I'm involved with several projects that analyze recordings from e-interviews conducted using systems like Zoom, Bluejeans, and WebEx. Some of our analysis techniques rely on timing information, and so it's natural to wonder how much distortion might be introduced by those systems' encoding, transmission, and decoding processes.
Why might timing in particular be distorted? Because any internet-based audio or video streaming system encodes the signal at the source into a series of fairly small packets, sends them individually by diverse routes to the destination, and then assembles them again at the end.
If the transmission is one-way, then the system can introduce a delay that's long enough to ensure that all the diversely-routed packets get to the destination in time to be reassembled in the proper order — maybe a couple of seconds of buffering. But for a conversational system, that kind of latency disrupts communication, and so the buffering delays used by broadcasters and streaming services are not possible. As a result, there may be missing packets at decoding time, and the system has to deal with that by skipping, repeating, or interpolating (the signal derived from) packets, or by just freezing up for a while.
It's not clear (at least to me) how much of this happens when, or how to monitor it. (Though it's easy to see that the video signal in such conversations is often coarsely sampled or even "frozen", and obvious audio glitches sometimes occur as well.) But the results of a simple test suggest that more subtle time distortion is sometimes a problem for the audio channel as well.
Read the rest of this entry »