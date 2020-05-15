« previous post |

I'm involved with several projects that analyze recordings from e-interviews conducted using systems like Zoom, Bluejeans, and WebEx. Some of our analysis techniques rely on timing information, and so it's natural to wonder how much distortion might be introduced by those systems' encoding, transmission, and decoding processes.

Why might timing in particular be distorted? Because any internet-based audio or video streaming system encodes the signal at the source into a series of fairly small packets, sends them individually by diverse routes to the destination, and then assembles them again at the end.

If the transmission is one-way, then the system can introduce a delay that's long enough to ensure that all the diversely-routed packets get to the destination in time to be reassembled in the proper order — maybe a couple of seconds of buffering. But for a conversational system, that kind of latency disrupts communication, and so the buffering delays used by broadcasters and streaming services are not possible. As a result, there may be missing packets at decoding time, and the system has to deal with that by skipping, repeating, or interpolating (the signal derived from) packets, or by just freezing up for a while.

It's not clear (at least to me) how much of this happens when, or how to monitor it. (Though it's easy to see that the video signal in such conversations is often coarsely sampled or even "frozen", and obvious audio glitches sometimes occur as well.) But the results of a simple test suggest that more subtle time distortion is sometimes a problem for the audio channel as well.

The simple test is to use a metronomic "click track", playing it on one end and recording it at the other end (or in the middle). Yesterday, Chris and Caitlin Cieri tried a version of this test, using a ten-minute signal from the Wikipedia "click track" article. This signal is suitably regular, as this sample indicates:

Your browser does not support the audio element.

When I apply a simple click-detection program to this signal, the resulting histogram of inter-click intervals is suitably exact. The click track is at metronome marking 120, equivalent to 120 clicks per minute, or two per second. And all 1200 clicks occur exactly half a second apart:

Caitlin and Chris recorded the "conversation" (in this case entirely one-sided) via Zoom, using Zoom's cloud recording capability (which in effect records in the middle, so to speak, on a server that's mediating among the participants). The results are sometimes rather erratic, as these samples indicate:

Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element.

And the histogram of inter-click intervals shows it:

This is just one sample, using one system (Zoom 5.0.2), one network configuration and traffic condition at one time, and one recording method, for a one-way audio stream of a particular kind. And it's possible that Zoom's encoding and decoding system is doing something special with (what it reckons to be) silence, prolonging or curtailing silences in order to even out transmission latencies. So we'll try some additional tests, and let you know what we find.

Aside from our concern about timing cues in interview analysis, audio time distortions of this kind may be responsible for some of what people experience as "Zoom fatigue".

