Fundamentals of USB Audio

Henk Muller, Principal Technologist XMOS Ltd. -June 27, 2012

What's a second between friends?

The big issue in digital audio is to agree on a common notion of time. Above we have defined USB frames to be transferred 8,000 times per second, and set the speakers to play a sample 96,000 times per second. This will only work if the speaker and the host agree on the length of a second. USB Audio offers three modes that ensure that the host and the speaker agree on timings:

  • In synchronous mode, the length of a second is defined by the host device. That is, the host will send data at a rate, and the device has to exactly match that rate.
  • In asynchronous mode it is the other way around, the device sets the definition of a second, and the host has to match the device.
  • In adaptive mode the data flow determines the clock.

Adaptive and synchronous mode are not ideal because PCs are notoriously bad at keeping a stable clock, and there are often other audio sources involved, such as an external digital deck. Asynchronous mode enables external clock sources to be used as the master, or a low-jitter clock in the device. Typically, either relies on a crystal based PLL, as shown in Figure 2.

Figure 2: A USB-audio board, with a crystal for stable audio frequencies, and a low-jitter PLL to generate any frequency required.

Hence there are at least two separate clocks in the system, the USB clock with a host driven frequency of 8,000 transfers per second, and a sample clock with an externally driven sample rate of, for example, 96,000 Hz.

These clocks will have subtly different frequencies, and the difference will vary slightly over time. Hence the average number of audio samples per frames will be slightly more or less than the expected rate. For example, in the case of our 96,000 Hz sample rate, the average number of samples may be 12.001.

In order to ensure that the host sends the right amount of data, and not too much or too little, the host requests the current sample rate over an interrupt endpoint. Every few milliseconds the average sample rate over the last period is reported back as a 16.16 bit fixed point number. If the last period averaged out as 12.001 frames, then the value 0x000C0041 will be reported (65536 * 12.001).

Given this average rate, the host can work out when to send an extra sample in a transfer; in this example 8 transfers each second will carry one extra sample. In addition to this, the host can use this value to synchronise itself with the audio device. This enables host applications such as a DVD player to keep the video in sync with the audio. If it didn't, the audio would slowly run ahead of the video, and after two hours the audio would be a second out.

In order to keep a short feedback loop, the trick is to not buffer audio packets and feedback packets unnecessarily. Any additional buffering creates latency in the reporting, and this latency makes it more difficult to keep a smooth flow of traffic. This means that the low-level USB stack and the USB Audio stack should be tightly integrated, without buffering in between. Although this is hard to achieve on an application processor, this is quite easy to achieve if the software is implemented on an embedded processor that has a predictable execution time.

Loading comments...

Write a Comment

To comment please Log In