Fundamentals of USB Audio
USB, the Universal Serial Bus, has been around for decades and is a heavily used standard in the world of personal computers. Memory sticks, external drives, mice, and web cameras are all interfaced over USB.In this article we will look into USB Audio: a standard for digital audio used in PCs, smart phones and tablets to interface with audio peripherals such as speakers, microphones, or mixing desks. In this article we set out to show how USB Audio works, what to watch out for, and how to use USB Audio for high-fidelity multi-channel input and output.
USB is a protocol where the PC, the USB-host, initiates a transfer, and the device (for example a USB speaker) responds. Each transfer is addressed to a specific device, and to a specific endpoint on the device. IN-transfers send data to the PC. When the host initiates an IN-transfer the device has to respond with data for the host. OUT-transfers send data to the device.
When the host performs an OUT-transfer it sends a packet of data that the device must capture. In the world of USB Audio, IN and OUT transfers may be used to transport audio samples: an OUT-transfer to send audio data from a PC to a speaker, whereas an IN-transfer is used to send audio data from a microphone to the PC.
There are four sorts of IN and OUT-transfers in USB: Bulk, Isochronous, Interrupt, and Control transfers.
A bulk transfer is used to reliably transfer data between host and device. All USB transfers carry a CRC (checksum) that indicates whether an error has occurred. On a bulk transfer, the receiver of the data has to verify the CRC. If the CRC is correct the transfer is acknowledged, and the data is assumed to have been transferred error-free. If the CRC is not correct, the transfer is not acknowledged and will be retried.
If the device is not ready to accept data it can send a negative-acknowledgment, NAK, which will cause the host to retry the transfer. Bulk transfers are not considered time criticial, and are scheduled around the time critical transfers discussed below.
Isochronous transfers are used to transfer data in real-time between host and device. When an isochronous endpoint is set up by the host, the host allocates a specific amount of bandwidth to the isochronous endpoint, and it regularly performs an IN- or OUT-transfer on that endpoint. For example, the host may OUT 1 KByte of data every 125 us to the device. Since a fixed and limited amount of bandwidth has been allocated, there is no time to resend data if anything goes wrong. The data has a CRC as normal, but if the receiving side detects an error there is no resend mechanism.
Interrupt transfers are used by the host to regularly poll the device to find out whether something worthwhile has happened. For example, a host may poll an audio device to check whether the MUTE button has been pressed. The name Interrupt transfer is slightly confusing, since they do not interrupt anything. However, regular polling of data gives the same sort of functionality that an host-interrupt would provide.
Control transfers are very much like bulk transfers. Control transfers are acknowledged, can be NAKed, and are delivered in a non-real-time fashion. Control transfers are used for operations that are outside normal data flow, such as querying the device capabilities, or endpoint status. An explanation on how device capabilities are described is outside the scope of this article, and we just state that there are predefined classes such as 'USB Audio Class' or 'USB Mass Storage Class' that enable cross platform interoperability.
All transfers are made in USB frames. High Speed USB frames span 125 us (Full Speed USB are 1 ms) and are marked by the host sending a Start-Of-Frame (SOF) message. Isochronous and Interrupt transfers are transmitted at most once a frame.
USB Audio uses isochronous, interrupt and control transfers. All audio data is transferred over isochronous transfers; interrupt transfers are used to relay information regarding the availability of audio clocks; control transfers are used used to set volume, request sample rates, etc. These are shown in Figure 1.
The data requirements of a USB Audio system depends on the number of channels, the number of bits to represent each sample, and the sample rate. Typical channel counts are 2 (stereo), 6 (5.1) or much higher for studio and DJ use. Typically sample size is 24 bits, although 16 bits is available for legacy audio, and 32 bits for high quality audio. Typical sample rates are 44.1, 48, 96, and 192 kHz. The latter is used for high quality audio.
Suppose that we design a stereo audio speaker with a 96 kHz sample rate and 24-bit samples. In order to simplify data marshalling on host and device, 24-bit values are typically padded with a zero byte, so the total data throughput is 96,000 x 2 channels x 4 bytes = 768,000 bytes per second. The isochronous endpoints run at a rate of one transfer per 125 us; or 8,000 transfers per second. Dividing the required byte rate over the frame rate gives us the number of bytes for each isochronous transfer: 768,000/8,000 = 96 bytes per transfer.
When using CD rates, such as 44,100 Hz, the transfer rate works out as 44.1 transfers per second. In USB Audio each transfer always carries a whole number of samples; alternating transfers carry 48 and 40 bytes (6 and 5 stereo samples), so that the average rate works out as 44.1 bytes per transfer.
A single isochronous transfer can carry 1024 bytes, and can carry at most 256 samples (at 24/32 bits). This means that a single isochronous endpoint can transfer 42 channels at 48 kHz, or 10 channels at 192 kHz (assuming that High Speed USB is used - Full Speed USB cannot carry more than a single stereo IN and OUT pair at 48 kHz).
When transmitting digital audio, latency is introduced. In the case of High Speed USB this latency is 250 us. A packet of data is transferred once in every 125 us window, but given that it may be sent anytime in this window a 250 us buffer is required. On top of this 250 us delay, extra delay may be incurred in the O/S driver, and in the CODEC. Note that Full Speed USB has a much higher intrinsic latency of 2 ms, as data is only sent once in every 1 ms window.