EDN Access

 

May 22, 1997


Videoconferencing goes to POTS

Stephen Kempainen, Technical Editor

Believe it or not, the POTS infrastructure can now deliver videoconferencing. Because of new standardized codecs, low-cost, interoperable products provide toll-quality audio as well as video that ranges from acceptable to very good.

Telecommuters beware: Videoconferencing is ready to invade your home office, bringing an end to teleconferences in which you're still unshaven or wearing your pajamas. At the same time, however, videoconferencing introduces exciting and yet unimagined adventures in communications. Enabling these adventures is the International Telecommunication Union--Telecommunication Sector (ITU-T) standard H.324 for multimedia conferencing on plain-old telephone service (POTS). H.324 defines new standards for transmitting video, voice, and data over one analog telephone line. Possibilities abound because related ITU-T recommendations for videoconferencing on LANs and WANs use the same interoperable substandards as H.324, enabling even greater interoperability among videophones with different features and manufacturers.

Before H.324, products required the use of ISDN phone lines, which added cost and hassle to videoconferencing. In addition, an expensive and proprietary "boardroom" product from one manufacturer could not conduct a call with the product of another manufacturer. Add the fact that the video and audio quality were inconsistent, and you can see why videophones have not enjoyed widespread use among the general public.

H.324 addresses these problems, but another roadblock to widespread use is that even the adventurous consumer expects a videophone to have the quality and reliability of his telephone. The H.324 standard also addresses these expectations. Delivery occurs through the digitally switched, worldwide POTS infrastructure that has the bit rate and guaranteed circuit connections necessary for multimedia calls. In addition, the H.324 standard relies on the proven bandwidth of V.34 modems to carry simultaneous video, audio, and data.

However, since the completion of H.324 in March 1996, few products have appeared on the shelves, so consumer purchases have yet to acknowledge whether the POTS signal can carry enough video data to meet quality expectations. The audio quality is not in question, because the new codec techniques provide toll quality or better. But the video quality is suspect. Consumers will have to adjust to videophone calls as they adjust to regular phone calls: making sure that calls happen in quiet surroundings and that participants sit relatively still in a brightly lit place with a visually quiet background. Only then will the new video codec be able to properly transfer facial expressions in color for full-sized displays. If the codec must analyze excessive motion and poor lighting, the video frame rate will slow, and the images will blur.

If consumers are willing to adjust their expectations, videoconferencing will ride H.324 to mainstream consumer products. The almost-universal connection of every home and business to POTS provides a ready market for videoconferencing. The popularity of V.34 modems and powerful, cheap computers and videoconferencing chips, in addition to software techniques for compression, will pave the way for multimedia conferencing on low-bit-rate-circuit switched networks. H.324 provides for interoperability among diverse terminals, including PC-based systems, PBX videophone systems, stand-alone videophones, Web browsers with live video, telemedicine terminals, and remote-security and surveillance cameras.

ITU-T standards for other networks provide even more variety for interoperable videoconferencing products. Standards for all network connections consist of substandards (Table 1, pg 88), both mandatory and optional, for the audio and video codecs. The table shows how some of the substandards are common across a number of the overall videoconferencing standards, meaning that interoperation could be as simple as translating from one communication format to another. For example, the video codec H.261 is mandatory for ISDN, POTS, LAN, and asynchronous-transfer-mode (ATM) implementations. Also evident is the commonality of control-function and data-transfer standards, which further eases interoperability.

As the first videoconferencing standard, created by the ITU-T in 1990, H.320 for ISDN served as the architectural model for all the follow-on standards. H.324, the first follow-on standard, focused on the telephone analog local loop; therefore, performance on low-bit-rate networks was of primary importance. Also important were flexibility, options, and maintaining backward compatibility with H.320. However, H.324 video quality will not match that of H.320 because of POTS' limited bandwidth: 33.6 kbps compared to ISDN's 128 kbps. (Because the new 56-kbps modems provide their bit rate in only one direction, they might only minimally improve videoconferencing bandwidth.)

In addition, there are two videoconferencing standards for LANs: one for those that can guarantee bandwidth and another for those that can't. H.323 is for packet-switched networks that cannot guarantee bandwidth, such as Ethernet. Ethernet's dominance in the business market is causing much product development around this recommendation. The other standard, for guaranteed-bandwidth LANs such as ATM, is H.310. This standard is making strides in compatibility with the MPEG standards, so future products will be able to interoperate with the next MPEG standard for video compression, MPEG-4.

A primary goal of H.323 was to establish interoperability with all other terminal types for videoconferencing. Hence, in addition to the terminal specifications, H.323 defines such components as gatekeepers for conference admissions, multipoint controllers for group conferences, and gateways for interoperability with H.324, H.310, and H.320. The gateways translate call signaling, control-channel messages, and multiplexing techniques between terminals. You can avoid audio- and video-compression transcoding with gateways, because terminals could have a common mandatory or optional algorithm between them.

Taking a closer look at H.324 for POTS gives insight into all the standards, because they are so architecturally similar. Figure 1 (pg 88) shows the components that make up a video- conferencing system to comply with H.324. The modem, multiplexer, and control blocks are mandatory; the audio, video, and data streams are optional. For example, an H.324 system can have multiple streams of audio and video but no data streams. However, the mandatory blocks must be in place for system interoperability. (See box, "Standard videoconferencing architecture," pg 90, for an explanation of these mandatory blocks.)

Toll-quality audio is critical

Even though the term "videoconferencing" implies that video is of primary importance, without the audio there would be no conference. Inconsistent or dropped video frames are annoying but tolerable, whereas bad audio can end a call. Therefore, the audio stream and its quality need primary design attention. Recent work to optimize audio coders for specific applications focuses on trade-offs in attributes such as bit rate, complexity, delay, and quality. (See box, "Speech coders for low-bit-rate multimedia communications," for an explanation of speech-coding attributes.)

The G.723.1 audio-codec standard addresses low-bit-rate videoconferencing systems and produces a near-toll-quality audio signal. The two bit rates for G.723.1 are 5.3 and 6.3 kbps; the less complex, higher bit-rate system uses less processor power. The estimated processing power needed to implement the algorithm from a fixed-point DSP is 23 MIPS for the low bit rate and 18 MIPS for the high bit rate. As speech coders go, both algorithms are toward the low end of the complexity scale.

The G.723.1 also has a provision for silence-suppression coding, which is possible because usually only one party talks at a time. The listener's terminal can eliminate its audio portion from the multimedia stream and use that bandwidth to improve the video signal when the other person is talking. However, complete silence on a call might make a listener think he has lost his connection. For this reason, the receive-end audio DSP adds simulated background noise to increase the user's comfort level with the technology. When the listener starts talking, the voice-activity detector turns the audio codec back on.

The talker's transmitter-control unit decides whether to use the high or low bit rate for the audio codec. The listener's receiver signals its preference for either high or low bit rate to the transmitter using the new substandard, H.245 for control protocols, during the call setup and whenever the receiver's preference changes. The transmitter can change rates during a call, depending on receive-end feedback and audio complexity. This change is possible because each audio bit-stream frame carries the coder rate as a part of its syntax.

The G.723.1 speech coder adds about a 100-msec, one-way system delay to the audio stream, which is less than the video coder adds. If lip-synch capability is important to the end product, a delay in the receive-audio path synchs the audio to the video (Figure 1). The transmitter uses H.245 to send a message describing the skew between the transmitted audio and video streams. The receiver uses this information to adjust the delay in the receive-audio path. The receiver may choose no added delay if the video-frame rate is low, because at five to seven frames/sec, the motion of facial expressions is not going to closely match the audio anyway. However, at 10 to 15 frames/sec, the absence of lip-synch becomes distracting. Therefore, the receive terminal should add audio delay to match the video delay.

Another problem that the audio-stream delay causes is acoustic echo from the receiving speakerphone. In a full-duplex videophone, the listener's microphone picks up the room's audio energy disseminating from the listener's speaker and sends it back to the talker. This echo can be disconcerting to the talker and becomes worse after adding delay for lip-synch. Therefore, including acoustic-echo cancellation (AEC) in a videophone is a very good idea. A good AEC algorithm typically uses about 15 MIPS.

Total audio processing uses about 23 MIPS for coding, 15 MIPS for AEC, and 5 MIPS for audio-control messages, for a total of approximately 43 MIPS. Because low system cost is crucial to H.324, one processor that can handle the task and still have MIPS to spare, such as Hitachi's SH-DSP, is beneficial. The SH-DSP provides 60 MIPS, which is enough to run any of the videoconferencing speech coders in Table 1 and still have MIPS for control functions and other features, such as speech recognition. Hitachi has the algorithms available for all the ITU-T speech coders. Using one powerful processor is probably better than the alternatives: leaving out some of the features, such as AEC, or using multiple processors.

You wouldn't be alone in assuming that it is the video in POTS videoconferencing (and not the audio) that will go to pot. Video sent over analog phone lines creates images with jerky motion, blurring, and stair-step or jagged-edge artifacts. The low bit rate that analog telephone lines offer presents a challenge to producing good-quality video.

Previous videoconferencing experience shows that a rate of approximately 20 frames/sec gives motion images that compare with TV quality. At 15 frames/sec, the motion quality is very good, but lip-synch to the audio needs to have a skew of less than 50 msec, or the audio will not match the mouth moving on the video. At five frames/sec, motion quality deteriorates until the lip movement is undetectable anyway, but image clarity is still important. If you have a bigger display, motion and clarity become even more critical.

The size of the display and whether it is LCD or CRT set user expectations. Consider a TV-based videophone connected to POTS by a set-top box. The viewer expects the image on his TV to be TV-quality and cringes at a jerky picture. On the other hand, a videophone on your desktop with a 4-in., color LCD screen or a quarter-common-intermediate-format (QCIF) (176×144 pixels) video window on your notebook PC does not create the same user expectations that the TV display does; viewers are more likely to accept a less-than-TV-quality image on such a display. So, the task is to cost-effectively transmit enough information at low bit rates and process the received information to fill video displays that satisfy each consumer.

The ITU-T undertook this task when it created the recommendation H.263, which addresses video coding for low-bit-rate communications. It specifies a coding technique to compress the moving-picture component of audio-visual signals. The video-compression technique comes from H.261 (the video-compression algorithm of H.320 for ISDN conferencing), with significant improvements in the motion-compensation quality at low frame rates. The target bit rate for H.263 available from a V.34 modem in 1995 was 28.8 kbps, and POTS provided that bit rate at the time. A low-complexity algorithm, which translates into low cost, was another objective that the ITU-T wanted to include. In addition, H.263 had to be as generic as possible to fit the range of products that are possible in POTS-videoconferencing connections.

Flexibility is just what H.263 delivers. It is so flexible that either CPU software routines or a hardware accelerator satisfy its processing requirements. It works in a TV-based or stand-alone videophone or in a multimedia PC equipped with a videophone. The motion-picture quality can vary from three to 30 frames/sec, depending on the processing power and bandwidth available. In addition, there are five standardized picture formats; sub-QCIF (128×96 pixels), QCIF, CIF, 4CIF, and 16CIF. Therefore, the video can be a small window with only three to five frames/sec or a full-screen window with 15 or more frames/sec. Product designers make a trade-off between picture quality and cost/complexity by choosing from the H.263 options for compression, image size, and frame rate.

The video quality for H.263 can be very good. Even though 28.8 kbps minus the 5.3 or 6.3 kbps for audio was the design goal, H.263 slightly outperforms H.261 when you use H.263 at higher bit rates. But at the lower bit rates, the video quality of H.263 is equal to H.261 working at five to 10 times H.263's bit rate. H.263 worked so well that the MPEG adopted it as the basis for MPEG-4. MPEG and ITU-T will likely converge on one recommendation for next-generation products.

The trade-off between a software or hardware video codec for H.263 is simply between motion quality and cost. One example of a software codec is Intel's Video Phone v1.2, which runs best on a multimedia-extension (MMX) Pentium processor. In general, the software technique can eliminate the need for a separate video codec, but it provides only a QCIF window at about three to 15 frames/sec; the frame rate depends on available CPU cycles. And you still need a video-capture card when using an analog camera with a software video codec. On the other hand, a hardware video codec can deliver a CIF (352×288 pixels) or bigger window at 15 frames/sec. Hardware can deliver more frames per second, because software doesn't use all the complex, full-motion estimation algorithms and compression options offered by H.263 that a hardware video codec does.

Codec delays are another difference between software and hardware co-decs; software codecs take more time to run the same algorithms. Increasing CPU power affects the trade-off be-tween software and hardware codecs (Figure 2); as the CPU's power increases, it can run more of the H.263 options with less delay.

An interesting insight on software systems comes from Microsoft's (Redmond, WA) NetMeeting 2.0, which adds H.324 and H.323 audioconferencing and videoconferencing to the company's Internet-telephony and multipoint-data-conferencing capabilities. Beta testing of NetMeeting 2.0 reveals poor video-image quality. Microsoft lists some possible causes, such as too large an image and insufficient camera lighting, which both overload the CPU. Suggested fixes include connecting the camera through a video-capture card to help unload the CPU, closing programs to conserve CPU power, reducing the size of the image, and increasing camera lighting. You can get more information on NetMeeting at www.microsoft.com/netmeeting for Windows 95 and NT.

The video-capture card for software codecs can be simple, but it is essential for speeding video-image processing. For example, the Brooktree Bt848 Video Capture for PCI chip integrates the video-capture system. It incorporates functions from the analog video-input circuitry to the PCI-master interface. The Bt848 helps the CPU by directly delivering down-scaled video to the system memory in YUV format. The resized and formatted video is ready for immediate codec processing by the CPU.

Further help not only for software but also for all video codecs comes from filtering noise from the video-input signal. Noise from poor video-input circuitry, poor lighting, and electrical interference wreaks havoc on compression algorithms. The Bt848 has a five-tap vertical filter at the video input. The filter reduces high-frequency noise in the video signal, enabling better compression of the signal.

You can eliminate the need for a separate video-capture card by using a video-accelerator chip with video input, such as the CL-GD5480 VisualMedia Accelerator from Cirrus. The chip uses Cirrus' V-Port standard interface for a digital-camera input, performs linear scaling by filtering for scale-down, and performs interpolation for zooming without using the host processor. The chip also has two hardware windows and a mirroring feature that allows the local videoconferencing user to see himself.

A full-function-videophone PC add-in card can serve as a complete multimedia card. Figure 3 shows a reference design that includes MPEG-1 video-decompression-playback capability. The design uses Lucent's AVP-III for both audio and video performance. The AVP-III connects to the PCI bus through a bus-mastering device, and the AVP software system controller runs the control and multiplexing algorithms on the host processor. The AVP-III is a specialized processor for H.263, H.261, and G.723 algorithms with minimal delay. In addition, the AVP-III can run the MPEG-1 audio and video algorithms if necessary.

It is a challenge to design an H.324 stand-alone videophone that hits the (as-yet-undetermined) price and performance sweet spot. One method uses the new breed of multimedia processors; you can use the TriMedia processor from Philips Semiconductor as the basis for a videophone or Web-phone design (Figure 4). TriMedia processes audio and video codecs plus the AEC algorithms. It even has power left over to perform a video-frame-to-frame interpolation to minimize jerky motion at low frame rates. TriMedia also offers a set of development tools.

In addition to the audio and video media channels, H.324 provides for data channels. The data-channel substandard is T.120, and it provides for a plethora of data-transmission functions. T.120 includes electronic white boards, computer-application sharing, file transfers, and camera remote control. Standardized formats do not limit the type of data exchange possible in T.120; it allows two terminals to negotiate any type of data exchange between them.

The ITU-T standard recommendations for videoconferencing open the door for more useful products, implying guaranteed interoperability for all products designed for the same recommendation, regardless of manufacturer or complexity. Also, the common substandards within the recommendations enable the design of gateways that can translate compression and multiplexing schemes from one standard to another for even greater interoperability.


References

  1. Lindbergh, David, "The H.324 multimedia communication standard," IEEE Communications, December 1996, pg 46.
  2. Rijkse, Karel, "H.263: Video coding for low-bit-rate communication," IEEE Communications, December 1996, pg 42.
  3. Cox, Richard, and Peter Kroon, "Low bit-rate speech coders for multimedia communications," IEEE Communications, December 1996, pg 34.
  4. Martin, John, "Real world considerations in a video compression system," Motion Media Technology paper presented at ADVICE, Cambridge, MA, 1996.
  5. Thom, Gary, "H.323: The multimedia communications standard for local area networks," IEEE Communications, December 1996, pg 52.

  • The ITU-T standards for videoconferencing for plain-old telephone service (POTS), ISDN, LANs, and asynchronous-transfer mode (ATM) share substandards to promote product interoperability.
  • Audio is the primary medium for a multimedia conference, so near-toll-quality audio is essential.
  • Video compression for POTS, with the ability to dynamically adjust bit rates, fills the available bandwidth for a POTS call by trading off frames per sec and resolution.
  • Specialized and general-purpose media processors are at the core of new designs for videophones.
Standard videoconferencing architecture
For videoconferencing to become mainstream, it needs to be as easy as making a telephone call. To this end, worldwide standardization began in 1990, when the International Telecommunication Union--Telecommunication Sector's (ITU-T) H.320 became the first videoconferencing standard for public switched-digital networks. The ITU-T built this standard on existing proprietary videoconferencing systems. However, because H.320 requires direct access to a digital network, such as an ISDN connection, widespread adoption in North America has not happened, and adoption in Europe and Asia is slow. Still, H.320 serves as the architectural model for the newer ITU-T videoconferencing standards.

Because it uses H.320 as a model, H.324 includes many second-generation improvements but still maintains backward compatibility. The improvements are evident in new substandards, which include multiplexing the various media types into one bit stream, video and audio compression, and control messages. ITU-T changed the H.320 substandards to meet the low-bit-rate requirements of analog telephone lines. Other second-generation improvements are receiver-controlled mode preferences, dynamic assignment of bandwidth to different channels, and the ability to support multiple channels of each medium. Yet, some of the substandards from H.320, such as encryption and far-end camera control, carry over directly to H.324. In fact, all the new videoconferencing standards reuse substandards wherever possible, which eases interoperation through a simple gateway.

Figure 1 (pg 88) shows the overall system architecture for H.324; all the other videoconferencing standards are basically the same. The mandatory components are the modem (or appropriate network connection for other standards), multiplexer, and control protocol. The audio, video, and data streams are optional, and you can have more than one of each. These options are possible because H.324 terminals negotiate a common set of capabilities during the call setup. The circuit connections established in each direction during the call setup independently negotiate capabilities. The independent circuit connections in each direction allow for asymmetric products, such as a receive-only videophone--no camera--to use all the bandwidth in one direction for audio and, possibly, data.

All the standards also offer the ability to accommodate group videoconferences (Figure 1). The multipoint operation happens through the multipoint control unit (MCU), sometimes called the "bridge." The MCU takes the media streams from each terminal and selectively distributes, or, in the case of audio, mixes, the streams to all terminals.

H.324 specifies using V.34 modems, which operate in full-duplex mode at 9600 bps to 33.6 kbps in increments of 2400 bps. The mandatory V.8 or optional V.8bis protocols negotiate the bit rate during the call setup. The V.8bis protocol also offers an important advantage: You can begin a regular voice telephone call with another party, and you can switch at any point to a video call. There is only a 10- to 15-sec delay while the phone initializes the multimedia call, but this delay is user-initiated and therefore an accepted part of the call.

Using the V.34 modem for H.324 requires special consideration when it comes to bit-error-rate handling. For nonvideo data transmissions, the protocol "tunes" the modem to the highest bit rate at which data retransmission due to errors would be more efficient than would lowering the bit rate by 2400 bps to get fewer errors. In videoconferencing, the isochronous nature of the data precludes the retransmission of data, so lowering the bit rate by 2400 bps to reduce the bit-error rate would be more efficient. Therefore, video- and audio-retransmission protocols, such as the link-access procedure for modems, work at the control level rather than at the modem level.

A special circumstance arises when a PC-based videophone uses a V.34 modem. In this case, the V.34 modem requires the bit stream from the PC to be synchronous. Because most PCs lack a convenient synchronous interface, the modem uses the ITU V.80 protocol, which provides an asynchronous-to-synchronous "tunneling" protocol to accommodate the modem.

The multiplexer layer sends and receives a serialized bit stream to or from the modem. In this layer, the multiplexer assembles the serialized bit stream from the various media streams to send it to the modem. The demultiplexer receives the single bit stream from the modem and divides it into the audio, video, control, and data streams. The H.223 substandard governs the multiplexing for H.324. This substandard needed to modify the H.221 time-division multiplexer (TDM) that H.320 used, because the V.34-modem bit-rate adjustments due to line conditions and varying payload bit rates do not fit neatly into a TDM. Also, a TDM is unfriendly to software implementations.

The H.223 substandard defines both the multiplexer and the adaptation layers. The multiplexer layer mixes media streams into one stream of multiplex-protocol data units (MUX-PDU). Each MUX-PDU carries a variable-length mix of media types as specified in the header. The adaptation layer handles the logical framing, sequence numbering, and error detection and correction on a medium-by-medium basis.

The layers work together to carry a mix of audio, video, and data streams. The mix contains dynamic proportions of each stream, depending on which stream needs the bandwidth. For example, during a silent period, the unused audio-bandwidth allocation goes to the video or data stream. The procedure incurs little delay and has low protocol overhead.

H.323, H.324, and H.310 all use the H.245 substandard for control. H.245 draws from the H.242 control substandard that H.320 uses. An end-to-end H.245 control channel is open when the videoconference connection begins. The first exchange between terminals consists of a description of each terminal's capabilities, such as video- and audio-codec abilities. The terminals also exchange mode-preference requests leading to negotiation for a common set of capabilities. Then, H.245 opens logical channels for the exchange of multimedia information. H.245 controls as many as 65,535 logical channels that are independent, unidirectional bit streams with defined content.

The commonality of the H.245 control substandard helps with interoperability between the plain-old telephone service (POTS), LAN, and asynchronous-transfer-mode (ATM) systems. In addition to capabilities and logical channels, messages for general commands and indications use the H.245 protocol. The terminals or systems also exchange flow-control and multiplex substandard messages. The substandard has room for manufacturer-specific messages to add special product features. ITU-T designed H.245 to accommodate the addition of future standard features.

Speech coders for low-bit-rate multimedia communications
Speech-coding algorithms reduce the number of bits you need to correctly replicate an utterance. When the speech occurs simultaneously with video, graphics, or data, you call it "multimedia communications." Because videoconferencing happens on plain-old telephone service (POTS), ISDN, and LANs, the attributes for a speech coder vary from one medium to another. Instead of using one coder for all media, the ITU-T specifies speech coders that make economical trade-offs between attributes for specific applications.

The attributes common to all speech coders are bit rate, complexity, delay, and quality. The application for which the coder is suitable depends on how these attributes interact. For example, in one application, cost is the dominant design criterion, whereas in another application, quality might be most important. You can use three recent speech coders for low-bit-rate media to demonstrate these trade-offs in coders (Table A).

Bit rate--Bit rate aims for the lowest possible peak rate to conserve bandwidth but still maintain quality. At the same time, the bit rate should also be variable; a variable bit rate allows the coder to use only the bandwidth it needs, which frees bandwidth for the other media that share the channel. A simple way to achieve variable bit rates is to use a fixed rate for active speech and to suppress silent periods. However, total silence can be uncomfortable for users; they may think they have lost their connections.

The silence-suppression techniques rely on two algorithms. The first one is the voice-activity detector, which is necessary to determine when a silent period occurs. It must be able to detect speech from background noise quickly enough to start coding without losing the beginning of the utterance. The second algorithm is the comfort-noise generator. The receiver uses the comfort-noise generator to fill the dead silence that results from a period that the voice-activity detector designates as nonspeech. The challenge is for the encoder and decoder to stay synchronized during zero-bit periods so that there is no delay when positive voice-activity detection occurs.

Complexity--A speech coder's degree of complexity depends on the amount of computer resources needed to perform the codec algorithms. MIPS is a convenient way to measure computer resources needed for coding. Low-complexity speech coders typically use 15 MIPS or less, whereas high-complexity coders are those that use 30 or more MIPS. The coder-implementation cost varies with the number of MIPS.

Delay--The one-way system delay for a speech coder consists of the algorithm, processing, and communication delays. In telephony applications, the one-way system delay becomes noticeable at 200 msec and annoying at 400 msec, assuming that no echoes are introduced into the speech. If echoes occur (from a speaker videophone, for example), the tolerable delay becomes only 25 msec.

For videoconferencing, it is a good idea to include echo cancellation in speech processing. The acoustic-echo cancellation (AEC) happens at the receiver end, because that is where the sound energy bounces back into the microphone. The echo cancellation added to the receiver enhances the speaker's comfort--he doesn't hear his own echo.

Low-bit-rate speech coders work on a frame-by-frame basis that causes the algorithm delay. The coder analyzes and encodes or decodes a segment of speech, or frame, at a time. For efficiency and accuracy, the coder analyzes a portion of the next frame at the same time it analyzes the current frame. This piece of the next frame is the look-ahead, or subframe. A buffer holds both the frame and the subframe while the coder analyzes them, so there is a minimum delay in the algorithm that equals the length of the stored frames. Table A shows that the frame for G.729 is 10 msec and the subframe is 5 msec, so the total algorithmic delay is 15 msec.

In addition to the algorithmic delay, there is also the processing delay: the time the DSP or CPU takes to analyze and code or decode the speech data. This time depends on the algorithm complexity and the processor speed; hence, the processing delay is somewhat under the control of the designer. A typical processing delay for G.729 is approximately 10 msec.

The communications delay is the time it takes for an entire frame to transfer from the encoder to the decoder. This delay depends on the proximity of the terminals and changes from one call to the next. The speech coder's impact is due only to the frame size; the smaller the frame, the smaller the communications delay.

In videoconferencing standards, multiple-caller conferences might add another delay. The conference callers connect to each other through a bridge and, possibly, a gateway (if the call is from an H.320 to an H.324 system, for example). The bridge decodes the speech and then distributes it to all conference participants. Before transmission from the bridge to the callers, however, another coder encodes the speech, which doubles the algorithm and processing delays.

Quality--Speech-coder quality has many variables to consider. The ultimate measure of quality is how the decoded speech sounds to the user. Factoring into this final measure are the following questions: How many encodes and decodes has the utterance gone through? How much background noise is there? What is the transmission-error rate? Does the modem detect errors, and, if so, does the protocol ignore them? Another measure of quality is how well a coder deals with more than one speaker at a time. How successfully the algorithm accounts for all these variables is the measure of the coder's quality.

These speech-coder attributes are interdependent. For example, G.729 has a short algorithmic delay and good quality at the expense of increased complexity and bit rate. The annex recommendation, G.729A, targets the digital-simultaneous-voice-and-data (DSVD) application, in which cost is more sensitive than short delay and quality. To reduce the complexity and cost, G.729A sacrifices some performance under certain conditions, such as data-only transfers.

POTS videoconferencing has a different set of requirements from DSVD, and G.723.1 aims to maximize the coder for POTS-videoconferencing applications. Because a minimum video-frame rate of five frames/sec or 200 msec is likely, G.723.1 works backward from a 100-msec, one-way system delay to establish the longest tolerable algorithmic delay. The 100- msec delay allows for a multipoint bridging unit to double the delay and still match the video delay. The approximate delays for G.723.1 total 97.5 msec: an algorithmic delay of 37.5 msec; a processing delay of 15 msec; and a communications delay of about 45 msec, which includes the V.34-modem delay of 35 msec. The long algorithmic delay was a trade-off for reducing the bit rate and lowering the complexity. The quality difference between G.723.1 and G.729 is undetectable to the typical user.

Table A--Important parameters for new ITU-T speech-coder standards
Attribute Parameter Speech-coder standard
G.729 G.729.A G.723.1 G.723.1
Bit rate Bit rate (kbps) 8 8 6.3 5.3
Delay Frame (msec) 10 10 30 30
  Subframe (msec) 5 5 7.5 7.5
  Algorithmic delay        
  (msec) 15 15 37.5 37.5
Complexity MIPS (estimated        
  for general-        
  purpose, fixed-        
  point DSP) 22 12 18 23
  RAM (16-bit words) 2.7k 2k 2.2k 2.2k
Table 1--Major ITU-T videoconferencing standards and substandards
Feature Videoconferencing standard
H.320 H.324 H.323 H.310
Primary application ISDN Plain-old telephone service (POTS) LAN, Ethernet Asynchronous-
transfer-mode (ATM) LANs
Video codecs H.261* H.261*; H.263* H.261*; H.263 H.262 (MPEG-2)*; H.261*
Audio codecs G.711*; G.722; G.728
G.723.1*;
G.729
G.711*; G.722;
G.728; G.723.1;
G.729
MPEG-1*; G.711*; G.722; G.728
Multiplexing H.221* H.223* H.225* H.222*; H.222.1*
Control functions H.242* H.245* H.245* H.245*
Data transfer T.120 T.120 T.120 T.120
*mandatory
Note: See box, "
Videoconferencing standards and substandards glossary,'' for definitions.
Videoconferencing standards and substandards glossary
Videoconferencing standards

H.310: Standard for guaranteed-bandwidth LANs, such as ATM.

H.320: The first videoconferencing standard for public switched-digital networks, such as ISDN.

H.323: Standard for packet-switched networks that cannot guarantee bandwidth, such as Ethernet.

H.324: Standard for multimedia conferencing on plain-old telephone service (POTS).

Audio-codec substandards

G.711: Audio-compression algorithms for 64-kbps pulse-code modulation.

G.722: Audio-compression with low-complexity algorithms but a high bit rate.

G.723.1: Audio-compression algorithms for low bit rates and low complexity.

G.728: Audio-compression with high quality and high complexity at 16 kbps.

G.729: Standard for telephone-network-quality speech, originally designed for wireless applications but suitable for multimedia as well.

G.729.A: A less complex version of G.729, explicitly designed for simultaneous voice and data; interoperates with G.729.

Video-codec substandards

H.261: Compression algorithms for color motion video, originally for H.320.

H.262: MPEG-2-compatible video-compression algorithms.

H.263: Video-compression algorithm based on H.261 with enhancements including motion compensation for improved quality at low bit rates.

Substandards for mixing media into one bit stream

H.221: Uses time-division multiplexing method.

H.222: Uses multiplexing for guaranteed-bandwidth LANs.

H.222.1: Extensions for guaranteed-bandwidth LANs.

H.223: Uses multiplexing for low-bit-rate connections.

H.225: Uses multiplexing for nonguaranteed-bandwidth LANs.

Substandards for control

H.242: Control protocol for ISDN connections; the basis for H.245.

H.245: Control protocol that performs automatic capability negotiation and logical channel control.

Miscellaneous standards

T.120: Data-channel substandard.

V.8: Call start-up protocol to identify modem type and operation mode.

V.8bis: Call protocol that allows a normal voice call to switch to a videoconference call at any time.

V.14: Modem-level retransmission protocol that the data channel uses in multimedia calls.

V.80: Standard that provides an asynchronous-to-synchronous "tunneling" protocol to accommodate V.34 modems.

Representative list of videoconferencing-IC manufacturers
When you contact any of the following manufacturers directly, please let them know you read about their productson EDN's website.
Analog Devices
Norwood, MA
1-617-461-3060
www.analog.com
Brooktree
San Diego, CA
1-800-228-2777
www.brooktree.com
Chromatic Research
Sunnyvale, CA
1-408-752-9100
www.mpact.com
Cirrus Logic
Fremont, CA
1-510-623-8300
www.cirrus.com
Crystal Semiconductor
Austin, TX
1-800-888-5016
www.crystal.com
Digital Semiconductor
Hudson, MA
1-800-332-2717
www.digital.com/info/semiconductor
Hitachi America
Brisbane, CA
1-800-285-1601, ext 27
www.hitachi.com
Intel
Hillsboro, OR
1-800-548-4725
www.intel.com/proshare/videophone/
Lucent Technologies
Berkeley Heights, NJ
1-800-372-2447
www.lucent.com
Micro Linear
San Jose, CA
1-408-433-5200
www.microlinear.com
Motion Media Technology
Bristol, UK
+44 1454 313444
fax +44 1454 313678
www.mmtech.co.uk
Philips Semiconductors
Sunnyvale, CA
1-800-447-1500, ext 1344
www.semiconductors.philips.com/ps/
Siemens
Cupertino, CA
1-408-777-4500
www.siemens.com
Texas Instruments
Dallas, TX
1-800-477-8924
www.ti.com
Toshiba America
San Jose, CA
1-800-879-4963
www.toshiba.com
Table 2--ICs in videoconferencing products
Vendor Part no. Videoconference
function
Features Package Price
Analog Devices ADSP-2181
(KS115)
Audio codec
for all ITU-T recommendations; acoustic-echo cancellation; control
80-kbyte RAM 128-pin TQFP or PQFP $22.68
(10,000)
Brooktree Bt848 Video Capture for PCI H.324-enabled home PCs; captures, resizes, and reformats video for direct codec feed PCI Rev 2.1 interface; supports multiple video analog-and digital-input and -output formats 160-pin PQFP $19.50
(10,000)
Cirrus Logic CL-GD5480 64-bit VisualMedia Accelerator Display accelerator with videoconferencing features YUV420 video I/O; digital-camera interface; dual hardware video windows 208-pin QFP $21
(10,000)
Crystal Semiconductor CS6403 echo-canceling codec Acoustic-echo cancellation Full-duplex codec; DSP and ROM in one package 44-pin TQFP or PLCC $10.40
(10,000)
Digital Semiconductor 21230 video codec Video codec for MPEG-1, H.261, and extendible to H.263 Supports video codec at 30 frames/sec for MPEG-1 and H.261 (CIF resolution at 15 frames/sec) 240-pin PQFP $75
(10,000)
21231 video-output logic PC interface directly to TV monitor or VCR Supports TV/VCR output on 21230-board designs 144-pin PQFP $22.50
(1000)
Hitachi America SH7410 SH-DSP
(60 MHz)
Audio codec for all ITU-T recommendations; acoustic-echo cancellation; control 48-kbyte ROM; 8-kbyte RAM 176-pin QFP $25
(10,000)
Lucent Technologies AV4400A AVP III video/audio processor Video and audio codec for H.320, H.324, and MPEG-1; modules G.711, G.722, G.728, and audio-echo cancellation Includes AVP-API host-software platform 240-pin SQFP $75
(100,000)
Micro Linear ML6430 Genlock chip Microclock synthesizer Extracts and generates the video timing waveforms for codecs, video captures, and PCs 32-pin TQFP $15
(1000)
Philips Semiconductors TriMedia processor Video and audio codec for H.320, H.323, and H.324 General-purpose media processor; PCI interface and C programming; open architecture 240-pin PQFP $50
(100,000)
Siemens PSB 7230 joint audio decoder encoder (JADE); PSB 2163 audio-ringing codec filter Two chips for H.324-compatible chip set; needs video and control processor Video-reference board needs camera, handset, speakerphone, and modem; PCI board; Windows software 7230, 100-pin TQFP; 2163, 28-pin PLCC or PDIP 7230, $20
(10,000);
2163 $11
(10,000)
Texas Instruments TMS320C80
DSP/32-bit RISC
Videoconferencing for room-size systems, either proprietary or H.320 Includes video controller, four parallel-processing DSPs, 50-kbyte RAM, and 2 billion operations/sec 352-pin BGA; 305-pin ceramic PGA $120
(25,000)
TMS320C82
DSP/32-bit RISC
Videoconferencing for desktop systems, H.320, and H.324 Two parallel- processing DSPs, 44-kbyte RAM, 1.5 billion operations/sec 352-pin BGA $80
(25,000)
Toshiba America TC80301AF
MPACT/3000
multimedia processor
PCI-card videoconferencing; H.324 and modem modules available General-purpose media processor; 2- and 3-D-graphics acceleration; audio; fax/modem; MPEG-1 and MPEG-2 video 240-pin PQFP $75
(1000)
Stephen Kempainen, Technical Editor

You can reach Stephen Kempainen at 1-415-643-1760, fax 1-415-643-9513, ednkempainen@worldnet.att.net.


| EDN Access | Feedback | Table of Contents |


Copyright © 1997 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Publishing Company, a unit of Reed Elsevier Inc.