EDN logo


Design Feature: February 17, 1994

Speech-synthesis and -recognition chips personalize consumer products

JOHN GALLANT,
Technical Editor

If speech is the mirror of the soul, then many of today's commercial products will reap their just rewards. Maturation of speech-compression coding algorithms is allowing almost every new device to talk, make audio sounds, and even recognize a human voice.

Have you ever noticed how virtually every consumer product you buy these days can talk to you? Practically every new toy, learning aid, game, and even some greeting cards can make audible sounds. In addition, multimedia computers, automotive warning systems, appliances, clocks, and equipment for the handicapped all can talk. At the core of these talking devices are speech-synthesis ICs that generate speech from sampled data stored in memory.

The 1980s witnessed a range of technical breakthroughs that enable today's high-quality talking products. Gone are the early days of talking computers, which employed ICs that linked phonemes to generate speech. Although these products could generate unlimited speech, the speech was of very poor quality: At best, it sounded robotic; at worst, it was unintelligible. The techniques that emerged to generate today's high-quality speech are sampled-data systems that take samples of an actual human voice. The systems use data compression for predictive-coding algorithms.

The two most popular predictive-coding algorithms for generating speech are adaptive differential pulse-code modulation (ADPCM) and linear predictive coding (LPC). These algorithms compress speech samples stored in memory while retaining "near-toll-quality" speech. The CCITT defines toll-quality speech as logarithmic PCM-coded data at a sample rate of 8 kHz and a resolution of 8 bits/sample. The combination results in a bit rate of 64 kbps for toll-quality speech.


ADPCM has lots of advocates

Oki Semiconductor uses the ADPCM coding system to compress data in the company's speech-synthesis ICs. ADPCM is a variant of DPCM, which reduces the amount of data by quantizing and encoding the differential between speech-signal samples. ADPCM adaptively changes the quantization width, depending on the quantization width of the previous differential sample.

Oki offers synthesizers having internal mask ROM ranging from 128 kbits to 1 Mbit. The company also offers synthesizers with one-time-programmable (OTP) or external ROMs, (Fig 1). The synthesizers have resolutions of 3 or 4 bits/sample and variable sampling rates from 4 to 16 kHz. A 4-bit ADPCM synthesizer sampling at 4 kHz has a compressed bit rate of 16 kbps. The 4-bit MSM6379 has 512 kbits of OTPROM, which stores 32 sec of speech sampling at 4 kHz. The $10 (5000) chip comes in a 16-pin DIP and has an internal 12-bit D/A converter and a lowpass filter. You program the device using a dedicated programming tool called Anawriter Mark VII.

Mosel-Vitelic also offers a wide range of speech-processing chips, largely designed into products from Asia, based on ADPCM data compression. The VTV001 has as much as 256 kbytes of memory and variable bit rates from 16 to 32 kbps. The chip also includes a microphone and a preamplifier for voice recording and an 8-bit D/A converter for reproduction.


LPC models the vocal tract

Texas Instruments employs the other popular coding system, LPC, to compress data in speech-synthesis ICs. LPC attempts to model the human vocal tract. Air from the lungs excites the vocal tract by moving through the vocal chords (two small flaps at the base of the larynx). When producing voiced sounds such as "a" or "e," the vocal chords vibrate to modulate the air from the lungs, thus producing nearly periodic pulses of air. The pitch period determines the sound produced.

Besides the voiced sounds, generating speech also requires the use of unvoiced—or noise—sounds, such as "s." Turbulent air passing through the open vocal tract produces unvoiced sounds. Above the vocal chords are the pharynx and the oral and nasal cavities, all of which shape the spectrum of the sound. The frequency response of the vocal tract is similar to a tube with constant diameter, which has a number of resonances, or formants.

TI's TSP50C1x family combines an 8-bit µP, a speech synthesizer, ROM, RAM, a D/A converter, and I/O interface ports on a low-cost chip. The chip's LPC model imitates the human vocal tract. The model extracts parameters from sampled speech to create two excitation generators that model the vocal-chord restrictions for voiced and unvoiced sounds. The model has a gain multiplication stage to model levels of pressure from the lungs and a 12-pole lattice filter that models the shape of the oral cavity. Because the filter has 12 poles, TI calls it LPC-12 (Fig 2).

Because speech changes slowly, the µP accesses parameter samples from memory in frames that are generally 10 to 25 msec long. The device calculates the input parameters to the model as an average of the parameters for the entire frame. The resulting compressed bit rate is effectively 1.5 kbps. The TI devices offer five sizes of internal ROM ranging from 4 kbytes (capable of processing 14 sec of speech) in the $0.85 TSP50C04 to 32 kbytes (capable of up to 3 minutes) in the $2.30 TSP50C19. Because the devices are mask programmable, they are available only in minimum-quantity orders of 50,000 units.

TI recommends that external speech-coding services perform speech development for the company's speech-synthesis chips. One external speech developer is Robert Jeffway, who has coded a host of toys and commercial appliances using TI's development tools for speech analysis. You can contact him in Leeds, MA, at (413) 584-0491.


Toys and games

ESS Technology offers the Sound Magician line of speech- and sound-synthesis chips for the toys and game market. The playback-only chips cost $0.50 to $2 in OEM quantities. A customer supplies a recording of the desired speech and sound, which the company samples and stores in an on-chip ROM.

In addition, ESS supplies three chips that generate speech and sound for PC applications. The chips employ ADPCM or a patented ESPCM compression coding algorithm. The chips are register compatible with Creative Labs' Sound Blaster board. The $12 ES488 creates all of the speech and sound of a Sound Blaster board, except music synthesis.

The recently introduced $18 ES1488 is socket compatible with the ES488 and includes on-chip music synthesis. The $20 ES688 is a 16-bit stereo chip that features 44.1-kHz sampling for recording and playback of CD-quality music. All of the chips interface with the ISA bus and have drivers for Windows and the Windows Sound System. The chips' audio-application software includes an audio recorder, a talking clock, a calculator, and a volume control. In addition, the chips run on 3.3 or 5V and have power-management features for adding sound to portable computers.

National Semiconductor's NS32AM160 and NS32AM161 chips are members of the 32-bit Series 32000/EP family of embedded system processors. Designed for digital answering machines, the $18 (10,000) chips can also replace microcassettes in dictation machines. The processors integrate the functions of a DSP chip and a system controller. The DSP function compresses and decompresses data using sub-band coding or LPC algorithms.

The processors can execute instructions from on-chip or external ROM. In addition, the chips can detect and generate DTMF tones and provide voice recognition. National supplies the NSvoice algorithmic software to execute the DSP, compression/decompression, DTMF, and tone-generation functions.

Information Storage Devices (ISD) provides a different twist to recording and playback devices for short (10 sec or less) speech applications. The $5.48 (1000) ISD1100 series stores analog signals directly in single cells as one of 256 levels. Each device provides an oscillator, a microphone preamplifier, automatic gain control, a smoothing filter, and a speaker amplifier on one chip.

Because the ISD chips provide analog storage, they don't require A/D or D/A converters. ChipCoder technology lets you record and rerecord as much as 10 sec of audio without using a special programmer. You can play back the sound through a small external speaker. The chip is at the heart of Hallmark's recordable greeting cards.

Voice synthesis is only half the story. Many commercial devices can recognize voice commands as well. Because voice-recognition algorithms are more complex than synthesis algorithms, they require the full horsepower of a DSP chip. AT&T has leveraged its expertise based on the DSP16A chip to develop dedicated speaker-trained voice recognizers for telecommunication terminals, such as cellular and cordless telephones. Using ADPCM or code-excited linear-predictive-plus coding, the recognizers store compressed speech that achieves bit rates of 5.2 kbps.

AT&T's latest offering is a hands-free voice processor, called the HVP-S, which provides full-duplex operation of a cellular phone with no microphone suppression. The chip, along with two codecs, memory, and a microcontroller stores as many as 64 speaker-trained utterances. The chip allows speed dialing by voice and automatically answers the phone. The HVP-S with ROM-coded voice-recognizer software sells for less than $20 (10,000).

Because Analog Devices sells some of its DSP chips for less than $10, the company is actively recruiting third-party vendors to add speech-processing value to their chips. For example, Centigram Communication adds interactive voice response and text-to-speech value to its 2100 family of DSP chips. You can purchase the product as an adapter card or as integrated chip sets.

Recently, Dragon Systems and Analog Devices received a federal grant to port Dragon's voice-recognition software onto one of Analog Devices' DSP chips. Dragon will develop speech-recognition software to fit available memory and run on a range of DSP chips for handheld personal digital assistants and notebook computers.


Voice recognition for faxes

National Semiconductor offers the Dispatch family, which includes three 32-bit embedded processors that incorporate DSP functions and three peripheral controllers. The product permits the use of a single phone line for facsimile and voice communications. It uses software to switch automatically between send and receive modes or between voice and fax modes. Dispatch uses a set of 25 words to provide speaker-dependent voice recognition for controlling the fax and the answering machine. The lowest configuration of the family—the 32FX161 processor plus the 32X100 peripheral controller—costs $45 (1000).

Vocal Inc adds voice-recognition software to TI's TMS320C25 DSP chip or National Semiconductor's 32000/EP family of embedded-system processors. The company's TrueWord software stores as many as 100 utterances using a speaker-trained DSP algorithm called Spectral Fit Coding (SFC). SFC compresses sampled audio data to 5 kbps using a minimum-squared error-fitting process.

For systems that already have a TI DSP chip or one of National's embedded processors, Vocal supplies TrueWord as a licensed OEM software product. Otherwise, Vocal offers a 3×6-in. card that contains a processor, a µ-Law codec, external RAM, and external ROM. Evaluation units for a TMS320C25 voice computer card cost $1500. Evaluation units for a voice module card containing a National 3200 family processor cost $500.

Looking ahead
Speech-synthesis technology has matured through the 1980s and the 1990s. As a consequence, you can purchase powerful speech-synthesis ICs today for only a few dollars or less in OEM quantities. The low added cost of these devices must appeal to commercial-equipment designers because they are including vocal commands in virtually every product under development.

Even speech-recognition technology has made major advances in the past decade. Although the most sophisticated voice recognizers require a computer with megabytes of memory, voice-recognition products designed around DSP chips or embedded processors are powerful subsets of this technology. These speaker-trained devices store a variety of utterances and achieve accuracies greater than 90%, even in the presence of automobile noise. You can expect to see these products in cellular and cordless telephones, handheld personal digital assistants, digital answering machines, and a variety of voice organizers, such as memos, calendars, and to-do lists.


You can reach Technical Editor John Gallant at (671) 558-4666, fax (617) 558-4470.

For free information...
For free information on the speech-processing products discussed in this article, contact any of the following manufacturers directly. Please let them know you read about their products in EDN.
Analog Devices
Norwood, MA
(617) 329-4700
AT&T Microelectronics
Allentown, PA
(800) 372-2447
Centigram Communication Corp
San Jose, CA
(408) 944-0250
Dragon Systems
Newton, MA
(617) 965-5200
ESS Technology
Fremont, CA
(510) 226-1088
Information Storage Devices
San Jose, CA
(408) 452-8700
Mosel-Vitelic
San Jose, CA
(408) 433-0952
National Semiconductor Corp
Santa Clara, CA
(800) 272-9959
Oki Semiconductor
Sunnyvale, CA
(408) 720-8940
Texas Instruments
Denver, CO
(800) 477-8924
Vocal Inc
Palo Alto, CA
(415) 323-5613



| EDN Access | feedback | subscribe to EDN! |
| design features | design ideas |


Copyright © 1996 EDN Magazine. EDN is a registered trademark of Reed Properties Inc, used under license.