Design FeaturesMarch 3, 1997 |
Stephen Kempainen, Technical Editor
Specialized chips, new technologies, and voice-activated user interfaces are making automatic speech recognition more accurate, user-friendly, and inexpensive than ever before.
Automatic speech recognition (ASR) is approaching the science-fiction capabilities of HAL, the infamous computer in 2001: A Space Odyssey. But, unlike HAL, designers can use ASR to make people's lives easier. New application-specific standard products, interactive error-recovery techniques, and better voice-activated user interfaces allow the handicapped, the computer-illiterate, and rotary-dial-phone owners to "talk" to machines. ASR affords such users a natural human interface to computers; telephone-call centers, such as for airline-flight information; learning devices; and toys.
Inaccuracy and high costs plagued earlier attempts at ASR. For example, although automobile security is a natural application, what if you can't get into your car because the ASR system rejects your password? Worse, what if a car thief overhears your password?
The steady improvements in ASR have worked toward overcoming those problems. Today, speech-recognition technology ranges from the simple and low-cost to the complex and expensive. Categories of products include speaker-dependent, isolated-word-recognition (IWR); general-purpose, small-vocabulary, continuous-speech-recognition; and "natural-language" systems, which allow a user to phrase a response in a few different ways and still achieve recognition. More complex products are large-vocabulary-recognition (LVR) and speech-to-text systems (see box, "Speech recognition 101''). And, now emerging from research labs are the most complex systems: speaker-independent, natural-language, continuous, LVR systems.
Application-specific standard products with good design tools support the less complex products and make possible such ASR tasks as speaker verification, keyword spotting, and small-vocabulary recognition (Table 1). DSP hardware supports more complex technology in speaker-dependent, almost-continuous-speech LVR systems.
The cost of implementing speech recognition depends on
the complexity and capability of the product. At the low end, Sensory offers the
$5 RSC-164, a general-purpose interactive-speech chip, which uses neural-network
technology to preprocess the input utterance for simpler recognition. Sensory
targets the device at consumer telecommunication products, personal digital
assistants, both small and major home appliances, and electronic toys. For a
complete system, you need to add only a microphone, a speaker, a preamp, and
gain control. The chip provides speaker-independent recognition of 14 words per
setthe words from which a recognizer can choose to match the utteranceor
speaker-dependent recognition of 60 words per active set. A development kit is
also available (Figure
1).
At the other end of the capability and price spectrum is
Dialogic's $3500 Antares computer-telephony DSP board (Figure 2). The resource board uses either an ISA bus or a VMEbus
and supports speech-recognition products on Windows NT, DOS, Unix, Solaris, and
OS/2 operating systems. Four independent DSPs compute the algorithms an
application needs. System developers can use the board with software for ASR and
text to speech (TTS) for such applications as call-center processing, facsimile,
and information hot lines.
With the Antares board, you can also choose from third-party vendors, such as Brooktrout, Lernout & Hauspie, and PureSpeech, for tools to add ASR, TTS, speech compression, echo cancellation, or other computationally intensive algorithms to a telephony application.
Vocabularies for computers
Each ASR system has an active vocabularythe set of words from which the recognition engine tries to match the accumulated utteranceand a total vocabulary sizethe sum of the words in all the possible sets that can be called from memory. Along with vocabulary size, a system's "recognition latency"the allowable time to accurately recognize an utterancedetermines the processing horsepower that a recognition engine requires.
An active vocabulary set ranges from approximately 14 words plus "none of the above," which the recognizer chooses when none of the 14 words is a good match, for a speaker-independent system to 30 to 60 words for a speaker-dependent system. When using a 4-MIPS processor, recognition latency is about 0.5 sec for a speaker-independent set and less than 1 sec for a speaker-dependent system. Processing-power requirements increase dramatically for LVR sets with thousands of words. Only vocabularies of a few thousand words approach real-time latencies even with Pentium-class processors.
A small active vocabulary limits a system's search range, providing advantages in latency and less search computation. A large total vocabulary enables a more versatile human interface but affects system-memory requirements. A system with a small active vocabulary for each prompt usually provides faster, more accurate results. Similar-sounding words in a vocabulary set cause recognition errors, but a unique sound for each word enhances the recognition engine's accuracy.
In addition to speaker-dependent and -independent, another class of ASR is "speaker-adaptive"systems that allow each user to train the system by reading acoustically significant text that prompts the system to adjust the acoustic model to match the speaker's voice. These high-end systems allow users to interactively increase the system's vocabulary and perform corrections. One such system, Philips SpeechMagic, has an active vocabulary of 64,000 words. Its recognition engine requires at least a 66-MHz, 486-class processor, along with 16 Mbytes of RAM and 500 Mbytes of disk space to hold the vocabulary. The system attempts no real-time recognition; instead, a user dictates a session and then activates the engine to produce a TTS file. The system also requires two slots; one for the ISA-compatible coprocessor and the other for a SoundBlaster-compatible board.
Speaker independence gives a product the advantage of working immediately, whereas speaker dependence requires a training cycle before you can use it. However, the training cycle itself can yield the benefit of language independence, because trainers can use whatever sounds they wish. On the other hand, only one user can employ a speaker-dependent system unless it allows multiple users to train, store, and call up profiles. Another advantage of speaker-independent systems is that they come with a vocabulary, thus eliminating the need for writing speaker-dependent vocabularies. In speaker-adaptive systems, meanwhile, the user corrects inaccurately transcribed words and can add words to the vocabulary.
Measuring accuracy
Comparing ASR technologies for accuracy is difficult, and no established standards exist. One measure of accuracy is how gracefully a system recovers from its inability to correctly recognize an utterance. In a normal conversation, how many times do you ask a speaker to repeat a word because of background noise, an unfamiliar pronunciation or word, or a speaker's accent? How often must you ask a speaker to repeat a whole phrase because the word makes no sense to you in the context of the sentence? Now, imagine a machine trying to understand the idiosyncrasies of normal conversation.
To combat these problems, sophisticated ASR systems use synthesized, interactive "prompt-and-reply" techniques, which critically impact a user's successful completion of a transaction without talking to a human. If the prompts are quick and clear, you can tolerate the process as long as you complete the task.
More quantitative measures of accuracy are also available. For example, Oki Semiconductor offers the following equations on its data sheet for its Voice Recognition Processor (Figure 4). The equations measure the speech-recognition accuracy of IWR systems:
ACCURACY=100%ERATE,
and
ERATE=ESUB+1/2EREJ,
where ERATE is the error rate, ESUB
is the substitution error, and EREJ is the sum of word rejections.
Of these errors, the substitution errorsuch as when a speaker says "five,"
and the system recognizes, or substitutes, "nine"is the most
crucial error because it is the most difficult from which to recover. Three
factors affect EREJ: EGAP, an error caused by a word's
being spoken before the recognition engine is ready for the next word; ETIME,
an error caused by words that are too long; and ESPU, an error
caused by spurious noise or by the wrong word's being spoken.
The human factor
In addition to accuracy, another design consideration is human-factor engineering to reduce the number of prompt-and-reply iterations for a successful product. Flexible speech synthesis is important. Phrases such as, "It is noisy; please speak louder," and "Please wait for the prompt," help users adjust to low-cost recognition systems. Pointed prompts such as these typically elicit a well-defined word or words from the user. You must also field-test your design's prompts to learn whether a novice would understand the prompt. Plan on iterations to the design to fine-tune the prompt-response cycle.
Although simple prompts are desirable, prompts that are too simple can be obnoxious or insulting. On the other hand, prompts that are too open-ended or conversational will elicit answers having words lacking from the system's vocabulary. In addition, nonspecific prompts may elicit incorrect answers. Accuracy ratings improve when a user gives one phonetically distinct answer from among limited possibilities. For example, a user may have to limit using the articles "the," "an," and "a," and answer to a prompt as clearly either plural or singular.
Another design trade-off is whether a product needs to listen constantly or has the simpler option of "windowed" listening. The advantage of the windowed approach is power savings, but this type of listening also limits what a user can do: If the response window is closed, a user cannot interrupt a prompt and proceed to the next step, which constant listening allows. On the other hand, constant listening may cause the product to falsely recognize a user's input because of a noisy environment. For example, your car unlocks because of surrounding noise. To alleviate this problem, a system would work by having a user employ an attention-getting phrase, such as "car doors," and then provide a password.
Software recognition engines and the tools to design them into your application are now available. The tools ease the preliminary design work. Unfortunately, the software engines are too expensive for low-cost consumer products. Instead, the target markets are high-end telephony and dictation speech-recognition markets. Software products include IBM's VoiceType Developers Toolkit version 3.0, which is a free download from the Internet, but you must pay for support, and Nuance's Toolkit, which are both software-development tools (Table 2).
The $5000 Nuance Conversational Transactions Product Toolkit combines client and server recognition engines for large-scale telephony. It interprets natural-language utterances and determines an action. The product comprises the recognition-client, recognition-server, speech-action-library, and development-tool-kit software modules.
The recognition clients act as the front end and manage telephony interaction with users, and the recognition server handles speech recognition and meaning extraction. The speech-action library handles semantic links to business applications, and the tool kit lets developers build and maintain speech-action libraries and define templates for understanding the meanings of phrases. Using Toolkit, you can design large-scale, telephony, stock-quotation systems and dictation systems for doctors and lawyers having specialized vocabularies.
Burgeoning biometrics
In addition to these high-end applications, a growing application for ASR is biometric security, which includes fingerprint and retina-scan systems. ASR is less expensive to implement than are other biometric systems, which use expensive graphics-recognition systems, such as scanners and cameras, to scan users' retinas and fingerprints. In contrast, for less than $10, you can build a biometric-security system using Sensory's Voice Password chip and adding a battery, a speaker, a microphone, external memory, a crystal, and an input amplifier. The speaker-dependent chip authenticates a speaker's voice and password, has a 16-word vocabulary, and allows as many as four passwords per entry.
The chip offers a security threshold that you can set to
high or low for matching the trained template to the password utterance. By
setting the threshold low, you allow a low match accuracy to gain entry. A false
rejection occurs when the system denies access to an authorized user with a
correct password, and a false acceptance occurs when the system allows access to
an unauthorized user who knows the password and gains entry. Decreasing the
false-acceptance rate increases the false-rejection rate and vice versa (Figure 3). Also, increasing the number of passwords per user
decreases both false-acceptance and false-rejection rates.
| Table 1Representative speech-recognition ICs | ||||||||
| Company | Product | Type | Active/total vocabulary |
On-chip speech synthesis |
Development board |
Price | Comments | |
| Oki Semiconductor | MSM6679 Voice Recognition Processor | Speaker
independent/ speaker dependent |
Speaker independent: 25 words on chip, speaker dependent: 61-word set/SRAM based | Yes | Voice Recognition
Processor Toolkit, Windows interface to evaluation board ($995) |
$18 to $20 (10,000) |
Slave-mode device, audio record and play back, memory controller and interface, PWM sound output | |
| MSM6679A | Speaker independent/ speaker dependent | Speaker independent: 40 words on chip | Yes | Voice Recognition
Processor Toolkit, Windows interface to to evaluation board ($995) |
$13 to $14 (10,000) |
Smaller dice and algorithms than the MSM6679, additional on-chip memory, speaker dependent uses external memory | ||
| Ricoh | RL5S840 | Speaker independent | 60 words | No | RVZ2000 ($3000) |
$10 (10,000) |
IWR, good noise rejection | |
| Sensory | RSC-64 | Speaker independent/ speaker dependent | Speaker independent: 14-word sets, speaker dependent: 60-word sets | Yes | RSC board ($3000) |
$5 (50,000) |
Audio record and playback, output amp on chip, neural-network recognition | |
| Voice Password | Speaker dependent | 16 passwords | Yes | N/A | $6 (50,000) |
Configurable, nonprogrammable, limited synthesis, stand-alone and external-host modes | ||
| Table 2--Representative ASR-system software-development products | ||||||
| Company | Product | Type | Vocabulary size | Operating system | Price | Comments |
| AT&T | Watson Advanced Speech Application Platform | Speaker independent | 1000 words | Windows 95, NT | $395 | Software-based speech platform with a software developers kit using Visual Basic, Visual C++, and Borland Delphi; Microsoft speech-application-programming-interface- and telephony-application-programming-interface-compliant; needs SoundBlaster-compatible card |
| Dialogic | Antares DSP board, software developers kit | Speaker independent/ speaker dependent | Software-vendor dependent | Spox real-time DSP | $3500 | Computer-telephony system that supports multiple algorithms and operating systems for automatic-speech and text-to-speech recognition; includes four independent Texas Instruments DSPs with multiple memory options; Signal Computing bus allows standard access to call-processing products; drivers for DOS, Unix, OS/2, and Windows NT |
| Dragon | Speech Tool | Speaker independent | Small | Windows | $295 | Creates small vocabularies for speech-enabled software applications; produces add-on vocabularies for DragonDictate for Windows |
| DragonXTools | Speaker independent | As many as 60,000 words | Windows | $295 | Adds ASR to custom controls and programs using Visual Basic and other tools supporting VBX format; dictation applications require DragonDictate for Windows | |
| Entropic Cambridge Research Labs | HTK Hidden Markov model tool kit | Speaker independent/ speaker dependent | Limited by memory size | Unix (Sun, SGI, HP700, Alpha, IBM) | $7020 | Training, development, testing, and prototyping of ASR software; software-only system requires no boards; enables building a recognizer in any language |
| IBM | VoiceType Developer Toolkit | Speaker independent | Small | Windows | Free Web download | Recognizes phrases spoken at a natural rate; contains set of 14 C function calls to control ASR with a C/C++ compiler |
| Lernout & Hauspie | asr1500/M ASR software-development kit for Windows | Speaker independent | Limited only by memory size | Windows | $790 | Development kit for C/C++, Visual Basic, and other programming languages to incorporate ASR into Windows applications; modes include isolate word, keyword spotting, continuous digits, continuous speech; supports seven languages; application-programming-interface compatibility; one runtime engine plus 4 hours support |
| asr1500/T | Speaker independent | Limited only by memory size | Unix (SCO Unixware), OS/2, DOS | $790 | ASR application-programming interface for computer-telephony integration; modes include isolated word, keyword spotting, continuous digits, continuous speech; system requires Dialog Antares board; supports seven languages; one runtime engine plus 4 hours support | |
| Nuance Communications | Nuance Toolkit | Speaker independent | 160,000 total plus adaptive; 15,000 active | Solaris, Pentium Pro, SCO, AIX | $5000 | Includes one client and one server runtime version to develop voice-processing applications; typically for large-scale telephony applications; developer must have telephony termination board to serve as front end |
| Philips | SpeechMagic | Speaker independent/ speaker dependent | 64,000 active | Windows 3.1x | $3500 to $4500 per author | Uses Philips ISA-compatible DSP board and SoundBlaster-compatible board; universal modular architecture gives developers the ability to integrate ASR into their applications; continuous-ASR capability; medical- and legal-language models |
| PureSpeech | ReCite software-development kit | Speaker independent | Pro: 2000 words, Lite: 100 words total, 16 words active | SCO Unix, Sun (Host), Windows NT | $7500 | Software-development kit for Dialogic Antares DSP board to add ASR to telephony applications (runtime configuration); based on phonetics; continuous-digits vocabulary is zero to nine, "oh," "cancel," "quit," "help," "yes," and "no"; two runtime licenses plus support |
| Speech Solutions | Voice Tools | Speaker independent | 30,000 total, fewer active | Windows 95, NT | $299 | Developers using Visual Basic and C++ use ActiveX custom controls that encapsulate the IBM VoiceType Dictation Engine application-programming-interface set; VoiceType Dictation runtime software comes with Voice Tools |
| Speech Systems | VoiceMatch | Speaker independent | Finite grammar, 2000 active | Windows | $995 | Application-programming interfaces include OCX, C library; enable continuous ASR for PCs and wearable computers running LynxOS; one runtime license; 90 days free support |
| Looking Ahead |
| Designers foresee a day when a
computer will be able to translate a conversation between speakers of
two different languages in real time. Before that day comes, however,
there are still problems to overcome. One problem is background noisean
obstacle for both speaker-independent and -dependent systems.
Implementing front-end noise removalunless you limit it to a
specific noise profilecan only handle the global statistics of
speech. However, research is progressing in digital noise compensation
and noise-tolerant pattern-matching techniques.
Another problem is that language models depend heavily on tasks, such as law or medicine. Designers are working to incorporate explicit grammar knowledge to make these models less task-dependent. In addition, transcribing spontaneous casual speech is error-prone, and the reasons are unclear. But the errors probably stem from poor articulation, highly variable speaking rate, hesitations, and false starts. |
| Speech recognition 101 |
| Statistical pattern
recognition is the basis of all speech-to-text products. However, the
pattern model varies dramatically, depending on the scope of the
recognition task. When a vocabulary set comprises only numbers and a few
commands, an acoustic model is usually adequate. However, a
continuous- natural-speech-to-text dictation system uses complex
acoustic and language models to match speech utterances to existing
patterns. The system first acquires an input word or phrase and compares
it to a library of stored words or phrases. The system then
calculates the most probable match to find the result. The size of the
library, or vocabulary, determines the complexity of the algorithm and
the processing power needed to identify the correct match.
Automatic-speech-recognition (ASR) systems are either speaker-dependent or -independent. Before using a speaker-dependent system, a user must orient the system to the user's articulation. Speaker-independent systems, on the other hand, allow for similar word utterances by more than one speaker. Speaker-independent systems are more challenging to design because of dialect, accent, age, sex, noise, and other speech variations that speakers use to utter the same word. Thus, speaker-independent systems require larger memories, using as much as 1500 bytes per word vs 50 bytes per word for a speaker-dependent system. ASR systems can be isolated-word-recognition (IWR) systems, keyword spotters, or continuous-natural-speech recognizers. All recognition includes the front-end speech processor, the coding processor, and the recognition phase. IWR products work only on single word utterances. They need not find end-of-word breaks. Keyword spotters, on the other hand, must select a pattern from a series of utterances. Continuous-natural-speech recognizers identify ends of words, sentence structure, and all the other syntax of natural speech. Most reasonably priced dictation programs simplify continuous speech by requiring the speaker to momentarily pause between each word. One example of a typical complex ASR is the Cambridge University (Cambridge, UK) HTK large-vocabulary-recognition (LVR), speaker-independent system. It processes speech, comprising phonemes, or fundamental sounds, into a digitized format. It then searches the previously stored patterns of formatted phonemes for a match to the accumulated sequence of phonemes. The front-end speech processor converts an unknown speech waveform to a sequence of acoustic vectors. Speech is a slowly changing waveform, and each vector is a small portion of an utterance, typically 10 msec. The average 10-word utterance can take about 3 sec. Therefore, the sequence resulting from this utterance is 300 vectors long, or an average of 30 vectors per spoken word. The front-end block must extract and digitize all the necessary acoustic information to transcode the utterance. The digitized form results from processes such as discrete cosine transform, linear prediction, and Fourier analysis. The front-end waveform digitization must complement the subsequent pattern-matching algorithm. Both an acoustic model and a language model comprise the coding-processor stage. The acoustic model provides a method for determining the probability of a word, given a sequence of phonemes. The language model presents probabilities of the word fitting into a context of words that make sense. In the HTK acoustic model, a statistical, "hidden Markov" model (HMM) represents each phoneme. A sequence of HMMs represents each word in a vocabulary. In principle, the model transforms each utterance to a sequence of HMMs and compares them to the vocabulary to find the most likely match. In reality, however, each phoneme varies because of the context of continuous speech. The HMM's accurate representation of the phoneme depends on the contextual information. The HMM is a state-transition model whose simplest form comprises three states for entry, exit, and loop. The accuracy of the state model depends on how well the state sequence corresponds to the phonemes when transitioning through a sequence of sounds. In practice, only the observed phoneme is known, and the state transitions are "hidden." US English requires approximately 45 phonemes. The HTK acoustic model links phonemes in "triphones." Triphones place phonemes in the context of the preceding and following neighbors. This linking of phonemes accounts for the fact that contextual effects cause variations in the way sounds are produced. Because US English uses 45 phonemes, it can have 91,125 triphones. However, it takes only about 60,000 triphones to describe all possible US English phonetics. Triphones overlap each other and cross word boundaries to produce better acoustic models than simple phonemes. The HTK model can link HMMs to form composite HMMs, which, in turn, link to form words that match those in the vocabulary. Linking HMMs that cross word boundaries lead to phrases that match those in the language model. The language model lets the recognition engine estimate the probability of a certain word's being recognized, given the preceding words. The premise is that syntax, semantics, and pragmatics give higher probabilities to certain phrase utterances. For instance, you create a language model for law tasks by examining legal documents and calculating the probability of the use of certain phrases. In a legal-language model, a higher probability exists that the phrase "waive the right" will be spoken more frequently than "wave the write." Task-oriented text thus eliminates the need for formal grammar rules in the language model. In language models, an N-word-gram stores the likelihood of a word string's usage. Bigrams and trigramstwo- and three-word strings, respectivelydominate current language models. Using short N-grams has obvious deficiencies, such as exploiting longer commonly used phrases and subject-verb agreement. However, a vocabulary of words whose number is represented by W has W3 possible trigrams. Computational requirements grow exponentially as the number of words an N-gram uses increases. The decoder phase compares the acquired speech waveform data to the data from the acoustic and language models. To decode, the HTK system pursues all hypotheses in parallel; other systems may pursue a promising hypothesis to its end. The pursuit of parallel hypotheses eliminates the least likely word sequences from the acoustic and language models. This pruning simplifies the computational effort, as long as it avoids pruning the correct recognition too early in the decoding phase. The search finds the most probable match from the vocabularyor "grammar," in LVR systemsfor the utterance. A recognition system's capability depends on the vocabulary size and the computational power available to search for probable matches. For more on speech recognition, see References 1, 2, and 3.
|
| For more information... | |||
| Analog
Devices Norwood, MA (800) 262-5643 www.analog.com |
AT&T
Advanced Speech Products Group Madison, WI (800) 592-8766 www.att.com/aspg |
Brooktrout
Technology Needham, MA (617) 449-4100 www.techspk.com |
Dialogic Parsippany, NJ (201) 993-3000 www.dialogic.com |
| Dragon
Systems Newton, MA (800) 825-5897 www.dragonsys.com |
Entropic
Cambridge Research Labs Washington, DC (202) 547-1420 www.entropic.com |
IBM
Speech Products Yorktown Heights, NY (800) 825-5263 www.software.ibm.com/is/voicetype/ |
Kurzweil
Applied Intelligence Waltham, MA (617) 883-5151 www.kurzweil.com |
| Lernout
& Hauspie Burlington, MA (617) 238-0960 www.lhs.com |
Nuance
Communications Menlo Park, CA (415) 462-8200 www.nuance.com |
Oki
Semiconductor Sunnyvale, CA (408) 720-1900 www.oki.com |
Philips
Speech Processing Woodbury, NY (516) 921-9310 www.speech.be.philips.com |
| PureSpeech Cambridge, MA (617) 441-0000 www.speech.com |
Ricoh San Jose, CA (408) 432-8800 |
Sensory Sunnyvale, CA (408) 744-1299 www.SensoryInc.com |
Speech
Solutions Philadelphia, PA (800) 773-3247 www.speechsolutions.com |
| Speech
Systems Boulder, CO (303) 938-1110 www.speechsys.com |
Texas
Instruments Dallas, TX (800) 477-8924, ext 4500 www.ti.com |
Voice
Control Systems Cambridge, MA (617) 494-0100 |
|

You can reach Technical Editor Stephen Kempainen at (415) 643-1760, fax (415) 643-9513, e-mail ednkempainen@worldnet.att.net.
| EDN Access | Feedback | Subscribe to EDN | Table of Contents |