Ubiquitous sensors meet the most natural interface--speech—Part I
Bernie Brafman, Sensory, Inc. - January 8, 2013
These popular mobile applications, automotive infotainment, Bluetooth headsets and hands-free kits, and entertainment and home automation remote controls, all require a button press to engage the voice user interface. The microphone is present and capable of capturing voice, which could be used to begin interaction much like the “Computer” command in Star Trek.
Historically, there have been practical reasons for the button press requirement. First, speech recognition technology is sensitive to background noise and recognizing “voice triggers” in real world environments has not yielded acceptable accuracy. As a result, close talking microphones such as headsets were required. Another consideration is the computational requirements for doing “always on” continuous listening voice triggers can rapidly consume battery life in mobile devices. These limitations have made for unacceptable real-world performance. However, the compelling value of truly hands free operation--safety and convenience--continues to spur the search for a reliable and robust solution.
Fortunately just such a solution has arrived.
The technologyDuring decades of embedded speech recognition innovation, users have wanted to use reliable speaker independent voice triggers to make their devices truly hands free. Recently a combination of experience in real world speech recognition combined with a breakthrough in handling noisy conditions allowed the development and offering of a hands-free voice control with fast, highly accurate, reliable and noise robust voice triggers for a wide variety of consumer electronics.
There are three components that led to the innovation. First, keyword spotting, whereby a phrase can be recognized without the customary preceding and following silence. This is a fundamental part of the noise robustness and reliability. Extensive experience with keyword spotting was developed over years of producing top-selling consumer electronics products. For example, in a mobile phone, the user might be able say, “Take a picture,” to activate the camera. Keyword spotting provides the phone the ability to recognize this phrase when said as “I want to take a picture” or “please take a picture” or “hey watch this, I can just say take a picture and it works.” Keyword spotting allows voice triggers to be recognized when embedded in a sentence, without pauses or silence before or after a word, and it contributes to overall noise robustness.
The second component represents a breakthrough in the handling of noisy conditions. Noise reduction techniques that are very effective when applied in telecommunications have proven to be ineffective or even degrade speech recognition accuracy due to a mismatch between the spectral characteristics of the incoming audio and the audio from which the speech recognizer engine’s underlying models are built. Figure 1 shows a block diagram of an embedded speech recognition engine, including the Acoustic Model, a database of phonetic information used to match the user's speech to the vocabulary to be recognized. Also shown is a Speech and Noise database, which is part of the breakthrough developed to solve the challenges associated with voice triggers in noise. Technologists now have a way to use statistical methods to encode, and include noise in the recognition process in a way that dramatically improves not only noise robustness, but far field recognition as well.
The third component is experience. Based on more than 100 million products shipped to consumers utilizing embedded speech technology, significant expertise on product and voice user interface design is available. Active engagement with customers throughout the design process leads to a user-friendly experience and highest possible accuracy. For voice triggers, this includes careful selection of the trigger phrase and a detailed process for collecting speech data from the target demographic that used to optimize the trigger performance to customer specifications, including the tradeoff of false accepts and false rejects.
Next: Applications and design considerations
About the Author
Bernard Brafman is Vice President of Business Development for Sensory, Inc., responsible for strategic business partnerships. He received his MSEE from Stanford University. He can be reached at email@example.com