Ubiquitous sensors meet the most natural interface--speech—Part I

Bernie Brafman, Sensory, Inc. -January 08, 2013

Advances in sensor technology have surrounded us with devices that are aware of our activities and environment and that enable systems to deliver convenience and comfort.  Among the most ubiquitous of sensors is the microphone.  Originally intended to enhance person-to-person communication, today’s microphones often are an increasingly popular interface as the use and utility of voice-based personal assistants such Apple’s Siri, Google’s Voice Actions, Nuance’s Dragon Go and Samsung’s S Voice have become widespread. 

These popular mobile applications, automotive infotainment, Bluetooth headsets and hands-free kits, and entertainment and home automation remote controls, all require a button press to engage the voice user interface.  The microphone is present and capable of capturing voice, which could be used to begin interaction much like the “Computer” command in Star Trek. 

Historically, there have been practical reasons for the button press requirement.  First, speech recognition technology is sensitive to background noise and recognizing “voice triggers” in real world environments has not yielded acceptable accuracy.  As a result, close talking microphones such as headsets were required.  Another consideration is  the computational requirements for doing “always on” continuous listening voice triggers can rapidly consume battery life in mobile devices.  These limitations have made for unacceptable real-world performance.  However, the compelling value of truly hands free operation--safety and convenience--continues to spur the search for a reliable and robust solution.

Fortunately just such a solution has arrived.

The technology

During decades of embedded speech recognition innovation, users have wanted to use reliable speaker independent voice triggers to make their devices truly hands free.  Recently a combination of experience in real world speech recognition combined with a breakthrough in handling noisy conditions allowed the development and offering of a hands-free voice control with fast, highly accurate, reliable and noise robust voice triggers for a wide variety of consumer electronics.

There are three components that led to the innovation.  First, keyword spotting, whereby a phrase can be recognized without the customary preceding and following silence. This is a fundamental part of the noise robustness and reliability.  Extensive experience with keyword spotting was developed over years of producing top-selling consumer electronics products.  For example, in a mobile phone, the user might be able say, “Take a picture,” to activate the camera.  Keyword spotting provides the phone the ability to recognize this phrase when said as “I want to take a picture” or “please take a picture” or “hey watch this, I can just say take a picture and it works.” Keyword spotting allows voice triggers to be recognized when embedded in a sentence, without pauses or silence before or after a word, and it contributes to overall noise robustness.

The second component represents a breakthrough in the handling of noisy conditions.  Noise reduction techniques that are very effective when applied in telecommunications have proven to be ineffective or even degrade speech recognition accuracy due to a mismatch between the spectral characteristics of the incoming audio and the audio from which the speech recognizer engine’s underlying models are built.  Figure 1 shows a block diagram of an embedded speech recognition engine, including the Acoustic Model, a database of phonetic information used to match the user's speech to the vocabulary to be recognized.  Also shown is a Speech and Noise database, which is part of the breakthrough developed to solve the challenges associated with voice triggers in noise.  Technologists now have a way to use statistical methods to encode, and include noise in the recognition process in a way that dramatically improves not only noise robustness, but far field recognition as well.


The third component is experience.   Based on more than 100 million products shipped to consumers utilizing embedded speech technology, significant expertise on product and voice user interface design is available.  Active engagement with customers throughout the design process leads to a user-friendly experience and highest possible accuracy.  For voice triggers, this includes careful selection of the trigger phrase and a detailed process for collecting speech data from the target demographic that used to optimize the trigger performance to customer specifications, including the tradeoff of false accepts and false rejects.


Next:  Applications and design considerations


About the Author

Bernard Brafman is Vice President of Business Development for Sensory, Inc., responsible for strategic business partnerships.  He received his MSEE from Stanford University. He can be reached at bbrafman@sensoryinc.com


Loading comments...

Write a Comment

To comment please Log In