Ubiquitous sensors meet the most natural interface—speech—Part II
Applications and Design Considerations
Figure 1 illustrates how voice activation compliments other aspects of speech recognition capabilities, showing the steps of a multistage process for creating a truly hands free voice user interface.
Figure 1 – The role of voice activation
However, this important voice-activation step requires a few critical characteristics:
Extremely fast response time. Since it basically competes with a button press, it has to have a similar or faster response time. Because the hands free system uses a probabilistic approach, it can respond without having to wait for the recognizer to determine if the word is even finished. Slow response times lead users to speak before the Step 2 recognizer is ready to listen, which is a major cause of failure.
Low power consumption. This technology can deliver “always listening” wake-up triggers with as few as 7 MIPS, and current draw requirements in 1-10 mA range on today’s devices.
Highly accurate even in low SNR environments. This means several things:
- Works in high noise -TrulyHandsfree Voice Control performs virtually flawlessly in extremely loud environments, including music playing in the background, in a car, or even outdoors
- Works without a microphone in close proximity -it is responsive even at distances of 20 feet (in a relatively quiet environment) and at arm’s length in noise. This is critical because many VUI based applications of the future will become commonplace in a wide variety of consumer electronics devices, and users won’t want to get up and walk over to their devices to control them.
Such companies as Nuance, Google and Microsoft are prominent in the second step, which is a powerful (often cloud-based) recognition system.
The third step “Understanding Meaning” is what the original Siri was all about. This was an AI component developed under DARPA funding at SRI and later spun off and acquired by Apple. Nuance’s Vlingo does a really nice job of implementing Steps 1-3 It’s very likely that Google, Microsoft, Apple and Nuance all have efforts underway in the area of AI and natural language understanding.
The SEARCH in Step 4 is done via typical search engines (Google, Microsoft, Apple) and likely the independent players have developed partnerships in these areas.
Step 5 represents a good quality Text-to-Speech (TTS) engine. Providers like Nuance, Ivona, ATT, NeoSpeech, and Acapella all have quality TTS engines, and no doubt Apple, Microsoft and Google all have in-house solutions as well.
Mobile applications for smartphones, tablets and ultrabooks benefit from hands-free voice control in safety and convenience. Applications can wake up and be controlled without touching the handset in the car or across the room. As a component of a medium vocabulary size recognizer with SDKs for iOS and Android, voice triggers and extensive command menus can be combined with cloud based recognizers creating a hybrid rich user experience when connected and extensive control capabilities when not connected. Response time is so fast that no pause between the trigger and command in necessary; for example “hello computer what time is it in Tokyo?”
Triggers can be made contextual; for example if a phone number is included in a text message or email, a trigger such as “Dial the number” can be activated. Uniquely, these SDKs also support using voice triggers as Speaker Verification or Speaker Identification phrases. In these scenarios, a single user or multiple users enroll themselves by speaking the phrase a few times. When enrolled, the trigger can be used as a voice password in the case of Speaker Verification, rejecting any other speaker, or as identification from a group of enrolled users in the case of Speaker Identification, so that users preferences may be retrieved. Both predefined fixed “hard coded” triggers and User Defined Triggering systems can be implemented on the device for further personalization (and combined with Speaker Verification/Identification