Creating JARVIS - Smart microphones enabling the digital butler
The voice interface sector has been driven by the rapidly accelerated developments of cloud-based automated speech recognition (ASR) systems and smart microphones. ASR systems are now adding context to voice queries to enable searches that far outstrip the capabilities offered by current browser-based search, and adding the ability to control home automation and entertainment systems with a voice interface. Smart microphones are allowing these systems to work in a free-form environment. Together, they will soon enable the creation of a digital butler.
Here's an example case. At the start of 2016 Facebook’s CEO, Mark Zuckerberg, announced that his usual annual challenge was to “… build a simple AI to run my home and help me with my work. You can think of it kind of like JARVIS in Iron Man.” Now, if you’re a fan of Marvel - be it their comics or films - you probably know JARVIS (just a rather very intelligent system) is Tony Stark’s digital assistant in Iron Man, and has the ultimate in voice user interfaces.
Zuckerberg was suitably vague about what his version of JARVIS will be and how it would work, but very clear what the outcome should be. He wants the ability to use his voice to control much of his home. Zuckerberg must also have seen the project as an opportunity to get a better understanding of the Amazon Echo. In particular, he probably wanted to understand the integrated control of IoT devices around the home, such as lights/thermostats/locks, and the development of Alexa Web Information Service (AWIS) Skills that developers can use to automate repetitive tasks like making an entry in a diary or booking a ticket.
Consumer interest in voice interface is growing rapidly. Less than 18 months ago Gartner reported that 30% of our interactions with technology will be through ‘conversations’ with smart machines by 2018. Yet by June 2016 Ben Barjin concluded in Creative Strategies that the release of the Amazon Echo had already pushed voice to the mainstream, observing:
- “I’ve long said the true test of a great feature very early in its life cycle, is when it combines both delight and frustration. Once you use it, you’re hooked but you want it to be great all the time because you can see the potential.”
Certainly the pace of innovation and development has been impressive. Since its launch in March 2015, Alexa Skills for the Amazon Echo have gone from 70 at launch to more than 2000 today. Also, during the last six months, Google has announced a competitor to the Echo and a similar product is expected from Apple, while Facebook and Microsoft have announced chatbot initiatives based around Messenger and Cortana. Each company is trying to work out a way to position their technologies to take best advantage of ecosystems based around voice.
So how might Zuckerberg go about fulfilling his 2016 initiative? Let’s consider the scale of some of the challenges he’ll have to face as he develops his voice-activated butler.
The first challenge will be to capture his voice commands and questions with as much clarity as possible. If Zuckerberg intends to use his mobile phone and a Facebook chatbot to communicate with JARVIS, the problem is reasonably well constrained and defined. Communication is not likely to be the ideal target of 99% accurate, but it will be OK for starters.
But if Zuckerberg intends to talk to JARVIS as he walks around his home, JARVIS needs to be able to capture Zuckerberg's voice when he’s standing several metres away before passing it to the ASR engine, with the same level of clarity as when he is standing nearby. The technology for capturing distant voice, sometimes referred to as a far-field microphone or smart microphone, needs to take into account the direction of voice, any ambient noise in the surrounding environment, the reverberations and echo generated by hard surfaces, sudden loud noises, and the like. And all are features that change from room to room.
Having decided on how to capture a clean voice stream, Zuckerberg will then need to implement a way to activate the butler. Generally, this activation challenge gets solved by using a keyword -- like “JARVIS” -- to wake up the system components for processing the subsequent captured voice. But Zuckerberg will have to make sure the trigger keyword is recognised at least 95% of the time, and that similar words or phrases don’t mistakenly trigger JARVIS. Then there’s the question of who can activate JARVIS, which brings biometrics into consideration, including voice biometrics, which many banks are looking at for personal identification/verification.
Zuckerberg then needs to decide whether, having woken up JARVIS and asked a question or issued a command, he wants the decisions and responses to be made using local processing or to be passed to a cloud-based ASR service for processing. Both are reasonable approaches, but they require different interface technologies with associated security concerns. Most of the conversation will probably be handed to the ASR service managed by Facebook, of course, but if the request is a simple decision to turn a device on or off it’s much more efficient to run the command on the device or a local home network.
Finally, JARVIS will have to talk back to Zuckerberg with an answer to a question or to confirm that an action has been completed. This requirement doesn’t mean that JARVIS has to have a display screen, but it will need to somehow interface to the surrounding systems and provide visual and audible clues as to its actions.
So how close is Zuckerberg’s JARVIS?
These challenges represent a significant level of work. But there are a lot of people out there trying to achieve the same objective, and they have large development teams working on it. In a recent interview in August, Zuckerberg said he should be ready to demo “the AI sometime in the month of September”.
But what can we expect apart from a technology demonstration? Well, we know that Zuckerberg will show that he can use voice to instruct the AI system to control the lights, the temperature, the doors, and make him toast – he’s already confirmed that much. He might also show that facial recognition can be used to control the gates to his mansion. But I suspect he’ll do most of the conversation using his phone. Interesting, but nothing ground-breaking.
If, however, he shows the ability to do these tasks by talking with JARVIS using far-field voice (so he controls the features as he walks around his home), then he really will raise a few eyebrows. Amazon set a standard for far-field voice capture when they released the Echo back in 2015, so Zuckerberg wouldn’t want to show anything less capable and I’m certain that he won’t demonstrate his butler being driven by an Amazon product.
Even if Zuckerberg doesn’t show a far-field voice activated butler in September, he’ll have raised the expectations of what people can do with voice control, which will in turn drive more innovation and competition. Key to these developments will be smart microphones that handle the microphone aggregation, voice-capture, keyword detection, acoustic signal processing, system control, and even thin-client secure cloud integration. Devices that integrate all these features, replacing configurations of multiple discrete devices, are already available to early technology adopters who are working on product releases for 2017.
As smart microphones reduce the entry point for voice-controlled applications in terms of cost and time to market, developers are encouraged to interface to the myriad of cloud services available. Maybe we can all start to think about having our JARVIS or a personal assistant well before 2030.
Huw Geddes is the Director of Marketing for XMOS. He has an extensive background in the delivery of technology to designers, developers and engineers. Prior to joining XMOS as Information/Documentation Manager, Huw worked as Technology Transfer Manager at the 3D graphics company Superscape Ltd, and was Technical Author at VideoLogic Ltd. Huw also has a strong background and interest in fine art.
- Voice-activated interface becomes pervasive and persistent
- Life as a system designer in 2076
- Voice as an interface in the smart home: Can you hear me now?
- Teardown: Amazon's Echo voice-activated virtual assistant
- Optical sensor ignores ambient noise to detect the spoken word