NEW YORK–If you’ve ever been frustrated using a voice activated customer agent or have scratched your head while reading an unintelligible voice-to-text message, AT&T says help is on the way.
The company, which has invested more than one million research hours over the past 20 years in speech and language recognition technology, says that it’s developed technologies that will not only make these traditional voice activated services more accurate but will extend voice activation to other modes of communication.
Earlier this week, AT&T Labs researchers showed off some of the technologies they’ve been working on at their labs here. Most of the applications showcased are not yet ready for prime-time commercial use. Researchers said they have no idea when these services will find their way into products. But bits and pieces are already in products developed by AT&T and the company’s partners.
For decades, AT&T has been at the forefront of speech recognition and natural language technology research. It’s developed a core technology platform, known as Watson, which is a cloud-based system of services that not only identifies words but interprets meaning and context to deliver more accurate results. The system itself is built on servers that model and compare speech to recorded voices. Watson is an evolving platform that with more data is able to adapt and learn so that it continues to improve accuracy and also cross reference data to use speech as input for getting to all kinds of communication and data.
“We are really on the cusp of a technology revolution in speech and language technology,” said Mazin Gilbert, executive director of speech and language technology at AT&T Labs. “It’s no longer about simply trying to get the words right. It’s about adding intelligence to interpret what is being said and then using that to apply to other modes of communication, such as text or video.”
Of course, AT&T is not alone in its quest for developing more intelligent voice-activated technologies. IBM and Microsoft have each invested heavily in this area for years. Microsoft has already incorporated some speech recognition technology into the Xbox Kinect. And Google, a relative newcomer to the field, is also making headway with voice recognition built into its Google Voice product, which is now available on the iPhone.
But AT&T’s researchers say the intelligence built into the Watson engine sets their applications apart from these others.
One of the demonstrations the company showed at its lab in New York was the iRemote, an application that turns an iPhone or some other smartphone into a voice-activated TV remote. The application allows users to speak normal sentences asking to search for specific shows, actors, or genres.
For example, someone might ask the app to search for reality shows on Thursday evening. And the app will generate a list of all the reality shows starting at 8 p.m. on Thursday. Users will likely have to scroll through a short list of titles, but the search has been greatly refined from the hundreds of shows that would have to be searched otherwise.
Voice activated remotes already exist. But AT&T’s technology goes far beyond what’s currently available today, said Michael Johnston, a principal researcher at AT&T Labs. Many of these other applications respond to prerecorded commands. AT&T’s application not only identifies words, but it also uses other principles of language such as syntax and semantics to interpret and understand the meaning of the request. The system is designed to get more accurate over time as it learns the speech patterns of large numbers of users.
“The hardest thing in developing a service like this is populating it with a base-level of understanding,” he said. “Even humans make mistakes in hearing words correctly. But we’re able to infer meaning from the way the question was phrased or even by understanding gestures or facial expressions.”
Eventually, Johnston said cameras could be used to read lips or gauge facial expressions, which can also be used to determine the intent of what’s being said.
“The vision is that we have something like you’d see in ‘Star Trek’ or ‘Minority Report,'” he said. “You shouldn’t have to sit with a keyboard and type anything. Your environment should sense you and through voice commands or gestures the devices around you should know what you’re searching for or be able to initiate some other action for you.”
Researchers have also been applying the Watson speech and language framework to mobile devices. Some of AT&T’s technology partners, which license AT&T’s core speech and language technologies, have already built commercial products. For example, Vlingo licenses AT&T’s Watson core technology and also partners with AT&T on research. Today it offers applications for Android, BlackBerry, Nokia and iPhone smartphones. The Vlingo apps, which are often used to enable or enhance other applications, allow users to search the Web, find directions, update social networking status, and send emails and text messages to contacts simply by using voice commands.
As touch screens and other mobile devices such as the iPad emerge, AT&T has begun introducing physical gestures into the platform. Earlier this year, it introduced a research application for the iPhone that is capable of understanding both the spoken word as well as physical gestures.
The Speak4It app, which can be downloaded from the iTunes App Store, allows consumers to discover restaurants within a specific area, obtain directions to the nearest gas station, call their local pharmacy and access information on a variety of local businesses. By pressing the speak button people can say what they would like to find and have it pinpointed on a Google map. Users can also touch a point on a map and ask, “What’s there?” Or they can circle a neighborhood on the map and search for something only in that specific area.
In addition to understanding and correctly interpreting language, AT&T is also developing voice technology that mimics natural voices. Its AT&T Natural Voices technology builds on text-to-speech technology to enable any communication to be spoken in a variety of languages including, English, German, Spanish, French or Italian when text is processed through the AT&T cloud based service.
The technology works by accessing a database of high-quality recorded sounds that when melded together by algorithms create spoken phrases. AT&T demonstrated the technology with an application that reads aloud children’s story books. The application was downloaded onto an iPad and it used synthesized voice technology to read the story of Goldilocks and the Three Bears aloud. The application highlighted each word as it was read with each character speaking in a different voice. While the voices in the story still sound somewhat mechanical, the goal is that over time, the voices will match the intonations and speech patterns of natural voices.
“The whole idea behind what we’re doing with this voice and multimodal technology is to develop an intelligent virtual agent that is with you all the time, whether you’re at home or out in the world,” Gilbert said. “When you’re out and about it helps you look for restaurants, it knows to send and SMS to your mother on her birthday, it knows you go to Dunkin Donuts everyday and sends you a virtual coupon on Monday morning, and it can speak to you when you need something read to you. “