State of Natural Speech and Italy

Last week, I had the opportunity to attend and speak at InteractiveMedia’s Speech Workshop in Rome. Speech recognition has become much more mainstream and accepted in the past 5 years, due mainly to the industry making huge strides in getting speech recognition to work, and also frankly to it being more ubiquitous, such as in cars, so people are more comfortable with it.

While I used to be very embedded in the speech scene, I haven’t been lately. In fact, IM asked me to speak about Dialogic’s cloud experiences and what we’re seeing there. But the workshop was very interesting for me and even though the entire day was in Italian, here are some key themes that I picked up, or shall I say, some key themes that crystallized for me even though maybe this isn’t what the speakers were talking about!

First, the movement to mobile is increasing the usage of speech technologies. As I said above, speech is more ubiquitous now in cars than ever before and that’s because when you are driving, you can’t really be hitting buttons. So it’s hard to create DTMF tones when talking on a mobile phone too. And Smartphones increase the use of speech technologies as well – one reason is 3G has both a voice and data channel as any US consumer will know from the AT&T commercials about 3G and the iPhone – and another reason is because as Smartphones become more vehicles for mobile payments, then speech recognition will be the speaker verification vehicle, so you wouldn’t need a PIN. Your unique speech tones, such as your unique fingerprint, will be the verification.

Another theme I saw was that virtual agents (i.e. pieces of software that maybe look like a human and also “listen” to you and “talk” to you) are able to be used to help with specific tasks. They don’t make mistakes. And in some cases, it might be better to give personal information to a computer rather than a human being. Sure, live agents are still required, and will be for some time, to deal with complex requests and probably accents (though I saw some cool demos about speaking a language, say English, with an Italian accent, etc. and it sounded like an Italian talking English), but virtual agents for specific tasks are cost-effective and they work.

I also got a low-down on the latest standards, such as EmotionML, EMMA, SCXML, VoiceXML 3.0 and HTMLspeech.

All in all, the speech industry remains extremely vibrant and innovative. And we are now moving into true multi-modal support, and true interactivity. I’m sure this will continue to move forward and provide increased support for all of us, and even more new, innovative applications such as providing support for mobile banking.