Click here to register.

Speech Recognition in the News

Click the 'Add' link to add a comment to this page; click the 'Read More' link to view replies to a posted comment.


Rest in Peas: The Unrecognized Death of Speech Recognition
By kmaclean - 5/3/2010

Interesting Article on Speech Recognition.  The author, Robert Fortner, is not impressed with the rate of speech recognition improvements over the years.  The passage that give the gist of his argument is:

We have learned that speech is not just sounds. The acoustic signal doesn’t carry enough information for reliable interpretation, even when boosted by statistical analysis of terabytes of example phrases. As the leading lights of speech recognition acknowledged last May, “it is not possible to predict and collect separate data for any and all types of speech…” The approach of the last two decades has hit a dead end.[...]

However, what is more interesting is the rebuttal by Jeff Foley (Nuance), who says in a comment:

First of all, any discussion of speech recognition is useless without defining the task--with the references to Dragon I'll assume we're talking about large vocabulary speaker dependent general purpose continuous automatic speech recognition (ASR) using a close-talking microphone. Remember that that "speech recognition" is successfully used for other tasks from hands-free automotive controls to cell phone dialing to over-the-phone customer service systems. For this defined task, accuracy goes well beyond the 20% WERR cited here. Accuracy even bests that for speaker independent tasks in noisy environments without proper microphones, but of course those have constricted vocabularies making them easier tasks. In some cases, you write about the failure to recognize "conversational speech," which is a different task involving multiple speakers and not being aware of an ASR system trying to transcribe words. Software products such as Dragon do not purport to accomplish this task; for that, you need other technologies which are still tackling this task.

And with respect to Fortner's comment that "The core language machinery had not changed since the 50s and 60s", Foley says:

[...]  Actually, it was the Bakers' reliance on Hidden Markov Models (HMM) that made NaturallySpeaking possible. Where other ASR attempts focused on either understanding words semantically (what does this word mean?) or on word bigram and trigram patterns (which words are most likely to come next?), both techniques you described, the HMM approach at the phoneme level was far more successful. HMM's are pretty nifty; it's like trying to guess what's happening in a baseball game by listening to the cheers of the crowd from outside the stadium.[...]

Good thing Sphinx, HTK and Julius all use HMM-based acoustic models...

Keyword audio searching website
By LeslawPawlaczyk - 11/27/2009


Recently there was a lot of news that Google created their own ASR for generating subtitles in youtube. However not only they have to be experts in new technologies. My team has just launched a website which allows anyone to look at technology of speech recognition based on a modified Julius engine called Skrybot ( The website is generally aimed to show how speech recognition can be used to decode speech into text and then navigate across the information transcribed. Currently the website has Polish interface and acoustic/language models employed for decoding. However in the nearest future we are planning to support new languages like English, German. Please feel free to look there. We are proud to prove that not only big corporations are able to develop innovative technologies but also smaller teams who are passionate in what they are doing.

Kind regards

Dr. Leslaw Pawlaczyk

Automatic Captions in YouTube
By kmaclean - 11/20/2009

Google is now offering automatic captions (auto-caps) in YouTube.   Video captions are generated using Google's speech recognition technology.  From the official blog post:

[...] we've combined Google's automatic speech recognition (ASR) technology with the YouTube caption system to offer automatic captions, or auto-caps for short. Auto-caps use the same voice recognition algorithms in Google Voice to automatically generate captions for video. The captions will not always be perfect (check out the video below for an amusing example), but [...] the technology will continue to improve with time.

They are also have another related feature called auto-timing that can create time stamps of words uttered in a video (if you upload the  transcriptions along with the video).  The resulting time stamp file is downloadable.  From the blog:

[...] we’re also launching automatic caption timing, or auto-timing, to make it significantly easier to create captions manually. With auto-timing, you no longer need to have special expertise to create your own captions in YouTube. All you need to do is create a simple text file with all the words in the video and we’ll use Google’s ASR technology to figure out when the words are spoken and create captions for your video. [...]

Seems like an easier way to perform forced alignment on the audio track of a YouTube video...

On Speech Recognition: Web App Integration, Pointers for Newbies, & Lessons Learned from a failed startup
By kmaclean - 8/30/2009 - 1 Replies

From this article: On Speech Recognition: Web App Integration, Pointers for Newbies, & Lessons Learned from a failed startup:

For all of those thinking of integrating speech recognition into their apps I have a word of advice for you: Don’t.


[The] speech rec discussed in this article is the kind that understands short phrases and/or commands with no training required. It’s not free flowing dictation like that found in Dragon software. [...]

He reviews some of ways to integrate speech recognition into a web application:

  1. Telephony
  2. Web Services
  3. Embedded

And then describes the main stumbling block for open source speech recognition:

[...] The only real differences between the open source and commercially available solutions lie in what’s called their Acoustic Models. AMs for speech rec are like gold. A good AM is produced from several thousand hours of good audio samples.

Speech Recognition and the 'Uncanny Valley'
By kmaclean - 8/18/2009

Interesting article on the Discovery News Website (Why are Speech Recognition and Natural Language Neither of Those?) which says that part of the author's frustration with a telephony speech IVR application was due to his expectations:

[...] The more human the electronic operator sounds, the more I expect from her. When she doesn’t perform, my eye-rolling, jaw-jutting, and nose-exhaling ensues. Japanese roboticist Masahiro Mori (b. 1927) devised a theory around this phenomenon and calls it the “Uncanny Valley."

It says, basically, that humans will tolerate and even show empathy for artificially intelligent life forms (robots, electronic operators) as long as the machines don’t get too big for their britches and start looking and acting all Homo sapiens sapiens. [...]

So what to do? For starters, stop trying to simulate humans, said Bilmes. Keep speech recognition technology on a short leash and use it for applications where expectations are not so high.

He then goes on to give examples where speech recognition makes sense:


Bing 411
By kmaclean - 6/20/2009

Bing 411 is Microsoft's answer to Goog411 (originally discussed in this post: Google Voice Local Search).  It allows you to connect directly to businesses for free, and can also give directions (from this article):

The system, which is powered by Tellme, uses speech recognition technology to retrieve results. The free service helps users find a business, or receive a text message with a link to a map. It also includes star ratings of businesses based on reviews from others. [...]

How does it work? Users dial 1-800-Bing 411 (1-800-246-4411) from any phone and give the system the information they’re seeking. Bing 411 will give you directions over the phone (you can stop and repeat the directions several times, if needed). Or users can request them via text message.

Dasher - interesting use of a Language Model
By kmaclean - 4/11/2009 - 3 Replies

Dasher, is a mouse interface that allows you to enter text without a keyboard:

Dasher is a text-entry system in which a language model plays an integral role, and it's driven by continuous gestures. Users can achieve single-finger writing speeds of 35 words per minute and hands-free writing speeds of 25 words per minute.

Demo video located here: Single-finger text input

Google Adds Voice Search to Android
By kmaclean - 2/4/2009

Google has added voice search to the Android platform via the new software update.

Because it was trained on Goog411 (which is currently only available in the US and Canada) Google voice search has difficulty with other accents. Here's a statement from Google explaining why:

The acoustic model for Voice Search was trained, in part, by using data from GOOG-411 which has only launched broadly in the US. Since the acoustic model was trained using mostly American accents, the tool currently works best when receiving queries with American accents. While you can still download the Google Mobile App and turn on the Voice Search here, we've turned off the voice functionality by default when the app is downloaded from anywhere outside of the US. We don't have any specific launches to announce at this time, but we think this is exciting new technology and the speech recognition and understanding will only get better for other accents and jargon as we keep working on it

Google voice interface to iPhone search
By kmaclean - 11/14/2008

The New York Times is reporting that Google has added a voice interface to their iPhone search software:

Users of the free application, which Apple is expected to make available as soon as Friday through its iTunes store, can place the phone to their ear and ask virtually any question, like “Where’s the nearest Starbucks?” or “How tall is Mount Everest?” The sound is converted to a digital file and sent to Google’s servers, which try to determine the words spoken and pass them along to the Google search engine.

Other similar services:

Google Voice Dialer
By dano - 10/22/2008 - 1 Replies

Google Android has its own speech recognition engine, which is used by their application VoiceDialer.

You can check out the code on

It is licensed in apache license, which is compatible with GPL.