Click the 'Add' link to add a comment to this page.
gnomeSpeak is a two way voice application using GVC and festival. Prototype is aimed to help the visually impaired. Currently supporting english and tamil.
Appreciate your feedback on it.
HelloMe and my team just released a new open source software under LGPL license for controlling Windows using voice commands. This software is using Julius as a speech recognition engine. Currently we support Polish acoustic models, so anyone who has any knowledge of Slavic language are welcome to download it and try. Once again thanks goes to prof. Lee and his team for developing Juliushttp://sourceforge.net/projects/skrybotdomowy/files/Releases/InstalatorSkrybotKomendy-188.8.131.524.exe/downloadThank youLeslaw Pawlaczykhttp://skrybot.pl/en/
From the Google Chrome blog:
Today, we’re updating the Chrome beta channel with a couple of new capabilities, especially for web developers. Fresh from the work that we’ve been doing with the HTML Speech Incubator Group, we’ve added support for the HTML speech input API. With this API, developers can give web apps the ability to transcribe your voice to text. When a web page uses this feature, you simply click on an icon and then speak into your computer’s microphone. The recorded audio is sent to speech servers for transcription, after which the text is typed out for you.
You can try this it out yourself on Google's html5rocks.com website (you need Google Chrome 11 beta installed).
It works on Linux - I tried it on Fedora 14 and Ubuntu 10.4 with no problems.
I wanted to announce a first release of an open source project called Skrybot doMowy, which is based on a well known decoder Julius. This software is a result of 3 year research and is a LVCSR dictation system for Windows platform available from http://sourceforge.net/projects/skrybotdomowy/
The aim of this software which code is written in C# is to allow other fellow software engineers to write their own plugins and extensions to dictation system.
Currently the program supports only Polish acoustic and language models making it possible to use for dictation of emails or simple documents. It has a live view of a microphone input allowing the user to monitor the volume of their speech.
One of the other aims was to make speech dictation available for free to everyone with a quality similar to commercial programs.
I encourage other researchers or programmers to get into contact with me and and potentially develop other language GUI versions as well as acoustic and language models for their native languages. We are soon considering supporting British English version of this software, however we still need to develop such models.
More details can be found on http://skrybot.pl/en/
According to this InfoWorld article, Google is building speech-recognition technologies not just for Chrome, but for all browsers.
Ian Fette, product manager for the Google Chrome team, said (at the Google I/O conference in San Francisco late last week):
We're hoping that the text-to-speech APIs as well as the voice input, voice recognition ship in Chrome but also become a Web standard that is implementable by any browser out there.
Interesting Article on Speech Recognition. The author, Robert Fortner, is not impressed with the rate of speech recognition improvements over the years. The passage that give the gist of his argument is:
We have learned that speech is not
just sounds. The acoustic signal doesn’t carry enough information for
reliable interpretation, even when boosted by statistical analysis of
terabytes of example phrases. As the leading lights of speech
last May, “it is not possible to predict and collect separate data for
any and all types of speech…” The approach of the last two decades has
hit a dead end.[...]
However, what is more interesting is the rebuttal by Jeff Foley (Nuance), who says in a comment:
First of all, any discussion of speech recognition is useless without
defining the task--with the references to Dragon I'll assume we're
talking about large vocabulary speaker dependent general purpose
continuous automatic speech recognition (ASR) using a close-talking
microphone. Remember that that "speech recognition" is successfully
used for other tasks from hands-free automotive controls to cell phone
dialing to over-the-phone customer service systems. For this defined
task, accuracy goes well beyond the 20% WERR cited here. Accuracy even
bests that for speaker independent tasks in noisy environments without
proper microphones, but of course those have constricted vocabularies
making them easier tasks. In some cases, you write about the failure to
recognize "conversational speech," which is a different task involving
multiple speakers and not being aware of an ASR system trying to
transcribe words. Software products such as Dragon do not purport to
accomplish this task; for that, you need other technologies which are
still tackling this task.
And with respect to Fortner's comment that "The core language machinery had not changed since the 50s and
60s", Foley says:
[...] Actually, it was the Bakers' reliance on Hidden Markov Models
(HMM) that made NaturallySpeaking possible. Where other ASR attempts
focused on either understanding words semantically (what does this word
mean?) or on word bigram and trigram patterns (which words are most
likely to come next?), both techniques you described, the HMM approach
at the phoneme level was far more successful. HMM's are pretty nifty;
it's like trying to guess what's happening in a baseball game by
listening to the cheers of the crowd from outside the stadium.[...]
Good thing Sphinx, HTK and Julius all use HMM-based acoustic models...
Recently there was a lot of news that Google created their own ASR for generating subtitles in youtube. However not only they have to be experts in new technologies. My team has just launched a website http://skrybot.tv which allows anyone to look at technology of speech recognition based on a modified Julius engine called Skrybot (http://www.skrybot.pl). The website is generally aimed to show how speech recognition can be used to decode speech into text and then navigate across the information transcribed. Currently the website has Polish interface and acoustic/language models employed for decoding. However in the nearest future we are planning to support new languages like English, German. Please feel free to look there. We are proud to prove that not only big corporations are able to develop innovative technologies but also smaller teams who are passionate in what they are doing.
Dr. Leslaw Pawlaczyk
Google is now offering automatic captions (auto-caps) in YouTube. Video captions are generated using Google's speech recognition technology. From the official blog post:
[...] we've combined Google's automatic speech recognition (ASR) technology
with the YouTube caption system to offer automatic captions, or
auto-caps for short. Auto-caps use the same voice recognition
algorithms in Google Voice
to automatically generate captions for video. The captions will not
always be perfect (check out the video below for an amusing example), but [...] the technology
will continue to improve with time.
They are also have another related feature called auto-timing that can create time stamps of words uttered in a video (if you upload the transcriptions along with the video). The resulting time stamp file is downloadable. From the blog:
[...] we’re also launching automatic
caption timing, or auto-timing, to make it significantly easier to
create captions manually. With auto-timing, you no longer need to have
special expertise to create your own captions in YouTube. All you need
to do is create a simple text file with all the words in the video and
we’ll use Google’s ASR technology to figure out when the words are
spoken and create captions for your video. [...]
Seems like an easier way to perform forced alignment on the audio track of a YouTube video...
From this article: On Speech Recognition: Web App Integration, Pointers for Newbies, & Lessons Learned from a failed startup:
For all of those thinking of integrating speech recognition into their apps I have a word of advice for you: Don’t.
[The] speech rec discussed in this article is the kind that understands short
phrases and/or commands with no training required. It’s not free
flowing dictation like that found in Dragon software. [...]
He reviews some of ways to integrate speech recognition into a web application:
And then describes the main stumbling block for open source speech recognition:
[...] The only real differences between the open source and commercially
available solutions lie in what’s called their Acoustic Models. AMs for
speech rec are like gold. A good AM is produced from several thousand
hours of good audio samples.
Interesting article on the Discovery News Website (Why are Speech Recognition and Natural Language Neither of Those?) which says that part of the author's frustration with a telephony speech IVR application was due to his expectations:
[...] The more human the electronic operator sounds, the more I expect from
her. When she doesn’t perform, my eye-rolling, jaw-jutting, and
nose-exhaling ensues. Japanese roboticist Masahiro Mori (b. 1927)
devised a theory around this phenomenon and calls it the “Uncanny Valley."
It says, basically, that humans will tolerate and even show empathy
for artificially intelligent life forms (robots, electronic operators)
as long as the machines don’t get too big for their britches and start
looking and acting all Homo sapiens sapiens. [...]
So what to do? For starters, stop trying to simulate humans, said
Bilmes. Keep speech recognition technology on a short leash and use it
for applications where expectations are not so high.
He then goes on to give examples where speech recognition makes sense: