to Reuse Speech from Other Open Source/Social Projects
There is more then enough speech on the Internet to create a commercial quality FOSS speech corpus and acoustic models. The problem is that it is a very time-consuming process to convert such speech into a format that can be usable for the creation of acoustic models. Automating our current manual process for segmenting an Audiobook (from LibriVox for example), and applying the same algorithms to other potential sources of speech (audio or video blogs, etc.) would go a long way to improving FOSS speech recognition.
This project is to create a series of scripts to train acoustic models using audiobooks from Librivox.
The high level steps are as follows:
1) Get a list of speakers and number of hours spoken by each speaker.
2) Write the scripts to download all the audio and text
3) Write scripts to clean up the text so that it matches the audio. In the first case this would be removing the Gutenberg preamble and adding the spoken Librivox preamble, and looking at what can be done about chapter headings, etc.
4) Build acoustic and language models using one of the following speech Recognition Engines:
5) Use an "automated transcription script" to highlight any problems with the transcriptions, and if so go back to stage 3 and fix them up.
6) Decide on a sensible split of data between train, eval and test.
7) Make three releases. The first would be the audio and text (in original forms), the second the scripts that performs steps 3-5 above (so that others may improve) and thirdly the acoustic model release.
8) complete Acoustic Model creation scripts for the other Speech Recognition Engines not selected in step 4.
Well - who am I to disagree?
One thought I've just had - I've heard of speech recognition companies who have trained up models using HTK and run them under SPHINX (before HDecode was available). Surely it can't be that hard, well certainly the means/variances/transition probs should be trivial. Voxforge would seem to be an excellent place to:
1) develop some GPL code to do the acuostic model conversion (lets assume ARPA format LMs)
Anyone got any results to share?
One important wrinkle to using MP3 audio from Librivox (or even the WAV audio) is that some (not sure how much) of the speech audio submitted to Librivox has been 'processed' - i.e. the audio has been 'cleaned' with noise removal algorithms, audio level normalization, and/or equalization.
Not sure how this might affect a final acoustic model - the rule of thumb has been to use unaltered speech audio as much as possible.
My gut feeling is that is shouldn't put you off using the data. Okay, you might find that some audio has been distorted, but you can always throw this away later.
Dr Tony Robinson, CEO Cantab Research Ltd
Phone: +44 845 009 7530, Fax: +44 845 009 7532
I created a quick 'sanity test' to compare Acoustic Models trained with wav audio versus mp3 audio. Basically I took the larger wav directories in the VoxForge corpus, and converted them to MP3 and then converted them back. I then compared Acoustic Models ("AM") created with the original wav data, to AM trained with converted mp3 data to get an idea of any performance differences.
The tests with Julius performed as expected, with a bit of a degradation of performance by using mp3-based Acoustic Models.
The tests with HTK are a bit more confusing, since these show some improvement in performance when using AMs based on mp3 audio.
Basically I need to use a larger test sample with a more complex grammar to get a better test. But the use of MP3 audio for AM training looks promising.