Speech Recognition Engines

Nested
VoxForge Acoustic Models w/ Sphinx 4
User: william818
Date: 1/8/2010 8:51 am
Views: 28911
Rating: 14

Hi All,

I need help getting the VoxForge Acoustic Models working with Sphinx 4.  I'm only on this forum asking after I spent an entire day looking through user manuals and wikis.

I understand the differences between acoustic models, language models, etc.  I'm also a Java developer, so that helps since Sphinx 4 is written in Java.

I was originally guessing the VoxForge download is in Sphinx 3 format, so I looked at Sphinx 4's build.xml and see what it does to convert 3 to 4, and the stuff doesn't match up with the contents of the VoxForge download.

I also looked around the Sphinx 4 source tree to see any similarities between the existing CMU models and the VoxForge model, and I can't find any.  For example, the CMU models don't have *.align files, while VoxForge does.

Is there any instruction manual for on integrating VoxForge into Sphinx 4?  Call me stupid but I couldn't find anything in manuals and wikis.

In case anyone is interested, what I basically want to do is to take wav files of voice mails, and transcribe them.  Basically similar to what Google Voice does.

Thanks in advance for any help,

William

--- (Edited on 1/8/2010 8:51 am [GMT-0600] by ) ---


EDIT: Second follow-up question:  How come there are no daily builds for Sphinx, and only for HTK/Julius?  Would be nice if we could get most up-to-date models for Sphinx as well, wouldn't it be?

--- (Edited on 1/8/2010 8:55 am [GMT-0600] by ) ---

Re: VoxForge Acoustic Models w/ Sphinx 4
User: kmaclean
Date: 1/8/2010 10:31 am
Views: 221
Rating: 15

>Is there any instruction manual for on integrating VoxForge into Sphinx

>4? 

See nsh's post here: Re: Sphinx3 model to Sphinx4 model

> Would be nice if we could get most up-to-date models for Sphinx as

>well, wouldn't it be?

On my to do list...

Are you volunteering to help out?

Ken

--- (Edited on 1/8/2010 11:31 am [GMT-0500] by kmaclean) ---

Re: VoxForge Acoustic Models w/ Sphinx 4
User: william818
Date: 1/8/2010 11:06 am
Views: 199
Rating: 12

>On my to do list...

>Are you volunteering to help out?

Hey Ken, yeah sure, I'd be willing to help set this up.  I have no idea how to build these bundles at this point, but I'm sure it's not rocket science and if you send me a build guide, I can set it up.

 

--- (Edited on 1/8/2010 11:06 am [GMT-0600] by ) ---

Re: VoxForge Acoustic Models w/ Sphinx 4
User: kmaclean
Date: 1/8/2010 12:40 pm
Views: 283
Rating: 17

>I have no idea how to build these bundles at this point,

>but I'm sure it's not rocket science and if you send me a build guide, I

>can set it up.

See this post: Creating Sphinx Acoustic Model.

Downloading the corpus will take a while (we have about 70 hours worth of speech).

Ken

 

--- (Edited on 1/8/2010 1:40 pm [GMT-0500] by kmaclean) ---

Re: VoxForge Acoustic Models w/ Sphinx 4
User: nsh
Date: 1/8/2010 6:25 pm
Views: 253
Rating: 16

> Is there any instruction manual for on integrating VoxForge into Sphinx 4?  Call me stupid but I couldn't find anything in manuals and wikis.


There is official documentation on this which you can easily find in  google and in sphinx4 source

http://cmusphinx.sourceforge.net/sphinx4/doc/UsingSphinxTrainModels.html

> In case anyone is interested, what I basically want to do is to take wav files of voice mails, and transcribe them.  Basically similar to what Google Voice does.

In case you are interested, that's not that simple as you might think. Voicemails are usually 8kHz audio and Voxforge model is trained on 16kHz, it will not work for telephone recordings. There are other issues here.

 

--- (Edited on 1/9/2010 03:25 [GMT+0300] by nsh) ---

Re: VoxForge Acoustic Models w/ Sphinx 4
User: nsh
Date: 1/8/2010 6:26 pm
Views: 212
Rating: 14

Hello Ken

> See nsh's post here: Re: Sphinx3 model to Sphinx4 model


I wouldn't link that rather sketchy post when official documentation is available.

--- (Edited on 1/9/2010 03:26 [GMT+0300] by nsh) ---

Re: VoxForge Acoustic Models w/ Sphinx 4
User: Visitor
Date: 1/8/2010 7:20 pm
Views: 231
Rating: 15

> In case anyone is interested, what I basically want to do is to take wav files of voice mails, and transcribe them.  Basically similar to what Google Voice does.


Don't you need a dictation system for that? Maybe I'm wrong, but I think that you need much more than 70 hours of speech for that.

--- (Edited on 1/8/2010 7:20 pm [GMT-0600] by Visitor) ---

Re: VoxForge Acoustic Models w/ Sphinx 4
User: nsh
Date: 1/8/2010 7:49 pm
Views: 338
Rating: 16

> Don't you need a dictation system for that? Maybe I'm wrong, but I think that you need much more than 70 hours of speech for that.

It's a common myth spreaded on Voxforge for some reason. Database size means is only one of the factors that affect accuracy, there are many others. And there is no direct dependency between size and accuracy. It's possible to have good accuracy with 70 hours, it's possible that with 10 thousands you'll have bigger error rate.

Here on page 13 you can find comparision of accuracy and database size

http://mi.eng.cam.ac.uk/research/projects/EARS/pubs/evermann_sttmay04.pdf


Basically the difference in accuracy between 400 hours and 2200 hours is 2%.

 

--- (Edited on 1/9/2010 04:49 [GMT+0300] by nsh) ---

Re: VoxForge Acoustic Models w/ Sphinx 4
User: kmaclean
Date: 1/19/2010 2:27 pm
Views: 7753
Rating: 16

>It's a common myth spreaded on Voxforge for some reason.

Yes, it is all part of a grand conspiracy to collect thousands of hours of recorded speech to create "The One Acoustic Model"... :)

I don't have the time or the inclination to collect more speech than we need.  If we have enough English speech now, please let me know, so I can cut a release, and move on to other things.

>Database size means is only one of the factors that affect accuracy,

>there are many others. And there is no direct dependency between

>size and accuracy.

Good to know...

>It's possible to have good accuracy with 70 hours, it's possible that

>with 10 thousands you'll have bigger error rate.

Wouldn't more data reduce the impact of outliers/errors in transcriptions or pronunciation, or non-speech noise?  i.e. the theory being that rather than spending lots of time manually transcribing/reviewing a small database of speech, you just collect lots of it, and hope that the statistical analysis performed during the acoustic model training process drops the outliers.

>Here on page 13 you can find comparision of accuracy and database size

>http://mi.eng.cam.ac.uk/research/projects/EARS/pubs/evermann_sttmay04.pdf

From the Experiments with Fisher Data Power Point slide:

     2.2% abs. WER reduction from using 800h Fisher instead of 360h h5train03

Could that not also be interpreted as saying that more than doubling the speech from 360hours to 800 hours, you only get 2.2% improvement in Word Error Rate (which as far as I could tell, was pretty bad to start off with...), therefore, everything else being equal, you need lots more speech to get decent improvement because of diminishing returns?

So how much speech do we need for command and control acoustic models (70 hours - less than 400 hours?) and how much for dictation (no more than 400 hours), if the data was perfectly transcribed (speech and non-speech) and the audio perfectly clean, and the target was North American users with minimal accent variance?

Ok that last question was rhetorical... my real question is:

Is the only way to find out how much speech we really need for a given domain to create a test set (as you have suggested many times...) of the target domain (e.g. North American English) and keep collecting speech until we gain no more improvement in recognition using the test set?  Once that is accomplished, then we should focus our attention on AM and LM adaptation frameworks as described in your post: How to create a speech recognition application for your needs?

thanks,

Ken

--- (Edited on 1/19/2010 3:27 pm [GMT-0500] by kmaclean) --- Fix Font size

--- (Edited on 1/19/2010 3:38 pm [GMT-0500] by kmaclean) ---

Re: VoxForge Acoustic Models w/ Sphinx 4
User: kmaclean
Date: 1/19/2010 2:42 pm
Views: 118
Rating: 10

>Voicemails are usually 8kHz audio and Voxforge model is trained on 16kHz, it will not work for telephone recordings.

All audio submitted to VoxForge is downsampled to 8kHz-16bit, which is what is linked to in the Sphinx Acoustic Model tutorial/post I gave william818.

--- (Edited on 1/19/2010 3:42 pm [GMT-0500] by kmaclean) ---

PreviousNext