Acoustic Model Discussions

Flat
Re: Acoustic model testing
User: tpavelka
Date: 3/31/2009 2:28 am
Views: 83
Rating: 4

Hi,

> I am also thinking that for a release of the VoxForge acoustic

> model, we would train using the *entire* corpus (not just the

> training corpus) in order to ensure that it includes as much

> speech as possible - is this reasonable?

I do not think this is a good idea, you always need a testing corpus, it should be resonably big (thousands of sentences), and it has to be static (i.e. not changing over time).

Reasons:

  1. The need for training corpus: One reason is to encourage competition among the people who train their own acoustic models. This will makethe training more enjoyable ;-) But, more important reasons are debugging and parameter tuning. Debugging of statistical models is notoriously difficult, because the only way to find a mistake is to look at the resulting accuracy. And you can only find that the accuracy is low by comparing it to tests done by someone else. Also, the testing corpus can be used to tune the parameters of the acoustic model, namely the number of mixtures per state and also the decision tree clustering threshold, which controls the number of physical states after the clustering. Both of these should have a pretty visible impact on accuracy.
  2. The size of the training corpus: The corpus should be quite large, because the incremental improvements achieved by parameter tuning etc. can be small and you need a large enough corpus for these differences to be statistically significant. If you get a slight accuracy improvement you don't know whether this is by chance or because your new acoustic model is better. But the larger the testing corpus the more sure you can be.
  3. The corpus has to be static: I do not beleive that results achieved on different corpora are comparable. Since by choosing a testing set you are also choosing the difficulty of its recognition. By adding sentences to the testing corpus you can (unintentionally) make it more or less difficult to recognize. Then the results on the new testing corpus may not be comparable with the results on the old one and you can just throw the old results away. The testing can take quite a long time (my experiment with HDecode ran a little bit less than 24 hours) so you do not want to invalidate the expensive tests by changing the testing corpus too often (but I guess the corpus has to be changed sometimes based on the needs that arise from the experiments).

Some more thoughts:

  • By training the acoustic model you assume that your training data is representative enough of your task. The only way you can ensure this is by not including the testing data in the training. Otherwise you cannot be sure that your model just "memorized" the training data. This is calle overtraining and can be seen as a too large difference between the accuracy on training and testing data. (Given the size of the VoxForge corpus this does not seem likely, but you can never be sure.)
  • Ideally the speakers in the testing set should be different from the speakers  in the training set.
  • There should be more than one testing corpus. The testing corpora should differ in difficulty, e.g. corpora for grammar based recognition (the numbers etc.), small, medium, large vocabulary + stochastic language models, clean speech, noisy speech etc.
  • The size of the testing corpus is relativelly small (when compared with the training data). This allows for a more rigorous checking, e.g. if the corpus is just one or two hours long you can listen to all the recordings and take out the bad ones.

I understand that the design and recording of an official testing corpus will be a quite difficult task. I do not claim to be 100% correct (actually I think it is highly likely that I am wrong in some parts, I just do not know which ones ;-)), so this should be taken as a starting point of a debate. Hopefully you can get some ASR professionals with more experience than I have to take part. Also, the articles about the design of e.g. the WSJ corpus might be helpful.

Since it can take some time to get a proper testing corpus I suggest that we make a temprary one so that the testing (and the resulting improvements in accuracy) can start soon. For this I have proposed the set used by nsh in the sphinx experiments. It should be easy to exclude these files from the training process.

If we have at least a temporary testing corpus we can try to figure out:

  • Whether mixtures should be added (they will improve accuracy at the expense of computation time).
  • What is the best setting of the decision tree clustering threshold.
  • What are the best features (Although the MFCC_D_N_Z seem to perform quite well).

Tomas

--- (Edited on 3/31/2009 2:28 am [GMT-0500] by tpavelka) ---

Re: Acoustic model testing
User: kmaclean
Date: 3/31/2009 9:37 am
Views: 84
Rating: 6

Hi Tomas,

Thank you for the amazing feedback!

> [...] I suggest that we make a temprary one so that the testing (and the

>resulting improvements in accuracy) can start soon. For this I have

>proposed the set used by nsh in the sphinx experiments. It should be

>easy to exclude these files from the training process.
Just to confirm, you want me to exclude the files listed in:

voxforge_en_sphinx_test.transcription (which corresponds to the file IDs in voxforge_en_sphinx_test.fileids)

from the master_prompts files (8kHz_16bit and 16kHz_16bit) used in our current VoxForge acoustic model nightly build?

thanks,

Ken

--- (Edited on 3/31/2009 10:37 am [GMT-0400] by kmaclean) ---

Re: Acoustic model testing
User: tpavelka
Date: 4/1/2009 1:46 am
Views: 69
Rating: 6

Hi Ken,

confirmed, these are the files. One more suggestion: If my guess is right then if you take these files out of the training process, they will also not be present in this file. If that is correct then please publish the testing prompts separatelly on the same page so that anyone who downloads the prompts knows why some of them are missing.

Tomas

 

--- (Edited on 4/1/2009 1:46 am [GMT-0500] by tpavelka) ---

Re: Acoustic model testing
User: nsh
Date: 4/1/2009 3:07 pm
Views: 82
Rating: 5

btw, Tomas, what is JLaser performance on the model? Just got the reference from the archive :)

--- (Edited on 4/1/2009 3:07 pm [GMT-0500] by nsh) ---

Re: Acoustic model testing
User: tpavelka
Date: 4/2/2009 2:12 am
Views: 96
Rating: 4

Hi nsh,

the JLASER is my attempt at creating my own speech recognizer. The only results I have with JLASER and VoxForge are those at the beginning of this thread. After that all my experiments were done with HTK because I wanted to avoid possible bugs in my recognizer and wanted to improve the accuracy of my acoustic models. As you can see from my previous posts I am still not done with that ;-)

The problem with JLASER is that I was never able to properly incorporate stochastic language models into the decoder (i.e. every time I tried the improvement in accuracy was quite small). I guess one of the reasons was the grammar scaling factor. I know that when doing transition between words you subtract the word transition penalty and multiply the lm score by the grammar scaling factor. I thought that the panalty is sufficient but from the tests with HDecode I see that it is not. In the decoder you work with logaritms so that multiplying the lm score by e.g. 15 means the result is the lm score to the power of 15. This means that if I used the word transition penalty to achieve the same it would have to be a huge number which is something I did not expect.

The other thing is pruning, the JLASER is kind of slow because it is written in Java (let's say HTK is about twice as fast). I was able to make it run faster than HTK by using tree lexicons and pruning (in my experiments a tree lexicon with pruning can run more than 10x faster than linear lexicon which is normally used by HTK). But this was all done without any language model.

Without any language model I could set up a pruning threshold that would lead to about 200 active states per frame, which lead to the speed of about 0.1-0.3 RT. But when I look at the results from HDecode with  the trigram LM there the average number of active states is around 10-20k (that is if I read what comes out of HDecode correctly). The speed isn't that great either, about 10 RT. If I try to do the same with JLASER it will run much slower, unless I have a more efficient decoding algorithm than HDecode. I don't believe that is the case ;-)

Anyway, regarding HDecode, do you have any experience with it? The accuracy I got with it looks good, but the speed does not. It might run faster if the pruning thresholds are set up properly, but I don't know what numbers to put there. The only one I understand is the absolute beam width, but there is also a relative beam width and word end beam width... I can guess what these do but I have no idea about what values should they have.

Tomas

--- (Edited on 4/2/2009 2:12 am [GMT-0500] by tpavelka) ---

--- (Edited on 4/2/2009 2:15 am [GMT-0500] by tpavelka) ---

Re: Acoustic model testing
User: nsh
Date: 4/3/2009 4:10 am
Views: 111
Rating: 6

Hi Tomas

It's a very interesting decoder then.

The 10xRT is quite a usual speed for testing medium vocabulary decoder. Articles for LVCSR like CU-HTK mentions 400xRT speed for "slow" decoder and 10xRT for the "fast" :).

sphinx decoders are currently somewhat around 2xRT in normal mode for medium vocabulary task, but it can vary of course depending on the required accuracy.

Btw, the problem of the proper search algorithm implementation is raised here. It's hard to deal with word transition probably. I wonder if JLaser search is optimized for a large vocabulary.

I didn't get deep into the models yet, to be honest 95% accuracy on single mixture model looks strange to me, but I really wanted to dig deeper into this.

Btw, the language weight for medium vocabulary is typically around 10-8, 15 is more for a small vocabulary.

The beams that typically used are in sphinx for example:

-beam            1.0e-55        Beam selecting active HMMs (relative to best) in each frame [0(widest)..1(narrowest)]
-ci_pbeam        1e-80        CI phone beam for CI-based GMM Selection. [0(widest) .. 1(narrowest)]
-pbeam            1.0e-50        Beam selecting HMMs transitioning to successors in each frame [0(widest)..1(narrowest)]
-wbeam            1.0e-35        Beam selecting word-final HMMs exiting in each frame [0(widest)..1(narrowest)]
-wend_beam        1.0e-80        Beam selecting word-final HMMs exiting in each frame [0(widest) .. 1(narrowest)]

--- (Edited on 4/3/2009 4:10 am [GMT-0500] by nsh) ---

Re: Acoustic model testing
User: tpavelka
Date: 4/3/2009 9:43 am
Views: 123
Rating: 4

Hi nsh,

thanks for very interesting reply,

> I wonder if JLaser search is optimized for a large vocabulary.

Well, I tried 120k words with no language model, it runs reasonably fast with tight beam threshold but with a very low accuracy. As for a test with stochastic language models it has never been done sucessfully ;-)

> I didn't get deep into the models yet, to be honest 95%

> accuracy on single mixture model looks strange to me, but I

> really wanted to dig deeper into this.

Absolutelly, it seems very high to me as well, especially when compared to your results. That's why I asked Ken to remove the test set from the training, because it was the only explanation I could come up with.

> Articles for LVCSR like CU-HTK mentions 400xRT speed for

> "slow" decoder and 10xRT for the "fast" :).

Now that's what I call slow... I always thought that when you do tests you can just watch and wait for the results and that it was only the training part for which you might have to wait several days...

But, what do the people that do dictation do that we do not know about? I mean, in this article (link in Czech, sorry, could not find any English links) they talk about selling a dictation system for use in courts which has 370K vocabulary, has a language model that includes 6-grams and runs in RT... Also, I thought that the dictation systems like ViaVoice or Dragon can do 100k+ vocabularies in RT.

Thanks for the hints for beam widths, I used the value 220 suggested by the tutorial. I guess the number is the negative of a logarithm which would translate to about 1e-96 in numbers comparable with yours, which is quite wide. I have tried to lower it, but it lead to a quite sharp decrease in accuracy, so I guess that such a wide beam is needed when you have poor acoustic models (with one mixture).

I might try to play with the other beams, but if you say that this kind of speed is normal, then I think we cannot expect any radical speed up.

Tomas

 

--- (Edited on 4/3/2009 9:43 am [GMT-0500] by tpavelka) ---

--- (Edited on 4/3/2009 9:45 am [GMT-0500] by tpavelka) ---

Re: Acoustic model testing
User: nsh
Date: 4/5/2009 4:15 pm
Views: 62
Rating: 5

I recently tested Dragon NS 9.0 Professional on voxforge data (kayray book and some other segmented audiobook). I used their language model (didn't specify my own one) and batch transcribing (they have such mode) which is slower than realtime, not much though. No adaptation is used since I decoded with a clean new user, which is of course not correct. DNS should havily rely on adaptatoin.


Results differ from one speaker to another.

TOTAL Percent correct = 89.88% Error = 10.94% Accuracy = 89.06%


This is the best one

TOTAL Percent correct = 64.34% Error = 39.08% Accuracy = 60.92%

This is also typical.

Voxforge model is worse, but with adaptation the results were comparable. At least not much worse.

--- (Edited on 4/5/2009 4:15 pm [GMT-0500] by nsh) ---

Re: Acoustic model testing
User: nsh
Date: 4/5/2009 4:27 pm
Views: 76
Rating: 4

Basically that shows that DNS is an essentially single-user dictation. Once you'll train it on your voice it will not recognize anything else.

For large vocabulary modern dictation systems this one is typical I think:

http://mi.eng.cam.ac.uk/research/projects/AGILE/publications/mjfg_ASL.pdf

and the 1xRT WER is not that good. Of course such system is much more advanced than HTK, but the vocabulary is bigger as well.

--- (Edited on 4/5/2009 4:27 pm [GMT-0500] by nsh) ---

Re: Acoustic model testing
User: kmaclean
Date: 4/2/2009 2:27 pm
Views: 121
Rating: 5

HI Tomas,

>>Just to confirm, you want me to exclude the files listed in:

>>voxforge_en_sphinx_test.transcription (which corresponds to the

>>file IDs in voxforge_en_sphinx_test.fileids)

>>from the master_prompts files (8kHz_16bit and 16kHz_16bit) used in

>>our current VoxForge acoustic model nightly build?

>confirmed, these are the files

Done!

The requested changes have been incorporated into the nightly build - see Changeset [2675] for details.

>If my guess is right then if you take these files out of the training process,

>they will also not be present in this file.

That is correct - although in last night's run, the old master_prompts files are still there, but going forward they will not be included in the tar file.

Ken

--- (Edited on 4/2/2009 3:27 pm [GMT-0400] by kmaclean) ---

PreviousNext