User: Sam
Date: 3/31/2013 5:35 am
Views: 3722
Rating: 5

Tried Julius 4.2 version. Ran it with the same julian.jconf file as in the tutorial.

The good news is I am no longer getting sp repeating problem, the bad news is the recognition is terrible, its returning more words than I uttered and most of the times they seem to be quite unrelated to what I am uttering.

Now I am totally confused....What should I be trying next?

Thanks in advance...

Re: untitled
User: hstech
Date: 8/9/2013 10:51 am
Views: 79
Rating: 5

I got the same problem with the terrible recognition accuracy. Basically what Julian spitted out was completely out of sync with what was actually said. So I did some digging and testing to figure out what is happening and why.

The first thing I did was that I took one of those already completed audio files and then purposefully damaged the prompts file by inserting a word into one of the prompts. I was using a script where I placed code which checked the HVite log file whether all words were recognized properly (see the "Step 8: Realigning the Training Data" step in the tutorial, namely the part about checking the output of the HVite command), so I expected the script to error out on this step saying that "not all words were properly recognized". However to my surprise the script zipped through the damaged input all the way to the final model output, creating a model. Further testing revealed that HVite can swallow up to three added words into the prompt without signaling that there is something weird with the prompt. More experiments revealed that you can rearrange the words in the prompt significantly and even drop some of them and HVite will still report all of them despite the fact that the content of the prompt now completely disagrees with the audio.

Trying to use the resulting model with Julius failed (lot of "phone xx-yy-zz not found" error messages concluded with "XX words out of YY failed") so I tried to run the HVite command to try to recognize the audio and output some sort of "labelled file". Comparing the "labelled file" produced with the damaged prompt with the "labelled file" I realized that the program is somehow squeezing the labels for the existing words and moving them around to make room for the nonexistent words.

I will post the exact details of these tests later when I will get to my computer where I did them.

Then when I tried to download the VoxForge model and recognize the files I used using Julius (version 4.3.x or so). I did not get the errors but the accuracy was horrible. Basically what the program spitted out was completely different from what was said in the audio. So I tried to figure out why this is happening and I came to the conclusion that the speech does not match what is in the sample.dict file. Little further digging revealed that actually the program was right when given the grammar and the audio in question.

This last thing brought me to the conclusion that this "pronouncation vocabulary + prompts + audio" approach to speech recognition training is fundamentally broken. The problem here is that every person uses slightly different pronounciation of the same words and if this pronounciation is not in sync with the lexicon used during the training, the resulting model will be a piece of crap. And you will get no error indicating that the model is a piece of crap until you try to actually use it (in that case you will get horrible accuracy). I studied some tutorials about the Hidden Markov processes and I believe that this kind of error is inevitable if you don't use perfect training input in the early stage of model training to build some sort of preliminary model.

After some thinking I came to an idea about how to do this training properly. My idea is that I will manually label the locations of all phones in one train set, then use this train set to build a model and then attempt to use this model to try to automatically label another train set (from the same person). As more and more files get labelled, each next file will be progressively easier to be labelled because the model will be improving all the time. Namely the idea here is that I want to get rid of that "speech not in sync with the lexicon" problem by taking the lexicon out of the training equation. Also I hope that this approach will allow me to discover imperfections in the next sets of the training data.

Another approach (maybe slightly easier to try but with uncertain outcome) will be to construct a little lexicon for each of the train sets. I am not sure about my mileage with this (actually I think it is more likely to not work than to work because there is a step in the tutorial which says "take the whole lexicon and do some training with it") but at least I can give it a try.

The good news here is that all the audio is available here so I can pick it up and try these different approaches. With a proprietary audio model I would be stuck.

Some more findings about this issue
User: hstech
Date: 10/4/2013 2:18 pm
Views: 80
Rating: 9

I now have some more information about why I believe this "pronouncation vocabulary + prompts + audio" approach to speech recognition is fundamentally broken. This information resulted in from my study of the speech recognition process.

So, the process hiding in the background (at least for Julian/Julius and other HTK based speech recognition engines) is called "Hidden Markov Process". The "Hidden" in the name means that we have some process which we can't directly observe and which produces some output which we can observe from some input we can't observe and we are trying to model that process so we can deduce the unobserved input from the observed output. In our case the unobserved input are the words (or phonemes), the observed output is the speech audio and the process that transforms the input to the output is the speaking of the particular human.

There is an algorithm called forward-backward algorithm which is used to train a model of the hidden process with training samples which are composed of the obseved output and the corresponding unobserved input. The algorithm then produces the most probable model of the process used to convert the given input to the given output.

Here is a nice tutorial which explains the stuff in greater detail:

So, given this information I can see that the problem here is the "given input with the given output" part of the training (in our case the speech and its transcription). If the example input and output we are using to train the model is not corresponding (i.e. the audio contains some words pronounciated differently than the transcription/lexicon says), the training algorithm will still produce a model but this model will be horribly inaccurate. And we have practically no way (short of tedious manual checking) to know whether our training data set is producing an accurate model or not. Actually it is possible to produce a model from an audio that says one thing and a prompt that says something quite different than the audio says and it is even able to pass the test with the HVite (I tested that by adding a few words to one of the prompts, and the HTK was still able to construct a model that its HVite tool was able to somehow use to recognize the nonexistent words and even align the particular file to them. To make the alignment failure to occur I had to add almost half of one of the prompts to the other.

Fortunately, here we have now pretty huge amount of data so I can try some other approaches. Well, it is still not 140 hours of audio but with 100+ hours I believe I have enough of it to make it an useful testcase.

The approach is to first construct a "seed train set" which will be few audio files which are manually labelled to mark which phone occurs in them and where exactly. Then using this seed train set a "seed model" is constructed. This model then could be used to recognize phones in an additional audio sample and automatically label this additional sample with the phones. At the beginning this additional labelling needs to be manually verified and corrected and once it is completed, this additional sample can be added to the collection of labelled samples and used to train a better model (because more train data is used for it). This better model is then used on another unlabelled audio sample to help label it. Once a "critical mass" of train data is reached, the labelling process with the model will be able to accurately label an additional sample and add it to its train set collection.

I believe this is how humans learn to recognize speech. A small baby does not know a word but as it listens to its parents, it asks them questions about things and listens to what they say. Then later when the baby absorbs enough speech input, it can improve its speech recognition skill on his own just by listening to his parents and other adults. This makes me think that I am going to use some sort of neural network based speech recognition approach because neural networks are closer to what is inside human brain than Hidden Markov processes.

Well, I guess that it will take me a while to get some working example of my approach. Once I have it, I will follow up with my findings.

Recently noticed something in the files
User: hstech
Date: 5/21/2014 11:26 am
Views: 113
Rating: 4

I was downloading and looking at the latest speech files and I noticed all of them containing a file called "Julius_log". According to these files the speech files were processed with Julius and the sentences successfully recognized. Examination of several speech files yielded similar results. I found some samples where not all audio files were recognized properly but still the accuracy is much better than "horrible" as stated here in this thread.

This means that my process of speech recognition model training is somehow botched because it does not yield good accuracy at all. I do not understand how or where the difference is yet (I suspect that the culprit might be that the HTK is configured to expect 16000 Hz samples and then fed with 48000 Hz samples but I cannot tell for sure right now as I had to put my speech recognition analysis for a while) but the answer should be hidden somewhere in the VoxForge submission processing scripts. I have reasonable confidence that these scripts are published in a SVN repository here. The interesting directory is lib/Corpus/Quarantine/Submission as it appears to be the home of the submission processing script(s). Another interesting file is "Installation.txt" which contains a (fairly involved) procedure of installation of the dependencies of the scripts.

The reason why I did not notice this little detail is that many of the older speech files don't contain a "Julius_log" file and certainly those which I was working with did not have one. Additional indication that this path leads somewhere is that the amount of samples processed per week increased dramatically recently and at roughly the same time when this increase happened the "Julius_log" files started to appear in the speech samples.

So I guess I am going to grab a Fedora Core distribution (as that is what "Installation.txt" implies and get the thing up and running. This is fairly complicated project so it might take quite some time (as I have an emergency survival task of sorts to finish first) but as soon as I get some results I am going to post them here. Additionally I have this neural network phone recognition approach which seems to be quite promising and fairly easy to quickly test so this might get completed sooner.

Re: Recently noticed something in the files
User: kmaclean
Date: 5/21/2014 12:08 pm
Views: 76
Rating: 4

>So I guess I am going to grab a Fedora Core distribution (

The scripts you link to are to sanitize/validate submissions (, add them to Subversion (, and replicate submissions to the VoxForge repository ( & - they are a mess - and not just because I wrote them in Perl...

What you really want is: (which is called by

which is basically the same Bash script used in the How-to.


Re: Recently noticed something in the files
User: hstech
Date: 5/30/2014 5:57 am
Views: 87
Rating: 8

Thank you for the pointer. Now when you explained it to me, I likely will be able to find the bug much sooner. Namely, I based my script on one of the tutorials but I suspect I got some stuff broken (the tutorrial does not explain everything found in the file(s) used in detail, which is understandable after all, it is a tutorial not a reference) while writing it and now I need to figure out what is broken and how to fix it.

Another problem might be that I am using newer version of HTK. I will try to downgrade it to match it that of yours and try again. However it is likely that the problem is not there so I will try studying the script you pointed me to first.

Re: Recently noticed something in the files
User: kmaclean
Date: 6/1/2014 9:47 am
Views: 89
Rating: 5

>Another problem might be that I am using newer version of HTK. [...]

>However it is likely that the problem is not there