Speech Recognition Engines

Nested
Julius’s behavior understanding
User: sagar
Date: 10/12/2009 11:39 pm
Views: 19587
Rating: 7

Hello,

My requirement is to identify selected 20 words from any speaker and ignore all other speech data.

I recorded those 20 words from different (currently 10 speakers). I have single wav file per speaker and each wav file contains those 20 words, which I need to identify. 
e.g. 

Speaker1.wav  WORD1 WORD2 WORD3 … WORD20
Speaker2.wav WORD1 WORD2 WORD3 … WORD20
.
.
.
Speaker10.wav -> WORD1 WORD2 WORD3 … WORD20

I trained Julius using these files. But Julius’s accuracy was not good. Here accuracy means accuracy to identify trained 20 words from different speakers.

So I divided recorded wav file in such a way that each wav file contains single word only. e.g.

Speaker1_1.wav  WORD1
Speaker1_2.wav  WORD2
Speaker1_3.wav  WORD3
.
.
Speaker1_20.wav  WORD20
Speaker2_1.wav  WORD1
Speaker2_2.wav  WORD2
.
.
.
Speaker10_20.wav  WORD20

I trained Julius using these new files. Julius’s accuracy is improved in this case. Again accuracy means accuracy to identify trained 20 words from different speakers.

If the spoken word is not present in trained 20 words, Julius still returns one of the word from 20 words with some cmscore (confidence score) value. I noticed that this score value is not accurate to reject unknown words.

To handle such scenario, I have decided to add more words in my dictionary file (currently my dictionary file contains only 20 words). I have downloaded required training data from voxforge site (approx. 2000 wav files of different speakers) and trained Julius with them and also with my current data (which contains training for my 20 words). But when I ran Julius using new training model, I got following warning message regarding hypothesis stack.

WARNING: 00 _default: hypothesis stack exhausted, terminate search now

I searched Julius forum for this problem. Please refer following Julius forum for more information regarding this problem.
http://julius.sourceforge.jp/forum/viewtopic.php?f=6&t=42

I tried all those things but still I am getting this warning message.

Again I trained Julius again with those downloaded data (this time only 600 wav files) and my current data (which contains training for my 20 words). Julius started without above warning message, but with the worst accuracy. 

Also I noticed one more thing. When my training dictionary has multiple pronunciations for same WORD, accuracy of the Julius decreases.

Can anyone help me, what should I do next? Anyone passed through such scenario? 

--- (Edited on 10/12/2009 11:39 pm [GMT-0500] by Visitor) ---

Re: Julius’s behavior understanding
User: kmaclean
Date: 10/13/2009 9:49 am
Views: 111
Rating: 5

Hi sagar,

>I recorded those 20 words from different (currently 10 speakers). I have

>single wav file per speaker and each wav file contains those 20 words,

>which I need to identify.

You need more than just one recording from each speaker saying your twenty words.  Add more recordings until recognition improves (start with 10 or 20 recordings from each speaker, and increase from there...).

If you want to create a true speaker independent acoustic model, you will need way more than just 10 speakers... (I am guessing at least 50-100 speakers).

>So I divided recorded wav file in such a way that each wav file contains

>single word only. e.g.

Are you training phone-based-HMMs or whole-word-HMMs?

>If the spoken word is not present in trained 20 words, Julius still

>returns one of the word from 20 words with some cmscore

>(confidence score) value.

See this post: One word grammar, always recognized?

>I have decided to add more words in my dictionary file (currently my

>dictionary file contains only 20 words).

I am actually surprised that HTK allowed you to compile an acoustic model with only 20 different words...  are you compiling monophone or triphone acoustic models?  Are you following the VoxForge acoustic model tutorial?

Is your target language English?  You should avoid mixing English audio data from VoxForge with another language for training purposes...

Ken

--- (Edited on 10/13/2009 10:49 am [GMT-0400] by kmaclean) ---

Re: Julius's behavior understanding
User: sagar
Date: 10/14/2009 1:05 am
Views: 131
Rating: 6

Hi Ken,

Thanks for the reply.

> Are you training phone-based-HMMs or whole-word-HMMs?
I am using phone-based-HMMs.

> See this post: One word grammar, always recognized?
Yes I have seen that post. Can I reject words (which are not present in dictionary) based on grammer? Filler word concept can satisfy my requirement. Is there any other way?

>I am actually surprised that HTK allowed you to compile an acoustic model with only 20 different words...  are you compiling monophone or triphone acoustic models?  Are you following the VoxForge acoustic model tutorial?
I am using triphone acoustic model. And I am able to create acoustic model for 20 words using those 10 steps given at http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/tutorial/download

>Is your target language English?
Yes. My target language is English. There is no mixing in audio samples. All downloaded samples are checked before starting training process.

 

--- (Edited on 10/14/2009 1:05 am [GMT-0500] by Visitor) ---

Re: Julius's behavior understanding
User: kmaclean
Date: 10/15/2009 1:39 pm
Views: 354
Rating: 4

Hi sagar,

>Can I reject words (which are not present in dictionary) based on

>grammer?

No, you will need to add them to your pronunciation dictionary.

In Step 10 you created an acoustic model for your entire pronunciation dictionary, which is usually much larger than your training set.  This is why you were able to create acoustic models with only 20 different words... since your pronunciation dictionary only contained the words in your training set.

Steps 1 to 9 only create acoustic model for the words in your training set.  Step 10 creates a "tiedlist" file that maps the "unseen" triphones in the larger pronunciation dictionary to the actual triphones in your smaller training set.  Depending on the words in your out-of-grammar list, you may also need to provide speech audio training data for these new out-of-grammar words.... see Step 2 - Pronunciation Dictionnary.

>Is there any other way?

Better acoustic models might help.

Ken

--- (Edited on 10/15/2009 2:39 pm [GMT-0400] by kmaclean) ---

Re: Julius's behavior understanding
User: Ravishanker
Date: 7/26/2010 11:39 am
Views: 106
Rating: 7

Hi Ken


I need an application that can reject words not in the vocabulary file of Julius. When I speak a word not in the vocabulary, it still identifies the word as some other word in the vocabulary with a ver high confidence value. I am wondering what could be the reason. Is there any other way I can identify the words, not using the confidence value? Thanks

--- (Edited on 7/26/2010 12:21 pm [GMT-0500] by Ravishanker) ---

Re: Julius's behavior understanding
User: kmaclean
Date: 7/28/2010 2:32 pm
Views: 2437
Rating: 6

see this post: One word grammar, always recognized?

--- (Edited on 7/28/2010 3:32 pm [GMT-0400] by kmaclean) ---

PreviousNext