Acoustic Model Discussions

Single gram representation.
User: DooGood
Date: 2/21/2007 11:07 am
Views: 7216
Rating: 28

I am pretty new to speech recognition, but I don't clearly see why single gram phones cannot cover pretty much all words.  Why do we need N-grams?  Is it simply for word separation? 


--- (Edited on 2/21/2007 11:07 am [GMT-0600] by DooGood) ---

Re: Single gram representation.
User: kmaclean
Date: 2/21/2007 12:56 pm
Views: 517
Rating: 53

Hi DooGood,

Not sure I understand your question ...

From Wikipedia:

An n-gram is a sub-sequence of n items from a given sequence. n-grams are used in various areas of statistical natural language processing and genetic sequence analysis. The items in question can be letters, words or base pairs according to the application.
An n-gram of size 1 is a "unigram"; size 2 is a "bigram" (or, more etymologically sound but less commonly used, a "digram"); size 3 is a "trigram"; and size 4 or more is simply called an "n-gram" or "(n − 1)-order Markov model".

n-grams in a Speech Recognition context are usually used to describe word sequences and are used in the creation of Language Models

Acoustic Models, on the other hand, use phonemes (also called 'phones')  A single phone is a monophone, and a series of 3 phones is called a triphone.   In the CMU dictionary, which has close to 130,000 word pronunciations, there are only 43 phones, but there are close to 6000 triphones.

As set out in Step 9 of the Voxforge tutorial, a triphone declaration is in the form "L-X+R".   The "L" phone  (i.e. the left-hand phone) is the phone that precedes the "X" phone and the "R" phone (i.e. the right-hand phone) is the one that follows it. 

Below is an example of a monophone declaration and a triphone declaration of the word "TRANSLATE" (the first line shows the "monophone" declaration, and the second line shows the "triphone" declaration):

TRANSLATE [TRANSLATE] t r @ n s l e t
TRANSLATE [TRANSLATE] t+r t-r+@ r-@+n @-n+s n-s+l s-l+e l-e+t e-t

Using a triphone based Acoustic Model improves the level of recognition accuracy.  With monophone Acoustic Models, we are not looking at the 'context' of the monophone.  The SRE is trying to match the sound that it has heard to a single phone - a  single sound.

With a triphone acoustic model, we are essentially looking for a monophone in the "context" other monophones - i.e. the one immediately before and the one immediately after (if they exist - it may be the beginning or end of the word).  This greatly improves recognition accuracy, because the SRE is looking to match a specific sequence of 3 sounds together (a triphone), rather than only one sound.  This is like using a 3 word Google search rather than a single word Google search - you get more accurate results.  Triphones reduce the possibility of error caused by confusing one sound with another, because we are now looking for a distinct sequence of 3 sounds. 


--- (Edited on 2/21/2007 1:56 pm [GMT-0500] by kmaclean) ---

Re: Single gram representation.
User: DooGood
Date: 2/21/2007 1:44 pm
Views: 2910
Rating: 14

Forgive me... a mix up in my vocabulary there!  What I meant was, why can't we just use monophones to deduce each word, which was answered.  I guess  I understand the reason for having the bi and triphones, but it just seems redundant.   Accuracy is everything in speech recognition though, and I suppose accuracy at the cost of redundancy is well worth it.


Thanks for the quick reply.   I am sure I will have more questions coming soon. 

--- (Edited on 2/21/2007 1:44 pm [GMT-0600] by DooGood) ---