General Discussion

Flat
Librivox contributions and dates/numbers
User: colbec
Date: 5/18/2012 10:55 am
Views: 5481
Rating: 10

In reviewing a possible audio file I came across a lot of dates in one section, 1800, 1839 and so on.

This raises the issue of whether in a prompt context it is better to deal with these numbers in the text2prompts stage, (ensuring that 1800 becomes "eighteen hundred" for example) or including '1800' as a separate word in the lexicon.

The downside of the latter is that potentially you end up with a lot of numbers in your lexicon, eventually more numbers than words. The pre-treatment seems to be more efficient.

Is there an industry standard or even Voxforge preference for this?

--- (Edited on 5/18/2012 10:55 am [GMT-0500] by colbec) ---

Re: Librivox contributions and dates/numbers
User: ralfherzog
Date: 5/19/2012 2:06 pm
Views: 144
Rating: 8

In my opinion, it is better if '1800 becomes "eighteen hundred" for example'. If you have a pronunciation dictionary with an entry "1800" you would need two corresponding pronunciations: "eighteen hundred" and "one thousand eight hundred". So in the end, "eighteen hundred" is less complicated.

I don't think that there is something like an "industry standard". You can handle it both ways. There is no right or wrong.

--- (Edited on 2012-05-19 2:06 pm [GMT-0500] by ralfherzog) ---

Re: Librivox contributions and dates/numbers
User: colbec
Date: 5/19/2012 3:20 pm
Views: 230
Rating: 10

Thanks for your thoughts, Ralf.

In a Librivox context, you are trying to match the audio (which is fixed) to the text (which is editable). In the case of the 'in the year 1800' the voice either says 'in the year eighteen hundred' (probably more common) or 'in the year one thousand eight hundred' (still possible but not as likely to my ear. And the text has to match that.

Since Voxforge has submissions coming in from multiple sources (Librivox readers may tend to treat numbers/dates differently) it might be helpful to have a guideline, numbers as numbers or numbers as words. It would probably make life simpler for the guy at the other end to have the cards in order.

I know it is possible to have two pronunciations for any word, but I wonder if confusion arises if you have two words for any pronunciation. Homophones are the enemy of accurate recognition.

Once 1800 is stored as 'eighteen' and 'hundred' it might make for fewer entries in the lexicon, and this is good provided the lexicon is a one-way function. You build it and it remains as a simple by the word reference. However, given only the lexicon, and you ask the question "Has 1800 been used in the prompts?" you can no longer get an answer, so information has been lost - you can only get it from the prompts file. Generally speaking you have both files, so the loss of information is not important.

My inclination is to store numbers and dates as separate words.

--- (Edited on 5/19/2012 3:20 pm [GMT-0500] by colbec) ---

Re: Librivox contributions and dates/numbers
User: TonyR
Date: 5/19/2012 5:08 pm
Views: 243
Rating: 10

"I know it is possible to have two pronunciations for any word, but I wonder if confusion arises if you have two words for any pronunciation. Homophones are the enemy of accurate recognition."

You have to distinguish between acoutistic model training and recognition.   In acoustic model training the right way to do this is to introduce sufficient variations that the Viterbi alignment picks the right one.

 

Tony

Dr Tony Robinson, Founder Cantab Research Ltd

 

--- (Edited on 19-May-2012 11:08 pm [GMT+0100] by TonyR) ---

Re: Librivox contributions and dates/numbers
User: colbec
Date: 5/20/2012 6:03 am
Views: 136
Rating: 9

Thanks for that, Dr Tony. Of course the topic of the thread is building an audio model, in this case using Librivox data. As I understand it we are giving the model builder as many different opportunities to hear triphones as possible. And the triphone patterns are defined in the lexicon.

But the lexicon is also the palette from which you build a grammar. So any decision taken related to the lexicon has implications in a later process, recognition.

If my lexicon contains SAY s eh, ONE w uh n, WON w uh n, TO t uw, TWO t uw, and I try to use prompts SAY ONE, SAY WON, SAY TO, SAY TWO, I find it hard to imagine that even given infinite amount of data a recognizer would ever be able to do better than 50-50 on SAY ONE. Perhaps I am wrong here?

My example is a bit academic, you can design your grammar with the lexicon weaknesses in mind, or ask the recognizer for the top two possibilities and deal with the outcome in the DM.

--- (Edited on 5/20/2012 6:03 am [GMT-0500] by colbec) ---

Re: Librivox contributions and dates/numbers
User: TonyR
Date: 5/20/2012 11:50 am
Views: 158
Rating: 9

If my lexicon contains SAY s eh, ONE w uh n, WON w uh n, TO t uw, TWO t uw, and I try to use prompts SAY ONE, SAY WON, SAY TO, SAY TWO, I find it hard to imagine that even given infinite amount of data a recognizer would ever be able to do better than 50-50 on SAY ONE. Perhaps I am wrong here?

The acoustic model doesn't care about the words, just the phones, so it is trained on /s eh w uh n/.    The langage model says that given the recognised phones  /s eh w uh n/ you output SAY ONE.

The topic of the thread was really different pronuciatiations for words, so a better example might have been "SAY 101" which could be pronounced as "SAY ONE OH ONE", "SAY ONE HUNDRED AND ONE", "SAY ONE HUNDRED ONE".   In general it is better to allow all reseasonable pronuciations and let the Viterbi alignment pick the best.

 

Tony

-- 
Dr Tony Robinson, Founder Cantab Research Ltd

--- (Edited on 20-May-2012 5:50 pm [GMT+0100] by TonyR) ---

Re: Librivox contributions and dates/numbers
User: colbec
Date: 5/20/2012 12:24 pm
Views: 228
Rating: 9

Tony, since I was challenged on a particular detail of one of my responses I tailored the example to suit that situation. I maintain that a recognizer would have a problem distinguishing between SAY ONE and SAY WON since they have identical phoneme representations. But, I don't have any data at this point to back that up, so I will leave it as a claim at this point, ready to be withdrawn later in the face of evidence.

Your example is a good one to bring us back to the original topic. Of course in a Librivox context the audio will already have been decided for you, TEN SIXTY SIX or ONE THOUSAND [AND] SIXTY SIX.

So the master lexicon will need to contain a subset of 1066, TEN, SIXTY, SIX, THOUSAND, AND, ONE, TEN_SIXTY_SIX and so on. I'm just concerned that the 1066/TEN_SIXTY_SIX approach (which of course is perfectly valid) for numbers means a perpetually growing lexicon.

--- (Edited on 5/20/2012 12:24 pm [GMT-0500] by colbec) ---

Re: Librivox contributions and dates/numbers
User: TonyR
Date: 5/20/2012 12:33 pm
Views: 2108
Rating: 11

I maintain that a recognizer would have a problem distinguishing between SAY ONE and SAY WON since they have identical phoneme representations

This is the job of the langauge model that is used at run time and nothing to do with the acoustic model training.

So the master lexicon will need to contain a subset of 1066, TEN, SIXTY, SIX, THOUSAND, AND, ONE, TEN_SIXTY_SIX and so on. I'm just concerned that the 1066/TEN_SIXTY_SIX approach (which of course is perfectly valid) for numbers means a perpetually growing lexicon.

You can create a network that has branches for all the pronuciations of 1066 in terms of ordinary words and then force align on that.   In this way you keep the pronunciaiton dictionary small.   But even if you have tens of thousands of hours of data to align it's not too bad to expand out 1066 and all other troublesome cases many times in the dictionary used for alignment.

 

Tony

-- 

Dr Tony Robinson, Founder Cantab Research Ltd

--- (Edited on 20-May-2012 6:33 pm [GMT+0100] by TonyR) ---

PreviousNext