VoxForge
Re: Librivox contributions and dates/numbers
Thanks for your thoughts, Ralf.
In a Librivox context, you are trying to match the audio (which is fixed) to the text (which is editable). In the case of the 'in the year 1800' the voice either says 'in the year eighteen hundred' (probably more common) or 'in the year one thousand eight hundred' (still possible but not as likely to my ear. And the text has to match that.
Since Voxforge has submissions coming in from multiple sources (Librivox readers may tend to treat numbers/dates differently) it might be helpful to have a guideline, numbers as numbers or numbers as words. It would probably make life simpler for the guy at the other end to have the cards in order.
I know it is possible to have two pronunciations for any word, but I wonder if confusion arises if you have two words for any pronunciation. Homophones are the enemy of accurate recognition.
Once 1800 is stored as 'eighteen' and 'hundred' it might make for fewer entries in the lexicon, and this is good provided the lexicon is a one-way function. You build it and it remains as a simple by the word reference. However, given only the lexicon, and you ask the question "Has 1800 been used in the prompts?" you can no longer get an answer, so information has been lost - you can only get it from the prompts file. Generally speaking you have both files, so the loss of information is not important.
My inclination is to store numbers and dates as separate words.
--- (Edited on 5/19/2012 3:20 pm [GMT-0500] by colbec) ---