German Speech Files

first machine-generated pronunciation lexicon
User: timobaumann
Date: 12/12/2007 6:47 am
Views: 7971
Rating: 29


I've just used a few scripts to generate a lexicon for all the words (5088 in total) contained in the submissions up to now. We will now have to start to review these... 

NB: As always, this is GPLed, even though I didn't write it explicitly.

voxDE20071211.pr0n.gz voxDE20071211.pr0n.gz
Re: first machine-generated pronunciation lexicon
User: kmaclean
Date: 12/12/2007 8:36 am
Views: 236
Rating: 27

Hi timobaumann,

Great Job! 

I added your lexicon to the Subversion repository.  You can see it using the Trac front-end at: 

Or using the Subversion HTTP front-end at:

If you have a Subversion client, you can checkout the lexicon using this last URL. 




Special characters and numbers in the German dictionary
User: ralfherzog
Date: 12/12/2007 8:53 am
Views: 204
Rating: 29
Hello Timo,

I am pretty impressed.  Over 5000 words in the first version of your dictionary.  Here are my thoughts:

1. What about the special characters (first seven lines of the dictionary)?  At the moment, there is no pronunciation included.

2. What about the numbers (line 8 - line 60)?  Should we use them, or should we discard them?

3.  What is to do with the line 65 (German word: "aber") and line 66 (German word: "Aber")? Do we need both of them?

Greetings, Ralf

Re: Special characters and numbers in the German dictionary
User: timobaumann
Date: 12/20/2007 2:40 am
Views: 246
Rating: 33
Hi Ralf,

well, no reason to be impressed. It was you who read those 3400 utterances, composed of 26000 word tokens, made up of 5000 individual word types. It's your countless number of hours that really help the project.

Now, appraisal aside:

You are right, the lines with the special characters should be removed. They are just there because of the scripting I did (I will write more about that when I find the time).

Same with upper-/lowercase words. It would be best not to discard this information completely by uppercasing/lowercasing everything, we will need some slightly smart way of figuring this out. For the moment it would probably suffice to always join the two cases.

The numbers shouldn't be discarded. They should be handled differently, though. We need some number normalization (5->fünf, 85 -> fünf und achtzig) and then just keep the word tokens forming the numbers. (NB: This is incorrect, because "fünf und achtzig" is /fYnf QUnt QaxtsiC/, while "85" should be /fYn vUn daxtsiC/; there is no Auslautverhärtung, because there are no Auslauts within the number).

If we start to build (as in "with our own hands") a *serious* pronunciation resource, we will have to put some thought into how we organize our data. Some good starting point for the storage of pronunciation lexicon data is the W3 Pronunciation Lexicon Specifiaciton ( ).

I think though, that our manual work should be on a more structured level (separated by noun, verb, adjective, thus allowing us to automatically guess all the different forms), leaving traces (so that errors can be traced and corrected in the base form) and several more things. It's probably smartest to build a tool that helps us managing our yet to be built great pr0nlex.

All this takes time, which I will not have before january. My next steps will be to improve and publish the process by which I extracted the lexicon above and then to think about how we can start to manually review, sort and improve the lexicon.

Have a nice christmas!
pronunciation lexicon specification
User: ralfherzog
Date: 12/20/2007 4:48 am
Views: 597
Rating: 34
Hello Timo,

Countless numbers of hours aren't necessary to create the prompts.  I just dictate them using NaturallySpeaking 9.5.  Afterwards, I use a macro with Notepad++ to insert the line numbers in a semi-automated process.  I could submit much more prompts if needed.

Maybe we could treat the special characters as normal words?  So we wouldn't need the special characters anymore.  Instead, we could use the words "Bindestrich", "Komma", "Ausrufezeichen", "Fragezeichen", "Punkt", "Paragraph" and the corresponding pronunciation.

Maybe you are right.  The numbers can be handled differently.  In Dragon NaturallySpeaking, there is a "dictation mode" and a "numbers mode."  In the long-term we may need different categories for both kinds of words.

Yes, we need to find a solution for the storage of the pronunciation lexicon data.  I will take a look at the Pronunciation Lexicon Specification.  It is necessary that other people are able to use our work without barriers.  And to achieve this goal we may have to look at specifications like Speech Recognition Grammar Specification, VoiceXML, Semantic Interpretation for Speech Recognition.  We should try to eliminate any barrier. And this means to employ standards. If we do employ standards we have a chance to become a serious pronunciation resource.

Nice Christmas (I will be here anyway)!