Re: Murmelthier vs. Murmeltier
User: Robin
Date: 5/8/2010 3:59 am
Hi Ralf,

I fully agree with you that it would not make sense to add words in the old spelling to a pronunciation dictionary. Even if that would be okay from a performance perspective -- and let's assume it would be -- you would end up with a dictionary that could be something like 50% larger than necessary.

I have posted a question about this in the general part of the forums and I hope that someone will come up with a good solution.



German (Hochdeutsche Mundart, year 1798) PLS dictionary
User: ralfherzog
Date: 5/8/2010 6:24 am
Hi Robin!

"add words in the old spelling to a pronunciation dictionary" - We can develop a separate PLS dictionary with old spelling. The name of the dictionary could be something like "German (Hochdeutsche Mundart, year 1798) PLS dictionary". The phonemes can be created by eSpeak (it shouldn't be a problem if words with old spelling like "Murmelthier" appear - the corresponding phoneme element would be acceptable).

At the moment, I am developing an XSLT style-sheet that should convert every single PLS dictionary that contains eSpeak phonemes automatically into IPA phonemes (e.g. Ralf's Dutch dictionary contains at the moment eSpeak phonemes). This XSLT style-sheet could be of course used for the creation of a "German (Hochdeutsche Mundart, year 1798) PLS dictionary" with IPA phonemes.

Each language (and each dialect) should have its own PLS dictionary (e.g. for the Spanish language, I am offering three different PLS dictionaries). And of course, even for the ancient language Latin there is a solution: Ralf's Latin dictionary. This concept is applicable to every language (even dead languages).

The conversion of the grapheme elements can be done later. We should go one step at a time. The copyright free texts from Gutenberg have their own orthography. No problem to use this old orthography (1798) in a separate PLS dictionary.

Later, we can try to do a conversion from old orthography (1798) to current orthography (2010).

Of course, theoretically it would be possible to add old orthography to the PLS dictionary with current orthography (the total size of the dictionary should be not more than 800.000 words because of performance reasons). But this would be too confusing. A better solution is to develop a separate PLS dictionary.

In short:

1. Step: Create "German (Hochdeutsche Mundart, year 1798) PLS dictionary". Create the phonemes with eSpeak. Convert the eSpeak phonemes into IPA phonemes using an XSLT style-sheet.

2. Step: Convert the grapheme elements with old orthography (1798) into current orthography (2010).

3. Step: Convert the Gutenberg text "Alice's Abenteuer im Wunderland" (old orthography from 1798) into current orthography (2010).

Of course, noone is interested to use a "German (Hochdeutsche Mundart, year 1798) PLS dictionary" for dictation. But this specific PLS dictionary could be part of the XML-based ASR framework. We could convert from old orthography (1798) to current orthography e.g. via an XSLT style-sheet. The end-user doesn't have to deal with old/current orthography because the conversion can be done in the background.

The conversion from old (1798) to current (2010) orthography has low-priority. At the moment, I am investing my time in the grapheme-to-phoneme conversion (and into espeak-to-ipa conversion) because this has high-priority.

Greetings, Ralf