converting copyright free texts to modern spelling

Audio and Prompts Discussions

Flat

User: Robin
Date: 5/8/2010 3:56 am

Views: 16535
Rating: 21

Hi all,

Using copyright free texts from for instance the Gutenberg project poses some problems, because these texts are typically 100 years old or older (after all, in many countries copyright expires 70 years after the death of the author). So if you use these texts to record speech, you potentially end up with many words that are not present in a modern dictionary or in a pronunciation dictionary.

I am not talking about words that are simply not used often any more in modern language. In some languages, the spelling of words has changed quite systematically. For instance in German many instances of "th" have been replaced by "t"; in Dutch many double vowels such as "oo" are now spelt as "o" (but not all).

Adding the old-fashionedly spelt words to the dictionary would not -- in my view -- make a lot of sense. That way one would end up with a very bloated dictionary, possibly being 50% or so larger than it could be. It would be a far nicer solution if we could convert an old text in a relatively efficient manner into modern spelling. That way, it would also be possible -- as a bonus -- to use such texts to create language models with (in combination with other texts in modern spelling). If one would do that without converting the text into modern spelling, you would end up with a speech recognition system suggesting to use old spelling quite often. Not something one should want in my opinion.

Of course one solution would be to use an existing spellchecker, but I don't know if that would be successful, especially for shorter words. Also, I think that it should be possible to come up with a more efficient solution. Perhaps a type of spellchecker that would remember replacements for future documents, so one could convert one old-fashioned text and the second one would go a lot faster...

Does anyone have a good idea?

--- (Edited on 5/8/2010 3:56 am [GMT-0500] by Robin) ---

Re: converting copyright free texts to modern spelling

User: nsh
Date: 5/8/2010 7:22 pm

Views: 339
Rating: 14

> Dutch many double vowels such as "oo" are now spelt as "o" (but not all)

"But not all" is a keyword here. It's well known that dictionary is actually the only possible way to describe the variability of language. Everything else could be considered as a rough approximation. Dictionary could be separate of course or just contains marks to make it easy to strip old words.

--- (Edited on 5/9/2010 04:22 [GMT+0400] by nsh) ---

Re: converting copyright free texts to modern spelling

User: Robin
Date: 5/9/2010 2:51 am

Views: 437
Rating: 17

Yes, I understand. I mentioned "but not all", because I wanted to emphasise that it is not simply a matter of replacing certain combinations. If that would be the case, perhaps even I could come up with a solution.

I assume that a spellchecker uses a dictionary and every time it comes across a word that is not in its dictionary, it says that it should be corrected. More advanced spellcheckers make suggestions e.g. "I think the word "teh" should be replaced with "the"" and then typically there are the options "ignore" and "replace".

I don't know if there are any spellcheckers out there where you also have an option "replace and automatically replace in the future", something like that would make it a lot faster to correct the next hundred years old book.

It would also be handy if the auto-replace list could be exported, so someone else would benefit from the corrections made by someone else.

Does that make sense?

--- (Edited on 5/9/2010 2:51 am [GMT-0500] by Robin) ---

Re: converting copyright free texts to modern spelling

User: kmaclean
Date: 6/9/2010 9:48 pm

Views: 334
Rating: 13

>So if you use these texts to record speech, you potentially

>end up with many words that are not present in a modern

>dictionary or in a pronunciation dictionary.

From an acoustic model perspective (HTK at least...), another approach might be to create two pronunciation dictionaries: one for training and one for recognition (or a single XML encoded corpus, with properties that designate a spelling as archaic or new, and if it is an archaic spelling, have a pointer to the newer spelling, and generate the two dictionaries from this...).

When creating an acoustic model, what is important is the pronunciations (i.e. the series of phonemes), not the spelling of the word itself. So, in your training pronunciation dictionary, if you have an archaic word that uses "th", and it has been replaced by "t" in a newer spelling, as long as both spellings have the same pronunciation (i.e. are assigned to the same series of phonemes) you are OK from an HMM training perspective.

On the recognition side, in your recognition pronunciation dictionary, which would only include the newer spellings. Therefore, for the same series of phonemes, the newer spelling would get picked up upon recognition.

I think that from an acoustic model creation perspective, this approach *might* be easier than looking at modifying the spelling in all the old texts you might use for training, because your change is only made in one place, and there would be less chance for mistakes...

I guess it depends on the number of words we are talking about... if there are lots, use the double dictionary approach, if there are not that many, just change the words in the training texts.

Ken

--- (Edited on 6/9/2010 10:48 pm [GMT-0400] by kmaclean) ---

two Dutch PLS dictionaries

User: ralfherzog
Date: 6/12/2010 11:09 am

Views: 285
Rating: 20

Let's say you have two Dutch PLS dictionaries: The first Dutch PLS dictionary contains grapheme elements with old spelling. The second Dutch PLS dictionary contains grapheme elements with new spelling.

Let's assume that the phoneme elements are often the same (because the differences between old/new spelling are marginal and normally don't affect the pronunciation).

Then you can do the following: You can copy and paste the lexeme elements from the second Dutch PLS dictionary into the first Dutch PLS dictionary.

Then you use the XSLT expression <xsl:for-each-group select="lexeme" group-by="phoneme">. The resulting PLS dictionary contains lexeme elements with two grapheme elements: first grapheme element is new spelling. Second grapheme element is old spelling.

Robin, maybe you are interested to continue with the development of Ralf's Dutch dictionary version 0.1.1? You could rename the dictionary into Robin's Dutch PLS dictionary.

--- (Edited on 2010-06-12 11:13 am [GMT-0500] by ralfherzog) ---

Re: two Dutch PLS dictionaries

User: Robin
Date: 6/12/2010 4:07 pm

Views: 389
Rating: 18

Thanks, Ralf. I had a look at your dictionary and it looks interesting. What was your starting point? We already had a Dutch dictionary under the GPL. However, I think yours is generated using espeak pronunciation rules, right? Why did you use this approach and which list of words was your starting point? I need to get an idea as to which of the two has better pronunciation before I commit to improve one of them.

--- (Edited on 6/12/2010 4:07 pm [GMT-0500] by Robin) ---

Robin's Dutch PLS dictionary

User: ralfherzog
Date: 6/15/2010 6:22 am

Views: 2545
Rating: 18

"any solution will require a lot of manual work" - that is right. The best approach is to generate the phonemes with eSpeak. Then you can use XSLT for further refinement rules. E.g. take a look at the article that explains how I created Ralf's German dictionary version 0.1.9.6. With XSLT, you can minimize the amount of manual work that you have to do.

"I had a look at your dictionary and it looks interesting." - Yes, it is interesting. If you would invest some time into Ralf's Dutch dictionary, you should be able to improve the <phoneme> elements so that they could be used for training of some words e.g. with simon.

"What was your starting point?" - I think that I used this Dutch OpenOffice.org spelling dictionary as source. Maybe, I used the command "unmunch nl_NL.dic nl_NL.aff > nl-wordlist" in the Ubuntu terminal to generate the Dutch word list.

"We already had a Dutch dictionary under the GPL." - I know. But it is as far as I know not in the PLS format with IPA phonemes. I want to offer a Dutch PLS/IPA dictionary. It is necessary that human editors can "read" the phonemes. The IPA is much better readable than SAMPA. The results should be much better in the long run.

"yours is generated using espeak pronunciation rules" - yes. Can you tell me which eSpeak phonemes correspond with which IPA phonemes? Take a look at the section that begins with the line <xsl:when test="matches(/lexicon/@xml:lang, 'nl')"> in ralfs-ipa-stylesheet.xsl. This section contains the Dutch eSpeak-to-IPA conversion rules. You can help me with the refinement of these rules.

"Why did you use this approach" - I used eSpeak for a lot of languages. Now I know how I can generate phonemes without eSpeak. I use XSLT for this approach. You just need a starting point. eSpeak was my starting point. But I don't need eSpeak any more. I implement my own phoneme improvement rules.

"which of the two has better pronunciation" - Ralf's Dutch dictionary probably has very bad pronunciation because eSpeak support for the Dutch language is "provisional". But as soon as you have a feeling for XSLT you should be able to write your own grapheme-dependent rules for phoneme improvement. Take a look into my style sheet improve-german.xsl. You can develop your own style sheet improve-dutch-dictionary.xsl.

Robin, I think that you should invest some time in the development of "Robin's Dutch PLS dictionary". You can use of course the Dutch Voxforge dictionary as source if you want to.

Later, you can use "Robin's Dutch PLS dictionary" as source for the development of simon scenarios. At the moment, simon scenarios for German and English are available. You can develop a scenario for the Dutch language.

Your Dutch dictionary should have a size of up to 800.000 words. More words would be bad for the performance. Ralf's Dutch dictionary has a size of just 160.000 words. That is a good starting point. You can add more words if you want to.

--- (Edited on 2010-06-15 6:22 am [GMT-0500] by ralfherzog) ---

--- (Edited on 2010-06-15 6:35 am [GMT-0500] by ralfherzog) ---

Re: converting copyright free texts to modern spelling

User: Robin
Date: 6/12/2010 4:00 pm

Views: 340
Rating: 17

Thanks, Ken. In Dutch it is around 5% to 50% of the words, depending on the age of the text. A text that says 'only' 100 years old probably only has about 5% of words in an archaic spelling. For a short text that is not bad I guess, but it would have been nice to transform many texts in one go (I don't mean really one press of a button, but quickly anyway).

I understand that technically one could use a dictionary with archaic and modern spelling for acoustic model training and a dictionary with only modern spelling for recognition. However, I was hoping that there would have been a more practical solution. On the other hand, any solution will require a lot of manual work.

I will think this over carefully and see what I will do.

--- (Edited on 6/12/2010 4:00 pm [GMT-0500] by Robin) ---

Re: converting copyright free texts to modern spelling

User: donnaj867
Date: 6/25/2010 2:49 am

Views: 839
Rating: 16

Hello !
I am also a new member. Would a newcomer be warmly welcome here? Good day you guy !

--- (Edited on 6/25/2010 2:49 am [GMT-0500] by donnaj867) ---

--- (Edited on 6/25/2010 2:18 pm [GMT-0400] by donnaj867) ---

Re: converting copyright free texts to modern spelling

User: paultom374
Date: 6/21/2013 5:48 am

Views: 798
Rating: 12

Dear I know. But it is as far as I know not in the PLS format with IPA phonemes. I want to offer a Dutch PLS/IPA dictionary. It is necessary that human editors can "read" the phonemes. The IPA is much better readable than SAMPA. The results should be much better in the long run.

-----------------------------------------

Usman Malik

--- (Edited on 6/21/2013 5:49 am [GMT-0500] by paultom374) ---

[ «Previous Page | 1 2 | Next Page» ]

Previous • Next •


Username	Password