Click here to register.

Dialog Managers

Flat
11,000 lemmas with lots of inflected forms
User: ralfherzog
Date: 7/14/2008 5:24 pm
Views: 2171
Rating: 16

Hello bedahr,

1. "It is my strong belief [...]" OK, I understand your arguments.

2. If you want, you can promote the "dictionary acquisition project".  But I am not the maintainer of this project.  The maintainer is Timo Baumann.  I can only speak for my personal opinion: Just do what you think is the right thing to do.

As far as I know, he has extracted about 11,000 entries (= lemmas).  Most of them should have several inflected forms.  The entries are sorted after the frequency of occurrence.  So this is a very intelligent approach.  It shouldn't take a lot of time to succeed.

By the way, I didn't know the IPA either.  Most of the work is just drag and drop.

3. "The character set and allowed "signs" for use with the HTK" - In my opinion, this is generally a major problem.  I think that there isn't enough conscience (must read) when it comes to the character set.  For example, just take a look into the German prompts (download and extract the file Prompts.tgz, open the file "master_prompts_16kHz-16bit" with a text editor).  Currently, the German special characters (ä,ö,ü,ß) are often displayed wrongly.  And I think that we might have the same problem when we were trying to use the IPA.  In the long term, we can't stick to US-ASCII.  Why not make a complete evolutionary step to UTF-8?  But this needs a lot of thinking, and discussion.

When it comes to the processing of special characters (ä,ö,ü,ß; IPA-characters) with software like HTK (or Sphinx), major problems may occur due to Mojibake.  Or those special characters aren't allowed.  

In my opinion, the main problem is the mixed use of the different encodings (especially ISO-8859-1 and UTF-8).  It is necessary to focus on this problem.  Americans don't have this problem as long as they stick to US-ASCII. 

Greetings, Ralf

--- (Edited on 2008-07-14 5:24 pm [GMT-0500] by ralfherzog) ---

--- (Edited on 2008-07-14 6:07 pm [GMT-0500] by ralfherzog) ---

PreviousNext