Acoustic Model Discussions

Flat
HDMan character encoding problem
User: nagi
Date: 6/6/2007 3:49 am
Views: 10865
Rating: 34
Hi folks,

Let us assume my lexicon file content is like:

ÄHNLICH    ae n l i C

and my wordlist file content is:

ÄHNLICH

Now when I run HDMan command like following :

HDMan -m -w wordlist -n monophones1 -l dlog dict lexicon

everything works fine except 'dict' file. dict file contains following line:

\303\204HNLICH  ae n l i C sp

As you can see, 'Ä' character replaced with some ascii numbers.
I tried to set my LANG, LC_ALL environment variables and other LC=XXX variable's value to de_DE.ISO8859-1 ( on FreeBSD machine ) but no difference.
Could somebody tell me what I am doing wrong?

Any help would be appreciated.

Nagi

--- (Edited on 6/6/2007 3:49 am [GMT-0500] by nagi) ---

Re: HDMan character encoding problem
User: kmaclean
Date: 6/6/2007 10:01 am
Views: 336
Rating: 18

Hi Nagi,

For questions like this, it is well worth the trouble to download the entire HTK users and developers mailing lists, and search through them to see if your answer might be there.  See these links (you need to be logged in to HTK to download these ...):

[   ] htk-users.mbox          09-Feb-2007 14:08   20M  
[   ] htk-developers.mbox     09-Feb-2007 10:36  2.9M  

I found a couple of references (I'm not sure if you already came across these ..) - but no clear answer:

Once email on the users list says:

From: "Emilian Stoimenov" <[email protected]>
To: "htk-users" <[email protected]>
Subject: Re: [HTK-Users] info request
Date: Fri, 17 May 2002 10:20:16 +0300 
[...] 
[...] you don't need to switch to Latin character set. HTK sees codes to represent the non-latin characters - instead of writing them directly as characters it writes them using their ASCII codes represented by octals preceded with '\'. So, for example an hmm name can be '\345\347'.When the various tools read such codes they print them on screen correctly.

Another email states the following: 

To: [email protected]
Cc: [email protected]
Subject: Re: [HTK-Users] Thai language output from HVite
References: <[email protected]>
Date: 21 Jan 2002 16:20:06 +0000 
Supphanat Kanokphara <[email protected]> writes:

> I am making Thai speech recognition system. Now I am a problem that
> output from HVite becomes unreadable texts as below
>
> 0 400000 sil
> 400000 4500000 \342\244\303\247\312\303\351\322\247

I am afraid I don't know very much about Thai character sets, but I am
assuming you are using some sort of 8-bit character encoding (maybe
ISO-8859-11, if that is published as the standard, yet?).

HTK tries to automatically determine whether a particular character is printable. For this it uses the standard isprint() function. If you let your locale (e.g. the LANG and LC_ALL environment variables) correctly,  then this should all work correctly automatically. Failing that you could put the following into your HTK config fail to avoid the \xxx notation:

NONUMESCAPES = F

 Gunnar

So my reading of these two emails would lead me to believe that setting your LANG and LC_ALL environment variables only applies to output from HVite, and that HTK uses the slash notation internally - therefore you should not need to convert the contents of your dict file.  The best way to test this theory, would be to train your acoustic model, and try it out with HVite.  If this does not work, then you might try asking on the HTK mailing lists.  You might also try the Julius mailing list, since they use HTK to train Acoustic Models in Japanese.

Not sure how this might affect Julius ... try it out and let us know how you make out! 

Another thought ... what language is your PC set to?  If it is English, this might also be part of the  problem too. 

Hope that helps,

Ken 

 

--- (Edited on 6/6/2007 11:01 am [GMT-0400] by kmaclean) ---

Re: HDMan character encoding problem
User: nagi
Date: 6/6/2007 10:59 am
Views: 472
Rating: 24

Hi Ken,

Thank you for the help. The link for the mail archive is great :)

I gave up :( ,,, but not completely:)

 I have just replaced these characters (ÄÜÖüäö) with simple characters Ä -> AE, Ü -> UE... and the problem is solved.

And I have changed my Lexicon a little bit like following:

old version :

ÄHNLICH  ae n l i C

 new version :

AEHNLICH  [Ähnlich] ae n l i C

so this part which is in [xxx ] will be my result text in julius.

 Ok, that is it for now. I am not finished my AM training till the end :)

 Best regards

Nagi 

--- (Edited on 6/6/2007 10:59 am [GMT-0500] by nagi) ---

Re: HDMan character encoding problem
User: kmaclean
Date: 7/18/2007 7:32 am
Views: 1140
Rating: 30

Found another reference to this problem on the HTK users list:

Hi,

wubin wrote (2007/07/17 18:59):

>  N=19998 L=899059
>  I=0    W=!NULL              
>  I=1    W=</s>              
>  I=2    W=<s>                
>  I=3    W=\260\241          
>  I=4    W=\260\242          
>  I=5    W=\260\242\260\315\313\271
>  I=6    W=\260\242\266\373\260\315\304\341\321\307
>  I=7    W=\260\242\266\373\274\260\300\373\321\307
>  I=8    W=\260\242\270\273\272\271
>  I=9    W=\260\242\270\371\315\242

HTK automatically adds backslash before non-printable characters (in 1-byte character sense, see HShell.c:ReWriteString() for detail).
So words in your lattice file became unreadable.
You can turn off this function by setting the environmental variable NONUMESCAPES to TRUE.

Regards,

Heiga ZEN (Byung Ha CHUN)

--
------------------------------------------------
Heiga ZEN     (in Japanese pronunciation)
Byung Ha CHUN (in Korean pronunciation)

Department of Computer Science and Engineering
Nagoya Institute of Technology
Gokiso-cho, Showa-ku, Nagoya 466-8555 Japan

http://www.sp.nitech.ac.jp/~zen
------------------------------------------------
 

--- (Edited on 7/18/2007 8:32 am [GMT-0400] by kmaclean) ---

Re: HDMan character encoding problem
User: Myra
Date: 6/24/2013 9:41 pm
Views: 2500
Rating: 2

i m working with windows 7 and htk

i have a problem with HDman during the dictionary construction with the numbers . in each time i use a number in labelling a word i have this error message :

ERROR [+8050]  ReadDict: Probability malformed 3
  ERROR [+1413]  ReadNextWord: ReadDictWord failed
 FATAL ERROR - Terminating program exe\HDMan

what can I do to solve this problem

thanks

 

--- (Edited on 6/24/2013 9:42 pm [GMT-0500] by Visitor) ---

PreviousNext