Spanish

Nested
New Subversion repository for Spanish
User: ubanov
Date: 11/27/2008 6:09 pm
Views: 12202
Rating: 16

Ken has created a Subversion repository instance for Spanish in the following directory.

http://www.dev.voxforge.org/svn/es/

And at the end I have got upload all the files that I have in my training directory.

The directory has training for HTK/Julian with all the voices that voxforge have for spanish.

I have described some things about speech recognition softwares and the training process in my blog (it's writen in spanish) (http://ubanov.wordpress.com/2008/11/28/reconocimiento-de-voz-en-castellano/).

 

Spanish dictionary has some unreadable characters
User: ralfherzog
Date: 11/28/2008 8:27 am
Views: 117
Rating: 16

Hello ubanov! Hello everyone!

It is great that you write about speech recognition software in your blog.

I just took a look into the spanish pronounciation dictionary. On my computer (Win XP, Firefox) there are some unreadable characters. We have similar problems with the german language when it comes to characters like "ä, ö, ü, ß". 

The special characters of the Spanish language may be displayed correctly on your own personal computer.  But please keep in mind that other people like me may experience problems with the special characters. Probably you know about this problem. If not, please read this article.

Greetings, Ralf

Re: Spanish dictionary has some unreadable characters
User: kmaclean
Date: 11/28/2008 9:40 am
Views: 356
Rating: 15

Hi Ralf,

>On my computer (Win XP, Firefox) there are some unreadable characters.

I think this might be a problem with Subversion's web front-end, because Trac displays the exact same file correctly: voxforge_lexicon_spanish

 

There must be a default setting in Subversion somewhere that I need to change...

Ken

Re: Spanish dictionary has some unreadable characters
User: kmaclean
Date: 11/28/2008 10:26 am
Views: 137
Rating: 16

Hi Ralf,

I guess I spoke too soon, as you indicated, there is a problem with displaying the characters in voxforge_lexicon_spanish (the prompts file also has the same problem - see ticket #441 for details).  I think it is because it uses ISO-8859-1 rather than UTF-8 encoding. 

Workaround: If you display it using ISO-8859-1 it displays correctly...

in FireFox go to:

    View > Character Encoding

then select Western (ISO-8859-1) so it can view properly.

Ubanov: if you can figure it out, it is best to use UTF-8 encoding for the files you upload to VoxForge - if you can... it saves a lot of headache down the road.

thanks,

Ken

Is the Spanish Voxforge dictionary compatible with Simon?
User: ralfherzog
Date: 11/28/2008 11:02 am
Views: 236
Rating: 15

Hello Ken,

Thanks for your answer.  I just inserted a direct hyperlink to the Spanish VoxForge dictionary in the how to install Simon under Ubuntu.  It would be great if someone would try Simon with this Spanish pronunciation dictionary (and report about the results)!  It would be interesting to know whether the Spanish VoxForge dictionary is Simon-compatible or not.

This is exactly what I have in mind: saving a lot of headache!  It took me a very long time to find out that there is a problem with the character encoding.  How is it possible to train a speech model when the character encoding is wrong?  ISO-8859-1 is still a very common standard, but we should try to totally switch to UTF-8.  When I change the display mode to ISO-8859-1, the Spanish special characters are being displayed correctly. Thanks for the tip, Ken.

UTF-8 is the way to go, e.g. Wikipedia, WordPress.com, eBay.com are using UTF-8.  I encourage everyone to employ UTF-8 instead of ISO-8859-1.

Greetings, Ralf

Re: Is the Spanish Voxforge dictionary compatible with Simon?
User: ubanov
Date: 11/29/2008 6:45 am
Views: 142
Rating: 19

Hi,

As you asked I have update de dictionary to UTF-8 format. In order to make the translation I have used a simple script that I have build (it's the filtroiso884911toutf8.c program, stored in svn programas directory).

I have update another files too, but may be some are missing. If anyone finds one file that it's not converted, send me a mail and I will change.

I used iso 8859-1, because it's the default option with a Debian 4.0 installation in spanish. When I connect to my Debian machine with putty, the default is 8859-1 too. If it's preferable the utf-8, it's allright for me :-)

Ralf: I will test simon one of this days in spanish and I will tell you anything.

Regards.

 

encoding issues in German and Spanish
User: ralfherzog
Date: 11/29/2008 10:28 am
Views: 130
Rating: 14

Hello! 

Which standard should we use?  ISO-8859-1 or UTF-8? Well, it is not easy to find an answer. The Spanish Debian distribution may use ISO-8859-1.  And even the CMU Sphinx website is encoded in ISO-8859-1. Obviously, they don't care about UTF-8.  But from my point of view, they should.

I just downloaded the German prompts (658k).  And when I look at the unpacked prompts with my text editor Notepad++, I see a lot of garbage when it comes to the German special characters (ä,ö,ü,ß).  The reason for this crap this the mixed use of different character encodings.  In my opinion, it doesn't make sense to train the acoustic/language/speech model when the German special characters are garbage.

So we have to take care of that problem. EBay migrated from Latin 1 (I think that this is the same as ISO-8859-1) to UTF-8.  In Spanish as well as in German they used Latin 1, and migrated to UTF-8.

A reason why eBay migrated was to solve cross-border trade impediments (PDF).  I think that we have an analog problem with the speech recognition development.  E.g. I created and uploaded lots of German prompts (unfortunately using mixed character encodings).  And nsh compiled my prompts (and the corresponding audio files) to a speech model.  And the result is not yet usable because of the character encoding issues (the special German characters are crap). It won't be easy to fix that.  Maybe we should try and use your script filtroiso88591toutf8.c.

Sorry for writing so much.

I hope that you won't lose too much time because of encoding issues.

Thanks in advance for testing Simon with the Spanish dictionary.

Greetings, Ralf

Re: encoding issues in German and Spanish
User: ubanov
Date: 11/29/2008 4:14 pm
Views: 130
Rating: 16

Hi,

I would like to help you. The only problem is that I don't have a table converting with the conversion from ISO-8859-1 to UTF-8.

What I have done this morning is to search the characters for áéíóúñ in ISO and UTF. In order to help you I would need what are the characters that you use in ISO-8859-1 and the UTF-8 equivalent (if you give me two files will all the characters I make the rest). Do you use ISO-8859-1 or another ISO-8859-x?

May be write the characters you need to convert in reply to this message (here I will have the ISO chars), and then I will try to create the UTF file.

I'm thinking now that I have missed some characters (ï and ü) in my conversion of the lexicon dictionary...

I expect to help you, but I need a little bit help.

Regards.

 

Audio in mono format and notes about encoding
User: ubanov
Date: 11/29/2008 4:36 pm
Views: 144
Rating: 12

Hi Ken,

In the train/wav subdirectorys I have uploaded the sounds again converting the stereo wav files to mono (using sox -c 1 fichero.wav ficherosal.wav), and changing the prompts files to utf-8 characteres. I have uploaded ubanov*, buhochileno4 and txita1 directorys.

Ken may be you upload the files to the spanish voice repository (in order to be possible to download the files from the Listen option of voxforge).

Another thing, I'm going to include a reference about the encoding in the spanish Read or Listen page (asking the people to use UTF-8 charset).

Regards.

Re: Is the Spanish Voxforge dictionary compatible with Simon?
User: kmaclean
Date: 12/1/2008 11:22 am
Views: 117
Rating: 11

Hi Ralf,

>How is it possible to train a speech model when the character encoding is

>wrong?

The use of UTF-8 is really more to get rid of headaches that occur when trying to display international character sets on a web site. 

It does not really have much to do with acoustic model training, since Sphinx, Julius/HTK, ... use ASCII internally (which I assume is the reason why the SAMPA  computer readable phonetic alphabet was created).

Ken

PreviousNext