New Subversion repository for Spanish

Spanish

Flat

User: ubanov
Date: 11/27/2008 6:09 pm

Views: 13742
Rating: 16

Ken has created a Subversion repository instance for Spanish in the following directory.

http://www.dev.voxforge.org/svn/es/

And at the end I have got upload all the files that I have in my training directory.

The directory has training for HTK/Julian with all the voices that voxforge have for spanish.

I have described some things about speech recognition softwares and the training process in my blog (it's writen in spanish) (http://ubanov.wordpress.com/2008/11/28/reconocimiento-de-voz-en-castellano/).

Spanish dictionary has some unreadable characters

User: ralfherzog
Date: 11/28/2008 8:27 am

Views: 117
Rating: 16

Hello ubanov! Hello everyone!

It is great that you write about speech recognition software in your blog.

I just took a look into the spanish pronounciation dictionary. On my computer (Win XP, Firefox) there are some unreadable characters. We have similar problems with the german language when it comes to characters like "ä, ö, ü, ß".

The special characters of the Spanish language may be displayed correctly on your own personal computer. But please keep in mind that other people like me may experience problems with the special characters. Probably you know about this problem. If not, please read this article.

Greetings, Ralf

Re: Spanish dictionary has some unreadable characters

User: kmaclean
Date: 11/28/2008 9:40 am

Views: 373
Rating: 15

Hi Ralf,

>On my computer (Win XP, Firefox) there are some unreadable characters.

I think this might be a problem with Subversion's web front-end, because Trac displays the exact same file correctly: voxforge_lexicon_spanish

There must be a default setting in Subversion somewhere that I need to change...

Ken

Re: Spanish dictionary has some unreadable characters

User: kmaclean
Date: 11/28/2008 10:26 am

Views: 137
Rating: 16

Hi Ralf,

I guess I spoke too soon, as you indicated, there is a problem with displaying the characters in voxforge_lexicon_spanish (the prompts file also has the same problem - see ticket #441 for details). I think it is because it uses ISO-8859-1 rather than UTF-8 encoding.

Workaround: If you display it using ISO-8859-1 it displays correctly...

in FireFox go to:

View > Character Encoding

then select Western (ISO-8859-1) so it can view properly.

Ubanov: if you can figure it out, it is best to use UTF-8 encoding for the files you upload to VoxForge - if you can... it saves a lot of headache down the road.

thanks,

Ken

Is the Spanish Voxforge dictionary compatible with Simon?

User: ralfherzog
Date: 11/28/2008 11:02 am

Views: 236
Rating: 15

Hello Ken,

Thanks for your answer. I just inserted a direct hyperlink to the Spanish VoxForge dictionary in the how to install Simon under Ubuntu. It would be great if someone would try Simon with this Spanish pronunciation dictionary (and report about the results)! It would be interesting to know whether the Spanish VoxForge dictionary is Simon-compatible or not.

This is exactly what I have in mind: saving a lot of headache! It took me a very long time to find out that there is a problem with the character encoding. How is it possible to train a speech model when the character encoding is wrong? ISO-8859-1 is still a very common standard, but we should try to totally switch to UTF-8. When I change the display mode to ISO-8859-1, the Spanish special characters are being displayed correctly. Thanks for the tip, Ken.

UTF-8 is the way to go, e.g. Wikipedia, WordPress.com, eBay.com are using UTF-8. I encourage everyone to employ UTF-8 instead of ISO-8859-1.

Greetings, Ralf

Re: Is the Spanish Voxforge dictionary compatible with Simon?

User: ubanov
Date: 11/29/2008 6:45 am

Views: 142
Rating: 19

Hi,

As you asked I have update de dictionary to UTF-8 format. In order to make the translation I have used a simple script that I have build (it's the filtroiso884911toutf8.c program, stored in svn programas directory).

I have update another files too, but may be some are missing. If anyone finds one file that it's not converted, send me a mail and I will change.

I used iso 8859-1, because it's the default option with a Debian 4.0 installation in spanish. When I connect to my Debian machine with putty, the default is 8859-1 too. If it's preferable the utf-8, it's allright for me :-)

Ralf: I will test simon one of this days in spanish and I will tell you anything.

Regards.

encoding issues in German and Spanish

User: ralfherzog
Date: 11/29/2008 10:28 am

Views: 130
Rating: 14

Hello!

Which standard should we use? ISO-8859-1 or UTF-8? Well, it is not easy to find an answer. The Spanish Debian distribution may use ISO-8859-1. And even the CMU Sphinx website is encoded in ISO-8859-1. Obviously, they don't care about UTF-8. But from my point of view, they should.

I just downloaded the German prompts (658k). And when I look at the unpacked prompts with my text editor Notepad++, I see a lot of garbage when it comes to the German special characters (ä,ö,ü,ß). The reason for this crap this the mixed use of different character encodings. In my opinion, it doesn't make sense to train the acoustic/language/speech model when the German special characters are garbage.

So we have to take care of that problem. EBay migrated from Latin 1 (I think that this is the same as ISO-8859-1) to UTF-8. In Spanish as well as in German they used Latin 1, and migrated to UTF-8.

A reason why eBay migrated was to solve cross-border trade impediments (PDF). I think that we have an analog problem with the speech recognition development. E.g. I created and uploaded lots of German prompts (unfortunately using mixed character encodings). And nsh compiled my prompts (and the corresponding audio files) to a speech model. And the result is not yet usable because of the character encoding issues (the special German characters are crap). It won't be easy to fix that. Maybe we should try and use your script filtroiso88591toutf8.c.

Sorry for writing so much.

I hope that you won't lose too much time because of encoding issues.

Thanks in advance for testing Simon with the Spanish dictionary.

Greetings, Ralf

Re: encoding issues in German and Spanish

User: ubanov
Date: 11/29/2008 4:14 pm

Views: 130
Rating: 16

Hi,

I would like to help you. The only problem is that I don't have a table converting with the conversion from ISO-8859-1 to UTF-8.

What I have done this morning is to search the characters for áéíóúñ in ISO and UTF. In order to help you I would need what are the characters that you use in ISO-8859-1 and the UTF-8 equivalent (if you give me two files will all the characters I make the rest). Do you use ISO-8859-1 or another ISO-8859-x?

May be write the characters you need to convert in reply to this message (here I will have the ISO chars), and then I will try to create the UTF file.

I'm thinking now that I have missed some characters (ï and ü) in my conversion of the lexicon dictionary...

I expect to help you, but I need a little bit help.

Regards.

needed: conversion script PLS/IPA to HTK/ASCII

User: ralfherzog
Date: 12/1/2008 12:04 pm

Views: 425
Rating: 16

Hello ubanov!

Thanks for your offer to help. I think that I have found a solution to the encoding problem (thanks to nsh) with the following commands in the Ubuntu terminal:

ubuntu@ubuntu-desktop:~$ svn checkout http://[email protected]/svn/de/Trunk/Prompts

cd Prompts

gedit master_prompts_8kHz-16bit

svn commit

I am just doing a simple search & replace with gedit. There is no need to write a script. After searching the corrupt characters and replacing them with the valid special characters (ä,ö,ü,ß), it is just neccessary to save the file with the character encoding UTF-8. The results are shown in the German timeline.

Well, I have a similar problem. How is it possible to convert the german PLS/IPA pronounciation dictionary into HTK/ASCII format? If you want, you could help me with the conversion. My thoughts are: Using XSLT/XPath. Or someone could write a C++ script. Or a Perl script. Or search & replace with gedit. At the moment, I don't have the neccesary programming skills. But I am trying to find a solution. I would appreciate any help.

Obviously, you know how to write a C++ script. If you want to help, you are welcome.

It is very comfortable to create the german pronounciation dictionary using the IPA. But in the end, we need just ASCII. And for the conversion, a script would be fine.

Such a script could be useful for other languages, too, of course. Is there a Spanish IPA dictionary available that is licensed under the GPL? If yes, you could use that script for your own language (I assume that your mother language is Spanish).

Regards, Ralf

Re: needed: conversion script PLS/IPA to HTK/ASCII

User: ubanov
Date: 12/1/2008 5:35 pm

Views: 346
Rating: 16

Hi,

At the end I finished the filter program (searching information about the characters in google), and I have executed the program against german master prompts files. As I can't upload anything to german svn directory, the resulting files are in the following directory: "svn checkout http://www.dev.voxforge.org/svn/es/Trunk/german".

In the 16Khz file the program has changed about 8500 characteres. In the 8Khz fil the program has changed only 6 (?).

Download the files and review them. When you have download the files tell me to delete those files. If you can use them well, if not, just delete them. :-)

Regards.

[ «Previous Page | 1 2 | Next Page» ]

Previous • Next •


Username	Password