General Discussion

Flat
Re: Language Hebrew
User: ofir
Date: 12/26/2007 12:54 pm
Views: 229
Rating: 9

I wrote the article and posted in an Open Source magazine.

In the article, I guided the readers to record themselves reading an open source text, mainly from Wikipedia, and to save the results in a WAV (or AIFF) file with 48Khz-16bit Mon.

 

Anyway, I guided them to record the whole text and to upload it to one of the hosting files services kmaclean mentioned before.

One of them already recorded 56 min.

So now I need to cut the records to sentences of 10-15 words, and write a file with the file name of the record and the sentence the speaker says in it. Am I right?

What if there is a sentence longer than 15 words? Cut it in the middle and make two files?

Is there an option to manage the Hebrew project from here (assign tasks for volunteers, track progress etc)?

Is there anything else I can do to help with the project and with the Hebrew language?

Hebrew is written from right to left. Do I need to write the file (which contains filename->sentence)  in a special way, or it doesn't matter?

--- (Edited on 12/26/2007 12:54 pm [GMT-0600] by ofir) ---

Re: Language Hebrew
User: kmaclean
Date: 12/26/2007 1:59 pm
Views: 214
Rating: 8

Hi ofir,

>In the article, I guided the readers to record themselves reading an open

>source text, mainly from Wikipedia, and to save the results in a WAV (or

>AIFF) file with 48Khz-16bit Mon.

Excellent idea!  Remember to get them to submit their audio and text under GPL (having them include a text file containing the GPL notice should suffice).

>One of them already recorded 56 min.

Wow!  

>What if there is a sentence longer than 15 words? Cut it in the middle and make two files?

10-15 words per sentence is a rule of thumb - I've got some sentences that I segmented (LibriVox Audio books) that had 20-25 words per sentence.  When the sentences get too long, the HTK training process seems to get "stuck".  Try to segment where there is a natural pause.  If not, add some silence (half a second) at the beginning and at the end of the sentence.  The HTK training scripts (and SphinxTrain) assume that each sentence begins and ends with a short silence section.

>Is there an option to manage the Hebrew project from here (assign tasks

>for volunteers, track progress etc)?

If you are familiar with Trac (or willing to learn) I can set you up with a Trac environment for Hebrew.  It's more geared to issue tracking for software development projects,but we've been using it for speech corpora development too.

>Is there anything else I can do to help with the project and with the Hebrew language?

Any type of promotion (articles Hebrew, English otherwise ...) is appreciated. 

>Hebrew is written from right to left. Do I need to write the file (which

>contains filename->sentence)  in a special way, or it doesn't matter?

It's best to leave the speech text in its original form.  I am not sure if HTK can transpose such text natively, but (if required) we should be able to create a Perl script to do so.  The current training scripts assume text goes left-to-right.

Ken 

 

--- (Edited on 12/26/2007 2:59 pm [GMT-0500] by kmaclean) ---

Re: Language Hebrew
User: ofir
Date: 12/26/2007 3:39 pm
Views: 160
Rating: 7

In previous post (in this thread) it was mentioned that the Java applet can permit translations but it will some time until Hebrew will be included in it.

I want to know if it is possible to try include Hebrew in it, since many people which are eager to help don't submit records, just because it isn't one-click action - start the applet and record sentences.

 

I don't want to push you, but it really going to help people contribute to the project.

 

About Trac, I don't know it yet, but I can learn how to use it.

If you can tell me how you use it, and for what you use it, I will be grateful.

--- (Edited on 12/26/2007 3:39 pm [GMT-0600] by ofir) ---

Re: Language Hebrew
User: kmaclean
Date: 12/26/2007 8:30 pm
Views: 234
Rating: 12

Hi ofir,

>I want to know if it is possible to try include Hebrew in it

see this post.

>>Is there an option to manage the Hebrew project from here (assign tasks for volunteers, track progress etc)?

>About Trac, I don't know it yet, but I can learn how to use it.

see the new Hebrew site on VoxForgeDevWiki. The site includes most of the Trac documentation you'll need.  Just play with it a bit to get the hang of it.

Trac is a front-end for the Subversion version control system.  We use Subversion to track changes to a speech corpus as it evolves.  It also permits the creation of sub-copora with the tagging option.  It has a wiki and a ticket tacker that allows you to track issues or tasks (and assign them to people to track progress), create milestones, etc..  We've been using Trac for the 'messy' dev side of speech corpora creation, and the VoxForge website for the final product.

I'll email your Trac password shortly.  Let me know if you have anyone else that needs access.

Ken 

--- (Edited on 12/26/2007 9:30 pm [GMT-0500] by kmaclean) ---

Re: Language Hebrew
User: ofir
Date: 12/27/2007 2:22 pm
Views: 187
Rating: 10

Where to submit the translated texts from here? http://www.voxforge.org/home/forums/other-languages-forum/general-discussion/speechsubmission-java-applet-localization

 

Is wikipedia can be used as a text in the recordings? Anyway, where can I find texts that meets the correct standards?

--- (Edited on 12/27/2007 2:22 pm [GMT-0600] by ofir) ---

Re: Language Hebrew
User: kmaclean
Date: 12/27/2007 3:37 pm
Views: 293
Rating: 11

Hi ofir,

>Where to submit the translated texts from here?

Either on in your Hebrew Submission forum or on the Hebrew Wiki - best place would be the Wiki, since this is corpus development stuff.

>Is wikipedia can be used as a text in the recordings?

I don't know what kind of license do they use ... do you know?  

>Anyway, where can I find texts that meets the correct standards?

Project Gutenberg might have something in Hebrew.  Wikipedia should be OK, but I don't know the license they use - you will need to look it up.  You can also create your own texts ... write an article on speech recognition (or just nonsense sentences) that uses words that cover all the sounds of the Hebrew language, and then parse into 10-15 word sentences, and get people to record it.

Note: there are 2 layers of Copyright involved here:

  1. Copyright on the original text, which may or may not be permissively licensed;
  2. Copyright on the speech of someone reading the text, which may or may not be licensed separately.

The Copyright in the text takes precedence over any Copyright in a recording of that text.  You need permission (i.e. a written license) from *both* the Copyright holder of the text, and from the person who made the recording.  That is why VoxForge only uses out-of-Copyright texts for its prompts, and asks users to submit their speech using a GPL license.

Details:

Original text under Copyright and no license

  • then any recording of the text is a derivative work of the original text, and cannot be distributed (nor used by VoxForge).

Original text under Copyright with a permissive license (like GPL or equivalent).

  • then the recording is a permitted derivative work of the text, and the the person who made the recording has Copyright in their recording, which they can then license under a similar permissive license.

Original text under Public Domain (i.e. no Copyright)

  • then a recording of the text will have Copyrights attached to it, which need to be licensed under GPL.

I'm not a lawyer, but that is my understanding of how things work in this context. 

Ken 

--- (Edited on 12/27/2007 4:37 pm [GMT-0500] by kmaclean) ---

Re: Language Hebrew
User: ofir
Date: 12/28/2007 5:27 am
Views: 156
Rating: 8

Thank you...

I checked Wikipedia and I saw they use GFDL, which is a license of the Free Software Foundation, like GPL, so I guess it is OK.

What do you think?

--- (Edited on 12/28/2007 5:27 am [GMT-0600] by ofir) ---

Re: Language Hebrew
User: kmaclean
Date: 12/28/2007 9:18 am
Views: 179
Rating: 10

Hi Ofir,

I don't think so ... according to Wikipedia:

GPL incompatible in both directions

The GNU FDL is incompatible in both directions with the GPL: that is GNU FDL material cannot be put into GPL code and GPL code cannot be put into a GNU FDL manual. Because of this, code samples are often dual-licensed so that they may appear in documentation and can be incorporated into a free software program.

You might try translating some of the VoxForge prompts.  

Ken 

--- (Edited on 12/28/2007 10:18 am [GMT-0500] by kmaclean) ---

Re: Language Hebrew
User: ralfherzog
Date: 12/30/2007 5:04 am
Views: 189
Rating: 10
Hi ofir,

Why don't you create your own prompts?  In my opinion, due to copyright concerns this would be the best solution.  I create prompts in the German and in English-language using Dragon NaturallySpeaking 9.5.  Unfortunately this application isn't available for Hebrew.  So to create your own prompts, you would have to use your keyboard.

You can number your own prompts in a semi automated process with Notepad++ using a macro.

It is not difficult to create your own prompts, it is just a lot of work.  But you will get used to it.

Greetings, Ralf

--- (Edited on 2007-12-30 5:04 am [GMT-0600] by ralfherzog) ---

Re: Language Hebrew
User: ofir
Date: 12/30/2007 5:46 am
Views: 178
Rating: 11

Yes, as suggested earlier in this post, I will translate the prompts of the Java Applet.

I just need some time to translate them, I hope in the next few days...

 

Thank you!

--- (Edited on 12/30/2007 5:46 am [GMT-0600] by ofir) ---

PreviousNext