VoxForge
Hi,
Until now (10 dec 2008), we have the following materials for spanish:
- Two list of phrases, one written by me, and another one from nuremberg project
- There are voices from 10 diferent persons (about three hours of voice recording)
- There is a writen procedure in order to "compile" the voices in a julian model (by now this information it's in spanish in my blog http://ubanov.wordpress.com/2008/11/28/reconocimiento-de-voz-en-castellano/)
- There is a TRAC system containing about 1G of data (most of the data are the wav files)
I would like to know the next steps to follow. What are the next steps in order to continue with this? Next things that I think will have to be done:
- Train noises
- Review the output of the training script by anyone that knows what the program says (may be ken O:-), look at the file http://www.dev.voxforge.org/svn/es/Trunk/auto/out.log, there are some warnings)
- Write more phrases containing more triphones
- Get more voice donors, with more phrase
- Write procedure for CMU Sphinx
Other things that could be missing?
Regards,
Ivan
The previous post it's from ubanov, but I don't know why Firefox hasn't validate my user, until I have sent the email. Sorry.
> Other things that could be missing?
The thing number 0 is the estimation of the model accuracy on a test set.
Hi Ivan,
>Other things that could be missing?
You're doing great!
Other things that you might want to look at:
1. testing - I would second nsh's comment about needing some sort of test set (we have a rudimentary 'sanity test' that I implemented, but it is not that good.).
2. other sources of speech - you might look at contacting the Spanish section of LibriVox and see if there are any readers who might be interested in giving you their chapters in wav format. You'll have to segment them so they can be used for creating AMs, but it might be a good source of speech.
On the VoxForge side of things (which I think we have already discussed...), the next steps would be to:
1. set up the VoxForge mirroring scripts so that they automatically downsample submitted audio to a 'normalised' formate (16kHz-16bit & 8kHz-16bit) and create associated MFCCs (for HTK/Julius) - this requires some way for you (or others) to review and approve the audio...
2. Further review your acoustic model build scripts (I like them - I was just starting in Perl when I made mine, and it shows...) and incorporate them into the nightly acoustic model build.
These two things will have to come after I finish with the user submission review system. Unfortunately I am having to learn way more Javascript/CSS/HTML than I ever wanted to (my CMS doesn't do what I need it to do, and I am not to the point where I want to do this as another Java applet...), which takes time...
Ken
And how can I do that?!?
Thanks in advance.
Hi Ubanov,
>>The thing number 0 is the estimation of the model accuracy on a test set.
>And how can I do that?!?
See:
Ken
About other source of speech, on 30 nov 2008 I searched about spooken books. I found that there where books in cervantesvirtual.com, but asked them if I could use then and I have no response :-( in librivox I have not found any spanish reading :-(
I will try to follow your testing (testing, nsh) recomendations...
Whats the point of using 16Khz and 8Khz recognition (I understand that 8Khz is for telephony systems) and 16Khz? I have review some of the audios.
I always listened one or two of each set of sounds files, and I know that some of them are not correctly, (because have some noise before the sound...). Then it would be necessary to cut off this noise, wouldn't it?
Hi ubanov,
>in librivox I have not found any spanish reading :-(
You need to use the advanced search (i.e. "more search options" on their catalog page) and select Spanish under the "Language" drop down. They have 12 completed works.
>Whats the point of using 16Khz and 8Khz recognition (I understand that
>8Khz is for telephony systems) and 16Khz?
With 16Khz sampling rate at 16 bits per same, you have more audio 'information' with which to train an acoustic model with and this seems to improve recognition accuracy (at least with my experiments...). This *seems* to indicate that the higher the sampling rate, the better the recognition rate you get (and therefore the less audio needed for an acoustic model), but, you need to take into account the Nyquist theorem:
This maximum frequency for a given sampling rate is called the Nyquist frequency. Most information in human speech is in frequencies below 10,000 Hz; thus a 20,000 Hz sampling rate would be necessary for complete accuracy.
See this page for more info:Acoustic Model Creation, and my post in the comments section.
>Then it would be necessary to cut off this noise, wouldn't it?
Yes- that is what I do.
And this is why it takes me so long to turn around submissions... because it is so mind-numbingly boring to do this task... :)
Ken
> Yes- that is what I do.
Actually it's a bad thing, since real data for recognition will always have noise. In proper databases noise is just marked as such in labels and it's possible to train noise model. Of course there are different types of noise marked as fillers:
++AH++ +AH+
++BEEP++ +TONE+
++NOISE++ +NOISE+
++DOOR_SLAM++ +SLAM+
++BREATH++ +BREATH+
++GRUNT++ +GRUNT+
++LIP_SMACK++ +SMACK+
++PHONE_RING++ +RING+
++CLICK++ +CLICK+
i nsh,
>Actually it's a bad thing, since real data for recognition will always have noise.
Point taken... that'll save some time... :)
I think I recall a discussion with someone (I thought it was you, but can't find the original post...) a while back where they mentioned that we might not need as much speech (for dictation AMs) if we had a "good quality" speech corpus. I think I inferred (wrongly...) from that conversation that if "isolated noise" can be safely removed from a recording, then we should do so. I was not thinking that this could also be used as a source of information for a noise model. I still have all the original submissions audio in their unmolested format... another project for later... I guess from this discussion, a "good quality" speech corpus is one that properly tags its isolated noise, not one that has the minimum of non-speech noise.
The main problem that I am finding is that sometimes, the SpeechSubmission applet looks like it is ready to record, even though it isn't actually ready - so a user's recording gets truncated. I think this might be related to Java's garbage collector... the BAS SpeechRecorder Software page talks about a similar problem (in the section entitled "Important note for reliable recordings").
Because of this, I need to visually review the waveforms of all the submissions (using Audacity) to make sure that there has been no truncation of the utterance in the beginning or at the end of the recording - this can be done rather quickly. If prompt is truncated, I remove the truncated word(s) and add in silence - this takes time.
In addition, what I have also been doing, is if there are very noticeable click noises (usually for submissions recorded from a laptop microphone) after they have finished speaking, I have been removing them. I will gladly stop doing this... and see what I can do to tag these appropriately.
Another question is how to deal with submissions that are so noisy (line noise/hum,...) that they seem to cause problems with the training of the silence model (my theory...) - see attached file. Since the whole recording contains this constant line noise, there is no real "silence", and this seems to have caused problems with recognition rates. I have been removing these submissions from the "Master_Prompts" for acoustic model training (and if they are very bad, I don't include them in the corpus at all...). What would be the best approach to deal with these cases?
thanks,
Ken