Errors in Voxforge corpus

Audio and Prompts Discussions

User: gongdusheng
Date: 4/12/2007 10:24 pm

Views: 21399
Rating: 38

In the process of training Sphinx4, I'm finding there are some errors in the corpus. I've encountered one or more of the following errors:

1) Prompt doesn't match recording

2) Prompt has incorrect recording label

3) Prompt file named transcripts.txt

4) Prompt has a typo

5) Recording is unintelligible

I'm wondering if and how I should report these findings and if they will be corrected in the repository. For #3 above, I'm wondering if there is some standard. In addition to the name of the prompts file, some prompts are all uppercase while some are mixed, some have recording labels pointing to the mfc directory while most are relative paths to the wav file. Some prompts have punctuation while some don't. Some prompts have multiple sentence fragments, while most are single sentences or a series of words.

Thanks.

--- (Edited on 4/12/2007 10:24 pm [GMT-0500] by gongdusheng) ---

Re: Errors in Voxforge corpus

User: kmaclean
Date: 4/13/2007 7:16 am

Views: 338
Rating: 21

Hi gongdusheng,

Sorry, some of the older prompt files that are included with the audio itself need to be cleaned up. I'm hoping to correct this with the restructuring of the Subversion database and Trac projects format that I am currently working on.

The Master_Prompts file is in the HTK scripts directory (download the HTK.tgz tar file and get it from the AMCreate_scripts directory therein) and has the correct version of the prompts for all the audio in the VoxForge corpus - this is the prompts file used for the creation of the VoxForge Acoustic Models. Part of the subversion restructuring includes moving the Master_Prompts file to the Speech Corpus subversion site, and updating the Acoustic Model scripts to use it at this new location.

For any errors, please report them on the new Speech Corpus Trac site. The Issue Tracker is located at this link. To prevent comment SPAM, your browser needs to support cookies, and you can only post "non-clickable" URLs - (i.e. without the "http://", like this: "www.dev.voxforge.org/wiki"). If you need to post clickable URLs email me at kmaclean at voxforge dot org, and I can set you up with a Trac username and password.

Thanks,

Ken

--- (Edited on 4/13/2007 8:16 am [GMT-0400] by kmaclean) ---

Re: Errors in Voxforge corpus

User: gongdusheng
Date: 4/13/2007 10:42 pm

Views: 2444
Rating: 33

No need for apology. It's hard to make a clean corpus. That's why you have to pay for all those non-GPLed ones, right?

I took a look at the master_prompts file, and I believe it still has mistakes. I checked the first 3 mistakes that I'd found and they are all in master_prompts too. Nevertheless, I'll switch to using this prompts file as a basis for the Sphinx4 model since its formatting is more consistent.

FYI, I found some new types of errors:

6) Recording from 16kHz 16bit repository is actually 44.1kHz.

7) Recording doesn't have background noise or silence at the end.

I guess #7 isn't an error, but more of a standards issue since SphinxTrain will ignore the file if it doesn't appear to match the prompt. It seems like most files have silence at the beginning and end and hence I was just marking every prompt with the silence phone at the begin and end. When there are exceptions, it's hard to script things and I've been manually adding silence to the wav files when necessary by copying the silence at the beginning of the wav and pasting it at the end.

When I clean the corpus enough to make Sphinx work, I'll open an "issue" with all the errors I've found. FYI, I haven't been keeping track of which files run afoul of #6. I just changed my script to run every file through a downsampler.

--- (Edited on 4/13/2007 10:42 pm [GMT-0500] by gongdusheng) ---

Re: Errors in Voxforge corpus

User: kmaclean
Date: 5/8/2007 9:39 pm

Views: 9534
Rating: 22

Hi gongdusheng,

I managed to fix a couple of the issues you mentioned:

>1) Prompt doesn't match recording

Fixed - see Ticket #2 - there were problems with transcriptions not matching speech audio in cmu_us_jmk_arctic

I've create a script to review the HVite log and flag entries that has audio that might not match its corresponding transcription (FindQuestionableAudio.pl)

>6) Recording from 16kHz 16bit repository is actually 44.1kHz.

Fixed - see Ticket #1 - my manual segmentation of audio for the ductapeguy-20070308b submission caused the downsampling.pl script not to work, and it just copied the audio unchanged.

I created another script to review each wav file in Main_8kHz-16bit and Main_16kHz-16bit to make sure that the audio sampling rate is correct (ConfirmSamplingRate.pm)

I've created a new ticket for #3:

3) Prompt file named transcripts.txt

I'm going to need more information to address the other issues you mentioned:

2) Prompt has incorrect recording label

4) Prompt has a typo

5) Recording is unintelligible

7) Recording doesn't have background noise or silence at the end.

Ken

--- (Edited on 5/8/2007 10:39 pm [GMT-0400] by kmaclean) ---

--- (Edited on 5/9/2007 2:16 pm [GMT-0400] by kmaclean) ---

Previous • Next •


Username	Password