Click here to register.

Audio and Prompts Discussions

Flat
a quicker approach to building a corpora?
User: dchambers
Date: 4/27/2008 1:58 pm
Views: 86
Rating: 6    Rate [

+

]

Would it not be quicker, and less fatiguing on the user, to automatically transliterate spoken audo submissions (e.g. Librivox readings) using a best-of-breed speech recognizer (e.g. Dragon), and then have a 2-stage proofing process where any mistakes can be quickly corrected by volunteers?

I believe this approach could significantly increase the transliterated words per volunteer hour that you are currently achieving, and more importantly, make the process far more enjoyable so that you can build long term relationships with volunteers, and they can continue to drive this project towards meeting its goals, rather than have them take a short term interest only.

This is exactly the model that was used by Distributed Proofreaders to revolutionize the speed at which books could be ASCIIfied for submission into Project Gutenberg. That project has many parallels with what you are doing, and their process has some significant advantages over what is being used here:

  1. the process is enjoyable because there aren't too many mistakes to correct, and so you actually get to enjoy reading books at the same time -- instead, the voxforge proccess is quickly fatiguing.
  2. you can choose which books to work on which means you get to read bits of something you may have actually been interested in anyway.
  3. they have a very prominent display showing how many books (in your case transliterated hours) that have been achieved so far, and the history of that progress, and this acts as a great motivator to do more work.
  4. since their process de-couples authors from proof readers (or proof listeners in your case ;-) it allows interested parties to make long term commitments to the project that is not coupled to the optimal number of minutes needed from each author.

The perfect interface I can imagine for this would let the user listen to audio in real time, while displaying sub-titles of the corresponding transliterated text, so that the user could pause the audio each time they see an error, and fix the current sub-title. I believe this should be possible by automatically dividing the audio up into sentence sized chunks before queeing it for transliteration.

Another way to make this process even smoother and more enjoyable (more listening/less fixing) for the user would be to only add one chapter at a time, and then to feed the corrected audio back into the next round of training so that the next chapter required far less fix-up, and so on -- this might not be possible with Dragon, but should be with OSS recognizers, and so perhaps these would be better for subsequent chapters.

Anyway, just my 2 cents! Thanks for all the good work, and I look forward to the day when high-quality speech recognizers are a part of every OSS desktop.

Regards, Dominic Chambers.

--- (Edited on 4/27/2008 1:58 pm [GMT-0500] by Visitor) ---

--- (Edited on 4/27/2008 5:43 pm [GMT-0400] by Visitor) ---

Reply
Re: a quicker approach to building a corpora?
User: kmaclean
Date: 4/28/2008 1:16 pm
Views: 21
Rating: 6    Rate [

+

]

Hi Dominic,

Thanks for the feedback.  My replies follow:

>Would it not be quicker, and less fatiguing on the user, to automatically

>transliterate spoken audo submissions (e.g. Librivox readings) using a

>best-of-breed speech recognizer (e.g. Dragon), and then have a 2-stage

>proofing process where any mistakes can be quickly corrected by volunteers?

Not sure I understand what you mean here ... are you saying that we should take audiobooks read by LibriVox users, and put them through DNS, and have it spit out the text transcriptions of the audio book?

Since we can get the text to the audio book directly from Project Gutenberg, I am not sure I understand what the advantage of using a commercial speech recognition system might be ... is it to ensure that the audiobook chapter matches the Gutenberg text?  Note that we currently do this in the nightly build (however, it is not perfect) for any new audio submitted to VoxForge using our current FOSS tools.  If the text deviates considerably from the speech audio, the scripts ends abnormally, and then we can go and fix the problem.

In addition, we still have the problem that we need the speech segmented into 10-15 word sentences so that it can used to train the acoustic models.  We are working on an automated process to do this (see this link: Automated Audio Segmentation Using Forced Alignment), and Sequitur G2P will go a long way to helping us completely automate this process.

I am also not clear on what you mean by "transliteration" and how it differs from "transcription" in this context (the Wikipedia examples I read are not clear to me...), since we are only concerned that the words in the audio file match the text in the transcription.  Please clarify.

>... and more importantly, make the process far more enjoyable so that you

>can build long term relationships with volunteers, and they can continue to

>drive this project towards meeting its goals, rather than have them take a

>short term interest only.

I agree - this can be improved. 

>the process is enjoyable because there aren't too many mistakes to correct,

>and so you actually get to enjoy reading books at the same time -- instead,

>the voxforge proccess is quickly fatiguing.

Users can still read/submit audio books to LibriVox and submit an uncompressed version of their chapter to VoxForge (see this link: How to Send Your LibriVox Recording).  There is still some processing required to segment the submitted audio so that it can be used to train acoustic models - but this is in the process of being automated.

We actually had a few people who asked for pre-existing texts to read for quick and easy submission to VoxForge - the Java Applet on the read page is the result.

>you can choose which books to work on which means you get to read bits

>of something you may have actually been interested in anyway.

The Java Applet has a lot of evolving left to go ... and allowing users to submit their own text for recording is definitely a good option to have. 

In addition, as stated above, VoxForge accepts uncompressed audio chapters of books submitted to LibriVox, so someone can help two projects at the same time: LibriVox with an audio book chapter, and VoxForge with speech audio for training acoustic models.

>they have a very prominent display showing how many books (in your case

>transliterated hours) that have been achieved so far, and the history of that

>progress, and this acts as a great motivator to do more work.

I agree, we can do a better job on this.  The metrics page is our current gage of progress.

>since their process de-couples authors from proof readers (or proof

>listeners in your case ;-) it allows interested parties to make long term

>commitments to the project that is not coupled to the optimal number of

>minutes needed from each author.

We actually have a long-term plan to allow users to be able to proof listen audio ... it just has not been implemented yet.

However, I am not sure how keen people might be to do this ... We've had a forum (i.e. the listen page) that links to all submitted audio, and out of the hundreds of audio submissions we have received, we have not received that many corrections.  It might be a "chicken or the egg" issue (i.e. if is easier to proof-listen and make corrections, then this might entice people to actually do it).

>The perfect interface I can imagine for this would let the user listen to audio

>in real time, while displaying sub-titles of the corresponding transliterated

>text, so that the user could pause the audio each time they see an error,

>and fix the current sub-title. I believe this should be possible by

>automatically dividing the audio up into sentence sized chunks before

>queeing it for transliteration.

What I have in mind (from an "ease of coding" perspective at least) is basically the current Java Applet, but with playback buttons and user editable text fields.  The user could play the first prompt, and the app would play each prompt consecutively, so that a submission could be reviewed quickly. If correction are required, the user would make the correction in the text field and submit the changes and they would get incorporated into the Subversion repository (for version control).

>Another way to make this process even smoother and more enjoyable

>(more listening/less fixing) for the user would be to only add one chapter at

>a time, and then to feed the corrected audio back into the next round of

>training so that the next chapter required far less fix-up, and so on -- this

>might not be possible with Dragon, but should be with OSS recognizers,

>and so perhaps these would be better for subsequent chapters.

We kind of do this now, since the nightly build of acoustic models includes any audio submitted that day.  Therefore these acoustic model should improve with the new audio.   Note that this is not always the case if the submission contains noise (background, non-speech, line hiss, etc.).

>I look forward to the day when high-quality speech recognizers are a part of

>every OSS desktop.

Me too  :) 

Thanks again,

Ken 

 

--- (Edited on 4/28/2008 2:16 pm [GMT-0400] by kmaclean) ---

Reply
Add