I want to do some expermints. I don't have any expectaions, so save the "it isn't going to work good enough" becaue all I am hoping for at first is to see how good/bad things are with various amounts of effort.
I have all the .dv files that are the source of everything on pycon.blip.tv plus the 3 hour tutorials that havn't been published yet.
Ultimately, I am hoping to submit to voxforge a good chunk of transcribed audio. I am not sure how good the audio is, so if someone can listen to some of what;s on blip and tell me how usefull it will be, that will help me know how much effort to put into it.
So what should I do to get started?
--- (Edited on 4/7/2009 9:56 pm [GMT-0500] by CarlFK) ---
> all I am hoping for at first is to see how good/bad things are
>with various amounts of effort.
I think that if you want to do experiments you may want to compare the efficiency of trancsribing and reading. The interesting number is this ratio: length of transcribed recording / time spent getting the recording. My experience is that transcription is less efficient than reading, but maybe you can prove me wrong.
I have tried two ways of acquiring transcribed speech data, one was a sort of supervised reading, where there was one person reading sentences and another one listening to each sentence to hear if it is correct. In this way I could get about 12 minutes of speech per hour spent.
The other way I tried was to have lecture recordings (which may be similar to your conference talks) transcribed. My observation is that a skilled transcriber can transcribe about 3 minutes of speech in one hour.
The problem with lecture transcription is that for acoustic model training your transcription should be as close as possible to the actual sound of the recording, i.e. you have to include things like laughter, music (jingles etc.), filled pauses (sounds such as er, um, erm, etc.), coughing and many other things which I cannot remember right now. What is even worse you have to have acoustic models for those special sounds, which means that if you have e.g. coughing in your files you have to have a lots of recordings of coughing so that you can train these acoustic models.
Another problem is the inherent difficulty of spontaneous speech recognition. Lectures are much harder to recognize than read speech because of spontaneous speech effects (filled pauses, false starts, contracted forms, non standard grammar etc.), different speaking rates (e.g. one guy I saw on pycon.blip.tv was speaking extremely fast when compared to the usual recordings in VoxForge), and adverse conditions (background noise, reverberation etc.).
A question is, whether the inclusion of spontaneous speech in the VoxForge corpus, which is mostly read speech, would increase or decrease the accuracy on read speech.
I do not want to discourage you from transcribing the recordings, I think free transcribed spontaneous speech will be usefull in the future and there is no harm in starting to collect it now.
One more thing, an useful tool for transcription is this: trans.sourceforge.net
--- (Edited on 4/8/2009 2:31 am [GMT-0500] by tpavelka) ---
As we discussed on irc, first of all you need to collect a lot of texts from a target domain and train a language model with it. I'm not sure why don't you started with it yet.
You can use cmuclmtk, srilm or even this simple perl script for that
get a texts from mailing lists and create a language model. Then just decode, the accuracy should be reasonably high.
--- (Edited on 4/8/2009 3:16 am [GMT-0500] by nsh) ---
> As we discussed on irc, first of all you need to collect a lot of texts from a target domain and train a language model with it. I'm not sure why don't you started with it yet.
> get a texts from mailing lists and create a language model.
I want to get something working with existing files (progams, models, configs) before I start trying to make it better.
> Then just decode, the accuracy should be reasonably high.
I am trying to figure out what i need to "just decode." Like what was used to produce http://pastebin.ca/1384195 ?
--- (Edited on 4/8/2009 8:14 am [GMT-0500] by Visitor) ---
>I am trying to figure out what i need to "just decode."
nsh and tpavelka provide excellent information, in addition to the information provided in your previous thread on this topic: PyCon transcription.
Here is what you can do to get started on your transcribing experiments:
The language model used by Sphinx-4 follows the ARPA format. Language models provided with the acoustic model packages were created with the Carnegie Mellon University Statistical Language Modeling toolkit (CMU SLM toolkit), available at CMU. A manual is available there.
The language model is created from a list of transcriptions. Given a file with training transcription, the following script creates a list of words that appear in the transcriptions, then creates a bigram and a trigram LM files in the ARPA format. The file with extension ccs contains the context cues, and it is usually a list of words used as markers - beginning or end of speech etc.
set task = RM
# Location of the CMU SLM toolkit
set bindir = ~/src/CMU-SLM_Toolkit_v2/bin
cat $task.transcript | $bindir/text2wfreq | $bindir/wfreq2vocab > $task.vocab
set mode = "-absolute"
# Create bigram
cat $task.transcript | $bindir/text2idngram -n 2 -vocab $task.vocab | \
$bindir/idngram2lm $mode -context $task.ccs -n 2 -vocab $task.vocab \
-idngram - -arpa $task.bigram.arpa
# Create trigram
cat $task.transcript | $bindir/text2idngram -n 3 -vocab $task.vocab | \
$bindir/idngram2lm $mode -context $task.ccs -n 3 -vocab $task.vocab \
-idngram - -arpa $task.trigram.arpa
Note: there are no guarantees that this will work because of the acoustic model limitations outlined by tpavleka.
--- (Edited on 4/8/2009 10:54 am [GMT-0400] by kmaclean) ---
@CarlFK: I misunderstood your first post, I thought you were talking about manual transcription of speech. I did not realize you wanted to do automatic transcription of lectures, which is one of the hardest problems of ASR.
The problem is not just the acoustic model, the language modeling is just as difficult. There is a big difference between writen text and spoken lectures and if you train the language model on text then the difference is going to show as a drop in your accuracy. Even if training simple trigrams would work it would have to be trained with a lot of real lecture transcriptions and these are hard to come by.
And even if you manage to get your problem specific text you have to clean it up, which you cannot do manually, because the text you need is so big. You have to do a so called normalization where you convert everything in the text to words. And you have to figure out the pronunciation which is hard to do automatically for foreign names and abbreviations.
--- (Edited on 4/8/2009 10:38 am [GMT-0500] by tpavelka) ---
--- (Edited on 4/8/2009 10:43 am [GMT-0500] by tpavelka) ---
--- (Edited on 4/8/2009 10:43 am [GMT-0500] by tpavelka) ---
I have both sphinx3 and 4 built, did some of 4's demos.
> Replace this with the English Gigaword Language Model.
"Distribution: 1 DVD Non-member Fee: US$3000.00"
--- (Edited on 4/9/2009 6:50 pm [GMT-0500] by Visitor) ---
>"Distribution: 1 DVD Non-member Fee: US$3000.00"
That is for the source speech audio. Keith has kindly trained acoustic models based on the Gigaword dataset... pick one of these language models (I have not tried this out myself, so you will need to experiment to see which one works best for you...)
--- (Edited on 4/9/2009 8:13 pm [GMT-0400] by kmaclean) ---
I have foo.wav - what next?
again, I am not really interested in quality of results, just getting some pieces put in place that give some results. Until I can do something, I can't do something better - so all I am looking for is getting something working. I do of course appreciate the advice, but right now it is just getting pushed into the pile of stuff I hope I can use this later, which often gets forgotten about.
--- (Edited on 4/10/2009 12:23 pm [GMT-0500] by CarlFK) ---
>I have foo.wav - what next?
What are you trying to do?... recognize the speech in foo.wav? I don't know Sphinx very well, but there must be a setting to point to a wav file and attempt to recognize the speech it contains.
--- (Edited on 4/22/2009 1:48 pm [GMT-0400] by kmaclean) ---