Click here to register.

General Discussion

Flat
transcribing expermints
User: CarlFK
Date: 4/7/2009 9:55 pm
Views: 3892
Rating: 1    Rate [
]

I want to do some expermints.  I don't have any expectaions, so save the "it isn't going to work good enough" becaue all I am hoping for at first is to see how good/bad things are with various amounts of effort.

I have all the .dv files that are the source of everything on pycon.blip.tv plus the 3 hour tutorials that havn't been published yet.

 

Ultimately, I am hoping to submit to voxforge a good chunk of transcribed audio.  I am not sure how good the audio is, so if someone can listen to some of what;s on blip and tell me how usefull it will be, that will help me know how much effort to put into it.

So what should I do to get started? 

--- (Edited on 4/7/2009 9:56 pm [GMT-0500] by CarlFK) ---

Re: transcribing expermints
User: tpavelka
Date: 4/8/2009 2:31 am
Views: 26
Rating: 1    Rate [
]

Hi CarlFK,

> all I am hoping for at first is to see how good/bad things are

>with various amounts of effort.

I think that if you want to do experiments you may want to compare the efficiency of trancsribing and reading. The interesting number is this ratio: length of transcribed recording / time spent getting the recording. My experience is that transcription is less efficient than reading, but maybe you can prove me wrong.

I have tried two ways of acquiring transcribed speech data, one was a sort of supervised reading, where there was one person reading sentences and another one listening to each sentence to hear if it is correct. In this way I could get about 12 minutes of speech per hour spent.

The other way I tried was to have lecture recordings (which may be similar to your conference talks) transcribed. My observation is that a skilled transcriber can transcribe about 3 minutes of speech in one hour.

The problem with lecture transcription is that for acoustic model training your transcription should be as close as possible to the actual sound of the recording, i.e. you have to include things like laughter, music (jingles etc.), filled pauses (sounds such as er, um, erm, etc.), coughing and many other things which I cannot remember right now. What is even worse you have to have acoustic models for those special sounds, which means that if you have e.g. coughing in your files you have to have a lots of recordings of coughing so that you can train these acoustic models.

Another problem is the inherent difficulty of spontaneous speech recognition. Lectures are much harder to recognize than read speech because of spontaneous speech effects (filled pauses, false starts, contracted forms, non standard grammar etc.), different speaking rates (e.g. one guy I saw on pycon.blip.tv was speaking extremely fast when compared to the usual recordings in VoxForge), and adverse conditions (background noise, reverberation etc.).

A question is, whether the inclusion of spontaneous speech in the VoxForge corpus, which is mostly read speech, would increase or decrease the accuracy on read speech.

I do not want to discourage you from transcribing the recordings, I think free transcribed spontaneous speech will be usefull in the future and there is no harm in starting to collect it now.

One more thing, an useful tool for transcription is this: trans.sourceforge.net

Tomas

 

--- (Edited on 4/8/2009 2:31 am [GMT-0500] by tpavelka) ---

Re: transcribing expermints
User: nsh
Date: 4/8/2009 3:16 am
Views: 27
Rating: 1    Rate [
]

As we discussed on irc, first of all you need to collect a lot of texts from a target domain and train a language model with it. I'm not sure why don't you started with it yet.


You can use cmuclmtk, srilm or even this simple perl script for that

http://scribblej.com/svn/lmgen/lmgen.pl


get a texts from mailing lists and create a language model. Then just decode, the accuracy should be reasonably high.

 

--- (Edited on 4/8/2009 3:16 am [GMT-0500] by nsh) ---

Re: transcribing expermints
User: Visitor
Date: 4/8/2009 8:14 am
Views: 27
Rating: 1    Rate [
]

> As we discussed on irc, first of all you need to collect a lot of texts from a target domain and train a language model with it. I'm not sure why don't you started with it yet.

> get a texts from mailing lists and create a language model.

I want to get something working with existing files (progams, models, configs)  before I start trying to make it better. 

> Then just decode, the accuracy should be reasonably high.

I am trying to figure out what i need to "just decode."   Like what was used to produce http://pastebin.ca/1384195 ?

 

 

--- (Edited on 4/8/2009 8:14 am [GMT-0500] by Visitor) ---

Re: transcribing expermints
User: kmaclean
Date: 4/8/2009 9:54 am
Views: 811
Rating: 1    Rate [
]

Hi CarlFK,

>I am trying to figure out what i need to "just decode." 

nsh and tpavelka provide excellent information, in addition to the information provided in your previous thread on this topic: PyCon transcription.

Here is what you can do to get started on your transcribing experiments:

  • Download Sphinx 4;
  • Play the Hello N-Gram Demo (uses N-gram Language Model for speech recognition);
  • Look at the actual "hellongram.trigram.lm" language model in the HelloNGram.jar file.  Replace this with the English Gigaword Language Model.
  • To improve recognition performance, start collecting collect lots of texts from your target domain by "taking transcriptions of previous conferences, mailing list archives, related documentation, technical papers on Python and so on..." (nsh post in your previous thread).
  • When you are ready to start creating your own language model, follow these steps from the Sphinx4 page:

Language Models

The language model used by Sphinx-4 follows the ARPA format. Language models provided with the acoustic model packages were created with the Carnegie Mellon University Statistical Language Modeling toolkit (CMU SLM toolkit), available at CMU. A manual is available there.

The language model is created from a list of transcriptions. Given a file with training transcription, the following script creates a list of words that appear in the transcriptions, then creates a bigram and a trigram LM files in the ARPA format. The file with extension ccs contains the context cues, and it is usually a list of words used as markers - beginning or end of speech etc.

set task = RM

# Location of the CMU SLM toolkit
set bindir = ~/src/CMU-SLM_Toolkit_v2/bin

cat $task.transcript | $bindir/text2wfreq | $bindir/wfreq2vocab > $task.vocab

set mode = "-absolute"

# Create bigram
cat $task.transcript | $bindir/text2idngram -n 2 -vocab $task.vocab | \
 $bindir/idngram2lm $mode -context $task.ccs -n 2 -vocab $task.vocab \
 -idngram - -arpa $task.bigram.arpa

# Create trigram
cat $task.transcript | $bindir/text2idngram -n 3 -vocab $task.vocab | \
 $bindir/idngram2lm $mode -context $task.ccs -n 3 -vocab $task.vocab \
 -idngram - -arpa $task.trigram.arpa

Note: there are no guarantees that this will work because of the acoustic model limitations outlined by tpavleka.

Ken

 

 

--- (Edited on 4/8/2009 10:54 am [GMT-0400] by kmaclean) ---

Re: transcribing expermints
User: tpavelka
Date: 4/8/2009 10:38 am
Views: 117
Rating: 2    Rate [
]

@CarlFK: I misunderstood your first post, I thought you were talking about manual transcription of speech. I did not realize you wanted to do automatic transcription of lectures, which is one of the hardest problems of ASR.

The problem is not just the acoustic model, the language modeling is just as difficult. There is a big difference between writen text and spoken lectures and if you train the language model on text then the difference is going to show as a drop in your accuracy. Even if training simple trigrams would work it would have to be trained with a lot of real lecture transcriptions and these are hard to come by.

And even if you manage to get your problem specific text you have to clean it up, which you cannot do manually, because the text you need is so big. You have to do a so called normalization where you convert everything in the text to words. And you have to figure out the pronunciation which is hard to do automatically for foreign names and abbreviations.

 

--- (Edited on 4/8/2009 10:38 am [GMT-0500] by tpavelka) ---

--- (Edited on 4/8/2009 10:43 am [GMT-0500] by tpavelka) ---

--- (Edited on 4/8/2009 10:43 am [GMT-0500] by tpavelka) ---

Re: transcribing expermints
User: Visitor
Date: 4/9/2009 6:50 pm
Views: 35
Rating: 1    Rate [
]

I have both sphinx3 and 4 built, did some of 4's demos.

> Replace this with the English Gigaword Language Model.

"Distribution:   1 DVD Non-member Fee:     US$3000.00"

That?

 

--- (Edited on 4/9/2009 6:50 pm [GMT-0500] by Visitor) ---

Re: transcribing expermints
User: kmaclean
Date: 4/9/2009 7:13 pm
Views: 27
Rating: 1    Rate [
]

>"Distribution:   1 DVD Non-member Fee:     US$3000.00"

>That?

That is for the source speech audio.  Keith has kindly trained acoustic models based on the Gigaword dataset... pick one of these language models (I have not tried this out myself, so you will need to experiment to see which one works best for you...)


Vocab Punc Size giga
OOV%
giga
ppl
setasd
OOV%
setasd
ppl
csr
OOV%
csr
ppl

5K NVP 2-gram 12.5 123.7 12.0 132.5 12.1 130.9 Download
20K NVP 2-gram 4.1 199.7 3.7 215.9 3.8 213.9 Download
64K NVP 2-gram 1.7 238.6 1.1 264.9 1.2 262.5 Download
5K VP 2-gram 11.4 96.4 10.8 103.6 11.0 104.0 Download
20K VP 2-gram 3.8 141.8 3.4 151.2 3.5 153.0 Download
64K VP 2-gram 1.8 161.4 1.2 176.2 1.2 178.9 Download
5K NVP 3-gram 12.5 82.0 12.0 91.1 12.1 89.7 Download
20K NVP 3-gram 4.1 121.3 3.7 138.5 3.8 136.4 Download
64K NVP 3-gram 1.7 144.6 1.1 170.1 1.2 167.5 Download
5K VP 3-gram 11.4 62.0 10.8 67.8 11.0 67.9 Download
20K VP 3-gram 3.8 79.3 3.4 88.2 3.5 88.7 Download
64K VP 3-gram 1.8 90.3 1.2 103.1 1.2 104.1 Download


--- (Edited on 4/9/2009 8:13 pm [GMT-0400] by kmaclean) ---

Re: transcribing expermints
User: CarlFK
Date: 4/10/2009 12:23 pm
Views: 132
Rating: 1    Rate [
]

wget http://www.inference.phy.cam.ac.uk/kv227/lm_giga/lm_giga_5k_nvp_2gram.zip
unzip lm_giga_5k_nvp_2gram.zip

I have foo.wav - what next?

again, I am not really interested in quality of results, just getting some pieces put in place that give some results.  Until I can do something, I can't do something better - so all I am looking for is getting something working.  I do of course appreciate the advice, but right now it is just getting pushed into the pile of stuff I hope I can use this later, which often gets forgotten about.


--- (Edited on 4/10/2009 12:23 pm [GMT-0500] by CarlFK) ---

Re: transcribing expermints
User: kmaclean
Date: 4/22/2009 12:48 pm
Views: 140
Rating: 1    Rate [
]

Hi CarlFK,

>I have foo.wav - what next?

What are you trying to do?... recognize the speech in foo.wav?  I don't know Sphinx very well, but there must be a setting to point to a wav file and attempt to recognize the speech it contains.

Ken

--- (Edited on 4/22/2009 1:48 pm [GMT-0400] by kmaclean) ---

PreviousNextAdd