Audio and Prompts Discussions

Flat
Automatic Segmentation of LibriVox Audio
User: kmaclean
Date: 3/2/2007 3:05 pm
Views: 23606
Rating: 44
email from David Gelbart:
 
Hi Ken,

I was wondering regarding the Google Summer of Code proposal: how are
you planning to build an automatic segmenter that can segment
transcripts as well as segmenting audio?

It seems there needs to be some kind of speech recognition involved,
in order to keep the transcript chunks synchronized with the audio
chunks.  Ivan Uemlianin talked about using human speech recognition
for this on the OSSRI list
(http://harvee.org/pipermail/ossri/2006-November/001793.html).

If you can run an existing recognizer in 'forced alignment' mode (in
which it is given the transcript and tries to time-align the
transcript to the audio), doing this with automatic speech recognition
might actually be fairly straightforward to do.  The HTK web site says
that HTK supports forced alignment.  I didn't check JULIUS or Sphinx
but hopefully they do as well.  I think this will be particularly easy
if the recognizer will support doing a forced alignment on the entire
audio at one; then you could simply use a script to parse the forced
alignment output and break the transcript where the forced alignment
indicates pauses.  But personally, I've only run forced alignments on
audio that was already segmented into small chunks, so I don't know
how likely it is that running a forced alignment on an hour of audio
at once will be suported.

Regards,
David

--- (Edited on 3/ 2/2007 4:05 pm [GMT-0500] by kmaclean) ---

Re: Automatic Segmentation of LibriVox Audio
User: kmaclean
Date: 3/2/2007 3:06 pm
Views: 394
Rating: 38

More from David Gelbart: 

Regarding forced alignment, I think you will find this paper of
interest:

Evaluating Factors Impacting the Accuracy of Forced Alignments...
Lei Chen, Yang Liu, Mary Harper, Eduardo Maia, Susan McRoy
http://citeseer.ist.psu.edu/724385.html

They compare forced alignment performance of HTK and ISIP. They also
mention at least one downloadable system that can be used for
alignment, the HTK-based Aligner.  I wonder if the trained models for
their WSJ-HTK or SWB-ISIP system can be downloaded too?

They also considered the impact of doing segmentation before
alignment, and found it helpful.

It looks like Lei Chen's is still at Purdue:
http://cobweb.ecn.purdue.edu/~chenl

Yang Lui is now in Dallas:
http://www.hlt.utdallas.edu/~yangl/

I didn't check up on the other authors.  I know Yang from ICSI; she's
very nice.

Feel free to quote from this mail (or my previous mail) on the
VoxForge forums if you like.

--- (Edited on 3/ 2/2007 4:06 pm [GMT-0500] by kmaclean) ---

Re: Automatic Segmentation of LibriVox Audio
User: kmaclean
Date: 3/2/2007 3:06 pm
Views: 381
Rating: 38

More from David Gelbart: 

> They also considered the impact of doing segmentation before
> alignment, and found it helpful.

Although they attributed it to crosstalk: "This is probably due the
considerable channel crosstalk in this corpus."  So maybe you can do
alignment on the unsegmented data.  That would certainly be
convenient.

--- (Edited on 3/ 2/2007 4:06 pm [GMT-0500] by kmaclean) ---

Re: Automatic Segmentation of LibriVox Audio
User: kmaclean
Date: 3/2/2007 3:07 pm
Views: 463
Rating: 28
Hi David,

If you go back a few posts on the ossri site, you'll notice that I started that whole thread :)  

The Julius adintool tool allows you to use silence detection to segment a larger speech audio file.

My sense is that running any forced alignment on a large file would take way too long. I was figuring on plugging the audio through the adintool, then using a 'brute strengh and ignorance' approach to try to find the matching text for each resulting file.  Looking at the text for punctuation at first to see if it matches an audio segment, and if it does not, start adding or dropping words until a match is found.  Basically the first sentence in the  supplied text would be used as the grammar to recognize (using the VoxForge AMs) the first audio segment.  Once a match is found, remove the recognized text and the audio from their respective text and audio file, and continue one.  Not pretty, but it should at the very least provide a first pass that would need to be manually re-segmented thereafter.  As we get more experience with this process, then can look at automating it more.

An HTK time alignment approach might work too, but it seems to try to make a match even where this is none, hence the term 'forced alignment' I guess. 

Basically some experimentation will be required to find a workable method - it does not have to be perfect, but 'good enough' (I think that's going to be my moto on this project) to get the job done.

Ken

--- (Edited on 3/ 2/2007 4:07 pm [GMT-0500] by kmaclean) ---

Re: Automatic Segmentation of LibriVox Audio
User: Tony Robinson
Date: 3/4/2007 1:59 pm
Views: 355
Rating: 22

> My sense is that running any forced alignment on a large file would take way too long.

There are many papers describing training on data that wasn't transcribed for recognition.   In order to do so, some form of forced alignment must have taken place in order to split the data up into chunks suitable for forward/backward training.   I'd suggest that it would be worth testing your assumption, particulary as  LibriVox data is very clean, and so should segment/align quite easily.

 Tony

 P.S. Is there some way to "subscribe" or otherwise be notified of all recent posts to every forum on this site?  

--- (Edited on 3/ 4/2007 1:59 pm [GMT-0600] by Tony Robinson) ---

Re: Automatic Segmentation of LibriVox Audio
User: kmaclean
Date: 3/4/2007 3:01 pm
Views: 382
Rating: 24

Hi Tony,

Excellent! Thanks for the info re: force alignment.

With respect to subscribing to a forum, after you click on the title of a forum  on the Forums page, a list of threads appears.  At the very top of this page forum there should be a subscribe link.

Notes:

  • You need to be logged in for this subscribe link to appear
  • You need to subscribe to each forum - there is no 'master subscribe'. 
  • Sometimes, there can be a long delay (1+ days) before you get an email notification.

Ken 

--- (Edited on 3/ 4/2007 4:01 pm [GMT-0500] by kmaclean) ---

Re: Automatic Segmentation of LibriVox Audio
User: kmaclean
Date: 3/6/2007 9:22 am
Views: 347
Rating: 19
From David Gelbart:
 
> My senseis that running any forced alignment on a large file would take way
> too long.

I'm not sure.

> An HTK time alignment approach might work too, but it seems to try to make a
> match even where this is none, hence the term 'forced alignment' I guess.

I expect it can give you a probability for the alignment, which may be
usable for rejecting a mismatch.  Perhaps the probability would tend
to be lower for long utterances, in which you might have to normalize
it for utterance length somehow.  If you can run the alignment on the
entire unsegmented audio book then you wouldn't have a need to detect
mismatches, although as we've discussed that might be undesirable for
other reasons.

--- (Edited on 3/ 6/2007 10:22 am [GMT-0500] by kmaclean) ---

Re: Automatic Segmentation of LibriVox Audio
User: kmaclean
Date: 3/11/2007 11:41 pm
Views: 431
Rating: 24

Hi David, 

Thanks again for your feedback on this. 

The main conclusion from the paper you cite is this:

 From this study, we found that segmenting the speech ?les prior to alignment improves the overall alignment accuracy and that alignment accuracy is enhanced by using more advanced acoustic models and more training data matched on speaking style (conversational versus planned) to the data to be aligned. Speaker adaptation improves the models somewhat, but more so for the weaker models.

I am really only looking to segment the audio and text data into 5-10 second snippets - I don't need word-level or phone-level time alignments. 

I have done some experiments with HTK.  As you indicated, the run time for forced alignment is not an issue (less than 1 minute for a 60 meg wav file).  Unfortunately, HVite Forced Alignment using the full text does not yield acceptable results.  It may be that I need to enter silence phones at the end of each sentence and paragraph - I will try that next. 

Next I will try to segment the data by paragraph (manually or using Julius' adintool utility) as described above, and try to use Forced Alignment to determine where the sentence boundaries are.  If that doesn't work, I'll try speaker adaptation with part of the Librivox audio book submission - since I am limited to the VoxForge Acoustic Model for HTK and Julius experimentation.  Julian does not seem to handle large wav files (greater than 10 seconds).

If none of these work, then I may need to look at using ISIP alignment tools or Sphinx-align, since they use more robust Acoustic Models.  Or simply perform manual sentence segmentation until the VoxForge AM gets robust enough to be useful for Forced Alignment Segmentation.  

Ken 

--- (Edited on 3/12/2007 12:41 am [GMT-0400] by kmaclean) ---

Re: Automatic Segmentation of LibriVox Audio
User: mofei
Date: 3/27/2007 4:07 am
Views: 378
Rating: 21

Hi -

first of all please excuse my ignorance and the slight off-topicness of this posting.

I'm a researcher in neuro/computational linguistics (but not the speech variety, unfortunately), and am looking for some way to time-annotate plain text relative to a speech file. I do EEG research, and will be recording the electrical activity on people's scalps (aka "brainwaves") while they listen to a recorded spoken text. Afterwards the EEG file has to be aligned to the word level of the written text, so we can trace brain responses to individual words. So I would need an output something along the lines of:

[text] [audio-file] 

The 0-232ms

lazy 233-507ms

brown 598-789ms

fox 790-1012ms

... 

I figured the best way to do this would be to force-align an audio-book with it's text. The book I would like to use is Beatrix Potter's Peter Rabbit: http://www.gutenberg.org/etext/12702.

I saw that you've been discussing how long segments can be, and what level of training/tweaking is required to get acceptable results. Is it at all feasible to try and force-align longer stretches of audio-book text, of say 10 minutes? (I guess the recordings are pretty clean compared to say phone transcripts and the like). How much do things like speech accent and plain text formating matter?

Any suggestions welcome!

Brian Murphy

CIMeC, University of Trento, Italy 

 

--- (Edited on 3/27/2007 4:07 am [GMT-0500] by mofei) ---

Re: Automatic Segmentation of LibriVox Audio
User: Tony Robinson
Date: 3/28/2007 6:42 am
Views: 398
Rating: 29

Hi Brian,

There should be no problem performing a forced alignment of ten minutes of audio.   You do need the audio and the text to correspond (ie. plain text with no words missing or extra), a pronunciation for each word and pre-existing acoustic models.   Accent isn't too much of an issue.

Contact me off-board if you like (tonyr at cantabResearch.com) and I should think we can work something out to get the timings you need.

 

Tony 

-- 

Dr Tony Robinson, CEO Cantab Research Ltd
Phone:  +44 845 009 7530, Fax: +44 845 009 7532


--- (Edited on 28-March-2007 12:42 pm [GMT+0100] by Tony Robinson) ---

PreviousNext