Click here to register.

Automated Audio Segmentation Using Forced Alignment (Draft)

Step 1 - Create Pronunciation Dictionnary

First you need to make sure that all the words in the eText of the audio book are contained in the VoxForge Lexicon.  The Lexicon file contains the pronounciations used for Acoustic Model creation, and if you try to train an Acoustic Model with a word that is not in the Lexicon file, the training process will end abnormally.

This section will guide you throught the process to creating a list of all words in the eText, and then compare it against the lexicon file, and create a log of all the missing words.  The missing words will then need to be added to the VoxForge Lexicon (with pronounciations).

In this example, we used the Librivox text for the History of England, from the Accession of James the Second, by Thomas Babington Macaulay (eText.txt & Librivox Audio File).

Cleanup eText file

You might need to convert the text file to your OS format (see this link for an explanation of the format differences and click here to obtain the flip utility for MAC or Windows/MS-DOS environments).  On Linux use the dos2unix command to convert the file from MS-DOS format (or Mac format) to Unix format:

$dos2unix eText.txt

Then run it through a spell checker to correct any spelling mistakes. 

You will also need change your eText so that numbers and dates are spelled out, as follows:

  • numbers - if you find any numbers in your eText, write them out in the eText itself.  For example, the number '10' should be converted to 'ten' in the eText.
  • dates - all dates should also be written out in long form.  For example, the date '1928' should be converted to 'Nineteen Twenty Eight' in the eText.

You can do this by hand or wait until you run the HDMan command below, it will log any words in the eText file that are missing from the Lexicon file - which only includes spelled out numbers and dates. 

Note: you may need to listen to the actual audio to confirm that the reader has read the number or date as you would think.  Sometimes they might say 'one, oh' rather than 'ten' for the number '10', or they might say 'one, nine, two, eight' rather than 'Nineteen Twenty Eight' for the date '1928'.  Always transcribe the number or date the same way as the reader says it.

Create Word List file

Next create a word list file using the etext2wlistmlf.pl script:

$ perl ./etext2wlistmlf.pl eText.txt

This will create a wlist file which contains an alphabetical list of all the words in the eText file. 

Note: this script adds the following entries to your wlist file:

SENT-END
SENT-START  

These are HTK internal entries required for the start of a sentence (SENT-START) and the end of a sentence (SENT-END).  

Download VoxForge Lexicon 

Next download and extract the most current version of the Voxforge Lexicon (also called a pronunciation dictionnary) from here:

[   ] VoxForge.tgz          

To your current directory.

Find Missing Words 

First you need to create the global.ded (default script used by HDMan), which contains:

AS sp
RS cmu
MP sil sil sp

This is mainly used to convert all the words in the dict file to uppercase.  See the HTK book for details of what these commands mean.

Next run the HTK HDMan command against the VoxForge Lexicon to see if there are any missing words (i.e. words in your eText.txt file that are not included in the VoxForge Lexicon):

$ HDMan -A -D -T 1 -m -w wlist -i -l dlog dict VoxForgeDict

The following is the output from this command:

HDMan -A -D -T 1 -m -w wlist -i -l dlog dict VoxForgeDict

No HTK Configuration Parameters Set

Output dictionary dict opened
Source dictionary VoxForgeDict opened
Dictionary dict created - 969 words processed, 52 missing

No HTK Configuration Parameters Set

Review the HDMan log file (dlog) to determine which words (if any) in the eText file that are missing from the VoxForge lexicon.

If you don't have any missing words go to Step 3.

If you have missing words, review the list and look for any that might correspond to numbers or dates in numeric format.  If you find any of these, then change them in your eText file (i.e. so that they are written using words, not numbers), and re-run the HDMan command as shown above.  If you still have missing words showing in the dlog file, then go to Step 2. 

Step 2 - Add Missing Words to Your Copy of the VoxForge Lexicon

Manually 

To add a missing word (as displayed in your HDMan log - dlog) to the VoxForge Lexicon, you need to look at the pronunciation of similar words in the dictionary, and create a new pronunciation entry for your word based on these similar words. 

For example, if you want to add the word "winward", you would look up words that are similar, such as:

WINWOOD [WINWOOD] w ih n w uh d

In this case, this gives us the pronunciation for the "win" in the word "winward".  Next, we look for words that contain "ward" in the dictionary, such as:

WOODWARD [WOODWARD] w uh d w er d
WARD [WARD] w ow r d

Notice that although the words "woodward" and "ward" contain the same sequence of letters (ward), they are pronounced differently - they have different phoneme sequences.  Next you need to make a judgment call based on your knowledge of your English dialect (you might also want to listen to the actual audio passage that contains the word, but this could take too much time for each and every word you are unsure of... ).  For me, the way I pronounce the word part "ward" in "winward" is closer to the sounds I make in "woodward" that in the word "ward".  Therefore, the final pronunciation dictionary entry I would use would look like this:

WINWARD [WINWARD] w ih n w er d

You then need to add this word to your version of the VoxForge Lexicon in *Alphabetical* sequence.  You need to repeat these steps for all the "missing words" words in your eText.  It's a little tedious when you perform this process for the first time, but as you get familiar with the words and phonemes, it goes much quicker.

Manually with help from Festival 

Start Festival

$ festival

From the Festival command line, there are a series of "lex" commands that can help you determine the phonemes contained in a word that is not included in the VoxForge dictionnary, and as an added bonus, you can actually listen to how Festival pronounces the word to get a better feel for the phonemes.

First, find out which lexicons (i.e. pronunciation dictionnaries and rules) are included in your distribution of Festival using the "lex.list" command as follows:

festival> (lex.list) 

("english_poslex" "cmu")

Since VoxForge is based on the cmu dictionnary, we can use Festival to determine the phonemes of an unknown word, using Festival's dictionnary an pronunciation rules (see here for Festival's phone list).

Festival (rel 1.95) usullay uses the "cmu" lexicon by default.  To make sure that you are using this dictionnary, use the following command:

festival> (lex.select "cmu")

Next, to determine the pronunciation of a word use the "lex.lookup" command as follows:

 festival> (lex.lookup "internet")

("internet" nil (((ih n t) 1) ((er n) 0) ((eh t) 1)))

Festival will list the phonemes included in the word, but also includes numbers (these indicate "lexical stress" for a phoneme).  Ignore the parathesis and numbers, and you have Festival's view of the phonemes that make up the word you entered.  Therefore, for the word "Internet", Festival says its phonemes are: "ih n t er n eh t".

Semi-automated approach using Festival (manual corrections required)

Create a new file called MissingWords, and Copy the missing words listed in the dlog log file from the HDMan run

Next, Run the MissingWordsCleanup.pl script as follows 

$ perl  ./MissingWOrdsCleanup.pl

This will create a good first draft of the pronunciations for the missing words - in a file called MissingWords_out.  You still need to confirm these pronunciations to make sure they are OK.  You can do this by looking at similar groups of letters in the missing words, and look up the pronunications for these groups in other known words - if they match, then use what Festival recommends.  If they don't match, you need to make a judgement call based on your knowledge of English. 

After you've added all the missing words to your copy of the VoxForge dictionnary 

In the current example, once all the missing words have been added, your VoxForge Lexicon should look like this: VoxForgeDict.

Once you finish adding all your words, re-run the HDMan command:

 $ HDMan -A -D -T 1 -m -w wlist -i -l dlog dict VoxForgeDict

And review the HDMan log output (i.e. dlog) again to make sure that you did not miss any other words. 

Note: One common error is to put the new entries in the Lexicon file in the wrong sort order.  You might have to experiment with word placement (especially with words containing non-alphanumeric characters) to get it so that HDMan will run correctly.

The HDMan command with create a dictionnary file called: dict.  Your dict file is essentially all the words in your wlist file with added pronounciation information.

Step 3 - Temporarily DownSample Your Wav File

Forced Alignment using HVite only seems to work with audio recorded at a 16kHz sampling rate, at 16bits per sample.  So we will create a temporary version of your audio at 16kHz/16bits, run HVite, and then use the time alignments generated by the Forced Alignment process to segment your original audio.

You need to use the SoX sound editing utility to downsample your audio.  The SoX command syntax is as follows:

    $ sox original.wav -c 1 -r 16000 -w downsampled.wav 

  • -c 1 - converts stereo audio to mono
  • -r  16000 - is the target rate of 16kHz
  • -w is the target bits per sample (word length = 16 bits)

In this example, we will use the audio from jimmowatt's "History of England" submission:

[SND] historyofengland01ch04_01_macaulay.wav 04-Mar-2007 14:32   165M  

Which we will downsample to 16kHz-16bits using the following sox command:

    $ sox historyofengland01ch04_01_macaulay.wav -c 1 -r 16000 -w downsampled.wav

which creates the downsampled.wav file.   It's a good idea to open the resulting file with Audacity to make sure the file converted OK.

Step 4 - Convert eText to MLF file

Convert the transcription to an HTK 'mlf' file by executing the etext2wlistmlf.pl script in the following format:

    $ perl ./etext2wlistmlf.pl [eText.txt file] [wav filename (no suffix)] 

For our current example, execute etext2wlistmlf.pl as follows:

    $ perl ./etext2wlistmlf.pl eText.txt downsampled

This creates the following words.mlf file.

Note: you need to make sure that the text contained in the eText is the same as what is recorded.

Step 5 - Forced Alignment

Next run the HVite (HTK tool) to use the VoxForge Acoustic Model to line up the written words in the words.mlf file with the spoken words in the corresponding Audio book file, and to get time alignments.  

First you need to create an HVite configuration file called "wav_config" containing the following: 

SOURCEFORMAT = WAV
TARGETKIND = MFCC_D_N_Z_0
TARGETRATE = 100000.0
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12

Next create a training script called "train.scp" with the name(s) of the audio files you will be using for forced alignment, something like this:

downsampled.wav

Next, get the most current version of the 16khz-16bit VoxForge Acoustic Models (you can use the current stable release, or one of the nightly builds).  Copy the following files to your directory:

  • macros
  • hmmdefs
  • tiedlist

Run the HVite command as follows:

$HVite -A -D -T 1 -l '*'  -a -b SENT-END -m -C wav_config -H macros -H hmmdefs -m -t 250.0 150.0 1000.0 -I words.mlf  -i aligned.out -S train.scp dict tiedlist

This creates a file called aligned.out containing all the words from your words.mlf file, with time alignments.  The output from the HVite command is here.

Note: different acoustic models may produce slightly different forced alignment results (i.e. the better the Acoustic Model, the more accurate the forced aligments).

 

Step 6 - Validate Audio with Audacity

1. Run the htklabels2audacity.pl script to convert the HTK time stamps into a format readable by Audacity, as follows

    $ perl ./htklabels2audacity.pl aligned.out audacityLabelTrack.txt

This creates a label file called audacityLabelTrack.txt that can be opened in Audacity so you can compare it with the original audio book and confirm that the Forced Alignment times look OK.

2. Open your original speech audio file (historyofengland01ch04_01_macaulay.wav) as an audio track in Audacity.  Next open the audacityLabelTrack.txt label file as a label track in Audacity.  To perform a quick confirmation that the alignments look OK, you will need to zoom in to certain sections using Audacity's 'Zoom to Selection' feature, listen to the audio, and make sure the spoken audio matches the label. 

Step 7 - Run the Segmentation Script

First, create a directory called 'wav'.

Prepare Your Original Audio File for Processing 

Next, you need to rename your audio file with a 3 or 4 character name.  Because this is the name that will be used for the segmented wav files (e.g. hoe.wav will be segmented to: hoe0001.wav, hoe002.wav, hoe003 .wav...).  In addition, audio file may require some further changes because HTK only works with audio files recorded at a maximum of 16 bits per sample, and the segmenting script assumes that the audio was recorded in mono format.

If your audio is a mono recording, at 16 bits per sample, then you only need to rename your file to a short filename (keep the wav suffix):

    $ mv original.wav target.wav

If your audio was recorded at a 32 bit float sample rate, you can use SoX to convert it to 16-bits (using the '-w' 16-bit word parameter), or if it was recorded in stereo (i.e. using 2 channels), you can also use SoX to convert it to mono (using the '-c 1' single channel parameter).  In our current  example, you would use SoX as follows:

    $sox historyofengland01ch04_01_macaulay.wav -w - c 1 hoe.wav 

Run Audio Segmentation Script

Next you need to create an HCopy configuration file called "copy_config" containing the following (the htksegment.pl script below uses HCopy to segment the audio):  

SOURCEKIND=WAVEFORM
SOURCEFORMAT=WAV
TARGETKIND=WAVEFORM
TARGETFORMAT=NOHEAD
NATURALWRITEORDER = T
NATURALREADORDER = T 

Next, run the htksegment.pl script to perform the actual segmentation of the audio into many smaller files and create a corresponding prompt file.  It uses the following parameters:

    $perl htksegment.pl [wav filename]  [sample rate]

So for our current example, you would run it as follows:

     $perl htksegment.pl hoe.wav 44100

Step 8 - Submit your segmented files to VoxForge

Create Readme file

Create a README file that describes your submission.  Right-click this link and save the file to your upload folder.  Modify the entries where appropriate:

  • Each line in the readme has a question with some possible answers within brackets.  Please replace the suggestions between the brackets with your answer . 
  • Take your best guess as to the original author's dialect (follow this link for help on this).  If you are not sure, just put in Librivox.

Create License file 

Next, create a LICENSE file for your submission.  Right-click this link and save the file to your upload folder.   Change the year to the current year, and the 'name of author' to your name (or to the 'Free Software Foundation' - if you wish to assign your copyright to the FSF). 

Although the audio book you segmented is likely in the public domain, you have copyright over the way the audio was segmented (because you have rearranged this audio in a unique way) and therefore you can license the segmented audio under GPL. 

Tar your files.

Please create a single compressed tar file containing the following files:
  • your segmented wav files;
  • the corresponding prompts file;
  • your updated eText file (remove any references to Project Gutenberg - see this FAQ for an explanation why);
  • any changes/updates you might have made to the VoxForge Lexicon; and
  • your README and LICENSE files.
Name your tar file as follows "[voxforge username]-[year][month][day].tgz" .  For example, if you stored all these files in the /home/myusername/segment folder, you would execute the following command to create your gzipped tar file:

$cd /home/myusername
$tar -zcvf kmaclean-20070125.tgz segment

Connect to the VoxForge FTP site

Connect to the site using your favourite FTP client (see link below). 

If you are using Firefox 1.5 or greater, you can use FireFTP, a cross-platform FTP client.  For Linux you can use Nautilus (Gnome), for Windows you can use FileZilla or WinSCP, and Cyberduck can be used on a Mac.  

(Note: You need to be registered on the VoxForge site for the link to display and to get the current password. )

Copy your TarFile

Copy your compressed tarfile to the VoxForge FTP site. 

Submission Notification

Please add a note stating that you submitted some audio to the VoxForge FTP site and/or to ask questions about the FTP submission process.  You can do this by clicking the 'Add' link below (note: it is only visible if you are  logged in).

thanks!

By kmaclean - 2/20/2014 SailAlign is an open-source software toolkit for robust long speech-text

By kmaclean - 4/20/2009 - 3 Replies Here Perl audio segmentation script that I worked on a while ago: Audiobook.pm.

By kmaclean - 9/9/2010 Segmentation and Diarization using LIUM tools

By kmaclean - 5/25/2010

By kmaclean - 2/26/2010 Automatic Building of Synthetic Voices from Large Multi-Paragraph Speech Databases

By kmaclean - 7/2/2009 See baküzen's blog post here: Forced Alignment in Sphinx2

By kmaclean - 7/2/2009 See baküzen's post here: Word-Level Forced Alignment in Sphinx4

By kmaclean - 5/13/2009 - 2 Replies Email from Andrew: