Step 5 - Forced Alignment

Next run the HVite (HTK tool) to use the VoxForge Acoustic Model to line up the written words in the words.mlf file with the spoken words in the corresponding Audio book file, and to get time alignments.  

First you need to create an HVite configuration file called "wav_config" containing the following: 

SOURCEFORMAT = WAV
TARGETKIND = MFCC_D_N_Z_0
TARGETRATE = 100000.0
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12

Next create a training script called "train.scp" with the name(s) of the audio files you will be using for forced alignment, something like this:

downsampled.wav

Next, get the most current version of the 16khz-16bit VoxForge Acoustic Models (you can use the current stable release, or one of the nightly builds).  Copy the following files to your directory:

Run the HVite command as follows:

$HVite -A -D -T 1 -l '*'  -a -b SENT-END -m -C wav_config -H macros -H hmmdefs -m -t 250.0 150.0 1000.0 -I words.mlf  -i aligned.out -S train.scp dict tiedlist

This creates a file called aligned.out containing all the words from your words.mlf file, with time alignments.  The output from the HVite command is here.

Note: different acoustic models may produce slightly different forced alignment results (i.e. the better the Acoustic Model, the more accurate the forced aligments).