Step 1 - Create Pronunciation Dictionnary

First you need to make sure that all the words in the eText of the audio book are contained in the VoxForge Lexicon.  The Lexicon file contains the pronounciations used for Acoustic Model creation, and if you try to train an Acoustic Model with a word that is not in the Lexicon file, the training process will end abnormally.

This section will guide you throught the process to creating a list of all words in the eText, and then compare it against the lexicon file, and create a log of all the missing words.  The missing words will then need to be added to the VoxForge Lexicon (with pronounciations).

In this example, we used the Librivox text for the History of England, from the Accession of James the Second, by Thomas Babington Macaulay (eText.txt & Librivox Audio File).

Cleanup eText file

You might need to convert the text file to your OS format (see this link for an explanation of the format differences and click here to obtain the flip utility for MAC or Windows/MS-DOS environments).  On Linux use the dos2unix command to convert the file from MS-DOS format (or Mac format) to Unix format:

$dos2unix eText.txt

Then run it through a spell checker to correct any spelling mistakes. 

You will also need change your eText so that numbers and dates are spelled out, as follows:

You can do this by hand or wait until you run the HDMan command below, it will log any words in the eText file that are missing from the Lexicon file - which only includes spelled out numbers and dates. 

Note: you may need to listen to the actual audio to confirm that the reader has read the number or date as you would think.  Sometimes they might say 'one, oh' rather than 'ten' for the number '10', or they might say 'one, nine, two, eight' rather than 'Nineteen Twenty Eight' for the date '1928'.  Always transcribe the number or date the same way as the reader says it.

Create Word List file

Next create a word list file using the etext2wlistmlf.pl script:

$ perl ./etext2wlistmlf.pl eText.txt

This will create a wlist file which contains an alphabetical list of all the words in the eText file. 

Note: this script adds the following entries to your wlist file:

SENT-END
SENT-START  

These are HTK internal entries required for the start of a sentence (SENT-START) and the end of a sentence (SENT-END).  

Download VoxForge Lexicon 

Next download and extract the most current version of the Voxforge Lexicon (also called a pronunciation dictionnary) from here:

[   ] VoxForge.tgz          

To your current directory.

Find Missing Words 

First you need to create the global.ded (default script used by HDMan), which contains:

AS sp
RS cmu
MP sil sil sp

This is mainly used to convert all the words in the dict file to uppercase.  See the HTK book for details of what these commands mean.

Next run the HTK HDMan command against the VoxForge Lexicon to see if there are any missing words (i.e. words in your eText.txt file that are not included in the VoxForge Lexicon):

$ HDMan -A -D -T 1 -m -w wlist -i -l dlog dict VoxForgeDict

The following is the output from this command:

HDMan -A -D -T 1 -m -w wlist -i -l dlog dict VoxForgeDict

No HTK Configuration Parameters Set

Output dictionary dict opened
Source dictionary VoxForgeDict opened
Dictionary dict created - 969 words processed, 52 missing

No HTK Configuration Parameters Set

Review the HDMan log file (dlog) to determine which words (if any) in the eText file that are missing from the VoxForge lexicon.

If you don't have any missing words go to Step 3.

If you have missing words, review the list and look for any that might correspond to numbers or dates in numeric format.  If you find any of these, then change them in your eText file (i.e. so that they are written using words, not numbers), and re-run the HDMan command as shown above.  If you still have missing words showing in the dlog file, then go to Step 2.