In order to use the audio and text from an Audio Book in the creation of the VoxForge Acoustic Model, it must be segmented into 5-10 second audio files. This involves using a silence detection tool to determine the location of pauses in the Audio Book, and segmenting the large audio file into a series of smaller files based on theses pauses. Next the corresponding eText must also be segmented using a Perl script that can determine the location of sentences in a text.
Once we have a few hundred hours of speech audio, we will be able to use the VoxForge Acoustic Model to perform 'Forced Alignment' to automatically segment the speech audio and text into files.
Step 1: Confirm Mono Audio
Check to see if the speech audio file is in Mono or Stereo (i.e. recorded with one channel or two). You can use Audacity to do this - just import the wav file into Audacity. If there are two tracks shown (one for each channel), then the audio is stereo. If there is only one track, then it is mono.
If the audio file is in Stereo (i.e. recorded with two channels) use SoX to convert it to Mono (i.e. convert it to one channel) as follows:
$sox stereo_file.wav -c 1 mono_file.wav
where the parameters are as follows:
-c corresponds to the number of channels
Step 2: Segment the Audio Using Silence Detection
Go to the directory where the speech audio file you want to segment is located. Create a sub-directory called 'wav'.
Use the Julius adintool to segment the audio file as follows:
$ adintool -in file -out file -filename wav/segment -startid 1000 -freq 44100 -lv 1000
The following will appear in your console:
Input-Source: Wave File (filename from stdin)
Segmentation: on, continuous
SampleRate: 44100 Hz
Level: 1000 / 32767
ZeroCross: 60 per sec.
HeadMargin: 400 msec.
TailMargin: 400 msec.
remove DC: off
Recording: segment.1000.wav, segment.1001.wav, ...
Enter the name of the audio file to be segmented next to "enter filename->".
The parameters are as follows:
-lv is the Level threshold (default: 2000). If the audio input amplitude goes over this threshold for a period, this triggers a the begin of speech segment. If the level goes below this level after triggering, it is the end of the speech segment.
-zc zerocrossnum (default=60). Fewer crossings of in a second would signify a pause in speech.
-headmargin msec (default: 400). Header margin of each speech segment (unit: milliseconds)
-tailmargin msec (default: 400). Tail margin of each speech segment (unit: milliseconds)
All audio is recorded differently. As a result, you will have to adjust the '-lv', '-zc' and -headmargin and -tailmargin parameters until get a set of files that are less then 10 seconds in duration.
Step 3: Segment the eText into Sentences
Next, right click and save the eText2Prompts.pl script (change the suffix from "_pl.txt" to ".pl"), and execute it as follows:
$perl ./eText2prompts.pl eText prompts
This will create a file ("prompts") that contains all the sentences in the eText, separated by a line feed.
Step 4: Match the eText Sentences to the Generated Speech Audio Segments
Next, review each wav file generated in Step 2, and label the text in the prompts file with the name of the audio file it corresponds to. You may have to break up sentences into separate prompt lines because the audio segmentation in Step 2 might have broken up a sentence into 2 or more files (depending on the number of pauses it detected in the audio file).