I had a someone email me and ask how to create Sphinx Acoustic models using the VoxForge Speech Corpus. Based on the information from the CMU Robust Group Tutorial (Learning to use the CMU SPHINX Automatic Speech Recognition system), here is my reply:
I've downloaded and compiled Sphinxtrain, SphinxBase and PocketSphinx.
To create acoustic models, I've run the SphinxTrain scripts (as described in the Robust tutorial) on the AN4 database using these commands:
results were bad, but it is a small database...
$perl scripts_pl/make_feats.pl -ctl etc/an4_train.fileids # converts wav files to features perl scripts_pl/RunAll.pl # creates acoustic models perl scripts_pl/decode/slave.pl # run PocketSphinx speech rec engine agains the an4 test data
You can set up an new environment to use the VoxForge corpus as follows:This basically copies the an4 directory structure and contents to a new directory called VoxForge, and then you can change the required files to use the VoxForge Speech corpus.
cd an4 perl scripts_pl/copy_setup.pl -task VoxForge
Audio (referred to as Acoustic Signals in the Robust Tutorial)
Download the audio from the VoxForge Repository (8kHz-16bit) and put it in the VoxForge/wav directory. You might want to use the wget utility to automate this.
Create a script to parse the VoxForge prompts file to create a new VoxForge_train.fileids file. The prompts file is in this format:
jaiger-12032006-6/mfc/vf6-01 HE CRIED AND SWUNG THE CLUB WILDLY
jaiger-12032006-6/mfc/vf6-02 SHE TURNED FEARING THAT JACQUES MIGHT SEE WHAT WAS IN HER FACE
...
Which needs to be converted to this format:
jaiger-12032006-6/wav/vf6-01Make sure the paths correspond to the paths listed in the VoxForge_train.fileids file. Note, some files in the Repository are in FLAC format - these need to be converted to WAV (or omitted VoxForge_train.fileids file, for now...)
jaiger-12032006-6/wav/vf6-01
...
Next, convert the wav files to feature files using the make_feats.plscript:This will take a long time to complete...$perl scripts_pl/make_feats.pl -ctl etc/an4_train.fileids
Transcriptions
Create a script to modify the VoxForge prompts file and copy them into the etc/VoxForge_train.transcription file so that they are in this format:
<s> HE CRIED AND SWUNG THE CLUB WILDLY </s> (jaiger-12032006-6/wav/vf6-01)<s> SHE TURNED FEARING THAT JACQUES MIGHT SEE WHAT WAS IN HER FACE </s> (jaiger-12032006-6/wav/vf6-01)Note: I am not sure if Sphinx accepts paths in the name... if not, you will have to rename all the audio files so that they are unique.
Phones
Copy the VoxForge Phone list into etc/VoxForge.phone. Here is the VF phone list:
axNote: you might not need the sp model - HTK does not in certain circumstances, I don't know about Sphinx.
sp
ae
b
l
ow
n
d
m
t
ey
iy
s
ix
k
sh
aa
z
er
eh
dx
ng
ay
ih
jh
ao
r
aw
ah
v
hh
p
uw
y
ch
w
f
th
g
uh
dh
oy
zh
silPronunciation Dictionary (language dictionary)
Create a script to modify the VoxForge lexicon (i.e. pronunciation dictionary), which is in this format:
Into this format (i.e. remove the return word in brackets):
A [A] ax
A'READY [A'READY] ax r eh d iy
A'S [A'S] ey z
A(2) [A] ey
...
A ax
A'READY ax r eh d iy
A'S ey z
A(2) ey
....
and copy it to etc/VoxForge.dic.
HTK's HDMan command can do this too.
Filler (filler dictionary)
Just use the one that is already there (etc/VoxForge.filler).
Language Model
I am not sure how to create a language model - the Robust Tutorial says to "check the CMU SLM Toolkit page for an excellent manual". With HTK/Julius (which I am more familiar with....), you don't need one if you are just creating grammar (for command and control applications - not dictation).
You should be able to use Keith Vertanen's English Gigaword Language Model. Copy it to etc/VoxForge.ug.lm. You may need to to convert it to a dump file (etc/VoxForge.ug.lm.DMP) - there should be a tool (lm3g2dmp?) on the Sphinx site to do this.
Any feedback and/or corrections would be greatly appreciated,
thanks,
Ken
--- (Edited on 7/22/2008 5:32 pm [GMT-0400] by kmaclean) ---
A few comments
> Note: I am not sure if Sphinx accepts paths in the name... if not, you will have to rename all the audio files so that they are unique.
It's better to rename them
> Note: you might not need the sp model - HTK does not in certain circumstances, I don't know about Sphinx.
Unlike HTK Sphinx insert fillers automatically, so you don't need sp.
> I am not sure how to create a language model - the Robust Tutorial says to "check the CMU SLM Toolkit page for an excellent manual". With HTK/Julius (which I am more familiar with....), you don't need one if you are just creating grammar (for command and control applications - not dictation).
You can use jsgf grammars. Something like:
#JSGF V1.0;
/* JSGF Grammar for Turtle example */
grammar goforward;
public <move> = GO FORWARD TEN METERS;
public <move2> = GO <direction> <distance> [METER | METERS];
<direction> = FORWARD | BACKWARD;
<distance> = ONE | TWO | THREE | FOUR | FIVE | SIX | SEVEN | EIGHT | NINE | TEN;
--- (Edited on 7/28/2008 6:27 pm [GMT-0500] by nsh) ---