All Speech Recognition Engines ("SRE"s) are made up of the following components:
Language Model or Grammar - Language Models contain a
very large list of words
and their probability of occurrence in a given sequence. They are
used in dictation applications. Grammars are a much smaller file
containing sets of predefined
combinations of words. Grammarsare used in IVR or desktop Command and Control
applications. Each word in a Language
Model or Grammar has an associated list of phonemes (which correspond to the distinct sounds that make up a word).
Acoustic Model - Contains a statistical
representation of the distinct sounds that make up each word in the Language
Model or Grammar. Each distinct sound corresponds to a phoneme.
Decoder -
Software program that takes the sounds spoken by a user and searches
the
Acoustic Model for the equivalent sounds. When a match is made, the Decoder determines the phoneme corresponding to the sound. It keeps track of the
matching phonemes
until it reaches a pause in the users speech. It then
searches the Language Model or Grammar file for the equivalent series
of phonemes. If a match is made it returns the text of
the corresponding word or phrase to the calling program.
Although Julian (a special version of Julius which performs
grammar-based speech recognition) uses Acoustic Models created with the
HTK toolkit, Julian uses its own Grammar definition format.
Grammar
A recognition Grammar essentially defines constraints on what the SRE can expect as
input. It is a list of words and/or phrases that the
SRE listens for. When one of these predefined words or phrases is heard, the SRE returns the
word or phrase to the calling program - usually a Dialog Manager (but could
also be a script written in Perl, Python, etc.). The Dialog
Manager then does some processing based on this word or phrase.
The example in the HTK book is that of a voice-operated interface to
for phone dialling. If the SRE hears the sequence of words: 'Call
Steve Young', it returns the textual representation of this phrase to
the Dialog Manager, which then looks up Steve's telephone number and
then dials the number.
It is very important to understand that the words that you can use in
your Grammar are limited to the words that you have 'trained' in
your Acoustic Model. The two are tied very closely together.
Acoustic Model
An Acoutic Model is a file that contains a statistical representation
of each distinct sound that makes up a spoken word. It must
contain the sounds for each word used in your grammar. The words
in your grammar give the SRE the sequence of sounds it must listen
for. The SRE then listens for the sequence of sounds that make up
a particular word, and when it finds a particular sequence, returns the
textual representation of the word to the calling program (usually a
Dialog Manager). Thus, when an SRE is listening for words, it is
actually listening for the sequence of sounds that make up
one of the words you defined in your Grammar. The Grammar and the Acoustic
Model work together.
Therefore, when you train your Acoustic Model to recognize the phrase
'call Steve Young', the SRE is actually listening for the phoneme sequence "k",
"ao", "l", "s", "t", "iy", "v", "y", "ah" and "ng". If you say
each of these phonemes aloud in sequence, it will give you an idea of
what the SRE is looking for.
Commercial SREs use large databases of speech audio to create their Acoustic Models. Because of this,
most common words that might be used in a Grammar are already included
in their Acoustic Model.
When creating
your own Acoustic Models and Grammars, you need to make sure that all
the phonemes that make up the words in your Grammar are included in your
Acoustic Model.
Background - Julian Grammars
In Julian, a recognition grammar is separated into two files:
the ".grammar" file which defines a set of rules governing the
words the SRE is expected to recognize; rather than listing out
each word in the .grammar file, a Julian grammar file uses "Word Categories" - which is the name for a list of words to be recognized (which are defined in a separate ".voca" file);
the ".voca" file which defines the actual "Word Candidates" in
each Word Category and their pronunciation information (Note: the phonemes that make up this pronunciation
information must be the same as will be used to train your Acoustic
Model).
.grammar file
The rules governing the allowed words are defined in the .grammar file using
a modified BNF format. A .grammar specification in Julian uses a
set of derivation rules, written as:
[expression with Symbols] is an
expression which consists of sequences of Symbols, which can be terminals and/or nonterminals.
A terminal is BNF jargon for a symbol that represents a constant
value. It never appears to the left of the colon. In Julian
terminals represent Word Categories - lists of words that are further defined in a separate ".voca" file.
A nonterminal is BNF jargon for a symbol that can be
expressed in terms of other symbols. It can be replaced as a result of substitution rules.
For example, look at the the following derivation rules:
S : NS_B LOOKUP NS_E
LOOKUP: CONNECT NAME
In this
example, "S" is the initial sentence symbol. NS_B and
NS_E correspond to the silence that occurs just before the utterance you want to
recognize and after. "S", "NS_B" and "NS_E" are required in all Julian grammars.
"NS_B", "NS_E", "CONNECT", and "NAME" are terminals, and represent
Word Categories that must be defined in the ".voca" file. In the
".voca" file,"CONNECT" corresponds to two words: "PHONE" and "CALL" and
their pronunciations. "NAME" corresponds to two words: "STEVE" and
"YOUNG" and their pronunciations.
"LOOKUP" is a nonterminal, and does not have any definition in the
.voca file. It does have a further definition in the .grammar
file, where it is replaced by the expression "CONNECT NAME". All
nonterminals must be further defined in the .grammar file until they
are finally represented by terminals (which are then defined in the
.voca file as Word Categories).
With Julian, only one
Substitution Rule per line is permitted, with the colon ":" as the separator. Alphanumeric ASCII characters and the underscore
are permitted for Symbol names, and these are case sensitive.
.voca file
The ".voca" file contains Word Definitions for each Word Category defined in the .grammar file.
Each Word Category must be defined with "%" preceding it. Word
Definitions in each Word Category are then defined one per line. The
first column is the string which will be output when recognized, and
the rest is the pronunciation. Spaces and/or tabs can act field
separators.
For example the Word Categories "NS_B", "NS_E", "CONNECT", and "NAME"
were referenced in the ".grammar" file above and are defined in a ".voca" as follows:
% NS_B
<s> sil
% NS_E
</s> sil
% CONNECT
PHONE f ow n
CALL k ao l
% NAME
STEVE s t iy v
YOUNG y ah ng
In the above example, the NS_B and NS_E Word Categories each have
one Word
Definition with a silence model ('sil' is a special silence model
defined in your Acoustic Model). These correspond to the head and
tail silence in speech input.
"CONNECT" is broken out into two
words "PHONE" and "CALL" with pronunciation information, which are the
phonemes that make up the words to be recognized (and which correspond
to phonemes that will be included in your Acoustic Model).
"NAME" is broken out into two words: "STEVE" and
"YOUNG" and their phonemes
The phonemes used here must match the phonemes used in the creation of your
Acoustic Model (which we will create in later steps).
If you have words with different pronunciations, simply create the
additional entries on separate lines for the same word but with the different
pronunciation.
The .grammar and .voca files working together
Julian
needs a predefined word lattice file where each word and each
word-to-word transition is
listed explicitly. We get this by compiling the ".grammar" and
".voca" files together to generate the word lattice file (actually it
is two files, but more on that later) with a script. The mkdfa.pl
script does this
by looking for the Initial Sentence Symbol "S" in the .grammar
file and replacing the Word Categories with all the possible Word
Candidates from the .voca file, and making a predefined
list of all the
possible combinations of words and phrases Julian must recognize. In this case, the list
of
all possible
sentences would be:
<s> PHONE STEVE </s>
<s> PHONE YOUNG </s>
<s> CALL STEVE </s>
<s> CALL YOUNG </s>
What this means is that when Julian hears the sounds that make up a
word or phrase uttered by a user, it tries to match these sounds to the
statistical representations of sounds contained in the Acoustic
Model. When a match is made, Julian determines the phoneme
corresponding to the sound. It keeps track of the
matching phonemes
until it reaches a pause in the user's speech. It then
searches the compiled grammar for the equivalent series
of phonemes. You can think of the compiled grammar as looking something like
this:
sil f ow n s t iy v sil sil f ow n y ah ng sil sil k ao l s t iy v sil sil k ao l y ah ng sil
If, for example, a match is made with the list of phonemes: "sil k ao l s t iy v sil", Julian returns the words "
<s> CALL STEVE </s>" to the calling program.
Tutorial
.grammar file
For this tutorial, go to the 'voxforge' folder you created in your home directory.
Create a new directory called 'manual'. Next create a file
called sample.grammar in your new 'voxforge/manual' folder, and add the
following text:
S : NS_B SENT NS_E
SENT: CALL_V NAME_N
SENT: DIAL_V DIGIT
In this case, NS_B, NS_E, CALL_V, NAME_N, DIAL_V, DIGIT are Word Categories (i.e. terminals in BNF jargon),
and they must be defined in a separate .voca file.
"SENT" is the only
nonterminal symbol. The "SENT" in the first line will be substituted with either of the following Word Category Phrases:
"CALL_V
NAME_N" (from the second line); or
"DIAL_V DIGIT" (from the third line).
Each Word Category (i.e. "CALL_V",
"NAME_N", "DIAL_V", or "DIGIT") is replaced by one of the Word Definitions set out in
the .voca file below.
.voca file
For this tutorial, create a file called: sample.voca in your 'voxforge/manual' folder, and add the following text:
% NS_B
<s> sil
% NS_E
</s> sil
% CALL_V
PHONE f ow n
CALL k ao l
% DIAL_V
DIAL d ay l
% NAME_N
STEVE s t iy v
YOUNG y ah ng
% DIGIT
FIVE f ay v
FOUR f ow r
NINE n ay n
EIGHT ey t
OH ow
ONE w ah n
SEVEN s eh v ih n
SIX s ih k s
THREE th r iy
TWO t uw
ZERO z iy r ow
Compiling your Grammar
The .grammar and .voca files now need to be compiled into ".dfa"
and ".dict" files so that Julian can use them. This is done using the Julian "mkdfa.pl"
grammar compiler. The .grammar and .voca files need to have the same
file prefix, and this prefix is then specified to the mkdfa.pl
script. Compile your files (sample.grammar and sample.voca)
as follows:
$ mkdfa.pl sample sample.grammar has 3 rules
sample.voca has 6 categories and 18 words
---
Now parsing grammar file
Now modifying grammar to minimize states[-1]
Now parsing vocabulary file
Now making nondeterministic finite automaton[6/6]
Now making deterministic finite automaton[6/6]
Now making triplet list[6/6]
---
generated: sample.dfa sample.term sample.dict