julian - grammar based continuous speech recognition parser
NAME
Julian - grammar based continuous speech recognition parser
SYNOPSIS
julian [-C jconffile] [options ...]
DESCRIPTION
Julian is a high-performance, multi-purpose, free speech recognition
parser based on finite state
grammar. It is capable of
performing
real-time recognition
of continuous speech with over thousands of
vocabulary.
Julian is a derived version of Julius, and almost all components are
the same except language model related part.
To execute a recognition, it needs an acoustic model and a finite state
grammar that describes sentence patterns to be recognized. The grammar
format is an original one, and tools to create a recognirion grammar
are included in the distribution. For acoustic model, standard format
(i.e. HTK) with any word/phone units and sizes are supported. So users
can build a recognition system customized for specific tasks using own
task grammar and acoustic models. For details about models and how to
write a grammar, please see the documents contained in this package.
Julian can perform recognition on audio files, live microphone input,
network input and feature parameter files. The maximum size of vocabu-
lary is 65,535 words.
RECOGNITION MODELS
Julian supports the following models.
Acoustic Models
Same as Julius: Sub-word HMM (Hidden Markov Model)
in HTK
format are supported. Phoneme models
(monophone), context
dependent phoneme models (triphone), tied-mixture and
pho-
netic tied-mixture models of any unit can
be used. When
using context dependent models, interword context
is also
handled. You can further use a tool mkbinhmm to convert the
ascii HMM definition file to binary format, for speeding up
the startup (this format is incompatible with that of HTK).
Language model
The grammar format is an original one, and tools to create a
recognirion grammar are included in the
distribution. A
grammar consists of two files: one is a 'grammar' file that
describes sentence structures in a BNF style,
using word
'categories' as terminate symbols. Another is a 'voca' file
that defines word with its pronunciations
(i.e. phoneme
sequences) for each category. They should be
converted by
mkdfa.pl(1) to a deterministic finite automaton file (.dfa)
and a dictionary file (.dict), respectively.
SPEECH INPUT
Same as Julius: Both live speech input and recorded speech file input
are supported. Live
input stream from microphone device,
DatLink
(NetAudio) device and tcpip network input using adintool is supported.
Speech waveform files (16bit WAV (no compression), RAW format, and many
other format will be acceptable if compiled with libsndfile library).
Feature parameter files in HTK format are also supported.
Note that Julian itself can
only extract MFCC_E_D_N_Z features from
speech data. If you use an acoustic HMM trained by other feature type,
only the HTK parameter file of the same feature type can be used.
SEARCH ALGORITHM OVERVIEW
Recognition algorithm of Julian is
based on a two-pass strategy. In
the first pass,
a high-speed approximate search is performed using
weaker constraints then the given grammar. Here a LR beam search using
only inter-category
constraints extracted from the grammar is per-
formed. The second pass re-searches the input, using the original gram-
mar rules and intermediate results from the first pass, to gain a high
precision result quickly. In the second pass the optimal solution is
theoretically guaranteed using the A* search.
When using context dependent phones (triphones), interword contexts are
taken into consideration. For tied-mixture and phonetic tied-mixture
models, high-speed acoustic likelihood calculation is possible using
gaussian pruning.
For more details, see the related document or web page below.
OPTIONS
The options below specify
the models, system behaviors and various
search parameters. These option can be set all at once at the command
line, but it is recommended that
you write them in a text file as a
"jconf file", and specify the file with "-C" option.
Most are the same as Julius.
Options only in Julian: -gram, -gramlist, -dfa, -penalty1, -penalty2,
-looktrellis
Options only in Julius: -nlr, -nrl, -d, -lmp, -lmp2, -transp, -silhead,
-siltail, -spdur, -sepnum, -separatescore
Speech Input
-input {rawfile|mfcfile|mic|adinnet|netaudio|stdin}
Select speech data input source. 'rawfile' is
waveform file,
and specified after startup from stdin). 'mic' means microphone
device, and 'adinnet' means receiving waveform data
via tcpip
network from an adinnet
client. 'netaudio' is from
DatLink/NetAudio input, and 'stdin' means data input from stan-
dard input.
WAV (no compression) and RAW (noheader, 16bit,
BigEndian) are
supported for waveform file input. Other
format can be sup-
ported using external library. To see what format is
actually
supported, see the help message using option "-help". For stdin
input, only WAV and RAW is supported.
(default: mfcfile)
-filelist file
(With -input rawfile|mfcfile) perform recognition on all
files
listed in the file.
-adport portnum
(with -input adinnet) adinnet port number (default: 5530)
-NA server:unit
(with -input netaudio) set the server name and unit
ID of the
Datlink unit.
-zmean -nozmean
This option enables/disables DC offset removal of
input wave-
form. For speech file input, zero mean will be
computed from
the whole input. For microphone / network input, zero
mean of
the first 48000 samples (3 seconds in 16kHz sampling)
will be
used at the rest. (default: disabled (-nozmean))
-zmeanframe -nozmeanframe
With speech input, this option enables/disables
frame-wise DC
offset removal. This is the same as HTK's ZMEANSOURCE
option,
and cannot be set with "-zmean". (default: disabled
(-nozmean-
frame))
-nostrip
Julian by default removes zero samples in input speech data. In
some cases, such invalid data may be recorded at the
start or
end of recording. This option inhibit this automatic removal.
-record directory
Auto-save input speech data successively under the
directory.
Each segmented inputs are recorded to a file each by
one. The
file name of the recorded data is generated
from system time
when the input starts, in a style of
"YYYY.MMDD.HHMMSS.wav".
File format is 16bit monoral WAV. Invalid for
mfcfile input.
With input rejection by "-rejectshort", the rejected input will
also be recorded even if they are rejected.
-rejectshort msec
Reject input shorter than specified milliseconds.
Search will
be terminated and no result will be output. In
module mode,
'<REJECTED REASON="..."/>' message will be sent to client.
With
"-record", the rejected input will also be recorded even if they
are rejected. (default: 0 = off)
Speech Detection
Options in this section is invalid for mfcfile input.
-cutsilence
-nocutsilence
Force silence cutting (=speech segment
detection) to ON/OFF.
(default: ON for mic/adinnet, OFF for files)
-lv threslevel
Level threshold (0 - 32767) for speech
triggering. If audio
input amplitude goes over this threshold for a
period, Julius
begin the 1st pass recognition. If the level goes
below this
level after triggering, it is the end of the
speech segment.
(default: 2000)
-zc zerocrossnum
Zero crossing threshold per a second (default: 60)
-headmargin msec
Margin at the start of the speech segment
in milliseconds.
(default: 300)
-tailmargin msec
Margin at the end of the
speech segment in milliseconds.
(default: 400)
Acoustic Analysis
-smpFreq frequency
Set sampling frequency of input speech in Hz. Sampling rate can
also be specified using "-smpPeriod". Be careful that this fre-
quency should be the same as the trained conditions of acoustic
model you use. This should be specified for
microphone input
and RAW file input when using other than default rate. Also see
"-fsize", "-fshift", "-delwin" and "-accwin".
(default: 16000 (Hz = 625ns))
-smpPeriod period
Set sampling frequency of input speech by its
sampling period
(nanoseconds). The sampling rate can also be
specified using
"-smpFreq". Be careful that the input frequency
should be the
same as the trained conditions of acoustic model you use.
This
should be specified for microphone input and RAW file input when
using other than default rate. Also see
"-fsize", "-fshift",
"-delwin" and "-accwin".
(default: 625 (ns = 16000Hz))
-fsize sample
Analysis window size in number of samples. (default: 400).
-fshift sample
Frame shift in number of samples (default: 160).
-preemph value
Pre-emphasis coefficient (default: 0.97)
-fbank num
Number of filterbank channels (default: 24)
-ceplif num
Cepstral liftering coefficient (default: 22)
-rawe / -norawe
Enable/disable using raw energy before pre-emphasis
(default:
disabled)
-enormal / -nornormal
Enable/disable normalizing log energy
(default: disabled).
Note: normalising log energy should not be
specified on live
input, at both training and recognition (see sec. 5.9
"Direct
Audio Input/Output" in HTKBook).
-escale value
Scaling factor of log energy when
normalizing log energy
(default: 1.0)
-silfloor value
Energy silence floor in dB when normalizing log energy (default:
50.0)
-delwin frame
Delta window size in number of frames (default: 2).
-accwin frame
Acceleration window size in number of frames (default: 2).
-lofreq frequency
Enable band-limiting for MFCC filterbank computation: set lower
frequency cut-off.
(default: -1 = disabled)
-hifreq frequency
Enable band-limiting for MFCC filterbank computation: set upper
frequency cut-off.
(default: -1 = disabled)
-sscalc
Perform spectral subtraction using head part of each file. With
this option, Julius assume there are certain length of
silence
at each input file. Valid only for
rawfile input. Conflict
with "-ssload".
-sscalclen
With "-sscalc", specify the length of head part silence in mil-
liseconds (default: 300)
-ssload filename
Perform spectral subtraction for speech input using
pre-esti-
mated noise spectrum from file. The noise spectrum data
should
be computed beforehand by mkss. Valid for
all speech input.
Conflict with "-sscalc".
-ssalpha value
Alpha coefficient of spectral subtraction. Noise will
be sub-
tracted stronger as this value gets larger, but
distortion of
the resulting signal also becomes remarkable. (default: 2.0)
-ssfloor value
Flooring coefficient of spectral
subtraction. The spectral
parameters that go under zero after subtraction will be substi-
tuted by the source signal with this
coefficient multiplied.
(default: 0.5)
GMM-based Input Verification and Rejection
-gmm filename
GMM definition file in HTK format. If specified, GMM-based input
verification will be performed concurrently with the 1st
pass,
and you can reject the input according to the result as
speci-
fied by "-gmmreject". Note that the GMM should be
defined as
one-state HMMs, and their training parameter should be the same
as the acoustic model you want to use with.
-gmmnum N
Number of Gaussian components to be computed per frame
on GMM
calculation. Only the N-best Gaussians
will be computed for
rapid calculation. The default is 10 and
specifying smaller
value will speed up GMM calculation, but too small value (1
or
2) may cause degradation of identification performance.
-gmmreject string
Comma-separated list of GMM names to be
rejected as invalid
input. When recognition, the log likelihoods of
GMMs accumu-
lated for the entire input will be computed
concurrently with
the 1st pass. If the GMM name of the maximum
score is within
this string, the 2nd pass will not be executed and
the input
will be rejected.
Language Model (Finite State Grammar)
The recognition
grammar can be specified in three ways:
"-gram",
"-gramlist" or combination of "-dfa" and "-v".
Multiple grammars can be specified by using "-gram" and "-gramlist".
When you use these options several times, all of them will be read at
startup. Note that this is a
different behavior from other options
(last one override previous ones). You can use "-nogram" to reset the
already specified grammars at that point.
-gram gramprefix1[,gramprefix2[,gramprefix3,...]]
Comma-separated list of grammars to be
used. the argument
should be prefix of a grammar, i.e. if you have
"foo.dfa" and
"foo.dict", you can specify them by single argument "foo". Mul-
tiple grammars can be specified as comma-separated list.
-gramlist listfile
Specify a grammar list file that contains list of grammars to be
used. The list file should contain the prefixs
of grammars,
each per line. A relative path in the list file will be treated
as relative to the list file, not the current path or configura-
tion file.
-dfa dfa_filename
Finite state automaton grammar file.
-v dictionary_file
Word dictionary file (required)
-nogram
Remove the current list of grammars already
specified by the
options above.
-penalty1 float
Word insertion penalty for the first pass. (default: 0.0)
-penalty2 float
Word insertion penalty for the second pass. (default: 0.0)
-spmodel {WORD|WORD[OUTSYM]|#num}
Name of short pause model as defined in the hmmdefs. In Julian,
a word whose pronunciation consists of only
this short pause
model is called 'short pause word', and handled
especially in
recognition: even if its appearance in a sentence is explicitly
specified in the grammar, it can be skipped while parsing. This
behavior is for dealing with insertion and
deletion of short
pause that often appear unintensionally in
user utterances.
They can be specified in a style as shown below (default: "sp").
Example
Word_name
<s>
Word_name[output_symbol] <s>[silB]
#Word_ID
#14
(Word_ID is the word position in the dictionary
file starting from 0)
-forcedict
Ignore dictionary errors and force running. Words
with errors
will be dropped from dictionary at startup.
Acoustic Model (HMM)
-h hmmfilename
HMM definition file to use. Format (ascii/binary) will be auto-
matically detected. (required)
-hlist HMMlistfilename
HMMList file to use. Required when using triphone
based HMMs.
This file provides a mapping between the logical triphones names
genertated from the phonetic representation in the
dictionary
and the HMM definition names.
-iwcd1 {best N|max|avg}
When using a triphone model, select method to handle inter-word
triphone context on the first and last phone of a word
in the
first pass.
best N: use average likelihood of N-best scores from the same
context triphones
max: use maximum likelihood of the same
context triphones
avg: use average likelihood of the same
context triphones (default)
-force_ccd / -no_ccd
Normally Julius determines whether the specified acoustic model
is a context-dependent model from the model names, i.e., whether
the model names contain character '+' and '-'. You can
explic-
itly specify by these options to avoid
mis-detection. These
will override the automatic detection result.
-notypecheck
Disable checking of the input parameter type. (default: enabled)
Acoustic Computation
Gaussian Pruning will be automatically enabled when using tied-mixture
based acoutic
model. It is disabled by default for non tied-mixture
models, but you can activate
pruning to those models by explicitly
specifying "-gprune". Gaussian Selection needs a monophone model con-
verted by mkgshmm.
-gprune {safe|heuristic|beam|none}
Set the Gaussian pruning technique to use.
(default: 'safe' (setup=standard), 'beam' (setup=fast) for tied
mixture model, 'none' for non tied-mixture model)
-tmix K
With Gaussian Pruning, specify the number of Gaussians to
com-
pute per mixture codebook. Small value will speed up
computa-
tion, but likelihood error will grow larger. (default: 2)
-gshmm hmmdefs
Specify monophone hmmdefs to use for Gaussian Mixture Selectio.
Monophone model for GMS is generated from an ordinary monophone
HMM model using mkgshmm. This option is
disabled by default.
(no GMS applied)
-gsnum N
When using GMS, specify number of monophone state to select from
whole monophone states. (default: 24)
Inter-word Short Pause Handling
-iwsp (Multi-path version only) Enable inter-word context-free short
pause handling. This option appends a skippable
short pause
model for every word end. The added model will
be skipped on
inter-word context handling. The HMM model to be
appended can
be specified by "-spmodel" option.
Search Parameters (First Pass)
-b beamwidth
Beam width (number of HMM nodes) on the first pass. This
value
defines search width on the 1st pass, and has great
effect on
the total processing time. Smaller
width will speed up the
decoding, but too small value will result
in a substantial
increase of recognition errors due to search
failure. Larger
value will make the search stable and will lead to failure-free
search, but processing time and memory usage will grow in
pro-
portion to the width.
default value: acoustic model dependent
400 (monophone)
800 (triphone,PTM)
1000 (triphone,PTM, setup=v2.1)
-1pass Only perform the first pass search.
-realtime
-norealtime
Explicitly specify whether real-time (pipeline) processing will
be done in the first pass or not. For file input, the
default
is OFF (-norealtime), for microphone,
adinnet and NetAudio
input, the default is ON (-realtime). This
option relates to
the way CMN is performed: when OFF, CMN is calculated for
each
input using cepstral mean of the whole input. When the realtime
option is ON, MAP-CMN will be performed. When MAP-CMN, the cep-
stral mean of last 5 seconds are used as the initial
cepstral
mean at the beginning of each input. Also refer to
"-progout".
-cmnsave filename
Save last CMN parameters computed while recognition to the spec-
ified file. The parameters will be saved to the file
in each
time a input is recognized, so the output file always keeps the
last CMN parameters. If output file already exist, it
will be
overridden.
-cmnload filename
Load initial CMN parameters previously saved in a file by "-cmn-
save". Loading an initial CMN
enables Julius to better
recognize the first utterance on a microphone / network
input.
Also see "-cmnnoupdate".
-cmnmapweight
Specify weight of initial cepstral mean at the beginning of each
utterance for microphone / network input. Specify larger
value
to retain the initial cepstral mean for a longer
period, and
smaller value to rely more on the current
input. (default:
100.0)
-cmnnoupdate
When microphone / network input, this option makes engine not to
update the cepstral mean at each input and force engine to
use
the initial cepstral mean given by "-cmnload" parmanently.
Search Parameters (Second Pass)
-b2 hyponum
Beam width (number of hypothesis) in second pass. If the
count
of word expantion at a certain length of hypothesis reaches this
limit while search, shorter hypotheses are not expanded further.
This prevents search to fall in breadth-first-like status stack-
ing on the same position, and improve search failure. (default:
30)
-n candidatenum
The search continues till 'candidate_num' sentence
hypotheses
have been found. The obtained sentence hypotheses are sorted by
score, and final result is displayed in the order (see also the
"-output" option).
The possibility that the optimum hypothesis is correctly
found
increases as this value gets increased, but the processing time
also becomes longer.
Default value depends on the engine setup on compilation time:
10 (standard)
1 (fast, v2.1)
-output N
The top N sentence hypothesis will be Output at
the end of
search. Use with "-n" option. (default: 1)
-cmalpha float
This parameter decides smoothing effect of word confidence mea-
sure. (default: 0.05)
-sb score
Score envelope width for enveloped scoring.
When calculating
hypothesis score for each generated
hypothesis, its trellis
expansion and viterbi operation will be pruned in the middle of
the speech if score on a frame goes under [current maximum score
of the frame- width]. Giving small value makes the second
pass
faster, but computation error may occur. (default: 80.0)
-s stack_size
The maximum number of hypothesis that can be stored on the stack
during the search. A larger value may give more stable results,
but increases the amount of memory required. (default: 500)
-m overflow_pop_times
Number of expanded hypotheses required
to discontinue the
search. If the number of expanded hypotheses is
greater then
this threshold then, the search is discontinued at that
point.
The larger this value is, The longer Julius gets
to give up
search (default: 2000)
-lookuprange nframe
When performing word expansion on the second pass, this
option
sets the number of frames before and after to look up next word
hypotheses in the word trellis. This prevents the
omission of
short words, but with a large value, the number
of expanded
hypotheses increases and system becomes slow. (default: 5)
-looktrellis
Expand only the words survived on the
first pass instead of
expanding all the words predicted by grammar. This option makes
second pass decoding slightly faster especially for large vocab-
ulary condition, but may increase deletion error of short words.
(default: disabled)
Graph Output
-graphrange nframe
When graph output is enabled
(--enable-graphout), merge same
words at neighbor position. If the position of same words
dif-
fers smaller than this value, they will be merged. The
default
is 0 (allow merging on exactly the same location) and specifying
larger value will result in smaller graph output. Setting to -1
will disable merging, in that case same words on the same loca-
tion of different scores will be left as they are. (default: 0)
-graphcut depth
Cut the resulting graph by its word depth at
post-processing
stage. The depth value is the number of words to be allowed
at
a frame. Setting to -1 disables this feature. (default: 80)
-graphboundloop num
Limit the number of boundary adjustment loop at post-processing
stage. This parameter prevents Julius from blocking by infinite
adjustment loop by short word oscillation. (default: 20)
-graphsearchdelay
-nographsearchdelay
When "-graphsearchdelay" option is set,
Julius modifies its
graph generation alogrithm on the 2nd pass not
to terminate
search by graph merging, until the first sentence candidate
is
found. This option may improve graph accuracy, especially
when
you are going to generate a huge word graph by
setting broad
search. Namely, it may result in better graph accuracy when you
set wide beams on both 1st pass "-b" and 2nd
pass "-b2", and
large number for "-n". (default: disabled)
Forced Alignment
-walign
Do viterbi alignment per word units from the recognition result.
The word boundary frames and the average
acoustic scores per
frame are calculated.
-palign
Do viterbi alignment per phoneme (model) units from the recogni-
tion result. The phoneme boundary frames and the average acous-
tic scores per frame are calculated.
-salign
Do viterbi alignment per HMM state from the recognition result.
The state boundary frames and the average acoustic
scores per
frame are calculated.
Server Module Mode
-module [port]
Run Julian on "Server Module Mode". After startup, Julian waits
for tcp/ip connection from client. Once
connection is estab-
lished, Julian start communication with the client to
process
incoming commands from the client, or to
output recognition
results, input trigger information and other system
status to
the client. The multi-grammar mode is only
supported at this
Server Module Mode. The default port number is 10500.
jcontrol
is sample client contained in this package.
-outcode [W][L][P][S][C][w][l][p][s]
(Only for Server Module Mode) Switch which symbols of recognized
words to be sent to client. Specify 'W' for output symbol,
'L'
for grammar entry, 'P' for phoneme sequence, 'S' for score, and
'C' for confidence score, respectively. Capital letters are for
the second pass (final result), and
small letters are for
results of the first pass. For example, if you
want to send
only the output symbols and phone sequences as
a recognition
result to a client, specify "-outcode WP".
Message Output
-multigramout
Enable multiple grammar output. Usually, Julian will search for
the best hypothesis among the
grammars. This options will
change the search to find the best result one by one
for each
grammar.
-quiet Omit phoneme
sequence and score, only output the best word
sequence hypothesis.
-progout
Enable progressive output of the partial results on
the first
pass.
-proginterval msec
set the output time interval of "-progout" in milliseconds.
-demo Equivalent to "-progout -quiet"
-charconv from to
Enable output character set conversion. "from"
is the source
character set used in the language model, and "to" is the target
character set you want to get.
On Linux, the arguments should be a code name. You
can obtain
the list of available code names by invoking the command "iconv
--list". On Windows, the arguments should
be a code name or
codepage number. Code name should be one
of "ansi", "mac",
"oem", "utf-7", "utf-8", "sjis", "euc". Or you can specify
any
codepage number supported at your environment.
OTHERS
-debug (For debug) output enoumous internal status and debug informa-
tion.
-C jconffile
Load the jconf file. The
options written in the file are
included and expanded at the point. This option
can also be
used within other jconf file.
-check wchmm
(For debug) turn on interactive
check mode of tree lexicon
structure at startup.
-check triphone
(For debug) turn on interactive check mode of
model mapping
between Acoustic model, HMMList and dictionary at startup.
-setting
Display compile-time engine configuration and exit.
-help Display a brief description of all options.
EXAMPLES
For examples of
system usage, refer to the tutorial section in the
Julian documents.
NOTICE
Note about jconf files: relative paths in a jconf file are interpreted
as relative to the jconf file itself, not to the current directory.
SEE ALSO
julius(1), jcontrol(1), adinrec(1), adintool(1), mkdfa(1), mkbinhmm(1)
mkgsmm(1), wav2mfcc(1), mkss(1)
http://julius.sourceforge.jp/en/
DIAGNOSTICS
Julian normally will return the
exit status 0. If an error occurs,
Julian exits abnormally with exit status 1. If an input file cannot be
found or cannot be loaded for some reason then Julian will skip pro-
cessing for that file.
BUGS
There are some restrictions to the type and size of the models Julian
can use. For a detailed explanation refer to the Julius documentation.
For
bug-reports, inquires
and comments
please contact
julius@kuis.kyoto-u.ac.jp or julius@is.aist-nara.ac.jp.
COPYRIGHT
Copyright (c) 1991-2006 Kawahara Lab., Kyoto University
Copyright (c) 2000-2005 Shikano
Lab., Nara Institute of Science and
Technology
Copyright (c) 2005-2006 Julius project team, Nagoya Institute of Tech-
nology
AUTHORS
Rev.1.0 (1998/07/20)
Designed by Tatsuya KAWAHARA and Akinobu LEE (Kyoto University)
Rev.2.0 (1999/02/20)
Rev.2.1 (1999/04/20)
Rev.2.2 (1999/10/04)
Rev.3.1 (2000/05/11)
Development of above versions by Akinobu LEE (Kyoto University)
Rev.3.2 (2001/08/15)
Rev.3.3 (2002/09/11)
Rev.3.4 (2003/10/01)
Rev.3.4.1 (2004/02/25)
Rev.3.4.2 (2004/04/30)
Development of above versions by Akinobu LEE (Nara Institute of
Science and Technology)
Rev.3.5 (2005/11/11)
Rev.3.5.1 (2006/03/31)
Rev.3.5.2 (2006/07/31)
Development of above versions by Akinobu LEE (Nagoya
Institute
of Technology)
THANKS TO
From rev.3.2, Julian is released to the member of the "Information Pro-
cessing Society, Continuous Speech Consortium". From rev.3.4, Julian
becomes an open-source products incorporated with Julius.
The Windows Microsoft Speech API compatible version was developed by
Takashi SUMIYOSHI (Kyoto University).
Comments
Click the 'Add' link to add a comment to this page; click the 'Read More' link to view replies to a posted comment.
Add
•
Search