Acoustic Model Discussions

New 160k words 1080 hours english models released
User: guenter
Date: 10/30/2017 5:01 pm
Views: 7038
Rating: 0

After working on the german voxforge model for some time I have now applied the scripts I developed for those models to a combination of the english librispeech and voxforge corpora. The resulting models can be downloaded from:

The scripts I am using to build my models can be found on github here:

While I took a pretty much manual approach for the german models I decided to try a more or less fully automated approach for the english ones - mostly because a lot of speech model resources are available here (while I had to start pretty much from scratch for the german models). 

The lexicon is based on the CMUdict to which I added missing entries using sequitur g2p (trained on CMUdict).

The audio recordings consist of


  •  the "good" librispeech recordings
  •  those same recordings with noise and reverb added to them at random


I trained a first kaldi nnet3 model on these recordings and then used this model to decode all the recordings from the english voxforge model and added those recordings to my corpus where the decoding results matched the transcripts. I iterated this process once more (and plan to do more iterations  in the future along with manual reviews).


159373 lexicon entries.
total duration of all good submissions: 1038:59:40
%WER 7.30 [ 36196 / 496128, 2226 ins, 16007 del, 17963 sub ] exp/nnet3/nnet_tdnn_a/decode/wer_8_0.0
CMU Sphinx models:
cmusphinx cont model: SENTENCE ERROR: 85.5% (12906/15093)   WORD ERROR RATE: 18.0% (89407/496158)
cmusphinx ptm model: SENTENCE ERROR: 89.2% (13467/15093)   WORD ERROR RATE: 24.2% (120169/496158)
sequitur g2p model:
    total: 13147 strings, 99753 symbols
    successfully translated: 13146 (99.99%) strings, 99746 (99.99%) symbols
        string errors:       4881 (37.13%)
        symbol errors:       9557 (9.58%)
            insertions:      2190 (2.20%)
            deletions:       2422 (2.43%)
            substitutions:   4945 (4.96%)
    translation failed:      1 (0.01%) strings, 7 (0.01%) symbols
    total string errors:     4882 (37.13%)
    total symbol errors:     9564 (9.59%)

--- (Edited on 10/30/2017 5:01 pm [GMT-0500] by guenter) ---

Re: New 160k words 1080 hours english models released
User: kmaclean
Date: 10/31/2017 7:44 am
Views: 281
Rating: 0

>I have now applied the scripts I developed for those models to a

>combination of the english librispeech and voxforge corpora

Very impressive!

Thank you,


--- (Edited on 10/31/2017 8:44 am [GMT-0400] by kmaclean) ---

Re: New 160k words 1080 hours english models released
User: guenter
Date: 11/29/2017 6:04 pm
Views: 321
Rating: 1

I have done another auto-review round using the latest model. This time, I also upgraded to kaldi 5.2 and used that to train tdnn-chain models - with quite encouraging results:

%WER 2.48 [ 12525 / 504653, 737 ins, 2720 del, 9068 sub ] exp/nnet3_chain/tdnn_sp/decode_test/wer_10_0.0
%WER 3.03 [ 15269 / 504653, 948 ins, 3260 del, 11061 sub ] exp/nnet3_chain/tdnn_250/decode_test/wer_9_0.0
the smaller (tdnn_250) model is targeted at embedded platforms like the raspberry pi 3 where it achieves near realtime performance:
[bofh@donald py-kaldi-asr]$ python examples/ 
tdnn_250 loading model...
tdnn_250 loading model... done, took 23.394126s.
tdnn_250 creating decoder...
tdnn_250 creating decoder... done, took 14.411979s.
decoding data/dw961.wav...
 0.087s:  4000 frames ( 0.250s) decoded.
 0.400s:  8000 frames ( 0.500s) decoded.
 0.742s: 12000 frames ( 0.750s) decoded.
 1.021s: 16000 frames ( 1.000s) decoded.
 1.263s: 20000 frames ( 1.250s) decoded.
 1.497s: 24000 frames ( 1.500s) decoded.
 1.714s: 28000 frames ( 1.750s) decoded.
 1.992s: 32000 frames ( 2.000s) decoded.
 2.370s: 36000 frames ( 2.250s) decoded.
 2.642s: 40000 frames ( 2.500s) decoded.
 2.873s: 44000 frames ( 2.750s) decoded.
 3.112s: 48000 frames ( 3.000s) decoded.
 3.333s: 52000 frames ( 3.250s) decoded.
 3.668s: 56000 frames ( 3.500s) decoded.
 3.876s: 60000 frames ( 3.750s) decoded.
 4.092s: 64000 frames ( 4.000s) decoded.
 4.305s: 68000 frames ( 4.250s) decoded.
 4.517s: 72000 frames ( 4.500s) decoded.
 4.951s: 74000 frames ( 4.625s) decoded.
** data/dw961.wav
** i cannot follow you she said 
** tdnn_250 likelihood: 1.99656772614
tdnn_250 decoding took     4.95s
the new models are available for download here:
and, as always, the scripts used to produce these models are available free and open source on my github:

--- (Edited on 11/29/2017 6:04 pm [GMT-0600] by guenter) ---

Re: New 160k words 1080 hours english models released
User: mischmerz
Date: 5/5/2018 4:55 pm
Views: 65
Rating: 0

Thanks for your work. I've got a question though - the directory voxforge.cd_cont_6000 already contains means / mdef so I tried to use it directly with pocketshinx and the -hmm parameter and it seems to be working though loading means and variances takes quite some time (on a Raspberry). Any way I would be able to speed up the process or am I doing something completely wrong?



--- (Edited on 5/5/2018 4:55 pm [GMT-0500] by ) ---

Re: New 160k words 1080 hours english models released
User: guenter
Date: 5/8/2018 2:17 pm
Views: 2196
Rating: 1

Hi Michaela,

currently we are quite focussed on kaldi - we do have models tuned for use on the rpi3. For now, this is work in progress but we are planning to do an "official" announcement soon. 

If you want to give it a try, here we have Raspbian packages of kaldi, the models and the python wrapper:

once you have them installed, you can use a python script to try them (this one is adapted from the example scripts that come with ):

import sys
import os
import wave
import struct
import numpy as np

from time import time

from kaldiasr.nnet3 import KaldiNNet3OnlineModel, KaldiNNet3OnlineDecoder

# this is useful for benchmarking purposes

MODELDIR    = '/opt/kaldi/model/kaldi-chain-voxforge-de'
MODEL       = 'tdnn_250'
WAVFILE     = 'data/gsp1.wav'

print '%s loading model...' % MODEL
time_start = time()
kaldi_model = KaldiNNet3OnlineModel (MODELDIR, MODEL, acoustic_scale=1.0, beam=7.0, frame_subsampling_factor=3)
print '%s loading model... done, took %fs.' % (MODEL, time()-time_start)

print '%s creating decoder...' % MODEL
time_start = time()
decoder = KaldiNNet3OnlineDecoder (kaldi_model)
print '%s creating decoder... done, took %fs.' % (MODEL, time()-time_start)

for i in range(NUM_DECODER_RUNS):

    time_start = time()

    print 'decoding %s...' % WAVFILE
    wavf =, 'rb')

    # check format
    assert wavf.getnchannels()==1
    assert wavf.getsampwidth()==2

    # process file in 250ms chunks

    chunk_frames = 250 * wavf.getframerate() / 1000
    tot_frames   = wavf.getnframes()

    num_frames = 0
    while num_frames < tot_frames:

        finalize = False
        if (num_frames + chunk_frames) < tot_frames:
            nframes = chunk_frames
            nframes = tot_frames - num_frames
            finalize = True

        frames = wavf.readframes(nframes)
        num_frames += nframes
        samples = struct.unpack_from('<%dh' % nframes, frames)

        decoder.decode(wavf.getframerate(), np.array(samples, dtype=np.float32), finalize)

        s, l = decoder.get_decoded_string()

        print "%6.3fs: %5d frames (%6.3fs) decoded. %s" % (time()-time_start, num_frames, float(num_frames) / float(wavf.getframerate()), s)


    s, l = decoder.get_decoded_string()
    print "*****************************************************************"
    print "**", WAVFILE
    print "**", s
    print "** %s likelihood:" % MODEL, l
    print "*****************************************************************"
    print "%s decoding took %8.2fs" % (MODEL, time() - time_start )



--- (Edited on 5/8/2018 2:17 pm [GMT-0500] by guenter) ---