Problem with silence/fillers using Julius4 forced alignment

Speech Recognition Engines

Nested

User: azeem
Date: 4/23/2013 6:08 am

Views: 6113
Rating: 3

Basically, I am not able to spot silences/fillers with forced alignment in julius.

1 Forced Alignment with Julius

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1.1 word / phoneme segmentation kit

====================================

I started from the word / phoneme segmentation kit that is present

in the home page of the Julius project:

+ [http://sourceforge.jp/projects/julius/downloads/32570/julius4-segmentation-kit-v1.0.tar.gz]

In this package there is a Perl script that automatically generates

the speech grammar files 'tmp.dfa', 'tmp.phndict' and 'tmp.dict' from

transcription. Then recognition with julius is performed with

-walign and/or -palign parameters.

1.2 What I have done successfully

==================================

So I used Julius for Forced Alignment and I prepared a script that

builds a grammar (like in the Perl script mentioned above) for each

input file from text transcription. I experienced very good

results both in word and phone alignment.

1.3 What I am not able to do

=============================

Unfortunately, I am not able to implement the following feature:

I would like Julius to be able to spot silence (and furthermore

even non verbal sounds) that may occur between words (or even

between phones). I would like to do this without explicitly

designing a particular grammar to contain also optional states

related to silence or filler words.

1.4 -iwsp parameter

====================

The "-iwsp" parameter seems to be related to the -iwsp option:

$ julius -help

...

"[-iwsp] insert sp for all word end (multipath)(off)"

...

I have tried it, but without the expected results. I.e., with

input audio files containing silences between words, there has not

been detected any "sp" in the output.

2 My question

~~~~~~~~~~~~~~

Does anybody know how to spot silences or non verbal sounds in a

forced alignment procedure with Julius4, without explicitly

designing a grammar that include states associated with silence/non

verbal sounds?

3 Test with voxforge acoustic model for English

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If someone is interested, I can share a test example I'm struggling

with. It's about forced alignment for the English language.

3.1 Acoustic model

===================

I downloaded a pre-built AM, contained in this tarball:

+ [http://www.repository.voxforge1.org/downloads/software/julius-3.5.2-quickstart-linux.tgz]

The "tiedlist" file, namely

julius-3.5.2-quickstart-linux/acoustic_model_files_build726/tiedlist

is lacking several triphones, and this can cause Julius to give an

error (see below).

3.2 A sample audio and grammar

===============================

Here follows a link to a tarball with files to test this issue:

[https://dl.dropboxusercontent.com/u/10183668/julius\_FA/test\_julius\_fa.tar.gz]

3.2.1 Here follows a description of each file contained in that tarball:

-------------------------------------------------------------------------

- sample_eng/c31c030s_plus_sil.wav

This is a sample audio file we used to test julius forced

alignment. It is a sentence taken from WSJ corpus. A silence chunk

has been manually added.

- sample_eng/c31c030s_plus_sil.txt

This is the corresponding text file:

+ "he updates his list of things to do today before going home each

evening"

The silence has been inserted between the words "today" and "before".

- fa_files/file.dfa, fa_files/file.phndict and fa_files/file.dict

Files that specify the grammar and dictionary for forced alignment

- output_files/c31c030s_plus_sil.phn and output_files/c31c030s_plus_sil.wrd

These two files are the output I gave from julius. There is no

silence (nor "sp", short pause) inserted between "today" and

"before". Actually, the phoneme "ey" in the alignment is lasting

more than a second and a half, spanning over the long silence.

These files have been obtained by converting the actual output

of julius, from timings in frames to timings in seconds.

3.3 Missing triphones

======================

In the sample sentence I provided there are triphones that are

missing in the voxforge acoustic model. One can map some of them

into other triphones and put them into the voxforge tiedlist file,

like this:

$ echo "ah-p+d ah-p+ch

ax-p+d ax-p+t

b-iy+f b-iy+s

hh-ix+z hh-ix+s

iy-f+ao iy-f+ow

ow-ix+n ow-ix+ng

p-d+ey n-d+ey

uw-d+ey ax-d+ey" \

>> julius-3.5.2-quickstart-linux/acoustic_model_files_build726/tiedlist

3.4 The julius command line I used is the following:

=====================================================

$ echo sample_eng/c31c030s_plus_sil.wav | julius \

-h julius-3.5.2-quickstart-linux/acoustic_model_files_build726/hmmdefs \

-hlist julius-3.5.2-quickstart-linux/acoustic_model_files_build726/tiedlist \

-dfa fa_files/file.dfa \

-v fa_files/file.phndict \

-walign \

-palign \

-multipath \

-spmodel "sp" \

-iwsp \

-b 200 \

-b2 200 \

-bs 200 \

-sb 200.0 \

-gprune safe \

-iwcd1 max \

-iwsppenalty -30.0 \

-input file

--- (Edited on 4/23/2013 6:08 am [GMT-0500] by azeem) ---

Re: Problem with silence/fillers using Julius4 forced alignment

User: kmaclean
Date: 4/24/2013 10:28 am

Views: 356
Rating: 3

> I would like Julius to be able to spot silence (and furthermore even non

>verbal sounds) that may occur between words (or even between phones).

First, for better results, use a current nightly build of the VoxForge acoustic models

Second, if you look at the output of the forced alignment, you should have word and phoneme timestamps. You should be able to create a script to collect the time stamps from the end of one word to the beginning of the next to give you an idea of where there might be silence.

I used HTK's Hvite for forced alignment in this tutorial (on how to segment a speech file). The word and phoneme timestamps from that tutorial look like this... the sp "short pause" entries correspond to the silence you are looking for.

--- (Edited on 4/24/2013 11:28 am [GMT-0400] by kmaclean) ---

Re: Problem with silence/fillers using Julius4 forced alignment

User: azeem
Date: 4/28/2013 10:18 am

Views: 2642
Rating: 2

@kmaclean: First of all, thank you very much for your advices.

> First, for better results, use a current nightly build of the

> VoxForge acoustic models

Thanks, I will surely try those!

> Second, if you look at the output of the forced alignment, you

> should have word and phoneme timestamps. You should be able to

> create a script to collect the time stamps from the end of one

> word to the beginning of the next to give you an idea of where

> there might be silence.

Yes, I have already prepared such a script. Actually, in the

output that I linked in my post there is an example of output

with timestamps (I realize that my post may be too long,

difficult to read and hide this info). I have two versions, .wrd

and .phn. Here follows the word version, it can be noted that

the word "today" is very long (and that's the error: a silence

should have been spotted right after that word):

0 0.32 sil

0.32 0.45 he

0.45 0.88 updates

0.88 1.04 his

1.04 1.31 list

1.31 1.37 of

1.37 1.75 things

1.75 1.87 to

1.87 2.06 do

2.06 3.98 today

3.98 4.31 before

4.31 4.6 going

4.6 4.88 home

4.88 5.08 each

5.08 5.63 evening

5.63 5.91 sil

> I used HTK's Hvite for forced alignment in this tutorial (on how

> to segment a speech file). The word and phoneme timestamps from

> that tutorial look like this... the sp "short pause" entries

> correspond to the silence you are looking for.

Sure, HVite is another option that I want to try. But in this

thread I would like to trouble shoot julius, i.e. to investigate

wheter it is capable of automatically spot silence/fillers events

in a Forced Alignment task, without designing a specific grammar

containing those nodes.

In particular, the -iwsp option seems to deliver what I am

looking for, and I would like to understand if I am using it

correctly.

--- (Edited on 4/28/2013 10:18 am [GMT-0500] by azeem) ---

Previous • Next •


Username	Password