In sphinx there are phonemes for noises, and the documents speak about the importantence of this sounds.
Would it be necessary to train noises in order to recognize this sounds as noise in HTK?!?
(I have trained sounds, and each time I make noises I get very strange results). Os may be that I need more voice for training and then the noises will be ignored? (the training data is spanish training with 20-30 minutes of voice)
Thanks in advance.
--- (Edited on 11/13/2008 4:56 pm [GMT-0600] by ubanov) ---
>Would it be necessary to train noises in order to recognize this sounds as
>noise in HTK?!?
Yes, we should train for noise in our speech submissions. I have not been too concerned about this because it can be added after the fact... (i.e. we can add noise tags to the user submission transcriptions/prompts).
>Os may be that I need more voice for training and then the noises will be ignored?
It is best to train with clean speech (i.e. with no noise) - see these posts more on this topic:
This article referenced in this link might also be useful: The Production of Speech Corpora
--- (Edited on 11/13/2008 10:42 pm [GMT-0500] by kmaclean) ---
I am a beginner in creating a quality ASR.
One major question regards the environment I am targeting.
If I record my training data with the same poor quality microphone the users tend to use, there will be an overlay of noise consistent throughout the whole utt.
There are no tags for this noise as it becomes part of the phones for each word spoken.
So we have a three dimensional problem in the data recording.
1) The atmosphere the microphone records in (including the additional noise the connection makes to the digitizer in the computer.
2) The sounds the mouth, throat, etc. make in recording. i.e. the grunts, teeth clicks, huh's, haa, lip smacking, hissing, etc. all of this just coming out of the human's mouth. This is one area of noise tags. The other would be people in our area while recording and machinery, i.e. phones, fans, beepers, doors, sirens, baby crying, loud talking, etc.
3) Finally, we get to the actual words spoken in continuous speech.
This is the nature of my problem and it is taking a lot of time in coming to a proper understanding of how to properly record an Acoustic Database for a hospital in southern India. Many offices have open ceiling with a high roof (humid tropical environment.)
Thousands of people come to this hospital every day. It is quite noisey.
I found not long ago that raising the recorded wav files ampitude 6db took me from a horrible WER of 93% to a quite acceptable 12%.
These wav files were recorded with a very noisey mike.
Recording an open microphone was listening to a wind tunnel with strange noises from time to time.
By raising the amplitude 6db I was making the words spoken raise up through the cloud obscuring it, thus being able to be heard by the recognizer.
Does this make any sense to anyone?
And how do I create a quality ASR in this environment???
--- (Edited on 2/6/2009 3:25 am [GMT-0600] by Visitor) ---
And you are here as well.
> I found not long ago that raising the recorded wav files ampitude 6db took me from a horrible WER of 93% to a quite acceptable 12%.
It was a mistake. Amplitude is not directly related to WER. Probably you just made frontend work better in endpointer area.
> And how do I create a quality ASR in this environment???
Noise training is not directly related to HTK since HTK itself have no methods to deal with noise. Usual solutions require
noise cancellation in frontend processor
special noise-robust features like RASTA
usage of better classifiers both offline (discriminative training) or online (HMM-ANN or HMM-SVM methods)
The noise cancellation is probably the easiest thing to start with.
--- (Edited on 2/6/2009 3:41 am [GMT-0600] by nsh) ---
Julius can perform noise reduction while recognizing using spectral substraction (not sure if Sphinx can do this). From the Julius manual (this is for an older version of Julius - the newer one may have more options):
Perform spectral subtraction using head part of each file. With
this option, Julius assume there are certain length of silence
at each input file. Valid only for rawfile input. Conflict
Perform spectral subtraction for speech input using pre-esti-
mated noise spectrum from file. The noise spectrum data should
be computed beforehand by mkss. Valid for all speech input.
Conflict with "-sscalc".
Alpha coefficient of spectral subtraction for "-sscals" and
"-ssload". Noise will be subtracted stronger as this value gets
larger, but distortion of the resulting signal also becomes
remarkable. (default: 2.0)
Flooring coefficient of spectral subtraction. The spectral
parameters that go under zero after subtraction will be substi-
tuted by the source signal with this coefficient multiplied.
This does not help you on the recording end - you still need as clean of speech as you can get, but for recognizing, spectral substraction might help.
Theoretically, you might be able to use two microphones, one to recognize the target speech and another to pick up the background noise that you would feed in to the Julius spectral substraction algorithm. I have read of noise cancelling headsets that use two microphones in a similar way for noise cancellation.
--- (Edited on 2/6/2009 10:21 am [GMT-0500] by kmaclean) ---