I'm using sphinx3 [latest from svn] on windows vista using visual studio 2008. Everything is compiling and working correctly, however I'm getting TERRIBLE accuracy.
I'm not doing dictation, I simply want to decode a single word from a wav file.
I've tried 5 different open source models from CMU's page and NONE of them have given accurate results. I have 20+ wav files that I've recorded each saying a SINGLE word [dog, cat, etc].
I've tried using the pronunciation dictionary that comes with each model, and i've tried deleting every word except the 20 I've recorded. My thought was that some of the dictionaries have 20,000+ words and it'd be easy for it to get confused between similar words, however even with my dictionary of only my 20 words, it doesn't get a single word correct.
For some of the models the audio data was provided along with the zip, if I try to decode the same audio that the engine has been trained with, I get VERY GOOD accuracy.
Any ideas on what could be going on here?
--- (Edited on 1/19/2009 6:01 pm [GMT-0600] by speige) ---
I tried what you suggested and I'm still not getting any better results.
I've tried converting each wave to cepstra format [.mfc] before sphinx3_decode using wave2feats.exe. I told it the correct sample rate of my wave file during conversion. However, running decode on my .mfc converted files still gave completely inacurrate results. I've tried using -mode allphone during decode to get the decoded phonemes rather than the word and none of the phonemes are even close. I've also tried telling decode.exe my samplerate during decode for both .mfc & .wav files and the results weren't any better.
Is there a way to tell what sampling rate an acoustic engine is expecting? I've tried 48000 & 44100, but I also downsampled those to 16000 & 8000 [without changing pitch, so it still sounded good] & I still got terrible results.
I've tried TONZ of combinations on each of my 5 acoustic engines and I've never got ANYTHING that resembled an accurate decoding. I'd be happy to even get a few of the phonemes correct on a few words, just so I knew it was kindof working, but I'm getting nothing even close.
I understand I must be doing something wrong but I'm not sure what else to try.
An interesting note is I have some cepstra .mfc files that are used in the CMU example of training the RM1 engine. If I decode that file with the engine it was trained on it gives accurate results, but if i try it on any other engine the decoded phonemes aren't even close.
--- (Edited on 1/20/2009 5:51 pm [GMT-0600] by speige) ---
Another note: if I run cepview.exe on my .mfc files compared to the files in RM1's included audio data, my files have a range of -20,000 -> 20,000 for the sample values. The RM1's audio has -10 -> 10
That seems like a really big difference to me. Is this a problem, or just that our recordings were done differently?
--- (Edited on 1/20/2009 6:13 pm [GMT-0600] by speige) ---
It is a PROBLEM. Instead of pocking in the dark you could follow the documentation more carefully, be more precise, provide the exact commands you are running and their output, provide the links on files you are working with.
--- (Edited on 1/20/2009 11:24 pm [GMT-0600] by nsh) ---
>Is there a way to tell what sampling rate an acoustic engine is expecting?
The sampling rate that the speech recognition engine (more properly called a decoder) is expecting depends on the acoustic model that you are using. Most Sphinx acoustic models give you hints the the actual name of the acoustic model.
>I understand I must be doing something wrong but I'm not sure what else to try.
Sometimes it is best to restart the training process and follow the steps very carefully. An error in one point can cause difficult to identify problems are another point.
>An interesting note is I have some cepstra .mfc files that are used in
>the CMU example of training the RM1 engine. If I decode that file with
>the engine it was trained on it gives accurate results, but if i try it on
>any other engine the decoded phonemes aren't even close.
Not sure what you mean by: "If I decode that file with the engine it was trained on"... Are you saying you are using the Sphinx engine with different acoustic models?
The process is usually one where you train an acoustic model using audio from a speech corpus. You then use a speech recognition engine (like Sphinx) to listen to speech and look up the phonemes in your acoustic model. The sampling rate (and to a lesser extent the bits per sample) of the speech audio that you want to recognize must match the sampling rate of the speech audio used to create the acoustic model.
--- (Edited on 1/26/2009 11:23 am [GMT-0500] by kmaclean) ---
--- (Edited on 1/26/2009 11:26 am [GMT-0500] by kmaclean) ---