What would be the approach to get either Sphinx, Julius or HTK to be as accurate as Dragon Naturally Speaking preferred or pro version in English (USA and or UK)? Has anyone achieved anything close? i.e. accurate CSR with a large english vocab. I assume that a much larger speech corpus than what VoxForge currently has would be required, along with a much larger dictionary than what is currently available. Even if VoxForge reaches 140 hours of English, would this be enough? What other things would have to be done to the 140 hours to have it be good enough?
Finally, are there commercial dictionaries, and Speech corpus available that would be sufficient to make one of the above speech engines work comparably to DNS preferred?
--- (Edited on 1/8/2010 1:22 pm [GMT-0600] by bhartsb) ---
I blogged about this a while back:
--- (Edited on 1/10/2010 04:32 [GMT+0300] by nsh) ---
Well you sort of block it out, but don't really address what it might take to get accuracy that is similar to Dragons. Per accuracy the areas from your list that seem to apply are:
1) Initial acoustic and language models
2) Acoustic model adaptation framework
3) Language model adaptation framework
But my questions really are about what hypothetically it might take to achieve 1,2 and 3? e.g. per 1, what does it take to get speaker independent accuracy similar to Dragon's? e.g. per 2 and 3 what would these entail to get speaker dependent accuracy near to Dragon's?
--- (Edited on 1/10/2010 1:46 am [GMT-0600] by bhartsb) ---
Your question is very general, so it is very hard to give a precise answer. Also, it's not very constructive.
I think that the approach that is mentioned on nsh's blog uses an initial acoustic model and an initial language model that is not (by far) as good as the initial models of commercial products. However, by using a well coded adaptation framework, it would be still possible to achieve an end result that would be comparable (though, of course, speaker dependent).
If you want to use an approach where the initial recognition results without any training are already comparable, then more speech is needed and also it is essential that it matches almost 100% with its transcription. How much and what type of speech exactly is highly debated and by searching the forums you can already see that every person involved with VoxForge has different thoughts on that issue.
For someone with a bit of patience, the first approach would work fine and would mostly require a lot of coding, since the initial acoustic model would already exist (at least for English) and I believe there are also some initial language models available. So most of the work would be coding work. Of course this would require some fundraising or a business model, because also programmers need to pay the rent etc.
The second approach would require more initial work on speech collection and speech processing.
Ask yourself the question which approach you would prefer/which approach you believe has the most potential for you. Then ask the question what the best thing to do would be to contribute. That of course depends on your specific skills.
--- (Edited on 1/10/2010 6:23 am [GMT-0600] by Robin) ---
My vantage is that I spent a good deal of $ between 2005-2007 funding the building of software applications and infrastructure for Voice to Text. I have plenty of experience building large scalable SAAS web apps., mobile device apps., Windows Apps, and networking apps.. In pursuit of current business objectives, I am now more interested in having our own CSR engine because of various constraints imposed by using Dragon. i.e. marketing constraints because of it's license cost, and development contraints because we don't control or even have access to source code. I am merely attempting to do an initial base assessment of what it might take to get something comparible to DNS. To be of any use, it would have to be at least as accurate as DNS, both trained and untrained. I know that I'm out of my element with regard to CSR fundamentals, don't have time to become an expert, and I'm wasn't trying to be unproductive with my comments, but rather get the exact information I need to make initial informed decisions. I appreciate everyone's comments so far.
--- (Edited on 1/11/2010 9:10 pm [GMT-0600] by bhartsb) ---
--- (Edited on 1/11/2010 9:11 pm [GMT-0600] by bhartsb) ---
--- (Edited on 1/11/2010 9:12 pm [GMT-0600] by bhartsb) ---
Well, to answer on this, each point will require month of coding or so work. Take acoustic model adaptation which is basically done with VTLN factor estimation and MLLR, it requires:
1) Modification of current voxforge training scripts for warp factor estimation for VTLN, warp factor normalization and retraining of the DB.
2) Modification of training scripts in order to implement SAT training where speaker data is normalized with speaker-specific MLLR matrix
3) Implementation of VTLN factor estimation in the decoder
4) Implementation of speaker-specific decoder modification if speaker identifier is somehow available in your system.
5) Implementation of decoding pass with MLLR (depends on decoder)
The same could be written for other tasks.
I wouldn't agree with you that just three points from the list are related. System integration, system tests are obviously the activities that will take a lot of time and effort. As well as many others. Speech collection is not so important it's mostly done unlike others since it's possible to start with Voxforge speech database.
For optimal result it will be required to obtain a commercial databases and that will slightly raise the price of solution.
Obviously to create a detailed proposal it's required to understand the details of your business, specific requriements for engine and many other issues. If you are interested please contact me at email@example.com and we could discuss details of this project.
--- (Edited on 1/12/2010 19:08 [GMT+0300] by nsh) ---