>3. Do you plan to release the results coming from the dictionary acquisition
>project under the Pronunciation Lexicon Specification?
You certainly are persistent with respect to PLS :)
>And I am convinced that in the long-term XML related standards are the way
>to go. Maybe not today, but in a few years.
>I would like to submit much more speech samples (prompts) in the English
>and in the German language employing the SSML, even if there isn't any
>demand at the moment.
Please note that SSML is only a markup language for directing what a text-to-speech engine says. I don't think it used as a format for describing transcribed audio submitted for the creation of acoustic models.
>I started to read the HTK book. It is really not easy.
The HTK book is a difficult read... I have only read the first few chapters, and now only use it as a reference - I don't have the math skills to understand all the formulas and how they interact. But if you look at HTK as a "black-box", and only focus on the minimum command set required to compile an acoustic model, then you can do quite a bit with trial and error - which essentially was my approach when I started out... :)
You might be interested in the W3C VoiceXML standard, which essentially merges subsets of the SSML, CCXML, SRGML specifications. This doc: "Voice Browsers, Introduction" provides a good overview of how they all should work together.
The jvoicexml project has implemented a working VoiceXML browser, which essentially provides a VoiceXML dialog manager front-end to Sphinx and Festival. They might provide bindings to Asterisk (IP PBX). Note that jvoicexml uses the JSAPI and JTAPI "standards" to accomplish this.
>I will take a look into the Introduction and Overview of W3C Speech Interface Framework
Good link, I didn't see that particular one...
>I am looking for "a format for describing transcribed audio."
I am not sure there is a any such format on the W3C site for this. The LDC might have something. With XML, one could be created fairly easily. But in VoxForge's case, there would be quite a few scripting changes on the acoustic model creation backend that would required to implement such a thing.
> Perhaps it is VoiceXML, I am not sure at the moment, I will read about it.
VoiceXML is a language to describe spoken dialogs... think of spoken interactive voice response (IVR) systems in a telephone environment (which is what VoiceXML was originally designed for). For example, when I call my ISP, I used to use keypad sequences to get routed to the help desk. Now I call their number, and just say "Internet technical support" on my phone, and get routed to the help desk queue.
A VoiceXML browser "abstracts" away all the differences between the different implementation of:
There have been a few open source implementations of VoiceXML that implemented the text to speech and the telephony components. But most attempts to implement the speech recognition portion failed - because it is very difficult to do. jvoiceXML is amazing since they got the speech rec component working (though I have not tried it out myself). I think using JSAPI was an excellent way to avoid having to work out the details of a particular speech rec or tts engine, but I am not sure of where Sun's JSAPI licensing is currently at.
>I solved one problem, and then the next problem occurred. I stopped
>trying. Maybe I will try it again.
Don't give up yet, if that is what you are interested in. It takes some effort. A bit of understanding of a scripting language is also very helpful.