>And I am convinced that in the long-term XML related standards are the way
>to go. Maybe not today, but in a few years.
>I would like to submit much more speech samples (prompts) in the English
>and in the German language employing the SSML, even if there isn't any
>demand at the moment.
Please note that SSML is only a markup language for directing what a text-to-speech engine says. I don't think it used as a format for describing transcribed audio submitted for the creation of acoustic models.
>I started to read the HTK book. It is really not easy.
The HTK book is a difficult read... I have only read the first few chapters, and now only use it as a reference - I don't have the math skills to understand all the formulas and how they interact. But if you look at HTK as a "black-box", and only focus on the minimum command set required to compile an acoustic model, then you can do quite a bit with trial and error - which essentially was my approach when I started out... :)
You might be interested in the W3C VoiceXML standard, which essentially merges subsets of the SSML, CCXML, SRGML specifications. This doc: "Voice Browsers, Introduction" provides a good overview of how they all should work together.
The jvoicexml project has implemented a working VoiceXML browser, which essentially provides a VoiceXML dialog manager front-end to Sphinx and Festival. They might provide bindings to Asterisk (IP PBX). Note that jvoicexml uses the JSAPI and JTAPI "standards" to accomplish this.
>I will take a look into the Introduction and Overview of W3C Speech Interface Framework
Good link, I didn't see that particular one...
>I am looking for "a format for describing transcribed audio."
I am not sure there is a any such format on the W3C site for this. The LDC might have something. With XML, one could be created fairly easily. But in VoxForge's case, there would be quite a few scripting changes on the acoustic model creation backend that would required to implement such a thing.
> Perhaps it is VoiceXML, I am not sure at the moment, I will read about it.
VoiceXML is a language to describe spoken dialogs... think of spoken interactive voice response (IVR) systems in a telephone environment (which is what VoiceXML was originally designed for). For example, when I call my ISP, I used to use keypad sequences to get routed to the help desk. Now I call their number, and just say "Internet technical support" on my phone, and get routed to the help desk queue.
A VoiceXML browser "abstracts" away all the differences between the different implementation of:
There have been a few open source implementations of VoiceXML that implemented the text to speech and the telephony components. But most attempts to implement the speech recognition portion failed - because it is very difficult to do. jvoiceXML is amazing since they got the speech rec component working (though I have not tried it out myself). I think using JSAPI was an excellent way to avoid having to work out the details of a particular speech rec or tts engine, but I am not sure of where Sun's JSAPI licensing is currently at.
>I solved one problem, and then the next problem occurred. I stopped
>trying. Maybe I will try it again.
Don't give up yet, if that is what you are interested in. It takes some effort. A bit of understanding of a scripting language is also very helpful.
>You select a phoneset, build an LTS system that will generate variants and
>then use forced-alignment against the recording to check are pronuncations
>valid or not.
Forgive my ignorance, but by "LTS" do you mean "Letter to Sound"? If so, do you mean that for each letter in the alphabet for a target language, you create a table that contains the different sounds that the letter might have, then you create a dictionary that would have multiple alternate pronunciations of the same word. Then you take a transcribed speech recording and let the speech recognizer figure out the correct phonemes for each word (using forced alignment), based on what it recognizes in the recording?
For example, if someone wants to create a dictionary for a new language, do you first start with a set of speech transcriptions for the target language (i.e. speech audio files with a transcription of the actual words spoken in a text file).
Then create the letter-to-sound rules. For example the word "house" in the VoxForgdict is pronounced as follows:
HOUSE [HOUSE] hh aw s
If I were using your approach, first I would create a phone list like this (CMU's phone list in this case):
Phoneme Example Translation
------- ------- -----------
AA odd AA D
AE at AE T
AH hut HH AH T
AO ought AO T
AW cow K AW
AY hide HH AY D
B be B IY
CH cheese CH IY Z
D dee D IY
DH thee DH IY
EH Ed EH D
ER hurt HH ER T
EY ate EY T
F fee F IY
G green G R IY N
HH he HH IY
IH it IH T
IY eat IY T
JH gee JH IY
K key K IY
L lee L IY
M me M IY
N knee N IY
NG ping P IH NG
OW oat OW T
OY toy T OY
P pee P IY
R read R IY D
S sea S IY
SH she SH IY
T tea T IY
TH theta TH EY T AH
UH hood HH UH D
UW two T UW
V vee V IY
W we W IY
Y yield Y IY L D
Z zee Z IY
ZH seizure S IY ZH ER
I would then create a set of letter-to-phone rules as follows (phonemes converted to lower case for easier reading):
O ow, oy, uw
Then create rules for letter combinations to sounds (only for such letter combinations that have a unique sound in the target language):
HO hh aa, hh uh,hh ow
US ax s,
Then generate all the possible pronunciations for the word "house":
HOUSE hh ow uh z iy
HOUSE hh oy uh z iy
HOUSE hh uw uh z iy
HOUSE hh aa uh z iy
HOUSE hh uh uh z iy
And then use the forced alignment feature of a speech recognition engine(like Sphinx, HTK, ...) to look the text of a particular recording (in this case of the single word "house"), and see what phonemes it identifies as the most likely used in the recording (HTK format in this example):
0 9400000 sil -5373.277832 SENT-END
9400000 10400000 hh -750.756897 HOUSE
10400000 11300000 aw -659.823364
11300000 12900000 s -962.888245
12900000 13300000 sil -238.437622 SENT-END
Which can then be input into a script to create the final correct pronunciation to the word "house":
HOUSE [HOUSE] hh aw s