Hey Ralf, I've checked some of your recordings, there are too much speech from you I suppose :) Model will be too overtrained if it will contain only your voice
But, there is an amazing advantage of the large database - we can easily create German TTS voice for festival, what do you think about it?
That's a good idea! I've always thought it was a shame there are not so many voices out there. I don't really use TTS, but it's great for people who have bad eyes or no voice.
How many hours are (more or less) necessary to make a decent voice for festival?
Here are my two cents ...
I agree with you when you say that the best speech corpus would be a large speech corpus containing both read and transcribed spontaneous speech from hundreds/thousands of people uttering 400,000 different words and 1 million different sentences. However, the problem is cost - it is very costly and time consuming to collect such data.
As a short-cut, we can try to think of the problem in terms of trying to get good monophone and *triphone* coverage from as many people as possible. The original CMU dictionary (the source of the current VoxForge pronunciation dictionary) has close to 130,000 word pronunciations and has 43 phonemes and close to 6000 triphones. If we can get good coverage of these 6000 triphones (or however many there might be in the target language) then we might reach our objective of a reasonably good acoustic model, without needing to worry about complete coverage of all word/sentence combinations in the target language.
Hope that helps,
Heh, let me repeat that it looks like a deeply wrong idea to get all triphones, current senone tying technique allows you to get effective recognition without good coverage. And the biggest problem is that rare triphones give you zero improvement in the accuracty.
I'd actually like to ask for a help - in the acoustic model build logs (inside the acoustic model archive) you'll see a lot of errors due to not reaching final state. It means that transcription is out of sync with the dictionary and recordings. Can you please check dictionary, transcription and recordings for those particular prompts and find the reason of misalignment.
P.S. Thanks to Ken for uploading it. There was a request about sphinx4 model too.
Ralf: Sorry, I was thinking speech recognition when I replied to your posting of covering all nodes of a language (I forgot that Robin's original question was about Text-to-Speech).
From a speech recognition context, I just wanted to save you some work by making sure that you did not attempt to create submissions for every combination of words. However, if your goal is to create a pronunciation dictionary by recording many different prompts, and getting the added benefit actual speech for these prompts, then it makes sense.
>let me repeat that it looks like a deeply wrong idea to get all triphones, current
>senone tying technique allows you to get effective recognition without good
>coverage. And the biggest problem is that rare triphones give you zero
>improvement in the accuracty.
Thanks for this clarification. My assumption that a good acoustic model (for speech recognition) needs to be trained from recordings of words containing all triphones is wrong. Therefore, the key is to get recordings of words that contain the most common triphones, and using "tied-state triphone" models (which I think is HTK terminology for "senone tying" technique, which is what Sphinx uses...) to cover the rare triphones.
I'm wondering if HTK's HDMan command can provide triphone counts (in a similar way that it provides phoneme counts), so we can then create prompts that might give us the "most bang for our buck". I'm thinking we would run it against a large database to get these triphone counts (even it could even be proprietary, since we are only looking for the counts), and then generate a list of words (from this same database) that cover off these common triphones, so Ralf (and others creating prompts for new languages) could use these words in his prompts.
... I'll put it on my todo list :)
Thanks for bringing up Zipf's law. For others (like me) who have never heard of it, here is an excerpt from Wikipedia:
Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc. For example, in the Brown Corpus "the" is the most frequently occurring word, and all by itself accounts for nearly 7% of all word occurrences (69971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36411 occurrences), followed by "and" (28852). Only 135 vocabulary items are needed to account for half the Brown Corpus.
P.S. re: "I just hope that you have enough webspace to store those prompts."
No worries, disk space gets cheaper every year - not sure if it follows Moore's Law, but it must be pretty close :)