So far I can operate with numbers. Thanks to Ken who uploaded almost all the audio we have and now I was able to train a model from 21 hours of speech. The results are very good for you:
SENTENCE ERROR: 9.6% (23/249) WORD ERROR RATE: 2.6% (50/1928)
but very bad for other speakers
ENTENCE ERROR: 86.4% (190/221) WORD ERROR RATE: 52.8% (884/1675)
the model is clearly overtrained. Basically other speakers are even dropped from training just because of alignment issues. The 21 hours we already have is enough even for TTS moreover they aren't designed to be good TTS database (they doesn't cover the intonation aspects while they should). So don't sumbit more, it's already fine. I hardly believe we'll get a database with 10 hours from 10 speakers to be balanced ever.
But again, I uploaded the model and you can now have very precise your own recognizer.
I just came across a tool on the Festvox site that might be helpful:
find_nice_prompts: tools for building balanced prompt lists
Scripts for finding a nice set of prompts. Given a large set of text data find a balanced subset of "nice" prompts that can be recorded.
I haven't had a chance to look at it in detail.
It's possible to make a good voice with relatively small database - an hour or two of speech, but modern ones include from 5 up to 30 hours of speech. Also note that 30 hours usually include various kinds of emotional speech and so on.
Is it better to use large database nobody knows. Some researches state that small but well designed database is better others say that bigger database gives you better coverage.
My understanding is that it depends on the type of Text-To-Speech ("TTS") engine your are using.
If you use a TTS engine that is based on the concatenation of diphones (like the MBROLA speech synthesizer - though this is not a true 'text'-to-speech engine since it doesn't accept raw text for input), then using speech from more than one person will not sound natural, since different parts of a word may contain speech from different people.
HTS uses Hidden Markov Modeling to generate speech. HMMs are also used for speech recognition (Sphinx/HTK/Julius all use HMMs). In this case, rather than using statistical models (HMMs) to recognize speech, they are used to generate speech. Thus the use of many speakers is possible (not sure how many would be the maximum), because the differences in speakers should be smoothed out by the statistical modelling process.
The Festival Speech Synthesis System can use different types of Text-to-speech engines (it can use MBROLA, HTS, ...). It default engine is a Multisyn general purpose unit selection synthesis engine. This excerpt(1) describes the Festival multisyn implementation:
The multisyn implementation in Festival uses a conventional unit selection algorithm. A target utterance structure is predicted for the input text, and suitable diphone candidates from the inventory are proposed for each target diphone. The best candidate sequence is found that minimises target and join costs.
So it seems like that Festival's multisyn implementation uses diphones. Because of this it would likely have the same problems as MBROLA if more than one person was used for creating the audio based for the TTS model.
(1) Multisyn Voices from ARCTIC Data for the Blizzard Challenge, Robert Clark, Korin Richmond and Simon King, CSTR, The University of Edinburgh, Edinburgh, UK.
Nothing was built.
Well, it's not that required nowdays since openmary released their German voices under BSD license. But still would be interesting do do for someone who wants to become familar with Festival voice building. Also, it will be GPL, which is important for some people.