Re: Google Summer of Code
I think it would be worth discussing the need to work with uncompressed audio.
Firstly. to agree with you, the use of compression is an obvious source of noise, and all obvious sources of noise should be be eliminated as far as is practical.
However, I would like to question how much noise128kbps MP3 adds to audio recording? I've listened to three speakers and looked at one with wavesurfer - the speech sounds and looks clear to me. In one case, the background noise, whilst not exessive, was certainly more noticable than the MP3 artifacts.
Ideally the way to test this out would be to train and test on the uncompressed audio, train and test on the compressed, and compare error rates. However, this implies that most of the project is already done.
Another way to look at this is to consider the degree of mismatch between the source/train environment and the target environment. If the aim is freely redistributable acoustic models, then the target environment is very varied, and it could be that the coding noise is not significant compared with this mismatch.
Of course, if Libravox speakers will upload the original audio to you then that is preferable, however from a project management point of view I'd hate to be dependent on 400 people who have volunteered for a different cause.
On other issues:
Focussing on Librivox would seem sensible if there is a sufficient variety of speakers, which you seem to think is the case.
My point on copyright was targetted at Project Gutenberg texts, however rereading the T+C's it allows the insertion of markup to add timings to the text. Nevertheless, the whole text (plus markup) must be distributed, not just a lot of 15 second chunks.
You may well find that an "automated transcription script" is very helpful in this project, many ASR sites use such tools to check audio transcriptions or to train on partially transcribed audio.
To summarise, if I were doing this for Cantab Reasearch I would focus only on 128kbps audio from Libravox and:
1) Get a list of speakers and number of hours spoken by each speaker.
2) Write the scripts to download all the audio and text
3) Write scipts to clean up the text so that it matches the audio. In the first case this would be removing the Gutenberg preamble and adding the spoken Libravox preamble, and looking at what can be done about chapter headings, etc.
4) Build acoustic and language models
5) Use an "automated transcription script" to highlight any problems with the transcriptions, and if so go back to stage 3 and fix them up.
6) Decide on a sensible split of data between train, eval and test.
7) Make three releases. The first would be the audio and text (in original forms), the second the scripts that performs steps 3-5 above (so that others may improve) and thirdly the acoustic model release.
Hope that helps,
[ speaking for Cantab Research only ]
--- (Edited on 2/28/2007 5:35 am [GMT-0600] by Tony Robinson) ---
--- (Edited on 2/28/2007 5:42 am [GMT-0600] by Tony Robinson) ---