In addition, please do not submit any lossy compressed audio (such as MP3 or Ogg Vorbis) converted to an uncompressed format (such as WAV or AIFF). For example, if you convert your audio from MP3 to WAV, information will still be missing from the audio stream, even though it has been converted to WAV.
I find this requirement very strange. As a human cannot tell the difference between uncompressed and lightly compressed audio, but can decode speech very accurately from either of them, it feels intuative that an algorithm designed to decode speech should be incapable of telling a difference between them.
Instinctively to me, it feels that if the MFCC transform process produces different results for compressed and uncompressed copies of the same audio, then the MFCC is the wrong transform to be using.
This issue was discussed here: Google Summer of Code.
Tony Robinson says (his second post):
I think it would be worth discussing the need to work with uncompressed audio.
Firstly. to agree with you, the use of compression is an obvious source of noise, and all obvious sources of noise should be be eliminated as far as is practical.
However, I would like to question how much noise128kbps MP3 adds to audio recording? I've listened to three speakers and looked at one with wavesurfer - the speech sounds and looks clear to me. In one case, the background noise, whilst not exessive, was certainly more noticable than the MP3 artifacts.
Ideally the way to test this out would be to train and test on the uncompressed audio, train and test on the compressed, and compare error rates. However, this implies that most of the project is already done.
Another way to look at this is to consider the degree of mismatch between the source/train environment and the target environment. If the aim is freely redistributable acoustic models, then the target environment is very varied, and it could be that the coding noise is not significant compared with this mismatch.
Of course, if Libravox speakers will upload the original audio to you then that is preferable, however from a project management point of view I'd hate to be dependent on 400 people who have volunteered for a different cause.
With respect to the use of lossy compressed audio, such as 128kbps mp3, I always assumed that we could not use such audio for the training of Acoustic Models. It was a rule of thumb I used without really questioning why. Thanks for pointing out that mp3 could be 'good enough' for our purposes.
Brough Turner in a post he made on his blog called Large Speech Corpora also discussed the use of podcasting audio (which is usually lossy compressed like mp3) as a possible source of audio data for creating Acoustic Models. I commented (incorrectly) that the use of lossy compressed audio (such as MP3 or Ogg recordings) was not a good source of audio for training AMs. However, he rightly noted that mp3 audio, although lossy, is probably better quality speech than what you would find in telephony audio (8kHz sampling rate at 8-bits per sample). I never really thought that comment through ... Basically, the same logic would apply to audio used for Acoustic Models (which usually use 16kHz sample rates, at 16-bits per sample) for use in Desktop Computer Speech Recognition, which is your point.
I created a quick 'sanity test' to compare Acoustic Models trained with wav audio versus mp3 audio. Basically I took the larger wav directories in the VoxForge corpus, and converted them to MP3 and then converted them back. I then compared Acoustic Models ("AM") created with the original wav data, to AMs trained with converted mp3 data to get an idea of any performance differences.
The tests with Julius performed as expected, with a bit of a degradation of performance by using mp3-based Acoustic Models.
The tests with HTK are a bit more confusing, since these show some improvement in performance when using AMs based on mp3 audio.
Basically I need to use a larger test sample with a more complex grammar to get a better test. Regardless, the use of MP3 audio for AM training looks promising.
All this to say that we probably could start using mp3 compressed audio (or some other open codec: SPEEX, or Ogg Vorbis, ...), but if we can get uncompressed, we prefer it for now...