VoxForge
In order for Speech Audio Files to be 'compiled' into Acoustic Models, the speech contained in the audio file must be labelled in some way. This can be done using orthographic transcriptions (transcriptions containing the actual words uttered) or using phonetic transcriptions (transcriptions contraining the sounds that make up the words). These transcriptions are usually contained in a separate text file, and are linked to the speech audio file by a common file name.
Trancriptions can also be 'time aligned' (where the file contains the start and end time of each word or phone) or not (no time stamps indicating the start or end of a word or phone).
Training Acoustic Models with short segments of transcribed speech (10-15 words long), with no time alignments, seems to create the best acoustic models.