The LT and the Teleccoperation group have open sourced their German spoken language corpus, recorded over 2014 and 2015 using several speakers from their department.
The corpus has about 35 hours of speech. About 180 speakers have read aloud sentences from German Wikipedia, protocols from European Parliament and some individual commands.
The speakers have confirmed that the recorded speech can be distributed with CC-BY license.
For each sentence the speaker metadata (ageclass, region, corpus, transcript sentence etc.) and for each microphone an individual wave-file were generated. The recordings were collected with the software KiSrecord (supports concurrent recordings via multithreading).
The distance between speaker and each microphone is 1 meter. More details here (pdf). The target is Distant Speech Recognition and a speaker independent acoustic model. In addition to the open speech data corpus, they have also developed acoustic models for Sphinx and Kaldi.
Their motivation was how they could support open source speech recognition. When their research in speech recognition begain, they were faced with the general issue of obtaining speechdata and decided to support open source speech recognition projects (instead of buying commercial software).
They are interested that their developed pattern be further used by other research institutions or companies. The current size of open speech data corpus is just meant as the First Start.
Voxforge mirror of the Open Speech Sata Corpus for German:
german-speechdata-TUDa-2015.tar.gz 27-Mar-2015 17:21 16G
The LT and the Teleccoperation group has now an ultimate update on the speech data corpus. Many words are corrected now in the second version (issue with text normalisation e.g. separating thousand signs like 1.000.000).
We got the feedback from the community, so we could enhance the speechdata corpus J.
You can find the new corpus here:
german-speechdata-package-v2.tar.gz 01-Jul-2015 14:07 16G