Open Speech Data Corpus for German
User: kmaclean
Date: 4/28/2015 1:18 pm
Views: 23348
Rating: 8

The LT and the Teleccoperation group have open sourced their German spoken language corpus, recorded over 2014 and 2015 using several speakers from their department.

The corpus has about 35 hours of speech. About 180 speakers have read aloud sentences from German Wikipedia, protocols from European Parliament and some individual commands.

The speakers have confirmed that the recorded speech can be distributed with CC-BY license.

For each sentence the speaker metadata (ageclass, region, corpus, transcript sentence etc.) and for each microphone an individual wave-file were generated. The recordings were collected with the software KiSrecord (supports concurrent recordings via multithreading).

The distance between speaker and each microphone is 1 meter. More details here (pdf). The target is Distant Speech Recognition and a speaker independent acoustic model. In addition to the open speech data corpus, they have also developed acoustic models for Sphinx and Kaldi.

Their motivation was how they could support open source speech recognition. When their research in speech recognition begain, they were faced with the general issue of obtaining speechdata and decided to support open source speech recognition projects (instead of buying commercial software).

They are interested that their developed pattern be further used by other research institutions or companies. The current size of open speech data corpus is just meant as the First Start.

Voxforge mirror of the Open Speech Sata Corpus for German:

[   ] german-speechdata-TUDa-2015.tar.gz 27-Mar-2015 17:21 16G

Re: Open Speech Data Corpus for German
User: kmaclean
Date: 7/13/2015 10:27 pm
Views: 1642
Rating: 3

The LT and the Teleccoperation group  has now an ultimate update on the speech data corpus. Many words are corrected now in the second version (issue with text normalisation e.g. separating thousand signs like 1.000.000).

We got the feedback from the community, so we could enhance the speechdata corpus J.

You can find the new corpus here:

[   ] german-speechdata-package-v2.tar.gz 01-Jul-2015 14:07   16G
Re: Open Speech Data Corpus for German
User: meiko
Date: 3/7/2018 10:01 am
Views: 9
Rating: 1

Problems with this corpus:

A lot of audio files are assigned to wrong transcriptions (~15%):

Example 1:
File: german-speechdata-package-v2/train/2014-08-04-13-15-38.xml (sentence id: 113)
text should read: "In den wenigen beobachteten Fällen wurden diese großen Beutetiere innerhalb von Sekunden getötet."

But the corresponding audio files (e.g. 2014-08-04-13-15-38_Yamaha.wav) contain only "Okay"
(according to "SentencesAndIDs.raw.txt" sentence id 851)

Example 2:
German-speech data-package-v2/train/2014-08-04-13-13-22.xml (sentence id: 158)
Text should read: "Das Land der offenen Fernen wie die Rhön auch genannt wird ..."

The associated audio files (e.g. 2014-08-04-13-13-22_Yamaha.wav) contain the sentence "Ich weiß nicht" (sentence-id 815)

It seems that wav files from the "command" corpus have overwritten those from the "wiki" corpus? Maybe it could be easy to fix for the creator of the corpus, but I cannot find a possibility to fix it with the given files.

Re: Open Speech Data Corpus for German
User: kmaclean
Date: 3/7/2018 11:02 am
Views: 2693
Rating: 0

>A lot of audio files are assigned to wrong transcriptions (~15%):

This is not a VoxForge corpus.

We are mirroring it for the LT and the Teleccoperation group.

Best to contact to them.

If any updates are made, I can update the mirrored copy here.