Click here to register.

Google Summer of Code Ideas Page

Flat
how to get more voice samples?
User: V
Date: 3/17/2008 4:51 pm
Views: 8509
Rating: 45

I don't know if this a viable solution, but might worth thinking about it a bit.

The idea is to put out a small icon on blogs with the VoxForge logo and a small text: Give your voice. Then if someone clicks on it, then he is redirected to the voxforge page, where he can read the given blogpost into the system á la LibriVox.

I am sure that it's relatively easy to create wordpress/drupal moduls that would provide this feature. (Communicate the text to the voxforge site.) Moreover we should give an api for others to make this contribution easily possible. (Imagine a firefox plugin that opens all the major newspaper articles on voxforge! :) )

 What do you think?

Re: how to get more voice samples?
User: kmaclean
Date: 3/17/2008 8:41 pm
Views: 237
Rating: 29

Hi V,

Technically, this is a cool idea!  The speech submission app could be modified to accept text as a parameter when the applet is called.

Legally, there might be problems if the user is reading a Copyrighted page ... however, we might get the blogger to include a license file authorizing the reading of their blog for submission to VoxForge (sort of a "yes you can record this, so long as your recording is GPL and is being submitted to VoxForge" license)... interesting.

From a promotion perspective, it might allow us to build ties with other open source projects and "spread the word".

Any other comments?

Ken 

Re: how to get more voice samples?
User: nsh
Date: 3/18/2008 3:35 am
Views: 1270
Rating: 39
Nice idea really.
Re: how to get more voice samples?
User: Mariane
Date: 9/14/2009 3:14 pm
Views: 35
Rating: 7

Ask for voice recordings which are already open source.

Run them through an existing (even if close source) speech recognition engine. The output would still not be under any copyright.

Proofread the output and segment.

I am planning on this approach for my research so I asked Christopher Lydon from radioopensource.org whether he would give me a few hours of interviews in wav format.

http://www.radioopensource.org/author/chris/

I'll let you know what he says (if anything). If he refuses or simply ignores my email we can probably find other recordings somewhere (there are plenty online, the only problem is that they tend to be in mp3). Any suggestion welcome.

Please tell me about any sound segmenting tools you have, for the moment I use Audacity but it is not well adapted for the purpose.

Mariane

Re: how to get more voice samples?
User: kmaclean
Date: 9/14/2009 3:30 pm
Views: 36
Rating: 6

Hi Marianne,

>Run them through an existing (even if close source) speech recognition

>engine. The output would still not be under any copyright.

Why not just use something like LibriVox - where people record themselves reading public domain Gutenberg texts... therefore don't need the speech-rec step?

>plenty online, the only problem is that they tend to be in mp3

see discussion on mp3 recordings here: What Kind of Audio Formats is VoxForge looking for?(my reply at the bottom)

>Please tell me about any sound segmenting tools you have,

See here:

Ken

Re: how to get more voice samples?
User: Mariane
Date: 9/14/2009 3:48 pm
Views: 35
Rating: 6

Thank you for these links.


To answer your question, the rythm is not the same (reading aloud is more rythmic, pauses are not distributed in the same way, lack of fillers such as "hum", etc). I'm afraid that training only on read samples will produce a system which is most good at parsing read aloud text - not really what we want, is it? You only have to listen to a few samples of read text vs. radio live speech to convince yourself of this.  You can easily hear the difference.

Maybe this issue will only arise in the future as some kind of fine tuning of our systems, but I'm starting right now to look for samples of spontaneous speech :). I'll share them with you of course.

Mariane

Re: how to get more voice samples?
User: kmaclean
Date: 9/14/2009 7:58 pm
Views: 58
Rating: 7

Hi Mariane,

>To answer your question, the rythm is not the same (reading aloud is

>more rythmic, pauses are not distributed in the same way, lack of fillers

>such as "hum", etc).

Agree... from wikipedia (disclaimer: my own submission, so take it for whatever it is worth):

There are two types of Speech Corpora:

(1) Read Speech - which includes:

  • Book excerpts
  • Broadcast news
  • Lists of words
  • Sequences of numbers

(2) Spontaneous Speech - which includes:

  • Dialogs - between two or more people (includes meetings);
  • Narratives - a person telling a story (one such corpus is the Buckeye Corpus);
  • Map-tasks - one person explains a route on a map to another;
  • Appointment-tasks - two people try to find a common meeting time based on individual schedules.

Sphinx3 uses the WSJ1 corpus in it's wideband acoustic model.  The LDC Catalog description for the WSJ1 corpus says the following:

The complete WSJ1 corpus contains approximately 78,000 training utterances (73 hours of speech), 4,000 of which are the result of spontaneous dictation by journalists with varying degrees of experience in dictation.

[edit - Sept15] Basically, only 5% of the entire corpus uses "Spontaneous Speech" (See Readme of the WSJ1 corpus for more info), most of the corpus is "Read Speech".

>I'm afraid that training only on read samples will produce a system

>which is most good at parsing read aloud text - not really what we

>want, is it?

Actually, for command and control (C&C) applications on a person's computer, "Read Speech" fits the bill perfectly from a cost perspective and from an application domain perspective. 

Dictation (as opposed to C&C) acoustic models require training with very large amounts of read and spontaneous speech (1000+ hours).  However, the cost of transcribing these large quantities of speech is very prohibitive...  Tpvelak's post in this thread: Re: transcribing expermints describes the issue quite clearly:

I think that if you want to do experiments you may want to compare the efficiency of trancsribing and reading. The interesting number is this ratio: length of transcribed recording / time spent getting the recording. My experience is that transcription is less efficient than reading, but maybe you can prove me wrong.

I have tried two ways of acquiring transcribed speech data, one was a sort of supervised reading, where there was one person reading sentences and another one listening to each sentence to hear if it is correct. In this way I could get about 12 minutes of speech per hour spent.

The other way I tried was to have lecture recordings (which may be similar to your conference talks) transcribed. My observation is that a skilled transcriber can transcribe about 3 minutes of speech in one hour.

Read speech is simply a reasonable compromise to get things started.

In addition, in the command and control application domain, a user simply wants the computer to perform simple commands using speech.  These commands would tend to be words or short phrases, very close to what read speech would be like.

tpvelka's post also talks about his experiences in using speech recognition to 'bootstrap' transcriptions for lectures - I am not sure how applicable this might be to your plan to use open source radio archives...

Ken

Re: how to get more voice samples?
User: Mariane
Date: 9/18/2009 11:56 am
Views: 957
Rating: 7

If we are talkin about non-free corpuses, you also have the meeting Bmr corpus at:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S02

Described in a paper available here:

http://www.icsi.berkeley.edu/ftp/pub/speech/papers/icassp03-janin.pdf

 

Mariane

 

 

Re: how to get more voice samples?
User: kmaclean
Date: 9/18/2009 1:29 pm
Views: 271
Rating: 7

Hi Marianne,

I'm not sure if you have seen this list on the VoxForge Dev Wiki:

AudioSources -- Possible sources of Spoken Audio for the creation of Acoustic Models.

You might be interested in the TalkBank corpus listed therein.  They have a subset of the Switchboard telephony corpus in GPL format.

Ken

Re: how to get more voice samples?
User: kmaclean
Date: 11/17/2009 9:34 am
Views: 190
Rating: 6

Found an interesting paper furthering your argument about the differences between read speech and conversational speech: Spontaneous Speech: How People Really Talk and Why Engineers Should Care by Elizabeth Shriberg.  From the Abstract:

Spontaneous conversation is optimized for human-human communication, but differs in some important ways from the types of speech for which human language technology is often developed. This overview describes four fundamental properties of spontaneous speech that present challenges for spoken language applications because they violate assumptions often applied in automatic processing technology.

The 4 challenge areas are:

2.1. Recovering hidden punctuation
In many formal written languages, punctuation is rendered ex-
plicitly. But spoken language is a stream of words, with no
overt lexical marking of the punctuation itself. Instead, phras-
ing is conveyed through other means, including prosody [...]

2.2. Coping with disfluencies
Disfluencies such as filled pauses, repetitions, repairs, and false
starts are frequent in natural conversation. Across many corpora
and languages, disfluencies occur at rates higher than every 20
words, and can affect up to one third of utterances. Al-
though disfluencies were once viewed as “errors”, a growing
literature on the topic has come to appreciate them as an inte-
gral part of natural language and conversation [...]

2.3. Allowing for realistic turn-taking
Spontaneous speech has another dimension of difficulty for au-
tomatic processing when more than one speaker is involved. As
described in classic work on conversation analysis, turn-
taking involves intricate timing. In particular, speakers do not
alternate sequentially in their contributions as often suggested
by the written rendition of dialog. Rather, listeners project the
end of a current speaker’s turn using syntax, semantics, prag-
matics, and prosody, and often begin speaking before the cur-
rent talker is finished [...]

2.4. Hearing more than words
A fourth challenge area is to “hear” a speaker’s emotion or state
of being, through speech. Modeling emotion and user state
is particularly important for certain dialog system applications.
For example, in an automated assistance application, one would
want to transfer an angry user to a human. Detecting affect
through speech obviously requires more than just words. De-
spite a growing literature in both linguistics and applied fields,
this area remains a challenge both because it is such an inher-
ently difficult task, and because it is hard to obtain natural emo-
tion data [...]

Conclusion

This overview described four challenge areas for the automatic modeling of spontaneous speech. In each area, speakers convey useful information on multiple levels that is often not modeled in current speech technology. Greater attention to these challenges, as well as increased scientific understanding of natural speaking behavior, should offer long-term benefits for the development of intelligent spoken language applications.

PreviousNext