VoxForge
Hello.
I recently discovered your initiative and would like to thank all contributors so far to this endeavour. I just hope you make it one day!
I think I could spend some time helping the project. I have no knowledge of acoustic modelling, am native french speaker, decent english reader (with french accent though), and got some software dev/tampering/scripting skills.
What's the best way to help?
-point 2 has the advantage to be doable while on public transportation in a noisy environment-
Have a nice day, and a happy new year!
Berteh.
> timing/sub-titling/splitting/re-encoding some audio stream you already have
This is a more efficient way to collect audio data. The top task is to collect more transcribed data and to make sure it can be automatically aligned with the existing tools. The automatic alignment part of CMUSphinx needs extensive testing.
>> timing/sub-titling/splitting/re-encoding some audio stream you already have
>This is a more efficient way to collect audio data.
While I am willing to add subtitles to audio streams, I don't know :
_if any french audio stream without noise or music will do. example : Are outdated films (that have at least 10 years) ok?
example : Are public TV news ok ?
example : Are podcasts from radio ok ?
_how to upload it and in which format (using which freeware, if needed).
> _if any french audio stream without noise or music will do. example : Are outdated films (that have at least 10 years) ok?
> example : Are public TV news ok ?
> example : Are podcasts from radio ok ?
> _how to upload it and in which format (using which freeware, if needed).
To get you a better idea on what is needed here is the set for UK English we are using:
http://downloads.bbc.co.uk/podcasts/radio4/rla76/rla76_20110920-0930b.mp3 -O '2011, Eliza Manningham-Buller: Securing Freedom, 3.mp3'wget http://downloads.bbc.co.uk/rmhttp/radio4/transcripts/2011_reith3.pdf -O '2011, Eliza Manningham-Buller: Securing Freedom, 3.pdf'wget http://downloads.bbc.co.uk/podcasts/radio4/rla76/rla76_20110913-0930b.mp3 -O '2011, Eliza Manningham-Buller: Securing Freedom, 3.mp3'wget http://downloads.bbc.co.uk/rmhttp/radio4/transcripts/2011_reith3.pdf -O '2011, Eliza Manningham-Buller: Securing Freedom, 3.pdf'wget http://downloads.bbc.co.uk/podcasts/radio4/reith/reith_20110906-0940a.mp3 -O '2011, Eliza Manningham-Buller: Securing Freedom, 3.mp3'wget http://downloads.bbc.co.uk/rmhttp/radio4/transcripts/2011_reith3.pdf -O '2011, Eliza Manningham-Buller: Securing Freedom, 3.pdf'wget http://downloads.bbc.co.uk/podcasts/radio4/rla76/rla76_20110628-0915c.mp3 -O '2011, Aung San Suu Kyi: Liberty, 1.mp3'wget http://downloads.bbc.co.uk/rmhttp/radio4/transcripts/1974_reith1.pdf -O '2011, Aung San Suu Kyi: Liberty, 1.pdf'wget http://downloads.bbc.co.uk/podcasts/radio4/reith/reith_20100622-0940a.mp3 -O '2010, Martin Rees: Scientific Horizons, 4.mp3'wget http://downloads.bbc.co.uk/rmhttp/radio4/transcripts/20100622_reith.pdf -O '2010, Martin Rees: Scientific Horizons, 4.pdf'wget http://downloads.bbc.co.uk/podcasts/radio4/reith/reith_20100615-0945a.mp3 -O '2010, Martin Rees: Scientific Horizons, 3.mp3'wget http://downloads.bbc.co.uk/rmhttp/radio4/transcripts/20100615_reith.pdf -O '2010, Martin Rees: Scientific Horizons, 3.pdf'wget http://downloads.bbc.co.uk/podcasts/radio4/reith/reith_20100608-0940a.mp3 -O '2010, Martin Rees: Scientific Horizons, 2.mp3'
Bonjour arbae,
> Are outdated films (that have at least 10 years) ok?
No, these cannot be used on VoxForge because of Copyright issues.
You can create your own Acoustic Models from any audio stream as Nick is doing, as long as you don't distribute the source audio. There is an argument that an acoustic model is a derivative work of the copyrighted source material, and therefore even the AM cannot be distributed, but whether such an argument would hold up in court is unknown to me.
For this reason, any audio we collect here at VoxForge is licensed by the author (who owns the Copyright) with a GPL compatible license, so we can redistribute freely.
Your best bet would be to use French Project Gutenberg recordings for timing/sub-titling/splitting/re-encoding since they are in the public domain, and therefore have no Copyright restrictions,
thanks,
Ken
Bonjour kmaclean.
>You can create your own Acoustic Models from any audio stream as Nick is doing, as long as you don't distribute the source audio.
Please provide a link to the thread you are talking about.
>Your best bet would be to use French Project Gutenberg recordings for timing/sub-titling/splitting/re-encoding since they are in the public domain, and therefore have no Copyright restrictions,
I'm rather scientific so when I took a look at what was there, I was a little bored because titles where not classified by genre nor had keywords associated.
I was thinking about the radio channel France Info : You can listen to it from the internet and from the AM radio in France.
There are also podcasts in some categories.
example :
audio source : http://rf.proxycast.org/905031126442057728/18998-18.06.2014-ITEMA_20642896-0.mp3
audio transcript :
http://www.franceinfo.fr/emission/nouveau-monde/2013-2014/comment-facebook-dresse-notre-portrait-psy-06-18-2014-06-50
2 problems ,though, I have found : I haven't asked them if it's allowed and their transcript is 90% exact often, that is not 100%.
What do you think about that ?
>Please provide a link to the thread you are talking about.
Basically, you can create your own acoustic models from any speech source you want, but unfortunately, if the source audio is protected by Copyright, we cannot host it on VoxForge.
>There are also podcasts in some categories.
If the works are in the public domain, then we can host it on VoxForge. If they are protected by Copyright, you will need to get a license from the author which is compatible with the GPL.
> There are also podcasts in some categories. http://rf.proxycast.org/905031126442057728/18998-18.06.2014-ITEMA_20642896-0.mp3
This seems to be a good source for the training. Is it possible to get 200 hours of audio like this? We could train a model then.
>as Nick is doing
I would like a link to the thread where Nick describes how to "create your own Acoustic Models" please.
>Is it possible to get 200 hours of audio like this?
With an average of 5mins per article, you would need 2400 links.
In 2009 and 2010, everything was recorded. Nowadays, only chronicles ("chroniques" in French)are podcasted. If you wait for newest audios, I believe in about 4 monthes you would get enough audio.
Furthermore, the rss flows are broken : they show the title of the chronicle with its date but they do not link to it. But they still link to mp3 of those pages. And the rss flows show only about 15 mp3. This means that the remaining mp3s must be retrieved by going to the page with the transcript then obtain the mp3 podcast.
Is it possible to make an audio model without hosting the audio source files on this site ? I could also ask them their license by mail.