Be very careful in the way you license this data. If you release a great collection of GPL data, and someone else releases a great collection of data under, for example, a CC-by license, it will probably be illegal to combine the two corpuses and make a working speech recognition product.
I have been involved in 2 projects that went through immense pain because of this issue. The solution is as follows:
If you don't make this clear right from the start, any attempt to change the license, for whatever reason, will be impossible.
For example, if a court rules that using a subset of the data derived from VoxForge on a hand-held device doesn't comply with the source code rule, and you have to bundle gigabytes of data in order to use VoxForge derived engines on mobile devices, you will be back to square one, and will have to start collecting data all over again. Who knows why a court may do that, but unless you can be sure, you can't risk not assigning data.
The organisation could have in it's constitution, (terms of incorporation or whatever it's called in the relevant jurisdiction) that it will always make the data available under the GPL, but may additionally make it available under other licenses.
Thanks for the post. My comments follow:
>If you release a great collection of GPL data, and someone else releases a
>great collection of data under, for example, a CC-by license, it will probably
>be illegal to combine the two corpuses and make a working speech
This was already discussed in this post: License terms vs. existing database.
Remember that the only thing that can be prevented under the GPL would be the *distribution* of such a combined corpus (if it turns out that the licenses are incompatible), or any derivative works made therefrom (like sub-copora or acoustic models). A researcher could still combine such corpora for their own use or a coporation could use them internally (for server based speech recognition for example), as long as they don't distribute all or part of the source corpus or any derivative acoustic models.
>The solution is as follows: 1. Set up a proper legal entity to hold the data (an organisation)...
We already get users to assign their Copyright to the Free Software Foundation for any submissions they make when they submit speech using the Speech Submission Java applet.
>If you don't make this clear right from the start, any attempt to change the
>license, for whatever reason, will be impossible.
That is exactly the reason why we chose GPL. Otherwise we would have just chosen a BSD style license or released everything in the public domain.
>For example, if a court rules that using a subset of the data derived from
>VoxForge on a hand-held device doesn't comply with the source code rule,
>and you have to bundle gigabytes of data in order to use VoxForge derived
>engines on mobile devices,
I realize this is just an example, and you never really know how other situations might develop, but in this particular case, this FAQ entry on the Free Software Foundation Website is helpful:
Can I put the binaries on my Internet server and put the source on a different Internet site?
The GPL says you must offer access to copy the source code "from the same place"; that is, next to the binaries. However, if you make arrangements with another site to keep the necessary source code available, and put a link or cross-reference to the source code next to the binaries, we think that qualifies as "from the same place".
Note, however, that it is not enough to find some site that happens to have the appropriate source code today, and tell people to look there. Tomorrow that site may have deleted that source code, or simply replaced it with a newer version of the same program. Then you would no longer be complying with the GPL requirements. To make a reasonable effort to comply, you need to make a positive arrangement with the other site, and thus ensure that the source will be available there for as long as you keep the binaries available.
My interpretation here is that there would be *no* requirement to deliver the source audio corpus on the mobile device in your example.
P.S. I am not a lawyer and this is not a legal opinion