I think there is a misunderstanding of the difference between audio data and source code. When I create an executable from source code, I might modify the source code. But when I create acoustic models, I don't modify the data that is used to train the acoustic models. So, if I have to distribute the data along with my models, I'd be distributing an identical copy of the data. (Not to mention the difficulties of distributing gigabytes of data...)
Visitor said: "But when I create acoustic models, I don't modify the data that is used to train the acoustic models. So, if I have to distribute the data along with my models, I'd be distributing an identical copy of the data."
But you may use the source data (i.e. VoxForge Speech Audio files) and add additional non-VoxForge source data and then create new Acoustic Models; or you may clean up the audio (remove non-speech noises, improve signal to noise ratio, etc) or otherwise improve it, or correct some of the transcriptions, and then create a new Acoustic Model.
The use of GPL in this case is to ensure that if you try to distribute your new and improved Acoustic Models based on VoxForge audio, that you also make available any of the changes you made to the audio or transcriptions.
A bit more information from the Free Software Foundation Website FAQ:
Can I put the binaries on my Internet server and put the source on a different Internet site?
The GPL says you must offer access to copy the source code "from the same place"; that is, next to the binaries. However, if you make arrangements with another site to keep the necessary source code available, and put a link or cross-reference to the source code next to the binaries, we think that qualifies as "from the same place".
Note, however, that it is not enough to find some site that happens to have the appropriate source code today, and tell people to look there. Tomorrow that site may have deleted that source code, or simply replaced it with a newer version of the same program. Then you would no longer be complying with the GPL requirements. To make a reasonable effort to comply, you need to make a positive arrangement with the other site, and thus ensure that the source will be available there for as long as you keep the binaries available.
Beyond the issue of sharing new data, as pointed out by Ken, one of the strengths of the GPL is ensuring modifiability by end-users. If an end user wants to augment an existing body of data to create a new model, or wants to generate a new model type suitable for a different recognition algorithm, he/she will need access to the source data. GPL serves to ensure that the user has this access. To distributors, this may seem like 'making another identical copy', but the point is that the identical copy is being made available to everyone who might need to use it in the course of satisfying their own needs, hence encouraging further refinement and development that might otherwise have not occurred.
I have a couple of questions:
1. If I use VoxForge audio and add more data (commercial speech data cover by other type of license), do I need to give access for these commercial speech data?
2. Do I need to give access for acoustic models trained using VoxForge data (only with VoxForge data and with added commercial speech data)?
If I understood correctly from GPL license, I need only to provide modification for VoxForge audio or transcriptions.
my answers to your questions follow:
>1. If I use VoxForge audio and add more data (commercial speech data cover by other type of license), do I need to give access for these commercial speech data?
If you want to "distribute" them together, then Yes.
I think that section 2 of the GPL applies (my emphasis added) :
2. You may modify your copy or copies of the Program or any portion
of it, thus forming a work based on the Program, and copy and
distribute such modifications or work under the terms of Section 1
above, provided that you also meet all of these conditions:
b) You must cause any work that you distribute or publish, that in
whole or in part contains or is derived from the Program or any
part thereof, to be licensed as a whole at no charge to all third
parties under the terms of this License.
In this case, the "Program" is the VoxForge Corpus, and in order to distribute the VoxForge Corpus along with commercial speech data, you must license the whole (i.e. VoxForge Corpus *and* the Commercial audio) under the terms of the GPL.
>2. Do I need to give access for acoustic models trained using VoxForge data (only with VoxForge data and with added commercial speech data)?
Acoustic Models are considered derivative works of the audio used to train them. If you distribute Acoustic Models trained using the VoxForge corpus (or a portion thereof, or a collective work containing the VoxForge corpus, or parts thereof), then you must provide *all* the audio used to train these Acoustic Models under the terms of the GPL.
Hope this helps,