Sobre o VoxForge
VoxForge foi criado para coletar transcrições de fala para uso com Programas de reconhecimento de voz ("SRE"s) baseados em Código aberto como por exemplo o ISIP, HTK, Julius and Sphinx. Nós organizaremos em categorias e disponibilizaremos todos as gravações enviadas (também conhecidas como 'Speech Corpus") e modelos acústicos sob a licença GPL.
Por que precisamos de transcrições de fala GPL?
Para reconhecer a voz os programas de reconhecimento de voz necessitam de dois tipos de arquivos: o primeiro, chamado de Modelo acústico, é criado coletando um grande número de transcrições de gravações de fala (chamado de 'Speech Corpus') e compilando-os 'compiling' em reporesentações estatísticas dos sons of the sounds que compõe cada palavra. O segundo é uma 'Gramática' ou um Modelo Linguístico. Uma 'Gramática' é um arquivo relativamente pequeno contendo conjuntos de combinações predefinidas de palavras. O Modelo linguístico é um arquivo muito maior contendo as probabilidades de certas seqüências de palavras.
Problemas com as estratégias atuais:
Acoustic Models are Closed-Source
Most Acoustic Models used by 'Open Source' Speech Recognition engines are 'closed source'. They do not give you access to the speech audio (the 'source') used to create the Acoustic Model. If they do give you access, there are usually licensing restrictions on the distribution of the 'source' (i.e. you can only use it for personal or research purposes).
The reason for this is because there is no free Speech Corpus in a form that can readily be used, or that is large enough, to create good quality Acoustic Models for Speech Recognition Engines. Although there are a few instances of small FOSS speech corpora that could be used to create acoustic models, the vast majority of corpora (especially large corpora best suited to building good acoustic models) must be purchased under restrictive licenses.
As a result, Open Source projects that want to distribute their code freely must purchase restrictively licensed Speech Copora that limit distribution of the 'source' speech audio, but allow them to distribute any Acoustic Models they create.
VoxForge will address this problem by providing all Acoustic Models and their 'source' (i.e. transcribed speech audio) in GPL licensing format - which requires that the distribution of derivative works include access to the source used to create that work.
Restrictive Licensing Creates an Access Barrier to Potential Contributors
Every project that wants to build an acoustic model using a corpus with restrictive licensing must purchase their own copy. This is difficult for FOSS projects, which usually have no revenue. If a project does purchase such resources, the license restrictions will require them to keep the resources behind some kind of access barrier restricted to official project members. This takes away freedom and flexibility from end users and shrinks the pool of potential contributors to the project.
Acoustic Models are not Interchangeable
Most Open Source Speech Recognition Engines ("SRE"s) come with an Acoustic Model. However, these Acoustic Models are not interchangeable with other open source Speech Recognition engines. The way to address this problem is to provide the 'source code' for the Acoustic Models (i.e. the Speech Corpora used to create the Acoustic Models), and permit users to 'compile' it into Acoustic Models that can be used with the Open Source SRE of their choice.
VoxForge hopes to address this problem by creating a repository of
'source' speech audio and transcriptions, and by creating Acoustic
Models for each of the main Open Source Speech Recognition Engines
(such as Sphinx,
Julius, HTK
and ISIP)
.
Open Source Acoustic Models Need to be Improved
Current Acoustic Models used by Open Source Speech Recognition Engines are not at the level of quality of Commercial Speech Recognition Engines.
VoxForge provides a central location that can collect GPL speech audio and transcriptions. As more speech audio data is collected, better Acoustic Models can be created, to the point that someday they will be comparable to Commercial Speech Recognition.
No Open Source Dictation Software
Most Open Source SREs are designed for command and control and IVR telephony type applications (e.g. Sphinx, HTK and ISIP). The Julius Speech Recognition Engine was designed for dictation applications, however the Julius distribution only includes Japanese Acoustic Models. But since it uses Acoustic Models trained using the HTK toolkit, it can also use Acoustic Models trained in other languages - like English. We just need hundreds of hours of transcribed speech audio to create English dictation Acoustic Models. This same audio data might also be used to permit the other open source Speech Recognition Engines to work in dictation applications.
Although the current focus of VoxForge is on Speech Recognition for IVR telephony applications or Command and Control applications on the desktop, when the amount of audio data collected reaches a certain threshold, this data can then be used in the creation of Acoustic Models for Open Source Dictation Applications.
The VoxForge Approach
Currently, you can easily create a single user Acoustic Model trained to recognize your own voice using open source speech recognition software - it just takes time and patience. VoxForge's main objective is to create multi-user Acoustic Models that can be used without training for:
- telephony IVR (8kHz Acoustic Models);
- desktop command and control (16-48kHz Acoustic Models);
- dictation (in the future).
To achieve this, VoxForge will serve as a repository for transcribed speech audio files that will be used to create continuously-improving Acoustic Models (as user contributions are merged into the VoxForge Multi-User Acoustic Model).
As
more and more
transcribed speech data is collected, the creation of single user
Acoustic Models will be made easier. This is because users will
be able to adapt
the VoxForge Multi-User Acoustic Model to recognize their voice, rather than to try to
create one from scratch. As even more
speech data is obtained, then
the VoxForge Multi-User Acoustic Model will be able to recognize speech without
needing to be adapted to a particular user's voice.
To
achieve this objective, we need your help. There are two ways to help:
- Create transcribed speech audio files and submit them to VoxForge:
- You create transcribed speech audio files using your own voice and submit them;
- We compile them into the VoxForge Speaker Independent Acoustic Model.
- Create your own Acoustic Models and submit the speech audio files used to create them to VoxForge:
- You use our How-to's or Tutorials to learn how to create your own Acoustic Models and submit the speech audio and transcriptions to VoxForge;
- We compile them into the VoxForge Speaker Independent Acoustic Model.
We are currently focusing on collecting high quality audio data (48kHz/16-bit) and downsampling it for use with telephony IVR
(8kHz/16-bit) and desktop command and control applications (16kHz-48kHz/16-bit).
The advantage of collecting data with these higher frequency rates is that these audio files could also be used in the creation Acoustic Models for PC based dictation applications (note that robust statistical Language Models would also need to be created to work with these Acoustic Models).
More information on the creation of Acoustic Models.Why GPL?
Unrestricted Licenses for Speech Corpora will not be Effective
We believe that making Speech Corpora available using an unrestrictive, BSD style license will not help the Open Source Community in this particular case. A BSD style license permits users to distribute derivative works without having to contribute the source of those modifications back to the community. In our opinion, the Open Source Speech Recognition community does not have the required threshold of users to create a self-sustaining community using a BSD style license. If there was a larger community, then there would be a greater likelihood that a self-sustaining group would give back to the community, even if not required to do so using a BSD style license.
GPL licensing ensures that any contributions made by the Open Source Community to VoxForge will benefit the community. This is because the distribution of any derivative works based on the VoxForge Speech Corpora must make the source (i.e. the transcribed speech audio) available to the community.