Frequently Asked Questions - voxforge.org

VoxForge

What is a Java Applet?
What is a Language Model?
What is a Phone?
What is a Speaker Dependent or Independent Acoustic Model?
What is a speech corpus or speech corpora?
What is a Speech Decoder?
What is an Acoustic Model?
What is closed source software?
What is Downsampling?
What is forced alignment?
What is Free Software?
What is G2P?
What is GPL?
What is grapheme?
What is Open Source Software?
What is Telephony IVR?
What is the CMU Arctic Database?
What is the difference between a dialect and an accent?
What is the difference between a phone and a phoneme?
What is the difference between a Speech Recognition Engine and a Speech Recognition System
What is the difference between a VoiceXML Interpreter, a VoiceXML Browser and a VoiceXML Platform?
What is the difference between lossy, lossless, and uncompressed audio formats?
What is the difference between the HTK Pronunciation Dictionnary and the Julius sample.dict?
What is the different between a monophone and a triphone?
What is the VoxForge phoneset?
What is Transcribed or Annotated Speech Audio File
What Kind of Audio Formats is VoxForge looking for?
What Other Speech Submissions Options are There?
What's the difference between Linux and GNU/Linux?
Where can I get software to convert FLAC to wav format?

«Previous Page · 1 2 3 4 · Next Page»

What is a Java Applet?

The Read page on the VoxForge site contains a Java applet. A Java applet is basically a Java program that runs on the Java Run-time Environment on your PC, but looks like it forms part of a web page.

Part of the start-up process for any Java applet is a check to see if the applet is going to use any resources on your PC - like accessing to your hard drive, the Internet, etc.

What is a Language Model?

A Statistical Language Model is a file used by a Speech Recognition Engine to recognize speech. It contains a large list of words and their probability of occurrence. It is used in dictation applications.

What is a Phone?

Although there are theoretical differences between a phone and a phoneme, practically, a phone is simply a contraction for phoneme.

What is a Speaker Dependent or Independent Acoustic Model?

An Acoustic Model is a file used by a Speech Recognition Engine for Speech Recognition. It contains a statistical representation of the distinct sounds that make up each word in the Language Model or Grammar. More information can be found on the Background re: Acoustic Model Creation Document.

A Speaker Dependent Acoustic Model is exactly what its name suggests - it is an Acoustic Model that has been tailored to recognize a particular person's speech. Such Acoustic Models are usually trained using audio from a particular person's speech. However you can also take a generic Acoustic Model and adapt it to a particular person's speech to create a Speaker Dependent Acoustic Model.

A Speaker Independent Acoustic Model can recognize speech from a person who did not submit any speech audio that was used in the creation of the Acoustic Model.

The reason for the distinction is that it takes much more speech audio training data to create a Speaker Independent Acoustic Model than a Speaker Dependent Acoustic Model.

What is a speech corpus or speech corpora?

A Speech Corpus (or Spoken Corpus) is a database of speech audio files and text transcriptions of these audio files in a format that can be used to create Acoustical Models (which can then be used with a Speech Recognition Engine). ISIP's Switchboard database is a good example of this.

A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases).

There are two types of Speech Copora:

(1) Read Speech - which includes

Book excerpts;
Broadcast news;
Lists of words;
Sequences of numbers.

(2) Spontaneous Speech - which includes:

Dialogs - between two or more people (includes meetings);
Narratives - a person telling a story;
Map-tasks - one person explains a route on a map to another;
Appointment-tasks - two people try to find a common meeting time based on individual schedules.

What is a Speech Decoder?

A Speech Decoder (or simply "Decoder") is the software portion of the speech recognition engine. In addition to a Decoder, Speech Recognition engines need an Acoustical Model and a Language Model or Grammar in order to recognize speech.

What is an Acoustic Model?

An acoustic model is a file that contains statistical representations of each of the distinct sounds that makes up a word. Each of these statistical representations is assigned a label called a phoneme. The English language has about 40 distinct sounds that are useful for speech recognition, and thus we have 40 different phonemes.

An acoustic model is created by taking a large database of speech (called a speech corpus) and using special training algorithms to create statistical representations for each phoneme in a language. These statistical representations are called Hidden Markov Models ("HMM"s). Each phoneme has its own HMM.

For example, if the system is set up with a simple grammar file to recognize the word "house" (whose phonemes are: "hh aw s"), here are the (simplified) steps that the speech recognition engine might take:

The speech decoder listens for the distinct sounds spoken by a user and then looks for a matching HMM in the Acoustic Model. In our example, each of the phonemes in the word house has its own HMM:
- hh
- aw
- s
When it finds a matching HMM in the acoustic model, the decoder takes note of the phoneme. The decoder keeps track of the matching phonemes until it reaches a pause in the users speech.

When a pause is reached, the decoder looks up the matching series of phonemes it heard (i.e. "hh aw s") in its Pronunciation Dictionary to determine which word was spoken. In our example, one of the entries in the pronunciation dictionary is HOUSE:
- ...
- HOUSAND [HOUSAND] hh aw s ax n d
- HOUSDEN [HOUSDEN] hh aw s d ax n
- HOUSE [HOUSE] hh aw s
- HOUSE'S [HOUSE'S] hh aw s ix z
- HOUSEAL [HOUSEAL] hh aw s ax l
- HOUSEBOAT [HOUSEBOAT] hh aw s b ow t
- ...

The decoder then looks in the Grammar file for a matching word or phrase. Since our grammar in this example only contains one word ("HOUSE"), it returns the word "HOUSE" to the calling program.

This get a little more complicated when you start using Language Models (which contain the probabilities of a large number of different word sequences), but the basic approach is the same.

What is closed source software?

Closed source is a term for software released or distributed without the corresponding source code

What is Downsampling?

Downsampling (or subsampling) is the process of reducing the sampling rate of a signal. This is usually done to reduce the data rate or the size of the data. For details, please refer to this wikipedia link.

A paper by Mitchel Weintraub and Leonardo Neumeyer called CONSTRUCTING TELEPHONE ACOUSTIC MODELS FROM A HIGH-QUALITY SPEECH CORPUS provides some background on the use of downsampled High Quality Speech Audio in applications that can only use lower sampling rates.

What is forced alignment?

Is the process of taking the text transcription of an audio speech segment and determining where in time particular words occur in the speech segment.

As opposed to speech recognition, where the object is to take an audio speech segment and generate its text transcription.

For more info, see:

Speech Recognition Engines that can perform forced alignment (from command line or script):

GUI Programs:

SPPAS (SPeech Phonetization Alignment and Syllabification) (uses Julius and Praat)

Other Information:

Automated Audio Segmentation Using Forced Alignment
AutoCap - Automatic and Accurate Captioning (

http://imj.ucsb.edu/autocap/

What is Free Software?

Free software is software that gives users the four essential freedoms:

to run the program, for any purpose
to study how the program works, and change it so it does your computing as you wish. Access to the source code is a precondition for this.
to redistribute copies so you can help your neighbor, and
to distribute copies of your modified versions to others. Access to the source code is a precondition for this.

For more information, see the definition of Free Software on the Free Software Foundation's (FSF) website. The FSF promotes the development and use of free software, particularly the GNU operating system, used widely in its GNU/Linux variant.

What is G2P?

G2P refers to grapheme-to-phoneme conversion. This is the process of using rules to generate a pronunciation for a word (for creating a pronunciation dictionary). The rules are usually created by a automated statistical analysis of a pronunciation dictionary.

The G2P algorithm is used to generate the most probable phone list for a word not contained in the pronunciation dictionary (i.e. out-of-vocabulary words) used to create the G2P rules.

From SPEECH and LANGUAGE PROCESSING By Daniel Jurafsky:

The process of converting a sequence of letters into a sequence of phones is called grapheme-to-phoneme conversion, sometimes shortened g2p. The job of a grapheme-to-phoneme algorithm is thus to convert a letter string like cake into a phone string like [K EY K].

What is GPL?

GPL refers to the 'GNU General Public License'. Copyright provides an author with the right to control copies and changes to a work, whereas the GPL license (also referred to as "copyleft") provides a user with the right to copy and change a work.

The preamble to the GPL license follows:

The GNU General Public License is a free, copyleft license for
software and other kinds of works.

The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
the GNU General Public License is intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users. We, the Free Software Foundation, use the
GNU General Public License for most of our software; it applies also to
any other work released this way by its authors. You can apply it to
your programs, too.

When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.

To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights. Therefore, you have
certain responsibilities if you distribute copies of the software, or if
you modify it: responsibilities to respect the freedom of others.

For example, if you distribute copies of such a program, whether
gratis or for a fee, you must pass on to the recipients the same
freedoms that you received. You must make sure that they, too, receive
or can get the source code. And you must show them these terms so they
know their rights.

Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
giving you legal permission to copy, distribute and/or modify it.

For the developers' and authors' protection, the GPL clearly explains
that there is no warranty for this free software. For both users' and
authors' sake, the GPL requires that modified versions be marked as
changed, so that their problems will not be attributed erroneously to
authors of previous versions.

Some devices are designed to deny users access to install or run
modified versions of the software inside them, although the manufacturer
can do so. This is fundamentally incompatible with the aim of
protecting users' freedom to change the software. The systematic
pattern of such abuse occurs in the area of products for individuals to
use, which is precisely where it is most unacceptable. Therefore, we
have designed this version of the GPL to prohibit the practice for those
products. If such problems arise substantially in other domains, we
stand ready to extend this provision to those domains in future versions
of the GPL, as needed to protect the freedom of users.

Finally, every program is threatened constantly by software patents.
States should not allow patents to restrict development and use of
software on general-purpose computers, but in those that do, we wish to
avoid the special danger that patents applied to a free program could
make it effectively proprietary. To prevent this, the GPL assures that
patents cannot be used to render the program non-free.

What is grapheme?

A grapheme is basically a letter ("a", "b", ...).

What is Open Source Software?

Open-source software is an antonym for closed source and refers to any computer software whose source code is available under a license that permits users to study, change, and improve the software, and to redistribute it in modified or unmodified form. It is often developed in a public, collaborative manner.

See this Wikipedia entry for more information.

In addition, see the Open Source Initiative (OSI) web site. OSI is a non-profit corporation dedicated to managing and promoting the Open Source Definition for the good of the community, specifically through the OSI Certified Open Source Software certification mark and program.

What is Telephony IVR?

IVR is an acronym for: Interactive Voice Response.

Older IVR systems allowed users to call in to a system and use the keys on their telephone (also called 'touch-tones') to navigate a series of menus to get information or conduct a transaction. The system would respond to the user over the phone using Text-to-Speech.

Newer IVR systems use a speech-based interface (using Speech Recognition and Text to Speech) to permit a caller to get similar information or conduct similar transactions. The menu structure of speech-based IVRs tends to be 'flatter' than with touch-tone menus, because the available options are not longer limited to the keys on a telephone keypad.

What is the CMU Arctic Database?

The CMU_ARCTIC database was constructed at the Language Technologies Institute at Carnegie Mellon University. It consists of around 1150 utterances selected from out-of-copyright texts from Project Gutenberg.

The prompt file used in the CMU_ARCTIC database were originally designed as US English single speaker prompt file for Speech Synthesis research (i.e Text to Speech). Since it is phonetically balanced, we are using it to generate prompt files for the creation of Speech Recognition Acoustic Models.

What is the difference between a dialect and an accent?

Dialect

A dialect is a variety of language differing in vocabulary and grammar as well as pronunciation. Dialects are usually spoken by a group united by geography or class.

Accent

When a standard language and pronunciation are defined by a group, an accent may be any pronunciation that deviates from that standard.

Groups sharing an identifiable accent may be defined by any of a wide variety of common traits. An accent may be associated with the region in which its speakers reside (a geographical accent), the socio-economic status of its speakers, their ethnicity, their caste or social class, their first language (when the language in which the accent is heard is not their native language), and so on.

What is the difference between a phone and a phoneme?

A phoneme is the smallest structural unit that distinguishes meaning in a language. Phonemes are not the physical segments themselves, but are cognitive abstractions or categorizations of them.

On the other hand, phones refer to the instances of phonemes in the actual utterances - i.e. the physical segments.

For example (from this article):

the words "madder" and "matter" obviously are composed of distinct phonemes; however, in american english, both words are pronounced almost identically, which means that their phones are the same, or at least very close in the acoustic domain.

What is the difference between a Speech Recognition Engine and a Speech Recognition System

Speech Recognition Engines ("SRE"s) are made up of the following components:

Language Model or Grammar - Language Models contain a very large list of words and their probability of occurrence in a given sequence. They are used in dictation applications. Grammars are a much smaller file containing sets of predefined combinations of words. Grammars are used in IVR or desktop Command and Control applications. Each word in a Language Model or Grammar has an associated list of phonemes (which correspond to the distinct sounds that make up a word).
Acoustic Model - Contains a statistical representation of the distinct sounds that make up each word in the Language Model or Grammar. Each distinct sound corresponds to a phoneme.
Decoder - Software program that takes the sounds spoken by a user and searches the Acoustic Model for the equivalent sounds. When a match is made, the Decoder determines the phoneme corresponding to the sound. It keeps track of the matching phonemes until it reaches a pause in the users speech. It then searches the Language Model or Grammar file for the equivalent series of phonemes. If a match is made it returns the text of the corresponding word or phrase to the calling program.

A Speech Recognition System ('SRS') on a desktop computer does what a typical user of speech recognition would expect it to do: you speak a command into your microphone and the computer does something, or you dictate something to the computer and it types it out the corresponding text on your screen.

An SRS typically includes a Speech Recognition Engine and a Dialog Manager (and may or may not include a Text to Speech Engine).

What is the difference between a VoiceXML Interpreter, a VoiceXML Browser and a VoiceXML Platform?

A VoixeXML Interpreter, does just what it says - it interprets VoiceXML documents. But does not, by itself, recognize speech or output speech responses. It is the core smarts of a VoiceXML platform, but does not have the Application Programming Interfaces ("API") necessary to communicate with an ASR and TTS and/or PBX systems.

A VoiceXML Browser contains a VoiceXML interpreter, and includes generic APIs to ASR, TTS and PBX systems. However it does not, by itself, recognize speech or output speech responses. It still requires separate ASR and TTS and/or PBX systems, and its APIs still need to be modified to work with specific ASR/TTS/PBX systems.

A VoiceXML platform, is a 'turnkey' VoiceXML Speech Recoginition and Text to Speech System that works with a PBX. It works 'out of the box' and includes a VoiceXML Browser and the required TTS, ASR and PBX subsystems. A VoiceXML platform may also be called a 'VoiceXML Spoken Dialog System'.

What is the difference between lossy, lossless, and uncompressed audio formats?

Uncompressed audio is audio without any compression applied to it. This includes audio recorded in PCM or WAV form.

Lossless audio compression is where audio is compressed without losing any information or degrading the quality at all. Examples of lossless formats includes WMA Lossless or FLAC in Matroska.

Lossy audio compression attempts to apply to discard as much 'irrelevant' data as possible from the original audio, thereby producing a file much smaller than the original that sounds almost identical. This results in a much smaller filesize then lossless or uncompressed audio. Lossy audio formats include AC3, DTS, AAC, MPEG-1/2/3, Vorbis, and Real Audio.

What is the difference between the HTK Pronunciation Dictionnary and the Julius sample.dict?

The pronunciation dictionnary is HTK specific and is different than the file you created in Step 1. The HTK file is used in creating Acoustic Models. The sample.dict file is used in Step 1 is part of the Julian Grammar. The Julian sample.dict file will usually only contain a subset of the words and pronunciation information contained in the HTK pronunciation dictionnary.

Note: There is duplication of pronunciation information in the Julius sample.dict file (part of the Julius Grammar) and the HTK pronunciation dictionary (used in the creation of HTK Acoustic Models). This can cause errors if you don't get your pronunciation data just right - so be careful.

What is the different between a monophone and a triphone?

Monophone: The pronunciation of a word can be given as a series symbols that correspond to the individual units of sound that make up a word. These are called 'phonemes' or 'phones'. A monophone refers to a single phone.

Triphone: A triphone is simply a group of 3 phones in the form "L-X+R" - where the "L" phone (i.e. the left-hand phone) precedes "X" phone and the "R" phone (i.e. the right-hand phone) follows it.

Below is an example of the conversion of a monophone declaration of the word "TRANSLATE" to a triphone declaration (the first line shows the "monophone" declaration, and the second line shows the "triphone" declaration):

TRANSLATE [TRANSLATE] t r @ n s l e t
TRANSLATE [TRANSLATE] t+r t-r+@ r-@+n @-n+s n-s+l s-l+e l-e+t e-t

In the CMU dictionnary, which has close to 130,000 word pronunciations, there are only 43 phones, but there are close to 6000 triphones.

What is the VoxForge phoneset?

The VoxForge phoneset is derived from the CMU phone set (cmudict-0.7b) as follows:

The current phoneme set has 39 phonemes

        Phoneme Example Translation
        ------- ------- -----------
        AA	odd     AA D
        AE	at	AE T
        AH	hut	HH AH T
        AO	ought	AO T
        AW	cow	K AW
        AY	hide	HH AY D
        B 	be	B IY
        CH	cheese	CH IY Z
        D 	dee	D IY
        DH	thee	DH IY
        EH	Ed	EH D
        ER	hurt	HH ER T
        EY	ate	EY T
        F 	fee	F IY
        G 	green	G R IY N
        HH	he	HH IY
        IH	it	IH T
        IY	eat	IY T
        JH	gee	JH IY
        K 	key	K IY
        L 	lee	L IY
        M 	me	M IY
        N 	knee	N IY
        NG	ping	P IH NG
        OW	oat	OW T
        OY	toy	T OY
        P 	pee	P IY
        R 	read	R IY D
        S 	sea	S IY
        SH	she	SH IY
        T 	tea	T IY
        TH	theta	TH EY T AH
        UH	hood	HH UH D
        UW	two	T UW
        V 	vee	V IY
        W 	we	W IY
        Y 	yield	Y IY L D
        Z 	zee	Z IY
        ZH	seizure	S IY ZH ER

What is Transcribed or Annotated Speech Audio File

In order for Speech Audio Files to be 'compiled' into Acoustic Models, the speech contained in the audio file must be labelled in some way. This can be done using orthographic transcriptions (transcriptions containing the actual words uttered) or using phonetic transcriptions (transcriptions contraining the sounds that make up the words). These transcriptions are usually contained in a separate text file, and are linked to the speech audio file by a common file name.

Trancriptions can also be 'time aligned' (where the file contains the start and end time of each word or phone) or not (no time stamps indicating the start or end of a word or phone).

Training Acoustic Models with short segments of transcribed speech (10-15 words long), with no time alignments, seems to create the best acoustic models.

What Kind of Audio Formats is VoxForge looking for?

VoxForge is only looking for original audio files in an uncompressed format such as WAV or AIFF (up to a 48kHz sampling rate at 16 bits per sample).

Please do not submit any audio using 'lossy' compressed formats such as MP3, Ogg Vorbis, etc. Lossy compression removes information from the audio stream.

In addition, please do not submit any lossy compressed audio (such as MP3 or Ogg Vorbis) converted to an uncompressed format (such as WAV or AIFF). For example, if you convert your audio from MP3 to WAV, information will still be missing from the audio stream, even though it has been converted to WAV.

You can submit audio in a 'lossless' compressed format such as FLAC, since it does not remove information from the audio stream.

What Other Speech Submissions Options are There?

If you have contributed speech for all the Prompts in the Speech Submisson Section, you may be interested in contributing additional speech recordings using your own prompts or transcribing audio recorded by others.

These types of submissions will help to ensure that we get speech audio for as many different words as possible (especially words not already included in our Phoneme Coverage Prompts), and thus provide coverage for as many triphones as possible. It is not enough to get many different people reading the same VoxForge created Phoneme Prompts files (why? because the resulting Acoustic Models will only be as good as the triphones covered in those files). We need a large variety texts to ensure we cover as many of the triphones in the English language as possible.

User Submitted Audio

Suggestions for user-submitted prompts:

Gutenberg Project - The Gutenberg project is a library of 17,000 free ebooks whose copyright has expired. Pick one and record all or part of it. Then submit a compressed audio version to the Gutenberg Audio Books project (e.g. mp3) and submit the uncompressed version (in wav format) to VoxForge.
Wikipedia - Wikipedia is the free encyclopedia built collaboratively using Wiki software. Pick your favourite article and record all or part of it. Then submit a compressed version of your audio (in ogg format) to the Spoken Wikipedia project and submit an uncompressed version (in wav format) to VoxForge.
Google Books - Google Books gives you access to many out-of-copyright books that you can download (you need to select the "Full view" radio button when you search on books.google.com)

Don't worry if you don't have the time (or the inclination) to create VoxForge style prompts and/or audio files. We can convert your "one big prompt file" and corresponding "one big speech audio file" (in uncompressed wav format) into VoxForge style prompt and audio files. What is important is that we get as many varied speech audio contributions as possible.

Transcribing Speech Audio to VoxForge Format

Check the AudioSources page of the VoxForge Dev Wiki for possible sources of audio that could be transcribed.

What's the difference between Linux and GNU/Linux?

Linux is the kernel of a GNU/Linux operating system, it does not include all the other software needed to create an operating system.

Linux is not an operating system. Linux is the kernel: the program in the system that allocates the machine's resources to the other programs that you run. The kernel is an essential part of an operating system, but useless by itself; it can only function in the context of a complete operating system.

See this link for more information.

Where can I get software to convert FLAC to wav format?

You can download a FLAC decoder here:

«Previous Page · 1 2 3 4 · Next Page»

Unless otherwise indicated, © 2006-2025 VoxForge; Legal: Terms and Conditions