Speech Recognition Engines

Flat
Re: VoxForge Acoustic Models w/ Sphinx 4
User: kmaclean
Date: 1/22/2010 9:57 pm
Views: 293
Rating: 6

>This article is not very related, but attracted my attention recently

>http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.4227

Found some background (from this paper: Acoustic Model Clustering Based on Syllable Structure; by Izhak Shafran & Mari Ostendorf) that helped me understand your link :

In many ASR systems, the acoustic variation of words are modeled at two levels - the pronunciation model which maps word sequences to phonemes, and the acoustic model which maps phoneme sequences to multivariate acoustic models. Work with simulated data which was produced using the acoustic models of speech, have pointed to pronunciation variability as a key problem in recognizing conversational speech [...] However, the work on pronunciation modeling in terms of phoneme-level substitutions, deletions and insertions has so far only yielded small performance gains [...]

Conventionally, phone-level acoustic variation has been captured by conditioning the acoustic models for a phoneme on the context of neighboring phonemes in the hypothesized sequence. Typically, in large vocabulary ASR, phonemes with immediate neighbors (triphones) [...] are used. Conditioning only on phonemic context does not capture the acoustic variation of conversational speech fully [...]

Our hypothesis is that, in English, syllable structure is also useful in modeling the variation not accounted for by phoneme context. Consider the phoneme \t" (in the context \iy t er") in \beater", \beat Ernest" and \return". Even though it is the same triphone, the articulation of phone \t" in the three contexts is distinctly di fferent - in the first it is  flapped, in the second it is an unreleased closure and in the third it is a closure plus a release. These diff erences are closely related to syllable structure [...]  The use of syllable structure is motivated in part by results from psychoacoustic studies, which argue for the syllable as a unit of perception [...]

Whereas a phoneme, as defined in Wikipedia:

[...] is a group of slightly different sounds which are all perceived to have the same function by speakers of the language or dialect in question. An example of a phoneme is the /k/ sound in the words kit and skill. [...] Even though most native speakers don't notice this, in most dialects, the k sounds in each of these words are actually pronounced differently: they are different speech sounds, or phones [...] . In our example, the /k/ in kit is aspirated, [kʰ], while the /k/ in skill is not, [k]. The reason why these different sounds are nonetheless considered to belong to the same phoneme in English is that if an English-speaker used one instead of the other, the meaning of the word would not change: saying [kʰ] in skill might sound odd, but the word would still be recognized.

The paper you refer to (Moving Beyond the `Beads-On-A-String' Model of Speech) goes even further:

[...] several researchers have recently argued for the syllable as an alternative to the phoneme for representing speech. In this paper, we take a different tack and argue for finer-grained low level representation, incorporating dependence on syllable (and higher level) structure via context conditioning.

They then cite two very different approaches:

1. data-driven: 

Acoustically derived sub-word units (ASWUs) represent a data driven approach to defining the sub-word units of speech. Recognition system design involves a combination of automatic segmentation into stationary regions or ‘segments’, clustering the segments based on acoustic similarity, and dictionary design.

2. linguistically based:

In linguistics, it is [linguistic] features and not phonemes that are viewed as the fundamental units of speech, where phones are specified (or coded) in terms of distinctive features. [...]For the most part, distinctive features are related to the manner in which a speech sound is produced (the degree of constriction in the vocal tract), the particular articulator that is used (glottis, soft palate, lips and tongue blade, body and root) and/or place of constriction, and how an articulator is used to produce the sound.

Although I'm not sure I understand exactly how the data-driven or linguistically-based approaches might implemented in Sphinx or HTK/Julius, it seems from your blog post (Moving Beyond the `Beads-On-A-String') that you're leaning toward the syllabic approach, which seems like it could be implemented with a pronunciation dictionary that uses syllables rather than phonemes. 

Ken

--- (Edited on 1/22/2010 10:57 pm [GMT-0500] by kmaclean) ---

Re: VoxForge Acoustic Models w/ Sphinx 4
User: sfssZSDSER32
Date: 10/14/2021 1:27 am
Views: 618
Rating: 0

Newer technologies have emerged from this and I during my uni days we used Raspberry Pi for Home Automation using Speech Recognition.

--- (Edited on 10/14/2021 1:27 am [GMT-0500] by sfssZSDSER32) ---

Re: VoxForge Acoustic Models w/ Sphinx 4
User: kmaclean
Date: 3/11/2010 10:07 pm
Views: 2426
Rating: 6

Interesting article supporting your argument: DEPLOYING GOOG-411: EARLY LESSONS IN DATA, MEASUREMENT, AND TESTING.  From the article:

Interestingly, recognition performance does not increase dramatically with the amount of training data (8% absolute CA [Correct Accept] increase at 10% FA [False Accept] for a factor 64 increase in training size). Part of the reason may be that the training data is well matched to the test set, both phonetically and acoustically (the same users may even appear in both training and testing, in different calls of course, but probably on the same device, and sometimes speaking the same query). Another reason may simply be that we haven’t explored that space much yet

Another interesting factoid is that they use mapreduce for acoustic model training.

--- (Edited on 3/11/2010 11:07 pm [GMT-0500] by kmaclean) ---

Re: VoxForge Acoustic Models w/ Sphinx 4
User: kmaclean
Date: 1/19/2010 2:42 pm
Views: 124
Rating: 10

>Voicemails are usually 8kHz audio and Voxforge model is trained on 16kHz, it will not work for telephone recordings.

All audio submitted to VoxForge is downsampled to 8kHz-16bit, which is what is linked to in the Sphinx Acoustic Model tutorial/post I gave william818.

--- (Edited on 1/19/2010 3:42 pm [GMT-0500] by kmaclean) ---

Re: VoxForge Acoustic Models w/ Sphinx 4
User: nsh
Date: 1/19/2010 5:22 pm
Views: 98
Rating: 7

Sorry, I was talking about pretrained acoustic model

 

--- (Edited on 1/20/2010 02:22 [GMT+0300] by nsh) ---

PreviousNext