General Discussion

Flat
audio-visual problem
User: pooya
Date: 10/1/2012 1:47 pm
Views: 2067
Rating: 9

Dears

I am working on a project to predict lip kinematic based on sound  feature extraction. I have tried by Matlab and I've extract MFCC features by auditory but it was not accurate so I migrate to HTK now.

I have an audio file (16000Hz - mono) and corresponding video frame (25 frames per second). I have 2 big questions.

How can I extract corresponding sound features per each frame? I mean, I have 25 frames and 16000 samples per second, how can I extract features for each frame?
I've extracted 9 vectors in each image frame ( having X and Y), How can I train a HMM for these victor points. For instance, one of my pivot points (vector) has 4 values

(x,y)  => (120,145), (120,150), (125,145), (125,150)

should I create a HMM for each pivot point? or should of create a HMM for whole of all 9 victors? and please guide me how can i do it with HTK, because I really baby in this software. So far I've reviewed HTK documents and tutorials but all of them focus on Phonetic transcription for all training data, but in my case is quite different and I don't need Phonetic Transcription anymore. 

I was wondering if you guide to to the correct way. I've struggled too much but no progress!!! please help me. I can send my data if you need. I can refer you to a paper:
Audio Visual Speech Recognition in Noisy Visual Environment

Best regards
Pooya,

--- (Edited on 10/1/2012 1:47 pm [GMT-0500] by pooya) ---

PreviousNext