Speech Recognition Engines

recognition engine wishlist
User: trevarthan
Date: 4/3/2007 1:22 pm
Views: 4821
Rating: 23

As I've been researching the subject of speech recognition and toying with the existing open source engines available (mainly CMU Sphinx2 and Julius/Julian), I've been making a mental list of ways to improve the engines themselves.

DISCLAIMER: I've never written a recognition engine personally, so I'm hardly an expert on the subject. All I have to draw on is nature and existing documents detailing how speech recognition works.

Here are a few things I've thought up, none of which are particularly novel, and none of which I have the slightest idea how to translate into code. I'm a programmer, but I don't deal with audio much, so this is new territory for me.


1.) Codec agnostic

    The human ear/brain doesn't care whether speech is encoded as 8 bit 8 khz or 32 bit 60 khz. Neither should our speech recognition engines. I know why they do care currently. And I don't have a clue how to make them not care (some sort of abstraction/inference system, most likely). But I think it needs to happen. And I don't think training an engine separately for each individual codec is an efficient way to do it.

2.) Background noise filtering

    We need some way to deal with background noise. It could be a simple algorithmic filter or something more intelligent. I'm not even sure if it belongs in the engine itself (probably not). But it needs to be somewhere in the pipeline or the technology becomes much less useful.

3.) Multiple speaker identification and separation.

    Again, I'm not sure if this belongs in the recognition engine itself or if it should be a separate filter that runs before the engine to clean up the input data, but I think it would be excellent to have. In particular, it could be used to identify the speaker and "lock on" to his voice patterns, filtering out everything in the background. This is an advanced form of #2 above.


Please feel free to comment on the above and/or offer your own improvements. Who knows, maybe one of us will end up coding some of this functionality in the future.



--- (Edited on 4/ 3/2007 1:22 pm [GMT-0500] by trevarthan) ---

Re: recognition engine wishlist
User: kmaclean
Date: 4/3/2007 7:50 pm
Views: 1966
Rating: 26

Hi trevarthan,

thanks for the post,

>2.)Background noise filtering

Julius/Julian can use 'spectral subtraction' for removing noise - see the '-ssload filename' parameter.  Many microphones have noise reduction built-in.  The one codec developer I spoke to about noise reduction stated that all noise reduction leaves artifacts (which he called 'musical noise') in the processed audio.

>3.) Multiple speaker identification and separation.

You might be interested in a project by Ronan Crowley and Paul Connolly called Speaker Verification Implemented Security - they use HTK for speaker recognition.


--- (Edited on 4/ 3/2007 8:50 pm [GMT-0400] by kmaclean) ---