Dialog Managers

New to voice to text, advice sought
User: bil-lancaster
Date: 6/8/2019 9:58 am
Views: 5026
Rating: 0

I spend some time editing audio files containing a mixture of voice and music (non vocal mostly).

Is it possible to 'analyse' the source file to determine where the speech parts are?

I'm using Kubuntu 18.04 and would be looking for a solution to run from the command line.

I hope this isn't a daft idea!  Looking forward to any thoughts.


--- (Edited on 6/8/2019 9:58 am [GMT-0500] by ) ---

Re: New to voice to text, advice sought
User: colbec
Date: 7/9/2019 12:47 am
Views: 2706
Rating: 1

Sure, it is relatively straightforward. Consider the various instruments of the orchestra. Each has a distinct mix of frequencies in the sound it is capable of producing. So we can distinguish a piano from a violin by examining the frequencies produced. Now consider that the human voice is just like an instrument with its own mix of frequencies, plus other characteristics like being a bit more staccato than a sustained instrument sound.

Mathematics in the form of a fast fourier analysis can decompose a sound into its component frequencies. We can also examine a stream using a windowing technique to divide up the stream into reasonable chunks.

All that remains is to use artificial intelligence in the form of a neural net to train a model, so that a window chunk of the stream fed into the NN model produces the output in the form of a likelihood of the input being a voice or not a voice. The more you can train the NN model, the more accurate will be the marks where the voices are.

--- (Edited on 2019-07-09 1:47 am [GMT-0400] by colbec) ---