Is the process of taking the text transcription of an audio speech segment and determining where in time particular words occur in the speech segment.
As opposed to speech recognition, where the object is to take an audio speech segment and generate its text transcription.
For more info, see:
Speech Recognition Engines that can perform forced alignment (from command line or script):
i have question, how viterbi segment a unknown utterance at recognition step. while given utterance is only feature vectors.