Skip to main content icon/video/no-internet

Machine Speech Recognition

Speech represents the most natural means of human communication. Machine speech recognition, often called automatic speech recognition, is the automatic process performed by machine or computer to transform a speech utterance into a text consisting of a string of words. The term machine aims at making the distinction between machine speech recognition and human speech recognition (human speech perception). Machine speech recognition is also different from machine speaker recognition, which is the automatic process performed by machine to identify a speaker or to verify the identity of a speaker based on his or her voice. After a brief overview of the general steps involved in machine speech recognition, this entry introduces two of the most prominent machine speech recognition techniques.

There are different techniques for machine speech recognition, but generally speaking, they consist of a common series of steps. First, the acoustic waves of pressure corresponding to the speech utterances are transformed into electric signals by a microphone. These electric signals are then transformed into a string of feature vectors, usually called acoustic feature vectors. The feature vectors are representations of the spectrum and energy of the speech signal over short periods. Then, the extracted string of acoustic feature vectors is matched against previously stored models of sentences, words, syllables, or phonemes. The text string of words that best matches the incoming string of acoustic feature vectors is presented at the output of the machine speech recognition. Based on the type of input utterances, machine speech recognition can be classified as isolated word recognition or continuous speech recognition. Based on the generality of the models, machine speech recognition can be classified as speaker dependent or speaker independent. Based on the size of the vocabulary, machine speech recognition can be classified as small vocabulary (up to 100 words), medium vocabulary (up to 1,000 words), or large vocabulary (up to hundreds of thousands of words). Applications of machine speech recognition include voice dialing (e.g., digit recognition), command and control, form filling (e.g., data entry), web search by voice, and dictation (e.g., speech-to-text word processing).

Dynamic Time Warping

One of the most successful early techniques for machine speech recognition is called dynamic time warping (DTW) and is based on a combination of template matching and dynamic programming. Dynamic programming is a mathematical optimization process of finding the best (optimal) decisions in a recursive manner. A string of acoustic feature vectors corresponding to the input test utterance is matched consecutively against each stored reference template of feature vectors corresponding to training utterances. The test string of vectors and the stored string of vectors corresponding to each reference template form a search grid on which DTW finds an optimum path. The test feature vectors are warped nonlinearly in time (compressed or expanded) with the feature vectors of the stored templates. A matching score or distance is then computed between the test utterance and each stored reference template. The input test utterance is recognized to be the utterance corresponding to the stored reference template that provides the highest score or lowest distance to the test utterance. The first DTW approaches used isolated words to create templates. Later, this technique was extended to connected speech by creating sentence templates made of concatenated word templates.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading