Skip to main content icon/video/no-internet

Computer-Generated Speech, Perception of

The ability to generate synthetic speech under computer control provides an important tool for the study of speech perception as well as an important technology for interacting with computers. We generally think of speech as a uniquely human ability, even if qualified by the vocal mimicry abilities of other species such as parrots and mynah birds, but computer models of speech production make it possible to generate speech signals with specific acoustic characteristics. Speech is an extremely complex acoustic signal mixing periodic and aperiodic sounds and consisting of patterns of frequency change over time, bursts of noise, silent gaps, and brief steady state patterns. Understanding how listeners quickly and effectively understand this complex signal is the primary goal of speech perception research, and synthetic speech allows systematic control of these acoustic properties. This entry describes the scientific importance of synthetic speech, models of speech production, text-to-speech synthesis, resynthesis of speech, and applications and limitations of computer-generated speech.

Scientific Importance of Synthetic Speech

Although psychoacoustic research defines and manipulates the acoustic properties of stimuli exactly using formal mathematical descriptions, this has not generally been possible for speech research. No simple mathematical description of speech can be used to characterize the sound patterns that affect listeners' perception of speech. Phonetic research seeks to identify the acoustic patterns or movements and positions of parts of the speaker's mouth that determine perception of consonant and vowel sounds (i.e., phonemes). The development of the speech spectrograph allowed researchers to measure how the sound patterns of speech change over time. The x-axis of the spectrogram (printed by a spectrograph) shows time as an utterance unfolds and the y-axis displays the frequency with the amount of energy at each time point and frequency displayed by darkening the point. Thus, dark and light visual patterns depict the different frequencies and noises in the acoustic patterns of speech. Rising pitches are displayed as lines that slant upward and falling pitches are displayed as lines that slant downward. The development of the Pattern Playback Machine at Haskins Laboratories provided a way of turning those visual patterns into acoustic speech sounds. Different patterns painted onto acetate were converted into speech containing only those acoustic properties hypothesized to change perception of one consonant or vowel into another. By making slight variations on those visual patterns, small systematic changes in speech signals could be generated where it would be difficult or even impossible for a human talker to speak with such precise control.

Models of Speech Production

Speech synthesizers are the modern computational version of the Pattern Playback. Instead of drawing acoustic patterns in visual form, speech synthesizers take as input a description of the speech signal—for example, as numerical descriptions of acoustic properties or moment-by-moment physical positions of speech articulators such as the tongue. These descriptions are then input to a computational model of speech production that generates the actual acoustic output. Some synthesizers model the movements of the mouth whereas others model the acoustic properties of speech production through filters and resonators.

Despite the differences between the underlying models, however, speech synthesizers are used less often to test those models than to parametrically control the acoustic properties of speech tested in perceptual experiments. Synthesizers permit precise control of the pattern properties of speech, synthetic speech has made possible studies how these properties are used in recognizing spoken consonants, vowels, and words, how this ability develops from infancy, and how human perception of speech differs from perception of speech signals by nonhuman animals. For example, it is possible to make a series of test stimuli varying from one vowel to another (e.g., from EE as in beet to IH as in bit) or from one consonant to another (e.g., from B as in bit to P as in pit). In the case of vowel stimuli, this can be accomplished by varying the duration of the vowel or by changing the frequency of one of the components of the vowel. For consonant differences, it is possible to vary the timing relationships among different acoustic properties such as a burst of noise and the vowel of the syllable, or by changing frequency properties. In this way, small acoustic changes can be produced that can be classified by listeners (adult humans, infants, chinchillas) to understand the relationship between the acoustic properties and perception.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading