Skip to main content icon/video/no-internet

The primary form of human communication is based on a speaker producing a series of words with a listener understanding the speaker's message. Speech perception is the process that allows the listener to decode the complex acoustic signal that the speaker has produced, ultimately resulting in the listener (usually) understanding what the speaker intended. A full description of speech perception begins with the signal (i.e., what the speaker has produced) and involves both perceptual and cognitive processes. The signal can be considered at many levels: It is a sound, it contains vowels and consonants, it is made up of words, and the words are syntactically arranged to convey the desired semantic content. Speech perception research has examined all these levels.

The Speech Signal

There is a good understanding of the way that the speech signal comes to be the way that it is. The standard theory of speech production is called source-filter theory. The idea is that there are certain sources of sound within the vocal tract, and that these sources are then filtered by the changing shape of the vocal tract. The most important source of sound for speech is called “voicing,” a kind of a buzzing sound that is produced when air from the lungs is forced upward, through the vocal cords. There are other sources of sound as well, such as the noise produced when air slips through a narrow opening (e.g., the sound of /s/, or of /f/).

The filtering of these sources is due to a physical property called resonance: Each physical object resonates at particular frequencies that depend on its size, shape, and material. As the tongue, lips, and jaw move, the shape of the air spaces within the mouth changes, producing different resonant properties.

Speech can be thought of as alternating between relatively open positions of the mouth and relatively closed positions; the more open positions correspond to vowels, and the more closed positions correspond to consonants. This is the signal that speech perception process must decode—patterns of energy that reflect the articulation patterns for each vowel and consonant in a given language. Linguists have characterized thousands of human languages in terms of the vowels and consonants that each uses. Across all human languages, about 100 different such phonemes have been identified. Each language uses a subset of these, with some variation in the number across languages. English is a fairly typical language from this perspective, with about 42 different phonemes.

It has proven useful to think about each phoneme as being made up of a set of phonetic features. For example, it is possible to characterize each of the consonants in English in terms of three features: voicing, place of articulation, and manner of articulation. Voicing specifies whether or not the vocal cords are active during the consonant's production—they are active when producing a sound like /z/ but not when producing /s/. The place of articulation is based on where in the vocal tract the airflow is most constricted. For example, the air is completely stopped by the lips when saying /b/, whereas the restriction is in the middle of the mouth when saying /d/. Manner of articulation refers to how the air flow is restricted: For sounds like /b/ and /d/ (called stops) the air is fully stopped momentarily, whereas for sounds like /s/ and /z/ the air is only mostly restricted, with a bit slipping through; the noisy sound of the air escaping is called frication, and the manner is fricative.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading