Skip to main content icon/video/no-internet

Auditory Scene Analysis

When we listen, we effortlessly hear the separate sounds of our environment—voices, musical instruments, cars—each coming from its appropriate direction and having its own characteristic qualities such as pitch and timbre. This simple perceptual experience is the result of the complex brain mechanisms of auditory scene analysis (ASA), described in this entry. ASA addresses the problem of grouping together those frequency components that have originated from the same sound source, thereby separating them from other sounds that happen to be present at the same time. The problem of ASA can be conceptually divided into two parts: deciding which simultaneous frequency components belong together to form a sound source, and then sequentially tracking a particular sound source across time. These mechanisms make it possible for us to function in an auditorily cluttered environment by perceptually separating sounds that are physically mixed together in the sound waves that enter our ears. The limits of our ability to separate different sound sources are exploited in music, and our inability to fully understand the processes involved impairs the automatic recognition of speech by computer, and the effectiveness of hearing aids.

Figure 1 Narrow-Band Spectrograms

None
Notes: (Top) a male talker saying “One two three four”; (bottom) a female talker saying “Two hundred and six,” and (middle) both utterances mixed together. The darker the spectrogram, the more energy there is in the sound at a particular frequency and time. The horizontal axis is time; the vertical axis frequency goes as high as 4,000 Hz. The roughly horizontal parallel thin dark lines denote the harmonics of the voice, closely spaced for the low-pitched male voice and more widely spaced for the higher-pitched female. The difference in pitch helps the brain perceive the two separate sounds in the mixture.

As an example of our ability to separate simultaneous sounds, Figure 1 shows spectrograms of three sounds. A spectrogram plots as a function of time how much energy is present in a sound at different frequencies; it broadly represents the pattern of activity that sound produces in the inner ear and that is signaled to the brain by the auditory nerve. The upper spectrogram is of a male voice saying “One two three four,” the lower spectrogram is of a female voice saying “Two hundred and six,” and the middle spectrogram is of those two sounds added together. When we listen to this mixture, we clearly hear the two voices as separate yet simultaneous sounds, and we have no difficulty in following a particular voice over time. Yet it is not obvious to the eye which parts of the mixture belong to one voice and which to the other. The brain uses a variety of cues to pull apart these two sounds from the mixture.

Simultaneous Grouping: Harmonic Structure and Onset Time

One of the cues that is useful for both speech and music is harmonic structure. Any sound apart from a pure tone (resembling a whistle) consists of many different frequencies. For a sound with a distinct pitch, such as a musical note, or a sung or spoken vowel sound, these frequencies will all be whole-number multiples (harmonics) of the fundamental frequency. So, for example, when the oboe plays the A above middle C to tune-up an orchestra, its sound contains frequencies that are all multiples of 440 Hz—it has a harmonic spectrum. If the oboe played the A an octave lower, then the fundamental would be 220 Hz and the harmonics would all be multiples of the lower fundamental of 220 Hz. The harmonic structure of the voiced parts of male and female speech is visible in the spectrogram of Figure 1. The harmonics appear as groups of roughly horizontal parallel lines. For the male voice, these are closely spaced, reflecting its low pitch; for the higher-pitched female voice, they are more widely spaced. The brain is adept at separating mixtures of differently spaced harmonics, as for example at the beginning of the mixed spectrogram where the low-pitched male voice and the high-pitched female voice are sounding simultaneously. If two voices are speaking or two instruments playing on the same pitch rather than on different pitches, it is much harder to separate them because their harmonic frequencies coincide.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading