Auditory System

Authors: Joe Steinmetz and Glen Lee

The human ear's purpose in the area of hearing is to convert sound waves into nerve impulses. These impulses are then perceived and interpreted by the brain as sound. The human ear can perceive sounds in the range of 20 to 20,000 Hz. This section is broken down into a basic overview of the ear, a section of how sound is received by the ear, a section on how the ears communicate with the brain, and finally, a section of human factors. Understanding how the ear works is key to successfully implementing 3D sound in VR systems.

Overview of the human human ear

The human ear is made of of three distinct areas: the outer ear, the middle ear, and the inner ear. The outer ear channels sound waves through the ear canal to the eardrum. The outer ear or pinna is a structure that is considered an auditory cue in HRTF generation . The eardrum is a thin membrane stretched across the inner end of the canal. The acoustic environment created by the ear canal and ear drum is one that 3D sound synthesis strives to simulate. HRTF generation uses small microphones embedded in a real ear canal to help simulate the real acoustic environment. The use of a small microphone in the ear canal can only approximate the real thing; frequency response is different, position is different, reflection and refraction of sounds waves are different. Air pressure changes in the ear canal cause the thin membrane to vibrate. These vibrations are transmitted to three small bones called "ossicles". The ossicles are located in the air-filled middle ear and will conduct the vibrations across the middle ear to another thin membrane called the oval window.

All of these vibrations are different when a small microphone is embedded in the ear canal, thus it is impossible to accurately simulate the acoustic environment of the human ear using this method (HRTF measurements). The problem is trying to make a measurment without affecting the measurment. The oval window separates the middle ear from the fluid-filled inner ear. The effect that the fluid filled inner ear has on sound transmission is not modelled at all in HRTF generation, in fact, HRTF generation basically only attempts to measure affects of external structures and immediate results inside the ear canal. Everything that happens from middle ear on back is not modelled at all.

The inner ear houses the "cochlea", a spiral-shaped structure that contains the organ of "Corti" - the most important component of hearing. The Corti sits in an extremely sensitive membrane called the "basilar membrane". Whenever the basilar membrane vibrates, small sensory hair cells inside the Corti are bent, which stimulates the sending of nerve impulses to the brain.

The outer ear

Two structures make up the external ear: a flexible, oval shaped structure ("pinna") attached to the head, and the ear canal ("meatus") that leads to the middle ear. The pinna is made up of fibrous cartiledge. Because there are few muscles to move the pinna, humans need to move there head in order to help locate the source of a sound (see 3D sound, dynamic modeling).

The transmission of sound through the ear

Sound waves hitting the outer ear are both reflected (scattered) and conducted. Conducted sound waves will travel through the ear canal and will hit the eardrum causing it to be driven inwards. This portion of the process is measured in HRTF generation by embedded microphone. Although the remaining process is not modelled currently, it is offered to help understand how complex the human auditory system is and how much work remains in the 3D sound synthesis world. The inward force will cause the malleus and incus to push the stapes deep into the oval window of the inner ear. The surface area of the eardrum is 30 times greater than the stapes. This causes the pressure on the oval window to be 30 times greater than the original pressure on the eardrum. This pressure is needed for the stapes to be able to transfer the energy into the "perilymph". The basilar membrane of the perilymph is compressed inward by the movement of the stapes. The compression of the flexible membrane causes the round window to bulge into the middle ear. The organ of the Corti pivots in response to the movements of the basilar membrane. The action of the organ of the Corti and the tectoral membrane sliding against each other cause the hair of the hair cells to bend.

Near the end of the hair cells are the tips of 27,000 nerve fibers. The interconnections of these nerve fibers and hair cells are complex and overlapping. No direct connections exist. Instead, as hairs are bent, nerve impulses are stimulated in the nerve fibers. The exact method of the electrical impulse generation is not known, although some theories exist. The changing stiffness of the basilar membrane from one end of the cochlea to the other creates a kind of mechanical analyzer of sound. High frequency sound causes the narrow basil end of the membrane to vibrate. Medium frequencies cause the membrane in the middle cochlea to vibrate. Low frequencies cause the whole membrane to vibrate. The cochlea is able to map frequencies onto certain locations on the basilar membrane. The sensation of pitch is a function of the location of the vibration on the the basilar membrane.

Auditory nerves and the brain

Nerve impulses are transmitted from the ear to the brain via the auditory nerves, one of the several sensory nerves that exists in the group of nerves known as cranial nerves. The auditory nerves connect the nerve impulses of the ears to the upper "temporal lobe" of the "cerebral cortex". Nerve impulses pass over "neurons" via an electro-chemical action. That is, the neuron itself provides the necessary energy to propel the impulse along the nerve. The nerve impulse does not travel as fast as a standard electrical current, but instead moves at about 3.25 to 395 feet/sec. Nerve impulses travel over many neurons on their way to the brain. Neurons work together by transmitting the impulses through the "axoms" to the "dendrites" of the neuron. The dendrites of one neuron communicate to the dendrites of another by means of the "synapse". This gap-like structure communicates by releasing a chemical transmitter substance.

Human factors involving the auditory system

Virtual reality attempts to offer realistic stimuli for all human senses. Given the current state of technology, the details of realism are often trade-offs for real-time management of the system components. The auditory system is no exception. While the human ear is capable of hearing a multitude of distinct sounds, the ear can only concentrate on listening to one particular sound at a given time.

This physiological constraint has lead to studies in "auditory cognition". Auditory cognition analyzes such issues as attending to auditory events, remembering and recognizing sound sources and events, and perceptions of acoustic sequences. The theories behind auditory cognition attempt to explain how the brain processes and/or filters out certain sounds.

These studies are useful in Virtual Reality systems, because they allow the developers to mimic selective sounds in the background while the user attends to a specific important sound. For example, if the virtual environment imitates the city of New York, the sound generators need not orchestrate the sounds of taxi horns, subway noise, the rustling of people, and other typical background noise. These may be merged as one sound that is played in the background, since the user may not want to attend to any individual sound in this sound cluster. However, the user may be interested in hearing his name being called among the noise. The layering of this sound over the background noise is a sound to which the user would selectively attend. In this example, the sound of interest (i.e. the name being called) would require the use of 3D sound. However, the background noises need not take advantage of the 3D sound capabilities, since these sounds seem to surround the user, and the user does not consciously attempt to locate the sound source of each individual noise.

This section provides some sample theories on how sounds of interest can be made to be attention-holding in the virtual environment.

Broadbent's Theory

Broadbent's Filter Theory of Attention has been used as the basis for most selective attention models. In his theory, sound information is passed through a number of sensory channels. These "channels" are vaguely defined as having some distinct neural representation in the brain. Their representations may be based on a number of sound attributes, such as pitch, loudness or spatial position characteristics. Broadbent postulates that the sound channels lead into the short-term memory portion of the brain, where a particular channel may then be filtered based on the desired sound attributes. This filter allows only one of the channels to lead to the long-term memory store and any output mechanisms necessary to respond to the input channel.

This theory arose from the cocktail party situation in which a guest must filter out all distracting sounds to concentrate on one conversation. Several studies have suggested different conditions under which the filter switches to listen to a different channel. For instance, when your name is heard in the middle of a current conversation shows some priority over which channel is filtered. Some experiments have suggested that the switching time of the filter between channels is to be of the order of 0.25 seconds [Moray, 1970].

Treisman's Theory

Treisman's theory modifies Broadbent's theory, by proposing that the input selections bypass the short-term memory area of the brain to arrive immediately to a filter which is sensitive to a sound's physical characteristics. This filter eliminates most unattended sound channels, but allows a subset of channels to enter a series of nodes in the brain, known as "dictionary units". This network of dictionary units becomes a pattern matcher, where similar-sounding stimuli trigger signals to the listener's output activity mechanism in the brain.

Deutsch and Deutsch's Theory

Deutsch and Deutsch's Response Selection Theory of Selective Attention alters Treisman's theory by omitting the initial filter for physical characteristics. Within the dictionary network, each signal is analyzed and recognized for its importance. The importance of the signal fires a proportional signal to the brain's output activity mechanism. Hence, the sound that captures the listener's attention is the sound that bears the heaviest signal. Applying this theory to hearing your name during a conversation shows that tuning into your name when it is called has importance.

Coursework to pursuing research in the auditory system

The following courses are suggestions for pursuing research in the auditory system: Linguistics, Brain and Human Communications, Anatomy and Physiology of Speech Mechanisms, Phonetic Science, Vocal Diction, Physics of Music, Vibrations and Waves


[Borden, 1980] Borden, Gloria J. and Harris, Katherine S., Speech Science Primer: Physiology, Acoustics, and Perception of Speech, Williams and Wilkins, Baltimore, 1980.

[Jarvis, 1978] Jarvis, J.F., The Anatomy and Physiology of Speech and Hearing, Juta and Company Limited, Cape Town, 1978.

[McAdams, 1993] McAdams, Stephen and Bigand, Emmanuel, Thinking in Sound: The Cognitive Psychology of Human Audition, Clarendon Press, Oxford, 1993.

[Moray, 1970] Moray, Neville, Attention: Selective Processes in Vision and Hearing, Academic Press, New York, 1970.

[Underwood, 1976] Underwood, Geoffrey, Attention and Memory, Pergamon Press, New York, 1976.

[Yost, 1985] Yost, William A. and Nielsen, Donald W., Fundamentals of Hearing, Holt, Rinehart and Winston, New York, 1985.

[Table of Contents]

Human Interface Technology Laboratory