Authors: Joe Steinmetz and Glen Lee
The human ear's purpose in the area of hearing is to convert sound
waves into nerve impulses. These impulses are then perceived and
interpreted by the brain as sound. The human ear can perceive sounds
in the range of 20 to 20,000 Hz. This section is broken down into a
basic overview of the ear, a section of how sound is received by the
ear, a section on how the ears communicate with the brain, and
finally, a section of human factors. Understanding how the ear
works is key to successfully implementing 3D sound in VR systems.
Overview of the human human ear
The human ear is made of of three distinct areas: the outer ear, the
middle ear, and the inner ear. The outer ear channels sound waves
through the ear canal to the eardrum. The outer ear or pinna is a
structure that is considered an auditory cue in HRTF generation . The
eardrum is a thin membrane stretched across the inner end of the
canal. The acoustic environment created by the ear canal and ear drum
is one that 3D
sound synthesis strives to simulate. HRTF generation uses small
microphones embedded in a real ear canal to help simulate the real
acoustic environment. The use of a small microphone in the ear canal
can only approximate the real thing; frequency response is different,
position is different, reflection and refraction of sounds waves are
different. Air pressure changes in the ear canal cause the thin
membrane to vibrate. These vibrations are transmitted to three small
bones called "ossicles". The ossicles are located in the air-filled
middle ear and will conduct the vibrations across the middle ear to
another thin membrane called the oval window.
All of these vibrations are different when a small microphone is
embedded in the ear canal, thus it is impossible to accurately
simulate the acoustic environment of the human ear using this method
(HRTF measurements). The problem is trying to make a measurment
without affecting the measurment. The oval window separates the middle
ear from the fluid-filled inner ear. The effect that the fluid filled
inner ear has on sound transmission is not modelled at all in HRTF
generation, in fact, HRTF generation basically only attempts to
measure affects of external structures and immediate results inside
the ear canal. Everything that happens from middle ear on back is not
modelled at all.
The inner ear houses the "cochlea", a spiral-shaped structure that
contains the organ of "Corti" - the most important component of
hearing. The Corti sits in an extremely sensitive membrane called
the "basilar membrane". Whenever the basilar membrane vibrates,
small sensory hair cells inside the Corti are bent, which stimulates
the sending of nerve impulses to the brain.
The outer ear
Two structures make up the external ear: a flexible, oval shaped
structure ("pinna") attached to the head, and the ear canal ("meatus")
that leads to the middle ear. The pinna is made up of fibrous
cartiledge. Because there are few muscles to move the pinna, humans
need to move there head in order to help locate the source of a sound
(see 3D sound, dynamic modeling).
The transmission of sound through the ear
Sound waves hitting the outer ear are both reflected (scattered) and
conducted. Conducted sound waves will travel through the ear canal and
will hit the eardrum causing it to be driven inwards. This portion of
the process is measured in HRTF generation by embedded microphone.
Although the remaining process is not modelled currently, it is offered
to help understand how complex the human auditory system is and
how much work remains in the 3D sound synthesis world.
The inward force will cause the malleus and incus to push the stapes
deep into the oval window of the inner ear. The surface area of the
eardrum is 30 times greater than the stapes. This causes the pressure
on the oval window to be 30 times greater than the original pressure
on the eardrum. This pressure is needed for the stapes to be able to
transfer the energy into the "perilymph". The basilar membrane of
the perilymph is compressed inward by the movement of the stapes.
The compression of the flexible membrane causes the round window
to bulge into the middle ear. The organ of the Corti pivots in
response to the movements of the basilar membrane. The action of the
organ of the Corti and the tectoral membrane sliding against each
other cause the hair of the hair cells to
Near the end of the hair cells are the tips of 27,000 nerve fibers. The
interconnections of these nerve fibers and hair cells are complex and
overlapping. No direct connections exist. Instead, as hairs are bent,
nerve impulses are stimulated in the nerve fibers. The exact method of
the electrical impulse generation is not known, although some theories
exist. The changing stiffness of the basilar membrane from one end of
the cochlea to the other creates a kind of mechanical analyzer of
sound. High frequency sound causes the narrow basil end of the
membrane to vibrate. Medium frequencies cause the membrane in the
middle cochlea to vibrate. Low frequencies cause the whole membrane to
vibrate. The cochlea is able to map frequencies onto certain locations
on the basilar membrane. The sensation of pitch is a function of the
location of the vibration on the the basilar membrane.
Auditory nerves and the brain
Nerve impulses are transmitted from the ear to the brain via
the auditory nerves, one of the several sensory nerves that
exists in the group of nerves known as cranial nerves. The auditory
nerves connect the nerve impulses of the ears to the upper "temporal
lobe" of the "cerebral cortex". Nerve impulses pass over "neurons" via
an electro-chemical action. That is, the neuron itself provides the
necessary energy to propel the impulse along the nerve. The nerve
impulse does not travel as fast as a standard electrical current, but
instead moves at about 3.25 to 395 feet/sec. Nerve impulses
travel over many neurons on their way to the brain. Neurons work
together by transmitting the impulses through the "axoms" to the
"dendrites" of the neuron. The dendrites of one neuron communicate
to the dendrites of another by means of the "synapse". This gap-like
structure communicates by releasing a chemical transmitter
Human factors involving the auditory system
Virtual reality attempts to offer realistic stimuli for all human
senses. Given the current state of technology, the details of realism
are often trade-offs for real-time management of the system components.
The auditory system is no exception. While the human ear is capable of
hearing a multitude of distinct sounds, the ear can only concentrate on
listening to one particular sound at a given time.
This physiological constraint has lead to studies in "auditory
cognition". Auditory cognition analyzes such issues as attending to
auditory events, remembering and recognizing sound sources and events,
and perceptions of acoustic sequences. The theories behind auditory
cognition attempt to explain how the brain processes and/or filters out
These studies are useful in Virtual Reality systems, because they allow
the developers to mimic selective sounds in the background while the
user attends to a specific important sound. For example, if the
virtual environment imitates the city of New York, the sound generators
need not orchestrate the sounds of taxi horns, subway noise, the
rustling of people, and other typical background noise. These may be
merged as one sound that is played in the background, since the user
may not want to attend to any individual sound in this sound cluster.
However, the user may be interested in hearing his name being called
among the noise. The layering of this sound over the background noise
is a sound to which the user would selectively attend. In this
example, the sound of interest (i.e. the name being called) would
require the use of 3D sound. However, the background noises need not
take advantage of the 3D sound capabilities, since these sounds seem to
surround the user, and the user does not consciously attempt to locate
the sound source of each individual noise.
This section provides some sample theories on how sounds of interest
can be made to be attention-holding in the virtual environment.
Broadbent's Filter Theory of Attention has been used as the basis for
most selective attention models. In his theory, sound information is
passed through a number of sensory channels. These "channels" are
vaguely defined as having some distinct neural representation in the
brain. Their representations may be based on a number of sound
attributes, such as pitch, loudness or spatial position
characteristics. Broadbent postulates that the sound channels lead
into the short-term memory portion of the brain, where a particular
channel may then be filtered based on the desired sound attributes.
This filter allows only one of the channels to lead to the long-term
memory store and any output mechanisms necessary to respond to the
This theory arose from the cocktail party situation in which a guest
must filter out all distracting sounds to concentrate on one
conversation. Several studies have suggested different conditions
under which the filter switches to listen to a different channel. For
instance, when your name is heard in the middle of a current
conversation shows some priority over which channel is filtered. Some
experiments have suggested that the switching time of the filter
between channels is to be of the order of 0.25 seconds [Moray, 1970].
Treisman's theory modifies Broadbent's theory, by proposing that the
input selections bypass the short-term memory area of the brain to
arrive immediately to a filter which is sensitive to a sound's physical
characteristics. This filter eliminates most unattended sound
channels, but allows a subset of channels to enter a series of nodes in
the brain, known as "dictionary units". This network of dictionary
units becomes a pattern matcher, where similar-sounding stimuli trigger
signals to the listener's output activity mechanism in the brain.
Deutsch and Deutsch's Theory
Deutsch and Deutsch's Response Selection Theory of Selective Attention
alters Treisman's theory by omitting the initial filter for physical
characteristics. Within the dictionary network, each signal is
analyzed and recognized for its importance. The importance of the
signal fires a proportional signal to the brain's output activity
mechanism. Hence, the sound that captures the listener's attention is
the sound that bears the heaviest signal. Applying this theory to
hearing your name during a conversation shows that tuning into your
name when it is called has importance.
Coursework to pursuing research in the auditory system
The following courses are suggestions for pursuing research in the
auditory system: Linguistics, Brain and Human Communications, Anatomy
and Physiology of Speech Mechanisms, Phonetic Science, Vocal Diction,
Physics of Music, Vibrations and Waves
[Borden, 1980] Borden, Gloria J. and Harris, Katherine S., Speech
Science Primer: Physiology, Acoustics, and Perception of Speech,
Williams and Wilkins, Baltimore, 1980.
[Jarvis, 1978] Jarvis, J.F., The Anatomy and Physiology of Speech and
Hearing, Juta and Company Limited, Cape Town, 1978.
[McAdams, 1993] McAdams, Stephen and Bigand, Emmanuel, Thinking in
Sound: The Cognitive Psychology of Human Audition, Clarendon Press,
[Moray, 1970] Moray, Neville, Attention: Selective Processes in Vision
and Hearing, Academic Press, New York, 1970.
[Underwood, 1976] Underwood, Geoffrey, Attention and Memory, Pergamon
Press, New York, 1976.
[Yost, 1985] Yost, William A. and Nielsen, Donald W., Fundamentals of
Hearing, Holt, Rinehart and Winston, New York, 1985.
Human Interface Technology Laboratory