An Exploration of Virtual Auditory Shape Perception
![]()
![]()
![]()
Before I attempt to explain where one might draw form out of chaos, I want to give a brief description of the chaos.
Figure 2.1 The Cochlea [From Williams, 1988]
All of audition relies on a sensory transducer with fundamentally two dimensional output. This transducing mechanism, the cochlea, lies curled up inside of the inner ear. The cochlea is physically coupled at one end to the tympanic membrane (eardrum) by the bones of the middle ear: the mallaeus (hammer), the incus (anvil), and the stapes (stirrup). When sound pressure waves from the outside world vibrate the eardrum, the bones of the middle ear initiate sympathetic motion in the fluid that fills the upper chamber of the cochlea. Waves in this fluid travel along the length of the cochlea. This in turn causes motion of the basilar membrane which is part of a long partition that divides the interior of the cochlea into two channels. The motion of the basilar membrane reaches maximum amplitudes at positions corresponding to each frequency component of the external sound. Hair cells along the basilar membrane measure the amplitude of motion. In this way the cochlea is a mechanical Fourier analysis machine measuring the intensity of each frequency component of external sounds that are within the range of human hearing.
Figure 2.2
Frequency Sensitivity Along The Basilar Membrane
(After Carterette, 1978)
From just these two physical dimensions, frequency and intensity[2], we build a vast perceptual cosmos containing numerous perceptual dimensions. In order to establish the context for yet more potential complexity to our auditory experience, I will conduct a tour of some of the perceptual attributes within audition:
Loudness Loudness is "the subjective intensity of a sound" [Scharf, 1978, p. 227]. It is the perceptual dimension that is probably most familiar. If you are a member of my generation, you probably have at least once, found yourself yelling: "DAD, TURN DOWN THE STEREO, IT IS TOO LOUD!" at 3:00 A.M.
The perceptual scale of loudness is the sone scale. One sone is defined as 40 phons (or dB) of a 1000 Hz tone. The rest of the sone scale is defined in terms of that reference point. A tone which is perceived to be twice as loud as the reference tone is defined to be 2 sones, and a tone that is perceived half as loud is 1/2 sone. For 1000 Hz tones, this perceptual scale relates to the physical dimension of intensity roughly following a power law:

where L is loudness in sones, and P is the pressure in micropascals (uPa).
Or

since intensity changes as the square root of pressure. A better match is obtained at low sound pressure levels using:

where Po is the effective threshold (~45 uPa).
Another potentially important function to know regarding loudness is the relative difference limen or just noticeable difference (JND). The JND of loudness varies from ~2dB at the threshold to about 0.5 dB up around 80 dB.Look for ref from Wessel's class on bark scale.
The above rules apply to a sustained tone at 1000 Hz. The behavior becomes somewhat more complicated with other types of sounds. Loudness varies in a non-uniform fashion with frequency, tapering off at both the low and high ends of the perceivable spectrum. This is just the first of many interactions of perceptual dimensions that we will see.
With complex tones, loudness increases with bandwidth, to a certain extent, irrespective of the overall energy. Surprisingly this works for both line spectra and continuous spectra. The reason involves the concept of critical bandwidth.
One method of evaluating the loudness of a narrow-band tone is to determine its perceptual threshold against a background of noise, using the known intensity of the noise as a reference. The threshold intensity of the tone is the point where the noise and the tone are equally loud. Experiments have shown[3] that narrowing the bandwidth of the noise, down to a specific limiting width, does not change the threshold level of the tone. Likewise, if the tone is broadened out to a limiting width, the threshold remains unchanged. This limiting bandwith is called the critical bandwidth. The absolute loudness of broadband sounds is partially determined by the number of critical bandwidths that they span. The critical bandwidth is constant at about 100 Hz for center frequencies of up to 500 Hz, and thereafter is 10-15% of the center frequency [Scharf & Buus, p. 14.35].
There is also time dependence in loudness. As the duration of a tone increases up to ~200ms the threshold loudness decreases: the ear integrates on time scales < 200ms.
With regard to all perceptual dimensions, it is important to keep in mind that there are individual differences. This means that variation from person to person is not merely likely, but guaranteed. In fact Levelt, et al. [1972][4] has shown that the sone scale, which is among the more consistent perceptual scales, varies not just from person to person, but between the ears of individuals. The basic structure of a perceptual attribute should, however, remain the same across many perceivers.
The just noticeable difference in frequency is dependant both on the frequency and the method of measurement. The JND has traditionally been measured in two ways: change in frequency and difference in frequency. At low frequencies, modulations of 2-5 Hz are detectable. If, however, two pitches are played sequentially, the JND is in the 1-3 Hz range. At high frequencies this bias reverses, and the JND for modulation is just below 100 Hz, while it is just above 100 Hz for a sequence.
When sounds deviate in structure from the harmonic complex, a generalized form of this rule for perceived pitch applies. If the fundamental frequency is weak or even missing, but some number of the harmonics are present, then the pitch perceived is still that of the fundamental. A good example of this phenomenon is the lowest key on the piano keyboard. There is no measurable energy at the fundamental frequency of this note (22 Hz), and yet it is still useful in a musical scale.[5]
There is something special about certain relationships between musical notes. One of the most important relationships is the octave. Octaves, notes that have integral frequency ratios,[6] seem to go well together. This is not surprising, as octaves form the harmonic sets that we saw were so important to pitch perception. Octaves therefore are perceptually similar. Sometimes more similar, in fact than certain sets of notes at less than octave intervals [Shepard, 1982, p. 346]. We cannot stick to a unidimensional scale unless we are willing to consider a non-monotonic, multi-valued perceptual mapping for frequency. If we step into a higher dimensional representation of pitch, a simple spatial analog for a monotonic scale that preserves the similarity relationship of the octave interval is a helix [Shepard, 1982, p. 352]. The pitch proximity of adjacent notes (tone chroma) is expressed as we travel around the coil, and the similarity of the octave is expressed in the proximity of successive winds.
There are other frequency ratios that are important, and hence other sets of significant tonal relationships, for example major thirds, and perfect fifths. Shepard [1982] proposes (among other structures) a seven dimensional manifold that can be partially described as a double helix wrapped around a helical cylinder.
Brightness is probably the most well known of the timbral dimensions. Brightness is a measure of acoustic energy distribution roughly quantified as the centroid of the perceivable auditory spectrum. As noted previously, it is possible to generate equal pitched harmonic complexes with varying numbers of partials and even missing fundamentals. These can all be perceived as the same pitch, however, they are as of different brightnesses.
Taking this process a bit further uncovers yet another less established auditory dimension. Since there are many possible sounds with the same spectral centroid, it is correspondingly also possible to construct many different sounds of the same pitch and brightness. These only differ in the density of spectral energy. Thus density joins the ranks of auditory dimensions.
Many more esoteric timbral dimensions have been suggested. In addition to the ones mentioned above, Carterette [1978] cites tonality and vocality. Other qualities that have been studied are roughness, vowel, beating, bite, and spectral flux. These dimensions may seem to be pushing the boundary between perceptual significance and arbitrary classification. However, several of these timbral dimensions are sufficiently robust as to allow the construction of multidimensional subspaces with interesting properties. David Wessel [1979] describes a two dimensional timber space in which analogical relationships are perceivable. Wessel performed an experiment in which subjects were asked to rate the appropriateness of different tonal analogies: tone A is to tone B as tone C is to tone D. Several different combinations were supplied, and Wessel found a significant tendency toward favoring analogies that preserved relationships within the timbre space.
It is important to note that time variation of properties has creeped in during our discussion of timbre. When I stated before that every one of the perceptual dimensions was built out of the two dimensions frequency and intensity, I left out the influence of time. Frequency is properly a combination of time and intensity (and so it might be better to say that the auditory perceptual dimensions are all based on time and intensity). For perceptual purposes, however, this occurs on scales at or below 50 milliseconds which corresponds to the minimum perceivable frequency of 20 Hz. Time (on larger scales) should therefore be considered one of the fundamental building blocks of the auditory perceptual dimensions.
Since our brains can only juggle so many details at once[8], we need a system for distilling content from cacophony. Bregman addresses such questions as: "How do we separate the speech of two people talking at once, or the voice of the singer from the orchestra?" He introduces the concept of auditory streams as the critical phase in the perceptual process of interpreting the auditory environment. We separate the sound in our environment into streams, and these streams can then be associated with objects.
One of the reasons that Bregman chose the word "stream" to describe this phase of perception is because a single auditory object may involve a series of sounds, and may take place over a period of time [Bregman, 1990, p. 10]. The basic assumption that the auditory system makes is that the characteristics of sound producing objects remain the same or change slowly over time.
At a given point in time, stream segregation involves proximity on various perceptual dimensions: space, pitch, loudness, brightness, etc. Due to its physical properties, the sounds made by a single object are frequently compact on some of these dimensions. A person's voice has a characteristic pitch and degree of roughness. So do the "voices" of biting insects or hungry predators.
One simple example of stream segregation involves the series of pitches shown in figure 2.1.
Figure 2.3
Auditory Stream Alternatives
If the difference in pitch between temporally successive sounds is small enough then the sequence will be perceived of as one stream (figure 2.3A). If the differences are too extreme then the series splits into two streams (figure 2.3B).
The characteristics of a given stream are allowed to change slowly over time. If, however, the changes occur too rapidly then the streams will fragment. If, for example, the time scale of sequence in figure 2.3A is compressed, eventually the streams will split and take on the configuration of 2.3B. Temporal factors generalize to a larger class of streaming phenomena: sounds tend to group into one stream if they start together, stop together, or vary cohesively in some fashion over time.
There are several reasons that a singer is distinguishable from the orchestra. One is that singers (notably opera singers) have a formant in their voice which corresponds to a peak in the frequency spectrum somewhere in the 2.5-3k Hz range. The average spectrum of the orchestra peaks out at about 500 Hz [Sundberg, 1982, p. 69]. As long as this "singer's formant" is present, then the vocalist can stick out above the sound of the orchestra. Another tactic that singers use, not only applies to singing, but also to any soloists: When musicians use vibrato, they are in a sense providing a carrier wave for the listeners' ears. All of the various aspects of the instruments sound (be it a violin or a vocal tract) vary together, and hence are easier to pick out against the background.
Bregman describes two classes of auditory stream segregation processes: primitive processes and schema based processes. Primitive processes are innate, they work for broad classes of stimuli, and tend to be associated with assigning auditory information to sound sources. Schema-based processes involve learned information, and tend to apply to a more limited set of acoustic events.
Schema based segregation involves the application of cognitive effort to pick out an auditory stream. As an example, in figure 2.3, especially when the time/pitch scale is such that neither grouping is particularly favored, if one applies effort, it is possible to force the sequence into either grouping. If a sequence is less ambiguous, then more effort is often required. One consequence of the necessity for cognitive effort is that where primitive processes improve with speed, schema-based ones get worse.
Another difference between primitive and schema based segregation is that primitive segregation is symmetrical and schema-based processes are not [Bregman, 1990, p. 669]. Whereas it may be possible to pick out the Gilligan's Island theme from a mixed sequence of notes, it is not likely that you would subsequently be able to hum the residual tune.
Grouping is of primary importance in our perception of the auditory portion of the environment. There is a rich variety of cues that can be used in auditory grouping. I suggest that this is one of the reasons why there are so many auditory perceptual attributes: This way we have a large number of potential dimensions on which an object might cohere.
This is a paradox: in order to locate an auditory object, is it not first necessary to determine what portion of the ambient sound belongs uniquely to that object? Not a whole lot is known about the mechanics of such an operation. What is known is that localization is synergistically at its strongest as a grouping cue in the presence of other corroborating grouping factors [Bregman, 1990, p. 645].
These cues alone do not uniquely indicate the 3-space position of a sound source. Any given point in space is only one member of a locus of points that have the same interaural characteristics. This "cone of confusion" is disambiguated by another set of cues that are due to the effect of the external part of the ear: the pinnæ.
Since the pinnæ are irregularly shaped, they have an asymmetric effect on incoming sound waves. This effect not only serves to resolve the cone of confusion, but is also the main factor in vertical localization. Pinnæ cues are generally represented as slight modifications to the spectrum of incoming sounds.
The third cue in auditory motion perception makes it clear that such
perceptions are not based solely on change. Because objects in our environment
frequently have velocities that are a significant fraction of the speed of
sound in air (343 m/s at 20deg.C in dry air at 1 atmosphere of pressure) they
are subject to Doppler shifts in frequency:
where
is the velocity of the sound source away from the ear,
v
is the initial frequency,
is the shifted frequency, and
c
is the speed of sound.
Rosenblum, et al. performed a study of listeners ability to identify the time when a moving sound passed directly in front. Their data showed that each of these cues in isolation is sufficient perform this task. The precision was greatest with intensity differences, followed by phase differences (actually time differences in this case), and finally Doppler shifts.
Summary of Studies of Horizontal Acuity in the Horizontal Plane
Study 0deg. 20deg 40deg 80deg 90deg 120de 180de
. . . . g. g.
Preibisch-Effenberger, 1966; 3.6de 9.2de 5.5de
Haustein & Schirmer, 1970[10] g. g. g.
Oldfield & Parker, 1984 4deg. 6deg. 6deg. 6deg. 12deg 20deg 10deg
. . .
Values
are mean Absolute error
Off the horizontal plane, azimuthal acuity stays about the same, (or perhaps improves a little in the 20-30deg. range [Oldfield & Parker, 1984]) and then begins to get worse at elevations in the 70-80deg. range [Strybel, 1992].
The main determinant for the accuracy of localization at a given position is the spectral content of the stimulus. Sinusoids are the worst, especially at low frequencies, and impulses of broad-band noise are best. This is even more the case with vertical localization.
During a recent informal experiment with a Macintosh-based auditory localization system [13] I observed people making a preponderance of back-front reversals. My suspicion about this is that the subjects experience in sitting at the Macintosh monitor is quite analogous to sitting in front of a television. Generally sounds associated with a television issue forth from the television, and not from behind you.[14] I call this effect "television ventriloquism". Another common error in human behavior attests to the extent that televisions can dominate the environment as a spatial focus: No matter where the VCR is located, people usually point the remote-control at the TV.
Effect of head motion
One possible method of resolving the cone of confusion is the use of head motion. Thurlow & Runge [1967] observed that induced head rotation causes large reductions in the frequency of reversals (in one instance from 90% down to zero). Thurlow & Runge observed an overall reduction in localization error even after the data was corrected for reversals. This is likely due to the fact that head rotations bring sound sources through the higher-acuity potions of auditory space. Shelton, Rodger, & Searle [1982] noted that an important determinant of the effect of head motion is the presence of visual cues. They observed that the benefits of head motion are minimal unless there are also visual stimuli present. This is likely due to the augmentation of proprioceptive feedback that visual context provides.Ref Furness
Summary of Studies of Vertical Acuity in the Median Plane
Elevation Acuity (Damaske & Acuity (Oldfield &
Wagener, 1969) Parker, 1984)
-40deg. 4deg.
-30deg. 2deg.
-20deg. 6deg.
-10deg. 10deg.
0deg. 9deg. 10deg.
10deg. 10deg.
20deg. 4deg.
30deg. 6deg.
36deg. 10deg.
40deg. 6deg.
90deg. 13deg., 22deg.
144deg. 15deg.
Values
are mean Absolute error
Wightman and Kistler simply measured the effect of different positions on a stimulus containing equal amounts of all perceivable frequencies. They used tiny probe microphones to make recordings of the ear-canal perspective on impulses of white noise from 144 positions surrounding the listener. This record of the frequency domain effects of each of those positions is called a head-related transfer function (HRTF).
To reproduce the effect of the position with an arbitrary sound, all that is necessary is to perform a Fourier transform to break the sound into its frequency components, apply the HRTF, and then perform an inverse transform to return the sound to the time domain representation. Wightman & Kistler [1989b] verified the efficacy of this procedure by testing subjects localization both with these synthetic cues, and in the free field. With the exception of increased frequency of vertical and front-back confusions, the subjects had performance that was consistent between these cases.
The measurement of an HRTF (also called an "earprint") is a laborious process. It is not currently practical to measure the earprint for every individual desiring to use three dimensional sound. Wenzel, Wightman, & Kistler [1991][18] examined listeners' ability to localize sounds using other people's earprints. They determined that performance is largely dependent on the "quality" of the earprint. There are large variations in auditory spatial acuity from person to person. If the HRTF is measured from a person with generally poor localization ability, then others who use that earprint also will have poor localization, regardless of their initial acuity. If, however, the earprint is taken from a "good" localizer, then other good localizers will tend to retain much of their acuity. There are even some indications that poor localizers can benefit from a good earprint in slightly increased acuity.
28~ The main failing in non-individualized earprints is in hemispheric confusion rates. Using another's earprint tends to increase frequency of front-back confusions by a factor of four and vertical confusions by a factor of seven[19].