An Exploration of Virtual Auditory Shape Perception
![]()
![]()
![]()
There is a considerable body of research addressing the apparent non-directional size of sounds. This somewhat abstract quantity is (unfortunately) termed "tonal volume"[20], and could be considered a unidimensional manifestation of auditory shape. It is the first precedent to auditory shapes that I discovered in the literature. The topic of tonal volume experienced a period of relative popularity around the turn of the century[21].
An early study by G. J. Rich, uncovered reasonable evidence that tonal volume possesses the characteristics that define a distinct perceptual attribute. In his experiments, Rich used compressed air powered pneumatic tone generators (whistles and "Stern Variators") to present his stimuli. He would play a reference tone and subsequently a comparison tone of a different frequency randomly selected from within a range. He fixed the time intervals by measuring their duration with a swinging pendulum. He trained the subjects to listen for changes in tonal volume in addition to changes in pitch. From these evaluations Rich wanted to determine the just noticeable differences of both pitch and volume.
Subjectively, all three of the subjects reported that they experienced tonal volume as a separate quality from pitch. The subjects, however, seemed to have difficulty articulating their judgment criteria. Some reported that it was at least partially a kinesthetic experience in that they felt the extent of the sounds in their bodies. One subject was not quite sure what criteria he used, or even if he consistently used the same criteria. Nonetheless, all three of the subjects were relatively confident in their judgments. The fact itself that the listeners felt that they could make good judgments of the tonal volume of the tone, and yet had difficulty describing it in terms of their existing acoustic vocabulary, would seem to lend credence to the idea that tonal volume is a distinct perceptual attribute.
Objectively, the (upper) difference limens for tonal volume calculated from Rich's data (by the method of constant stimuli) were consistent from judgment to judgment, and consistently higher (i.e. the just noticeable differences for tonal volume were larger) than the difference limens for pitch. This data indicates that perceptions of tonal volume can be treated systematically as a perceptual dimension.
Rich's subjects were trained to listen for changes in tonal volume which remains to this day a rather abstract quantity. Nowhere in his research is there any objective evidence of a truly spatial component to tonal volume. We rely on reports from Rich's subjects that some of the tones they heard were larger and more space filling and that others were more compact. Experiments conducted during subsequent years seemed to indicate that volume perceptions were highly dependent on experimental conditions and results were difficult to repeat [Stevens, 1934]. Thereafter tonal volume fell out of favor as a perceptual attribute and was largely ignored in research. Over the next 50 years only a handful of studies considered volume at all [Perrot, et. al. 1980, p. 54].
In 1980 Perrott, Musicant, and Schwethelm cleared the reputation of tonal volume. They untangled previously confounded variables and showed that tonal volume depends explicitly on duration in addition to pitch and loudness. The experiments that had discredited tonal volume failed to properly control the duration of stimulus presentation. In a previous experiment of Perrott et. al., subjects were required to listen to a 5 kHz tone continuously for five minutes. One of the subjects complained that the tone gradually expanded until it filled his head. With subsequent experiments where all three variables were properly controlled they were able to reliably measure not only perceptions of tonal volume, but also how they increase over time.[22]
Modern studies show that tonal volume is a rather convoluted sense. Tonal volume increases with the intensity and duration of a sound, and is inversely proportional to its pitch [Perrott & Musicant, 1980]. The hypotheses as to the nature of tonal volume are various. Some describe it as a cognitive phenomenon stemming from associations formed between types of sounds and their sources. Other theories are more physiological in nature and suggest that tonal volume is correlated with the area of the basilar membrane affected by a sound. [23] Unfortunately, none of the theories of the nature of the phenomenon have been thoroughly explored.
On short time scales loudness is dependent on duration. For the first few hundred milliseconds the sound energy is integrated to produce the subjective impression of loudness. Perrott et al. observed tonal volume to increase for hundreds of seconds. This vast difference in time scale is a good indication that loudness and tonal volume not only are distinct, but employ different mechanisms.
During tonal volume's dark decades, von Békésy [1960] had performed a more geometrical assessment of the attribute. He asked his subjects to give a numerical estimate (in centimeters) of the "width" of the test tones. He observed, as expected, that the subjective width increased with loudness and decreased with frequency. However, von Békésy also observed that the perceived size of the auditory image was of the actual size of the speaker that produced the sound.
In their 1980 experiments, Perrott et. al. attempted a two dimensional[24] analysis of volume. Their results reaffirmed the findings of von Békésy on the horizontal dimension, but they were not able to extend the definition of tonal volume to include the vertical dimension. Judgments of tonal volume on the vertical dimension were not significantly different from judgments of loudness, and thus in this study they were unable to establish if tonal volume occupies more than one spatial dimension.
Most of the tonal volume experiments were conducted with sinusoid tones. As we have seen before, sinusoids, due to their spectral monochromaticity, are nearly impossible to localize on the median plane. Therefore, it is hardly surprising that, using sine tones, Perrott et al could not get good vertical data.
Perrott & Buell [1982] performed an experiment confirming tonal volume and the expanding image effect for white noise. In this experiment they also demonstrated that interaural decorrelation has a major effect on perceived horizontal extent. An explicit comparison was made between monaural, diotic, and dichotic stimuli. Respectively, they were rated in size by a ratio of 1:1.9:2.5.
The following sections were removed:
Stereo and beyond
Stereo Audio
Multichannel Audio
Surround Sound
THX(TM)
Many auditory spatial acuity studies consist of measuring the ability to detect a change in the location of an object. Perrott [1984a] challenged the comparison of these "minimum audible angle" studies to minimum visual angle studies. He contended that the analogy was inappropriate because the minimum visual angle is measured not by change but by juxtaposition: the minimum visual angle is the minimum separation between two objects required in order for them to be independently resolved. Perrott proposed a "concurrent minimum audible angle" as a more direct comparison.
The resolution of concurrently active sounds brings us back to the concept of auditory stream segregation. If two concurrent sounds are to be resolved, it is not only important for them to be spatially separate, but they must also fall into separate streams. A specific example of this problem is embodied by the "auditory precedence effect" (also called the "Haas Effect" [Haas, 1951], or "Law of the First Wavefront").
The precedence effect arises in reverberant environments. In such situations there are many possible paths that a sound can take from the source to a listener. In terms of grouping mechanisms and localization, it would be very confusing if every echo was perceived as a separate event. In order to deal with this problem, the auditory system has echo-suppression built into it: the first instance of a sound is assumed to be the one that followed the most direct path from source to listener. Others are more or less ignored. This is the Law of the First Wavefront.[25]
Much of the more recent literature on the precedence effect has involved showing that the echo suppression is not complete.[26] If the precedence effect is a manifestation of auditory stream segregation, then it seems that the influence of the "subordinate" sounds is not negligible. Perrott, et al. [1987] attempted to establish an "existence region" for the precedence effect. They had experimental subjects give qualitative evaluations of pairs of white noise bursts (played from speakers at +/-20deg. azimuth) with the responses: "Single", "Simultaneous" or, "Continuous Motion". Subjects also attempted to determine if the left or right stimulus was first.
With uncorrelated noise, the subjects tended to report that the sound underwent continuous motion down to an inter-stimulus onset interval of 10 ms. Below 10 ms they usually reported that the sounds were simultaneous. With correlated noise there was a cutoff at about 6-7 ms, below which the stimuli were most often judged to be a single event. The most surprising result was that, except for in the case of truly simultaneous presentation, subjects were able to identify which sound came first in all cases with better than chance accuracy.
Another study explored the spatial implications of the precedence effect. Perrott, et al. [1989] performed an analysis of the minimum audible angle for sounds supposedly suppressed by the precedence effect. His apparatus consisted of three speakers: one speaker centered in front of the listener, and a pair of speakers a controllable distance apart and a controllable distance behind the first speaker (i.e. farther in front of the listener). First the center speaker was energized by itself, and then once again along with one of the side speakers. The subjects task was to determine which of the side speakers was used. The separation at which subjects achieved a performance level of 75% correct was defined to be the minimum audible angle. The results showed that the minimum audible angle for the "suppressed" sounds was 2-4deg. larger than in a condition where the front speaker was not played along with the side speakers. However, since the minimum audible angle in quiet conditions was often less than 1deg., this represents as much as a sevenfold increase.
The conflict can be summarized as follows: Complex acoustic environments produce situations where it becomes important to be able to distinguish the original sound amongst the echoes. This echo suppression mechanism still applies to situations where the multiple sources are not echoes, which would make perception of extended sources extremely difficult. However, these suppression mechanisms occur with identical sources. Perhaps if the sources are differentiated in some way, in effect assigned to different streams, the precedence effect would not apply. In fact, von Békésy [1960] observed that if the two sources had a slightly different timbre, the image that would otherwise be a point source centered between the speakers becomes a broad elliptical image spanning the space in between [von Bekesy, 1960, p. 377]!
Perrott [1984b] explores this effect in more detail. He observes that when presented with two pure tones from different locations there seems to be three different regions of frequency differences which have three different effects. It the difference in frequency is too small then the tones will not be distinguished. If the difference is to large, then the two sources are perceived as just that: two sources. There is, however, an intermediate range over which the two sources are perceived as one extended source.
A possible interpretation for these three regions is the following: When the sources are too similar, they are fused. They are assigned to the same stream, and a single object at a single point in space. On the other extreme the sources are assigned to two different streams at two different points in space. The intermediate case is the intriguing one: This could be the boundary between differentiation and fusion. It is within this thin line that natural extended auditory perceptions reside.
In the first experiment subjects were asked to determine the slope of a line displayed on this matrix as a series of sine tones. The subjects were given a folder of printed lines from which to choose. The subjects performed the task with significantly better than chance accuracy. The easiest slope to identify was the vertical line which was identified with 47% accuracy. The other slopes were only correctly identified about 20% of the time.
The second experiment was also an identification task where the choices were between six geometric shapes, one of which was a random pattern (denoted as "other"). When asked to draw the shapes, the subjects were unable to perform the task, so again they were given a folder of pictures from which to choose. Performance on this task was better than on the line slope task (42% correct identifications) with no significant difference between shapes. Performance was better with a 800 Hz tone than with a 6400 Hz tone. Varying the time-per-speaker durations (100, 200, 400, 800, 1600 ms) did not cause significant performance differences.
The third experiment was a letter identification task where the subjects were asked to pick from three, six, or all twenty-six letters. Performance was always above chance. The three choice case was significantly easier than the latter two cases, but there was no difference between the latter two cases.
In their next study, Ruff & Perret [1982] examined the anisotropy of the system. The examined the effects of compressing different horizontal and vertical regions of the display by varying extents.
They found a significant main effect for overall horizontal compression of the array. Vertical compression of the array (including complete vertical compression) was not significant. Interestingly, where there was no significant effect for the region of horizontal compression, the region was significant in vertical compression. The highest performance in the experiment occurred when the vertical middle third of the shapes was compressed (i.e. performance was best when the shape was compressed such that it fit into a region in the center of the matrix that was six speakers tall by ten wide).
Ruff and Perret concluded that, since drastic changes to the spatial characteristics of a pattern did not necessarily have a significant effect, it is the succession of the stimuli, i.e. the overall motion in the pattern, that determines correct identification. I would add an additional interpretation to their reduction data. The fact vertical compression was not significant, but horizontal compression was significant is a consequence of the type of stimuli they used: sine tones are notoriously difficult to localize on the vertical dimension [e.g. Blauert, 1983].
Next, similar experiments with the same apparatus were conducted with brain-damaged subjects [Ruff & Perret, 1983; Ruff, 1985]. In these experiments the authors attempted to discover the degree of involvement of the visual system in the auditory spatial integration task. They tried positioning the speaker array above, to the sides, and behind the subjects in addition to in front. They additionally blindfolded a second group of subjects, and tracked the eye movements of a third group.
The data showed that visual cues do not influence performance. Eye movements did not seem to be correlated with the auditory shapes. Involuntary (vs. voluntary) head restraint did, however, cause a decrement in performance on the side presentations, probably because the subjects tended to rotate their heads to face the array when the stimulus was from the side.
The authors note that 35% of the errors made can be explained by vertical confusions. They expected that, since the overhead presentations did not involve any vertical component, performance should be better under this condition, because front/back determinations are easier than above/below. This did not seem to be the case. I would suggest that this could be due to poor horizontal acuity at large elevation angles [Strybel, et al., 1992].
The authors also noted that feedback did not drastically influence performance, so the influence of "acquisition strategies" is not great.
Ruff & Perret's display technology seemed to be most limited by the stimulus that they employed. Lakatos [1993a] performed a similar experiment with speaker arrays, using a harmonic complex instead of sine tones. His harmonic complex consisted of a 1000 Hz fundamental with a total of 12 partials. This stimulus would be better than a pure sinusoid for vertical localization (although probably not ideal).
Lakatos used a 16-element speaker array with adjacent elements set 1 foot apart (see figure 3.1). The task for the listeners was to identify which of 10 alpha-numeric characters was being displayed. Subjects were given an opportunity to associate each shape with it's respective auditory display. For each pattern, the starting position on the array was indicated to subjects. His subjects performed at an accuracy level of 60-90% from a field of 10 possible choices.
In a second experiment, Lakatos examined the influence of modifications to the stimulus on performance. He varied the number of harmonic partials in the stimuli and the envelope of the stimuli.
Performance was slightly worse with a gradual attack and decay than with a sharp attack and decay. Performance also worsened with fewer partials, especially when the mid-ranges were omitted.
Figure 3.1
Sixteen Element Speaker Array
(after Lakatos, 1993)
The prototype device does not exploit the spatial properties of sound. Meijer avoids the technical and perceptual difficulties of spatial sound by using temporal and timbral analogies to space. The vertical dimension is mapped onto a spectrum of 64 discrete frequency bands, trusting that the resultant timbral changes will be evocative enough to preserve image information. Height maps onto frequency, with the top of the display conveyed via high frequency components, and the bottom of the image in the lower frequencies. The one aspect that is addressed by this display, and no other in the literature is image contrast. Meijer's display maps contrast onto the intensity of each frequency component, producing a "16-level gray-scale" output.
The image is built up of a sequence of these vertical slices, sampling from left to right. The device repeats this process, emitting a click at the beginning of each frame. Thus, the horizontal dimension is simply mapped onto the time interval since the last click. Meijer reports that he plans to add horizontal localization cues in a later version.
Whether this display actually works, is an unresolved question. At the time of his paper, no psychophysical evaluation had been performed.