An Exploration of Virtual Auditory Shape Perception
![]()
![]()
![]()
Each virtual speaker, or "pixel" in a particular shape pattern was sequentially energized for 60 milliseconds using the stimulus described above. There was a 60 millisecond pause between pixels. The shapes were 10 alphanumeric characters: "3", "6", "9", "C", "G", "O", "P", "R", "S", "U" designed to be analogous to those described in Lakatos [1993a] (see figure 9.1).
Figure 9.1
Alphanumeric Auditory Shape Patterns
Interface for Auditory Vector Experiment I
ANOVA of Performance on Shape Identification Task (Experiment 1)
Source Sum-Of-Squares DF Mean-Square F-Ratio P
Subject 5.480 6 0.913 4.759 0.001
Shape 10.413 9 1.157 6.029 0.001
Subject x Shape 17.977 54 0.333 1.735 0.001
Error 120.900 630 0.192
In order to better understand the results, I created a confusion matrix with the presented shapes as the column headings and the subject's choices as the row headings. Next I performed a cluster analysis on this matrix and sorted the rows and columns based on clustering (see figure 9.3).Describe Clusters?
Percentage of Responses
Auditory Pattern Presented
Figure 9.3
Confusion Matrix for Experiment 1
This representation brings out some interesting features of the data. As one would hope, the peaks are generally located in the right places, along the diagonal. The sorting by clusters helps preserve the appearance of the diagonal line in spite of the fact that there are several shapes that were specifically confused. By looking at which shapes fall adjacent to each other, we discover that many confusions are just the ones we would expect: The "O" is the same pattern as the adjacent "C" with three missing pixels. Although the adjacent "S", "6", and "G" are traced in different spatial-temporal patterns, independent of time they are morphologically quite similar.
The diagonal line appears almost "anti-aliased" in places. This is a consequence of the fact that within columns, and to a certain extent, within rows, the data are quite smooth. This gives the impression that there is a fair amount of perceptual overlap between shapes that are adjacent in this representation. The one major exception is the region around the correct identification of the letter "C". Whereas the "O" and the "G" are often mistaken for a "C", the reverse is almost never true. This asymmetry probably reflects the fact that the pattern that makes up the letter "C" has only eleven pixels in it, whereas the "O" and "G" have fourteen. This makes the duration of the "C" pattern 360 ms shorter than that of "O" and "G". The difference in duration alone could be the distinguishing factor.
The asymmetry and enhanced performance on the identification of the "C" suggests another analysis: if the patterns with the fewest pixels are the least ambiguous, perhaps there is an overall effect caused by number of pixels in a pattern. I performed an analysis of variance with respect to the number of pixels in each pattern(see table 9.2), and discovered that there is a significant effect at the P<.001 level. Moreover, post-hoc contrasts show a highly significant inverse linear relationship between performance and the number of pixels in a pattern (see figure 9.4). One way to interpret this is that the greater complexity of shapes with more pixels confused the subjects. In other words, the main challenge to recognizing a pattern was not getting sufficient information, but keeping track of details. This interpretation argues that the pattern recognition task is a difficult cognitive exercise, and not a purely perceptual one. Several subjects reported that they would occasionally "lose track" of complicated patterns, and had to resort to guessing.
ANOVA of Performance on Shape Identification Task (Experiment 2)
Source Sum-Of-Squares DF Mean-Square F-Ratio P
Pixels 8.646 4 2.161 9.612 0.001
Error 250.726 1115 0.225

Figure 9.4
The Relationship Between Identification Accuracy and Number of Pixels
An alternative explanation which might yield the same result was proposed by Lakatos [1993a]. He suggested that there may be a bias toward the selection of simpler patterns. This hypothesis is difficult to evaluate without a strong definition of "simplicity". Two possibilities are depicted in figure 9.5.
No. in Parenthesis No. in Parenthesis Indicate
Indicate No. No. of Changes of Direction
of Pixels
Figure 9.5
Erroneous Selection Counts
Overall performance on Lakatos' [1993a] experiment (60%-90% correct), with nearly identical patterns and stimuli, was superior to that on my experiment (20%-43% correct). This could be due to the virtual nature of the patterns in my experiment and the fact that I did not use head tracking. Although I believe that these are factors, I would like to show that it is does not comprise the majority of the discrepancy in performance. Wenzel, Wightman, and Kistler [1993] determined that non-individualized HRTF's incur only a minor decrement in acuity of localization. However, they also state that there is a high frequency of front-back, and to a lesser extent, up-down confusions in localization in such situations. My subjects, however, were selected for their low frequency of vertical confusions.
In order to appropriately compare the two experiments, it is first necessary to emphasize a few differences in methodology other than the virtual-vs.-physical aspect. In Lakatos' experiment, the subjects were given a training session before starting the experiment. In my experiment only enough advance training was supplied to acquaint the subjects with the interface. Instead feedback was supplied for the first half of each test run. I chose this method of training because I wanted to obtain information on how listeners learn the task. Inspection of the change in performance over time in my experiment reveals that there is great variation between subjects. The performance of some subjects increased steadily throughout the half of the experiment in which they had feedback. All but one of these subjects seemed to reach an asymptote in performance level before the end of the feedback (two of these were lower asymptotes, as two subjects started out with a string of correct identifications). Two of the subjects were extremely consistent throughout the experiment run. Since some of the subjects did improve during the experiment, their initial poor performance is one source of error that is likely a part of the discrepancy between my data and that of Lakatos.
Another methodological difference was that in Lakatos' experiment, on each pattern choice, the starting point was depicted. This additional cue could greatly effect the distinguishability of several patterns which mainly differed in the direction of tracing, depending on the strategy used by the subjects. I chose not to supply this cue because I thought that it would bias the subjects' strategies.
To reconcile these differences, I ran one subject who had previously scored quite well on my experiment (43%) through a version of the experiment modified to more closely match Lakatos'. The subject was given a training session in which she had an opportunity to listen to and compare the shapes at will, and to perform a practice run of the experiment. During her practice run she scored 70% (although that was not a balanced set of stimuli where each occurred an equal number of times). Afterward she went through the experiment as she had during the formal trials, only this time the captions on the selection buttons had been modified to include an indication of the starting position and direction of each pattern. She scored 58%, which is close to the performance level in Lakatos' experiment. This subject's dramatic improvement is strong evidence that the majority of the difference in performance between my experiment and Lakatos' was due to methods, and not apparatus.
The patterns in experiment 2 consisted of 6 geometric shapes: a horizontal line (from left to right), a diagonal line (from upper left to lower right), a square (with no bottom), a trapezoid (with no bottom), and a triangle. These shapes all have the same number of pixels: seven (see figure 9.6). As a consequence of the small number of pixels, the patterns in this experiment are much simpler than in experiment 1. The subjects, stimuli, and apparatus were the same as in experiment 1. The minimum angular separation was 8.7deg. for the adjacent speakers at the center of the grid, and the maximum angular separation was 24deg. for the diagonally adjacent speakers at the corners of the array.
Figure 9.6
Geometric Auditory Shape Patterns
Interface for Auditory Vector Experiment II
In two instances there were highly asymmetric confusions. Shape 2 was mistakenly identified as shape 3 twice as often as it was correctly identified. Even more pronounced was the favoring of shape 5 over shape 6. Further inspection of the confusion rates leads us to believe that the overlap within these pairs is complete. The columns in figure 9.8 can be thought of as vectors within a shape description space: Shape 1 is 77% like the visual picture 1, 4% like picture 2, 9% like picture 3, and so on. The column profiles of shapes 5 and 6 and also those of shapes 2 and 3 are nearly identical.
Auditory Pattern Presented
Figure 9.8 We apologize. This image does not translate to HTML.
Percentage of Responses
The large perceptual overlap between shapes 5 and 6 is not too surprising, as they are quite morphologically similar. The differences between shapes 2 and 3 are more striking. In both cases the favored visual description is the one that most closely follows the overall motion (see figure 9.8). This is a strong argument that motion in these patterns is more important than morphology.
An alternative explanation involves the spatial distribution of the pixels in the patterns. The pixels of the two unfavored shapes fall more in the periphery of the grid. This is the region of poorest auditory acuity. The unfavored shapes could therefore be considered the more ambiguous patterns.23.9deg. vs. 11.3deg.
Perrott Papers on MAMA vs. position
The performance on experiment 2 did not fit into the linear dependency on number of pixels of experiment 1. This might cloud the interpretation of importance of that factor, but since experiment 2 was not designed to be especially comparable to experiment 1, the nature of the patterns themselves is quite different. First, the array used in this experiment is three feet wider than that used in experiment 1. This means that portions of patterns fall in regions of poorer auditory acuity. And second, since consecutive pattern elements do not necessarily occupy adjacent elements of the array, issues of apparent motion begin to come into play. The largest angular separation between consecutive pixels in this experiment is about 24deg. (as compared to 11deg. in experiment 1). This is within the range of some empirically derived critical values for apparent motion perception with a 120 ms onset interval [Lakatos, 1993b].