An Exploration of Virtual Auditory Shape Perception

[Previous Chapter][Table of Contents][Next Chapter]


9. Virtual Auditory Vector Experiments I conducted two experiments to test the efficacy of the virtual auditory vector display. These experiments were shape identification tasks wherein subjects were presented with an auditory pattern over headphones, and then requested to choose the picture that this pattern most closely matched.

9.1. Subjects

The same seven subjects that passed the screening test, and were used in the virtual concurrent minimum audible angle experiment, were also used in these experiments.

9.2. Stimulus

The stimulus in these experiments is a 12-partial harmonic complex with a fundamental frequency of 1000 Hz (as described in chapter 6).

9.3. Apparatus

The apparatus is described in chapter 6.

9.4. Virtual Auditory Vector Display I

This experiment was intended to be similar to that of Lakatos [1993a], only using virtual sound technology. The shape patterns were displayed on a 16 element virtual speaker array. This array was set such that the center of the array appeared at a "distance" of 6.5 feet from the listener and was oriented perpendicular to the line of sight. Adjacent elements ("speakers") of the array were spatialized to appear 1 foot apart. This corresponds to a minimum angular separation of 8.8deg. for the adjacent speakers at the center of the grid, and a maximum angular separation of 11deg. for the diagonally adjacent speakers in the corners of the array.

Each virtual speaker, or "pixel" in a particular shape pattern was sequentially energized for 60 milliseconds using the stimulus described above. There was a 60 millisecond pause between pixels. The shapes were 10 alphanumeric characters: "3", "6", "9", "C", "G", "O", "P", "R", "S", "U" designed to be analogous to those described in Lakatos [1993a] (see figure 9.1).

Figure 9.1

Alphanumeric Auditory Shape Patterns

9.4.1. Procedure

There were 105 trials for each subject: five practice trials, followed by 100 trials during which each shape occurs ten times at random. In each trial, the subject was presented with an auditory shape and was then required to select the best match from screen buttons labeled with alphanumeric characters. The subject was allowed to listen to the shape twice before making a selection. For the first 55 trials (including the 5 practice trials) the subject was given feedback. If they selected the correct letter, the button they pressed turned green. If they have made an incorrect choice, the button they selected turned red, and the correct button turned green. The Toolbook interface is shown in figure 9.2.

Figure 9.2

Interface for Auditory Vector Experiment I

9.4.2. Results & Discussion of Experiment 1

A t-test comparing correct identification rate to random performance indicates significantly better than chance performance (T = 12.932, P < 0.001). An analysis of variance indicated significant main effects for subjects, letters, and a significant interaction between subjects and letters (see table 9.1).

Table 9.1

ANOVA of Performance on Shape Identification Task (Experiment 1)

      Source           Sum-Of-Squares     DF      Mean-Square    F-Ratio     P     
      Subject              5.480           6         0.913        4.759    0.001   
       Shape               10.413          9         1.157        6.029    0.001   
  Subject x Shape          17.977         54         0.333        1.735    0.001   
       Error              120.900         630        0.192                         

In order to better understand the results, I created a confusion matrix with the presented shapes as the column headings and the subject's choices as the row headings. Next I performed a cluster analysis on this matrix and sorted the rows and columns based on clustering (see figure 9.3).Describe Clusters?

Percentage of Responses

             Auditory Pattern Presented

Figure 9.3

Confusion Matrix for Experiment 1

This representation brings out some interesting features of the data. As one would hope, the peaks are generally located in the right places, along the diagonal. The sorting by clusters helps preserve the appearance of the diagonal line in spite of the fact that there are several shapes that were specifically confused. By looking at which shapes fall adjacent to each other, we discover that many confusions are just the ones we would expect: The "O" is the same pattern as the adjacent "C" with three missing pixels. Although the adjacent "S", "6", and "G" are traced in different spatial-temporal patterns, independent of time they are morphologically quite similar.

The diagonal line appears almost "anti-aliased" in places. This is a consequence of the fact that within columns, and to a certain extent, within rows, the data are quite smooth. This gives the impression that there is a fair amount of perceptual overlap between shapes that are adjacent in this representation. The one major exception is the region around the correct identification of the letter "C". Whereas the "O" and the "G" are often mistaken for a "C", the reverse is almost never true. This asymmetry probably reflects the fact that the pattern that makes up the letter "C" has only eleven pixels in it, whereas the "O" and "G" have fourteen. This makes the duration of the "C" pattern 360 ms shorter than that of "O" and "G". The difference in duration alone could be the distinguishing factor.

The asymmetry and enhanced performance on the identification of the "C" suggests another analysis: if the patterns with the fewest pixels are the least ambiguous, perhaps there is an overall effect caused by number of pixels in a pattern. I performed an analysis of variance with respect to the number of pixels in each pattern(see table 9.2), and discovered that there is a significant effect at the P<.001 level. Moreover, post-hoc contrasts show a highly significant inverse linear relationship between performance and the number of pixels in a pattern (see figure 9.4). One way to interpret this is that the greater complexity of shapes with more pixels confused the subjects. In other words, the main challenge to recognizing a pattern was not getting sufficient information, but keeping track of details. This interpretation argues that the pattern recognition task is a difficult cognitive exercise, and not a purely perceptual one. Several subjects reported that they would occasionally "lose track" of complicated patterns, and had to resort to guessing.

Table 9.2

ANOVA of Performance on Shape Identification Task (Experiment 2)

   Source        Sum-Of-Squares     DF      Mean-Square    F-Ratio     P     
   Pixels            8.646           4         2.161        9.612    0.001   
    Error           250.726        1115        0.225                         

Figure 9.4

The Relationship Between Identification Accuracy and Number of Pixels

An alternative explanation which might yield the same result was proposed by Lakatos [1993a]. He suggested that there may be a bias toward the selection of simpler patterns. This hypothesis is difficult to evaluate without a strong definition of "simplicity". Two possibilities are depicted in figure 9.5.

No. in Parenthesis     No. in Parenthesis Indicate     
Indicate No.           No. of Changes of Direction      
of Pixels
                 
Figure 9.5
Erroneous Selection Counts

Overall performance on Lakatos' [1993a] experiment (60%-90% correct), with nearly identical patterns and stimuli, was superior to that on my experiment (20%-43% correct). This could be due to the virtual nature of the patterns in my experiment and the fact that I did not use head tracking. Although I believe that these are factors, I would like to show that it is does not comprise the majority of the discrepancy in performance. Wenzel, Wightman, and Kistler [1993] determined that non-individualized HRTF's incur only a minor decrement in acuity of localization. However, they also state that there is a high frequency of front-back, and to a lesser extent, up-down confusions in localization in such situations. My subjects, however, were selected for their low frequency of vertical confusions.

In order to appropriately compare the two experiments, it is first necessary to emphasize a few differences in methodology other than the virtual-vs.-physical aspect. In Lakatos' experiment, the subjects were given a training session before starting the experiment. In my experiment only enough advance training was supplied to acquaint the subjects with the interface. Instead feedback was supplied for the first half of each test run. I chose this method of training because I wanted to obtain information on how listeners learn the task. Inspection of the change in performance over time in my experiment reveals that there is great variation between subjects. The performance of some subjects increased steadily throughout the half of the experiment in which they had feedback. All but one of these subjects seemed to reach an asymptote in performance level before the end of the feedback (two of these were lower asymptotes, as two subjects started out with a string of correct identifications). Two of the subjects were extremely consistent throughout the experiment run. Since some of the subjects did improve during the experiment, their initial poor performance is one source of error that is likely a part of the discrepancy between my data and that of Lakatos.

Another methodological difference was that in Lakatos' experiment, on each pattern choice, the starting point was depicted. This additional cue could greatly effect the distinguishability of several patterns which mainly differed in the direction of tracing, depending on the strategy used by the subjects. I chose not to supply this cue because I thought that it would bias the subjects' strategies.

To reconcile these differences, I ran one subject who had previously scored quite well on my experiment (43%) through a version of the experiment modified to more closely match Lakatos'. The subject was given a training session in which she had an opportunity to listen to and compare the shapes at will, and to perform a practice run of the experiment. During her practice run she scored 70% (although that was not a balanced set of stimuli where each occurred an equal number of times). Afterward she went through the experiment as she had during the formal trials, only this time the captions on the selection buttons had been modified to include an indication of the starting position and direction of each pattern. She scored 58%, which is close to the performance level in Lakatos' experiment. This subject's dramatic improvement is strong evidence that the majority of the difference in performance between my experiment and Lakatos' was due to methods, and not apparatus.

9.5. Virtual Auditory Vector Display II

The patterns in experiment 1 were quite complex, and thus it was difficult to determine just what aspects of certain shapes made them hard to recognize. Also, I was concerned about the effect of differing numbers of pixels (and hence different pattern durations). I wanted to control these properties a little better.

The patterns in experiment 2 consisted of 6 geometric shapes: a horizontal line (from left to right), a diagonal line (from upper left to lower right), a square (with no bottom), a trapezoid (with no bottom), and a triangle. These shapes all have the same number of pixels: seven (see figure 9.6). As a consequence of the small number of pixels, the patterns in this experiment are much simpler than in experiment 1. The subjects, stimuli, and apparatus were the same as in experiment 1. The minimum angular separation was 8.7deg. for the adjacent speakers at the center of the grid, and the maximum angular separation was 24deg. for the diagonally adjacent speakers at the corners of the array.

Figure 9.6

Geometric Auditory Shape Patterns

9.5.1. Procedure

The procedure in the second experiment is much like that of the first. The only difference is that in this experiment there are only six shapes and six choices, so there are a total of 66 trials with 6 for practice. Feedback is supplied during only the first 36 trials. Again, each shape occurs 10 times. The Toolbook interface is shown in figure 9.7.

Figure 9.7

Interface for Auditory Vector Experiment II

9.5.2. Experiment 2 Results

The performance on this experiment ranged from 31% to 63% with an overall average of 42%. It should be noted that, as there were fewer choices in experiment 2, this is only 26% above random which is not significantly higher than the 23% above the level of chance performance in experiment 1. Again I performed a cluster analysis of the responses to the shapes, and created a confusion matrix sorted by these clusters (see figure 9.8).

In two instances there were highly asymmetric confusions. Shape 2 was mistakenly identified as shape 3 twice as often as it was correctly identified. Even more pronounced was the favoring of shape 5 over shape 6. Further inspection of the confusion rates leads us to believe that the overlap within these pairs is complete. The columns in figure 9.8 can be thought of as vectors within a shape description space: Shape 1 is 77% like the visual picture 1, 4% like picture 2, 9% like picture 3, and so on. The column profiles of shapes 5 and 6 and also those of shapes 2 and 3 are nearly identical.

Auditory Pattern Presented

Figure 9.8

We apologize. This image does not translate to HTML.

Percentage of Responses

The large perceptual overlap between shapes 5 and 6 is not too surprising, as they are quite morphologically similar. The differences between shapes 2 and 3 are more striking. In both cases the favored visual description is the one that most closely follows the overall motion (see figure 9.8). This is a strong argument that motion in these patterns is more important than morphology.

An alternative explanation involves the spatial distribution of the pixels in the patterns. The pixels of the two unfavored shapes fall more in the periphery of the grid. This is the region of poorest auditory acuity. The unfavored shapes could therefore be considered the more ambiguous patterns.23.9deg. vs. 11.3deg.

Perrott Papers on MAMA vs. position

The performance on experiment 2 did not fit into the linear dependency on number of pixels of experiment 1. This might cloud the interpretation of importance of that factor, but since experiment 2 was not designed to be especially comparable to experiment 1, the nature of the patterns themselves is quite different. First, the array used in this experiment is three feet wider than that used in experiment 1. This means that portions of patterns fall in regions of poorer auditory acuity. And second, since consecutive pattern elements do not necessarily occupy adjacent elements of the array, issues of apparent motion begin to come into play. The largest angular separation between consecutive pixels in this experiment is about 24deg. (as compared to 11deg. in experiment 1). This is within the range of some empirically derived critical values for apparent motion perception with a 120 ms onset interval [Lakatos, 1993b].