3D sound, ofter termed spatial sound, is sound as we hear it in everyday life. Sounds come at us from all directions and distances, and individual sounds can be distinguished by pitch, tone, loudness, and by their location in space. The spatial location of a sound is what gives the sound its three-dimensional aspect.
The constant influx of sound from our environment provides much information of the world around us. Slight echoes and reverberations in the surrounding environment give the brain cues about the direction and distance of objects from us. These cues also relay information about the size of the environment surrounding us. For example, a small room has fewer echoes than one with cathedral ceilings. Additionally, the presence of objects in the environment outside the field of view can be felt by hearing sounds emitted from those objects. In this way, hearing those sounds also serves as a cue to turn to locate the sound source. Finally, information about the material qualities of objects and the environment around us can be gathered through sounds. You can tell, for example, if an object is soft or hard by dropping it on a hard surface and observing the sound it makes. Similarly, you can gain information about the physical qualities of the ground through sound. For example, walking on wet surface yields the squishing sounds made as your feet make contact with the wet surface.
Being able to accurately synthesize such spatial sound would clearly add to the immersiveness of a virtual environment. Sounds are a constant presence in our everyday world and offer rich cues about our environment. Sound localization, however, is a complex human process. Efforts to artificially spatialize sounds must first understand how humans actually hear and localize sounds.
All of these cues in some way contribute to the ability to spatially locate a sound in 3D space. 3D sound synthesis needs to deal with these cues in order to provide accurate sound immersion. The difficultly in doing this is great; researchers don't fully understand exactly how the brain interprets the signals it gets from the ear, nor do they understand all of the characteristics that cause sound to be perceived in 3D space. As research continues, we will hopefully gain a better understanding of the human ear and how to emulate it.
Stereo sound is recorded with two microphones several feet apart separated by empty space. Most people are familiar with stereo sound; it is heard commonly through stereo headphones and in the movie theater. When a stereo recording is played back, the recording from one microphone goes into the left ear, while the recording from the other microphone is channeled into the right ear. This gives a sense of the sound's position as recorded by the microphones. Listeners of stereo sound often perceive the sound sources to be at a position inside the listener's head -- this is due to the fact that humans do not normally hear sounds in the manner they are recorded in stereo, separated by empty space. The human head is there acting as a filter to incoming sounds.
Binaural recordings sound more realistic as they are recorded in a manner that more closely resembles the human acoustic system. Binaural recordings are made with the recording microphones embedded in a dummy head, and yield sounds that sound external to the listener's head. Binaural recordings sound closer to what humans hear in the real world as the dummy head filters sound in a manner similar to the human head.
The computations involved in convolving the sound signal from a particular point in space are demanding. Refer to [BURGESS92] for details on these sound computations. The point to recognize is that the computations are so demanding that they currently cannot be performed in real-time without special hardware. To meet this need, Crystal River Engineering has implemented these convolving operations on a digital signal processing chip called the Convolvotron.
The rendering technique deals with a sound source as a single dimensional signal that has an intensity over time. This is a rather simple approach to the more traditional Fourier transform representation (HRTF generation). The technique exploits the simularity of light and sound to provide the necessary convolutions. A sound source in space propagates sound waves in all directions just as a light source does. As in light, sounds waves can be reflected and refracted due to the acoustic environment. A sound wave interacts with many objects in the environment as it makes its way to the listener. The final sound that the listener hears is the integral of the signals from the multiple simultaneous paths existing between the sound source and the listener. The rendering algorithm cannot provide a continuous analysis of this function and therefore must break it up into discrete calculations to compute sound transformations.
The actual sound rendering process is a pipelined process made up of 4 stages. The first stage is the generation of each object's characteristic sound (recorded, synthesized, modal analysis-collisions). The second stage is sound instantiation and attachment to moving objects within the scene. The third stage is the calculation of the necessary convolutions to describe the sound source interaction within the acoustic environment. In the last stage the convolutions are applied to the attached instantiated sound sources. This process is demonstrated in |Figure SOUND3D.jpg| from [TAKALA92].
The convolution calculation process of this pipelined process also deals with the effect of reverberation. This is an auditory cue that can lead to better spatial perception. The mathematical description of reverberation is a convolution with a continuous weighting function. This is really just multiple echos within the sound environment. The sound rendering technique approximates this by capitalizing on the fact that the wavelength of the sound is similar to that of the object, and thus, is diffuse in its reflections. Sound diffraction allows sound to propagate around an object - this has a "smoothing" affect of the sound. These observations allow the technique to use a simplified sound tracing algorithm. The simplified sound tracing algorithm is beyond the scope of this article, for more information please consult [TAKALA92].
This method handles the simplicity of an animated world that is not necessarily real-time; it is unclear how this method would work in a real-time virtual reality application. However, its simularity to ray-tracing and its unique approach to handling reverberation are a noteworthy aspects. [FOSTER92]
Front-to-back reversals refers to the effect of having a sound being heard directly in front of a subject when it is really located directly in back or vice versa. This is a classic problem that can be diminished by accurate inclusion of the subject's head movement and pinna response. Generally, this problem exists if these two cues are left out of the HRTF calculation. Another possible solution involves a different auditory cue, the early echo response. The inclusion of a first order echo response has been shown to provide front-to-back differentiation for most test subjects [BURGESS92].
Intracranially heard sound describes sound that is heard inside one's head when the source is really located external to the head. This problem can be lessened by adding reverberation cues.
Other problems occur in the generation of the HRTF's themselves, specifically with measurements made with small microphones in the ear canal, as these microphones are prone to noise and linearity problems. Also, the speakers used to generate sounds are sometimes ineffective with low-frequency sounds [BEGAULT90]. When an HRTF is generated for a particular subject, it contains characteristics that are based upon the localization skills of that subject. Researchers have determined that by using several primary auditory cues with a subject that is skilled in localization, an HRTF can be created that is good enough for most of the population [BURGESS92].
Sound can also be used as a substitute for other sensory feedback in virtual environments. For example, pushing a virtual button is a task detected by wired glove. Without haptic feedback, however, users have had difficulties knowing when the button was successfully activated [BEGAULT92]. Sound cues have been used to alleviate this problem; hearing the sound of the button being pushed gave users the immediate feedback needed to know that their actions were indeed successful.
Similarly, sounds can be used to compensate for sensory impairments of specific users. The Mercator project, for example, is researching the use of sound as alternative, nonvisual interface to X Window System applications for visually impaired software developers [BURGESS92]. The goal of the project is to map the behaviors of window-based applications into an auditory space; spatial sound is being used heavily to relay information about the organization of objects on the user's screen.
Using sound as an additional input channel for computer-human interaction has begun to be researched [SMITH93], but much more human factors work needs to be done before sound can be accurately utilized for data representation in user interfaces. The auditory channel is currently underutilized in user interfaces, and the potential exists to increase the bandwidth of information relayed to users by using sound in addition to visual and other sensory outputs to relay information to users.
3D sound is a new technology that is early in the stages of development and understanding. More potential applications will continue to unfold as our understanding of spatial hearing and ways to artificially recreate it continue to evolve.
[BEGAULT92]: Begault, Durand R. "An Introduction to 3-D Sound for Virtual Reality", NASA-Ames Research Center, Moffett Field, CA, 1992.
[BURGESS92]: Burgess, David A. "Techniques for Low Cost Spatial Audio", UIST 1992.
[FOSTER92]: Foster, Wenzel, and Taylor. "Real-Time Synthesis of Complex Acoustic Environments" Crystal River Engineering, Groveland, CA.
[SMITH93]: Stuart Smith. "Auditory Representation of Scientific Data", Focus on Scientific Visualization, H. Hagen, H. Muller, G.M. Nielson, eds. Springer-Verlag. 1993.
[STUART92]: Stuart, Rory. "Virtual Auditory Worlds: An Overview", VR Becomes a Business, Proceedings of Virtual Reality 92, San Jose, CA, 1992.
[TAKALA92]: Takala, Tapio and James Hahn. "Sound Rendering". Computer Graphics, 26, 2, July 1992.