3D Sound Synthesis

Authors: Cindy Tonnesen and Joe Steinmetz

3D sound, ofter termed spatial sound, is sound as we hear it in everyday life. Sounds come at us from all directions and distances, and individual sounds can be distinguished by pitch, tone, loudness, and by their location in space. The spatial location of a sound is what gives the sound its three-dimensional aspect.

The constant influx of sound from our environment provides much information of the world around us. Slight echoes and reverberations in the surrounding environment give the brain cues about the direction and distance of objects from us. These cues also relay information about the size of the environment surrounding us. For example, a small room has fewer echoes than one with cathedral ceilings. Additionally, the presence of objects in the environment outside the field of view can be felt by hearing sounds emitted from those objects. In this way, hearing those sounds also serves as a cue to turn to locate the sound source. Finally, information about the material qualities of objects and the environment around us can be gathered through sounds. You can tell, for example, if an object is soft or hard by dropping it on a hard surface and observing the sound it makes. Similarly, you can gain information about the physical qualities of the ground through sound. For example, walking on wet surface yields the squishing sounds made as your feet make contact with the wet surface.

Being able to accurately synthesize such spatial sound would clearly add to the immersiveness of a virtual environment. Sounds are a constant presence in our everyday world and offer rich cues about our environment. Sound localization, however, is a complex human process. Efforts to artificially spatialize sounds must first understand how humans actually hear and localize sounds.

Cues that aid in human sound localization

Humans use auditory localization cues to help locate the position in space of a sound source. There are eight sources of localization cues: interaural time difference, head shadow, pinna response, shoulder echo, head motion, early echo reponse, reverberation, and vision. The first four cues are considered static and the others dynamic. Dynamic cues involve movement of the subject's body, affecting how sound enters and reacts with the ear. [FOSTER92]. The following section briefly defines the eight localization cues [BURGESS92].

Interaural time difference

Interaural time difference describes the time delay between sounds arriving at the left and right ears. This is a primary localization cue for interpreting the lateral position of a sound source. The interaural time delay of sound sources that are directly in front or behind a subject are approximately zero, while sound sources to the far left or right are around 0.63ms. The frequency of and the linear distance to a sound source factors into the interaural time delay as well.

Head shadow

Head shadow is a term describing a sound having to go through or around the head in order to reach an ear. The head can account for a significant attenuation (reduced amplitude) of overall intensity as well as provides a filtering effect. The filtering effects of head shadowing cause one to have perception problems with linear distance and direction of a sound source.

Pinna response

Pinna response desribes the effect that the external ear, or pinna, has on sound. Higher frequencies are filtered by the pinna in such a way as to affect the perceived lateral position, or azimuth, and elevation of a sound source. The response of the pinna "filter" is highly dependent on the overall direction of the sound source.

Shoulder echo

Frequencies in the range of 1-3kHz are reflected from the upper torso of the human body. In general, the reflection produces echoes that the ears perceive as a time delay that is partially dependent on the elevation of the sound source. The reflectivity of the sound is dependent on the frequency; some sources do not reflect as strongly as others. The shoulder echo effect is not a primary auditory cue, others have greater significance in sound localization.

Head Motion

The movement of the head in determining a location of a sound source is a key factor in human hearing and quite natural. Head movement occurs more often as the frequency of a sound source increases. This is because higher frequencies tend to not bend around objects as much and are harder to localize.

Early echo response/reverberation

Sounds in the real world are the combination of the original sound source plus their reflections from surfaces in the world (floors, walls, tables, etc.). Early echo response occurs in the first 50-100ms of a sounds life. The combination of early echo response and the dense reverberation that follows seems to affect the judgement of sound distance and direction. Research in this area is still emerging and will hopefully shed some light that will allow more accurate sound synthesis.

Vision

Vision helps us quickly locate the physical location of a sound and confirm the direction that we perceive.

All of these cues in some way contribute to the ability to spatially locate a sound in 3D space. 3D sound synthesis needs to deal with these cues in order to provide accurate sound immersion. The difficultly in doing this is great; researchers don't fully understand exactly how the brain interprets the signals it gets from the ear, nor do they understand all of the characteristics that cause sound to be perceived in 3D space. As research continues, we will hopefully gain a better understanding of the human ear and how to emulate it.

Methods to synthesize spatial sound

In order to gain a clear understanding of spatial sound, it is important to distinguish monaural, stereo, and binaural sound from 3D sound. A monaural sound recording is a recording of a sound with one microphone. No sense of sound positioning is present in monaural sound.

Stereo sound is recorded with two microphones several feet apart separated by empty space. Most people are familiar with stereo sound; it is heard commonly through stereo headphones and in the movie theater. When a stereo recording is played back, the recording from one microphone goes into the left ear, while the recording from the other microphone is channeled into the right ear. This gives a sense of the sound's position as recorded by the microphones. Listeners of stereo sound often perceive the sound sources to be at a position inside the listener's head -- this is due to the fact that humans do not normally hear sounds in the manner they are recorded in stereo, separated by empty space. The human head is there acting as a filter to incoming sounds.

Binaural recordings sound more realistic as they are recorded in a manner that more closely resembles the human acoustic system. Binaural recordings are made with the recording microphones embedded in a dummy head, and yield sounds that sound external to the listener's head. Binaural recordings sound closer to what humans hear in the real world as the dummy head filters sound in a manner similar to the human head.

The head-related transfer function for 3D sound synthesis

In synthesizing accurate 3D sound, attempts to model the human acoustic system have taken binaural recordings one step further by recording sounds with tiny probe microphones in the ears of a real person. These recordings are then compared with the original sounds to compute the person's head-related transfer function (HRTF). The HRTF is a linear function that is based on the sound source's position and takes into account many of the cues humans used to localize sounds, as discussed in the previous section. The HRTF is then used to develop pairs of finite impulse response (FIR) filters for specific sound positions; each sound position requires two filters, one for the left ear, and one for the right. Thus, to place a sound at a certain position in virtual space, the set of FIR filters that correspond to the position is applied to the incoming sound, yielding spatial sound.

The computations involved in convolving the sound signal from a particular point in space are demanding. Refer to [BURGESS92] for details on these sound computations. The point to recognize is that the computations are so demanding that they currently cannot be performed in real-time without special hardware. To meet this need, Crystal River Engineering has implemented these convolving operations on a digital signal processing chip called the Convolvotron.

Sound rendering as a method for 3D sound synthesis

Sound rendering is a technique of generating a synchronized soundtrack for animations. This method for 3D sound synthesis creates a sound world by attaching a characteristic sound to each object in the scene. Sound sources can come from sampling or artificial synthesis. The sound rendering technique functions in two distinct passes. The first pass calculates the propagation paths from every object in the space to each microphone; this data is then used to calculate the geometric transformations of the sound sources as they correlate to the acoustic environment. The transformations are made up of two parameters, delay and attenuation. In the second pass, the sound objects are instantiated and then modulated and summed to generate the final sound track. Synchronization is inherent in the use of convolutions that correspond to an objects position with respect to the listener [TAKALA92].

The rendering technique deals with a sound source as a single dimensional signal that has an intensity over time. This is a rather simple approach to the more traditional Fourier transform representation (HRTF generation). The technique exploits the simularity of light and sound to provide the necessary convolutions. A sound source in space propagates sound waves in all directions just as a light source does. As in light, sounds waves can be reflected and refracted due to the acoustic environment. A sound wave interacts with many objects in the environment as it makes its way to the listener. The final sound that the listener hears is the integral of the signals from the multiple simultaneous paths existing between the sound source and the listener. The rendering algorithm cannot provide a continuous analysis of this function and therefore must break it up into discrete calculations to compute sound transformations.

The actual sound rendering process is a pipelined process made up of 4 stages. The first stage is the generation of each object's characteristic sound (recorded, synthesized, modal analysis-collisions). The second stage is sound instantiation and attachment to moving objects within the scene. The third stage is the calculation of the necessary convolutions to describe the sound source interaction within the acoustic environment. In the last stage the convolutions are applied to the attached instantiated sound sources. This process is demonstrated in |Figure SOUND3D.jpg| from [TAKALA92].

The convolution calculation process of this pipelined process also deals with the effect of reverberation. This is an auditory cue that can lead to better spatial perception. The mathematical description of reverberation is a convolution with a continuous weighting function. This is really just multiple echos within the sound environment. The sound rendering technique approximates this by capitalizing on the fact that the wavelength of the sound is similar to that of the object, and thus, is diffuse in its reflections. Sound diffraction allows sound to propagate around an object - this has a "smoothing" affect of the sound. These observations allow the technique to use a simplified sound tracing algorithm. The simplified sound tracing algorithm is beyond the scope of this article, for more information please consult [TAKALA92].

This method handles the simplicity of an animated world that is not necessarily real-time; it is unclear how this method would work in a real-time virtual reality application. However, its simularity to ray-tracing and its unique approach to handling reverberation are a noteworthy aspects. [FOSTER92]

Synthesizing 3D sound with speaker location

Still other efforts at meeting the real-time challenges of 3D sound synthesis have involved using strategically placed speakers to simulate spatial sound. This model does not attempt to simulate many of the human localization cues, instead focusing on attaching sampled sounds to objects in 3D space. Visual Synthesis Incorporated's Audio Image Sound Cube uses this approach with eight speakers to simulate spatial sound. The speakers can be arranged to form a cube of any size; two speakers are located in each corner of the cube, one up high and one down low. Pitch and volume of the sampled sounds are used to simulate sound location; volume is distributed through the speakers appropriately to give the perception of a sound source's spatial location. This solution gives up the accuracy yielded by convolving sound as in the previous two approaches, but effectively speeds up processing by losing the computational demands involved in the other approaches, allowing for much less expensive real-time spatial sound.

Problems with spatial sound

Spatial sound generation does have problems that tend to lessen its immersiveness. The most common of these problem are: front-to-back reversals, intracranially heard sounds, and HRTF measurement problems.

Front-to-back reversals refers to the effect of having a sound being heard directly in front of a subject when it is really located directly in back or vice versa. This is a classic problem that can be diminished by accurate inclusion of the subject's head movement and pinna response. Generally, this problem exists if these two cues are left out of the HRTF calculation. Another possible solution involves a different auditory cue, the early echo response. The inclusion of a first order echo response has been shown to provide front-to-back differentiation for most test subjects [BURGESS92].

Intracranially heard sound describes sound that is heard inside one's head when the source is really located external to the head. This problem can be lessened by adding reverberation cues.

Other problems occur in the generation of the HRTF's themselves, specifically with measurements made with small microphones in the ear canal, as these microphones are prone to noise and linearity problems. Also, the speakers used to generate sounds are sometimes ineffective with low-frequency sounds [BEGAULT90]. When an HRTF is generated for a particular subject, it contains characteristics that are based upon the localization skills of that subject. Researchers have determined that by using several primary auditory cues with a subject that is skilled in localization, an HRTF can be created that is good enough for most of the population [BURGESS92].

Applications of 3D sound

Sound has many potential applications in the areas of virtual reality and telepresence. As previously discussed, spatial sound could help increase the sense of presence in virtual environments by relaying information about the environment and the objects within it. Such environmental awareness could be very beneficial in increasing the user's orientation in virtual environments.

Sound can also be used as a substitute for other sensory feedback in virtual environments. For example, pushing a virtual button is a task detected by wired glove. Without haptic feedback, however, users have had difficulties knowing when the button was successfully activated [BEGAULT92]. Sound cues have been used to alleviate this problem; hearing the sound of the button being pushed gave users the immediate feedback needed to know that their actions were indeed successful.

Similarly, sounds can be used to compensate for sensory impairments of specific users. The Mercator project, for example, is researching the use of sound as alternative, nonvisual interface to X Window System applications for visually impaired software developers [BURGESS92]. The goal of the project is to map the behaviors of window-based applications into an auditory space; spatial sound is being used heavily to relay information about the organization of objects on the user's screen.

Using sound as an additional input channel for computer-human interaction has begun to be researched [SMITH93], but much more human factors work needs to be done before sound can be accurately utilized for data representation in user interfaces. The auditory channel is currently underutilized in user interfaces, and the potential exists to increase the bandwidth of information relayed to users by using sound in addition to visual and other sensory outputs to relay information to users.

3D sound is a new technology that is early in the stages of development and understanding. More potential applications will continue to unfold as our understanding of spatial hearing and ways to artificially recreate it continue to evolve.

Preparation for research in spatial sound

Suggested study for further education and research in the field of 3D sound synthesis would include the areas of: perceptual psychology, digital signal processing, human factors engineering, user interface design, and auditory perception.

References

[BEGAULT90]: Begault, Durand R. "Challenges to the Successful Implementation of 3-D Sound", NASA-Ames Research Center, Moffett Field, CA, 1990.

[BEGAULT92]: Begault, Durand R. "An Introduction to 3-D Sound for Virtual Reality", NASA-Ames Research Center, Moffett Field, CA, 1992.

[BURGESS92]: Burgess, David A. "Techniques for Low Cost Spatial Audio", UIST 1992.

[FOSTER92]: Foster, Wenzel, and Taylor. "Real-Time Synthesis of Complex Acoustic Environments" Crystal River Engineering, Groveland, CA.

[SMITH93]: Stuart Smith. "Auditory Representation of Scientific Data", Focus on Scientific Visualization, H. Hagen, H. Muller, G.M. Nielson, eds. Springer-Verlag. 1993.

[STUART92]: Stuart, Rory. "Virtual Auditory Worlds: An Overview", VR Becomes a Business, Proceedings of Virtual Reality 92, San Jose, CA, 1992.

[TAKALA92]: Takala, Tapio and James Hahn. "Sound Rendering". Computer Graphics, 26, 2, July 1992.

Human Interface Technology Laboratory