Effects of Spatial Audio on Communication During Desktop Conferencing
by Jessica J. Baldis

[Previous Chapter][Table of Contents][Next Chapter]


Chapter 2: Background

  1. Desktop Conferencing
  2. What is desktop conferencing? Desktop conferencing is a communication medium that allows individuals to use their personal computers to communicate with each other over a network. Desktop conferencing can refer to either audio only conferencing, similar to a telephone conference, or video conferencing. During a desktop conference, each participant is seated at a computer. A microphone captures their voice (a camera may also capture their image) as they speak, and the electronic signal is sent to the computer’s sound card where it is digitized. The digitized signal is then sampled, compressed, and transmitted across the network. When the now compressed signal is received by the machines of the other conference participants, it is decompressed and converted back to sound by the sound card. The sound card then plays back the signal on the computer’s speakers or headphones. See Figure 1 (Summers, 1998).

    Figure 1: How Desktop Conferencing Works.

    While desktop conferencing has been possible for many years, it has just recently become widely available at a reasonable price, and is now experiencing a dramatic increase in use. Prior to 1995, real-time, desktop conferencing was not possible using standard PCs with a 386 or 486 micro-processor. A person or company wishing to engage in real-time desktop conferencing, had to do so by either using high-end computer, or by adding additional conferencing hardware to a standard PC. Either solution was expensive enough to pose a major barrier to use.

    By 1996, however, the standard PC had been upgraded to a Pentium microprocessor and was now capable of sending and receiving audio and video over a standard 28.8Kbps modem. This opened the gate for the first desktop conferencing software available for use by the general public, CU-SeeMe. In the few years since CU-SeeMe first launched its product, a variety of other desktop conferencing software packages have been introduced. These include : Microsoft’s NetMeeting, VocalTec’s Internet Conference professional, PictureTel’s LiveShare, SmithMicro’s Audio Vision, and NetPodium’s (formerly Metabridge) NetPodium. Currently, none of the commercial desktop conferencing packages incorporate spatial audio.

    While the number of commercially available desktop conferencing systems, and their usage, is increasing, desktop conferencing in its current state still has many shortcomings. The most commonly addressed issues are technical in nature such as delays in video and/or audio transmission, poor quality, video and audio that is out of sync, slow frame rates, and so on (Tang & Isaacs, 1993). There is also a series of not so well addressed issues surrounding how to develop desktop conferencing as a richer, more productive communication medium.

    Buxton (1992) suggests that one such issue is the need to, "establish a sense of shared presence or shared space among geographically separated members of a group" (Buxton, 1992, p. 123). Mane (1997) also discusses the need for a sense of a shared, or group, space in a successful video mediated communication environment. Mane outlines four levels of cues that are needed for there to be a sense of group space. These cues are: connectivity assurance cues that provide simple knowledge and awareness of the other participants, audience response cues that provide feedback, relationship/status cues that allow a sense of group structure to be built, and focal assurance cues.

    Focal assurance cues provide information regarding the involvement of each participant, such as: who is speaking, who is asking questions, or who is interrupting. "In a situation where the participants are not familiar with each other, it is especially hard to develop a sense of where people stand on issues when the contributions are not tied to a specific participant" (Mane, 1997, p. 403). Strong focal assurance cues will allow the conference participants to develop a deeper sense of who the other participants are, and allow them to build stronger contextual cues.

    Current desktop conferencing systems lack strong focal assurance cues, and conference participants may often experience a sense of confusion, particularly if the conference is made up of multiple, or unfamiliar participants. A simple way of increasing focal assurance cues would be to create a stronger sense of space during desktop conferences. This could be accomplished by adding spatial audio to desktop conferencing systems.

  3. Spatial Hearing
  4. Humans are extremely adept at locating a sound source in space, allowing sound to be used as an important information tool about the environment. Not only does sound provide us information about the environment, it provides cues that help us determine where to focus our attention. For example, if a twig snaps in a quiet forest, the –snap– cues us to either turn our head and focus our visual attention, or be still and focus our auditory attention, in the direction of the sound. The following section will briefly discuss how humans localize sound in the horizontal plane.

    The ability to localize sound is a function of binaural hearing, or of having two ears, each receiving separate audio input. Because each ear receives separate input, differences in the sound reaching the two ears can be analyzed by the brain and the origin of the sound located. There are three, primary interaural differences used to determine the location of a sound: interaural time difference (ITD), interaural level difference (ILD), and interaural phase differences.

    Interaural Time Differences (ITDs): ITDs refer to the difference in time a sound stimuli takes to reach the two ears. First, imagine a tone emitted from a speaker to the left of a listener. As the tone travels through space, it will reach the listener’s left ear before their right, creating an ITD (See Figure 2). ITDs can range from 0 to 0.8 msec, and vary as the azimuth of the sound source changes. The human ear can detect differences as small as 0.01 msec (Yost, 1994).

     

    Figure 2: Interaural Time Difference Figure 3: Interaural Intensity Difference

    Interaural Intensity Differences (IIDs): IID refers to the difference in intensity, or volume, of a sound stimulus as it reaches each ear. A tone played to the left of the listener will be closer in distance to the listener’s left ear than their right, and because intensity decreases with distance from the source, the sound will be less intense in the right ear than the left. Also, as the sound travels through space, the listener’s head may act as an obstacle, creating a ‘sound shadow’ on the far side of the head (See Figure 3). The impact of the sound shadow depends on the wavelength of the sound compared with the dimensions of the head. Middlebrooks (1989) found that at higher frequencies the head’s sound shadow can cause up to a 35dB difference between the two ears, while the auditory system is capable of detecting differences of less than 1dB. Note that as the wavelength increases the head is no longer large enough to shadow the sound, and, IIDs become less significant in sound localization.

    Interaural Phase Differences (IPDs): IPD refers to the difference in the phase of the wavelength that reaches each ear, and is dependent on the frequency of the stimuli and the ITDs. Imagine a 1000Hz tone that reaches the left ear 0.5 msec before the right. As the wavelength reaches the right ear, it will be 180 degrees out of phase with the wave at the left ear. IPDs are extremely useful as the human ear has the ability to detect differences as small as 3 degrees, and the combination of IPD and ITD, not only aids the listener in determining where the sound stimuli originated from, but helps identify the frequency of the stimuli (Yost, 1994).

    Once the brain has analyzed these three differences (ITD, IID, and IPD), the location of the sound stimuli can be determined with relative accuracy. A great deal of research has gone into determining exactly how accurate the human listener is at pinpointing the origins of sounds. For a more in-depth discussion of human localization, please see Middlebrooks and Green (1991), or Blauert (1996).

     

    1. Spatial Sound Technology

    2. The most common technique for electronically generating spatial audio uses the three cues discussed above (ITD, IID and IPD), and Head Related Transfer Function (HRTF) technology. A HRTF refers to the spectral changes that occur as a sound travels from its source, to the listener’s tympanic membrane, or ear drum. Each individual has a unique HRTF, that is determined by the size and shape of their head, shoulders, and pinna or outer ear.

      An individual’s HRTF can be measured by putting tiny microphones in the ear canal and measuring the amplitude and phase spectra of sound as it reaches the ear. The amplitude and phase spectra of the sound source can then be compared to the recordings made at the ear. Based on the comparison, an HRTF filter can be created. By taking recordings from sound sources systematically placed around the listener, using broadband noise, an HRTF model of how the user hears in three dimensional space, at various frequencies is created.

      Spatial sound can then be reproduced using carefully positioned speakers, or headphones. ITD and IIDs can be generated by sending differing signals to the left and right speakers, while an HRTF filter is used to modulate the sound spectrum for the simulated sound location. While the highest localization accuracy occurs when an individual uses their own unique HRTF, it is not always practical to record unique HRTFs, and averaged HRTFs are often used.

    3. Speech Intelligibility Research
    4. A large body of research has shown that spatial audio increases speech intelligibility. The "cocktail party effect", refers to a listener’s ability to focus on one sound source in the presence of many by filtering out unwanted distractions. Research into the cocktail party effect has examined both a listener’s ability to focus on one of two competing messages, and a listener’s ability to recognize speech in the presence of noise.

      It has been shown that a listener’s ability to focus on one of two competing messages is dramatically enhanced by either dichotic or spatial audio (Egan et al., 1954; Dirks & Wilson, 1969). In addition, as two messages become closer in space, it becomes more difficult for a listener to focus on one of the messages (Egan, et al., 1954; Spieth et al., 1954; Treisman, 1964).

      It has also been shown that when there is non-speech noise present, spatialized speech is perceived more clearly than non-spatialized speech (McKinley et al., 1994). McKinley also found that pilots listening to spatial speech during in-flight tests reported improved clarity over non-spatialized speech. For a detailed review of the literature please see Erickson and McKinley, 1996.

      During desktop conferences, participants often speak at the same time, creating competing messages. In addition, an artifact of current technological constraints is that, the signal to noise ratio during conferences is often less than ideal. The above research indicates that adding spatial audio to desktop conferences may help increase speech intelligibility in these cases.

    5. Spatial and Auditory Memory
    6. A full discussion of the memory issues relevant to this thesis can be found in Robert H. Logie’s book Visuo-Spatial Working Memory (1995). The following will provide only a brief overview.

What makes something easy to remember? There are a plethora of theories regarding what makes something easy to remember. In his 1892 discussion of memory, William James said:

"In mental terms, the more other facts a fact is associated with in the mind, the better possession of it our memory retains. Each of its associates becomes a hook to which it hangs, a means to fish it up by when sunk beneath the surface. Together, they form a network of attachments by which it is woven into the entire tissue of our thoughts. The secret of a ‘good memory’ is this, the secret of forming diverse and multiple associations with every fact we retain." (James, 1892).

James is merely saying that the more cues that are attached to something, the easier it will be to remember. In the case of adding spatial audio to desktop conferences, James theory would say that the additional location cue should improve the chances of retention and recall from the conference.

Considering a more in-depth memory model, while not an undisputed claim, it is believed that memory can be split into two categories, long term and short term, or working memory. The working memory has been broken down even further by psychologists such as Baddeley and Hitch (1974). Baddeley and Hitch (1974) conceptualize working memory as a set of short-term memory functions including a central executive, responsible for reasoning, decision making and coordinating functions of the working memory’s ‘slave systems’ (Logie, 1995). Baddeley and Hitch specified two distinct ‘slave systems’. The first being a verbal rehearsal loop, sometimes called the phonological loop, which is responsible for temporary retention of verbal material, and semantic meaning, and the second being the visuo-spatial sketch pad (VSSP), responsible for temporary retention of visual and spatial material.

The phonological loop and the VSSP function somewhat independently of each other (Baddeley et al, 1975 & Baddeley, 1987). Because of this independence, they are susceptible to interference from different kinds of concurrent tasks. For example, trying to perform two spatial tasks maybe a greater drain on memory resources than performing one verbal and one spatial task. In a series of experiments, Brooks (1968) required participants to perform either a verbal or a spatial task, and give either a verbal or spatial response. He found that participants did the best when the type of task and response were different, such as a verbal task and a spatial response. This illustrates the point that if two tasks employ different parts of the working memory, they will be "time-shared" more effectively (Wickens, 1992). Wickens and Liu (1988), also showed that spatial tasks were disrupted less when the phonological loop was used in concurrent information processing tasks. Conversely, phonological tasks are disrupted less by concurrent spatial tasks.

In the case of desktop conferencing, listening to and understanding what an individual is saying would require using the phonological loop. In desktop conferences using non-spatial audio, the only speaker identification cues are verbal in nature, and so, the case could be made, that the phonological loop is also used to determine who is speaking. As discussed in the above research, having two concurrent tasks pulling from the resources of the phonological loop, is less efficient than distributing the tasks between the phonological loop and VSSP. Baddeley and Liebrman (1980), however, showed that keeping track of the location of sounds in space is spatial in nature, and is assigned to the VSSP. Thus, by adding spatial audio to a desktop conference, the task of speaker identification could be offloaded to the VSSP, increasing the efficiency of the whole memory system.


Human Interface Technology Laboratory