| Back to IND E 543 |

Wide Area Tracking

for

Computer Supported Collaborative Work

submitted by:

Richard May, Stu Turner, Kevin Audleman, Ioannis Kassabalides

as a partial requirement for the completion of

Industrial Engineering 543

Winter Quarter 1999

Professor Thomas A. Furness III, Ph.D.

12 March 1999

Wide Area Tracking for Computer Supported Collaborative Work

May, Turner, Audleman, and Kassabalides

IE543 Winter 1999

Introduction

This project examines wide area tracking technologies for application in computer supported collaborative work (CSCW). In the context of Furness' taxonomy of virtual reality (VR) development priorities, this project fuses the priorities of "accurate and low-latency tracking…" with "driving applications" [Furness, personal contact, 5 Jan 1999]. This fusion of topics was adopted by our team for three reasons: 1) current technologies for tracking are not mature and do not offer performance adequately for highly desirable VR and augmented reality (AR) applications; 2) wide area tracking of head, hand, and body position and orientation introduces difficult engineering problems that can more easily be approached via a specific application framework; and, 3) the HITL "Mixed Reality" project provides a unique approach to a small CSCW virtual workspace that seems ripe for broadening to wide area dimensions.

The goal of this project is to propose new research vectors to achieve a room-sized system for VR and AR applications, primarily CSCW-oriented applications. Our team envisions a conference room-sized environment affording placement of virtual information on multiple real room surfaces, the free movement of virtual and physical objects around the room, and tracking the movement of the system user(s) around the room.

Such a system would afford highly desirable functionality such as: the representation of remote collaborators as life-sized avatars co-existing with the physical environment, extending tangible interface constructs [Ullmer 1997] for use between remote users, and mixing realities by allowing people, avatars, physical data, and virtual data to seamlessly interact enabling new and unique collaborative capabilities.

Most existing CSCW tools introduce seams or discontinuities between how people think and act in the physical world and the virtual world. Ishii [Ishii 1994] described these functional and cognitive seams as important factors that limit the usefulness of computer based collaborative tools. Minimizing these seams is a critical step in the development of superior CSCW tools. The HITL Mixed Reality prototype attempts to reduce these seams by supporting the use of current physical objects and interaction techniques in the virtual interface [Kato, Billinghurst, Weghorst, and Furness, 1999]. It is also easier for users to maintain eye contact with collaborators which has been shown to be an important social cue for communications.

Existing CSCW tools also focus primarily on the person-to-person communication channel without consideration for the importance of the data objects within the environment. The purpose of group collaboration is often the design, analysis, or discussion of data. Examples of source data include design schematics, physical models, scientific data sets, or business information such as cost projections and competitive intelligence data. The issue is not only face-to-face communications but also the conveyance of knowledge about the object of study. This data-centric design philosophy causes a fundamental change in the information flow. Rather than information flowing directly between users, the object of study now serves as an important information link through which information passes.

The HITL Mixed Reality prototype provides strong support for the data-centric design of a CSCW tool. Data (both virtual and physical) is placed on the table between the collaborators, supporting interactions techniques just as in non-computer assisted proximal collaboration.

The system we envision depends considerably upon an accurate and low-latency, wide area tracking capability and, thus, dictates the primary thrust of our investigation. A wide area tracking system appropriate for CSCW applications needs to break the ‘one sensor per object tracked’ limitation. The potential number of objects being tracked far exceeds a systems ability to place sensors on every object. The system needs to track the user’s head and hands while still maintaining position and orientation of all objects in the environment. This needs to be accomplished while minimizing user and environmental encumbrances.

We have examined existing systems used for tracking, including these systems' performance, encumbrance, portability, flexibility and methods of calculation. Unfortunately, no existing systems for wide area tracking seems to adequately satisfy the need. We have considered the feasibility of expanding or modifying the most promising techniques to fulfill the system requirements.

A detailed description of the desired system capabilities is followed by a summary of relevant literature. Proposed directions for research to achieve the goal of a room-sized environment for virtual and augmented CSCW applications concludes this report.

Desired System Performance and Characteristics

Literature from psychology and human factors highlights the importance of natural, or "ecological" interfacing in collaboration among groups of individuals. Body positioning, gestures, eye contact, facial expressions, and common spatial reference frames enhance communication and contribute significant information to the human collaborative process [Tang, 1991; Wickens, Gordon, and Liu, 1997]. A VR/AR system designed to facilitate CSCW should adhere to principles of design that preserve these characteristics of human communication. Numerous "soft" interface design issues arise under these general constraints, but we will limit consideration to those issues of tracking, associated hardware, and performance characteristics that directly impact these communicative factors.

The overarching design requirement is for a room-sized environment. To accommodate collaborative groups of various sizes, we believe a small conference room is the minimum room size desirable. We avoid specifying exact dimensions since different applications will require different minimums and any developed technology should be scalable to support a range of facilities. However, the room should accommodate multiple individuals who may be collaborating among themselves and collaborating virtually with other individuals at distant locations. Employing AR techniques, collaborators could directly see and hear their peers in the room while simultaneously viewing video or avatars of distant collaborators in a mixed reality (MR) display. Similarly, each collaborator whether local or distant could interact with common MR objects and information presentations. In a MR display, natural communication characteristics need to be preserved for collaborators in the room and the potential exists to preserve most of these characteristics for distant collaborators as well.

Preserving these characteristics at distance dictates some stringent system tracking performance constraints. Preservation of fine directional information, such as that required to afford very accurate spatial reference frames, pointing gestures, and especially eye contact, requires precise registration of virtual objects with the real world. Positional accuracy and image stabilization precision of virtual objects must be very good, and system latency must be minimized to avoid unacceptable dynamic registration errors along these dimensions. Specification of minimum performance requirements is a difficult prospect. What is "good enough" for a given application? Our approach is to examine existing system performance parameters and then seek methods to improve performance. For instance, hybrid systems using multiple tracking techniques may offer one means of improving performance. For a research-oriented CSCW application we believe a system approaching the following user perceived performance values should be a goal, but not a rigid constraint determining success or failure:

Perceived Latency < 5 ms

Perceived Accuracy < 5 min. of arc

Perceived Precision < 5 min. of arc

Obstructing instrumentation also threatens the preservation of natural communication among collaborators in virtual environments. Tracking and display systems may require collaborators to wear any combination of devices unnatural to the face-to-face collaboration environment. These systems frequently require instrumenting the environment as well as the system users, potentially impeding natural interactions in the collaborative space. With current technology, completely avoiding such encumbrances is impossible. However, we seek a system that offers an elegant solution to the tracking problem while balancing user encumbrance and environmental instrumentation. Design trade-offs should attempt to minimize interference with natural human communication.

Finally, the tracking system should afford ease in set-up and calibration. Again, specifying minimum requirements is difficult since various systems require widely varying calibration and set-up tasks. Assessing an acceptable level of difficulty for these tasks is highly scenario dependent. However, reducing the time and effort required with existing techniques should be the initial goal. Eliminating these tasks completely for the user should be a long-term goal.

In order to properly frame our proposed vectors for new research into wide area tracking systems for CSCW, a summary of existing technologies and techniques is necessary. An extensive VR literature base describes various tracking schemes, and several articles pertinent to our proposed research directions are now provided in review.

Tracker Technology Review

Electromagnetic Trackers

Magnetic trackers have served as the primary tracking technology used in almost every aspect of VR research for the last decade. Because of this, we felt it was appropriate to consider this technology is greater detail.

Magnetic trackers work by setting up a magnetic field and measuring its strength at discrete locations. A transmitter is placed in a stationary location to set up the field, and receivers measure their location relative to the transmitter. The transmitter contains three orthogonal electromagnetic coils. By pulsing these, a magnetic field is set up in a three dimensional space. Receivers then measure this field. A receiver contains three orthogonal coils just like the transmitter, and these take on a voltage as they pass through the field. The received voltages are compared to those of the sent voltages, from which location and orientation can be determined. All measurements are made in relation to distance from the transmitter. To get world coordinates a transform from transmitter coordinates to world coordinates is required.

Advantages

Electromagnetic trackers have several advantages over other technologies. They are popular and supported by many years of field use. The receivers are small and unobtrusive, typically smaller than a 1" cube, and can be fit onto an existing head mounted display with very little extra burden. Multiple object tracking is also available. The Flock of Birds transmitter made by Ascension, for instance, supports up to 16 receivers at one time.

Disadvantages

There are also several disadvantages. First, the receivers are tethered. To transmit the information from the receivers, wires must be run from each one to the computer, limiting the motion of the user. Untethered magnetic trackers are now appearing on the market but the wiring is replaced by comparatively bulky IR transmitters and battery packs. Second, transmitters typically have a small range, with accurate measurements ending at a few feet. Part of this is due to the limited power of the transmitter, but a large part come from side effects created from environmental conditions. Electromagnetic interference from radios and other devices can cause erroneous readings. Also, large objects made of ferrous metal can cause disruptions in the magnetic field if they’re too close. Too close is generally defined as being closer to the transmitter or receiver than the two are to each other. This means that for an environment to optimally function with a magnetic tracking system, metal objects must be kept clear. In an office environment where metal desks, filing cabinets, and computers can’t be removed, working range is limited. Another problem is that most buildings contain metal support structures in the floor, walls and ceiling that further reduce the working range of the tracker. In most practical environments, providing a metal-free area is virtually impossible.

Researchers have developed techniques to account for these environmental conditions and compensate for their effects. Warps in the magnetic field can be mapped allowing errors in measurements to be predicted and compensated for. Researchers at the University of Illinois came up with calibration matrices to correct for errors in their CAVE Extended Range Transmitter Flock Of Birds (ERTFOB).

In 1995, Ghazisaedy [Ghaziseady, 1995] reported on how they used an ultrasonic measuring device (UMD) to correct for errors in the CAVE ERTFOB. The CAVE environment is a 10’x10’x10’ room. The process used to calibrate the CAVE was to create a 1’x 1’x 1’ grid of points and measure the actual position of the receiver with the UMD as well as the ERTFOB reported position. From this information a calibration 3D matrix was produced. When measurements are taken in the CAVE, they are first passed through this matrix to correct for errors. The UMD is not suitable for real-time tracking, but can be used to perform static measurements, and is accurate to about 1.5% of the distance measured. The matrix is only valid for the environment in which is was created, thus a new correction grid must be generated every time the cave is moved or something near the environment is moved.

The data gathered to generate the correction matrix showed a significant error being generated by the ERTFOB. The errors over the 10’ cubed area ran between 0.2’ and 3.7’. After generating the correction 3D matrix, they measured new values at the centroid of each of the original 1’x 1’x 1’ grid points. This maximized the required interpolation distance. They found the error at these points ranged between 0.01’ and 0.29’.

In order to further study calibration techniques, we gained access to an Ascension Flock of Bird transmitter and performed our own experiments. Our goal was to quantify the errors we could expect in a typical office environment, and to determine the feasibility of calibration and compensatory methods.

Ascension ERTFOB Magnetic Tracker Tests

The HITL has an ERTFOB system and we performed an informal test to evaluate the accuracy of the ERTFOB in a CSCW environment. The test was conducted in Fluke Hall room 115F. Three experiments were conducted to evaluate the positional accuracy of the system. The coordinate system used by the ERTFOB has the X axis and Y axis on the horizontal plane and the Z axis for elevation information.

The ERT was placed on a plastic stand approximately 36 inches above the floor. An effort was made to maximize the distance between movable metal structures and the ERT. The ceiling structure, approximately 6 ft. above the ERT, contains extensive metal structures. No information was available concerning the structure of walls or floors.

Dead Zone Experiment:

The sphere immediately surrounding the ERT does not register sensor data. We measured the diameter of the sphere on the horizontal plane, placing the receiver on a plastic stand to ensure constant height. This was measured by moving the receiver toward the ERT until distances were no longer reported and then slowly moving back out to the point where the system started to report values.

The numbers reported by the ERTFOB were approximately 25 inches and are depicted in Figure 1. The distance collected by physical measurement was approximately 28 inches from the center of the ERT. The difference is attributed to the fact that the ERT does not measure from the center but from the diameter of the coils. This test indicates that the ERTFOB interactions need to take place 28 inches away from the center of the ERT.

Elevation accuracy along each horizontal axis:

In this experiment, two receivers were placed 4.75 inches apart on a horizontal plastic stand. The elevation of the stand was approximately level with the ERT. Measurements were then taken at one-foot increments moving away from the ERT. The first measurement was at 30 inches. Figures 2 and 3 show graphs of the data collected for the two tests performed.

This data shows the reported elevation increasing when, in fact, it was remaining constant. The reported elevation was consistent for both receivers in both conditions. The ERTFOB documentation indicates this sort of error is due to metal support structures in the floor. We could potentially reduce this error by raising the ERT 18 inches higher to place it directly between the floor and ceiling.

The reported elevation plateaus for the last measurement in the +Y axis (Figure 3). This measurement was taken adjacent to a metal cabinet that may have influenced the magnetic field. It is impossible to tell without taking additional measurements near the cabinet.

The collected data also shows some variance in the reported distance between the two receivers. The actual distance remained a constant 4.75" but the reported values ranged from 4.62" to 8.24" for the –X axis and 4.52" to 5.08" for the +Y axis. For the –X axis data, these values were generally increasing as the receivers moved away from the ERT. For the +Y axis the opposite was true. The values decreased as the receiver moved away from the ERT. The +Y axis values were much tighter and more accurate than those for the –X axis.

Positional accuracy of several points on a plane:

This experiment measured how accurate the ERTFOB was at matching known distances. A nine point grid was placed on a table. The point arrangement and distances are shown in Figure 4. Two receivers were used for the first test where the ERT was approximately 55" from the center of the grid (point E). Each receiver was placed on all nine points and the values recorded. A second test was run with the ERT approximately 105" away from point E. Only one receiver was used in the second test. Because each receiver had to be visually aligned with the points on the table and the point placement technique itself was manually conducted there were some errors introduced. The total physical error between any two points would not exceed 0.125".

On the first test both receivers showed an average error of 0.42" distance between any two neighbors. The errors ranged from 0.01" to 0.89". For the second test this value more than doubled to an average error of 0.95" with individual errors ranging between 0.23" and 2.24". These numbers are with out correction for the 0.125" potential experimental error.

Analysis of using Magnetic Tracking in a conference room environment

Magnetic receivers are tethered devices. Although not a significant logistical problem, it is inconsistent with our goal of minimizing user encumbrance. One possible method for handling the tether would be to run them to a hub under the conference table; this way user movement would only be limited away from the table.

We anticipate the environment will contain substantial amounts of metal. Chairs and tables are potential troublemakers, as are computers, overhead projectors, and lamps. A typical office building has substantial metal structures in the ceiling and floors, which adds to the problem. Properly calibrating the environment is a necessity, but doing so could prove problematic. Calibration matrices are dependent on a consistent scene, but metal objects in a conference room will move around. Chairs will move, as will the metal objects that people bring with them (such as laptops). Re-calibration would have to be constantly done on-the-fly, and no current techniques exist to perform this task.

Acoustic Trackers

Acoustic tracking is a fairly simple technique, typically employing ultrasonic frequencies to avoid human detection. A transmitter emits ultrasonic sound, and a special microphone receives it. The time it takes the sound to reach the microphone is measured, and from that distance can be calculated. This technique is known as time-of-flight (TOF). One microphone and one transmitter allows determination of distance only, so triangulation of multiple signals is needed to find position. Three microphones or three transmitters are needed for this. To get both position and orientation, three transmitters and three microphones are needed.

Another less commonly used method is phase coherency. This technique measures the phase difference between the tracking signal and a reference signal. The difference is used to calculate changes in the position of the transmitter relative to the receiver. The problem with this method is that errors accumulate over time. Periodic re-calibration is necessary.

Another technique uses passive acoustic tracking techniques. This system uses a four microphone array as detectors and the human voice as the transmitter [Omologo, 1996]. This technique uses crosspower spectrum phase analysis and in a room with moderate noise and reverberation shows locational accuracy between 2 cm and 10 cm.

Advantages

The technology is inexpensive, and the parts used to make an auditory tracker are readily available. The microphones and transmitters are small, and can easily be placed unobtrusively on the body. It also has a longer range than most other technologies.

Disadvantages

First, line of site must be maintained between the transmitter and receiver. Second, the speed of sound in air varies with temperature, pressure, and humidity. Steps must be taken to account for the environment in order to avoid inaccurate calculations. Furthermore, ambient sounds and echoes can cause interference.. These two cause "ghost" signals to be received, and interfere with incoming signals. These effects can drastically reduce the accuracy and effective working area of acoustic trackers. Techniques to effectively address both ambient sounds and echoes are just starting to show up in the literature

Evaluation of technology for use in a conference room environment

A conference room is a controlled environment; fluctuations in the atmosphere wouldn’t be a concern. The trackers could be calibrated once for the temperature, humidity, and pressure, and no further corrections should be necessary. Implementation could use either participant mounted transmitters or microphones. In either case, the technology is small and light posing minimal encumbrance. Maintaining line of sight could be effectively accomplished by redundantly instrumenting the ceiling. Ambient noise, however, would continue to be a problem. A conference room is designed as a place to hold conversations, thus a large amount of sound would be expected. Using narrow frequency ultrasonic transmitters could reduce this problem. Regardless, echoing would continue to be a problem, because conference rooms are full of hard, sound reflecting surfaces. The conference table, chairs, and even the walls all contribute to make a very reflective environment. Hence, in using an acoustic system a large amount of interference would have to be dealt with. The tracking of items other than the users would still require instrumenting all objects.

Inertial Trackers

Inertial trackers apply the principle of conservation of angular momentum. They attach directly to a moving body, and give an output signal proportional to their motion with respect to an inertial frame of reference. There are two types of inertial tracking devices: accelerometers and gyroscopes. Accelerometers sense and respond to translational accelerations, and gyroscopes sense and respond to rotational rates [Sowizral, 1995].

Accelerometers

Accelerometers sense and respond to translational accelerations. Their outputs need to be integrated once with respect to time to get velocity, and twice to get position. Many technologies are used to implement accelerometer designs, including piezoelectric, piezoresistive, and capacitive technologies [Baratoff].

The designs listed above are primarily pendulous. Pendulous accelerometers contain an inertial proof mass, a part of the sensor with a known mass, which is attached to the rest of the sensor by a spring-like hinge or tether. When this type of sensor undergoes acceleration, the proof mass stays stationary and the spring is deformed. The deformation of the spring is what is measured and transduced into the output signal of the sensor. Another method allows the proof mass to move, and its displacement is transduced into the output signal.

Piezoelectricity is a common pendulous sensing technique. Piezoelectric materials develop distributed electric charges when displaced or subjected to forces. Piezoelectric accelerometers employ a cantilever design with a piezoelectric film attached to the beam of the cantilever. When accelerated, the proof mass causes the beam to deflect, which in turn causes the piezoelectric film to stretch, resulting in an electric charge difference (the output of the sensor). Piezoelectric accelerometers are called active devices because they generate their own signals. Since these sensors require a time-varying input (physical work), they do not respond to steady-state inputs such as the acceleration of gravity, hence they are called AC responsive (sense only changing signals).

Another common transducer technology used for accelerometers is piezoresistivity. Piezoresistive materials change their electrical resistance under physical pressure or mechanical work. If a piezoresistive material is strained or deflected, its internal resistance will change, and will stay changed until the original position of the material is restored. Piezoresistive accelerometers can sense static signals and are thus called DC sensors.

One type of accelerometer is the capacitive pendulous accelerometer, which is the most commonly used. This type of accelerometer contains a capacitor with one of its plate being the proof mass. When the sensor is accelerated, the proof mass moves, and the voltage across the capacitor is changed. The amount of change in the voltage corresponds to the applied acceleration.

Gyroscopes

There are two main branches of gyroscope design: mechanical gyros that operate using the inertial properties of matter, and optical gyros that operate using the inertial properties of light.

Mechanical gyroscope designs are commonly of the vibrating type. Vibrating gyroscopes use Coriolis acceleration effects to sense when they rotate. This is accomplished by establishing an oscillatory motion orthogonal to the input axis in a sensing element within the gyro. When the sensor is rotated about its input axis, the vibrating element experiences Coriolis forces in a direction tangential to the rotation (orthogonal to the vibratory and rotating axes).

Optical gyroscopes operate based on the "Sagnac Effect." These sensors use two light waves traveling in opposite directions around a fixed path. When the device is rotated, the light wave traveling against the direction of rotation will complete its revolution faster than the light wave traveling with the rotation. This effect is detected by comparison of the phase difference in the two light waves.

Advantages

Inertial trackers are small and potentially unobtrusive. AMP makes a model, the ACH-04-08, that has dimensions of 0.4 x 0.4 x 0.06 inches. Inertial trackers have an update rate that is one of the fastest of any tracking technology. InterSense (www.isense.com) markets a product, the IS-600 Mark 2 Precision Motion Tracker, that has an update rate of up to 500Hz. Inertial trackers are unaffected by external fields and can be used in almost any environment.

Disadvantages

The main disadvantage of inertial trackers is that they are subject to considerable drift. Positions are found by integrating over time (t) the signals of the trackers as well as any signal errors. As a result, position errors accumulate. To find position with an accelerometer, the raw data is integrated twice. Measurements are multiplied by t², thus even small amounts of noise can lead to large errors. In just 60 seconds, an accelerometer with an output noise level of just 0.004 g yields a position uncertainty of about 70 meters. Gyroscope errors increase linearly with time and are not quite as susceptible to large errors. They drift approximately 10 degrees per minute and are also sensitive to vibrations. Because of this inherit buildup of absolute positional errors, inertial trackers are limited in usefulness to relative motion sensing or tracking over brief periods.

Analysis of use in a conference room application

The conference room environment has characteristics that cause problems with many tracking technologies, but inertial tracking avoids most of these. Inertial devices are small and lightweight, not adding significantly to the encumbrance of the user. The advantages of inertial tracking is its environmental stability and small size.

Their disadvantage is drift. We need accurate head positioning in world coordinates, and inertial tracking technologies are not advanced enough to support this. Inertial trackers could potentially be combined with other technologies to correct for this, but by themselves are not sufficient for our needs.

Mechanical Trackers

Mechanical tracking systems range from simple armature devices to complex whole body exoskeletons. They measure limb positions through mechanical linkages where the relative position of each link is compared to give final positions and orientations. Measurements are usually joint angles and lengths between joints [Hand, 1994]. Potentiometers and LVDT’s are typically used as transducers, but a wide array of technologies can be used, including resistive bend sensors and fiber optics. Mechanical trackers can be ground-based or body-based. A tracker is called ground-based, when it is attached to the ground, allowing the user to work only within its constraints. Body-based trackers attach the entire system to the user allowing freedom of movement [Kolozs 1996].

Mechanical trackers are used for all parts of the body. The smallest device is the glove, which measures the location of the hand and fingers. Virtual Technologies (www.virtex.com) produces the Cyberglove, which has an update rate of 110 Hz and a resolution of about 0.5° . It uses resistive bend sensors to detect finger bend. Fifth Dimension Technologies (www.5dt.com) manufactures the 5th Glove, which has a refresh rate of 200 Hz and uses fiber optic sensors to track the position of each finger [Hand, 1994].

The next step up is the mechanical arm, which measures the orientation of the arm, elbow, and wrist. To fully capture the positions and rotations of the arm, many DOF must be measured. The Dexterous Arm Master, produced by Sarcos, uses 4-bar linkages to measure 20 DOF.

The highest level is the complete body suit, which measures positions of the each body part. The VR Bodysuit, a device produced by the Advanced Digital Systems Laboratory, measures the joint angles of the entire body. It currently measures 8 DOF, but a more advanced system is being designed.

Mechanical trackers are also combined with structural armatures to support the weight of technology that is too heavy for users. The BOOM, manufactured by Fakespace (www.fakespace.com), is an armature mounted display that is too heavy to put in a head-mounted device. Tracking technology is integrated into the armature alleviating the requirement for a separate tracking system.

Advantages

There are many advantages to mechanical trackers. They are impervious to external fields or materials, and thus, can be used in a wide range of environments. If properly attached to the body, the exoskeletons accurately reflect the user’s joint angles. The sensors used are taken from well-developed technology from other fields and are inexpensive, accurate, and fast. Good speed is inherent in these systems and does not severely limit bandwidth or latency. Systems that are contained on the body are not limited to a confined workspace and can be used in many environments impractical for other tracking techniques.

Disadvantages

Unfortunately, mechanical trackers are very cumbersome. They can be bulky, heavy, and severely limit the motion of the user. Devices are attached to soft tissue, trading comfort for accuracy -- a tight fit is necessary to maintain accuracy. Making a system feasible for marketing requires it to be robust enough to fit multiple users of different height, weight, and gender. This is a difficult problem at best! The machine-to-human connection is subject to shifting. Relative motion between the user and the device can occur, resulting in random error during use.

Analysis of mechanical trackers in conference room application

Mechanical trackers have the advantages of stability. The large amount of metal and flat, reflective surfaces that wreak havoc on magnetic and acoustic systems wouldn’t cause a problem for mechanical tracking systems.

There are, however, many problems with the potential use of mechanical trackers for CSCW applications. First, our goal of minimizing user encumbrance is brutally violated. Using mechanical trackers would produce the greatest amount of encumbrance possible. Many different users with different body types would require very robust devices, which is currently a limitation of the technology. Some advantages of these trackers are the lack of independent emitters and the ability to use them in many different environments. These are both nulled in our environment; a transmitter can easily be built into the room, and our constrained environment doesn’t benefit from the versatility of the tracker.

Video Tracking

Video tracking methodologies are radically different from other VR tracking schemes. Other technologies provide direct, simple input to the VR system. This input usually takes the form of a 1-D data-stream of position, orientation, or acceleration of the object being tracked. This is not the case with video tracking. Video output is a series of 2-D, time sequential images. The size, frame rate, view volume, optical distortion and other factors are all dependent on the type of video used, any filters applied, and the analysis techniques employed. Currently there are very few integrated video-tracking products available for virtual environment work. This is partially because video output is much more complex than other tracking technologies' output, requiring substantially more processing power for analysis. The number of possible combinations of technical factors in video tracking systems is vast, and typically each system must be customized to fit the intended application.

The number of computations required to derive tracking information from a video image can be enormous. Until recently the use of video tracking for VR has mostly been an academic exercise. This has changed in the last few years as computer processing power has increased to a level sufficient to support video tracking techniques. The technology is now advancing rapidly. As a demonstration event for video tracking technology, the annual Robocup pits teams of autonomous robots against each other in a game of soccer. The robots negotiate the field, avoid other robots, and score goals all with video cameras as the primary sensory input.

Video tracking usually takes one of two forms:

    1. Outside-In: placing the camera in the environment and viewing the object(s) being tracked.
    2. Inside-Out: placing the camera on the object being tracked and viewing the environment

These techniques can be combined. The camera can be placed on an object and that object tracked by looking at the environment but other objects that appear in the video frame can also be tracked. Regardless of the technique, the image needs to be analyzed to find and recognize objects in the scene. We will discuss three different approaches to recognizing objects in the video scene; (1) Placing markers, called fiducial markers, in the environment, (2) recognizing natural objects, and (3) recognizing people.

Using Environmental Markers

Currently, the most common approach to registering object position and orientation, including the human body, is by placing known markers in the physical environment or on the object itself. These markers have known qualities that can be more easily detected in the video image. The techniques being developed are varied and new approaches are common in the research literature. One method is to use the inside-out approach and place multiple markers in the scene. The markers can then be used to determine the position and orientation of the camera. In one system the MIT Media lab used three fixed fiducial markers each with unique color values and ensured that those hues did not appear elsewhere in the environment [State 1996]. As long as the markers remained in the image frame the position of the user's head (with mounted camera) could be calculated.

Color is not a requirement, however, and the same techniques can be applied to gray scale intensity values. This is the approach taken by Kato and Billinghurst in the UW HITL Mixed Reality project. Here, fiducial markers of known intensity values and shape are placed in the environment. Again, the camera is attached to the user, and the position and orientation of the fiducial markers is determined relative to the user's head coordinates; a transform to world coordinates is not necessary. This allows a unique twist in the work done by Kato and Billinghurst. The markers are not static in the world. The markers can be moved and any virtual overlay associated with a marker moves with the marker in the user's display.

The fiducial markers do not have to be visible to the human eye to be effective. An unfiltered CCD chip, the common technology used in almost all video cameras, can detect energies beyond the 700nm (visible light) wavelength and up to about 1500nm. This range is commonly referred to as IR or near-IR. Markers that emit near-IR light have the advantage of not disturbing the user’s visual field with artificial objects. This is the approach taken at the University of North Carolina on the HiBall project [University of North Carolina, 1999].

The HiBall system employs ceiling-mounted IR LED's for optical tracking by photodetectors. The HiBall Tracker is only slightly larger than a golf ball and weighs about five ounces. It has a cluster of six lenses and six photodiodes arranged so that each photodiode can view IR LED's through several of the six lenses, providing 26 total views. This hardware can provide data on more than 3000 LED sightings per second, although the tracking system currently uses only about 1500. The system uses a "single constraint at a time" (SCAAT) algorithm from multiple individual sightings, making use of an Extended Kalman Filter for positional and orientational updates after individual IR LED sightings.

The performance of this system is impressive. The group claims linear tracking resolution of less than 0.2 mm, and angular motion tracking is resolved under 0.03 degrees! The update rate is greater than 1500 Hz with latency of about 1 ms. This quality of performance could make the trouble of ceiling mounted LED's worth the effort, especially in dedicated spaces such as offices or conference rooms.

The group's current work involves the integration of inertial sensors into the system to reduce reliance on world-mounted infrastructure, such as the IR LED's. Their stated goal is to derive a system that combines passive optical sensing of the natural environment with inertial and other sensors to allow augmented reality applications outdoors and on the shop floor.

Video Tracking with Natural Objects

A growing area of research for vision based tracking is that of natural 3-D object tracking. This includes accurate registration of a 3-D object, by which we will be able to track the head position and orientation, given the camera-model parameters. The main advantages of this approach are that the vision-based tracking system will no longer be dependent on fiducial markers, allowing greater mobility and a deeper sense of natural interaction.

A number of vision techniques for 3-D object tracking have been proposed in the literature. These techniques can be divided into the following major catagories:

1) Mesh-based modeling.

2) Neurofuzzy classification.

3) Simple shape fitting approach.

4) Feature extraction based tracking.

5) Shape-volume approximation.

It should be noted that the boundaries between the above classes are very vague, and that there exist a number of hybrid techniques that use a combination of two or more of the above approaches. Some of the methods are used as aids and intermediate steps in other more complicated ones. For the above categories, there exist a number of refinements that should be mentioned.

First of all, the number of objects that can be tracked varies according to the application, while the number of cameras used also varies. In general, the computational capacity required increases with the number of objects, while the utilization of multi-ocular vision increases robustness to the expense of system latency and allows for the estimation of the position through triangulation. Moreover, the position of the camera(s) is another issue. Placing the camera in the environment instead of on the user allows for direct registration of the user’s pose and orientation, while affixing the camera on top of the head requires registration of another 3-D object.

Moreover, the process of registration can be based on a sequence of images at different time instants or on a set of stereo (or multi-camera) images at the same time instant. In the case of a sequence of images, prediction can also be incorporated to improve system latency. The utilization of different variations of the Kalman filtering technique is the most common.

As far as the fitting processes are concerned, either polyhedral, polygonal, surface or mesh-based, the search can be based on the least squares method, the nearest neighbor approach, the nearest to mean approach or on a graph-based search, where the depth of the search tree will be determined by the number of features. The features to be matched themselves can be updated during run-time, allowing for alternate object tracking and greater mobility.

A number of constraints can be applied on all of the five major approaches, specifically kinematics constraints or camera projection constraints. The kinematics constraints usually apply to the rigid body assumption, dictating uniform motion for all the points on an image that belong to the same rigid body. On the other hand, the most common camera projection constraint is the epipolar constraint, which applies to stereo images, under the perspective projection assumption.

Other methods:

Apart from the above most commonly used methods, other methods also include the utilization of global properties such as color, volume or area for matching. The use of color as a classification criterion implies the creation of a scene histogram, while the use of volume or area size requires some knowledge of the distance from the object [Ashbrook-A-P. 1998]. An interesting approach is the segmentation of the object in different parts and the application of one or a combination of the other methods to perform the matching. Other matching approach include hierarchical modeling, where each object is made up by a number of primitives or basis functions, and Bayesian-based approaches [Ogniewicz and Ilg, 1992; Shao-Y., 1998].

Mesh-based modeling:

Mesh-based modeling is a technique in which the image is separated into a number of patches, triangular, quadrilateral or generic polygonal. The process of creating triangular patches is called triangulation and is the most commonly used. The vertices of the patches are the nodes of the mesh, and the whole structure justifies its name. The mesh can be made of equal or unequal patches, where the case of unequal patches can be interpreted as irregular 2-D sampling [Yao-Wang. 1998]. In the case of unequal patches, their creation can be content-based, thus providing a number of advantages. [Altunbasak, 1997].

The main advantages of mesh-based modeling are that it presents the neighboring relations between patches, while allowing for spatial transformations and deformations of the 3-D object. Moreover, prediction can be incorporated into the whole scheme, by measuring the position, velocity and acceleration of each individual node. The position of the nodes need not be given at every time instant, because a motion vector is enough to describe the position for each node, given knowledge of the previous coordinates.

Therefore, the problem of registration and tracking of 3-D objects is reduced to fitting the given mesh model of the object to the observed mesh structure. The mesh structure can be used to approximate any free surface form, polyhedral and polygonal, even spherical structure [Hebert 1995]. In order to track the 3-D object, a mapping of the mesh nodes from each image of the sequence to the next has to be established. However, this motion field is relatively sparse and allows the derivation of a dense motion field through interpolation. This leads to efficient 2-D tracking of the object on the image sequence. Knowledge of the geometric parameters of the camera straightforwardly leads to 3-D tracking.

Neurofuzzy classification:

Recognition and classification are two of the primary tasks for which neural networks and fuzzy systems are well known for. Before the image data can be applied to the neural or fuzzy system, it has to undergo some sort of preprocessing so that it is appropriate as input to the classifier. Some features may have to be extracted, using known feature extraction methods such as edge detection, singularity detection or any other primitive-feature extraction technique. Some form of segmentation of the image may also be necessary, to allow for better localization of the recognition problem.

The design of the neural or fuzzy classifier is based on a training phase, during which the network is supplied with a large enough number of training samples, until it reaches quiescence. The adaptation of the neural network is achieved through the technique of back-propagation, which actually back-propagates the classification error to the layers of the network and adjusts the synaptic weights. The fuzzy system can be designed manually or statistically, taking averages and standard deviations of the object features. Hybrid systems may also be implemented, while a large number of variations for neural networks also exist.As in the previous case, after successful registration has been achieved, the knowledge of the camera parameters can be utilized to derive the position of the object.

Feature extraction and matching:

This method is one of the first ones to be implemented for the task of object tracking, and is often used as an aid for other more complicated techniques. The basic concept of this method is that the extraction of some basic distinctive features of an object combined with accurate 2D tracking of those on a sequence of images can lead to 3-D tracking of the object. The most commonly used primitive features used are lines, points and curves [Kamgar, 1997]. Among the most common feature extraction techniques we mention edge detection and block matching.

However, this technique by itself is extremely sensitive to image noise and occlusion. On the other hand, it is a lot faster the other more complicated methods, and under some conditions it has satisfactory performance. A more robust performance can be achieved through the use of the Hough transform, which can compensate for some of the problems that are caused by the presence of noise local singularities.

Simple shape fitting approach:

This approach has two main variations. The first one is the direct tracking of polygonal objects, by fitting a polyhedral, cylindrical or spherical model to candidate objects in the scene and the second is the tracking of an object by fitting a cyclical, polygonal or ellipsoid surface model to the object surfaces [Huang, 1996]. In both cases, some sort of feature extraction is necessary, which is usually acquired through edge detection. The fitting process can be performed with one of the above mentioned methods.

Surface-volume approximation:

This approach also main variations. The is the use of some surface or volume primitives to approximate the shape or the surface of the object and the second one is the use of model polynomial functions to approximate the surfaces of the object. Of the volume primitives that are usually used, we mention spheres and cubes, also known as voxels, while the usual surface primitives are circles and rectangles [Ranjan, 1996]. A mesh-based approach can be incorporated in this scheme to produce a hybrid registration and tracking technique. As with all the above cases, the combination of the registration data with the camera parameters yields the desired results.

Tracking People

Another set of objects that can be tracked by video in a virtual environment are the users themselves. Myron Krueger has been a proponent for video based tracking of people for most of his career. He started working on video tracking in the early 1970’s. Many of his systems and ideas are discussed in his book Artificial Reality II [Krueger 1990]. Tracking users and their complex actions generally takes one of three forms; tracking the body, tracking the hands, and tracking the face/head.

Tracking the entire body is usually done for the purpose of recognizing general limb movement and body position/posture. This technique is used in game research and development. Typically, this is done in a blue/green background screen environment and the user (or player, in this case) interacts with animated, life-size characters.

Using video to track the hands has recently gained much support in the research community. This idea is not new and is not limited to video. This is the primary use for glove devices (see Mechanical Trackers). The concept is that the hand can be used to replace the mouse and keyboard. The problem with video tracking of hands as input devices is two-fold. First, hands come in all sizes and shapes. They also change shape as they move. This can make the process of hand and finger recognition very difficult. Second, hand gestures are temporal. They need to be tracked and compared over a series of frames. There are many different algorithms and techniques in existence for image analysis that can be applied to hand and hand gesture recognition. Some of them include discrete statistics [Schuster 1996], a significant number of hidden Markov model (HMM) variations [Devijver 1988] [He 1991] [Stoll 1995], wavelets [Mallat 1992], and hierarchical decomposition [Carlsson 1995].

Tracking the user’s head position and facial expressions by video is a relatively new effort, but one that has direct applicability to CSCW research. Frequently the focus of these techniques is not only to track head position, but to use facial expressions as part of the overall interface technique [Cascia 1998]. The facial masks produced can then be used to overlay on avatars or can be used in other aspects of CSCW applications.

Advantages

Video tracking has several advantages. It provides a flexible system that can be developed by instrumenting only the user, only the environment, or both. Video tracking allows many objects to be tracked by using a single sensor without having to tag each object. This capability to track multiple objects without placing transmitters on them removes an encumbrance that has hampered many other tracking technologies. When users are required to wear cameras, they are relatively small and light. Video tracking is also scaleable. The same technology can be used to track the motion of any of the objects in a room from hands to whole bodies. This scalability means that one technology has the potential to do the job previously accomplished only with several combined technologies.

Disadvantages

Video tracking technology can be very complex. The output from a video camera is a 2D array, and the analysis method for that information is application dependent. The bounds and limitations of video tracking are not well understood at this time. Because there are no general purpose image tracking systems available, each application must develop its own video tracking system. Video tracking is still more of a research endeavor than an end user product.

Video tracking can also be significantly effected by environmental factors. Lighting conditions need to be controlled in order for the camera to see objects in the environment consistently. Significant changes in light, such as dimming the lights to present slides, can have a detrimental effect on the video data. A camera must also have an unoccluded view of the object it is tracking. Typically, redundant cameras are required to provide a continuous, unobstructed view.

Analysis of use in conference room application

Despite its disadvantage, video tracking holds great promise for use in the small room CSCW application. The ability to perform passive tracking means users can freely enter and leave the environment. By tracking multiple objects from a single sensor the amount of instrumentation and user encumbrance can be minimized. This also means users can bring new objects into the meeting without having an extensive preparation process. Many of the lighting restrictions can be addressed by taking greater advantage of the near-IR spectrum, but more research is required to address these problems.

Prediction Algorithms

Minimizing through-put delay is especially critical for mixed reality applications. The through-put delay (tpd) is the total time required for the position or orientation to be sampled (ts) and the resultant virtual display to be refreshed (tr). In mixed reality applications virtual objects are registered to the physical world. Without prediction techniques, virtual objects are displayed in a location relative to the users head position at the time of the last sample used for display refresh (tr – tpd). The technologies discussed previously can lessen this problem by reducing the tpd. Here, we briefly discuss three predictor functions that can be used to estimate where the user’s position/orientation will be at ts + tpd. This reduces the perception of lag in the system.

The techniques we discuss all use the same basic concept. Use the last N time samples {ts, ts-1, ts-2 , … , ts-(n-1)) to estimate ts+tpd. A sample set is provided for each constraint provided by the tracking system. If the tracker is providing positional information, then a prediction algorithm would be run for the X, Y, and Z positional values.

The simplest of the techniques for estimating ts + tpd is a linear predictor. As the name implies, the best fit of a line is found to the sample set. This is usually performed by a standard least mean squared error test. The value of the fitted line at ts+tpd is then used. While this technique can show some improvement over no predictor, typical tracker input is noisy and user motions too variable for this technique to be of much value. Two more advanced techniques that are commonly used in VR systems today are Kalman filtering and Grey system prediction.

A head-tracking prediction algorithm based on Kalman filtering was introduced in 1991 by Liang, Shaw, and Green [Liang 1991]. It has become the standard predictor in use by many systems today. This predictive filtering technique was designed to reduce the latency associated with orientation lag. Kalman filtering is derived using a difference equation, and uses a predictor-corrector method to solve for the least-squared error. A detailed description of the technique can be found in papers by Kalman [Kalman 1961] and Bozic [Bozic 1987]. The Kalman difference equation is heavily dependent on a derived low-pass filter named the Kalman gain. The Kalman gain is calculated from a recursive set of equations. One draw back to Kalman filtering is the complexity of the computations and the inherently slower recursive techniques. This has historically made it difficult to perform Kalman filtering in real-time. With the current generation of processors this problem has been greatly reduced and is no longer a major consideration.

Azuma and Biship [Azuma 1994] used Kalman filtering in conjunction with head mounted inertial trackers to further reduce dynamic registration errors. This research showed errors 2-3 times lower than Kalman filtering without inertial trackers and 5 to 10 times lower than using no prediction filtering at all.

Wu and Ouhyoung proposed another prediction algorithm based on Grey system theory in 1994 [Wu 1994]. Grey theory applies to areas where the system is only partially defined. This applies to HMD tracking because there are many ill-defined aspects of the system and behavior is uncertain. In this technique, the observed data is accumulated to derive a differential model for prediction and analysis of system behavior. This technique is less dependent on user defined system characteristics than the previously defined Kalman filtering. The Grey predictor has been shown to produce errors 5 to 12 times lower than with no prediction.

Computationally the Grey predictor is faster than the Kalman techniques. While more studies are required the further compare the two prediction models, Kalman filtering generally show slightly better results but with more variance (jitter) in the values. The type of application and tracking technology used will also effect which predictor should be used.

Single Constraint based tracking

Tracking the position of an object in a VR system was almost universally accomplished by collecting data for all degrees of freedom (constraints) before defining its position/orientation in space. In 1997, Welch and Bishop [Welch 1997] proposed a mathematical system named single-constraint-at-a-time or SCAAT tracking. In this approach, each constraint is used to make an estimation. As each new constraint is collected they are combined to make a new, improved estimation. Thus, in a positional tracking system with three DOF, three position estimations have been made by the time the first ‘complete’ set of measurements has been received. This results in more frequent estimations, less latency (reduced tpd), and better accuracy.

One reason for the improved performance has to do with how many tracking technologies work. Magnetic trackers, for example, sample each constraint sequentially before combining them for submission as a complete package. Using SCAAT allows the system to take advantage of each constraint as soon as it is collected. Vision based systems typically use a similar methodology. There are usually several markers in the scene being analyzed, and each marker is found sequentially within the image. Rather then waiting to find all the markers, a SCAAT approach processes each marker individually while the others are being found. Notice that there is implied parallel processing taking place. This is usually accomplished by having one system perform data capture while another is doing application processing.

An integral part of the SCAAT algorithms is some type of predictor function. Welch has used a variation of the Kalman filter for his work on the HiBall project at the University of North Carolina at Chapel Hill. While direct comparison to other techniques is difficult, experimental results indicated an error of one or two magnitudes lower than for systems not using SCAAT or prediction techniques.

Proposed Research of a New Tracking System for CSCW Applications

Based on our technology review, we believe the best approach to meet CSCW tracking requirements is by developing a hybrid tracking system. No single technology has been shown to meet the requirements of accuracy, latency, and precision while maintaining flexibility and minimizing user and environmental encumbrances. Video and inertial based trackers are the two most promising technologies to combine in the hybrid tracker. More specifically, we recommend continued research to enhance the head mounted video tracking system used by the HITL Mixed Reality project and combining it with an inertial tracking system.

Why a head mounted video system?

There are many advantages to a head mounted video system as implemented in the HITL Mixed Reality project for CSCW applications. First, the captured video can serve additional purposes other than tracking. By using an inside-out design, where the camera is attached to the user, the video camera has essentially the same view as the user. This means the video signal can be sent to other collaborators to effectively place them in the user’s body by giving them the user’s visual perspective of the environment. By having the video serve two vital purposes, the encumbrance cost is better justified.

Objects in the user’s view can also be passively tracked using video techniques. This removes the need to place active trackers on all objects in the collaborative environment. The potential number of objects in a collaborative environment could be quite large. Passively tracking all objects by a single camera makes the system much more flexible and greatly reduces the amount of technology that could hinder the user’s task.

Using the head mounted video tracker means the user’s hands are in the visual field during many operations. This allows for tracking the user’s hands and for gesture recognition to be accomplished by the same camera. Gesture recognition can be a powerful tool. It would allow the placement or modification of virtual objects in the environment by more natural interaction techniques. As an example, the user could draw on virtual paper without the need of virtual pens or other tracking solutions. This creates a virtual whiteboard in 3-D space that all collaborators could use.

Why include the inertial tracking system?

Traditional video based tracking has a limited refresh rate. Using RS-170, TV video signal, the maximum refresh rate would be 60Hz assuming only one of the two interlaced fields was used. This could be improved by using nonstandard video signals. We believe that inertial trackers would provide the necessary improved update rate while only incurring a minimal impact on bandwidth. The inertial tracker system would be used in conjunction with prediction filters to increase the accuracy of tracked position and orientation.

How to enhance the current video tracking system

The current Mixed Reality tracking system works in a user coordinate space without concern for registration to the world coordinate space. The system can be enhanced to support world coordinates by placing static markers in the environment. In the short term, we recommend fiducial markers similar to those currently used by the Mixed Reality system. Long-term research should look at replacing fiducial markers with more natural objects in the environment that will not interfere with room esthetics or disturb the user’s visual field.

Adding static markers to the environment will not solve all the existing system limitations. The authors of the Mixed Reality tracking system have determined that the current system rapidly looses accuracy at distances greater than a few feet. The exact reason for the performance degradation is not known but is believed to be caused by optical aberrations and camera resolution.

The first step to addressing the problem would be to use a higher resolution camera. The Mixed Reality prototype is using a camera with a lower resolution than achievable by other RS-170 cameras. Using a different camera would increase the resolution and allow for marker detection at a greater distance. However, this exacerbates the problem of maintaining a real-time system because there are more pixels that need to be analyzed in order to find the markers. We propose the development of two marker types.

Short-range markers will be used as physical handles for virtual objects. These are the markers used by the current system and they contain patterns to uniquely identify the virtual objects for which they serve as handles.

Long-range marker to be used for registration of head position to the physical (world coordinate) environment. The visual pattern on these markers can be simplified to reduce the required resolution for recognition.

Long-range markers can be placed in the environment in a known pattern. This will facilitate the development of a faster search algorithm for finding markers between frames. The optimized marker search algorithms would not only be aided by a known marker placement scheme but could use relative position and orientation changes between frames to select the initial search parameters.

This approach can be used to reduce search time for the long-range markers but is not sufficient for the short-range markers since they are dynamically moved by the user. One way to approach this problem is to use the additional information we now have about the world coordinate space. The location of the short-range markers can now be kept in world coordinates. As long as the markers are not moved then their position in the view field is easily determined from the transformation functions. Moving the markers has to be accomplished by the user. The tracking system can determine when the user’s hand moves a marker and calculate the new marker position. Ideally this eliminates any searching for the markers.

Conclusion

After a thorough review of recent literature concerning tracking, it seems clear that a hybrid video-inertial tracking system holds the best potential for a room-sized CSCW application. Such a system should be capable of providing the performance required to help preserve many natural human communicative preferences in the CSCW environment. Expanding a known and working video technique, such as that advanced by Kato and Billinghurst, reduces the technical risk and complexity to a level that should allow quick success with inertial tracking system integration. Once this integration has been achieved and refined to acceptable performance levels, more advanced video tracking techniques can be examined. The long term goal is to reduce or eliminate reliance upon markers or other environmental modifications. We believe the HIT Lab should embark on a program to explore these research vectors and generate a prototype CSCW laboratory and technology demonstration showpiece.

References

Azuma, R. and Bishop. G., "Improved Static and Dynamic Registration in an Optical See-through HMD," SIGGRAPH ’94 Conference Proceedings, pp. 197-204, 1994.

Baratoff, G., and Blanksteen, S.. Tracking Devices. Encyclopedia of Virtual Environments. World Wide Web URL: http://www.cs.umd.edu/projects/hcil/eve.restore/eve-articles/I.D.1.b.TrackingDevices.html

Bozic S., "Kalman filter subroutine computation," Journal of Electronic Engineering , pp 29-31, July 1987.

Carlsson, S., "Projectively Invariant Decomposition and Recognition of Planar Shapes," International Journal of Computer Vision 17(2): pp. 193-209, 1995

Devijver, P.A., Dekesel, M.M., "Experiments with an Adaptive Hidden Markov Mesh Image Model," Philips Journal of Research, Vol. 43 No. 34, 1988. pp. 375-392.

Ghazisaedy, M., Adamczyk, D., Sandin, D., Kenyon, R., DeFanti, T., "Ultrasonic Calibration of a Magnetic Tracker," Proceedings of Virtual Reality Annual International Symposium ’95, March 11-15, 1995., pages 179-188.

Hand, Chris. "A Survey of 3-D Input Devices", Technical Report CS TR94/2, 1994.

He, Y., Kundu, A., "Shape Classification Using Hidden Markov Model," Proceedings of Computer Vison and Pattern Recognition, July 1991 pp. 10-15.

Ishii, H., Koboyashi, M., Arita, K,. "Interative Design of Seamless Collaboration Media." Communications of the ACM, Vol 37, No. 8, August 1994, pp. 83-97.

Kalman, R.E., and Bucy, R.S., "New Results in Linear Filtering and Prediction Theory," Trans. ASME, Journal of Basic Engineering, Series 83D, pp. 95-108, March 1961.

Kato, H., Billinghurst, M., Weghorst, S., and Furness, T., "A Mixed Reality 3D Conferencing Application," Human Interface Technology Laboratory Technical Report TR-99-1, http://www.hitl.washington.edu/publications/r-99-1, 1999.

Kolozs, James. "Position Trackers for Virtual Reality", University of Utah, May 28, 1996.

Krueger, M. Artificial Reality II. Addison-Wesley, 1990.

La Cascia, M., Isidoro, J., and Sclaroff, S. Head Tracking via Robust Registration in Texture Map Images . In Proc. CVPR 98, June 1998.

Liang, J., Shaw, C., and Green, M., "On Temporial-spatial Realism in the Virtual Reality Environment", Proc. 4th Annual Symposium on User Interface Software and Technology, Hilton Head SC. PP 19-25, 1991.

Mallat, S.G., Zhong, S., "Characterization of signals from multiscale edges," IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(7); pp. 710-732, 1992.

Omologo, M. and Svaizer, P. "Acoustic Source Location in Noisy and Reverberant Environment using CSP Analysis". Proceedings of ICASSP 96, Atlanta, U.S.A., 1996.

Schuster, M,. Rigoll, G., "Fast Online Video Image Sequence Recognition with Statistical Methods," Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 3450-3453.

Silicon Microstructures Inc., Capacitive Accelerometer, 7130 Series, Product Data Sheet, Fremont, CA (1995).

Sowizral, H. 1995. Tutorial: An Introduction to Virtual Reality. Virtual Reality Annual International Symposium.

State, Andrei, Mark A. Livingston, Gentaro Hirota, William F. Garrett, Mary C. Whitton, Henry Fuchs and Etta D. Pisano. Techniques for Augmented-Reality Systems: Realizing Ultrasound-Guided Needle Biopsies. Proceedings of SIGGRAPH ‘96 (New Orleans, LA, 4-9 August 1996), 439-446.

Stoll, P.A., Ohya, J., "Applications of HMM Modeling to Recognizing Human Gestures in Image Sequences for a Man-Machine Interface," IEEE International Workshop on Robot and Human Communication 1995, pp. 129-134.

Tang, J.C. (1991). Findings from observational studies of collaborative work. In S. Greenberg (ed.), Computer-supported cooperative work and groupware (pp. 11-28). San Diego, CA: Academic Press.

Ullmer, B. and Ishii, H. "The metaDESK: Models and Prototypes for Tangible User Interfaces." UIST ’97 Proceedings, pp. 209-10.

University of North Carolina, Department of Computer Science, "Wide-Area Tracking: Navigation Technology for Head-Mounted Displays," internet web site: www.cs.unc.edu/~tracker, 1999.

Watts, R. G. and Bahill, T. A., Keep Your Eye on the Ball: The Science and Folklore of Baseball, W. H. Freeman and Company, New York (1990).

Welch, G. and Bishop, G,. "SCATT: Incremental Tracking with Incomplete Information", SIGGRAPH ’97 Conference Proceedings, pp. 333-344, 1997.

Wickens, C., Gordon, S., & Liu, Y. (1998) An Introduction to Human Factors Engineering, Reading, MA: Addison-Wesley Publishing Company, 223-258.

Wu, Jiann-Rong and Ouhyoung, M., "ReducingThe Latency In Head-Mounted Displays By a Novel Prediction Method Using Gray System Theory," Computer Graphics Forum (EuroGraphic’94) 13(3). Pp. C503-c512, 1994