Gesture Recognition

Authors: Gregory Baratoff and Dan Searles

Gestures are physical positions or movements of a person's fingers, hands, arms or body used to convey information. For example, see the figure below.

In a virtual reality system, gestures can be used to navigate, control or otherwise interact with the computer creating the virtual reality. Gesture recognition is the process by which gestures formed by a user are made known to the system. In completely immersive VR environments, the keyboard is generally not included, and some other means of control over the environment is needed. Other possibilities include voice commands, wands, or other devices which can be manipulated unseen (or via its symbolic inclusion in the virtual environment). The goal is to have a natural means of control, or at least one which is simple to learn. Gestures with the fingers or hands can provide this capability and are usually inexpensive and easy to implement.

Gestures can be expressed in several ways. Any physical aspect of a person that the person can control can be the basis for a set of gestures. Lip reading or interpretation of facial expressions is one possibility, as well as sign language used by the deaf. A gesture recognition system for Japanese sign language is presented in [Murakami 91]. Other gesture sets developed outside the computer arena can be seen in use by traffic cops, construction site workers, and airport ground controllers. Some gestures have been specifically created for use in computer generated virtual realities, such as in words: "fly this direction", and "select or pick up this object", for example. One example of a mixture of these types is a program called Charade [Baudel 93] where gestures are used to control a presentation.

Gestures can be static, where the user assumes a pose or certain configuration, or dynamic, where movement is the gesture itself. In order to make these gesture configurations accessible to the computer, sensing devices are either directly attached to the user, or measure the configuration indirectly from a distance.

Attached devices (gloves, datasuits, 6DOF trackers) generally provide information along all three spatial dimensions. This is not the case in image-based approaches, where only two- dimensional projections are available. This is problematic since a given 3D configuration projects to different 2D views for different relative positions of user and camera. In some degenerate cases, a given 2D view could even correspond to different 3D configurations. Computer vision techniques could be used to either recognize gestures from 2D views, or to reconstruct the third dimension from several views taken by two or more cameras. These are, however, hard and computationally expensive tasks, requiring dedicated hardware for real-time performance. In order to make the recognition from images easier, care should be taken in the definition of the gestures to make them redundant along several dimensions so as to avoid degeneracies due to the reduction from three to two dimensions.

A further problem with image-based measurements is the possibility of occlusion. The user might be facing away from the camera, hiding his/her own gestures, or an object could be blocking the camera's view of the user. The possibility of such occlusions places limits on the space in which gestures can be recognized, and thus puts the additional burden on the user of staying aware of this restriction. This is another problem that needs to be addressed during the design phase. In real life, people use gestures to communicate with other people or animals, or to enrich verbal communication. In such situations the speaker and the audience usually face each other. It allows the speaker to receive feedback from the audience, but, more importantly, it allows the audience an unobstructed view of the speaker's gestures. In a virtual environment it is fairly easy to project an audience in front of the user, but it is more difficult to have the cameras be in front of the user all the time.

Inside the computer, each gesture is represented as a region in some feature space. In the direct approach, features would be the x,y,z coordinates of various points on the user's body, and the orientations of some limbs. In an image-based approach, the sensor measures the intensity values of a 2D grid of pixels. In general the image is first preprocessed to enhance contrasts and to filter out noise, followed by a feature extraction process, which localizes points of high image contrast like for example edges. A grouping process then links together these features to form a representation of boundaries in the image. The boundary representation is usually the basis for a segmentation process which separates from each other the regions corresponding to different parts of the body. Once this is done, relative positions and orientations of the parts in the image can be measured. The space of all possible positions and orientations could be a candidate for the feature space inside which the gesture regions are defined.

For each measurement taken by the sensing devices, the computer tries to identify the gesture by locating the region in feature space into which the measurement falls. This can be done using pattern recognition or neural network algorithms. Usually such algorithms use a rather simple model for the gesture regions. They store a limited set of templates, or prototypes, each defining a gesture. The gesture region of a given template is defined as the set of all points that lie closer to this template than to any other of the stored templates. The recognition process therefore consists in finding the template that is closest to the input vector. Such algorithms are known under the name of Nearest-neighbour search algorithms. An example of a neural network architecture for recognition tasks are attractor networks. In such a network, the prototypes are stored implicitly in the weights of the connections of the network. Recognition of a pattern is achieved by an iterative algorithm that converges to one of the prototypes.


[Murakami 91] Murakami, K., Taguchi, H., (1991) Gesture Recognition Using Recurrent Neural Networks. In Proceedings of ACM CHI'91 Conference on Human Factors in Computing Systems, pp 237-242.

[Baudel 93] Baudel, T. and Beaudouin-Lafon, M. (1993 July). CHARADE: Remote Control of Objects using Free-Hand Gestures. Communications of the ACM. 36(7): 28-35.

[Table of Contents][Next Chapter]

Human Interface Technology Laboratory