Voice Recognition

Author: Jim Baumann

Voice recognition is the process of taking the spoken word as an input to a computer program. This process is important to virtual reality because it provides a fairly natural and intuitive way of controlling the simulation while allowing the user's hands to remain free. This article will delve into the uses of voice recognition in the field of virtual reality, examine how voice recognition is accomplished, and list the academic disciplines that are central to the understanding and advancement of voice recognition technology.

What is voice recognition, and why is it useful in a virtual environment?

Voice recognition is "the technology by which sounds, words or phrases spoken by humans are converted into electrical signals, and these signals are transformed into coding patterns to which meaning has been assigned" [ADA90]. While the concept could more generally be called "sound recognition", we focus here on the human voice because we most often and most naturally use our voices to communicate our ideas to others in our immediate surroundings. In the context of a virtual environment, the user would presumably gain the greatest feeling of immersion, or being part of the simulation, if they could use their most common form of communication, the voice. The difficulty in using voice as an input to a computer simulation lies in the fundamental differences between human speech and the more traditional forms of computer input. While computer programs are commonly designed to produce a precise and well-defined response upon receiving the proper (and equally precise) input, the human voice and spoken words are anything but precise. Each human voice is different, and identical words can have different meanings if spoken with different inflections or in different contexts. Several approaches have been tried, with varying degrees of success, to overcome these difficulties.

How is voice recognition performed?

The most common approaches to voice recognition can be divided into two classes: "template matching" and "feature analysis". Template matching is the simplest technique and has the highest accuracy when used properly, but it also suffers from the most limitations. As with any approach to voice recognition, the first step is for the user to speak a word or phrase into a microphone. The electrical signal from the microphone is digitized by an "analog-to-digital (A/D) converter", and is stored in memory. To determine the "meaning" of this voice input, the computer attempts to match the input with a digitized voice sample, or template, that has a known meaning. This technique is a close analogy to the traditional command inputs from a keyboard. The program contains the input template, and attempts to match this template with the actual input using a simple conditional statement.

Since each person's voice is different, the program cannot possibly contain a template for each potential user, so the program must first be "trained" with a new user's voice input before that user's voice can be recognized by the program. During a training session, the program displays a printed word or phrase, and the user speaks that word or phrase several times into a microphone. The program computes a statistical average of the multiple samples of the same word and stores the averaged sample as a template in a program data structure. With this approach to voice recognition, the program has a "vocabulary" that is limited to the words or phrases used in the training session, and its user base is also limited to those users who have trained the program. This type of system is known as "speaker dependent." It can have vocabularies on the order of a few hundred words and short phrases, and recognition accuracy can be about 98 percent.

A more general form of voice recognition is available through feature analysis and this technique usually leads to "speaker-independent" voice recognition. Instead of trying to find an exact or near-exact match between the actual voice input and a previously stored voice template, this method first processes the voice input using "Fourier transforms" or "linear predictive coding (LPC)", then attempts to find characteristic similarities between the expected inputs and the actual digitized voice input. These similarities will be present for a wide range of speakers, and so the system need not be trained by each new user. The types of speech differences that the speaker-independent method can deal with, but which pattern matching would fail to handle, include accents, and varying speed of delivery, pitch, volume, and inflection. Speaker-independent speech recognition has proven to be very difficult, with some of the greatest hurdles being the variety of accents and inflections used by speakers of different nationalities. Recognition accuracy for speaker-independent systems is somewhat less than for speaker-dependent systems, usually between 90 and 95 percent.

Another way to differentiate between voice recognition systems is by determining if they can handle only discrete words, connected words, or continuous speech. Most voice recognition systems are discrete word systems, and these are easiest to implement. For this type of system, the speaker must pause between words. This is fine for situations where the user is required to give only one word responses or commands, but is very unnatural for multiple word inputs. In a connected word voice recognition system, the user is allowed to speak in multiple word phrases, but he or she must still be careful to articulate each word and not slur the end of one word into the beginning of the next word. Totally natural, continuous speech includes a great deal of "coarticulation", where adjacent words run together without pauses or any other apparent division between words. A speech recognition system that handles continuous speech is the most difficult to implement.

What disciplines are involved in voice recognition?

The template matching method of voice recognition is founded in the general principles of digital electronics and basic computer programming. To fully understand the challenges of efficient speaker- independent voice recognition, the fields of phonetics, linguistics, and digital signal processing should also be explored.

References

[ADA90] Adams, Russ, Sourcebook of Automatic Identification and Data Collection, Van Nostrand Reinhold, New York, 1990.

[CAT84] Cater, John P., Electronically Hearing: Computer Speech Recognition, Howard W. Sams & Co., Indianapolis, IN, 1984.

[FOU89] Fourcin, A., G. Harland, W. Barry, and V. Hazan, editors, Speech Input and Output Assessment, Ellis Horwood Limited, Chichester, UK, 1989.

[YAN87] Yannakoudakis, E. J., and P. J. Hutton, Speech Synthesis and Recognition Systems, Ellis Horwood Limited, Chichester, UK, 1987.

Human Interface Technology Laboratory