Voice synthesis

Author: Glen Lee

Many aspects of Virtual Reality involve imitating familiar real-world processes by computer. Imitating the human form and feel alone is an overwhelming task due to the complexity of the human creature. This section provides an overview of one of the human tasks of duplicative interest: speech.

How do humans speak: the physiological mechanics of speech

In order to synthesize human speech, an understanding must be developed of how humans are capable of such a task. By dissecting speech and its process into measurable and analytical qualities, instruments and techniques can be created to reproduce the aspects of speech.

The most simplistic description of how humans utter sounds in speech can be characterized by the control of air generated by the lungs, flowing through the vocal tract, vibrating over the vocal cord, filtered by facial muscule activity, and released out of the mouth and nose (see |Figure IB2.1|). Although the vocal tract is a tube that roughly measures up to one foot, the sound actually takes shape in the latter six inches of the tube--the oral cavity. In this portion of the vocal tract, many combinations of anatomical activities can alter the passing air flow.

For example, the brain can signal the vocal cord, also known as the "larynx", to vibrate or resonate. This vibration forms the glottal waveform, shown in |Figure IB2.2|. Due to the waveform curve being simple in nature, waveform generators can attempt to duplicate its shape. Just by tightening the vocal cord muscles, the sound's fundamental frequency can be increased, producing higher-pitched sounds.

Furthermore, the air flow can be restricted and altered by positions of the tongue, teeth, lips and other facial muscles in the oral cavity. It is this interfacing of all the possible combinations and variations of the audio filters that complicates the task of synthesizing speech [CATER83]. An alternative to combatting the problem of determining the effects of changes in the facial muscles to produce sound, is to study the actual sounds produced.

"Linguistics", the "science of language," helps to classify speech into a phonetic alphabet, called "phonemes". The International Phonetic Association (IPA) provides a standard for which phonetic alphabets are based. The phonemes can be collected into popular subsets known as dialects. For instance, we might be interested in the General American (GA) dialect of phonemes, which comprises the sounds of commonly spoken American English.

Within the English language, there are also variations on saying phonemes, depending on the context of the word. Each of these variations is an "allophone". Furthermore, when the phonemes are connected to produce words or other sounds of speech, there are other characteristic sounds that transition certain phonemes to other phonemes. These sounds, which cannot be represented by a single symbol, are called "diphthongs". Diphthongs usually occur when pronouncing two vowel-type phonemes in succession, such as "ah" and "ee" to create the sound i [CATER83].

Describing the mechanics for producing recognizable sounds does not fully capture the essence of speech. Other human components assist the sophistication of speech. Many of these components provide sensory input during speech, which were not noticed to have an effect until comparisons were conducted on speech by humans incapable of the average flow of these inputs.

The ears provide a mechanism to simultaneously monitor progression of speech, which serves as an input to alter speech patterns or volume, as necessary (refer to section IV.3 Human Factors/Auditory System). The effects of not having this auditory input can be noticed by comparison of speech produced by the deaf. Without the auditory feedback, a deaf person relies more heavily on the bone vibration feedback received during speech to control their speech.

In addition, the brain orchestrates the lungs to produce sufficient air exhalation, the contraction of the various muscles in modifying the air, the control over the visual and auditory inputs that affect speech, and provides a guideline for rhythm in speech. The tasks that appear very natural for the brain are very hard to capture algorithmically. For instance, providing rules for synthesizing rhythm is such a natural activity for the brain; yet, it would require an exhaustive set of cases for analysis.

How do we make computers speak: techniques for speech synthesis

In speech generation, there are three basic techniques (in order of increasing complexity): "waveform encoding", analog "formant frequency synthesis", and "digital vocal tract modeling" of speech. Each of these techniques will be described in brief detail.

In waveform encoding, the computer simply becomes like a tape recorder; it records phrases or words onto digital memory, and then plays these phrases in the application software, as necessary. Needless to say, providing a wide realm of possible phrases requires a considerable amount of memory storage, especially if high quality recordings are desired. High quality recordings store sound details at shorter time intervals, thereby producing more data to store.

However, this technique has been proven to be useful for such applications as relaying the status of an automobile: "Your door is ajar!" Very little additional hardware is needed for this technique, and since a human makes the initial recordings, many of the obstacles in mimicing speech electronically are avoided. For more information on how these recordings are sampled and stored, refer to sections I.B.1 System Components/Sound/3D Sound and I.B.2 System Components/Sound/Replay and Formats.

In the second basic technique for speech generation, formant frequency synthesis attempts to replicate the human vocal tract (refer to I.B.2 System components/Sound/Voice Synthesis/HOW DO HUMANS SPEAK). In this method, bandpass filters are summed together to act as the various audio filters in the oral cavity. Obviously, this method allows the flexibility to utter many different sounds in succession in reduced data storage. However, if our input is text, these sounds appear very unnatural and sometimes unrecognizable [CATER83] due to the unintelligibility of the computer to have rules on how to pronounce written text and how to say it rhythmically. Careful use of the phonetic alphabet is needed in most text-to-speech applications. Yet, if our computer input allows for words and phrases to be entered phonetically, this method closely replicates our speech.

The third technique that was mentioned models the human vocal tract digitally. The methods that support this technique are very mathematical in nature, since they map the actions of the human vocal tract to equations. The most prevalent method has been linear predictive coded (LPC) speech. Other methods that are closely related are partial autocorrelation (PARCOR) and parametric waveform encoding of speech. A human's speech, inputted in the same manner as the waveform encoding synthesis technique, is dissected into its various frequencies and vocal characteristics. It is stored in this form, which greatly reduces the memory storage. A very visible example of this technique is Texas Instrument's Speak and Spell children's learning aid.

In linear predictive coded speech, the basic concept is that current information on a speech sample helps to estimate future information on the speech sample; that is, in more mathematical terms, LPC parameters of upcoming speech samples can be approximated by a linear combination of past LPC parameters. Some of these LPC parameters would include the voice's pitch, formant frequency characteristics and amplitude.

Where do we go from here: preparation for research in voice synthesis

Preparing for research in voice synthesis requires a fairly diverse range of coursework and understanding. Many biological and physiological functions must be understood for use as a background. Courses that would be most helpful are those that deal directly with lungs and the organs in the head (since almost all of these organs affect speech). Additionally, a study of linguistics would allow the researcher to associate identifiable symbols to the biological characteristics. These studies are prominent in language and music departments. In fact, depending on the nature of research, many music studies have analyzed the flexible transitions of the voice.

Certainly, obtaining a comprehension in the physical properties of sound and its representation would apply for all sound synthesis, including voice (refer to I.B.1 System Components/Sound/3D Sound Synthesis?). These properties provide concrete representations which can be measured, graphed, adjusted and controlled via hardware. The hardware that pertains to these physical properties would require some training in the electrical and/or mechanical engineering fields. Implementing the software counterparts of these tools demand strong abilities in computer science and mathematics.

The following courses are suggestions for pursuing research in voice synthesis. These courses were taken from the Course Catalog at the University of Maryland at College Park.

COMPUTER SCIENCE: Programming Languages, Data Structures, Computer Algorithms, Database Design

ENGINEERING: Computer Systems Architecture, Digital Signal Processing

HEARING AND SPEECH SCIENCES: Linguistics, Brain and Human Communications, Anatomy and Physiology of Speech Mechanisms, Phonetic Science

MATHEMATICS: Linear Algebra, Probability and Statistics, Calculus, Differential Equations

MUSIC: Vocal Diction

PHYSICS: Physics of Music, Vibrations and Waves

References

[CATER83]: Cater, John P., Electronically Speaking: Computer Speech Generation, Howard W. Sams & Co., Inc., c.1983.

Human Interface Technology Laboratory