| Publications Page
|
![[Previous Page]](/icons/button-back.gif)
![[Table of Contents]](/icons/button-contents.gif)
Section 4.5.1: Keyword Recognition and Conceptual
Dependency
Section 4.5.2: Experiments and Results
4.5
Connected Word Recognition Using Keywords
One of our goals during the development of our system was to find the
best way to combine the DSP, statistical and AI technologies. We chose
to represent each continuous spoken sentence by CDs, because this
representation was well suited for a virtual reality system, as will be
explained in the following chapters. The CDs and some of their
components that will represent the spoken sentences are chosen according
to results obtained by the keyword speech recognition system.
The initial structure issued to represent a sentence may contain holes
(NILs) that need to be filled. These holes in the sentence's knowledge
representation may represent some of the sentence's words that either
are wrongly recognized or are not part of the recognition vocabulary.
These knowledge holes will be filled by a mechanism in which an
inference engine will use rules and context to look for the needed
information.
As an analogy, suppose a person is learning a foreign language and he
knows only some words of that language. When someone speaks a phrase in
that language to him, he picks just the words that he knows. He also
may misunderstand some of them, so the remaining meaning is fuzzy and
unintelligible. He tries to understand the whole sentence from the
words that he understands and from the context surrounding the
conversation. He creates an idea in his mind about the phrase that he
just heard. It is possible that later sentences will correct or
discredit that idea.
According to the conceptual dependency theory each phrase can be
decomposed into several categories. Some of the categories are ACTORS,
VERBS, OBJECTS, etc. We represent each category by a connect network, in
which the HMM models of each of the words that belong to a particular
category are concatenated together. Also concatenated are the silence and
extraneous models (see Fig. 4.4.) The extraneous
models are created using all the words that do not belong to that
catagory. For example to create the extraneous model for the category
VERBS we used all the words that are not verbs.
For a given sentence we used the keyword spotting technique to find
some of the members of each category embedded in it.
The DSP module produces a Cepstral Vector Quantization (VQ)
representation of the input and passes the index vectors to each of the
connected HMM networks. Here the Viterbi algorithm finds the optimum
sequence of states that in turn represent the best possible words for
each category. These words are labeled as the "Viterbi raw words".
Next, we produce the set of "Viterbi pruned words" by eliminating some
of the Viterbi raw words; the extraneous words, silences, and words
whose durations are less than some threshold. The Viterbi pruned words
contain only words that can be used in the CD primitives.
The final step is to reduce multiple overlapping between words of
different categories. Due to incorrect recognition, a given set of index
vectors may produce several different words in the same range. In order to
find the correct one, we evaluate the word's model probability, and the
word with highest probability is chosen. We label the string of words
selected as "Viterbi final words". Figure 4.5
shows a block diagram of the whole process.
Each of these three word lists is fed to an expert system which
produces a representation of the sentence meaning using this
information and the context in which the sentence was spoken.
As an example the following sentence was spoken: "Robot, give the shoes
to the father." This sentence had 152 frames (each frame consists of
128 samples), 2.432 seconds.
The Viterbi algorithm found the following:
Category Humans:
- robot, from frame 1 to 45
- extraneous-model-for-humans, from frame 46 to 46
- he, from frame 47 to 69
- extraneous-model-for-humans, from frame 70 to 71
- me, from frame 72 to 74
- extraneous-model-for-humans, from frame 75 to 113
- father, from frame 114 to 140
- he, from frame 141 to 150
- extraneous-model-for-humans, from frame 151 to 152
Category Verbs:
- extraneous-model-for-verbs, from frame 1 to 58
- give, from frame 59 to 70
- extraneous-model-for-verbs, from frame 71 to 74
- live, from frame 75 to 80
- show, from frame 81 to 87
- extraneous-model-for-verbs, from frame 88 to 152
Category Objects:
- extraneous-model-for-objects, from frame 1 to 3
- silence-model, from frame 4 to 12
- extraneous-model-for-objects, from frame 13 to 32
- object, from frame 33 to 38
- extraneous-model-for-objects, from frame 39 to 51
- object, from frame 52 to 59
- extraneous-model-for-objects, from frame 60 to 74
- shoes, from frame 75 to 99
- extraneous-model-for-objects, from frame 100 to 108
- dog, from frame 109 to 114
- extraneous-model-for-objects, from frame 115-152
Category Places:
- extraneous-model-for-places, from frame 1 to 33
- living-room, from frame 34 to 37
- extraneous-model-for-places, from frame 38 to 51
- kitchen, from frame 52 to 66
- extraneous-model-for-places, from frame 67 to 77
- kitchen, from frame 78 to 87
- extraneous-model-for-places, from frame 88 to 152
Category Numbers:
- extraneous-model-for-numbers, from frame 1 to 3
- silence-model, from frame 4 to 12
- extraneous-model-for-numbers, from frame 13 to 46
- three, from frame 47 to 80
- extraneous-model-for-numbers, from frame 81 to 93
- one, from frame 94 to 100
- extraneous-model-for-numbers, from frame 101 to 113
- one, from frame 114 to 130
- extraneous-model-for-numbers, from frame 131 to 152
Category Questions:
- how-is, from frame 1 to 16
- extraneous-model-for-question, from frame 17 to 18
- how-are, from frame 19 to 33
- extraneous-model-for-question, from frame 34 to 34
- what, from frame 35 to 47
- what-do, from frame 48 to 59
- extraneous-model-for-question, from frame 60 to 67
- what, from frame 68 to 73
- extraneous-model-for-question, from frame 74 to 106
- where-is, from frame 107 to 114
- extraneous-model-for-question, from frame 115 to 118
- how, from frame 119 to 130
- extraneous-model-for-question, from frame 131 to 152
After pruning we had the following:
Category Humans:
- robot, from frame 1 to 45
- he, from frame 47 to 69
- father, from frame 114 to 140
Category Verbs:
- No Verbs
Category Objects:
- shoes, from frame 75 to 99
Category Places:
- kitchen, from frame 52 to 66
Category Numbers:
- three, from frame 47 to 80
- one, from frame 114 to 130
Category Questions:
- how-is, from frame 1 to 16
- how-are, from frame 19 to 33
And finally the final selection:
Category Humans:
- robot, from frame 1 to 45
- he, from frame 52 to 66
- father, from frame 114 to 140
Category Verbs:
- No Verbs
Category Objects:
- shoes, from frame 75 to 99
Category Places:
- No Places
Category Numbers:
- three, from frame 47 to 80
- one, from frame 114 to 130
Category Questions:
- how-is, from frame 1 to 16