| Publications Page |

[Previous Page][Table of Contents][Next Page]

Section 4.5.1: Keyword Recognition and Conceptual Dependency
Section 4.5.2: Experiments and Results


4.5
Connected Word Recognition Using Keywords

One of our goals during the development of our system was to find the best way to combine the DSP, statistical and AI technologies. We chose to represent each continuous spoken sentence by CDs, because this representation was well suited for a virtual reality system, as will be explained in the following chapters. The CDs and some of their components that will represent the spoken sentences are chosen according to results obtained by the keyword speech recognition system.

The initial structure issued to represent a sentence may contain holes (NILs) that need to be filled. These holes in the sentence's knowledge representation may represent some of the sentence's words that either are wrongly recognized or are not part of the recognition vocabulary. These knowledge holes will be filled by a mechanism in which an inference engine will use rules and context to look for the needed information.

As an analogy, suppose a person is learning a foreign language and he knows only some words of that language. When someone speaks a phrase in that language to him, he picks just the words that he knows. He also may misunderstand some of them, so the remaining meaning is fuzzy and unintelligible. He tries to understand the whole sentence from the words that he understands and from the context surrounding the conversation. He creates an idea in his mind about the phrase that he just heard. It is possible that later sentences will correct or discredit that idea.

According to the conceptual dependency theory each phrase can be decomposed into several categories. Some of the categories are ACTORS, VERBS, OBJECTS, etc. We represent each category by a connect network, in which the HMM models of each of the words that belong to a particular category are concatenated together. Also concatenated are the silence and extraneous models (see Fig. 4.4.) The extraneous models are created using all the words that do not belong to that catagory. For example to create the extraneous model for the category VERBS we used all the words that are not verbs.

For a given sentence we used the keyword spotting technique to find some of the members of each category embedded in it.

The DSP module produces a Cepstral Vector Quantization (VQ) representation of the input and passes the index vectors to each of the connected HMM networks. Here the Viterbi algorithm finds the optimum sequence of states that in turn represent the best possible words for each category. These words are labeled as the "Viterbi raw words".

Next, we produce the set of "Viterbi pruned words" by eliminating some of the Viterbi raw words; the extraneous words, silences, and words whose durations are less than some threshold. The Viterbi pruned words contain only words that can be used in the CD primitives.

The final step is to reduce multiple overlapping between words of different categories. Due to incorrect recognition, a given set of index vectors may produce several different words in the same range. In order to find the correct one, we evaluate the word's model probability, and the word with highest probability is chosen. We label the string of words selected as "Viterbi final words". Figure 4.5 shows a block diagram of the whole process.

Each of these three word lists is fed to an expert system which produces a representation of the sentence meaning using this information and the context in which the sentence was spoken.

As an example the following sentence was spoken: "Robot, give the shoes to the father." This sentence had 152 frames (each frame consists of 128 samples), 2.432 seconds.

The Viterbi algorithm found the following:

Category Humans:

  1. robot, from frame 1 to 45
  2. extraneous-model-for-humans, from frame 46 to 46
  3. he, from frame 47 to 69
  4. extraneous-model-for-humans, from frame 70 to 71
  5. me, from frame 72 to 74
  6. extraneous-model-for-humans, from frame 75 to 113
  7. father, from frame 114 to 140
  8. he, from frame 141 to 150
  9. extraneous-model-for-humans, from frame 151 to 152
Category Verbs:
  1. extraneous-model-for-verbs, from frame 1 to 58
  2. give, from frame 59 to 70
  3. extraneous-model-for-verbs, from frame 71 to 74
  4. live, from frame 75 to 80
  5. show, from frame 81 to 87
  6. extraneous-model-for-verbs, from frame 88 to 152
Category Objects:
  1. extraneous-model-for-objects, from frame 1 to 3
  2. silence-model, from frame 4 to 12
  3. extraneous-model-for-objects, from frame 13 to 32
  4. object, from frame 33 to 38
  5. extraneous-model-for-objects, from frame 39 to 51
  6. object, from frame 52 to 59
  7. extraneous-model-for-objects, from frame 60 to 74
  8. shoes, from frame 75 to 99
  9. extraneous-model-for-objects, from frame 100 to 108
  10. dog, from frame 109 to 114
  11. extraneous-model-for-objects, from frame 115-152
Category Places:
  1. extraneous-model-for-places, from frame 1 to 33
  2. living-room, from frame 34 to 37
  3. extraneous-model-for-places, from frame 38 to 51
  4. kitchen, from frame 52 to 66
  5. extraneous-model-for-places, from frame 67 to 77
  6. kitchen, from frame 78 to 87
  7. extraneous-model-for-places, from frame 88 to 152
Category Numbers:
  1. extraneous-model-for-numbers, from frame 1 to 3
  2. silence-model, from frame 4 to 12
  3. extraneous-model-for-numbers, from frame 13 to 46
  4. three, from frame 47 to 80
  5. extraneous-model-for-numbers, from frame 81 to 93
  6. one, from frame 94 to 100
  7. extraneous-model-for-numbers, from frame 101 to 113
  8. one, from frame 114 to 130
  9. extraneous-model-for-numbers, from frame 131 to 152
Category Questions:
  1. how-is, from frame 1 to 16
  2. extraneous-model-for-question, from frame 17 to 18
  3. how-are, from frame 19 to 33
  4. extraneous-model-for-question, from frame 34 to 34
  5. what, from frame 35 to 47
  6. what-do, from frame 48 to 59
  7. extraneous-model-for-question, from frame 60 to 67
  8. what, from frame 68 to 73
  9. extraneous-model-for-question, from frame 74 to 106
  10. where-is, from frame 107 to 114
  11. extraneous-model-for-question, from frame 115 to 118
  12. how, from frame 119 to 130
  13. extraneous-model-for-question, from frame 131 to 152
After pruning we had the following:

Category Humans:

  1. robot, from frame 1 to 45
  2. he, from frame 47 to 69
  3. father, from frame 114 to 140
Category Verbs:
  1. No Verbs
Category Objects:
  1. shoes, from frame 75 to 99
Category Places:
  1. kitchen, from frame 52 to 66
Category Numbers:
  1. three, from frame 47 to 80
  2. one, from frame 114 to 130
Category Questions:
  1. how-is, from frame 1 to 16
  2. how-are, from frame 19 to 33
And finally the final selection:

Category Humans:

  1. robot, from frame 1 to 45
  2. he, from frame 52 to 66
  3. father, from frame 114 to 140
Category Verbs:
  1. No Verbs
Category Objects:
  1. shoes, from frame 75 to 99
Category Places:
  1. No Places
Category Numbers:
  1. three, from frame 47 to 80
  2. one, from frame 114 to 130
Category Questions:
  1. how-is, from frame 1 to 16