Jay Wilpon and Rabiner described an automated call type recognition experiment developed at AT&T, in which the user was allowed to say only 5 words to make a complete call: collect, person to person, operator, third number, and calling card [15]. In this experiment 20% of all the users' answers included other words.
For instance when someone tried to make a call using a telephone card, instead of just saying calling card he said:
um? Gee, ok I'd like to place a calling card call
If we try to recognize this sentence just having trained the word calling card, the system would fail due to the external words that were not trained. This type of problem requires a mechanism to find the keywords embedded in a phrase. One solution is to include an extraneous speech model in the system (see Fig 3.3.) This extraneous speech model is created using all the words that are outside the recognition vocabulary. It also includes a model that represents the silence and background noise.
As we can see in the figure the recognition words are connected together with the extraneous, silence and background models to form a connected network. Then the Viterbi algorithm is used to find the time when the states of the recognized words are reached in the network, and if the network stays some time in those states then a successful keyword matching is declared.
The connected network is created first with its HMM models of individual words trained in isolation. The individual HMM models of the keywords, extraneous words, background noise and silence are also created. The HMM models are created for each keyword using isolated repetitions of each word. The model of the extraneous words is created using all the isolated words that do not belong to the recognition vocabulary. The model for silence and background noise is created using samples of the noise environment without the user's speech.
The connected network is retrained with continuous speech, in which the Viterbi algorithm is used to segment the sentences. The Viterbi algorithm separates the sentence in three parts, one that corresponds to the extraneous words, one that corresponds to the silence and background noise, and the one that corresponds to each of the keywords. If the Viterbi algorithm fails, using the previous procedure, then the sentences are segmented manually.
Each keyword model is updated using its corresponding segments found previously. Also the extraneous words model is updated using all the extraneous segments found in the sentences. The same procedure is done for silence and background noise. Finally the network is created again with the updated HMM models of the keywords, extraneous words, silence and background noise.
Using keyword speech recognition Wilpon and Rabiner report that they were able to recognize correctly 95.1% of the five keywords in fluent speech, spoken over long distance telephone lines.