| Publications Page | HITL Home |

The Expert Surgical Assistant: An Intelligent Virtual Environment with Multimodal Input

M. Billinghurst, J. Savage, P. Oppenheimer, C. Edmond
Human Interface Technology Laboratory
University of Washington
PO Box 352142, Seattle, WA 98195-2142

{grof,savage,peter}@hitl.washington.edu

MAJ_Charles_Edmond@SMTPLINK.MAMC.AMEDD.ARMY.MIL

Contents:

Abstract

Virtual Reality has made computer interfaces more intuitive but not more intelligent. This paper shows how an expert system can be coupled with multimodal input in a virtual environment to provide an intelligent simulation tool or surgical assistant. This is accomplished in three steps. First, voice and gestural input is interpreted and represented in a common semantic form. Second, a rule-based expert system is used to infer context and user actions from this semantic representation. Finally, the inferred user actions are matched against steps in a surgical procedure to monitor the user's progress and provide automatic feedback. In addition, the system can respond immediately to multimodal commands for navigational assistance and/or identification of critical anatomical structures. To show how these methods are used we present a prototype sinus surgery interface. The approach described here may easily be extended to a wide variety of medical and non-medical training applications by making simple changes to the expert system database and virtual environment models. Successful implementation of an expert system in both simulated and real surgery has enormous potential for the surgeon both in training and clinical practice

Keywords

Virtual Reality, Surgery Simulation, Sinus Surgery, Multimodal Input, Voice Recognition.

Introduction

We have developed a prototype surgery interface which couples an expert system with a virtual environment and multimodal voice and gesture recognition. The interface allows surgeons to interact with virtual tissue and organ models using an intuitive combination of voice and gesture and also monitors their actions to give automatic feedback. In this way the interface can be used as a training tool, providing expert feedback during a simulated operation, or as a development tool for evaluating different types of intelligent medical interfaces. In the future the techniques described here could be used in an intelligent surgical assistant that would give assisted navigation or identification of critical surgical landmarks during actual operative dissection.

Our interface is composed of three different components; a virtual anatomical model, an interpretation module for integrating speech and gesture into a common semantic representation, and a rule-based expert system which uses this representation to interpret the user's actions and matching them against components of a surgical procedure. While speech and gesture have been used together before in virtual environments this is the first time that they have been coupled with an expert system which infers context and higher level understanding. This is particularly important in situations where the computer needs to be able to monitor what the surgeon is doing and notify them if they are performing steps in an operation either out of sequence or in a dangerous manner.

The first application we have developed using this approach is for sinus surgery. Here it is vital that the surgeon remain cognizant of instrument location within the nasal cavity at all times. As the operation proceeds this becomes more and more critical because the landmarks are fewer and vital structures (optic nerve, carotid artery and skull base) converge from an anterior to posterior plane of dissection. With our interface the surgeon can use vocal commands to retrieve CT scans of the patient's head at the location of the surgical tool, display a three dimensional model showing the tool location in the nasal cavity, or obtain visual directions to specific surgical landmarks. The expert system also continually monitors the surgeons actions, giving visual and auditory warnings to prevent inadvertent injury to critical structures. As a training tool the expert system can watch student progress and suggest or demonstrate the next step in a procedure.

The remainder of this paper is devoted to describing the motivation for our work, the problems we set out to address and the general approaches used to overcome them. We describe our interface in detail, covering both it's visual aspects and the underlying expert system, outline the theoretical foundation used and give directions for future research.

Motivation

This work arose out of an attempt to address two critical needs. The first is to investigate the best way to provide accurate localization information to the surgeon during a minimal access surgical (MAS) procedure. In brief, MAS (laparascopic, endoscopic) has revolutionized the surgeon's approach to numerous surgical maladies by using keyhole techniques to minimize the damage to the surrounding tissue. The application of MAS has translated into an enormous saving in hospital bed days and personnel days lost to convalescence. However, this surgical advance has challenged the surgeon to acquire new and/or distinctly different skills for performing surgery. Additional consequences of MAS are felt to be a limited field of view, diminished tactile input and loss of 3-D spatial orientation. Efforts to minimize these side effects have been limited. Current navigational aids for localization within the head and neck region are confined to a two dimensional view of the instrument location, forcing the surgeon to mentally reconstruct the actual three dimensional position[1]. Current aids also do not respond automatically to user actions by warning them when they are too close to critical structures.

A second need is to provide students with medical simulators which can respond in a lifelike manner to their actions and provide expert training assistance. Surgical training to date has revolved around the age old tradition of "See One, Do One, Teach One". However, recent technological advances have allowed the development of cost effective simulators that enable students to repeat procedures many times before facing a real patient. Electronic simulators have already proven their value in military and commercial pilot and crew training and promise similar value in the medical arena. MAS is an area in which simulators could be particularly valuable. Acquiring the visual, manual and psychomotor skills necessary for successful MAS procedures requires considerable experience that can in part be developed through simulators. MAS computer simulation also offers the potential for standardized basic training and skill assessment without direct patient involvement. Despite this, there are currently no commercially available sinus surgery simulators although several prototype systems for other endoscopic procedures have been demonstrated [2,3,4].

The common element between these two problems is the need for expert assistance and so our primary motivation was to investigate possible solutions resulting from coupling an expert system with virtual reality technology. In doing so we have created a prototypical sinus surgery interface that has several unique characteristics:

Multimodal Input

Although humans communicate with each other through a range of different modalities, human-machine interfaces rarely reflect this. One of the contributions of this work is to show how multi-modal input can be incorporated into medical interfaces. This is particularly desirable in a surgical setting where surgeons typically have their hands engaged and rely heavily on vocal and gestural commands. While multimodal input has not previously been demonstrated in medical simulators, the work of Lasko et. al. [5] in building a Chloecystectomy trainer with simulated multimodal input received very positive feedback from users.

There are several additional reasons why we chose to use multimodal input. First, voice and gesture compliment each other and when used together create an interface more powerful that either modality alone. Cohen [6,7] shows how natural language interaction is suited for descriptive techniques, while gestural interaction is ideal for direct manipulation of objects. Ganapathy and Weiner [8] also found that speech and gestures were suited to different tasks: in their 3D CAD package voice was used for navigating through menus and gesture for determining the graphics transformations.

Second, users prefer using combined voice and gestural communication over either modality alone when attempting graphics manipulation. Hauptman and MacAvinney[9] used a simulated speech and gesture recognizer in an experiment to judge the range of vocabulary and gestures used in a typical graphics task. Three different modes were tested - gesture only, voice only, and gesture and voice recognition. Users overwhelmingly preferred combined voice and gestural recognition due to the greater expressiveness possible. When a combined input was possible, subjects used speech and gesture together 71% of the time as opposed to voice only (13%) or gesture only (16%). Subjects were also able to express commands with greatest sufficiency using combined input.

Finally, combining speech, gesture and context understanding improves recognition accuracy. By integrating speech and gesture recognition, Bolt [10] discovered that neither had to be perfect provided they converged on the contributions of this work is to show how multi-modusers intended meaning. In this case the computer responds to users commands by using speech and gesture recognition and taking the current context into account. For example, if a user says "Create a blue square there (pointing at a screen location)", and the speech recognizer fails to recognize "Create", the sentence can still be correctly understood by considering the graphical context. If there are no blue squares present then the missed word could only be "Create" and the system responds correctly.

These examples demonstrate three key advantages of multimodal interfaces:

Expert System

An expert system is necessary to effectively integrate voice and gestural input into a single semantic form. This unified representation can then be matched against procedural knowledge contained in the expert system to make inferences about user actions. In this way the system can recognize when the user performs specific actions and provide automatic intelligent feedback.

In our interface we use a rule-based system which encodes expert knowledge in a set of if-then production rules[11]; for example, if FACT-1 then DO-THIS, where FACT-1 is the event necessary for rule activation and DO-THIS the consequence of rule activation. The user's speech and actions in the virtual environment generate a series of facts which are passed to the expert system fact database and matched against the appropriate rules, causing an intelligent response. Rule-based expert systems have been successfully applied to a wide variety of domains, such as medical diagnosis [12], configuration of computer systems[13] and oil exploration [14].

Natural Language Processing

Natural Language Processing (NLP) techniques provide a theoretical framework for interpreting and representing multimodal input. NLP is traditionally concerned with understanding of written or spoken language, but these methods may be modified to understand multimodal input and create inferences about user actions from them. In effect Natural Language methods determine the representation scheme used in our expert system, while the expert system determines how these representations are to be manipulated in an intelligent way.

Previous multimodal interfaces show how voice and gesture can be integrated into a common semantic form [15,16], but our work extends beyond this by using low-level semantic knowledge as the basis for higher level pragmatic understanding. To do this we draw on established natural language techniques such as Conceptual Dependency (CD) representations and Scripts.

An important lesson learnt from the natural language literature is that the most successful systems use a combination of top down and bottom up processing to create conceptual models of user input. For example, BORIS [17] is a system which understands stories and responds to users questions about them. Understanding begins with bottom-up parsing of textual input which activates higher level semantic structures. These structures then make predictions about the possible text meaning which guide processing of subsequent input in a top-down manner. Thus semantic and conceptual information processes occur as an integrated part of parsing the text input.

In a similar way, our multimodal interface brings semantic knowledge to bear as early as possible in the parsing process. Hill [18] argues that this should occur not just within each modality but also across modalities, so that the application level is no longer the point of modality integration. By blending conceptual representations of multimodal input at the lowest semantic level possible, the user can switch input modalities at any time without loss of understanding by the system.

The Surgical Assistant

As outlined above, our goals were to address the problems of instrument location during operative procedure and effective surgical simulation. Sinus surgery was selected as the initial application for a number of reasons: it is a minimal access endoscopic procedure and so presents navigational problems typical of endoscopic surgery, the surgical procedure is relatively easy to simulate, and the nasal anatomy is fairly rigid so we did not have to address the difficulty of modeling soft tissue deformation is a realistic manner.

The Interface

The sinus interface itself is shown in figure 1.0. It consists of a three dimensional virtual environment model of the inside of the nasal cavity connected to a rule-based expert system, voice recognition software and synthesized speech output. The virtual environment is viewed on a computer monitor, simulating the view from a video endoscope. Interaction with the model is through virtual instruments whose motions correspond to the movements of a Polhemus magnetic sensor attached to the surgeons hand in the real world.

Figure 1.0: The surgical interface showing the virtual environment, CT scans with instrument position overlay and miniature navigational model.

The anatomical model was developed using sinus anatomy textbooks and actual patient CT scans. The emphasis was on producing a model of sufficient accuracy to be useful in procedural training, but not of such a high complexity to slow the simulation down to unacceptable frame rates. Graphics rendering, virtual environment interface devices, and interactions were controlled using Division Ltd's dVS and DVISE software. The expert system and speech recognizing software were developed using CLIPS, an expert system shell available from NASA. The interface runs on a Silicon Graphics ONYX computer, while the expert system, voice recognizer and synthetic speech output run on a DEC Alpha. Two way communication between the expert system and virtual environment is possible through the use of UNIX sockets. The voice recognition software permitted speaker dependent continuous recognition with a vocabulary of 200 words, while gestures were limited to pointing and grasping due to the necessity of simulating real instruments.

In our work we are more interested in developing a testbed for prototyping interface ideas, than actual surgical implementation, although all the technologies necessary are currently being used in surgical settings. If this interface were to be used in actual surgery then the virtual model would be replaced with a live video feed from the endoscope. Actual patient CT data would be used to reconstruct an accurate anatomical model for the expert system database and optical or magnetic tracking used for instrument location. Methods for model construction from CT data are detailed in [19], while surgical use of magnetic tracking techniques are outlined in [20]. Once techniques have been developed for rapid reconstruction of patient anatomy and accurate registration between the real and virtual world anatomy then the interface described will have greater applicability to actual surgical situations.

Navigation and Instrument Location

Instrument location is of critical importance in sinus surgery. Sinus surgeons currently rely on visual and tactile cues from the endoscope and surgical instruments respectively. In complex cases the use of navigational aids such as the ISG system[1] adds an additional level of information. This system utilizes instruments that are registered to the patient and their pre-operative CT scan. The computer interactively displays a virtual probe overlaid on the CT scan images, the tip of which moves in concert with a real probe guided in and around the patients head. Other navigational techniques include displaying marching squares on the endoscopic video, showing a safe path for surgical dissection to a specific structure.

Using purely visual cues becomes more challenging as the operation progresses and the tissue anatomy changes. The disease process or prior surgical intervention may also affect traditional landmarks. In addition, many critical bony structures are hidden from view under soft tissue making it difficult to accurately estimate distance to them. CT scan overlay and marching squares address these limitations, but are passive methods which fail to provide automatic warnings as the instrument moves dangerously close to critical structures. CT overlays also force the surgeon to make the additional step of mentally reconstructing a three dimensional instrument location from several two dimensional displays.

We address the problem of instrument location in several ways. First with a visual representation. Coronal and axial CT scans are displayed alongside the simulated endoscope view. The instrument tip location is marked on these images with a cross which moves according to the users instrument motion. The CT scans also change in response to instrument depth, so the current images are those corresponding to instrument tip location within the nasal cavity. In addition a small three dimensional model of the nasal cavity is attached to the simulated endoscope view. A miniature instrument marker present in the model follows the users motions exactly. The model can be also rotated independently of the endoscope view to prevent the instrument marker from being occluded. In this way the user can easily get accurate two or three-dimensional locational information.

The second technique is to allow the user to directly query the expert system about location of anatomical landmarks. The system responds appropriately with synthesized speech and visual cues. For example, if the users says "Where is the Sphenoid sinus ?", the system will respond by highlighting the sphenoid if it is in view or providing a direction and distance if it isn't. The system can also respond to cobined voice and gestural commands such as "What is that ?" while pointing at an anatomical structure, by describing the structure being pointed to. It will also calculate paths to an object and a safe trajectory for instrument movement if necessary. If the user says "Show me how to get to the Natural Ostium", a set of marching circles will be overlaid on the anatomical model, showing a safe route to the organ (figure 2.0). The expert system is capable of identifying all of the anatomical structures referred to in sinus surgical procedures and calculating distances between them and the instrument location.

Figure 2.0: The system responds to the command "Show me how to get to the Natural Ostium" by producing a path of marching circles to the Natural Ostium.

In addition to passive and command driven navigational aids, the expert system also gives automatic vocal and visual warnings when the instrument approaches predetermined critical structures. These include the lamina papynacea, skull base, carotid artery and optic nerve. Severe trauma or mortality results if any of these are damaged during a real procedure. Auditory and visual warnings are generated in response to two events. First, an instrument collision with an invisible bounding box surrounding the structure sends a collide message to the expert system. Following this, the user's instrument location is sent every second for more accurately determining impending collisions and generating an even more urgent warning.

Simulation and Training

The locational techniques described require an expert system with simple rules to monitor and respond to user actions and commands. By extending this rule base and adding procedural knowledge the same expert system can be used as a sinus surgery trainer. In doing this we were concerned with teaching students the various steps in a sinus surgical procedure rather than specific skills training, such as suturing, or cutting techniques. Skills training with current virtual reality technology is difficult due to several challenges:

However virtual reality has been shown useful for developing procedural knowledge. Loftin [21] details how users found NASA's Hubble repair simulator effective in teaching them the steps in the repair process, while others have presented medical virtual environment simulators [22]. To be effective a simulator must allow training at several levels, give evaluation of student performance, and provide expert feedback. It must also be open ended and not restrict the student to a linear sequence of operations. Our approach is unique among medical virtual environment simulators because it uses an expert system to achieve these requirements.

The sinus interface supports training of the classic anterior-to-posterior approach in paranasal sinus surgery. In constructing the expert system database, the complete surgical procedure is broken down into a number of self-contained steps, each of which is also comprised of several tasks. Some of these steps need to performed in order while for others this is not essential. For example, if the Bulla ethmodalis needs to be removed then the anterior ethmoid air cells within the uncinate process must be dissected first to make it accessible. In contrast, frontal sinus disease can be addressed at any time once the anterior ethmoid air cells have been opened. A complete graph showing the required ordering of steps in the operative procedure is shown in figure 3.0 overleaf. This graph is used by the expert system to monitor user progress through the operative procedure and provide accurate feedback. In contrast, the tasks that make up each surgical step do have a definite order, such as incisions that need to be made before tissue is removed, or instruments that need to be selected.

To identify which stage the user is attempting in the operation, the expert system combines information from users multimodal input with the state of the virtual world and compares it to sets of rules designed to recognize context. Each set of rules correspond to the onset of a different step in the surgical procedure and when activated causes the rest of the user interaction to be interpreted from within that surgical context. This contextual knowledge is then used to improve speech and gesture recognition. When the interface is used for training, activation of a particular context causes activation of the set of rules corresponding to the tasks contained in that surgical step. These rules differ from those used to recognize context in that they correspond to actions that need to be performed in definite sequence. In this way the users progress can be monitored through the operation step by step as well as at a task by task level.

Figure 3.0

Figure 3.0: The ordering of operative steps in the anterior-to-posterior approach to paranasal sinus surgery, from [23]. The number of steps that need to be completed depend on the severity and type of disease.

Response to incorrect actions depend on the training level chosen at the program outset. At worst the system prevents the user from performing a task-level action out of sequence and gives an auditory warning when they try to do so. While the system cannot constrain instrument movement, it can prevent tissue dissection and removal, stopping the user from progressing further. At the next level the system will prevent the user from beginning a new surgical step before all the tasks in the current step are complete. In this case the system will remind the user of the previous tasks that remain undone. Finally the system can be used in an unconstrained mode where it doesn't prevent any user actions but notifies user's at the end of the entire operation of surgical steps they skipped over.

Aside from recording tasks left undone, the warnings generated during the surgical procedure and the time taken for the procedure are also noted and can be used to evaluate student performance. The system also stores a record of all the user interactions so they can be played back at a later time and evaluated by the student's instructors. During play back the virtual instruments interact with the environment just as the user did when using the simulator and the anatomical model responds in the same way.

Since the entire surgical procedure is encoded in the expert system database the interface can also respond to requests for help during training. If a student becomes confused they can ask "What do I do now?", and the system responds by vocalizing the next task or step in the operation. Alternatively, if the user asks "Show me what to do now", the rules for the next task are used to send commands to the virtual environment that move the instrument in the correct manner. At the same time the system describes what it is doing so the user receives visual and auditory reinforcement.

As can be seen our sinus interface addresses the two tasks of instrument localization and surgical training. In the next section we describe in detail how to build the expert system and show how natural language techniques may be modified to support multimodal input.

The Theoretical Foundation

From the previous section it is obvious that different amounts of expert knowledge are needed for each aspect of the interface. For instrument location the expert system needs to monitor user position and understand multimodal commands. For context recognition the system needs to be able to recognize where the user is in the operation. While for training the expert system needs to have detailed procedural knowledge of the simulated operation and be able to monitor user performance at a task level. In this section we show how natural language processing techniques can be used to develop an expert system with these characteristics.

We use several steps in understanding multimodal input; syntactic features are first extracted from the user's raw input, semantic meaning assigned to these features, and finally higher level pragmatic representations are used to create inferences about context and user actions. Each step is progressively more knowledge intensive. Contextual knowledge is used at all steps, ensuring we have both top down and bottom-up processing. Integration of the various input modalities occurs at the semantic level where a unified representation iscreated. The types of input to each of these stages varies as does the output generated and the different sets of rules that the output is matched against, as shown in the table below.

Table One

Syntactic Feature Extraction

Syntactic representation of user input is accomplished in several steps. First, significant features are extracted from the input data using statistical techniques such as Vector Quantization[24]. In speech input these features include phonemes, syllables or even words and phrases[25]. Wexelblat[26] has found commonality in gestures at the sub-full gesture level which may be used to reconstruct gestures, while the user's absolute position and orientation can be used for locational features. After features have been separated out they are interpreted using the current context, and a syntactic representation found:

syntactic representation

Contextual cues can be gathered from the initial syntactic interpretation as well as from higher levels of semantic analysis. In a virtual environment, contextual cues could be given in a number of ways, including the user's location, a menu or voice command they have chosen, an action they are executing, or by the type of the virtual world they are in. Unlike in the real world, the virtual world designer has complete control of the environment and so can precisely specify the types of contextual cues that may be used. In our interface context is determined by where the user is in the surgical procedure and the actions they are performing.

Syntactic features are used to match against rules for monitoring instrument position and collisions. For example when the instrument collides with a bounding box a collision fact is inserted into the expert system fact database. If this fact matches a rule for warning the user they are too close to a critical structure, then the rule will be activated and the warning sounded.

Semantic Representation

Once a syntactic representation has been found, the associated semantic meaning can be established - in much the same way that humans recognize words in a conversation before trying to infer meaning. In order that low level semantic knowledge can infer higher level pragmatic understanding, the semantic meaning needs to be encoded in some representation that satisfies two requirements[27]:

There are a number of suitable constructs, such as Frames [28] and Conceptual Dependencies[30], developed from the theory that language is based on a very small number of semantic objects and the relations between them [29]. In our interface we use Conceptual Dependencies (CDs)[30] for semantic representation and combine groups of CDs together into Scripts for pragmatic understanding[31]. These constructs are traditionally used by expert systems for Natural Language Processing of text but until now have not been applied to multimodal interfaces or virtual environments.

Conceptual Dependencies essentially attempt to represent every action as the composition of one or more primitive actions, intermediate states and causal relations[30]. CDs use a simple structure to represent events, designed to ensure that no matter what form a description of the event takes, the CD representation will always be the same. Every CD structure has a number of empty slots that need to be filled, such as:

For example, one of the CD primitive actions is called "ATRANS", representing an abstract transfer of possession, control or ownership. Using ATRANS we can form a CD representations of the following sentences:

"John gave Mary a ball" "Mary took a ball from John"
actor: John actor: Mary
action: ATRANS action: ATRANS
object: ball object: ball
direction:
    TO Mary
    FROM John
direction:
    TO Mary
    FROM John

Although both sentences differ markedly syntactically, they have the same semantic meaning and so the CDs are almost identical. Other CD primitives include :

PTRANS Physical transfer of location ATTEND Focus a sense organ
SPEAK Make a sound GRASP Grasp an object
PROPEL Apply a force to an object MTRANS Mental transfer of information

The primitive used is determined by verb analysis - "gave" and "took" activates ATRANS, while "kick" implies PROPEL etc. More complex sentences, and even entire passages can be represented by linking CDs together.

Conceptual Dependencies have been used successfully for text-based natural language processing, however with small modification multi-modal input can also be represented in a CD form. Instead of extracting primitive actions from speech only, the users gestures, actions and location can also be analyzed for verb equivalents and CD primitives. For example, in a virtual environment a user could move a block from the floor to a table top with the vocal command "Put the block on top of the table", which would have the CD representation:

actor: user
action: PTRANS
object: block
direction:
    TO table top
    FROM floor

Alternatively, the user could pick up the block and move it to the table top without saying anything. The gesture of picking up a block would cause the PTRANS primitive to be activated and the appropriate procedures called to fill it's object and direction slots. This would also produce the same CD representation as above.

Finally, the user could point at the block and say "Put the block over there (pointing at the table top)". In this case separate CDs will be generated for the speech and gestural input with empty slots for the unknown information:

"Put the block over there" Pointing at the table top
actor: user actor: user
action: PTRANS action: ATTEND
object: block object: hand
direction:
    TO unknown
    FROM floor
direction:
    TO table top
    FROM unknown

Empty slots in a CD structure can be filled by examining CDs from other modalities generated at the same time. Comparing and combining the two representations above, it is simple for the computer to arrive at the same CD produced by the purely spoken or gestural inputs. This is to be expected since in all three cases the meaning of the users input is the same. Using the same representation for all the various input modalities considerably simplifies integration of the different modalities at the semantic level. In those cases where CDs alone do not contain the needed information, empty slots can be used for querying the virtual environment directly or reexamining the user's input.

The CD representation has the advantage of decoupling the input modalities from their interpretation, allowing the same meaning to be expressed in a number of ways. This makes it easy to develop higher level pragmatic understanding, since the underlying semantic representations will largely be the same, regardless of the input modalities used to express them. In this way if the computer fails to interpret one type of input correctly, the user can try conveying the same meaning using a different modality

However there are some disadvantages with CDs, the most obvious being that CD primitives are mainly for physical actions making it difficult to represent emotional states or concepts about agent motivation and goal-based behavior. The number of CDs needed to represent natural language also becomes unwieldy as the length of the input increases. While these are serious limitations for using CDs to represent large amounts of complicated text or speech input, they are not so prohibitive for representing input in a virtual environment. Here the user interacts with the environment using physical actions and/or short spoken commands, both of which are easily representable by CDs.

In our interface we use the CD representation to encode all of the expert system knowledge base, except for the rules matched by syntactic features. At the semantic level, CDs are used to activate rules that respond to direct multimodal commands. For example, if the user says "What is that ?", while pointing at some anatomical structure, a unified CD representation is found from the voice and gestural input which would then match the rules for responding to anatomic queries, and the correct response given.

Context Recognition

Syntactic feature extraction and the semantic representation described above are sufficient for activating the rules needed for instrument location and navigation. However contextual knowledge is needed for tracking users progress through procedural tasks while training.

For context recognition the expert system needs to be able to identify which step in the surgery the user is performing. In a related problem, DeJong developed a very effective NLP system which could read news stories and find the context for each story [32]. This was achieved by encoding a wide range of contexts into CD form, one set of CDs for each context. When a particular set of CDs matched those being generated by the NLP, the associated context was activated and the remainder of the story interpreted from within that context. Context activation occurred in a number of different ways:

Our interface uses a similar approach. Explicit context activation occurs when the user says the surgical step they are about to undertake. The phrase "I'm going to remove the uncinate", will produce a CD representation that matches the rule for setting the context to the surgical step of removing the uncinate process. Event induced context recognition is more complex. A set of events is chosen which happen immediately prior to the start of a particular surgical stage or shortly thereafter. Rules are then written for identifying these events and the context established when these rules are activated. For example the key events for recognizing when the user is about to remove the uncinate process are:

These events are represented by the appropriate CDs and syntactic features and a set of rules determine if the corresponding facts have been inserted into the fact database. If so, the context is set to the surgical step of removing the uncinate process. For example, the system first checks if the user is holding the right instrument by seeing if the following CD is present in the fact database:

actor: User
action: ATRANS
object: sickle knife or cottle elevator
direction:
    TO User
    FROM Instrument Tray

Then the user has to have moved the instrument back far enough - this is determined by checking if the instrument has collided with an invisible horizontal plane across the front of the middle turbinate. If so a Collide Turbinate_Plane fact will be in the fact database. Finally, the user needs to attempt to cut the uncinate process so a Collide Uncinate_Bounding_Box should be present. If all these facts are present then the system assumes the user is removing the uncinate process and sets the context accordingly.

Pragmatic Representation

For surgical training our expert system needs to be able to monitor user actions and detect when they perform tasks out of order or even miss them entirely. This pragmatic knowledge can be encoded using scripts[31]. Scripts are basically a causal sequence of CDs that can be used to predict future actions as well as interpret past behavior, allowing the formation of a single coherent interpretation from a collection of observations. The expert system can make inferences about user actions from the particular script activated.

For example, a restaurant script could be used to interpret the story: "John went to a restaurant. He ordered chicken. He left the restaurant.", and answer questions such as "Did John eat dinner ?", or "Did John pay the bill?". A simple restaurant script is:

Once the sentence "John went to a restaurant" is understood the script is activated and the remainder of the user input is understood from within the restaurant context. The continuing dialog can be used to establish where in the script the user is, so when the text "he ordered chicken" is input we know that John must have already sat down and read a menu. A change of context is indicated when the input cannot be matched to the current script. In the simple script above there is only a single flow of events, but scripts can have alternate sequences of events or even additional scripts nested inside them.

Scripts are commonly used in natural language processing, but with events based on multimodal CDs they provide a powerful way of monitoring user's actions to make sure they are performed in correct order. For example to remove the uncinate process, the following sequence of tasks need to be performed:

Each of the steps in the sinus procedure has it's own script of tasks which are represented by a set of rules that match the appropriate CDs and syntactic features. Activation of a surgical context also activates the associated script and the expert system will then attempt to match the user actions to tasks contained in the script. At the end of each task a fact representing task completion is inserted into the fact database so the expert system can keep track of user progress. If the users input matches the rules for a particular task, but previous task has not been activated then the system notes that the user has missed part of the procedure, the consequences of which depend on the training level selected. The steps in the entire sinus surgical procedure are also contained in a script, so when the user skips an entire step the system also recognizes it.

Since the script encodes the sequence of tasks for each surgical step, it is simple for the system to respond to user requests for help about what to do next. By transforming the task CDs into commands to the virtual environment, the expert system can explicitly show the users particular tasks that they don't know how to perform.

In this section we have shown how natural language techniques can be used to create intelligent virtual environments. The complexity of the expert system used depends on the degree of intelligence desired in the interface. Simple rules can be used which match syntactic features such as object collisions, while semantic representations and scripts allow higher level understanding. Although we have applied these techniques to modeling sinus surgery, the same approaches can be used in a wide range of medical and non-medical applications.

Conclusion

We have described a prototype intelligent multimodal medical interface that addresses the needs of instrument localization and effective training simulators for endoscopic sinus surgery. The approach described here is not unique to sinus surgery and may be easily applied to other procedures by changing the virtual environment model and expert system rule base.

Our interface has three key aspects:

The common element between these aspects is the use of knowledge encoded in a rule-based expert system. This is applied at all levels in the interface, from responding to instrument collisions with tissue, to interpreting multimodal commands and identifying surgical context. Encoding procedural knowledge in the expert system rule base also allows the interface to be used as an intelligent trainer.

Our immediate goal is to extend this work to develop an interface which would be useful during actual operative dissection. To accomplish this we intend to conduct research on rapid model reconstruction from patient data and accurate registration between the real patient and this anatomical model. We will also conduct user studies to identify which aspects of the interface would be most useful for instrument localization and navigation. Finally we need to investigate better ways for script activation and context recognition so that the expert system can immediately recognize surgeon's actions and respond accordingly.

Acknowledgements

This work was partially supported under ARPA grant DAMD17-94-J-4239, "Advanced Human Interfaces for Telemedicine". Thanks to Paul Schwartz for providing considerable software support, Suzanne Weghorst for help with interface ideas and Don Stredney for donating the CT scan images used.

References

[1]Drake, J.M., Rutka, J.T., Hoffman, H.J. ISG Viewing Wand System. Neurosurgery, v4, pp. 1094-1097, 1994.

[2]Loftin, R.B., Ota, D., Saito, T., Voss, M. A Virtual Environment for Laparascopic Surgical Training. Medicine Meets Virtual Reality II, Jan 27-30, 1994, San Deigo, pp. 166-69.

[3]Barde, C. Simulation Modelling of the Colon. First International Symposium on Endoscopy Simulation, World Congress of Gastroenterology, Sydney, 1990.

[4]Poon, A., Williams, C., Gillies, D. T. The Use of Three-Dimensional Dynamic and Kinematic Modelling in the Design of a Colonscopy Simulator. New Trends in Computer Graphics, Springer Verlag, 1988.

[5]Lasko-Harvill, A., Blanchard, C., Lanier, J., McGrew, D. A Fully Immersive Cholecystectomy Simulation. In Interactive Technology and the New Paradigm for Healthcare, K. Morgan, R.M. Satava, H.B. Sieburg, R. Mattheus, J.P. Christensen (Eds.), pp. 182-186, IOS Press and Ohmsha, 1995.

[6]Cohen, P.R., Dalrymple, M., Pereira, F.C.N., Sullivan, J.W., Gargan Jr., R.A., Schlossberg, J.L. and Tyler, S.W. Synergistic Use of Direct Manipulation and Natural Language. Conference on Human Factors in Computing Systems (CHI '89), pp. 227-233. Austin, Texas, IEEE, ACM, 1989.

[7]Cohen, P.R. The Role of Natural Language in a Multimodal Interface. 1992 UIST Proceedings, pp. 143-149.

[8]Weimer, D. and Ganapathy, S.K. A Synthetic Visual Environment with Hand Gesturing and Voice Input. Conference on Human Factors in Computing Systems (CHI '89), pp. 235-240. Austin, Texas, IEEE, ACM, 30th April - 4th May 1989.

[9]Hauptman, A.G. and McAvinny, P. Gestures with Speech for Graphics Manipulation. Intl. J. Man-Machine Studies, vol.. 38, pp. 231-249, 1993.

[10]Bolt, R.A. "Put-That-There": Voice and Gesture at the Graphics Interface. Computer Graphics, v14 (no. 3), pp 262-270. In Proceedings of ACM SIGGRAPH, 1980.

[11]Giarratano, J., Riley, G. Expert Systems: Principles and Programming. PWS Publishing Company, Boston, 1993.

[12]Buchanan, B.G., Shortlife, E.H. Rule Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Addison-Wesley, Reading, Massachusetts, 1984.

[13]McDermott, J., Bachant, J. R1 Revisited: Four Years in the Trenches. AI Magazine, V, no. 3, pp. 21-32, Fall 1984.

[14]Puda, R., Gasching, J., Hart, P. Model Design in the PROSPECTOR Consultant System for Mineral Exploration. In Expert Systems in the Micro-Electronic Age, ed. D. Mitchie. Edinburgh University Press, pp. 153-167, 1979.

[15]Koons, D.B., Sparrell, C.J. and Thorisson, K.R. Integrating Simultaneous Input from Speech, Gaze and Hand Gestures. In Intelligent Multi-Media Interfaces, ed. M. Maybury. Menlo Park: AAAI Press, 1993.

[16]Neal, J.G., Shapiro, S.C. Intelligent Multi-Media Interface Technology. In Intelligent User Interfaces, Sullivan, J.W., Tyler, S.W. eds, Frontier series, ACM Press, New York, 1991.

[17]Dyer, M.G. In-Depth Understanding: A Computer Model of Integrated Processing for Narrative Comphrehension. MIT Press, Cambridge, Massachusetts, 1983.

[18]Hill, W., Wroblewski, D., McCandless, T., Cohen, R. Architectural Qualities and Principles for Multimodal and Multimedia Interfaces. In Multimedia Interface Design, ed. M.M. Blattner and R.B. Dannenberg, pp. 311-318, ACM Press, New York, 1992.

[19]Lorensen, W., Jolesz, F.A., Kikinis, R. The Exploration of Cross Sectional Data with a Virtual Endoscope. In Interactive Technology and the New Paradigm for Healthcare, K. Morgan, R.M. Satava, H.B. Sieburg, R. Mattheus, J.P. Christensen (Eds.), pp. 221-230, IOS Press and Ohmsha, 1995.

[20]Doyle, W.K. Interactive Image-Directed Epilepsy Surgery: Rudimentary Virtual Reality in Neurosurgery. In Interactive Technology and the New Paradigm for Healthcare, K. Morgan, R.M. Satava, H.B. Sieburg, R. Mattheus, J.P. Christensen (Eds.), pp. 91-100, IOS Press and Ohmsha, 1995.

[21]Loftin, R.B., Kenney, P.J. Virtual Environments in Training: NASA's Hubble Space Telescope Mission. In Proceedings of the 16th Interservice/Industry Training Systems and Education Conference, 28th November - 1st December, 1994.

[22]Merril, J.R. Surgery on the Cutting Edge: Virtual Reality Applications in Medical Education. Virtual Reality World, November/December 1993, pp 34-38.

[23]Rice, D.H., Schaefer, S.D. Endoscopic Paranasal Sinus Surgery, pp. 159-186. Raven Press, New York, 1993.

[24]Makhoul, J., Roucos, S., Gish, H. Vector Quantization in Speech Coding. Proceedings of the IEEE, vol. 73, pp. 1551-1588, 1985.

[25]Schmandt, C. Voice Communication with Computers: Conversational Systems. Van Nostrand Reinhold, New York, 1994.

[26]Wexelblat, A. Natural Gesture in Virtual Environments. Virtual Reality Software and Technology: Proceedings of the VRST `94 Conference, 23-26 August 1994, Singapore, pp. 5-16. Singapore: World Scientific.

[27]Lytinen, S.L. Conceptual Dependency and its Descendants. Computers Math. Applic. Vol. 23, No. 2-5, pp. 51-73, 1992.

[28]Minsky, M. A Framework for Representing Knowledge. In The Psychology of Computer Vision, ed. P.H. Winston, McGraw-Hill, New York, 1975.

[29]Waltz, D. Semantic Structures: Advances in Natural Language Processing. Lawerance Erlbaum Associates, Hillsdale, New Jersey, 1989.

[30]Schank, R.C. Conceptual Information Processing, North-Holland Amsterdam, 1975.

[31]Schank, R.C., Abelson, R. Scripts, Plans, Goals, and Understanding. Lawerance Erlbaum Associates, Hillsdale, New Jersey, 1977.

[32]DeJong, G. An Overview of the FRUMP System. In Strategies for Natural Language Processing, ed. W.G. Lehnert, M.H. Ringle, pp. 149-176, Lawerance Erlbaum Associates, Hillsdale, New Jersey, 1982.