Issue: Volume: 24 Issue: 8 (August 2001)

It's Not What You Say, But How You Say It



For many people, public speaking is a nightmare. Among those who are crippled by such exposure are animated characters. Speech can bring even the best-looking, most realistic-moving digital persona to its knees. This is because human speech involves not only verbal expression, but also a range of intonations, spontaneous hand gestures, and facial movements that are tied not only to the meaning of the words, but also to the manner in which they are spoken.

While procedural-animation techniques have made tremendous strides with respect to re-creating human motion, including facial and hand behavior, "those techniques have not done well when the motion has needed to be synchronized to speech," says Justine Cassell, associate professor at MIT's Media Laboratory and director of the Lab's Gesture and Narrative Language Research Group. "And although lip sync has gotten much better, nobody has worked on synchronizing parts of the body besides the lips with speech."

Procedurally generated animated characters may soon get a chance to be heard, however, thanks to a new technology developed by Cassell and Media Lab colleagues Hannes Vihjalmsson and Timothy Bick more. Called BEAT for Behavior Expression Animation Toolkit, the procedural system promises an automated method for giving virtual actors a voice by applying rules of speech and expression derived from extensive research into human conversational behavior.
The gestural and facial expressions of this animated character are driven in "near" real time by computational linguistic analyses of typed text.




With BEAT, users input typed text to be spoken by an animated human figure. What they get in return are appropriate and synchronized nonverbal behaviors and synthesized speech in a form (XML trees) that can be sent to a number of different animation systems. "BEAT uses linguistic and contextual information contained in the text to control the movements of the hands, arms, and face, as well as the intonation of the voice," says Cassell. "The mapping from text to facial, intonational, and body gestures is contained in a set of rules."

Written in JAVA, BEAT consists of various modules, each of which takes tagged text as input and produces tagged text as output. The tags identify the text as being of a certain type so the system knows which rules should be applied to it. The BEAT knowledge base contains information that can be inferred directly from the text, such as the type of object being discussed and the related action.

Each piece of input text makes its way through three main processing modules: the language-tagging module, the behavior-generation module, and the behavior-scheduling module. The behavior generation module is further divided into a suggestion module and a selection module.

The language-tagging module is responsible for annotating input text with the linguistic and contextual information that allows successful nonverbal behavior assignment and scheduling. The language tags that the system currently implements include clause, theme and rheme (the former is the part of a clause that creates a coherent link with a preceding clause, and the latter is the part that contributes some new information to the discussion), word newness, contrast, and objects and actions. Word newness helps to determine which words should be emphasized by the addition of intonation, eyebrow motion, or hand gesture. The contrast tag is important because words that stand in stark contrast to each other ("I wanted blue, but they only had green") are often marked with hand gesture and intonation.
Each BEAT module relies on input text that is tagged to identify the linguistic rules that should be applied to it. The system's knowledge base contains baseline associations between domain actions, facial expressions, and hand gestures.




Once tagged in the language module, the input text moves through the behavior-suggestion and selection modules. The current set of behavior generators implemented in the toolkit includes such things as beats (formless hand waves that account for approximately 50 percent of the naturally occurring gestures observed in most contexts), actions for which gestural descriptions are available, contrast gestures, eyebrow flashes, gaze descriptions, and intonation. Based on the language tags, the behavior-suggestion module suggests a wide range of possibly appropriate nonverbal behavior and prioritizes the suggestions based on their relevance to the object or the action at hand. This step is intentionally liberal. Any nonverbal behavior that is possibly appropriate is suggested independent of any other.
Nonverbal behavior, such as hand motions, facial expressions, and eye gazes are as important to the "success" of an animated character as what he has to say.




The resulting "over-generated" behaviors are filtered down in the behavior-selection mode, which analyzes the input data using an extensible set of filters, each of which can delete behavior suggestions that do not meet its criteria. In general, says Cassell, "the filters can reflect the personalities, affective states, and energy levels of characters by regulating how much nonverbal behavior they exhibit." Among the filters the researchers have implemented thus far are a conflict-resolution filter and a priority-threshold filter. The former detects all of the nonverbal behavior suggestions that could not physically co-occur and resolves the conflicts by deleting the suggestions tagged to be of lower priority. The latter removes all behavior suggestions whose priority falls below a specific threshold.

The final processing step is behavior scheduling and animation, in which the filtered data is converted to a set of instructions that can be executed by an animation system or edited by a user prior to rendering.

One of the most important features of BEAT is its extensibility. "New entries can easily be made in the knowledge base to add new hand gestures to correspond to object features and actions, and the range of nonverbal behaviors and the strategies for generating them can easily be modified by defining new behavior-suggestion generators," says Cassell. Also, users have the flexibility to override the output from any of the modules. For example, says Cassell, "an animator could force a character to raise its eyebrows on a particular word by including the relevant 'eyebrows' tag wrapped around the word in question. The tag will then be passed through the language tagger, and behavior-generation and selection modules to be compiled into the animation commands by the scheduler."

BEAT's major disadvantage, notes Cassell, "is that it relies on the state of the art in computational linguistic analysis of written text, which does not yet do all the kinds of analyses that we'd need to get perfect nonverbal output." Also, at "near real-time," the current implementation is slower than ideal.

On the plus side, BEAT has proven itself to be a valuable method for "roughing out" an animation before the animator applies his or her art. "One animator told us that BEAT suggested natural movements that an animator might not necessarily consider," says Cassell.

On the BEAT horizon is the possibility for applying it to multiple characters in a scene, as well as adding other nonverbal communicative behavior, such as forehead wrinkles, smiles, and ear wiggling.

Additional information on the BEAT system can be found at http://gn.www.media.mit.edu/groups/gn/.




Diana Phillips Mahoney is chief technology editor of Computer Graphics World.