Issue: Volume: 23 Issue: 8 (August 2000)

Lip Service

By Audrey Doyle

Talk is cheap. And thanks to the latest crop of software for lip-sync animation, making 3D computer-generated characters talk in a realistic manner is becoming not only cheaper, but also easier and quicker to accomplish.

"Today, everybody expects realism in character animation," notes Zac Jacobs, vice president of product development at Famous Technologies, which markets software for lip sync and facial animation. "Because of movies like Antz and A Bug's Life, which had some really good lip sync, the bar has been raised. Viewers expect all lip sync to look believable, even in TV series, commercials, and games, which don't have the long production cycles and large budgets of features."
FamousFaces, used to lip sync this warrior from the game Empire Earth, is just one program that helps remove the drudgery from lip syncing. (Image courtesy Blur Studio.)

Luckily for animators working in these markets, several software tools are now available that enable them to accurately sync lips to dialog without suffering through the notoriously laborious (and therefore expensive) process of keyframing. Some of the newer tools that are enjoying widespread use among production facilities today are LipSinc's Ventriloquist, Third Wish Software's Magpie Pro, and Famous Technologies' FamousFaces Animator. (A fourth product, Face2Face, was scheduled to ship this summer from Face2Face Animation.) Although each takes a different approach to automating lip sync, all enable animators to create believable expressions and mouth movements for digital characters.

Incorporated in early 1998, LipSinc began as a project in the Tech Program at North Carolina State University. "Three professors in speech technology developed a way to extract mouth- and lip-position data from a voice signal," notes Mike Helpingstine, LipSinc's vice president of business development. "Recognizing an application for the technology in computer animation, several students created a business plan to develop lip-sync tools for professional animators."

The resulting product, called Ventriloquist, is available as a $595 plug-in for 3D Studio Max R3. Shipping since December 1999, Ventriloquist is based on LipSinc's proprietary voiceDSP voice analysis system, which uses voice digital-signal processing technology to analyze speech and automatically output corresponding mouth-, jaw-, and lip-position data in 3D. Output data is generated as a stream of morph targets or movement destinations that re-create lip-synced animation. "It's the sophistication of this analysis engine that produces naturalistic and accurate results," says Scott Curtis, chief creative officer. "The engine analysis mimics the co-articulation that occurs when a human actually speaks by anticipating what the next phoneme will be and blending visemes [mouth positions] to match it."
LipSinc's Ventriloquist, based on research begun at North Carolina State University, offers lip syncing for 3D Studio Max users. (Image courtesy LipSinc.)

"Ventriloquist fills the role of a junior animator," adds Helpingstine. "It does the grunge work-going through the sound files, locating the utterances and defining the mouth shapes, assigning them at the correct time, and setting them up on the character. And it does it automatically." To use the software, the animator builds a character in Max with a relaxed, expressionless face and the lips and jaw closed. Then he loads the character into Ventriloquist, where he maps morph targets of the different mouth positions that the model will need to be able to execute when the audio data is assigned to it.

LipSinc assigned 15 visemes for Ventriloquist to use to define speech. In general, if the name of a viseme begins with a consonant, the viseme represents all the consonants in the name. If it begins with a vowel, it represents that vowel sound and others resembling it. The Bump viseme, for example, shows the mouth in the position for pronouncing the B, M, and P sounds in such words as bump, mom, and pea. The Church viseme represents the hard CH and J sounds in words such as church, joke, and choke. The If viseme is used to pronounce many of the front vowels, or those in which the tongue is in the front of the mouth, as in the words if, bet, bat, bout, debit, and suspect. And the Told viseme is used to represent the T, L, and D sounds in such words as told, day, lay, and muddy. It also is used in flaps, or the indistinct consonants between vowels-for example, the TT in the word mutter.

To map the visemes, the animator first displays the 15 viseme mapping tracks. He then displays the Track View Pick dialog box that contains all the assigned morph targets. To connect Ventriloquist's sound analysis (the phonemes) to the model's facial expressions (the visemes), the animator double-clicks each target in order. "After you've modeled your base character, you can set up your initial targets in an hour," notes Helpingstine.
Poser enthusiasts now have a lip-syncing program, Mimic from LipSinc, that works with Versions 3 and 4 of the popular character modeling and animation software from Curious Labs. (Image courtesy LipSinc.)

After the visemes are mapped, the animator analyzes the audio file by loading a text file that corresponds to the audio to which the character's lips will be synced. Then he uses the Analyze Sound File feature, selecting the sound file that matches the text, and then inserting the morph keys. "That's it; at this point, the lip-syncing process is complete," Helpingstine says. "Now the animator can preview the keyframes the analysis generated, then fine-tune and render the animation in Max."

LipSinc also offers two additional lip-sync and facial animation tools. A standalone batch processor for processing large amounts of dialog, Echo analyzes audio files and outputs flipbook, dope sheet, and viseme curve animation data for animation packages, game engines, and multimedia platforms. Echo licenses start at $10,000. Mimic is a $195 tool for use with Curious Labs' Poser 3 and 4. To use Mimic, the animator loads in a sound file and Mimic automatically outputs the corresponding mouth, jaw, and lip positions. Mimic also enables users to automatically generate head, eye, eyebrow, and eye-blink movements that are appropriate to the character's dialog.

The company also offers two SDKs-TalkBack, a run-time version of the LipSinc SDK for creating character-driven content for CD or Web applications; and TalkNow, a real-time version of the SDK. TalkBack is available immediately for license by other animation software vendors interested in incorporating automated lip-sync functionality into their programs; licenses start at $50,000. TalkNow will be available Q4 starting at $100,000.
Garvin Entertainment is using Ventriloquist to animate characters in the upcoming feature film Interface, which combines digital beings with a CG environment. (Images courtesy Garvin Entertainment Partners.)

Garvin Entertainment Partners (San Gabriel, CA) is one facility that's using Ventriloquist in production. According to head producer Brian Reed Garvin, the company is working on a feature film called Interface that's being shot with high-definition cameras and will star more than a dozen 3D CG characters. Says Garvin, "Interface is about a man who, due to an unusual set of events, can interface his mind and body into a 3D animated world to fight [computer] viruses, energy tears, and spikes, and help the government [thwart] hackers and illegal encoders." Whenever the actor is shown in the computer, the character and its surroundings are 3D CG. In total, the film will boast approximately 26 minutes of 3D animation.

At press time, the facility had created 11 characters-men, women, and creatures. To construct the digital replica of the main actor, the facility scanned his head at Cyber F/X (Glendale, CA), reduced the 10mb file to a manageable size using Digimation's MultiRes Max plug-in, and imported the data into Ventriloquist, where morph targets were mapped and lip sync was generated. "Ventriloquist blows us away," enthuses Garvin. "Before, when we keyframed our lip sync, it took a fast animator about a day to do just one sentence." With Ventriloquist, he says, they can do three sentences in about 20 minutes. According to Garvin, the facility is using Max on NT machines to model, render, animate, and texture the Interface characters and their environments. For body animation, they're using Character Studio, and to composite the CG with the live action, they're employing Discreet's Paint and Effect. Interface is scheduled to hit theaters this Christmas.

Another popular lip-sync and facial animation tool, Third Wish Software's Magpie Pro, was developed by software programmer Miguel Grinberg and his wife, Alicia Crivicich, both of whom emigrated from Argentina to Portland, Oregon, a few years ago when Grinberg began working at Will Vinton Studios. Magpie for 2D animation started shipping five years ago; Magpie Pro for 3D animation, was released last year.

According to Grinberg, the main difference between Ventriloquist and Magpie Pro is that in Ventriloquist the process of mapping morph targets, and subsequently outputting corresponding mouth-, jaw-, and lip-position data, is automatic. With Magpie Pro, it's a more manual process. "Another difference is that Ventriloquist is a Max plug-in, while Magpie Pro is a stand-alone application that can export lip-sync and facial animation data to all the main 3D applications, including LightWave, Maya, Softimage, and Max," Grinberg says.

To animate with Magpie Pro, an animator creates morph targets in a 3D program, launches Magpie Pro, and loads the morph targets and the audio waveform file containing the dialog the character will speak. He then creates an exposure sheet by manually assigning morph targets at specified frames to break down the waveform file.

Say, for instance, a character must speak the words "Happy Birthday." At frame 0, the animator assigns a morph target for a closed mouth. At frame 1, he or she assigns a morph target for the "h" sound heard at that frame in the waveform file, at frame 3 the "ah" sound, and so on, until the animator reaches the end of the file. The different mouth positions are represented by images that can be generated by the animator or from a set of default images in the software. In addition, an animator can assign morph targets for different facial positions; for example, at frame 0 a morph target for open eyes can be assigned, and at frame 13 one for closed eyes. Magpie Pro creates a 3D preview in real time that runs concurrently with the audio. When the animators are happy with the preview, they can export the data to 3D software for fine-tuning.

One of the newest features in Magpie Pro is a speech recognition module that automatically fills the mouth channel of the aforementioned exposure sheet. However, users must first "train" the software to recognize mouth shapes from the waveform by loading in a waveform of the desired voice, finding as many frames as possible that sound like standard phonemes, and telling the software which mouth position to use to represent it. Then, using the database of examples created during the training step, Magpie Pro analyzes the waveform, compares it to the examples using internal speech recognition techniques, and assigns the best match to each frame. The recognized mouth shapes can then be modified by hand if necessary. The software also provides a curve editor that enables animators to deform the character's facial and mouth movements and then export the data to 3D packages that have a weighted morph capability, such as NewTek's LightWave 6, Max (through the MorphMagic plug-in), Alias|Wavefront's Maya, and Softimage. Also, users can build libraries of lip-sync words as well as body and facial movements for re-use. Magpie Pro runs on Intel and DEC Alpha PCs as well as the Mac and sells for $250.
Magpie Pro offers a lip-syncing process that is less automated but more customizable than that of most other products. (Image courtesy Third Wish Software. )

Magpie Pro is the facial animation/lip-sync tool of choice for the folks at DNA Productions (Irving, TX), which used the software with Light Wave and the Puppet Master plug-in on NT machines to create the one-hour TV special Olive the Other Reindeer, which aired last Christmas. At press time, DNA was in the process of developing Jimmy Neutron, Boy Genius, Nickelodeon's first 3D CG animated character and first property to be simultaneously developed and produced as a feature film, TV series, and online franchise. Created by DNA's John Davis, Jimmy will debut as a series of 1-minute cartoon shorts on the network prior to its 2001 theatrical release; then the show will join Nickelodeon's line-up as a series. According to animation supervisor Paul Allen, DNA is using LightWave 6 for modeling, animation, texturing, and rendering, and Magpie Pro and the Morph Gizmo plug-in for facial animation and lip sync. "The animators who don't have Magpie Pro installed will use Morph Gizmo to make minor tweaks to the facial animation. We'll likely use Project:messiah for character animation," he adds.
This post-apocalyptic mutant created by Blaze International for the computer game Krossfire was speech-enabled with FamousFaces. (Image courtesy Famous Technologies.)

Although lip sync in Magpie Pro isn't as automated as it is in Ventriloquist, Allen says he prefers it that way. "We didn't want an automated product, because having lots of control over facial animation and lip sync is important to us. With Magpie Pro, we can block in all the phonemes and build sliders on top of them in Messiah or LightWave to add more personality to the facial performance," he says. "And besides, in Magpie Pro we can still animate a 300-frame shot with one character in less than 10 minutes."

Famous Technologies was formed in 1996 as the animation R&D division of content developer Blaze International. Its first product, FamousFaces, released in July 1998, is different from Ventriloquist and Magpie Pro in that it uses motion-capture technology and voice recognition to drive the movements of a 3D character's face and mouth. At NAB in April, the company announced Face Ace, a $595 facial animation and lip-sync plug-in for 3D Studio Max.

According to Famous, FamousFaces is a stand-alone product that also can be used from within Maya, Light Wave, Softimage, Max, and Kaydara Filmbox, and can accept motion-capture data from Oxford Metrics, Motion Analysis, Xist, and Phoenix systems, and from the company's own Famous Faces vTracker. Last November, Famous began shipping FamousFaces Animator 1.5, which offers support for real-time game animation and provides plug-in support for major animation packages, including LightWave 6.

"With our system, you can capture facial movements as the performer speaks and tweak the data to clean it up or to emphasize or de-emphasize captured movements," comments Famous's Jacobs. "Our tools deliver a base animation and then allow animators to use their skills to tweak it properly."
An actor wearing markers (bottom) provides the basis for FamousFaces lip syncing. A NURBS model (top) shows adjustable markers (red and blue pointers), picked up in a videotape session.

Animators can use FamousFaces to create morph targets, much the same as in Ventriloquist. But they also can use it with motion-capture data to further automate the process. In the latter case, the animator builds a character in any one of the supported packages and places optical markers on the face of the performer speaking the dialog to be captured.

"Where you place the markers depends on what you want to capture," says Jacobs. "If you're animating a close-up scene for a film, you might want to use 100 markers all over the face to get the most detail. But if it's for a game or Web site, you might need only a dozen markers. With just a few markers, you can capture basic eye or eyebrow movement, or mouth movement," he says.

After videotaping the performer, the animator imports into FamousFaces the resulting .AVI file and the mesh data of the 3D head. The software reads where the optical markers are and measures their movement, then correlates that data to the mesh. Because FamousFaces is based on deformations rather than morph targets, every cluster (or region of the face that moves as a unit) that the animator creates deforms to follow the movement of one of the markers on the performer's face. Therefore, there's no need to sculpt the face in multiple poses before animating. After the animator views the animation, he or she can go to a frame that needs modification, re position clusters, and set the new facial pose as a keyframe. The software is available for NT and sells for $4990. The vTracker option, which en ables users to employ a single video camera to capture the performance of the actor and apply the data to the character's mesh using FamousFaces, sells for an additional $3000.

"We use Famous on almost all the facial animation we do," comments Aaron Powell, visual effects supervisor at Blur Studio (Venice, CA). So far, Blur has used it to create morph targets in Max for various projects, and has used it with vTracker for a game cinematic. Recently, Blur used Famous in conjunction with Magpie Pro to animate the faces and mouths, respectively, of four characters starring in a 90-second game cinematic the studio created for Empire Earth, a new game developed by Stain less Steel Studios and published by Sierra.
Blur Studio used FamousFaces along with Magpie Pro to create lip syncing for characters in a cinematic for the game Empire Earth. (Images courtesy Blur Studios.)

Earlier this year, the studio purchased 12 Oxford Metrics Vicon M-Series cameras and has begun using them with Famous Faces for a stereoscopic ride film based on the characters in comic-book icon Stan Lee's new 7th Portal Web isode. "We're animating 16 characters in this ride film," says Powell. "Around three full minutes will be close-up facial animation of characters at 2K resolution, looking right at you, speaking dialog." Blur is partnering with Stan Lee Media and Paramount Parks to create the ride film, which will open in Paramount theme parks worldwide in February 2001.

"This is the best facial tool I've used, and I've been looking for years," says Powell, who points to FamousFaces' clustering process as one of his favorite features. With this feature, users can paint in 3D on a mesh to define areas of the face that can be moved independently and blended to create complex expressions. "This helps you create believable-looking morphs. You can't do that in any other tool," he says.

As of press time, the newest lip-sync tool to come on the market was Face2Face, created by Face2Face Animation, a new company out of Lucent Technologies' New Ventures Group. Announced at NAB and scheduled to ship this summer, Face2Face uses visual rather than voice recognition but doesn't require markers to be placed on the performer's face. Instead, a standard video camera captures the performer's face and lips while he or she is speaking, and the recording is transferred into a workstation and processed through the Face2Face software for automatic lip sync to a selected animated face model.

To process the sequence, the animator, looking at the first frame of the video sequence, uses the mouse to click on the actor's eyes, and on the corners of his mouth. Then the animator clicks the Start button to begin analyzing the video sequence. "The software tracks the movement of these temporary reference points in relation to each other as the sequence plays," says Bill Riley, vice president of marketing and business development. According to Riley, a 20-second sequence of video takes less than 30 seconds to process on a standard Pentium III 500mhz machine.

The output of the software is a frame-by-frame capture of the key facial definition points, which the animator can then map to the character to drive the animation, using sliders to add emotion to the character's face. Because it's based on MPEG-4 standards, the output file can be streamed over the Internet at bit rates of less than 2kbs/second. According to Face2Face, the software-which is available for NT, Irix, and Linux platforms and works with Softimage 3D, Max, and Maya-will be available in downloadable formats from Face 2Face's Web site, with pricing based on the number of frames of processed video.
Face2Face uses facial features, such as the corners of the mouth, as markers (below). The output can then be applied to a digital character like the one above.

As advances in technology continue to simplify the facial animation and lip-sync process, believable CG characters will no doubt enjoy even more exposure. "I think the future is in the games market," predicts Famous's Jacobs. "The outstanding processing power of the next-generation game consoles will be able to handle high-quality, photorealistic faces."

"An interesting use of our technology is real-time communication over the Web," adds LipSinc's Helpingstine. "We're focusing on getting our technology, especially our real-time SDK, into Web applications like online greeting cards, instant messaging systems using characters, and interactive games."

The folks at Face2Face also have their sights set on the Web. "Ultimately, that's where we hope to take our product," comments Riley. "Our MPEG 4-compliance is important for streaming video over the Web." Plus, Riley says Face2Face is in preliminary discussions regarding lip-syncing projects with several different companies, including the UK's Digital Animations Group, which earlier this year created the virtual newscaster Ananova. "We're exploring the opportunity to use Face2Face to drive the face and mouth movements of the photorealistic animated characters these companies create."

All told, it's an exciting time for character animators. "You can make a character's body look and move realistically, but that means nothing if its face doesn't look like it's really talking," concludes Third Wish Software's Grinberg. "That's what brings a character to life."

Audrey Doyle is a freelance writer and editor based in Boston. She can be reached at

Synchronized software

Summit, NJ

San Francisco

Cary, NC

Portland, OR