Issue: Volume: 23 Issue: 5 (May 2000)

TECHWATCH: A New Track for Modeling Human Motion

The world's most renowned movers and shakers could eventually be brought to digital life if the promise of a novel motion-tracking approach is realized. Fred Astaire's fancy footwork. Babe Ruth's hallmark swing. The possibilities are limited only by the availability of the world's vast film and video archives.

Unlike existing commercial technologies for capturing human motion, which typically involve tracking a human subject's movements through the use of sensor-based hardware, the new system relies instead on 2D video of human motion as its primary input. This is a significant distinction, since the perceived value of traditional motion tracking can be offset for many potential applications by the expense of the specialized tracking equipment needed to collect the motion data as well as technical constraints that can limit the range of the tracked person's motion and/or the accuracy of the results.

In an attempt to create a less expensive, less restrictive means for tracking and representing human motion, researchers at Cornell University, the Massachusetts Insti tute of Technology, and Mitsubishi Electric Research Laboratory (MERL) have developed a system that combines computer-vision techniques with a learning-based approach to reconstruct 3D human motion from standard 2D video data. The re searchers' objective is to create a motion-tracking system that is not only more accessible to a broader group of users because it relies on inexpensive and easy single-camera video output, but also one that can take advantage of the range of human motion represented in existing film and video clips.

The heart of the system is an ability to infer 3D information from 2D video clips using knowledge about human motion that it learns from sample "training" data. What sets the effort apart from previous projects that have attempted to exploit the promise of 2D video for 3D motion tracking is that it relies more on inference than the acquisition of precise measurements to solve the 3D puzzle, says principle researcher Nicholas Howe of Cornell, who developed the system with Michael Leventon of MIT and William Freeman of MERL. "The difficulty is that 2D video doesn't have enough information, so even with precise measurements, the 3D motion is hard to reconstruct. Our approach differs in that it has expectations about how humans move."
A video clip of a subject waving an arm provides input for reconstructing the same motion in 3D. The system targets and tracks specific body parts, then infers depth information by comparing the movements to those of a sample data set.

The system develops these ex pectations based on what it has "learned" about the regularity of human motion. "Many movements a human could make in principle are actually pretty un likely in practice. Thus if there's ambiguity between two possible motions, and one is more likely, our system goes with the most likely one," says Howe. "This lets it get through parts of the video where it's hard to see what's going on."

To create the 3D motion model, the system processes a 2D video stream to track the movement of specific joints and body parts in the image plane over time. This is achieved through the use of part maps, support maps, and positional information. The part maps are models of each body part derived from a weighted average of several frames, as opposed to only the most recent one. These provide the system with information about what each body part looks like. Support maps differentiate the body part from its surroundings.

"Creating the support map is like drawing a box around a body part and crossing out all of the areas in the box that don't belong to that part," says Howe. "For example, the support map of the head would probably have pixels around the corners crossed out, leaving an oval region in the center."

The support map is critical in that it allows the system to account for self-occlusions. "If the hand was in front of the face, then a hand-shaped imprint would also be crossed out, leaving the visible part of the head. So the purpose of the support map is to make sure we're tracking the right thing. We don't want the tracker to get confused when one body part moves in front of another," says Howe. By tracking the part and support maps, the system obtains the pose information for the various body parts.

Essentially, explains Howe, the system is able to track each body part by moving frame by frame through the video, looking near the old position of the given part for something that looks like the image it has stored, says Howe. "When it finds the best match, it updates the position, part map, and support map accordingly and goes to the next frame." The 2D tracker returns the coordinates of the body parts in each successive frame, which yield the necessary joint and control point positions needed for 3D reconstruction.

On their own, however, these 2D observations are insufficient. "There's actually not enough information," says Howe. "The trick lies in using expectations about the motion. Your brain can do this too, and probably does all the time. For example, if I play a movie of the 2D coordinate information, your brain will interpret it as a person doing something, perhaps walking. We're so used to seeing people move that we do this automatically. We're trying to put the same expectations into the computer system."
A video of a walking subject presents a 2D tracking challenge because several body parts are partially occluded. As a result, the reconstructed motion can be jerky and twisted, and consequently less believable.

Equipping the system with the appropriate expectations involves teaching it which reconstructions are plausible based on a training set of 3D motions that have been acquired by a traditional 3D tracking system. Using this information, the system is able to disregard unnatural and anatomically impossible projections. It uses a prior probabilities framework to compute the likelihood of any of the remaining projections being the correct one. "In a sense, we look at examples of motions that we've seen before that in 2D look similar to the one we're trying to figure out," says Howe. By a sort of interpolation between these close examples, we create an idealized motion that looks like the 2D observations, the depth of which is the 'inferred' depth."

The system has a number of technical obstacles to overcome in order for it to become commercially viable. In its current state, it is able to track and reconstruct the motion of human figures in short video sequences, but it is not yet reliable enough to track significant lengths of difficult footage, primarily because even small errors snowball as they accumulate through the frames, to the point where the system can lose track of the body parts it is following. Lighting changes, contrast problems, and self-occlusion can exacerbate this. Body parts crossing each other in a heavily shadowed environment, for example, can cause the system to get off track. Consequently, the 3D reconstructed motion might be shaky and unbelievable.

Longer, more reliable tracking is one of the researchers' goals for this system. Another is the implementation of automatic initialization, says Howe, "so the computer can find people on its own. Right now, a human tells the computer where to start tracking." Automating this process should improve the tracking reliability as well. "If it gets way off, the computer can find the person again and start from scratch."

Once the tracking issues are resolved, the system's interpretation capabilities will be ripe for enhancement. "For this to work as a tool for human-computer interaction, the computer will need to determine what the human is doing and adjust its behavior accordingly," says Howe.

In addition to enabling more realistic animation for games, film, and broadcast projects, the ability to track human motion inexpensively and automatically from new and old video footage holds appeal for a number of applications. "The computer could act as a coach, analyzing the motion of golfers, figure skaters, divers, and so on. It might compare the motions of beginners with those of professionals and make suggestions for improvement, for example," says Howe. "One can also imagine systems that scan for shoplifters or perhaps watch precious museum exhibits." The possibilities, he says, are endless. "Once a computer can track people reliably, all sorts of interaction opportunities open up."

Diana Phillips Mahoney is chief technology editor of Computer Graphics World.