Issue: Volume: 23 Issue: 10 (October 2000)

Getting Up to Speed



Despite huge gains in the performance of interactive graphics systems in recent years, many applications still have demands far in excess of what the latest architectures can provide. Among the applications crying out for real-time interactive performance are scientific visualization of large datasets, photorealistic rendering, low-latency virtual reality, and large-scale display systems.

To meet the ever-growing computational needs of such applications, researchers in Stanford University's Computer Graphics Lab have built an entirely new, fully scalable, OpenGL-compatible, parallel graphics architecture that is poised to take full advantage of current and future-generation consumer-level graphics technology.

Called Pomegranate ("for no particularly good reason," says principal researcher Matthew Eldridge), the new system consists of a five-stage graphics pipeline connected by a network of high-speed, point-to-point links. To date, performance simulations demonstrate Pomegranate's scalability up to a 64x speed-up.

Where high-end graphics was once the domain of workstation vendors peddling their proprietary architectures, today the entire graphics pipeline can be placed on a single consumer-level chip. And as the chips themselves edge closer toward the billion-transistor mark, so will grow the amount of internal parallelism they possess. Unfortunately, says Eldridge, "existing architectures cannot simultaneously scale to a high degree of parallelism efficiently while supporting a standard programming interface [such as OpenGL]." In contrast, as a single chip with multiple pipelines, Pomegranate could provide a practical method for optimizing transistor usage by enabling the replication of smaller pipelines to consume the available transistor count.
With Pomegranate, the same model was rendered using different techniques (marching cubes on the left, 3D volume textures on the right), and the performance was compared to a standard high-end graphics system. The results showed gains of 58x and 56x, respe




The Pomegranate architecture could also be implemented as a scalable graphics pipeline with many levels of parallelism. The performance of a single pipeline solution compares to that of today's graphics accelerators. A 64-pipeline accelerator would be able to match supercomputer performance.

Regardless of scale, each pipeline consists of five stages: geometry, rasterization, texture, fragment, and display. During the geometry stage, commands are received from an application, the primitives are transformed, lit, and clipped, and screen-space primitives are sent to the rasterizer. During rasterization, the primitives are converted into untextured fragments. Next, textures are applied to the resulting fragments, which merge with the frame buffer in the fragment-processing stage. Finally, the display processor reads pixels from the fragment processor and sends them to a display.

The architecture's scalability is achieved through a "sort-everywhere" approach, whereby work is distributed in a balanced fashion at each of these five stages of the pipeline. With the point-to-point network, each pipeline of the architecture can communicate with the others at every stage. For example, each geometry processor can distribute its transformed primitives over all of the rasterizers.

Balancing the work load on each pipeline is crucial to Pomegranate's efficient operation and is also the system's greatest challenge. Eldridge uses triangle rasterization to illustrate the difficulty of the task. "Triangles vary greatly in their size on the screen and generally clump on the screen as individual scene elements are drawn-both of these effects can greatly limit parallel efficiency."

The traditional method for dealing with these effects is to closely interleave the responsibility of the rasterizers for pixels on the screen. Doing so results in an excellent load balance, but also results in all triangles being broadcast to all the rasterizers, says Eldridge, "which severely limits scalability by forcing each rasterizer to accept triangles at the aggregate emission rate."

To avoid this, Pomegranate sorts the load twice: once between geometry processing and rasterization, and again between rasterization and fragment processing. "This allows Pomegranate to use a scalable point-to-point communication mechanism instead of broadcast communication," says Eldridge. Unfortunately, the double-sort approach exacerbates another challenge: the ability to execute commands in the order submitted. With double sorting, the fragments of one triangle may arrive at a fragment processor before the fragments of a triangle earlier in the command stream. To deal with this, the researchers developed a mechanism that relays ordering information without using broadcast communication.

A final parallel-graphics challenge that Pomegranate overcomes is the ability to provide sufficient input bandwidth into the hardware to keep it busy. Pomegranate achieves this through support for a parallel API, which allows multiple graphics contexts to be active in the hardware with commands being submitted for them simultaneously through multiple interfaces.

The Pomegranate architecture could easily bear fruit soon, says Eldridge. "The main enabling technology-high-speed, point-to-point interconnect-has moved from academia to industry over the past several years, making Pomegranate readily buildable with current technology."

Diana Phillips Mahoney is chief technology editor of Computer Graphics World.