GPU Computing Uncovered
Issue: Volume: 30 Issue: 6 (June 2007)

GPU Computing Uncovered

It’s official. GPUs are not just about graphics any longer. With the release of its latest generation G80 chip and accompanying CUDA driver, Nvidia’s GPUs are now equal-opportunity, floating-point compute engines aimed at a wide variety of uses. Make no mistake: The company isn’t backing off 3D graphics, and the G80 promises to push both throughput and render effects to the next level. But now the company has another angle to push its chips: high-performance, general-purpose computation for demanding, floating-point intensive applications in a new category called GPU Computing.
The GPU Grows Up
Once upon a time, hardwired graphics accelerators ruled the graphics world. Limited to the subset of features that hardware designers chose to implement, a person could turn a rendering function on or off, albeit with very limited control on how it rendered. The user might have Z-buffered triangles with Gouraud shading and depth-cued vectors, but not much more. And as far as using the accelerator for anything other than rendering, forget it.

During the past five years, that has all changed, and in a big way. The year 2001 marked the advent of the programmable graphics shader with Nvidia’s GeForce3. Hardwired vertex transform/lighting and pixel engines gave way to ISV programmable shader units. Primitive in their first incarnations, these units today have evolved to deliver massive amounts of floating-point power—power made available to the programmer through high-level shader languages.

With successive generations of shader-based GPUs, vendors and analysts alike began pointing out that GPUs had not only achieved parity with most CPUs in terms of complexity, but in a couple of important aspects they reigned supreme. GPUs trounce CPUs on matrix floating-point mathematics and tend to be coupled with a huge amount of bandwidth for streaming applications. And the growth in that raw horsepower, better exported to the application via programmable shader architectures, had not gone unnoticed in technical and scientific communities.

A grassroots campaign has since grown from a few factions of the scientific communities and become more organized, looking for ways to harness all those FLOPS for general-purpose computing. The idea of high-performance computing (HPC) on GPUs was born.

But hopeful users soon found more than a few issues in porting their code and algorithms to GPUs, all of which stemmed from the basic principle that GPUs were built specifically for graphics and CPUs were not.

These included the idea that: the GPU outputs colors (or Z) into a quad/triangle region; there was no memory management on the GPU; no communication existed between stream processing stages (everything in and out of memory); there was no scatter support, load, or store functions; the primary GPU data type was a stream, whereas the CPU was a word (32 bit); graphics languages were very different and more difficult to program; and GPUs were limited in instructions, all geared to rendering.

Most notably, the programmer had to think—and structure instructions and data—in terms of geometry and pixels. The GPU operated on a stream of vertices and output colors (and Zs) restricted to memory regions defined as triangles or quads. Also, there was no support for scattering data—that is, writing data to some location based on an index—and awkward, incomplete support for gathering data. Data was output strictly as pixel colors scanned into a quad or triangle. Feeding the GPU often meant organizing input data as textures, and that was something only the pixel shader could do, not the vertex shader.

All in all, this was not a pretty sight for programmers accustomed to CPUs. Advocates of HPC on the GPU loved the processing power, but to program, they needed to see more of a typical computing model rather than a typical rendering model.

A Step Forward for HPC

Nvidia and ATI were surely intrigued by the possibilities of HPC on the GPU. Both would welcome the prospects of finding new pockets in the market to generate more volume. But one problem was that the early vocal GPU adopters weren’t promising much volume, but, rather, were presenting traditional HPC opportunities, such as scientific computing, geosciences, and non-polygonal graphics like final-frame or volume rendering. Nvidia, for one, generates remarkable revenue-plying niches in the professional ranks, but does so primarily with the same silicon designed for the consumer/gaming ranks. Making incremental changes to a complex chip in order to serve new markets is a powerful proposition for any vendor.

In 2005, the HPC community got some help for free, thanks to Nvidia’s G70 GPU and Microsoft’s DirectX 9c API. There was nothing in DX 9c specifically for HPC, but advancements built for graphics helped nonetheless. Single-precision (32-bit) floating-point throughout, dynamic branching, larger code sizes, and vertex textures eased some of the programming issues. Vertex texture support—the reading of indexed texture data into the vertex shader—in particular, helped address the lack of gather support. 

With the G70 and DirectX 9c, the programmer had a reasonable, though still far from simple, solution—streaming data in from video memory, through a shader, and then back out to memory.

Even with the help of the G70 and DX 9c, the technology was still limited to a select few. These were typically academics who were desperate to find some reasonably priced, capable hardware solution to work their hugely compute-intensive problems, but who also possessed deep insight into the 3D graphics pipeline. In order to tap into all those FLOPS, a person faced the daunting task of tearing apart the algorithm and data and then effectively repackaging it as a stream-based graphics rendering operation.

The As, Bs, and Cs of GPU Computing

Recently, the industry got a look at the first fruits of Nvidia’s labor. As part of the rollout of its latest G80 GPU, Nvidia unveiled the category of GPU Computing, the company’s first comprehensive answer to the demands of HPC on the GPU. GPU Computing covers both hardware solutions specifically tailored to the needs of the HPC market and software, in the form of CUDA, or  Compute Unified Device Architecture, a C compiler and standards libraries that provides a completely new programming environment for GPUs—one designed and optimized for general-purpose data parallel computing.

Nvidia opened up GPU Computing in conjunction with the launch of the G80, the code name of the GeForce 8 series, with a couple of key pieces of the planned environment—the debugger and profiler, and double-precision accuracy—following later. The GPU Computing model exploited the G80’s unified shader architecture, where shader resources are not built specific to vertices or triangles and can be more efficiently allocated to whichever processing threads are active and demand attention.

Most notably, CUDA provides: a dedicated, general-purpose computing model, standard C language, load/store support, and a more complete instruction set. The parallel data cache eliminates the need to make multiple passes to memory, and concurrent threads can share data. Specifically, with CUDA, GPU Computing programmers get their own dedicated computing model. The drivers and models can run concurrently, in two separate contexts, allowing developers to, for example, calculate physics in the CUDA context and send results off to a graphics (DirectX) context.

Running multiple threads on unified, general-purpose shaders, CUDA eliminates the awkwardness of mapping algorithms and code-to-triangle rasterizers and graphics shaders. Programmers can thankfully forget about vertices, triangles, and pixels, and stick with a model that better mimics what they’re used to. Textures are still available (for free, as they have to be there for graphics), and are something an image filtering application might use, for example.

The programming language is standard C, and in the CUDA model, programmers call a function, specifying how many threads to run. The G80’s Massively Multi-Threaded Architecture manages hundreds of threads, allocated across the chip’s 128 shader units (each running at 1.35 ghz).

The G80 implements a parallel, software-managed cache, which CUDA uses as a central repository to store and share data among shader units and the threads running on those units. With the cache and thread manager, threads can share data and pass along output directly to other threads and shader units, thereby resolving another of the oft-quoted complaints of GPU Computing programmers.

Nvidia is working with ISV partners to optimize code to best exploit CUDA and the GPU Computing engine.

With the new capability, Nvidia promises big bumps in throughput. Beyond easing the programming burden, Nvidia’s GPU Computing technology promises bigger boosts in computation throughput. Holding up a host of test cases from applications in weather, oil/gas exploration, medical imaging, simulations, and, of course, physics, Nvidia is touting some spectacular numbers, as compared to Intel’s recent Core 2 Duo CPU (at 2.66 ghz).

There is no clear way of commenting on Nvidia’s specific sped-up numbers, but we don’t have trouble believing they’re substantial. Schmid & Partner Engineering AG, which, along with Acceleware, is working with researchers at Boston Scientific to investigate the impact of modern design parameters on implantable medical devices, such as pacemakers, when exposed to electronic magnetic fields. Nik Chavannes, director of software at Schmid & Partner Engineering, states: “Running electromagnetic simulations using Nvidia hardware empowers faster processing times by factors of 25 or more, enabling the analysis and optimization of medical products applying a level of complexity that nobody dreamed of, even two years ago. Nvidia’s and Acceleware’s solutions have opened completely new worlds for Computational Electromagnetics.” Listen closely, and you’ll probably hear more cheering from all those HPC users anxious and ready to put that boatload of G80 FLOPS to work.

Changing Tide

Nvidia—not to mention AMD/ATI and, most likely, Intel—are paving the way for dramatic changes in the computer architecture. Any vendor is going to claim that its new technology will change the industry; that’s just marketing. But when can you tell that a vendor is really serious about such claims? When it puts its money where its mouth is, and Nvidia’s done just that with GPU Computing.

Adding cost, schedule, and risk to a GPU is a serious commitment, but that’s what Nvidia did. And it plans to continue doing so, tipping its hat about double-precision floating-point operations in some next-generation GPUs later this year. Double-precision floating point today has virtually no applications in graphics rendering, so it would have to justify itself based exclusively on GPU Computing. Yep, Nvidia is  bullish on the notion of GPU Computing, and it’s willing to take on significant risk to make it work.

What’s pushing them? Maybe it was that promise of substantial incremental volume for GPUs running as processors for high-performance computing markets. Or maybe Nvidia was taking the “build it and they will come” approach, counting on creative developers to come up with new killer applications that can take advantage of an HPC-optimized GPU solution. Or maybe it was something else.

In this industry, if you’re a vendor of peripheral chips, you’re constantly looking over your shoulder for the relentless threat of obsolescence by integration. Remember discrete chips for audio, telephony, and networking? Gone.

Nvidia’s GPU Computing technology allows for far greater computational
throughput, making it ideal in the medical realm, where users regularly
interact with very dense imagery.

Looking to the Future

Integration has taken a toll on those vendors making a living off graphics already. Remember, it’s Intel that sells more graphics hardware than anyone, integrating controllers in its north bridge. Arguably, gaming is the only reason that graphics integrated in memory controllers have not relegated both ATI and Nvidia to niche roles. GPUs are continually under pressure to justify themselves as discrete components.

One sure way to ensure that a peripheral is not subsumed by the CPU is for it to stop being a peripheral. In fact, Nvidia is not the first GPU vendor to substantially shift its architecture toward general-purpose computing, though it is the first to deliver a comprehensive solution to market.

In last year’s R520, ATI already made advances to make its architecture more general purpose, for example, with a big array of general-purpose registers (providing inter-thread communication a la Nvidia’s parallel data cache) as well as its UltraThreading technology, preceding the G80’s GigaThread. And in early October, ATI more formally positioned its hardware for more general-purpose computing applications with its announcement of StreamComputing, an initiative that will probably look something like CUDA when it’s officially unveiled.

But guess what? GPU vendors aren’t the only ones reading the tea leaves to make sure they’ll still be around in 10 years. With the capabilities of GPUs on par—or exceeding—CPUs, don’t think companies like AMD and Intel aren’t concerned about the incursion of GPUs on their turf. We’ve just witnessed the blockbuster acquisition of ATI by AMD—what do you think that was all about?

AMD had to be ready. It needed the graphics expertise to ensure that it was prepared to evolve, no matter what direction the baseline architecture would take. Whether it is the CPU taking on GPU functionality, CPUs sharing the motherboard with GPUs, or a fully integrated combined CPU/GPU, AMD felt it critical to improve its footing—critical enough to justify the $5.4 billion buyout of ATI.

Earlier this year, before ever revealing its intentions regarding ATI, AMD had already tipped its hat when it made a flurry of announcements paving the way both for GPGPU and the ATI acquisition. First it announced it would license Opteron’s cache-coherent HyperTransport and socket, allowing third-party processors to share the motherboard with its flagship processor. Then it followed up with Torrenza, a platform formalizing the concept of “socket fillers,” allowing OEMs to build hybrid systems optimized to deliver maximum performance for specific application demands.

From Torrenza, it’s not much of a stretch to imagine the CPU integrating all or part of that third-party accelerator. And this past October, that’s precisely what AMD announced with its Fusion program, promising future multi-core chips starting in 2008, combining both CPU and GPU cores on a single processor.

So, what’s Intel doing during all of this activity? For a couple of days, Wall Street, for one, was guessing it would go and acquire Nvidia, resulting in a short-term spike in its stock price. But there were lots of reasons that made such a move unlikely. Intel’s got graphics technology in-house, though it has not had success doing the type of innovative, high-performance designs as Nvidia or ATI.

To date, Intel hasn’t been as vocal on how it is positioning itself for this changing landscape. But at the latest Intel Developer Forum last September, the company did unveil the concept for a future large-scale, multi-core platform called Terascale. Terascale combines both general-purpose and special-purpose cores in a processor, supported by a new interconnect fabric. Intel didn’t say so, but one can imagine the general-purpose core would be a conventional x86 core, and a special-purpose core might be a GPU or some subset/variant thereof.

For its part, Nvidia is going it alone, at least for now. With GPU Computing, it is first to market with a better-thought-out, comprehensive solution for carrying out HPC tasks on a GPU. It should lead to incremental business for Nvidia GPUs, first and foremost in higher-volume gaming, as well as a non-trivial amount in lower-volume but higher-margin workstation applications.

As it has in the past, Nvidia is taking risks and aggressively blazing its way ahead. And if the past is any indication of future success, Nvidia is sure to be rewarded handsomely.

Alex Herrera is a senior analyst with Jon Peddie Research and author of “JPR’s Workstation Report.” Based in Tiburon, CA, JPR provides consulting, research, and other specialized services to technology companies, including graphics development, multimedia for professional applications and consumer electronics, high-end computing, and Internet-access product development. For more information about the reports, visit