Issue: Volume: 32 Issue: 9 (Sep. 2009)


As evidenced by their use in the highly interactive 3D games available today, modern graphics processing units (GPUs) are very efficient at manipulating raster graphics. What is not as well known is that the highly parallel structure and computational power of GPUs also makes them extremely effective for handling more general, computationally intensive, and demanding algorithms.
Using graphics processors to do things other than graphics is a disruptive technology (it has a large impact within the industry) that has been an active area of research as far back as the late 1970s, with work on Ikonas [England, 1968], one of the first commercial graphics companies and chips.

Using rasterization hardware to perform motion planning [Lengyel, et al. 1990], calculate Voronoi diagrams [Hoff, et al, 1999], simulate neural networks [Bohn, 1998], and crack passwords [Kedem and Ishihara, 1999], among many other examples, became the basis for continued active research as commodity graphics engines became more widely available and powerful.

Mike Houston is a senior system architect in the Advanced Technology Development group at AMD in Santa Clara, CA, working in architecture design and programming models for parallel architectures. He received his PhD in Computer Science from Stanford University, focusing on research in programming models, algorithms, and runtime systems for parallel architectures, including GPUs, cell, multi-core, and clusters.  
The first wave of truly programmable GPUs, early in this decade, brought about many more advanced algorithms running on them, including advanced image processing, raytracing, cellular automata, linear algebra, physics, and database operations. Running these applications on programmable GPUs was the start of the general-purpose GPU (GPGPU) computation revolution. In graphics-oriented conferences, such as SIGGRAPH, there was an increased interest in advanced research in computational photography, image processing, raytracing, and advanced shading effects, all going beyond what is available using traditional graphics application programming interfaces (APIs). This is also when the immense processing power of GPUs was noticed in other fields, and publications about their use can be found in areas from finance to biology, often demonstrating large performance increases.

Most of the original GPGPU work was done by changing the compute algorithm to fit the programming models of graphics APIs, such as OpenGL or Microsoft DirectX. As a simple example, to perform computation at each pixel in a 2D image, developers would bind a simple shader program, written in a graphics shading language, like HLSL or GLSL, and, at times, even pseudo-assembly, like ARB_fragment_program, then render a full-screen quad the size of the output image.
While there was a tremendous amount of research using graphics APIs, this was a difficult approach for developers who were not proficient with graphics APIs. Consequently, the academic research community began to design languages to abstract the GPU as a compute engine, hiding the use of the graphics APIs behind more compute-friendly programming interfaces. Perhaps the two best-known examples of this are Brook, developed at Stanford, and Sh, developed at the University of Waterloo. (Brook evolved into AMD’s Brook+; Sh eventually was commercialized by RapidMind.)

Until recently, even with more compute-oriented programming interfaces, computation on GPUs has remained primarily a research technology used by early adopters—a new, promising, experimental capability for scientists, engineers, financial professionals, and others running computationally intensive applications. Mainstream adoption, however, was low.

Basically, two issues have kept GPU computing from gaining wide adoption: First, available GPU compute APIs were proprietary, targeting a single vendor’s architecture; second, the GPU has been treated as an independent application accelerator, as opposed to being part of a larger, balanced, heterogeneous platform working with other computational resources.

Recently, there have been two APIs created to address cross-vendor computation on GPUs: DirectX 11 DirectCompute and OpenCL. OpenCL goes further by targeting parallel processors more generally and providing cross-platform support.

Apple provided the initial proposal for OpenCL and the catalyst to gather the major silicon vendors to work on an industry standard for compute in 2008. The main goals of OpenCL are to provide a cross-platform, cross-vendor, cross-architecture, royalty-free industry standard for parallel programming. While having solid support for GPU compute, the working group strove to make OpenCL applicable beyond GPUs, providing a framework for writing software that can exploit the vast, heterogeneous computing power of multi-core CPUs, GPUs, cells, and DSPs.

Thus, OpenCL addresses the need for a cross-platform, industry-standard approach toward development for heterogeneous architectures. OpenCL is a C-based language, with a structure familiar to parallel programmers. This allows developers, who are used to programming in C for multi-core CPUs and other GPU computation languages, to easily transition to OpenCL, providing portability of their applications/programs across a wider range of devices and platforms.

Traditional GPUs used for manipulating raster graphics can also handle demanding algorithms.

OpenCL includes a platform API that lets developers query, select, and initialize compute devices, as well as a run-time API to execute the compute kernels and manage the scheduling of compute and memory resources. It is designed as a low-level interface to maximize the performance that can be extracted from the targeted devices. By creating efficient, close-to-the-metal programming interfaces, OpenCL forms the foundational layer of a parallel computing ecosystem of platform-independent tools, middleware, and applications.

Of course, no application runs entirely on the GPU. Beyond the obvious need for CPUs to drive execution, most mainstream applications are heterogeneous in nature: They have some functions that accelerate well on multi-core CPUs and others that are perfectly suited for a GPU’s data-parallel architecture (see “Power Play,” pg. 21). GPUs excel at mathematically intensive algorithms with a high degree of data parallelism. Many algorithms, like those discussed above, can be excellent candidates for acceleration; however, some algorithms generally thought of as perfect GPGPU algorithms, such as image convolution with small window sizes, may be faster on a multi-core CPU system than on a GPU because of the expense of off-loading the computation to the GPU.

Current GPUs require a high degree of arithmetic intensity, a measurement of the amount of computation performed per data read, to effectively accelerate algorithms. A developer must take a balanced approach and match the algorithms to the best-suited device in the system. For some algorithms, that may be just the GPU, or just the CPU, but many algorithms benefit from using both the GPU and the CPU.

AMD’s first public beta release of OpenCL targets multi-core x86 processors. Beyond being part of a full-platform approach to OpenCL, it also enables developers to begin exploring OpenCL on the systems they already have, without having to first invest in new hardware. Moreover, the techniques critical to GPU performance, such as data locality awareness, use of vector types, and large amounts of parallelism, are also critical to CPU performance. Coding in OpenCL provides a solid methodology for writing parallel code that is scalable on multi-core x86. In fact, AMD has shown some applications scaling well on a 24-core system based on four Six-Core AMD Opteron processors.

Programming that takes maximum advantage of all the parallel-computing resources in the system is the next frontier of application acceleration. OpenCL and the burgeoning ecosystem around it form a solid foundation for this up-and-coming revolution to flourish and take off.