The New Dawn of HPC
By: Kathleen Maher
Issue: Volume: 33 Issue: 2 (Feb. 2010)

The New Dawn of HPC

As PCs rose from the ashes of mainframes. a new life in computation was born.

Once upon a time, there was a small group of vary expensive machines in glass houses, used them to answer important questions about who, how much, and when, and what's the weather going to be like. They were gods, or rather, they were priests serving the machines more then information in banks, hospitals, government agencies, and big companies. If you wanted any answer from these very important people, you had to submit you question in the proper way and then wait until they deemed the time was right to give you, the poor supplicant, an answer. Now the fact of the matter is that most of the questions they got were really simple, and more often then not, they could have produced an answer much faster then they usually did. When their time passed no one mourned them.

As we all know, what took the place of those machines was a new generation of workstations and PCs­ and at first they might not have been capable of the same complex computing of the glass-house mainframes, but at least they belonged to their users and could provide an answer when asked. It was well worth the sacrifice. IBM puts the birth of the PC at 1981; the Mac first said “hello” in 1984.

Today, supercomputing is all about processors. At top is the Cray Jaguar, the world’s fastest computer for open science; inset shows the Cray XT5 Kraken, the world’s fastest academic supercomputer.  

Time has passed, and the capabilities of desktop computers have actually surpassed that of those old gargantuan mainframe computers, but high-end computing is no longer a monster computer in a glass house—it is thousands and thousands of processors in rows and rows in warehouses. At the recent Supercomputing Conference (SC09), the Cray Jaguar took the top spot in the TOP500 list of the world’s fastest supercomputers. It is now based on six-core processors from ATI, has a total of 255,000 cores, and posted a 1.75-petaflop performance speed running Linpack.
The Jaguar lives at the Oak Ridge Leadership Computing facility in Oak Ridge, Tennessee, and is put to work in climate science, chemistry, materials science, nuclear energy, physics, bioenergy, astrophysics, geoscience, fusion, and combustion.

In performance, the Jaguar competes closely with an IBM Roadrunner computer at Los Alamos in New Mexico, the former fastest computer. The Roadrunner has 6562 dual-core AMD Opterons and 12,240 Cell processors from IBM. And, the list goes on. In all, there are millions of processors hard at work in the TOP500 computers, and there is another generation of priests serving these beasts.

The beat goes on, and enormous change is coming to the compute world. In a way, it’s a repeat of that last revolution. What happens when you have a supercomputer at your desk? Well, they are already here. Nvidia has its Tesla, a computer on an add-in board that takes advantage of the many cores in a GPU. The numbers now range from 128, 240, and 512 for the new Fermi. And, with this processor, Nvidia has turned its focus on high-performance computing (HPC), developing first for this market, and addressing the mainstream market next with a variant (at least that’s how it looks from here at the moment). AMD can offer a 400-, 800-, or 1600-core ATI FireStream board.

Dell and Cray Computer at SC09 showcased the Cray CX1-iWS deskside computer, which combines a visualization cluster using Nvidia processors with eight-core Xeon processors in a compute cluster. Several independent software vendors (ISVs) have already announced support for the new machine, including Dassault and Ansys for their analysis tools and The MathWorks for MatLab, its scientific and engineering software.

At the moment, there is fierce competition in the realm of HPC as the leading computer scientists and processor makers hash out the various approaches to solving big problems. The stakes are high. The strategies developed to drive giant super computers are also going to go to work on mainstream problems as well, for video transcoding, rendering, physics, imaging filters, and image enhancement. And the list will grow, so this is technology that will be in every computer down the road, and it’s not a long road, either.

There are three basic paths being pursued by the processor companies. Intel is championing the multi-CPU core approach. AMD is in there as well with multiple CPUs, but it is also pursuing Fusion, an integrated chip that includes CPUs and GPU cores. Nvidia and ATI, AMD’s graphics subsidiary, are going for multiple GPUs, and at the moment, this approach—also referred to as heterogeneous computing—has a great deal of traction.

Rise of GPGPU
GPGPU stands for general-purpose graphics processing unit, and the term is interchangeable with “GPU compute.” GPU compute approaches take advantage of the hundreds of cores in a GPU to perform tasks other than getting pixels to the screen. The trick is that you have to program applications in such a way that they behave like a graphics application. Nvidia has been very proactive in GPU compute, and the company has worked closely with university research centers to develop its CUDA tool for application development.

Supercomputers are vital to scientific research. Here, positive (red) and negative (green) ions are placed into a ribosome in an effort to kill it. This process requires very large scale calculations of ribosome electrostatics that have been made possible by GPU computing.

The open standards body Khronos has developed OpenCL, a project that has involved Apple, AMD, and Nvidia, among other interested parties. OpenCL is also a development language, and Nvidia is evolving CUDA to co-exist with OpenCL; in a subtle bit of marketing eloquence, Nvidia now refers to CUDA as CUDA for CL. Nvidia arch rival ATI has developed the Stream architecture for GPGPU and has committed to OpenCL. The company has been somewhat schizophrenic about its commitment to GPU compute, probably because it has had to pick its fights more carefully than its wealthier competitors, Intel and Nvidia. Now, however, AMD seems to be concentrating on its Stream initiative. The rewards for GPU compute are huge: There are applications that are seeing a speed up of 300x, and improvements of 10x are common.

The Once and Future Larrabee
Intel has not always been solely an x86 processor builder. The company has had other semiconductor architectures, most notably the ARM-based XScale which was eventually spun off to Marvell. When Intel sold its XScale technology, it was based on the longevity of the x86, and the company’s position is that there is nothing you can’t do with x86 cores, including graphics.

Larrabee, which Intel hoped to be selling as a graphics processor in 2010, has multiple x86 cores. Intel has not announced how many, but it is generally believed there are 32 to 48 x86 cores in Larrabee. The advantage Larrabee has, says Intel, is in being more easily programmable for traditional computer programs, but Intel had something of the same problem making x86 cores behave like graphics processors as the GPU guys have making GPUs handle CPU-type jobs. A GPU is an SIMD (single instruction, multiple data) processor, while a multi- or many-core x86 processor is designed for MIMD (multiple instructions, multiple data) operations. The compute world has both problems, and neither is more important than the other, a fact that has gotten lost in the competition between the big players: AMD, Intel, and Nvidia.

Larrabee generated a lot of excitement as a rendering engine that might be able to reduce the cost, size, and power consumption of huge rendering farms, but as is now well known, Intel has pulled back from the brink. The company built the chip and demonstrated it reaching 1 teraflop performance on a single precision (SGEMM) test at SC09.

Unfortunately, Intel was pushing development of Larrabee as a strong competitor to graphics processors, much to the bemusement of outsiders who can clearly perceive the strength of Larrabee for a much broader type of computing. At the Super Computing conference, Intel CEO Justin Rattner seemed to be repositioning Larrabee, stressing the importance of MIMD operations.

As Intel was reaching an internal deadline for Larrabee, the stakeholders called for a major time-out. They had not worked out the difficulties of programming Larrabee. Above all, Intel wanted Larrabee to come out a winner, and to do that, the firm had to meet the high expectations it had set for graphics performance and give developers tools that would be easy to use. That was not happening as the managers thought it should.

The Larrabee name is probably tainted, but Intel’s work in this field is part of an ongoing initiative that’s not going away. Almost concurrent with the announcement that Intel would pause in the creation a graphics chip, the company announced a 48-core, tera-scale processor developed in India as a cloud computer in a chip.

Meanwhile, Fusion
Fusion, AMD’s grand vision, combines the GPU and the CPU. In one sense, it is a design that will enable low-cost computers and mobile devices to take the idea of an integrated processor one step further than the integrated graphics processors that have come to dominate mainstream and low-end computing. But, AMD also sees the potential for a raft of Fusion processors all lined up and working hard on parallel and linear problems.

The fact of the matter is that the world needs both CPUs and GPUs to handle the different problems that are out there. In most applications, the CPU and GPU are working in tandem.

It’s All in the Language
The challenge in high-performance computing is not necessarily in building the biggest, baddest, most efficient processor—though that’s by no means easy; it’s also necessary to build something on which applications can be developed. The HPC applications of today are fabulous—truly fabulous, the stuff of myth and magic?…they will cure cancer and re-create universes. However, the first applications were built by dedicated scientists who have been able to twist their brains around the challenge of restating problems in ways the processors will understand.

Here, an image of the Satellite Tobacco Mosaic Virus is colored by the electrostatic field created by the virus capsid and contained RNA. The electrostatic field was calculated on an Nvidia Tesla C1060 GPU, which ran 26 times faster than a single-core CPU or 6.6 times faster than a typical four-core CPU. Researchers are then able to work with the model and run simulations on GPU-accelerated workstations.

OpenCL and CUDA are relatively new and evolving, and more work has to be done to make them easier to use. Microsoft and Nvidia have pushed the ball further with the development of Nexus, a programming tool that works within Visual Studio to let developers access CPU and GPU cores and also analyze the ways in which these cores are working in the app.

The ability to offer these sophisticated capabilities in a mainstream programming tool like Visual Studio is huge. We expect more to come from Microsoft to enable work that takes advantage of DirectX 11, which supports heterogeneous computing. ATI’s Stream Computing group has developed an SDK that also includes analysis, but so far Nvidia has made the most strides in tool development.

Intel knows this, as well. The company announced its own development tool at SC09 called Ct. It’s a dialect of C+ and will likewise enable developers to spread code across CPU- and GPU-based processors. Compliant with CUDA and with OpenCL, Ct is an example of the kind of work being done at Intel that goes beyond Larrabee. It’s going to needed in order to enable the work of the tera-scale group and of future processors from Intel that combine GPU and CPU.

(Top) A supernova shock wave and (bottom) neutron experiments are some of the projects being performed with supercomputers.

So, let’s go back to visit those high priests of computing. They’re still with us; in fact, some of them are the same people, but they’ve changed, too. They are our friends. They are the people working in universities, in the Khronos Organization, and at the major hardware and software companies to make supercomputing everyday computing.

Some of the changes we can see right now are in faster video performance, faster transcoding, jazzy heads-up displays, instant imaging adjustments, and, of course, better gameplay. In the next go-round, all those processors available out there will be working on science projects, but they’ll also be available to perform analysis and render in the cloud, letting us try out new materials on our couches. The differences between HPC and mainstream computing will fade, and access to more power will be like turning on a light switch.