And GPU computing is no longer just a curious niche of interest to only a handful of academics with access to a workstation and a little extra time on their hands; it has been the launching point for a fundamental wave of change in emerging system architectures. No longer does the conventional paradigm apply, in which the subservient GPU is limited to 3D and 2D rendering, while the CPU handles everything else. The line of processing responsibility between the GPU and the CPU is blurring, and vendors on both sides of that fence have taken notice.
The Multi-core Software Gap
Despite the welcome and surprisingly strong upswing in growth, the workstation market is a mature one. And with that maturity comes a growing dependence on the replacement cycle to sustain volume. If customers judge the latest features and performance levels as offering a cost-effective boost in productivity, they’ll replace equipment more often, thereby shrinking the cycle and raising volume. But if they see little benefits with the industry’s latest round of new products, they’ll be more likely to hang on to their old systems and software longer, thus extending the cycle and lowering volume.
And therein lies the risk for the workstation industry, as it makes its way into the new age of multi-core computing architectures. Raising the bar with every new hardware generation—enough to entice customers to throw out the old and replace with the new—has never been easy, but with the advent of multi-core, it’s gotten that much harder. Once largely responsible and accountable for raising performance from one generation to the next, hardware vendors now find themselves increasingly dependent on the software industry to realize substantial performance gains.
According to fi gures from JPR, workstations are experiencing a resurgent growth.
That’s a new, uncomfortable position for hardware vendors such as Intel. While it toyed with modest forms of Simultaneous Multi Threading (SMT)—for example, Pentium 4’s HyperThreading—for the most part, the company could focus on single-threaded architectures, designing in more elaborate superscalar features and dialing up the gigahertz. Compilers needed to stay in sync with architectural improvements, but largely, the hardware provider alone determined how much a jump in performance the user would see.
And for a while, things were good. If Intel did its job correctly, last year’s x86 binaries ran on this year’s beefed-up processor, everything got substantially faster, and buyers could justify forking over the dollars to upgrade to the new platforms.
But Pentium 4 marked the beginning of the end for that paradigm. Clock rates began hitting a wall, and achieving even small boosts in frequency meant significantly more complicated designs and dramatically more watts. Power consumption was spiraling out of control, beyond the ability to effectively cool the chips and beyond the sensibility to pay for the electricity to power them.
So, en masse, the computer industry revisited the problem and arrived at a new answer: multi-core architectures. Rather than try and double the clock rate of a CPU core twice the size, vendors integrated twice the number of cores running at roughly the same frequency. That brought power under control and still achieved twice the theoretical performance as the previous generation.
But with every engineering decision comes a trade-off. And in the case of the migration to multi-core, that trade-off meant hardware vendors now had to more evenly share control over performance and productivity with application developers. In the multi-core era, raising performance depends much more on the software developer: to explicitly draw out parallelism and to implement multi-threaded code efficiently enough to keep more cores busy more of the time.
So today’s celebrated next-generation processor with twice the number of cores as last year’s model might deliver a significant boost running the end user’s key application, or it might not. It now depends a lot on what code is running. Installations that rely heavily on carrying unmodified legacy code forward from platform to platform may be disappointed in the performance boost delivered by a new quad core, for example, compared to its predecessor, dual core.
Legacy x86 code may perform better on a newer processor. Yet, this is likely not due to the presence of more cores, but rather from other system resource enhancements, such as faster memory or I/O access, or through clever designer tricks. For example, AMD and Intel—both keenly aware of the need to focus attention on improving single-thread performance—have introduced power-averaging techniques that allow the system running single-thread code to turn off one core and use the saved power to crank up the frequency of the operational core. So older single-threaded code will get some boost out of a new platform, but very possibly not enough for a manager debating whether to invest serious dollars (out of a tight IT budget) to upgrade staff workstations.
So how is the ISV community doing in regard to keeping up the pace of multi-threading with the hardware industry’s pace of multiplying cores? While the situation, of course, varies dramatically by application, the answer is generally—and unanimously—not good enough. Vendors not only aren’t getting the extended linear-type scaling they want, but instead are seeing performance too often trailing off after only two or three cores, with diminishing returns beyond (sometimes drastically so).
But there is hope. The industry understands how critical it is to improve programming on more massively parallel platforms, and companies are allocating more money, staff, and PR to address it. Alert workstation vendors, like HP, are working with the ISV community to help promote and stimulate more effective multi-core programming.
Unlike its competitors, Boxx Technologies has stayed solely focused on the professional market, attracting power users with its PC-derived solutions.
Vendors such as RapidMind have seen the need and are developing the tools to better map code to the more massively parallel architectures of the future. Hardware leaders AMD, Nvidia, HP, IBM, Intel, and Sun are establishing research sites like the new Pervasive Parallelism Lab at Stanford, chartered to pursue new, more effective models for the future’s massively parallel architectures. And the vendors with the most at stake in this battle—Intel and Microsoft—are kick-starting research and development of new tools and new approaches to multi-core programming, with the two companies already putting up more than $100 million in funding.
Don’t Forget the GPU
Exploring beyond their traditional boundaries, GPU vendors, such as Nvidia and AMD, are stepping up to help in the battle to scale system performance (though Intel, of course, isn’t particularly enthusiastic about that proposition). With the transition to unified arrays of massively parallel, programmable engines, GPUs are moving well beyond simply rendering triangles, instead tackling stream-intensive, floating-point-heavy general-purpose compute problems other than graphics.
While a processor vendor might be shooting to see the next-generation quad core deliver 50 percent more performance than last generation’s dual core, GPU computing applications running on Nvidia’s Tesla and AMD’s FireStream hardware are delivering some eye-popping numbers for well-suited applications. In work performed with the University of Illinois (and presented at the Hot Chips conference in 2007), for example, Nvidia claims a wide range of speedups for its targeted applications—anywhere from 1.5X to more than 400X, depending on the application.
An increase in system throughput is an increase, whether it comes via the CPU, the GPU, or anyplace else. And the promise of better throughput will always spur more upgrades and entice new buyers.
Beyond relying exclusively on improving application multi-threading, the workstation industry fortunately has another avenue to pursue in its goal to translate more cores into better end-user productivity: the actual end user.
The thinking is this: If a task for one application can scale performance efficiently by taking advantage of two cores, then a user kicking off two tasks in parallel ought to effectively consume four cores. And the best part is that it’s simply taking more explicit advantage of what we’ve all been doing already, consciously or not: partitioning a project into distinct tasks, sorting out which can execute in parallel versus those which must be worked sequentially, and then adapting our workflow to match.
Resourceful engineers and artists quickly develop their own techniques for overlapping iterations and batching up jobs to run in parallel. Take the illustration from HP, Intel, and component car manufacturer Factory Five depicting an iterative workflow for CAD: render, review, test, analyze, adjust, and repeat (see graphic, this page). Tweak the design and then kick off a detailed rendering and an FEA run, while at the same time visually reviewing the modified assembly.
Similarly, digital content creators naturally overlap tasks in the pipeline: adjust a model, render a scene, tweak the animation, render a rough sequence, review, re-render, and so forth. Whatever the space, resourceful professionals have always adapted their own workflow to a parallel process whenever they can. It’s just that now with multi-core architectures, those available compute cycles will be more plentiful and better suited to handle discrete tasks.
CAD professionals multitask and adapt their workflow to a parallel process whenever possible. This depiction from Factory Five is a prime example of such a process.
So just as the ISV needs to raise the tempo of application multi-threading, users will need to pick up the pace in workflow multi-tasking. Fortunately, users are already getting some help to do just that.
Workstation vendors have their suppliers—of processor platforms, graphics cards, and displays—to thank for helping users juggle more and more tasks in parallel. Two dual-link DVI interfaces trickling down the add-in card lines from vendors AMD and Nvidia, along with recent platforms from both Intel and AMD, now let workstation vendors populate two of those cards in a single system. Throw in dramatically lower prices on high-resolution LCDs, and it’s become both easy and inexpensive to deploy two, three, or even four high-resolution displays on the desktop.
More screen real estate lets us manage more tasks all at the same time, keeping more hardware resources busy, and getting more work done in the process. And, ultimately, that’s what it all boils down to: delivering a meaningful boost in productivity. Multi-core workstations can get more done in less time than their predecessors, but ensuring so will mean an increased emphasis not just on getting the application to do more in parallel, but giving the user the tools to do the same.