Parallel computing has become personal, in both liberating and demanding ways.
I remember using an IBM 1130 mainframe in high school in the 1970s, and how frustrating it was because only one person could use the machine at a time, feeding it via a card reader. Then, in the 1980s, computers became personal, and I had all the time I wanted to run and debug sequential programs.
Parallel computing has undergone a similar shift. In the 1980s and 1990s, parallel computers were institutional. They were fascinating to program, but access was limited. I was fortunate at one point to share a 256-processor nCUBE with only a few other people. Now, with multi-core chips, every programmer can have cheap access to a parallel computer—perhaps not with 256 processors yet, but with a growing number every year.
The downside, of course, is that parallel programming is no longer optional because parallel computers have become personal for consumers, too. Now parallel programming is mandatory for performance-sensitive applications.
There is no one true way to do parallel programming. Many paradigms have been proposed and have been cast in the form of new languages, language extensions, and libraries. One such paradigm defines tasks that run in shared memory. This paradigm is well suited to achieving parallel speedup on multi-core chips. The key notion is to separate logical task patterns from physical threads, and to delegate task scheduling to the system.
The paradigm has been around for a long time. Intel Threading Building Blocks was written to evangelize the paradigm, and to provide it off the shelf so that programmers would not have to reinvent it (and debug and tune it!).
Threading Building Blocks is strictly a library, not a new language or language extension. Though new languages and extensions are attractive, they raise a high barrier to adoption in the near term, particularly in commercial settings, where continuity from the existing code base is paramount. (Indeed, it is so important that businesses are still selling systems that are upward-compatible with some of those mainframes from the 1970s.)
Starting in 2004, I chaired a study group inside Intel that drafted the initial Threading Building Blocks proposal. Despite an early commitment to a library-only solution, we drew much of our inspiration from new languages and extensions for parallel programming. Our goal was to borrow as many good ideas as we could put into library form. Sticking to a library was a requirement so that Threading Building Blocks could slip easily into existing C++ programming environments.
C++ makes the library approach practical because it is designed for writing libraries. Stroustrup cites a 10X reduction in line count for the Booch components written in C++ versus Ada. Perhaps C++ will be even more powerful in the future. For example, the addition of lambda functions (see Chapter 12) would simplify the mechanics of using the Threading Building Blocks
A library-only solution is not perfect. We had to leave out some features that really require compiler support. For example, data-parallel array operations such as Fortran 90, ZPL, and NESL were deemed impractical because they rely heavily on optimizing compilers. There have been some C++ libraries such as POOMA that do some of the optimizations via template metaprogramming, but the complexity of such libraries is high. Parallel functional programming is another powerful paradigm, but alas, it requires significant compiler support.
Several systems were particularly influential in the development of Threading Building Blocks; these (and others) are listed in the bibliography of this book.
The Chare Kernel (now Charm++) showed the advantages of breaking a program into many small tasks. In particular, distributing load is simplified. By analogy, it’s a lot easier to evenly distribute many small objects among the cases instead of a few large objects.
Cilk showed the power of combining a scheduling technique called task stealing with recursive tasks. Recursion is often slower than iteration for serial programming, but it turns out that recursive parallelism has some big advantages over iterative parallelism with respect to load balancing and cache reuse. Cache reuse is critical because restructuring for cache sometimes improves program speed by 2X or more, possibly delivering better improvement than multithreading alone. Fortunately, the Cilk approach tends to steer programmers to solutions that both are parallel and have good cache behavior.
The C++ Standard Template Library (STL) showed how a library could be both generic and efficient. As we gain experience, we’re learning how to be more generic. STAPL showed how to bring generic binding of algorithms to containers into the parallel world, by substituting the fundamentally sequential STL iterator with parallel recursive ranges (pRanges in STAPL). This enabled parallel algorithms to operate on parallel containers and opened up the ability to apply parallel recursive ranges to multidimensional spaces (e.g.,
blocked_range2d), and even reuse (some would say abuse) them to write a parallel quicksort.
If you are accustomed to heavyweight threads, lightweight tasks require a new mindset. For example, heavyweight threads tend to drive designs toward relatively few threads because they are costly. Furthermore, they tend to introduce a lot of explicit synchronization, such as locks. Lightweight tasks drive designs that use many tiny tasks with implicit synchronization. Each task combines some work and a little bit of synchronization. An extreme case is the Threading Building Blocks
empty_task class, which does nothing except synchronization. For instance, it is used by
parallel_for to synchronize completion without using any (explicit) locking. The synchronization is implied by the task pattern. Locks are still sometimes necessary, but I encourage you to exploit implicit task synchronization instead where possible.
Performance matters. By definition, parallel programming for speedup is wasted effort if the sequential equivalent outruns it. Merely decomposing a problem into a zillion parallel race-free tasks is typically not enough to achieve speedup because each of those zillion tasks needs space. Threading Building Blocks is intended to guide programmers toward patterns that are space-efficient.
I’ll close with my favorite principle for parallel programming: KISS (Keep It Simple, Stupid). Writing parallel programs can be as complicated as you want it to be. However, the successful parallel programs that I’ve seen stick with simple patterns. Threading Building Blocks provides a good foundation layer for these patterns.
—Arch D. Robison
Arch D. Robison is currently the lead developer of Intel Threading Building Blocks. Previously, Arch worked at Shell on massively parallel codes for seismic imaging, and then was the lead developer for KAI C++, a portable, cross-platform, high-performance, C++ compiler. He received his Ph.D. from the University of Illinois.