Fundamental Laws of Parallel Computing .. theoretical post but important to understand!!

Amdahl’s and Gustafson’s law as well as the equivalence of the two laws of parallelism that have influenced the research and practice of parallel computing during the past decades. How Amdahl’s law can be applied to multi-core chips and what implications it can have on architecture and programming model research .. this is what I am trying to explain here in this article !!

Amdahl’s Law: Amdahl’s law is probably the best known and most widely cited law defining the limits of how much parallel speedup can theoretically be achieved for a given application. It was coined by Amdahl in 1967 [1], as a supportive argument for the continuing scaling of the performance of a single core rather than aiming for massively parallel systems.

In its traditional form it provides an upper bound for potential speedup of a given application, as function of the size of the sequential fraction of that application. The
mathematical form of the law is     Speedup = 1/(1 − P + P/N)

where P is the fraction of the parallelized portion of the application and N is the number of available cores. It is immediately clear that, according to Amdahl’s law, any application with a sequential fraction will have an upper bound to how fast it can run, independent of the amount of cores that are available for its execution.

When N approaches infinite, this upper bound will be    Speedup = 1/(1−P)                    The speedup curve for various levels of parallelism is shown in the following figure:blog

The implications of Amdahl’s law were profound. Essentially it predicted that the focus shall be on getting single cores run faster—something that was within reach for the past four decades—instead of the costlier approach of parallelizing existing software which would have anyway limited the scalability, in accordance with Amdahl’s law, as long as some part of the software remained sequential. This law also triggered groundbreaking research that resulted in innovations such as out of order execution, speculative execution, pipelining, dynamic voltage and frequency scaling and, more recently, embedded DRAM. All these techniques are primarily geared towards making single threaded applications run faster and consequently, push the hard limit set by Amdahl’s law.

As multi-core chips were becoming mainstream, Amdahl’s law had the clear merit of putting the sequential applications—or sequential portions of otherwise parallelized applications—into the focus. A large amount of research is dealing with auto-parallelizing existing applications; the research into asymmetric multi-core architectures is also driven by the need to cater for both highly parallelized applications and applications with large sequential portions.

However, the shortcomings of Amdahl’s law: it assumes that the fraction of the sequential part of an application stays constant, no matter how many cores can be used for that application. This is clearly not always the case: more cores may mean more data parallelism that could dilute the significance of the sequential portion; at the same time an abundance of cores may enable speculative and run-ahead execution of the sequential part, resulting in a speedup without actually turning the sequential code into a parallel one. The first example lead to Gustafson’s law later, while the second one to a new way of using manycore chips.

Amdahl’s Law for Many-core Chips: 

The applicability of Amdahl’s law to many-core chips has first been explored, on theoretical level, in Ref. [2]. It considered three scenarios and analyzed the speedup that
can be obtained, under the same assumptions as in the original form of Amdahl’s law,
for three different types of multi-core chip architectures. The three scenarios were:

• Symmetric multi-core: all cores have equal capabilities
• Asymmetric multi-core: the chip is organized as one powerful core along with several, simpler cores, all sharing the same ISA
• Dynamic multi-core: the chip may act as one single core with increased performance or as a symmetric multi-core processor; this is a theoretical case only so far, as no such chip was designed

In order to evaluate the three scenarios, a simple model of hardware is needed. Ref [2] assumes that a chip has a certain amount of resources expressed through an abstract quantity called base processing unit ( BCU). The simplest core that can be built requires at least one BCU and speedup is reported relative to execution speed on a core built with one BCU; multiple BCUs can be grouped together—statically at chip design time or dynamically at execution time as in the dynamic case—in order to build cores with improved capabilities. In Ref. [2], the performance of the core built from n BCUs was approximated using a theoretical function perf(n), expressed as the square root of n. It’s clearly just an approximation to express the diminishing returns from adding more transistors (BCUs in our terminology) to a given core design.

In the symmetric multi-core processor case, the results were equivalent to the
original scenario outlined by Amdahl’s law—essentially, using homogeneous architectures, for the canonical type of applications, the speedup will be limited by the
sequential fraction.
The more interesting results were obtained for the other two cases. In the single
ISA asymmetrical multi-core processor case, the number of cores is reduced in
order to introduce one more powerful core along a larger set of simpler cores.

The speedup curves for programs with different degrees of parallelism on a chip with 64 BCUs, organized into different asymmetric configurations. There are two conclusions that may be drawn from this scenario:

• The speedup is higher than what Amdahl’s classical law predicts: the availability of the more complex (faster) core makes it possible to run the sequential part of the program faster.
• There’s a sweet spot for each level of parallelism beyond which the performance will decline; for example for the 95% parallel type of application this sweet spot is reached with 48 equal cores and one core running at four times higher speed.

The implication of this law is that asymmetric chip designs can help mitigate the impact of Amdahl’s law; however the challenge is that different applications may have their sweet spots at different configurations. This is what makes the third case, the fully dynamic scenario really interesting. In this scenario the HW can either act as a large pool of simple cores—when the parallel part is executed—or as a single, fast core with performance that scales as function of the number of simple cores it replaces.

The limitations set by Amdahl’s law. However, we don’t yet know how to build such a chip, thus at first sight this application of Amdahl’s law seems to be a mere theoretical exercise. There are however two techniques that could, in theory, make N cores look like
behaving as one single powerful core. The first, albeit with a limited scope of applicability,
is dynamic voltage and frequency scaling, applied to one core, while the others are switched off (or in very low power mode); the second, with a theoretically better scalability is speculative, run-ahead execution.

The run-ahead speculative execution aims at speeding up execution of single threaded applications, by speculatively executing in advance the most promising branches of execution on separate cores; if the speculation is successful, the result is earlier completion of sequential portions and thus a speedup of execution. The grand challenge of speculative execution is the accuracy of prediction: too many mis-predictions decrease the ratio of useful work per power consumed, making it less appealing while delivering a limited amount of speedup. On the other hand, it does scale with the number of cores, as more cores increase the possibility of speculating on the right branch of execution. The key to efficient speculation is to dramatically strengthen the semantic link between the execution environment and the software it is executing.

Amdahl’s law, along with Gustafson’s law that will be introduced in another post, is still at
the foundation of how we reason about computer architecture and parallel programming in general and the architecture and programming of many-core chips in special. At the core of it, it defines the key problem that needs to be addressed in order to leverage on the increased computational power of many-core chips: the portions of the program that are designed to execute within one single thread.

References:

  1. Amdahl G (1967) Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities. American Federation of Information Processing Societies (AFIPS) Conference Proceedings 30:483-485
  2. Hill M D, Marty M R (2008) Amdahl’s Law in the Multi-core Era. IEEE Computer
  3. András Vajda, “Programming Many-Core Chips” Springer, 2011, 237 p, hardcover ISBN-10: 1441997385, ISBN-13: 978-1441997388
Advertisements

One thought on “Fundamental Laws of Parallel Computing .. theoretical post but important to understand!!

  1. Pingback: None of the easy things work | cartesian product

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s