Fundamental Laws of Parallel Computing .. theoretical post but important to understand!!

Amdahl’s and Gustafson’s law as well as the equivalence of the two laws of parallelism that have influenced the research and practice of parallel computing during the past decades. How Amdahl’s law can be applied to multi-core chips and what implications it can have on architecture and programming model research .. this is what I am trying to explain here in this article !!

Amdahl’s Law: Amdahl’s law is probably the best known and most widely cited law defining the limits of how much parallel speedup can theoretically be achieved for a given application. It was coined by Amdahl in 1967 [1], as a supportive argument for the continuing scaling of the performance of a single core rather than aiming for massively parallel systems.

In its traditional form it provides an upper bound for potential speedup of a given application, as function of the size of the sequential fraction of that application. The
mathematical form of the law is     Speedup = 1/(1 − P + P/N)

where P is the fraction of the parallelized portion of the application and N is the number of available cores. It is immediately clear that, according to Amdahl’s law, any application with a sequential fraction will have an upper bound to how fast it can run, independent of the amount of cores that are available for its execution.

When N approaches infinite, this upper bound will be    Speedup = 1/(1−P)                    The speedup curve for various levels of parallelism is shown in the following figure:blog

The implications of Amdahl’s law were profound. Essentially it predicted that the focus shall be on getting single cores run faster—something that was within reach for the past four decades—instead of the costlier approach of parallelizing existing software which would have anyway limited the scalability, in accordance with Amdahl’s law, as long as some part of the software remained sequential. This law also triggered groundbreaking research that resulted in innovations such as out of order execution, speculative execution, pipelining, dynamic voltage and frequency scaling and, more recently, embedded DRAM. All these techniques are primarily geared towards making single threaded applications run faster and consequently, push the hard limit set by Amdahl’s law.

As multi-core chips were becoming mainstream, Amdahl’s law had the clear merit of putting the sequential applications—or sequential portions of otherwise parallelized applications—into the focus. A large amount of research is dealing with auto-parallelizing existing applications; the research into asymmetric multi-core architectures is also driven by the need to cater for both highly parallelized applications and applications with large sequential portions.

However, the shortcomings of Amdahl’s law: it assumes that the fraction of the sequential part of an application stays constant, no matter how many cores can be used for that application. This is clearly not always the case: more cores may mean more data parallelism that could dilute the significance of the sequential portion; at the same time an abundance of cores may enable speculative and run-ahead execution of the sequential part, resulting in a speedup without actually turning the sequential code into a parallel one. The first example lead to Gustafson’s law later, while the second one to a new way of using manycore chips.

Amdahl’s Law for Many-core Chips: 

The applicability of Amdahl’s law to many-core chips has first been explored, on theoretical level, in Ref. [2]. It considered three scenarios and analyzed the speedup that
can be obtained, under the same assumptions as in the original form of Amdahl’s law,
for three different types of multi-core chip architectures. The three scenarios were:

• Symmetric multi-core: all cores have equal capabilities
• Asymmetric multi-core: the chip is organized as one powerful core along with several, simpler cores, all sharing the same ISA
• Dynamic multi-core: the chip may act as one single core with increased performance or as a symmetric multi-core processor; this is a theoretical case only so far, as no such chip was designed

In order to evaluate the three scenarios, a simple model of hardware is needed. Ref [2] assumes that a chip has a certain amount of resources expressed through an abstract quantity called base processing unit ( BCU). The simplest core that can be built requires at least one BCU and speedup is reported relative to execution speed on a core built with one BCU; multiple BCUs can be grouped together—statically at chip design time or dynamically at execution time as in the dynamic case—in order to build cores with improved capabilities. In Ref. [2], the performance of the core built from n BCUs was approximated using a theoretical function perf(n), expressed as the square root of n. It’s clearly just an approximation to express the diminishing returns from adding more transistors (BCUs in our terminology) to a given core design.

In the symmetric multi-core processor case, the results were equivalent to the
original scenario outlined by Amdahl’s law—essentially, using homogeneous architectures, for the canonical type of applications, the speedup will be limited by the
sequential fraction.
The more interesting results were obtained for the other two cases. In the single
ISA asymmetrical multi-core processor case, the number of cores is reduced in
order to introduce one more powerful core along a larger set of simpler cores.

The speedup curves for programs with different degrees of parallelism on a chip with 64 BCUs, organized into different asymmetric configurations. There are two conclusions that may be drawn from this scenario:

• The speedup is higher than what Amdahl’s classical law predicts: the availability of the more complex (faster) core makes it possible to run the sequential part of the program faster.
• There’s a sweet spot for each level of parallelism beyond which the performance will decline; for example for the 95% parallel type of application this sweet spot is reached with 48 equal cores and one core running at four times higher speed.

The implication of this law is that asymmetric chip designs can help mitigate the impact of Amdahl’s law; however the challenge is that different applications may have their sweet spots at different configurations. This is what makes the third case, the fully dynamic scenario really interesting. In this scenario the HW can either act as a large pool of simple cores—when the parallel part is executed—or as a single, fast core with performance that scales as function of the number of simple cores it replaces.

The limitations set by Amdahl’s law. However, we don’t yet know how to build such a chip, thus at first sight this application of Amdahl’s law seems to be a mere theoretical exercise. There are however two techniques that could, in theory, make N cores look like
behaving as one single powerful core. The first, albeit with a limited scope of applicability,
is dynamic voltage and frequency scaling, applied to one core, while the others are switched off (or in very low power mode); the second, with a theoretically better scalability is speculative, run-ahead execution.

The run-ahead speculative execution aims at speeding up execution of single threaded applications, by speculatively executing in advance the most promising branches of execution on separate cores; if the speculation is successful, the result is earlier completion of sequential portions and thus a speedup of execution. The grand challenge of speculative execution is the accuracy of prediction: too many mis-predictions decrease the ratio of useful work per power consumed, making it less appealing while delivering a limited amount of speedup. On the other hand, it does scale with the number of cores, as more cores increase the possibility of speculating on the right branch of execution. The key to efficient speculation is to dramatically strengthen the semantic link between the execution environment and the software it is executing.

Amdahl’s law, along with Gustafson’s law that will be introduced in another post, is still at
the foundation of how we reason about computer architecture and parallel programming in general and the architecture and programming of many-core chips in special. At the core of it, it defines the key problem that needs to be addressed in order to leverage on the increased computational power of many-core chips: the portions of the program that are designed to execute within one single thread.

References:

  1. Amdahl G (1967) Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities. American Federation of Information Processing Societies (AFIPS) Conference Proceedings 30:483-485
  2. Hill M D, Marty M R (2008) Amdahl’s Law in the Multi-core Era. IEEE Computer
  3. András Vajda, “Programming Many-Core Chips” Springer, 2011, 237 p, hardcover ISBN-10: 1441997385, ISBN-13: 978-1441997388

Never stop reading this .. Linux is obsolete – Andrew S. Tanenbaum and Linus Torvalds Debate

From the kernel architecture perspective there are two main categories of operating
systems, with a number of variations in between. Micro-kernel operating systems are characterized by running most of their services in user mode as user processes and keeping only the very basic scheduling and hardware management mechanisms in the protected kernel space. Monolithic kernels on the other hand are characterized by incorporation of most of the operating system services into the kernel space, sharing the same memory space.

The fundamental principle micro-kernel designs aim to follow is the principle of separation of mechanism and policy. In a micro-kernel, the kernel’s role is to provide the minimal support for executing processes, inter-process communication and hardware management. All the other services—and indeed, policies—are then implemented as servers in user space, communicating with applications and with other components through inter-process communication mechanisms, usually messages. Micro-kernel based operating systems are usually characterized by strong modularity, low kernel footprint and increased security and robustness, as a consequence of the strong isolation of the components and the execution of most services in user-space. On the other hand, micro-kernels traditionally require a higher overhead for access to operating services, as there will be more context switches between user-space and kernel-space modes.

Monolithic kernels excel primarily at speed as all the services of the operating system execute in kernel mode and hence can share memory and perform direct function calls, without the need for using inter-process communication mechanisms, such as message passing. As most of the OS functions are packed together, monolithic kernels tend to be bigger, more complex and hence more difficult to test and maintain, also requiring careful, holistic approach to overall design. There have been several long debates around these two different approaches, most famously between Linus Torvalds, the inventor and gate keeper of Linux (on the monolithic side) and Andrew Tannenbaum, a respected professor and author (advocate of micro-kernel architectures), dating back to 1992 with a revived exchange in 2006. The essence of the debate revolves around maintainability, security, efficiency and complexity of operating systems, with valid arguments brought forward by both camps. The argument for micro-kernels is primarily based on the emphasis on reliability and security, supported by as little data sharing as possible and strict decomposition and isolation of operating system components. The counter-argument brought forward by Torvalds builds on the fact that algorithm design for distributed, share-nothing systems is inherently more complex and hence micro-kernels, with their emphasis on isolation would suffer on the maintainability and performance front.

If you are a Linux enthusiast but have never heard of this debate then I think you have missed something really interesting. The basis of this debate were the allegations made by Andrew S. Tanenbaum ( an American computer scientist and professor of computer science at the Vrije Universiteit, Amsterdam in the Netherlands. Also best known as the author of MINIX, a free Unix-like operating system for teaching purposes) on Linux portability and kernel architecture in general. The debate was started in 1992 on the Usenet discussion group comp.os.minix. It was a heated debate that was joined by Linus Torvalds (the creator of Linux) himself and many other hackers/developers.
Here are some of the excerpts of that discussion (from google groups) :

Tanenbaum :
I was in the U.S. for a couple of weeks, so I haven’t commented much on
LINUX (not that I would have said much had I been around), but for what it is worth, I have a couple of comments now.

As most of you know, for me MINIX is a hobby, something that I do in the evening when I get bored writing books and there are no major wars, revolutions, or senate hearings being televised live on CNN.  My real job is a professor and researcher in the area of operating systems.

Linus :
You use this as an excuse for the limitations of minix? Sorry, but you loose: I’ve got more excuses than you have, and Linux still beats the pants of minix in almost all areas.  Not to mention the fact that most of the good code for PC minix seems to have been written by Bruce Evans.

Re 1: you doing minix as a hobby – look at who makes money off minix, and who gives Linux out for free.  Then talk about hobbies.  Make minix freely available, and one of my biggest gripes with it will disappear.  Linux has very much been a hobby (but a serious one: the best type) for

me: I get no money for it, and it’s not even part of any of my studies in the university.  I’ve done it all on my own time, and o n my own machine.

Re 2: your job is being a professor and researcher: That’s one hell of a good excuse for some of the brain-damages of minix. I can only hope (and assume) that Amoeba doesn’t suck like minix does.

 

Tanenbaum :

I think it is a   gross error to design an OS for any specific architecture, since that is  not going to be around all that long.

Linus :

“Portability is for people who cannot write new programs”
-me, right now (with tongue in cheek)

This war of words was re-ignited in 2006 after Tanenbaum wrote a cover story for Computer magazine titled “Can We Make Operating Systems Reliable and Secure?. Here is what Wikipedia reports about it :

This subject was revisited in 2006 after Tanenbaum wrote a cover story for Computer magazine titled “Can We Make Operating Systems Reliable and Secure?”.[3] While Tanenbaum himself has mentioned that he did not write the article to renew the debate on kernel design,[4] the juxtaposition of the article and an archived copy of the 1992 debate on the technology site Slashdot caused the subject to be rekindled.[5] Torvalds posted a rebuttal of Tanenbaum’s arguments via an online discussion forum,[6] and several technology news sites began reporting the issue.[7] This prompted Jonathan Shapiro to respond that most of the field-proven reliable and secure computer systems use a more microkernel-like approach.

Here are some of the important links that would make this whole debate an interesting read :