Kernel 3.2+: Hide Processes From Other Normal Users!

In a multi-user system it was always possible for any user to list all running processes on the system whether or not these processes belong to the user!linuxkernel32rc6_dh_fx57

With linux kernel 3.2+ (RHEL/Centos 6.5+) there is a new added feature to give the root a full control over this issue where root will be the one who can list all running processes and all users will only list their own processes no more.

New mounting option: hidepid

The new option defines how much info about processes we want to be available for non-owners.The value of it will define the mode in mounting as follow:

hidepid=0 – The old behavior – anybody may read all world-readable /proc/PID/* files (default).

hidepid=1 – It means users may not access any /proc/ / directories, but their own. Sensitive files like cmdline, sched*, status are now protected against other users.

hidepid=2 It means hidepid=1 plus all /proc/PID/ will be invisible to other users. It compicates intruder’s task of gathering info about running processes, whether some daemon runs with elevated privileges, whether another user runs some sensitive program, whether other users run any program at all, etc.

This could be done through the command line as follow

# mount -o remount,rw,hidepid=2 /proc

Or by updating /etc/fstab

# vi /etc/fstab

proc    /proc    proc    defaults,hidepid=2     0     0

This will need # mount -a   to just re-read /etc/fstab

References:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0499680a42141d86417a8fbaa8c8db806bea1201

Bringing Unix Philosophy to Big Data

The Unix philosophy fundamentally changed the way we think of computing systems: instead of a sealed monolith, the system became a collection of small, easily understood programs that could be quickly connected in novel and ad hoc ways. Today, big data looks much like the operating systems landscape in the pre-Unix 1960s: complicated frameworks surrounding by a priesthood that must manage and protect a fragile system.

Bryan Cantrill in one of the best Big Data talks; describes and demonstrates Manta, a new object store featuring in situ compute that brings the Unix philosophy to big data, allowing tools like grep, awk and sed to be used in map-reduce fashion on arbitrary amounts of data describing both the design challenges in building Manta (a system built largely in node.js)

Fundamental Laws of Parallel Computing .. theoretical post but important to understand!!

Amdahl’s and Gustafson’s law as well as the equivalence of the two laws of parallelism that have influenced the research and practice of parallel computing during the past decades. How Amdahl’s law can be applied to multi-core chips and what implications it can have on architecture and programming model research .. this is what I am trying to explain here in this article !!

Amdahl’s Law: Amdahl’s law is probably the best known and most widely cited law defining the limits of how much parallel speedup can theoretically be achieved for a given application. It was coined by Amdahl in 1967 [1], as a supportive argument for the continuing scaling of the performance of a single core rather than aiming for massively parallel systems.

In its traditional form it provides an upper bound for potential speedup of a given application, as function of the size of the sequential fraction of that application. The
mathematical form of the law is     Speedup = 1/(1 − P + P/N)

where P is the fraction of the parallelized portion of the application and N is the number of available cores. It is immediately clear that, according to Amdahl’s law, any application with a sequential fraction will have an upper bound to how fast it can run, independent of the amount of cores that are available for its execution.

When N approaches infinite, this upper bound will be    Speedup = 1/(1−P)                    The speedup curve for various levels of parallelism is shown in the following figure:blog

The implications of Amdahl’s law were profound. Essentially it predicted that the focus shall be on getting single cores run faster—something that was within reach for the past four decades—instead of the costlier approach of parallelizing existing software which would have anyway limited the scalability, in accordance with Amdahl’s law, as long as some part of the software remained sequential. This law also triggered groundbreaking research that resulted in innovations such as out of order execution, speculative execution, pipelining, dynamic voltage and frequency scaling and, more recently, embedded DRAM. All these techniques are primarily geared towards making single threaded applications run faster and consequently, push the hard limit set by Amdahl’s law.

As multi-core chips were becoming mainstream, Amdahl’s law had the clear merit of putting the sequential applications—or sequential portions of otherwise parallelized applications—into the focus. A large amount of research is dealing with auto-parallelizing existing applications; the research into asymmetric multi-core architectures is also driven by the need to cater for both highly parallelized applications and applications with large sequential portions.

However, the shortcomings of Amdahl’s law: it assumes that the fraction of the sequential part of an application stays constant, no matter how many cores can be used for that application. This is clearly not always the case: more cores may mean more data parallelism that could dilute the significance of the sequential portion; at the same time an abundance of cores may enable speculative and run-ahead execution of the sequential part, resulting in a speedup without actually turning the sequential code into a parallel one. The first example lead to Gustafson’s law later, while the second one to a new way of using manycore chips.

Amdahl’s Law for Many-core Chips: 

The applicability of Amdahl’s law to many-core chips has first been explored, on theoretical level, in Ref. [2]. It considered three scenarios and analyzed the speedup that
can be obtained, under the same assumptions as in the original form of Amdahl’s law,
for three different types of multi-core chip architectures. The three scenarios were:

• Symmetric multi-core: all cores have equal capabilities
• Asymmetric multi-core: the chip is organized as one powerful core along with several, simpler cores, all sharing the same ISA
• Dynamic multi-core: the chip may act as one single core with increased performance or as a symmetric multi-core processor; this is a theoretical case only so far, as no such chip was designed

In order to evaluate the three scenarios, a simple model of hardware is needed. Ref [2] assumes that a chip has a certain amount of resources expressed through an abstract quantity called base processing unit ( BCU). The simplest core that can be built requires at least one BCU and speedup is reported relative to execution speed on a core built with one BCU; multiple BCUs can be grouped together—statically at chip design time or dynamically at execution time as in the dynamic case—in order to build cores with improved capabilities. In Ref. [2], the performance of the core built from n BCUs was approximated using a theoretical function perf(n), expressed as the square root of n. It’s clearly just an approximation to express the diminishing returns from adding more transistors (BCUs in our terminology) to a given core design.

In the symmetric multi-core processor case, the results were equivalent to the
original scenario outlined by Amdahl’s law—essentially, using homogeneous architectures, for the canonical type of applications, the speedup will be limited by the
sequential fraction.
The more interesting results were obtained for the other two cases. In the single
ISA asymmetrical multi-core processor case, the number of cores is reduced in
order to introduce one more powerful core along a larger set of simpler cores.

The speedup curves for programs with different degrees of parallelism on a chip with 64 BCUs, organized into different asymmetric configurations. There are two conclusions that may be drawn from this scenario:

• The speedup is higher than what Amdahl’s classical law predicts: the availability of the more complex (faster) core makes it possible to run the sequential part of the program faster.
• There’s a sweet spot for each level of parallelism beyond which the performance will decline; for example for the 95% parallel type of application this sweet spot is reached with 48 equal cores and one core running at four times higher speed.

The implication of this law is that asymmetric chip designs can help mitigate the impact of Amdahl’s law; however the challenge is that different applications may have their sweet spots at different configurations. This is what makes the third case, the fully dynamic scenario really interesting. In this scenario the HW can either act as a large pool of simple cores—when the parallel part is executed—or as a single, fast core with performance that scales as function of the number of simple cores it replaces.

The limitations set by Amdahl’s law. However, we don’t yet know how to build such a chip, thus at first sight this application of Amdahl’s law seems to be a mere theoretical exercise. There are however two techniques that could, in theory, make N cores look like
behaving as one single powerful core. The first, albeit with a limited scope of applicability,
is dynamic voltage and frequency scaling, applied to one core, while the others are switched off (or in very low power mode); the second, with a theoretically better scalability is speculative, run-ahead execution.

The run-ahead speculative execution aims at speeding up execution of single threaded applications, by speculatively executing in advance the most promising branches of execution on separate cores; if the speculation is successful, the result is earlier completion of sequential portions and thus a speedup of execution. The grand challenge of speculative execution is the accuracy of prediction: too many mis-predictions decrease the ratio of useful work per power consumed, making it less appealing while delivering a limited amount of speedup. On the other hand, it does scale with the number of cores, as more cores increase the possibility of speculating on the right branch of execution. The key to efficient speculation is to dramatically strengthen the semantic link between the execution environment and the software it is executing.

Amdahl’s law, along with Gustafson’s law that will be introduced in another post, is still at
the foundation of how we reason about computer architecture and parallel programming in general and the architecture and programming of many-core chips in special. At the core of it, it defines the key problem that needs to be addressed in order to leverage on the increased computational power of many-core chips: the portions of the program that are designed to execute within one single thread.

References:

  1. Amdahl G (1967) Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities. American Federation of Information Processing Societies (AFIPS) Conference Proceedings 30:483-485
  2. Hill M D, Marty M R (2008) Amdahl’s Law in the Multi-core Era. IEEE Computer
  3. András Vajda, “Programming Many-Core Chips” Springer, 2011, 237 p, hardcover ISBN-10: 1441997385, ISBN-13: 978-1441997388

Why 1000+ cores CPUs do matter these days ?!

We have come to the end of the road for clock speed improvements. As a corollary to
Moore’s Law, CPUs will double in the number of cores every 18 months. Based on this trend (since 2005), 1000-core CPUs would become commonplace before 2020. Recently,
several research projects have started prototyping or implementing 1000-core chips, including Intel’s 80-core Teraflops Research Chip (also called Polaris) and 48-core Single-Chip Cloud (SCC), CAS’s 64-core Godson-T, Tilera’s 100-core Tile-Gx, MIT’s ATAC 1000-core chips, and the recent Xilinx FPGA-based 1000-core prototype made by the Uni of Glasgow. With a rising core count, chip multiprocessors (CMPs) following the traditional bus-based cache-coherent architecture will fail to sustain scalability in power and memory latency. Targeting at a kilo-core scale, a paradigm shift in on-chip computing
towards the tile-based or tiled architecture has happened. A tile is a block comprising
compute core(s), a router and optionally some programmable on-chip memory1 for ultra-lowlatency inter-core communication. Instead of buses/crossbars, a network-on-chip (NoC), commonly a 2D mesh2, is used to interconnect the tiles. To avoid the “coherency wall” which emerges as an eventual scalability barrier3, some projects, notably Intel’s SCC and Polaris, do away with coherent caches and promote software-managed coherence via on-chip inter-core message passing instead. Eliminating hardware coherence may also lead to more energy-efficient and flexible computing.

Previous work has shown that only about 10% of the application memory references actually require cache coherence tracking. Applications can have most data RO-shared and few RW-shared; hardware coherence thus could overkill and also lead to waste of energy (possibly up to 40% of the total cache power. Scaling up to 1,000 cores sees another barrier—the “memory wall”. Over the past 40 years, memory density has doubled nearly every two years, but performance has improved just slowly (100s of CPU clock cycles per DRAM access). Besides the growing disparity of speed between CPU and off-chip memory, current memory architecture scales poorly for even 100 cores since CMPs are critically constrained by the off-chip memory bandwidth. We see only a limited number of DRAM controllers (e.g. four in SCC) are connected to the edges of the 2D mesh, which will not scale with increasing core density due to significant design impediment on the package pin density and pin bandwidth to memory devices. The reality of as many as 1,000 cores sharing a few memory controllers raises an important issue on how to uniformly spread the processor memory traffic across all the available memory ports. To mitigate the external DRAM bandwidth bottleneck, one solution is to increase the amount of on-chip cache per core so as to reduce the bandwidth demand imposed on off-chip DRAM. However, it will also reduce the number of cores that can be integrated on a die of fixed area.

Recent 3D stacked memory techniques can be employed (e.g. in Polaris) to alleviate such planar layout issues by attaching a memory controller to each router in the NoC. 3D stacking, however, makes it difficult to cool systems effectively through conventional heat sinks and fans. OS support for many-core chips also needs a radical rethink. Today’s OSes with symmetric multiprocessor (SMP) support are adapted to work on CMPs but cannot scale to high core counts (e.g. Linux 2.6 kernel’s physical page allocator does not scale beyond 8 cores under heavy load). Such SMP-based OSes also heavily rely on hardware support for cache coherence for efficient communications with kernel-space data structures and locks, which is no more in future non-cache-coherent kilo-core CMPs. In light of these problems, the design of  next-generation OSes for CMPs goes for the multikernel paradigm (evolved from microkernel design): the CMP is treated as a network of independent cores that do not internally utilize shared memory, but explicit, message-based communication. Examples are Microsoft – ETH Barellfish, MIT’s fOS and Berkeley’s Tesselation, which have been designed specifically to address many limitations of current OSes as we move into the many-core era.

All the above changes pose several implications, both opportunities and challenges, to the
upper software layers: (1) programming paradigms designed for distributed systems like message passing interface (MPI) and software distributed shared memory (SDSM), s.k.a. shared virtual memory (SVM), become useful to many-core systems; (2) but they need remodeling as now the system bottleneck lies in the off-chip memory instead of the network. So it is vital to trim the slow off-chip accesses and exploit the fast but small on-chip memory effectively.

Building Gem5 from scratch on ubuntu 12.04 “LTS”

Assuming you have a fresh Ubuntu 12.04 installed, and I assume you already understand what is the python script gem5.opt, gem5.debug, gem5.fast … etc if not you can easily visit the official website of Gem5 documentation http://m5sim.org/Documentation

$ sudo apt-get update; sudo apt-get upgrade

# installing the needed packages for Gem5

$ sudo apt-get install mercurial scons swig gcc m4 python python-dev libgoogle-perftools-dev g++  build-essential

# you can download the .tar.gz file and extract it but I prefer grabbing from the repo just in # case you needed to run $ hg pull to update the code or to commit

$ hg clone http://repo.gem5.org/gem5   # this is a developer version

$ cd gem5/

# Gem5 uses scons build system instead of make

$ scons build/ARM/gem5.opt

## To test the System Emulation mode you can run one of the programs shipped with Gem5
$ ./build/ARM/gem5.opt configs/example/se.py -c tests/test-progs/hello/bin/arm/linux/hel­lo

# You can check out this live demo by Bayn in this youtube Video