We have come to the end of the road for clock speed improvements. As a corollary to
Moore’s Law, CPUs will double in the number of cores every 18 months. Based on this trend (since 2005), 1000-core CPUs would become commonplace before 2020. Recently,
several research projects have started prototyping or implementing 1000-core chips, including Intel’s 80-core Teraflops Research Chip (also called Polaris) and 48-core Single-Chip Cloud (SCC), CAS’s 64-core Godson-T, Tilera’s 100-core Tile-Gx, MIT’s ATAC 1000-core chips, and the recent Xilinx FPGA-based 1000-core prototype made by the Uni of Glasgow. With a rising core count, chip multiprocessors (CMPs) following the traditional bus-based cache-coherent architecture will fail to sustain scalability in power and memory latency. Targeting at a kilo-core scale, a paradigm shift in on-chip computing
towards the tile-based or tiled architecture has happened. A tile is a block comprising
compute core(s), a router and optionally some programmable on-chip memory1 for ultra-lowlatency inter-core communication. Instead of buses/crossbars, a network-on-chip (NoC), commonly a 2D mesh2, is used to interconnect the tiles. To avoid the “coherency wall” which emerges as an eventual scalability barrier3, some projects, notably Intel’s SCC and Polaris, do away with coherent caches and promote software-managed coherence via on-chip inter-core message passing instead. Eliminating hardware coherence may also lead to more energy-efficient and flexible computing.
Previous work has shown that only about 10% of the application memory references actually require cache coherence tracking. Applications can have most data RO-shared and few RW-shared; hardware coherence thus could overkill and also lead to waste of energy (possibly up to 40% of the total cache power. Scaling up to 1,000 cores sees another barrier—the “memory wall”. Over the past 40 years, memory density has doubled nearly every two years, but performance has improved just slowly (100s of CPU clock cycles per DRAM access). Besides the growing disparity of speed between CPU and off-chip memory, current memory architecture scales poorly for even 100 cores since CMPs are critically constrained by the off-chip memory bandwidth. We see only a limited number of DRAM controllers (e.g. four in SCC) are connected to the edges of the 2D mesh, which will not scale with increasing core density due to significant design impediment on the package pin density and pin bandwidth to memory devices. The reality of as many as 1,000 cores sharing a few memory controllers raises an important issue on how to uniformly spread the processor memory traffic across all the available memory ports. To mitigate the external DRAM bandwidth bottleneck, one solution is to increase the amount of on-chip cache per core so as to reduce the bandwidth demand imposed on off-chip DRAM. However, it will also reduce the number of cores that can be integrated on a die of fixed area.
Recent 3D stacked memory techniques can be employed (e.g. in Polaris) to alleviate such planar layout issues by attaching a memory controller to each router in the NoC. 3D stacking, however, makes it difficult to cool systems effectively through conventional heat sinks and fans. OS support for many-core chips also needs a radical rethink. Today’s OSes with symmetric multiprocessor (SMP) support are adapted to work on CMPs but cannot scale to high core counts (e.g. Linux 2.6 kernel’s physical page allocator does not scale beyond 8 cores under heavy load). Such SMP-based OSes also heavily rely on hardware support for cache coherence for efficient communications with kernel-space data structures and locks, which is no more in future non-cache-coherent kilo-core CMPs. In light of these problems, the design of next-generation OSes for CMPs goes for the multikernel paradigm (evolved from microkernel design): the CMP is treated as a network of independent cores that do not internally utilize shared memory, but explicit, message-based communication. Examples are Microsoft – ETH Barellfish, MIT’s fOS and Berkeley’s Tesselation, which have been designed specifically to address many limitations of current OSes as we move into the many-core era.
All the above changes pose several implications, both opportunities and challenges, to the
upper software layers: (1) programming paradigms designed for distributed systems like message passing interface (MPI) and software distributed shared memory (SDSM), s.k.a. shared virtual memory (SVM), become useful to many-core systems; (2) but they need remodeling as now the system bottleneck lies in the off-chip memory instead of the network. So it is vital to trim the slow off-chip accesses and exploit the fast but small on-chip memory effectively.