Intel has continually harkened back to their vision of offering a high degree of parallelism inside a power efficient package that could promise programmability, since the first details about the MIC architecture emerged.
With the eventual entry of the next generation Xeon Phi hitting the market in years to come with its (still unstated) high number of cores, on-package memory, ability to shape shift from co-processor to processor along the x86 continuum, many are wondering about what kind of programmatic muscle will be needed to spring from Knights Corner to Knight’s Landing.
As Intel turns its focus on the Xeon front to doubling FLOPS, boosting memory bandwidth and stitching in I/O, Intel’s technical computing lead Raj Hazra says the long-term goal is to make the full transition from multi-core to manycore via the Knights-codenamed family.
One can look at Knight’s Landing as simply a new Xeon with higher core counts since at least some of the complexities of using it as a co-processor will no longer be an issue. Unlike with the current Xeon Phi, transfers across PCIe are eliminated, memory is local and Landing acts as a true processor that can carry over the benefits of parallelism and efficiency of Phi in a processor form factor while still offering the option to use it as a coprocessor for specific highly parallel parts of a given workload. So this should make programming for one of these essentially the same as programming for Xeon—that is, in theory.
Despite the emphasis on extending programmability, make no mistake, it’s not as though parallel programming is suddenly going to become magically simple–and certainly that’s still not the case for using coprocessors, says James Reinders, Intel’s software director. However, there are some notable features that will make the transition more seamless.
When it comes to using Knight’s Landing as a coprocessor, the real benefits between Knight’s Corner and Landing will become more apparent. As it stands now, many programmers using accelerators or coprocessor use offload models on mixed (serial and highly parallel) code where they write their programs to run on the processor but with certain highly parallel bits offloaded. The advantage there is that there’s the power of the processors, which compared to accelerators/Phi are much better at serial tasks. Of course, programmers are keenly aware of Amdahl’s Law and are looking to OpenACC and OpenMP directives to address some of these problems with offloading—problems that Intel is addressing by nixing the offloading middleman.
As Reinders described, “One of the big things about Knight’s Landing in this regard is that to make it a processor we had to reduce the effects of Amdhal’s Law. Making Knight’s Landing a processor means we wanted to build a system around it where the program runs on it but it “offloads itself” in a sense—there’s no such thing as offloading to yourself; you just switch between being somewhat serial to highly parallel just like you do in a program you write for a processor today. However, Knight’s Landing is more capable of handling highly parallel workloads than any other processor today.”
The other way to program for Knight’s Landing (or its predecessor, for that matter) is to just treat it as a processor hooked together with other Xeons or Phis using MPI. Landing will support that model as both a processor or coprocessor, Reinders said. “A lot of users today are just taking their applications and using MPI instead of offloading. When you build a Knight’s Landing machine they can all run MPI and since they run a full OS you can do anything that a processor would do.”
By the way, as a side note on the OS, many users on the HPC front will likely not let the OS run wild and eat up a number of the cores (and there are definitely more than 61 on the new chips) and will also have to prevent the OS from munching into the high bandwidth memory it sees sitting nearby. It’s a matter of user-set policy for the number of cores the OS runs on and as for keeping the OS’s greedy hands off the new memory on board, there are workarounds in development around that.
With that specific OS piece in mind, however, it’s easy to see why Reinders is giddy about Landing. “You can think of Knight’s Landing exactly like it’s a Xeon with lots and lots of (but-we-still-can’t-tell-you-how-many) cores. The big difference is how good it is at highly parallel workloads. It’s a high core count Xeon. That’s how we get extreme compatibility with Knight’s Landing to make it a processor—every OS that boots on it will look at think it’s a just a Xeon on steroids; it shouldn’t look any different. But again you can set in policy to run it on one of the cores.” He expects that OEMs that supply systems will continue to keep configuring machines with these policies that favor keeping the OS contained and letting the applications have full reign on the other cores.
Among some of those refinements that will be present in Knight’s Landing are the 512-bit SIMD capabilities, which will eventually be extended across the entire Intel processor line. Currently with AVX2 and its 256 bit width users can pull 4 double precision operations (or 8 singles) from a single clock, but with the introduction of 512-bit, that performance will double for both single and double-precision. There is already 512 capability built into current Xeon Phi, but it’s only for use in the coprocessor since it hasn’t been fully synched with the full set of x86 capabilities. People using the current Phi thus don’t have the throughput possibilities or all the functionality that Intel will roll out with Knight’s Landing.
Reinders has been teaching users how to tap into Xeon Phi and as he’s introducing concepts leading up to Knight’s Landing everyone is “looking for holes in the armor.” But he argues that the ones they know about they’re working to address through the ecosystem, compilers, and in house. “The simple answer is that anyone who already programs for Knight’s Corner will find the Landing leap an easy one since there’s no new learning.”
This bodes well for Intel to take this highly parallel approach well beyond HPC applications in the future, especially if they continue to push the idea that there’s nothing “special” (i.e. difficult or accelerator-like for programmers) about it—that it’s simply a high core count processor. The beauty is that they can eventually round out their suite of processor choices so users can continually tailor these choices around their workloads and the degree of parallelism, performance and power required.
Credit to http://www.hpcwire.com