Kalray Clusters Calculate Quickly
Supercomputer-on-a-Chip Targets HPC, Data Centers
A little-known company, Kalray, has developed a processor that promises industry-leading performance per watt. The design packs 256 general-purpose cores—more than twice as many as Tilera offers and five times more than Cavium. You might expect such a powerful chip to burn 100W or even 200W, but the 28nm processor is rated at just 16W (typical) at 600MHz. Target applications include high-performance computing (HPC), image processing, embedded systems, and networking.
To achieve this impressive power efficiency, the company takes a unique approach. The 256 general-purpose cores offload compute-intensive tasks from 16 high-level CPUs that run Linux and application software. Instead of a coherent-memory model, Kalray uses message passing, and each CPU cluster manages its own shared memory. To further improve efficiency, the CPUs implement a proprietary VLIW instruction set. This relatively simple hardware design has highly predictable behavior and improves energy efficiency. Thus, the company offers massively parallel but low-power processors with a real-time twist. The downside is increased complexity for software developers, although the company provides an SDK as well as Posix and OpenCL drivers.
Kalray designed its first chip, code-named Andey, in conjunction with Global Unichip. Andey taped out in 28nm HP at TSMC in late 2012. The first-generation 32-bit chip is now in production as the MPPA-256 (Multi-Purpose Processing Array), and the company claims over a dozen customers. The second-generation design, Bostan, adds 64-bit addressing capabilities to the cores and improves energy efficiency by widening the CPU data path. Whereas Andey is limited to 400MHz, Bostan can clock as fast as 800MHz. Here, we focus on the Bostan implementation, which will tape out later this month in the same 28nm process.
Kalray has flown under the radar for a number of years, having started in 2008 as a technology spinoff from CEA, the French equivalent of the U.S. Department of Energy. The company is headquartered near Paris, and the engineering team, now 55 strong, is based in Grenoble. Last April, it raised $8 million, including additional investment from CEA, and brought in new management led by CEO Eric Baissus. Gilles Delfassy, former CEO of ST-Ericsson and long-time leader of Texas Instruments’ wireless business, heads the company’s supervisory board.
Custom Cores Quash Complexity
Kalray’s focus on simplifying the design starts with the CPUs, which are the basic elements of the system. Each core implements a five-issue VLIW architecture with classic Fisher-style features such as nontrapping loads, conditional moves, and conditional execution for memory accesses. By moving instruction scheduling from hardware to the compiler, VLIW cores are more power efficient than dynamically scheduled superscalar designs. For power efficiency, the core has various idle modes and can wake when receiving an event, interrupt, or explicit wake-up command.
The CPU uses a relatively short instruction pipeline, as Figure 1 shows. Basic integer operations complete in just five cycles. Loads that hit in the data cache require six cycles, whereas floating-point operations can take up to eight cycles. The lack of complex instruction-scheduling logic, register renaming, and reorder buffers keeps the hardware simple and the pipeline short. One advantage of a short pipeline is that mispredicted branches have just a two-cycle penalty; therefore, expensive branch-prediction hardware is unnecessary.
Figure 1. Block diagram of the MPPA-256 VLIW CPU. The CPU uses a simple six-stage pipeline for load operations. The short pipeline keeps the mispredicted branch penalty to just two cycles.
In the first cycle, a 128-bit VLIW bundle containing up to five operations is fetched from the prefetch buffer. This buffer contains up to three bundles and decouples decoding from the 8KB instruction cache, which is two-way set associative. The instruction cache uses 64-byte lines and is not coherent with the data cache. The next stage is instruction decoding (ID), which breaks the bundle into constituent operations for execution in parallel. The register file (RF) that feeds the execution units contains 64 general-purpose registers that are 32 bits wide; it supports up to16 reads and 6 writes per clock cycle. Operands larger than 32 bits are accessed as register pairs.
The register file fans out to five different execution units. The branch control unit (BCU) resolves conditional branches and computes the target address for jumps and indirect branches; it also includes counters to perfectly predict counted loops. Two symmetric 32-bit integer units (ALU0 and ALU1) handle most integer computations, including bit-wise logic. The multiply-add execution unit (MAU) handles both integer and IEEE 754 floating-point data with a fully pipelined 64-bit data path. Integer multiply-add operations have a two-cycle latency, and floating-point operations take four cycles; floating-point throughput is one 64-bit fused-multiply-add (FMA) operation or two 32-bit FMA operations per clock cycle.
The last and most important execution resource is the 64-bit load/store unit (LSU). The load-to-use latency is just two cycles for a cache hit, with the first cycle for address computation using one of two addressing modes: base register + immediate offset, or base register + scaled index register. The second cycle is for accessing the 8KB data cache—which is virtually indexed, physically tagged, and two-way associative—and for retrieving up to 64-bits of data from a 64-byte cache line. The access occurs in parallel with virtual-address translation, which is performed by dual TLBs: a JTLB that’s two-way associative with 64 entries and an LTLB with eight fully associative entries. The cache has a write-through no-allocate policy. Stores take advantage of a fully associative write buffer that holds eight 8-byte entries.
Completing the Compute Cluster
The second level of the compute hierarchy is the compute cluster. This cluster is where Kalray begins to depart from traditional multicore processors and more strongly resemble GPUs and other high-throughput architectures. As Figure 2 shows, the compute cluster comprises 16 compute cores, an additional core dedicated to system functions, a shared 2MB memory, a DMA engine, interfaces to the network-on-a-chip (NoC) router, and a debug-support unit (DSU).
Figure 2. Block diagram of MPPA-256 compute cluster. Each cluster contains 16 general-purpose VLIW cores, a system core, and 2MB of shared memory.
The cores in a cluster all share a 2MB local memory that’s instantiated in the cluster and serves two critical functions. First, all memory accesses that miss the first-level caches are handled through the shared memory with a minimum nine-cycle latency. Second, software uses the shared memory as an explicit buffer for the external DRAM. In some usage models, the shared memory acts as a software-managed cache employing the MMU to handle 4KB blocks on a per-application basis. Because the cores in a cluster are not cache coherent, this shared memory also provides the main mechanism for implementing intercore communication in a compute cluster.
Kalray physically implements the shared memory as 16 independent 128KB banks with ECC over eight-byte chunks. The first-level caches are inclusive with regard to the shared memory, freeing them to use parity rather than ECC. Access to the 16 banks is through 20 ports: one per compute core and the system core; one for the debug-support unit; and two for the network-on-a-chip interfaces. Each port can access 8 bytes per cycle; the total bandwidth is therefore 128 bytes/clock, or 102.4GB/s at 800MHz. The shared memory-address mapping is configurable by software, and it’s allocated in 128KB blocks or interleaved over 64-byte portions across banks, depending on the application.
The DMA engine includes a horizontal-microcoded CPU responsible for moving data between the shared memory and the NoC at up to 3.2GB/s in each direction. To do so, it employs the ports dedicated to the NoC interface. The debug unit connects to a JTAG chain and uses a system trace to record and export debug information to an external device. The system core is a fully featured CPU that implements the same VLIW architecture as the compute cores, but it runs an exokernel, executes privileged functions, and operates the DMA engine.
NoCing on Heaven’s Door
The MPPA-256 contains 16 compute clusters, which are not cache coherent. Giving up coherence is a reasonable choice for a high-throughput processor; that feature is complex, often power hungry, and encourages large caches. For Kalray, it also poses a challenge for real-time operation, since hardware coherence can introduce variable latencies into memory accesses (e.g., remote-cache accesses, contention, etc.). The MPPA-256, however, goes a step further: each cluster has an entirely private memory-address space and communicates with other clusters via explicit message passing, similar to Intel’s Single-Chip Cloud Computer (see MPR 4/26/10, “The Single-Chip Cloud Computer”).
A custom-designed network-on-a-chip (NoC) connects the clusters. It differentiates the MPPA-256 from other processors and demonstrates Kalray’s real-time focus. The NoC is simple and explicitly configured by software; it natively supports RDMA primitives, atomic message queues, and synchronization barriers. Its simplicity ensures that software developers can precisely predict and control the performance, and it also saves power by avoiding buffers and complex flow control. The NoC ties together the compute clusters and other system elements. Messages are composed of configurable-length packets, typically with a 64-byte payload, that in turn comprise 32-bit flits.
Each network node (e.g., a compute cluster) has four duplexed data links and four duplexed control links. The data links are four bytes wide in each direction and operate at the CPU clock rate of 800MHz; therefore, each tile can transmit a total of 3.2GB/s and receive 3.2GB/s spread across the four directions (north, south, east, and west).
As Figure 3 shows, the overall network topology is a two-dimensional torus, meaning that each north-south or east-west route wraps around at the edge of the chip. The packets are routed using a four-byte header. An optional four-byte protocol header extension ensures appropriate RDMA semantics and barrier synchronization. Packets use wormhole switching with source-based flow control and routing; this approach avoids buffering but stalls flit transmission in the presence of congestion.
Figure 3. MPPA-256 network topology. To facilitate communication among a large number of clusters and I/O elements, Kalray chose a two-dimensional torus that passes data both north-south and east-west.
The network design has minimal control-flow hardware, instead relying on software to avoid deadlocks. Flow control is configurable via a few parameters (e.g., average and peak flow rate as well as minimum/maximum payload size) and is simple enough that network calculus (a form of queuing theory) can guarantee service when routes between pairs of NoC nodes are predetermined. This approach to guaranteeing services is an important differentiator in the real-time world. For regular applications, the standard QoS behavior can be disabled and replaced with a best-effort mode.
Keeping I/O on the Edge
Figure 4 shows the MPPA-256’s I/O blocks, which reside around the die’s exterior. The two I/O blocks are mostly symmetric; the north-east and south-west blocks contain PCI Express (PCIe) and Ethernet interfaces. Every I/O block contains two quad-core clusters, each with a shared data cache and 2MB of shared memory. One quad-core block controls the interface to the NoC; the other runs a real-time or standard operating system. Each I/O block also contains twelve 10G serdes (divided into an eight-lane block and a four-lane block) that can be configured to handle I/O (e.g., PCIe 3.0 or Ethernet) or to extend the on-chip fabric using the Interlaken protocol with extensions that Kalray calls NoC Express (NoCX). This approach allows system designers to gluelessly build an array of MPPA-256 processors.
Figure 4. MPPA-256 processor overview. Each compute cluster (CC) contains 16 VLIW CPU cores. Ethernet, PCI Express, and other I/Os are spread around the chip’s outer edge.
The processor shares two 64-bit-wide DRAM interfaces that address 128GB of ECC memory for the chip (64GB per controller). Bostan’s maximum memory speed is DDR3-2133. The chip also includes a static memory controller that works with NAND or NOR flash as well as external SRAM.
The external I/O includes two x8 PCIe Gen3 links. In some configurations, multiple processors link to a single PCIe switch that fans out to a x16 host interface. For networking, the processor offers two 40G Ethernet (40GbE) MACs; using its 10Gbps serdes, the chip can connect directly to QSFP+ optical modules, eliminating the need for an external PHY. These two ports allow the chip to operate in a 40Gbps flow-through design or in a 2x40GbE network interface card that connects to a host processor using PCIe. The chip also includes the usual assortment of serial ports and 128 bits of GPIO. One unusual twist is that the GPIOs can directly access the NoC.
Clusters Exposed to Programmers
Kalray offers a Linux port to its VLIW architecture and also a port of the standard Gnu tool chain to enable standard C/C++ programming in the compute clusters. Thus, the company can position its processors for general-purpose applications. The challenge lies in coordinating tasks among the various core types, which run different operating systems and have different address spaces.
The four quad cores in the I/O blocks can run either a real-time OS or a Linux kernel with MPPA drivers. In each compute cluster, the system core runs a specialized OS, whereas the compute cores run user threads and direct system calls to the system core. Thus, Linux applications can use the MPPA drivers to pass tasks to the compute clusters; the drivers communicate with the system cores, which allocate tasks to the compute cores. This software model gives developers a great deal of freedom to take advantage of the simple hardware’s real-time nature. To do so, however, they must understand the Kalray architecture well enough to allocate tasks to the clusters while directly managing the chip’s multiple address spaces to ensure that the necessary data gets to each core.
For customers with less expertise, time, or resources, Kalray must provide substantial software infrastructure and libraries. The company currently provides a Posix interface as part of the SDK. Better yet, it has developed an alpha OpenCL 1.2 implementation, where OpenCL workgroups map to the compute clusters and work items map to each VLIW core. Kalray expects to deliver a complete OpenCL implementation by the time Bostan enters production later this year.
Applications Include Packet Processing
Its general-purpose nature allows the MPPA-256 to serve a variety of highly parallel applications that require strong performance at modest power. The message-passing model is similar to that of a supercomputer, so HPC programmers will be comfortable with the design. Some customers are already using the processor for applications such as financial simulations (e.g., Monte Carlo analysis), geophysics (e.g., petroleum exploration), and life sciences.
The initial processor also attracted military and aerospace customers, who are willing to buy expensive processors and invest in programming them. These customers also appreciate Kalray’s real-time features. The MPPA-256 can perform object recognition and tracking, augmented reality, and other vision-processing tasks. Its highly parallel architecture is well suited to other types of audio and video processing. For example, it can perform broadcast-quality HEVC video encoding on an HD video stream while consuming only 15W.
Packet processing is well suited to highly parallel designs, since a packet stream can be divided among any number of CPUs. The Ethernet controller embeds a hardware engine that examines each packet and assigns it to a cluster on the basis of its flow ID, keeping each flow in a coherent environment. At 800MHz, Bostan can process 240 million packets per second (Mpps), which works out to 160Gbps for minimum-size (64-byte) packets. At this rate, the chip can execute 3,400 instructions for each packet—enough to perform advanced functions in addition to basic routing. The VLIW architecture’s bit-manipulation and population-count instructions further extend its networking performance.
For Bostan, the company enhanced the hardware for packet processing. The new system core can perform basic traffic management (e.g., scheduling and flow control) on egress data, removing this burden from the CPUs. Bostan also includes 128 new crypto engines (one for every two compute cores) that can offload cryptography algorithms such as AES, 3DES, SHA, and iSCSI CRC. Together, these engines enable line-rate encryption and decryption for the two 40GbE ports.
Given these capabilities, Kalray positions Bostan for intelligent network interface cards (NICs). Using its two 40GbE ports, the chip can handle 160Gbps of throughput. It can plug into a PCI Express slot using one of its x8 PCIe connections. The company is developing an Open vSwitch (OVS) port to deliver a variety of software-defined-networking (SDN) functions (see NWR 4/28/14 sidebar “Open vSwitch History and Porting”); the code should be in production by the end of this year. For example, the Kalray chip could perform flow classification, flow setup and teardown, and packet processing and routing. This approach is more efficient than using the main server CPU to handle all SDN functions.
Best Performance per Joule
Given the unusual nature of Kalray’s architecture, comparing it with other commercial processors is difficult. Bostan’s raw compute power of 845Gflops is impressive, but many PC graphics cards exceed that rating, as do HPC accelerators such as Intel’s Xeon Phi (see MPR 12/3/12, “Tesla and Xeon Phi Fuel HPC”). Graphics cards and accelerators, however, typically burn 100W or more, whereas the Kalray chip is a relative miser at about 20W. Alternatively, Intel’s new Broadwell-U processors offer two 3.1GHz CPUs at a 28W TDP; these cores deliver a combined 100Gflops, one-eighth of Bostan’s performance at similar power.
Raw compute power must be harnessed to be effective. An independent analysis led by the University of Grenoble compared the original MPPA-256 (Andey) against an eight-core Xeon E5-4640 (Sandy Bridge-EP) and a quad-core Tegra 3 running at 1.3GHz. Each chip was programmed for the well-known traveling-salesman problem. Using 256 cores, the Kalray chip solved a 20-city problem in less time than even the powerful Xeon processor, as Figure 5 shows. More importantly, it used about one-tenth as much energy as the Tegra chip, which is designed for low-power devices.
Figure 5. Processor performance and energy comparison. The Kalray product completed the 20-city traveling-salesman problem in less time than a high-end Xeon while using about a tenth of the energy of either Xeon or Tegra. (Source: University of Grenoble paper)
The product most similar to Kalray’s is the Tile-Gx72 (see MPR 2/25/13, “Tilera Unwraps 72-Core Whopper”), which packs 72 CPUs, each operating at up to 1.2GHz. Like Kalray, Tilera used a proprietary instruction set for its cores, but it differs in maintaining full cache coherence across the entire chip, greatly simplifying the programming model. On the other hand, the 40nm chip is rated at 60W (typical). Thus, the Tile-Gx72 provides less than a third as many cores as Bostan while operating at about three times the power.
Network processors offer greater core counts. Netronome, for example, integrates 120 flow-processing cores and 96 packet-processing cores into its NFP-6xxx NPU. As their names imply, these cores are specialized and can’t handle the general-purpose tasks that Kalray’s chips can. But Kalray will compete against both Tilera (now part of EZchip) and Netronome in packet processing and intelligent NICs. The NFP-6xxx appears in a 2x40GbE NIC, similar to Bostan, that is rated at 55W (typical). The Tile-Gx72 NIC also handles 2x40GbE but uses 95W. In addition to greater electrical cost, these high-power NICs also require double-wide PCI cards, whereas Bostan will fit on a single-slot card.
Please Come to Bostan (for the Real Time)
The 28nm Bostan will sample later this month and is scheduled to enter production in 4Q15 as the MPPA-256 v2. It is expected to dissipate 16W (typical) at 600MHz and 24W at its top speed of 800MHz. Kalray has not disclosed pricing for its processors. The third generation, Coolidge, is slated for a 16nm process and will arrive about a year later than Bostan. In this process node, the architecture could scale to 1,024 cores per chip, but the company has yet to finish its product plans.
Kalray has created a unique chip-level architecture that mimics the cluster architecture of many supercomputers, much like initial multicore chips followed the approach of simple SMP systems. By limiting shared memory to 16-core clusters, this design simplifies the on-chip interconnect. The architecture scales easily though cluster replication, helping the company pack more general-purpose cores onto its chip than any other available processor. The simpler design also improves real-time latency.
The downside of the cluster approach is it requires significant source-code changes. Most programs assume a single coherent memory space and must be modified for a cluster model. Even programs written for cluster computing will need some modification to handle the specifics of Kalray’s implementation. The company is developing an OpenCL interface to abstract the hardware model and support programs already written for this emerging API.
Code must also be recompiled for the company’s proprietary instruction set. Many customers would be more comfortable with a standard instruction set such as ARM, but Kalray says its VLIW ISA is critical for delivering best-in-class performance per watt. Tilera had some success with its home-brew instruction set, but that company did not require customers to modify their source code.
Kalray has won over some HPC and military/aerospace customers that need the best possible performance per watt and are willing to invest in programming to a custom architecture and unusual memory model. To expand on this success, the company must offer some prepackaged software to avoid exposing customers to its unusual programming model. The smart NIC is a good example of this approach, and the sudden demand for SDN is driving interest in data-center offload. If Kalray delivers the software that smart NICs require, the processor’s inherent power efficiency will win designs.
Price and Availability
The initial MPPA-256 (Andey) is in production. The MPPA-256 v2 (Bostan) is scheduled to sample in 2Q15 and reach production in 4Q15. The company withheld pricing. Kalray will also offer Bostan in a four-chip PCI Express card (Turbocard3) and in a 40G Ethernet smart NIC. For more information, access www.kalrayinc.com/kalray/products.