**Linley Spring Processor Conference 2018**

*April 11 - 12, 2018*

Hyatt Regency, Santa Clara, CA

Hyatt Regency, Santa Clara, CA

Accelerating Machine Learning

New Specialized Architectures Examined at ISSCC

*By David Kanter*

At the recent ISSCC, researchers from Intel and MIT presented machine-learning accelerators that offer tremendous power and performance benefits compared with CPUs, GPUs, and DSPs. Their papers highlight options that are available for products ranging from smartphones to servers. The Intel team described an accelerator that handles the *k*-nearest-neighbor problem, and the MIT Eyeriss chip is designed specifically for convolutional neural networks.

Machine learning is one of the hottest topics in computing; applications include fraud detection, ad optimization, and automated/assisted driving. Commercial implementations are quite common in the data center and in embedded devices (e.g., computer vision for robots).

Although machine-learning algorithms often run on standard microprocessors, companies are starting to introduce products developed specifically for this task. Nvidia designed its Tesla M40 and M4 to implement machine-learning models using the Cuda software ecosystem (see *MPR 11/30/15*, “Nvidia Brings Learning to the Server”). On the embedded side, Ceva optimized its embedded DSPs for vision processing; they are accessible through software libraries (see *MPR 10/26/15*, “Ceva Enables Deep Learning”).

Instead of being specifically architected for machine learning, these products are slightly modified versions of existing mainstream GPUs and DSPs. For these workloads, dedicated hardware will eclipse the performance and efficiency of even modestly specialized generic hardware. But existing GPUs and DSPs already ship in high volume, whereas the economic viability of machine-learning hardware is uncertain. Nvidia GPUs are successful in low-volume niches such as high-performance computing only because gaming and similar consumer applications drive large volumes and underwrite the R&D cost.

Machine Learning Moves Mainstream

The goal of machine learning is to enable a computer system to observe data and create or extract an accurate model of the underlying phenomena without explicit human intervention. Typically, once the model is “trained,” the user deploys it to make inferences, predictions or classifications about new data. Some algorithms incorporate online learning, which uses new data to subsequently retrain or adjust the model.

A crucial advantage of machine learning is that it doesn’t require a programmer to create the model manually. For instance, rather than trying to compile an exhaustive list of scenarios in which an autonomous vehicle can (or cannot) safely change lanes, a machine-learning approach would simply train a model on millions of observed lane changes so that the vehicle can infer and subsequently emulate human behavior.

Although machine learning has been studied in many guises for over 30 years, the critical insight was that accuracy scales with the amount of training data. Even relatively simple techniques are highly accurate when trained using millions or billions of observations. On the other hand, sophisticated algorithms and input from domain experts are no substitute for data quantity. The rise of machine learning’s popularity is tied to the availability of large-scale data, as well as large-scale computer systems (e.g., Google’s massive compute farms) to analyze this data.

Machine learning encompasses a wide variety of algorithms, each with its own strengths and weaknesses, creating a rich field for research. One paper from the Intel Circuits Research Lab focused on the *k*-nearest-neighbor (KNN) problem, while another from MIT addressed convolutional neural networks (CNNs). Both illustrate algorithms with unique architectural and circuit-design techniques.

Picking the Right Neighbors

The KNN classification problem is fairly straightforward. Given a query, the goal is to select from a data set the *k* objects that are most similar to the query. More quantitatively, an input (or query) is represented by a vector with multiple dimensions (usually one dimension for each feature—e.g., height or weight), and the program picks the “nearest” *k* data points in the set. The distance metric for calculating “nearness” is defined over the various features of the data set. For example, the data might be books, and features could include genre, author, and the frequency of rare words; the KNN algorithm would identify other books similar to the input across the features.

Intel’s research project implements a KNN accelerator in state-of-the-art 14nm silicon, as Figure 1 illustrates. The accelerator uses an adaptive-precision algorithm that iteratively computes the partial distance between the query and reference vectors. Eliminating more-distant vectors in early iterations reduces energy usage.

**Figure 1. ****Adaptive-precision implementation of k-nearest-neighbor problem.** The KNN accelerator comprises storage for 128 reference vectors, as well as logic to compute partial distances between reference vectors and the input query vector. The distances are fed into a partial-sort network, which outputs the minimum distance into the control logic. The control logic then uses the minimum distance to select the nearest neighbors and disable distant ones. (Source: Intel Research)

The accelerator can compare a 128-dimension query vector against 128 reference vectors, which reside in 16KB of on-die memory. Each dimension is represented by a single byte, and the distance between two vectors is measured using either the Manhattan metric (sum of absolute differences of dimensions) or the standard Euclidean metric (square root of the sum of squares).

There Can Be Only One

On each clock, the accelerator computes partial sums for a 2-bit window in every dimension of every reference vector (starting from the most significant bits). A sort network determines the minimum distance from the query vector to an active reference vector. The distance calculation and sorting circuits are highly optimized to account for the relatively small 2-bit input, saving area and power. Any vector that exceeds the minimum distance from the query is eliminated and deactivated in the search-space control logic.

The accelerator then iterates over the remaining active vectors, moving from the most to least significant bits and computing additional partial sums until only the nearest neighbor is left. As a result, the distance computations and sorting are now data dependent; only the reference vectors near the query vector are handled with full precision, whereas others are terminated early to save power while still producing the correct result.

The search-space control tracks disable reference vectors using a global bit mask. Additionally, each vector stores data that indicates when it was eliminated (i.e., during which 2-bit window)—critical information for calculating additional nearest neighbors.

Once the accelerator has found the nearest neighbor, it can calculate the next-nearest neighbor from the group of most recently eliminated reference vectors, potentially rolling back to a group of reference vectors from an earlier window. This backtracking technique substantially reduces the cost of identifying additional neighbors: as *k* increases from 1 to 10, each extra neighbor increases the latency by two cycles on average.

Figure 2 shows an example execution of a nearest-neighbor query. The search space is initially large, but it narrows until the best match is found. The accelerator then rolls back to find more nearby neighbors.

**Figure 2. ****Adaptive-precision KNN calculation.** Each iteration reduces the search space. Once it finds the nearest neighbor, the accelerator rolls back and does a small search to find the next-closest neighbors. (Source: Intel Research)

The 12-million-transistor KNN accelerator measures a miniscule 0.333mm^{2} in Intel’s 14nm process; it operates at 338MHz and 0.75V at 25°C. Using random reference vectors and the Manhattan distance, the accelerator achieves 21.5 million query vectors per second (or 16 cycles per query) while consuming 73mW. Using the Euclidean distance reduces throughput to 20.3 million queries per second, or 17 cycles per query, while consuming 78mW. The accelerator employs Intel’s near-threshold-voltage design style; the best energy efficiency comes at 0.39V and is a 2.7x improvement over 0.75V operation, although performance drops by roughly 10x. The accelerator is designed for 128 reference vectors with 8-bit dimensions, but it could easily expand to more dimensions or more bits per dimension.

Convolutional Neural Networks

Neural networks are a popular machine-learning algorithm that organizes data flow into a graph of “neurons” (see *MPR 4/13/15*, “Synopsys Embeds Vision Processing”). As Figure 3 illustrates, a convolutional neural network (CNN) is simply a variety of neural network where the computation in a set of neurons is a convolution (as opposed to another computational primitive). In this context, the convolution consumes over 90% of the time and energy. A *deep neural network* is a fuzzy term, but it generally refers to a neural network with more than five layers of neurons (and up to hundreds of layers) in series on the data-flow graph.

**Figure 3. ****Example use of a convolutional neural network.** The neurons in each layer perform a convolution on the input data to extract features. The convolution is the dominant computation and is the focus of the Eyeriss accelerator. (Source: MIT)

Deep CNNs are incredibly taxing computationally. For example, the well-recognized AlexNet requires 4.6MB of storage for filter weights and 13K multiply-accumulates per pixel to classify a 227x227-pixel image. Other CNNs, such as VGG16, require 5–10x more storage and 30x more computations than AlexNet. Deep CNNs have a configurable but highly specialized data flow and structure that exhibits considerable parallelism and locality. As Table 1 illustrates, however, each layer will vary in filter size, number of filters, and number of channels (e.g., red, blue, and green channels for image analysis).

**Table 1. AlexNet convolutional-layer configurations.** Production neural networks use varied configurations, which require flexible accelerators. (Source: MIT)

As Figure 4 illustrates, CNNs create a sea of inherently parallel computation multiplying each element of a filter against nearly every pixel in each channel of the image and then accumulating the results. All this number crunching provides many opportunities to reuse data. For example, The main challenge for accelerating CNNs is creating an architecture that takes advantage of this data flow while being flexible enough to accommodate all different CNN sizes.

**Figure 4. ****Parallel computation and data flow in convolutional neural networks.** The basic convolution operation comprises a dot product and a partial sum (top). Many different convolutions occur in parallel (e.g., different filters and different channels) but often share or reuse data (bottom). (Source: MIT)

Optimized Eyeriss Architecture

At ISSCC, a team of MIT researchers presented Eyeriss, a reconfigurable deep-CNN accelerator. They chose a rectangular array of processing elements (PEs) connected through a fabric to a 108KB shared on-chip buffer, as Figure 5 illustrates. Each PE comprises a 16-bit multiply-accumulate (MAC) unit as well as local storage for image data (24B), filter weights (550B), and partial sums (48B). Image and filter data is read from DRAM to the on-chip buffer and then transmitted to the PE array across a network-on-a-chip (NoC) specifically designed for the CNN data flow.

**Figure 5. ****CNN-accelerator architecture.** The accelerator comprises an array of processing elements with separate on-chip networks for filter weights, image data, input partial sums, and output partial sums, as well as an SRAM buffer and dedicated hardware for compression, decompression, and activation. (Source: MIT)

Figure 6 shows the logical data flow; the filter weights are read from the buffer and move west to east across the PE array, with a filter row for each PE. The image data starts from the west column and south row, and values move diagonally to the northeast, with one image row per PE. Last, the computed partial sums move north from the generating PE. They are read to the buffer from the top row and potentially fed back to the bottom row (if reused). If the CNN doesn’t precisely fit the PE array, the data mapping can be folded (for larger CNNs) or replicated (for smaller CNNs).

**Figure 6. ****CNN accelerator data flow.** Filter and image data follow specific patterns in the NoC. Multicasting enables the NoC to reduce the number of read operations while avoiding the overhead of broadcasting. (Source: MIT)

The NoC is physically optimized for CNNs using both point-to-point and multicast communication flows. It’s logically hierarchical, with the on-chip buffer feeding into a global Y bus that vertically spans the PE array on the western edge. The Y bus fans out to a global X bus for each row of the PE array, and each PE also has a local link to its immediate northern neighbor. The network is capable of single-cycle multicast and comprises three separate physical channels (filter, image, and partial sums). Multicasting uses less power than broadcasting while still reducing data reads from DRAM and the buffer. For example, a single read of filter weights can provide filter rows to 12 PEs.

Another major insight from the researchers involves data compression. Convolution is a linear operator, so it’s coupled with a nonlinear activation function. For activation, most recent neural networks use a rectified linear transformation *f*(*x*) = max(*x,*0), because the network will converge toward a solution more quickly than other activation functions. Eyeriss contains a dedicated rectified linear unit (RELU) that operates on filtered data. Applying this transform makes run-length data compression highly effective and reduces DRAM bandwidth. Special zero-detection circuits bypass and clock gate the MAC in addition to reading from local memory, further saving PE power.

The Eyeriss chip is fabricated in much older TSMC 65nm LP technology, so the absolute performance is mostly irrelevant to modern processors that use the 28nm node or below. But the architecture’s relative performance and energy-efficiency benefits demonstrate the advantages of specialization.

Classifying four images with AlexNet requires 2.6 billion MACs and therefore eight billion 16-bit inputs (or 16GB), and it produces 2.6 billion outputs (or 5.2GB). Eyeriss accessed a mere 15.4MB of data from DRAM and only 208.5MB from the on-chip buffer, indicating the accelerator took advantage of both data reuse and multicasting. Using run-length data compression on images and filters decreases the DRAM accesses per layer from 2.7-6MB down to 1.3-5MB for AlexNet. In addition, the data-dependent clock gating reduces PE power by 45%. Eyeriss natively handles filters up to 32 pixels wide and 12 pixels high with as many as 1,024 filters and 1,024 channels. Larger problems can be decomposed into multiple software passes (e.g., to increase the number of channels).

Will Silicon Ever Learn?

The popularity of machine learning is a direct result of Moore’s Law. As the cost of computation and storage has fallen, more consumer data can be collected and analyzed. Now that machine learning is established as a commercially valuable workload across many different markets, from servers to mobile phones, the next step is further optimization.

The easiest optimization is moving the workload from general-purpose microprocessors, such as Intel’s Xeon line, to a general-purpose accelerator such as a GPU or DSP. Dedicated accelerators, exemplified by Intel’s KNN project and MIT’s Eyeriss, have a number of architectural advantages over CPUs, GPUs, and DSPs.

First, the data representation is much more efficient. The two accelerators use 8- and 16-bit fixed-point data, which is much more area and power efficient than 16- or 32-bit floating point, enabling 2–4x more computations in a fixed design. Second, the data flow is structured to reduce off-chip accesses. For example, MIT estimates that Eyeriss uses 2–4x less memory bandwidth than a state-of-the-art GPU. Third, dedicated on-chip storage is more efficient than caches, which require an additional tag lookup and may activate multiple locations. Fourth, the dedicated architectures enable application-aware power-saving techniques, such as the variable-precision search of Intel’s KNN accelerator and the data-driven clock gating of the Eyeriss PEs.

As Moore’s Law and process-technology advances decelerate, architectural optimizations will become increasingly attractive levers to improve performance and efficiency. Although proper comparisons among research papers and modern products are difficult, the initial results indicate that specialized hardware can deliver 4–30x better performance per watt than CPUs or GPUs for machine-learning workloads. These research papers promise gains, but the actual system-integration options are unclear. Should such accelerators be on-die coprocessors or tightly integrated functional units (e.g., like AES in x86 or ARM)? Similarly, should they communicate with the main system through memory, cache, or special offload APIs? For the data center, could these accelerators be virtualized? Architects must tackle a host of questions before accelerator products become viable.

In the short term, accelerators could be implemented in Altera or Xilinx FPGAs. Although the performance would surely lag that of fixed-function silicon, it may still be an improvement over that of CPUs and GPUs, and the time to market will undoubtedly be shorter than that of dedicated accelerators.

The research presented at ISSCC clearly demonstrates the technical benefits and feasibility of accelerators for machine learning. Actual products will require sufficient customer interest to drive the high volumes necessary to absorb the high development costs for modern semiconductors. The first place we expect to see machine-learning accelerators is the data center. Microsoft’s Catapult project already uses FPGAs for Bing, demonstrating a willingness to embrace unusual hardware, and we fully expect to see dedicated accelerators within the next five years.

For More Information

For additional information on MIT’s Eyeriss processor, access the original presentation at *www.rle.mit.edu/eems/wp-content/uploads/2016/02/eyeriss_isscc_2016_slides.pdf*.

Events

Hyatt Regency, Santa Clara, CA

October 31 - November 1, 2018

Hyatt Regency, Santa Clara, CA

Newsletter

Press

White Papers