Microprocessor Report (MPR) Subscribe

Google TPU Boosts Machine Learning

Proprietary Data-Center Accelerator Bests CPUs and GPUs

May 8, 2017

By David Kanter

A leader in the deployment of machine learning, Google uses neural networks as well as other inference techniques that apply weights to input data to classify incoming email as spam, perform speech recognition, and similar tasks. The company has tremendous compute resources across its data centers and a penchant for designing its own hardware as well as software. It is the first company to design and deploy a custom processor that accelerates machine learning. Google describes this processor in paper that it will present at the International Symposium on Computer Architecture next month.

The “tensor processing unit” (TPU) is a 28nm accelerator that offloads neural-network inferencing based on the open-source TensorFlow library. It offers roughly 10x better performance than comparable 28nm GPUs and 22nm CPUs. Google set a blistering pace by designing, verifying, and deploying the accelerator in just 15 months. This fast development cycle was possible because the TPU is simpler than a general-purpose CPU or GPU, and the design team deliberately left out complicated features such as power management.

When users interact with neural-network-based services (such as Google Now), they expect a prompt response. Thus, Google places particular emphasis on latency and quality-of-service guarantees when evaluating neural-network inferencing. Rather than using complex vector units, caches, and DRAM, the TPU reduces latency using a simple design based on a massive array of 256x256 multiply-accumulate (MAC) units as well as explicitly addressed SRAM. It relies on a host processor for higher-level functions.

As Figure 1 shows, the TPU is mounted on an add-in card that connects to the host processor using a PCIe 3.0 link with 16 lanes. The PCIe connector also provides power, limiting the board to about 75W. The TPU operates at 700MHz; it could have been faster, but Google focused on a short schedule rather than high operating speed. Although the company withheld the exact die size, we estimate it’s about 300mm2­. The TPU comes with 8GB of ECC-protected DDR3 memory that stores the read-only weights of multiple inferencing models; inferencing can therefore operate primarily on the TPU with minimal CPU overhead.


Figure 1. Google TPU add-in card. The TPU and its associated DRAM are mounted on a card that uses PCIe for both power and data, limiting the total draw to 75W. (Source: Google)

As an accelerator, the TPU must be managed by driver code running on the main CPU. The driver is split into two parts: a lightweight kernel that only handles memory management and interrupts, and a user-space driver that performs more-complex and rapidly changing functions. For example, the user-space driver contains a just-in-time (JIT) compiler that transforms TensorFlow code into TPU binaries and then moves the instructions and associated model weights to the TPU memory for execution. It also manages the data transfers between the CPU and TPU, including any formatting and unpacking.

Google’s Target Workload

The TPU project started in 2013, a time when machine learning and neural networks were a small part of Google’s total computing portfolio. But some engineers were concerned that these workloads would grow rapidly, potentially doubling the infrastructure needed to deliver search, Gmail, and other services to consumers. At first, the TPU team had just one or two engineers. When it became clear that machine-learning demand was growing rapidly, Google added more resources to the project.

From among the many neural-network architectures, Google has predominantly chosen multilayer perceptrons (MLPs—a type of fully connected network) and recurrent neural networks (RNNs). (See MPR 12/12/16, “Many Options for Machine Learning.”) The company currently uses 61% MLPs and 29% RNNs. Convolutional neural networks (CNNs), which typically operate on visual or audio data, are just 5% of its inference workload.

These neural-network types have different numbers of layers (between 4 and 89), different nonlinear activation functions (e.g., sigmoid or hyperbolic tangent), and widely varying compute intensity (measured in bytes per operation). All require low-latency streaming computation, however. These neural networks use a large collection of static weights to perform inferences on the basis of input data. Unlike training, inferencing is typically a customer-facing task, so it requires a rapid response; Google’s quality-of-service requirement is seven milliseconds for 99th-percentile latency. (This value represents only the computation time. The customer’s perceived response time also includes client and network latency.)

Most of Google’s neural networks are small and comprise fewer than 1,500 lines of TensorFlow code. The weights for the trained networks are modest as well, ranging from 5 million to 100 million parameters. Although GPUs perform the training with floating-point arithmetic, Google quantizes and transforms most of its neural-network models into 8-bit integer data for inferencing. The integer weights consume less memory, and integer computation is more energy efficient and requires less of the processor die.

Data Flow Tailored to the Task

Figure 2 shows a block diagram of the Google TPU. It also illustrates the general data flow, which closely aligns with the application, unlike in a GPU or CPU. The host processor loads TPU instructions and data across the PCIe interface. The instructions flow into an instruction buffer and drive further execution. The TPU fetches neural-network weights from memory and stages them in a queue. The matrix-multiply unit combines data inputs from the unified buffer and the weights from the queue. It then performs the multiplication and sends the results to the accumulators. The accumulators act as inputs to the activation stage, which applies a nonlinear differentiable function such as a sigmoid or hyperbolic tangent. For CNNs, the chip has dedicated hardware for pooling, which occurs after activation.


Figure 2. Block diagram of Google TPU. The design is based on a neural-network data flow. The 64K 8-bit MAC units handle most computation and use 24MB of SRAM to store intermediate results.

The TPU and the host-based driver connect through PCIe, which is high latency and low bandwidth relative to on-die and coherent interfaces. That link carries both instructions and input data for the inferencing. To make the most of it, Google chose a CISC-style instruction encoding, including a repeat prefix. It says most instructions execute in 10–20 cycles, but each one does a lot of work.

The TPU has five main instructions: two that move data between the accelerator and the host and three that do the computation. The compute instructions read weights from memory, perform matrix multiplication and convolution, and apply an activation function. For example, the matrix-multiply instruction is encoded as 12 bytes: 3 for input addressing, 2 for output addressing, 4 to specify the length, and 3 for the opcode and flags. The instruction multiplies a variable-length input and a 256x256-entry weight matrix, yielding a variable-length output. The TPU also has a number of other, rarely used instructions for configuration and other system functions.

Fixed Pipeline With Massive Compute

The TPU has a four-stage pipeline that aims to keep the 256x256 matrix-multiply unit occupied. Accordingly, most of the chip’s data paths are 256 bytes wide. The read-weights instruction completes once the weight address is generated but before fetching all the weights. This approach enables the TPU to overlap weight fetching with matrix multiplication. The TPU reads weights from the DDR3 memory into a queue that can hold four 64KB (256x256) tiles.

The matrix-multiply unit is a systolic array comprising 64K MAC units that operate on 8-bit integers. The input weights come from the weight FIFO and reside in two 64KB tiles. The first tile is for the active computation, and the second provides double-buffering to hide latency. The matrix unit reads 256 inputs or activations per clock from a large unified buffer, and it performs the multiply-accumulate operations using the weights. The multiplier array stalls if either the weights or the inputs are unavailable.

The result is 256 products that are 16 bits each. The TPU writes them into a collection of 4K accumulators. Each of these accumulators contains 256 entries of 32 bits each for a total capacity of 4MB. Google chose this large number of accumulators after analyzing the performance of representative neural networks. According to its performance models, its networks need 100–1,350 operations per byte of memory bandwidth to sustain peak throughput. The company simply rounded up to 2,048 and used twice that number to handle double buffering.

The matrix-multiply unit is designed primarily for 8-bit integers but can operate on larger data at lower throughput. If either the weights or the inputs are 16 bits, for example, the throughput is halved; if both weights and input require 16-bit data, the throughput falls by a factor of four.

The activation pipeline applies nonlinear operations to the data in the accumulators. Google CNNs use the rectified linear unit (ReLU), whereas its other neural networks use ReLU, sigmoid functions, and hyperbolic tangents. The activation pipeline executes these functions on the accumulators. It also includes hardware that executes pooling layers, which downsample between layers in some CNNs.

The activation results are written back into the unified buffer, which is a giant SRAM array that holds 96K 256-byte entries. Google selected a large size for two reasons. First, it wanted to avoid spilling any intermediate results (e.g., activations) to DRAM when inferencing. Second, as Figure 3 illustrates, the unified buffer has the same vertical height as the matrix-multiply unit to simplify layout.


Figure 3. TPU floorplan. The design is roughly one-quarter SRAM, one-quarter compute logic, one-quarter I/O and control, and one-quarter miscellaneous functions. (Source: Google)

Dedicated Hardware Boosts Performance

Google shared some carefully selected performance and power data to justify the benefits of its design compared with processors from 2015, when it first deployed the TPU. Specifically, it compares the TPU, an Intel Xeon E5-2699v3 (Haswell) processor, and an Nvidia Tesla K80 (Kepler) GPU. Google ruled out Nvidia’s Maxwell and other parts that lack ECC memory protection because it requires this feature in its data-center products.

The data also explains the shortcomings of current solutions. Both CPUs and GPUs must reduce throughput to hit the quality-of-service target. On one workload, the Xeon processor only achieved 42% of peak throughput at the required latency, and the Tesla GPU could only attain 37%. In contrast, the TPU sustained 80% of its peak throughput while delivering about 20x better performance than the GPU and 41x better than the CPU.

These performance comparisons are imperfect and come with several caveats. First, the baseline Xeon E5v3 employs single-precision floating-point code, whereas the TPU employs 8-bit integers. Google admits that porting the code to AVX2 (which offers 8-bit integer SIMD) would improve performance by 3–4x. Intel recently announced large performance gains thanks to AVX and fundamental algorithmic changes. The Tesla K80 also lacks integer support, but the newer P4 and P40 GPUs offer 8-bit integer SIMD (see MPR 10/17/16, “Nvidia Tunes Pascal for Machine Learning”). The K80 is inefficient when ECC is enabled; newer GPUs deliver much closer to peak throughput with ECC.

Even with those caveats, the TPU provides a tremendous performance gain over its contemporaries, as Figure 4 shows. It’s best for MLPs and especially CNNs. The CPU is surprisingly strong for RNNs, sometimes beating the GPU. Were the RNNs compiled to use AVX2, Xeon could even outrun the TPU.

Figure 4. Relative performance. The TPU is much faster than the CPU and GPU for MLPs and CNNs, but the CPU delivers surprisingly strong performance for RNNs. The average is weighted based on Google’s workload mix. (Source: Google)

Energy efficiency is a large factor in the total cost of ownership, and the TPU is even more compelling from that perspective. Although it offers 29x the performance of the Xeon E5-2699v3, a TPU-based system is 34x more energy efficient (including the overhead of the host server processor). The Tesla GPU improves system energy efficiency by about 2x compared with Xeon, but that edge is too small to justify the hardware and software changes, let alone the additional cost.

The Tip of the Innovation Iceberg

Since 2015, both Intel and Nvidia have developed new products that focus more on neural networks, including the Tesla P4 and P40 as well as Intel’s impending Skylake Xeons. Google, however, is already well along in developing a second-generation TPU, which is likely near release and should deliver sizable performance gains. Thus, the TPU’s performance and power-efficiency advantages are likely to persist.

In its initial design, Google sacrificed performance to meet a tight schedule. For example, the TPU has almost no power management. The company did identify some relatively straightforward optimizations that could at least double the design’s energy efficiency: upgrading the DDR3 memory interface to faster GDDR5 and increasing the number of accumulators as well as the operating frequency. We expect such a chip will also implement other optimizations and use 14nm FinFET or another new process technology.

Other companies have developed architectures specifically for accelerating neural networks, but startups such as Wave Computing and Nervana Systems (now part of Intel) have focused on training. Few other accelerators target inferencing in the data center. Microsoft’s FPGA-based Catapult project can handle inferencing, but it performs a wider variety of tasks that include software-defined networking. As a result, it fails to achieve the performance gains of the TPU. Ceva, Synopsys, and Tensilica make licensable DSP cores for inferencing, but they primarily target client devices. MIT’s Eyeriss accelerator, an academic project, supports only CNNs (see MPR 3/7/16, “Accelerating Machine Learning”).

The TPU illustrates that order-of-magnitude improvements are practical in a commercial context, not just in academic settings. But its market impact will be muted, as Google will not sell the chip to other data-center operators. Deploying it has reduced the company’s cost of running TensorFlow workloads. The TPU’s advantages will spur competitors such as Amazon, Baidu, and Microsoft to investigate acceleration options, beginning with FPGAs and possibly including dedicated hardware.

For those data-center operators that don’t want to develop their own hardware, processor vendors offer off-the-shelf solution. But these chip vendors should take heed of Google’s definition of performance. The company’s performance metrics are based on latency and quality of service, considerations that are absent from most processor marketing. Google also chose a solution that fits within standard PCIe power-delivery limits, unlike most of Nvidia’s power-hungry GPUs, and indicated a clear requirement for ECC data protection.

The TPU’s biggest impact is as a proof of concept. Benchmark data from the world’s largest neural-network user shows that a dedicated architecture can deliver more than 10x performance gains over CPU and GPU designs when running real-world workloads. Although Intel and Nvidia will continue to optimize their products and software stacks, perhaps including new extensions to the x86 architecture, these general-purpose designs will struggle to compete with dedicated architectures in performance and power efficiency. We see an exciting future as inferencing becomes a focus for new architectures from vendors both old and new.

For More Information

Google’s ISCA paper describing the performance of the TPU is available at https://arxiv.org/abs/1704.04760.

Free Newsletter

Linley Newsletter
Analysis of new developments in microprocessors and other semiconductor products


Linley Spring Processor Conference 2022
Conference Dates: April 20-21, 2022
Hyatt Regency Hotel, Santa Clara, CA
Linley Fall Processor Conference 2021
Held October 20-21, 2021
Proceedings available
Linley Spring Processor Conference 2021
April 19 - 23, 2021
Proceedings Available
More Events »