![]() |
A Guide to Processors for Deep Learning Fourth Edition Published February 2021 Authors: Linley Gwennap and Mike Demler Corporate License: $5,995 |
Take a Deep Dive into Deep Learning
Deep learning, also known as artificial intelligence (AI), has seen rapid changes and improvements over the past few years and is now being applied to a wide variety of applications. Typically implemented using neural networks, deep learning powers image recognition, voice processing, language translation, and many other web services in large data centers. It is an essential technology in self-driving cars, providing both object recognition and decision making. It is even used in smartphones, PCs, and embedded (IoT) systems.
Even the fastest CPUs are inadequate to efficiently execute the highly complex neural networks needed to address these advanced problems. Boosting performance requires more specialized hardware architectures. Graphics chips (GPUs) have become popular, particularly for the initial training function. Many other hardware approaches have recently emerged, including DSPs, FPGAs, and dedicated ASICs. Although these solutions promise order-of-magnitude improvements, GPU vendors are tuning their designs to better support deep learning.
Autonomous vehicles are an important application for deep learning. Vehicles don't implement training but instead focus on the simpler inference tasks. Even so, these vehicles require very powerful processors, but they are more constrained in cost and power than data-center servers, requiring different tradeoffs. Several chip vendors are delivering products specifically for this application; some automakers are developing their own ASICs instead.
Large chip vendors such as Intel and Nvidia currently generate the most revenue from deep-learning processors. But many startups, some well funded, have emerged to develop new, more customized architectures for deep learning; Cerebras, Graphcore, Greenwaves, Gyrfalcon, Groq, Horizon Robotics, Tenstorrent, and Untether are among the first to deliver products. Eschewing these options, leading data-center operators such as Alibaba, Amazon, and Google have developed their own hardware accelerators.
We Sort Out the Market and the Products
A Guide to Processors for Deep Learning covers hardware technologies and products from more than 55 companies. The 300+ page report provides deep technology analysis and head-to-head product comparisons, as well as analysis of company prospects in this rapidly developing market segment. We explain which products will win designs, and why. The Linley Group’s unique technology analysis provides a forward-looking view, helping sort through competing claims and products.
The guide begins with a detailed overview of the market. We explain the basics of deep learning, the types of hardware acceleration, and the end markets, including a forecast for both automotive and data-center adoption. The heart of the report provides detailed technical coverage of announced chip products from AMD, Cambricon, Cerebras, Graphcore, Groq, Intel (including former Altera, Habana, Mobileye, and Movidius), Mythic, Nvidia (including Tegra and Tesla), NXP, and Xilinx. Other chapters cover Google’s TPU family of ASICs and Tesla’s autonomous-driving ASIC. We also include shorter profiles of numerous other companies developing AI chips of all sorts, including Amazon, Brainchip, Gyrfalcon, Hailo, Huawei, Lattice, Qualcomm, Synaptics, and Texas Instruments. Finally, we bring it all together with technical comparisons in each product category and our analysis and conclusions about this emerging market.
Make Informed Decisions
As the leading vendor of technology analysis for processors, The Linley Group has the expertise to deliver a comprehensive look at the full range of chips designed for a broad range of deep-learning applications. Principal analyst Linley Gwennap and senior analyst Mike Demler use their experience to deliver the deep technical analysis and strategic information you need to make informed business decisions.
Whether you are looking for the right processor for an automotive application, an IoT device, or a data-center accelerator, or seeking to partner with or invest in one of these vendors, this report will cut your research time and save you money. Make the smart decision: order A Guide to Processors for Deep Learning today.
This report is written for:
- Engineers designing chips or systems for deep learning or autonomous vehicles
- Marketing and engineering staff at companies that sell related chips who need more information on processors for deep learning or autonomous vehicles
- Technology professionals who wish an introduction to deep learning, vision processing, or autonomous-driving systems
- Financial analysts who desire a hype-free analysis of deep-learning processors and of which chip suppliers are most likely to succeed
- Press and public-relations professionals who need to get up to speed on this emerging technology
This market is developing rapidly — don't be left behind!
The fourth edition of A Guide to Processors for Deep Learning covers dozens of new products and technologies announced in the past year, including:
- The innovation behind Nvidia’s Ampere A100, the industry-leading GPU
- The first products in Qualcomm’s power-efficient Cloud AI 100 line
- Graphcore’s second-generation accelerator, the GC2000
- Tenstorrent’s initial Grayskull product, which outperforms Nvidia’s T4 at the same 75W
- Intel’s new Stratix NX, its first AI-optimized FPGA
- AMD’s powerful new MI100 (CDNA) accelerator for supercomputers
- NXP’s i.MX8M Plus, the company’s first microcontroller with AI acceleration
- Untether’s TsunAImi, which packs an industry-leading 2,000 TOPS into a single card
- The evolution of Google’s TPU family, including its next-generation TPUv4
- Intel’s new Xe GPU initiative
- Esperanto’s first product, which features more than 1,000 Minion cores at only 20W
- New card and system products from Groq
- A preview of the second-generation Wafer-Scale Engine (WSE2) from Cerebras
- Updated roadmaps for Intel’s Habana accelerators and Agilex FPGAs
- The Jacinto processor from Texas Instruments for Level 3 ADAS
- Synaptics’ VS680, a low-cost SoC for AI-enabled consumer devices
- Other new AI vendors such as Ambient, Aspinity, Coherent Logix, Kneron, Perceive, SambaNova, Sima, and XMOS
- New product roadmaps and other updates on all vendors
- Updated market size and forecast to include economic effects of 2020 pandemic
Deep-learning technology is being deployed or evaluated in nearly every industry in the world. This report focuses on the hardware that supports the AI revolution. As demand for the technology grows rapidly, we see opportunities for deep-learning accelerators (DLAs) in three general areas: the data center, automobiles, and embedded (edge) devices.
Large cloud-service providers (CSPs) can apply deep learning to improve web searches, language translation, email filtering, product recommendations, and voice assistants such as Alexa, Cortana, and Siri. Data-center DLAs exceeded $5 billion in 2020 revenue and in five years will approach $12 billion. By 2025, we expect nearly half of all new servers (and most cloud servers) to include a DLA.
Deep learning is critical to the development of self-driving cars. ADAS functions such as autonomous emergency braking already appear in more than half of new cars worldwide; we forecast nearly 80% adoption by 2025. Several automakers are shipping Level 3 autonomous vehicles. Over the next few years, we expect rapid growth in robotaxis and other commercial vehicles, generating more than $2.5 billion in processor revenue in 2025.
To improve latency and reliability for voice and other cloud services, edge products such as drones, security cameras, and Internet of Things (IoT) devices are implementing neural networks. Most premium smartphones now include a DLA, which is also becoming common in smart speakers. We expect 1.9 billion edge devices to ship with DLAs in 2025.
The rapid market growth has spurred many new companies to develop chips with DLAs. This report covers more than 55 vendors, including those developing chips only for internal use; we’re aware of many others that have disclosed too little information to cover. Although the situation is reminiscent of previous booms in graphics accelerators and network processors, we expect more winners in this competition. Given the widely differing product requirements in the data center, automotive, and various embedded market segments, different companies are likely to take the lead in each.
Nvidia dominates the data-center market with its new Ampere GPU, which offers excellent performance for both neural-network training and inference. Customers prefer the company’s broad and reliable software stack. Nvidia also leads the push to develop autonomous vehicles; its Xavier processor is the industry’s first single-chip solution for Level 3 autonomy, and its Drive AGX cards deliver even greater capabilities.
Intel offers several DLA architectures. Its standard Xeon CPUs are often used for lightweight inference, although their performance is far less than Ampere’s. In late 2019, it acquired Habana’s high-end DLA chips for training and inference, but product development has stalled since then. Intel also sells a range of FPGAs for customers that wish to design their own DLA architecture. Its low-power Myriad chips target drones and other camera-based devices. In addition, its Mobileye subsidiary leads in ADAS and competes against Nvidia at Level 3 and above.
Several others offer data-center DLAs. Cerebras, Graphcore, Groq, SambaNova, and other startups are sampling or shipping chips that use new architectures to outperform Nvidia’s GPUs on at least some workloads. AMD’s new MI100 GPU challenges the A100 for HPC but still lags on AI applications. Xilinx added AI cores to its new Versal FPGAs but hasn’t disclosed any neural-network benchmarks. These vendors must also compete against in-house inference ASICs at Alibaba, Amazon, Baidu, Google, Microsoft, and other top CSPs, all of which are now deploying these devices. Huawei disclosed its own AI architecture that it sells in servers and as a cloud service. Each of these companies offers a limited software stack and thus can target only a few workloads.
Established automotive suppliers such as NXP, Renesas, Texas Instruments, and Toshiba compete against Mobileye in the ADAS market. Well-funded Chinese startups Black Sesame and Horizon Robotics have released ADAS processors and are developing more-powerful chips for autonomous driving. Other startups also target this market, including Blaize (formerly ThinCI) and Hailo. Some automakers are developing their own chips for autonomous vehicles, but only Tesla has disclosed any details about its ASIC, which is already in production. Pilot deployments of Level 4 robotaxis have already started, and we expect mass production of these vehicles in 2022.
The embedded market has attracted the most startups, as the cost of both hardware and software development is relatively low, and design wins can quickly generate revenue. This market encompasses multiple end applications. Flex Logix, Gyrfalcon, and Mythic offer high-end DLA chips for high-resolution and multicamera applications. BrainChip, Google, Intel, NXP, and Synaptics target drones and other consumer video devices. Smart speakers and other voice-activated devices require less performance; Syntiant supplies the lowest-power chip for keyword spotting, but it competes against Ambient, Kneron, and Knowles. For ultra-low-power sensors, GreenWaves and Lattice provide alternatives.
Comparing the capabilities of such products is complicated; much depends on the needs of the end application. This report provides the data necessary to evaluate these companies and their products, along with our analysis of how well they address market requirements.
List of Figures |
List of Tables |
About the Authors |
About the Publisher |
Preface |
Executive Summary |
1 Deep-Learning Applications |
What Is Deep Learning? |
Cloud-Based Deep Learning |
Advanced Driver-Assistance Systems |
Autonomous Vehicles |
Voice Assistants |
Smart Cameras |
Manufacturing |
Robotics |
Financial Technology |
Health Care and Medicine |
2 Deep-Learning Technology |
Artificial Neurons |
Deep Neural Networks |
Spiking Neural Networks |
Neural-Network Training |
Training Spiking Neural Networks |
Pruning and Compression |
Neural-Network Inference |
Quantization |
Neural-Network Software |
Popular Frameworks |
Other Open Software |
Neural-Network Models |
Image-Classification Models |
Natural-Language and Recommender Models |
3 Deep-Learning Accelerators |
Accelerator Design |
Data Formats |
Computation Units |
Dot Products |
Systolic Arrays |
Handling Sparsity |
Other Common Functions |
Processor Architectures |
CPUs |
GPUs |
DSPs |
Custom Architectures |
FPGAs |
Performance Measurement |
Peak Operations |
Neural-Network Performance |
MLPerf Benchmarks |
AI-Benchmark |
4 Market Forecast |
Market Overview |
Data Center and HPC |
Market Size |
Market Forecast |
Automotive |
Market Size |
Market Forecast |
Autonomous Forecast |
Client and IoT |
Market Size |
Market Forecast |
5 AMD |
Company Background |
Key Features and Performance |
Product Roadmap |
Conclusions |
6 Cambricon |
Company Background |
Key Features and Performance |
Product Roadmap |
Conclusions |
7 Cerebras |
Company Background |
Key Features and Performance |
Product Roadmap |
Conclusions |
8 Google |
Company Background |
Key Features and Performance |
Data-Center TPUs |
Edge TPU |
Conclusions |
9 Graphcore |
Company Background |
Key Features and Performance |
Graphcore GC2 |
Graphcore GC2000 |
Graphcore M2000 |
Product Roadmap |
Conclusions |
10 Groq |
Company Background |
Key Features and Performance |
Product Roadmap |
Conclusions |
11 Huawei |
Company Background |
Key Features and Performance |
Conclusions |
12 Intel |
Company Background |
Xeon Processors |
Key Features and Performance |
Product Roadmap |
Habana Accelerators |
Key Features and Performance |
Product Roadmap |
Stratix and Agilex FPGAs |
Key Features and Performance |
Product Roadmap |
Movidius Myriad SoCs |
Key Features and Performance |
Product Roadmap |
Conclusions |
Data Center |
Client and IoT |
Strategy Summary |
13 Mobileye (Intel) |
Company Background |
Key Features and Performance |
Product Roadmap |
Conclusions |
14 Mythic |
Company Background |
Key Features and Performance |
Product Roadmap |
Conclusions |
15 Nvidia Tegra |
Company Background |
Key Features and Performance |
Software Development |
Product Roadmap |
Conclusions |
16 Nvidia Tesla |
Company Background |
Key Features and Performance |
Product Roadmap |
Conclusions |
17 NXP |
Company Background |
Key Features and Performance |
Product Roadmap |
Conclusions |
18 Tesla (Motors) |
Company Background |
Key Features and Performance |
Product Roadmap |
Conclusions |
19 Xilinx |
Company Background |
Key Features and Performance |
UltraScale+ |
Alveo |
Versal |
Product Roadmap |
Conclusions |
20 Other Automotive Vendors |
Black Sesame |
Company Background |
Key Features and Performance |
Conclusions |
Blaize |
Company Background |
Key Features and Performance |
Conclusions |
Hailo |
Company Background |
Key Features and Performance |
Conclusions |
Horizon Robotics |
Company Background |
Key Features and Performance |
Conclusions |
Renesas |
Company Background |
Key Features and Performance |
Conclusions |
Texas Instruments |
Company Background |
Key Features and Performance |
Conclusions |
Toshiba |
Company Background |
Key Features and Performance |
Conclusions |
21 Other Data-Center Vendors |
Achronix |
Company Background |
Key Features and Performance |
Conclusions |
Alibaba |
Company Background |
Key Features and Performance |
Conclusions |
Amazon |
Baidu |
Company Background |
Key Features and Performance |
Conclusions |
Centaur (Via) |
Company Background |
Key Features and Performance |
Conclusions |
Enflame |
Company Background |
Key Features and Performance |
Conclusions |
Esperanto |
Company Background |
Key Features and Performance |
Conclusions |
Furiosa |
Marvell |
Microsoft |
Company Background |
Key Features and Performance |
Conclusions |
Qualcomm |
Company Background |
Key Features and Performance |
Conclusions |
SambaNova |
Company Background |
Key Features and Performance |
Conclusions |
SimpleMachines |
Company Background |
Key Features and Performance |
Conclusions |
Tenstorrent |
Company Background |
Key Features and Performance |
Conclusions |
Tianshu Zhixin |
Untether |
Company Background |
Key Features and Performance |
Conclusions |
Wave Computing |
22 Other Embedded Vendors |
Ambient |
Company Background |
Key Features and Performance |
Conclusions |
Aspinity |
Company Background |
Key Features and Performance |
Conclusions |
BrainChip |
Company Background |
Key Features and Performance |
Conclusions |
Coherent Logix |
Company Background |
Key Features and Performance |
Conclusions |
Cornami |
Eta Compute |
Company Background |
Key Features and Performance |
Conclusions |
Flex Logix |
Company Background |
Key Features and Performance |
Conclusions |
Grai Matter |
Company Background |
Key Features and Performance |
Conclusions |
GreenWaves |
Company Background |
Key Features and Performance |
Conclusions |
Gyrfalcon |
Company Background |
Key Features and Performance |
Conclusions |
Kneron |
Company Background |
Key Features and Performance |
Conclusions |
Knowles |
Company Background |
Key Features and Performance |
Conclusions |
Lattice |
Company Background |
Key Features and Performance |
Conclusions |
NovuMind |
Company Background |
Key Features and Performance |
Conclusions |
Perceive |
Company Background |
Key Features and Performance |
Conclusions |
Sima.ai |
Company Background |
Key Features and Performance |
Conclusions |
Synaptics |
Company Background |
Key Features and Performance |
Conclusions |
Syntiant |
Company Background |
Key Features and Performance |
Conclusions |
XMOS |
Company Background |
Key Features and Performance |
Conclusions |
23 Processor Comparisons |
How to Read the Tables |
Data-Center Training |
Architecture |
Interfaces |
Performance and Power |
Summary |
Data-Center Inference |
Architecture |
Interfaces |
Performance and Power |
Summary |
Power-Efficient Data-Center Inference |
Architecture |
Interfaces |
Performance and Power |
Summary |
Automotive Processors |
CPU Subsystem |
Vision Processing |
Interfaces |
Summary |
Embedded Processors |
Performance and Power |
Memory and Interfaces |
Summary |
Embedded Coprocessors |
Performance and Power |
Memory and Interfaces |
Summary |
Ultra-Low-Power Processors |
Performance and Power |
Memory and Interfaces |
Summary |
24 Conclusions |
Market Summary |
Data Center |
Automotive |
Embedded |
Technology Trends |
Neural Networks |
Hardware Options |
Performance Metrics |
Vendor Summary |
Data Center |
Automotive |
Embedded |
Closing Thoughts |
Index |
Figure 1‑1. SAE autonomous-driving levels |
Figure 1‑2. Waymo autonomous test vehicle |
Figure 1‑3. GM’s autonomous-vehicle prototype |
Figure 1‑4. Various smart speakers |
Figure 1‑5. A smart surveillance camera |
Figure 1‑6. Processing steps in a computer-vision neural network |
Figure 1‑7. Robotic arms use deep learning |
Figure 1‑8. Comparisons of lung-disease severity (PXS) |
Figure 2‑1. Neuron connections in a biological brain |
Figure 2‑2. Model of a neural-network processing node |
Figure 2‑3. Common activation functions |
Figure 2‑4. Model of a four-layer neural network |
Figure 2‑5. Three-dimensional neural network |
Figure 2‑6. Spiking effect in biological neurons |
Figure 2‑7. Spiking-neural-network pattern |
Figure 2‑8. Pruning a neural network |
Figure 2‑9. Mapping from floating-point format to integer format |
Figure 3‑1. Common AI data types and approximate data ranges |
Figure 3‑2. Arm dot-product operation |
Figure 3‑3. A systolic array |
Figure 3‑4. Performance versus batch size |
Figure 4‑1. Revenue forecast for deep-learning chips, 2017–2025 |
Figure 4‑2. Unit forecast for deep-learning chips, 2017–2025 |
Figure 4‑3. Unit forecast for deep-learning chips by technology, 2018–2025 |
Figure 4‑4. Unit forecast for ADAS-equipped vehicles, 2017–2025 |
Figure 4‑5. Revenue forecast for ADAS processors, 2017–2025 |
Figure 4‑6. Unit forecast for client deep-learning chips, 2017–2025 |
Figure 5‑1. AMD CDNA architecture |
Figure 7‑1. Cerebras wafer-scale engine (WSE) |
Figure 8‑1. Google TPUv3 board |
Figure 9‑1. Block diagram of Graphcore M2000 accelerator |
Figure 10‑1. TSP conceptual diagram |
Figure 12‑1. Functional diagram of Habana HLS-1 system |
Figure 14‑1. Mythic's flash-based neural-network tile |
Figure 20‑1. Blaize Pathfinder system-on-module |
Figure 20‑2. Hailo-8 heterogeneous-resource map |
Figure 20‑3. Block diagram of Renesas R-Car V3H |
Figure 20‑4. TI Jacinto 7 TDA4VM primary compute domain |
Figure 20‑5. Block diagram of Toshiba TMPV770 ADAS processor |
Figure 21‑1. Block diagram of Centaur CHA processor |
Figure 21‑2. Esperanto ET-SoC-1 die plot |
Figure 22‑1. Block diagram of Aspinity AnalogML processor |
Figure 22‑2. Block diagram of Eta Compute ECM3531 |
Figure 22‑3. Block diagram of Flex Logix InferX X1 |
Figure 22‑4. Block diagram of GreenWaves GAP8 processor |
Figure 22‑5. Block diagram of Knowles AISonic processor |
Figure 22‑6. Block diagram of Lattice SensAI architecture |
Figure 22‑7. Block diagram of Synaptics AudioSmart AS-371 |
Figure 22‑8. Block diagram of Syntiant NDP101 speech processor |
Figure 22‑9. Xcore.ai processor architecture |
Figure 23‑1. ResNet-50 v1.0 training throughput |
Figure 23‑2. ResNet-50 v1.0 inference throughput (high end) |
Figure 23‑3. ResNet-50 v1.0 inference throughput (low end) |
Figure 23‑4. ResNet-50 v1.0 inference latency |
Figure 24‑1. Model-size trends, 2014–2020 |
Table 2‑1. Size and compute requirement of popular DNNs |
Table 3 1. AI-Benchmark smartphone tests |
Table 4‑1. Data-center DLA units and revenue, 2019–2025 |
Table 5‑1. Key parameters for AMD Instinct accelerators |
Table 6‑1. Key parameters for Cambricon edge- and cloud-AI processors |
Table 8‑1. Key parameters for Google TPU accelerators |
Table 8‑2. Key parameters for Google Edge TPU |
Table 9‑1. Key parameters for Graphcore processors |
Table 10‑1. Key parameters for Groq TSP architecture |
Table 11‑1. Key parameters for Huawei Ascend devices |
Table 12‑1. Key parameters for selected Intel Xeon Scalable processors |
Table 12‑2. Key parameters for Intel Habana Goya accelerator card |
Table 12‑3. Key parameters for selected Intel FPGAs |
Table 12‑4. Key parameters for Intel Movidius processors |
Table 13‑1. Key parameters for Mobileye EyeQ processors |
Table 14‑1. Key parameters for Mythic M1108 DLA |
Table 15‑1. Key parameters for Nvidia automotive processors |
Table 16‑1. Key parameters for Nvidia deep-learning GPUs |
Table 17‑1. Key parameters for NXP AI-enabled processors |
Table 18‑1. Key parameters for Tesla FSD ASIC |
Table 19‑1. Key parameters for selected Xilinx FPGAs |
Table 20‑1. AI-chip companies targeting automotive applications |
Table 21‑1. AI-chip companies targeting data-center applications |
Table 21‑2. Key parameters for Achronix Speedster7t FPGAs |
Table 21‑3. Key parameters for Alibaba HanGuang 800 accelerator card |
Table 21‑4. Key parameters for Baidu Kunlun K200 accelerator card |
Table 21‑5. Key parameters for Enflame CloudBlazer products |
Table 21‑6. Key parameters for Qualcomm Cloud AI 100 products |
Table 21‑7. Key parameters for SimpleMachines Accelerando card |
Table 21‑8. Key parameters for Tenstorrent Grayskull accelerator card |
Table 21‑9. Key parameters for Untether TsunAImi accelerator card |
Table 22‑1. AI-chip companies targeting embedded applications |
Table 22‑2. Key parameters for Gyrfalcon Lightspeeur coprocessors |
Table 22‑3. Key parameters for Kneron edge-AI processors |
Table 22‑4. Key parameters for Perceive Ergo processor |
Table 23‑1. Comparison of leading DLAs for AI training |
Table 23‑2. Comparison of other DLAs for AI training |
Table 23‑3. Comparison of high-end DLAs for AI inference |
Table 23‑4. Comparison of midrange DLAs for AI inference |
Table 23‑5. Comparison of 75W DLAs for AI inference |
Table 23‑6. Comparison of M.2 DLAs for AI inference |
Table 23‑7. Comparison of ADAS processors |
Table 23‑8. Comparison of autonomous-vehicle processors |
Table 23‑9. Comparison of automotive DLAs |
Table 23‑10. Comparison of low-power embedded DLA SoCs |
Table 23‑11. Comparison of high-performance embedded DLA SoCs |
Table 23‑12. Comparison of embedded DLA coprocessors |
Table 23‑13. Comparison of ultra-low-power DLAs below 5 GOPS |
Table 23‑14. Comparison of ultra-low-power DLAs above 50 GOPS |