Microprocessor Report (MPR) Subscribe

New AI Engines Give Versal an Edge

Xilinx AI Edge Products Target Automotive, Industrial Designs

June 28, 2021

By Linley Gwennap


As promised, Xilinx is continuing to expand its Versal line, introducing its AI Edge family. The new parts are similar to the AI Core family, which are now in production, while extending Versal downward to serve applications that require lower cost and power. They also introduce an upgraded AI engine that doubles the peak performance of the previous generation, yielding a mid-generation boost. Versal devices combine a complete CPU subsystem and other hard logic with a programmable FPGA fabric to form what Xilinx calls an ACAP. 

The new models span a broad range. The low-end ones deliver up to 8 trillion INT8 operations per second (TOPS) at less than 10W, while the top-of-the-line VE2802 reaches 202 TOPS with a typical power of 75W. All include dual Cortex-A72 CPUs and at least one DDR4 DRAM interface. These products target camera-based automotive systems, robotics, drones, and similar edge equipment. Xilinx expects them to sample by the middle of next year.

The biggest functional change is the new AI Engine-ML, which doubles the multiply-accumulate (MAC) throughput of the original AI Engine (which appears in the AI Core line). Whereas the original design employs two 512-bit MAC units, the new one has four, as Figure 1 shows. This change boosts the peak TOPS rate as well as power efficiency (TOPS/W) relative to the older models. To support the greater throughput, the new design also doubles the amount of memory per engine to 64KB.

  • Figure 1. Second-generation Versal AI engine. Each cycle, the VLIW engine can perform two scalar operations, two loads, one store, and four vector operations. Together, the four vector units can perform 256 8x8-bit integer MAC operations per cycle.

Separately, the company announced plans to acquire Silexica, a small software company that provides tools to create FPGA-based accelerators from C code. These tools enable FPGA customers to avoid RTL design by converting software models into FPGA logic using a high-level synthesis flow. Although they simplify the design process, the performance of such accelerators considerably lags that of traditional RTL. The German startup had raised $28 million in funding; the acquisition price remains undisclosed. Xilinx plans to integrate the Silexica tools into its Vitis software stack.

A Versatile Family

The AI Edge products derive from the original Versal design but lack some of its high-speed I/O features (see MPR 10/8/18, “Xilinx Versal Surpasses UltraScale+”). Xilinx offers Versal parts in three speed grades; the company’s TOPS ratings assume the AI engines operate at 1.3GHz, which requires the highest speed grade (-3). The chips can deliver additional AI throughput using the programmable logic and DSP engines, but this approach reduces performance per watt. Instead, the fabric is typically used for other types of accelerators.

To create the AI Edge line, Xilinx scales the number of FPGA gates (LUTs), AI engines, memory blocks, and I/O, as Table 1 shows. The family breaks down into three chip pairs, each presumably from a different die. The VE2802 and VE2602 provide the most LUTs and by far the most AI engines along with up to 111Mbits (about 14MB) of on-chip SRAM. They have extensive high-speed I/O, including two Ethernet ports that scale to 40GbE as well as a slew of 32Gbps serdes than can emulate various communications protocols using a controller instantiated in the LUTs. They feature three 64-bit DRAM interfaces and four x8 PCI Gen4 ports.

  • Table 1. Selected Versal AI Edge products. The new line spans a wide performance and power range. All models are manufactured in TSMC 7nm. *AI engines only; †ResNet-50 inferences per second, MLPerf rules, batch=1; ‡AI workloads. (Source: Xilinx, except §The Linley Group estimate)

FPGA power ratings are difficult to quote, as they vary widely depending on what portions of the chip are in use. According to Xilinx, the VE2802 consumes about 75W for AI applications; this mode stresses the AI engines but not the programmable fabric or the high-speed I/O. A design that extensively employs the fabric and I/O could push the power to 100W or more. These values are typical consumption; the TDP would be considerably greater.

The VE2302 and VE2202 deliver up to 23 TOPS but reduce power to about 20W. They have only a single Ethernet port and far fewer serdes. The VE2102 and VE2002 omit high-speed I/O entirely, allowing them to operate below 10W. But having just a few AI engines, they’re constrained to at most 8 TOPS. The line also includes a seventh model (not shown), the VE1752, that appears to be a repurposed AI Core device; it’s similar to the VE2802 but sports the original AI engine, yielding half the peak TOPS.

Xilinx provided some ResNet-50 performance estimates for the new design. At a batch size of one, it expects the VC2802 to achieve about 9,500 inferences per second (IPS). This score is consistent with the company’s recent MLPerf submission for the VC1902 (the most powerful AI Core model) scaled by peak TOPS. ResNet-50 throughput is about 50% more than the VC1902’s, owing to the new chip’s second-generation AI engines. At maximum batch size, the VE2802 could reach 19,000 IPS, according to Xilinx. The company also estimates the chip will achieve 0.16ms latency on ResNet-50; if the actual product meets this mark, it would best even Nvidia’s fastest Ampere GPUs.

Matching Orin’s AI Performance

Although Xilinx prefers to compare the AI Edge chips against Nvidia’s Xavier, by the time they enter production, they’ll compete against Nvidia’s upcoming Orin processor. Orin is actually a bit ahead, having already achieved first silicon whereas the AI Edge chips haven’t. Like Versal, Orin will be available in multiple models, ranging from 25W to 100W TDP. We compare the two top-of-the-line products, the VE2802 and Orin X, although Nvidia has yet to disclose full specifications for its next-generation device (see MPR 5/17/21, “Nvidia Orin Turbocharges ADAS”).

Nvidia rates the 100W Orin X at 254 TOPS, about 25% more than the VE2802, as Table 2 shows. Although the company sometimes employs sparsity to inflate its TOPS ratings for Ampere GPUs, we believe it refrained from doing so with Orin, whose integer AI accelerator doesn’t support sparsity, but the specification is unclear. 

On the basis of Xavier’s recent MLPerf results, we estimate Orin X will deliver about 17,500 IPS on ResNet-50 at maximum batch size but only 7,100 at batch=1. We estimate the Versal part’s TDP to be about the same as Orin’s; Xilinx reports a peak power of 81W when running ResNet-50 at batch=1, but power will be greater for larger batches. Thus, Versal could have a small performance-per-watt advantage, particularly at batch=1, but it’s difficult to tell given the preliminary state of both products.

  • Table 2. Edge-processor comparison. Despite its lower TOPS rating, Versal could outperform Orin on ResNet-50. Whereas Nvidia provides common accelerators for graphics and image processing, Xilinx customers can create their own accelerators using the FPGA fabric. (Source: vendors, except *The Linley Group estimate) 

Although the VE2802 includes a complete CPU subsystem, Orin X provides about 15x more CPU performance, enough to handle high-end autonomous applications without an external processor. The Orin cores enable Arm’s split-lock function, so they can operate in redundant pairs to enhance functional safety. Orin also includes accelerators such as a high-performance GPU, image processor, and video codec that can efficiently offload imaging functions from the CPUs. Versal lets customers implement their own accelerators in the FPGA fabric, permitting greater customization but requiring design skills that smaller customers lack. For basic image processing and video conversion, the fabric will be less efficient than Nvidia’s hardened units.

Versal’s AI capabilities remain somewhat unproven. For MLPerf, Xilinx submitted only a ResNet-50 score while ignoring the other five neural networks in that suite. It has posted performance results for several other CNNs—including Inception, VGG, and Yolo—that are available through its “model zoo.” None of these published results includes latency, a critical metric for real-time image applications. 

Versal Combines FPGA and AI

With its AI Edge and AI Core products, Xilinx now offers AI-powered processors that can serve devices ranging from smart sensors to data-center accelerators. For non-AI applications, Versal also includes Prime and Premium models, which omit the AI engines (see MPR 3/20/20, “Versal Premium Targets Core Networks”). The Prime and Core products entered volume production in April, and the Premium and Edge devices should follow next year. The company has yet to introduce offerings in the specialized Versal AI RF and HBM families, which will add their eponymous capabilities. Once the 7nm Versal line is complete, Xilinx will begin disclosing next-generation FPGAs, possibly in 3nm, although it will likely be a subsidiary of AMD by that time (see MPR 10/26/20, “Shopping for Chip Companies”).

The AI Edge products are well suited to advanced automotive systems, particularly forward-looking cameras that must process high-resolution images at 30 or even 60 frames per second. In this application, the AI engines can run one or more neural networks to analyze the images and send them to a more powerful central processor that handles path planning and vehicle control. In a simpler industrial robot, the Cortex CPUs, assisted by accelerators instantiated in the programmable fabric, can implement the control algorithms while the AI engines process the images. A military drone could use the fabric to implement radar or communications functions in addition to AI processing, although it may require an external host processor.

Nvidia targets similar applications with its Xavier and future Orin processors. Thanks to its upgraded AI engines, Versal AI Edge roughly matches Orin in neural-network performance and efficiency, at least for basic CNNs. But Nvidia has a proven and complete software stack that can handle a broader range of neural networks. Orin is a better choice for designers who can employ its programmable graphics and imaging pipelines as well as its standard interfaces. Versal is better for designers who want to create custom accelerators or interfaces using the chip’s FPGA fabric. These customers must either have RTL design skills or be willing to accept the lower performance of designing at a high level using the Silexica tools. 

To create its Versal products, Xilinx combines its traditional FPGA logic with standard processor components and custom AI engines. These hardened components enable the chips to deliver competitive AI performance and supplant SoCs in some applications. But the FPGA fabric remains an essential part of the value proposition. Processors from Nvidia and others replace that fabric with additional accelerators that are more power efficient and easier to program, making those processors better suited to mainstream designs. The FPGA logic, however, remains ideal for lower-volume designs that require more-custom solutions. The new AI Edge line extends Versal to lower-power applications, but their appeal still depends on the customer’s appetite for FPGA design.

Price and Availability

Xilinx expects to sample the Versal Edge AI products in 1H22. It withheld product pricing. For more information, see www.xilinx.com/versal-ai-edge. Versal AI performance data is at github.com/Xilinx/Vitis-AI/tree/master/models/AI-Model-Zoo.

Free Newsletter

Linley Newsletter
Analysis of new developments in microprocessors and other semiconductor products

Events

Linley Spring Processor Conference 2022
Conference Dates: April 20-21, 2022
Hyatt Regency Hotel, Santa Clara, CA
Linley Fall Processor Conference 2021
Held October 20-21, 2021
Proceedings available
Linley Spring Processor Conference 2021
April 19 - 23, 2021
Proceedings Available
More Events »