Microprocessor Report (MPR) Subscribe

R-Car V3U Runs Lockstep Neural Nets

Renesas Processor Delivers 66 TOPS for Level 2 and Level 3 ADASs

March 22, 2021

By Mike Demler


The new Renesas R-Car V3U offers more-extensive functional-safety features and much greater performance than its predecessor, the R-Car V3H. To meet ISO 26262 requirements, most ADAS processors implement hardware redundancy only in an isolated safety island, but the V3U employs that technique in all the compute subsystems, including the deep-learning accelerators (DLAs). Although the company positions a single chip for Level 2 ADASs and dual chips for Level 3, the V3U delivers greater peak throughput—66 trillion operations per second (TOPS)—than the processors in most Level 4 autonomous vehicles (AVs) now under development.

The V3U implements redundancy in five functional domains using a technique called dual-core lockstep (DLCS), as Figure 1 shows. The control domain applies the most common technique: its two Cortex-R52 CPUs manage real-time ASIL D–compliant operations throughout the SoC, and they control security as well as external communications to the automotive bus. The application domain integrates eight Cortex-A76 CPUs that run as four lockstep pairs; the computer-vision (CV) domain allows two of the three DLA cores to run in lockstep, sacrificing one-third of the possible neural-network throughput. The interconnect block applies lockstep operations to I/Os, and the memory interface permits lockstep DRAM operations.

Figure 1. Renesas R-Car V3U functional-safety architecture. The design meets ASIL D functional-safety requirements by employing hardware redundancy in each block outlined in blue, including lockstep CPUs and DLAs as well as dual-DRAM interfaces.

The R-Car V3U is a 230mm2 chip that TSMC manufactures in 12nm technology. Renesas is now sampling the processor, but because of long automotive-design cycles, it plans to begin volume production in 2Q23. It also offers the Falcon development board, which includes the chip, four LPDDR4X DRAMs, connectors for automotive interfaces, a display, storage, and three video cameras.

Safety Runs Throughout

As Figure 2 shows, the V3U’s application CPUs comprise eight cache-coherent Cortex-A76s clocking at 1.8GHz. Renesas withheld the cache sizes. The real-time control/safety domain employs lockstep Cortex-R52 CPUs (see MPR 10/3/16, “Cortex-R52: Safer Real-Time Control”), which run an Automotive Open System Architecture (AutoSAR) RTOS at 1.0GHz. The GPU is a six-year-old Imagination PowerVR GE7800 operating at 600MHz; it’s a low-end model, but according to Renesas, it handles general-purpose-GPU (GPGPU) tasks.

Figure 2. Renesas R-Car V3U architecture. The chip integrates eight Cortex-A76 application CPUs configurable as four lockstep pairs, along with a lockstep Cortex-R52 control subsystem. Four ISPs prepare incoming video for analysis in the CV accelerators and DLAs.

All of the V3U’s compute cores perform self-diagnosis by applying error-correction and error-detection codes (ECC and EDC) to their inputs and outputs. The A76s operate as four lockstep pairs: each pair consists of a primary core and a checker core that both execute the same instructions. A comparator detects any mismatch between the outputs. The error code identifies which core caused a fault, enabling the processor to switch to the good one in just a few cycles. Because clock jitter and noise can produce the same error in both, the design also employs temporal diversity, delaying the checker core by six cycles relative to the main core. A monitor circuit detects any clock instability that would affect the redundant circuits.

The architecture further supports ASIL D by applying to all memory and register operations a safety ID (SID), which the control processor checks to ensure freedom from interference (FFI) when two blocks operating at different safety levels try to access the same resource.

The V3U integrates a total of 41.6MB of SRAM. The CV and DLA cores use 8MB as scratchpad memory; another 1MB is allocated to system storage and 2MB to real-time storage for the Cortex-R52s. The company withheld details of the remaining 30MB beyond explaining that every block has its own SRAM and that most of it is inaccessible to software. The chip has a 128-bit DRAM interface that supports LPDDR4X-4266, as well as an eMMC port and two QSPI flash-memory interfaces.

The R-Car V3U has four ISPs, each connected to one of its MIPI CSI2 camera ports. It can also output video to two MIPI DSI–connected displays. Standard automotive I/Os include eight CAN-FD, six Ethernet AVB, and one two-channel FlexRay interface. The PCIe port offers four Gen4 lanes. The Aurora PCIe interface works with a lightweight link-layer protocol developed by Xilinx for high-speed serial communication; the specification is open source and royalty-free. The interprocessor-communication (IPC) interface enables connection of multiple processors using the High-Speed Serial Link (HSSL).

ADAS Requires Multiple Engines

The V3U performs computer-vision (CV) and inference tasks through a combination of single-function accelerators, programmable DLAs and DSPs. Four of the accelerators execute the aggregate-channel-feature (ACF) algorithm that detects pedestrians by identifying facial characteristics. Two dense-optical-flow (DOF) accelerators track motion by computing a vector for each pixel in an image. The IMP blocks implement Renesas’s proprietary image-recognition engine, which performs pre- and postprocessing in conjunction with the DLAs using a dedicated 2MB scratchpad memory (IMPC). The pyramid scalers (PSCs) shrink images to four different sizes. The stereovision (STV) accelerators extract depth information from two front-facing cameras.

Renesas describes the CVE as a DSP that implements a 32-slot SIMD instruction set optimized for CV. The chip integrates eight CVE cores along with two general-purpose DSPs that can handle lidar- and radar-signal processing.

The V3U employs three identical DLA cores. As Figure 3 shows, they have a ping-pong buffer to stream input data and weights, increasing utilization of the multiply-accumulate (MAC) units. The DLA compute unit contains 16 convolution engines, each integrating 32x27 INT8 MAC arrays for a total of 13,824 MAC units per DLA. Each DLA has a 2MB scratchpad memory (SPM) that connects to the convolution engine through a 512-bit bus. The SPM is organized in 32 banks, each comprising 64KB of data storage and 8KB for ECCs. The multibank arrangement employs clock gating, which conserves power by eliminating toggling from all but the accessed bank. The DLA also includes function units that handle activation, pooling, and other nonconvolution layers.

Figure 3. R-Car V3U DLA. These inference engines integrate 16 convolution units, each employing a 32x27 MAC-unit array, along with separate units for nonconvolution layers. The DLA includes a 2MB scratchpad memory, organized in 32x64KB banks, that connects to the input buffers through a 512-bit bus.

To optimize SPM storage efficiency and bandwidth, the V3U organizes data in two formats depending on which is best for a particular neural-network layer. In a typical convolutional neural network (CNN), the feature maps comprise multiple channels of 2D pixel data. In the single-channel-multiposition (SCMP) format, the SPM on each cycle transfers 512 bits that represent multiple rows of one feature-map channel. In the single-position-multichannel (SPMC) format, the transfers represent groups of data in the same position across multiple channels. For example, in a 16-channel feature map, the 512-bit transfer comprises data from 32 row-column positions in each channel (512/16=32).

SCMP is the most efficient format for storing image data to feed the first CNN layers, whereas SPMC is more efficient for processing the intermediate layers, because pixel arrays shrink and channel depth increases as data moves through the network. According to the company’s measurements, applying the SPM storage options to VGG-16 reduces DRAM bandwidth by greater than 10x, and it more than quadruples power efficiency.

A DLA That Thinks Twice

Running at up to 800MHz, each V3U DLA can deliver 22 TOPS. In its tests using VGG-16, Renesas measured actual throughput of 32.3 TOPS, indicating about 50% MAC utilization—much better than typical DLAs. The three DLA cores, the SPMs, and the interconnect together consumed 5.3W. The company also characterized an unspecified neural network after optimizing it to run on the V3U hardware, yielding 60.4 TOPS while consuming 4.4W in the DLAs and associated components.

Although power efficiency increases from 6.1 to 13.8 TOPS per watt, the unspecified network has 90% zero-value data and weights. The company says many of the feature maps and networks it tested include such sparse layers. In its datasheet, it specifies the optimized 60-TOPS peak throughput from evaluation-board measurements, but we believe typical throughput will be less. Nevertheless, nearly all other AI-processor vendors publish theoretical peak DLA throughputs that users will never see on real neural networks.

As described above for the CPU subsystems, ASIL D compliance normally requires hardware redundancy. Applying that method to DLAs is inefficient, however, because it requires duplicating the large MAC arrays and SRAMs at the expense of much greater die area and power. Instead, Renesas developed a software-defined method that enables dynamic reconfiguration of two DLAs to run in lockstep only for the most safety-critical layers.

ASIL D compliance requires that hardware maintain at least a 90% single-point-fault metric (SPFM). In other words, safety mechanisms must cover at least 90% of the function blocks (or elements in ISO vernacular) in which a failure can violate a safety goal. On the basis of typical ADAS neural networks, Renesas estimates that no more than 10% of a DLA’s tasks must run in lockstep, reducing the inference engine’s throughput by just a few percent.

Figure 4 shows the operating sequence for lockstep mode. In the first step, a function block that Renesas calls the LDMAC (lockstep direct-memory access) loads the same data from DRAM into both the SPM1 and SPM2, which reside in separate DLAs. In the second step, the respective CNN engines execute the same neural-network operations, but with the checker engine delayed by two clock cycles. In this mode, the solitary DLA and the lockstep DLAs deliver up to 44 TOPS total. In the final step, the LDMAC compares the two CNN results; if they match, it transfers the output to DRAM.

Figure 4. DLA lockstep operation. The R-Car V3U can dynamically configure two of its three DLAs to run in lockstep. The LDMAC loads the same data into each DLA’s scratchpad memory and, on completion, checks whether the two results match.

Entering the Top Tier

The combination of DLA performance and functional-safety features, along with a comprehensive set of automotive subsystems, puts the R-Car V3U in the top tier of announced ADAS processors. The group also includes Nvidia’s Xavier and Tesla’s FSD ASIC. Because the V3U targets Level 2 and Level 3 ADASs, it handles only four camera streams, whereas the other two processors can create a 360-degree surround view using up to 16 cameras.

As Table 1 shows, the Renesas chip delivers twice the DLA performance of Xavier, which employs the same 12nm technology. To achieve peak throughput, its DLA consumes less than 5W—about half the chip’s typical power by our estimate. Assuming 10W for the entire SoC, the V3U provides six times the power efficiency. Xavier integrates Nvidia’s custom Carmel CPUs, which offer ASIL D compliance by enabling lockstep operation and temporal diversity, but it lacks the other safety features of the V3U, such as lockstep DLA operation and FFI.

Table 1. Comparison of high-performance ADAS processors. The R-Car V3U delivers more than twice the AI throughput of Xavier and about the same as the FSD ASIC, but it’s much more power efficient. (Source: vendors, except *The Linley Group estimate)

Nvidia has preannounced its next-generation Orin products, including a 35W card that will deliver 100 TOPS, but it has yet to release further details. We expect Orin will enter volume production in 2022, around the same time as the V3U.

At 230mm2, the V3U die is 12% smaller than Tesla’s FSD ASIC (see MPR 5/13/19, “Tesla Rolls Its Own Self-Driving Chip”); the latter employs a 14nm Samsung process. The two chips achieve proportional AI performance per unit die area: 66 TOPS and 74 TOPS, for a 0.89 ratio. Nevertheless, the FSD ASIC consumes more than three times the power, likely because of its more powerful CPU and GPU clusters.

The Renesas and Tesla designs exhibit many differences, beginning with the CPU configurations. The FSD ASIC employs older Cortex-A72 CPUs, which deliver about 70% of Cortex-A76’s single-thread performance, according to our performance metric. Although Tesla integrates 12 cores arranged in three clusters, its chip can provide much greater CPU performance in aggregate; the cores lack support for lockstep operation, however. The FSD ASIC integrates two DLA cores, but they run independently. For redundancy, Tesla’s FSD computer has two of the ASICs, doubling the system power to 72W.

Resetting Expectations

Whereas the R-Car V3U offers more than 15 times the CV and AI throughput of the predecessor V3H (see MPR 12/3/18, “Renesas Focuses on ADAS Cameras”), Renesas appears to be greatly downplaying its capabilities. The new chip delivers big performance boosts in all of its subsystems, with substantial upgrades to the CPUs, control processor, CV accelerators, DLAs, and DRAM bandwidth, while simultaneously tripling power efficiency. But rather than promote the device for future self-driving cars, the company is smartly positioning it for the Level 2 ADASs that now appear in mainstream vehicles and for the Level 3 systems that will become mainstream in the next 3–5 years.

Although a few luxury vehicles offer limited Level 3 capabilities, and although Tesla continues to hype its system as “full self-driving,” most carmakers have become leery of promoting hands-free operation. Instead, they call such systems Level 2+, aiming to emphasize that drivers remain responsible for controlling their vehicles. To ensure safe operation, most carmakers are including driver-monitoring systems (see MPR 3/15/21, “Driver Monitoring Makes ADAS Safer”), and Tesla recently removed some owners from its FSD beta for failing to remain attentive.

The V3U is competitive with the leading ADAS and AV processors now in production, but it places a much greater emphasis on functional safety. Although competitors have yet to address how their inference engines will support ASIL D, R-Car makes it easy for programmers to employ hardware redundancy as a check on critical neural-network operations. The V3U can dynamically couple two of its DLAs, a much more efficient technique than Tesla’s brute-force hardware duplication.

Renesas vies with NXP for the top spot among automotive-semiconductor suppliers, but that European rival lacks a processor to compete with the R-Car V3U. Now that Toshiba has ceased to develop its Visconti ADAS processors (see MPR 3/25/19, “Toshiba Moves Up to Level 3 ADAS”), Renesas has one less competitor in its home market as well. Although it primarily supplies R-Car chips to Japanese carmakers, the company’s largest customer. Toyota, is also one of the auto industry’s top-selling manufacturers.

Because of its broad automotive lineup, which includes battery management, MCUs, and power devices, Renesas has expanded its sales to automotive customers in China, now the world’s largest car producer. Newer competitors such as Mobileye, Nvidia, and several startups lack such complete offerings. We expect the R-Car V3U will attract many carmakers and Tier One suppliers looking to sell more-powerful ADASs while ensuring support for the highest functional-safety standards.

Price and Availability

Renesas withheld pricing for the R-Car V3U. It’s now shipping samples as well as the Falcon development board. The company expects the chip to enter volume production in 2Q23. For details on the R-Car V3U SoC, access www.renesas.com/us/en/products/automotive-products/automotive-system-chips-socs/r-car-v3u-best-class-r-car-v3u-asil-d-system-chip-automated-driving.

Free Newsletter

Linley Newsletter
Analysis of new developments in microprocessors and other semiconductor products

Events

Linley Spring Processor Conference 2022
Conference Dates: April 20-21, 2022
Hyatt Regency Hotel, Santa Clara, CA
Linley Fall Processor Conference 2021
Held October 20-21, 2021
Proceedings available
Linley Spring Processor Conference 2021
April 19 - 23, 2021
Proceedings Available
More Events »