Microprocessor Report (MPR) Subscribe

Untether Delivers At-Memory AI

Startup Packs Industry-Leading 2,000 TOPS Into a PCIe Card

November 2, 2020

By Linley Gwennap


Using what it refers to as an at-memory architecture, Untether AI has created a highly power-efficient accelera­tor that can achieve a stunning two petaop/s in a PCIe card. The at-memory design interleaves more than 250,000 tiny processing elements (PEs) inside a standard SRAM array. Putting the processing next to the memory enables massive data flow into the PEs. It also greatly reduces the power required to move data to the compute units; this movement consumes more power than the computation itself in traditional architectures. The startup is already sampling its TsunAImi card, which accelerates AI inference, and expects to ship production-qualified units in 2Q21.

Untether emerged from stealth and introduced its new architecture at the recent Linley Fall Processor Conference. CTO Martin Snelgrove, a former professor at the University of Toronto, helped found the company in 2018. After serving in executive roles at AMD and Xilinx, CEO Arun Iyengar helped Untether raise $27 million from Radical Ventures, Intel Capital, and other investors. The startup has about 60 employees, mostly in Toronto.

The Canadian company has quickly achieved first silicon of its initial chip, in part because of its highly regular design. The RunAI200 chip features 511 cores that contain a total of 192MB of primary SRAM, obviating the need for external DRAM. Operating at 960MHz, the cores generate peak performance of 502 trillion operations per second (TOPS) for the 8-bit integer (INT8) operations that commonly serve in neural-network inferencing. At a relatively low 100W TDP, RunAI200 offers more than 3x better TOPS per watt than Nvidia’s new Ampere A100 GPU (see MPR 6/8/20, “Nvidia A100 Tops in AI Performance”). The new chip also supports an “eco mode” that delivers 377 TOPS at 47W.

To boost performance, the TsunAImi card includes four of these chips. At 2,000 TOPS, it outperforms all other high-end AI accelerators by more than 2x, as Figure 1 shows. To achieve this rating, the card has a 400W TDP, similar to the A100 card. Boasting nearly 800MB of on-chip memory, it can hold large models without using any DRAM. Untether expects the four-chip design to achieve 80,000 images per second (IPS) on ResNet-50, which would be a record for a single accelerator card. Alibaba’s HanGuang 800 card is rated at 78,563 IPS (see MPR 3/2/20, “Alibaba Uses Convolution Architecture”).

Figure 1. Untether TsunAImi. The PCIe card generates more than twice as many operations per second (TOPS) as the leading accelerators. (Data source: vendors)

Where It’s At

Untether differentiates its approach from in-memory computing as practiced by companies such as Ambient, Mythic, and GSI Technology (see MPR 7/20/20, “GSI Offers In-Memory Computing”). In-memory designs employ the memory cell itself to compute results, often through analog techniques. In contrast, Untether’s at-memory design works with standard SRAM structures and standard digital logic. Whereas most processor designers focus on moving the data to the compute units, however, at-memory moves the compute units to the data. In this way, it’s similar to Upmem’s approach, but that startup instead uses DRAM technology (see MPR 8/26/19, “Upmem Embeds Processors in DRAM”). By sticking with SRAM, Untether breaks the memory bottleneck while implementing its processing elements in high-performance 16nm CMOS technology.

By the company’s accounting, a typical CPU or GPU spends 2.3 picojoules (pJ) to move one data byte from DRAM to the compute unit and only 0.2pJ to compute an 8-bit multiply-accumulate (MAC) operation. Untether instead places the PE, which is essentially a MAC unit, inside the SRAM array, as Figure 2 shows. This approach reduces the data-movement energy by 90%—about the same as the MAC operation burns. Each PE connects to 752 bytes of SRAM such that the memory and the compute unit have the same physical width, minimizing wasted die area. Each memory bank, which we call a core, has 8 rows and 64 columns of these structures, totaling 512 PEs and 385KB of SRAM.

Figure 2. Untether chip architecture. The design comprises an 8x64 array of cores (memory banks) that each contain one RISC CPU to control 512 processor elements (PEs) connected to a total of 384KB of SRAM. The rotator cuff transfers data east-west, the direct row transfer (DRT) runs north-south, and the P-Bus handles long-distance accesses.

The PE is a simple design that contains an INT8 MAC unit and a few registers. Each cycle, it can load one byte (INT8 value) from its local SRAM and perform one 8x8-bit MAC operation using a 32-bit accumulator. The PE checks the input values before each computation; if either operand is zero, it clock-gates the MAC unit, reducing the PE’s power by about 50% for that cycle. All 512 PEs perform the same operation in parallel (SIMD) fashion as directed by a single RISC CPU that controls the core.

The PE can use its MAC unit to handle other computations. It can perform multiplication or addition (and subtraction) by appropriately setting the input registers. Activation functions are synthesizable through sequences of these basic math operations. The PE also has a mask bit that allows it to do conditional operations. For example, ReLU (a common activation function) requires zeroing all negative values; the PE can compute this function by moving the value’s sign bit to the mask and conditionally overwriting the value to zero if the mask bit isn’t set.

Maintaining Direct Communication

The Untether design provides multiple ways for the PEs to communicate. A direct row transfer (DRT) moves an entire row of data either up or down in the memory array. As Figure 2 shows, these north-south transfers can proceed across core boundaries, although they require two extra cycles (nanoseconds) because of the longer physical distance. Once the data shifts via DRT, the PE can read the new data from its local SRAM.

The whimsically named rotator cuff transfers data east and west. Using these connections, data can shift one, two, or three PEs per cycle to the left or right, simplifying support for common 1x1 and 3x3 convolutions. The rotator cuff can shift data across core boundaries; it even has a “snake mode” that moves data from the far end of one row to the near end of the next row, enabling the entire chip to act like a linear set of PEs. For long-distance calls, the pipelined bus (P-bus) creates a mesh-like interconnection among the cores and the PCIe interface. Like DRT, the P-bus loads data directly into the memory array, where the PE can then access it.

When inferencing a typical neural-network layer, the PEs keep the weights in their local SRAM, which can hold hundreds of INT8 weights. Each cycle, the PE can load a weight value from this memory into its C register; if the weight is unchanged from the previous cycle (which is often the case), it can skip the load. At the same time, the PE can load an activation value from the rotator cuff into its A register. It then accumulates the product of the A and C registers using its MAC unit. By keeping the weights stationary and moving the activations only a short physical distance on each cycle, this approach minimizes the power consumed for data transfer.

Controlling the Core

Although the PEs handle most of the computation, they aren’t individually programmable. Only the RISC CPU can fetch and decode instructions, satisfying our definition of a core. Untether developed a custom instruction set to meet the unique requirements of this design. Using the PEs, the CPU essentially performs 4,096-bit-wide SIMD operations, but instead of a single physical register feeding the compute units, it implements eight rows of 64 MAC units, as Figure 3 shows. The “register file” is also divided into rows and columns to more quickly provide the data. The company calls this approach 2D SIMD.

Figure 3. Untether CPU design. The proprietary RISC CPU controls 512 PEs in a two-dimensional SIMD structure. Each PE contains a mask bit that, combined with the row mask, enables fine-grain control of the SIMD operations.

The CPU issues a SIMD instruction to all 512 PEs, causing them to perform the same operation. But it includes a row-mask register that can disable one or more rows if they’re unneeded for a specific calculation. As noted previously, each PE can be individually disabled using a mask bit. These capabilities allow fine-grain control of the SIMD operations. For example, algorithms can implement conditional execution by disabling PEs that finish early. The CPU also contains a set of row ALUs that can operate on values stored in the PEs. For instance, the row ALUs could sum the results from all the PEs, performing a portion of the softmax function.

In most other respects, the design is a simple 32-bit RISC CPU. It has an 8KB local memory to hold both instructions and scalar data such as loop counters and pointers. It can fetch and execute one instruction per cycle, either a SIMD operation or an overhead operation using its internal registers and ALU. The CPU can communicate directly with its nearest neighbors (north, south, east, and west), and it can employ the P-bus to access any other cores or to load data from the PCIe interface. To offload the CPU, each core also contains state machines that autonomously control data transfers on the rotator cuff, DRT, and P-bus, similar to DMA engines in a traditional design.

The RunAI200 chip provides a total of 192MB of SRAM for storing neural-network parameters; this figure excludes the RISC cores’ memory. The chip lacks a DRAM interface, however. Each one has a PCIe Gen4 x16 interface, and the four-chip card sports a PCIe bridge chip that converts a single x16 interface to four x4 interfaces connecting to the accelerator chips. With all four chips running at full utilization, the card consumes 400W. Most neural networks, however, operate at much lower power; the company expects both ResNet-50 and Bert-Base to require 200W. Customers can limit the card to a lower TDP to reduce the necessary cooling—a 300W TDP, for example, will still allow it to operate at full performance on most neural networks. For maximum efficiency, the card can run in eco mode at 200W TDP, but doing so reduces performance by 25%.

Industry-Leading Power Efficiency

Although Untether will primarily sell the four-chip card, at a chip level the design is similar to recent Qualcomm and Tenstorrent releases in performance and power dissipation, as Table 1 shows. Architecturally, however, these three are quite different. To achieve its strong performance, Qualcomm’s Cloud AI 100 employs only 16 cores, each containing large systolic MAC arrays (see MPR 10/12/20, “Qualcomm Samples First AI Chip”). Tenstorrent offers its Grayskull chip with an array of 120 cores, each containing about 3x more SRAM and computing 3x more TOPS than RunAI200’s core (see MPR 4/13/20, “Tenstorrent Scales AI Performance”). But because of its greater core count, Untether comes out ahead of its fellow Toronto startup in total memory and TOPS.

Table 1. Deep-learning accelerators for inference. QPS=queries per second. Untether’s first chip excels in peak operations (TOPS) but falls a bit short on ResNet-50 performance. All Untether data is estimated. *50W typical when running ResNet-50 or Bert; †best batch size, INT8. (Source: vendors, except ‡The Linley Group estimate)

All three chips feature a hefty amount of memory, but the Qualcomm and Tenstorrent products also support external DRAM to efficiently handle large models. Although Untether omits a DRAM interface, the four-chip card packs 768MB of on-chip memory, so it doesn’t need DRAM to hold most neural networks. The company’s software helps divide large models among the chips in a manner that minimizes chip-to-chip communication.

At 20,000 IPS on ResNet-50, RunAI200 falls slightly short of the other two competitors. But at only 50W (typical) when running this model, it offers slightly better performance per watt than even Qualcomm’s design. It does even better on the Bert model, slightly outperforming Grayskull and opening a big efficiency lead. Untether can achieve even greater efficiency using its low-voltage eco mode, but Qualcomm has a similar mode. An advantage for RunAI200 is that its performance comes with a batch size of one; the smaller batch size should reduce latency, although Untether withheld latency results.

We must caution that none of these three parts is in production, and their neural-network performance results are all preliminary. Tenstorrent has yet to complete its software stack or even specify its chip’s clock speed, so its scores may improve. Untether’s estimates assume unimplemented software gains and thus are unproven. The design’s low utilization factor could indicate the software is inefficient or that the RISC CPU spends too many cycles on nonconvolution operations.

More Ways to Skin the Cat

Untether’s at-memory architecture successfully attacks the data-movement problem in traditional von Neumann architectures, delivering a big power-efficiency boost relative to CPUs and GPUs. Other than Intel and Nvidia, however, AI-accelerator vendors have already moved away from von Neumann designs. A systolic array mimics Untether’s array of PEs, employing short connections to move data north-south and east-west from one MAC unit to the next. Small cores from companies such as Graphcore and Tenstorrent are similar to Untether’s in the number of MAC units and amount of SRAM. Putting small compute units into the SRAM isn’t that different from putting lots of SRAM into small compute units.

By combining four power-efficient chips on a board, however, Untether moves into rarified air. Its PCIe card far outperforms a single Nvidia A100 GPU at about the same power rating (400W). It also outruns the 300W Groq accelerator. Alibaba’s impressive HanGuang card falls just short of Untether’s ResNet-50 score, but because that card is available only through the Chinese vendor’s cloud service, it isn’t a direct competitor. Qualcomm and Tenstorrent customers can achieve the same performance and power efficiency by using multiple cards, although such an approach can’t match TsunAImi’s density (performance per rack unit).

The challenge for Untether, as for other AI startups, is to build an adequate software stack. As the market leader, Nvidia supports all popular AI frameworks and neural networks. Untether supports only TensorFlow, with plans to add Pytorch next year. Even for TensorFlow, the startup currently accelerates just two dozen functions, whereas Nvidia accelerates hundreds. Untether still has much performance tuning to complete.

The TsunAImi card delivers far better performance, measured in either peak TOPS or ResNet-50 throughput, than any other merchant AI inference accelerator. Although some competitors can approach its per-watt performance, they provide only single-chip cards with much lower performance. This performance lead is impressive for a relatively small and lightly funded startup. Untether must now focus on building out its software stack and raising additional funding for its next product. At 2,000 TOPS, the card’s industry-leading performance should motivate customers to find a way to tap into that power.

Price and Availability

Untether AI recently began sampling its four-chip TsunAImi accelerator. It expects to ship production-qualified PCIe cards in 2Q21, but it withheld pricing. For more information, access www.untether.ai.

Free Newsletter

Linley Newsletter
Analysis of new developments in microprocessors and other semiconductor products

Events

Linley Spring Processor Conference 2022
Conference Dates: April 20-21, 2022
Hyatt Regency Hotel, Santa Clara, CA
Linley Fall Processor Conference 2021
Held October 20-21, 2021
Proceedings available
Linley Spring Processor Conference 2021
April 19 - 23, 2021
Proceedings Available
More Events »