NXP Debuts QorIQ Cortex-A72
New Dual- and Quad-Core LS1 Chips Boost CPU Performance
NXP is on the verge of sampling the industry’s first embedded processors that use ARM’s Cortex-A72. The quad-core LS1046A and dual-core LS1026A will bring greater CPU performance to the QorIQ family’s LS1 series, which currently comprises 32- and 64-bit chips based on Cortex-A7, Cortex-A9, and Cortex-A53. The new A72 processors are scheduled to sample this quarter and begin production later this year.
As usual, networking is the main target, but the chips are also useful for industrial and general embedded applications. With their dual 10G Ethernet (10GbE) controllers, four GbE controllers, and one 2.5GbE controller, the new processors are well equipped for enterprise routers, switches, line-card controllers, security appliances, virtual customer premises equipment (vCPE), service-provider gateways, and network-attached storage (NAS).
Although some previous LS1 chips target the same applications, we estimate Cortex-A72 delivers more than double the single-thread performance of Cortex-A53 at the same clock frequency. Moreover, the new products target a maximum clock speed of 2.0GHz, which is 25% faster than the existing A53-based QorIQ chips. NXP estimates the LS1046A’s power consumption will be 12W (typical) or 15W thermal design power (TDP) at 2.0GHz. By downshifting to 1.2GHz, it can deliver sufficient performance for many of the target applications while reducing power consumption to about 7W typical or 10W TDP, according to the company. That’s about the same power dissipation as existing quad-core A53 chips.
In addition to delivering better CPU performance, the new products further expand the scope of NXP’s QorIQ LS1 family, which the company inherited with its recent acquisition of Freescale. The biggest LS1 processor has eight Cortex-A53 CPUs, and three models are dual- or quad-core A53 designs. The rest are dual-core 32-bit processors, except for the recently announced LS1012A, which has one A53 core (see MPR 3/14/16, “NXP’s Lowest-Power 64-Bit ARM”). The next-higher product tier in the QorIQ family is the LS2-series, which scales to even greater performance. No other vendor offers as many 32- and 64-bit ARM-based embedded processors.
As Figure 1 shows, the LS1046A is well endowed with I/O interfaces. Until now, only the quad-core LS1048A and octa-core LS1088A sported dual 10GbE controllers. Other LS1 models have just one 10GbE controller or are limited to GbE or 2.5GbE. The second 10GbE port is generally required for enterprise networking equipment. The 2.5GbE controller also supports GbE, so a system can implement up to five GbE ports.
Figure 1. Block diagram of NXP’s QorIQ LS1046A processor. This 64-bit design is similar to the existing quad-core LS1048A but substitutes 2.0GHz Cortex-A72 CPUs for that product’s 1.6GHz Cortex-A53 CPUs. The new dual-core LS1026A is virtually identical to the LS1046A except for the core count.
The new chips implement NXP’s first-generation Data Path Acceleration Architecture (DPAA1), which is slower and less flexible than the second-generation DPAA2 soon appearing in the LS1088A, LS1048A, and future LS2-series chips (see MPR 7/20/15, “Freescale Overhauls the Data Plane”). The older acceleration hardware costs less to implement, is sufficient for chips in this class, and is more compatible with the existing LS1043 and T1040 processors. Also, the company is porting some acquired software to DPAA1 before DPAA2, so the LS1046A will have a complete turnkey stack for home and small-enterprise routers.
Using the DPAA1 hardware, NXP estimates the LS1046A will perform basic Layer 3 packet forwarding at 16Gbps. By contrast, the LS1048A reaches 20Gbps. Overall, however, the new chips should deliver better data-plane performance than most other LS1 models, and they will deliver much better control-plane performance. NXP says a 1.6GHz LS1046A scores 32,000 (aggregate) on the EEMBC CoreMark test.
Note that the company is using ARM’s CCI-400 cache-coherent interconnect instead of a CCN-5xx interconnect, which is normally recommended for an ARMv8 multicore processor. NXP says it chose the older interconnect because it consumes significantly less power yet still delivers sufficient performance for these dual- and quad-core designs.
Not So Quicc
Like almost all other LS1 processors, the LS1046A and LS1026A omit the regular-expression (reg-ex) and data-compression engines that some LS2 processors include. The only LS1 chip with reg-ex is the LS1024A, which is actually the former Mindspeed Comcerto C2200 acquired from Macom in 2014. Omitting these features helps to differentiate the LS1 and LS2 products from each other. Competing processors in this class usually lack these accelerators as well, although reg-ex engines are becoming more widespread.
The new LS1046A and LS1026A also drop the proprietary Quicc Engine, which mainly provides backward compatibility with communications software written for older Freescale and Motorola processors. Seven other chips in the LS1 series have Quicc Engines, so legacy customers still have plenty of options—unless they also want Cortex-A72 CPUs. Like all QorIQ chips, however, the new products do have the company’s SEC security engine, which in this case can process IPSec packets at 8Gbps. That’s about 20% slower than the LS1048A and LS1088A but still competitive in this class.
Both new processors have three PCI Express (PCIe Gen3) controllers linked to four 10Gbps lanes. They have one Serial ATA (SATA3) controller, three USB3.0 interfaces with PHYs, and the miscellaneous I/O interfaces common to the LS1 series (SD/MMC, SPI, quad SPI, I2C, UARTs, GPIO). They also have a JTAG debug port. The integrated 32/64-bit DDR4-2133 controller provides up to 17GB/s of peak DRAM bandwidth, matching the best throughput of other LS1 processors.
The new products are packaged in a 23mm flip-chip ball-grid array (FCBGA-621). They are pin compatible with the dual-core LS1023A, the quad-core LS1043A and LS1048A, and the octa-core LS1088A. (Note that the LS1023A and LS1043A are also available in older 21mm FCBGA packages that are not pin compatible.) We estimate the 1,000-unit pricing will be $80 for the quad-core LS1046A and $50 for the dual-core LS1026A, but they may cost a little less if some older products get price reductions. The initial devices will run at 1.6GHz; NXP plans to release 2.0GHz versions next year.
Surprisingly Few Competitors
NXP anticipates an unusually fast ramp from sampling to production—about six months, if the project stays on schedule. This ramp is feasible because the new designs are similar to the existing Cortex-A53 designs and are built in the same TSMC 28nm HPM process. The Freescale acquisition brought NXP much experience with that process; almost all ARM-based QorIQs are built in the same technology.
Among 64-bit ARM embedded processors, the LS1046A and LS1026A will compete mainly with AMD’s Opteron A1100 family, AppliedMicro’s Helix 2 family, and Broadcom’s “Northstar 2” StrataGX products. The AMD and Broadcom chips use the lower-performance Cortex-A57, but Helix 2 uses an internally developed ARMv8-compatible CPU. Despite the industry’s ARM frenzy, NXP’s LS1-series has surprisingly little direct competition.
Table 1 compares the LS1046A with AMD’s Opteron A1120 and AppliedMicro’s Helix 2 APM887104-H2. All are quad-core ARMv8 designs, but they use different CPUs. Because Cortex-A72 can execute about 25% more instructions per cycle than Cortex-A57 at the same clock frequency, the 1.6GHz LS1046A should beat the 1.7GHz Opteron. If NXP succeeds in pushing the clock speed to 2.0GHz, the QorIQ processor will easily win. In AMD’s favor, Opteron has a huge 8MB L3 cache, whereas the LS1046A has none. If a program or its data can fit entirely in this cache, the DRAM latency is effectively hidden. And when Opteron does miss the cache, it has 75% more DRAM bandwidth (see MPR 2/24/14, “AMD to Sample First ARM Chip”).
Table 1. Comparison of three ARMv8 embedded processors. OOO=out-of-order execution. *NXP plans to boost the maximum clock speed to 2.0GHz in future versions. (Source: vendors, except †The Linley Group estimate)
AppliedMicro’s Helix 2 offers better CPU performance owing to its much faster 2.4GHz clock speed and wider superscalar issue. (AppliedMicro and AMD claim SPECint scores that work out to approximately the same performance per megahertz.) Although Helix 2 has only a quarter as much L2 cache as the other processors, it has more total cache than the LS1046A, which lacks an L3 cache. AMD’s 10MB of cache easily surpasses NXP and AppliedMicro.
Our price estimate for Helix 2 is much higher than our estimates for the competing products because it’s based on the company’s X-Gene 2 server processors, which are essentially the same design. If AppliedMicro prices the embedded chips much lower, embedded customers might buy the server chips, unless embedded features such as the packet-acceleration hardware are disabled. AppliedMicro has not divulged its Helix 2 pricing strategy (see MPR 10/27/14, “AppliedMicro ARMs for Embedded”).
LS1046A Excels at Networking
All three processors employ hardware accelerators to offload their CPUs. All have crypto engines, and the LS1046A and Helix 2 have efficient packet acceleration. Only Opteron has an engine for data compression and decompression. Despite the lack of reliable benchmarks, we believe the NXP and AppliedMicro products will deliver the best packet throughput for typical networking. The LS1046A definitely has the best Ethernet connectivity. Helix 2 is limited by a single 10GbE interface, and Opteron has two 10GbE ports but nothing else.
All these chips have sufficient PCIe connectivity, but Opteron’s I/O interfaces reflect its server-oriented design. Whereas the others have one SATA3 controller and three USB3.0 interfaces with integrated PHYs, Opteron has 14 SATA3 controllers and no USB at all. Thus, Opteron is the best choice for RAID and NAS subsystems that must attach stacks of disk drives, and its compression engine will reduce the overhead of downsizing files to conserve storage. USB would be useful for attaching an external drive, though.
NXP’s estimated power consumption for the LS1046A is 15W TDP at 2.0GHz or 10W at 1.2GHz—hence our estimate of 12.5W at the initial maximum clock speed of 1.6GHz. We estimate 16W for the 2.4GHz Helix 2, and AMD rates the 1.7GHz Opteron A1120 at 25W.
The Opteron and Helix 2 processors are available now. The LS1046A is scheduled to sample this quarter and qualify for production this fall. By that time, NXP’s competition is more likely to come from Broadcom (which announced its first ARMv8 StrataGX products on April 19) and Cavium (which is developing ThunderX chips that are smaller than its 48-core behemoth).
NXP Hits the Sweet Spot
NXP is moving quickly to bring ARM’s newest high-end CPU, Cortex-A72, to the embedded market ahead of competitors. In fact, no other company offers as many ARMv8 embedded processors today. The AMD and AppliedMicro products in this comparison both derive from server processors and are not fully optimized embedded designs. The A1120 has poor performance per watt and is best suited to storage servers. Helix 2 is similar to the LS1046A, but we believe it’s based on an eight-core server die that’s more costly to manufacture.
By contrast, NXP’s LS1046A and LS1026A are highly optimized for their target applications. They supplement their Cortex-A72 CPUs with packet-acceleration hardware and class-leading Ethernet connectivity. Their PCIe, SATA, and USB interfaces are useful for a wide range of embedded systems. Their estimated power consumption is relatively low, and they are pin compatible with four other LS1 products. Although they have the fastest CPUs in that series, they aren’t the fastest LS1 chips—the octa-core LS1088A still delivers greater aggregate CPU performance, and it has the newer DPAA2 packet acceleration.
Thus, the LS1046A and LS1026A are midrange products that hit the dual/quad-core sweet spot. As the embedded market’s first Cortex-A72 products, they should surpass most Cortex-A57 and Cortex-A53 designs while using about the same power and silicon. Having few direct competitors optimized for the same applications, they belong on every potential customer’s short list.
Price and Availability
NXP plans to sample the QorIQ LS1046A and LS1026A this quarter and begin production no later than 4Q16. Initial chips will have a maximum clock speed of 1.6GHz; the company plans to introduce 2.0GHz models next year. It will announce list prices when the chips become available. We estimate the quad-core LS1046A will cost about $80 and the dual-core LS1026A about $50 in 1,000-unit volumes. For more information, access www.nxp.com/QorIQ.