Google Develops 3D Vision for Tablets
Project Tango Employs GPU Compute to Map Space in Real Time
By Mike Demler
Smartphone and tablet cameras may soon enter a new dimension. In February 2014, Google launched Project Tango, which aims to enable mobile devices to map a user’s physical environment and movement in 3D. By employing depth-sensing cameras synchronized with nine-axis motion sensors, a Tango device can capture and graphically recreate the location, shape, and distance of nearby objects and structures in real time. In December, the company began limited sales of a Tango development kit through the Google Play store. The seven-inch tablet, which employs Nvidia’s Tegra K1 processor, is currently available for $1,024 to professional developers that Google has chosen from its waiting list.
Project Tango falls under Google’s Advanced Technology and Projects (ATAP) group, which also came up with the Ara customizable smartphone (see MCR 7/21/14, “Google DIY Phone Sounds Loony”). Although the Ara DIY concept is likely to have limited appeal, the potential uses for 3D computer vision in mobile devices are numerous.
Games are foremost among Tango’s applications. By combining 3D maps and graphics, the technology enables users to transform their surroundings into a virtual play space that changes as they move through their actual physical environment. It can facilitate indoor navigation, which currently requires building owners to install a network of wireless beacons for location detection. Using a downloadable 3D map, a tablet equipped with Tango’s 3D cameras can sense its location by the simpler process of dead reckoning. The user would simply need to take a picture; the device could then determine the location and follow the user’s movement with built-in motion sensors. Other applications include interior design, robotics, and industrial 3D scanners.
Similarities between Tango and Microsoft’s Xbox Kinect system are apparent (see MPR 5/30/11, “Visual Computing Becomes Embedded”). Uncoincidentally, Google’s technical lead is Johnny Lee, who developed human-tracking algorithms for Microsoft’s Kinect system. At its 2014 I/O developer conference, Google announced a partnership with LG to bring Tango technology to the mass market. A consumer product is unlikely before 2Q15, but an analysis of the development-kit design provides insight into the processor capabilities that these devices will need.
It Takes More Than Two to Tango
To film a 3D movie, cinematographers employ a second 2D camera that records the scene stereoscopically. The result is two slightly offset images of each frame that attempt to imitate how humans see the world using two eyes. The “depth” that viewers sense, if they don’t get a headache, is just an illusion that the brain creates.
To accurately map and replicate a physical space in three dimensions, the Project Tango tablet also uses a pair of cameras, but their purpose is not to drive stereoscopic displays but rather to capture images for processing by 3D-computer-vision algorithms. Conventional smartphone and tablet cameras employ a CMOS sensor to convert visible light into pixels, which an image signal processor (ISP) then transforms into a picture suitable for viewing on a display (see MPR 8/12/13, “New Sensors for Smartphones”). As Figure 1 shows, Tango’s camera system also has a sensor that captures a conventional RGB image, but additionally, it can acquire an infrared (IR) image of the same scene.
Figure 1. Block diagram of Project Tango 3D camera system. To sense depth and object shapes, Tango combines images from conventional and fisheye-lens cameras with an IR-beam scan of the same scene. A sensor hub synchronizes the camera images with motion data from an accelerometer and gyroscope, which track the user’s movement in three dimensions.
The IR image is a scan formed by flashing the scene with structured light patterns from an IR projector; these patterns appear similar to a wire frame. Google adopted this technique from the Kinect system, which uses structured light to detect player movement. In Tango, the CMOS sensor captures distortion of a grid of IR beams as they wrap around three-dimensional objects. Image-processing algorithms can “reverse-engineer” these patterns to reconstruct the 3D geometry. Mantis Vision, an Israeli startup that has attracted investments from Qualcomm and Samsung, supplies the IR blaster and proprietary structured-light algorithms.
Structured light provides information about object shape, but Tango also aims to track user movements in three dimensions, a process that requires forming an accurate view of objects as the user’s perspective changes. To accomplish that feat, Google additionally employs an RGB motion-tracking camera with a 170° fisheye lens, which emulates a human’s peripheral vision. The Tango image-processing algorithms use the fisheye camera to extract further information on object shape using a process known as Harris corner detection.
By employing a sensor hub, Tango synchronizes images from the two cameras with position data from accelerometers and gyroscopic motion sensors; it then stores the data to compile a 3D location map of the user’s surroundings. The tablet also integrates a GPS receiver, along with a barometer and compass, that developers can use to supplement location data from motion sensors. As Figure 2 shows, the combination of RGB and IR images with synchronized motion-sensor data is enough for Tango to reconstruct the geometry of a 3D space.
Figure 2. Reconstruction of a 3D map. By combining RGB and IR depth-sensing images with motion data from the accelerometer and gyroscope, the Tango system can reconstruct a 3D map of the user’s surroundings. In this image, the white line traces the user’s path up the staircase of an eight-story building. The colors indicate the elevation change detected by the tablet’s motion sensors. (Source: Google)
Tablets Cure High Temperatures
Before developing the Tango tablet, Google first experimented with building a 3D camera and sensor system in a smartphone. The prototype integrates a Qualcomm Snapdragon 8974 processor with quad cores running at 2.3GHz and two first-generation Myriad vision coprocessors from Movidius (see MPR 8/29/11, “Movidius Powers 3D Video”). Despite offloading the 3D-camera functions from the application processor, the real-time vision-processing load proved too much for the smartphone package, causing the processors to slow because of thermal limits. As a result, Google is currently withholding the kit from developers.
For its tablet developer kit, the company adopted Nvidia’s Tegra K1 processor, which integrates four Cortex-A15 CPUs with a scaled-down A15 CPU in a variable-symmetric-multiprocessing (VSMP) architecture. The smaller core offloads the larger A15s by performing single-thread tasks that are less CPU intensive (see MPR 1/13/14, “Tegra K1 Adds PC Graphics to Mobile”). The K1 also integrates Nvidia’s Kepler GPU, which employs one streaming-multiprocessor (SMX) unit from the company’s GeForce graphics plug-in cards.
The seven-inch Tango tablet ameliorates the thermal limits of the smartphone prototype, and it also has room for a 4,960mAh battery. That’s approximately 25% more capacity than a similar seven-inch tablet in Samsung’s Galaxy line. As a result, the device also weighs more than 13 ounces—nearly 35% heavier than the Galaxy Tab.
The CMOS RGB-IR sensor is a four-megapixel device manufactured by Omnivision. This resolution is low compared with that of conventional smartphone or tablet cameras, but it is necessary to accommodate the larger-than-normal two-micron pixels. Typical CMOS camera sensors use 1- to 1.5-micron pixels to achieve higher resolution in the same-size chip. The Omnivision RGB-IR sensor dedicates 25% of its pixels to capturing IR. The larger pixels provide higher light sensitivity and a higher signal-to-noise ratio (SNR), which are critical for 3D mapping.
As Figure 3 shows, Tango’s motion-sensing system employs a 32-bit STMicroelectronics MCU to acquire data from the accelerometer and gyroscope. The sensor hub also provides a time stamp to synchronize position data with images from the cameras, maintaining at least 50-microsecond accuracy in the 3D location map.
Figure 3. Tango Tablet block diagram. The 3D-depth-sensing camera system couples an STMicroelectronics 32-bit MCU for sensor-hub functions with an Nvidia Tegra K1 application processor. Tegra’s CPU cluster pairs four full-size Cortex-A15s with a lower-power A15 companion core. Its Kepler GPU performs the image-processing functions for feature extraction and depth detection. *Power-saver CPU.
The Tegra K1 application processor performs all image-processing and camera-control tasks, eliminating the need for the Myriad coprocessors in this design. It integrates two instances of the Chimera computational-photography engine, which Tango uses to provide low-noise camera images for object-feature tracking (see MPR 3/11/13, “Tegra 4 Outperforms Snapdragon”). In addition to its normal role as an ISP, Chimera collects pixel-data statistics and processes the RGB elements to correct for inadvertent IR absorption.
The Tango 3D-vision system makes extensive use of Kepler’s 192 shader cores for GPU compute. Nvidia worked with Mantis Vision to convert the latter company’s IR-mapping algorithms to the Cuda parallel-programming platform. Kepler provides hardware that applies feature-extraction algorithms to the RGB images while simultaneously performing depth calculations for the structured-light map.
Whereas the Tango developer tablet employs the Mantis IR blaster and Omnivision RGB-IR imager for depth detection, Google is also working with a separate IR time-of-flight (ToF) sensor in some of its engineering prototypes. Infineon Technologies and Pmd Technologies manufacture the ToF sensor: the latter provides the pixel-matrix design that the former integrates with mixed-signal circuitry and control logic for production in its CMOS process.
The ToF device combines the IR-blaster controller and detector in one chip. It calculates distance by measuring the time for a flash of IR light to return to each pixel—a more accurate approach than use of structured light. The Pmd chip can deliver much higher resolution, and is available in 160x120-pixel (19K pixels) and 352x268-pixel (100K pixels) versions. Pmd says the ToF sensor is no more costly than the $2–$3 Omnivision sensor. Integrating the projector and sensor together in a smaller package could be better for building depth-sensing cameras into smartphones.
Google Goes for a Grand SLAM
Google is not alone in its efforts to bring 3D-vision capability to mass-market devices. In 2013, Apple acquired PrimeSense, the Israeli company that developed the 3D-sensor system for Microsoft’s Xbox. The notoriously secretive Cupertino company has not disclosed its plans for that technology, but it freely offers the itSeez3D app available through the iTunes store. The app works with a third-party add-on device similar to Kinect, called the Structure Sensor. This device connects to an iPad and is smaller than the PrimeSense camera. The itSeez3D app enables users to create 3D scans by moving the iPad-mounted Structure camera around a stationary object, but it doesn’t map 3D space.
Also in 2013, Intel launched its perceptual-computing project by holding a contest for developers of gesture-recognition UIs. It used a Creative Labs 3D camera similar in size to Kinect. Intel is now branding its 3D-computer-vision technology as RealSense and is even promoting in a television ad featuring Big Bang Theory star Jim Parsons. The company has adapted the technology for a seven-inch Android tablet, which Dell plans to ship in January 2015. This tablet uses Intel’s Moorefield processor, which integrates an Imagination PowerVR G6430 GPU (see MPR 3/17/14, “Intel Makes Merry in Smartphones”). It is likely to be the first shipping consumer device with 3D-scanning capability, but it will only be able to create images of stationary objects.
Google has a much more aggressive goal for Project Tango: simultaneous localization and mapping (SLAM). The science for building a SLAM -capable device is rooted in robotics. Tango shows that the components necessary to integrate SLAM into a smartphone or tablet are now available. Modern nine-axis MEMS sensors enable motion tracking in three dimensions, and inexpensive CMOS image sensors enable scanning of physical objects in 3D as well. MCU sensor hubs and powerful application processors with GPU compute complete the picture.
Google’s objective is to provide a complete software stack, which will encourage developers to create 3D-computer-vision applications for their mobile platforms. The Tango tablet kit for developers will kickstart the ecosystem, and we expect other vendors to join LG in the race to bring new dimensions to mobile vision.
Price and Availability
The Project Tango tablet is currently available only to developers chosen by Google. It sells for $1,024 through the Google Play store. For more information, see www.google.com/atap/projecttango/#project.