Skip navigation
DON system grabs a shoe it's never seen before by the tongue
With the DON system, a robot can perform novel tasks like look at a shoe it has never seen before and successfully grab it by its tongue. (Photo: Tom Buehler/CSAIL)

Advanced Computer Vision Lets Robot Pick Up Virtually Any Object

Humans have long been masters of dexterity, a skill that can largely be credited to the help of our eyes. Robots, meanwhile, are still catching up.

For decades, robots in controlled environments like assembly lines have been able to pick up the same object over and over again. More recently, breakthroughs in computer vision have enabled robots to make basic distinctions between objects. Even then, though, the systems don’t truly understand objects’ shapes, so there’s little the robots can do after a quick pick-up.

1. PhD student Lucas Manuelli worked with lead author Pete Florence to develop a system that uses advanced computer vision to enable a KUKA robot to pick up virtually any object. (Photo: Tom Buehler/CSAIL)

In a recent paper, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) say that they’ve made a key development in this area of work: A system that lets robots inspect random objects, and visually understand them enough to accomplish specific tasks without ever having seen them before. Figure 1 shows the robot and the video below shows the robot at work.

The system, called Dense Object Nets (DON), looks at objects as collections of points that serve as sort of visual roadmaps. This approach lets robots better understand and manipulate items, and, most importantly, allows them to even pick up a specific object among a clutter of similar objects—a valuable skill for the kinds of machines that companies like Amazon and Walmart use in their warehouses.

For example, someone might use DON to get a robot to grab onto a specific spot on an object, say, the tongue of a shoe (see the opening photo). From that, it can look at a shoe it has never seen before, and successfully grab its tongue.

"Many approaches to manipulation can’t identify specific parts of an object across the many orientations that object may encounter,” says PhD student Lucas Manuelli, who wrote a new paper about the system with lead author and fellow PhD student Pete Florence, alongside MIT Professor Russ Tedrake. “For example, existing algorithms would be unable to grasp a mug by its handle, especially if the mug could be in multiple orientations, like upright, or on its side."

The team views potential applications not just in manufacturing settings, but also in homes. Imagine giving the system an image of a tidy house, and letting it clean while you’re at work, or using an image of dishes so that the system puts your plates away while you’re on vacation.

What’s also noteworthy is that none of the data was actually labeled by humans. Instead, the system is what the team calls “self-supervised,” not requiring any human annotations.

Two common approaches to robot grasping involve either task-specific learning or creating a general grasping algorithm. These techniques both have obstacles: Task-specific methods are difficult to generalize to other tasks, and general grasping doesn’t get specific enough to deal with the nuances of particular tasks, like putting objects in specific spots.

The DON system, however, essentially creates a series of coordinates on a given object, which serve as a kind of visual roadmap, to give the robot a better understanding of what it needs to grasp, and where.

The team trained the system to look at objects as a series of points that make up a larger coordinate system. It can then map different points together to visualize an object’s 3D shape, similar to how panoramic photos are stitched together from multiple photos. After training, if a person specifies a point on an object, the robot can take a photo of that object, and identify and match points to be able to then pick up the object at that specified point.

This is different from systems like UC-Berkeley’s Dex-Net, which can grasp many different items, but is unable to satisfy a specific request. Imagine a child at 18 months old, who doesn't understand which toy you want it to play with but can still grab lots of items, versus a four-year old who can respond to "go grab your truck by the red end of it.”

In one set of tests done on a soft caterpillar toy, a KUKA robotic arm powered by DON could grasp the toy’s right ear from a range of different configurations. This showed that, among other things, the system has the ability to distinguish left from right on symmetrical objects.

When testing on a bin of different baseball hats, DON could pick out a specific target hat despite all of the hats having very similar designs—and having never seen pictures of the hats in training data before.

“In factories, robots often need complex part feeders to work reliably,” says Florence. “But a system like this that can understand objects’ orientations could just take a picture and be able to grasp and adjust the object accordingly.”

In the future, the team hopes to improve the system to a place where it can perform specific tasks with a deeper understanding of the corresponding objects. For example, it could learn how to grasp an object and move it with the ultimate goal of say, cleaning a desk.

The team will present their paper on the system later this month at the Conference on Robot Learning in Zürich, Switzerland.


A nimble-fingered machine shows how machine learning can teach robots to recognize and pick up different types of objects, a skill that could transform many factories and warehouses.

It might not look that special, but the robot is, according to a new measure, the most dexterous one ever created. Among other tricks, it could sort through your junk drawer with unrivaled speed and skill.

The key to its dexterity isn’t in its mechanical grippers, but in its brain. The robot uses software called Dex-Net to determine how to pick up even odd-looking objects with incredible efficiency (Fig. 2).

2. Jeff Mahler, a graduate student at UC Berkeley, configures the Dex-Net robot system. (Photo: Adriel Olmos)

Dex-Net was developed by Ken Goldberg, a professor at UC Berkeley, and one of his graduate students, Jeff Mahler. The software runs on an off-the-shelf industrial machine made by ABB, a Swiss robotics company. Goldberg demonstrated the latest version of his system at EmTech Digital, an event in San Francisco organized by MIT Technology Review and dedicated to artificial intelligence. 

According to Goldberg, the system is a lot closer to matching the adroitness of a human than anything developed previously. Industrial robots with better dexterity could find application in warehouses and factories as well as hospitals and homes.

What’s especially clever about Dex-Net is how it learns to grasp. The software tries picking up objects in a virtual environment, training a deep neural network through trial and error. Even in simulation, this is a laborious task. However, Dex-Net can generalize from an object it has seen before to a new one. The robot will even nudge an item to get a better look at it if it isn’t sure how it should be grasped. 

The latest version of the system includes a high-resolution 3D sensor and two arms, each controlled by a different neural network. One arm is equipped with a conventional robot gripper and another with a suction system. The robot’s software scans an object and then looks at both neural networks to decide, on the fly, whether it makes more sense to grab or suck that particular object.

UC Berkeley researchers also developed a better way to measure the performance of a picking robot. The metric, called “mean picks per hour,” is calculated by multiplying the average time per pick and the average probability of success for a consistent set of objects.

The new metric will help research labs working on picking robots share their results. “We’ve been talking about how to align our results so that we see progress,” says Goldberg. “It all depends what robot you’re using, what sensor you’re using, and—very importantly—what objects you’re using.”

Humans are capable of between 400 and 600 mean picks per hour. In a contest organized by Amazon recently, the best robots were capable of between 70 and 95. The new machine reaches 200 to 300 mean picks per hour, says Goldberg. The results will be presented at a conference in Australia later this year.

During his presentation, Goldberg added that within five years, he expects that robots will reach "human or even superhuman mean picks per hour."

Grasping and manipulating awkward and unfamiliar objects is a fundamental challenge in robotics, and one that has held the technology back. The robots found in car factories, for instance, are fast and precise but have no ability to adapt to a changing or unfamiliar setting. Besides factory or warehouse work, more sophisticated manipulation may lead to the first useful robots for helping people in places such as hospitals and elder-care facilities.

Recent progress in this aspect of robotics is the result of several simultaneous trends. Smaller, safer robots have proliferated, new kinds of end grippers have emerged, and—most significantly—big strides have been made in machine learning.

In addition to Goldberg’s work and research at several other academic labs, researchers at places like DeepMind and OpenAI have begun exploring how machine learning could be used to make robots smarter and more adaptable. Advances in robotics may well feed back into other areas of AI, such as perception. 

“Machine learning is having an unprecedented impact on robotics,” says MIT’s Tedrake, who has seen the UC Berkeley robot demoed. “There is incredible value in getting robots to proliferate to the point that we finally have big data for robotics.”

Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.