A new approach enables robots to manipulate new objects in a variety of environments.
Clearing the dinner table is a task easy enough for a child to master, but it's a major challenge for robots.
Robots are great at doing repetitive tasks but struggle when they must do something new or interact with the disorder and mess of the real world. Such tasks become especially challenging when they have many steps.
"You don't want to reprogram the robot for every new task," says Murtaza Dalal, a PhD student in the School of Computer Science's (SCS) Robotics Institute at Carnegie Mellon University. "You want to just tell the robot what to do, and it does it. That's necessary if we want robots to be useful in our daily lives."
To enable robots to undertake a wide variety of tasks they haven't previously encountered, Dalal and other researchers at SCS and Apple Inc. have developed an approach to robotic manipulation called ManipGen that has proven highly successful for these multistep tasks, known as long-horizon tasks.
The key idea, Dalal explains, is to divide the task of planning how a robotic arm needs to move into two parts.
Imagine opening a door: The first step is to reach the door handle, next is to turn it. To solve the first problem, the researchers use well-established data-driven methods for computer vision and motion planning to locate the object and move a robotic arm's manipulator near the object. This method simplifies the second part of the process, limiting it to interacting with the nearby object. In this case, the door handle.
"At that point, the robot no longer cares where the object is. The robot only cares about how to grasp it," Dalal says.
Robots are typically trained to perform a task by using massive amounts of data derived from demonstrations of the task. That data can be manually collected, with humans controlling the robot, but the process is expensive and time consuming. An alternative method is to use simulation to rapidly generate data. In this case, the simulation would place the robot in a variety of virtual scenes, enabling it to learn how to grasp objects of various shapes and sizes, or to open and shut drawers or doors.
Dalal says the research team used this simulation method to generate data and train neural networks to learn how to pick up and place thousands of objects and open and close thousands of drawers and doors, employing trial-and-error reinforcement learning techniques. The team developed specific training and hardware solutions for transferring these networks trained in simulation to the real world. They found that these skills could be recombined as necessary to enable the robot to interact with many different objects in the real world, including those it hadn't previously encountered.
"We don't need to collect any new data," Dalal says of deploying the robot in the real world. "We just tell the robot what to do in English and it does it."
The team implements the two-part process by using foundation models such as GPT-4o that can look at the robot's environment and decompose the task -- like cleaning up the table -- into a sequence of skills for the robot to execute. Then the robot executes those skills, first estimating positions near objects using computer vision, then going there using motion planning, and finally manipulating the object using a depth camera to measure distances.
The researchers have applied their method to challenging multistage tasks such as opening drawers and placing objects in them or rearranging objects on a shelf. They have demonstrated that this approach works with robotics tasks that involve up to eight steps "but I think we could go even further," Dalal says.
Likewise, gathering data through demonstrations could enable this approach to be extended to objects that can't currently be simulated, such as soft and flexible objects.
"There's so much more to explore with ManipGen. The foundation we've built through this project opens up exciting possibilities for future advancements in robotic manipulation and brings us closer to the goal of developing generalist robots," says Min Liu, a master's student in the machine learning department and co-lead on the project.
"ManipGen really demonstrates the strength of simulation-to-reality transfer as a paradigm for producing robots that can generalize broadly, something we have seen in locomotion, but until now, not for general manipulation," says Deepak Pathak, an assistant professor of computer science in the Robotics Institute.
ManipGen builds on research to enable robots to solve longer and more complicated tasks, says Ruslan Salakhutdinov, the principal investigator on the project and professor of computer science in the machine learning department.
"In this iteration," he says, "we finally show the exciting culmination of years of work: an agent that can generalize and solve an enormous array of tasks in the real world."