Deep RL at Scale: Sorting Waste in Office Buildings with a Fleet of Mobile Manipulators
- Alexander Herzog*
- Kanishka Rao*
- Karol Hausman*
- Yao Lu*
- Paul Wohlhart*
- Mengyuan Yan
- Jessica Lin
- Montserrat Gonzalez Arenas
- Ted Xiao
- Daniel Kappler
- Daniel Ho
- Jarek Rettinghouse
- Yevgen Chebotar
- Kuang-Huei Lee
- Keerthana Gopalakrishnan
- Ryan Julian
- Adrian Li
- Chuyuan Kelly Fu
- Bob Wei
- Sangeetha Ramesh
- Khem Holden
- Kim Kleiven
- David Rendleman
- Sean Kirmani
- Jeff Bingham
- Jon Weisz
- Ying Xu
- Wenlong Lu
- Matthew Bennice
- Cody Fong
- David Do
- Jessica Lam
- Yunfei Bai
- Benjie Holson
- Michael Quinlan
- Noah Brown
- Mrinal Kalakrishnan
- Julian Ibarz
- Peter Pastor
- Sergey Levine
*Authors with equal contribution
Abstract
We describe a system for deep reinforcement learning of robotic manipulation skills applied to a large-scale real-world task: sorting recyclables and trash in office buildings. Real-world deployment of deep RL policies requires not only effective training algorithms, but the ability to bootstrap real-world training and enable broad generalization. To this end, our system - RL at Scale (RLS) - combines scalable deep RL from real-world data with bootstrapping from training in simulation, and incorporates auxiliary inputs from existing computer vision systems as a way to boost generalization to novel objects, while retaining the benefits of end-to-end training. We analyze the tradeoffs of different design decisions in our system, and present a large-scale empirical validation that includes training on real-world data gathered over the course of 24 months of experimentation, across a fleet of 23 robots in three office buildings, with a total training set of 9527 hours of robotic experience. Our final validation also consists of 4800 evaluation trials across 240 waste station configurations, in order to evaluate in detail the impact of the design decisions in our system, the scaling effects of including more real-world data, and the performance of the method on novel objects.
Video
Problem Setup
We study the problem of continual real-world reinforcement learning through the lenses of a large scale experiment, where we deployed a fleet of 23 RL-enabled robots over two years in Google office buildings to sort waste and recycling. In our experiment, a robot roamed around an office building searching for “waste stations” (bins for recyclables, compost, and trash). The robot was tasked with approaching each waste station to sort it, moving items between the bins so that all recyclables (cans, bottles, etc.) were placed in the recyclable bin, all the compostable items (cardboard containers, paper cups, etc.) were placed in the compost bin, and everything else was placed in the landfill trash bin.The task of sorting waste is much harder than it sounds: not only does the robot need to correctly pick up the vast variety of objects that people deposit into waste bins, but it also needs to identify the appropriate bin for each object and sort them as quickly and efficiently as possible.
The experiment setup enabled robots to learn on the job and improve through real-world experience, additional autonomous data collection in “robot classrooms,” and simulation. Our robotic system combines scalable deep RL from real-world data with bootstrapping from training in simulation and auxiliary object perception inputs to boost generalization, while retaining the benefits of end-to-end training, which we validate with 4,800 evaluation trials across 240 waste station configurations.
Bootstrapping
We start with learning sorting in simulation using a previously-developed PI-QT-Opt framework to obtain the sorting policy. To make the sim2real possible, we apply separately-trained RetinaGAN to make the simulated images look closer to reality as shown below.
Once we have an initial sim2real policy and data collected using scripts in the real world, we are off to collecting data autonomously in a lab setting which we call a "robot classroom". While real-world office buildings can provide the most representative experience, the throughput in terms of data collection is limited – some days there will be a lot of trash to sort, some days not so much. Our robots collect a large portion of their experience in “robot classrooms.” In the classroom shown below, 20 robots practice the waste sorting task:
Equipped with the data coming from scripts, simulation and robot classroom, we continuously train our waste sorting policies using PI-QT-Opt. The resulting policy is deployed in the real office buildings - in this case we deployed RLS at 3 office buildings with 30 waste stations.
The resulting policy was continually trained using all sources of data to continuously improve sorting success in novel scenarios.
Method
Equipped with real and simulated data, we use deep RL to train an end-to-end policy that is directly optimized for reducing the contamination of the bins. Similarly to how we train our simulation policy, we use PI-QT-Opt to train the final policy on the complete dataset assembled from simulation and real world collection.The diagram of the neural network architecture of the Q-function that is learned with PI-QT-Opt is shown below.
We train this model using Deep RL, which allows us to not only distill the best possible policy out of the bootstrapping data, but also to enable the robot to improve continuously as it interacts with waste stations more and more.
Results
In the end, we gathered 540k trials in the classrooms and 32.5k trials from deployment. Overall system performance improved as more data was collected. We evaluated our final system in the classrooms to allow for controlled comparisons, setting up scenarios based on what the robots saw during deployment. You can see the classroom evaluation scenes below.
Citation
Acknowledgements
We would like to thank Mohi Khansari, Cameron Tuckerman, Stanley Soo, Justin Vincent, Mario Prats, Thomas Buschmann, Joséphine Simon, Jarrett Lee, Kalpesh Kuber, Meghha Dhoke, Christian Bodner, Russell Wong and the entire Everyday Robots team for their help and support in various aspects of the project.
The website template was borrowed from Jon Barron.