Deep RL at Scale: Sorting Waste in Office Buildings with a Fleet of Mobile Manipulators
- Alexander Herzog*
- Kanishka Rao*
- Karol Hausman*
- Yao Lu*
- Paul Wohlhart*
- Mengyuan Yan
- Jessica Lin
- Montserrat Gonzalez Arenas
- Ted Xiao
- Daniel Kappler
- Daniel Ho
- Jarek Rettinghouse
- Yevgen Chebotar
- Kuang-Huei Lee
- Keerthana Gopalakrishnan
- Ryan Julian
- Adrian Li
- Chuyuan Kelly Fu
- Bob Wei
- Sangeetha Ramesh
- Khem Holden
- Kim Kleiven
- David Rendleman
- Sean Kirmani
- Jeff Bingham
- Jon Weisz
- Ying Xu
- Wenlong Lu
- Matthew Bennice
- Cody Fong
- David Do
- Jessica Lam
- Yunfei Bai
- Benjie Holson
- Michael Quinlan
- Noah Brown
- Mrinal Kalakrishnan
- Julian Ibarz
- Peter Pastor
- Sergey Levine
*Authors with equal contribution
We describe a system for deep reinforcement learning of robotic manipulation skills applied to a large-scale real-world task: sorting recyclables and trash in office buildings. Real-world deployment of deep RL policies requires not only effective training algorithms, but the ability to bootstrap real-world training and enable broad generalization. To this end, our system - RL at Scale (RLS) - combines scalable deep RL from real-world data with bootstrapping from training in simulation, and incorporates auxiliary inputs from existing computer vision systems as a way to boost generalization to novel objects, while retaining the benefits of end-to-end training. We analyze the tradeoffs of different design decisions in our system, and present a large-scale empirical validation that includes training on real-world data gathered over the course of 24 months of experimentation, across a fleet of 23 robots in three office buildings, with a total training set of 9527 hours of robotic experience. Our final validation also consists of 4800 evaluation trials across 240 waste station configurations, in order to evaluate in detail the impact of the design decisions in our system, the scaling effects of including more real-world data, and the performance of the method on novel objects.
Problem SetupWe study the problem of continual real-world reinforcement learning through the lenses of a large scale experiment, where we deployed a fleet of 23 RL-enabled robots over two years in Google office buildings to sort waste and recycling. In our experiment, a robot roamed around an office building searching for “waste stations” (bins for recyclables, compost, and trash). The robot was tasked with approaching each waste station to sort it, moving items between the bins so that all recyclables (cans, bottles, etc.) were placed in the recyclable bin, all the compostable items (cardboard containers, paper cups, etc.) were placed in the compost bin, and everything else was placed in the landfill trash bin.
The task of sorting waste is much harder than it sounds: not only does the robot need to correctly pick up the vast variety of objects that people deposit into waste bins, but it also needs to identify the appropriate bin for each object and sort them as quickly and efficiently as possible.
The experiment setup enabled robots to learn on the job and improve through real-world experience, additional autonomous data collection in “robot classrooms,” and simulation. Our robotic system combines scalable deep RL from real-world data with bootstrapping from training in simulation and auxiliary object perception inputs to boost generalization, while retaining the benefits of end-to-end training, which we validate with 4,800 evaluation trials across 240 waste station configurations.
To make sure that robots can learn on the job, we need to bootstrap the robots with a basic set of skills. To this end, we use four sources of experience: (1) a set of simple hand-designed policies that have a very low success rate, but serve to provide some initial experience, (2) a simulated training framework that uses sim-to-real transfer to provide some initial bin sorting strategies, (3) robot classrooms where the robots continually practice at a set of representative waste stations, and (4) the real deployment setting, where robots practice in real office buildings with real trash.
We start with learning sorting in simulation using a previously-developed PI-QT-Opt framework to obtain the sorting policy. To make the sim2real possible, we apply separately-trained RetinaGAN to make the simulated images look closer to reality as shown below.
Once we have an initial sim2real policy and data collected using scripts in the real world, we are off to collecting data autonomously in a lab setting which we call a "robot classroom". While real-world office buildings can provide the most representative experience, the throughput in terms of data collection is limited – some days there will be a lot of trash to sort, some days not so much. Our robots collect a large portion of their experience in “robot classrooms.” In the classroom shown below, 20 robots practice the waste sorting task:
Equipped with the data coming from scripts, simulation and robot classroom, we continuously train our waste sorting policies using PI-QT-Opt. The resulting policy is deployed in the real office buildings - in this case we deployed RLS at 3 office buildings with 30 waste stations.
The resulting policy was continually trained using all sources of data to continuously improve sorting success in novel scenarios.
MethodEquipped with real and simulated data, we use deep RL to train an end-to-end policy that is directly optimized for reducing the contamination of the bins. Similarly to how we train our simulation policy, we use PI-QT-Opt to train the final policy on the complete dataset assembled from simulation and real world collection.
The diagram of the neural network architecture of the Q-function that is learned with PI-QT-Opt is shown below.
We feed two RGB images to two separate convolutional towers which are later concatenated and processed by another set of convolutional layers. The two images correspond to the current camera image as well as the object mask image. The object mask image is an extra image with a dot at the center of every object that is currently misplaced. The color of the dot indicates which bin the object should be sorted into and is trained using a pre-trained vision model. This image is fed to the network as an extra input channel concatenated to the current RGB image.
We train this model using Deep RL, which allows us to not only distill the best possible policy out of the bootstrapping data, but also to enable the robot to improve continuously as it interacts with waste stations more and more.
ResultsIn the end, we gathered 540k trials in the classrooms and 32.5k trials from deployment. Overall system performance improved as more data was collected. We evaluated our final system in the classrooms to allow for controlled comparisons, setting up scenarios based on what the robots saw during deployment. You can see the classroom evaluation scenes below.
The final system could accurately sort about 84% of the objects on average, with performance increasing steadily as more data was added. We performed an ablation study to understand how our design decisions contribute to the final performance as seen in the next graph.
Lastly, one of the most challenging aspects of this problem was the diversity of data encountered in the real office buildings. We present a few examples with a variety of situations with unique objects from real office buildings below.
In the real world, we logged statistics from three real-world deployments between 2021 and 2022, and found that our system could reduce contamination in the waste bins by between 40% and 50% despite the challenging out-of-distribution scenarios.
Our paper provides further insights on the technical design, ablations studying various design decisions, and more detailed statistics on the experiments.
We would like to thank Mohi Khansari, Cameron Tuckerman, Stanley Soo, Justin Vincent, Mario Prats, Thomas Buschmann, Joséphine Simon, Jarrett Lee, Kalpesh Kuber, Meghha Dhoke, Christian Bodner, Russell Wong and the entire Everyday Robots team for their help and support in various aspects of the project.
The website template was borrowed from Jon Barron.