Plunder: Probabilistic Program Synthesis for Learning from Unlabeled and Noisy Demonstrations

Jimmy Xin, Linus Zheng, Kia Rahmani, Jiayi Wei, Jarrett Holtz, Isil Dillig, Joydeep Biswas

Paper GitHub Supplements Short Video

Abstract

Imitation Learning (IL) is a popular approach for teaching robots how to perform novel tasks using only human demonstrations. Plunder is a novel Programmatic Imitation Learning (PIL) algorithm that integrates a probabilistic program synthesizer in an iterative Expectation-Maximization (EM) framework. Our method is robust against actuation errors in demonstrations, and the synthesized probabilistic policies intuitively reflect uncertainties commonly found in human demonstrations. We benchmark Plunder against several established IL techniques, and demonstrate its superiority across five challenging imitation learning tasks under noise. Plunder policies achieve 95% accuracy in matching the given demonstrations, outperforming the next best baseline by 19%. Additionally, policies generated by Plunder successfully complete the tasks 17% more frequently than the nearest baseline.


Overview of the Plunder algorithm

Experiments

Stop Sign (SS): acceleration control in a straight line.
Pass Traffic (PT): switching lanes in highway traffic. See here for the version with human demonstrations.
Merge (MG): merging from the leftmost to the rightmost lane in highway traffic. See here for the version with human demonstrations.
Pick and Place (PP): controlling a robot arm to grab and object and move it to a designated location.
Stack (ST): stacking two boxes on top of each other at a designated location.

Experiment Details

Task Environment Action Labels Observed Variables State Variables Training Set Size Testing Set Size
SS AV ACC, DEC, CON acc a_min, a_max, v, v_max, target, dist_to_target 10 20
PT AV FASTER, SLOWER, LANE_LEFT, LANE_RIGHT acc, steer x, v, front_x, front_v, right_x, right_v, left_x, left_v 10 20
MG AV FASTER, SLOWER, LANE_LEFT, LANE_RIGHT acc, steer x, v, front_x, front_v, right_x, right_v, left_x, left_v 10 20
PP RA MOVE_TO_CUBE, MOVE_TO_TARGET dx, dy, dz, d_end x, y, z, block_x, block_y, block_z, target_x, target_y, target_z 5 15
ST RA MOVE_TO_CUBE1, MOVE_TO_CUBE2, MOVE_TO_TARGET, GRASP, LIFT dx, dy, dz, d_end x, y, z, block_x, block_y, block_z, block2_x, block2_y, block2_z, target_x, target_y, target_z 10 10

Aggregate Results

AST sizes

For full policies, please visit each module’s page.

Task SS PT MG PP ST PT_H MG_H
AST size (total) 25 96 46 22 55 196 98
Average size per transition 8.33 9.6 7.67 22 6.11 16.33 12.25

Accuracy Results

Visualizes the accuracy of action labels each learned policy produces on the test trajectories.

Likelihood Heatmap

Visualizes the log-likelihood of observations generated by each policy on the test trajectories.

Noise Experiments

Visualizes the degradation of performance of each approach, measured both by accuracy and log-likelihood of observations.

EM Loop Convergence

Visualize convergence for each task. More graphs can be found on each module’s page.

Success Rates

Each approach’s policy was tested by running it in the environment 100 times, and the number of successful runs was recorded. The results are visualized below, along with our metrics for what is considered a successful run.

Task Success Metric
SS Coming to a stop near the target
PT Maintaining some minimum velocity and being in a valid lane for a specified period of time
MG Reaching the rightmost lane, starting from the leftmost lane
PP Placing a randomly-initialized cube at a randomly-initialized target position
ST Placing two randomly-initialized cubes in a stack at a randomly-initialized target position, in the correct order

Particle Filter Implementation

Our particle filter follows from a standard implementation and can be found here: https://github.com/ut-amrl/plunder/blob/main/particleFilter/pf.h

The algorithm is as follows:

1. Initialize all particles to start with the default label.
2. for i=1..n
3.      Forward-propagate particles by acquiring new labels for each particle using the probabilistic policy π.
4.      Reweight particles using the observation model.
5.      Normalize weights.
6.      Resample particles to remove ineffective particles.
7. Trace back particle lineages to acquire full trajectories.

Citation

@ARTICLE{10493855,
  author={Xin, Jimmy and Zheng, Linus and Rahmani, Kia and Wei, Jiayi and Holtz, Jarrett and Dillig, Isil and Biswas, Joydeep},
  journal={IEEE Robotics and Automation Letters},
  title={Programmatic Imitation Learning From Unlabeled and Noisy Demonstrations},
  year={2024},
  volume={},
  number={},
  pages={1-8},
  keywords={Task analysis;Trajectory;Probabilistic logic;Noise measurement;Training data;Noise;Approximation algorithms},
  doi={10.1109/LRA.2024.3385691}}