ComposableNav: Instruction-Following Navigation in Dynamic Environments via Composable Diffusion

Accepted at CoRL 2025


  • Zichao Hu1,
  • Chen Tang1,
  • Michael Munje1,
  • Yifeng Zhu1,
  • Alex Liu1,
  • Shuijing Liu1,
  • Garrett Warnell12,
  • Peter Stone13,
  • Joydeep Biswas1
1The University of Texas at Austin, 2Army Research Laboratory, 3 Sony AI
composablenav illustration Given an instruction that specifies how a robot should interact with entities in the scene (a), ComposableNav leverages the composability of diffusion models (b) to compose motion primitives to generate instructionfollowing trajectories (c).

Abstract

This paper considers the problem of enabling robots to navigate dynamic environments while following instructions. The challenge lies in the combinatorial nature of instruction specifications: each instruction can include multiple specifications, and the number of possible specification combinations grows exponentially as the robot’s skill set expands. For example, “pass_from the pedestrian while staying on the right side of the road” consists of two specifications: “pass_from the pedestrian” and “walk on the right side of the road.” To tackle this challenge, we propose ComposableNav, based on the intuition that following an instruction involves independently satisfying its constituent specifications, each corresponding to a distinct motion primitive. Using diffusion models, ComposableNav learns each primitive separately, then composes them in parallel at deployment time to satisfy novel combinations of specifications unseen in training. Additionally, to avoid the onerous need for demonstrations of individual motion primitives, we propose a two-stage training procedure: (1) supervised pre-training to learn a base diffusion model for dynamic navigation, and (2) reinforcement learning fine-tuning that molds the base model into different motion primitives. Through simulation and real-world experiments, we show that ComposableNav enables robots to follow instructions by generating trajectories that satisfy diverse and unseen combinations of specifications, significantly outperforming both noncompositional VLM-based policies and costmap composing baselines.


ComposableNav

ComposableNav is a diffusion-based planner for instruction-following navigation. ComposableNav first learns motion primitives via a two-stage training procedure. At deployment, given instruction specifications, it selects relevant primitives and composes them by summing the predicted noise from each diffusion model during the denoising process. Finally, for real-time control, ComposableNav is paired with an model predictive controller.

Main Method Figure


Two-Stage Training Procedure

We randomly generate environments with different agents, terrains, and goal locations. Then using an RRT + Hybrid A* planner, we create diverse, smooth, collision-free trajectories — and save them as training data for supervised pre-training of the base diffusion model.

Randomly Generated Environment With Dynamic Agents

Plan Robot Trajectories Via RRT + Hybrid A*

Diverse Collision-Free Goal-reaching Trajectories

From the collision-free robot trajectory dataset, the model learns a conditional denoising network $f^{\text{base}}_{\theta}(\tau_t, t, \mathcal O)$, which predicts the noise $\epsilon$ to denoise the trajectory $\tau_t$ at step $t$, conditioned on environment observations $\mathcal O$.

We fine-tune the base model for each motion primitive with RL using the DDPO approach. Each primitive is trained in environments containing, where the diffusion model generates trajectories that a critic evaluates to assign rewards.

Deployment Illustration


Simulation Demonstrations

Learned Primitives

Pass From Left
Pass From Right
Follow
Yield
Avoid
Walk Over

On-the-Fly Composition

Pass From Left + Yield
Pass From Left + Follow + Yield
Yield + Pass From Left + Walk Over
Avoid + Follow + Walk Over
Pass From Right + Pass From Left
Pass From Left + Pass From Right
Avoid + Avoid + Yield
Walk Over + Walk Over + Yield

Real World Demonstrations

ComposableNav enables complex navigation by composing motion primitives.

ComposableNav enables customizable robot navigation through instructions (indoor).

ComposableNav enables customizable robot navigation through instructions (outdoor).

ComposableNav runs in real time and adapts to unexpected changes.

The robot passes through the narrow doorway after the person in the white coat and the one in the black hoodie. When another person approaches from the opposite side, the robot waits for them to pass before continuing its navigation

The robot follows the person in the white shirt down the hallway toward the elevator. An unexpected person enters from the right and briefly blocks its path. Near the elevator, the robot waits for the door to open and lets the person enter first, then follows into the elevator.


BibTeX

@inproceedings{
  hu2025composablenav,
  title={ComposableNav: Instruction-Following Navigation in Dynamic Environments via Composable Diffusion},
  author={Zichao Hu and Chen Tang and Michael Joseph Munje and Yifeng Zhu and Alex Liu and Shuijing Liu and Garrett Warnell and Peter Stone and Joydeep Biswas},
  booktitle={9th Annual Conference on Robot Learning},
  year={2025},
  url={https://openreview.net/forum?id=FBsawSyYBM}
}