ComposableNav: Instruction-Following Navigation in Dynamic Environments via Composable Diffusion

Accepted at CoRL 2025

Zichao Hu¹,
Chen Tang¹,
Michael Munje¹,
Yifeng Zhu¹,
Alex Liu¹,
Shuijing Liu¹,
Garrett Warnell¹²,
Peter Stone¹³,
Joydeep Biswas¹

¹The University of Texas at Austin, ²Army Research Laboratory, ³ Sony AI

Given an instruction that specifies how a robot should interact with entities in the scene (a), ComposableNav leverages the composability of diffusion models (b) to compose motion primitives to generate instructionfollowing trajectories (c).

Abstract

This paper considers the problem of enabling robots to navigate dynamic environments while following instructions. The challenge lies in the combinatorial nature of instruction specifications: each instruction can include multiple specifications, and the number of possible specification combinations grows exponentially as the robot’s skill set expands. For example, “pass_from the pedestrian while staying on the right side of the road” consists of two specifications: “pass_from the pedestrian” and “walk on the right side of the road.” To tackle this challenge, we propose ComposableNav, based on the intuition that following an instruction involves independently satisfying its constituent specifications, each corresponding to a distinct motion primitive. Using diffusion models, ComposableNav learns each primitive separately, then composes them in parallel at deployment time to satisfy novel combinations of specifications unseen in training. Additionally, to avoid the onerous need for demonstrations of individual motion primitives, we propose a two-stage training procedure: (1) supervised pre-training to learn a base diffusion model for dynamic navigation, and (2) reinforcement learning fine-tuning that molds the base model into different motion primitives. Through simulation and real-world experiments, we show that ComposableNav enables robots to follow instructions by generating trajectories that satisfy diverse and unseen combinations of specifications, significantly outperforming both noncompositional VLM-based policies and costmap composing baselines.

ComposableNav

ComposableNav is a diffusion-based planner for instruction-following navigation. ComposableNav first learns motion primitives via a two-stage training procedure. At deployment, given instruction specifications, it selects relevant primitives and composes them by summing the predicted noise from each diffusion model during the denoising process. Finally, for real-time control, ComposableNav is paired with an model predictive controller.

Two-Stage Training Procedure

We randomly generate environments with different agents, terrains, and goal locations. Then using an RRT + Hybrid A* planner, we create diverse, smooth, collision-free trajectories — and save them as training data for supervised pre-training of the base diffusion model.

Randomly Generated Environment With Dynamic Agents

Plan Robot Trajectories Via RRT + Hybrid A*

Diverse Collision-Free Goal-reaching Trajectories

From the collision-free robot trajectory dataset, the model learns a conditional denoising network $f^{\text{base}}_{\theta}(\tau_t, t, \mathcal O)$, which predicts the noise $\epsilon$ to denoise the trajectory $\tau_t$ at step $t$, conditioned on environment observations $\mathcal O$.

We fine-tune the base model for each motion primitive with RL using the DDPO approach. Each primitive is trained in environments containing, where the diffusion model generates trajectories that a critic evaluates to assign rewards.

Deployment Illustration

Simulation Demonstrations

Learned Primitives

Pass From Left

Pass From Right

Follow

Yield

Avoid

Walk Over

On-the-Fly Composition

Pass From Left + Yield

Pass From Left + Follow + Yield

Yield + Pass From Left + Walk Over

Avoid + Follow + Walk Over

Pass From Right + Pass From Left

Pass From Left + Pass From Right

Avoid + Avoid + Yield

Walk Over + Walk Over + Yield

Real World Demonstrations

ComposableNav enables complex navigation by composing motion primitives.

ComposableNav enables customizable robot navigation through instructions (indoor).

ComposableNav enables customizable robot navigation through instructions (outdoor).

ComposableNav runs in real time and adapts to unexpected changes.

The robot passes through the narrow doorway after the person in the white coat and the one in the black hoodie. When another person approaches from the opposite side, the robot waits for them to pass before continuing its navigation

The robot follows the person in the white shirt down the hallway toward the elevator. An unexpected person enters from the right and briefly blocks its path. Near the elevator, the robot waits for the door to open and lets the person enter first, then follows into the elevator.

BibTeX

@inproceedings{
  hu2025composablenav,
  title={ComposableNav: Instruction-Following Navigation in Dynamic Environments via Composable Diffusion},
  author={Zichao Hu and Chen Tang and Michael Joseph Munje and Yifeng Zhu and Alex Liu and Shuijing Liu and Garrett Warnell and Peter Stone and Joydeep Biswas},
  booktitle={9th Annual Conference on Robot Learning},
  year={2025},
  url={https://openreview.net/forum?id=FBsawSyYBM}
}