This paper considers the problem of enabling robots to navigate dynamic environments while following instructions. The challenge lies in the combinatorial nature of instruction specifications: each instruction can include multiple specifications, and the number of possible specification combinations grows exponentially as the robot’s skill set expands. For example, “pass_from the pedestrian while staying on the right side of the road” consists of two specifications: “pass_from the pedestrian” and “walk on the right side of the road.” To tackle this challenge, we propose ComposableNav, based on the intuition that following an instruction involves independently satisfying its constituent specifications, each corresponding to a distinct motion primitive. Using diffusion models, ComposableNav learns each primitive separately, then composes them in parallel at deployment time to satisfy novel combinations of specifications unseen in training. Additionally, to avoid the onerous need for demonstrations of individual motion primitives, we propose a two-stage training procedure: (1) supervised pre-training to learn a base diffusion model for dynamic navigation, and (2) reinforcement learning fine-tuning that molds the base model into different motion primitives. Through simulation and real-world experiments, we show that ComposableNav enables robots to follow instructions by generating trajectories that satisfy diverse and unseen combinations of specifications, significantly outperforming both noncompositional VLM-based policies and costmap composing baselines.
ComposableNav is a diffusion-based planner for instruction-following navigation. ComposableNav first learns motion primitives via a two-stage training procedure. At deployment, given instruction specifications, it selects relevant primitives and composes them by summing the predicted noise from each diffusion model during the denoising process. Finally, for real-time control, ComposableNav is paired with an model predictive controller.
We randomly generate environments with different agents, terrains, and goal locations. Then using an RRT + Hybrid A* planner, we create diverse, smooth, collision-free trajectories — and save them as training data for supervised pre-training of the base diffusion model.
Randomly Generated Environment With Dynamic Agents
Plan Robot Trajectories Via RRT + Hybrid A*
Diverse Collision-Free Goal-reaching Trajectories
From the collision-free robot trajectory dataset, the model learns a conditional denoising network $f^{\text{base}}_{\theta}(\tau_t, t, \mathcal O)$, which predicts the noise $\epsilon$ to denoise the trajectory $\tau_t$ at step $t$, conditioned on environment observations $\mathcal O$.
We fine-tune the base model for each motion primitive with RL using the DDPO approach. Each primitive is trained in environments containing, where the diffusion model generates trajectories that a critic evaluates to assign rewards.
@inproceedings{
hu2025composablenav,
title={ComposableNav: Instruction-Following Navigation in Dynamic Environments via Composable Diffusion},
author={Zichao Hu and Chen Tang and Michael Joseph Munje and Yifeng Zhu and Alex Liu and Shuijing Liu and Garrett Warnell and Peter Stone and Joydeep Biswas},
booktitle={9th Annual Conference on Robot Learning},
year={2025},
url={https://openreview.net/forum?id=FBsawSyYBM}
}