Foresight: Iterative Reasoning About Clues that Matter for Navigation

Abstract

Open-world mapless navigation from sparse language instructions requires resolving underspecified goals and inferring which environmental cues are relevant for reaching the goal—e.g., interpreting ramps, signs, or detours that reveal where to go or which route to take. Prior works rely on closed-set factor categories or identify cues before motion planning, missing plan-dependent cues. We argue that pretrained Vision-Language Models (VLMs) can discover novel instruction-relevant cues, but require adaptation to focus on which cues matter and how they should influence motion planning. We present Foresight, a test-time framework in which a finetuned VLM alternates between proposing image-space motion plans and critiquing them using the language goal and visual context, with subsequent plans conditioned on prior critiques to enable iterative motion refinement before execution. To align critiques and refinements with open-set behavior preferences, we learn a reward model from human feedback and use it to post-train the VLM with reinforcement learning in the plan-critique loop. In offline evaluations and 6 real-world environments, Foresight improves average task success by 37% and reduces interventions per mission by 52% relative to state-of-the-art test-time reasoning and foundation-model baselines, while running in real-time on a Jetson AGX Orin.

Test-time Reasoning via Iterative Plan-Critique Refinement

Foresight formulates motion planning as a test-time reasoning loop. Conditioned on the image observation and language instruction, a finetuned VLM proposes an image-space motion plan and then critiques it with respect to the language goal and visual context. Subsequent plans are conditioned on prior critiques, enabling the model to surface plan-dependent cues (e.g., ramps, signs, detours) and iteratively refine its motion before execution. A lightweight policy converts the refined image-space plan into executable trajectories.

Scalable Training Recipe: Supervised Finetuning + Preference RL

To adapt pretrained VLMs for iterative refinement, we introduce a two-stage training recipe. We first apply supervised finetuning to teach the VLM the structure of the plan-critique loop. Then, we learn a reward model from human preference annotations to judge motion plan quality and use it to post-train the VLM with reinforcement learning, aligning critiques and refinements with open-set behavior preferences that are difficult to capture with hand-designed objectives alone.

Offline and Real-World Evaluations

We evaluate Foresight on a challenging offline benchmark with 25 unique urban environments and conduct real-world experiments in 6 real-world environments, comparing against state-of-the-art navigation models, including test-time reasoning and robot foundation models. With just one additional refinement step, Foresight decreases the median planning error by 40%, improves average task success rate by 37%, and reduces interventions per mission by 52% relative to state-of-the-art baselines, while running in real-time on a Jetson AGX Orin.

Hausdorff distance boxplot by reflection

Real-world robot experiments success rate

Deployment: Long-Horizon Missions and Head-to-Head Comparisons

We deploy Foresight on a real robot to follow sparse language instructions across long-horizon navigation missions, and compare it head-to-head against competing methods in six environments that require sign understanding, inferring structural clues, and navigating detours.

Long-horizon navigation mission

Head-to-head comparisons (LeLaN, Alpamayo, Foresight)

Detour rerouting tasks require inferring instruction-aligned alternative routes when the current path is blocked.

Limitations

While Foresight makes significant strides toward scalable mapless navigation, several challenges remain:

Credit assignment in multi-step refinement: Outcome-level supervision makes it hard to attribute gains to better critiques, critique-conditioned planning, or critique following. Process-level supervision could provide denser signal and reduce spurious correlations.
Memory and multi-view understanding: Navigation plans can degrade when relevant cues are sparse or absent from the local observation history. Retrieval-augmented memory and co-training on multi-view reasoning could improve cue grounding in these scenarios.
Static-environment focus: Our experiments primarily target static settings. Dynamic environments demand faster policies that reason over temporal relationships, which we leave for future work.

Citation

If you find Foresight useful in your research, please consider citing our paper:

@article{zhang2026foresight,
    title={Foresight: Iterative Reasoning About Clues that Matter for Navigation},
    author={Zhang, Arthur and Qi, Carl and Su, Donne and Meng, Xiangyun and Zhang, Amy and Biswas, Joydeep},
    journal={arXiv preprint arXiv:2606.12550},
    year={2026}
  }