CREStE (Counterfactuals for Reward Enhancement with Structured Embeddings) is the first approach to
learn representations that address the full mapless navigation
problem. CREStE learns generalizable bird's eye view (BEV) scene representations for urban environments by
distilling priors from visual foundation models trained on internet-scale data. Using this
representation, we predict BEV reward maps for navigation that are aligned with expert and counterfactual
demonstrations. CREStE outperforms all state-of-the-art approaches in mapless urban navigation,
traversing a 2 kilometer mission with just 1 intervention, demonstrating our
generalizability
to unseen semantic entities and terrains,
challenging scenarios with little room for error, and fine-grained human preferences.
Our approach acheives this without an exhaustive list of semantic classes, large-scale robot
datasets, or
carefully designed reward functions. We acheive this with the following contributions: 1) A novel model
architecture and learning objective that leverages visual foundation models to learn geometrically
grounded semantic,
geometric, and instance-aware representations 2) A counterfactual-based inverse
reinforcement learning
objective and framework for learning reward functions that attend to the most important features for
navigation.