Segmenting Transformer for Open-Vocabulary Object Goal Navigation. A lightweight RGB-only agent that generalizes to unseen object categories using a goal mask encoder and entropy-adaptive training.
Open-vocabulary Object Goal Navigation requires an embodied agent to reach objects described by free-form language, including categories never seen during training. Existing end-to-end policies often overfit small simulator datasets and exhibit unsafe behavior such as frequent collisions. OVSegDT is a lightweight transformer policy that introduces two synergistic components: (1) a semantic branch with a target-mask encoder and auxiliary segmentation loss to ground the goal and provide spatial cues, and (2) Entropy-Adaptive Loss Modulation (EALM), a per-sample scheduler that continuously balances imitation and reinforcement learning signals using policy entropy. These additions reduce sample complexity and improve navigation safety while keeping inference cost low with an RGB-only 130M-parameter model.
OVSegDT targets Open-Vocabulary Object Goal Navigation (OVON), where the agent must explore unfamiliar scenes and navigate to a target object described by text. The observation at time step t includes the current RGB frame, the text goal, the previous discrete action, and a binary segmentation mask for the target category. A transformer processes the last 100 observation embeddings to predict the next action.
The approach is mapless and does not require depth, odometry, or large vision-language models. Instead, it leverages compact encoders and a training strategy that improves both convergence speed and generalization to unseen categories.
OVSegDT is evaluated on the HM3D-OVON benchmark [1], a large-scale dataset for open-vocabulary object-goal navigation in photorealistic 3D scanned indoor environments.
The total loss combines PPO, behavior cloning, value regression, entropy bonus, and auxiliary segmentation loss (Dice + BCE). A semantic reward encourages progress toward the goal and increasing target-mask area in the current view. These components are kept task-invariant across experiments.
The full objective adds value regression, entropy bonus, and segmentation loss to EALM.
Entropy EMA \(\hat H_t\) is mapped to a mixing weight \(\lambda_t\) that schedules PPO vs. behavior cloning.
The policy loss interpolates between PPO and behavior cloning per sample.
The reward increases when the agent moves closer and the target mask grows in view.
Dice + BCE supervision encourages accurate reconstruction of the goal mask.
OVSegDT reaches state-of-the-art RGB-only performance on HM3D-OVON while matching seen and unseen category performance[1].
| Method | Depth | Odometry | Val Seen | Val Seen Synonyms | Val Unseen | |||
|---|---|---|---|---|---|---|---|---|
| SR | SPL | SR | SPL | SR | SPL | |||
| BC | No | No | 11.1 ± 0.1 | 4.5 ± 0.1 | 9.9 ± 0.4 | 3.8 ± 0.1 | 5.4 ± 0.1 | 1.9 ± 0.2 |
| DAgger | No | No | 18.1 ± 0.4 | 9.4 ± 0.3 | 15.0 ± 0.4 | 7.4 ± 0.3 | 10.2 ± 0.5 | 4.7 ± 0.3 |
| RL | No | No | 39.2 ± 0.4 | 18.7 ± 0.2 | 27.8 ± 0.1 | 11.7 ± 0.2 | 18.6 ± 0.3 | 7.5 ± 0.2 |
| BCRL | No | No | 20.2 ± 0.6 | 8.2 ± 0.4 | 15.2 ± 0.1 | 5.3 ± 0.1 | 8.0 ± 0.2 | 2.8 ± 0.1 |
| DAgRL | No | No | 41.3 ± 0.3 | 21.2 ± 0.3 | 29.4 ± 0.3 | 14.4 ± 0.1 | 18.3 ± 0.3 | 7.9 ± 0.1 |
| Uni-NaVid | No | No | 41.3 | 21.1 | 43.9 | 21.8 | 39.5 | 19.8 |
| VLFM | Yes | Yes | 35.2 | 18.6 | 32.4 | 17.3 | 35.2 | 19.6 |
| DAgRL+OD | Yes | Yes | 38.5 ± 0.4 | 21.1 ± 0.4 | 39.0 ± 0.7 | 21.4 ± 0.5 | 37.1 ± 0.2 | 19.8 ± 0.3 |
| TANGO | Yes | Yes | — | — | — | — | 35.5 ± 0.3 | 19.5 ± 0.3 |
| MTU3D | Yes | Yes | 55.0 | 23.6 | 45.0 | 14.7 | 40.8 | 12.1 |
| OVSegDT | No | No | 43.6 ± 0.4 | 20.1 ± 0.2 | 40.1 ± 0.4 | 17.9 ± 0.1 | 44.7 ± 0.4 | 20.6 ± 0.2 |
[1] HM3D-OVON: A Dataset and Benchmark for Open-Vocabulary Object Goal Navigation. arXiv:2409.14296 [cs.AI]
OVSegDT is trained with ground-truth masks for fast convergence, then fine-tuned with predicted masks from an open-vocabulary segmenter (YOLOE). Confidence calibration per category and the combination of semantic reward plus segmentation loss preserve generalization to unseen categories.
Calibration of category-specific confidence thresholds improves navigation quality, and fine-tuning on predicted masks yields additional gains on unseen categories. The combination of semantic reward and segmentation loss is critical to preserve generalization after switching from ground-truth to predicted masks.
The switching-strategy study shows that naive PPO stalls, while DAgger overfits and collides frequently. EALM steadily improves success rate and reduces collisions by continuously shifting from imitation to reinforcement as policy entropy falls. The entropy-threshold study varies Hlow with Hhigh=0.75 and shows that switching too early harms success, while switching too late delays RL gains.
Policies are trained for 200M steps with ground-truth segmentation and then fine-tuned for 15M steps using predicted masks. Experiments run across 40 environments using Variable Experience Rollout on two A100 GPUs.