Memory-Adaptive Vision-and-Language Navigation

Keji He, Ya Jing, Yan Huang, Zhihe Lu, Dong An, Liang Wang*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

8 Citations (Scopus)

Abstract

Vision-and-Language Navigation (VLN) requests an agent to navigate in 3D environments following given instructions, where history is critical for decision-making in dynamic navigation process. Particularly, a memory bank storing histories is widely used in existing methods to incorporate with multimodel representations in current scenes for better decision-making. However, by weighting each history with a simple scalar, those methods cannot purely utilize the informative cues that co-exist with detrimental contents in each history, thereby inevitably introducing noises into decision-making. To that end, we propose a novel Memory-Adaptive Model (MAM) that can dynamically restrain the detrimental contents in histories for retaining contents that benefit navigation only. Specifically, two key modules, Visual and Textual Adaptive Modules, are designed to restrain history noises based on scene-related vision and text, respectively. A Reliability Estimator Module is further introduced to refine above adaptation operations. Our experiments on the widely used RxR and R2R datasets show that MAM outperforms its baseline method by 4.0%/2.5% and 2%/1% on the validation unseen/test split, respectively, wrt the SR metric.

Original languageEnglish
Article number110511
Number of pages13
JournalPattern Recognition
Volume153
Early online dateApr 2024
DOIs
Publication statusPublished - Sept 2024
Externally publishedYes

Keywords

  • History noises
  • Memory bank
  • Memory-Adaptive Model
  • Vision-and-Language Navigation

Fingerprint

Dive into the research topics of 'Memory-Adaptive Vision-and-Language Navigation'. Together they form a unique fingerprint.

Cite this