Skip to main navigation Skip to search Skip to main content

NaviFormer: Multimodal scene segmentation for assistive navigation

  • Ly Bui
  • , Son Lam Phung*
  • , Yang Di
  • , Soan Thi Minh Duong
  • , Abdesselam Bouzerdoum
  • *Corresponding author for this work
  • University of Wollongong
  • Le Quy Don Technical University

Research output: Contribution to journalArticlepeer-review

Abstract

Safe and independent navigation is an essential part of daily life for individuals with visual impairments. Recent advances in assistive navigation have leveraged deep learning-based semantic scene segmentation with promising results by using RGB images. In comparison, RGB-D images provide richer geometric and appearance information, but their integration into deep learning frameworks for assistive navigation could be further explored to enhance segmentation accuracy. In modeling cross-modal interactions, the existing RGB-D segmentation methods tend to lose fine-grained spatial details or discard useful information, which hinders performance on dense prediction tasks. To address these gaps, we propose NaviFormer, a novel RGB-D semantic segmentation architecture tailored for assistive navigation. NaviFormer features a dual-stream transformer encoder with shared weights to efficiently extract latent features from RGB and depth modalities. It also incorporates a Local–Global Cross-Modal Fusion module, which facilitates effective information exchange between the two modalities across both local and global feature levels. In training NaviFormer, we further employ a pixel-wise contrastive loss to enhance the separability of pixel-level embeddings in the RGB-D feature space. Extensive experiments on TrueSight and Cityscapes datasets indicate that NaviFormer achieves superior performance compared to existing RGB-D segmentation methods. Our findings highlight the importance of leveraging RGB-D data for enhancing semantic understanding in assistive navigation systems, and establish NaviFormer as a solid baseline for future research in this domain.

Original languageEnglish
Article number104793
JournalComputer Vision and Image Understanding
Volume269
DOIs
Publication statusPublished - Jun 2026

Keywords

  • Assistive navigation
  • Cross-modal attention
  • Deep learning
  • RGB-D semantic segmentation

Fingerprint

Dive into the research topics of 'NaviFormer: Multimodal scene segmentation for assistive navigation'. Together they form a unique fingerprint.

Cite this