Abstract
Safe and independent navigation is an essential part of daily life for individuals with visual impairments. Recent advances in assistive navigation have leveraged deep learning-based semantic scene segmentation with promising results by using RGB images. In comparison, RGB-D images provide richer geometric and appearance information, but their integration into deep learning frameworks for assistive navigation could be further explored to enhance segmentation accuracy. In modeling cross-modal interactions, the existing RGB-D segmentation methods tend to lose fine-grained spatial details or discard useful information, which hinders performance on dense prediction tasks. To address these gaps, we propose NaviFormer, a novel RGB-D semantic segmentation architecture tailored for assistive navigation. NaviFormer features a dual-stream transformer encoder with shared weights to efficiently extract latent features from RGB and depth modalities. It also incorporates a Local–Global Cross-Modal Fusion module, which facilitates effective information exchange between the two modalities across both local and global feature levels. In training NaviFormer, we further employ a pixel-wise contrastive loss to enhance the separability of pixel-level embeddings in the RGB-D feature space. Extensive experiments on TrueSight and Cityscapes datasets indicate that NaviFormer achieves superior performance compared to existing RGB-D segmentation methods. Our findings highlight the importance of leveraging RGB-D data for enhancing semantic understanding in assistive navigation systems, and establish NaviFormer as a solid baseline for future research in this domain.
| Original language | English |
|---|---|
| Article number | 104793 |
| Journal | Computer Vision and Image Understanding |
| Volume | 269 |
| DOIs | |
| Publication status | Published - Jun 2026 |
Keywords
- Assistive navigation
- Cross-modal attention
- Deep learning
- RGB-D semantic segmentation
Fingerprint
Dive into the research topics of 'NaviFormer: Multimodal scene segmentation for assistive navigation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver