Abstract
Traveling safely and independently in unfamiliar environments remains a significant challenge for people with visual impairments. Conventional assistive navigation systems, while aiming to enhance spatial awareness, typically handle crucial tasks like semantic segmentation and depth estimation separately, resulting in high computational overhead and reduced inference speed. To address this limitation, we introduce AMT-Net, a novel multi-task deep neural network designed for joint semantic segmentation and monocular depth estimation. The AMT-Net is designed with a single unified decoder, which boosts not only the model's efficiency but also its scalability on portable devices with limited computational resources. We propose two self-attention-based modules, CSAPP and RSAB, to leverage the strengths of convolutional neural networks for extracting robust local features and Transformers for capturing essential long-range dependencies. This design enhances the ability of our model to interpret complex scenes effectively. Furthermore, AMT-Net has low computational complexity and achieves real-time performance, making it suitable for assistive navigation applications. Extensive experiments on the public NYUD-v2 dataset and the TrueSight dataset demonstrated our model's state-of-the-art performance and the effectiveness of the proposed components.
| Original language | English |
|---|---|
| Article number | 129468 |
| Number of pages | 12 |
| Journal | Neurocomputing |
| Volume | 625 |
| DOIs | |
| Publication status | Published - 7 Apr 2025 |
Keywords
- Assistive navigation
- Depth estimation
- Multi-task learning
- Semantic segmentation
- Vision impairment
- Vision transformers