Skip to main navigation Skip to search Skip to main content

ViT-CNN: Explainable Dual-Stream Cross-Attention for MRI Brain Tumor Screening on Consumer Edge Devices

  • Bakht Zada
  • , Juhua Pu
  • , Yar Muhammad
  • , Asmat Ullah
  • , Mona M. Jamjoom
  • , Zahid Ullah
  • , Ahmed Farouk
  • , Muhammad Adil*
  • *Corresponding author for this work
  • Beihang University
  • Princess Nourah Bint Abdulrahman University
  • Al-Imam Muhammad Ibn Saud Islamic University
  • Hurghada University
  • Texas Southern University

Research output: Contribution to journalArticlepeer-review

Abstract

Brain tumors are among the most aggressive and life-threatening cancers, requiring accurate and timely diagnosis for effective treatment. Convolutional neural networks (CNNs) and vision transformers (ViTs) have been widely explored for MRI-based tumor classification. However, CNNs often struggle with long-range dependency modeling and background noise, while ViTs lack strong inductive biases such as translation invariance. Although CNN-ViT hybridization can improve performance, practical screening systems on consumer edge devices must satisfy strict constraints on compute, memory, and latency. To address these challenges, we propose ViT-CNN, an explainable dual-stream cross-attention model that integrates CNN and ViT representations in an efficiency-aware manner for brain tumor screening on consumer edge devices. The CNN branch captures tumor-specific local patterns such as edges, textures, and shapes, while the ViT branch models global anatomical context. A spatial attention module is applied to the CNN features to suppress background noise, and a self-attention refinement module is applied to the ViT representation to enhance informative global cues. The two streams are then fused through bidirectional cross-attention, enabling adaptive interaction between local and global features. Experiments on two public MRI datasets show that ViT-CNN achieves 93.86±0.40% accuracy on the Kaggle multiclass dataset-v2 and 99.33% on the BR35H binary dataset. To strengthen interpretability, we perform both qualitative and quantitative XAI analysis using saliency maps and LIME, including fidelity and consistency evaluation. Deployment-oriented profiling further demonstrates real-time CPU inference at 29.36 ms/image (34.05 FPS) for the full model, while lightweight variants achieve up to 55.57 FPS on CPU with modest accuracy reduction, supporting flexible deployment across consumer edge devices.

Original languageEnglish
JournalIEEE Transactions on Consumer Electronics
DOIs
Publication statusAccepted/In press - 2026
Externally publishedYes

Keywords

  • Brain Tumor Classification
  • Cross-attention fusion
  • Feature Fusion
  • LIME
  • MRI
  • Saliency Maps
  • ViT-tiny

Fingerprint

Dive into the research topics of 'ViT-CNN: Explainable Dual-Stream Cross-Attention for MRI Brain Tumor Screening on Consumer Edge Devices'. Together they form a unique fingerprint.

Cite this