TY - GEN
T1 - Adaptive Inter-Modality Attention for Enhanced Cross-Domain Deepfake Detection Transferability
AU - Khan, Naseem
AU - Tuan, Nguyen Vu
AU - Khalil, Issa
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/12/6
Y1 - 2025/12/6
N2 - Cross-domain generalization remains a critical challenge in deepfake detection, with existing methods exhibiting severe performance degradation across unseen generative architectures. We propose CAMME (Cross-domain Adaptive Multi-Modal Embeddings), which dynamically integrates visual, textual, and frequency-domain features via embedding-level multi-modal self-attention. Treating each modality as a distinct sequence element enables cross-modal interactions that adaptively weight discriminative features based on input characteristics. Unlike static fusion approaches, CAMME learns input-specific contributions, dynamically emphasizing optimal signals across visual semantics, textual consistency, and spectral artifacts. Evaluation across twelve generative architectures demonstrates superior cross-domain performance: 77.34% average F1-score on natural scenes (7.30% improvement) and 66.46% on facial datasets (13.25% improvement). CAMME exhibits exceptional robustness with 14.7% Attack Success Rate against seven prominent adversarial attacks (4-6× improvement) and 96.63% accuracy under natural perturbations. Ablation results confirm the importance of each modality and the effectiveness of our inter-modal attention over standard fusion methods.
AB - Cross-domain generalization remains a critical challenge in deepfake detection, with existing methods exhibiting severe performance degradation across unseen generative architectures. We propose CAMME (Cross-domain Adaptive Multi-Modal Embeddings), which dynamically integrates visual, textual, and frequency-domain features via embedding-level multi-modal self-attention. Treating each modality as a distinct sequence element enables cross-modal interactions that adaptively weight discriminative features based on input characteristics. Unlike static fusion approaches, CAMME learns input-specific contributions, dynamically emphasizing optimal signals across visual semantics, textual consistency, and spectral artifacts. Evaluation across twelve generative architectures demonstrates superior cross-domain performance: 77.34% average F1-score on natural scenes (7.30% improvement) and 66.46% on facial datasets (13.25% improvement). CAMME exhibits exceptional robustness with 14.7% Attack Success Rate against seven prominent adversarial attacks (4-6× improvement) and 96.63% accuracy under natural perturbations. Ablation results confirm the importance of each modality and the effectiveness of our inter-modal attention over standard fusion methods.
KW - Adversarial robustness
KW - Cross-domain transferability
KW - Deepfake detection
KW - Inter-modal attention
KW - Multi-modal learning
UR - https://www.scopus.com/pages/publications/105025127825
U2 - 10.1145/3743093.3771048
DO - 10.1145/3743093.3771048
M3 - Conference contribution
AN - SCOPUS:105025127825
T3 - Proceedings of the 7th ACM International Conference on Multimedia in Asia, MMAsia 2025
BT - Proceedings of the 7th ACM International Conference on Multimedia in Asia, MMAsia 2025
A2 - Chua, Tat-Seng
A2 - Wong, Lai-Kuan
A2 - Chan, Chee Seng
A2 - Tang, Jinhui
A2 - Ngo, Chong-Wah
A2 - Schoeffmann, Klaus
A2 - Liu, Jiaying
A2 - Ho, Yo-Sung
PB - Association for Computing Machinery, Inc
T2 - 7th ACM International Conference on Multimedia in Asia, MMAsia 2025
Y2 - 9 December 2025 through 12 December 2025
ER -