TY - JOUR
T1 - A temporal–spatial deep learning framework leveraging dynamic 3D attention maps for violence detection
AU - Varghese, Elizabeth B.
AU - Elzein, Almiqdad
AU - Yang, Yin
AU - Qaraqe, Marwa
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/11
Y1 - 2025/11
N2 - In intelligent systems for real-time security and safety monitoring, the proliferation of surveillance cameras has fueled a growing interest in using deep learning-based artificial intelligence (AI) models for violence detection. Most current approaches consider violence detection as a video classification task, overlooking the fact that violent activities occur within relatively small spatiotemporal regions. Moreover, these activities depend on relationships among multiple such regions, making a single region analysis inadequate, especially for larger-scale violence. This paper proposes a novel temporal–spatial attention framework inspired by human visual perception, which dynamically focuses on multiple informative regions across space and time. By learning where, when, and for how long to attend within a video, using dynamic three-dimensional attention prediction networks, the model captures complex patterns of violent behavior more effectively. Experiments on four public benchmark datasets and a real-world dataset created for this study demonstrate that the proposed approach outperforms existing methods in accuracy and interpretability.
AB - In intelligent systems for real-time security and safety monitoring, the proliferation of surveillance cameras has fueled a growing interest in using deep learning-based artificial intelligence (AI) models for violence detection. Most current approaches consider violence detection as a video classification task, overlooking the fact that violent activities occur within relatively small spatiotemporal regions. Moreover, these activities depend on relationships among multiple such regions, making a single region analysis inadequate, especially for larger-scale violence. This paper proposes a novel temporal–spatial attention framework inspired by human visual perception, which dynamically focuses on multiple informative regions across space and time. By learning where, when, and for how long to attend within a video, using dynamic three-dimensional attention prediction networks, the model captures complex patterns of violent behavior more effectively. Experiments on four public benchmark datasets and a real-world dataset created for this study demonstrate that the proposed approach outperforms existing methods in accuracy and interpretability.
KW - 3D spatiotemporal attention maps
KW - Computer vision
KW - Residual convolutional neural network
KW - Video surveillance
KW - Violence detection
UR - https://www.scopus.com/pages/publications/105016606086
U2 - 10.1007/s00521-025-11641-4
DO - 10.1007/s00521-025-11641-4
M3 - Article
AN - SCOPUS:105016606086
SN - 0941-0643
VL - 37
SP - 26689
EP - 26709
JO - Neural Computing and Applications
JF - Neural Computing and Applications
IS - 32
ER -