TY - JOUR
T1 - Dual-attention Network for View-invariant Action Recognition
AU - Kumie, Gedamu Alemu
AU - Habtie, Maregu Assefa
AU - Ayall, Tewodros Alemu
AU - Zhou, Changjun
AU - Liu, Huawen
AU - Seid, Abegaz Mohammed
AU - Erbad, Aiman
N1 - Publisher Copyright:
© The Author(s) 2023.
PY - 2024/2
Y1 - 2024/2
N2 - View-invariant action recognition has been widely researched in various applications, such as visual surveillance and human–robot interaction. However, view-invariant human action recognition is challenging due to the action occlusions and information loss caused by view changes. Modeling spatiotemporal dynamics of body joints and minimizing representation discrepancy between different views could be a valuable solution for view-invariant human action recognition. Therefore, we propose a Dual-Attention Network (DANet) aims to learn robust video representation for view-invariant action recognition. The DANet is composed of relation-aware spatiotemporal self-attention and spatiotemporal cross-attention modules. The relation-aware spatiotemporal self-attention module learns representative and discriminative action features. This module captures local and global long-range dependencies, as well as pairwise relations among human body parts and joints in the spatial and temporal domains. The cross-attention module learns view-invariant attention maps and generates discriminative features for semantic representations of actions in different views. We exhaustively evaluate our proposed approach on the NTU-60, NTU-120, and UESTC large-scale challenging datasets with multi-type evaluation metrics including Cross-Subject, Cross-View, Cross-Set, and Arbitrary-view. The experimental results demonstrate that our proposed approach significantly outperforms state-of-the-art approaches in view-invariant action recognition.
AB - View-invariant action recognition has been widely researched in various applications, such as visual surveillance and human–robot interaction. However, view-invariant human action recognition is challenging due to the action occlusions and information loss caused by view changes. Modeling spatiotemporal dynamics of body joints and minimizing representation discrepancy between different views could be a valuable solution for view-invariant human action recognition. Therefore, we propose a Dual-Attention Network (DANet) aims to learn robust video representation for view-invariant action recognition. The DANet is composed of relation-aware spatiotemporal self-attention and spatiotemporal cross-attention modules. The relation-aware spatiotemporal self-attention module learns representative and discriminative action features. This module captures local and global long-range dependencies, as well as pairwise relations among human body parts and joints in the spatial and temporal domains. The cross-attention module learns view-invariant attention maps and generates discriminative features for semantic representations of actions in different views. We exhaustively evaluate our proposed approach on the NTU-60, NTU-120, and UESTC large-scale challenging datasets with multi-type evaluation metrics including Cross-Subject, Cross-View, Cross-Set, and Arbitrary-view. The experimental results demonstrate that our proposed approach significantly outperforms state-of-the-art approaches in view-invariant action recognition.
KW - Attention transfer
KW - Cross-attention
KW - Dual-attention
KW - Human action recognition
KW - Self-attention
KW - View-invariant representation
UR - https://www.scopus.com/pages/publications/85165209241
U2 - 10.1007/s40747-023-01171-8
DO - 10.1007/s40747-023-01171-8
M3 - Article
AN - SCOPUS:85165209241
SN - 2199-4536
VL - 10
SP - 305
EP - 321
JO - Complex and Intelligent Systems
JF - Complex and Intelligent Systems
IS - 1
ER -