TY - GEN
T1 - Spatiotemporal Transformer-Based Analysis of Social Gaze in Multi-Agent Interaction Videos
AU - Aldhubri, Ali
AU - Varghese, Elizabeth B.
AU - Al-Thani, Dena
AU - Qaraqe, Marwa K.
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025/10/22
Y1 - 2025/10/22
N2 - Understanding human gaze communication from a video is critical for decoding complex social interactions in dynamic, real-world environments. Existing gaze communication models focus on a single interaction, such as mutual gaze or shared attention, leaving the full spectrum of dyadic gaze states unaddressed. Unlike low-level gaze tracking that focuses on eye movement anatomy, this work addresses high-level gaze behaviors such as mutual gaze, referential gaze, and shared attention, which reflect the social-cognitive functions of gaze in multi-agent contexts. To this end, a spatiotemporal transformerbased framework is proposed, which involves human-object detection and tracking, gaze-following prediction, and a robust spatiotemporal transformer architecture for fine-grained classification and localization of these gaze behaviors. Moreover, the proposed model incorporates human gaze information, which provides explicit, fine-grained cues about each individual's focus of attention, allowing more precise alignment of visual features with underlying social intent. Evaluated on a benchmark dataset, the proposed model substantially improves over strong graph-based and transformer-based baselines, particularly in accurately identifying rare yet socially meaningful gaze behaviors. This study contributes a scalable architecture for multi-class gaze analysis, supporting socially aware AI systems in healthcare through applications like autism screening and social engagement assessment, as well as in robotics and behavioral science.
AB - Understanding human gaze communication from a video is critical for decoding complex social interactions in dynamic, real-world environments. Existing gaze communication models focus on a single interaction, such as mutual gaze or shared attention, leaving the full spectrum of dyadic gaze states unaddressed. Unlike low-level gaze tracking that focuses on eye movement anatomy, this work addresses high-level gaze behaviors such as mutual gaze, referential gaze, and shared attention, which reflect the social-cognitive functions of gaze in multi-agent contexts. To this end, a spatiotemporal transformerbased framework is proposed, which involves human-object detection and tracking, gaze-following prediction, and a robust spatiotemporal transformer architecture for fine-grained classification and localization of these gaze behaviors. Moreover, the proposed model incorporates human gaze information, which provides explicit, fine-grained cues about each individual's focus of attention, allowing more precise alignment of visual features with underlying social intent. Evaluated on a benchmark dataset, the proposed model substantially improves over strong graph-based and transformer-based baselines, particularly in accurately identifying rare yet socially meaningful gaze behaviors. This study contributes a scalable architecture for multi-class gaze analysis, supporting socially aware AI systems in healthcare through applications like autism screening and social engagement assessment, as well as in robotics and behavioral science.
KW - Computer Vision
KW - Deep Neural Network
KW - Gaze Communication
KW - Gaze following
UR - https://www.scopus.com/pages/publications/105032348662
U2 - 10.1109/AICCSA66935.2025.11315206
DO - 10.1109/AICCSA66935.2025.11315206
M3 - Conference contribution
AN - SCOPUS:105032348662
SN - 979-8-3315-5694-5
T3 - International Conference On Computer Systems And Applications
BT - 2025 Ieee/acs 22nd International Conference On Computer Systems And Applications, Aiccsa
PB - IEEE Computer Society
T2 - 22nd ACS/IEEE International Conference on Computer Systems and Applications, AICCSA 2025
Y2 - 19 October 2025 through 22 October 2025
ER -