Skip to main navigation Skip to search Skip to main content

Spatiotemporal Transformer-Based Analysis of Social Gaze in Multi-Agent Interaction Videos

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Understanding human gaze communication from a video is critical for decoding complex social interactions in dynamic, real-world environments. Existing gaze communication models focus on a single interaction, such as mutual gaze or shared attention, leaving the full spectrum of dyadic gaze states unaddressed. Unlike low-level gaze tracking that focuses on eye movement anatomy, this work addresses high-level gaze behaviors such as mutual gaze, referential gaze, and shared attention, which reflect the social-cognitive functions of gaze in multi-agent contexts. To this end, a spatiotemporal transformerbased framework is proposed, which involves human-object detection and tracking, gaze-following prediction, and a robust spatiotemporal transformer architecture for fine-grained classification and localization of these gaze behaviors. Moreover, the proposed model incorporates human gaze information, which provides explicit, fine-grained cues about each individual's focus of attention, allowing more precise alignment of visual features with underlying social intent. Evaluated on a benchmark dataset, the proposed model substantially improves over strong graph-based and transformer-based baselines, particularly in accurately identifying rare yet socially meaningful gaze behaviors. This study contributes a scalable architecture for multi-class gaze analysis, supporting socially aware AI systems in healthcare through applications like autism screening and social engagement assessment, as well as in robotics and behavioral science.

Original languageEnglish
Title of host publication2025 Ieee/acs 22nd International Conference On Computer Systems And Applications, Aiccsa
PublisherIEEE Computer Society
Number of pages6
ISBN (Electronic)9798331556938
ISBN (Print)979-8-3315-5694-5
DOIs
Publication statusPublished - 22 Oct 2025
Event22nd ACS/IEEE International Conference on Computer Systems and Applications, AICCSA 2025 - Doha, Qatar
Duration: 19 Oct 202522 Oct 2025

Publication series

NameInternational Conference On Computer Systems And Applications

Conference

Conference22nd ACS/IEEE International Conference on Computer Systems and Applications, AICCSA 2025
Country/TerritoryQatar
CityDoha
Period19/10/2522/10/25

Keywords

  • Computer Vision
  • Deep Neural Network
  • Gaze Communication
  • Gaze following

Fingerprint

Dive into the research topics of 'Spatiotemporal Transformer-Based Analysis of Social Gaze in Multi-Agent Interaction Videos'. Together they form a unique fingerprint.

Cite this