Combining Facial Videos and Biosignals for Stress Estimation During Driving

Official repository of ICPR 2026 submission
Paraskevi Valergaki2 , Vassilis C. Nicodemou2 , Iason Oikonomidis2 , Antonis Argyros1,2 , Anastasios Roussos2
1Computer Science Department, University of Crete, Heraklion, Greece
2Institute of Computer Science (ICS), Foundation for Research & Technology – Hellas (FORTH), Heraklion, Greece
Paper (PDF) Code arXiv

Abstract

Reliable stress recognition is critical in applications such as medical monitoring and safety-critical systems, including real-world driv- ing. While stress is commonly detected using physiological signals such as perinasal perspiration and heart rate, facial activity provides complemen- tary cues that can be captured unobtrusively from video. We propose a multimodal stress estimation framework that combines facial videos and physiological signals, remaining effective even when biosignal acquisition is challenging. Facial behavior is represented using a dense 3D Morphable Model, yielding a 56-dimensional descriptor that captures subtle expres- sion and head-pose dynamics over time. To investigate the correlation between stress and facial motions, we perform extensive experiments in- volving also physiological markers. Paired hypothesis tests between base- line and stressor phases show that 38 of 56 facial components exhibit consistent, phase-specific stress responses comparable to physiological markers. Building on these findings, we introduce a Transformer-based temporal modeling framework and evaluate unimodal, early-fusion, and cross-modal attention strategies. Combining 3D-derived facial features with physiological signals via cross-modal attention substantially im- proves performance over physiological signals alone, increasing AUROC from 52.7% and accuracy from 51.0% to 92.0% and 86.7%, respectively. Although evaluated on driving data, the proposed framework and pro- tocol may readily generalize to other stress estimation settings

Feature Visualization Video

Method Overview

Conceptual framework
Conceptual overview of the proposed stress estimation framework combining facial videos and physiological signals.
Cross-modal attention fusion
Cross-modal attention fusion architecture for integrating 3D-derived facial features with biosignals.

BibTeX

@misc{valergaki2026combiningfacialvideosbiosignals,
      title={Combining facial videos and biosignals for stress estimation during driving}, 
      author={Paraskevi Valergaki and Vassilis C. Nicodemou and Iason Oikonomidis and Antonis Argyros and Anastasios Roussos},
      year={2026},
      eprint={2601.04376},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.04376}, 
}