Reliable stress recognition is critical in applications such as medical monitoring and safety-critical systems, including real-world driv- ing. While stress is commonly detected using physiological signals such as perinasal perspiration and heart rate, facial activity provides complemen- tary cues that can be captured unobtrusively from video. We propose a multimodal stress estimation framework that combines facial videos and physiological signals, remaining effective even when biosignal acquisition is challenging. Facial behavior is represented using a dense 3D Morphable Model, yielding a 56-dimensional descriptor that captures subtle expres- sion and head-pose dynamics over time. To investigate the correlation between stress and facial motions, we perform extensive experiments in- volving also physiological markers. Paired hypothesis tests between base- line and stressor phases show that 38 of 56 facial components exhibit consistent, phase-specific stress responses comparable to physiological markers. Building on these findings, we introduce a Transformer-based temporal modeling framework and evaluate unimodal, early-fusion, and cross-modal attention strategies. Combining 3D-derived facial features with physiological signals via cross-modal attention substantially im- proves performance over physiological signals alone, increasing AUROC from 52.7% and accuracy from 51.0% to 92.0% and 86.7%, respectively. Although evaluated on driving data, the proposed framework and pro- tocol may readily generalize to other stress estimation settings
@misc{valergaki2026combiningfacialvideosbiosignals,
title={Combining facial videos and biosignals for stress estimation during driving},
author={Paraskevi Valergaki and Vassilis C. Nicodemou and Iason Oikonomidis and Antonis Argyros and Anastasios Roussos},
year={2026},
eprint={2601.04376},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.04376},
}