[StageM2] Graph-Transformer Based Techniques for Facial Expression Recognition, Human Gesture & Action Recognition from Videos

Duration: 6 months starting from March or April 2026

Place : Université de Lille – CRIStAL, Villeneuve d’Ascq 59655, France

Supervisors:

Dr Tanmoy Mondal mondal.tanmoy@univ-lille.fr

Dr Debabrota Basu debabrota.basu@inria.fr

Dr Deise Santana Maia deise.santanamaia@univ-lille.fr

This internship proposal falls within the scope of two research groups of the CRIStAL laboratory of the Université de Lille, namely the GT Image, whose main focus is to develop new tools and algorithms for image analysis, video scene interpretation, and 3D object shape analysis, and the GT DatInG, whose focus is on machine learning, data mining and signal processing.

Scientific context

In the recent years, “Graph Convolutional Networks (GCNs)” have become the one of the popular choice for modeling relational data due to their ability to capture both the local and global structure of the graphs. Hence, GCN has shown the promising results in various tasks, related to the spatial-temporal data. In the same manner, the “Transformer” models has also revolutionized the field of Natural Language Processing (NLP) and have become the most popular architecture to be used for several NLP tasks. In addition to NLP tasks, the Transformer architecture (i.e. “Vision Transformer”) is also been applied to the tasks beyond NLP, such as skeleton-based human action recognition, landmark based human gesture recognition and facial expression recognition etc.

Is can be noted that all of these tasks (i.e. human facial expression recognition, human action recognition and human gesture recognition from video) can be structured and represented as a sequence video frames, where each video frames can be structured by constructing graph based on the detected landmarks (for facial expression and human gesture) in it. In the case of human action recognition also, the human body can be represented by skeleton, based on detected landmarks. Considering the similarity between these three tasks, where all of these tasks can be represented by graph structure and the temporal sequences can be represented by dynamic spatial-temporal patterns, in this work we propose to develop a “Graph-Transformer” model [1,2,3] that combines the strengths of spectral GCNs for learning spatial-temporal representations and a “Transformer” model for capturing the long-term dependencies in video sequences.

Objectives

The main scientific objective of this internship will be to make contribution in “Graph-Transformer” model and make it as a generalized architecture which can be applied for all of these three tasks. The “Graph-Transformer” based spatial-temporal model will learn the dynamics of spatial-temporal correlations of hand skeletons. It consists of a spectral GCNs for spatial-temporal feature learning, followed by a “Transformer-Encoder Only” model for capturing global temporal dependencies.

Tasks

In this internship, the selected student is expected to:

Review of the state of the art on graphs and transformer based architectures applied to facial expression, gesture and action recognition.
Adapt selected state-of-the-art models to any of the three aforementioned tasks on public datasets.
Propose and implement new methodological improvements to existing graph-transformer models (e.g. novel ways of constructing lighter graphs – fewer vertices or connections – without compromising the overall performance of a model)

Required profile

Final-year Master’s student or engineering student specializing in machine learning, computer vision, or a related field.
Knowledge of computer vision, machine and deep learning (CNN, GNN, Transformers…)
Programming skills (Python)
Autonomy, rigor and critical thinking.

Application

If you are interested in this proposition, please send the following documents to Dr Tanmoy Mondal (mondal.tanmoy@univ-lille.fr), Dr Debabrota Basu (debabrota.basu@inria.fr) and Dr Deise Santana Maia (deise.santanamaia@univ-lille.fr):

CV
Motivation letter
Grades obtained in Bachelor’s/Master’s/Engineering degree and class rank
Name and contact information of at least one reference we may contact if necessary

References

[1] Feghoul, K., Maia, D.S., Amrani, M.E., Daoudi, M., Amad, A. (2024). Spatial-Temporal Graph Transformer for Surgical Skill Assessment in Simulation Sessions. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2023. Lecture Notes in Computer Science, vol 14469. Springer.

[2] Feghoul, K., Maia, D. S., El Amrani, M., Daoudi, M., & Amad, A. (2024, May). Mgrformer: A multimodal transformer approach for surgical gesture recognition. In 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG) (pp. 1-10). IEEE.

[3] Yun, S., Jeong, M., Kim, R., Kang, J., & Kim, H. J. (2019). Graph transformer networks. Advances in Neural Information Processing Systems, 32.

Annonce

[StageM2] Graph-Transformer Based Techniques for Facial Expression Recognition, Human Gesture & Action Recognition from Videos

IASIS en chiffres

A noter

Cartographie des expertises du GdR

Actus de la communauté

Graphes, la science des liens

Jean-Louis Lacoume (1940-2026)

GreenDays 2026

L’intelligence artificielle pour les sciences

Concours Chercheurs CNRS