Annonce


[PhD] Multi-modal data synchronization for speech and gesture analysis

14 Janvier 2026


Catégorie : Postes Doctorant ;


Application deadline : April 18, 2026

Final results : June, 2026

Duration : 3 years starting from October 2026

Place : Université de Lille – CRIStAL, Villeneuve d’Ascq 59655, France

Supervisors:

Mohamed Daoudi, Professor (mohamed.daoudi@univ-lille.fr)

Deise Santana Maia, Assistant Professor (deise.santanamaia@univ-lille.fr)

This PhD thesis proposal falls within the scope of the research group 3D SAM of the CRIStAL laboratory of the Université de Lille, whose primary focus is to develop new models and algorithms for analyzing and synthesizing human movement and behavior (bodies, faces and gestures). The selected candidate will have the opportunity to work in an interdisciplinary environment in collaboration with linguists from the Universidade Estadual do Sudoeste da Bahia (Brazil).

Scientific context

Multi-modal data analysis has gained significant attention in the computer vision community, as many computer vision tasks can benefit from modalities beyond images and videos, such as audio, physiological, and kinematic data. For instance, recent works in the literature [3, 4] successfully combine multiple modalities for pain measurement and surgical skill assessment. In both studies, the authors rely on datasets where the different modalities are captured using various devices, including RGB or RGB-D cameras and connected watches, which are manually synchronized. Such manual synchronization can be replaced with hardware-based methods, which use external triggers to synchronize data streams from different devices. However, these methods require specialized equipment and may be challenging to implement, as discussed in [1].

Considering only the cases where different modalities (e.g., kinematics and audio) are recorded separately by different devices, this gives rise to several research questions:

• How can we verify that both audio and kinematic data are indeed synchronized?

• Are there patterns in the audio, such as changes in voice tone, that can be directly correlated with certain kinematic events, such as sudden movements?

• Given that the data might be slightly desynchronized, how can we reestablish synchronization? Can we define a metric to quantify the level of synchronicity between such modalities?

This problem is fairly general and can be extended to various types of data modalities. In the context of this thesis, we focus more specifically on how addressing the data synchronization problem can contribute to a better understanding of the synchrony between human speech and gestures and vice versa.

For instance, it has been observed in different contexts that stressed syllables tend to coincide with the forceful phase of gestures, known as gestural strokes [2].

More precisely, this problem raises the following research questions:

• Are speech and gesture patterns consistently observable across different populations, and to what extent do they depend on attributes such as age, gender, or other individual characteristics?

• In populations with mildly impaired speech abilities, are similar synchrony patterns still observable?

Objectives

We aim to study the synchrony between two specific modalities, namely kinematic data (obtained from connected watches or estimated from videos) and audio. We focus on two complementary objectives, both of which can benefit from a well-defined measure of speech and gesture synchronization:

(1) Analyzing speech and gesture synchrony to identify patterns across different populations.

(2) Developing machine learning and deep learning techniques for the synchronization of audio, video, and kinematic data.

Using the techniques developed throughout the thesis, our final application will focus on the analysis of speech and gesture patterns in individuals with Trisomy 21 (T21), also known as Down Syndrome. This topic is part of the multidisciplinary Brazilian-French project SAVOIR 21, led by the 3D SAM team of CRIStAL in collaboration with the Saber Down research center (Universidade Estadual do Sudoeste da Bahia, Brazil).

Required profile

  • M2 or engineering diploma specialized in machine learning, computer vision, data science or a related field.
  • Knowledge of computer vision, machine and deep learning (CNN, GNN, Transformers…)
  • Excellent programming skills (Python)
  • Familiarity with audio processing is a plus
  • Autonomy, rigor and critical thinking.

Application

If you are interested in this proposition, please send the following documents to Prof. Mohamed Daoudi (mohamed.daoudi@univ-lille.fr) and Deise Santana Maia (deise.santanamaia@univ-lille.fr):

  • CV
  • Motivation letter (explaining why this specific proposition interests you)
  • Academic transcripts from your Bachelor’s, Master’s or Engineering degree, including class rank
  • Name and contact information of at least two references we may contact if necessary

References

[1] O. Basystiuk, Z. Rybchak, I. Zavushchak, and U. Marikutsa. Evaluation of multimodal data synchronization tools. Comput. Des. Syst. Theory Pract, 6:104–111, 2024.

[2] M. Chu and P. Hagoort. Synchronization of speech and gesture: Evidence for interaction in action. Journal of Experimental Psychology: General, 143(4):1726, 2014.

[3] K. Feghoul, D. S. Maia, M. Daoudi, and A. Amad. Mmgt: Multimodal graph-based transformer for pain detection. In 2023 31st European Signal Processing Conference (EUSIPCO), pages 556–560. IEEE, 2023.

[4] K. Feghoul, D. S. Maia, M. El Amrani, M. Daoudi, and A. Amad. Mgrformer: A multimodal transformer approach for surgical gesture recognition. In 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), pages 1–10. IEEE, 2024.

Les commentaires sont clos.