★ The internship will take place at the FOX team of CRIStAL laboratory, at the University of Lille.
Summary :
Our work focuses on 3D human pose estimation in video. The approaches in the state-of-the-art formulate the problem as 2D key point detection followed by 3D pose estimation.
While splitting up the problem arguably reduces the difficulty of the task, it is inherently ambiguous as multiple 3D poses can map to the same 2D key points.
Previous work tackled this ambiguity by modeling temporal information with recurrent neural networks. On the other hand, convolutional networks have been very successful in modeling temporal information in tasks that were traditionally tackled with RNNs. One of the main advantages of CNNs is that they are capable of processing multiple frames, which is not possible with recurrent networks.
In this internship, we plan to work on a fully convolutional architecture that performs temporal convolutions over 2D key points for accurate 3D pose prediction in video. Our approach should be compatible with any 2D key point detector and can effectively handle large contexts via dilated convolutions.
We plan to turn to settings where labeled training data is scarce and introduce a new scheme to leverage unlabeled video data for semi-supervised training. Low-resource settings are particularly challenging for neural network models, which require large amounts of labeled training data, and collecting labels for 3D human pose estimation requires an expensive motion capture setup as well as lengthy recording sessions.
Hence, firstly, we will work on a simple and efficient approach for 3D human pose estimation in video based on dilated temporal convolutions on 2D key point trajectories. Second, we introduce a semi-supervised approach that exploits unlabeled video and is effective when labeled data is scarce. Compared to previous semi-supervised approaches, we only require camera intrinsic parameters rather than ground-truth 2D annotations or multi-view imagery with extrinsic camera parameters.
Despite its simplicity, the second stage is an ill-posed problem that lacks a depth prior and suffers from the ambiguity problem. We will also explore Transformer architecture, which has high capability in modeling spatio-temporal correlation for 3D human pose estimation. It is shown in the literature that the computational cost of calculating the joint-to-joint affinity matrix in the self-attention has a quadratic growth along with the increase in the number of frames, making such a solution impractical for model training.
As a result, most transformer structures employ a two- step alternative, the first step encodes spatial information for each frame first and then aggregates the feature sequence by a temporal transformer. This strategy basically mines the correlation across frame-level features but seldom explores the relation between joints across different frames.
We will explore this research direction for the purpose of human pose estimation.
Desired Profile :
- Final-year Master’s student (M2) or engineering student specializing in machine learning, computer vision, or a related field.
- Knowledge of computer vision, machine learning, and deep learning.
- Programming skills (Python).
- Autonomy, rigor, and critical thinking skills.
★ The internship will take place at the FOX team of CRIStAL laboratory, at University of Lille.
Address of the Internship :
CAMPUS Haute-Borne CNRS IRCICA-IRI-RMN
Parc Scientifique de la Haute Borne, 50 Avenue Halley, BP 70478, 59658 Villeneuve d’Ascq Cédex, France.
Candidature :
If this proposal interests you, please send the following documents to Dr. Tanmoy MONDAL (tanmoy.mondal@univ-lille.fr) and Cedric BONNET (cedrick.bonnet@univ-lille.fr)
- CV
- Motivation Letter
- Transcripts of grades obtained in Bachelor’s/Master’s/Engineering school as well as class ranking
- Name and contact details of at least one reference person who can be contacted if necessary
References
- Tang, Z., Qiu, Z., Hao, Y., Hong, R., & Yao, T. (2023). 3D Human Pose Estimation with Spatio-Temporal Criss-Cross Attention. 1, 4790–4799.
- Lin, K., Wang, L., & Liu, Z. (2021). End-to-End Human Pose and Mesh Reconstruction with Transformers. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1954–1963..
- Pavllo, D., Feichtenhofer, C., Grangier, D., & Auli, M. (2019). 3D human pose estimation in video with temporal convolutions and semi-supervised training. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019-June, 7745–7754.
