[M2 Internship] Graph-based method for human action recognition for limited data

Background

Understanding human body, gesture and movement analysis is an important feature of daily communication between humans. Within this, the ability to build an autonomous system or agent to automatically interpret the human body movements during interactions has seen many benefit in many fields, such as accessibility (for instance, recognizing and translating sign languages [1]), automobiles (i.e. gesture recognitions during driving), and teaching [2].

Current state of the art human gesture recognition has advanced the accuracy of the prediction of the methods. Despite this, there is still an important limitation faces on the methods, mainly the needs of large samples required to train them [3]. This can be problematic, especially when dealing with body movements or gestures that are intended for specific applications (e.g. when high resolution, high fidelity human movement are of utmost importance). Furthermore, majority of the approaches rely on the image based observation, that still lack the utilization of human body structures. Finally, the investigation of learned features of the method during learning for the prediction (e.g. which are relevant) are still lacking, that hampers models understanding – or its explain ability.

Objective

Given the current limitations, this internship (Master level) aims to investigate the impact of the introduction of self-supervised learning [4], coupled with the use of skeleton data representation – through graph based representation [5] – for human body/gesture analysis. Furthermore, the integration of several feature analysis, such as variational learning [6] and models explainability [7] will be applied to discover the features importance during prediction. Several quantitative and qualitative experiments will be conducted to see the impact of each the proposed approach. Given the results, the results is aimed to be presented on a publication venue.

Methodology

The first step of this internship will utilize a variant of any graph neural network, e.g. Graph Convolutional Neural network [5] to incorporate the topography of the skeleton during training process. Afterwards, several self-supervised mechanisms [4], such as reconstruction or contrastive learning will be applied to the graph representations, and by using unlabeled data to capture inherent data structures. Then, quantitative experiments using a downstream task such as activity recognition will be done to evaluate the impact of the sue of self-supervised approach.

Afterwards, the incorporation of variational learning [6] will be applied during feature learning to enable explorations of the features that governs the variety of movements during learning. This step may also include the use of established explainable method such as grad-cam [8] or attention based method (e.g. transformer [9]) to project back these contribution to the the input data [10].

Dataset and evaluations

The evaluations will be done primarily using publicly available dataset, such as NTU RGB-D dataset [11] or possibly also motion-x [12] dataset to allow for direct comparisons with state of the art literature. Afterward, in-house dataset recorded by partnered company of MocapLab (http://mocaplab.com), which is one of leader in mocap capture, will be used for evaluations to analyze the models accuracy when dealing with dataset on the field.

Expected outcomes

Successful implementations of graph based method with combined self-supervised and variational learning (or grad-cam) to enable exploitations of data structures and explorations of learned features.
Applications of learned features during self-supervised learning to downstream task with competitive accuracy on benchmark dataset.
Delivery of the results to a publication venue.

Timeline and Research Lab

The research is expected to be conducted for six months with the results to be communicated in the Master Thesis and possibly a publication (an international conference or journal, depending on the results achieved).
The research will be carried out within the ARTEMIS department of Télécom SudParis, ARMEDIA team and SAMOVAR laboratory, in conjunction with MocapLab company for study cases.

Requirements

Knowledge in Deep Learning.
Knowledge in Computer Vision
Basic Knowledge of Graph Neural Networks

Organizations

The research will be carried out within the ARTEMIS department of Télécom SudParis, ARMEDIA team and SAMOVAR laboratory.
The work will be in conjunction with MocapLab for study cases.
The working location will be at Evry, France.

———- Applications Requirements ———-

Administration and Technical Requirements

Enrolled at Master 2 or in the last year of engineering school.
Knowledge in Machine Learning, Deep Learning and Image Processing.
Understanding of the use of Machine Learning (e.g. Scikit-learn), Computer Vision (e.g. OpenCV) and Deep Learning Frameworks, such as (Pytorch or Tensorflow).
Prior experience in research is highly advantageous.

Documents to be submitted:

Please send following documents in a single pdf page with the title of ‘Internship-GAR-TSP’ for evaluations to: decky.aspandi_latif@telecom-sudparis.eu or titus.zaharia@telecom-suparis.eu :

Curriculum Vitae.
Current diploma and transcripts.
Motivation letter (half to max 1 page, optional).
Recommendation letters. (if any)

Application deadlines, selection process and start of the internship:

Application deadlines as 31th March 2024 (flexible, first come first served).
Selection to be planned from 26^th March 2025 onward.
Start of work as of 1^th April 2025 onward (as soon as possible).

Contacts:

Further questions can be addressed to:

Decky Aspandi, Teacher-researcher: decky.aspandi_latif@telecom-sudparis.eu
Titus Zaharia, Professor: titus.zaharia@telecom-suparis.eu

References

[1] Bertin-Lemée, Élise, et al. « Rosetta-lsf: an aligned corpus of french sign language and french for text-to-sign translation. » 13th Conference on Language Resources and Evaluation (LREC 2022). 2022.

[2] Mitra, Sushmita, and Tinku Acharya. « Gesture recognition: A survey. » IEEE Transactions on Systems,Man, and Cybernetics, Part C (Applications and Reviews) 37.3 (2007): 311-324.

[3] Mohamed, Noraini, Mumtaz Begum Mustafa, and Nazean Jomhari. « A review of the hand gesturerecognition system: Current progress and future directions. » IEEE access 9 (2021): 157422-157436.

[4] Liu, Xiao, et al. « Self-supervised learning: Generative or contrastive. » IEEE transactions on knowledge and data engineering 35.1 (2021): 857-876

[5] Miah, Abu Saleh Musa, Md Al Mehedi Hasan, and Jungpil Shin. « Dynamic hand gesture recognition using multi-branch attention based graph and general deep learning model. » IEEE Access 11 (2023): 4703- 4716.

[6] Pu, Yunchen, et al. « Variational autoencoder for deep learning of images, labels and captions. » Advances in neural information processing systems 29 (2016).

[7] Minh, Dang, et al. « Explainable artificial intelligence: a comprehensive review. » Artificial Intelligence Review (2022): 1-66.

[8] C. Zhang, D. Aspandi, and S. Staab, “Predicting eye gaze location on websites,” in Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory anApplications, VISIGRAPP 2023, Volume 4: VISAPP, Lisbon, Portugal, February 19-21, 2023, pp. 121– 1322023

[9] Vaswani, Ashish, et al. « Attention is all you need. » Advances in neural information processing systems 30 (2017).

[10] Hanna-Asaad, Antoine, Decky Aspandi, and Titus Zaharia. « Multi-Modal interpretable automatic video captioning. » arXiv preprint arXiv:2411.06872 (2024).

[11] Shahroudy, Amir, et al. « Ntu rgb+ d: A large scale dataset for 3d human activity analysis. » Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[12] Lin, Jing, et al. « Motion-x: A large-scale 3d expressive whole-body human motion dataset. » Advances in Neural Information Processing Systems 36 (2023): 25268-25280.