Generative Few-Shot Learning using Mixture-of-Experts Transformers for 3D Skeleton-based Human Action Recognition

STAGE MASTER

Scientific fields: Computer science, Artificial Intelligence, Computer Vision

Keywords: Generative Few-Shot Learning; Transformers; Mixture-of-Experts; RGB+D Datasets; Human-System Interaction (HSI) Research interest: Computer Vision; Machine Learning; Deep Learning

Research work: Deep Learning Models for 3D Skeleton-based Human Action Recognition

3D Skeleton-based Human Action Recognition (HAR) [1], [2] is fundamental task of pattern recognition and computer vision and a key issue in many applications, e.g., in medical and industrial imaging, robotics and VR/AR.

Human Action Recognition (HAR) [1], [2] decodes human movements by analyzing sequential 3D skeletal joint coordinates obtained through advanced sensor technologies like motion capture devices, depth cameras such as Microsoft Kinect, Intel RealSense, and wearable motion sensors. These sensors track body joint positions in real-time, enabling sophisticated computational analysis of human actions and gestures across diverse domains.

Recently, researchers in [3] introduced few-shot generative models for skeleton-based human action recognition, enabling accurate action classification using limited training samples from specific domains. They leveraged large public datasets (NTU RGB+D 120 [4, p. 120] and NTU RGB+D [5]) to develop cross-domain generative models. By introducing novel entropy-regularization losses, they effectively transferred motion diversity from source to target domains, enabling more robust action recognition with limited training samples. They used a standard model called Spatial temporal graph convolutional network GCN (ST-GCN) [6] for generation of action recognition samples, then, they trained the Few-shot generative model with the concatenated data of the real data and generated samples by ST-GCN.

Few-shot scenarios [3], [6], [7] are often the case when training HAR models with very limited labeled data; this is considered a major obstacle to practical use because collecting human action data and annotating labels correctly are time consuming and labor intensive. In [8], they proposed few-shot learning using Cross-Domain HAR. Self-training is proposed to adapt representations learned in a labeled source domain (defined by activities, sensor positions, and users) to the target domain with very limited labeled data.

In this internship, we will develop Generative Few-Shot Learning Models for HAR that aims to generate actions samples and data augmentation of limited training data. We will propose a novel approach to 3D Skeleton-based Human Action Recognition (HAR) by combining Generative Few-Shot Learning with Mixture-of-Experts (MoE) Transformers [9], [10], [11]. The proposed approach aims to improve the efficiency and accuracy of action recognition in RGB+D datasets while addressing the challenges of limited training data.

The main key concepts of the Generative Few-Shot Learning for HAR:

1.3D Skeleton-based Human Action Recognition (HAR):

The method focuses on recognizing human actions based on skeletal representations, utilizing joint coordinates to represent movements [12].
The 3D skeleton captures dynamic interactions and positions of different body joints over time, which can be used to classify actions [12].

2.Generative Few-Shot Learning:

Few-shot learning [3] aims to train a model to recognize new actions using only a small number of training examples. This is particularly important in HAR due to the often-limited availability of labeled data for certain action classes.
Generative models help by synthesizing additional training examples or variations of existing examples to augment the training dataset, improving model robustness.

3.Generative MoE Transformers:

Combining MoE with Transformer architectures [10], [11] allows for a dynamic approach where different experts specialize in different aspects of the action recognition task.
The architecture uses self-attention mechanisms to capture temporal dependencies in the skeleton data, enhancing the model’s ability to understand complex actions over time.

The main potential of Generative Few-Shot Learning with MoE Transformer Architecture for 3D Skeleton-based HAR:

Input Representation: for each action, take as input 3D skeleton joint coordinates over time.
Transformer Encoder: Modified with MoE layers [9], [10].
MoE Layers: Replace feed-forward layers in the Transformer [9], [10]. Where Multiple experts (feed-forward networks) specializing in different aspects of action recognition. Then, A router network to determine which experts to use for each input sequence.
Generative Few-Shot Learning [3], [6], [7], [8], [13] based on ST-GCN [12] and MoE Layers [9], [10] to generates samples
Output: Action recognition.

In this internship, we will train the proposed models on different datasets for 3D Human action recognition. Finally, to measure the accuracy and study the performance, we will test the proposed models on “NTU RGB+D” and “NTU RGB+D 120” Datasets.

Work plan:

The working plan in general is divided in two phases:

1) In the first phase (about two-months), student will provide the state of-the-art (SOTA) of the Few-Shot learning models (machine/deep learning) applied for 3D Skeleton-based Human Action Recognition (HAR). Then, student will test SOTA models on “NTU RGB+D” and “NTU RGB+D 120” Datasets.

2) In the second phase (about four months), student would propose contributions to the following research directions:

Proposing new Few-Shot learning model based on Generative model, Transformers and Mixture-of-Experts for Human Action Recognition
Studying the properties of such models (complexity, expressivity, frugality)
Application of the proposed Few-Shot learning model on NTU RGB+D and NTU RGB+D 120 Datasets for Human Action Recognition tasks to measure the accuracy and study the performance.

Expected scientific production

Different scientific productions, write an international peer-reviewed conference paper or an indexed journal paper are expected:

Journal publication relating to the literature review about Deep learning models for 3D Skeleton-based Human Action Recognition (HAR) using NTU RGB+D and NTU RGB+D 120 Datasets
Publication relating to our proposal of a new Generative Few-Shot learning model for 3D HAR, model performance and evaluation based on training/validation and testing on Human-System Interaction datasets.

Introduction to the laboratory CESI LINEACT- Research Unit

CESI LINEACT (Digital Innovation Laboratory for Companies and Learnings at the service of the territories competitiveness) is the CESI group laboratory whose activities are implemented on CESI campuses.

Link to the laboratory website:

https://lineact.cesi.fr/en/

https://lineact.cesi.fr/en/research-unit/presentation-lineact/

CESI LINEACT (EA 7527), Digital Innovation Laboratory for Business and Learning at the service of the Competitiveness of Territories, anticipates and accompanies the technological mutations of the sectors and services related to industry and construction. CESI’s historical proximity to companies is a determining factor for our research activities and has led us to focus our efforts on applied research close to companies and in partnership with them. A human-centered approach coupled with the use of technologies, as well as the territorial network and the links with training, have allowed us to build transversal research; it puts the human being, his needs and his uses, at the center of its problems and approaches the technological angle through these contributions.

Its research is organized according to two interdisciplinary scientific themes and two application areas.

Theme 1 « Learning and Innovation » is mainly concerned with Cognitive Sciences, Social Sciences and Management Sciences, Training Sciences and Techniques and Innovation Sciences. The main scientific objectives of this theme are to understand the effects of the environment, and more particularly of situations instrumented by technical objects (platforms, prototyping workshops, immersive systems, etc.) on the learning, creativity and innovation processes.
Theme 2 « Engineering and Digital Tools » is mainly concerned with Digital Sciences and Engineering. The main scientific objectives of this theme concern the modeling, simulation, optimization and data analysis of industrial or urban systems. The research work also focuses on the associated decision support tools and on the study of digital twins coupled with virtual or augmented environments.

These two themes develop and cross their research in the two application areas of the Industry of the Future and the City of the Future, supported by research platforms, mainly the one in Rouen dedicated to the Factory of the Future and the one in Nanterre dedicated to the Factory and Building of the Future.

CESI LINEACT RESEARCH THEME:

Human-System Interactions (HSI)

https://lineact.cesi.fr/en/engineering-and-numerical-tools/thematics/human-system-interactions-theme

Your application must include :

A detailed curriculum vitae.
A cover letter explaining why the candidate is interested in this internship.
Master 1 and 2 transcripts (to be adapted to the level of the internship)
Recommendation letters if available
Any other documents you consider useful such as project reports, publications, datasets, codes, related to this internship topic.

Please send all documents in one file.

Your skills:

Scientific and technical skills:

1. Master Research student or in the final year of Engineering School in Computer sciences

2. Python programming skills and experience with standard Computer Vision and Machine/Deep Learning library’s

3. Basics in Machine Learning and Deep Learning Neural Networks, GAN, Transformers and Mixture-of-Experts

4. Skills in Machine/Deep Learning Frameworks: Pytorch, Keras, TensorFlow

5. Computer Vision applications, Image Classification, Action Recognition, etc.

6. Soft computing skills to take in hand the expected materials, e.g. Python with Google Colab, Jupyter Lab/Notebook, Weights & Biases (wandb), etc.

7. Ability for writing Master report

8. Fluency in English to write an international peer-reviewed conference paper or an indexed journal paper with impact factor

Interpersonal skills:

Be autonomous and have a spirit of initiative and curiosity,
Know how to work in a team and have good interpersonal skills,
be rigorous

References

[1] A. Ali, E. Pinyoanuntapong, P. Wang, and M. Dorodchi, “Skeleton-based Human Action Recognition via Convolutional Neural Networks (CNN),” Jan. 30, 2023, arXiv: arXiv:2301.13360. Accessed: Sep. 09, 2024. [Online]. Available: http://arxiv.org/abs/2301.13360

[2] H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting Skeleton-based Action Recognition,” Apr. 02, 2022, arXiv: arXiv:2104.13586. Accessed: Oct. 10, 2024. [Online]. Available: http://arxiv.org/abs/2104.13586

[3] K. Fukushi, Y. Nozaki, K. Nishihara, and K. Nakahara, “Few-shot generative model for skeleton-based human action synthesis using cross-domain adversarial learning,” in 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA: IEEE, Jan. 2024, pp. 3934–3943. doi: 10.1109/WACV57701.2024.00390.

[4] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, “NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 10, pp. 2684–2701, Oct. 2020, doi: 10.1109/TPAMI.2019.2916873.

[5] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis,” Apr. 11, 2016, arXiv: arXiv:1604.02808. Accessed: Sep. 20, 2024. [Online]. Available: http://arxiv.org/abs/1604.02808

[6] L. Wang, J. Liu, and P. Koniusz, “3D Skeleton-based Few-shot Action Recognition with JEANIE is not so Naïve,” Dec. 23, 2021, arXiv: arXiv:2112.12668. Accessed: Nov. 20, 2024. [Online]. Available: http://arxiv.org/abs/2112.12668

[7] L. Xu, Q. Wang, X. Lin, and L. Yuan, “An Efficient Framework for Few-shot Skeleton-based Temporal Action Segmentation,” Jul. 20, 2022, arXiv: arXiv:2207.09925. Accessed: Nov. 20, 2024. [Online]. Available: http://arxiv.org/abs/2207.09925

[8] M. Thukral, H. Haresamudram, and T. Ploetz, “Cross-Domain HAR: Few Shot Transfer Learning for Human Activity Recognition,” Oct. 22, 2023, arXiv: arXiv:2310.14390. Accessed: Nov. 20, 2024. [Online]. Available: http://arxiv.org/abs/2310.14390

[9] A. Alboody and R. Slama, “Graph Transformer Mixture-of-Experts (GTMoE) for 3D Hand Gesture Recognition,” in Intelligent Systems and Applications, vol. 1067, K. Arai, Ed., in Lecture Notes in Networks and Systems, vol. 1067. , Cham: Springer Nature Switzerland, 2024, pp. 317–336. doi: 10.1007/978-3-031-66431-1_21.

[10] A. Alboody and R. Slama, “EPT-MoE: Toward Efficient Parallel Transformers with Mixture-of-Experts for 3D Hand Gesture Recognition,” presented at the The 10th World Congress on Electrical Engineering and Computer Systems and Science, Aug. 2024. doi: 10.11159/mvml24.105.

[11] T. Chen et al., “AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of-Experts,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France: IEEE, Oct. 2023, pp. 17300–17311. doi: 10.1109/ICCV51070.2023.01591.

[12] C. Plizzari, M. Cannici, and M. Matteucci, “Spatial Temporal Transformer Network for Skeleton-Based Action Recognition,” in Pattern Recognition. ICPR International Workshops and Challenges, vol. 12663, A. Del Bimbo, R. Cucchiara, S. Sclaroff, G. M. Farinella, T. Mei, M. Bertini, H. J. Escalante, and R. Vezzani, Eds., in Lecture Notes in Computer Science, vol. 12663. , Cham: Springer International Publishing, 2021, pp. 694–701. doi: 10.1007/978-3-030-68796-0_50.

[13] X. Wang, X. Wang, B. Jiang, and B. Luo, “Few-Shot Learning Meets Transformer: Unified Query-Support Transformers for Few-Shot Classification,” Aug. 26, 2022, arXiv: arXiv:2208.12398. Accessed: Nov. 20, 2024. [Online]. Available: http://arxiv.org/abs/2208.12398

https://cesirh.talentview.io/jobs/by2qvh?utm_source=mail