Réunion

Les commentaires sont clos.

Vers un apprentissage pragmatique dans un contexte de données visuelles labellisées limitées (2ème édition)

Date : 6-06-2024
Lieu : MSH Paris Nord
20, avenue George Sand - 93210 La Plaine Saint-Denis
Amphithéâtre

Thèmes scientifiques :

Nous vous rappelons que, afin de garantir l'accès de tous les inscrits aux salles de réunion, l'inscription aux réunions est gratuite mais obligatoire.

S'inscrire à la réunion.

Inscriptions

35 personnes membres du GdR ISIS, et 36 personnes non membres du GdR, sont inscrits à cette réunion.

Capacité de la salle : 70 personnes.

Annonce

Les réseaux de neurones profonds ont offert de nombreux résultats impressionnants notamment sur des tâches de vision par ordinateur telles que la classification d'images ou de scènes, la détection d'objets... Cependant, pour que ces réseaux réussissent à apprendre de manière efficace, par exemple à reconnaître des catégories visuelles, la collecte et l'étiquetage manuels de milliers d'exemples d'entraînement par catégorie cible sont indispensables. De plus, l'application d'algorithmes d'optimisation, souvent itératifs, impose un coût considérable en termes de ressources de calcul, nécessitant des centaines, voire des milliers d'heures de GPU. Par ailleurs, en cas d'extension du modèle à de nouvelles catégories, une nouvelle collecte de données d'entraînement s'impose, conduisant à la relance de la procédure d'apprentissage. Face à ces exigences significatives, certains chercheurs explorent des pistes de recherche plus économes en termes de données d'entraînement. L'idée sous-jacente est de s'inspirer du système visuel humain, capable d'assimiler de nouveaux concepts avec aisance à partir de seulement quelques exemples et de réaliser la tâche de manière fiable. Reproduire ce mode de fonctionnement dans les systèmes de vision artificielle par apprentissage constitue l'un des objectifs centraux des recherches actuelles, notamment dans le contexte de la vision du monde réel.
Cette journée organisée conjointement avec des membres de la FR Math-STIC (de l'université Sorbonne Paris Nord), marque une deuxième édition, qui a pour objectif de donner un état des lieux des avancées en vision artificielle par apprentissage automatique avec très peu de données labellisées sur des tâches de vision par ordinateur.

Cinq chercheurs, issus des milieux académiques et industriels, ont été conviés à présenter leurs récentes contributions scientifiques.

Orateurs invités :

Frédéric JURIE, GREYC-UMR6072, Université de Caen Normandie

Hervé LE BORGNE, CEA LIST, Saclay

Stéphane HERBIN, DTIS, ONERA, Universté Paris Saclay

Hichem SAHBI, LIP6, Sorbonne Université

Tuan-Hung VU, VALEO

Appel à contribution :

Un appel à contribution est lancé sur cette thèmatique d'apprentissage avec peu d'exemples labelisés. Les chercheurs souhaitant présenter leurs contributions sont invités à soumettre aux organisateurs un résumé (2 pages maximum) avant le 30 avril 2024.

Une session poster sera également organisée. Les chercheurs sont invités à soumettre un résumé (2 pages maximum) aux organisateurs avant le 20 mai 2024. Cette session offrira aux doctorants et stagiaires de master 2 l'opportunité de présenter et d'échanger sur leurs travaux.

Organisateurs :

Anissa MOKRAOUI (L2TI, FR Math-STIC, Université Sorbonne Paris Nord), anissa.mokraoui@univ-paris13.fr

Fangchen FENG (L2TI, FR Math-STIC, Université Sorbonne Paris Nord), fangchen.feng@univ-paris13.fr

Programme

9h15 (15 minutes)
Accueil et présentation de la journée
Par Anissa MOKRAOUI, Fangchen FENG

9h30-10h10 (40 minutes)
Label-Efficient Machine Learning for Visual Recognition
Hichem SAHBI, LIP6, Sorbonne Université

10h10-10h40 (30 minutes)
Bridging Domains with Minimal Supervision: Domain Adaptation and Generalization for Semantic Segmentation
Mohammed-Yasser BENIGMIM, LTCI, Télécom Paris, Institut Polytechnique de Paris,
LIX, Ecole Polytechnique, CNRS, Institut Polytechnique de Paris

10h40-11h00 : Pause café -- Session Poster

11h00-11h40 (40 minutes)
Apprentissage dans un contexte de données visuelles labellisées limitées : des méthodes few-shot aux modèles de fondation
Frédéric JURIE, GREYC-UMR6072, Université de Caen Normandie
Stéphane HERBIN, DTIS, ONERA, Université Paris Saclay

11h40-12h10 (30 minutes)
Adaptation de domaine de segmentations échocardiographiques via l?apprentissage par renforcement
Thierry JUDGE, INSA Lyon, Université Claude Bernard Lyon 1, CNRS UMR5220, Inserm U1294, CREATIS, Villeurbanne, France

12h10-14h00 : Pause déjeuner -- Session Poster

14h00-14h40 (40 minutes)
Semantic Generative Augmentations for Few-Shot Counting
Hervé LE BORGNE, CEA LIST, Saclay
Perla DOUBINSKY, CEA LIST, Saclay

14h40-15h10 (30 minutes)
Vers la création de modèles de fondation pour les séries temporelles d'images satellites
Iris DUMEUR, CESBIO, Université de Toulouse, CNES/CNRS/INRAe/IRD/UT3

15h10-15h20 : Pause café -- Session Poster

15h20-16h00 (40 minutes)
Language-guided Adaptation and Generalization in Semantic Segmentation
Tuan-Hung VU, VALEO

16h00-16h30 (30 minutes)
Indirect-attention: SS-DETR for one shot object detection
Bissmella BAHADURI, LabCom IRISER, L2TI, Université Sorbonne Paris Nord

16h30-17h00 (30 minutes)
Discussions & clôture de la journée
Par Anissa MOKRAOUI, Fangchen FENG

Résumés des contributions

Session orale

Label-Efficient Machine Learning for Visual Recognition
Hichem SAHBI, LIP6, Sorbonne Université
Most of the existing machine learning (ML) models, particularly deep neural networks, are reliant on large datasets whose hand-labeling is expensive and time demanding. A current trend is to make ML frugal and less label-dependent. Among the existing solutions, self-supervised and active learning are currently witnessing a major interest and their purpose is to train ML models without (or only with the most informative) labeled data. This presentation discusses progress in label-efficient ML for visual recognition. The first part of the talk introduces a novel active learning method that seeks to minimize the hand-labeling effort. The method is probabilistic and based on the optimization of a constrained objective function that mixes diversity, representativity and uncertainty of data. The proposed approach unifies all these criteria in a single objective function, using a stateless reinforcement learning algorithm, that measures the relevance of data (i.e., how critical they are) when training ML models. The second part of the talk introduces a novel adversarial scheme which allows frugally labeling data that challenge the most the learning models, and this ultimately leads to a better re-estimate of these models in the subsequent iterations of active learning. Finally, the third part of the talk discusses some progress in self-supervised learning that pushes frugality further by making training totally label-free. The applicability of all these methods is shown through different visual recognition tasks including image classification, change detection and video analysis.

Bridging Domains with Minimal Supervision: Domain Adaptation and Generalization for Semantic Segmentation
Mohammed-Yasser BENIGMIM, LTCI, Télécom Paris, Institut Polytechnique de Paris,
LIX, Ecole Polytechnique, CNRS, Institut Polytechnique de Paris
Responding to the need for label-efficient models in computer vision, our research explores methods to enhance the adaptability of semantic segmentation models under conditions where real-life data is scarce or entirely unavailable. These methods utilize generative modeling and domain adaptation techniques to minimize reliance on hard-to-collect real-world datasets. In the first work, we developed a method that leverages a single sample from the target domain, utilizing text-to-image diffusion models to generate high-quality images that closely match the specific style and context of the target domain. This approach involves training the model in a self-supervised fashion on these generated images, which induces a more robust model capable of adapting effectively to new domains. Additionally, our second work addresses a more complex challenge where target data is not available at all. For this, we employ a collaborative system of existing foundation models, already pre-trained on vast amounts of data. This enables content diversification and robust feature representation of images, using the power of generative models and strong feature extractors trained contrastively on huge amounts of data. Additionally, we refine our predictions on the generated images using another pre-trained foundation model known for its sophisticated prediction refinement capabilities, enhancing the reliability of our domain generalization approach. By harnessing these advanced foundation models, we contribute to the deployment of more adaptable and efficient visual recognition systems, capable of operating reliably in dynamic and data-scarce scenarios.

Apprentissage dans un contexte de données visuelles labellisées limitées : des méthodes few-shot aux modèles de fondation
Frédéric JURIE, GREYC-UMR6072, Université de Caen Normandie
Stéphane HERBIN, DTIS, ONERA, Université Paris Saclay
Dans cette présentation, nous nous intéresserons aux avancées récentes des techniques de Machine Learning utilisées en Vision par Ordinateur dans des contextes où les données d'entraînement sont limitées ou absentes. Le Machine Learning / Deep Learning a joué un rôle essentiel dans les progrès récent en Vision par Ordinateur, mais au prix d?un besoin toujours plus grand en données d'entraînement. Or, dans de nombreux domaines applicatifs, la collecte de telles données reste un obstacle majeur en raison de contraintes de coût ou de disponibilité.
Pour surmonter cette limitation, diverses approches ont été proposées, notamment grâce aux techniques de few-shot learning, de zero-shot learning ou d'adaptation de domaine. Nous examinerons ces techniques en détail, en mettant en évidence leurs principes fondamentaux et en illustrant leur application à travers des exemples concrets issus de nos travaux de recherche.
Cette discussion nous guidera vers des concepts plus récents, notamment l'utilisation de modèles dits "modèles de fondation", qui ouvrent de nouvelles perspectives pour les applications de Vision par Ordinateur dans des conditions dans lesquelles les données d'entraînement sont rares ou inexistantes. Nous explorerons les implications et les possibilités offertes par ces modèles, en mettant en lumière les défis restants et les pistes de recherche prometteuses à explorer.

Adaptation de domaine de segmentations échocardiographiques via l?apprentissage par renforcement
Thierry JUDGE, INSA Lyon, Université Claude Bernard Lyon 1, CNRS UMR5220, Inserm U1294, CREATIS, Villeurbanne, France
Université de Sherbrooke, Sherbrooke, QC, Canada
Les réseaux de neurones fournissent d'excellentes performances de segmentation d'images échocardiographiques lorsqu'entraînés sur un bon nombre d'images. Cependant, leurs performances diminuent de façon significative lorsque confrontés à des images issues de protocoles d'acquisitions différents. Pour cette raison, plusieurs méthodes d'adaptation de domaine ont été développées pour exploiter des données non-étiquetées. Par contre, ces méthodes ne considèrent généralement pas de modèle de forme a priori, donc peuvent fournir des segmentations invalides. Nous présentons RL4Seg, une méthode d'adaptation de domaine de segmentation novatrice, basée sur l'apprentissage par renforcement. Inspirée de l'entrainement de ChatGPT, elle produit des segmentations en bonne adéquation avec des métriques de conformité anatomique et optimise un réseau de récompense pouvant servir d'estimateur d'incertitude fiable. Ses performances surpassent l'état de l'art, notamment en atteignant 99% de validité anatomique.

Semantic Generative Augmentations for Few-Shot Counting
Hervé LE BORGNE, Perla DOUBINSKY, CEA LIST, Saclay
Recent advancements in generative modeling for visual content have led to the synthesis of diverse high-resolution and high-quality images. For applications such as image editing and data augmentation, controlling the semantic properties of the generated images is also highly desirable. The work presented in this talk was developed in the more general context of a thesis that focuses on enhancing control over the generated content and also explores how to exploit the control to synthesize effective and diversified training data.
In the work presented, we specifically investigate the use of large pre-trained models conditioned on text to synthesize training datasets. However, we find that text control alone may be insufficient for tasks requiring compositionality. To address this, we propose adding a task-specific conditioning to generate precise augmentations suitable for supervised learning. Additionally, we introduce a strategy to diversify augmentations by utilizing both task-specific and text conditioning, prompting the generative model with novel but plausible pairs. We apply this method to the task of few-shot class-agnostic object counting and demonstrate improvements in the counting network's performances.

Vers la création de modèles de fondation pour les séries temporelles d'images satellites
Iris DUMEUR, CESBIO, Université de Toulouse, CNES/CNRS/INRAe/IRD/UT3
Suite aux récents et nombreux lancements de satellites d?observation de la Terre, des séries temporelles d?images satellites multi-années, offrant une large couverture géographique, sont désormais accessibles. Ces données renferment des informations cruciales pour diverses tâches de surveillance de la Terre telles que la gestion des cultures agricoles, la classification de l?occupation des sols et l?étude du changement climatique. Cependant, ces applications se heurtent souvent à un manque de données étiquetées, ce qui entrave le développement de méthodes d?apprentissage applicables à grande échelle. Ainsi, nous proposons une méthode d?entraînement autosupervisé multivue adaptée aux séries temporelles d?images satellites (STIS). En particulier, cette méthode fusionne une tâche de reconstruction croisée avec des fonctions de coût dans l?espace latent. De plus, le réseau de neurones proposé génère une représentation latente de taille fixe et alignée des STIS, qui sont par défaut irrégulières et non-alignées. Enfin, la qualité du pré-entraînement a été évaluée sur 3 tâches en aval différentes : l?occupation des sols, la segmentation des cultures et la détection de changements. Dans la configuration où le modèle pré-entraîné est gelé, notre méthode surpasse les méthodes compétitives actuelles sur des STIS [1, 2]. Enfin, lors d?une expérience simulant un manque de données étiquetées pour la segmentation des cultures, nos résultats indiquent que le pré-entraînement améliore significativement les performances de classification des cultures. Ces résultats soulignent l?importance du pré-entraînement autosupervisé pour les tâches disposant de peu de données étiquetées.
Références
[1] Iris Dumeur, Silvia Valero, and Jordi Inglada. Self-supervised spatio-temporal representation learning of satellite image time series. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, pages 1?18, 2024.
[2] Gabriel Tseng, Ivan Zvonkov, Mirali Purohit, David Rolnick, and Hannah Rae Kerner. Lightweight, pre-trained transformers for remote sensing timeseries. ArXiv, abs/2304.14065, 2023.

Language-guided Adaptation and Generalization in Semantic Segmentation
Tuan-Hung VU, VALEO
A critical challenge in employing semantic segmentation models in the open-world is the distributional shifts between training and testing environments. Domain adaptation and domain generalization are research fields focused on enhancing testing performance in target domains, whether known or unknown. In this study, we revise both adaptation and generalization problems when leveraging the recent vision-language models like CLIP. In our first work, we exploit the CLIP?s latent space and propose a simple and effective feature stylization mechanism that converts source-domain features into target-domain ones simply via language prompting. Fine-tuning the segmentation model on these zero-shot synthesized features helps mitigate the distribution gap between the source and target domains, thus improving performance on targets. In our second work, we introduce a simple framework for generalizing semantic segmentation networks by employing language as the source of randomization. Our recipe comprises three key ingredients: i) preservation of the intrinsic CLIP robustness through minimal fine-tuning, ii) language-driven local style augmentation, and iii) randomization by locally mixing the source and augmented styles during training. Through this line research, we demonstrate the significant potential of harnessing textual modality to improve the robustness of vision systems.

Indirect-attention: SS-DETR for one shot object detection
Bissmella BAHADURI, LabCom IRISER, L2TI, Université Sorbonne Paris Nord

One-shot object detection presents a significant challenge, requiring the identification of objects within a target image using only a single sample image of the object class as query image. Attention-based methodologies have garnered considerable attention in the field of object detection. Specifically, thecross-attention module, as seen in DETR, plays a pivotal role in exploiting the relationships between object queries and image features. However, in the context of DETR networks for one-shot object detection, the intricate interplay among target image features, query image features, and object queries must be carefully considered. In this study, we propose a novel module termed "indirect attention." By harnessing the adaptive nature of attention mechanisms, we illustrate that relationships among targetimage features, query image features, and object queries can be effectively captured in a more concise manner compared to cross-attention. Furthermore, we introduce a pre-training pipeline tailored specifically for one-shot object detection, addressing three primary objectives: identifying objects of interest, class differentiation, and object detection based on a given query image. Our experimental findings demonstrate that the proposed IA-DETR (Indirect-Attention DETR) significantly outperforms state-of-the-art one-shot object detection methods on both the Pascal VOC and COCO benchmarks.

Session poster

EQUIMOD: An EQUIvariance MODule to Improve Visual Instance Discrimination

Alexandre DEVILLIERS & Mathieu LEFORT
Univ Lyon, UCBL, CNRS, INSA Lyon, LIRIS, UMR5205, F-69622 Villeurbanne
Recent self-supervised visual representation methods are closing the gap with supervised learning performance. Most of these successful methods rely on learning invariance to augmentations, however this limits the generalization as augmentations-related information may be essential for some downstream tasks. Few recent works proposed to mitigate this problem of using only an invariance task by exploring some form of equivariance to augmentations, yet in a noncontrolled way. In this work, we introduce EquiMod a generic equivariance module that structures the learned latent space, in the sense that our module learns to predict the displacement in the embedding space caused by the augmentations. Applying that module to SOTA invariance models, such as BYOL and SimCLR, increases the performances on the usual CIFAR10 and ImageNet datasets.

Text Aided Domain Adaptation
Louis HEMADOU
Safran
This paper addresses the challenge of zero shot domain adaptation in the context of image classification. The focus is on developing a model that remains resilient to shifts in the test distribution without relying on training samples from the target domain. While acquiring target domain images can be tedious and costly, we leverage verbal descriptions to capture the differences between target and training domains. Utilizing large pre-trained text-image models like CLIP, we introduce TADA, an augmentation method that adapts arbitrary image features from source to target domains using textual prompts as training data, without requiring target domain images. Our method outperforms robust fine-tuning techniques on several domain generalization benchmarks, achieving state-of-the-art performance on the DomainNet dataset.

Reinterpreting Confidence Intervals in Few-Shot Learning
Raphael LAFARGUE
IMT Atlantique, Lab-STICC, UMR CNRS 6285, F-29238, France,
AIML, Université d'Adelaïde, IRL Crossing, Adelaïde
The predominant method for computing confidence intervals (CI) in few-shot learning (FSL) is based on sampling the tasks with replacement, i.e.\ allowing the same samples to appear in multiple tasks. This makes the CI misleading in that it takes into account the randomness of the sampler but not the data itself. To quantify the extent of this problem, we conduct a comparative analysis between CIs computed with and without replacement. These reveal a notable underestimation by the predominant method. This observation calls for a reevaluation of how we interpret confidence intervals and the resulting conclusions in FSL comparative studies. Our research demonstrates that the use of paired tests can partially address this issue. Additionally, we explore methods to further reduce the (size of the) CI by strategically sampling tasks of a specific size. We show theoretically and empirically that there exist an optimal number of tasks such that a data budget is optimally used. We also introduce a new optimized benchmark.

A Novel Benchmark for Few-Shot Semantic Segmentation in the Era of Foundation Models
Reda BENSAID
IMT Atlantique, Lab-STICC UMR CNRS 6285, F-29238, France,
Polytechnique Montréal, Département de génie électrique, Canada
In recent years, the rapid evolution of computer vision has seen the emergence of various foundation models, each tailored to specific data types and tasks. In this study, we explore the adaptation of these models for few-shot semantic segmentation. Specifically, we conduct a comprehensive comparative analysis of four prominent foundation models: DINO V2, Segment Anything, CLIP, Masked AutoEncoders, and of a straightforward ResNet50 pre-trained on the COCO dataset. We also include 5 adaptation methods, ranging from linear probing to fine tuning. Our findings show that DINO V2 outperforms other models by a large margin, across various datasets and adaptation methods. On the other hand, adaptation methods provide little discrepancy in the obtained results, suggesting that a simple linear probing can compete with advanced, more computationally intensive, alternatives.

Inferring Latent Class Statistics from Text for Robust Visual Few-Shot Learning
Yassir BENDOU
IMT Atlantique, Technopole Brest Iroise, France
In the realm of few-shot learning, foundation models like CLIP have proven effective but exhibit limitations in cross-domain robustness especially in few-shot settings. Recent works add text as an extra modality to enhance the performance of these models. Most of these approaches treat text as an auxiliary modality without fully exploring its potential to elucidate the underlying class visual features distribution. In this paper, we present a novel approach that leverages text-derived statistics to predict the mean and covariance of the visual feature distribution for each class. This predictive framework enriches the latent space, yielding more robust and generalizable few-shot learning models. We demonstrate the efficacy of incorporating both mean and covariance statistics in improving few-shot classification performance across various datasets. Our method shows that we can use text to predict the mean and covariance of the distribution offering promising improvements in few-shot learning scenarios.