École d’Été Peyresq 2025
Thème Quantification d’incertitude Le GRETSI et le GdR IASIS organisent depuis 2006 une École d’Été...
Nous vous rappelons que, afin de garantir l'accès de tous les inscrits aux salles de réunion, l'inscription aux réunions est gratuite mais obligatoire.
22 personnes membres du GdR ISIS, et 10 personnes non membres du GdR, sont inscrits à cette réunion.
Capacité de la salle : 90 personnes.
Attention visuelle : prédiction et applications
Attention, nouvelle date : 23 mai 2024.
Ces dernières années, de nombreuses études ont porté sur l'attention visuelle débouchant sur un nombre important de nouvelles méthodes d'extraction de la saillance visuelle ainsi que son utilisation dans diverses applications, aussi bien comme outil de guidage (extraction d'objets, suivi, ...) ou bien comme étape intermédiaire permettant de se focaliser uniquement sur certaines régions d'intérêt (qualité d'image, tatouage, compression,...). Ainsi, à travers différentes études, la pertinence d'utiliser ces modèles a été largement mise en évidence.
La thématique principale que nous souhaitons développer est donc celle de la saillance visuelle et de son exploitation. Dans le cadre de cette journée, nous nous intéressons aux méthodes de prédiction de l'attention visuelle pour différentes modalités (image, médicale, stéréo ou 3D) ainsi que ses applications diverses et variées sans limitation de domaine.
Nous lançons un appel à contribution sur cette thématique d'attention visuelle. Les chercheurs.euses et doctorant·e·s souhaitant présenter leurs travaux sont invitées à envoyer, par e-mail, leur proposition (titre et résumé) limitée à 1 page aux organisateurs avant le 30 avril 2024 10 mai 2024. e-mails: meriem.outtas@insa-rennes.fr; lu.zhang@insa-rennes.fr; aladine.chetouani@univ-orleans.fr.
Pr. Patrick Le Callet, "Neurosciences vs. vision par ordinateur : regard croisé sur l'attention visuelle",
Laboratoire des Sciences du Numérique Nantes (LS2N - CNRS) École centrale de Nantes/Nantes Université, Membre Institut Universitaire de France (IUF), France.
Pr. Jean-Marc Odobez, "Exploring Visual Attention: Methods for Analyzing Gaze cues in Everyday Contexts"
IDIAP Research Institute -Perception and Activity Understanding group- & Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland.
Meriem Outtas, INSA Rennes, IETR - CNRS UMR 6164
Lu Zhang, INSA Rennes, IETR - CNRS UMR 6164
Aladine Chetouani, Laboratoire PRISME, Polytech - Orleans, Université d'Orléans
Date : 2024-05-23
Lieu : INSA Rennes, Amphi Bonnin (Capacité : 90 personnes).
9h00 Café d'accueil
9h15 (15 minutes) Accueil et présentation de la journée
9h30-9h50 (20 minutes)
Évaluation de la qualité du nuage de points 3D sans référence en utilisant la saillance visuelle 3D.
Salima BOURBIA, LRIT, Univ. Mohammed V (Maroc). L@bISEN (Nantes), Prisme (Univ. Orléans).
9h50-10h10 (20 minutes)
HMM representation of scan paths reveals consistently more and less efficient visual strategies for task recognition.
Tahri Badr, LS2N - Commissariat aux Energies Alternatives (CEA)
10h10-11h20 (70 minutes)
Invited talk: Neurosciences vs. vision par ordinateur : regard croisé sur l'attention visuelle
Pr. Patrick Le Callet , Laboratoire des Sciences du Numérique Nantes (LS2N - CNRS) École centrale de Nantes/Nantes Université, France.
11h20-13h30 Pause déjeuner et discussion.
14h00-15h10 (70 minutes) Distanciel
Invited talk: Exploring Visual Attention: Methods for Analyzing Gaze cues in Everyday Contexts"
Pr. Jean-Marc Odobez, IDIAP Research Institute -Perception and Activity Understanding group- & Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland.
15h10-15h30 (20 minutes) Distanciel
Guided Dynamic Inference for model compression
Karim HAROUN , Institut List Centre Paris Saclay | Site Nano-INNOVAttention
15h30-16h00 (30 minutes) Distanciel
Efficient Visual Search using Retinotopic convolutional neural networks
Emmanuel DAUCE , Institut de Neurosciences de la Timone
Au cours des dernières années, les nuages de points 3D (3DPC) ont connu une croissance rapide dans divers domaines de la vision par ordinateur, ce qui a entraîné une demande accrue pour des approches efficaces visant à évaluer automatiquement la qualité des nuages de points 3D. Dans cette étude, nous présentons une nouvelle approche d'apprentissage profond pour l'évaluation de la qualité des nuages de points sans référence, visant à prédire la qualité visuelle des 3DPC sans recourir à un contenu de référence (3DPC originaux non altérés). Notre méthode intègre la saillance visuelle 3D dans le processus d'évaluation, exploitant sa capacité à identifier les zones visuellement significatives qui attirent l'attention immédiate des êtres humains. Nous supposons que les distorsions dans ces zones saillantes soient plus susceptibles d'impact sur la qualité perçue. Pour justifier notre hypothèse, nous avons entraîné notre modèle avec et sans l'étape de pondération avec les cartes saillantes. Les résultats obtenus montrent que le modèle fournit de meilleures prédictions lorsqu'il utilise les cartes de saillance. Nous avons également comparé les performances de notre modèle avec celles de l'état de l'art . Les résultas obtenus, montre que le modèle proposé présente une forte corrélation avec les évaluations subjectives humaines, surpassant les méthodes de l'état de l'art, y compris celles basées sur des références complètes (FR), réduites (RR) et sans référence (NR).
Understanding human visual behaviour during videos that display humans performing a task is at the crossroads of various fields such as psychology and neurology as well as imitation learning. Machine learning models tasked with predicting visual attention can be black boxes difficult to translate into an exploitable understanding of human behaviour. Hidden Markov models are a statistical but interpretable model, capable of capturing the tendencies of a big number of scan paths, clustering them for similar stimuli, co-clustering them across different stimuli for the same observers, and making use of the Viterbi algorithm to output most probable hidden state sequences. In this work, we adapted the approach for analysing eye movement using Hidden Markov Models (EMHMM) for static eye tracking of videos that show humans performing simple tasks on a conveyor belt. This adaptation sets the hidden states as regions of interest in the underlying scene, thus making the outputs of the Viterbi algorithm and the properties of the learned HMMs human interpretable. We uncover among observers efficient and inefficient visual strategies for task recognition. Groups with similar strategy proved to be consistently more efficient across different stimuli. We use a measure of entropy to classify strategies into efficient active and inefficient passive. By applying HMMs to eye-tracking data, we reveal distinct patterns of visual exploration that correlate with task efficiency. These results can help understand cognitive behaviour and apply eye-movements datasets to improve the performance and/or the speed of learning of machine learning models for the goal of task recognition. The generative property of HMMs allows us to augment the data in a particular way, for example by increasing the weight of efficient active scan paths in the hopes that it yields better results for subsequent applications.
Attention models have recently gained popularity in Machine Learning with Transformer architectures, dominating Natural Language Processing (NLP) and challenging CNN-based neural network architectures in computer vision tasks. This is due to the self-attention mechanism, i.e., the building block of Transformers, which assigns importance weights to different regions in the input sequence, enabling the model to focus on relevant information for each prediction. Recent work leverages the inherent attention mechanism in Transformers for complexity reduction of models in image classification. Precisely, it uses attention to focus on the most important information in input images, which allows us to allocate computation to this salient spatial locations only. The motivation for compressing neural networks stems from their computational complexity, which is quadratic (O(N2)) in the case of Transformers, where N is the number of input tokens, and memory requirements, which hinder their efficiency in terms of energy consumption. In our case, we explore a novel approach named dynamic compression, which aims to reduce complexity during inference by dynamically allocating resources based on each input sample. Through a preliminary study, we observed that Transformers exhibit a suboptimal image partitioning into tokens, which shows that small models (less tokens) classify a set of images better than bigger models (more tokens) for image classification task.
Foveal vision, a trait shared by many animals including primates, makes an important contribution to the performance of the visual system. However, it has been largely overlooked in machine learning applications. This study investigates whether retinotopic mapping, an essential component of foveal vision, can improve image categorisation and localisation performance when integrated with deep convolutional neural networks (CNNs). A retinotopic map is used as input to standard convolutional neural networks. These networks are then retrained on a visual classification task using the Imagenet dataset. Surprisingly, despite the loss of information due to foveal deformation, our re-trained networks show classification performance similar to the state of the art. In addition, during the test phase, the network showed an increased ability to detect regions of interest in the image with respect to the object predicted by the classifier. In other words, the network can analyse the entire image to find the position that best corresponds to the object being searched for. This visual search mechanism, a typical feature of the human visual system, is absent in typical CNNs. These results suggest that retinotopic mapping may play a key role in the design of visual recognition algorithms, whose ability to resist zooming and rotation may be valuable in an open environment.
Beyond words, non-verbal behaviors (NVB) are known to play important roles in face-to-face interactions. However, decoding NVB is a challenging problem that involves both extracting subtle physical NVB cues and mapping them to higher-level communication behaviors or social constructs. Gaze, in particular, serves as a fundamental indicator of attention and interest, influencing communication and social signaling across various domains such as human-computer interaction, robotics, and medical diagnosis, notably in Autism Spectrum Disorders (ASD) assessment.
However, estimating others' visual attention, encompassing their gaze and Visual Focus of Attention (VFOA), remains highly challenging, even for humans. It requires not only inferring accurate 3D gaze directions but also understanding the contextual scene to discern which object in the field of view is actually looked at. Context can include people activities that can provide priors about which objects are looked at, or the scene structure to detect obstructions in the line of sight. Recent research has pursued two avenues to address this: the first one focused on improving appearance-based 3D gaze estimation from images and videos, while the second investigated gaze following ?the task of inferring where a person looks in an image.
This presentation will explore ideas and methodologies addressing both challenges. Initially, it delves into advancements in 3D gaze estimation, including personalized model construction via few-shot learning and gaze redirection eye synthesis, differential gaze estimation, and leveraging social interaction priors for model adaptation. Subsequently, recent models for estimating gaze targets in real-world settings are introduced, including the inference of social labels like eye contact and shared attention.