Réunion

Artificial Intelligence and Pattern Recognition in Remote Sensing

Date : 02 Avril 2026
Horaire : 09h00 - 17h30
Lieu : Salle du conseil, Espace Turing, LIPADE, Université Paris Cité (7e étage) 45 rue des Saint-Pères, 75006 Paris

Axes scientifiques :

Apprentissage machine

Organisateurs :

- Sylvain Lobry (LIPADE)
- Charlotte Pelletier (IRISA)

Nous vous rappelons que, afin de garantir l'accès de tous les inscrits aux salles de réunion, l'inscription aux réunions est gratuite mais obligatoire.

La réunion sera également accessible en distanciel mais l'inscription est obligatoire

Inscriptions

52 personnes membres du GdR IASIS, et 100 personnes non membres du GdR, sont inscrits à cette réunion.

Capacité de la salle : 50 personnes. Nombre d'inscrits en présentiel : 52 ; Nombre d'inscrits en distanciel : 100
-2 Places restantes

Inscriptions closes pour cette journée

Annonce

As climate change intensifies and extreme events become more frequent, the monitoring and understanding of Earth system processes have become increasingly critical. Earth Observation (EO) and Remote Sensing (RS) support these efforts, driven by the rapid increase in the number of satellite missions. This growth in the volume and diversity of EO data has enabled large-scale environmental analyses, but has also introduced major challenges in data representation, interpretation, and timely analysis. In this context, advanced pattern recognition and data-driven techniques have become essential to extract meaningful information from these data. The objective of this meeting is therefore to review ongoing advances in EO and RS data analysis. To this end, we aim to cover the following topics during the day:

Deep learning for Earth observation data
Foundation models for Earth observation data
Vision and language models for Earth observation
3D reconstruction
Semantic classification and parameter estimation from remote sensing data
Active, interactive and transfer learning
Multi-modal and multi-temporal analysis
Extraction, selection, learning, and reduction of features
Novel pattern recognition tasks in remote sensing applications
Explainable and interpretable machine learning
Hybrid models, combining physics and machine learning
Benchmark datasets

Invited speakers:

Nicolas Audebert (IGN, LASTIG, STRUDEL)
- Title: A tour of generative models for remote sensing
- Abstract: Generative models are an untapped opportunity for remote sensing and Earth observation. They are versatile and unsupervised models that can produce an approximation of the data distribution. This makes them an effective tool for a large range of applications, from super-resolution to anomaly detection and domain adaptation. This talk will give an overview of modern classes of generative models, how they can be leveraged for Earth observation, and propose a roadmap for future research in « deep remote sensing ».
Flora Weissgerber (ONERA, DTIS, SAPIA)
- Title: Weakly supervised learning for the monitoring of the cryosphere
- Abstract: The cryosphere changes rapidly under climate change forcing. On top of impacting hugely local communities, these changes drastically impact the future climate through feedback loops. In the Alps, both snow cover duration and depth are shortening, limiting access to fresh water or the capacity to produce hydroelectricity. At the poles, sea ice is becoming thinner and more fragile. This impacts wildlife and local sea-ice journeys, as well as increasing the albedo of the ocean and accelerating its warming.
  Image processing technics, and in particular deep-learning, can expend the monitoring of these environments. Despite the large number of images available thanks to ambitious Earth observation programs, manual labels are generally very sparse. In this presentation, I will show how these sparse labels, different sensors, simulation and foundation models can be assembled through a good understanding of the physical properties of the imaged object, to design weakly supervised deep learning technics that overcome the limitations imposed by the label shortage. Firstly, I will present a weakly supervised deep learning algorithm to monitor seasonal snow combining both SAR and optical images. Then, I will present how sea ice can be mapped and its drift can be measured, combining SAR, optical images and nadir altimetry through dedicated weakly supervised deep learning algorithms.
  Down the line, we hope that these algorithms will help to improve the modeling of the future climate and that they could help local communities to adapt to climate change if they were transformed into accessible products.

Call for contributions :
We are welcoming contributions on these topics. We encourage presentations in English for international researchers, but do not restrict to it. Researchers and doctoral students wishing to present their work are invited to send their proposal (title and abstract) limited to one page by email by March 2, 2026, to:

Charlotte Pelletier: charlotte.pelletier@univ-ubs.fr
Sylvain Lobry: sylvain.lobry@u-paris.fr

Cette journée est labellisée par le comité technique 7 de l’IAPR

Organizers :

Charlotte Pelletier (IRISA, Univ. Bretagne Sud)
Ksenia Bittner (German Aerospace Center, DLR)
Marc Russwurm (MEO-Lab, Univ. Bonn)
Sylvain Lobry (LIPADE, Univ. Paris Cité)

The organizers would like to thank the on-site and online participants!

Programme

9h - 9h45 : Welcome coffee

9h45 - 10h00 : Introduction (replay)

10h - 11h: Keynote (replay):

Nicolas Audebert (IGN, LASTIG, STRUDEL): A tour of generative models for remote sensing

11h - 12h: Presentations

Iris Dumeur (Kayrros) - Comparative analysis of dual-form networks for live land monitoring using multi-modal satellite image time series (replay)

Cornelia Vacar (CEA) - Learning-based SAR-Optical Registration for Navigation: Insights from a Multi-Year, Multi-Season Continental-Scale Dataset registration (replay)

Pierre Audisio (ENS Lyon) - Plug-and-Play Forward Backward Algorithm to Restore Landsat Images: A Preliminary Step to Uncover the History of Surface Waters (replay)

Rana Elayeb (Satlantis) - Hybrid Deep Learning Model for Methane Detection: Merging the Spectral and Spatial Information (replay)

12h - 13h30: Lunch break

Please note that lunch is not provided. Participants are responsible for their own meals. Plenty of options are available around the University.

13h30-14h30: Keynote (replay):

Flora Weissgerber (ONERA, DTIS, SAPIA): Weakly supervised learning for the monitoring of the cryosphere

14h30-15h30: Presentations

Pierre Adorni (Uni. Bretagne Sud) - EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor? (replay)

Thomas Hallopeau (Université Montpellier) - MUSICA: A Multi-Source Informal settlement Classification Approach combining remote sensing foundation models and expert knowledge (replay)

Nisrine Bajja (Université du Littoral Côte d’Opale / ENSAM Casablanca) - Scalable and Interpretable Multimodal Kernel Boosting for Earth Observation Data Analysis (replay)

Youssef Fouzai (BRGM) - Deep learning model to classify building usage using footprints from BD TOPO and aerial images (replay)

15h30: 16h: Coffee break

16h-17h: Presentations

Pallavi Jain (Université de Montpellier) - TimeSenCLIP: A Time Series Vision-Language Model for Remote Sensing (replay)

Mihailo Obrenovic (ICube) - Semantic Heterogeneous Domain Adaptation with Vision-Language Models for Remote Sensing (replay)

Sarah Brood (INRIA/LSCE) - Addressing Geographical Domain Shift in Tree Species Mapping via Foundation Models using Satellite Image Time Series (replay)

Juliana Carvalho (CNAM) - Self-supervised learning with auxiliary metadata for Earth observation (replay)

17h-17h15: Closing (replay)

Résumés des contributions

Iris Dumeur (Kayrros) - Comparative analysis of dual-form networks for live land monitoring using multi-modal satellite image time series

Multi-modal Satellite Image Time Series (SITS) analysis faces significant computational challenges for live land monitoring applications. While Transformer architectures excel at capturing temporal dependencies and fusing multi-modal data, their quadratic computational complexity and the need to reprocess entire sequences for each new acquisition limit their deployment for regular, large-area monitoring. This paper studies various dual-form attention mechanisms for efficient multi-modal SITS analysis, that enable parallel training while supporting recurrent inference for incremental processing. We compare linear attention and retention mechanisms within a multi-modal spectro-temporal encoder. To address SITS-specific challenges of temporal irregularity and unalignment, we develop temporal adaptations of dual-form mechanisms that compute token distances based on actual acquisition dates rather than sequence indices. Our approach is evaluated on two tasks using Sentinel-1 and Sentinel-2 data: multi-modal SITS forecasting as a proxy task, and real-world solar panel construction monitoring. Experimental results demonstrate that dual-form mechanisms achieve performance comparable to standard Transformers while enabling efficient recurrent inference. The multi-modal framework consistently outperforms mono-modal approaches across both tasks, demonstrating the effectiveness of dual mechanisms for sensor fusion. The results presented in this work open new opportunities for operational land monitoring systems requiring regular updates over large geographic areas.

Cornelia Vacar (CEA) - Learning-based SAR-Optical Registration for Navigation: Insights from a Multi-Year, Multi-Season Continental-Scale Dataset registration

This work addresses the challenge of SAR-optical image registration for autonomous navigation in GPS- denied environments. One key aspect lies in introducing and leveraging MEOW, a dataset with extensive temporal coverage, to evaluate both spatial and temporal generalization of deep learning registration methods. Moreover, we conduct a comprehensive architectural sensitivity analysis of four recent deep neural networks for SAR-optical registration by combining various backbone networks, similarity measures and loss functions across 36 hybrid models to isolate the contribution of each component and identify the optimal configuration.

Pierre Audisio (ENS Lyon) - Plug-and-Play Forward Backward Algorithm to Restore Landsat Images: A Preliminary Step to Uncover the History of Surface Waters

The goal of this work is to obtain higher-resolution images over longer time periods and higher image density across the years jointly covered by Sentinel-2 (since 2015) and Landsat (since 1972), in order to monitor river ecosystems and the complex interactions between ecological and morphological processes occurring in these environments. We adopt an inverse problem approach to reconstruct high-resolution images from Landsat data (∼30 m/pixel), using Sentinel-2 images (∼10 m/pixel) as spatial references. To this end, we develop a Plug-and-Play (PnP) method, named Spect-FB-PnP [1], dedicated to multispectral reconstruction, combining the stability of classical inverse problem formulations with the expressivity of deep neural networks.

Rana Elayeb (Satlantis) - Hybrid Deep Learning Model for Methane Detection: Merging the Spectral and Spatial Information

Methane CH4 is a major greenhouse gas, responsible for about 20–30% of the warming ob- served since the industrial revolution. Reducing its emissions represents an important short-term mitigation lever because of its relatively short atmospheric lifetime. In short-wave infrared (SWIR) satellite imagery, detecting methane plumes remains challenging because the CH absorption signature is weak and is mixed with strong radiometric variability in the background (surface reflectance, atmospheric effects, and instrumental artifacts). Existing ap- proaches generally fall into two categories: classical physics-based methods, which are interpretable but often limited by their sensitivity and background assumptions, and artificial intelligence mod- els, which can achieve higher performance but are sometimes difficult to control and interpret. We propose a hybrid physics-guided deep learning framework for methane plume detection and segmentation in SWIR imagery. The framework is organized in two complementary stages. First, we use a machine learning algorithm to produce a methane-free reference image, representing the expected radiometric background in the absence of a plume. Detection then relies on differential imaging, which isolates the radiative signature associated with the plume. In a second stage, a multi-resolution wavelet decomposition combined with a deep CNN preserves the frequency organization of the signal to perform structured denoising and improve the segmentation of diffuse plumes. Experiments conducted on data from the GEISAT satellite (VNIR/SWIR iSIM-90 camera) show strong performance in intra-domain evaluation, with a Dice coefficient of around 0.9, indicating very high consistency between predictions and ground truth.

Pierre Adorni (Uni. Bretagne Sud) - EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?

Foundation models for Earth Observation aim to generalize across tasks with limited supervision, but current approaches rely heavily on scaling model and dataset size, leading to high computational costs. We propose EoS-FM, an Ensemble-of-Specialists framework that builds Remote Sensing Foundation Models by combining frozen, task-specialized ConvNeXtV2 encoders. This modular design improves efficiency, interpretability, and extensibility, while supporting pruning, federated training, and incremental integration of new specialists for scalable and sustainable RSFMs.

Thomas Hallopeau (Université Montpellier) - MUSICA: A Multi-Source Informal settlement Classification Approach combining remote sensing foundation models and expert knowledge

Informal settlements are rapidly evolving urban areas that often remain unmapped in ocial records, posing major challenges for urban planning and policy. Remote sensing methods oer powerful tools to detect and monitor these areas. Approaches based on handcrafted geospatial features eciently leverage domain knowledge to select input variables adapted to the study area, but reduce complex observations to a limited set of predened descriptors. On the other hand, deep learning, and recent Remote Sensing Foundation Models (RSFMs), extracts high-dimensional image representations directly from satellite imagery, yet remains conned to image modalities from pretraining and does not incorporate complementary geospatial variables. To address this gap, we propose MUSICA (Multi-Source Informal Settlement Classication Approach), a hybrid method that combines RSFM features from Sentinel-1 and Sentinel-2 imagery with domain-informed geospatial descriptors from digital elevation and street network data. The two feature sets are processed with independent Random Forest classiers and merged through a late fusion scheme, which preserves modality-specic strengths while avoiding the information loss of joint embeddings. We evaluate MUSICA in Rio de Janeiro, a city characterized by strong intra-urban diversity in informal settlement morphologies and contexts, making it a challenging and representative testbed. Using spatial cross-validation across ve heterogeneous urban zones, we show that MUSICA outperforms a state- of-the-art RSFM used in isolation (+14% Cohen's kappa over CROMA) and the method relying solely on expert features (+7%). These results demonstrate the value of integrating expert knowledge with deep learning

Nisrine Bajja (Université du Littoral Côte d’Opale / ENSAM Casablanca) - Scalable and Interpretable Multimodal Kernel Boosting for Earth Observation Data Analysis

The rapid growth of multimodal and multidimensional data from Earth observation (EO) systems has significantly improved large-scale environmental monitoring and semantic analysis capabilities. However, the integration of multiple data sources introduces critical challenges related to multimodal fusion, scalability and interpretability. In this work, we present a scalable and interpretable multimodal multi-class kernel boosting framework to address these challenges and ensure the reliability of remote sensing tasks. Our approach is based on a stratified three-level boosting architecture that successively combines (i) adaptive multiple kernel learning, (ii) modality-level aggregation, and (iii) final ensemble decision learning to construct a strong classifier. At the kernel level, we adaptively combine the output of multiple kernel learners, which allows an automatic kernel selection without requiring time-consuming cross-validations. At the modality level, the model estimates the contribution of each modality, providing direct interpretability of multimodal fusion. At the final level, a strong classifier is built through a boosting strategy based on weakened multiclass learners, ensuring both scalability and interpretability. The proposed model is evaluated on distinct multispectral remote sensing datasets derived from Sentinel-2 imagery and compared with state-of-the-art tree-based, kernel-based, and deep learning models. Experimental results demonstrate competitive classification performance while maintaining low computational complexity. Beyond predictive accuracy, the model provides interpretable insights into the contribution of individual spectral sources and their complementarity, making it particularly suitable for Earth Observation scenarios where transparency and reliability are critical.

Youssef Fouzai (BRGM) - Deep learning model to classify building usage using footprints from BD TOPO and aerial images

In the field of natural hazard assessment, evaluating the vulnerability of exposed elements is essential. Residential buildings are particularly vulnerable to major risks, especially in the context of shrinking and swelling clay soils. How- ever, assessing vulnerability requires extracting their usage type and structural characteristics. Building footprints from IGN3 BD TOPO provide incomplete data on their usage. This is obvious in our study area situated in the west of Orléans, where 40% of buildings listed by IGN have an unknown usage type due to missing information. By deep learning models and remote sensing data, we were able to resolve this uncertainty with an accuracy of 0.74 and a recall of 0.84, outperforming state of the art models. This marks the initial phase in developing a new processing pipeline capable of automatically extracting building features from multimodal data sources, including ground-level imagery and LiDAR data.

Pallavi Jain (Université de Montpellier) - TimeSenCLIP: A Time Series Vision-Language Model for Remote Sensing

Vision-language models (VLMs) have shown significant promise in remote sensing applications, particularly for land-use and land-cover (LULC) mapping via zero-shot classification and retrieval. However, current approaches face several key challenges, such as the dependence on caption-based supervision, which is often not available or very limited in terms of the covered semantics, and the fact of being adapted from generic VLM architectures that are suitable for very high resolution images. Consequently, these models tend to prioritize spatial context over spectral and temporal information, limiting their effectiveness for medium-resolution remote sensing imagery. In this work, we present TimeSenCLIP, a lightweight VLM for remote sensing time series, using a cross-view temporal contrastive framework to align multispectral Sentinel-2 time series with geo- tagged ground-level imagery, without requiring textual annotations. Unlike prior VLMs, TimeSenCLIP emphasizes temporal and spectral signals over spatial context, investigating whether single-pixel time series contain sufficient information for solving a variety of tasks. Our approach is trained on the LUCAS and Sen4Map datasets and evaluated across four main mapping tasks: land cover, land use, habitat mapping and crop type classification. The CLIP text encoder can be used to probe the learned representations using semantically meaningful categories, enabling effective zero-shot generalization without task-specific text supervision. We further extend our evaluation to bioregions mapping and country-level image retrieval. Although coarse, these tasks are valuable for probing whether the model captures geographically meaningful representations, such as regional climate regimes, vegetation patterns, and land-use structures. TimeSenCLIP achieves consistently better performance than existing CLIP-based remote sensing models in both zero-shot classification and cross-modal retrieval. Notably, single-pixel multispectral time series variants remain highly competitive, particularly with extended temporal coverage, demonstrating that temporal–spectral dynamics can compensate to a substantial degree for the reduced spatial footprint. While larger spatial patches still offer advantages for tasks where spatial patterns are inherently informative, such as ecosystem type classification, the results suggest that single-pixel multispectral time series can provide effective remote sensing vision–language pipelines, supporting scalable and efficient modelling in scenarios where large spatial tiles or extensive textual annotations are impractical.

Mihailo Obrenovic (ICube) - Semantic Heterogeneous Domain Adaptation with Vision-Language Models for Remote Sensing

Domain adaptation methods have achieved significant success in vision applications but are often designed with the assumption that both domains are represented by RGB images. Sometimes, however, domains lie in different spaces of possibly different dimensionalities – a setting called heterogeneous domain adaptation (HDA). HDA methods are very interesting for the field of remote sensing, where a variety of sensors are used, capturing images of different modalities, and different spatial and spectral resolutions. However, the current state-of-the-art unsupervised HDA approaches have been shown to be limited because there is a large domain shift between different modalities, and class flipping often occurs. Current HDA methods therefore remain highly dependent on supervision. The presented work stems from the hypothesis that learning a meaningful and explainable semantic representation of heterogeneous images would facilitate knowledge transfer and enhance class alignment between domains, without the need for labelling in the target domain. A representation based on language emerges as a good choice, with the availability of powerful vision-language models (VLMs). With a strong zero-shot classification performance, VLMs hold the potential to improve unsupervised HDA methods. This paper proposes a novel heterogeneous domain adaptation methodology based on vision- language models specialised for remote sensing tasks. The method is shown to have very strong performance on domains of different spatial resolutions, where neither the target domain nor any domain of another resolution was seen by the domain adaptation method nor vision-language model used, outperforming both state-of-the-art HDA methods and VLM zero-shot performance. Finally, the work is concluded by exploring the possibilities and challenges of using this domain adaptation methodology on domains with different spectral resolutions, i.e., different numbers of channels.

Sarah Brood (INRIA/LSCE) - Addressing Geographical Domain Shift in Tree Species Mapping via Foundation Models using Satellite Image Time Series

In a context of rapid environmental change, delivering robust tree species mapping is essential. It enables better quantification of forest biomass, facilitates climate change adaptation through better forest management and supports biodiversity preservation. However, the scarce existing ground-truth datasets suffer from geographic sparsity, semantic inconsistencies and class imbalance, making current methods overfit to context and unsuitable for accurate large-scale tree species mapping. Therefore, it is imperative to design methods that learn spatially invariant representations for tree species mapping.

The surge of Earth Observation missions has unlocked vast amounts of Satellite Image Time Series (SITS) which capture phenology and spectral dynamics that are an asset for tree species classifi- cation. Leveraging this data, an increasing number of Foundation Models (FM) pre-trained using Self-Supervised Learning (SSL) have been introduced. Yet, due to the prevalence of patch-level annotations in tree species datasets, FMs are primarily evaluated through classification tasks instead of segmentation, preventing the production of pixel-level maps. Furthermore, spatial generalization remains largely unexplored, partially explained by the geographic sparsity of the labels. As a result, current models often overfit to local context: they perform well on training areas but fail to generalize to new spatial domains. Therefore this work focuses on rigorous spatial generalization evaluation and the development of methods to produce large-scale pixel-level tree species maps overcoming current spatial domain shifts.

To quantify this generalization gap, we propose a spatial zero-shot domain adaptation evaluation protocol, where frozen FMs are linearly probed through a segmentation task on a geographical region and tested on geographically distinct, unseen regions. We aligned 3 datasets in Europe (TreeSatAI, PureForest and a regional dataset covering Poland) into 6 classes to benchmark state-of-the-art FMs (AnySat, ALISE, Presto) pre-trained on SITS and introduce a new architecture addressing current limitations.

We propose a SSL framework based on the TimeSFormer backbone. It captures complex spatio- temporal dynamics using divided space and time attention. The model is pre-trained as a Masked Auto-Encoder on a European-scale unlabeled Sentinel-2 dataset to learn robust phenological features. To mitigate the observed spatial generalization gap, we investigate different strategies such as adver- sarial disentangling on information inducing spurrious correlation such as geolocation and auxiliary conditioning using causal informations on tree species.

Our evaluation protocol reveals a significant accuracy drop of state-of-the-art models when applied to unseen regions. This decline suggests that current FMs capture geographically-dependent features rather than intrinsic tree species characteristics, resulting in a spatial generalization gap. Experiments confirm that the proposed architecture learns semantically rich features, evidenced by its high capacity to reconstruct missing time steps of satellite time series.

By quantifying the spatial domain shift, proposing a resilient SSL architecture and applying domain adaptation strategies this work addresses the important challenge of generalization in label-scarce regimes. It supports high-resolution forest monitoring, a prerequisite for precise carbon accounting and forest biodiversity conservation.

Juliana Carvalho (CNAM) - Self-supervised learning with auxiliary metadata for Earth observation

Self-supervised learning (SSL) has emerged as a scalable and effective paradigm in computer vision, especially in the domain of Earth observation, where satellite imagery is abundant but ground- truth annotations require domain expertise. While existing SSL methods predominantly rely on visual content, they often overlook the rich auxiliary image metadata. This work investigates the integration of metadata into the contrastive learning framework to improve representation learning for remote sensing tasks. Specifically, we explore how to encode and leverage spatial metadata, pri- marily geographic location, during the self-supervised pre-training phase to enhance generalization in downstream classification tasks. Our method builds upon and adapts the approach from Dufumier et al. (2021), modifying it to operate in a geospatial context. We reformulate the distance- aware contrastive loss to incorporate geographic proximity, encouraging representations of spatially close images to reside nearer in the embedding space. This adaptation demonstrates the potential of distance-aware metadata integration in self-supervised learning, contributing to more effective and semantically meaningful representations for Earth observation data. We achieved 1.6% higher accuracy than contrastive models for image-only tasks and 1% higher accuracy when compared with the baseline approach. This highlights the potential to use our method for pre-training on larger datasets and transferring to smaller ones.

Les commentaires sont clos.