[StageM2] - Quantifying Dataset Influence in Generative Models via Subspace Analysis

Context and Motivation

Generative AI models such as GANs and diffusion models are trained on massive image collections that are massive, often opaque datasets [1, 2]. These datasets may include personal photos, artwork, or copyrighted media, raising important concerns about attribution, copyright, and content ownership. Existing auditing techniques—such as membership inference [3] or data watermarking [4]—either require access to the original training set or provide only coarse indicators of memorization. They cannot quantify how much, or in what way, a specific dataset has influenced the internal geometry of a generative model.

Traditional generative models operate in pixel or latent spaces, which are high-dimensional and difficult to interpret or control. Subspace learning offers a compelling alternative by decomposing weights or latent features into meaningful low-dimensional components [5,6,7]. These subspaces provide clearer insights into how data shapes model behavior and may reveal dataset-specific signatures embedded during training.

Proposed Approach

This project aims to develop a subspace-based method for dataset attribution: determining whether a given dataset has influenced a generative model—and quantifying this influence—without requiring access to the original training data. The central hypothesis is that spectral or geometric fingerprints of datasets remain encoded in the subspaces of the model (weights, intermediate layers, or latent trajectories), and that comparing these subspaces enables reliable attribution.

Planned Actions

Survey existing work on attribution, interpretability, and subspace analysis in generative models.
Build the experimental pipeline for subspace extraction using SVD/PCA on latent trajectories or model parameters.
Develop dataset fingerprinting techniques to identify spectral or geometric signatures left by training data in the model.
Design a subspace alignment method to measure similarity between candidate datasets and model-induced subspaces.
Define attribution metrics to quantify relative contributions of different datasets.
Conduct experimental validation on multiple generative models using fully open-access datasets.

Requirements:

Prospective interns should have a strong background in computer vision, deep learning, and experience with deep learning frameworks like PyTorch or TensorFlow. The ability to work with video data and experience in video processing is a plus. Strong programming skills in Python are essential.

Practical information:

Location: laboratoire CIAD, Montbéliard, France.
Start date: is 1^st of February 2026
This internship is remunerated.

Application:

Send a curriculum vitae, referees coordinates, and grades for the two last years before the 15th of January, 2026 to:

ibrahim.kajo@utbm.fr

yassine.ruichek@utbm.fr

References:

[1] Rombach, R. et al. (2022). High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695.

[2] Podell, D. et al. (2023). SDXL: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952

[3] Zhang, M. et al. (2024). Generated distributions are all you need for membership inference attacks against generative models. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 4839–4849.

[4] Cui, Y. et al. (2023). DiffusionShield: A watermark for copyright protection against generative diffusion models. arXiv preprint arXiv:2306.04642.

[5] Han, L. et al. (2023). SVDiff: Compact parameter space for diffusion fine-tuning. Proceedings of the IEEE/CVF International Conference on Computer Vision, 7323–7334.

[6] Hu, E. J. et al. (2022). LoRA: Low-rank adaptation of large language models. ICLR, 1(2), 3.

[7] Kajo, I. et al. (2018). SVD-based tensor-completion technique for background initialization. IEEE Transactions on Image Processing, 27(6), 3114–3126.

Annonce

[StageM2] – Quantifying Dataset Influence in Generative Models via Subspace Analysis

IASIS en chiffres

A noter

Cartographie des expertises du GdR

Actus de la communauté

Graphes, la science des liens

Jean-Louis Lacoume (1940-2026)

GreenDays 2026

L’intelligence artificielle pour les sciences

Concours Chercheurs CNRS