Context and Motivation
Generative AI models such as GANs and diffusion models are trained on massive image collections that are massive, often opaque datasets [1, 2]. These datasets may include personal photos, artwork, or copyrighted media, raising important concerns about attribution, copyright, and content ownership. Existing auditing techniques—such as membership inference [3] or data watermarking [4]—either require access to the original training set or provide only coarse indicators of memorization. They cannot quantify how much, or in what way, a specific dataset has influenced the internal geometry of a generative model.
Traditional generative models operate in pixel or latent spaces, which are high-dimensional and difficult to interpret or control. Subspace learning offers a compelling alternative by decomposing weights or latent features into meaningful low-dimensional components [5,6,7]. These subspaces provide clearer insights into how data shapes model behavior and may reveal dataset-specific signatures embedded during training.
Proposed Approach
This project aims to develop a subspace-based method for dataset attribution: determining whether a given dataset has influenced a generative model—and quantifying this influence—without requiring access to the original training data. The central hypothesis is that spectral or geometric fingerprints of datasets remain encoded in the subspaces of the model (weights, intermediate layers, or latent trajectories), and that comparing these subspaces enables reliable attribution.
Planned Actions
- Survey existing work on attribution, interpretability, and subspace analysis in generative models.
- Build the experimental pipeline for subspace extraction using SVD/PCA on latent trajectories or model parameters.
- Develop dataset fingerprinting techniques to identify spectral or geometric signatures left by training data in the model.
- Design a subspace alignment method to measure similarity between candidate datasets and model-induced subspaces.
- Define attribution metrics to quantify relative contributions of different datasets.
- Conduct experimental validation on multiple generative models using fully open-access datasets.
Requirements:
Prospective interns should have a strong background in computer vision, deep learning, and experience with deep learning frameworks like PyTorch or TensorFlow. The ability to work with video data and experience in video processing is a plus. Strong programming skills in Python are essential.
- Location: laboratoire CIAD, Montbéliard, France.
- Start date: is 1st of February 2026
- This internship is remunerated.
Application:
Send a curriculum vitae, referees coordinates, and grades for the two last years before the 15th of January, 2026 to:
ibrahim.kajo@utbm.fr
yassine.ruichek@utbm.fr
References:
[1] Rombach, R. et al. (2022). High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695.
[2] Podell, D. et al. (2023). SDXL: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952
[3] Zhang, M. et al. (2024). Generated distributions are all you need for membership inference attacks against generative models. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 4839–4849.
[4] Cui, Y. et al. (2023). DiffusionShield: A watermark for copyright protection against generative diffusion models. arXiv preprint arXiv:2306.04642.
[5] Han, L. et al. (2023). SVDiff: Compact parameter space for diffusion fine-tuning. Proceedings of the IEEE/CVF International Conference on Computer Vision, 7323–7334.
[6] Hu, E. J. et al. (2022). LoRA: Low-rank adaptation of large language models. ICLR, 1(2), 3.
[7] Kajo, I. et al. (2018). SVD-based tensor-completion technique for background initialization. IEEE Transactions on Image Processing, 27(6), 3114–3126.
