[PhD] thesis position in AI at Sorbonne University: Deep Learning and Large Generative AI Models for Machine Vision

Deep neural networks are currently one of the most successful models in image processing and computer vision [1-4, 10-13]. Their principle consists in learning convolutional filters, together with attention [14,15] and fully connected layers, that maximize classification and generation performances. Large generative models (LGMs) [5] are a particular category of deep learning models specifically designed to generate new data, often resembling the data they were trained on. These models are at the forefront of artificial intelligence research, pushing the boundaries of what computers can create. Unlike standard deep learning models trained for classification or prediction, LGMs focus on creating entirely new data samples. This data can be images, text, video, audio, etc.

LGMs leverage various deep learning architectures, with some of the most common being: (i) Generative Adversarial Networks (GANs) [6] that involve two neural networks competing against each other. One network (generator) tries to create realistic data, while the other (discriminator) tries to distinguish real data from the generated data. This adversarial process helps the generator improve its ability to fool the discriminator and produce increasingly realistic outputs, (ii) Variational Autoencoders (VAEs) [7] which encode input data into a latent space, a bottleneck that captures the essential features. The model then learns to decode samples from this latent space, effectively generating new data that resemble the training data, and (iii) Normalizing Flows/Diffusion Models [8, 9]: these models start with a noisy version of the target data and gradually « de-noise » it step-by-step, ultimately producing a clean and realistic sample. The challenges of LGMs stem from their training complexity, their bias particularly when designed in a lifelong learning regime [16], and their ability to generate realistic and potentially manipulative content while addressing ethical concerns.

The goal of this thesis subject is to study and design novel solutions that address different LGM challenges including

Enhanced control and interpretability: by developing interactive techniques, based on prompting or semantic subspace design, that better control the outputs of LGMs, and their quality / diversity, and also understand how they generate specific data.

Extended LGMs to lifelong learning paradigm: by developing effective solutions that learn from streams of data while mitigating the challenging hurdle of catastrophic forgetting (i.e., without forgetting previously learned information); proposed solutions will mainly rely on regularization and dynamic LGMs architecture design (in order to maintain LGM capacity), as well as domain adaptation (in order to address the non-stationarity of training data streams).
Extended LGMs to unstructured data (such as 3D point clouds): by designing LGMs on graphs while handling all the possible symmetries (ambiguities) in the unstructured data, and particularly permutations, both for LGM encoding and decoding.
Improved LGM efficiency: by developing more efficient training algorithms to make LGMs more accessible particularly on edge devices.
Mitigated bias and ensured responsible use: by designing safeguards to address potential biases and promote responsible development and deployment of LGMs.
etc.

Applications of this thesis will be centred around different computer vision and image processing tasks including (i) creative image and video content generation, (ii) data augmentation (generating synthetic data to improve the performance of other machine learning models), and (iii) image/video editing (filling in missing parts of visual content, enhancing resolution, or photorealistic editing).

Keywords. Deep learning, Neural networks, Deep generative models, Image and video generation, Computer vision & AI.

Thesis Director: Hichem Sahbi, CNRS Researcher, HDR, LIP6 Lab, Sorbonne University (contact: hichem.sahbi@lip6.fr).

PhD Student Background. We are seeking a highly motivated PhD candidate, with a preferred background in applied mathematics or computer science with more emphasis on statistics, machine learning and/or image processing/computer vision, and familiar with existing machine learning tools and programming platforms.

Related bibliography

[1] R. Girshick. Fast R-CNN. In ICCV, pages 1440–1448, 2015.
[2] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, pages 770–778, June 2016.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, pages 1097–1105, 2012.
[4] He, Kaiming, et al. ”Mask r-cnn.” In Proceedings of International Conference on Computer Vision, 2017.
[5] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.
[6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
[7] D-P Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
[8] D. Rezende and M. Shakir. « Variational inference with normalizing flows. » International conference on machine learning. PMLR, 2015.
[9] S. Luo and W. Hu. « Diffusion probabilistic models for 3d point cloud generation. » Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
[10] M. Jiu and H. Sahbi, “Nonlinear deep kernel learning for image annotation,” IEEE Transactions on Image Processing, vol. 26(4), 2017.
[11] H. Sahbi. Learning Chebyshev Basis in Graph Convolutional Networks for Skeleton-based Action Recognition. In the proceedings of the IEEE International Conference on Computer Vision (ICCV-W), 2021
[12] R. Marsal, F. Chabot, A. Loesch, W. Grolleau & H. Sahbi (2024). MonoProb: Self-Supervised Monocular Depth Estimation with Interpretable Uncertainty. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 3637-3646).
[13] R. Marsal, F. Chabot, A. Loesch & H. Sahbi (2023). Brightflow: Brightness-change-aware unsupervised learning of optical flow. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 2061-2070).
[14] A. Vaswani et al. Attention Is All You Need, arxiv, 2017.
[15] H. Sahbi, J.-Y. Audibert, and R. Keriven. Context-dependent kernels for object classification. In IEEE Pattern Analysis and Machine Intelligence, TPAMI, 2011.
[16] H. Sahbi & H. Zhan. FFNB: forgetting-free neural blocks for deep continual learning. In The British machine vision conference (BMVC), 2021.