[PhD] Policy-Gradient Reinforcement Learning based on Stationary Distributions

Location : LAAS-CNRS,Toulouse, France

Application deadline : April 18 2025

Keywords : Reinforcement learning, stochastic gradient descent, policy gradient, convergence analysis, gradient estimator, exponential families, product-form stochastic systems.

Context: In Reinforcement learning (RL), an agent improves its behavior in a trial-and-error fashion, by
interacting with an environment. The environment is typically modeled as a Markov decision process (MDP) with unknown transition rates, and the goal of the agent is to learn a policy with maximal gain, where the gain of a policy is defined as the long-run average reward under this policy. In policy-gradient algorithms such as actor-critic [10, 9], the agent is given a parametric set of policies (e.g., a softmax policy parameterization) and searches an optimal policy parameter by stochastic gradient descent. The critical step in policy-gradient algorithms consists in estimating the policy gradient, i.e., the gradient of the gain with respect to the policy parameter.

Objectives: The starting point of this PhD project is a policy-gradient algorithm introduced in [4] and
later studied in [6, 7, 5, 8, 2, 3]. In this algorithm, the policy gradient is written as an expectation involving
the score of the stationary distribution, which is estimated by temporal-difference (TD) learning. A key
advantage of this algorithm is that several steps of the learning process are independent of the reward (so that they can be reused for different rewards). Unfortunately, without function approximation, the memory consumption of this algorithm grows linearly with the size of the MDP’s state space, which is prohibitive in many applications. Our goal in this PhD project is to overcome this difficulty by exploiting information on the structure of the MDP, e.g., information on its transition diagram and/or its stationary distribution. A promising lead is the exponential-family assumption considered in the recent preprint [1]. As an example, the first step of the PhD project could consist of combining the results of [4, 1] to solve complex RL problems in product-form stochastic networks; this may lead to a collaboration with Pascal Moyal (Institut Elie Cartan de Lorraine).

Prerequisites: The candidate should be able to demonstrate a mathematical background, including one
or more courses on probability theory and Markov chains, and proficiency in scientific computing software for prototyping algorithms, such as Python, Julia, or Matlab. Experience in stochastic optimization is also appreciated but not mandatory. The candidate will be required to have obtained a Master’s degree or equivalent before starting the PhD project.

Supervisors: Céline Comte (CNRS, LAAS) and Gersende Fort (CNRS, LAAS)

Work environment: The research project will be carried out in the SARA team at LAAS, in Toulouse,
France. The candidate will also be involved into the activities of the SOLACE team, which gathers researchers working on stochastic modeling at IRIT and LAAS. The SOLACE team has established international collaborations with leading academic institutions and industrial labs. As part of the PhD project, the PhD student will be offered the opportunity to undertake a collaboration and a possibly long-term visit to one of these partners, in particular Pascal Moyal (Institut ´Elie Cartan de Lorraine) and Siva Theja Maguluri (Georgia Tech).

Application procedure: Candidates are invited to send their application by email by April 18, 2025, to
C´eline Comte celine.comte@laas.fr and Gersende Fort gersende.fort@laas.fr, in PDF format. Appli-
cations should include a curriculum vitae, a transcripts of grades in first and second year of Master’s study (possibly incomplete for the second year), a short covering letter (approximately 200–400 words), and the name and email address of a teacher or project advisor who may support the application.

References:

[1] Céline Comte et al. “Score-Aware Policy-Gradient Methods and Performance Guarantees using Local
Lyapunov Conditions”. June 2024. url: https://hal.science/hal-04329790.
[2] Toru Hishinuma and Kei Senda. “Importance-Weighted Variational Inference Model Estimation for
Offline Bayesian Model-Based Reinforcement Learning”. In: IEEE Access 11 (2023). Conference Name:
IEEE Access, pp. 145579–145590. issn: 2169-3536. doi: 10.1109/ACCESS.2023.3345799. url: https:
//ieeexplore.ieee.org/abstract/document/10368011.
[3] Pulkit Katdare, Anant Joshi, and Katherine Driggs-Campbell. Towards Provable Log Density Policy
Gradient. Mar. 3, 2024. doi: 10.48550/arXiv.2403.01605. arXiv: 2403.01605[cs]. url: http:
//arxiv.org/abs/2403.01605.
[4] Tetsuro Morimura et al. “Derivatives of Logarithmic Stationary Distributions for Policy Gradient
Reinforcement Learning”. In: Neural Computation 22.2 (Feb. 2010). Conference Name: Neural Com-
putation, pp. 342–376. issn: 0899-7667. doi: 10.1162/neco.2009.12-08-922. url: https://doi.
org/10.1162/neco.2009.12-08-922.
[5] Harsh Satija, Philip Amortila, and Joelle Pineau. “Constrained Markov Decision Processes via Back-
ward Value Functions”. In: Proceedings of the 37th International Conference on Machine Learning.
International Conference on Machine Learning. ISSN: 2640-3498. PMLR, Nov. 21, 2020, pp. 8502–8511 [6] Yannick Schroecker and Charles L Isbell. “State Aware Imitation Learning”. In: Advances in Neural
Information Processing Systems. Vol. 30. Curran Associates, Inc., 2017. url: https://proceedings.
neurips.cc/paper/2017/hash/08e6bea8e90ba87af3c9554d94db6579-Abstract.html.
[7] Yannick Schroecker, Mel Vecerik, and Jonathan Scholz. Generative predecessor models for sample-
efficient imitation learning. Apr. 1, 2019. doi: 10.48550/arXiv.1904.01139. arXiv: 1904.01139[cs].
url: http://arxiv.org/abs/1904.01139.
[8] Yannick Karl Daniel Schroecker. “Manipulating state space distributions for sample-efficient imitation-
learning”. Publisher: Georgia Institute of Technology. PhD thesis. Mar. 16, 2020. url: http://hdl.
handle.net/1853/62755.
[9] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Red. by Francis
Bach. 2nd ed. Adaptive Computation and Machine Learning series. Cambridge, MA, USA: A Bradford
Book, Nov. 13, 2018. 552 pp. isbn: 978-0-262-03924-6.
[10] Csaba Szepesv´ari. “Algorithms for Reinforcement Learning”. In: Synthesis Lectures on Artificial Intel-
ligence and Machine Learning 4.1 (Jan. 2010). Publisher: Morgan & Claypool Publishers, pp. 1–103.
issn: 1939-4608. doi: 10.2200/S00268ED1V01Y201005AIM009. url: https://www.morganclaypool.
com/doi/abs/10.2200/S00268ED1V01Y201005AIM0