POWR: Operator World Models for Reinforcement Learning

1Computational Statistics and Machine Learning - Istituto Italiano di Tecnologia
2AI Centre, Computer Science Department, University College London
NeurIPS 2024

Abstract

Policy Mirror Descent (PMD) is a powerful and theoretically sound methodology for sequential decision-making. However, it is not directly applicable to Reinforcement Learning (RL) due to the inaccessibility of explicit action-value functions. We address this challenge by introducing a novel approach based on learning a world model of the environment using conditional mean embeddings. Leveraging tools from operator theory we derive a closed-form expression of the action-value function in terms of the world model via simple matrix operations. Combining these estimators with PMD leads to POWR, a new RL algorithm for which we prove convergence rates to the global optimum. Preliminary experiments in finite and infinite state settings support the effectiveness of our method.

BibTeX

@article{novelli2024operator,
  title={Operator World Models for Reinforcement Learning},
  author={Novelli, Pietro and Pratticò, Marco and Pontil, Massimiliano and Ciliberto, Carlo},
  journal={arXiv preprint arXiv:2406.19861},
  year={2024}
}