Efficient Bayesian Planning
Doctoral thesis, 2022

Artificial Intelligence (AI) is a long-studied and yet very active field of research. The list of things differentiating humans from AI grows thinner but the dream of an artificial general intelligence remains elusive. Sequential Decision Making is a subfield of AI that poses a seemingly benign question ``How to act optimally in an unknown environment?''. This requires the AI agent to learn about its environment as well as plan an action sequence given its current knowledge about it. The two common problem settings are partial observability and unknown environment dynamics. Bayesian planning deals with these issues by simultaneously defining a single planning problem which considers the simultaneous effects of an action on both learning and goal search. The technique involves dealing with infinite tree data structures which are hard to store but essential for computing the optimal plan. Finally, we consider the minimax setting where the Bayesian prior is chosen by an adversary and therefore a worst case policy needs to be found.

In this thesis, we present novel Bayesian planning algorithms. First, we propose DSS (Deeper, Sparser Sampling) for the case of unknown environment dynamics. It is a meta-algorithm derived from a simple insight about the Bayes rule, which beats the state-of-the-art across the board from discrete to continuous state settings. A theoretical analysis provides a high probability bound on its performance. Our analysis is different from previous approaches in the literature in terms of problem formulation and formal guarantees. The result also contrasts with those of previous comparable BRL algorithms, which typically provide asymptotic convergence guarantees. Suitable Bayesian models and their corresponding planners are proposed for implementing the discrete and continuous versions of DSS. We then address the issue of partial observability via our second algorithm, FMP (Finite Memory Planner). This uses depth-dependent partitioning of the infinite planning tree. Experimental results demonstrate comparable performance to the current state-of-the-art for both discrete and continuous settings. Finally, we propose algorithms for finding the best policy for the worst case belief in the Minimax Bayesian setting.


Partially Observable MDP


Bayesian Reinforcement Learning

Hall HC2, Hörsalsvägen 14
Opponent: Alessandro Lazaric, Research Scientist, Facebook Artificial Intelligence Research


Divya Grover

Chalmers, Computer Science and Engineering (Chalmers), Data Science and AI

Bayesian Reinforcement Learning via Deep, Sparse Sampling

Proceedings of Machine Learning Research,; Vol. 108(2020)p. 3036-3045

Paper in proceeding

Divya Grover, Debabrota Basu, Christos Dimitrakakis. “Bayesian Reinforcement Learning via Approximate Planning in Bayes Adaptive MDP” submitted to the Journal of Artificial Intelligence Research, 2022.

Divya Grover, Christos Dimitrakakis. “Adaptive Belief Discretization for POMDP Planning” in 15th European Workshop on Reinforcement Learning. Milano, Italy, September 19-21, 2022.

Minimax-Bayes Reinforcement Learning

Proceedings of Machine Learning Research,; Vol. 206(2023)p. 7511-7527

Paper in proceeding

This thesis develops Bayesian Planning algorithms for sequential decision-making problems, in particular, the problems of Reinforcement Learning (RL) and Partially Observable Markov Decision Processes (POMDP). The former describes the situation where an agent is unaware of the dynamics of its interacting environment, while the latter pertains to situations where there is a lack of feedback between the environment and the agent. The need for this work arose due to lack of efficient algorithms in dealing with such problems. Efficient in particular refers to sample efficiency since it is only natural to aim for an agent that limits the number of bad interactions with its environment. Bayesian methods are particularly suitable for low sample regime problems. They are characterized by the Bayesian belief, which refers to the quantification of one’s estimate about the uncertainty in the problem. Usually a system of updating this belief is also defined and the process of updating is called Bayesian Inference. It involves taking into account the latest observations to the current uncertainty estimate. We therefore develop algorithms belonging to this class of solutions due to their many practical benefits.

Our solution particularly aims at maximizing performance given finite computation resources. We focus on identifying key properties in both settings and developing algorithms around them. One such property is the rate of change of the Bayesian belief; over possible dynamics model for the RL and over the state space for the POMDP setting. In our algorithms, we propose tunable parameters that are strongly correlated to these key properties, thereby granting direct control on the computation-performance tradeoff. The proposed solutions beat the current best solutions and elegantly scale to larger problems. A key requirement of our algorithms is the existence of some framework for Bayesian Inference, which is not always true due to modelling issues. This work can be extended to such problem settings as well.

Learning, privacy and the limits of computation

Swedish Research Council (VR) (2015-05410), 2016-01-01 -- 2019-12-31.

Subject Categories

Computer and Information Science

Probability Theory and Statistics



Doktorsavhandlingar vid Chalmers tekniska högskola. Ny serie: 5224



Hall HC2, Hörsalsvägen 14


Opponent: Alessandro Lazaric, Research Scientist, Facebook Artificial Intelligence Research

More information

Latest update