par Chakraborty, Debraj
Président du jury Filiot, Emmanuel
Promoteur Raskin, Jean-François
Publication Non publié, 2022-12-20
Thèse de doctorat
Résumé : We study how to efficiently combine techniques from formal methods and learning for online computation of a strategy that aims at optimizing the expected long-term reward in large systems modelled as Markov decision processes (MDPs). This strategy is computed with a receding horizon and using Monte Carlo tree search (MCTS). MCTS algorithm is augmented with the notion of advice which guides the search in the relevant part of the tree using exact methods. We show that the classical theoretical guarantees of the Monte Carlo tree search are still maintained after this augmentation. To lower the latency of MCTS algorithms with advice, we propose to replace advice coming from exact algorithms with an artificial neural network trained using an expert imitation framework. To demonstrate the practical interest of our techniques, we implement them on different systems modelled as MDPs: in the game of Pac-Man and Frozen Lake and also for safe and optimal scheduling of jobs in a task system.