Interactive Thompson Sampling for, Multi-objective Multi-armed Bandits

Roijers, Diederik D.M.; Zintgraf, Luisa L.M.; Nowe, Ann

doi:doi/10.1007/978-3-319-67504-6_2

Citer

Interactive Thompson Sampling for, Multi-objective Multi-armed Bandits

par Roijers, Diederik D.M.;Zintgraf, Luisa L.M.;Nowe, Ann

Référence Lecture notes in computer science, 10576 LNAI, page (18-34)
Publication Publié, 2017

Article révisé par les pairs

Résumé :

In multi-objective reinforcement learning (MORL), much attention is paid to generating optimal solution sets for unknown utility functions of users, based on the stochastic reward vectors only. In online MORL on the other hand, the agent will often be able to elicit preferences from the user, enabling it to learn about the utility function of its user directly. In this paper, we study online MORL with user interaction employing the multi-objective multi-armed bandit (MOMAB) setting — perhaps the most fundamental MORL setting. We use Bayesian learning algorithms to learn about the environment and the user simultaneously. Specifically, we propose two algorithms: Utility-MAP UCB (umap-UCB) and Interactive Thompson Sampling (ITS), and show empirically that the performance of these algorithms in terms of regret closely approximates the regret of UCB and regular Thompson sampling provided with the ground truth utility function of the user from the start, and that ITS outperforms umap-UCB.

Référencement	Visibilité	Pérennité	Facilité
Les publications encodées constituent la bibliographie académique de l'Université.	Les documents déposés sont indexés par les moteurs de recherche (Google Scholar,…).	Les documents déposés en open-access sont archivés au sein du réseau de préservation SAFE-PLN (www.safepln.org).	Les listes de publications sont compatibles avec le CV-ULB, le FNRS et accessibles sur le web.

Interactive Thompson Sampling for, Multi-objective Multi-armed Bandits

Documents en relation

DI-fusion