A dynamic programming strategy to balance exploration and exploitation in the bandit problem

Caelen, Olivier; Bontempi, Gianluca

doi:doi/10.1007/s10472-010-9190-1

Citer

A dynamic programming strategy to balance exploration and exploitation in the bandit problem

par Caelen, Olivier

;Bontempi, Gianluca

Référence Annals of mathematics and artificial intelligence
Publication Publié, 2010

Article révisé par les pairs

Résumé :

The K-armed bandit problem is a well-known formalization of the exploration versus exploitation dilemma. In this learning problem, a player is confronted to a gambling machine with K arms where each arm is associated to an unknown gain distribution. The goal of the player is to maximize the sum of the rewards. Several approaches have been proposed in literature to deal with the K-armed bandit problem. This paper introduces first the concept of "expected reward of greedy actions" which is based on the notion of probability of correct selection (PCS), well-known in simulation literature. This concept is then used in an original semi-uniform algorithm which relies on the dynamic programming framework and on estimation techniques to optimally balance exploration and exploitation. Experiments with a set of simulated and realistic bandit problems show that the new DP-greedy algorithm is competitive with state-of-the-art semi-uniform techniques. © 2010 Springer Science+Business Media B.V.

Référencement	Visibilité	Pérennité	Facilité
Les publications encodées constituent la bibliographie académique de l'Université.	Les documents déposés sont indexés par les moteurs de recherche (Google Scholar,…).	Les documents déposés en open-access sont archivés au sein du réseau de préservation SAFE-PLN (www.safepln.org).	Les listes de publications sont compatibles avec le CV-ULB, le FNRS et accessibles sur le web.

A dynamic programming strategy to balance exploration and exploitation in the bandit problem

Documents en relation

DI-fusion