Beating the Adaptive Bandit with High Probability

Jacob Abernethy and Alexander Rakhlin

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2009-10
January 22, 2009

http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-10.pdf

We provide a principled way of proving $\tilde{O}(\sqrt{T})$ high-probability guarantees for partial-information (bandit) problems over arbitrary convex decision sets. First, we prove a regret guarantee for the full-information problem in terms of "local" norms, both for entropy and self-concordant barrier regularization, unifying these methods. Given one of such algorithms as a black-box, we can convert a bandit problem into a full-information problem using a sampling scheme. The main result states that a high-probability $\tilde{O}(\sqrt{T})$ bound holds whenever the black-box, the sampling scheme, and the estimates of missing information satisfy a number of conditions, which are relatively easy to check. At the heart of the method is a construction of linear upper bounds on confidence intervals. As applications of the main result, we provide the first known efficient algorithm for the sphere with an $\tilde{O}(\sqrt{T})$ high-probability bound. We also derive the result for the $n$-simplex, improving the $O(\sqrt{nT\log(nT)})$ bound of Auer et al by replacing the $\log T$ term with $\log\log T$ and closing the gap to the lower bound of $\Omega(\sqrt{nT})$. While $\tilde{O}(\sqrt{T})$ high-probability bounds should hold for general decision sets through our main result, construction of linear upper bounds depends on the particular geometry of the set; we believe that the sphere example already exhibits the necessary ingredients. The guarantees we obtain hold for adaptive adversaries (unlike the in-expectation results of Abernethy et al) and the algorithms are efficient, given that the linear upper bounds on confidence can be computed.


BibTeX citation:

@techreport{Abernethy:EECS-2009-10,
    Author = {Abernethy, Jacob and Rakhlin, Alexander},
    Title = {Beating the Adaptive Bandit with High Probability},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2009},
    Month = {Jan},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-10.html},
    Number = {UCB/EECS-2009-10},
    Abstract = {We provide a principled way of proving $\tilde{O}(\sqrt{T})$ high-probability guarantees for  partial-information (bandit) problems over arbitrary convex decision sets. First, we prove a regret guarantee for the  full-information problem in terms of "local" norms, both for entropy and self-concordant barrier regularization, unifying these methods. Given one of such algorithms as a black-box, we can convert a bandit problem into a full-information problem using a sampling scheme. The main result states that a high-probability $\tilde{O}(\sqrt{T})$ bound holds whenever the black-box, the sampling scheme, and the estimates of missing information satisfy a number of conditions, which are relatively easy to check. At the heart of the method is a construction of linear upper bounds on confidence intervals. As applications of the main result, we provide the first known efficient algorithm for the sphere with an $\tilde{O}(\sqrt{T})$ high-probability bound. We also derive the result for the $n$-simplex, improving the $O(\sqrt{nT\log(nT)})$ bound of Auer et al by replacing the $\log T$ term with $\log\log T$ and closing the gap to the lower bound of $\Omega(\sqrt{nT})$. While $\tilde{O}(\sqrt{T})$ high-probability bounds should hold for general decision sets through our main result, construction of linear upper bounds depends on the particular geometry of the set; we believe that the sphere example already exhibits the necessary ingredients. The guarantees we obtain hold for adaptive adversaries (unlike the in-expectation results of Abernethy et al) and the algorithms are efficient, given that the linear upper bounds on confidence can be computed.}
}

EndNote citation:

%0 Report
%A Abernethy, Jacob
%A Rakhlin, Alexander
%T Beating the Adaptive Bandit with High Probability
%I EECS Department, University of California, Berkeley
%D 2009
%8 January 22
%@ UCB/EECS-2009-10
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-10.html
%F Abernethy:EECS-2009-10