Strategies to Identify the Best Dish:

Strategies to Identify the Best Dish:
Error Bounds for Best Arm Identification in Multi-Armed Bandits
Billy Fang
Princeton University
Abstract
In the multi-armed bandit model, an agent is presented with a number of hidden distributions and
in each turn chooses a distribution and observes a sample from it. This paper focuses on a particular
setup called the best arm identification problem, in which the agent has a fixed number of turns to
sample from the distributions in order to determine which distribution has the highest mean. We
present and discuss upper bounds and lower bounds for the probability of error.
Introduction
Suppose you are planning to go to the same restaurant for lunch every day this month, and each
day you order one of the dishes on the menu. The tastiness of each dish follows a hidden probability
distribution, and each time you eat a dish, the tastiness you experience is drawn from that dish’s underlying
distribution. Each day you are free to choose whichever dish you want, and can base your decision upon
your past experience. At the end of the month, you are asked to use your experience to identify the dish
with the highest expected tastiness.
This framework is a variant of the multi-armed bandit model. The name of the model is based
on the phrase “one-armed bandit,” which is slang for a slot machine; one could reformulate the above
example in terms of a set of slot machines (which we call arms) which return rewards according to hidden
distributions.
The defining characteristic of the multi-armed bandit problem is the trade-off between “exploration”
and “exploitation.” If the agent explores all the arms too much in order to gain knowledge about their
underlying distributions, he risks incurring bad rewards. On the other hand, if he exploits what he believes
to be the best arms in order to get high rewards, he risks missing a greater reward due to his limited
knowledge about the other arms.
The problem of identifying the best arm after a fixed number of trials, described above, is known as
the best arm identification problem. A different well-studied problem (which we will not consider) is that
of maximizing the cumulative reward. The main difference between these two problems is that in the latter
the performance metric (cumulative reward) is tied to the rewards incurred during the decision-making
process, while in the former the performance metric (the final identification) is separated from the “testing
phase.” For example, consider clinical trials, in which possibly negative effects on each patients must
be minimized, compared to cosmetic trials, in which poor results during the isolated testing phase are of
little consequence and only help with the final decision of which product to place on the market.
The multi-armed bandit model and the best arm identification problem can model a host of learning
problems, such as selecting the best communication channel from a set of noisy channels, or a general
reinforcement learning scenario where an agent can observe the random rewards of different actions in a
trial period in order to decide which action to ultimately select.
In this paper we will provide bounds for the probability of error in the best arms identification problem.
11
1
Theorem 2.1. Principia: The Princeton Undergraduate Mathematics Journal
Problem statement and notation
Let ν1 � � � � � νK be K probability distributions with respective means µ1 � � � � � µK . Without loss of
generality we will assume that the means are ordered and that there is a unique optimal mean. That is,
µ1 > µ2 ≥ µ3 ≥ · · · ≥ µK �
where ties are broken arbitrarily. We will assume that for each � ∈ {1� � � � K }, the distribution ν� is
Gaussian with variance 1 and mean µ� in [−1� 1].
An agent, who has no knowledge of the µ� , is given a budget of � rounds. For each round � ∈
{1� � � � � �}, the agent pulls an arm I� ∈ {1� � � � � K }, and observes a reward drawn from the distribution
νI� , independent from past actions and observations. After the � rounds, the agent returns an arm J� that
he believes corresponds to the distribution with the highest mean.
We define the gap of a suboptimal arm � �= 1 to be the difference between its mean and the optimal
mean.
∆� := µ1 − µ� �
For � = 1, we define ∆1 := ∆2 . In particular, we then have
∆1 = ∆2 ≤ ∆3 ≤ · · · ≤ ∆K ≤ 2�
where the last inequality holds because the µ� lie in [−1� 1].
We let T� (�) denote the number of times arm � was chosen in rounds 1 to �, and we let X��1 � X��2 � � � � � X��T� (�)
be the corresponding sequence of rewards observed from choosing arm �. The empirical mean of arm �
after � pulls is denoted by
�
1�
�
µ��� :=
X��� �
�
�=1
To assess the success of the agent’s choice J� , we define the probability of error by
�� := P(J� �= 1)�
We would like to find policies (i.e., strategies) for the agent that ensures small probability of error. We
will only consider deterministic policies, that is, under a given policy and given past observations, the
agent’s choice of action is fixed.
The bounds that we find for the probability of error depend on the particular multi-armed bandit.
Intuitively, if the suboptimal arms have means close to the best mean (i.e., the ∆� are small), it is difficult
to distinguish the best arm from the others; likewise, if the ∆� are large, the task is easier. We define two
measures of hardness that capture this notion.
K
�
1
H1 :=
∆2
�=1 �
and
H2 := max
�
�
�
∆�2
These two quantities are actually equivalent up to a logarithmic factor (Audibert and Bubeck, 2010).
H2 ≤ H1 ≤ log(K )H2 ≤ log(2K )H2 �
We will later see that these quantities are indeed appropriate measures of hardness.
12
2
2.1
Principia: The Princeton Undergraduate Mathematics Journal
Upper bounds for best arm identification
Uniform allocation
We begin by presenting a naïve policy known as uniform allocation, which simply pulls each arm �/K
times and returns the arm with the highest empirical mean.
Under the uniform allocation policy, the probability of error in the best arm identification problem
satisfies
�
�
�
�
K
�
�∆22
�∆�2
�� ≤
exp −
≤ K exp −
�
2K
2K
�=2
Proof. If J� �= 1, then the empirical mean of arm J� was higher than that of arm 1. Therefore,
{J� �= 1} ⊂
K
�
�=2
{�
µ1��/K ≤ �
µ���/K }�
Applying a union bound and a Hoeffding-type bound (Theorem A.1) produces the stated inequality.
�� := P(J� �= 1)
≤
=
K
�
�=2
K
�
�=2
P(�
µ1��/K ≤ �
µ���/K )
P(�
µ���/K − �
µ1��/K − (µ� − µ1 ) ≥ ∆� )
�
�∆�2
≤
exp −
2K
�=2
�
�
�∆22
≤ K exp −
�
2K
K
�
�
(union bound)
(Theorem A.1)
We can rephrase the result of this theorem as follows. To ensure that �� < δ for some δ > 0, we
require the budget � to satisfy
� �
2K
K
� ≥ 2 log
�
δ
∆2
The fact that this bound only depends on ∆2 is unsatisfying, since it does not take into account how
suboptimal the other arms are. For instance this policy has the same type of performance on the following
two sets of arm means {1� 0�9� 0�8} and {1� 0�9� −1} because ∆2 = 0�1 in both cases, even though in the
latter case it is much easier to distinguish the worst arm from the rest. This shortcoming is due to uniform
allocation’s inability to adapt its strategy to the observed rewards.
2.2
Successive Rejects
An issue with the uniform allocation strategy is that it does not adapt its behavior upon observing the
rewards and only checks them after exhausting the budget. If there is an arm that is extremely suboptimal,
13
Principia: The Princeton Undergraduate Mathematics Journal
this strategy will waste turns pulling this arm even after observing that it returns very suboptimal rewards.
The following algorithm addresses this issue.
In the Successive Rejects (SR) algorithm (Audibert and Bubeck, 2010), the budget of � rounds is
divided into K − 1 phases. The agent keeps track of an “active set” of arms that initially contains all the
arms. In each phase, he tries all the arms in the active set equally often, and eliminates the arm with the
lowest empirical mean. By the end of the last phase, he will have eliminated all but one arm, which he
will return as J� . The procedure is intuitive, since the agent needs to spend more time trying the arms
that are closer to the optimal arm in order to properly determine which one is the best.
The lengths of the phases are chosen carefully to ensure a good bound. In the notation below, ��
denotes the number of times the �th eliminated arm is pulled. Note that �K −1 = �K (the two arms that
stay in the active set through all the rounds are pulled equally often). Moreover, in the �th phase, each
of the active arms is pulled �� − ��−1 times, so the length of the �th phase is (�� − ��−1 )(K + 1 − �),
where we have defined �0 = 0. Below, A� denotes the active set during phase �.
Successive Rejects algorithm
Let A1 := {1� � � � � K }, log(K ) =
1
2
+
�K
�� := �∗
In each phase � = 1� � � � � K − 1,
1
�=2 � ,
�0 = 0. For � ∈ {1� � � � � K − 1}, let
1
�−K
�
K
log(K ) + 1 − �
– Pull each active arm � ∈ A� for �� − ��−1 rounds.
– Let A�+1 be the result of removing arg min�∈A� �
µ���� from A� (ties broken arbitrarily).
Let JK be the unique element of AK , and return it.
Note that that the number of rounds does not exceed the budget.
�
�
K −1
�−K 1 �
1
�1 + �2 + · · · + �K −1 + �K −1 ≤ K +
+
= ��
K +1−�
log(K ) 2
�=1
Theorem 2.2. Under the Successive Rejects algorithm, the probability of error in the best arm identification
problem satisfies
�
�
K (K − 1)
�−K
�� ≤
exp −
�
2
2log(K )H2
Proof. At the beginning of the �th phase, we will have already eliminated � − 1 arms, so at least one of
the worst � arms will still be in the active set. Therefore, if the optimal arm is eliminated at the end of
the �th phase, then we must have
�
µ1��� = min �
µ���� ≤
�∈A�
max
�∈{K �K −1�����K +1−�}
�
µ���� �
If we let E� denote the event that arm 1 was eliminated in phase �, then by what we have just shown,
E� ⊂
K
�
�=K +1−�
{�
µ1��� ≤ �
µ���� }�
14
Therefore,
Principia: The Princeton Undergraduate Mathematics Journal
�� := P(AK �= {1})
≤
≤
=
≤
≤
K
−1
�
�=1
K
−1
�
P(E� )
K
�
�=1 �=K +1−�
K
−1
�
K
�
�=1 �=K +1−�
K
−1
�
K
�
�=1 �=K +1−�
K
−1
�
�=1
(union bound)
P(�
µ1��� ≤ �
µ���� )
P(�
µ���� − �
µ1��� − (µ� − µ1 ) ≥ ∆� )
exp(−�� ∆�2 /2)
� exp(−�� ∆K2 +1−� /2)�
(union bound)
(Theorem A.1)
To conclude the proof, note that by the definition of �2 and H2 , we have
�� ∆K2 +1−� ≥
1
�−K
1
�−K
· −2
≥
�
K
+
1
−
�
∆K +1−�
log(K )
log(K )H2
We can again rephrase the result of this theorem as follows. To ensure �� < δ for some δ > 0, we
require the budget � to satisfy
�
�
K (K − 1)
� ≥ 2H1 log
+ K�
2δ
where we have used H1 ≤ log(K )H2 . If we compare this with the analogous result for uniform allocation,
we see that H1 has replaced K /∆22 . But since ∆2 ≤ ∆� for all � ∈ {1� � � � � K }, we have H1 ≤ K /∆22 . This
essentially captures the reason why the SR algorithm performs better than uniform allocation. For all
bandits that have the same gap ∆2 , uniform allocation has the same performance, regardless of whether
arms � ≥ 3 are extremely suboptimal or extremely close to optimal; however, SR takes into account all
gaps, and performs differently depending on the value of H1 . Note that we have equality H1 = K /∆22 when
∆2 = ∆3 = · · · = ∆K , which is precisely when SR does not have the advantage of rejecting suboptimal
arms.
3
Lower bounds for best arm identification
We would like to find a lower bound on the probability of error for a given bandit under all policies,
or in other words, a “ceiling” on the performance of any policy on a given bandit. To do this, we consider
an “oracle” agent that is given extra information about the bandit, and therefore will perform better than a
normal agent. Any lower bound on the probability of error for the oracle agent will therefore be a lower
bound for that of a normal agent.
Our motivation for finding a lower bound is twofold. First, whereas constructing good algorithms/policies
give upper bounds for the probability of error, a lower bound provides a benchmark against which upper
bounds can be compared; note that without a lower bound we currently have no context to evaluate the
15
Principia: The Princeton Undergraduate Mathematics Journal
performance of the SR algorithm. Second, if we do find a lower bound that is close to the upper bound
given by the SR algorithm, we will have shown that H1 and H2 are indeed “correct” measures of hardness.
Note that our definition of lower bound is currently meaningless, since it is possible that �� = 0.
For example, if an agent is presented with the arms (unordered), and he follows the policy that simply
identifies the third arm as best regardless of the observations, then �� = 0 for any bandit in which the
third arm is indeed the best arm. However, this policy is clearly not “best,” since it has probability of
error equal to 1 on any other bandit. We need to redefine the notion of lower bound in a way that is
more meaningful.
A natural way to resolve this issue is to introduce an adversarial aspect. Given a bandit, we reveal all
the arm distributions to the agent. Then, given the policy that the agent chooses, the adversary considers
all bandits whose arms are a permutation of the original bandit’s arms, and chooses the permutation that
maximizes the policy’s probability of error. We seek a lower bound for this maximum, over all policies.
This avoids the issue described above because if the policy were to always identify the third arm as
best, the adversary would simply choose any permutation of the original bandit such that the third arm
is not the best, producing a probability of error equal to 1. Moreover, it is reasonable to assume that
a good policy performs similarly on any permutation of the arms (note that permutations do not change
the hardness measures H1 or H2 ). The resulting lower bound, due to Audibert and Bubeck (2010), is
comparable to the upper bound in Theorem 2.2.
Unfortunately, the proof of this bound is rather lengthy and involved, so we instead present an alternate
approach (Bubeck, private communication). Instead of considering all permutations of the arms, the
adversary considers certain “translations” of the arms, and returns the one that maximizes the given policy’s
probability of error. The resulting lower bound is also comparable to the upper bound in Theorem 2.2,
and the proof is much shorter than the previous one. What we possibly sacrifice is that the class of
translations is “farther away” from the original bandit than the class of permutations of a bandit. We will
elaborate on this below.
For simplicity we will assume that the arm distributions are Gaussian.
3.1
Lower bound
We change �
our definitions of the gaps and hardness measures slightly: we redefine ∆1 := 0 and
redefine H1 := K�=2 (∆� )−2 (the redefined H1 is equivalent to the original one by a factor of (K − 1)/K
in the worst case, so the change is small for large K ). We still assume that there is a unique best arm,
i.e., ∆2 > 0.
Let ν := ν1 ⊗ · · · ⊗ νK denote the bandit with arm distributions ν1 � ν2 � � � � � νK . We define a translation
operation τ� on ν, for each � ∈ {1� � � � � K }.
�
ν� + 2∆� if � = �,
(τ� ν)� :=
ν�
if � �= �,
where ν� + 2∆� denotes translating the support of the distribution by +2∆� . We see that τ� simply fixes
everything except arm �, whose mean is translated above that of the old best arm, making it the new best
arm. Moreover, τ1 is the identity.
Let τ� ν denote the bandit obtained by performing the translation τ� on the arms of the original bandit,
and let �� (τ� ν) and H1 (τ� ν) be the probability of error and hardness on τ� ν respectively. Since each τ�
increases the size of the gaps, we have H1 (τ� ν) ≤ H1 (ν). This is intuitive, because translating an arm far
above all the others will make the new problem easier. We will describe the importance of this property
after presenting the proof.
16
Principia: The Princeton Undergraduate Mathematics Journal
Theorem 3.1. Let ν be a bandit whose arm distributions are ν� ∼ � (µ� � 1) with µ� ∈ [−1� 1], for � ∈
{1� � � � � K }. Given any policy, we have
�
�
1
2�
max �� (τ� ν) ≥ exp −
�
�
4
H1 (ν)
The proof of this theorem uses results involving Kullback-Leibler divergence, which, loosely speaking,
is a type of distance between two probability distributions. We defer these results to the appendix of ?.
Proof. Recall that T� (�) is the number of times arm � was pulled. Choose � �= 1 such that
Eν [T� (�)] ≤
�
∆�2 H1 (ν)
�
Such an � exists because otherwise, the expected number of pulls of each arm � �= 1 exceeds �/(∆�2 H1 (ν)),
which implies
K
K
K
�
�
� � 1
�=
T� (�) ≥
Eν [T� (�)] >
= ��
H1 (ν)
∆�2
�=1
�=2
�=2
a contradiction. In plain words, we are choosing � to be the least-pulled arm (excluding arm 1), which is
where the algorithm is more likely to make a mistake.
We let �(Y� ) be the distribution of the �-dimensional vector Y� of observed rewards when the algorithm
runs on the translated bandit τ� ν. Using the chain rule for Kullback-Leibler divergence (see appendix of
?), we can calculate the divergence between the reward distribution of the original bandit (translation τ1 )
and that of this translated bandit (translation τ� for � �= 1).
KL(�(Y1 )� �(Y� )) = 2∆�2 Eν T� (�) ≤
2�
�
H1 (ν)
where the last step is due to our choice of �.
We are now equipped to finish the proof. In a translated bandit τ� ν where � �= 1, one way the agent
could be incorrect is if he chooses J� = 1, because arm 1 is no longer the best.
2 · max �� (τ� ν) ≥ 2 · max{�� (ν)� �� (τ� ν)}
�
≥ �� (ν) + �� (τ� ν)
≥ Pν (J� �= 1) + Pτ� ν (J� = 1)
1
≥ exp(− KL(�(Y1 )� �(Y� )))
2
�
�
1
2�
≥ exp −
�
2
H1 (ν)
(Lemma A.2)
As mentioned earlier, we know H1 (τ� ν) ≤ H1 (ν). Therefore, the theorem implies that there exists �
such that
�
�
�
�
1
2�
1
2�
�� (τ� ν) ≥ exp −
≥ exp −
�
4
H1 (ν)
4
H1 (τ� ν)
So, the theorem gives a lower bound for a translated bandit in terms of the hardness of that translated
bandit. If the translation did not decrease H1 , we would not be able to arrive at such a conclusion.
17
Principia: The Princeton Undergraduate Mathematics Journal
So, the theorem gives a lower bound for a translated bandit in terms of the hardness of that translated
bandit. If the translation did not decrease H1 , we would not be able to arrive at such a conclusion.
As mentioned before, this lower bound is comparable to the upper bound given by the SR algorithm,
which suggests that H1 and H2 are appropriate measures of hardness.
The proof of this bound is considerably shorter than that of the lower bound by Audibert and Bubeck
(2010). Although both proofs involve a perturbation and controlling of the same quantities (number of
pulls of an arm and the Kullback-Leibler divergence), the shorter proof only works with true KullbackLeibler divergences and expectations of the T� (�) while the longer proof delves into realizations of random
variables and empirical estimates of the Kullback-Leibler divergence. Moreover, it is much easier to
control the two quantities when dealing with translations rather than permutations, since only one arm
is affected.
What is possibly lost is that the class of translates is “farther away” from the original bandit than
the class of permutations. Whereas it is easy to reason that a good policy should perform similarly on
any permutation of a given bandit, it is in some sense harder to justify why a good policy should perform
similarly on any translate of a given bandit.
Conclusions
In this paper, we introduced the best arm identification problem for multi-armed bandits. We described
the Successive Rejects algorithm and gave an upper bound on its probability of error on the best arm
identification problem. We also compared two approaches to finding lower bounds for the best arm
identification problem. Moreover, we noted that the resulting lower bounds are close to the upper bound,
suggesting that H1 and H2 are good measures of the complexity of the bandit.
A discussion of how to generalize these results to the more general �-best arms identification problem,
where the goal is to identify the best � arms rather than the best arm, can be found in Fang (2014). A
further extension of �-best arms identification is combinatorial identification, where the returned set of
arms must satisfy some constraint; for example, suppose the weights of edges of a connected graph follow
hidden distributions, and we want to identify the spanning tree with the highest expected weight. This
area of research is currently open.
References
Jean-Yves Audibert and Sébastien Bubeck. Best Arm Identification in Multi-Armed Bandits. In Proceedings
of the 23rd Annual Conference on Learning Theory (COLT), 2010.
Sébastien Bubeck. Private communication.
Billy Fang. Error Bounds for Identification Problems in Multi-Armed Bandits. 2014.
A
Lemmas
Proofs of these lemmas and other relevant background results can be found in the appendix of Fang
(2014).
18
Principia: The Princeton Undergraduate Mathematics Journal
Theorem A.1 (Hoeffding-type bound). If X1 � � � � � X� are independent sub-Gaussian random variables with
means µ� and common parameter σ 2 , then for any � > 0,
� �
�
�
�
1�
�� 2
P
(X� − µ� ) ≥ � ≤ exp − 2 �
�
2σ
�=1
Lemma A.2. Let ρ0 and ρ1 be two probability distributions supported on some set �.Then for any
measurable function ψ : � → {0� 1},
PX ∼ρ0 (ψ(X ) = 1) + PX ∼ρ1 (ψ(X ) = 0) ≥
19
1
exp(− KL(ρ0 � ρ1 ))�
2