Multi-Armed Bandits at MSR-SV

This is an umbrella project for several related efforts at Microsoft
Research Silicon Valley that address various Multi-Armed Bandit (MAB)
formulations motivated by web search and ad placement. The MAB problem
is a classical paradigm in Machine Learning in which an online algorithm
chooses from a set of strategies in a sequence of trials so as to
maximize the total payoff of the chosen strategies.
The
name "multi-armed bandits" comes from a whimsical scenario in which a
gambler faces several slot machines, a.k.a. "one-armed bandits", that
look identical at first but produce different expected winnings. The
crucial issue here is the trade-off between acquiring new information (exploration)
and capitalizing on the information available so far (exploitation).
While the MAB problems have been studied extensively in Machine
Learning, Operations Research and Economics, many exciting questions are
open. One aspect that we are particularly interested in
concerns modeling and efficiently using various types of side
information that may be available to the algorithm.
Contact: Alex
Slivkins.
Research directions
- MAB
with similarity information
- MAB
in a changing environment
-
Explore-exploit tradeoff in mechanism design
-
Explore-exploit learning with limited resources
-
Risk vs. reward tradeoff in MAB
People at MSR-SVC

Ittai Abraham

Moshe Babaioff

Sreenivas Gollapudi

Nina Mishra

Rina Panigrahy

Alex Slivkins
External visitors and collaborators
Prof. Sébastien Bubeck (Princeton)
Prof. Robert Kleinberg (Cornell)
Filip Radlinski (MSR Cambridge)
Prof. Eli Upfal (Brown)
Former interns
Yogi Sharma (Cornell —>
Facebook; intern at MSR-SV in summer 2008)
Umar Syed (Princeton —> Google; intern at MSR-SV in summer 2008)
Shaddin Dughmi (Stanford —>USC; intern at MSR-SV in summer 2010)
Ashwinkumar Badanidiyuru (Cornell --> Google; intern at MSR-SV in
summer 2012)
MAB problems with similarity information
-
Multi-armed bandits in metric spaces
Robert Kleinberg, Alex Slivkins and Eli Upfal (
STOC
2008)
Abstract We
introduce a version of the stochastic MAB problem, possibly with
a very large set of arms, in which the expected payoffs obey a
Lipschitz condition with respect to a given metric space. The
goal is to minimize regret as a function of time, both in the
worst case and for 'nice' problem instances.
-
Sharp dichotomies for regret minimization in metric spaces
Robert Kleinberg and Alex Slivkins (SODA
2010)
Abstract We
focus on the connections between online learning and metric
topology. The main result that the worst-case regret is either O(log
t) or at least sqrt{t}, depending on whether the completion of the
metric space is compact and countable. We prove a number of other
dichotomy-style results, and extend them to the full-feedback
(experts) version.
-
Learning optimally diverse rankings over large document collections
Alex Slivkins, Filip Radlinski and Sreenivas Gollapudi (ICML
2010)
Abstract We
present a learning-to-rank framework for web search that
incorporates similarity and correlation between documents and thus,
unlike prior work, scales to large document collections.
-
Contextual bandits with similarity
information
Alex Slivkins (COLT
2011)
Abstract In
the 'contextual bandits' setting, in each round nature reveals a
'context' x, algorithm chooses an 'arm' y, and the expected payoff
is µ(x,y). Similarity info is expressed by a metric space over the
(x,y) pairs such that µ is a Lipschitz function. Our algorithms are
based on adaptive (rather than uniform) partitions of the metric
space which are adjusted to the popular and high-payoff regions.
-
Multi-armed bandits on implicit metric
spaces
Alex Slivkins (NIPS
2011)
Abstract Suppose
an MAB algorithm is given a tree-based classification of arms. This
tree implicitly defines a "similarity distance" between arms, but
the numeric distances are not revealed to the algorithm. Our
algorithm (almost) matches the best known guarantees for the setting
(Lipschitz MAB) in which the distances are revealed.
MAB problems in a changing environment
-
Adapting to a stochastically changing
environment
Alex Slivkins and Eli Upfal (COLT
2008)
Abstract We
study a version of the stochastic MAB problem in which the expected
reward of each arm evolves stochastically and gradually in time,
following an independent Brownian motion or a similar process. Our
benchmark is a hypothetical policy that chooses the best arm in each
round.
-
Adapting to the Shifting Intent of Search
Queries
Umar Syed, Alex Slivkins and Nina Mishra (NIPS
2009)
Abstract Query
intent may shift over time. A classifier can use the available
signals to predict a shift in intent. Then a bandit algorithm can be
used to find the new relevant results. We present a meta-algorithm
that combines such classifier with a bandit algorithm in a feedback
loop.
-
Contextual bandits with similarity
information
Alex Slivkins (COLT
2011)
Abstract Interpreting
the current time as a part of the contextual information, we obtain
a very general bandit framework that (in addition to similarity
between arms and contexts) can include slowly changing payoffs and
variable sets of arms.
-
The best of both worlds: stochastic and
adversarial bandits
Sebastien Bubeck and Alex Slivkins (COLT
2012)
Abstract We
present a new bandit algorithm whose regret is optimal both for
adversarial rewards and for stochastic rewards, achieving, resp.,
square-root regret and polylog regret.
Explore-exploit tradeoff in mechanism design
-
Characterizing truthful multi-armed bandit mechanisms
Moshe Babaioff, Alex Slivkins and Yogi Sharma (
EC
2009)
Abstract We
consider a multi-round auction setting motivated by
pay-per-click auctions in the Internet advertising, which can be
viewed as a strategic version of the MAB problem. We investigate
how the design of MAB algorithms is affected by the restriction
of truthfulness. We show striking differences in terms of both
the structure of an algorithm and its regret.
-
Truthful mechanisms with implicit payment computation
Moshe Babaioff, Robert Kleinberg and Alex Slivkins (
EC
2010 Best
Paper Award)
Abstract We
show that any single-parameter, monotone allocation function can
be truthfully implemented using a single call to this function.
We apply this to MAB mechanisms.
-
-
Multi-parameter mechanisms with implicit payment computation
Moshe Babaioff, Robert Kleinberg and Alex Slivkins (EC
2013)
Abstract We generalize
the main result of the EC'10 paper to the multi-parameter setting.
We apply this to a natural multi-parameter extension of MAB
mechanisms.
Explore-exploit learning with limited resources
-
Dynamic pricing with limited supply
Moshe Babaioff, Shaddin Dughmi, Robert Kleinberg and Alex
Slivkins (
EC
2012)
Abstract We
consider dynamic pricing with limited supply and unknown demand
distribution. We extend MAB techniques to the limited supply
setting, and obtain optimal regret rates.
-
Bandits with Knapsacks
Ashwinkumar Badanidiyuru, Robert Kleinberg and Alex Slivkins (FOCS
2013)
Abstract We
define a broad class of explore-exploit problems with knapsack-style
resource constraints, which subsumes dynamic pricing, dynamic
procurement, pay-per-click ad allocation, and many other problems.
Our algorithms achieve optimal regret w.r.t. the optimal dynamic
policy.
Risk vs. reward trade-off in MAB