Skip to content

Latest commit

 

History

History
24 lines (17 loc) · 2.61 KB

README.md

File metadata and controls

24 lines (17 loc) · 2.61 KB

Bandit Algorithms

Multi-arm bandit algorithms are fundamental to enabling recommendation systems, online advertising, clinical trials, online resource allocation and online search. Here is a small collection of them in Python Implementation: ETC, E-Greedy, Elimination, UCB, Exp3, LinearUCB, and Thompson Sampling.

Vis-a-vis Bandits used in Reinforcement Learning settings, three key assumptions apply:

  • Reward observed only conrrespond to the action taken (feedback).
  • Action does not alter the environment.
  • Taking an action does not restrict future actions.

Furthermore, these algorithms differ in their requirement of the time horizon n and suboptimality gap \Delta, and in their scaling of regret bounds.

Algorithm Environment Requirement Regret Scaling
Explore-then-Commit (ETC) Unstructured / Stochastic \Delta, n R_{n} = \frac{4}{n}  ln(n) + C
Epsilon-Greedy (E-Greedy) Unstructured / Stochastic \Delta R_{n} = \frac{1}{\Delta }  ln(n) + C
Elimination Unstructured / Stochastic n R_{n} = \frac{C_{1}}{\Delta }  ln(n) + C_{2}
Upper-Confidence Bound (UCB) - Infinite Horizon Unstructured / Stochastic --- R_{n} = \frac{2}{\Delta }  ln(n)
Exponantiate-Explore-Exploit (Exp3) - Partial feedback Adversarial / Non-stochastic --- R_{n} = \sqrt{2nk  ln(k)}
Linear UCB (LinUCB) Contextual / Linear --- R_{n} = Cd \sqrt{n} ln(nL)
Thompson Sampling Structured --- R_{n} = ln(n)

For detailed background of each algorithm, refer to Bandit Algorithms (2020) by Lattimore & Szepesvari.