You are previewing Bandit Algorithms for Website Optimization.

Bandit Algorithms for Website Optimization

Cover of Bandit Algorithms for Website Optimization by John Myles White Published by O'Reilly Media, Inc.
  1. Bandit Algorithms for Website Optimization
  2. Preface
    1. Finding the Code for This Book
    2. Dealing with Jargon: A Glossary
    3. Conventions Used in This Book
    4. Using Code Examples
    5. Safari® Books Online
    6. How to Contact Us
    7. Acknowledgments
  3. 1. Two Characters: Exploration and Exploitation
    1. The Scientist and the Businessman
      1. Cynthia the Scientist
      2. Bob the Businessman
      3. Oscar the Operations Researcher
    2. The Explore-Exploit Dilemma
  4. 2. Why Use Multiarmed Bandit Algorithms?
    1. What Are We Trying to Do?
    2. The Business Scientist: Web-Scale A/B Testing
  5. 3. The epsilon-Greedy Algorithm
    1. Introducing the epsilon-Greedy Algorithm
    2. Describing Our Logo-Choosing Problem Abstractly
      1. What’s an Arm?
      2. What’s a Reward?
      3. What’s a Bandit Problem?
    3. Implementing the epsilon-Greedy Algorithm
    4. Thinking Critically about the epsilon-Greedy Algorithm
  6. 4. Debugging Bandit Algorithms
    1. Monte Carlo Simulations Are Like Unit Tests for Bandit Algorithms
    2. Simulating the Arms of a Bandit Problem
    3. Analyzing Results from a Monte Carlo Study
      1. Approach 1: Track the Probability of Choosing the Best Arm
      2. Approach 2: Track the Average Reward at Each Point in Time
      3. Approach 3: Track the Cumulative Reward at Each Point in Time
    4. Exercises
  7. 5. The Softmax Algorithm
    1. Introducing the Softmax Algorithm
    2. Implementing the Softmax Algorithm
    3. Measuring the Performance of the Softmax Algorithm
    4. The Annealing Softmax Algorithm
    5. Exercises
  8. 6. UCB – The Upper Confidence Bound Algorithm
    1. Introducing the UCB Algorithm
    2. Implementing UCB
    3. Comparing Bandit Algorithms Side-by-Side
    4. Exercises
  9. 7. Bandits in the Real World: Complexity and Complications
    1. A/A Testing
    2. Running Concurrent Experiments
    3. Continuous Experimentation vs. Periodic Testing
    4. Bad Metrics of Success
    5. Scaling Problems with Good Metrics of Success
    6. Intelligent Initialization of Values
    7. Running Better Simulations
    8. Moving Worlds
    9. Correlated Bandits
    10. Contextual Bandits
    11. Implementing Bandit Algorithms at Scale
  10. 8. Conclusion
    1. Learning Life Lessons from Bandit Algorithms
    2. A Taxonomy of Bandit Algorithms
    3. Learning More and Other Topics
  11. Colophon
  12. Copyright
O'Reilly logo

Chapter 5. The Softmax Algorithm

Introducing the Softmax Algorithm

If you’ve completed the exercises for Chapter 2, you should have discovered that there’s an obvious problem with the epsilon-Greedy algorithm: it explores options completely at random without any concern about their merits. For example, in one scenario (call it Scenario A), you might have two arms, one of which rewards you 10% of the time and the other rewards you 13% of the time. In Scenario B, the two arms might reward you 10% of the time and 99% of the time. In both of these scenarios, the probability that the epsilon-Greedy algorithm explores the worse arm is exactly the same (it’s epsilon / 2), despite the inferior arm in Scenario B being, in relative terms, much worse than the inferior arm in Scenario A.

This is a problem for several reasons:

  • If the difference in reward rates between two arms is small, you’ll need to explore a lot more often than 10% of the time to correctly determine which of the two options is actually better.
  • In contrast, if the difference is large, you need to explore a lot less than 10% of the time to correctly estimate the better of the two options. For that reason, you’ll end up losing a lot of reward by exploring an unambiguously inferior option in this case. When we first described the epsilon-Greedy algorithm, we said that we wouldn’t set epsilon = 1.0 precisely so that we wouldn’t waste time on inferior options, but, if the difference between two arms is large enough, we end up wasting ...

The best content for your career. Discover unlimited learning on demand for around $1/day.