If you’ve completed the exercises for Chapter 2, you should have discovered that there’s an obvious problem with the epsilon-Greedy algorithm: it explores options completely at random without any concern about their merits. For example, in one scenario (call it Scenario A), you might have two arms, one of which rewards you 10% of the time and the other rewards you 13% of the time. In Scenario B, the two arms might reward you 10% of the time and 99% of the time. In both of these scenarios, the probability that the epsilon-Greedy algorithm explores the worse arm is exactly the same (it’s
epsilon / 2), despite the inferior arm in Scenario B being, in relative terms, much worse than the inferior arm in Scenario A.
This is a problem for several reasons:
epsilon = 1.0precisely so that we wouldn’t waste time on inferior options, but, if the difference between two arms is large enough, we end up wasting ...