The algorithms we’ve presented so far have one systematic weakness: they don’t keep track of how much they know about any of the arms available to them. They pay attention only to how much reward they’ve gotten from the arms. This means that they’ll underexplore options whose initial experiences were not rewarding, even though they don’t have enough data to be confident about those arms. We can do better by using an algorithm that pays attention to not only what it knows, but also how much it knows.
The algorithm, UCB, that we’ll present in this chapter does exactly this. Before we describe how the UCB algorithm keeps track of how much it knows, let’s look back at the epsilon-Greedy and Softmax algorithms and take a more abstract perspective on them. Both the epsilon-Greedy algorithm and the Softmax algorithm share the following broad properties:
The algorithm sometimes decides to explore and chooses an option that isn’t the one that currently seems best: