Policy iteration

In this section, we are going to analyze a strategy to find an optimal policy based on a complete knowledge of the environment (in terms of transition probability and expected returns). The first step is to define a method that can be employed to build a greedy policy. Let's suppose we're working with a finite MDP and a generic policy, π; we can define the intrinsic value of a state, st, as the expected discounted return obtained by the agent starting from st and following the stochastic policy, π:

In this case, we are assuming that, as the agent will follow π, state sa is more useful than sb if the expected return starting ...

Get Mastering Machine Learning Algorithms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.