Chapter 5

Policy-Gradient Algorithms 1

Most MDP solving methods evaluate a value function which makes it possible to identify optimal actions in each state. A first difficulty for these approaches is to handle large size problems. As shown in Chapter 3, a solution is to use methods approximating the value function. But the monotonous policy improvement is not guaranteed anymore for algorithms like value iteration or policy iteration. A second difficulty is to handle partial observations, a setting where the Markov property is not necessarily verified.

A very different approach is to perform a direct policy search, the policy taking the form of a parameterized controller. The choice of this controller makes it possible to adapt to the type of problem being considered and to make a compromise between expressing a large variety of policies and reducing the memory usage. We end up with a parameter optimization problem for which algorithms exist that guarantee a monotonous improvement of the policy towards a local optimum, even - in some cases - under partial observability.

EXAMPLE 5.1. What about the cars that need maintenance then?1 A garage mechanic often knows the procedures (or algorithms) he has to follow to diagnose and repair a vehicle. But these procedures may vary depending on the model or the age of the vehicle. Thus, it can be more cost-effective to change the brake discs without testing them if their age is beyond an age limit. The methods we will discuss here aim at optimizing/learning ...

Get Markov Decision Processes in Artificial Intelligence now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.