Core concept

In policy gradient methods, the algorithm performs rollouts to collect samples of transitions and (potentially) rewards, and updates the parameters of the policy using gradient descent to minimize the objective function. The idea is to keep updating the parameters to improve the policy until a good policy is obtained. To improve the training stability, the Trust Region Policy Optimization (TRPO) algorithm enforces a Kullback-Liebler (KL) divergence constraint on the policy updates, so that the policy is not updated too much in one step when compared to the old policy. TRPO was the precursor to the PPO algorithm. Let's briefly discuss the objective function used in the TRPO algorithm in order to get a better understanding of PPO. ...

Get Hands-On Intelligent Agents with OpenAI Gym now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.