Advantage actor-critic algorithm

The action-value actor-critic algorithm still has high variance. We can reduce the variance by subtracting a baseline function, B(s), from the policy gradient. A good baseline is the state value function, . With the state value function as the baseline, we can rewrite the result of the policy gradient theorem as the following:

We can define the advantage function to be the following:

When used in the previous ...

Get Hands-On Intelligent Agents with OpenAI Gym now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.