A policy is formally a deterministic or stochastic law that the agent follows in order to maximize its return. Conventionally, all policies are denoted with the letter π. A deterministic policy is usually a function of the current state that outputs a precise action:
A stochastic policy, analogously to environments, outputs the probability of each action (in this case, we are assuming we work with a finite MPD):
However, contrary to the environment, an agent must always pick a specific action, transforming any stochastic policy into ...