In this section, we will discuss in detail some of the methods to solve Reinforcement Learning problems. Specifically, dynamic programming (DP), Monte Carlo method, and temporal-difference (TD) learning. These methods address the problem of delayed rewards as well.
DP is a set of algorithms that are used to compute optimal policies given a model of environment like Markov Decision Process. Dynamic programming models are both computationally expensive and assume perfect models; hence, they have low adoption or utility. Conceptually, DP is a basis for many algorithms or methods used in the following sections: