We are going to consider an example based on a checkerboard environment representing a tunnel. The goal of the agent is to reach the ending state (lower-right corner), avoiding 10 wells that are negative absorbing states. The rewards are:
- Ending state: +5.0
- Wells: -5.0
- All other states: -0.1
Selecting a small negative reward for all non-terminal states is helpful to force the agent to move forward until the maximum (final) reward has been achieved. Let's start modeling an environment that has a 5 × 15 matrix:
import numpy as npwidth = 15height = 5y_final = width - 1x_final = height - 1y_wells = [0, 1, 3, 5, 5, 7, 9, 11, 12, 14]x_wells = [3, 1, 2, 0, 4, 1, 3, 2, 4, 1]standard_reward = -0.1tunnel_rewards ...