As you train this agent, you'll notice that the first thing it learns to do is hover the lander, and avoid landing. When the lander finally lands, it receives a very strong reward, either +100 for landing successfully or -100 for crashing. This -100 reward is so strong that the agent would rather incur small penalties for hovering at first. It takes quite a few episodes for our agent to finally get the hint that good landings are better than no landings, because crash landings are so very bad.
Because of this extreme negative feedback for crash landings, it will ...