O'Reilly logo
live online training icon Live Online training

Reinforcement Learning with Tensorflow and Keras

Use Reinforcement Learning applications in games and robotics

Pablo Maldonado

Reinforcement Learning algorithms are behind some of the most impressive breakthroughs in Artificial Intelligence. In this course, we will cover the fundamentals of reinforcement learning with an emphasis on their applications in video games and robotics.

What you'll learn-and how you can apply it

  • How to use Q-functions to obtain optimal policies for your agent, be it a game-playing bot or a simulated robot
  • Choose optimal behavior in complicated environments
  • How to use actor-critic methods to fine-tune and efficiently train your policy
  • How to use Monte Carlo Tree search to speed up the learning process

And you'll be able to:

  • Teach a bot to play video games
  • Train a simulated robot to perform a task, and then transfer it to a real robot
  • Understand state-of-the-art methods in artificial intelligence

This training course is for you because...

This session is targeted for anyone with basic software development skills (data scientists, software engineers, amateur programmers, and managers) looking to understand at a high level the main concepts of applied deep learning and artificial intelligence. You will gain valuable practical knowledge and a new perspective on your day-to-day challenges.


Working knowledge of R and/or Python and familiarity with calculus and probability

Recommended preparation:

  • Course requirements can be found here: http://www.datastart.eu/index.php/packt-training/
  • R Deep Learning Projects is suggested as it contains an introduction to some of the methods we will use in this course. A Github repository with course material is also available: https://github.com/jpmaldonado

About your instructor

  • Pablo Maldonado is an applied mathematician and data scientist with a taste for software development since his days of programming BASIC on a Tandy 1000. As an academic and business consultant, he spends a great deal of his time building applied artificial intelligence solutions for text analytics, sensor and transactional data, and reinforcement learning. Pablo earned his PhD in applied mathematics (with focus on mathematical game theory) at the Universite Pierre et Marie Curie in Paris, France. Pablo is the founder of Maldonado Consulting which is a technology-agnostic data analytics consultancy based in Prague, Czech Republic, that leverage the latest tools and research to develop custom solutions around like Data Analytics, Mathematical Modelling, and Machine Learning and Artificial Intelligence. Pablo has been an adjunct professor, teaching AI (Reinforcement Learning) and Machine Learning at Czech Technical University in Prague, the oldest technical university in Central Europe. He has co-authored a book “R Deep Learning Projects” published by Packt.


The timeframes are only estimates and may vary according to how the class is progressing

Day 1:

Introduction to Reinforcement Learning (20 min)

  • In this lecture, we will cover some examples and use cases of Reinforcement Learning, both in practice and in (applied) research as well as provide the motivation and fundamentals for the following lectures.
    • Where can we find RL in our daily lives?
    • Problem definition
    • History and motivation of the field. Outline of research frontiers and state of the art.

From Markov chains to Markov Decision Processes (40 min)

  • This lecture provides the theoretical foundation to understand the rest of the course. It is important to at least grasp the main ideas, as it will help you make the most out of this course.
    • Markov chains: definition and examples
    • Markov Decision Processes: definition and examples
    • The Dynamic programming principle
    • Value and policy iteration
    • A stochastic approximation perspective to learning

RL as a black-box optimization problem (20 min)

  • We consider Reinforcement Learning as a “black-box” optimization problem and apply different approaches to it: cross entropy method and natural evolution strategies. These approaches are conceptually simple yet very powerful and competitive against more sophisticated methods.
    • Black-box algorithms: random search, genetic algorithms, and other heuristics
    • The Cross-Entropy method
    • Natural Evolution strategies

Temporal Difference Methods. Q-Learning and Sarsa. Eligibility traces (40 min)

  • In this session, we will go through the different methods for estimating value functions, which are used later for estimating the optimal behavior.
    • Q-function: definition and examples
    • Estimating Q-functions via Q-Learning
    • Sarsa: Off-policy methods
    • Beyond Q-Learning: double Q-Learning, Zap Q-Learning, and others
    • Eligibility traces: forward and backward perspective

Practice: Solving Cart-Pole and MountainCar (60 min)

  • We will practice our hard-earned knowledge with two fun and challenging tasks. We will compare different algorithms and see “live” their advantages and disadvantages.
    • Solve Cart-Pole and MountainCar using different methods
    • Instructor support and solutions will be provided in the end

Deep Reinforcement Learning (40-50 min)

  • Building on the temporal difference methods lecture, we will show how to use state-of-the-art methods in artificial intelligence (deep learning) to calculate value functions.
    • Function approximation methods
    • Linear functions approximation methods
    • Using a multilayer perceptron for function approximation
    • Experience replay
    • Improving baseline Deep Q-Learning
    • Deep Cross-Entropy Method: a deep learning approach for black-box reinforcement learning

Wrap up and lessons learned (10-20 min)

  • We will summarize the learnings of the first day and suggest how you can apply your newly acquired knowledge in a number of settings.
    • More detailed example applications
    • Suggestions for projects: build your portfolio

Day 2:

Policy methods (DDPG and TRPO) and Actor-Critic methods (60 min)

  • In this session, we will consider a new class of algorithms: policy methods. These methods are more suitable for complicated tasks such as robotic arm manipulation, as they do not rely on computing value functions in advance.
    • Introduction to Policy Gradients: finite differences
    • Use likelihood ratios for improving the quality of the policy
    • REINFORCE: a Monte Carlo approach for policy approximation
    • Improving policy methods with a baseline: actor-critic methods

Practice: Solving Cliffwalk (30 min)

  • We will apply the methods to solve a challenging environment. Although this is a toy problem, it exhibits a number of features characteristic of more complicated setups. Once you can debug your algorithms here, you can apply them to more complicated tasks.
    • Implementation of several algorithms to solve a simple environment:
    • REINFORCE with baseline
    • Policy Gradients (different algorithms)

Practice: Solving Pong (30 min)

  • In this session, we will tackle the problem of teaching a bot to play the classic game Pong. We will use this as an excuse to practice the policy methods we learned before.
    • Video frame pre-processing pipeline: reading and combining video frames
    • Implementation of a policy gradient algorithm using Numpy
    • Policy gradients using Keras
    • Testing different algorithms at once: using OpenAI baselines

Introduction to RL for Robotics (30 min

  • In this session, we will provide an overview of how reinforcement learning is used in robotics. This includes roughly two scenarios: teaching a robot to imitate what a human lecturer does (through virtual reality) and teaching a robot how to discover the correct behaviour through trial and error.
    • The robotics control problem: definition and challenges
    • Imitation learning
    • Model-based methods

Efficient Algorithms for Robotics (30 min)

  • Reinforcement Learning can be quite data-hungry, and in this session, we will explore methods to reduce the number of simulations needed to obtain high-quality policies.
    • Imitation Learning algorithms
    • Estimating the dynamics of the system via Gaussian Processes

Practice: Humanoid robot control through RL in a simulator (60 min)

  • To conclude, we will apply our hard-earned knowledge to train a humanoid robot (NAO) to learn from simulations. If you have access to NAO, the optimal policy discovered can be deployed to the real robot.
    • Set up your environment: V-REP simulator and NAOQi Python development kit
    • Using V-REP as an OpenAI Gym environment
    • Training NAO robot with reinforcement learning
    • Where to go from here? Ideas for projects