Understand the intuition behind MDPs leading to Reinforcement Learning and the Q-learning algorithm.
Reinforcement Learning

Reinforcement Learning

I would like to give full credits to the respective authors as these are my personal python notebooks taken from deep learning courses from Andrew Ng, Data School and Udemy :) This is a simple python notebook hosted generously through Github Pages that is on my main personal notes repository on https://github.com/ritchieng/ritchieng.github.io. They are meant for my personal review but I have open-source my repository of personal notes as a lot of people found it useful.

Approaches to RL

MDP's Value Function

  • U(s)=R(s)+γmaxasT(s,a,s)U(s)
  • π(s)=argmaxasT(s,a,s) U(s)

Q function

  • Q(s,a)=R(s)+γsT(s,a,s)maxaQ(s,a)
    • Value for arriving in s
    • Leaving via a
    • Proceeding optimally thereafter

Defining MDP's Value Function as Q Functions

  • U(s)=maxa Q(s,a)
  • π(s)=argmaxa Q(s,a)

Q-learning

  • Evaluating the Bellman equations from data.
    • We need to estimate Q from transitions .
      • s, a, r, s'
        • We are in a state, we take an action, we get the reward and we are in the next state.

Estimating Q from Transitions: Q-learning Equation

  • ˆQ(s,a)αt r+γ maxaˆQ(s,a)
  • Intuition: imagine if we've an estimate of the Q function.
    • We will update by taking the state and action and moving a bit.
    • We have the reward and discount the maximum of the utility of the next state.
  • This represents the utility function.
    • r+γ maxaˆQ(s,a)
  • This represents the utility of the next state.
    • maxaˆQ(s,a)
  • αt is the learning rate.
    • VαtX
    • V(1αt)V+αX
      • When α=0, you would have no learning at all where your new value is your original value, V=V.
      • When α=1, you would have absolute learning where you totally forget your previous value, V leaving your new value V = X.

Convergence of Q-Learning Equation

  • ^Q(s,a) starts anywhere
  • ˆQ(s,a)αt r+γ maxaˆQ(s,a)
  • Then ^Q(s,a)Q(s,a)
    • The solution to Bellman's equation
  • But, this is true only if
    1. (s, a) is visited infinitely often.
    2. tαt= and tα2t<
    3. s' ~ T(s, a, s') and r ~ R(s)

Q-learning is a family of algorithms

  • How to initialize ˆQ?
  • How to decay αt?
  • How to choose actions?
    • ϵ-greedy exploration
      • If GLIE (Greedy Limit + Infinite Exploration) with decayed ϵ
      • ˆQQ and ˆππ
        • ˆQQ is learning
        • ˆππ is using
      • We have to note that we face an exploration-exploitation dilemma here that is a fundamental tradeoff in reinforcement learning.