slides - Cambridge Machine Learning Group

Introduction to Reinforcement Learning
Part 5: Learning (Near)-Optimal Policies
Rowan McAllister
Reinforcement Learning Reading Group
13 May 2015
Note
I’ve created these slides whilst following, and using figures from
“Algorithms for Reinforcement Learning ” lectures by Csaba Szepesv´ari,
specifically sections 4.1 - 4.3.
The lectures themselves are available on Professor Szepesv´ari’s homepage:
http://www.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf
Any errors please email me: [email protected]
Context
P and R known?
I
I
Evaluate a policy π
Compute optimal policy π∗
week 1
week 1
P and R unknown?
I
I
Evaluate a policy π
Learn optimal policy π∗
weeks 2,3,4
today!
Goal: cumulative rewards - summation of all rewards whilst learning.
Goal: cumulative rewards - summation of all rewards whilst learning.
‘dilemma’:
Explore vs Exploit?
Goal: cumulative rewards - summation of all rewards whilst learning.
‘dilemma’:
Explore vs Exploit?
pure exploitation (greedy actions) → might fail to learn better policies
Goal: cumulative rewards - summation of all rewards whilst learning.
‘dilemma’:
Explore vs Exploit?
pure exploitation (greedy actions) → might fail to learn better policies
pure exploration → continually receive mediocre rewards
Goal: cumulative rewards - summation of all rewards whilst learning.
‘dilemma’:
Explore vs Exploit?
pure exploitation (greedy actions) → might fail to learn better policies
pure exploration → continually receive mediocre rewards
need to balance...
Simple Exploration Heuristics
-greedy:
π(x, a)
=
·
1
+ (1 − ) · I[a = arg max Qt (x, a 0 )]
|A|
a0
Simple Exploration Heuristics
-greedy:
π(x, a)
=
·
1
+ (1 − ) · I[a = arg max Qt (x, a 0 )]
|A|
a0
Boltzmann:
π(x, a)
=
exp βQt (x, a)
P
0
a 0 exp βQt (x, a )
Bandits
Bandits and single-state MDPs:
|X| = 1
UCB1
optimism in the face of uncertainty...
s
2 log t
nt (a)
Ut (a)
=
Qt (a) + R
π
=
argmaxa U(a)
where:
nt (a)
:
number of time actions a selected
Qt (a)
:
sample mean of rewards observed for action a, bounded [−R, R]
UCB - simple reward
Now assume the goal is simple reward - to optimise reward after T
interactions.
i.e. rewards up until do not matter.
Idea: start eliminating actions when you’re sufficiently certain they’re
worse than others
r
log(2|A|T /δ)
Ut (a) = Qt (a) + R
2t
r
log(2|A|T /δ)
Lt (a) = Qt (a) − R
2t
where 0 < δ < 1 is user-specified probability of failure
we eliminate action a if Ut (a) < maxa 0 ∈A Lt (a 0 )
Pretty much unbeatable algorithm apart from constant scaling factors.
Q(uality)-learning
Algorithm: Tabular Q-learning
(called after each transition)
1: Input: X (last state), A (last action), R (reward), Y (next state), Q
(current action-value estimates)
2: δ ← R + γ · maxa Q[Y , a] − Q[X , A]
3: Q[X , A] ← Q[X , A] + α · δ
4: return Q
Pros:
Simple!
“Off policy” algorithm
(meaning: can use arbitrary sampling policy, but must visit each
stat-action pair infinitely often.)
Developed here in Cambridge Engineering Dept!
Q-learning with Linear Approximation
Algorithm: Q-learning with Linear Approximation
(called after each transition)
1: Input: X (last state), A (last action), R (reward), Y (next state), θ
(parameters)
2: δ ← R + γ · maxa θ> φ[Y , a] − θ> φ[X , A]
3: θ ← θ + α · δ · φ[X , A]
| {z }
∇θ Qθ
return θ
Note this is TD(0) when only one action!
where
4:
Qθ (x, a)
=
θ> φ(x, a),
x ∈ X, a ∈ A
and
θ ∈ Rd
φ:X×A→
(weights)
Rd
(basis fns)