• What is Reinforcement Learning? Reinforcement Learning

What is Reinforcement Learning?
Reinforcement Learning
•
•
•
•
•
Complete Agent
•
•
•
•
Key Features of RL
Temporally situated
Continual learning and planning
Object is to affect the environment
Environment is stochastic and uncertain
•
•
•
•
•
Environment
action
state
reward
Agent
An approach to Artificial Intelligence
Learning from interaction
Goal-oriented learning
Learning about, from, and while interacting with an
external environment
Learning what to do—how to map situations to
actions—so as to maximize a numerical reward
signal
Learner is not told which actions to take
Trial-and-Error search
Possibility of delayed reward
„ Sacrifice short-term gains for greater longterm gains
The need to explore and exploit
Considers the whole problem of a goal-directed
agent interacting with an uncertain environment
Elements of RL
An RL Approach to Tic-Tac-Toe
1. Make a table with one entry per state:
Policy
State
Reward
x
Value
...
x x x
o
o
1
...
...
x
x o
o
o
...
o x o
o x x
x o o
0
2. Now play lots of games.
To pick our moves,
look ahead one step:
win
0
...
Policy: what to do
Reward: what is good
Value: what is good because it predicts reward
Model: what follows what
•
•
•
•
...
Model of
environment
V(s) – estimated probability of winning
.5
?
.5
?
current state
loss
various possible
next states
*
draw
Just pick the next state with the highest
estimated prob. of winning — the largest V(s);
a greedy move.
But 10% of the time pick a move at random;
an exploratory move.
An Extended Example: Tic-Tac-Toe
RL Learning Rule for Tic-Tac-Toe
Starting Position
X
X
O X
O X
X
X
X
O X
O X
O
O
X O X
X O X
O X
O
O X
X
O
Opponent's Move
Our Move
x
}
...
x
x’s move
Opponent's Move
x
...
...
x o
} o’s move
...
o
o x
x
Opponent's Move
x
...
...
...
Our Move
...
...
} x’s move
Our Move
{
{
{
{
{
{
•a
•b
•
c c*
•d
e'*
“Exploratory” move
•e
•
f
s
g
} o’s move
•g*
– the state before our greedy move
s ′ – the state after our greedy move
We increment each V(s) toward V( s ′) – a backup :
V(s) ← V (s) + α [V( s′) − V (s)]
Assume an imperfect opponent:
—he/she sometimes makes mistakes
} x’s move
x o
x
x o
a small positive fraction, e.g., α = .1
the step - size parameter
The n-Armed Bandit Problem
Choose repeatedly from one of n actions; each
choice is called a play
After each playat , you get a rewardrt
, where
•
•
Action-Value Methods
•
Methods that adapt action-value estimates and
a
nothing else, e.g.: suppose by the t-th play, action
r1 , r2 , …, rka ,
had been chosen k a times, producing rewards
then
These are unknown action values
Distribution of rt depends only on at
•
Objective is to maximize the reward in the long
term, e.g., over 1000 plays
“sample average”
Qt (a) = Q* (a)
• klim
→∞
a
To solve the n-armed bandit problem,
you must explore a variety of actions
and exploit the best of them
The Exploration/Exploitation Dilemma
•
Suppose you form estimates
Qt (a) ≈ Q (a)
*
•
action value estimates
ε-Greedy Action Selection
•
Greedy action selection:
at = at = arg max Qt (a)
*
The greedy action at t is at
at* = argmax Qt (a)
a
•
ε-Greedy:
a
at = a ⇒ exploitation
*
t
at ≠ a ⇒ exploration
*
t
•
•
You can’t exploit all the time; you can’t explore all
the time
You can never stop exploring; but you should always
reduce exploring. Maybe.
at =
at* with probability 1 − ε
{ random action with probability ε
. . . the simplest way to balance exploration and exploitation
Chapter 3: The Reinforcement Learning
Problem
Softmax Action Selection
•
Softmax action selection methods grade action
probs. by estimated values.
Objectives of this chapter:
•
•
Choose action a on play t with probability
e
•
Qt (a) τ
∑ b=1 e Qt (b) τ
n
,
•
where τ is the
describe the RL problem we will be studying for
the remainder of the course
present idealized form of the RL problem for
which we have precise theoretical results;
introduce key components of the mathematics:
value functions and Bellman equations;
describe trade-offs between applicability and
mathematical tractability.
“computational temperature”
Incremental Implementation
The Agent-Environment Interface
Recall the sample average estimation method:
The average of the first k rewards is
(dropping the dependence on a ):
Qk =
r1 + r2 +
k
rk
相同的action a, ,經過了 k次實驗, 所得之平均reward
Can we do this incrementally (without storing all the rewards)?
Agent and environment interact at discrete time steps : t = 0,1, 2, …
Agent observes state at step t : st ∈S
We could keep a running sum and count, or, equivalently:
Qk +1 = Qk +
1
[r − Qk ]
k + 1 k +1
produces action at step t : at ∈ A(st )
gets resulting reward: rt +1 ∈ℜ
and resulting next state: st +1
This is a common form for update rules:
NewEstimate = OldEstimate + StepSize[Target – OldEstimate]
...
st
rt +1
at
st +1
rt +2
at +1
st +2
at +2
rt +3 s
t +3
...
at +3
The Agent Learns a Policy
Returns
Suppose the sequence of rewards after step t is :
rt +1 , rt+ 2 , rt + 3 , …
Policy at step t, π t :
a mapping from states to action probabilities
π t (s, a) = probability that at = a when st = s
What do we want to maximize?
In general,
we want to maximize the expected return, E{Rt }, for each step t.
•
•
Reinforcement learning methods specify how the
agent changes its policy as a result of experience.
Roughly, the agent’s goal is to get as much reward
as it can over the long run.
Episodic tasks: interaction breaks naturally into
episodes, e.g., plays of a game, trips through a maze.
Rt = rt +1 + rt +2 +
+ rT ,
where T is a final time step at which a terminal state is reached,
ending an episode.
Getting the Degree of Abstraction
Right
•
•
•
•
Time steps need not refer to fixed intervals of
real time.
Actions can be low level (e.g., voltages to motors),
or high level (e.g., accept a job offer), “mental”
(e.g., shift in focus of attention), etc.
States can low-level “sensations”, or they can be
abstract, symbolic, based on memory, or
subjective (e.g., the state of being “surprised” or
“lost”).
Reward computation is in the agent’s environment
because the agent cannot change it arbitrarily.
Returns for Continuing Tasks
Continuing tasks: interaction does not have natural episodes.
Discounted return:
Rt = rt +1 + γ rt+ 2 + γ 2 rt +3 +
∞
= ∑ γ k rt + k +1 ,
k =0
where γ , 0 ≤ γ ≤ 1, is the discount rate.
shortsighted 0 ← γ → 1 farsighted
The Markov Property
Value Functions
•
•
Ideally, a state should summarize past sensations so as
to retain all “essential” information, i.e., it should have
the Markov Property:
Pr{st +1 = s′,rt +1 = r st ,at ,rt , st −1 ,at −1 ,…,r1 ,s0 ,a0 }=
Pr{st +1 = s′,rt +1 = r st ,at }
for all s′, r, and histories st ,at ,rt , st −1 ,at −1 ,…,r1, s0 ,a0 .
The value of a state is the expected return
starting from that state; depends on the agent’s
policy:
State - value function for policy π :
⎧∞
⎫
V π (s) = Eπ {Rt st = s}= Eπ ⎨∑ γ k rt +k +1 st = s ⎬
⎩k =0
⎭
•
The value of taking an action in a state under
policy π is the expected return starting from
that state, taking that action, and thereafter
following π :
Action - value function for policy π :
⎧∞
⎫
Qπ (s, a) = Eπ {Rt s t = s, at = a}= Eπ ⎨∑γ k rt + k +1 st = s,at = a ⎬
⎭
⎩k = 0
Markov Decision Processes
•
•
•
If a reinforcement learning task has the Markov
Property, it is basically a Markov Decision Process
(MDP).
If state and action sets are finite, it is a finite
MDP.
To define a finite MDP, you need to give:
„ state and action sets
„ one-step “dynamics” defined by transition
probabilities:
Psas′ = Pr{st +1 = s′ st = s,at = a} for all s, s′ ∈ S, a ∈ A(s).
„
reward probabilities:
Rass′ = E {rt +1 st = s,at = a,st +1 = s′} for all s, s′ ∈ S, a ∈ A(s).
Bellman Equation for a Policy π
The basic idea:
Rt = rt +1 + γ rt +2 + γ 2 rt + 3 + γ 3 rt + 4
2
= rt +1 + γ (rt +2 + γ rt +3 + γ rt + 4
)
= rt +1 + γ Rt +1
So:
V π (s) = Eπ {Rt st = s}
= Eπ {rt +1 + γ V (st +1 ) st = s}
Or, without the expectation operator:
V π (s) = ∑ π (s,a)∑ Psas′ [R ass′ + γ V π (s′)]
a
s′
More on the Bellman Equation
Bellman Optimality Equation for V*
V π (s) = ∑ π (s,a)∑ Psas′ [R ass′ + γ V π (s′)]
The value of a state under an optimal policy must equal
the expected return for the best action from that state:
s′
a
∗
V ∗ (s) = max Q π (s,a)
This is a set of equations (in fact, linear), one for each state.
The value function for π is its unique solution.
a ∈A (s)
= max E {rt +1 + γ V ∗ (st +1 ) st = s,at = a}
a ∈A (s)
= max ∑ Psas′ [R ass′ + γ V ∗ ( s′)]
Backup diagrams:
a ∈A (s)
s′
The relevant backup diagram:
for V
for Q π
π
Optimal Value Functions
•
For finite MDPs, policies can be partially ordered:
π ≥ π ′ if and only if V π (s) ≥ V π ′ (s) for all s ∈S
•
There are always one or more policies that are better
than or equal to all the others. These are the optimal
policies. We denote them all π *.
Optimal policies share the same optimal state-value
function: V ∗ (s) = max V π (s) for all s ∈S
•
π
•
V ∗is the unique solution of this system of nonlinear equations.
Bellman Optimality Equation for Q*
Q∗ (s,a) = max Qπ (s, a) for all s ∈S and a ∈ A(s)
π
a並不一定是最好的, 但action a後,到st+1 State後是採用optimal policy
{
}
Q∗ (s,a) = E rt +1 + γ max Q∗ (st +1, a′) st = s,at = a
[
a′
]
= ∑ Psas′ R ass′ + γ max Q∗ ( s′, a′)
s′
a′
The relevant backup diagram:
Optimal policies also share the same optimal actionvalue function: Q∗ (s,a) = max Qπ (s, a) for all s ∈S and a ∈ A(s)
π
This is the expected return for taking action a in state s
and thereafter following an optimal policy.
*
Q is the unique solution of this system of nonlinear equations.
What About Optimal Action-Value
Functions?
Chapter 4: Dynamic Programming
*
Given Q , the agent does not even
have to do a one-step-ahead search:
∗
∗
π (s) = arg max Q (s,a)
Objectives of this chapter:
•
a∈A (s)
•
•
•
•
•
•
Overview of a collection of classical solution
methods for MDPs known as dynamic
programming (DP)
Show how DP can be used to compute value
functions, and hence, optimal policies
Discuss efficiency and utility of DP
Solving the Bellman Optimality
Equation
Finding an optimal policy by solving the Bellman
Optimality Equation requires the following:
„ accurate knowledge of environment dynamics;
„ we have enough space and time to do the
computation;
„ the Markov Property.
How much space and time do we need?
„ polynomial in number of states (via dynamic
programming methods),
„ BUT, number of states is often huge (e.g.,
backgammon has about 1020 states).
We usually have to settle for approximations.
Many RL methods can be understood as
approximately solving the Bellman Optimality
Equation.
Policy Evaluation
Policy Evaluation: for a given policy π, compute the
π
state-value function V
Recall:
State - value function for policy π :
⎧∞
⎫
V π (s) = Eπ {Rt st = s}= Eπ ⎨∑ γ k rt +k +1 st = s⎬
⎩k =0
⎭
Bellman Equation for V π :
V π ( s) = ∑ π ( s, a)∑ Pssa′ ⎡⎣ Rssa ′ + γ V π ( s′) ⎤⎦
a
s′
Iterative Methods
Policy Improvement
π
V0 → V1 →
→ Vk → Vk +1 →
→ Vπ
a “sweep”
A sweep consists of applying a backup operation to each state.
Suppose we have computed V for a deterministic policy π.
For a given state s,
would it be better to do an action a ≠ π (s) ?
The value of doing a in state s is :
Qπ (s,a) = Eπ {rt +1 + γ V π (st +1 ) st = s, at = a}
= ∑ Psas′ [Rsas′ + γ V (s′ )]
A full policy-evaluation backup:
π
Vk +1 (s) ← ∑ π (s,a)∑ P [R + γ Vk (s′)]
a
ss′
a
a
ss′
s′
s′
It is better to switch to action a for state s if and only if
Qπ (s, a) > V π (s)
Iterative Policy Evaluation
Policy Improvement Cont.
Do this for all states to get a new policy π ′ that is
greedy with respect to V π :
π ′(s) = argmax Qπ (s,a)
a
= argmax ∑ Psas ′[Rsas ′ + γ V π (s′)]
a
π′
Then V ≥ V
π
s′
Policy Iteration
π0 → V π → π1 → V π →
0
1
π* → V* → π*
Value Iteration
Recall the full policy-evaluation backup:
Vk +1 (s) ← ∑ π (s,a)∑ Psas′ [Rsas′ + γ Vk (s′ )]
policy evaluation
policy improvement
“greedification”
s′
a
Here is the full value-iteration backup:
Vk +1 (s) ← max ∑ Pss′ [Rss′ + γ Vk (s′ )]
a
a
Policy Iteration
a
s′
Value Iteration Cont.
Generalized Policy Iteration
Generalized Policy Iteration (GPI):
any interaction of policy evaluation and policy improvement,
independent of their granularity.
A geometric metaphor for
convergence of GPI: