What is Reinforcement Learning? Reinforcement Learning • • • • • Complete Agent • • • • Key Features of RL Temporally situated Continual learning and planning Object is to affect the environment Environment is stochastic and uncertain • • • • • Environment action state reward Agent An approach to Artificial Intelligence Learning from interaction Goal-oriented learning Learning about, from, and while interacting with an external environment Learning what to do—how to map situations to actions—so as to maximize a numerical reward signal Learner is not told which actions to take Trial-and-Error search Possibility of delayed reward Sacrifice short-term gains for greater longterm gains The need to explore and exploit Considers the whole problem of a goal-directed agent interacting with an uncertain environment Elements of RL An RL Approach to Tic-Tac-Toe 1. Make a table with one entry per state: Policy State Reward x Value ... x x x o o 1 ... ... x x o o o ... o x o o x x x o o 0 2. Now play lots of games. To pick our moves, look ahead one step: win 0 ... Policy: what to do Reward: what is good Value: what is good because it predicts reward Model: what follows what • • • • ... Model of environment V(s) – estimated probability of winning .5 ? .5 ? current state loss various possible next states * draw Just pick the next state with the highest estimated prob. of winning — the largest V(s); a greedy move. But 10% of the time pick a move at random; an exploratory move. An Extended Example: Tic-Tac-Toe RL Learning Rule for Tic-Tac-Toe Starting Position X X O X O X X X X O X O X O O X O X X O X O X O O X X O Opponent's Move Our Move x } ... x x’s move Opponent's Move x ... ... x o } o’s move ... o o x x Opponent's Move x ... ... ... Our Move ... ... } x’s move Our Move { { { { { { •a •b • c c* •d e'* “Exploratory” move •e • f s g } o’s move •g* – the state before our greedy move s ′ – the state after our greedy move We increment each V(s) toward V( s ′) – a backup : V(s) ← V (s) + α [V( s′) − V (s)] Assume an imperfect opponent: —he/she sometimes makes mistakes } x’s move x o x x o a small positive fraction, e.g., α = .1 the step - size parameter The n-Armed Bandit Problem Choose repeatedly from one of n actions; each choice is called a play After each playat , you get a rewardrt , where • • Action-Value Methods • Methods that adapt action-value estimates and a nothing else, e.g.: suppose by the t-th play, action r1 , r2 , …, rka , had been chosen k a times, producing rewards then These are unknown action values Distribution of rt depends only on at • Objective is to maximize the reward in the long term, e.g., over 1000 plays “sample average” Qt (a) = Q* (a) • klim →∞ a To solve the n-armed bandit problem, you must explore a variety of actions and exploit the best of them The Exploration/Exploitation Dilemma • Suppose you form estimates Qt (a) ≈ Q (a) * • action value estimates ε-Greedy Action Selection • Greedy action selection: at = at = arg max Qt (a) * The greedy action at t is at at* = argmax Qt (a) a • ε-Greedy: a at = a ⇒ exploitation * t at ≠ a ⇒ exploration * t • • You can’t exploit all the time; you can’t explore all the time You can never stop exploring; but you should always reduce exploring. Maybe. at = at* with probability 1 − ε { random action with probability ε . . . the simplest way to balance exploration and exploitation Chapter 3: The Reinforcement Learning Problem Softmax Action Selection • Softmax action selection methods grade action probs. by estimated values. Objectives of this chapter: • • Choose action a on play t with probability e • Qt (a) τ ∑ b=1 e Qt (b) τ n , • where τ is the describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which we have precise theoretical results; introduce key components of the mathematics: value functions and Bellman equations; describe trade-offs between applicability and mathematical tractability. “computational temperature” Incremental Implementation The Agent-Environment Interface Recall the sample average estimation method: The average of the first k rewards is (dropping the dependence on a ): Qk = r1 + r2 + k rk 相同的action a, ，經過了 k次實驗, 所得之平均reward Can we do this incrementally (without storing all the rewards)? Agent and environment interact at discrete time steps : t = 0,1, 2, … Agent observes state at step t : st ∈S We could keep a running sum and count, or, equivalently: Qk +1 = Qk + 1 [r − Qk ] k + 1 k +1 produces action at step t : at ∈ A(st ) gets resulting reward: rt +1 ∈ℜ and resulting next state: st +1 This is a common form for update rules: NewEstimate = OldEstimate + StepSize[Target – OldEstimate] ... st rt +1 at st +1 rt +2 at +1 st +2 at +2 rt +3 s t +3 ... at +3 The Agent Learns a Policy Returns Suppose the sequence of rewards after step t is : rt +1 , rt+ 2 , rt + 3 , … Policy at step t, π t : a mapping from states to action probabilities π t (s, a) = probability that at = a when st = s What do we want to maximize? In general, we want to maximize the expected return, E{Rt }, for each step t. • • Reinforcement learning methods specify how the agent changes its policy as a result of experience. Roughly, the agent’s goal is to get as much reward as it can over the long run. Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. Rt = rt +1 + rt +2 + + rT , where T is a final time step at which a terminal state is reached, ending an episode. Getting the Degree of Abstraction Right • • • • Time steps need not refer to fixed intervals of real time. Actions can be low level (e.g., voltages to motors), or high level (e.g., accept a job offer), “mental” (e.g., shift in focus of attention), etc. States can low-level “sensations”, or they can be abstract, symbolic, based on memory, or subjective (e.g., the state of being “surprised” or “lost”). Reward computation is in the agent’s environment because the agent cannot change it arbitrarily. Returns for Continuing Tasks Continuing tasks: interaction does not have natural episodes. Discounted return: Rt = rt +1 + γ rt+ 2 + γ 2 rt +3 + ∞ = ∑ γ k rt + k +1 , k =0 where γ , 0 ≤ γ ≤ 1, is the discount rate. shortsighted 0 ← γ → 1 farsighted The Markov Property Value Functions • • Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property: Pr{st +1 = s′,rt +1 = r st ,at ,rt , st −1 ,at −1 ,…,r1 ,s0 ,a0 }= Pr{st +1 = s′,rt +1 = r st ,at } for all s′, r, and histories st ,at ,rt , st −1 ,at −1 ,…,r1, s0 ,a0 . The value of a state is the expected return starting from that state; depends on the agent’s policy: State - value function for policy π : ⎧∞ ⎫ V π (s) = Eπ {Rt st = s}= Eπ ⎨∑ γ k rt +k +1 st = s ⎬ ⎩k =0 ⎭ • The value of taking an action in a state under policy π is the expected return starting from that state, taking that action, and thereafter following π : Action - value function for policy π : ⎧∞ ⎫ Qπ (s, a) = Eπ {Rt s t = s, at = a}= Eπ ⎨∑γ k rt + k +1 st = s,at = a ⎬ ⎭ ⎩k = 0 Markov Decision Processes • • • If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: state and action sets one-step “dynamics” defined by transition probabilities: Psas′ = Pr{st +1 = s′ st = s,at = a} for all s, s′ ∈ S, a ∈ A(s). reward probabilities: Rass′ = E {rt +1 st = s,at = a,st +1 = s′} for all s, s′ ∈ S, a ∈ A(s). Bellman Equation for a Policy π The basic idea: Rt = rt +1 + γ rt +2 + γ 2 rt + 3 + γ 3 rt + 4 2 = rt +1 + γ (rt +2 + γ rt +3 + γ rt + 4 ) = rt +1 + γ Rt +1 So: V π (s) = Eπ {Rt st = s} = Eπ {rt +1 + γ V (st +1 ) st = s} Or, without the expectation operator: V π (s) = ∑ π (s,a)∑ Psas′ [R ass′ + γ V π (s′)] a s′ More on the Bellman Equation Bellman Optimality Equation for V* V π (s) = ∑ π (s,a)∑ Psas′ [R ass′ + γ V π (s′)] The value of a state under an optimal policy must equal the expected return for the best action from that state: s′ a ∗ V ∗ (s) = max Q π (s,a) This is a set of equations (in fact, linear), one for each state. The value function for π is its unique solution. a ∈A (s) = max E {rt +1 + γ V ∗ (st +1 ) st = s,at = a} a ∈A (s) = max ∑ Psas′ [R ass′ + γ V ∗ ( s′)] Backup diagrams: a ∈A (s) s′ The relevant backup diagram: for V for Q π π Optimal Value Functions • For finite MDPs, policies can be partially ordered: π ≥ π ′ if and only if V π (s) ≥ V π ′ (s) for all s ∈S • There are always one or more policies that are better than or equal to all the others. These are the optimal policies. We denote them all π *. Optimal policies share the same optimal state-value function: V ∗ (s) = max V π (s) for all s ∈S • π • V ∗is the unique solution of this system of nonlinear equations. Bellman Optimality Equation for Q* Q∗ (s,a) = max Qπ (s, a) for all s ∈S and a ∈ A(s) π a並不一定是最好的, 但action a後,到st+1 State後是採用optimal policy { } Q∗ (s,a) = E rt +1 + γ max Q∗ (st +1, a′) st = s,at = a [ a′ ] = ∑ Psas′ R ass′ + γ max Q∗ ( s′, a′) s′ a′ The relevant backup diagram: Optimal policies also share the same optimal actionvalue function: Q∗ (s,a) = max Qπ (s, a) for all s ∈S and a ∈ A(s) π This is the expected return for taking action a in state s and thereafter following an optimal policy. * Q is the unique solution of this system of nonlinear equations. What About Optimal Action-Value Functions? Chapter 4: Dynamic Programming * Given Q , the agent does not even have to do a one-step-ahead search: ∗ ∗ π (s) = arg max Q (s,a) Objectives of this chapter: • a∈A (s) • • • • • • Overview of a collection of classical solution methods for MDPs known as dynamic programming (DP) Show how DP can be used to compute value functions, and hence, optimal policies Discuss efficiency and utility of DP Solving the Bellman Optimality Equation Finding an optimal policy by solving the Bellman Optimality Equation requires the following: accurate knowledge of environment dynamics; we have enough space and time to do the computation; the Markov Property. How much space and time do we need? polynomial in number of states (via dynamic programming methods), BUT, number of states is often huge (e.g., backgammon has about 1020 states). We usually have to settle for approximations. Many RL methods can be understood as approximately solving the Bellman Optimality Equation. Policy Evaluation Policy Evaluation: for a given policy π, compute the π state-value function V Recall: State - value function for policy π : ⎧∞ ⎫ V π (s) = Eπ {Rt st = s}= Eπ ⎨∑ γ k rt +k +1 st = s⎬ ⎩k =0 ⎭ Bellman Equation for V π : V π ( s) = ∑ π ( s, a)∑ Pssa′ ⎡⎣ Rssa ′ + γ V π ( s′) ⎤⎦ a s′ Iterative Methods Policy Improvement π V0 → V1 → → Vk → Vk +1 → → Vπ a “sweep” A sweep consists of applying a backup operation to each state. Suppose we have computed V for a deterministic policy π. For a given state s, would it be better to do an action a ≠ π (s) ? The value of doing a in state s is : Qπ (s,a) = Eπ {rt +1 + γ V π (st +1 ) st = s, at = a} = ∑ Psas′ [Rsas′ + γ V (s′ )] A full policy-evaluation backup: π Vk +1 (s) ← ∑ π (s,a)∑ P [R + γ Vk (s′)] a ss′ a a ss′ s′ s′ It is better to switch to action a for state s if and only if Qπ (s, a) > V π (s) Iterative Policy Evaluation Policy Improvement Cont. Do this for all states to get a new policy π ′ that is greedy with respect to V π : π ′(s) = argmax Qπ (s,a) a = argmax ∑ Psas ′[Rsas ′ + γ V π (s′)] a π′ Then V ≥ V π s′ Policy Iteration π0 → V π → π1 → V π → 0 1 π* → V* → π* Value Iteration Recall the full policy-evaluation backup: Vk +1 (s) ← ∑ π (s,a)∑ Psas′ [Rsas′ + γ Vk (s′ )] policy evaluation policy improvement “greedification” s′ a Here is the full value-iteration backup: Vk +1 (s) ← max ∑ Pss′ [Rss′ + γ Vk (s′ )] a a Policy Iteration a s′ Value Iteration Cont. Generalized Policy Iteration Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity. A geometric metaphor for convergence of GPI: