Download Report

Temporal Difference
Learning in the Brain
CBL Reinforcement Learning Reading Group
22/04/2015
Studies
•
Schultz, Dayan, Montague, A Neural Substrate of
Prediction and Reward, Science (1997)
•
O’Doherty et al., Temporal Difference Models and
reward-related learning in the brain, Neuron (2003)
•
Steinberg et al., A causal link between prediction
errors, dopamine neurons, and learning, Nature Neuroscience (2013)
Overview
Schultz et al
O’Doherty et al
Steinberg et al
Monkey
Human
Rodent
Method
Electrophysiology
Functional
neuroimaging
Optogenetics
Causality
No
No
Yes
Region
Midbrain
Ventral Striatum
Midbrain
Model
What Rowan Covered
•
Markov reward processes
•
Markov decision processes
•
Value functions
•
Policy functions
•
Dynamic programming (value/policy iteration)
•
TD(0)
•
Monte Carlo method
•
TD(lambda) = TD(0) + Monte Carlo
Today
•
Markov reward processes
•
Markov decision processes
•
Value functions
•
Policy functions
•
Dynamic programming (value/policy iteration)
•
TD(0)
•
Monte Carlo method
•
TD(lambda) = TD(0) + Monte Carlo
Pavlov’s Dogs
http://animals.howstuffworks.com/pets/dog-training1.htm
“Pavlovian” Conditioning
(bell)
(food)
CS
US
UR
(salivation)
“Pavlovian” Conditioning
(bell)
(food)
CS
US
0
UR
UR
(some salivation)
(some salivation)
food
0
(also see “Rescorla-Wagner Rule”)
0
food
Dopaminergic System
VTA = ventral tegmental area (part of “midbrain”) Nucleus accumbens (part of “ventral striatum”)
VTA/Substantia Nigra = source of dopamine in the brain
Schultz et al
Recording from VTA neurons in monkeys in Pavlovian conditioning paradigm (light-> fruit juice)
Schultz et al
Szepesvari
Schultz et al
Schultz et al
Function approximation over features of visual input (similar argument for temporal precision of prediction)
Functional Neuroimaging
Non-invasive recording of “BOLD” signal
Delayed increase in oxygenated blood flow in response to energy demands of neurons
O’Doherty et al
CS
US
US+ = glucose (positive) USneut = artificial saliva (neutral) US- = nothing (negative)
1-to-1 mapping CS to US
CS on
0
CS off/US on
3
6
time (s)
O’Doherty et al
Prediction error regressor time-locked to CS and US Ventral striatum = downstream of midbrain dopaminergic neurons
O’Doherty et al
O’Doherty et al
Optogenetics
“switching neurons on and off with lights and viruses”
Steinberg et al
0
0
1
1
Blocking paradigm
0
0
0
Steinberg et al
PairedCre+ = experimental group PairedCre- = wild-type equivalent (control 1) UnpairedCre+ = asynchronous optical stimulation (control 2)
Discussion
•
Correlative and causal evidence that RPEs drive
behaviour and physiological responses
•
Many unknowns: Construction of state-dependent value signals How are prediction errors “bound” to states? D1 vs D2 receptors Serotonergic error signaling?