Temporal Difference Learning in the Brain CBL Reinforcement Learning Reading Group 22/04/2015 Studies • Schultz, Dayan, Montague, A Neural Substrate of Prediction and Reward, Science (1997) • O’Doherty et al., Temporal Difference Models and reward-related learning in the brain, Neuron (2003) • Steinberg et al., A causal link between prediction errors, dopamine neurons, and learning, Nature Neuroscience (2013) Overview Schultz et al O’Doherty et al Steinberg et al Monkey Human Rodent Method Electrophysiology Functional neuroimaging Optogenetics Causality No No Yes Region Midbrain Ventral Striatum Midbrain Model What Rowan Covered • Markov reward processes • Markov decision processes • Value functions • Policy functions • Dynamic programming (value/policy iteration) • TD(0) • Monte Carlo method • TD(lambda) = TD(0) + Monte Carlo Today • Markov reward processes • Markov decision processes • Value functions • Policy functions • Dynamic programming (value/policy iteration) • TD(0) • Monte Carlo method • TD(lambda) = TD(0) + Monte Carlo Pavlov’s Dogs http://animals.howstuffworks.com/pets/dog-training1.htm “Pavlovian” Conditioning (bell) (food) CS US UR (salivation) “Pavlovian” Conditioning (bell) (food) CS US 0 UR UR (some salivation) (some salivation) food 0 (also see “Rescorla-Wagner Rule”) 0 food Dopaminergic System VTA = ventral tegmental area (part of “midbrain”) Nucleus accumbens (part of “ventral striatum”) VTA/Substantia Nigra = source of dopamine in the brain Schultz et al Recording from VTA neurons in monkeys in Pavlovian conditioning paradigm (light-> fruit juice) Schultz et al Szepesvari Schultz et al Schultz et al Function approximation over features of visual input (similar argument for temporal precision of prediction) Functional Neuroimaging Non-invasive recording of “BOLD” signal Delayed increase in oxygenated blood flow in response to energy demands of neurons O’Doherty et al CS US US+ = glucose (positive) USneut = artificial saliva (neutral) US- = nothing (negative) 1-to-1 mapping CS to US CS on 0 CS off/US on 3 6 time (s) O’Doherty et al Prediction error regressor time-locked to CS and US Ventral striatum = downstream of midbrain dopaminergic neurons O’Doherty et al O’Doherty et al Optogenetics “switching neurons on and off with lights and viruses” Steinberg et al 0 0 1 1 Blocking paradigm 0 0 0 Steinberg et al PairedCre+ = experimental group PairedCre- = wild-type equivalent (control 1) UnpairedCre+ = asynchronous optical stimulation (control 2) Discussion • Correlative and causal evidence that RPEs drive behaviour and physiological responses • Many unknowns: Construction of state-dependent value signals How are prediction errors “bound” to states? D1 vs D2 receptors Serotonergic error signaling?
© Copyright 2024