Introduction to Artificial Intelligence Marc Toussaint April 7, 2015 The majority of slides of the earlier parts are adapted from Stuart Russell. This is a direct concatenation and reformatting of all lecture slides and exercises from the Artificial Intelligence course (winter term 2014/15, U Stuttgart), including indexing to help prepare for exams. sequential decisions deterministic sequential decision problems propositional sequential assignment MCTS fwd/bwd chaining bandits UCB graphical models MDPs Decision Theory probabilistic on trees minimax utilities multi-agent MDPs FOL constraint propagation alpha/beta pruning games propositional logic CSP backtracking search BFS relational dynamic programming V(s), Q(s,a) FOL belief propagation msg. passing relational graphical models HMMs relational MDPs fwd/bwd msg. passing ML learning Reinforcement Learning Active Learning Contents 1 Introduction 4 3 Search 5 Example: Romania (1:2) Example: Vacuum World (1:5) Problem Definition: Deterministic, fully observable (1:9) Example: The 8-Puzzle (1:15) 3.1 Tree Search Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Tree search implementation: states vs nodes (1:25) Tree Search: General Algorithm (1:26) Breadth-first search (BFS) (1:29) Complexity of BFS (1:37) Uniform-cost search (1:38) Depth-first search (DFS) (1:39) Complexity of DFS (1:52) Iterative deepening search (1:54) Complexity of Iterative Deepening Search (1:63) Graph search and repeated states (1:65) 4 Informed search algorithms 13 Best-first Search (2:3) Greedy Search (2:5) Complexity of Greedy Search (2:14) A∗ search (2:15) A∗ : Proof 1 of Optimality (2:22) Complexity of A∗ (2:27) A∗ : Proof 2 of Optimality (2:28) Admissible heuristics (2:30) Memorybounded A∗ (2:34) 5 Constraint Satisfaction Problems 17 Constraint satisfaction problems (CSPs): Definition (3:2) Map-Coloring Problem (3:3) 5.1 Methods for solving CSPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Backtracking (3:10) Variable order: Minimum remaining values (3:18) Variable order: Degree heuristic (3:19) Value order: Least constraining value (3:20) Forward checking (3:21) Constraint propagation (3:25) Treestructured CSPs (3:33) 1 18 2 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 6 Optimization 22 Optimization problem: Definition (4:2) Local Search (4:5) Travelling Salesman Problem (TSP) (4:6) Local optima, plateaus (4:8) Iterated Local Search (4:9) Simulated Annealing (4:11) Genetic Algorithms (4:14) 6.1 A glimpse at general optimization problems . . . . . . . . . . . . . . . . . . . . . . 24 LP, QP, ILP, NLP (4:17) Slack Variables (4:19) n-queens as ILP (4:19) TSP as ILP (4:20) CSP as ILP (4:21) 7 Propositional Logic 26 Knowledge base: Definition (5:2) Wumpus World example (5:4) Logics: Definition, Syntax, Semantics (5:20) Entailment (5:21) Model (5:22) Inference (5:28) Propositional logic: Syntax (5:29) Propositional logic: Semantics (5:31) Logical equivalence (5:37) Validity (5:38) Satisfiability (5:38) Horn Form (5:40) Modus Ponens (5:40) Forward chaining (5:41) Completeness of Forward Chaining (5:51) Backward Chaining (5:52) Conjunctive Normal Form (5:64) Resolution (5:64) Conversion to CNF (5:65) 8 First Order Logic 37 FOL: Syntax (6:5) Universal quantification (6:7) Existential quantification (6:8) 8.1 FOL description of interactive domains . . . . . . . . . . . . . . . . . . . . . . . . . 38 Situation Calculus (6:21) Frame problem (6:22) Planning Domain Definition Language (PDDL) (6:24) 9 First Order Logic – Inference 41 Reduction to propositional inference (7:6) Unification (7:9) Generalized Modus Ponens (7:14) Forward Chaining (7:15) Backward Chaining (7:17) Resolution (7:19) Conversion to CNF (7:20) 10 Probabilities 44 Probabilities as (subjective) information calculus (8:2) Frequentist vs Bayesian (8:4) 10.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Definitions based on sets (8:6) Random variables (8:7) Probability distribution (8:8) Joint distribution (8:9) Marginal (8:9) Conditional distribution (8:9) Bayes’ Theorem (8:11) Multiple RVs, conditional independence (8:12) 10.2 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Bernoulli and Binomial (8:14) Beta (8:15) Multinomial (8:18) Dirichlet (8:19) Conjugate priors (8:23) Dirac (8:26) Gaussian (8:27) Particle approximation of a distribution (8:29) Utilities and Decision Theory (8:32) Entropy (8:33) Kullback-Leibler divergence (8:34) 11 Bandits & UCT 50 Multi-armed Bandits (9:1) Exploration, Exploitation (9:6) Upper Confidence Bound (UCB) (9:8) Monte Carlo Tree Search (MCTS) (9:14) Upper Confidence Tree (UCT) (9:19) 12 Game Playing 53 Minimax (10:3) Alpha-Beta Pruning (10:6) Evaluation functions (10:11) UCT for games (10:12) 13 Graphical Models 55 Bayesian Network (11:3) Conditional independence in a Bayes Net (11:7) Inference: general meaning (11:12) 13.1 Inference Methods in Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . 57 Inference in graphical models: overview (11:17) Monte Carlo (11:19) Importance sampling (11:22) Gibbs sampling (11:24) Variable elimination (11:27) Factor graph (11:30) Belief propagation (11:36) Message passing (11:36) Loopy belief propagation (11:39) Junction tree algorithm (11:41) Maximum a-posteriori (MAP) inference (11:45) Conditional random field* (11:46) 14 Dynamic Models 63 Markov Process (12:2) Filtering, Smoothing, Prediction (12:3) Hidden Markov Model (12:4) HMM: Inference (12:5) HMM inference (12:6) Kalman filter (12:9) 15 Reinforcement Learning 66 Markov Decision Process (MDP) (13:3) Value Function (13:4) Bellman optimality equation (13:8) Value Iteration (13:10) Q-Function (13:11) Q-Iteration (13:12) Proof of convergence of Q-Iteration (13:13) Temporal difference (TD) (13:19) Sarsa (13:21) Q-learning (13:22) Proof of convergence of Q-learning (13:24) Eligibility traces (13:26) Modelbased RL (13:34) Imitation Learning (13:37) Inverse RL (13:40) Policy gradients (13:44) 16 Reinforcement Learning – Exploration Epsilon-greedy exploration in Q-learning (14:4) Sample Complexity (14:6) PAC-MDP efficiency (14:7) ExplicitExploit-or-Explore* (14:9) R-Max (14:14) Bayesian RL (14:15) Optimistic heuristics (14:16) 74 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 17 Exercises 17.1 Exercise 1 . 17.2 Exercise 2 . 17.3 Exercise 3 . 17.4 Exercise 4 . 17.5 Exercise 5 . 17.6 Exercise 6 . 17.7 Exercise 7 . 17.8 Exercise 8 . 17.9 Exercise 9 . 17.10Exercise 10 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 78 79 79 80 81 82 82 83 83 84 86 4 1 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Introduction I wasn’t happy with the first slides (’Introduction’ and ’Intelligent Agents’). So I skip them here. They will also not be relevant for the exam. You may find them on the lecture webpage. Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 3 5 Search 1:4 Example: vacuum world Outline • Problem formulation & examples Deterministic, fully observable, start in #5. Solution?? • Basic search algorithms 1:1 1:5 Example: Romania Example: vacuum world On holiday in Romania; currently in Arad. Flight leaves tomorrow from Bucharest Deterministic, fully observable, start in #5. Solution?? [Right, Suck] Formulate goal: be in Bucharest, Sgoal = {Bucharest} Formulate problem: states: various cities, S = {Arad, Timisoara, . . . } actions: drive between cities, A = {edges between states} Find solution: Non-observable, {1, 2, 3, 4, 5, 6, 7, 8} e.g., Right goes Solution?? start to in {2, 4, 6, 8}. 1:6 sequence of cities, e.g., Arad, Sibiu, Fagaras, Bucharest minimize costs with cost function, (s, a) 7→ c 1:2 Example: Romania Example: vacuum world Deterministic, fully observable, start in #5. Solution?? [Right, Suck] Non-observable, start in {1, 2, 3, 4, 5, 6, 7, 8} e.g., Right goes to {2, 4, 6, 8}. Solution?? [Right, Suck, Lef t, Suck] Non-deterministic, start in #5 Murphy’s Law: Suck can dirty a clean carpet Local sensing: dirt, location only. Solution?? 1:7 1:3 Problem types Deterministic, fully observable (“single-state problem”) Agent knows exactly which state it will be in; solution is a sequence First state and world known → the agent does not rely on observations Non-observable (“conformant problem”) Agent may have no idea where it is; solution (if any) is a sequence Nondeterministic and/or partially observable (“contingency problem”) percepts provide new information about current state solution is a reactive plan or a policy often interleave search, execution Unknown state space (“exploration problem”) Example: vacuum world Deterministic, fully observable, start in #5. Solution?? [Right, Suck] Non-observable, start in {1, 2, 3, 4, 5, 6, 7, 8} e.g., Right goes to {2, 4, 6, 8}. Solution?? [Right, Suck, Lef t, Suck] Non-deterministic, start in #5 Murphy’s Law: Suck can dirty a clean carpet Local sensing: dirt, location only. Solution?? [Right, if dirt then Suck] 1:8 6 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Deterministic, fully observable problem def. A deterministic, fully observable problem is defined by four items: initial state s0 ∈ S e.g., s0 = Arad successor function succ : S × A → S e.g., succ(Arad,Arad-Zerind) = Zerind goal states Sgoal ⊆ S e.g., s = Bucharest step cost function cost(s, a, s0 ), assumed to be ≥ 0 states??: integer dirt and robot locations (ignore dirt amounts e.g., traveled distance, number of actions executed, etc. etc.) the path cost is the sum of step costs actions??: Lef t, Right, Suck, N oOp goal test?? path cost?? A solution is a sequence of actions leading from s0 to a goal 1:12 An optimal solution is a solution with minimal path costs Example: vacuum world state space graph 1:9 Example: vacuum world state space graph states??: integer dirt and robot locations (ignore dirt amounts etc.) actions??: Lef t, Right, Suck, N oOp goal test??: no dirt states?? path cost?? actions?? 1:13 goal test?? Example: vacuum world state space graph path cost?? 1:10 Example: vacuum world state space graph states??: integer dirt and robot locations (ignore dirt amounts etc.) actions??: Lef t, Right, Suck, N oOp goal test??: no dirt states??: integer dirt and robot locations (ignore dirt amounts path cost??: 1 per action (0 for N oOp) etc.) 1:14 actions?? Example: The 8-puzzle goal test?? path cost?? 1:11 Example: vacuum world state space graph Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 7 Example: The 8-puzzle states?? actions?? goal test?? path cost?? 1:15 Example: The 8-puzzle states??: integer locations of tiles (ignore intermediate positions) actions??: move blank left, right, up, down (ignore unjamming etc.) goal test??: = goal state (given) path cost??: 1 per move [Note: optimal solution of n-Puzzle family is NP-hard] states??: integer locations of tiles (ignore intermediate posi- 1:19 tions) actions?? 3.1 goal test?? Tree Search Algorithms path cost?? 1:20 1:16 Tree search algorithms Example: The 8-puzzle Basic idea: offline, simulated exploration of state space by generating successors of already-explored states (a.k.a. expanding states) 1:21 Tree search example states??: integer locations of tiles (ignore intermediate positions) actions??: move blank left, right, up, down (ignore unjamming etc.) goal test?? path cost?? 1:17 1:22 Tree search example Example: The 8-puzzle 1:23 states??: integer locations of tiles (ignore intermediate positions) Tree search example actions??: move blank left, right, up, down (ignore unjamming etc.) goal test??: = goal state (given) path cost?? 1:18 8 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 1:24 Depth-first search Depth-limited search Implementation: states vs. nodes Iterative deepening search 1:28 A state is a (representation of) a physical configuration A node is a data structure constituting part of a search tree Breadth-first search includes parent, children, depth, path cost g(x) States do not have parents, children, depth, or path cost! Expand shallowest unexpanded node Implementation: fringe is a FIFO queue, i.e., new successors go at end The E XPAND function creates new nodes, filling in the various fields and using the S UCCESSOR F N of the problem to create the corresponding states. 1:25 1:29 Breadth-first search Implementation: general tree search function T REE -S EARCH( problem, fringe) returns a solution, or failure fringe ← I NSERT(M AKE -N ODE(I NITIAL -S TATE[problem]), fringe) loop do if fringe is empty then return failure node ← R EMOVE -F RONT(fringe) if G OAL -T EST(problem, S TATE(node)) then return node fringe ← I NSERTA LL(E XPAND(node, problem), fringe) function E XPAND( node, problem) returns a set of nodes successors ← the empty set for each action, result in S UCCESSOR -F N(problem, S TATE[node]) do s ← a new N ODE PARENT-N ODE[s] ← node; ACTION[s] ← action; S TATE[s] ← result PATH -C OST[s] ← PATH -C OST[node] + S TEP -C OST(S TATE[node], action, result) D EPTH[s] ← D EPTH[node] + 1 add s to successors return successors 1:26 Expand shallowest unexpanded node Implementation: fringe is a FIFO queue, i.e., new successors go at end 1:30 Breadth-first search Expand shallowest unexpanded node Implementation: Search strategies fringe is a FIFO queue, i.e., new successors go at end A strategy is defined by picking the order of node expansion Strategies are evaluated along the following dimensions: completeness—does it always find a solution if one exists? time complexity—number of nodes generated/expanded space complexity—maximum number of nodes in memory optimality—does it always find a least-cost solution? 1:31 Time and space complexity are measured in terms of b—maximum branching factor of the search tree Breadth-first search d—depth of the least-cost solution m—maximum depth of the state space (may be ∞) Expand shallowest unexpanded node 1:27 Uninformed search strategies Uninformed strategies use only the information available in the problem definition Breadth-first search Uniform-cost search Implementation: fringe is a FIFO queue, i.e., new successors go at end Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 1:32 Properties of breadth-first search 9 Depth-first search Expand deepest unexpanded node Implementation: Complete?? fringe = LIFO queue, i.e., put successors at front 1:33 Properties of breadth-first search Complete?? Yes (if b is finite) Time?? 1:34 Properties of breadth-first search 1:39 Depth-first search Complete?? Yes (if b is finite) Expand deepest unexpanded node Time?? 1 + b + b2 + b3 + . . . + bd + b(bd − 1) = O(bd+1 ), i.e., exp. in d Implementation: fringe = LIFO queue, i.e., put successors at front Space?? 1:35 Properties of breadth-first search Complete?? Yes (if b is finite) Time?? 1 + b + b2 + b3 + . . . + bd + b(bd − 1) = O(bd+1 ), i.e., 1:40 exp. in d Depth-first search Space?? O(bd+1 ) (keeps every node in memory) Optimal?? Expand deepest unexpanded node 1:36 Implementation: Properties of breadth-first search fringe = LIFO queue, i.e., put successors at front Complete?? Yes (if b is finite) Time?? 1 + b + b2 + b3 + . . . + bd + b(bd − 1) = O(bd+1 ), i.e., exp. in d Space?? O(bd+1 ) (keeps every node in memory) Optimal?? Yes (if cost = 1 per step); not optimal in general 1:41 Space is the big problem; can easily generate nodes at 100MB/sec so 24hrs = 8640GB. 1:37 Uniform-cost search Depth-first search Expand deepest unexpanded node Implementation: fringe = LIFO queue, i.e., put successors at front Expand least-cost unexpanded node Implementation: fringe = queue ordered by path cost, lowest first Equivalent to breadth-first if step costs all equal Complete?? Yes, if step cost ≥ Time?? # of nodes with g ≤ cost of optimal solution, O(bdC ∗ /e where C ∗ is the cost of the optimal solution Space?? # of nodes with g ≤ cost of optimal solution, O(bdC ∗ 1:42 ) /e Depth-first search ) Optimal?? Yes—nodes expanded in increasing order of g(n) 1:38 Expand deepest unexpanded node Implementation: fringe = LIFO queue, i.e., put successors at front 10 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 1:47 Properties of depth-first search Complete?? 1:48 1:43 Depth-first search Properties of depth-first search Expand deepest unexpanded node Complete?? No: fails in infinite-depth spaces, spaces with loops Implementation: Modify to avoid repeated states along path fringe = LIFO queue, i.e., put successors at front ⇒ complete in finite spaces Time?? 1:49 Properties of depth-first search 1:44 Complete?? No: fails in infinite-depth spaces, spaces with loops Modify to avoid repeated states along path Depth-first search ⇒ complete in finite spaces Expand deepest unexpanded node Time?? O(bm ): terrible if m is much larger than d but if solutions are dense, may be much faster than breadth- Implementation: first fringe = LIFO queue, i.e., put successors at front Space?? 1:50 Properties of depth-first search 1:45 Complete?? No: fails in infinite-depth spaces, spaces with loops Modify to avoid repeated states along path Depth-first search ⇒ complete in finite spaces Time?? O(bm ): terrible if m is much larger than d Expand deepest unexpanded node but if solutions are dense, may be much faster than breadth- Implementation: first fringe = LIFO queue, i.e., put successors at front Space?? O(bm), i.e., linear space! Optimal?? 1:51 Properties of depth-first search 1:46 Complete?? No: fails in infinite-depth spaces, spaces with loops Depth-first search Modify to avoid repeated states along path ⇒ complete in finite spaces Expand deepest unexpanded node Time?? O(bm ): terrible if m is much larger than d Implementation: fringe = LIFO queue, i.e., put successors at front but if solutions are dense, may be much faster than breadthfirst Space?? O(bm), i.e., linear space! Optimal?? No 1:52 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 11 Depth-limited search = depth-first search with depth limit l, i.e., nodes at depth l have no successors Recursive implementation: function D EPTH -L IMITED -S EARCH( problem, limit) returns soln/fail/cutoff R ECURSIVE -DLS(M AKE -N ODE(I NITIAL -S TATE[problem]), problem, limit) function R ECURSIVE -DLS(node, problem, limit) returns soln/fail/cutoff cutoff-occurred? ← false if G OAL -T EST(problem, S TATE[node]) then return node else if D EPTH[node] = limit then return cutoff else for each successor in E XPAND(node, problem) do result ← R ECURSIVE -DLS(successor, problem, limit) if result = cutoff then cutoff-occurred? ← true else if result 6= failure then return result if cutoff-occurred? then return cutoff else return failure 1:58 Properties of iterative deepening search 1:53 Complete?? 1:59 Iterative deepening search function I TERATIVE -D EEPENING -S EARCH( problem) returns a solution inputs: problem, a problem Properties of iterative deepening search Complete?? Yes for depth ← 0 to ∞ do result ← D EPTH -L IMITED -S EARCH( problem, depth) if result 6= cutoff then return result end Time?? 1:60 1:54 Properties of iterative deepening search Complete?? Yes Iterative deepening search l = 0 Time?? (d + 1)b0 + db1 + (d − 1)b2 + . . . + bd = O(bd ) Space?? 1:61 1:55 Properties of iterative deepening search Complete?? Yes Iterative deepening search l = 1 Time?? (d + 1)b0 + db1 + (d − 1)b2 + . . . + bd = O(bd ) Space?? O(bd) Optimal?? 1:62 1:56 Properties of iterative deepening search Iterative deepening search l = 2 Complete?? Yes Time?? (d + 1)b0 + db1 + (d − 1)b2 + . . . + bd = O(bd ) Space?? O(bd) Optimal?? Yes, if step cost = 1 Can be modified to explore uniform-cost tree Numerical comparison for b = 10 and d = 5, solution at far left leaf: 1:57 Iterative deepening search l = 3 N (IDS) = 50 + 400 + 3, 000 + 20, 000 + 100, 000 = 123, 450 N (BFS) = 10 + 100 + 1, 000 + 10, 000 + 100, 000 + 999, 990 = 12 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 IDS does better because other nodes at depth d are not expanded BFS can be modified to apply goal test when a node is generated 1:63 Summary of algorithms Criterion BreadthFirst UniformCost DepthFirst DepthLimited Iterative Deepening Yes∗ bd+1 bd+1 Yes∗ Yes∗ ∗ bdC /e ∗ dC /e b Yes No bm bm No Yes, if l ≥ d bl bl No Yes bd bd Yes∗ Complete? Time Space Optimal? 1:64 Loops: Repeated states Failure to detect repeated states can turn a linear problem into an exponential one! 1:65 Graph search function G RAPH -S EARCH( problem, fringe) returns a solution, or failure closed ← an empty set fringe ← I NSERT(M AKE -N ODE(I NITIAL -S TATE[problem]), fringe) loop do if fringe is empty then return failure node ← R EMOVE -F RONT(fringe) if G OAL -T EST(problem, S TATE[node]) then return node if S TATE[node] is not in closed then add S TATE[node] to closed fringe ← I NSERTA LL(E XPAND(node, problem), fringe) end But: storing all visited nodes leads again to exponential space complexity (as for BFS) 1:66 Summary In BFS (or uniform-cost search), the fringe propagates layerwise, containing nodes of similar distance-from-start (cost-sofar), leading to optimal paths but exponential space complexity O(B d+1 ) In DFS, the fringe is like a deep light beam sweeping over the tree, with space complexity O(bm). Iteratively deepening it also leads to optimal paths. Graph search can be exponentially more efficient than tree search, but storing the visited nodes may lead to exponential space complexity as BFS. 1:67 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 4 Informed search algorithms 13 Greedy search We set the priority function equal to a heuristic f (n) = h(n) h(n) = estimate of cost from n to the closest goal Outline E.g., hSLD (n) = straight-line distance from n to Bucharest Greedy search expands the node that appears to be closest to • Best-first search goal • A∗ search 2:5 • Heuristics 2:1 Greedy search example Review: Tree search function T REE -S EARCH( problem, fringe) returns a solution, or failure fringe ← I NSERT(M AKE -N ODE(I NITIAL -S TATE[problem]), fringe) loop do if fringe is empty then return failure node ← R EMOVE -F RONT(fringe) if G OAL -T EST[problem] applied to S TATE(node) succeeds return node fringe ← I NSERTA LL(E XPAND(node, problem), fringe) 2:6 Greedy search example A strategy is defined by picking the order of node expansion 2:2 Best-first search 2:7 Idea: use an arbitrary priority function f (n) for each node – actually f (n) is neg-priority: nodes with lower f (n) have higher priority Greedy search example f (n) should reflect which nodes could be on an optimal path – could is optimistic – the lower f (n) the more optimistic you are that n is on an optimal path ⇒ Expand the unexpanded node with highes priority Implementation: fringe is a queue sorted in decreasing order of priority 2:8 Special cases: greedy search Greedy search example A∗ search 2:3 Romania with step costs in km 2:9 Properties of greedy search Complete?? 2:4 2:10 14 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Properties of greedy search n. (Also require h(n) ≥ 0, so h(G) = 0 for any goal G.) Complete?? No–can get stuck in loops, e.g., with Oradea as goal, E.g., hSLD (n) never overestimates the actual road distance Theorem: A∗ search is optimal (=finds the optimal path) Iasi → Neamt → Iasi → Neamt → 2:15 Complete in finite space with repeated-state checking Time?? 2:11 A∗ search example Properties of greedy search Complete?? No–can get stuck in loops, e.g., 2:16 Iasi → Neamt → Iasi → Neamt → Complete in finite space with repeated-state checking Time?? O(bm ), but a good heuristic can give dramatic improve- A∗ search example ment Space?? 2:12 Properties of greedy search 2:17 Complete?? No–can get stuck in loops, e.g., Iasi → Neamt → Iasi → Neamt → A∗ search example Complete in finite space with repeated-state checking Time?? O(bm ), but a good heuristic can give dramatic improvement Space?? O(bm )—keeps all nodes in memory Optimal?? 2:13 2:18 Properties of greedy search A∗ search example Complete?? No–can get stuck in loops, e.g., Iasi → Neamt → Iasi → Neamt → Complete in finite space with repeated-state checking Time?? O(bm ), but a good heuristic can give dramatic improvement Space?? O(bm )—keeps all nodes in memory Optimal?? No 2:19 Greedy search does not care about the ’past’ (the cost-so-far). 2:14 A∗ search example A∗ search Idea: combine information from the past and the future – neg-priority = cost-so-far + estimated cost-to-go Evaluation function f (n) = g(n) + h(n) g(n) = cost-so-far to reach n h(n) = estimated cost-to-go from n f (n) = estimated total cost of path through n to goal A∗ search uses an admissible (=optimistic) heuristic i.e., h(n) ≤ h∗ (n) where h∗ (n) is the true cost-to-go from 2:20 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 A∗ search example 15 Properties of A∗ Complete?? Yes, unless there are infinitely many nodes with f ≤ f (G) Time?? Exponential in [relative error in h × length of soln.] Space?? Keeps all nodes in memory Optimal?? 2:26 Properties of A∗ 2:21 Complete?? Yes, unless there are infinitely many nodes with f ≤ f (G) Proof of optimality of A∗ Time?? Exponential in [relative error in h × length of soln.] Suppose some suboptimal goal G2 has been generated and is in the fringe (but has not yet been selected to be tested for goal condition!). Let n be an unexpanded node on a shortest path to an optimal goal G. Space?? Keeps all nodes in memory Optimal?? Yes A∗ expands all nodes with f (n) < C ∗ A∗ expands some nodes with f (n) = C ∗ A∗ expands no nodes with f (n) > C ∗ 2:27 Optimality of A∗ (more useful) f (G2 ) = g(G2 ) > g(G) since G2 is suboptimal ≥ f (n) since h is admissible Lemma: A∗ expands nodes in order of increasing f value∗ since h(G2 ) = 0 Gradually adds “f -contours” of nodes (cf. breadth-first adds layers) Contour i has all nodes with f = fi , where fi < fi+1 Since f (n) < f (G2 ), A∗ will expand n before is will select G2 from the fringe (for goal testing). Then, as G is added to the fringe, and since f (G) = g(G) < f (G2 ) = g(G2 ) it will select G before G2 for goal testing. 2:22 Properties of A∗ Complete?? 2:23 2:28 ∗ Properties of A Proof of lemma: Consistency Complete?? Yes, unless there are infinitely many nodes with A heuristic is consistent if f ≤ f (G) h(n) ≤ c(n, a, n0 ) + h(n0 ) Time?? 2:24 If h is consistent, we have Properties of A∗ f (n0 ) = g(n0 ) + h(n0 ) = g(n) + c(n, a, n0 ) + h(n0 ) f ≤ f (G) ≥ g(n) + h(n) Time?? Exponential in [relative error in h × length of soln.] = f (n) Complete?? Yes, unless there are infinitely many nodes with Space?? I.e., f (n) is nondecreasing along any path. 2:25 2:29 16 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Admissible heuristics Key point: the optimal solution cost of a relaxed problem is no greater than the optimal solution cost of the real problem E.g., for the 8-puzzle: 2:33 h1 (n) = number of misplaced tiles Memory-bounded A∗ h2 (n) = total Manhattan distance (i.e., no. of squares from desired location of each tile) As with BFS, A∗ has exponential space complexity Iterative-deepening A∗ , works for integer path costs, but problematic for real-valued (Simplified) Memory-bounded A∗ (SMA∗ ): – Expand as usual until a memory bound is reach h1 (S) =?? – Then, whenever adding a node, remove the worst node h2 (S) =?? 0 n from the tree – worst means: the n0 with highest f (n0 ) 2:30 – To not loose information, backup the measured step-cost Admissible heuristics cost(˜ n, a, n0 ) to improve the heuristic h(˜ n) of its parent E.g., for the 8-puzzle: ∗ SMA is complete and optimal if the depth of the optimal path is h1 (n) = number of misplaced tiles within the memory bound h2 (n) = total Manhattan distance 2:34 (i.e., no. of squares from desired location of each tile) Summary Combine information from the past and the future A heuristic function h(n) represents information about the future – it estimates cost-to-go optimistically h1 (S) =?? 6 Good heuristics can dramatically reduce search cost h2 (S) =?? 4+0+3+3+1+0+2+1 = 14 2:31 Greedy best-first search expands lowest h – incomplete and not always optimal Dominance ∗ A search expands lowest f = g + h – neg-priority = cost-so-far + estimated cost-to-go If h2 (n) ≥ h1 (n) for all n (both admissible) – complete and optimal then h2 dominates h1 and is better for search – also optimally efficient (up to tie-breaks, for forward search) Typical search costs: d = 14 IDS = 3,473,941 nodes A∗ (h1 ) = 539 nodes A∗ (h2 ) = 113 nodes d = 24 IDS ≈ 54,000,000,000 nodes A∗ (h1 ) = 39,135 nodes A∗ (h2 ) = 1,641 nodes Given any admissible heuristics ha , hb , Admissible heuristics can be derived from exact solution of relaxed problems Memory-bounded startegies exist 2:35 Outlook We postpone tree search with partial observations h(n) = max(ha (n), hb (n)) – rather discuss this in a fully probabilistic setting later is also admissible and dominates ha , hb We postpone tree search for games 2:32 Relaxed problems – minimax extension to tree search – discuss state-of-the-art probabilistic Monte-Carlo tree search methods later Admissible heuristics can be derived from the exact solution cost of a relaxed version of the problem If the rules of the 8-puzzle are relaxed so that a tile can move anywhere, then h1 (n) gives the shortest solution If the rules are relaxed so that a tile can move to any adjacent square, then h2 (n) gives the shortest solution Next: Constraint Statisfaction Problems 2:36 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 5 Constraint Satisfaction Problems 17 Example: Map-Coloring contd. Outline • CSP examples • Backtracking sequential assignment for CSPs • Problem structure and problem decomposition • Later: general-purpose discrete (and continuous) optimization methods Solutions are assignments satisfying all constraints, e.g., 3:1 {W = red, N = green, Q = red, E = green, V = red, S = blue, T = green} 3:4 Constraint satisfaction problems (CSPs) Constraint graph In previous lectures we consideres sequential decision problems Binary CSP: each constraint relates at most two variables CSPs are not sequential decision problems Constraint graph: a bi-partite graph: nodes are variables, boxes However, the basic methods address them by testing sequentially ’decisions’ are constraints In the map-coloring problem, all constraints relate two variables: boxes↔edges In general, constraints may constrain several (or one) vari- CSP: ables (|Ik | 6= 2) We have n variables xi , each with domain Di , xi ∈ Di N We have K constraints Ck , each of which determines the feasible configurations of a subset of variables W The goal is to find a configuration X = (X1 , .., Xn ) of all vari- c1 c2 ables that satisfies all constraints c3 Q S E c6 c8 c7 Formally Ck = (Ik , ck ) where Ik ⊆ {1, .., n} determines the c5 c4 c9 V subset of variables, and ck : DIk → {0, 1} determines whether a configuration xIk ∈ DIk of this subset of variables is feasible T 3:5 3:2 Varieties of CSPs Example: Map-Coloring • Discrete variables: finite domains; each Di of size |Di | = d ⇒ O(dn ) complete assignments – e.g., Boolean CSPs, incl. Boolean satisfiability infinite domains (integers, strings, etc.) – e.g., job scheduling, variables are start/end days for each job – linear constraints solvable, nonlinear undecidable • Continuous variables – e.g., start/end times for Hubble Telescope observations – linear constraints solvable in poly time by LP methods Variables W , N , Q, E, V , S, T 3:6 (E = New South Wales) Domains Di = {red, green, blue} for all variables Varieties of constraints Constraints: adjacent regions must have different colors e.g., W 6= N , or Unary constraints involve a single variable, |Ik | = 1 (W, N ) ∈ {(red, green), (red, blue), (green, red), (green, blue), . . .} 3:3 e.g., S 6= green Binary constraints involve pairs of variables, |Ik | = 2 e.g., S 6= W 18 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 ⇒ b = d and there are dn leaves Higher-order constraints involve 3 or more variables, |Ik | > 2 e.g., Sudoku Depth-first search for CSPs with single-variable assignments Having “soft constraints” (preferences, cost, probabilities) leads is called backtracking search to general optimization and probabilistic inference problems Backtracking search is the basic uninformed algorithm for CSPs 3:7 Can solve n-queens for n ≈ 25 3:11 Real-world CSPs Backtracking search Assignment problems function B ACKTRACKING -S EARCH(csp) returns solution/failure return R ECURSIVE -B ACKTRACKING({ }, csp) e.g., who teaches what class Timetabling problems function R ECURSIVE -B ACKTRACKING(assignment, csp) returns soln/failure if assignment is complete then return assignment var ← S ELECT-U NASSIGNED -VARIABLE(VARIABLES[csp], assignment, csp) for each value in O RDERED -D OMAIN -VALUES(var, assignment, csp) do if value is consistent with assignment given C ONSTRAINTS[csp] then add [var = value] to assignment result ← R ECURSIVE -B ACKTRACKING(assignment, csp) if result 6= failure then return result remove [var = value] from assignment return failure e.g., which class is offered when and where? Hardware configuration Spreadsheets Transportation scheduling Factory scheduling Floorplanning Notice that many real-world problems involve real-valued variables 3:8 5.1 3:12 Backtracking example Methods for solving CSPs 3:13 3:9 Backtracking example Sequential assignment approach Let’s start with the straightforward, dumb approach, then fix it States are defined by the values assigned so far • Initial state: the empty assignment, { } • Successor function: assign a value to an unassigned variable that does not conflict with current assignment ⇒ fail if no feasible assignments (not fixable!) 3:14 • Goal test: the current assignment is complete Backtracking example 1) Every solution appears at depth n with n variables ⇒ use depth-first search 2) b = (n − `)d at depth `, hence n!dn leaves! 3:10 Backtracking sequential assignment Two variable assignment decisions are commutative, i.e., [W = red then N = green] same as [N = green then W = red] We can fix a single next variable to assign a value to at each node This does not compromise completeness (ability to find the solution) 3:15 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 19 Backtracking example Given a variable, choose the least constraining value: the one that rules out the fewest values in the remaining variables Combining these heuristics makes 1000 queens feasible 3:20 3:16 Forward checking Improving backtracking efficiency Idea: Keep track of remaining legal values for unassigned variables Simple heuristics can give huge gains in speed: Terminate search when any variable has no legal values 1. Which variable should be assigned next? 2. In what order should its values be tried? 3. Can we detect inevitable failure early? 4. Can we take advantage of problem structure? 3:21 3:17 Forward checking Minimum remaining values Idea: Keep track of remaining legal values for unassigned variables Minimum remaining values (MRV): Terminate search when any variable has no legal values choose the variable with the fewest legal values 3:18 3:22 Degree heuristic Forward checking Tie-breaker among MRV variables Idea: Keep track of remaining legal values for unassigned vari- Degree heuristic: choose the variable with the most constraints on remaining variables ables Terminate search when any variable has no legal values 3:19 Least constraining value 3:23 20 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Forward checking for every value x of X there is some allowed y Idea: Keep track of remaining legal values for unassigned variables Terminate search when any variable has no legal values 3:27 Arc consistency (for pair-wise constraints) Simplest form of propagation makes each arc consistent X → Y is consistent iff for every value x of X there is some allowed y Use a data structure D OMAIN [X] to explicitly store Di for each node 3:24 Constraint propagation If X loses a value, neighbors of X need to be rechecked Forward checking propagates information from assigned to unassigned variables, but doesn’t provide early detection for all failures: 3:28 Arc consistency (for pair-wise constraints) Simplest form of propagation makes each arc consistent X → Y is consistent iff for every value x of X there is some allowed y N and S cannot both be blue! Idea: propagate the implied constraints serveral steps further Generally, this is called constraint propagation 3:25 If X loses a value, neighbors of X need to be rechecked Arc consistency detects failure earlier than forward checking Can be run as a preprocessor or after each assignment Arc consistency (for pair-wise constraints) 3:29 Arc consistency algorithm (for pair-wise constraints) Simplest form of propagation makes each arc consistent X → Y is consistent iff for every value x of X there is some allowed y function AC-3( csp) returns the CSP, possibly with reduced domains inputs: csp, a pair-wise CSP with variables {X1 , X2 , . . . , Xn } local variables: queue, a queue of arcs, initially all the arcs in csp while queue is not empty do (Xi , Xj ) ← R EMOVE -F IRST(queue) if R EMOVE -I NCONSISTENT-VALUES(Xi , Xj ) then for each Xk in N EIGHBORS[Xi ] do add (Xk , Xi ) to queue 3:26 Arc consistency (for pair-wise constraints) Simplest form of propagation makes each arc consistent X → Y is consistent iff function R EMOVE -I NCONSISTENT-VALUES( Xi , Xj ) returns true iff D OM[Xi ] changed changed ← false for each x in D OMAIN[Xi ] do if no value y in D OMAIN[Xj ] allows (x,y) to satisfy the constraint Xi ↔ Xj then delete x from D OMAIN[Xi ]; changed ← true return changed O(n2 d3 ), can be reduced to O(n2 d2 ) Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 3:30 21 3. For j from 1 to n, assign Xj consistently with P arent(Xj ) This is forward sequential assignment (trivial backtracking) Constraint propagation 3:34 Nearly tree-structured CSPs See textbook for details for non-pair-wise constraints Very closely related to message passing in probabilistic models Conditioning: instantiate a variable, prune its neighbors’ domains In practice: design approximate constraint propagation for specific problem E.g.: Sudoku: If Xi is assigned, delete this value from all peers 3:31 Problem structure Cutset conditioning: instantiate (in all ways) a set of variables such that the remaining constraint graph is a tree Cutset size c ⇒ runtime O(dc · (n − c)d2 ), very fast for small c N W c1 c2 c3 3:35 Q c5 c4 c9 S Summary E c6 c8 c7 CSPs are a fundamental kind of problem: V finding a feasible configuration of n variables the set of constraints defines the (graph) structure of the prob- T lem Tasmania and mainland are independent subproblems Sequential assignment approach Backtracking = depth-first search with one variable assigned Identifiable as connected components of constraint graph 3:32 per node Variable ordering and value selection heuristics help significantly Tree-structured CSPs Forward checking prevents assignments that guarantee later failure Constraint propagation (e.g., arc consistency) does additional work to constrain values and detect inconsistencies The CSP representation allows analysis of problem structure Theorem: if the constraint graph has no loops, the CSP can be solved in O(n d2 ) time Tree-structured CSPs can be solved in linear time If after assigning some variables, the remaining structure is a Compare to general CSPs, where worst-case time is O(dn ) tree → linear time feasibility check by tree CSP This property also applies to logical and probabilistic reasoning! 3:33 Algorithm for tree-structured CSPs 1. Choose a variable as root, order variables from root to leaves such that every node’s parent precedes it in the ordering 2. For j from n down to 2, apply R EMOVE I NCONSISTENT(P arent(Xj ), Xj ) This is backward constraint propagation 3:36 22 6 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Optimization Iterative improvement The majority of optimization methods iteratively manipulate x to monotonely improve x, e.g.: – line search, backtracking, trust region methods Outline – gradient-based, (Quasi-) Newton methods – interior point methods, Simplex method • Local Search – primal-dual Newton • Iterated Local Search – local search, pattern search, Nelder-Mead • Simulated annealing & Genetic algorithms (briefly) Exceptions: – Global Optimization, Bayesian Optimization • General formulation of optimization problems – stochastic search, simulated annealing, evolutionary algorithms • LP, QP, ILP, non-linear program • ILP formulations of n-queens, CSP, TSP 4:4 4:1 Local search (greedy downhill, hill climbing) Optimization problems We have n variables xi , We assume there is a finite neighborhood N(x) defined for every continuous x ∈ Rn , or discrete xi ∈ x {1, .., d}, or mixed Greedy local search (variant 1): An optimization problem (or mathematical program) is defined by min f (x) x s.t. g(x) ≤ 0, h(x) = 0 where g : Rn → Rk defines k inequality constraints, Input: Initial x, function f (x) Output: Local minimum x ˆ of f (x) 1: repeat 2: x ˆ←x 3: x ← argminy∈N(x) f (y) 4: until f (ˆ x) ≤ f (x) and h : Rn → Rl defines l equality constraints Variant 2: x ← the “first” y ∈ N(x) such that f (y) < f (x) Optimization is a central thread through all of science: – Machine Learning, Robotics, Computer Vision 4:5 – Engineering, Control Theory – Economics, Operations Research Example: Travelling Salesman Problem (TSP) – Physics, Chemistry, Molecular Biology – Social Sciences Goal: Find the shortest closed tour visiting n cities. Computational modelling of natural phenomena often via opti- Start with any complete tour; modify 2 arcs to make the tour mality principles shorter 4:2 Stefan Funke gives an excellent lecture on Discrete Optimization (WS) – max-flow and min-cut on graphs – Linear Programs, esp. Simplex methods Variants of this approach get within 1% of optimum very quickly – Integer Linear Programming and LP-relaxations with thousands of cities In TSP, this neighborhood is called 2-opt (modifying 2 arcs). I offer a lecture on Continuous Optimization (SS) – Gradient and Newton methods 3-opt or 4-opt are larger neighborhoods. 4:6 – Lagrangian, log-barrier, augmented lagrangian methods, primal-dual – Local & stochastic search, global optimization, Bayesian optimization 4:3 Example: n-queens Goal: Put n queens on an n × n board with no two queens on the same row, column, or diagonal Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Start with any configuration of n queens; move a queen to re- 23 Example: Tavelling Salesman Problem duce number of conflicts LocalSearch uses the simple 2-opt or 3-opt neighborhood (→ quick) Iterated Local Search uses 4-opt meta-neighborhood (double bridges) 4:10 Simulated annealing Almost always solves n-queens problems almost instantaneously for very large n, e.g., n = 1million Idea: Escape local minimum by allowing some “bad” moves but gradually decrease their size and frequency 4:7 Input: initial x, function f (x), proposal distribution q(y|x), initial temp. T Output: Global minimum x ˆ of f (x) 1: repeat 2: Sample y from the neighborhood of x, y ∼ q(y|x) 3: Acceptance probability A = f (x)−f (y) q(x|y) T min 1, e q(y|x) 4: With probability A update x ← y 5: Decrease T , e.g. T ← (1 − )T for small 6: until T = 0 or x converges Local search contd. Useful to consider solution space landscape Typically q(x|y) = q(y|x) The new sample y is always accepted if y is better than x (f (y) ≤ f (x)) Random-restart local search overcomes local optima problem— If y is worse than x, only accept with probability e f (x)−f (y) T trivially complete 4:11 Random sideways moves escape from plateaus, but loop on flat Properties of simulated annealing optima 4:8 At fixed “temperature” T , state occupation probability reaches Boltzman distribution Iterated Local Search (6= random restarts) p(x) = αe −f (x) kT Random restarts may be rather expensive, sampling initial x is T decreased slowly enough =⇒ always reach best state x∗ = fully uninformed Idea: Escape local minimum x by restarting in a meta-neighborhood argminx f (x) because e N (x) ∗ −f (x∗ ) kT /e −f (x) kT =e f (x)−f (x∗ ) kT 1 for small T Is this necessarily an interesting guarantee?? Input: Initial x, function f (x) Output: Local minimum x ˆ of f (x) 1: repeat 2: For all meta-neigbors yi ∈ N∗ (x) compute yˆi ← LocalSearch(yi ) 3: x ← argminy∈{ˆ y1 ,..,ˆ yI } f (y) 4: until x converges Devised by Metropolis et al., 1953, for physical process modelling 4:12 Local beam search (maintain k candidates) Idea: keep k candidates instead of 1; choose top k of all their LocalSearch uses a simple/quick neighborhood N(x) successors The meta-neighborhood N∗ (x) enables large jumps towards alternative local optima Not the same as k searches run in parallel! Searches that find good candidates recruit other searches Variant 2: x ← the “first” yi ∈ N∗ (x) such that f (ˆ yi ) < f (x) to join them Stochastic variant: Meta-neighborhood N∗ (x) is replaced by a Problem: quite often, all k candidates end up on same local hill transition prob. q ∗ (y|x) Idea: choose k successors randomly, biased towards good ones 4:9 4:13 24 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Genetic algorithms – Sequential Quadratic Programming (SQP), Log-barrier, Augmented Lagrangian, primal-dual = stochastic local beam search + generate successors from 4:17 pairs of candidates The art is in finding a reduction How can a real-world problem be encoded as an optimization problem? 4:18 4:14 Genetic algorithms contd. Example: n-queens as Integer Linear Program binary indicator variables xij for a queen at position (i, j), i, j = GAs require solutions encoded as strings (GPs use trees or pro- 1, .., n Constraints: grams) – row constraints ∀i : P xij ≤ 1 P i xij ≤ 1 P – diagonal cnstr. ∀i∈{−n+1,..,n−1} : j:j,i+j∈{1,..,n} xi+j,j ≤ 1 P – diagonal cnstr. ∀i∈{−n+1,..,n−1} : j:j,i−j∈{1,..,n} xi−j,j ≤ 1 Crossover helps iff substrings are meaningful components j – column constraints ∀j : Objective Function: arbitrary (e.g. f (x) = 1)! We encoded everything in the constraints! GAs 6= evolution: e.g., real genes encode replication machinery! Better alternative: Optimize the number of constraint violations: instead of “≤ 1” write “≤ 1 + ξk ” in all constraints the slack variables ξ = (ξ1 , .., ξK ) become part of the state Move general view: add the constraints ξk ≥ 0 keeping multiple candidates allows us to use objective function f (x, ξ) = more general neighborhoods N(x1 , .., xK ) or meta-neighborhoods P k ξk related to Phase I optimization of finding a feasible x 4:15 6.1 A glimpse at general optimization problems 4:19 Example: TSP as Integer Linear Program binary indicator variables xij for (ij) ∈ tour 4:16 Optimization problems city-visit-times ti ∈ {1, .., n} Objective: cost f (x) = Linear Program (LP) > min c x s.t. x Gx ≤ h, Ax = b – Simplex Algorithm, Interior point method (Log-barrier), Augmented Lagrangian, primal-dual – LP in standard form: minx c>x s.t. P ij cij xij of the tour Constraints: P – Columns sum to 1: ∀j : i xij = 1 P – Rows sum to 1: ∀i : j xij = 1 – city-visit-times ti must fulfill: ∀2≤i6=j≤n : ti − tj ≤ n − 2 − (n − 1)xij x ≥ 0, Ax = b Quadratic Program (QP) (Q is positive definite) 1 min x>Qx + c>x s.t. Gx ≤ h, Ax = b x 2 – Log-barrier, Augmented Lagrangian, primal-dual Newton (There are alternative formulations.) 4:20 Example: CSP as Integer Linear Program Integer Linear Program (ILP) min c>x s.t. x Ax = b, xi ∈ {1, .., di } – LP-relaxations & backtracking, specialized methods, graph cut methods Non-linear program (Convex Program: f, g convex and h linear) min f (x) x s.t. g(x) ≤ 0, h(x) = 0 binary indicator variables xiv = [Xi = v] for every CSP variable Xi Constraints: – “Xi can have only one value”: probabilities..) ∀i : P v xiv = 1 (cf. – If [Xi = v ∧Xj = w] is constraint-violating, add a constraint xiv + xjw ≤ 1 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 – Do this for EVERY forbidden local configuration (MANY constraints) Objective Function: arbitrary (e.g. f (x) = 1)! We encoded everything in the constraints! Better alternative: Translate the constraints into soft constraints xiv + xjw ≤ 1 + ξk Minimize P k ξk s.t. ξk ≥ 0 (There exists a more efficient formulation for MaxSAT in conjunctive normal form.) 4:21 Summary Many problems can be reduced to optimization problems Local Search, esp. Iterated Local Search is often effective in practice In continuous domains, when gradients, Hessians are given → full field of optimization Ongoing research in global & Bayesian optimization 4:22 25 26 7 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Propositional Logic Performance measure gold +1000, death -1000 -1 per step, -10 for using the arrow Environment Squares adjacent to wumpus are smelly Squares adjacent to pit are breezy Glitter iff gold is in the same square Shooting kills wumpus if you are facing it The wumpus kills you if in the same square Shooting uses up the only arrow Grabbing picks up gold if in same square Releasing drops the gold in same square Actuators Left turn, Right turn, Outline • Knowledge-based agents • Wumpus world • Logic in general—models and entailment • Propositional (Boolean) logic • Equivalence, validity, satisfiability • Inference rules and theorem proving Forward, Grab, Release, Shoot, Climb – forward chaining Sensors Breeze, Glitter, Stench, Bump, Scream – backward chaining – resolution 5:4 5:1 Wumpus world characterization Knowledge bases Observable?? 5:5 Knowledge base = set of sentences of a formal language Wumpus world characterization Declarative approach to building an agent (or other system): T ELL it what it needs to know Observable?? No—only local perception Then it can A SK itself what to do—answers should follow from Deterministic?? the KB 5:6 Agents can be viewed at the knowledge level Wumpus world characterization i.e., what they know, regardless of how implemented Or at the implementation level i.e., data structures in KB and algorithms that manipulate Observable?? No—only local perception Deterministic?? Yes—outcomes exactly specified them 5:7 5:2 Wumpus world characterization A simple knowledge-based agent Observable?? No—only local perception function KB-AGENT( percept) returns an action static: KB, a knowledge base t, a counter, initially 0, indicating time Deterministic?? Yes—outcomes exactly specified Static?? Yes—Wumpus and Pits do not move T ELL(KB, M AKE -P ERCEPT-S ENTENCE( percept, t)) action ← A SK(KB, M AKE -ACTION -Q UERY(t)) T ELL(KB, M AKE -ACTION -S ENTENCE(action, t)) t←t + 1 return action Discrete?? 5:8 Wumpus world characterization The agent must be able to: Represent states, actions, etc. Incorporate new percepts Update internal representations of the world Deduce hidden properties of the world Deduce appropriate actions Observable?? No—only local perception Deterministic?? Yes—outcomes exactly specified Static?? Yes—Wumpus and Pits do not move 5:3 Discrete?? Yes Single-agent?? Wumpus World description 5:9 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 27 Wumpus world characterization Observable?? No—only local perception Deterministic?? Yes—outcomes exactly specified Static?? Yes—Wumpus and Pits do not move Discrete?? Yes Single-agent?? Yes—Wumpus is essentially a natural feature 5:10 Exploring a wumpus world 5:13 Exploring a wumpus world 5:11 5:14 Exploring a wumpus world Exploring a wumpus world 5:12 5:15 Exploring a wumpus world Exploring a wumpus world 28 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Breeze in (1,2) and (2,1) ⇒ no safe actions Assuming pits uniformly distributed, (2,2) has pit w/ prob 0.86, vs. 0.31 Smell in (1,1) ⇒ cannot move Can use a strategy of coercion: shoot straight ahead wumpus was there ⇒ dead ⇒ safe wumpus wasn’t there ⇒ safe 5:19 5:16 Logic in general Exploring a wumpus world Logics are formal languages for representing information such that conclusions can be drawn Syntax defines the sentences in the language Semantics define the “meaning” of sentences; i.e., define truth of a sentence in a world E.g., the language of arithmetic x + 2 ≥ y is a sentence; x2 + y > is not a sentence x + 2 ≥ y is true iff the number x + 2 is no less than the number y x + 2 ≥ y is true in a world where x = 7, y = 1 x + 2 ≥ y is false in a world where x = 0, y = 6 5:20 5:17 Entailment Entailment means that one thing follows from another: Exploring a wumpus world KB |= α Knowledge base KB entails sentence α if and only if α is true in all worlds where KB is true E.g., the KB containing “the Giants won” and “the Reds won” entails “Either the Giants won or the Reds won” E.g., x + y = 4 entails 4 = x + y Entailment is a relationship between sentences (i.e., syntax) that is based on semantics 5:21 5:18 Other tight spots Models Given a logical sentence, when is its truth uniquely defined in a world? Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Logicians typically think in terms of models, which are formally 29 Wumpus models structured worlds (e.g., full abstract description of a world, configuration of all variables, world state) with respect to which truth can uniquely be evaluated We say m is a model of a sentence α if α is true in m M (α) is the set of all models of α Then KB |= α if and only if M (KB) ⊆ M (α) E.g. KB = Giants won and Reds won α = Giants won KB = wumpus-world rules + observations α1 = “[1,2] is safe”, KB |= α1 , proved by model checking 5:22 5:26 Entailment in the wumpus world Wumpus models Situation after detecting nothing in [1,1], moving right, breeze in [2,1] Consider possible models for ?s assuming only pits 3 Boolean choices ⇒ 8 possible models 5:23 Wumpus models KB = wumpus-world rules + observations α2 = “[2,2] is safe”, KB 6|= α2 5:27 Inference Inference in the general sense means: Given some pieces of information (prior, observed variabes, knowledge base) what is 5:24 the implication (the implied information, the posterior) on other things (non-observed variables, sentence) Wumpus models KB `i α = sentence α can be derived from KB by procedure i Consequences of KB are a haystack; α is a needle. Entailment = needle in haystack; inference = finding it Soundness: i is sound if whenever KB `i α, it is also true that KB |= α Completeness: i is complete if whenever KB |= α, it is also true that KB `i α Preview: we will define a logic (first-order logic) which is expressive enough to say almost anything of interest, and for which there exists a sound and complete inference procedure. That is, the procedure will answer any question whose answer follows from what is known by the KB. KB = wumpus-world rules + observations 5:25 5:28 30 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Propositional logic: Syntax 5:33 Propositional logic is the simplest logic—illustrates basic ideas Wumpus world sentences The proposition symbols P1 , P2 etc are sentences If S is a sentence, ¬S is a sentence (negation) Let Pi,j be true if there is a pit in [i, j]. If S1 and S2 are sentences, S1 ∧ S2 is a sentence (conjunction) Let Bi,j be true if there is a breeze in [i, j]. If S1 and S2 are sentences, S1 ∨ S2 is a sentence (disjunction) If S1 and S2 are sentences, S1 ⇒ S2 is a sentence (implication) ¬P1,1 If S1 and S2 are sentences, S1 ⇔ S2 is a sentence (bicondi- ¬B1,1 tional) B2,1 5:29 “Pits cause breezes in adjacent squares” Propositional logic: Syntax grammar hsentencei hatomic sentencei hcomplex sentencei → → → hatomic sentencei | hcomplex sentencei true | false | P | Q | R | . . . ¬ hsentencei | (hsentencei ∧ hsentencei) | (hsentencei ∨ hsentencei) | (hsentencei ⇒ hsentencei) | (hsentencei ⇔ hsentencei) B1,1 ⇔ (P1,2 ∨ P2,1 ) B2,1 ⇔ (P1,1 ∨ P2,2 ∨ P3,1 ) “A square is breezy if and only if there is an adjacent pit” 5:34 5:30 Truth tables for inference Propositional logic: Semantics B1,1 B2,1 P1,1 P1,2 P2,1 P2,2 P3,1 Each model specifies true/false for each proposition symbol E.g. P1,2 P2,2 P3,1 true true false (With these symbols, 8 possible models, can be enumerated automatically.) R1 R2 R3 R4 R5 KB false false false false false false false true true true true false false false false false false false false true true true false true false false .. .. .. .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . . . . false true false false false false false true true false true true false false true false false false false true true true true true true true false true false false false true false true true true true true true false true false false false true true true true true true true true Rules for evaluating truth with respect to a model m: ¬S is true iff S is false false true false false true false false true false false true true false .. .. .. .. .. .. .. .. .. .. .. .. .. S1 ∧ S2 is true iff S1 is true and S2 is true . . . . . . . . . . . . . S1 ∨ S2 is true iff S1 is true or S2 is true true true true true true true true false true true false true false S1 ⇒ S2 is true iff S1 is false or S2 is true Enumerate rows (different assignments to symbols), i.e., is false iff S1 is true and S2 is false if KB is true in row, check that α is too S1 ⇔ S2 is true iff S1 ⇒ S2 is true and S2 ⇒ S1 is true Simple recursive process evaluates an arbitrary sentence, e.g., 5:35 ¬P1,2 ∧ (P2,2 ∨ P3,1 ) = true ∧ (false ∨ true) = true ∧ true = true 5:31 Inference by enumeration Truth tables for connectives P Q ¬P P ∧Q P ∨Q P ⇒Q P ⇔Q false false true true false true false true true true false false false false false true false true true true true true false true true false false true 5:32 Wumpus world sentences Let Pi,j be true if there is a pit in [i, j]. Let Bi,j be true if there is a breeze in [i, j]. ¬P1,1 ¬B1,1 B2,1 “Pits cause breezes in adjacent squares” Depth-first enumeration of all models is sound and complete function TT-E NTAILS ?(KB, α) returns true or false inputs: KB, the knowledge base, a sentence in propositional logic α, the query, a sentence in propositional logic symbols ← a list of the proposition symbols in KB and α return TT-C HECK -A LL(KB, α, symbols, [ ]) function TT-C HECK -A LL(KB, α, symbols, model) returns true or false if E MPTY ?(symbols) then if PL-T RUE ?(KB, model) then return PL-T RUE ?(α, model) else return true else do P ← F IRST(symbols); rest ← R EST(symbols) return TT-C HECK -A LL(KB, α, rest, E XTEND(P , true, model)) and TT-C HECK -A LL(KB, α, rest, E XTEND(P , false, model)) O(2n ) for n symbols 5:36 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 31 Logical equivalence Forward and backward chaining Two sentences are logically equivalent iff true in same models: Applicable when KB is in Horn Form α ≡ β if and only if α |= β and β |= α Horn Form (restricted) KB = conjunction of Horn clauses (α ∧ β) ≡ (β ∧ α) commutativity of ∧ (α ∨ β) ≡ (β ∨ α) commutativity of ∨ – proposition symbol; or ((α ∧ β) ∧ γ) ≡ (α ∧ (β ∧ γ)) associativity of ∧ – (conjunction of symbols) ⇒ symbol ((α ∨ β) ∨ γ) ≡ (α ∨ (β ∨ γ)) associativity of ∨ ¬(¬α) ≡ α (α ⇒ β) ≡ (¬β ⇒ ¬α) contraposition (α ⇒ β) ≡ (¬α ∨ β) implication elimination (α ⇔ β) ≡ ((α ⇒ β) ∧ (β ⇒ α)) biconditional elimination ¬(α ∧ β) ≡ (¬α ∨ ¬β) De Morgan ¬(α ∨ β) ≡ (¬α ∧ ¬β) De Morgan (α ∧ (β ∨ γ)) ≡ ((α ∧ β) ∨ (α ∧ γ)) distributivity of ∧ over ∨ (α ∨ (β ∧ γ)) ≡ ((α ∨ β) ∧ (α ∨ γ)) distributivity of ∨ over ∧ Horn clause = E.g., C ∧ (B ⇒ A) ∧ (C ∧ D ⇒ B) double-negation elimination Modus Ponens (for Horn Form): complete for Horn KBs α1 , . . . , αn , α1 ∧ · · · ∧ αn ⇒ β β Can be used with forward chaining or backward chaining. These algorithms are very natural and run in linear time 5:40 5:37 Validity and satisfiability Forward chaining A sentence is valid if it is true in all models, e.g., true, A ∨ ¬A, A ⇒ A, Idea: fire any rule whose premises are satisfied in the KB, (A ∧ (A ⇒ B)) ⇒ B Validity is connected to inference via the Deduction Theorem: KB |= α if and only if (KB ⇒ α) is valid add its conclusion to the KB, until query is found P ⇒Q L∧M ⇒P A sentence is satisfiable if it is true in some model e.g., A ∨ B, B∧L⇒M C A∧P ⇒L A sentence is unsatisfiable if it is true in no models A∧B ⇒L e.g., A ∧ ¬A A Satisfiability is connected to inference via the following: B KB |= α if and only if (KB ∧ ¬α) is unsatisfiable 5:41 i.e., prove α by reductio ad absurdum 5:38 Forward chaining example Proof methods Proof methods divide into (roughly) two kinds: Application of inference rules – Legitimate (sound) generation of new sentences from old – Proof = a sequence of inference rule applications Can use inference rules as operators in a standard search alg. – Typically require translation of sentences into a normal form Model checking truth table enumeration (always exponential in n) improved backtracking, e.g., Davis–Putnam–Logemann–Loveland (see book) 5:42 heuristic search in model space (sound but incomplete) e.g., min-conflicts-like hill-climbing algorithms 5:39 Forward chaining example 32 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 5:43 Forward chaining example 5:46 Forward chaining example 5:44 Forward chaining example 5:47 Forward chaining example 5:45 Forward chaining example 5:48 Forward chaining example Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 33 Backward chaining Idea: work backwards from the query q: to prove q by BC, check if q is known already, or prove by BC all premises of some rule concluding q Avoid loops: check if new subgoal is already on the goal stack Avoid repeated work: check if new subgoal 1) has already been proved true, or 2) has already failed 5:52 5:49 Backward chaining example Forward chaining algorithm function PL-FC-E NTAILS ?(KB, q) returns true or false inputs: KB, the knowledge base, a set of propositional Horn clauses q, the query, a proposition symbol local variables: count, a table, indexed by clause, initially the number of premises inferred, a table, indexed by symbol, each entry initially false agenda, a list of symbols, initially the symbols known in KB while agenda is not empty do p ← P OP(agenda) unless inferred[p] do inferred[p] ← true for each Horn clause c in whose premise p appears do decrement count[c] if count[c] = 0 then do if H EAD[c] = q then return true P USH(H EAD[c], agenda) return false 5:53 5:50 Backward chaining example Proof of completeness FC derives every atomic sentence that is entailed by KB 1. FC reaches a fixed point where no new atomic sentences are derived 2. Consider the final state as a model m, assigning true/false to symbols 3. Every clause in the original KB is true in m Proof : Suppose a clause a1 ∧ . . . ∧ ak ⇒ b is false in m Then a1 ∧ . . . ∧ ak is true in m and b is false in m Therefore the algorithm has not reached a fixed point! 4. Hence m is a model of KB 5. If KB |= q, q is true in every model of KB, including m 5:54 General idea: construct any model of KB by sound inference, check α 5:51 Backward chaining example 34 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 5:55 Backward chaining example 5:58 Backward chaining example 5:56 Backward chaining example 5:59 Backward chaining example 5:57 Backward chaining example 5:60 Backward chaining example Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 35 where `i and mj are complementary literals. E.g., P1,3 ∨ P2,2 , P1,3 ¬P2,2 Resolution is sound and complete for propositional logic 5:64 Conversion to CNF B1,1 ⇔ (P1,2 ∨ P2,1 ) 1. Eliminate ⇔, replacing α ⇔ β with (α ⇒ β) ∧ (β ⇒ α). (B1,1 ⇒ (P1,2 ∨ P2,1 )) ∧ ((P1,2 ∨ P2,1 ) ⇒ B1,1 ) 5:61 Backward chaining example 2. Eliminate ⇒, replacing α ⇒ β with ¬α ∨ β. (¬B1,1 ∨ P1,2 ∨ P2,1 ) ∧ (¬(P1,2 ∨ P2,1 ) ∨ B1,1 ) 3. Move ¬ inwards using de Morgan’s rules and double-negation: (¬B1,1 ∨ P1,2 ∨ P2,1 ) ∧ ((¬P1,2 ∧ ¬P2,1 ) ∨ B1,1 ) 4. Apply distributivity law (∨ over ∧) and flatten: (¬B1,1 ∨ P1,2 ∨ P2,1 ) ∧ (¬P1,2 ∨ B1,1 ) ∧ (¬P2,1 ∨ B1,1 ) 5:65 Resolution algorithm 5:62 Proof by contradiction, i.e., show KB ∧ ¬α unsatisfiable function PL-R ESOLUTION(KB, α) returns true or false inputs: KB, the knowledge base, a sentence in propositional logic α, the query, a sentence in propositional logic Forward vs. backward chaining FC is data-driven, cf. automatic, unconscious processing, e.g., object recognition, routine decisions May do lots of work that is irrelevant to the goal BC is goal-driven, appropriate for problem-solving, e.g., Where are my keys? How do I get into a PhD pro- clauses ← the set of clauses in the CNF representation of KB ∧ ¬α new ← { } loop do for each Ci , Cj in clauses do resolvents ← PL-R ESOLVE(Ci , Cj ) if resolvents contains the empty clause then return true new ← new ∪ resolvents if new ⊆ clauses then return false clauses ← clauses ∪ new gram? 5:66 Complexity of BC can be much less than linear in size of KB 5:63 Resolution Resolution example KB = (B1,1 ⇔ (P1,2 ∨ P2,1 )) ∧ ¬B1,1 α = ¬P1,2 Conjunctive Normal Form (CNF—universal) conjunction of disjunctions of literals | {z } clauses E.g., (A ∨ ¬B) ∧ (B ∨ ¬C ∨ ¬D) Resolution inference rule (for CNF): complete for propositional logic 5:67 `1 ∨ · · · ∨ `k , m1 ∨ · · · ∨ mn `1 ∨ · · · ∨ `i−1 ∨ `i+1 ∨ · · · ∨ `k ∨ m1 ∨ · · · ∨ mj−1 ∨ mj+1 ∨ · · · ∨ mn 36 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Summary Modus Ponens rule: complete for Horn KBs α1 ,...,αn , α1 ∧···∧αn ⇒ β Resolution rule: complete for propositional logic in CNF, let “`i = Logical agents apply inference to a knowledge base ¬mj ”: to derive new information and make decisions `1 ∨···∨`k , m1 ∨···∨mn `1 ∨···∨`i−1 ∨`i+1 ∨···∨`k ∨m1 ∨···∨mj−1 ∨mj+1 ∨···∨mn Basic concepts of logic: 5:70 – syntax: formal structure of sentences – semantics: truth of sentences wrt models – entailment: necessary truth of one sentence given another – inference: deriving sentences from other sentences – soundness: derivations produce only entailed sentences – completeness: derivations can produce all entailed sentences Wumpus world requires the ability to represent partial and negated information, reason by cases, etc. Forward, backward chaining are linear-time, complete for Horn clauses Resolution is complete for propositional logic Propositional logic lacks expressive power 5:68 Dictionary: logic in general a logic: a language, elements α are sentences, (grammar example: slide 34) model m: a world/state description that allows us to evaluate α(m) ∈ {true, false} uniquely for any sentence α, M (α) = {m : α(m) = true} entailment α |= β: M (α) ⊆ M (β), “∀m : α(m) ⇒ β(m)” (Folgerung) equivalence α ≡ β: iff (α |= β and β |= α) KB: a set of sentences inference procedure i can infer α from KB: KB `i α soundness of i: KB `i α implies KB |= α (Korrektheit) completeness of i: KB |= α implies KB `i α 5:69 Dictionary: propositional logic conjunction: α ∧ β, disjunction: α ∨ β, negation: ¬α implication: α ⇒ β ≡ ¬α∨β, biconditional: α ⇔ β ≡ (α ⇒ β)∧ (β ⇒ α) Note: |= and ≡ are statements about sentences in a logic; ⇒ and ⇔ are symbols in the grammar of propositional logic α valid: true for any model, e.g.: KB |= α iff [(KB ⇒ α) is valid] (allgemeingultig) ¨ α unsatisfiable: true for no model, e.g.: KB |= α iff [(KB ∧¬α) is unsatisfiable] literal: A or ¬A, clause: disjunction of literals, CNF: conjunction of clauses Horn clause: symbol | (conjunction of symbols ⇒ symbol), Horn form: conjunction of Horn clauses Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 8 37 First Order Logic 6:4 Syntax of FOL: Basic elements Constants Predicates Functions Variables Connectives Equality Quantifiers Outline • Why FOL? • Syntax and semantics of FOL • Example sentences KingJohn, 2, U CB, . . . Brother, >, . . . Sqrt, Lef tLegOf, . . . x, y, a, b, . . . ∧ ∨ ¬ ⇒ ⇔ = ∀∃ 6:5 • Wumpus world in FOL 6:1 Pros and cons of propositional logic First Order Logic: Syntax grammar hsentencei → hatomic sentencei | hcomplex sentencei | [∀ | ∃] hvariablei hsentencei hatomic sentencei → predicate(htermi,. . . ) | htermi=htermi htermi → function(htermi,. . . ) | constant | variable hcomplex sentencei → ¬ hsentencei | (hsentencei [∧ | ∨ | ⇒ | ⇔ ] hsente Pros: Propositional logic is declarative: pieces of syntax correspond to facts Propositional logic allows partial/disjunctive/negated information (unlike most data structures and databases) Propositional logic is compositional: meaning of B1,1 ∧ P1,2 is derived from meaning of B1,1 and of P1,2 6:6 Meaning in propositional logic is context-independent (unlike natural language, where meaning depends on context) Universal quantification ∀ hvariablesi hsentencei Cons: Everyone at Berkeley is smart: Propositional logic has very limited expressive power (unlike natural language) ∀ x At(x, Berkeley) ⇒ Smart(x) ∀x P E.g., cannot say “pits cause breezes in adjacent squares” is true in a model m iff P is true with x being each possible object in the model except by writing one sentence for each square 6:7 6:2 Existential quantification First-order logic ∃ hvariablesi hsentencei Whereas propositional logic assumes world contains facts, first-order logic (like natural language) assumes the world con- Someone at Stanford is smart: ∃ x At(x, Stanf ord) ∧ Smart(x) tains ∃x P • Objects: people, houses, numbers, theories, Ronald Mc- is true in a model m iff P is true with x being some possible object in the model Donald, colors, baseball games, wars, centuries . . . 6:8 • Relations: red, round, bogus, prime, multistoried . . ., brother of, bigger than, inside, part of, has color, occurred after, owns, comes between, . . . Properties of quantifiers ∀ x ∀ y is the same as ∀ y ∀ x • Functions: father of, best friend, third inning of, one more than, end of . . . ∃ x ∃ y is the same as ∃ y ∃ x ∃ x ∀ y is not the same as ∀ y ∃ x 6:3 ∃ x ∀ y Loves(x, y) “There is a person who loves everyone in the world” Logics in general Language Ontological Commitment Epistemological Commitment Propositional logic First-order logic Temporal logic Probability theory Fuzzy logic facts facts, objects, relations facts, objects, relations, times facts facts + degree of truth true/false/unknown true/false/unknown true/false/unknown degree of belief known interval value ∀ y ∃ x Loves(x, y) “Everyone in the world is loved by at least one person” Quantifier duality: each can be expressed using the other ∀ x Likes(x, IceCream) ∃ x Likes(x, Broccoli) ¬∃ x ¬Likes(x, IceCream) ¬∀ x ¬Likes(x, Broccoli) 38 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Example sentences 6:9 Brothers are siblings Truth in first-order logic ∀ x, y Brother(x, y) ⇒ Sibling(x, y). “Sibling” is symmetric Sentences are true with respect to a model and an interpretation 6:14 Model contains ≥ 1 objects (domain elements) and relations among them Example sentences Interpretation specifies referents for constant symbols → objects (domain elements) Brothers are siblings predicate symbols → relations ∀ x, y Brother(x, y) ⇒ Sibling(x, y). function symbols → functional relations “Sibling” is symmetric An atomic sentence predicate(term1 , . . . , termn ) is true iff the objects referred to by term1 , . . . , termn ∀ x, y Sibling(x, y) ⇔ Sibling(y, x). are in the relation referred to by predicate One’s mother is one’s female parent 6:15 6:10 Example sentences Models for FOL: Example Brothers are siblings ∀ x, y Brother(x, y) ⇒ Sibling(x, y). “Sibling” is symmetric ∀ x, y Sibling(x, y) ⇔ Sibling(y, x). One’s mother is one’s female parent ∀ x, y M other(x, y) ⇔ (F emale(x) ∧ P arent(x, y)). A first cousin is a child of a parent’s sibling 6:16 Example sentences Brothers are siblings ∀ x, y Brother(x, y) ⇒ Sibling(x, y). 6:11 “Sibling” is symmetric ∀ x, y Sibling(x, y) ⇔ Sibling(y, x). Models for FOL: Lots! One’s mother is one’s female parent Entailment in propositional logic can be computed by enumerat- ∀ x, y M other(x, y) ⇔ (F emale(x) ∧ P arent(x, y)). ing models A first cousin is a child of a parent’s sibling We can enumerate the FOL models for a given KB vocabulary: ∀ x, y F irstCousin(x, y) ⇔ ∃ p, ps P arent(p, x)∧Sibling(ps, p)∧ For each number of domain elements n from 1 to ∞ P arent(ps, y) For each k-ary predicate Pk in the vocabulary 6:17 For each possible k-ary relation on n objects For each constant symbol C in the vocabulary For each choice of referent for C from n objects . . . 8.1 FOL description of interactive domains Computing entailment by enumerating FOL models is not easy! 6:18 6:12 Knowledge base for the wumpus world Example sentences “Perception” Brothers are siblings ∀ b, g, t P ercept([Smell, b, g], t) ⇒ Smelt(t) 6:13 ∀ s, b, t P ercept([s, b, Glitter], t) ⇒ AtGold(t) Reflex: ∀ t AtGold(t) ⇒ Action(Grab, t) Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Reflex with internal state: do we have the gold already? 39 Each axiom is “about” a predicate (not an action per se): ∀ t AtGold(t) ∧ ¬Holding(Gold, t) ⇒ Action(Grab, t) P true afterwards Holding(Gold, t) cannot be observed ⇔ [an action made P true ∨ P true already and no action made P false ⇒ keeping track of change is essential 6:19 For holding the gold: Deducing hidden properties ∀ a, s Holding(Gold, Result(a, s)) ⇔ Properties of locations: [(a = Grab ∧ AtGold(s)) ∀ x, t At(Agent, x, t) ∧ Smelt(t) ⇒ Smelly(x) ∨ (Holding(Gold, s) ∧ a 6= Release)] ∀ x, t At(Agent, x, t) ∧ Breeze(t) ⇒ Breezy(x) 6:23 Squares are breezy near a pit: ∀ y Breezy(y) ⇔ [∃ x P it(x) ∧ Adjacent(x, y)] Planning Domain Definition Language (PDDL) Implies two rules: The Situation Calculus is very general, but not concise Diagnostic rule—infer cause from effect The AI community developed as simpler format (based on STRIPS) ∀ y Breezy(y) ⇒ ∃ x P it(x) ∧ Adjacent(x, y) for the 1998/2000 International Planning Competition (IPC) Causal rule—infer effect from cause ∀ x, y P it(x) ∧ Adjacent(x, y) ⇒ Breezy(y) 6:20 Keeping track of change: Situation Calculus Facts hold in situations, rather than eternally E.g., Holding(Gold, N ow) rather than just Holding(Gold) Situation calculus is one way to represent change in FOL: Adds a situation argument to each non-eternal predicate 6:24 E.g., N ow in Holding(Gold, N ow) denotes a situation Situations are connected by the Result function PDDL Result(a, s) is the situation that results from doing a in s The precondition specifies if an action predicate is applicable in a given situation The effect determines the changed facts Frame assumption: All facts not mentioned in the effect remain unchanged. 6:21 Describing actions I: Frame problem The majority of state-of-the-art AI planners use this format FFplan: (B. Nebel, Freiburg) a forward chaining heuristic “Effect” axiom—describe changes due to action state space planner ∀ s AtGold(s) ⇒ Holding(Gold, Result(Grab, s)) For probabilistic versions of PDDL: T. Lang and M. Toussaint: Planning with noisy probabilistic relational rules. JAIR, 2010. “Frame” axiom—describe non-changes due to action ∀ s HaveArrow(s) ⇒ HaveArrow(Result(Grab, s)) 6:25 Frame problem: find an elegant way to handle non-change (a) representation—avoid frame axioms Planning as FOL inference (b) inference—avoid repeated “copy-overs” to keep track of state A general approach to planning is to query the KB for a plan that 6:22 Describing actions II Successor-state axioms solve the representational frame problem fulfills a goal condition There is debate and ongoing research on this versus fwd search 6:26 40 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Substitution Suppose a wumpus-world agent is using an FOL KB and perceives a smell and a breeze (but no glitter) at s: T ELL(KB, P ercept([Smell, Breeze, N one], s)) A SK(KB, ∃ a Action(a, s)) I.e., does KB entail any particular actions at s? Answer: Y es, {a/Shoot} ← substitution (binding list) Given a sentence S and a substitution σ, Sσ denotes the result of plugging σ into S; e.g., S = Smarter(x, y) σ = {x/Hillary, y/Bill} Sσ = Smarter(Hillary, Bill) A SK (KB,S) returns some/all σ such that KB |= Sσ 6:27 Making plans as inference over plans Represent plans as action sequences [a1 , a2 , . . . , an ] P lanResult(p, s) is the result of executing p in s Then the query A SK(KB, ∃ p Holding(Gold, P lanResult(p, S0 ))) has the solution {p/[F orward, Grab]} Definition of P lanResult in terms of Result: ∀ s P lanResult([ ], s) = s ∀ a, p, s P lanResult([a|p], s) = P lanResult(p, Result(a, s)) 6:28 Summary First-order logic: – objects and relations are semantic primitives – syntax: constants, functions, predicates, equality, quantifiers Increased expressive power: sufficient to define wumpus world Situation calculus: – convention for describing actions and change in FOL – can formulate planning as inference on a situation calculus KB Planning Domain Definition Language (PDDL): – more common restricted language – more concise because of the frame assumption – directly lends to forward chaining methods (like FFplan) 6:29 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 9 First Order Logic – Inference 41 Existential instantiation contd. UI can be applied several times to add new sentences; the new KB is logically equivalent to the old Outline EI can be applied once to replace the existential sentence; the new KB is not equivalent to the old, • Reducing first-order inference to propositional inference but is satisfiable iff the old KB was satisfiable • Unification 7:5 • Generalized Modus Ponens Reduction to propositional inference • Forward and backward chaining • Resolution Suppose the KB contains just the following: 7:1 ∀ x King(x) ∧ Greedy(x) ⇒ Evil(x) King(John) Greedy(John) Brother(Richard, John) A brief history of reasoning 450B . C. 322B . C. 1565 1847 1879 1922 1930 1930 1931 1960 1965 Stoics Aristotle Cardano Boole Frege Wittgenstein ¨ Godel Herbrand ¨ Godel Davis/Putnam Robinson propositional logic, inference (maybe) “syllogisms” (inference rules), quantifiers probability theory (propositional logic + uncertainty) propositional logic (again) first-order logic proof by truth tables ∃ complete algorithm for FOL complete algorithm for FOL (reduce to propositional) ¬∃ complete algorithm for arithmetic “practical” algorithm for propositional logic “practical” algorithm for FOL—resolution Instantiating the universal sentence in all possible ways, we have King(John) ∧ Greedy(John) ⇒ Evil(John) King(Richard) ∧ Greedy(Richard) ⇒ Evil(Richard) King(John) Greedy(John) Brother(Richard, John) 7:2 The new KB is propositionalized: proposition symbols are King(Joh Universal instantiation (UI) 7:6 Every instantiation of a universally quantified sentence is en- Reduction contd. tailed by it: ∀v α S UBST({v/g}, α) Idea: propositionalize KB and query, apply resolution, return result for any variable v and ground term g Problem: with function symbols, there are infinitely many ground 7:3 terms, e.g., F ather(F ather(F ather(John))) Existential instantiation (EI) Theorem: Herbrand (1930). If a sentence α is entailed by an FOL KB, For any sentence α, variable v, and constant symbol k it is entailed by a finite subset of the propositional KB that does not appear elsewhere in the knowledge base: Idea: For n = 0 to ∞ do ∃v α S UBST({v/k}, α) create a propositional KB by instantiating with depth-n terms see if α is entailed by this KB Problem: works if α is entailed, loops if α is not entailed Theorem: Turing (1936), Church (1936), entailment in FOL is E.g., ∃ x Crown(x) ∧ OnHead(x, John) yields semidecidable Crown(C1 ) ∧ OnHead(C1 , John) 7:7 provided C1 is a new constant symbol, called a Skolem constant Problems with propositionalization Another example: from ∃ x d(xy )/dy = xy we obtain Propositionalization seems to generate lots of irrelevant sentences. d(ey )/dy = ey E.g., from provided e is a new constant symbol 7:4 ∀ x King(x) ∧ Greedy(x) ⇒ Evil(x) King(John) ∀ y Greedy(y) Brother(Richard, John) 42 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 it seems obvious that Evil(John), but propositionalization pro- Unification duces lots of facts such as Greedy(Richard) that are irrelevant With p k-ary predicates and n constants, there are p · nk instan- We can get the inference immediately if we can find a substitution θ tiations such that King(x) and Greedy(x) match King(John) and Greedy(y With function symbols, it gets much much worse! 7:8 θ = {x/John, y/John} works U NIFY(α, β) = θ if αθ = βθ Unification We can get the inference immediately if we can find a substitution θ such that King(x) and Greedy(x) match King(John) and Greedy(y) θ = {x/John, y/John} works p Knows(John, x) Knows(John, x) Knows(John, x) Knows(John, x) q Knows(John, Jane) Knows(y, OJ) Knows(y, M other(y)) Knows(x, OJ) θ {x/Jane} {x/OJ, y/John} {y/John, x/M other( U NIFY(α, β) = θ if αθ = βθ 7:12 p Knows(John, x) Knows(John, x) Knows(John, x) Knows(John, x) q Knows(John, Jane) Knows(y, OJ) Knows(y, M other(y)) Knows(x, OJ) θ Unification We can get the inference immediately if we can find a substitution θ 7:9 such that King(x) and Greedy(x) match King(John) and Greedy(y θ = {x/John, y/John} works Unification U NIFY(α, β) = θ if αθ = βθ We can get the inference immediately if we can find a substitution θ such that King(x) and Greedy(x) match King(John) and Greedy(y) θ = {x/John, y/John} works U NIFY(α, β) = θ if αθ = βθ p Knows(John, x) Knows(John, x) Knows(John, x) Knows(John, x) q Knows(John, Jane) Knows(y, OJ) Knows(y, M other(y)) Knows(x, OJ) θ {x/Jane} p Knows(John, x) Knows(John, x) Knows(John, x) Knows(John, x) q Knows(John, Jane) Knows(y, OJ) Knows(y, M other(y)) Knows(x, OJ) θ {x/Jane} {x/OJ, y/John} {y/John, x/M other( f ail Standardizing apart eliminates overlap of variables, e.g., Knows(z17 7:13 Generalized Modus Ponens (GMP) 7:10 Unification p1 0 , p2 0 , . . . , pn 0 , (p1 ∧ p2 ∧ . . . ∧ pn ⇒ q) qθ where pi 0 θ = pi θ for all i We can get the inference immediately if we can find a substitution θ p1 0 is King(John) p2 0 is Greedy(y) θ is {x/John, y/John} qθ is Evil(John) such that King(x) and Greedy(x) match King(John) and Greedy(y) θ = {x/John, y/John} works U NIFY(α, β) = θ if αθ = βθ p Knows(John, x) Knows(John, x) Knows(John, x) Knows(John, x) q Knows(John, Jane) Knows(y, OJ) Knows(y, M other(y)) Knows(x, OJ) θ {x/Jane} {x/OJ, y/John} p1 is King(x) p2 is Greedy(x) q is Evil(x) GMP used with KB of definite clauses (exactly one positive literal) All variables assumed universally quantified 7:14 7:11 Forward chaining algorithm Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 43 Resolution: brief summary Full first-order version: function FOL-FC-A SK(KB, α) returns a substitution or false repeat until new is empty new ← { } for each sentence r in KB do ( p 1 ∧ . . . ∧ p n ⇒ q) ← S TANDARDIZE -A PART(r) for each θ such that (p 1 ∧ . . . ∧ p n )θ = (p 01 ∧ . . . ∧ p 0n )θ for some p 01 , . . . , p 0n in KB q 0 ← S UBST(θ, q) if q 0 is not a renaming of a sentence already in KB or new then do add q 0 to new φ ← U NIFY(q 0 , α) if φ is not fail then return φ add new to KB return false `1 ∨ · · · ∨ `k , m1 ∨ · · · ∨ mn (`1 ∨ · · · ∨ `i−1 ∨ `i+1 ∨ · · · ∨ `k ∨ m1 ∨ · · · ∨ mj−1 ∨ mj+1 ∨ · · · ∨ mn )θ where U NIFY(`i , ¬mj ) = θ. For example, ¬Rich(x) ∨ U nhappy(x) Rich(Ken) U nhappy(Ken) with θ = {x/Ken} Apply resolution steps to CN F (KB ∧ ¬α); complete for FOL 7:19 7:15 Conversion to CNF Properties of forward chaining Everyone who loves all animals is loved by someone: Sound and complete for first-order definite clauses ∀ x [∀ y Animal(y) ⇒ Loves(x, y)] ⇒ [∃ y Loves(y, x)] (proof similar to propositional proof) 1. Eliminate biconditionals and implications Datalog = first-order definite clauses + no functions (e.g., crime ∀ x [¬∀ y ¬Animal(y) ∨ Loves(x, y)] ∨ [∃ y Loves(y, x)] KB) k FC terminates for Datalog in poly iterations: at most p·n literals 2. Move ¬ inwards: ¬∀ x, p ≡ ∃ x ¬p, ¬∃ x, p ≡ ∀ x ¬p: May not terminate in general if α is not entailed This is unavoidable: entailment with definite clauses is semidecidable ∀ x [∃ y ¬(¬Animal(y) ∨ Loves(x, y))] ∨ [∃ y Loves(y, x)] ∀ x [∃ y ¬¬Animal(y) ∧ ¬Loves(x, y)] ∨ [∃ y Loves(y, x)] ∀ x [∃ y Animal(y) ∧ ¬Loves(x, y)] ∨ [∃ y Loves(y, x)] 7:16 7:20 Backward chaining algorithm Conversion to CNF contd. function FOL-BC-A SK(KB, goals, θ) returns a set of substitutions inputs: KB, a knowledge base goals, a list of conjuncts forming a query (θ already applied) θ, the current substitution, initially the empty substitution { } local variables: answers, a set of substitutions, initially empty if goals is empty then return {θ} q 0 ← S UBST(θ, F IRST(goals)) for each sentence r in KB where S TANDARDIZE -A PART(r) = ( p 1 ∧ . . . ∧ p n ⇒ q) and θ0 ← U NIFY(q, q 0 ) succeeds new goals ← [ p 1 , . . . , p n |R EST(goals)] answers ← FOL-BC-A SK(KB, new goals, C OMPOSE(θ0 , θ)) ∪ answers return answers 3. Standardize variables: each quantifier should use a different one ∀ x [∃ y Animal(y) ∧ ¬Loves(x, y)] ∨ [∃ z Loves(z, x)] 4. Skolemize: a more general form of existential instantiation. Each existential variable is replaced by a Skolem function of the enclosing universally quantified variables: ∀ x [Animal(F (x)) ∧ ¬Loves(x, F (x))] ∨ Loves(G(x), x) 7:17 5. Drop universal quantifiers: Properties of backward chaining [Animal(F (x)) ∧ ¬Loves(x, F (x))] ∨ Loves(G(x), x) Depth-first recursive proof search: space is linear in size of proof 6. Distribute ∧ over ∨: Incomplete due to infinite loops ⇒ fix by checking current goal against every goal on stack [Animal(F (x))∨Loves(G(x), x)]∧[¬Loves(x, F (x))∨Loves(G(x), x Inefficient due to repeated subgoals (both success and failure) ⇒ fix using caching of previous results (extra space!) 7:21 Widely used (without improvements!) for logic programming 7:18 44 10 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Probabilities • Utilities, decision theory, entropy, KLD 8:3 Probability: Frequentist and Bayesian Objective Probability • Frequentist probabilities are defined in the limit of an infinite number of trials Example: “The probability of a particular coin landing heads up is 0.43” The double slit experiment: x • Bayesian (subjective) probabilities quantify degrees of belief Example: “The probability of it raining tomorrow is 0.3” – Not possible to repeat “tomorrow” θ 8:4 P 10.1 Basic definitions 8:5 Probabilities & Sets • Sample Space/domain O, e.g. O = {1, 2, 3, 4, 5, 6} • Probability P : A ⊂ O 7→ [0, 1] e.g., P ({1}) = 16 , P ({4}) = 16 , P ({2, 5}) = 31 , 8:1 • Axioms: ∀A, B ⊆ O Probability Theory – Nonnegativity P (A) ≥ 0 – Additivity P (A ∪ B) = P (A) + P (B) if A ∩ B = { } • Why do we need probabilities? – Normalization P (O) = 1 – Obvious: to express inherent (objective) stochasticity of the world • Implications 0 ≤ P (A) ≤ 1 • But beyond this: (also in a “deterministic world”): – lack of knowledge! P ({ }) = 0 A ⊆ B ⇒ P (A) ≤ P (B) – hidden (latent) variables P (A ∪ B) = P (A) + P (B) − P (A ∩ B) – expressing uncertainty P (O \ A) = 1 − P (A) – expressing information (and lack of information) – Subjective Probability 8:6 Probabilities & Random Variables • Probability Theory: an information calculus 8:2 • For a random variable X with discrete domain dom(X) = O we write: Outline • Basic definitions – Random variables – joint, conditional, marginal distribution – Bayes’ theorem • Probability distributions: – Binomial & Beta – Multinomial & Dirichlet – Conjugate priors – Gauss – Dirak & Particles ∀x∈O : 0 ≤ P (X = x) ≤ 1 P P (X = x) = 1 x∈O Example: A dice can take values O = {1, .., 6}. X is the random variable of a dice throw. P (X = 1) ∈ [0, 1] is the probability that X takes value 1. • A bit more formally: a random variable is a map from a measureable space to a domain (sample space) and thereby introduces a probability measure on the domain (“assigns a probability to each possible value”) 8:7 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 45 Probabilty Distributions likelihood · prior normalization posterior = • P (X = 1) ∈ R denotes a specific probability 8:11 P (X) denotes the probability distribution (function over O) Multiple RVs: Example: A dice can take values O = {1, 2, 3, 4, 5, 6}. By P (X) we discribe the full distribution over possible values {1, .., 6}. These are 6 numbers that sum to one, usually stored in a table, e.g.: [ 16 , 16 , 61 , 16 , 16 , 16 ] • Analogously for n random variables X1:n (stored as a rank n tensor) Joint: • In implementations we typically represent distributions over discrete random variables as tables (arrays) of numbers P (X1:n ) Marginal: P (X1:n ), X2:n P (X1 |X2:n ) = Conditional: • Notation for summing over a RV: P P (X1 ) = P (X1:n ) P (X2:n ) • X is conditionally independent of Y given Z iff: In equation we often needP to sum over RVs. We then write X P (X) · · · P as shorthand for the explicit notation x∈dom(X) P (X = x) · · · P (X|Y, Z) = P (X|Z) 8:8 Joint distributions • Product rule and Bayes’ Theorem: P (X1:n ) = Qn i=1 P (X, Z, Y ) P (X|Y, Z) P (Y |Z) P (Z) P (Xi |Xi+1:n ) P (X1 |X2:n ) = P (X2 |X1 ,X3:n ) P (X1 |X3:n ) P (X2 |X3:n ) Assume we have two random variables X and Y = P (X|Y, Z) = P (Y |X,Z) P (X|Z) P (Y |Z) P (X, Y |Z) = P (X,Z|Y ) P (Y ) P (Z) 8:12 • Definitions: Joint: P (X, Y ) Marginal: 10.2 P (X) = Conditional: P Y Probability distributions P (X, Y ) P (X|Y ) = P (X,Y ) P (Y ) The conditional is normalized: ∀Y : P X Bishop, C. M.: Pattern Recognition and Machine Learning. Springer, 2006 http://research. microsoft.com/en-us/um/ people/cmbishop/prml/ P (X|Y ) = 1 • X is independent of Y iff: P (X|Y ) = P (X) (table thinking: all columns of P (X|Y ) are equal) 8:9 Joint distributions 8:13 Bernoulli & Binomial joint: P (X, Y ) P marginal: P (X) = Y P (X, Y ) conditional: P (X|Y ) = • We have a binary random variable x ∈ {0, 1} P (X,Y ) P (Y ) {0, 1}) The Bernoulli distribution is parameterized by a single scalar µ, • Implications of these definitions: Product rule: (i.e. dom(x) = P (x = 1 | µ) = µ , P (X, Y ) = P (X|Y ) P (Y ) = P (Y |X) P (X) P (x = 0 | µ) = 1 − µ Bern(x | µ) = µx (1 − µ)1−x Bayes’ Theorem: P (X|Y ) = P (Y |X) P (X) P (Y ) • We have a data set of random variables D = {x1 , .., xn }, each xi ∈ {0, 1}. If each xi ∼ Bern(xi | µ) we have 8:10 P (D | µ) = Bayes’ Theorem Qn i=1 argmax log P (D | µ) = argmax µ P (X|Y ) = Bern(xi | µ) = P (Y |X) P (X) P (Y ) µ n X Qn i=1 µxi (1 − µ)1−xi xi log µ + (1 − xi ) log(1 − µ) = i=1 • The Binomial distribution is the distribution over the count m = Pn i=1 xi n µm (1 − µ)n−m , Bin(m | n, µ) = m n! n = (n − m)! m! m 1X ni 46 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 8:14 Beta Multinomial • We have an integer random variable x ∈ {1, .., K} The probability of a single x can be parameterized by µ = (µ1 , .., µK ) How to express uncertainty over a Bernoulli parameter µ • The Beta distribution is over the interval [0, 1], typically the pa- P (x = k | µ) = µk rameter µ of a Bernoulli: with the constraint 1 Beta(µ | a, b) = µa−1 (1 − µ)b−1 B(a, b) with mean hµi = a a+b and mode µ∗ = a−1 a+b−2 PK k=1 µk = 1 (probabilities need to be nor- malized) • We have a data set of random variables D = {x1 , .., xn }, each for a, b > 1 xi ∈ {1, .., K}. If each xi ∼ P (xi | µ) we have • The crucial point is: – Assume we are in a world with a “Bernoulli source” (e.g., binary bandit), but don’t know its parameter µ – Assume we have a prior distribution P (µ) = Beta(µ | a, b) P (D | µ) = where mk = i=1 Pn µx i = i=1 [xi Qn i=1 [x =k] QK k=1 µk i = QK k=1 m µk k = k] is the count of [xi = k]. The ML estimator is – Assume we collected some P data D = {x1 , .., xn }, Pxi ∈ {0, 1}, with counts aD = i xi of [xi = 1] and bD = i (1 − xi ) of [xi = 0] – The posterior is Qn argmax log P (D | µ) = µ 1 (m1 , .., mK ) n • The Multinomial distribution is this distribution over the counts P (D | µ) P (µ) ∝ Bin(D | µ) Beta(µ | a, b) P (µ | D) = P (D) mk ∝ µaD (1 − µ)bD µa−1 (1 − µ)b−1 = µa−1+aD (1 − µ)b−1+bD Mult(m1 , .., mK | n, µ) ∝ QK k=1 m µk k = Beta(µ | a + aD , b + bD ) 8:18 8:15 Dirichlet Beta How to express uncertainty over a Multinomial parameter µ The prior is Beta(µ | a, b), the posterior is Beta(µ | a + aD , b + bD ) • Conclusions: – The semantics of a and b are counts of [xi = 1] and [xi = 0], respectively – The Beta distribution is conjugate to the Bernoulli (explained later) – With the Beta distribution we can represent beliefs (state of knowledge) about uncertain µ ∈ [0, 1] and know how to update this belief given data 8:16 • The Dirichlet distribution is over the K-simplex, that is, over P µ1 , .., µK ∈ [0, 1] subject to the constraint K k=1 µk = 1: Dir(µ | α) ∝ QK k=1 α −1 µk k It is parameterized by α = (α1 , .., αK ), has mean hµi i = and mode µ∗i = Pαi −1 j αj −K Pαi j αj for ai > 1. • The crucial point is: – Assume we are in a world with a “Multinomial source” (e.g., an integer bandit), but don’t know its parameter µ – Assume we have a prior distribution P (µ) = Dir(µ | α) – Assume we collected someP data D = {x1 , .., xn }, xi ∈ {1, .., K}, with counts mk = i [xi = k] Beta – The posterior is P (D | µ) P (µ) ∝ Mult(D | µ) Dir(µ | a, b) P (D) QK Q m QK αk −1 αk −1+mk ∝ k=1 µk k = K k=1 µk k=1 µk P (µ | D) = = Dir(µ | α + m) 8:19 Dirichlet from Bishop The prior is Dir(µ | α), the posterior is Dir(µ | α + m) 8:17 • Conclusions: Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 47 Conjugate priors – The semantics of α is the counts of [xi = k] – The Dirichlet distribution is conjugate to the Multinomial – With the Dirichlet distribution we can represent beliefs (state of knowledge) about uncertain µ of an integer random variable and know how to update this belief given data likelihood Binomial Bin(D | µ) Multinomial Mult(D | µ) Gauss N(x | µ, Σ) 1D Gauss N(x | µ, λ-1 ) nD Gauss N(x | µ, Λ-1 ) nD Gauss N(x | µ, Λ-1 ) 8:20 Dirichlet Illustrations for α = (0.1, 0.1, 0.1), α = (1, 1, 1) and α = (10, 10, 10): conjugate Beta Beta(µ | a, b) Dirichlet Dir(µ | α) Gauss N(µ | µ0 , A) Gamma Gam(λ | a, b) Wishart Wish(Λ | W, ν) Gauss-Wishart N(µ | µ0 , (βΛ)-1 ) Wish(Λ | W, ν) 8:24 from Bishop Distributions over continuous domain 8:21 Motivation for Beta & Dirichlet distributions • Bandits: – If we have binary [integer] bandits, the Beta [Dirichlet] distribution is a way to represent and update beliefs 8:25 Distributions over continuous domain • Let x be a continuous RV. The probability density function (pdf) p(x) ∈ [0, ∞) defines the probability – The belief space becomes discrete: The parameter α of the prior is continuous, but the posterior updates live on a discrete “grid” (adding counts to α) b Z P (a ≤ x ≤ b) = p(x) dx ∈ [0, 1] a – We can in principle do belief planning using this • Reinforcement Learning: – Assume we know that the world is a finite-state MDP, but do not know its transition probability P (s0 | s, a). For each (s, a), P (s0 | s, a) is a distribution over the integer s0 – Having a separate Dirichlet distribution for each (s, a) is a way to represent our belief about the world, that is, our belief about P (s0 | s, a) – We can in principle do belief planning using this → Bayesian Reinforcement Learning • Dirichlet distributions are also used to model texts (word distributions in text), images, or mixture distributions in general The (cumulative) probability distribution F (y) = P (x ≤ y) = Ry dx p(x) ∈ [0, 1] is the cumulative integral with limy→∞ F (y) = −∞ 1 (In discrete domain: probability distribution and probability mass function P (x) ∈ [0, 1] are used synonymously.) • Two basic examples: Gaussian: N(x | µ, Σ) = > 1 e− 2 (x−µ) 1 | 2πΣ | 1/2 Σ-1 (x−µ) Dirac or d (“point particle”) d(x) = 0 except at x = 0, 8:22 Conjugate priors R d(x) dx = 1 d(x) = ∂ H(x) ∂x where H(x) = [x ≥ 0] = Heavyside step func- tion 8:26 • Assume you have data D = {x1 , .., xn } with likelihood Gaussian distribution P (D | θ) N (x|µ, σ 2 ) that depends on an uncertain parameter θ 2σ • 1-dim: N(x | µ, σ 2 ) = Assume you have a prior P (θ) 1 | 2πσ 2 | 1/2 1 2 e− 2 (x−µ) /σ 2 µ • n-dim Gaussian in normal form: • The prior P (θ) is conjugate to the likelihood P (D | θ) iff the posterior P (θ | D) ∝ P (D | θ) P (θ) N(x | µ, Σ) = 1 1 exp{− (x − µ)> Σ-1 (x − µ)} 2 | 2πΣ | 1/2 with mean µ and covariance matrix Σ. In canonical form: is in the same distribution class as the prior P (θ) N[x | a, A] = exp{− 12 a>A-1 a} 1 exp{− x> A x + x>a} 2 | 2πA-1 | 1/2 (1) • Having a conjugate prior is very convenient, because then you with precision matrix A = Σ-1 and coefficient a = Σ-1 µ (and know how to update the belief given data 8:23 mean µ = A-1 a). 48 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 • Gaussian identities: see http://ipvs.informatik.uni-stuttgart. Motivation for particle distributions de/mlr/marc/notes/gaussians.pdf 8:27 • Numeric representation of “difficult” distributions – Very general and versatile – But often needs many samples Motivation for Gaussian distributions • Distributions over games (action sequences), sample based planning, MCTS • Gaussian Bandits • State estimation, particle filters • Control theory, Stochastic Optimal Control • State estimation, sensor processing, Gaussian filtering (Kalman • etc 8:31 filtering) • Machine Learning Utilities & Decision Theory • etc 8:28 • Given a space of events O (e.g., outcomes of a trial, a game, etc) the utility is a function Particle Approximation of a Distribution U : O→R • We approximate a distribution p(x) over a continuous domain Rn • The utility represents preferences as a single scalar – which is not always obvious (cf. multi-objective optimization) • A particle distribution q(x) is a weighed set S = {(x , w i i )}N i=1 of N particles – each particle has a “location” xi ∈ Rn and a weight wi ∈ R P – weights are normalized, i wi = 1 q(x) := N X wi d(x − xi ) • Decision Theory making decisions (that determine p(x)) that maximize expected utility Z E{U }p = U (x) p(x) x • Concave utility functions imply risk aversion (and convex, risk- i=1 taking) where d(x − xi ) is the d-distribution. 8:32 • Given weighted particles, we can estimate for any (smooth) f : Z hf (x)ip = f (x)p(x)dx ≈ PN i=1 wi f (xi ) Entropy • The neg-log (− log p(x)) of a distribution reflects something like x See An Introduction to MCMC for Machine Learning www.cs. ubc.ca/˜nando/papers/mlintro.pdf “error”: – neg-log of a Guassian ↔ squared error – neg-log likelihood ↔ prediction error 8:29 • The (− log p(x)) is the “optimal” coding length you should assign Particle Approximation of a Distribution to a symbol x. This will minimize the expected length of an encoding Histogram of a particle representation: Z H(p) = p(x)[− log p(x)] x • The entropy H(p) = Ep(x) {− log p(x)} of a distribution p is a measure of uncertainty, or lack-of-information, we have about x 8:33 Kullback-Leibler divergence* • Assume you use a “wrong” distribution q(x) to decide on the coding length of symbols drawn from p(x). The expected length of a encoding is Z 8:30 p(x)[− log q(x)] ≥ H(p) x Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 • The difference D p q = Z p(x) log x p(x) ≥0 q(x) is called Kullback-Leibler divergence Proof of inequality, using the Jenson inequality: Z − p(x) log x q(x) ≥ − log p(x) Z p(x) x q(x) =0 p(x) 8:34 Some more continuous distributions* Gaussian Dirac or d Student’s t (=Gaussian for ν → ∞, otherwise heavy tails) Exponential N(x | a, A) > -1 1 1 e− 2 (x−a) A | 2πA | 1/2 = (x−a) ∂ d(x) = ∂x H(x) ν+1 2 p(x; ν) ∝ [1 + xν ]− 2 p(x; λ) = [x ≥ 0] λe−λx (distribution over single event time) Laplace (“double exponential”) Chi-squared Gamma p(x; µ, b) = 1 − | x−µ | /b e 2b p(x; k) ∝ [x ≥ 0] xk/2−1 e−x/2 p(x; k, θ) ∝ [x ≥ 0] xk−1 e−x/θ 8:35 49 50 11 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Bandits & UCT Bandits: Formal Problem Definition • Let at ∈ {1, .., n} be the choice of machine at time t Let yt ∈ R be the outcome Multi-armed Bandits • A policy or strategy maps all the history to a new choice: π : [(a1 , y1 ), (a2 , y2 ), ..., (at-1 , yt-1 )] 7→ at • Problem: Find a policy π that maxh PT t=1 yt i or maxhyT i • There are n machines • Each machine i returns a reward y ∼ P (y; θi ) The machine’s parameter θi is unknown or other objectives like discounted infinite horizon maxh P∞ t=1 γ t yt i • Your goal is to maximize the reward, say, collected over the first T trials 9:1 9:5 Exploration, Exploitation • “Two effects” of choosing a machine: – You collect more data about the machine → knowledge Bandits – applications – You collect reward • Online advertisement • For example – Exploration: Choose the next action at to minhH(bt )i – Exploitation: Choose the next action at to maxhyt i 9:6 • Clinical trials, robotic scientist Digression: Active Learning • “Active Learning” • Efficient optimization 9:2 “Experimental Design” “Exploration in Reinforcement Learning” Bandits All of these are strongly related to trying to minimize (also) H(bt ) • The bandit problem is an archetype for – Sequential decision making – Decisions that influence knowledge as well as rewards/states Gaussian Processes: – Exploration/exploitation • The same aspects are inherent also in global optimization, active learning & RL • The Bandit problem formulation is the basis of UCB – which is (from Rasmussen & Williams) the core of serveral planning and decision making methods 9:7 • Bandit problems are commercially very relevant 9:3 Upper Confidence Bounds (UCB) Upper Confidence Bound (UCB) 1: 2: 9:4 3: 4: Initialization: Play each machine once repeat Play the machine i that maximizes yˆi + β until q 2 ln n ni Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 51 UCB for Gauss yˆi is the average reward of machine i so far • If we have a single Gaussian bandits, we can compute P the mean estimator µ ˆ = n1 i yi P 2 1 the empirical variance σ ˆ 2 = n−1 i (yi − µ) ni is how often machine i has been played so far P n = i ni is the number of rounds so far β is often chosen as β = 1 and the variance of the mean estimator Var{µ} = sˆ2 /n The bound is derived from the Hoeffding inequality • µ ˆ and Var{µ} describe our posterior Gaussian belief over the See Finite-time analysis of the multiarmed bandit problem, Auer, Cesa-Bianchi & Fischer, Machine learning, 2002. true underlying µ Using the err-function we can get exact quantiles 9:8 • Alternative strategies: UCB algorithms 90%-quantile(µi ) • UCB algorithms determine a confidence interval such that µ ˆi + β yˆi − σi < hyi i < yˆi + σi p √ Var{µi } = m ˆ i + βσ ˆ/ n 9:11 with high probability. UCB - Discussion UCB chooses the upper bound of this confidence interval • UCB over-estimates the reward-to-go (under-estimates cost-to• Optimism in the face of uncertainty go), just like A∗ – but does so in the probabilistic setting of bandits • Strong bounds on the regret (sub-optimality) of UCB (e.g. Auer et al.) • The fact that regret bounds exist is great! 9:9 • UCB became a core method for algorithms (including planners) UCB for Bernoulli to decide what to explore: • If we have a single Bernoulli bandits, we can count In tree search, the decision of which branches/actions to explore is itself a decision problem. An “intelligent agent” (like UBC) can a = 1 + #wins , b = 1 + #losses be used within the planner to make decisions about how to grow the tree. • Our posterior over the Bernoulli parameter µ is Beta(µ | a, b) • The mean is hµi = a a+b The mode (most likely) is µ∗ = The variance is Var{µ} = a−1 a+b−2 9:12 Monte Carlo Tree Search for a, b > 1 ab (a+b+1)(a+b)2 9:13 One can numerically compute the inverse cumulative Beta distribution → get exact quantiles Monte Carlo Tree Search (MCTS) • MCTS is very successful on Computer Go and other games • Alternative strategies: • MCTS is rather simple to implement • MCTS is very general: applicable on any discrete domain argmax 90%-quantile(µi ) i • Key paper: argmaxhµi i + β ´ Bandit based Monte-Carlo Planning, ECML Kocsis & Szepesvari: p Var{µi } 2006. i 9:10 • Survey paper: Browne et al.: A Survey of Monte Carlo Tree Search Methods, 2012. 52 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 • Tutorial presentation: 9:17 http://web.engr.oregonstate.edu/˜afern/ icaps10-MCP-tutorial.ppt 9:14 Monte Carlo methods Flat Monte Carlo • The goal of MCTS is to estimate the utility (e.g., expected payoff • General, the term Monte Carlo simulation refers to methods that D) depending on the first action a chosen: generate many i.i.d. random samples xi ∼ P (x) from a distribu- Q(s0 , a) = E{D|s0 , a} tion P (x). Using the samples one can estimate expectations of anything that depends on x, e.g. f (x): Z hf i = P (x) f (x) dx ≈ x where expectation is taken with w.r.t. the whole future randomized actions (including a potential opponent) N 1 X f (xi ) N i=1 • Flat Monte Carlo does so by rolling out many random simula(In this view, Monte Carlo approximates an integral.) tions (using a R OLLOUT P OLICY) without growing a tree The key difference/advantage of MCTS over flat MC is that the • Example: What is the probability that a solitair would come out tree growth focusses computational effort on promising actions successful? (Original story by Stan Ulam.) Instead of trying to 9:18 analytically compute this, generate many random solitairs and Upper Confidence Tree (UCT) count. • UCT uses UCB to realize the T REE P OLICY, i.e. to decide where • The method developed in the 40ies, where computers became to expand the tree faster. Fermi, Ulam and von Neumann initiated the idea. von Neumann called it “Monte Carlo” as a code name. 9:15 • B ACKUP updates all parents of vl as n(v) ← n(v) + 1 (count how often has it been played) Generic MCTS scheme Q(v) ← Q(v) + D (sum of rewards received) • T REE P OLICY chooses child nodes based on UCB: s 2 ln n(v) Q(v 0 ) argmax +β 0) 0 n(v n(v 0 ) v ∈∂(v) or choose v 0 if n(v 0 ) = 0 9:19 from Browne et al. 1: 2: 3: 4: 5: 6: 7: 8: start tree V = {v0 } while within computational budget do vl ← T REE P OLICY(V ) chooses a leaf of V append vl to V D ← R OLLOUT P OLICY(V ) rolls out a full simulation, with return D B ACKUP(vl , D) updates the values of all parents of vl end while return best child of v0 9:16 Generic MCTS scheme • In comparision to other planners it always computes full roll outs to a terminal state. No heuristics to estimate the utility of a state are needed. • The tree grows unbalanced • The T REE P OLICY decides where the tree is expanded – and needs to trade off exploration vs. exploitation • The R OLLOUT P OLICY is necessary to simulate a roll out. It could be a random policy; at least a randomized policy. Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 12 Game Playing 53 Properties of minimax Complete?? Yes, if tree is finite (chess has specific rules for this) Outline Optimal?? Yes, against an optimal opponent. Otherwise?? • Minimax Time complexity?? O(bm ) • α–β pruning Space complexity?? O(bm) (depth-first exploration) • UCT for games For chess, b ≈ 35, m ≈ 100 for “reasonable” games 10:1 ⇒ exact solution completely infeasible But do we need to explore every path? Game tree (2-player, deterministic, turns) 10:5 α–β pruning example 10:2 Minimax Perfect play for deterministic, perfect-information games Idea: choose move to position with highest minimax value = best achievable payoff against best play E.g., 2-ply game: 10:3 Minimax algorithm function M INIMAX -D ECISION(state) returns an action inputs: state, current state in game return the a in ACTIONS(state) maximizing M IN -VALUE(R ESULT(a, state)) function M AX -VALUE(state) returns a utility value if T ERMINAL -T EST(state) then return U TILITY(state) v ← −∞ for a, s in S UCCESSORS(state) do v ← M AX(v, M IN -VALUE(s)) return v function M IN -VALUE(state) returns a utility value if T ERMINAL -T EST(state) then return U TILITY(state) v←∞ for a, s in S UCCESSORS(state) do v ← M IN(v, M AX -VALUE(s)) return v 10:4 54 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 10:6 Suppose we have 100 seconds, explore 104 nodes/second ⇒ 106 nodes per move ≈ 358/2 Why is it called α–β? ⇒ α–β reaches depth 8 ⇒ pretty good chess program 10:10 Evaluation functions α is the best value (to MAX ) If V is worse than α, MAX Define β similarly for MIN found so far off the current path will avoid it ⇒ prune that branch For chess, typically linear weighted sum of features 10:7 E VAL(s) = w1 f1 (s) + w2 f2 (s) + . . . + wn fn (s) The α–β algorithm e.g., w1 = 9 with function A LPHA -B ETA -D ECISION(state) returns an action return the a in ACTIONS(state) maximizing M IN -VALUE(R ESULT(a, state)) function M AX -VALUE(state, α, β) returns a utility value inputs: state, current state in game α, the value of the best alternative for MAX along the path to state β, the value of the best alternative for MIN along the path to state if T ERMINAL -T EST(state) then return U TILITY(state) v ← −∞ for a, s in S UCCESSORS(state) do v ← M AX(v, M IN -VALUE(s, α, β)) if v ≥ β then return v α ← M AX(α, v) return v f1 (s) = (number of white queens) – (number of black queens), etc. 10:11 Upper Confidence Tree (UCT) for games Standard backup updates all parents of vl as n(v) ← n(v) + 1 (count how often has it been played) Q(v) ← Q(v) + ∆ (sum of rewards received) In games use a “negamax” backup: While iterating upward, flip sign ∆ ← −∆ in each iteration function M IN -VALUE(state, α, β) returns a utility value same as M AX -VALUE but with roles of α, β reversed Survey of MCTS applications: 10:8 Browne et al.: A Survey of Monte Carlo Tree Search Methods, 2012. 10:12 Properties of α–β Brief notes on game theory Pruning does not affect final result • (Small) zero-sum games can be represented by a payoff matrix Good move ordering improves effectiveness of pruning • Uji denotes the utility of player 1 if she chooses the pure (=de- A simple example of the value of reasoning about which computations are relevant (a form of metareasoning) terministic) strategy i and player 2 chooses the pure strategy j. U T = −U Zero-sum games: Uji = −Uij , 10:9 Resource limits Standard approach: • Use C UTOFF -T EST instead of T ERMINAL -T EST e.g., depth limit (perhaps add quiescence search) • Fining a minimax optimal mixed strategy p is a Linear Program max w w s.t. U p ≥ w , X pi = 1 , p≥0 i Note that U p ≥ w implies minj (U p)j ≥ w. • Gainable payoff of player 1: maxp minq q T U p Minimax-Theorem: maxp minq q T U p = minq maxp q T U p Minimax-Theorem ↔ optimal p with w ≥ 0 exists • Use E VAL instead of U TILITY i.e., evaluation function that estimates desirability of position 10:13 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 13 Graphical Models 55 • The joint distribution can be factored as P (X1:n ) = n Y P (Xi | Parents(Xi )) i=1 Outline • Missing links imply conditional independence • Ancestral simulation to sample from joint distribution • A. Introduction 11:5 – Motivation and definition of Bayes Nets – Conditional independence in Bayes Nets – Examples Example • B. Inference in Graphical Models – Sampling methods (Rejection, Importance, Gibbs) – Variable Elimination & Factor Graphs – Message passing, Loopy Belief Propagation (Heckermann 1995) P(B=bad) =0.02 P(F=empty)=0.05 Battery Fuel 11:1 Gauge Graphical Models P(G=empty|B=good,F=not empty)=0.04 P(G=empty|B=good,F=empty)=0.97 P(G=empty|B=bad,F=not empty)=0.10 P(G=empty|B=bad,F=empty)=0.99 • The core difficulty in modelling is specifying TurnOver What are the relevant variables? Start P(S=no|T=yes,F=not empty)=0.01 P(S=no|T=yes,F=empty)=0.92 P(S=no|T=no,Fnot empty)=1.00 P(S=no|T=no,F=empty)=1.00 P(T=no|B=good)=0.03 P(T=no|B=bad)=0.98 How do they depend on each other? (Or how could they depend on each other → learning) ⇐⇒ P (S, T, G, F, B) = P (B) P (F ) P (G|F, B) P (T |B) P (S|T, F ) • Graphical models are a simple, graphical notation for • Table sizes: LHS = 25 − 1 = 31 RHS = 1 + 1 + 4 + 2 + 4 = 12 1) which random variables exist 11:6 2) which random variables are “directly coupled” Thereby they describe a joint probability distribution P (X1 , .., Xn ) Bayes Nets & conditional independence over n random variables. • Independence: Indep(X, Y ) ⇐⇒ P (X, Y ) = P (X) P (Y ) • 2 basic variants: – Bayesian Networks – Factor Graphs Field) • Conditional independence: (aka. directed model, belief network) Indep(X, Y |Z) ⇐⇒ P (X, Y |Z) = P (X|Z) P (Y |Z) (aka. undirected model, Markov Random 11:2 X Bayesian Networks Z • A Bayesian Network is a – directed acyclic graph (DAG) – where each node represents a random variable Xi Z Y X Z Y X Y (head-to-head) (tail-to-tail) (head-to-tail) Indep(X, Y ) ¬Indep(X, Y |Z) ¬Indep(X, Y ) Indep(X, Y |Z) ¬Indep(X, Y ) Indep(X, Y |Z) 11:7 – for each node we have a conditional probability distribution P (Xi | Parents(Xi )) • In the simplest case (discrete RVs), the conditional distribution is represented as a conditional probability table (CPT) 11:3 Example P (X, Y, Z) = P (X) P (Y ) P (Z|X, Y ) P P (X, Y ) = P (X) P (Y ) Z P (Z|X, Y ) = P (X) P (Y ) • Tail-to-tail: Indep(X, Y |Z) P (X, Y, Z) = P (Z) P (X|Z) P (Y |Z) drinking red wine → longevity? 11:4 Bayesian Networks • DAG → we can sort the RVs; edges only go from lower to higher index • Head-to-head: Indep(X, Y ) P (X, Y |Z) = P (X, Y, Z) = P (Z) = P (X|Z) P (Y |Z) • Head-to-tail: Indep(X, Y |Z) P (X, Y, Z) = P (X) P (Z|X) P (Y |Z) P (X, Y |Z) = P (X,Y,Z) P (Z) = P (X,Z) P (Y |Z) P (Z) = P (X|Z) P (Y |Z) 56 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 11:8 Inference • Inference: Given some pieces of information (prior, observed General rules for determining conditional independence in a Bayes net: variabes) what is the implication (the implied information, the posterior) on a non-observed variable • In a Bayes Nets: Assume there is three groups of RVs: – Z are observed random variables • Given three groups of random variables X, Y, Z – X and Y are hidden random variables Indep(X, Y |Z) ⇐⇒ every path from X to Y is “blocked by Z” • A path is “blocked by Z” ⇐⇒ on this path... – ∃ a node in Z that is head-to-tail w.r.t. the path, or – We want to do inference about X, not Y Given some observed variables Z, compute the posterior marginal P (X | Z) for some hidden variable X. – ∃ a node in Z that is tail-to-tail w.r.t. the path, or P (X | Z) = – ∃ another node A which is head-to-head w.r.t. the path and neither A nor any of its descendants are in Z 11:9 Example X P (X, Z) 1 = P (X, Y, Z) P (Z) P (Z) Y where Y are all hidden random variables except for X • Inference requires summing over (eliminating) hidden variables. 11:12 (Heckermann 1995) P(B=bad) =0.02 P(F=empty)=0.05 Battery Fuel Example: Holmes & Watson • Mr. Holmes lives in Los Angeles. One morning when Holmes Gauge leaves his house, he realizes that his grass is wet. Is it due to P(G=empty|B=good,F=not empty)=0.04 P(G=empty|B=good,F=empty)=0.97 rain, or has he forgotten to turn off his sprinkler? – Calculate P (R|H), P (S|H) and compare these values to the prior probabilities. P(G=empty|B=bad,F=not empty)=0.10 P(G=empty|B=bad,F=empty)=0.99 TurnOver P(T=no|B=good)=0.03 P(T=no|B=bad)=0.98 Indep(T, F )? Start – Calculate P (R, S|H). Note: R and S are marginally independent, but conditionally dependent P(S=no|T=yes,F=not empty)=0.01 P(S=no|T=yes,F=empty)=0.92 P(S=no|T=no,Fnot empty)=1.00 P(S=no|T=no,F=empty)=1.00 Indep(B, F |S)? Indep(B, S|T )? 11:10 • Holmes checks Watson’s grass, and finds it is also wet. – Calculate P (R|H, W ), P (S|H, W ) – This effect is called explaining away What can we do with Bayes nets? JavaBayes: run it from the html page • Inference: Given some pieces of information (prior, observed http://www.cs.cmu.edu/˜javabayes/Home/applet.html variabes) what is the implication (the implied information, the 11:13 posterior) on a non-observed variable Example: Holmes & Watson • Decision Making: If utilities and decision variables are defined P(R=yes)=0.2 P(S=yes)=0.1 Rain Sprinkler → compute optimal decisions in probabilistic domains Watson • Learning: – Fully Bayesian Learning: Inference over parameters (e.g., β) P(W=yes|R=yes)=1.0 P(W=yes|R=no)=0.2 Holmes P(H=yes|R=no,S=no)=0.0 P(H=yes|R=yes,S=yes)=1.0 P(H=yes|R=yes,S=no)=1.0 P(H=yes|R=no,S=yes)=0.9 P (H, W, S, R) = P (H|S, R) P (W |R) P (S) P (R) – Maximum likelihood training: Optimizing parameters P (R|H) = • Structure Learning (Learning/Inferring the graph structure itself): Decide which model (which graph structure) fits the data best; thereby uncovering conditional independencies in the data. = 1 X P (H|S, R) P (S) P (R) P (H) S 1 1 (1.0 · 0.2 · 0.1 + 1.0 · 0.2 · 0.9) = 0.2 P (H = 1) P (H = 1) 1 1 P (R = 0 | H = 1) = (0.9 · 0.8 · 0.1 + 0.0 · 0.8 · 0.9) = 0.072 P (H = 1) P (H = 1) P (R = 1 | H = 1) = 11:11 X P (R, W, S, H) X 1 = P (H|S, R) P (W |R) P (S) P (R) P (H) P (H) W,S W,S Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 11:14 57 In this view, Monte Carlo methods approximate an integral. • Motivation: p(x) itself is too complicated to express analytically or compute hf (x)ip directly • These types of calculations can be automated → Variable Elimination Algorithm (discussed later) 11:15 13.1 Inference Methods in Graphical Models • Example: What is the probability that a solitair would come out successful? (Original story by Stan Ulam.) Instead of trying to analytically compute this, generate many random solitairs and count. • Naming: The method developed in the 40ies, where computers became faster. Fermi, Ulam and von Neumann initiated the idea. von Neumann called it “Monte Carlo” as a code name. 11:19 11:16 Rejection Sampling Inference methods in graphical models • We have a Bayesian Network with RVs X1:n , some of which are • Sampling: – Rejection samping, importance sampling, Gibbs sampling – More generally, Markov-Chain Monte Carlo (MCMC) methods observed: Xobs = yobs , obs ⊂ {1 : n} • The goal is to compute marginal posteriors P (Xi | Xobs = yobs ) conditioned on the observations. • Message passing: – Exact inference on trees (includes the Junction Tree Algorithm) • We generate a set of K (joint) samples of all variables S = {xk1:n }K k=1 – Belief propagation Each sample xk1:n = (xk1 , xk2 , .., xkn ) is a list of instantiation of all • Other approximations/variational methods – Expectation propagation RVs. 11:20 – Specialized variational methods depending on the model Rejection Sampling • Reductions: – Mathematical Programming (e.g. LP relaxations of MAP) • To generate a single sample xk1:n : – Compilation into Arithmetic Circuits (Darwiche at al.) 11:17 1. Sort all RVs in topological order; start with i = 1 2. Sample a value xki ∼ P (Xi | xkParents(i) ) for the ith RV con- Sampling ditional to the previous samples xk1:i-1 • Read Andrieu et al: An Introduction to MCMC for Machine Learn- 3. If i ∈ obs compare the sampled value xki with the observation yi . Reject and repeat from a) if the sample is not ing (Machine Learning, 2003) equal to the observation. 4. Repeat with i ← i + 1 from 2. • Here I’ll discuss only thee basic methods: – Rejection sampling • We compute the marginal probabilities from the sample set S: – Importance sampling – Gibbs sampling 11:18 P (Xi = x | Xobs = yobs ) ≈ countS (xki = x) K or pair-wise marginals: Monte Carlo methods • Generally, a Monte Carlo method is a method to generate a set P (Xi = x, Xj = x0 | Xobs = yobs ) ≈ countS (xki = x ∧ xkj = x0 ) K of (potentially weighted) samples that approximate a distribution 11:21 p(x). In the unweighted case, the samples should be i.i.d. xi ∼ p(x) In the general (also weighted) case, we want particles that allow Importance sampling (with likelihood weighting) to estimate expectations of anything that depends on x, e.g. f (x): • Rejecting whole samples may become very inefficient in large Z hf (x)ip = f (x) p(x) dx = lim x N →∞ N X i=1 Bayes Nets! wi f (xi ) 58 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Gibbs sampling* • New strategy: We generate a weighted sample set S = {(xk1:n , wk )}K k=1 • As for rejection sampling, Gibbs sampling generates an unweighted sample set S which can directly be used to compute marginals. where each sample xk1:n is associated with a weight w k In practice, one often discards an initial set of samples (burn-in) to avoid starting biases. • In our case, we will choose the weights proportional to the likelihood P (Xobs = yobs | X1:n = xk1:n ) of the observations conditional to the sample xk1:n • Gibbs sampling is a special case of MCMC sampling. Roughly, MCMC means to invent a sampling process, where the 11:22 next sample may stochastically depend on the previous (Markov property), such that the final sample set is guaranteed to corre- Importance sampling spond to P (X1:n ). → An Introduction to MCMC for Machine Learning • To generate a single sample (wk , xk1:n ): 11:25 k 1. Sort all RVs in topological order; start with i = 1 and w = Sampling – conclusions 1 2. a) If i 6∈ obs, sample a value xki ∼ P (Xi | xkParents(i) ) for the ith RV conditional to the previous samples xk1:i-1 b) If i ∈ obs, set the value xki = yi and update the weight • Sampling algorithms are very simple, very general and very popular – they equally work for continuous & discrete RVs – one only needs to ensure/implement the ability to sample from conditional distributions, no further algebraic manipulations according to likelihood wk ← wk P (Xi = yi | xk1:i-1 ) – MCMC theory can reduce required number of samples 3. Repeat with i ← i + 1 from 2. • In many cases exact and more efficient approximate inference is • We compute the marginal probabilities as: PK P (Xi = x | Xobs = yobs ) ≈ possible by actually computing/manipulating whole distributions in the algorithms instead of only samples. wk [xki = x] PK k k=1 w k=1 11:26 Variable Elimination and likewise pair-wise marginals, etc. Notation: [expr] = 1 if expr is true and zero otherwise 11:27 11:23 Variable Elimination example X2 Gibbs sampling* F4 µ µ 3 3 X1 • In Gibbs sampling we also generate a sample set S – but in this case the samples are not independent from each other. The next sample “modifies” the previous one: • First, all observed RVs are clamped to their fixed value xki = yi for any k. • To generate the (k + 1)th sample, iterate through the latent vari- F3 µ1 (X F5 µ2 µ4 µ5 X3 X5 P (x5 ) P = x1 ,x2 ,x3 ,x4 ,x6 P (x1 ) P (x2 |x1 ) P (x3 |x1 ) P (x4 |x2 ) P (x5 |x3 ) P (x6 |x2 , P P = x1 ,x2 ,x3 ,x6 P (x1 ) P (x2 |x1 ) P (x3 |x1 ) P (x5 |x3 ) P (x6 |x2 , x5 ) P( x4 | F1 ( = P = P P (x1 ) P (x2 |x1 ) P (x3 |x1 ) P (x5 |x3 ) P (x6 |x2 , x5 ) µ1 (x2 ) P P (x1 ) P (x2 |x1 ) P (x3 |x1 ) P (x5 |x3 ) µ1 (x2 ) (x6 |x2 , x x6 P | {z x1 ,x2 ,x3 ,x6 x1 ,x2 ,x3 F2 (x2 ,x5 ,x6 ables i 6∈ obs, updating: = P = P = P = P = P x ,x ,x P (x1 ) P (x2 |x1 ) P (x3 |x1 ) P (x5 |x3 ) µ1 (x2 ) µ2 (x2 , x5 ) P P 1 2 3 = x2 ,x3 P (x5 |x3 ) µ1 (x2 ) µ2 (x2 , x5 ) P (x1 ) P (x2 |x1 ) P (x3 |x1 ) x1 | {z } xk+1 ∼ P (Xi | xk1:n\i ) i F3 (x1 ,x2 ,x3 ) ∼ P (Xi | xk1 , xk2 , .., xki-1 , xki+1 , .., xkn ) Y P (Xj = xkj | Xi , xkParents(j)\i ) ∼ P (Xi | xkParents(i) ) P (x5 |x3 ) µ1 (x2 ) µ2 (x2 , x5 ) µ3 (x2 , x3 ) P P (x5 |x3 ) 1 (x2 ) µ2 (x2 , x5 ) µ3 (x2 , x3 ) x2 µ | {z } x2 ,x3 x3 F4 (x3 ,x5 ) j:i∈Parents(j) That is, each xk+1 is resampled conditional to the other (neighi x3 x3 P (x5 |x3 ) µ4 (x3 , x5 ) P (x5 |x3 ) µ4 (x3 , x5 ) | {z } F5 (x3 ,x5 ) boring) current sample values. = µ5 (x5 ) 11:24 11:28 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Variable Elimination example – lessons learnt • There is a dynamic programming principle behind Variable Elimination: – For eliminating X5,4,6 we use the solution of eliminating X4,6 – The “sub-problems” are represented by the F terms, their solutions by the remaining µ terms 59 Variable Elimination Algorithm • eliminate single variable(F, i) 6: Input: list F of factors, variable id i Output: list F of factors find relevant subset Fˆ ⊆ F of factors coupled to i: Fˆ = {k : i ∈ ∂k} ˆ with neighborhood ∂ k ˆ = all variables create new factor k in Fˆ except i P Q compute µkˆ (X∂ kˆ ) = Xi k∈Fˆ fk (X∂k ) remove old factors Fˆ and append new factor µˆ to F 7: return F 1: 2: 3: 4: – We’ll continue to discuss this 4 slides later! 5: • The factorization of the joint – determines in which order Variable Elimination is efficient – determines what the terms F (...) and µ(...) depend on • elimination algorithm(µ, F, M ) • We can automate Variable Elimination. For the automation, all that matters is the factorization of the joint. 1: 2: 11:29 3: 4: 5: Factor graphs 6: • In the previous slides we introduces the box k notation to indi- 7: Input: list F of factors, tuple M of desired output variables ids Output: single factor µ over variables XM define all variables present in F : V = vars(F ) define variables to be eliminated: E = V \ M for all i ∈ E: eliminate single variable(F, i) for all remaining factors, compute the product µ = Q f ∈F f return µ cate terms that depend on some variables. That’s exactly what 11:32 factor graphs represent. Variable Elimination on trees • A Factor graph is a – bipartite graph Y3 1 – where each circle node represents a random variable Xi Y8 – each box node represents a factor fk , which is a function fk (X∂k ) Y1 2 6 X Y4 F1 (Y1,8 , X) F3 (Y3,4,5 , X) Y2 – the joint probability distribution is given as 4 P (X1:n ) = K Y 7 3 5 Y5 F2 (Y2,6,7 , X) Y6 Y7 fk (X∂k ) k=1 The subtrees w.r.t. X can be described as Notation: ∂k is shorthand for Neighbors(k) 11:30 F1 (Y1,8 , X) = f1 (Y8 , Y1 ) f2 (Y1 , X) F2 (Y2,6,7 , X) = f3 (X, Y2 ) f4 (Y2 , Y6 ) f5 (Y2 , Y7 ) Bayes Net → factor graph F3 (Y3,4,5 , X) = f6 (X, Y3 , Y4 ) f7 (Y4 , Y5 ) X4 X2 X1 The joint distribution is: X6 P (Y1:8 , X) = F1 (Y1,8 , X) F2 (Y2,6,7 , X) F3 (Y3,4,5 , X) 11:33 X3 • Bayesian Network: X5 P (x1:6 ) = P (x1 ) P (x2 |x1 ) P (x3 |x1 ) P (x4 |x2 ) P (x5 |x3 ) P (x6 |x2 , x5 ) X2 Variable Elimination on trees X4 Y3 µ2→X Y1 2 X1 • Factor Graph: X6 X3 Y9 1 Y8 F1 (Y1,8 , X) 6 X µ6→X µ3→X 3 X5 8 Y4 7 F3 (Y3,4,5 , X) Y2 Y5 4 P (x1:6 ) = f1 (x1 , x2 ) f2 (x3 , x1 ) f3 (x2 , x4 ) f4 (x3 , x5 ) f5 (x2 , x5 , x6 ) 5 F2 (Y2,6,7 , X) Y6 Y7 → each CPT in the Bayes Net is just a factor (we neglect the We can eliminate each tree independently. The remaining terms special semantics of a CPT) 11:31 (messages) are: P µF1 →X (X) = Y1,8 F1 (Y1,8 , X) 60 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 µF2 →X (X) = P F2 (Y2,6,7 , X) • Message passing exemplifies how to exploit the factorization µF3 →X (X) = P F3 (Y3,4,5 , X) structure of the joint distribution for the algorithmic implemen- Y2,6,7 Y3,4,5 tation The marginal P (X) is the product of subtree messages P (X) = µF1 →X (X) µF2 →X (X) µF3 →X (X) 11:34 • Note: These are recursive equations. They can be resolved exactly if and only if the dependency structure (factor graph) is Variable Elimination on trees – lessons learnt a tree. If the factor graph had loops, this would be a “loopy recursive equation system”... • The “remaining terms” µ’s are called messages 11:37 Intuitively, messages subsume information from a subtree • Marginal = product of messages, P (X) = Q Message passing variants µFk →X , is very k intuitive: – Fusion of independent information from the different subtrees – Fusing independent information ↔ multiplying probability tables • Message passing has many important applications: – Many models are actually trees: In particular chains esp. Hidden Markov Models – Message passing can also be applied on non-trees (↔ loopy graphs) → approximate inference (Loopy Belief Propagation) • Along a (sub-) tree, messages can be computed recursively 11:35 – Bayesian Networks can be “squeezed” to become trees → exact inference in Bayes Nets! (Junction Tree Algorithm) 11:38 Message passing • General equations (belief propagation (BP)) for recursive message computation (writing µk→i (Xi ) instead of µFk →X (X)): • If the graphical model is not a tree (=has loops): – The recursive message equations cannot be resolved. µ ¯ j→k (Xj ) µk→i (Xi ) = X fk (X∂k ) X∂k\i z Y Y j∈∂k\i | Loopy Belief Propagation – However, we could try to just iterate them as update equations... }| { µk0 →j (Xj ) k0 ∈∂j\k {z F (subtree) } • Loopy BP update equations: Q j∈∂k\i : excl. i Q k0 ∈∂j\k (initialize with µk→i = 1) branching at factor k, prod. over adjacent variables j : branching at variable j, prod. over adjacent factors k 0 µnew k→i (Xi ) = X fk (X∂k ) X∂k\i Y Y j∈∂k\i k0 ∈∂j\k µold k0 →j (Xj ) excl. k 11:39 µ ¯ j→k (Xj ) are called “variable-to-factor messages”: store them for efficiency Loopy BP remarks Y3 µ1→Y1 Y9 1 Y8 µ2→X Y1 2 F1 (Y1,8 , X) 6 X µ6→X µ3→X 3 4 8 Y4 µ7→Y47 F3 (Y3,4,5 , X) Y2 µ4→Y2 µ8→Y4 µ5→Y2 5 Y5 Example messages: P µ2→X = f2 (Y1 , X) µ1→Y1 (Y1 ) PY1 µ6→X = f6 (Y3 , Y4 , X) µ7→Y4 (Y4 ) PY3 ,Y4 µ3→X = Y f3 (Y2 , X) µ4→Y2 (Y2 ) µ5→Y2 (Y2 ) 2 F2 (Y2,6,7 , X) Y6 Y7 11:36 Message passing remarks • Computing these messages recursively on a tree does nothing else than Variable Elimination Q ⇒ P (Xi ) = k∈∂i µk→i (Xi ) is the correct posterior marginal • However, since it stores all “intermediate terms”, we can compute ANY marginal P (Xi ) for any i • Problem of loops intuitively: loops ⇒ branches of a node to not represent independent information! – BP is multiplying (=fusing) messages from dependent sources of information • No convergence guarantee, but if it converges, then to a state of marginal consistency X X b(X∂k ) = b(X∂k0 ) = b(Xi ) X∂k\i X ∂k0 \i and to the minimum of the Bethe approximation of the free energy (Yedidia, Freeman, & Weiss, 2001) • We shouldn’t be overly disappointed: – if BP was exact on loopy graphs we could efficiently solve NP hard problems... – loopy BP is a very interesting approximation to solving an NP hard problem – is hence also applied in context of combinatorial optimization (e.g., SAT problems) • Ways to tackle the problems with BP convergence: – Damping (Heskes, 2004: On the uniqueness of loopy belief propagation fixed points) – CCCP (Yuille, 2002: CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation) – Tree-reweighted MP (Kolmogorov, 2006: Convergent tree-reweighted message passing for energy minimization) Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 61 11:43 11:40 Junction Tree Algorithm Junction Tree Algorithm Example • Many models have loops X2 Instead of applying loopy BP in the hope of getting a good approximation, it is possible to convert every model into a tree by redefinition of X4 X4 X6 X1 X1 RVs. The Junction Tree Algorithms converts a loopy model into a tree. X3 X2,5 X2,3 X6 X5 • Loops are resolved by defining larger variable groups (separators) on which messages are defined • If we eliminate in order 4, 6, 5, 1, 2, 3, we get remaining terms 11:41 (X2 ), (X2 , X5 ), (X2 , X3 ), (X2 , X3 ), (X3 ) Junction Tree Example • Example: which translates to the Junction Tree on the right A B A B C D C D 11:44 Maximum a-posteriori (MAP) inference • Often we want to compute the most likely global assignment • Join variable B and C to a single separator MAP X1:n = argmax P (X1:n ) A X1:n B, C D This can be viewed as a variable substitution: rename the tuple of all random variables. This is called MAP inference and can be solved P by replacing all by max in the message passing equations – the algorithm is called Max-Product Algorithm and is a generalization of Dynamic Programming methods like Viterbi or Dijkstra. (B, C) as a single random variable • Application: Conditional Random Fields • A single random variable may be part of multiple separators – but only along a running intersection f (y, x) = φ(y, x)>β = k X φj (y∂j , x)βj = log j=1 11:42 Junction Tree Algorithm k hY eφj (y∂j ,x)βj i j=1 with prediction x 7→ y ∗ (x) = argmax f (x, y) y Finding the argmax is a MAP inference problem! This is frequently needed in the innerloop of CRF learning algorithms. • Standard formulation: Moralization & Triangulation 11:45 A clique is a fully connected subset of nodes in a graph. 1) Generate the factor graph (classically called “moralization”) Conditional Random Fields 2) Translate each factor to a clique: Generate the undirected graph • The following are interchangable: “Random Field” ↔ “Markov R where undirected edges connect all RVs of a factor 3) Triangulate the undirected graph 4) Translate each clique back to a factor; identify the separators between factors • Formulation in terms of variable elimination: 1) Start with a factor graph • Therefore, a CRF is a conditional factor graph: – A CRF defines a mapping from input x to a factor graph over y – Each feature φj (y∂j , x) depends only on a subset ∂j of variables y∂j – If y∂j are discrete, a feature φj (y∂j , x) is usually an indicator feature (see lecture 03); the corresponding parameter βj is then one entry of a factor fk (y∂j ) that couples these variables 2) Choose an order of variable elimination 11:46 3) Keep track of the “remaining µ terms” (slide 14): which RVs would they depend on? → this identifies the separators 62 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 What we didn’t cover • A very promising line of research is solving inference problems using mathematical programming. This unifies research in the areas of optimization, mathematical programming and probabilistic inference. Linear Programming relaxations of MAP inference and CCCP methods are great examples. 11:47 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 14 Dynamic Models 63 • A Hidden Markov Model (HMM) is defined as the joint distribution P (X0:T , Y0:T ) = P (X0 ) · Motivation T Y P (Xt |Xt-1 ) · t=1 – Robotics slides – Speech recognition T Y P (Yt |Xt ) . t=0 X0 X1 X2 X3 XT Y0 Y1 Y2 Y3 YT – Music 12:4 12:1 Inference in an HMM – a tree! Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Ffuture (X2:T , Y3:T ) Fpast (X0:2 , Y0:1 ) Markov assumption: Xt depends on bounded subset of X0:t−1 First-order Markov process: P (Xt | X0:t−1 ) = P (Xt | Xt−1 ) X0 X1 X2 X3 XT Y0 Y1 Y2 Y3 YT Second-order Markov process: P (Xt | X0:t−1 ) = P (Xt | Xt−2 , Xt−1 ) Fnow (X2 , Y2 ) • The marginal posterior P (Xt | Y1:T ) is the product of three messages Sensor Markov assumption: P (Yt | X0:t , Y0:t−1 ) = P (Yt | Xt ) Stationary process: transition model P (Xt | Xt−1 ) and P (Xt | Y1:T ) ∝ P (Xt , Y1:T ) = µpast (Xt ) µnow (Xt ) µfuture (Xt ) |{z} | {z } |{z} % α β sensor model P (Yt | Xt ) fixed for all t 12:2 Different inference problems in Markov Mod- • For all a < t and b > t – Xa conditionally independent from Xb given Xt – Ya conditionally independent from Yb given Xt els “The future is independent of the past given the present” Markov property • P (xt | y0:T ) marginal posterior • P (xt | y0:t ) filtering • P (xt | y0:a ), t > a prediction • P (xt | y0:b ), t < b smoothing • P (y0:T ) likelihood calculation • Viterbi alignment: Find sequence x∗0:T (conditioning on Yt does not yield any conditional independences) 12:5 Inference in HMMs that maximizes P (x0:T | y0:T ) (This is done using max-product, instead of sum-product message passing.) Ffuture (X2:T , Y3:T ) Fpast (X0:2 , Y0:1 ) X0 X1 X2 X3 XT Y0 Y1 Y2 Y3 YT Fnow (X2 , Y2 ) Applying the general message passing equations: 12:3 forward msg. backward msg. • We assume we have slice – a discrete latent variable Xt in each time slice X P (xt |xt-1 ) αt-1 (xt-1 ) %t-1 (xt-1 ) xt-1 Hidden Markov Models as Graphical Model – observed (discrete or continuous) variables Yt in each time µXt-1 →Xt (xt ) =: αt (xt ) = α0 (x0 ) = P (x0 ) X P (xt+1 |xt ) βt+1 (xt+1 ) %t+1 (xt+1 ) µXt+1 →Xt (xt ) =: βt (xt ) = xt+1 βT (x0 ) = 1 observation msg. posterior marginal posterior marginal µYt →Xt (xt ) =: %t (xt ) = P (yt | xt ) q(xt ) ∝ αt (xt ) %t (xt ) βt (xt ) q(xt , xt+1 ) ∝ αt (xt ) %t (xt ) P (xt+1 |xt ) %t+1 (xt+1 ) βt+1 – some observation model P (Yt | Xt ; θ) – some transition model P (Xt | Xt-1 ; θ) 12:6 64 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Inference in HMMs – implementation notes Kalman Filter example • The message passing equations can be implemented by reinterpreting them as matrix equations: Let αt , β t , %t be the vectors corresponding to the probability tables αt (xt ), βt (xt ), %t (xt ); and let P be the matrix with enties P (xt | xt-1 ). Then • filtering of a position (x, y) ∈ R2 : 1: 2: 3: 4: 5: α0 = π, β T = 1 fort=1:T -1 : αt = P (αt-1 · %t-1 ) fort=T -1:0 : β t = P> (β t+1 · %t+1 ) fort=0:T : q t = αt · %t · β t fort=0:T -1 : Qt = P · [(β t+1 · %t+1 ) (αt · %t )>] where · is the element-wise product! Here, q t is the vector with entries q(xt ), and Qt the matrix with entries q(xt+1 , xt ). Note that the equation for Qt describes Qt (x0 , x) = P (x0 |x)[(βt+1 (x0 )%t+1 (x0 ))(αt (x)%t (x))]. 12:7 Inference in HMMs: classical derivation 12:10 Given our knowledge of Belief propagation, inference in HMMs is simple. For reference, here is a more classical derivation: Kalman Filter example • smoothing of a position (x, y) ∈ R2 : P (y0:T | xt ) P (xt ) P (y0:T ) P (y0:t | xt ) P (yt+1:T | xt ) P (xt ) = P (y0:T ) P (y0:t , xt ) P (yt+1:T | xt ) = P (y0:T ) αt (xt ) βt (xt ) = P (y0:T ) P (xt | y0:T ) = αt (xt ) := P (y0:t , xt ) = P (yt |xt ) P (y0:t-1 , xt ) X = P (yt |xt ) P (xt | xt-1 ) αt-1 (xt-1 ) xt-1 βt (xt ) := P (yt+1:T | xt ) = X P (yt+1:T | xt+1 ) P (xt+1 | xt ) x+1 = i Xh βt+1 (xt+1 ) P (yt+1 |xt+1 ) P (xt+1 | xt ) x+1 12:11 Note: αt here is the same as αt · %t on all other slides! 12:8 HMM remarks HMM example: Learning Bach • A machine “listens” (reads notes of) Bach pieces over and over • The computation of forward and backward messages along the again → It’s supposed to learn how to write Bach pieces itself (or at Markov chain is also called forward-backward algorithm • The EM algorithm to learn the HMM parameters is also called least harmonize them). Baum-Welch algorithm • If the latent variable xt is continuous xt ∈ Rd instead of discrete, then such a Markov model is also called state space • Harmonizing Chorales in the Style of J S Bach Moray Allan & Chris Williams (NIPS 2004) model. • If the continuous transitions and observations are linear Gaus- – observed sequence Y0:T Soprano melody sian P (xt+1 |xt ) = N(xt+1 | Axt +a, Q) , • use an HMM P (yt |xt ) = N(yt | Cxt +c, W ) – latent sequence X0:T chord & and harmony: then the forward and backward messages αt and βt are also Gaussian. → forward filtering is also called Kalman filtering → smoothing is also called Kalman smoothing • Sometimes, computing forward and backward messages (in disrete or continuous context) is also called Bayesian filtering/smoothing 12:9 12:12 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 HMM example: Learning Bach • results: http://www.anc.inf.ed.ac.uk/demos/hmmbach/ • See also work by Gerhard Widmer http://www.cp.jku.at/ people/widmer/ 12:13 Dynamic Bayesian Networks – Arbitrary BNs in each time slide – Special case: MDPs, speech, etc 12:14 65 66 15 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Reinforcement Learning • Stationary MDP: – We assume P (s0 | s, a) and P (r|s, a) independent of time R – We also define R(s, a) := E{r|s, a} = r P (r|s, a) dr Long history of RL in AI 13:3 Idea of programming a computer to learn by trial and error (Tur- State value function ing, 1954) SNARCs (Stochastic Neural-Analog Reinforcement Calculators) (Minsky, 54) • The value (expected discounted return) of policy π when started in state s: Checkers playing program (Samuel, 59) Lots of RL in the 60s (e.g., Waltz & Fu 65; Mendel 66; Fu 70) MENACE (Matchbox Educable Naughts and Crosses Engine V π (s) = Eπ {r0 + γr1 + γ 2 r2 + · · · | s0 = s} (Mitchie, 63) discounting factor γ ∈ [0, 1] RL based Tic Tac Toe learner (GLEE) (Mitchie 68) Classifier Systems (Holland, 75) Adaptive Critics (Barto & Sutton, 81) • Definition of optimality: behavior π ∗ is optimal iff Temporal Differences (Sutton, 88) ∗ from Satinder Singh’s Introduction to RL, videolectures.com ∀s : V π (s) = V ∗ (s) where V ∗ (s) = max V π (s) π (simultaneously maximising the value in all states) • Long history in Psychology 13:1 (In MDPs there always exists (at least one) optimal deterministic policy.) Outline 13:4 • Markov Decision Processes as formal model – Definition – Value/Q-function – Planning as computing V /Q given a model • Learning An example for a value function... – Temporal Difference & Q-learning – Limitations of the model-free view – Model-based RL • Exploration • Briefly demo: test/mdp runVI – Imitation Learning & Inverse RL – Continuous states and actions (LSPI, Policy Gradients) 13:2 Values provide a gradient towards desirable states 13:5 Markov Decision Process Value function a0 a1 a2 s0 s1 s2 • The value function V is a central concept in all of RL! r0 P (s0:T +1 , a0:T , r0:T ; π) = P (s0 ) r1 QT t=0 r2 P (at |st ; π) P (rt |st , at ) P (st+1 |st , at ) • In other domains (stochastic optimal control) it is also called – world’s initial state distribution P (s0 ) – world’s transition probabilities P (st+1 | st , at ) – world’s reward probabilities P (rt | st , at ) – agent’s policy π(at | st ) = P (a0 |s0 ; π) at = π(st )) Many algorithms can directly be derived from properties of the value function. cost-to-go function (cost = −reward) 13:6 (or deterministic Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Recursive property of the value function 67 • Value Iteration: (initialize Vk=0 (s) = 0) h i X ∀s : Vk+1 (s) = max R(s, a) + γ P (s0 |s, a) Vk (s0 ) a π s0 2 V (s) = E{r0 + γr1 + γ r2 + · · · | s0 = s; π} stopping criterion: = E{r0 | s0 = s; π} + γE{r1 + γr2 + · · · | s0 = s; π} P = R(s, π(s)) + γ s0 P (s0 | s, π(s)) E{r1 + γr2 + · · · | s1 = s0 ; π} P = R(s, π(s)) + γ s0 P (s0 | s, π(s)) V π (s0 ) maxs |Vk+1 (s) − Vk (s)| ≤ • Note that V ∗ is a fixed point of value iteration! • Value Iteration converges to the optimal value function V ∗ (proof below) demo: V π = Rπ + γP π V π • We can write this in vector notation test/mdp runVI 13:10 with vectors V πs = V π (s), Rπs = R(s, π(s)) and matrix P πs0 s = P (s0 | s, π(s)) State-action value function (Q-function) • For stochastic π(a|s): P P V π (s) = a π(a|s)R(s, a) + γ s0 ,a π(a|s)P (s0 | s, a) V π (s0 ) • We repeat the last couple of slides for the Q-function... 13:7 • The state-action value function (or Q-function) is the expected discounted return when starting in state s and taking first action Bellman optimality equation a: • Recall the recursive property of the value function π V (s) = R(s, π(s)) + γ P s0 0 Qπ (s, a) = Eπ {r0 + γr1 + γ 2 r2 + · · · | s0 = s, a0 = a} X = R(s, a) + γ P (s0 | s, a) Qπ (s0 , π(s0 )) 0 π P (s | s, π(s)) V (s ) s0 (Note: V π (s) = Qπ (s, π(s)).) • Bellman optimality equation i h P V ∗ (s) = maxa R(s, a) + γ s0 P (s0 | s, a) V ∗ (s0 ) h i P with π ∗ (s) = argmaxa R(s, a) + γ s0 P (s0 | s, a) V ∗ (s0 ) • Bellman optimality equation for the Q-function P Q∗ (s, a) = R(s, a) + γ s0 P (s0 | s, a) maxa0 Q∗ (s0 , a0 ) (Sketch of proof: If π would select another action than argmaxa [·], then π 0 which = π everywhere except π 0 (s) = argmaxa [·] would be better.) with π ∗ (s) = argmaxa Q∗ (s, a) 13:11 Q-Iteration • This is the principle of optimality in the stochastic case 13:8 • Recall the Bellman equation: Richard E. Bellman (1920—1984) Bellman’s principle of optimality Q∗ (s, a) = R(s, a) + γ • Q-Iteration: B P s0 P (s0 | s, a) maxa0 Q∗ (s0 , a0 ) (initialize Qk=0 (s, a) = 0) ∀s,a : Qk+1 (s, a) = R(s, a) + γ A X P (s0 |s, a) max Qk (s0 , a0 ) 0 a s0 A opt ⇒ B opt stopping criterion: maxs,a |Qk+1 (s, a) − Qk (s, a)| ≤ • Note that Q∗ is a fixed point of Q-Iteration! h i V ∗ (s) = max R(s, a) + γ s0 P (s0 | s, a) V ∗ (s0 ) a h i P ∗ π (s) = argmax R(s, a) + γ s0 P (s0 | s, a) V ∗ (s0 ) P • Q-Iteration converges to the optimal state-action value function Q∗ a 13:12 13:9 Proof of convergence Value Iteration • Let Dk = ||Q∗ − Qk ||∞ = maxs,a |Q∗ (s, a) − Qk (s, a)| • How can we use this to compute V ∗ ? • Recall the Bellman optimality equation: Qk+1 (s, a) = R(s, a) + γ X ≤ R(s, a) + γ X s0 ∗ h V (s) = maxa R(s, a) + γ P s0 0 ∗ 0 P (s | s, a) V (s ) i s0 P (s0 |s, a) max Qk (s0 , a0 ) 0 a h i ∗ 0 0 P (s0 |s, a) max Q (s , a ) + D k 0 a 68 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 h i X ∗ 0 0 = R(s, a) + γ P (s0 |s, a) max Q (s , a ) + γDk 0 a s0 • Learning – Temporal Difference & Q-learning – Limitations of the model-free view – Model-based RL = Q∗ (s, a) + γDk similarly: Qk ≥ Q∗ − Dk ⇒ Qk+1 ≥ Q∗ − γDk • Exploration • Briefly • The proof translates directly also to value iteration 13:13 – Imitation Learning & Inverse RL – Continuous states and actions (LSPI, Policy Gradients) 13:16 For completeness* Learning in MDPs • While interacting with the world, the agent collects data of the • Policy Evaluation computes V π instead of V ∗ : Iterate: form π ∀s : Vk+1 (s) = R(s, π(s)) + γ P s0 D = {(st , at , rt , st+1 )}Tt=1 P (s0 |s, π(s)) Vkπ (s0 ) (state, action, immediate reward, next state) Or use matrix inversion V π = (I −γP π )−1 Rπ , which is O(|S|3 ). What could we learn from that? • Policy Iteration uses V π to incrementally improve the policy: • Model-based RL: learn to predict next state: estimate P (s0 |s, a) 1. Initialise π0 somehow (e.g. randomly) learn to predict immediate reward: estimate P (r|s, a) 2. Iterate: – Policy Evaluation: compute V πk πk or Q – Policy Update: πk+1 (s) ← argmaxa Qπk (s, a) • Model-free RL: learn to predict value: estimate V (s) or Q(s, a) demo: test/mdp runPI 13:14 • Policy search: e.g., estimate the “policy gradient”, or directly use black box Towards Learning (e.g. evolutionary) search 13:17 • From Sutton & Barto’s Reinforcement Learning book: The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP). Classical DP algorithms are of limited utility in reinforcement learning both because of their assumption of a perfect model and because of their great computational expense, but they are still important theoretically. DP provides an essential foundation for the understanding of the methods presented in the rest of this book. In fact, all of these methods can be viewed as attempts to achieve much the same effect as DP, only with less computation and without assuming a perfect model of the environment. Let’s introduce basic model-free methods first. D = {(s, a, r, s0 )t }Tt=0 learn → V (s) → π(s) 13:18 • So far, we introduced basic notions of an MDP and value func- Temporal difference (TD) learning with V tions and methods to compute optimal policies assuming that • Recall the recursive property of V (s): we know the world (know P (s0 |s, a) and R(s, a)) Value Iteration and Q-Iteration are instances of Dynamic Programming V π (s) = R(s, π(s)) + γ P s0 P (s0 | s, π(s)) V π (s0 ) • TD learning: Given a new experience (s, a, r, s0 ) Vnew (s) = (1 − α) Vold (s) + α [r + γVold (s0 )] • Reinforcement Learning? = Vold (s) + α [r + γVold (s0 ) − Vold (s)] . 13:15 Outline • Markov Decision Processes as formal model – Definition – Value/Q-function – Planning as computing V /Q given a model • Reinforcement: – more reward than expected (r > Vold (s) − γVold (s0 )) → increase V (s) – less reward than expected (r < Vold (s) − γVold (s0 )) → decrease V (s) Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 13:19 69 • Reinforcement: – more reward than expected (r > Qold (s, a)−γ maxa0 Qold (s0 , a0 )) Temporal difference (TD) learning with Q → increase Q(s, a) – less reward than expected (r < Qold (s, a)−γ maxa0 Qold (s0 , a0 )) • Recall the recursive property of Q(s, a): → decrease Q(s, a) π Q (s, a) = R(s, a) + γ P s0 0 ∗ 0 0 P (s |s, a) Q (s , π(s )) 13:22 Q-learning = off-policy TD learning with Q∗ • TD learning: Given a new experience (s, a, r, s0 , a0 = π(s0 )) Qnew (s, a) = (1 − α) Qold (s, a) + α [r + γQold (s0 , a0 )] • Off-policy: We estimate Q∗ while executing π = Qold (s, a) + α [r + γQold (s0 , a0 ) − Qold (s, a)] • Q-learning: 1: • Reinforcement: 2: – more reward than expected (r > Qold (s, a) − γQold (s0 , a0 )) 3: 4: → increase Q(s, a) 5: – less reward than expected (r < Qold (s, a) − γQold (s0 , a0 )) 6: 7: → decrease Q(s, a) 8: 13:20 9: 10: Sarsa = on-policy TD learning with Q Initialize Q(s, a) = 0 repeat // for each episode Initialize start state s repeat // for each step of episode Choose action a ≈ argmaxa Q(s, a) Take action a, observe r, s0 Q(s, a) ← Q(s, a) + α [r + γ maxa0 Qold (s0 , a0 ) − Qold (s, a)] s ← s0 , a ← a0 until end of episode until happy 13:23 • On-policy: We estimate Qπ while executing π Q-learning convergence with prob 1 1: 2: 3: 4: 5: • Sarsa: 6: 7: 8: 9: 10: 11: Initialize Q(s, a) = 0 repeat // for each episode Initialize start state s Choose action a ≈ argmaxa Q(s, a) repeat // for each step of episode Take action a, observe r, s0 Choose action a0 ≈ argmaxa0 Q(s0 , a0 ) Q(s, a) ← Q(s, a)+α [r+γQold (s0 , a0 )−Qold (s, a)] s ← s0 , a ← a0 until end of episode until happy • Q-learning is a stochastic approximation of Q-Iteration: Q-learning: Qnew (s, a) = (1 − α)Qold (s, a) + α[r + γ maxa0 Qold (s0 , a0 )] Q-Iteration: P ∀s,a : Qk+1 (s, a) = R(s, a) + γ s0 P (s0 |s, a) maxa0 Qk (s0 , a0 ) We’ve shown convergence of Q-Iteration to Q∗ • Convergence of Q-learning: Q-Iteration is a deterministic update: Qk+1 = T (Qk ) • -greedy action selection: Q-learning is a stochastic version: Qk+1 = (1−α)Qk +α[T (Qk )+ ηk ] a ≈ argmax Q(s, a) ⇐⇒ a= a random with prob. argmaxa Q(s, a) else ηk is zero mean! 13:24 13:21 Q-learning impact Q-learning • Q-Learning was the first provably convergent direct adaptive op- • Recall the Bellman optimality equation for the Q-function: Q∗ (s, a) = R(s, a) + γ P s0 timal control algorithm P (s0 |s, a) maxa0 Q∗ (s0 , a0 ) 0 • Q-learning (Watkins, 1988) Given a new experience (s, a, r, s ) Qnew (s, a) = (1 − α) Qold (s, a) + α [r + γmax Qold (s0 , a0 )] 0 a = Qold (s, a) + α [r − Qold (s, a) + γ max Qold (s0 , a0 )] 0 a • Great impact on the field of Reinforcement Learning in 80/90ies – “Smaller representation than models” – “Automatically focuses attention to where it is needed,” i.e., no sweeps through state space – Can be made more efficient with eligibility traces 13:25 70 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Eligibility traces TD-Gammon notes • Temporal Difference: based on single experience (s0 , r0 , s1 ) • Choose features as raw position inputs (number of pieces at each place) Vnew (s0 ) = Vold (s0 ) + α[r0 + γVold (s1 ) − Vold (s0 )] → as good as previous computer programs • Longer experience sequence, e.g.: (s0 , r0 , r1 , r2 , s3 ) • Using previous computer program’s expert features Temporal credit assignment, think further backwards: receiv- → world-class player ing r3 also tells us something about V (s0 ) Vnew (s0 ) = Vold (s0 ) + α[r0 + γr1 + γ 2 r2 + γ 3 Vold (s3 ) − Vold (s0 )] • TD(λ): remember where you’ve been recently (“eligibility trace”) and update those values as well: • Kit Woolsey was world-class player back then: – TD-Gammon particularly good on vague positions – not so good on calculable/special positions – just the opposite to (old) chess programs e(st ) ← e(st ) + 1 ∀s : Vnew (s) = Vold (s) + α e(s) [rt + γVold (st+1 ) − Vold (st )] • See anotated matches: http://www.bkgm.com/matches/ woba.html ∀s : e(s) ← γλe(s) • Core topic of Sutton & Barto book • Good example for – value function approximation → great improvement of basic RL algorithms 13:26 – game theory, self-play TD(λ), Sarsa(λ), Q(λ) 13:29 Detour: Dopamine • TD(λ): ∀s : V (s) ← V (s) + α e(s) [rt + γVold (st+1 ) − Vold (st )] • Sarsa(λ) ∀s,a : Q(s, a) ← Q(s, a) + α e(s, a) [r + γQold (s0 , a0 ) − Qold (s, a)] • Q(λ) ∀s,a : Q(s, a) ← Q(s, a) + α e(s, a) [r + γ maxa0 Qold (s0 , a0 ) − Montague, Dayan & Sejnowski: A Framework for Mesencephalic Dopamine Systems based on Predictive Hebbian Learning. Jour- Qold (s, a)] 13:27 nal of Neuroscience, 16:1936-1947, 1996. 13:30 TD-Gammon, by Gerald Tesauro So what does that mean? (See section 11.1 in Sutton & Barto’s book.) • MLP to represent the value function V (s) – We derived an algorithm from a general framework – This algorithm involves a specific variable (reward residual) – We find a neural correlate of exactly this variable Great! Devil’s advocate: • Only reward given at end of game for win. • Self-play: use the current policy to sample moves on both sides! – Does not proof that TD learning is going on Only that an expected reward is compared with a experienced reward • random policies → games take up to thousands of steps. Skilled – Does not discriminate between model-based and modelfree (Both can induce an expected reward) players ∼ 50 − 60 steps. • TD(λ) learning (gradient-based update of NN weights) 13:28 13:31 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Limitations of the model-free view 71 • Learning – Temporal Difference & Q-learning – Limitations of the model-free view – Model-based RL • Given learnt values, behavior is a fixed SR (or state-action) mapping • If the “goal” changes: need to re-learn values for every state in • Exploration the world! all previous values are obsolete • Briefly – Imitation Learning & Inverse RL – Continuous states and actions (LSPI, Policy Gradients) • No general “knowledge”, only values • No anticipation of general outcomes (s0 ), only of value 13:35 • No “planning” 13:32 dynamic prog. V (s) policy π(s) learn policy π(s) Model−free policy π(s) optimize policy π(s) learn latent costs R(s, a) dynamic prog. V (s) Inverse RL learn value fct. V (s) Imitation Learning learn model P (s0 |s, a) R(s, a) demonstration data D = {(s0:T , a0:T )d }n d=1 Policy Search experience data D = {(st , at , rt )}Tt=0 Model−based By definition, goal-directed behavior is performed to obtain a desired goal. Although all instrumental behavior is instrumental in achieving its contingent goals, it is not necessarily purposively goal-directed. Dickinson and Balleine [1,11] proposed that behavior is goal-directed if: (i) it is sensitive to the contingency between action and outcome, and (ii) the outcome is desired. Based on the second condition, motivational manipulations have been used to distinguish between two systems of action control: if an instrumental outcome is no longer a valued goal (for instance, food for a sated animal) and the behavior persists, it must not be goaldirected. Indeed, after moderate amounts of training, outcome revaluation brings about an appropriate change in instrumental actions (e.g. leverpressing) [43,44], but this is no longer the case for extensively trained responses ([30,31], but see [45]). That extensive training can render an instrumental action independent of the value of its consequent outcome has been regarded as the experimental parallel of the folk psychology maxim that wellperformed actions become habitual [9] (see Figure I). Five approaches to learning behavior policy π(s) 13:36 Niv, Joel & Dayan: A normative perspective on motivation. TICS, Imitation Learning 10:375-381, 2006. 13:33 D = {(s0:T , a0:T )d }n d=1 learn/copy → π(s) Model-based RL D = {(s, a, r, s0 )t }Tt=0 learn → P (s0 |s, a) DP → V (s) → π(s) • Model learning: Given data D = {(st , at , rt , st+1 )}Tt=1 estimate • Use ML to imitate demonstrated state trajectories x0:T Literature: 0 P (s |s, a) and R(s, a) Atkeson & Schaal: Robot learning from demonstration (ICML 1997) Schaal, Ijspeert & Billard: Computational approaches to motor learning by imitation (Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 2003) For instance: – discrete state-action: Pˆ (s0 |s, a) = #(s0 ,s,a) #(s,a) – continuous state-action: Pˆ (s0 |s, a) = N(s0 | φ(s, a)>β, Σ) estimate parameters β (and perhaps Σ) as for regression (including non-linear features, regularization, cross-validation!) Grimes, Chalodhorn & Rao: Dynamic Imitation in a Humanoid Robot through Nonparametric Probabilistic Inference. (RSS 2006) Rudiger Dillmann: Teaching and learning of robot tasks via observation ¨ of human performance (Robotics and Autonomous Systems, 2004) 13:37 • Planning, for instance: – discrete state-action: model Value Iteration with the estimated – continuous state-action: Least Squares Value Iteration Stochastic Optimal Control (Riccati, Differential Dynamic Prog.) 13:34 Imitation Learning • There a many ways to imitate/copy the oberved policy: Learn a density model P (at | st )P (st ) (e.g., with mixture of Gaussians) from the observed data and use it as policy (Billard et al.) Outline • Markov Decision Processes as formal model – Definition – Value/Q-function – Planning as computing V /Q given a model Or trace observed trajectories by minimizing perturbation costs (Atkeson & Schaal 1997) 13:38 72 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Imitation Learning (Abbeel & Ng, ICML 2004) 13:41 Atkeson & Schaal 13:42 13:39 Continuous state/actions in model-free RL Inverse RL • All of this is fine in small finite state & action spaces. D = {(s0:T , a0:T )d }n d=1 learn → R(s, a) DP → V (s) → π(s) Q(s, a) is a |S| × |A|-matrix of numbers. π(a|s) is a |S| × |A|-matrix of numbers. • Use ML to “uncover” the latent reward function in observed behavior • In the following: – optimize a parameterized π(a|s) (policy search) Literature: 13:43 Pieter Abbeel & Andrew Ng: Apprenticeship learning via inverse reinforcement learning (ICML 2004) Andrew Ng & Stuart Russell: Algorithms for Inverse Reinforcement Learning (ICML 2000) Nikolay Jetchev & Marc Toussaint: Task Space Retrieval Using Inverse Feedback Control (ICML 2011). Policy gradients • In continuous state/action case, represent the policy as linear in arbitrary state features: 13:40 π(s) = k X φj (s)βj = φ(s)>β (deterministic) j=1 Inverse RL (Apprenticeship Learning) • Given: demonstrations D = {xd0:T }n d=1 π(a | s) = N(a | φ(s)>β, Σ) (stochastic) with k features φj . • Try to find a reward function that discriminates demonstrations from other policies • Basically, given an episode ξ = (st , at , rt )H t=0 , we want to esti- – Assume the reward function is linear in some features R(x) = mate w>φ(x) ∂V (β) ∂β – Iterate: 1. Given a set of candidate policies {π0 , π1 , ..} 13:44 2. Find weights w that maximize the value margin between Policy Gradients teacher and all other candidates max ξ w,ξ s.t. ∀πi : w>hφiD | {z } value of demonstrations ≥ w>hφiπi +ξ | {z } value of πi 2 ||w|| ≤ 1 3. Compute a new candidate policy πi that optimizes R(x) = > w φ(x) and add to candidate list. • One approach is called REINFORCE: Z Z ∂V (β) ∂ ∂ = P (ξ|β) R(ξ) dξ = P (ξ|β) log P (ξ|β)R(ξ)dξ ∂β ∂β ∂β = Eξ|β { H H X ∂ log π(at |st ) X t0 −t ∂ log P (ξ|β)R(ξ)} = Eξ|β { γt γ rt0 } ∂β ∂β t=0 t0 =t | {z } Qπ (st ,at ,t) Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 ∂V (β) ∂β • Another is PoWER, which requires β←β+ Eξ|β { PH Eξ|β { Pt=0 H =0 13:48 t Qπ (st , at , t)} t=0 73 Basic topics not covered Qπ (st , at , t)} • Partial Observability (POMDPs) See: Peters & Schaal (2008): Reinforcement learning of motor skills with policy gradients, Neural Networks. Kober & Peters: Policy Search for Motor Primitives in Robotics, NIPS 2008. Vlassis, Toussaint (2009): Learning Model-free Robot Control by a Monte Carlo EM Algorithm. Autonomous Robots 27, 123-130. 13:45 What if the agent does not observe the state st ? → The policy π(at | bt ) needs to build on an internal representation, called belief βt . • Continuous state & action spaces, function approximation in RL • Predictive State Representations, etc etc... 13:49 Kober & Peters: Policy Search for Motor Primitives in Robotics, NIPS 2008. 13:46 policy π(s) Model−free policy π(s) learn policy π(s) learn latent costs R(s, a) dynamic prog. V (s) Inverse RL dynamic prog. V (s) optimize policy π(s) Imitation Learning learn value fct. V (s) Model−based learn model P (s0 |s, a) R(s, a) demonstration data D = {(s0:T , a0:T )d }n d=1 Policy Search experience data D = {(st , at , rt )}Tt=0 policy π(s) – Policy gradients are one form of policy search. – There are other, direct policy search methods (plain stochastic search, e.g. “Covariance Matrix Adaptation”) 13:47 Conclusions • Markov Decision Processes and RL provide a solid framework for describing behavioural learning & planning • Little taxonomy: policy π(s) learn policy π(s) learn latent costs R(s, a) dynamic prog. V (s) Inverse RL policy π(s) optimize policy π(s) Imitation Learning dynamic prog. V (s) Model−free learn value fct. V (s) Model−based learn model P (s0 |s, a) R(s, a) demonstration data D = {(s0:T , a0:T )d }n d=1 Policy Search experience data D = {(st , at , rt )}Tt=0 policy π(s) 74 16 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Reinforcement Learning – Exploration • Don’t forget the difference between learning and planning. Planning (solving an MDP / calculating optimal state and action val- Exploration is fundamental intelligent behavior ues) is the problem of adapting the behavior based on a given model 14:3 • Try to control the data you collect! – Make decisions that lead to interesting data -greedy exploration in Q-learning 1: 2: 3: – Reflect your own knowledge; know what you don’t know, what you could learn 4: 5: – Go to situations where you might learn something • Curiousity, fun, intrinsic motivation, life-long learning 6: 7: 8: 9: 10: Initialize Q(s, a) = 0 repeat[for each episode] Initialize start state s repeat[for each step of episode] Choose action ( random with prob. a argmaxa Q(s, a) else Take action a, observe r, s0 Qnew (s, a) ← Qold (s, a) γ maxa0 Qold (s0 , a0 ) − Qold (s, a)] s ← s0 until end of episode until happy = + α [r + 14:1 • Estimate Q(s, a) converges to Q∗ (s, a) with infinite number of Recall Markov decision processes a0 a1 a2 s0 s1 s2 r0 r1 samples for (s, a) and appropriate α (Watkins and Dayan, 1992) • Off-policy learning: – Q-learning estimates π ∗ in form of Q∗ (line 7) – However, Q-learning does not execute π ∗ (or its current estimate thereof), but -greedy to ensure to explore (line 5) r2 14:4 P (s0:T , a0:T , r0:T ; π) = P (s0 )P (a0 |s0 ; π)P (r0 |a0 , s0 ) QT t=1 P (st |at-1 , st-1 )P (at |st ; π)P (rt |at , st ) “Exploration-exploitation tradeoff” • Two different types of behavior: – exploration: act with the goal to learn as much as possible; perform actions with unknown rewards / outcomes / values – world’s initial state distribution P (s0 ) – world’s transition probabilities P (st+1 | at , st ) – world’s reward probabilities P (rt | at , st ) – exploitation: act with the goal of getting as much reward as possible; – discount parameter γ for future rewards perform actions which are known to produce large reward / value – agent’s policy π(at | st ) (or deterministic at = π(st )) • Exploration-exploitation tradeoff: not part of the world model! – two different sources of uncertainty: the world itself (not con- be sure not to miss states and actions with large rewards; but do not waste too much time in low-reward states and actions trolled by the agent) vs. the policy (controlled by the agent) 14:5 14:2 Recall reinforcement learning P t • Agent wants to maximize its future rewards E[ ∞ t=0 γ rt | s0 ; π] Sample Complexity • Let A be an RL algorithm which acts in an unknown MDP, resulting in s0 , a0 , r0 , s1 , a1 , r1 , . . . • Agent starts without a world model → no P (st+1 | at , st ), no P (rt | at , st ), no V ∗ (s), no Q∗ (s, a) • Agent needs to learn from experience s0 , a0 , r0 , s1 , a1 , r1 , . . . which actions lead to high rewards – model-based RL: learn world model and then plan – model-free RL: learn V and Q directly (Q-learning, TDlearning) • How can we describe and judge the exploration efficiency of A in formal terms? • Definition (Kakade, 2003): P s ∗ Let VtA = E[ ∞ s=0 γ rt+s | s0 , a0 , r0 . . . st−1 , at−1 , rt−1 , st ]. V is the value function of the optimal policy. Let > 0 be a prescribed accuracy. The sample complexity of A is the number of timesteps t such Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 75 that VtA (st ) < V ∗ (st ) − 14:9 This is the number of timesteps where the policy of A is more than worse than the optimal policy 14:6 E 3 sketch* PAC-MDP efficiency • Let δ > 0 be an allowed probability of failure. A is called PAC-MDP efficient if with probability 1 − δ its sample complexity scales polynomially in δ, and quantities describing the MDP • PAC: probably (δ) approximately () correct The PAC framework is fundamental to frequentist learning theory. For instance, it can be used to derive guarantees on the generalization performance of support vector machines Input: State s Output: Action a 1: if s is known then 2: Plan in MDPknown // Sufficiently accurate model estimates 3: if resulting plan has value above some threshold then 4: return first action of plan // Exploitation 5: else 6: Plan in MDPunknown 7: return first action of plan // Planned exploration 8: end if 9: else 10: return action with the least observations in s// Direct exploration 11: end if 14:10 • Quantities describing the MDP: number of states, number of actions, discount factor γ, maximal reward Rmax > 0, parame- E 3 example* ters in the transition model P (s0 | s, a), . . . 14:7 PAC-MDP efficiency • -greedy is not PAC-MDP efficient. Its sample complexity is exponential in the number of states (Whitehead, 1991) S. Singh (Tutorial 2005) • Examples of PAC-MDP efficient approaches: – model-based: E 3 , R- MAX 14:11 – model-free: Delayed Q-learning 14:8 3 Explicit-Exploit-or-Explore (E ) algorithm* E 3 example* Kearns and Singh (2002) • PAC-MDP efficient model-based RL algorithm • Based on two previously established key ideas: – counts c(s, a) for states and actions to quantify model confidence: s is known if all actions in s sufficiently often executed – optimism in the face of uncertainty: unknown states are assumed to give maximum reward (whose value is known) • E 3 uses two MDPs: S. Singh (Tutorial 2005) – MDPknown : known states with (approximately exact) estimates of P (st+1 | st , at ) and P (rt | st , at ) → captures what you know and drives exploitation – MDPunknown : MDPknown without reward + special state s0 where the agent receives maximum reward → drives exploration 14:12 E 3 example* 76 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 • Rmax: ( Rmax – R (s, a) = ˆ θrsa ∗ if #s,a < n , P ∗ (s0 |s, a) = otherwise ( ds0 s∗ θˆs0 sa if #s other – Guarantees over-estimation of values, polynomial PAC results! – Read about “KWIK-Rmax”! (Li, Littman, Walsh, Strehl, 2011) • Bayesian Exploration Bonus (BEB), Kolter & Ng (ICML 2009) – Choose P ∗ (s0 |s, a) = P (s0 |s, a, b) integrating over the current belief b(θ) (non-over-confident) β – But choose R∗ (s, a) = θˆrsa + with a hyperparam1+α0 (s,a) eter α0 (s, a), over-estimating return • Confidence intervals for V -/Q-function (Kealbling ’93, Dearden Mˆ : estimated known state MDP M : true known state MDP et al. ’99) 14:16 S. Singh (Tutorial 2005) 14:13 More ideas about exploration R- MAX • Intrinsic rewards for learning progress – “fun”, “curiousity” Brafman and Tennenholtz (2002) • similar to E 3 ; implicit instead of explicit exploration – in addition to the external “standard” reward of the MDP • based on reward function – “Curious agents are interested in learnable but yet unknown regularities, and get bored by both predictable and inherently unpredictable things.” (J. Schmidhuber) RR- MAX (s, a) = – Use of a meta-learning system which learns to predict the error that the learning machine makes in its predictions; meta-predictions measure the potential interestingness of situations (Oudeyer et al.) R(s, a) c(s, a) ≥ m (s, a known) Rmax c(s, a) < m (s, a unknown) • Is PAC-MDP efficient • Optimism in the face of uncertainty 14:14 • Dimensionality reduction for model-based exploration in continuous spaces: low-dimensional representation of the transition Bayesian RL function; focus exploration on relevant dimensions (A. Nouri, • There exists an optimal solution to the exploration-exploitation M. Littman) 14:17 trade-off: belief planning (see my tutorial “Bandits, Global Optimization, Active Learning, and Bayesian RL – understanding the common ground”) V π (b, s) = R(s, π(b, s)) + Z P (b0 , s0 | b, s, π(b, s)) V π (b0 , s0 ) b0 ,s0 – Agent maintains a distribution (belief) b(m) over MDP models m – typically, MDP structure is fixed; belief over the parameters – belief updated after each observation (s, a, r, s0 ): b → b0 – only tractable for very simple problems • Bayes-optimal policy π ∗ = argmaxπ V π (b, s) – no other policy leads to more rewards in expectation w.r.t. prior distribution over MDPs More ideas about exploration • Exploration to reduce uncertainty of a belief p(x) – In robotics, p(x) might be the belief about the robot position x – Entropy: a probabilistic measure of information R H(p(x)) = − p(x) log p(x)dx Hp (x) is maximal if p is uniform, and minimal if p is a point mass distribution – Information gain of action a: I(a) = H(p(x)) − Ez [H(p(x0 | z, a))] (z is the potential observation) expected change of entropy when executing an action – maximizing information gain = minimizing uncertainty in belief 14:18 – solves the exploration-exploitation tradeoff 14:15 Optimistic heuristics Digression: Active Learning Cohn, Ghahramani, Jordan (1996) • As with UCB, choose estimators for R∗ , P ∗ that are optimistic/over- • active choice of learning examples in a supervised learning setting: confident h ∗ Vt (s) = max R + a P s0 ∗ 0 0 P (s |s, a) Vt+1 (s ) i learn mapping X → Y from training examples D = {(xi , yi )m i=1 } • active learning protocol: Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 – select a new input x ˜: x ˜ may be a query, experiment, action ... – observe resulting output y˜ – incorporate new example (˜ x, y˜) into D, relearn, and repeat; 77 • Ideas for driving exploration: random actions, optimism in the face of uncertainty, maximizing learning progress and information gain • crucial question: how to choose x ˜? • Generalization over states and actions is crucial (→ relational • heuristics: – where we don’t have data RL) – where we perform poorly – where we have low confidence • Active learning selects statistically optimal training data for effi- – where we expect it to change our model cient supervised learning – where we previously found data 14:22 • in the following: select x ˜ in a statistically “optimal” manner 14:19 Digression: Active Learning (continued) • Goal: minimize variance of prediction yˆ for given x σy2ˆ = ED [(ˆ y − ED [ˆ y ])2 ] changes with new training example (˜ x, y˜) • Choose x ˜ which minimizes the expected predictive variances conditioned on having seen x ˜: ˜] hσy2ˆ i = ED∪(˜x,˜y) [σy2ˆ | x References • Brafman, Tennenholtz (2002). R-max - a general polynomial time algorithm for near-optimal RL. JMLR. • Cohn, Ghahramani, Jordan (1996): Active learning with statistical models. JAIR. • Kakade (2003): On the sample complexity of RL. PhD thesis. • Kearns, Singh (2002): Near-optimal reinforcement learning in polynomial time. Machine Learning Journal. • Li (2009): A unifying framework for computational RL theory. PhD thesis. • Nouri, Littman (2010): Dimension reduction and its application to modelbased exploration in continuous spaces. Machine Learning Journal. • Oudeyer, Kaplan, Hafner (2007): Intrinsic motivation systems for autonomous mental development. IEEE Evolutionary Computation. • Schmidhuber (1991): Curious model-building control systems. In Int. Joint Conf. on Neural Networks. 14:23 • How can we compute hσy2ˆ i? Monte Carlo approximation (sam- pling): evaluate at a set of reference points drawn from P (x) 14:20 Digression: Active Learning (continued) • Example: mixture of Gaussians – analytic solution for calculating expected predictive variances of the learner Cohn, Ghahramani, Jordan (1996) 14:21 Conclusions • Exploration is fundamental intelligent behavior • RL agents need to solve the exploration-exploitation tradeoff • Sample complexity is a measure of the exploration efficiency of an RL algorithm 78 17 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Exercises 17.1 17.1.1 Exercise 1 First Steps You will hand in your exercises in groups of (up to) three. If you did not write Johannes your group members, please do so as soon as possible. ([email protected]) All exercises will be in Python and handed in via Git. Make yourself familiar with both Python and Git by reading a few tutorials and examples. You can find links to some free good tutorials at the course website at https: around with the data in it. The last file is run_tests.sh. It runs the tests, so that you can use the test to check whether you are doing right. Note that our test suite will be different from the one we hand to you. So just mocking each function with the desired output without actually computing it will not work. You can run the tests by executing: $ sh run_tests.sh If you are done implementing the exercise simply commit your implementation and push it to our server. $ git add e01-graphsearch.py $ git commit $ git push Task: Implement breadth-first search, uniform-cost search, limited-depth search, iterative deepening search and A14-ArtificialIntelligence/. star as described in the lecture (A-star will be topic of the next lecture). All methods get as an input a graph, Login in our GitLab system at https://sully.informatik. a start state, and a goal state. Your methods should uni-stuttgart.de/gitlab/ with the account sent return two things: the path from start to goal, and the to you. If you did not receive an account yet, please fringe at the moment when the goal state is found (that email Johannes. latter allows us to check correctness of the implemenCreate a SSH key (if you don’t already have one) and tation). The first return value should be the found Node upload it in your profile at “Profile Settings” and “SSH (which has the path implicitly included through the parKeys”. ent links) and a Queue (one of the following: Queue, LifoQueue, PriorityQueue and NodePriorityQueue) $ ssh-keygen object holding the fringe. You also have to fill in the pri$ cat ˜/.ssh/id_rsa.pub ority computation at the put() method of the NodePriorityQue //ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/ Clone your repository with: Iterative Deepening and Depth-limited search are a bit different in that they do not explicetly have a fringe. You $ git clone [email protected]:ai_lecture/group_[GROUP_NUMBER].git don’t have to return a fringe in those cases, of course. Depth-limited search additionally gets a depth limit as input. A-star gets a heuristic function as input, which you can call like this: 17.1.2 Tree Search def a_star_search(graph, start, goal, heuristic In the repository you will find the directory e01-graphsearch # ... with a couple of files. First there is e01-graphsearch.py h = heuristic(node.state, goal) with the boilerplate code for the exercise. The com# ... ments in the code define what each function is supposed to do. Implement each function and you are done with the exercise. Tips: The second file you will find is tests.py. It consists of tests that check whether your function does what they should. You don’t have to care about this file, but you can have a look in it to understand the exercise better. The next file is data.py. It consists of a very small graph and the S-Bahn net of Stuttgart as graph structure. It will be used by the test. If you like you can play – For those used to IDEs like Visual Studio or Eclipse: Install PyCharm (Community Edition). Start it in the git directory. Perhaps set the Keymap to ’Visual Studio’ (which sets exactly the same keys for running and stepping in the debugger). That’s helping a lot. – Use the data structure Node that is provided. It has exactly the attributes mentioned on slide 26. – Maybe you don’t have to implement the ’Tree-Search’ and ’Expand’ methods separately; you might want to put Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 • Breitensuche, Tiefensuche und Suche mit ein¨ einer Bestenheitlichen Kosten sind Spezialfalle suche. them in one little routine. 17.2 79 • Suche mit einheitlichen Kosten ist ein Spezialfall der A∗ -Suche. Exercise 2 ¨ Prasenz ¨ Dieses Blatt enthalt ubungen, die teils am 6.11. ¨ ¨ in der Ubungsgruppe besprochen werden und auch zur Klausurvorbereitung dienen. Sie sind nicht abzugeben. ¨ Studenten werden zufallig gebeten, sich an den Aufgaben zu versuchen. ¨ • Es gibt Zustandsraume, in dem iterative Tiefen¨ ¨ als Tiefensuche eine hohere Laufzeit-Komplexitat 2 suche hat (O(n ) vs. O(n)). 17.2.3 ¨ Prasenzaufgabe: Greedy-Bestensuche und A∗ -Suche 17.2.1 ¨ Prasenzaufgabe: Beispiel fur ¨ Baum- ¨ Betrachten Sie die Rumanien-Karte auf Folie 4 in 03-search.pd suche Betrachten Sie den Zustandsraum, in dem der Startzustand mit der Nummer 1 bezeichnet wird und die Nach¨ folgerfunktion fur mit den Num¨ Zustand n die Zustande mern 4n − 2, 4n − 1, 4n und 4n + 1 zuruck ¨ gibt. Nehmen Sie an, dass die hier gegebene Reihenfolge auch genau die Reihenfolge ist, in der die Nachbarn in expand durchlaufen werden und in die LIFO fringe eingetragen werden. • Zeichnen Sie den Teil des Zustandsraums, der ¨ die Zustande 1 bis 21 umfasst. • Geben Sie die Besuchsreihenfolge (Besuch=[ein Knoten wird aus der fringe genommen, goal-check, ¨ und expandiert]) fur Tiefensuche ¨ eine beschrankte mit Grenze 2 und fur ¨ eine iterative Tiefensuche, jeweils mit Zielknoten 4, an. Geben Sie nach jedem Besuch eines Knotens den dann aktuellen Inhalt der fringe an. Die initiale fringe ist [1]. Nutzen Sie fur ¨ jeden Besuch in etwa die Notation: besuchter Zustand: [fringe nach dem Besuch] • Fuhrt ein endlicher Zustandsraum immer zu einem ¨ endlichen Suchbaum? Begrunden Sie Ihre Antwort. ¨ 17.2.2 ¨ ¨ der SuchPrasenzaufgabe: Spezialfalle • Die Luftlinien-Heuristik hLL bringt Probleme fur ¨ eine Greedy-Bestensuche, wenn wir von Iasi nach Faragas gehen wollen. In umgekehrter Richtung jedoch nicht. Finden Sie einen Fall, in dem GreedySuche mit hLL fur ¨ keine Richtung den kurzesten ¨ Weg findet. • Verfolgen Sie den Weg von Lugoj nach Bukarest mittels einer A∗ -Suche und verwenden Sie die Luftlinien-Distanz als Heuristik. Geben Sie alle Knoten an, die auf dem Weg berucksichtigt wer¨ den, und ermitteln Sie jeweils die Werte fur ¨ f, g und h. • Geben Sie den mittels der A∗ -Suche gefundenen kurzesten Weg an. ¨ 17.3 17.3.1 Exercise 3 Install Numpy For this exercise you will need numpy. Please install a recent version of it. Installation notes you can find http://www.scipy.org/install.html. Any reasonably recent version will do for this exercise. (So using a version from the Ubuntu repository for instance is fine.) strategien Beweisen Sie die folgenden Aussagen: • Breitensuche ist ein Spezialfall der Suche mit einheitlichen Kosten. 17.3.2 Constrained Satisfaction Problems Pull the current exercise from our server to your local repository: 80 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 $ git pull Task 1: Implement backtracking for the constrained satisfaction problem definition you find in csp.py. Make three different versions of it 1) without any heuristic 2) with minimal remaining value as heuristic but without tie-breaker (take the first best solution) 3) with minimal remaining value and the degree heuristic as tie-breaker. 17.4.1 ¨ Prasenzaufgabe: CSP Betrachten Sie folgenden Kartenausschnitt: Optional: Implement AC-3 or any approximate form of constraint propagation and activate it if the according parameter is set. Task 2: Implement a method to convert a Sudoku into a csp.ConstrainedSatisfactionProblem. The sudoku is given as a numpy array. Every empty field is set to 0. The CSP you create should cover all rules of a Sudoku, which are (from http://en.wikipedia. org/wiki/Sudoku): Fill a 9 × 9 grid with digits so that each column, each row, and each of the nine 3 × 3 sub-grids that compose the grid (also called ’boxes’, ’blocks’, ’regions’, or ’subsquares’) contains all of the digits from 1 to 9. Der Kartenausschnitt soll mit insgesamt 4 Farben so eingef”arbt werden, dass je zwei Nachbarl”ander verschiedene Farben besitzen. (a) Mit welchem Land w”urde man am ehesten beginnen? (b) F”arben Sie das erste Land ein und wenden Sie durchgehend Constraint Propagation an. In the lecture we mentioned the all different constraint for columns, rows, and blocks. As the csp.ConstrainedSatisfactionProblem only allows you to represent pair-wise unequal con¨ 17.4.2 Prasenzaufgabe: Generalized Arc Constraints (to facilitate constraint propagation) you need sistency to convert this. We have n variables xi , each with the (current) domain Di . Constraint propagation by establishing local constraint consistency (“arc consistency”) in general means the following: For a variable xi and an adjacent constraint Ck , we delete all values v from Di for which there exists no tuple τ ∈ DIk with τi = v that satisfies the constraint. Consider a simple example x1 , x2 ∈ {1, 2} , x3 , x4 ∈ {2, .., 6} , c = AllDiff(x1 , .., x4 ) (a) How does constraint propagation from c to x3 update the domain D3 ? 17.4 Exercise 4 ¨ Prasenz ¨ Dieses Blatt enthalt ubungen, die am 20.11. in ¨ ¨ der Ubungsgruppe besprochen werden und auch zur Klausurvorbereitung dienen. Sie sind nicht abzugeben. ¨ Studenten werden zufallig gebeten, sich an den Aufgaben zu versuchen. (b) On http://norvig.com/sudoku.html Norvig describes his Sudoku solver, using the following rules for constraint propagation: (1) If a square has only one possible value, then eliminate that value from the square’s peers. Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 (2) If a unit (block, row or column) has only one possible place for a value, then put the value there. Is this a general implementation of constraint propagation for the allDiff constraint? 17.5.2 81 Modelle enumerieren (Aussagenlogik) Betrachten Sie die Aussagenlogik mit Symbolen A, B, C und D. Insgesamt existieren also 16 Modelle. In wievielen Modellen sind die folgenden S”atze erf”ullt? 1. (A ∧ B) ∨ (B ∧ C) 2. A ∨ B Note: The generalized arc consistency is equivalent so-called message passing (or belief propagation) in probabilistic networks, except that the messages are domain sets instead of belief vectors. 3. A ⇔ (B ⇔ C) 17.5.3 ¨ Unifikation (Pradikatenlogik) See also www.lirmm.fr/˜bessiere/stock/TR06020. pdf Geben Sie f”ur jedes Paar von atomaren S”atzen den allgemeinsten Unifikator an, sofern er existiert. Standardisieren Sie nicht weiter. Geben Sie None zuruck, ¨ wenn kein Unifikator existiert. Ansonsten ein Dictioary, dass als Key die Variable und als Value die Konstante 17.5 Exercise 5 ¨ enthalt. ¨ Abgabetermin: 3. Dez, 24:00h. Die Losungen bitte als python-Datei (siehe Vorlage in der Email) mit dem Namen e05/e05_sol.py (in Verzeichnis e05) in Euer git account einloggen. Bei Unklarheiten bitte melden. Fur ¨ P (A), P (x): sol3z = {’x’: ’A’} 1. P (A, B, B), P (x, y, z). 2. Q(y, G(A, B)), Q(G(x, x), y). 3. Older(F ather(y), y), Older(F ather(x), John). 17.5.1 Erfullbarkeit ¨ und allgemeine Gultigkeit ¨ 4. Knows(F ather(y), y), Knows(x, x). (Aussagenlogik) Entscheiden Sie, ob die folgenden S”atze erf”ullbar (satisfiable), allgemein g”ultig (valid) oder keins von beidem (none) sind. (a) Smoke ⇒ Smoke (b) Smoke ⇒ F ire (c) (Smoke ⇒ F ire) ⇒ (¬Smoke ⇒ ¬F ire) (d) Smoke ∨ F ire ∨ ¬F ire (e) ((Smoke∧Heat) ⇒ F ire) ⇔ ((Smoke ⇒ F ire)∨ (Heat ⇒ F ire)) (f) (Smoke ⇒ F ire) ⇒ ((Smoke ∧ Heat) ⇒ F ire) (g) Big ∨ Dumb ∨ (Big ⇒ Dumb) (h) (Big ∧ Dumb) ∨ ¬Dumb 17.5.4 ¨ Prasenzaufgabe: Matching as Constraint Satisfaction Problem Consider the Generalized Modus Ponens (slide 09:15) for inference (forward and backward chaining) in first order logic. Applying this inference rule requires to find a substitution θ such that p0i θ = pi θ for all i. Show constructively that the problem of finding a substitution θ (also called matching problem) is equivalent to a Constraint Satisfaction Problem. “Constructively” means, explicitly construct/define a CSP that is equivalent to the matching problem. Note: The PDDL language to describe agent planning problems (slide 08:24) is similar to a knowledge in Horn form. Checking whether the action preconditions hold in a given situation is exactly the matching problem; applying the Generalized Modus Ponens corresponds to the application of the action rule on the current situation. 82 17.5.5 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 ¨ Prasenzaufgabe def inform_move(self, move): # after each move (also your own) this f # the player of the move played (which c # chose, if you chose a illegal one. pass In the lecture we discussed the case “A first cousin is a child of a parent’s sibling” def get_next_move(self, board, secs): ∀x, y F irstCousin(x, y) ⇐⇒ ∃p, z P arent(p, x)∧Sibling(z, p)∧P arent(z, y) # return a move you want to play on the # seconds pass A question was whether this is equivalent to ¨ Siearent(z, konnen die mitgelieferte python-chess Bibliothek nutzen. ∀x, y, p, z F irstCousin(x, y) ⇐⇒ P arent(p, x)∧Sibling(z, p)∧P y) Let’s simplify: Show that the following two ∀x A(x) ⇐⇒ ∃y B(y, x) (2) ∀x, y A(x) ⇐⇒ B(y, x) (3) are different. For this, bring both sentences in CNF as described on slides 09:21 and 09:22 of lecture 09FOLinference. Dieses mal gibt es keine Unittests, sondern Ihr ChessPlayer sollte in der Lage sein gegen einen anderen ¨ Spieler zu spielen und sollte – wenn moglich – gegen ¨ ¨ einen zufallig spielenden Spieler gewinnen. Sie konnen Ihre implementierung testen mit $ python2 interface.py --human um als Mensch gegen Ihren Spieler zu spielen. Oder mit $ python2 interface.py --random 17.6 Exercise 6 ¨ um einen zufallig spielenden Spieler gegen ihr Programm antreten zu lassen. Abgabetermin: 7. Jan 2015, 24:00h. 17.6.1 Schach Implementieren Sie ein Schach spielendes Programm. Der grundlegende Python code ist dafur ¨ in Ihren Repositories. Wir haben auch bereits die Grundstruktur fur ¨ den UCT Algorithmus implementiert, so dass Sie nur die einzelnen Funktionen implementieren mussen. Sie ¨ ¨ konnen dafur und z.B. die ¨ auch die Suchtiefe verkurzen ¨ mitgelieferte sehr einfache Evaluationsfunktion – oder eine eigene – nutzen. Es ist aber Ihnen uberlassen ¨ ¨ auch einen vollig anderen Algorithmus zu implementieren (z.B. Minimax), solange das folgende Interface eingehalten wird: 17.7 Exercise 7 ¨ Prasenz ¨ Dieses Blatt enthalt ubungen, die am 12.01. in ¨ ¨ der Ubungsgruppe besprochen werden und auch zur Klausurvorbereitung dienen. Sie sind nicht abzugeben. ¨ Studenten werden zufallig gebeten, sich an den Aufgaben zu versuchen. 17.7.1 ¨ Prasenzaufgabe: Bedingte Wahrscheinlichkeit 1. Die Wahrscheinlichkeit, an der bestimmten tropischen Krankheit zu erkranken, betr”agt 0,02%. Ein Test, der bestimmt, ob man erkrankt ist, ist in 99,995% der F”alle korrekt. Wie hoch ist die Wahrscheinclass ChessPlayer(object): lichkeit, tats”achlich an der Krankheit zu leiden, def __init__(self, game_board, player): wenn der Test positiv ausf”allt? # The game board is the board at the beginning, the optimization is 2. Eine andere seltene Krankheit # either chess.WHITE or chess.BLACK, depending on the player youbetrifft are.0,005% aller Menschen. Ein entsprechender Test ist in 99,99% pass Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 der F”alle korrekt. Mit welcher Wahrscheinlichkeit ist man bei positivem Testergebnis von der Krankheit betroffen? 3. Es gibt einen neuen Test f”ur die Krankheit aus b), der in 99,995% der F”alle korrekt ist. Wie hoch ist hier die Wahrscheinlichkeit, erkrankt zu sein, wenn der Test positiv ausf”allt? 17.7.2 p(x, c) = p(c) • From bandit 1: 8 7 12 13 11 9 D Y p(xi |c) (4) i=1 Nun kann man fur ¨ eine neue eintreffende Email die ¨ vorkommenden Worter x∗ analysieren und mit Hilfe von Bayes Formel die Wahrscheinlichkeit berechnen, ob diese Email Spam oder Ham ist. p(c|x∗ ) = ¨ Prasenzaufgabe: Bandits Assume you have 3 bandits. You have already tested them a few times and received returns 83 p(x∗ |c)p(c) p(x∗ |c)p(c) =P ∗ ∗ p(x ) c p(x |c)p(c) (5) Aufgabe: Implementieren Sie einen Naive Bayes Klassifikator fur ¨ die Spam Emails. Sie finden Trainingsdaten und Python-Code, der mit dieser umgehen kann, in Ihrem Repository. Ihre Implementierung sollte zwei Funktionen enthalten: • From bandit 2: 8 12 • From bandit 3: 5 13 For the returns of each bandit separately, compute a) the mean return, the b) standard deviation of returns, and c) standard deviation of the mean estimator. class NaiveBayes(object): def train(self, database): ’’’ Train the classificator with the giv pass def spam_prob(self, email): ’’’ Compute the probability for the give return 0. Which bandid would you choose next? (Distinguish cases: a) if you know this is the last chance to pull a bandit; b) if you will have many more trials thereafter.) 17.8 Exercise 8 Tip: David Barber gibt ein seinem Buch “Bayesian Reasoning and Machine Learning” eine sehr gute Einfuhrung ¨ in den Naive Bayes Klassifikator (Seite 243 ff., bzw. Seite 233 ff. in der kostenlosen Online Version des Buches, die man unter http://www.cs.ucl.ac.uk/ staff/d.barber/brml/ herunterladen kann). Abgabetermin: Mi, 28.01.2015, 23:59 h 17.9 17.8.1 Exercise 9 Spamfilter mit Naive Bayes ¨ Prasenz ¨ Dieses Blatt enthalt ubungen, die am 26.01. in ¨ ¨ Sie haben in der Vorlesung grafische Modelle und Inder Ubungsgruppe besprochen werden und auch zur ferenz in ihnen kennengelernt. Auf dieser Grundlage Klausurvorbereitung dienen. Sie sind nicht abzugeben. basiert der viel verwendete Naive Bayes Klassifikator. ¨ Studenten werden zufallig gebeten, sich an den AufDer Bayes Klassifikator wird zum Beispiel dafur ¨ vergaben zu versuchen. wandet, Spam Emails automatisch zu erkennen. Dafur ¨ ¨ werden Trainings-Emails untersucht und die Worthaufigkeiten ¨ gezahlt. Daraus werden dann D Wahrscheinlichkeiten ¨ 17.9.1 Prasenzaufgabe: Hidden Markov Modp(xi |c) fur ¨ das Auftreten eines bestimmten Wortes, gegeben elle ¨ das eine Email Spam/Ham ist, geschatzt. Nun wird die Annahme getroffen, dass all diese Wahrscheinlichkeiten ¨ unabhangig sind, so dass die Joint Verteilung wie folgt Sie stehen bei Nacht auf einer Br”ucke ”uber der B14 berechnet werden kann: in Stuttgart und m”ochten z”ahlen, wieviele LKW, Busse 84 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 und Kleintransporter in Richtung Bad Canstatt fahren. Da Sie mehrere Spuren gleichzeitig beobachten und es dunkel ist machen Sie folgende Fehler bei der Beobachtung des Verkehrs: 17.10.1 ¨ Prasenzaufgabe: Value Iteration 7 6 1 5 2 • Einen LKW erkennen Sie in 30% der F”alle als Bus, in 10% der F”alle als Kleintransporter. • Einen Bus erkennen Sie in 40% der F”alle als LKW, in 10% der F”alle als Kleintransporter. • Einen Kleintransporter erkennen Sie in je 10% der F”alle als Bus bzw. LKW. Zudem nehmen Sie folgendes an: • Auf einen Bus folgt zu 10% ein Bus und zu 30% ein LKW, ansonsten ein Kleintransporter. • Auf einen LKW folgt zu 60% ein Kleintransporter und zu 30% ein Bus, ansonsten ein weiterer LKW. • Auf einen Kleintransporter folgt zu 80% ein Kleintransporter und zu je 10% ein Bus bzw. ein LKW. Sie wissen sicher, dass das erste beobachtete Fahrzeug ¨ tatsachlich ein Kleintransporter ist. a) Formulieren Sie das HMM dieses Szenarios. D.h., geben Sie explizit P (X1 ), P (Xt+1 |Xt ) und P (Yt |Xt ) an. 8 4 3 Consider the circle of states above, which depicts the 8 states of an MDP. The green state (#1) receives a reward of r = 4096 and is a terminal state, the red state (#2) is punished with r = −512. Consider a discounting of γ = 1/2. Description of P (s0 |s, a): • The agent can choose between two actions: going one step clock-wise or one step counter-clockwise. • With probability 3/4 the agent will transition to the desired state, with probability 1/4 to the state in opposite direction. • Exception: the green state (#1) is a terminal state: The MDP terminates after the agent reached this state and collected the reward. Description of P (r|s, a): ¨ b) Pradiktion: Was ist die Marginal-Verteilung P (X3 ) uber das 3. Fahrzeug. ¨ • The agent will receive reward of r = 4096 upon reaching the green state (#1). c) Filtern: Sie machten die Beobachtungen Y1:3 = (K, B, B). Was ist die Wahrscheinlichkeit P (X3 |Y1:3 ) des 3. Fahrzeugs gegeben diese Beobachtungen? • The agent receives reward of r = −512 upon reaching the red state (#2). ¨ d) Glatten: Was ist die Wahrscheinlichkeit P (X2 |Y1:3 ) des 2. Fahrzeugs, gegeben die 3 Beobachtungen? e) Viterbi (wahrscheinlichste Folge): Was ist die wahrscheinlichste Folge argmaxX1:3 P (X1:3 |Y1:3 ) an Fahrzeugen, gegeben die 3 Beobachtungen? 17.10 Exercise 10 ¨ Prasenz ¨ Dieses Blatt enthalt ubungen, die am 05.02. in ¨ ¨ der Ubungsgruppe besprochen werden und auch zur Klausurvorbereitung dienen. Sie sind nicht abzugeben. ¨ Studenten werden zufallig gebeten, sich an den Aufgaben zu versuchen. 1. Perform three steps of Value Iteration: Initialize Vk=0 (s) = 0, what is Vk=1 (s), Vk=2 (s), Vk=3 (s)? 2. How could you compute the optimal value function V π (s) for a GIVEN policy (e.g., always walk clock-wise) in closed form? Provide an explicit matrix equation. 3. Assume you are given V ∗ (s). How can you compute the optimal Q∗ (s, a) form this? And assume Q∗ (s, a) is given, how can you compute the optimal V ∗ (s) from this? Provide general equations. 4. What is Qk=3 (s, a) for the example above? What is the “optimal” policy given Qk=3 ? Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 17.10.2 ¨ Prasenzaufgabe: TD-learning Consider TD-learning. The initial value function V (s) = 0. Consider the following setup: The agent starts in state 4, then permanently chooses clock-wise action. As soon as it reaches the green terminal state (one way or another), it will be beamed back to the start state 4 and everything repeats over and over again. 1. Describe at what events plain TD-learning will update the value function, how it will update it. Guess roughly how many steps the agent will have taken when for the first time V (s4 ) becomes non-zero. How would this be different for eligibility traces? 85 Index Existential quantification (6:8), Explicit-Exploit-or-Explore* (14:9), Exploration, Exploitation (9:6), n-queens as ILP (4:19), A∗ search (2:15), A∗ : Proof 1 of Optimality (2:22), A∗ : Proof 2 of Optimality (2:28), Factor graph (11:30), Filtering, Smoothing, Prediction (12:3), FOL: Syntax (6:5), Forward Chaining (7:15), Forward chaining (5:41), Forward checking (3:21), Frame problem (6:22), Frequentist vs Bayesian (8:4), Admissible heuristics (2:30), Alpha-Beta Pruning (10:6), Backtracking (3:10), Backward Chaining (5:52), Backward Chaining (7:17), Bayes’ Theorem (8:11), Bayesian Network (11:3), Bayesian RL (14:15), Belief propagation (11:36), Bellman optimality equation (13:8), Bernoulli and Binomial (8:14), Best-first Search (2:3), Beta (8:15), Breadth-first search (BFS) (1:29), Gaussian (8:27), Generalized Modus Ponens (7:14), Genetic Algorithms (4:14), Gibbs sampling (11:24), Graph search and repeated states (1:65), Greedy Search (2:5), Hidden Markov Model (12:4), HMM inference (12:6), HMM: Inference (12:5), Horn Form (5:40), Completeness of Forward Chaining (5:51), Complexity of BFS (1:37), Complexity of DFS (1:52), Complexity of Greedy Search (2:14), Complexity of Iterative Deepening Search (1:63), Complexity of A∗ (2:27), Conditional distribution (8:9), Conditional independence in a Bayes Net (11:7), Conditional random field* (11:46), Conjugate priors (8:23), Conjunctive Normal Form (5:64), Constraint propagation (3:25), Constraint satisfaction problems (CSPs): Definition (3:2), Imitation Learning (13:37), Importance sampling (11:22), Inference (5:28), Inference in graphical models: overview (11:17), Inference: general meaning (11:12), Inverse RL (13:40), Iterated Local Search (4:9), Iterative deepening search (1:54), Joint distribution (8:9), Junction tree algorithm (11:41), Conversion to CNF (5:65), Conversion to CNF (7:20), CSP as ILP (4:21), Kalman filter (12:9), Knowledge base: Definition (5:2), Kullback-Leibler divergence (8:34), Definitions based on sets (8:6), Depth-first search (DFS) (1:39), Dirac (8:26), Dirichlet (8:19), Local optima, plateaus (4:8), Local Search (4:5), Logical equivalence (5:37), Logics: Definition, Syntax, Semantics (5:20), Loopy belief propagation (11:39), LP, QP, ILP, NLP (4:17), Eligibility traces (13:26), Entailment (5:21), Entropy (8:33), Epsilon-greedy exploration in Q-learning (14:4), Evaluation functions (10:11), Example: Romania (1:2), Example: The 8-Puzzle (1:15), Example: Vacuum World (1:5), Map-Coloring Problem (3:3), Marginal (8:9), Markov Decision Process (MDP) (13:3), Markov Process (12:2), Maximum a-posteriori (MAP) inference (11:45), 86 Introduction to Artificial Intelligence, Marc Toussaint—April 7, 2015 Memory-bounded A∗ (2:34), Message passing (11:36), Minimax (10:3), Model (5:22), Model-based RL (13:34), Modus Ponens (5:40), Monte Carlo (11:19), Monte Carlo Tree Search (MCTS) (9:14), Multi-armed Bandits (9:1), Multinomial (8:18), Multiple RVs, conditional independence (8:12), Optimistic heuristics (14:16), Optimization problem: Definition (4:2), PAC-MDP efficiency (14:7), Particle approximation of a distribution (8:29), Planning Domain Definition Language (PDDL) (6:24), Policy gradients (13:44), Probabilities as (subjective) information calculus (8:2), Probability distribution (8:8), Problem Definition: Deterministic, fully observable (1:9), Proof of convergence of Q-Iteration (13:13), Proof of convergence of Q-learning (13:24), Propositional logic: Semantics (5:31), Propositional logic: Syntax (5:29), Q-Function (13:11), Q-Iteration (13:12), Q-learning (13:22), R-Max (14:14), Random variables (8:7), Reduction to propositional inference (7:6), Resolution (5:64), Resolution (7:19), Sample Complexity (14:6), Sarsa (13:21), Satisfiability (5:38), Simulated Annealing (4:11), Situation Calculus (6:21), Slack Variables (4:19), Temporal difference (TD) (13:19), Travelling Salesman Problem (TSP) (4:6), Tree search implementation: states vs nodes (1:25), Tree Search: General Algorithm (1:26), Tree-structured CSPs (3:33), TSP as ILP (4:20), UCT for games (10:12), Unification (7:9), Uniform-cost search (1:38), Universal quantification (6:7), Upper Confidence Bound (UCB) (9:8), Upper Confidence Tree (UCT) (9:19), Utilities and Decision Theory (8:32), Validity (5:38), Value Function (13:4), Value Iteration (13:10), Value order: Least constraining value (3:20), Variable elimination (11:27), Variable order: Degree heuristic (3:19), Variable order: Minimum remaining values (3:18), Wumpus World example (5:4), 87
© Copyright 2024