Neural Networks Delta Rule and Back Propagation CS570 - Artificial Intelligence Choi, Hyun-il and Jang, Seung-ick CSD., KAIST 2001.04.02 Threshold Logic Unit(TLU) An artificial neuron model the functionality of a neuron Threshold function = f(activation) Activation n a w i xi i 1 Threshold Function y y Hard limiter a a Sigmoid Choi, Hyun-il and Jang, Seung-ick Comparison of 1,2,3-layers : Hard Limiter(1) Single layer perceptron half plane decision regions 2 layer perceptron any convex region (possibly unbounded) Convex Hull 3 layer perceptron form arbitrarily complex decision regions separate meshed classes No more than 3 layers are required Choi, Hyun-il and Jang, Seung-ick Comparison of 1,2,3-layers : Hard Limiter(2) Choi, Hyun-il and Jang, Seung-ick TLU as Classifiers Two classes (A,B) (C,D) are linearly separable, and (A,D) (B,C) are linearly separable Choi, Hyun-il and Jang, Seung-ick Information for Classifier Training two output units, and decode them 1 0 y1 Y2 Class y1 (A B) (C D) 0 0 C 0 1 D y2 (A D) 1 0 B 1 1 A (B C) Two pieces of information are necessary 4 classes may separated by 2 hyper-planes (A,B) / (C,D) and (A,D) / (B,C) are linearly separable Choi, Hyun-il and Jang, Seung-ick Minimizing Error Find minimum of a function : gradient descent y = f(x), slope = y/x find the x position of minimum f(x) suppose we can find the slope (rate of change of y) at any point Choi, Hyun-il and Jang, Seung-ick Gradient Descent y y y x (1) x so that y slope x (2) put x slope (3) where 0 and is small enough to ensure y y then y ( slope ) 2 0 (4) If keep repeating step (4) , we can find the value of x associated with the function minimum Choi, Hyun-il and Jang, Seung-ick Gradient Descent on an Error Calculate error for each training vector perform gradient descent on the error considered as function of the weights find weights which gives minimal errors Choi, Hyun-il and Jang, Seung-ick Error Function For each pattern p, error Ep is the function of weight Typically defined by the square difference between output and target 1 Ep (t y ) 2 2 (5) Total Error is E E p (6) p Choi, Hyun-il and Jang, Seung-ick Gradient Descent on Error To perform gradient descent there must be well defined gradient at each point error must be a continuous function of weight train activation rather than output target activation is {-1, 1} Learning rule : Delta rule 1 Ep (t a) 2 (7) 2 wj (Slope of Ep with respect to wj) (t a) xj (8) Choi, Hyun-il and Jang, Seung-ick Delta Rule Error will not be zero always learning something from input term (t - a) is known as delta () perceptron error comparison theoretical background output hyperplane manipulation delta rule activation gradient descent on square error Choi, Hyun-il and Jang, Seung-ick Delta Rule for Sigmoid units For sigmoid TLU y (a) 1 Ep (t y ) 2 에서 2 dEp d (a ) (t (a )) da da dEp wj xj ' (a )(1 (a )) da Choi, Hyun-il and Jang, Seung-ick Multilayer Nets (A, B, C, D) (A,B) (A,D) (x1, x2) Choi, Hyun-il and Jang, Seung-ick Backpropagation - Theory Consider both hidden and output nodes hidden nodes cannot be directly accessed for the purpose of training use delta rule for the output nodes wj j ' (a j )(t j y j ) xij (2) where (t j y j ) : error on the jth node ' (a j ) : how quickly th e activation can change the output xij : amount tha t ith input has affected the activation Choi, Hyun-il and Jang, Seung-ick Causal chain for Determining Error q A B means ' A influences B via q' x σ t) W a y (y E define as j = ’(aj)(tj - yj) : measure of rate of change of the error wij j xi (4) Choi, Hyun-il and Jang, Seung-ick of Hidden Layer Node Consider kth hidden node Credit assignment problem how much influence has this node had on the error for the input i to hidden node k, wik k xi (5) from hidden node k to output node j • how much k can influence the output node j : wkj • via this, how the output node j affects the error : j • fan out of kth node : Ik k ' (a k ) j wkj (6) jI k Choi, Hyun-il and Jang, Seung-ick Using the Training Rule Forward Pass (1) Present the pattern at the input layer (2) Let the hidden nodes evaluate their output (3) Let the output nodes evaluate their output using the results of step (2) Backward Pass (4) Apply the target pattern to the output layer (5) Calculate the on the output nodes according to expr. (3) (6) Train each output node using gradient descent expr.(4) (7) Calculate the on the hidden nodes according to expr.(6) (8) Train hidden node using the in step (7) according to expr.(5) Choi, Hyun-il and Jang, Seung-ick An example Approximating g ( p ) 1 sin( Network 4 2 the function: p) 1.5 for 2 p 2 1 0.5 0 -2 -1 0 1 2 Choi, Hyun-il and Jang, Seung-ick Non-linearly Separable Problems A and B are separated by arbitrarily shaped decision surface For more complex decision surfaces, need more hidden units Difficulties : Number of hidden units inadequate training set generalization Choi, Hyun-il and Jang, Seung-ick Generalization Test vectors which were not shown during training are classified correctly Choi, Hyun-il and Jang, Seung-ick Overfitting the decision surface Decision planes are aligned themselves as close to the training data as possible misclassification of test data can be occurred : Too much freedom 1 hidden unit 2 hidden units Choi, Hyun-il and Jang, Seung-ick Overfitting : Curve fitting Actual output y vs. input x Less hidden units : capture underlying trend in the data More hidden units : curve vary more sensitively to training data, generalized poorly Choi, Hyun-il and Jang, Seung-ick Local Minima Start training with p performing gradient descent Reach to the minimum Ml : local minima Local minimum corresponds to a partial solution for the network in response to the training data p Can be solved using Simulated Annealing (Boltzmann machine) Choi, Hyun-il and Jang, Seung-ick Speed up Learning : Momentum Term(1) Speed of learning : learning rate too big : learning is unstable, oscillate back and forth across the minimum Alter the training rule from pure gradient descent to include a term of last weight change j j j j wi (n) x i wi (n - 1) (7) if previous weight change was large, so will the new one weight change carries along some momentum to the next iteration governs the contribution of the momentum term Choi, Hyun-il and Jang, Seung-ick Speed up Learning : Momentum Term(2) Choi, Hyun-il and Jang, Seung-ick Number of nodes in hidden layer : Two layer Perceptron must large enough to form a decision region must not so large that required weights cannot be reliably estimated from the training data Choi, Hyun-il and Jang, Seung-ick Number of nodes in hidden layer : Three layer Perceptron Number of nodes for 2nd layer must greater than one when decision region cannot be formed from convex area in worst case, equal to the number of disconnected regions Number of nodes for 1st layer Must be sufficient to provide three or more edges for each convex area generated by every second-layer node More than 3 times as many nodes in 2nd layer nodes Choi, Hyun-il and Jang, Seung-ick
© Copyright 2024