Download Report

Neural Networks
Delta Rule and Back Propagation
CS570 - Artificial Intelligence
Choi, Hyun-il and Jang, Seung-ick
CSD., KAIST
2001.04.02
Threshold Logic Unit(TLU)
An
artificial neuron

model the functionality of a neuron

Threshold function = f(activation)
 Activation
n
a   w i xi
i 1
Threshold
Function
y
y

Hard limiter
a

a
Sigmoid
Choi, Hyun-il and Jang, Seung-ick
Comparison of 1,2,3-layers :
Hard Limiter(1)

Single layer perceptron




half plane decision regions
2 layer perceptron

any convex region (possibly unbounded)

Convex Hull
3 layer perceptron

form arbitrarily complex decision regions

separate meshed classes
No more than 3 layers are required
Choi, Hyun-il and Jang, Seung-ick
Comparison of 1,2,3-layers :
Hard Limiter(2)
Choi, Hyun-il and Jang, Seung-ick
TLU as Classifiers

Two classes (A,B) (C,D) are linearly separable, and (A,D) (B,C) are
linearly separable
Choi, Hyun-il and Jang, Seung-ick
Information for Classifier


Training two output units, and decode them
1
0
y1
Y2
Class
y1
(A B)
(C D)
0
0
C
0
1
D
y2
(A D)
1
0
B
1
1
A
(B C)
Two pieces of information are necessary

4 classes may separated by 2 hyper-planes

(A,B) / (C,D) and (A,D) / (B,C) are linearly separable
Choi, Hyun-il and Jang, Seung-ick
Minimizing Error

Find minimum of a function : gradient descent

y = f(x), slope = y/x

find the x position of minimum f(x)

suppose we can find the slope (rate of change of y) at
any point
Choi, Hyun-il and Jang, Seung-ick
Gradient Descent
y
y  y 
x (1)
x
so that
y  slope  x (2)
put x    slope (3)
where   0 and is small enough to ensure y  y then
y   ( slope ) 2  0 (4)

If keep repeating step (4) , we can find the value of
x associated with the function minimum
Choi, Hyun-il and Jang, Seung-ick
Gradient Descent on an Error


Calculate error for each training vector
perform gradient descent on the error considered
as function of the weights

find weights which gives minimal errors
Choi, Hyun-il and Jang, Seung-ick
Error Function

For each pattern p, error Ep is the function of
weight

Typically defined by the square difference between
output and target
1
Ep  (t  y ) 2
2

(5)
Total Error is
E
E
p
(6)
p
Choi, Hyun-il and Jang, Seung-ick
Gradient Descent on Error

To perform gradient descent

there must be well defined gradient at each point

error must be a continuous function of weight



train activation rather than output
target activation is {-1, 1}
Learning rule : Delta rule
1
Ep  (t  a) 2
(7)
2
wj    (Slope of Ep with respect to wj)
  (t  a) xj
(8)
Choi, Hyun-il and Jang, Seung-ick
Delta Rule

Error will not be zero


always learning something from input
term (t - a) is known as delta ()
perceptron
error comparison
theoretical
background
output
hyperplane
manipulation
delta rule
activation
gradient descent
on square error
Choi, Hyun-il and Jang, Seung-ick
Delta Rule for Sigmoid units

For sigmoid TLU
y   (a)
1
Ep  (t  y ) 2 에서
2
dEp
d (a )
 (t   (a ))
da
da
dEp
 wj  
xj   ' (a )(1   (a ))
da
Choi, Hyun-il and Jang, Seung-ick
Multilayer Nets
(A, B, C, D)
(A,B) (A,D)
(x1, x2)
Choi, Hyun-il and Jang, Seung-ick
Backpropagation - Theory

Consider both hidden and output nodes

hidden nodes cannot be directly accessed for the
purpose of training

use delta rule for the output nodes
wj j   ' (a j )(t j  y j ) xij
(2)
where (t j  y j ) : error on the jth node
 ' (a j ) : how quickly th e activation can change the output
xij : amount tha t ith input has affected the activation
Choi, Hyun-il and Jang, Seung-ick
Causal chain for Determining Error
q
A 

B
means ' A influences
B via q'
x
σ
t)
W 

a

y (y

E

define  as j = ’(aj)(tj - yj) : measure of rate of
change of the error
wij   j xi
(4)
Choi, Hyun-il and Jang, Seung-ick
 of Hidden Layer Node

Consider kth hidden node

Credit assignment problem

how much influence has this node had on the error

for the input i to hidden node k,
wik   k xi
(5)
 from hidden node k to output node j
• how much k can influence the
output node j : wkj
• via this, how the output node j affects
the error :  j
• fan out of kth node : Ik
 k   ' (a k )   j wkj
(6)
jI k
Choi, Hyun-il and Jang, Seung-ick
Using the Training Rule
Forward Pass
(1) Present the pattern at the input layer
(2) Let the hidden nodes evaluate their output
(3) Let the output nodes evaluate their output using the results of step (2)
Backward Pass
(4) Apply the target pattern to the output layer
(5) Calculate the  on the output nodes according to expr. (3)
(6) Train each output node using gradient descent expr.(4)
(7) Calculate the  on the hidden nodes according to expr.(6)
(8) Train hidden node using the  in step (7) according to expr.(5)
Choi, Hyun-il and Jang, Seung-ick
An example
Approximating
g ( p )  1  sin(
Network

4
2
the function:
p)
1.5
for  2  p  2
1
0.5
0
-2
-1
0
1
2
Choi, Hyun-il and Jang, Seung-ick
Non-linearly Separable Problems



A and B are separated by arbitrarily shaped decision
surface
For more complex decision surfaces, need more
hidden units
Difficulties :

Number of hidden units
 inadequate
training set generalization
Choi, Hyun-il and Jang, Seung-ick
Generalization

Test vectors which were not shown during training
are classified correctly
Choi, Hyun-il and Jang, Seung-ick
Overfitting the decision surface

Decision planes are aligned themselves as close
to the training data as possible

misclassification of test data can be occurred : Too much freedom
1 hidden unit
2 hidden units
Choi, Hyun-il and Jang, Seung-ick
Overfitting : Curve fitting
 Actual
output y vs. input x
Less hidden units :
capture underlying trend in the data
More hidden units :
curve vary more sensitively to training data,
generalized poorly
Choi, Hyun-il and Jang, Seung-ick
Local Minima

Start training with p



performing gradient descent
Reach to the minimum Ml : local
minima
Local minimum corresponds to
a partial solution for the network
in response to the training data

p
Can be solved using Simulated
Annealing (Boltzmann machine)
Choi, Hyun-il and Jang, Seung-ick
Speed up Learning :
Momentum Term(1)

Speed of learning : learning rate 
too big  : learning is unstable, oscillate back and forth
across the minimum

 Alter
the training rule from pure gradient descent to
include a term of last weight change
j
j j
j
wi (n)   x i  wi (n - 1)
(7)

if previous weight change was large, so will the new one

weight change carries along some momentum to the
next iteration

 governs the contribution of the momentum term
Choi, Hyun-il and Jang, Seung-ick
Speed up Learning :
Momentum Term(2)
Choi, Hyun-il and Jang, Seung-ick
Number of nodes in hidden layer :
Two layer Perceptron


must large enough to form a decision region
must not so large that required weights cannot be
reliably estimated from the training data
Choi, Hyun-il and Jang, Seung-ick
Number of nodes in hidden layer :
Three layer Perceptron

Number of nodes for 2nd layer

must greater than one when decision region cannot be
formed from convex area

in worst case, equal to the number of disconnected
regions

Number of nodes for 1st layer

Must be sufficient to provide three or more edges for
each convex area generated by every second-layer node

More than 3 times as many nodes in 2nd layer nodes
Choi, Hyun-il and Jang, Seung-ick