What is a neural network (NN)? Neural Networks

What is a neural network (NN)?
Neural Networks
26 aprile 2001
Renzo Davoli
Sistemi Complessi Adattivi
What is a neural network (NN)?
• According to Haykin, S. (1994), Neural
Networks: A Comprehensive Foundation,
NY: Macmillan, p. 2:
– A neural network is a massively parallel
distributed processor that has a natural
propensity for storing experiential knowledge
and making it available for use. It resembles the
brain in two respects:
• 1.Knowledge is acquired by the network through a
learning process.
• 2.Inter-neuron connection strengths known as
synaptic weights are used to store the knowledge.
What is a neural network (NN)?
• According to Zurada, J.M. (1992),
Introduction To Artificial Neural Systems,
Boston: PWS Publishing Company, p. xv:
– Artificial neural systems, or neural networks,
are physical cellular systems which can acquire,
store, and utilize experiential knowledge.
• According to the DARPA Neural Network
Study (1988, AFCEA International Press, p.
60):
– ... a neural network is a system composed of
many simple processing elements operating in
parallel whose function is determined by
network structure, connection strengths, and the
processing performed at computing elements or
nodes.
What is a neural network (NN)?
• According to Nigrin, A. (1993), Neural
Networks for Pattern Recognition,
Cambridge, MA: The MIT Press, p. 11:
– A neural network is a circuit composed of a
very large number of simple processing
elements that are neurally based. Each element
operates only on local information.
Furthermore each element operates
asynchronously; thus there is no overall system
clock.
The von Neumann
machine and the
symbolic paradigm
• The machine must be told in advance, and in great detail, the
exact series of steps required to perform the algorithm. This
series of steps is the computer program.
• The type of data it deals with has to be in a precise format noisy data confuses the machine.
• The hardware is easily degraded - destroy a few key memory
locations and the machine will stop functioning or `crash'.
• There is a clear correspondence between the semantic
objects being dealt with (numbers, words, database entries
etc) and the machine hardware. Each object can be `pointed
to' in a block of computer memory.
1
Real Neurons
Real Neurons
•
•
•
•
•
Signals are transmitted between neurons by electrical pulses (action-potentials
or `spike' trains) travelling along the axon. These pulses impinge on the
synapses.
These are found principally on a set of branching processes emerging from the
cell body (soma) known as dendrites.
Each pulse occurring at a synapse initiates the release of a small amount of
chemical substance or neurotransmitter which travels across the synaptic cleft
and which is then received at post-synaptic receptor sites on the dendritic side
of the synapse. The neurotransmitter becomes bound to molecular sites here
which, in turn, initiates a change in the dendritic membrane potential. This
post-synaptic-potential (PSP) change may serve to increase (hyperpolarise) or
decrease (depolarise) the polarisation of the post-synaptic membrane.
In the former case, the PSP tends to inhibit generation of pulses in the afferent
neuron, while in the latter, it tends to excite the generation of pulses. The size
and type of PSP produced will depend on factors such as the geometry of the
synapse and the type of neurotransmitter. Each PSP will travel along its
dendrite and spread over the soma, eventually reaching the base of the axon
(axon-hillock). The afferent neuron sums or integrates the effects of thousands
of such PSPs over its dendritic tree and over time. If the integrated potential at
the axon-hillock exceeds a threshold, the cell `fires' and generates an action
potential or spike which starts to travel along its axon.
This then initiates the whole sequence ofevents again in neurons contained in
the efferent pathway.
Artificial neurons:
the Threshold Logic Unit (TLU)
Artificial neurons: the TLU
[McCulloch and Pitts, 1943]
[McCulloch and Pitts, 1943]
We suppose there are n inputs with signals
and weights
The signals take on the values `1' or `0' only. That is the signals are
Boolean valued.
The activation a, is given by
the output y is then given by thresholding the activation
Non-binary signal communication
Theorem of TLU
It is generally accepted that, in real neurons,
information is encoded in terms of the frequency of
firing rather than merely the presence or absence of a
pulse. There are two ways we can represent this in our
artificial neurons.
– First, we may extend the signal range to be
positive real numbers.
– We may emulate the real neuron and encode a
signal as the frequency of the occurrence of a `1'
in a pulse stream.
2
What can you do with an NN
and what not? (1)
Sigmoid output function
• Encoding frequencies (so managing real numbers
instead of binary data) works fine at the input straight
away, but the use of a step function limits the output
signals to be binary. This may be overcome by
`softening' the step-function to a continuous
`squashing' function like the Sigmoid.
• In principle, NNs can compute any
computable function, i.e., they can do
everything a normal digital computer can
do.
(Valiant, 1988; Siegelmann and Sontag, 1999; Orponen,
2000; Sima and Orponen, 2001)
ρ determines the shape of the sigmoid
(with threshold θ)
What can you do with an NN
and what not? (2)
•
•
•
•
•
•
•
Clearly the style of processing is completely dif ferent from von Neumann
machines - it is more akin to signal processing than symbol processing. The
combining of signals and producing new ones is to be contrasted with the
execution of instructions stored in a memory.
Inf ormation is stored in a set of weights rather than a program. The weights are
supposed to adapt when the net is shown examples from a training set.
Nets are robust in the presence of noise: small changes in an input signal will
not drastically af fect a node's output.
Nets are robust in the presence of hardware f ailure: a change in a weight may
only af fect the output for a f ew of the possible input patterns.
High level concepts will be represented as a pattern of activity across many
nodes rather than as the contents of a small portion of computer memory.
The net can deal with `unseen' patterns and generalise from the training set.
Nets are good at `perceptual' tasks and associative recall. These are just the
tasks that the symbolic approach has difficulties with.
What can you do with an NN
and what not? (3)
• There are important problems that are so difficult that a
neural network will be unable to learn them without
memorizing the entire training set, such as:
–
–
–
–
Predicting random or pseudo-random numbers.
Factoring large integers.
Determing whether a large integer is prime or composite.
Decrypting anything encrypted by a good algorithm.
• And it is important to understand that there are no
methods for training NNs that can magically create
information that is not contained in the training data.
Categories of NN: Learning
• The two main kinds of learning algorithms are
supervised and unsupervised.
– In supervised learning, the correct results (target values,
desired outputs) are known and are given to the NN during
training so that the NN can adjust its weights to try match its
outputs to the target values. After training, the NN is tested by
giving it only input values, not target values, and seeing how
close it comes to outputting the correct target values.
– In unsupervised learning, the NN is not provided with the
correct results during training. Unsupervised NNs usually
perform some kind of data compression, such as
dimensionality reduction or clustering. See "What does
unsupervised learning learn?"
Categories of NN: Topology
• Two major kinds of network topology are feedforward
and feedback.
– In a feedforward NN, the connections between units do not
form cycles. Feedforward NNs usually produce a response to
an input quickly. Most feedforward NNs can be trained using a
wide variety of efficient conventional numerical methods (e.g.
conjugate gradients) in addition to algorithms invented by NN
researchers.
– In a feedback or recurrent NN, there are cycles in the
connections. In some feedback NNs, each time an input is
presented, the NN must iterate for a potentially long time
before it produces a response. Feedback NNs are usually more
difficult to train than feedforward NNs.
3
Categories of NN: Accepted Data
Types of NN: 1 supervised learning
•
• Two major kinds of data are categorical and
quantitative.
– Categorical variables take only a finite (technically,
countable) number of possible values, and there are usually
several or more cases falling into each category. Categorical
variables may have symbolic values (e.g., "male" and
"female", or "red", "green" and "blue") that must be encoded
into numbers before being given to the network. Both
supervised learning with categorical target values and
unsupervised learning with categorical outputs are called
"classification."
– Quantitative variables are numerical measurements of some
attribute, such as length in meters. The measurements must be
made in such a way that at least some arithmetic relations
among the measurements reflect analogous relations among
the attributes of the objects that are measured. Supervised
learning with quantitative target values is called "regression."
Types of NN: 2 unsupervised learning
•
•
•
Competitive
– Vector Quantization
• Grossberg - Grossberg (1976)
• Kohonen - Kohonen (1984)
• Conscience - Desieno (1988)
– Self-Organizing Map
• Kohonen - Kohonen (1995), Fausett (1994)
• GTM: - Bishop, Svensen and Williams (1997)
• Local Linear - Mulier and Cherkassky (1995)
– Adaptive resonance theory
• ART 1 - Carpenter and Grossberg (1987a), Moore (1988), Fausett (1994)
• ART 2 - Carpenter and Grossberg (1987b), Fausett (1994)
• ART 2-A - Carpenter, Grossberg and Rosen (1991a)
• ART 3 - Carpenter and Grossberg (1990)
• Fuzzy ART - Carpenter, Grossberg and Rosen (1991b)
• DCL: Differential Competitive Learning - Kosko (1992)
Dimension Reduction - Diamantaras and Kung (1996)
– Hebbian - Hebb (1949), Fausett (1994)
– Oja - Oja (1989)
– Sanger - Sanger (1989)
– Differential Hebbian - Kosko (1992)
Autoassociation
– Linear autoassociator - Anderson et al. (1977), Fausett (1994)
– BSB: Brain State in a Box - Anderson et al. (1977), Fausett (1994)
– Hopfield - Hopfield (1982), Fausett (1994)
Training TLUs
(first simple example training algorithm)
•
•
Feedforward
– Linear
• Hebbian - Hebb (1949), Fausett (1994)
• Perceptron - Rosenblatt (1958), Minsky and Papert (1969/1988), Fausett (1994)
• Adaline - Widrow and Hoff (1960), Fausett (1994)
• Higher Order - Bishop (1995)
• Functional Link - Pao (1989)
– MLP: Multilayer perceptron - Bishop (1995), Reed and Marks (1999), Fausett (1994)
• Backprop - Rumelhart, Hinton, and Williams (1986)
• Cascade Correlation - Fahlman and Lebiere (1990), Fausett (1994)
• Quickprop - Fahlman (1989)
• RPROP - Riedmiller and Braun (1993)
– RBF networks - Bishop (1995), Moody and Darken (1989), Orr (1996)
• OLS: Orthogonal Least Squares - Chen, Cowan and Grant (1991)
• CMAC: Cerebellar Model Articulation Controller - Albus (1975), Brown and Harris (1994)
– Classification only
• LVQ: Learning Vector Quantization - Kohonen (1988), Fausett (1994)
• PNN: Probabilistic Neural Network - Specht (1990), Masters (1993), Hand (1982), Fausett (1994)
– Regression only
• GNN: General Regression Neural Network - Specht (1991), Nadaraya (1964), Watson (1964)
Feedback - Hertz, Krogh, and Palmer (1991), Medsker and Jain (2000)
– BAM: Bidirectional Associative Memory - Kosko (1992), Fausett (1994)
– Boltzman Machine - Ackley et al. (1985), Fausett (1994)
– Recurrent time series
• Backpropagation through time - Werbos (1990)
• Elman - Elman (1990)
• FIR: Finite Impulse Response - Wan (1990)
• Jordan - Jordan (1986)
• Real-time recurrent network - Williams and Zipser (1989)
• Recurrent backpropagation - Pineda (1989), Fausett (1994)
• TDNN: Time Delay NN - Lang, Waibel and Hinton (1990)
Competitive
– ARTMAP - Carpenter, Grossberg and Reynolds (1991)
– Fuzzy ARTMAP - Carpenter, Grossberg, Markuzon, Reynolds and Rosen (1992), Kasuba (1993)
– Gaussian ARTMAP - Williamson (1995)
– Counterpropagation - Hecht-Nielsen (1987; 1988; 1990), Fausett (1994)
– Neocognitron - Fukushima, Miyake, and Ito (1983), Fukushima, (1988), Fausett (1994)
Training TLUs
(first simple example of supervised learning)
• The training set for the TLU will consist of a set of pairs
{v,t}, where is v an input vector and t is the target class or
output (`1' or `0') that belongs to (i.e. the expected output).
• The learning rule (or training rule is):
• The parameter α is called the learning rate.
• This is named “the Perceptron learning rule”
Training TLUs
numerical example
4
Training TLU
convergence Theorem
Perceptron
(Rosenblatt 1959)
note: read d(x) is the target
Perceptron
Adaline
• Is a perceptron-like network.
• In a simple physical implementation this device consists of
a set of controllable resistors connected to a circuit w hich
can sum up currents caused by the input voltage signals.
• An adaline is an array of computing element like this:
Adaline: the delta rule
• Adaline learning rule is a refinement of
Perceptron.
• The Least Mean Square (LMS) procedure
finds the values of all the weights that
minimize the error function by a method
called gradient descent.
Adaline: the delta rule
• the total error E is defined to be
• The idea is to make a change in the weight proportional to
the negative of the derivative of the error as measured on
the current pattern with respect to each weight:
•
γ is the learning rate
5
Adaline: the delta rule
a bit of calculus...
Using TLUs and perceptrons as
classifiers
• All the percepton like networks are subject
to the TLU theorem: they are able to
classify only linearly separable set of data.
The XOR problem
• XOR is not linearly separable!
Theorem: Multi-layer
perceptrons can do everything
Solution of the XOR problem
• A multilayer perceptron is able to solve the
perceptron problem.
Theorem: Multi-layer
perceptrons can do everything
• Each function f:{-1,1}n>{-1,1}m can be
computed by a multilayer perceptron
• with a single hidden
layer
• but the number of hidden
nodes can be up to 2n
6
generalized delta rule (calculus...)
The learning rule for multi-layer
perceptrons: the generalized delta rule
output layer
other layers •
•
•
Generalized delta rule:
the core of backprop model
Weight adjustments with sigmoid
activation function.
Learning Rate and Momentum
Deficiencies of back-propagation
The equations derived in the previous section may be mathematically correct,
but what do they actually mean? Is there a way of understanding backpropagation other than reciting the necessary equations?
The answer is, of course, yes. In fact, the whole back-propagation process is
intuitively very clear. What happens in the above equations is the following.
When a learning pattern is clamped, the activation values are propagated to the
output units, and the actual network output is compared with the desired output
values, we usually end up with an error in each of the output units. We know
from the delta rule that, in order to reduce an error, we have to adapt its
incoming weig hts according to
That's step one. But it alone is not enough: when we only apply this rule, the
weights from input to hidden units are never changed, and we do not have the
full representational power of the feed-forward network as promised by the
universal approximation theorem. In order to adapt the weights from input to
hidden units, we again want to apply the delta rule. In this case, however, we do
not have a value for for the hidden units. This is solved by the chain rule which
does the following: distribute the error of an output unit o to all the hidden units
that is it connected to, weighted by this connection.
• Network paralysis. As the network trains, the weights can
be adjusted to very large values. The total input of a hidden
unit or output unit can therefore reach very high (either
positive or neg ative) values, and because of the sig moid
activation function the unit w ill have an activation very
close to zero or very close to one.
• Local minima. The error surface of a complex network is
full of hills and valleys. Because of the g radient descent,
the network can get trapped in a local minimum when
there is a much deeper minimum nearby .
7
Tuning BackProp: # of samples
of the learning set
Tuning BackProp: # of samples
of the learning set
Tuning BackProp: # of hidden
units
Tuning BackProp: # of hidden
units
Associative memories
This slide has been intentionally left blank!
• `Remembering' something in common parlance usually consists
of associating something with a sensory cue. For example,
someone may say something, like the name of a celebrity, and we
immediately recall a chain of events or some experience related
to the celebrity - we may have seen them on TV recently for
example. Or, we may see a picture of a place visited in our
childhood and the image recalls memories of the time. The sense
of smell (olfaction) is known to be especially evocative in this
way.
• On a more mundane level, but still in the same category, we may
be presented with a partially obliterated letter, or one seen
through a window when it is raining (letter + noise) and go on to
recognize the letter.
8
The nature of associative
memory
The Hopfield network
Hopfield Nets: convergence Theorem
(and correlation net-energy)
A physical analogy with memory
The Hopfield Network
Teaching Hopfield nets to be
associative memories
9
Hopfield nets: learning rule
Analogue Hopfield nets:
a NN solution to TSP
(travel salesman problem)
Analogue Hopfield nets:
a NN solution to TSP
Boltzmann Machines
(travel salesman problem)
Self-Organizing Networks
This slide has been intentionally left blank!
10
Competitive Learning.
Winner Selection 1: dot product
• Competitive Learning is an unsupervised learning
procedure that divides input patterns in cluster.
Competitive learning:
geometrical meaning.
Winner Selection 2: Euclidean distance
Linear Vector Quantization (LVQ)
Example
11
LVQ2 strategy
Kohonen Networks
Kohonen network: learning rule
Kohonen network: example
Credits
• Neural Network FAQ:
ftp://ftp.sas.com/pub/neural/FAQ.html.
• Dr. Leslie Smith brief on-line introduction to NNs
http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html.
• Kevin Gurney An Introduction to Neural Networks,
http://www.shef.ac.uk/psychology/gurney/notes/index.html
• Ben Kröse and Patrick van der Smagt, An Introduction
to Neural Networks, ftp://ftp.wins.uva.nl/pub/computer-systems/autsys/reports/neuro-intro/neuro-intro.ps.gz
12