What is a neural network (NN)? Neural Networks 26 aprile 2001 Renzo Davoli Sistemi Complessi Adattivi What is a neural network (NN)? • According to Haykin, S. (1994), Neural Networks: A Comprehensive Foundation, NY: Macmillan, p. 2: – A neural network is a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects: • 1.Knowledge is acquired by the network through a learning process. • 2.Inter-neuron connection strengths known as synaptic weights are used to store the knowledge. What is a neural network (NN)? • According to Zurada, J.M. (1992), Introduction To Artificial Neural Systems, Boston: PWS Publishing Company, p. xv: – Artificial neural systems, or neural networks, are physical cellular systems which can acquire, store, and utilize experiential knowledge. • According to the DARPA Neural Network Study (1988, AFCEA International Press, p. 60): – ... a neural network is a system composed of many simple processing elements operating in parallel whose function is determined by network structure, connection strengths, and the processing performed at computing elements or nodes. What is a neural network (NN)? • According to Nigrin, A. (1993), Neural Networks for Pattern Recognition, Cambridge, MA: The MIT Press, p. 11: – A neural network is a circuit composed of a very large number of simple processing elements that are neurally based. Each element operates only on local information. Furthermore each element operates asynchronously; thus there is no overall system clock. The von Neumann machine and the symbolic paradigm • The machine must be told in advance, and in great detail, the exact series of steps required to perform the algorithm. This series of steps is the computer program. • The type of data it deals with has to be in a precise format noisy data confuses the machine. • The hardware is easily degraded - destroy a few key memory locations and the machine will stop functioning or `crash'. • There is a clear correspondence between the semantic objects being dealt with (numbers, words, database entries etc) and the machine hardware. Each object can be `pointed to' in a block of computer memory. 1 Real Neurons Real Neurons • • • • • Signals are transmitted between neurons by electrical pulses (action-potentials or `spike' trains) travelling along the axon. These pulses impinge on the synapses. These are found principally on a set of branching processes emerging from the cell body (soma) known as dendrites. Each pulse occurring at a synapse initiates the release of a small amount of chemical substance or neurotransmitter which travels across the synaptic cleft and which is then received at post-synaptic receptor sites on the dendritic side of the synapse. The neurotransmitter becomes bound to molecular sites here which, in turn, initiates a change in the dendritic membrane potential. This post-synaptic-potential (PSP) change may serve to increase (hyperpolarise) or decrease (depolarise) the polarisation of the post-synaptic membrane. In the former case, the PSP tends to inhibit generation of pulses in the afferent neuron, while in the latter, it tends to excite the generation of pulses. The size and type of PSP produced will depend on factors such as the geometry of the synapse and the type of neurotransmitter. Each PSP will travel along its dendrite and spread over the soma, eventually reaching the base of the axon (axon-hillock). The afferent neuron sums or integrates the effects of thousands of such PSPs over its dendritic tree and over time. If the integrated potential at the axon-hillock exceeds a threshold, the cell `fires' and generates an action potential or spike which starts to travel along its axon. This then initiates the whole sequence ofevents again in neurons contained in the efferent pathway. Artificial neurons: the Threshold Logic Unit (TLU) Artificial neurons: the TLU [McCulloch and Pitts, 1943] [McCulloch and Pitts, 1943] We suppose there are n inputs with signals and weights The signals take on the values `1' or `0' only. That is the signals are Boolean valued. The activation a, is given by the output y is then given by thresholding the activation Non-binary signal communication Theorem of TLU It is generally accepted that, in real neurons, information is encoded in terms of the frequency of firing rather than merely the presence or absence of a pulse. There are two ways we can represent this in our artificial neurons. – First, we may extend the signal range to be positive real numbers. – We may emulate the real neuron and encode a signal as the frequency of the occurrence of a `1' in a pulse stream. 2 What can you do with an NN and what not? (1) Sigmoid output function • Encoding frequencies (so managing real numbers instead of binary data) works fine at the input straight away, but the use of a step function limits the output signals to be binary. This may be overcome by `softening' the step-function to a continuous `squashing' function like the Sigmoid. • In principle, NNs can compute any computable function, i.e., they can do everything a normal digital computer can do. (Valiant, 1988; Siegelmann and Sontag, 1999; Orponen, 2000; Sima and Orponen, 2001) ρ determines the shape of the sigmoid (with threshold θ) What can you do with an NN and what not? (2) • • • • • • • Clearly the style of processing is completely dif ferent from von Neumann machines - it is more akin to signal processing than symbol processing. The combining of signals and producing new ones is to be contrasted with the execution of instructions stored in a memory. Inf ormation is stored in a set of weights rather than a program. The weights are supposed to adapt when the net is shown examples from a training set. Nets are robust in the presence of noise: small changes in an input signal will not drastically af fect a node's output. Nets are robust in the presence of hardware f ailure: a change in a weight may only af fect the output for a f ew of the possible input patterns. High level concepts will be represented as a pattern of activity across many nodes rather than as the contents of a small portion of computer memory. The net can deal with `unseen' patterns and generalise from the training set. Nets are good at `perceptual' tasks and associative recall. These are just the tasks that the symbolic approach has difficulties with. What can you do with an NN and what not? (3) • There are important problems that are so difficult that a neural network will be unable to learn them without memorizing the entire training set, such as: – – – – Predicting random or pseudo-random numbers. Factoring large integers. Determing whether a large integer is prime or composite. Decrypting anything encrypted by a good algorithm. • And it is important to understand that there are no methods for training NNs that can magically create information that is not contained in the training data. Categories of NN: Learning • The two main kinds of learning algorithms are supervised and unsupervised. – In supervised learning, the correct results (target values, desired outputs) are known and are given to the NN during training so that the NN can adjust its weights to try match its outputs to the target values. After training, the NN is tested by giving it only input values, not target values, and seeing how close it comes to outputting the correct target values. – In unsupervised learning, the NN is not provided with the correct results during training. Unsupervised NNs usually perform some kind of data compression, such as dimensionality reduction or clustering. See "What does unsupervised learning learn?" Categories of NN: Topology • Two major kinds of network topology are feedforward and feedback. – In a feedforward NN, the connections between units do not form cycles. Feedforward NNs usually produce a response to an input quickly. Most feedforward NNs can be trained using a wide variety of efficient conventional numerical methods (e.g. conjugate gradients) in addition to algorithms invented by NN researchers. – In a feedback or recurrent NN, there are cycles in the connections. In some feedback NNs, each time an input is presented, the NN must iterate for a potentially long time before it produces a response. Feedback NNs are usually more difficult to train than feedforward NNs. 3 Categories of NN: Accepted Data Types of NN: 1 supervised learning • • Two major kinds of data are categorical and quantitative. – Categorical variables take only a finite (technically, countable) number of possible values, and there are usually several or more cases falling into each category. Categorical variables may have symbolic values (e.g., "male" and "female", or "red", "green" and "blue") that must be encoded into numbers before being given to the network. Both supervised learning with categorical target values and unsupervised learning with categorical outputs are called "classification." – Quantitative variables are numerical measurements of some attribute, such as length in meters. The measurements must be made in such a way that at least some arithmetic relations among the measurements reflect analogous relations among the attributes of the objects that are measured. Supervised learning with quantitative target values is called "regression." Types of NN: 2 unsupervised learning • • • Competitive – Vector Quantization • Grossberg - Grossberg (1976) • Kohonen - Kohonen (1984) • Conscience - Desieno (1988) – Self-Organizing Map • Kohonen - Kohonen (1995), Fausett (1994) • GTM: - Bishop, Svensen and Williams (1997) • Local Linear - Mulier and Cherkassky (1995) – Adaptive resonance theory • ART 1 - Carpenter and Grossberg (1987a), Moore (1988), Fausett (1994) • ART 2 - Carpenter and Grossberg (1987b), Fausett (1994) • ART 2-A - Carpenter, Grossberg and Rosen (1991a) • ART 3 - Carpenter and Grossberg (1990) • Fuzzy ART - Carpenter, Grossberg and Rosen (1991b) • DCL: Differential Competitive Learning - Kosko (1992) Dimension Reduction - Diamantaras and Kung (1996) – Hebbian - Hebb (1949), Fausett (1994) – Oja - Oja (1989) – Sanger - Sanger (1989) – Differential Hebbian - Kosko (1992) Autoassociation – Linear autoassociator - Anderson et al. (1977), Fausett (1994) – BSB: Brain State in a Box - Anderson et al. (1977), Fausett (1994) – Hopfield - Hopfield (1982), Fausett (1994) Training TLUs (first simple example training algorithm) • • Feedforward – Linear • Hebbian - Hebb (1949), Fausett (1994) • Perceptron - Rosenblatt (1958), Minsky and Papert (1969/1988), Fausett (1994) • Adaline - Widrow and Hoff (1960), Fausett (1994) • Higher Order - Bishop (1995) • Functional Link - Pao (1989) – MLP: Multilayer perceptron - Bishop (1995), Reed and Marks (1999), Fausett (1994) • Backprop - Rumelhart, Hinton, and Williams (1986) • Cascade Correlation - Fahlman and Lebiere (1990), Fausett (1994) • Quickprop - Fahlman (1989) • RPROP - Riedmiller and Braun (1993) – RBF networks - Bishop (1995), Moody and Darken (1989), Orr (1996) • OLS: Orthogonal Least Squares - Chen, Cowan and Grant (1991) • CMAC: Cerebellar Model Articulation Controller - Albus (1975), Brown and Harris (1994) – Classification only • LVQ: Learning Vector Quantization - Kohonen (1988), Fausett (1994) • PNN: Probabilistic Neural Network - Specht (1990), Masters (1993), Hand (1982), Fausett (1994) – Regression only • GNN: General Regression Neural Network - Specht (1991), Nadaraya (1964), Watson (1964) Feedback - Hertz, Krogh, and Palmer (1991), Medsker and Jain (2000) – BAM: Bidirectional Associative Memory - Kosko (1992), Fausett (1994) – Boltzman Machine - Ackley et al. (1985), Fausett (1994) – Recurrent time series • Backpropagation through time - Werbos (1990) • Elman - Elman (1990) • FIR: Finite Impulse Response - Wan (1990) • Jordan - Jordan (1986) • Real-time recurrent network - Williams and Zipser (1989) • Recurrent backpropagation - Pineda (1989), Fausett (1994) • TDNN: Time Delay NN - Lang, Waibel and Hinton (1990) Competitive – ARTMAP - Carpenter, Grossberg and Reynolds (1991) – Fuzzy ARTMAP - Carpenter, Grossberg, Markuzon, Reynolds and Rosen (1992), Kasuba (1993) – Gaussian ARTMAP - Williamson (1995) – Counterpropagation - Hecht-Nielsen (1987; 1988; 1990), Fausett (1994) – Neocognitron - Fukushima, Miyake, and Ito (1983), Fukushima, (1988), Fausett (1994) Training TLUs (first simple example of supervised learning) • The training set for the TLU will consist of a set of pairs {v,t}, where is v an input vector and t is the target class or output (`1' or `0') that belongs to (i.e. the expected output). • The learning rule (or training rule is): • The parameter α is called the learning rate. • This is named “the Perceptron learning rule” Training TLUs numerical example 4 Training TLU convergence Theorem Perceptron (Rosenblatt 1959) note: read d(x) is the target Perceptron Adaline • Is a perceptron-like network. • In a simple physical implementation this device consists of a set of controllable resistors connected to a circuit w hich can sum up currents caused by the input voltage signals. • An adaline is an array of computing element like this: Adaline: the delta rule • Adaline learning rule is a refinement of Perceptron. • The Least Mean Square (LMS) procedure finds the values of all the weights that minimize the error function by a method called gradient descent. Adaline: the delta rule • the total error E is defined to be • The idea is to make a change in the weight proportional to the negative of the derivative of the error as measured on the current pattern with respect to each weight: • γ is the learning rate 5 Adaline: the delta rule a bit of calculus... Using TLUs and perceptrons as classifiers • All the percepton like networks are subject to the TLU theorem: they are able to classify only linearly separable set of data. The XOR problem • XOR is not linearly separable! Theorem: Multi-layer perceptrons can do everything Solution of the XOR problem • A multilayer perceptron is able to solve the perceptron problem. Theorem: Multi-layer perceptrons can do everything • Each function f:{-1,1}n>{-1,1}m can be computed by a multilayer perceptron • with a single hidden layer • but the number of hidden nodes can be up to 2n 6 generalized delta rule (calculus...) The learning rule for multi-layer perceptrons: the generalized delta rule output layer other layers • • • Generalized delta rule: the core of backprop model Weight adjustments with sigmoid activation function. Learning Rate and Momentum Deficiencies of back-propagation The equations derived in the previous section may be mathematically correct, but what do they actually mean? Is there a way of understanding backpropagation other than reciting the necessary equations? The answer is, of course, yes. In fact, the whole back-propagation process is intuitively very clear. What happens in the above equations is the following. When a learning pattern is clamped, the activation values are propagated to the output units, and the actual network output is compared with the desired output values, we usually end up with an error in each of the output units. We know from the delta rule that, in order to reduce an error, we have to adapt its incoming weig hts according to That's step one. But it alone is not enough: when we only apply this rule, the weights from input to hidden units are never changed, and we do not have the full representational power of the feed-forward network as promised by the universal approximation theorem. In order to adapt the weights from input to hidden units, we again want to apply the delta rule. In this case, however, we do not have a value for for the hidden units. This is solved by the chain rule which does the following: distribute the error of an output unit o to all the hidden units that is it connected to, weighted by this connection. • Network paralysis. As the network trains, the weights can be adjusted to very large values. The total input of a hidden unit or output unit can therefore reach very high (either positive or neg ative) values, and because of the sig moid activation function the unit w ill have an activation very close to zero or very close to one. • Local minima. The error surface of a complex network is full of hills and valleys. Because of the g radient descent, the network can get trapped in a local minimum when there is a much deeper minimum nearby . 7 Tuning BackProp: # of samples of the learning set Tuning BackProp: # of samples of the learning set Tuning BackProp: # of hidden units Tuning BackProp: # of hidden units Associative memories This slide has been intentionally left blank! • `Remembering' something in common parlance usually consists of associating something with a sensory cue. For example, someone may say something, like the name of a celebrity, and we immediately recall a chain of events or some experience related to the celebrity - we may have seen them on TV recently for example. Or, we may see a picture of a place visited in our childhood and the image recalls memories of the time. The sense of smell (olfaction) is known to be especially evocative in this way. • On a more mundane level, but still in the same category, we may be presented with a partially obliterated letter, or one seen through a window when it is raining (letter + noise) and go on to recognize the letter. 8 The nature of associative memory The Hopfield network Hopfield Nets: convergence Theorem (and correlation net-energy) A physical analogy with memory The Hopfield Network Teaching Hopfield nets to be associative memories 9 Hopfield nets: learning rule Analogue Hopfield nets: a NN solution to TSP (travel salesman problem) Analogue Hopfield nets: a NN solution to TSP Boltzmann Machines (travel salesman problem) Self-Organizing Networks This slide has been intentionally left blank! 10 Competitive Learning. Winner Selection 1: dot product • Competitive Learning is an unsupervised learning procedure that divides input patterns in cluster. Competitive learning: geometrical meaning. Winner Selection 2: Euclidean distance Linear Vector Quantization (LVQ) Example 11 LVQ2 strategy Kohonen Networks Kohonen network: learning rule Kohonen network: example Credits • Neural Network FAQ: ftp://ftp.sas.com/pub/neural/FAQ.html. • Dr. Leslie Smith brief on-line introduction to NNs http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html. • Kevin Gurney An Introduction to Neural Networks, http://www.shef.ac.uk/psychology/gurney/notes/index.html • Ben Kröse and Patrick van der Smagt, An Introduction to Neural Networks, ftp://ftp.wins.uva.nl/pub/computer-systems/autsys/reports/neuro-intro/neuro-intro.ps.gz 12