Latest in Deep Learning Roland Memisevic LISA lab Montreal Oct 30, 2014 Roland Memisevic Latest in Deep Learning Oct 30, 2014 Rosenblatt’s perceptron (1957) pictures taken from http://www.rutherfordjournal.org/article040101.html Roland Memisevic Latest in Deep Learning Oct 30, 2014 Rosenblatt’s perceptron (1957) x1 w1 wD xD I ”the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence” (in NYT according to wikipedia) Roland Memisevic Latest in Deep Learning Oct 30, 2014 Rosenblatt’s perceptron (1957) x1 w1 wD xD I ”the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence” (in NYT according to wikipedia) Roland Memisevic Latest in Deep Learning Oct 30, 2014 The XOR problem x2 x1 Roland Memisevic Latest in Deep Learning Oct 30, 2014 The XOR problem x1 · x2 x2 x1 Roland Memisevic Latest in Deep Learning Oct 30, 2014 The XOR problem I (picture adapted from Bishop 2006) Roland Memisevic Latest in Deep Learning Oct 30, 2014 “It’s the features, stupid” Roland Memisevic Latest in Deep Learning Oct 30, 2014 “It’s the features, stupid” Roland Memisevic Latest in Deep Learning Oct 30, 2014 “It’s the features, stupid” A common vision pipeline (prior 2012) 1. Find interest points. Roland Memisevic Latest in Deep Learning Oct 30, 2014 “It’s the features, stupid” A common vision pipeline (prior 2012) 1. Find interest points. 2. Crop patches around them. Roland Memisevic Latest in Deep Learning Oct 30, 2014 “It’s the features, stupid” f1 fn A common vision pipeline (prior 2012) 1. Find interest points. 2. Crop patches around them. 3. Represent each patch with a sparse local descriptor. Roland Memisevic Latest in Deep Learning Oct 30, 2014 “It’s the features, stupid” f11 f1M fn1 fnM A common vision pipeline (prior 2012) 1. Find interest points. 2. Crop patches around them. 3. Represent each patch with a sparse local descriptor. 4. Combine the descriptors into a representation for the image. Roland Memisevic Latest in Deep Learning Oct 30, 2014 “It’s the features, stupid” f2 ? cathedral high-rise f1 I This creates a representation that even a linear classifier can deal with. Roland Memisevic Latest in Deep Learning Oct 30, 2014 What do good features look like? I There are many incarnations: SIFT, HOG, GIST, SURF, ... I Common to all is that they use local edge information, followed by several stages of non-linear processing. Roland Memisevic Latest in Deep Learning Oct 30, 2014 Back to neural nets I (picture adapted from Bishop 2006) Roland Memisevic Latest in Deep Learning Oct 30, 2014 Back-prop wf f (g; wf ) wg g(h; wg ) wh h(x; wh) f ∂f ∂g ∂f ∂wf fprop bprop grad x I I I We can make use of the chain rule to compute derivatives wrt. lower layer parameters. popularized by (Rumelhart, Hinton, Williams; 1986) suffers from the vanishing/exploding gradients problem (more on this later) Roland Memisevic Latest in Deep Learning Oct 30, 2014 The end Questions? Roland Memisevic Latest in Deep Learning Oct 30, 2014 Speech figure from Yoshua Bengio Roland Memisevic Latest in Deep Learning Oct 30, 2014 Convolutional networks I LeCun et al. 1998 I Fukushima 1980 (without learning) I Training the model to classify objects turns the lowest layers into Gabor like features (”edge detectors”). Roland Memisevic Latest in Deep Learning Oct 30, 2014 ImageNet Roland Memisevic Latest in Deep Learning Oct 30, 2014 ImageNet Roland Memisevic Latest in Deep Learning Oct 30, 2014 ImageNet high-level features Girshick, Donahue, Darrell, Malik (!); 2014 Roland Memisevic Latest in Deep Learning Oct 30, 2014 Emotion recognition in the wild Challenge 2013 Roland Memisevic Latest in Deep Learning Oct 30, 2014 Convnet features for generic recognition non-imagenet classes: (Donahue et al, 2013) Roland Memisevic Latest in Deep Learning Oct 30, 2014 Word embeddings Train vectors to represent words. france − paris + germany = berlin king − man + woman = queen men − man + book = books Bengio et al 2000 Roland Memisevic Latest in Deep Learning Oct 30, 2014 Word embeddings (from Mikolov et al. 2013) I Works well in sentiment analysis. How about speech? machine translation? I Are the vectors capturing syntax or semantics? I “He spotted the cat with the binoculars” (ex. from Felix Hill) Roland Memisevic Latest in Deep Learning Oct 30, 2014 Kaggle competitions I ”...The winning submission is a bag of 70 neural networks...” (www.kaggle.com) Roland Memisevic Latest in Deep Learning Oct 30, 2014 fall-out Roland Memisevic Latest in Deep Learning Oct 30, 2014 Research directions / challenges I Structured outputs, unsupervised learning I Unconventional nonlinearities / network structures I RNNs/LSTM Roland Memisevic Latest in Deep Learning Oct 30, 2014 Structured prediction Roland Memisevic Latest in Deep Learning Oct 30, 2014 Structured prediction Prediction: y x Structured prediction: x I y Scene labeling, text, speech Roland Memisevic Latest in Deep Learning Oct 30, 2014 Solution proposed in 1998 I I When layers are complex graphs back-prop still works (also in LeCun et al. 1998) This observation was recycled in 2001 under the name Conditional Random Field Roland Memisevic Latest in Deep Learning Oct 30, 2014 Streetview (Goodfellow et al, ICLR 2014) Roland Memisevic Latest in Deep Learning Oct 30, 2014 Towards scene recognition Roland Memisevic Latest in Deep Learning Oct 30, 2014 A chance for unsupervised learning x I y A possible alternative to complicated graphical models on the outputs is to use feature learning on the outputs themselves. Roland Memisevic Latest in Deep Learning Oct 30, 2014 A chance for unsupervised learning h x I y A possible alternative to complicated graphical models on the outputs is to use feature learning on the outputs themselves. Roland Memisevic Latest in Deep Learning Oct 30, 2014 A chance for unsupervised learning h x y I A possible alternative to complicated graphical models on the outputs is to use feature learning on the outputs themselves. I (...and potentially the dependence of the features themselves on the inputs, which would require 3-way connections) Roland Memisevic Latest in Deep Learning Oct 30, 2014 Unconventional nonlinearities / network structures Roland Memisevic Latest in Deep Learning Oct 30, 2014 w Tx ? I Mel, 1994 Roland Memisevic Latest in Deep Learning Oct 30, 2014 GoogLeNet Roland Memisevic I exercise in (a) scaling up, (b) unconventional neurons/architectures I wins ImageNet 2014 with 6.66% top-5 error rate Latest in Deep Learning Oct 30, 2014 ”Transistor neurons” z zk xi I x; y y j Vision is more that object recognition: Most other tasks based on encoding relations not things: Analogy making, motion understanding, invariance, attention, ... Roland Memisevic Latest in Deep Learning Oct 30, 2014 ”Transistor neurons” Roland Memisevic Latest in Deep Learning Oct 30, 2014 Recurrent networks Roland Memisevic Latest in Deep Learning Oct 30, 2014 RNNs and Backprop-Through-Time picture from http://www.cs.toronto.edu/ asamir/cifar/Ilya slides.pdf I Training the network by stepping it T times is just like a T -layer feedforward net. I The problem: Vanishing/exploding gradients now really become an issue. Roland Memisevic Latest in Deep Learning Oct 30, 2014 Long-Short Term Memory (LSTM) picture from (Hochreiter, Schmidthuber; 1997) Roland Memisevic Latest in Deep Learning Oct 30, 2014 RNN applications (thanks to LSTM) I Machine Translation (Sutskever et al. NIPS 2014), (Cho et al. Arxiv 2014) I Speech synthesis (Fan et al. INTERSPEECH 2014) I Speech recognition I handwriting generation http://www.cs.toronto.edu/ graves/handwriting.html I character-based text generation http://www.cs.toronto.edu/ ilya/rnn.html I “Neural Turing Machine” Roland Memisevic Latest in Deep Learning Oct 30, 2014 (Sutskever et al. NIPS 2014) Roland Memisevic Latest in Deep Learning Oct 30, 2014 http://www.cs.toronto.edu/ graves/handwriting.html Roland Memisevic Latest in Deep Learning Oct 30, 2014 Can we do better than LSTM? I LSTM is extremely simplistic: it keeps around a value, using a gate to protect or reset it. I A 1.0-connection is a special case of the identity matrix. The identity is a special case of an orthogonal matrix. → Gating supports orthonormal paths through a network. I I Roland Memisevic Latest in Deep Learning Oct 30, 2014 “Grammar cells” (Michalski, Memisevic, Konda; 2014) a long sequence of balls bouncing around in a box: Roland Memisevic Latest in Deep Learning Oct 30, 2014 The future? (besides structured prediction, unconventional non-linearities, RNN) I Scaling up, dealing with the optimization problem I Hardware support I Applications, fusion of vision, language, speech, robotics I Reinforcement learning I Deep learning as a service I Conv-nets? Roland Memisevic Latest in Deep Learning Oct 30, 2014 Bag of Tricks I stochastic gradient descent and momentum are almost always the best way to train a neural net I use gradient clipping to avoid getting thrown out by NaNs too often (especially for recurrent nets) I corrupt neuron inputs during learning (eg. “dropout”) I don’t worry about non-differentiabilities (in fact, about discontinuities) I people rarely successfully “try out” models on data, you guide them to make them work When does deep learning work well? When you have lots of data and a strong signal. Roland Memisevic Latest in Deep Learning Oct 30, 2014 Software I Convnets: overfeat, caffe, cuda-convnet, ..., sklearn-theano I Word embeddings: word2vec (available in gensim) I Deep learning: torch, theano Roland Memisevic Latest in Deep Learning Oct 30, 2014 sk-learn theano Roland Memisevic Latest in Deep Learning Oct 30, 2014 theano lots of hands-on examples at: http://www.deeplearning.net/tutorial/ Roland Memisevic Latest in Deep Learning Oct 30, 2014 Thank you! Questions? Roland Memisevic Latest in Deep Learning Oct 30, 2014
© Copyright 2025