Latest in Deep Learning Roland Memisevic Oct 30, 2014 LISA lab Montreal

Latest in Deep Learning
Roland Memisevic
LISA lab Montreal
Oct 30, 2014
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Rosenblatt’s perceptron (1957)
pictures taken from http://www.rutherfordjournal.org/article040101.html
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Rosenblatt’s perceptron (1957)
x1
w1
wD
xD
I
”the embryo of an electronic computer that [the Navy]
expects will be able to walk, talk, see, write, reproduce
itself and be conscious of its existence”
(in NYT according to wikipedia)
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Rosenblatt’s perceptron (1957)
x1
w1
wD
xD
I
”the embryo of an electronic computer that [the Navy]
expects will be able to walk, talk, see, write, reproduce
itself and be conscious of its existence”
(in NYT according to wikipedia)
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
The XOR problem
x2
x1
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
The XOR problem
x1 · x2
x2
x1
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
The XOR problem
I
(picture adapted from Bishop 2006)
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
“It’s the features, stupid”
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
“It’s the features, stupid”
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
“It’s the features, stupid”
A common vision pipeline (prior 2012)
1. Find interest points.
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
“It’s the features, stupid”
A common vision pipeline (prior 2012)
1. Find interest points.
2. Crop patches around them.
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
“It’s the features, stupid”
f1
fn
A common vision pipeline (prior 2012)
1. Find interest points.
2. Crop patches around them.
3. Represent each patch with a sparse local descriptor.
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
“It’s the features, stupid”
f11
f1M
fn1
fnM
A common vision pipeline (prior 2012)
1. Find interest points.
2. Crop patches around them.
3. Represent each patch with a sparse local descriptor.
4. Combine the descriptors into a representation for the
image.
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
“It’s the features, stupid”
f2
?
cathedral
high-rise
f1
I
This creates a representation that even a linear classifier
can deal with.
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
What do good features look like?
I
There are many incarnations: SIFT, HOG, GIST, SURF, ...
I
Common to all is that they use local edge information,
followed by several stages of non-linear processing.
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Back to neural nets
I
(picture adapted from Bishop 2006)
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Back-prop
wf
f (g; wf )
wg
g(h; wg )
wh
h(x; wh)
f
∂f
∂g
∂f
∂wf
fprop
bprop
grad
x
I
I
I
We can make use of the chain rule to compute derivatives
wrt. lower layer parameters.
popularized by (Rumelhart, Hinton, Williams; 1986)
suffers from the vanishing/exploding gradients problem
(more on this later)
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
The end
Questions?
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Speech
figure from Yoshua Bengio
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Convolutional networks
I
LeCun et al. 1998
I
Fukushima 1980 (without learning)
I
Training the model to classify objects turns the lowest
layers into Gabor like features (”edge detectors”).
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
ImageNet
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
ImageNet
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
ImageNet high-level features
Girshick, Donahue, Darrell, Malik (!); 2014
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Emotion recognition in the wild Challenge 2013
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Convnet features for generic recognition
non-imagenet classes:
(Donahue et al, 2013)
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Word embeddings
Train vectors to represent words.
france − paris + germany =
berlin
king − man + woman = queen
men − man + book = books
Bengio et al 2000
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Word embeddings
(from Mikolov et al. 2013)
I
Works well in sentiment analysis. How about speech?
machine translation?
I
Are the vectors capturing syntax or semantics?
I
“He spotted the cat with the binoculars” (ex. from Felix Hill)
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Kaggle competitions
I
”...The winning submission is a bag of 70 neural
networks...” (www.kaggle.com)
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
fall-out
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Research directions / challenges
I
Structured outputs, unsupervised learning
I
Unconventional nonlinearities / network structures
I
RNNs/LSTM
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Structured prediction
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Structured prediction
Prediction:
y
x
Structured prediction:
x
I
y
Scene labeling, text, speech
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Solution proposed in 1998
I
I
When layers are complex graphs back-prop still works
(also in LeCun et al. 1998)
This observation was recycled in 2001 under the name
Conditional Random Field
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Streetview (Goodfellow et al, ICLR 2014)
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Towards scene recognition
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
A chance for unsupervised learning
x
I
y
A possible alternative to complicated graphical models on
the outputs is to use feature learning on the outputs
themselves.
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
A chance for unsupervised learning
h
x
I
y
A possible alternative to complicated graphical models on
the outputs is to use feature learning on the outputs
themselves.
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
A chance for unsupervised learning
h
x
y
I
A possible alternative to complicated graphical models on
the outputs is to use feature learning on the outputs
themselves.
I
(...and potentially the dependence of the features
themselves on the inputs, which would require 3-way
connections)
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Unconventional nonlinearities / network structures
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
w Tx ?
I
Mel, 1994
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
GoogLeNet
Roland Memisevic
I
exercise in (a) scaling up, (b)
unconventional neurons/architectures
I
wins ImageNet 2014 with 6.66% top-5
error rate
Latest in Deep Learning
Oct 30, 2014
”Transistor neurons”
z
zk
xi
I
x; y
y
j
Vision is more that object recognition: Most other tasks
based on encoding relations not things: Analogy making,
motion understanding, invariance, attention, ...
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
”Transistor neurons”
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Recurrent networks
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
RNNs and Backprop-Through-Time
picture from http://www.cs.toronto.edu/ asamir/cifar/Ilya slides.pdf
I
Training the network by stepping it T times is just like a
T -layer feedforward net.
I
The problem: Vanishing/exploding gradients now really
become an issue.
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Long-Short Term Memory (LSTM)
picture from (Hochreiter, Schmidthuber; 1997)
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
RNN applications (thanks to LSTM)
I
Machine Translation (Sutskever et al. NIPS 2014), (Cho et
al. Arxiv 2014)
I
Speech synthesis (Fan et al. INTERSPEECH 2014)
I
Speech recognition
I
handwriting generation
http://www.cs.toronto.edu/ graves/handwriting.html
I
character-based text generation
http://www.cs.toronto.edu/ ilya/rnn.html
I
“Neural Turing Machine”
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
(Sutskever et al. NIPS 2014)
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
http://www.cs.toronto.edu/ graves/handwriting.html
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Can we do better than LSTM?
I
LSTM is extremely simplistic: it keeps around a value,
using a gate to protect or reset it.
I
A 1.0-connection is a special case of the identity matrix.
The identity is a special case of an orthogonal matrix.
→ Gating supports orthonormal paths through a
network.
I
I
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
“Grammar cells” (Michalski, Memisevic, Konda;
2014)
a long sequence of balls bouncing around in a box:
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
The future? (besides structured prediction,
unconventional non-linearities, RNN)
I
Scaling up, dealing with the optimization problem
I
Hardware support
I
Applications, fusion of vision, language, speech, robotics
I
Reinforcement learning
I
Deep learning as a service
I
Conv-nets?
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Bag of Tricks
I
stochastic gradient descent and momentum are almost
always the best way to train a neural net
I
use gradient clipping to avoid getting thrown out by NaNs
too often (especially for recurrent nets)
I
corrupt neuron inputs during learning (eg. “dropout”)
I
don’t worry about non-differentiabilities (in fact, about
discontinuities)
I
people rarely successfully “try out” models on data, you
guide them to make them work
When does deep learning work well? When you have lots of
data and a strong signal.
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Software
I
Convnets: overfeat, caffe, cuda-convnet, ...,
sklearn-theano
I
Word embeddings: word2vec (available in gensim)
I
Deep learning: torch, theano
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
sk-learn theano
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
theano
lots of hands-on examples at:
http://www.deeplearning.net/tutorial/
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014
Thank you!
Questions?
Roland Memisevic
Latest in Deep Learning
Oct 30, 2014