Convolutional Neural Network Applications and Insights

Convolutional Neural Network
Applications and Insights
Christof Angermueller and Alex Kendall
Application 1: Classification
Karpathy, Andrej, et al. "Large-scale video classification with convolutional neural networks."
Visual Classification
Attention-grabbing image classification performance
Clarifai classification demo
Large Scale Classification
Classification advances driven by:
● Large datasets such as ImageNet, Places with millions
of images
● Annual ImageNet Challenge (ILSRC)
Depth over width
A function which is invariant to the many nuisance variables (pose,
occlusion, lighting, clutter) is very complex and nonlinear
These functions are more efficiently represented with depth rather
than width
● sequential mapping to connected spaces
● deeper layers reuse computation
(On the Number of Linear Regions of Deep Neural Networks)
Deep architectures consistently outperform
shallow representations with comparable networks
(Return of the Devil in the Details: Delving Deep into Convolutional Networks)
Very deep architectures
1989: LeNet, 5 layers
2006: Autoencoders, 7 layers
2012: Alex Net, 9 layers
2014: GoogLeNet, 22 layers and current
ILSRC winner
(‘Going Deeper with Convolutions’)
What constrains depth?
● GPU Memory - more efficient architectures
○ dimensionality reduction kernels
● Over-fitting
○ data-augmentation
○ drop out
● Back-propagated gradient magnitude decay
○ multi-loss training with auxiliary classifiers
Leverage Data Hierarchy
Strong hierarchy in data
●
●
●
Image recognition: Pixel → edge → texton → motif → part → object
Text: Character → word → word group → clause → sentence → story
Speech: Sample → spectral band → sound → phone → word
Strong hierarchy in biological architectures:
Thorpe, Simon, Denis Fize, and Catherine Marlot.
"Speed of processing in the human visual system."
nature 381.6582 (1996): 520-522.
Understanding deep representations
First layer filters for edges, blobs and low level features.
Interesting when trained on dual GPUs a distinction forms between sharp
monochrome features (rods) and colour blobs (cones)
(ImageNet Classification with Deep Convolutional Neural Networks)
Hierarchy and Multi-Scales
A neuron’s receptive field increases in size with depth
●
●
Initial layer features are more discriminative
Deeper layers are more invariant and capture
semantics
Different and complementary features exist at different
spatial scales
Depth Multiscale: Hypercolumns represent features
over entire depth abstraction
Spatial MultiScale: GoogLeNet uses multi scale
filters in inception modules
Deconvolution
We can visualise the convolutional filters to find deficiencies in architecture. As you go
deeper the filters represent more semantic concepts, similar to V1-V4 of the visual
pathway in humans (Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks.")
Summary of Classification Insights
1. Large, augmented datasets
2. Maximise depth (while avoiding overfitting
and vanishing gradients)
3. Use multiscale and multi depth information
Application 2: Instantiation Variable Regression
Multi-Dimensional Regression
Instead of training a softmax classifier,
an euclidean loss function can be used
to train regression output
For example to regress camera location,
x, and orientation, q, we can use the loss
function
Multi-Dimensional Regression
Despite convnets being large piecewise linear function they can still continuously
regress pose and instantiation variables - map to linear space
● Human pose (Deep Pose: Human pose recognition.)
● Alex’s unpublished work in camera pose localisation
Saliency maps
We can view the gradient of the pixels w.r.t. the output
Visualising these (back-propagated) gradients is called a saliency map
Show areas of the image, and features, which are most important
Back-propagated gradients are a generalisation of deconvolution
(Simonyan, Karen, Andrea Vedaldi, and Andrew Zisserman. "Deep inside convolutional networks: Visualising image
classification models and saliency maps.")
Summary of Regression Insights
1. Convnet transforms data to a space linear in
a number of instantiation parameters
2. Context is extremely important to understand
the data
Other Applications
1. Image caption generation
2. Text recognition
3. Reinforcement learning
Image Caption Generation
Image Caption Generation
Image Caption Generation
Object detection combined with a multimodal Recurrent Neural Network architecture that uses the
detected descriptions to learn to generate descriptions of image regions
● Karpathy et al., ‘Deep visual-semantic alignments for generating image descriptions‘
● Vinyals et al., ‘Show and Tell’.
Text Recognition (OCR)
Using superpixels to generate region proposals for convnets has been used for
many applications, eg. OCR (Reading Text in the Wild with Convolutional Neural Networks,
Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks )
Reinforcement Learning
Spatial and temporal input through
convolutional neural network to output joystick
commands for a video game (Mnih et al., ‘HumanLevel Control through Deep Reinforcement Learning’)
Same architecture trained on 100 atari games
with separate weights trained for each game to
maximise score
Reinforcement Learning
Final Insights
● Feature vectors from convolutional neural
networks contain rich representations of images
● Invariant to nuisance variables and linear in a
number of instantiation parameters
● Improvement of convnets over SIFT features is
approx. equal to the improvement of SIFT over
simple RGB patches
Conclusion
● Convnets are pushing state-of-the-art in understanding
data with spatial structure
● Produce powerful and transferable representations
However,
● Can be hard to train and regularise
● Very hard to get labelled data to train
● Deep representations tend to lose spatial accuracy