Convolutional Neural Network Applications and Insights Christof Angermueller and Alex Kendall Application 1: Classification Karpathy, Andrej, et al. "Large-scale video classification with convolutional neural networks." Visual Classification Attention-grabbing image classification performance Clarifai classification demo Large Scale Classification Classification advances driven by: ● Large datasets such as ImageNet, Places with millions of images ● Annual ImageNet Challenge (ILSRC) Depth over width A function which is invariant to the many nuisance variables (pose, occlusion, lighting, clutter) is very complex and nonlinear These functions are more efficiently represented with depth rather than width ● sequential mapping to connected spaces ● deeper layers reuse computation (On the Number of Linear Regions of Deep Neural Networks) Deep architectures consistently outperform shallow representations with comparable networks (Return of the Devil in the Details: Delving Deep into Convolutional Networks) Very deep architectures 1989: LeNet, 5 layers 2006: Autoencoders, 7 layers 2012: Alex Net, 9 layers 2014: GoogLeNet, 22 layers and current ILSRC winner (‘Going Deeper with Convolutions’) What constrains depth? ● GPU Memory - more efficient architectures ○ dimensionality reduction kernels ● Over-fitting ○ data-augmentation ○ drop out ● Back-propagated gradient magnitude decay ○ multi-loss training with auxiliary classifiers Leverage Data Hierarchy Strong hierarchy in data ● ● ● Image recognition: Pixel → edge → texton → motif → part → object Text: Character → word → word group → clause → sentence → story Speech: Sample → spectral band → sound → phone → word Strong hierarchy in biological architectures: Thorpe, Simon, Denis Fize, and Catherine Marlot. "Speed of processing in the human visual system." nature 381.6582 (1996): 520-522. Understanding deep representations First layer filters for edges, blobs and low level features. Interesting when trained on dual GPUs a distinction forms between sharp monochrome features (rods) and colour blobs (cones) (ImageNet Classification with Deep Convolutional Neural Networks) Hierarchy and Multi-Scales A neuron’s receptive field increases in size with depth ● ● Initial layer features are more discriminative Deeper layers are more invariant and capture semantics Different and complementary features exist at different spatial scales Depth Multiscale: Hypercolumns represent features over entire depth abstraction Spatial MultiScale: GoogLeNet uses multi scale filters in inception modules Deconvolution We can visualise the convolutional filters to find deficiencies in architecture. As you go deeper the filters represent more semantic concepts, similar to V1-V4 of the visual pathway in humans (Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks.") Summary of Classification Insights 1. Large, augmented datasets 2. Maximise depth (while avoiding overfitting and vanishing gradients) 3. Use multiscale and multi depth information Application 2: Instantiation Variable Regression Multi-Dimensional Regression Instead of training a softmax classifier, an euclidean loss function can be used to train regression output For example to regress camera location, x, and orientation, q, we can use the loss function Multi-Dimensional Regression Despite convnets being large piecewise linear function they can still continuously regress pose and instantiation variables - map to linear space ● Human pose (Deep Pose: Human pose recognition.) ● Alex’s unpublished work in camera pose localisation Saliency maps We can view the gradient of the pixels w.r.t. the output Visualising these (back-propagated) gradients is called a saliency map Show areas of the image, and features, which are most important Back-propagated gradients are a generalisation of deconvolution (Simonyan, Karen, Andrea Vedaldi, and Andrew Zisserman. "Deep inside convolutional networks: Visualising image classification models and saliency maps.") Summary of Regression Insights 1. Convnet transforms data to a space linear in a number of instantiation parameters 2. Context is extremely important to understand the data Other Applications 1. Image caption generation 2. Text recognition 3. Reinforcement learning Image Caption Generation Image Caption Generation Image Caption Generation Object detection combined with a multimodal Recurrent Neural Network architecture that uses the detected descriptions to learn to generate descriptions of image regions ● Karpathy et al., ‘Deep visual-semantic alignments for generating image descriptions‘ ● Vinyals et al., ‘Show and Tell’. Text Recognition (OCR) Using superpixels to generate region proposals for convnets has been used for many applications, eg. OCR (Reading Text in the Wild with Convolutional Neural Networks, Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks ) Reinforcement Learning Spatial and temporal input through convolutional neural network to output joystick commands for a video game (Mnih et al., ‘HumanLevel Control through Deep Reinforcement Learning’) Same architecture trained on 100 atari games with separate weights trained for each game to maximise score Reinforcement Learning Final Insights ● Feature vectors from convolutional neural networks contain rich representations of images ● Invariant to nuisance variables and linear in a number of instantiation parameters ● Improvement of convnets over SIFT features is approx. equal to the improvement of SIFT over simple RGB patches Conclusion ● Convnets are pushing state-of-the-art in understanding data with spatial structure ● Produce powerful and transferable representations However, ● Can be hard to train and regularise ● Very hard to get labelled data to train ● Deep representations tend to lose spatial accuracy