Deep Learning Vocabulary Vocabulary for Deep Learning (In 40 Minutes! With Pictures!) Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Deep Learning Vocabulary Prediction Problems Given x, predict y 2 Deep Learning Vocabulary Example: Sentiment Analysis x y I am happy +1 I am sad -1 3 Deep Learning Vocabulary Linear Classifiers T y = step(w ϕ( x)) x: the input ● φ(x): vector of feature functions ● w: the weight vector ● y: the prediction, +1 if “yes”, -1 if “no” y ● 2 1 0 -4 -3 -2 -1 0 1 2 -1 -2 step(w*φ) 3 4 4 Deep Learning Vocabulary Example Feature Functions: Unigram Features φ(I am happy) {} 1 0 1 0 0 1 … I was am sad the happy φ(I am sad) {} 1 0 1 1 0 0 … 5 Deep Learning Vocabulary The Perceptron ● Think of it as a “machine” to calculate a weighted sum and insert it into an activation function w ϕ ( x) T step( w ϕ (x )) y 6 Deep Learning Vocabulary Sigmoid Function (Logistic Function) ● The sigmoid function is a “softened” version of the step function w⋅ϕ ( x) e P( y =1∣x)= w⋅ϕ (x) 1+e Logistic Function Step Function 1 p(y|x) p(y|x) 1 0.5 -10 -5 0 0 5 10 w*phi(x) ● ● Can account for uncertainty Differentiable 0.5 -10 -5 0 0 5 10 w*phi(x) 7 Deep Learning Vocabulary Logistic Regression ● ● Train based on conditional likelihood Find the parameters w that maximize the conditional likelihood of all answers yi given the example xi w=argmax ^ ∏i P( yi∣x i ; w) w ● How do we solve this? 8 Deep Learning Vocabulary Stochastic Gradient Descent ● Online training algorithm for probabilistic models (including logistic regression) create map w for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw ● In other words ● ● For every training example, calculate the gradient (the direction that will increase the probability of y) Move in that direction, multiplied by learning rate α 9 Deep Learning Vocabulary Gradient of the Sigmoid Function Take the derivative of the probability w⋅ϕ ( x) d P ( y=1∣x ) = dw d e d w 1+e w⋅ϕ (x) w⋅ϕ (x) e = ϕ (x ) w⋅ϕ (x) 2 (1+e ) d P ( y=−1∣x ) = dw dp(y|x)/dw*phi(x) ● 0.4 0.3 0.2 0.1 -10 -5 0 0 5 10 w*phi(x) w⋅ϕ (x) d e (1− ) w⋅ϕ (x) dw 1+ e w⋅ϕ (x) e = −ϕ (x ) w⋅ϕ (x) 2 (1+ e ) 10 Deep Learning Vocabulary Neural Networks 11 Deep Learning Vocabulary Problem: Linear Constraint ● ● Perceptron cannot achieve high accuracy on nonlinear functions X O O X Example: “I am not happy” 12 Deep Learning Vocabulary Neural Networks (Multi-Layer Perceptron) ● Neural networks connect multiple perceptrons together ϕ ( x) -1 13 Deep Learning Vocabulary Example: ● Build two classifiers: φ(x1) = {-1, 1} φ(x2) = {1, 1} X φ2 O φ1 O X φ(x3) = {-1, -1} φ(x4) = {1, -1} φ1 1 φ2 1 1 -1 w1 y1 w2 y2 φ1 -1 φ2 -1 1 -1 14 Deep Learning Vocabulary Example: ● These classifiers map the points to a new space y(x3) = {-1, 1} φ(x1) = {-1, 1} φ(x2) = {1, 1} X φ2 O O y2 φ1 O y1 X X φ(x3) = {-1, -1} φ(x4) = {1, -1} 1 1 -1 y1 -1 -1 -1 y2 y(x1) = {-1, -1} y(x4) = {-1, -1} O y(x2) = {1, -1} 15 Deep Learning Vocabulary Example: ● In the new space, examples are classifiable! φ(x1) = {-1, 1} φ(x2) = {1, 1} φ2 X O φ1 O 1 1 1 X y3 φ(x3) = {-1, -1} φ(x4) = {1, -1} 1 1 -1 -1 -1 -1 y1 y(x3) = {-1, 1} O y2 y1 y2 y(x1) = {-1, -1} y(x4) = {-1, -1} X O y(x2) = {1, -1} 16 Deep Learning Vocabulary Example: ● Final neural network: φ1 1 φ2 1 1 -1 w1 φ1 -1 φ2 -1 1 -1 1 1 w3 y4 w2 1 1 17 Deep Learning Vocabulary Hidden Layer Activation Functions -4 -3 -2 -1 Linear Function: - Whole net become linear 2 2 1 1 0 0 -1 y y Step Function: - Cannot calculate gradient 0 1 2 3 -4 -3 -2 -1 4 Tanh Function: Standard (also 0-1 sigmoid) 2 3 4 Rectified Linear Function: + Gradients at large vals. 2 2 1 1 0 0 y y 1 w*φ step(w*φ) -1 0 -2 -2 -4 -3 -2 -1 -1 0 1 -2 tanh(w*φ) 2 3 4 -4 -3 -2 -1 0 -1 1 2 3 4 -2 relu(w*φ) 18 Deep Learning Vocabulary Deep Networks ϕ ( x) ● ● Why deep? Gradually more detailed functions (e.g. [Lee+ 09]) 19 Deep Learning Vocabulary Learning Neural Nets 20 Deep Learning Vocabulary Learning: Don't Know Derivative for Hidden Units! ● For NNs, only know correct tag for last layer h( x) w1 d P( y=1∣x) =? d w1 d P( y=1∣x) e w ⋅h(x ) =h (x) w ⋅h( x) 2 d w (1+ e ) 4 w 4 ϕ ( x) w2 w3 4 4 y=1 d P( y=1∣x) =? d w2 d P( y=1∣x) =? d w3 21 Deep Learning Vocabulary Answer: Back-Propogation ● Calculate derivative w/ chain rule d P( y=1∣x) d P( y=1∣x ) d w 4 h(x ) d h1 (x ) = d w1 d w 4 h( x) d h1 (x) d w 1 w 4⋅h( x ) e w ⋅h( x) 2 (1+e ) w 1,4 4 Error of next unit (δ4) In General Calculate i based on next units j: Weight Gradient of this unit d P( y=1∣x) d hi ( x) = δ w ∑ j i, j j wi d wi 22 Deep Learning Vocabulary BP in Deep Networks: Vanishing Gradients y ϕ ( x) δ=-8e-7 δ=9e-5 δ=6e-4 δ=-2e-3 δ=0.01 δ=1e-6 δ=7e-5 δ=-1e-3 δ=5e-3 δ=0.02 δ=0.1 δ=2e-6 δ=-2e-5 δ=-8e-4 δ=1e-3 δ=-0.01 ● Exploding gradient as well 23 Deep Learning Vocabulary Layerwise Training ϕ ( x) ● X y Train one layer at a time 24 Deep Learning Vocabulary Layerwise Training ϕ ( x) ● X y Train one layer at a time 25 Deep Learning Vocabulary Layerwise Training ϕ ( x) ● y Train one layer at a time 26 Deep Learning Vocabulary Autoencoders ● Initialize the NN by training it to reproduce itself ϕ ( x) ● ϕ ( x) Advantage: No need for labels y! 27 Deep Learning Vocabulary Dropout ● Problem: Overfitting X ϕ ( x) X XX X XX ● Deactivate part of the network on each update ● Can be seen as an ensemble method y 28 Deep Learning Vocabulary Network Architectures 29 Deep Learning Vocabulary Recurrent Neural Nets y1 y2 y3 y4 NET NET NET NET x1 x2 x3 x4 ● Good for modeling sequence data ● e.g. Speech 30 Deep Learning Vocabulary Convolutional Neural Nets y x1 (x1-x3) (x2-x4) (x3-x5) x2 x3 x4 x5 ● Good for modeling data with spatial correlation ● e.g. Images 31 Deep Learning Vocabulary Recursive Neural Nets y NET NET NET ● ● NET x1 x2 x3 x4 x5 Good for modeling data with tree structures e.g. Language 32 Deep Learning Vocabulary Other Topics 33 Deep Learning Vocabulary Other Topics ● Deep belief networks ● ● Restricted Boltzmann machines Contrastive estimation ● Generating with Neural Nets ● Visualizing internal states ● Batch/mini-batch update strategies ● Long short-term memory ● Learning on GPUs ● Tools for implementing deep learning 34