CS224d Deep NLP Lecture 6: Neural Tips and Tricks + Recurrent

CS224d Deep NLP Lecture 6: Neural Tips and Tricks + Recurrent Neural Networks Richard Socher [email protected] Overview Today: •  Useful NNet techniques / 0ps and tricks: •  Mul0-­‐task learning •  Nonlineari0es •  Finite difference gradient check •  Momentum, AdaGrad •  Language Models •  Recurrent Neural Networks Lecture 1, Slide 2 Richard Socher 4/15/15 Deep Learning General Strategy and Tricks 3 MulH-­‐task learning / Weight sharing •  Similar to neural network from last class but replaces the single scalar score with a So#max classifier •  Training is again done via backpropaga0on which gives an error similar to the score in the scoring learning model •  NLP (almost) from scratch, Collobert et al. 2011 4 c1 c2 c3 a1 a2 x1 x2 x3 +1 The Model -­‐ Training •  We already know the soWmax classifier and how to op0mize it •  The interes0ng twist in deep learning is that the input features x are also learned, similar to learning with a score: s a1 a2 U2 W23 x1 x2 x3 +1 5 c1 c2 c3 a1 a2 x1 x2 x3 +1 S The Model -­‐ Training •  Main addi0onal idea: We can share both the word vectors AND the hidden layer weights. Only the soWmax weights are different. •  Cost func0on is just the sum of two cross entropy errors c1 c2 c3 a1 a2 x1 x2 x3 +1 6 S1 c4 c5 c6 a1 a2 x1 x2 x3 +1 S2 The secret sauce is the unsupervised word vector pre-­‐
training on a large text collecHon State-­‐of-­‐the-­‐art* Supervised NN Word vector pre-­‐training followed by supervised NN** + hand-­‐craWed features*** POS WSJ (acc.) NER CoNLL (F1) 97.24 96.37 97.20 89.31 81.47 88.87 97.29 89.59 * Representa0ve systems: POS: (Toutanova et al. 2003), NER: (Ando & Zhang 2005) ** 130,000-­‐word embedding trained on Wikipedia and Reuters with 11 word window, 100 unit hidden layer – then supervised task training ***Features are character suffixes for POS and a gazegeer for NER 7 Supervised refinement of the unsupervised word representaHon helps Supervised NN NN with Brown clusters Fixed embeddings* C&W 2011** POS WSJ (acc.) NER CoNLL (F1) 96.37 96.92 97.10 97.29 81.47 87.15 88.87 89.59 * Same architecture as C&W 2011, but word embeddings are kept constant during the supervised training phase ** C&W is unsupervised pre-­‐train + supervised NN + features model of last slide 8 General Strategy for Successful NNets 1.  Select network structure appropriate for problem 1.  Structure: Single words, fixed windows, bag of words, recursive vs. recurrent, CNN, sentence based vs. document 2.  Nonlinearity 2.  Check for implementa0on bugs with gradient checks 3.  Parameter ini0aliza0on 4.  Op0miza0on tricks 5.  Check if the model is powerful enough to overfit 1.  If not, change model structure or make model “larger” 2.  If you can overfit: Regularize 9 Non-­‐lineariHes: What’s used logis0c (“sigmoid”) tanh tanh is just a rescaled and shiWed sigmoid tanh oWen performs well for deep nets tanh(z) = 2logistic(2z) −1
10 For many models, tanh is the best! •  In comparison to sigmoid: •  At ini0aliza0on: values close to 0 •  Faster convergence in prac0ce •  Like sigmoid: Nice deriva0ve: 11 Richard Socher 4/15/15 Non-­‐lineariHes: There are various other choices hard tanh soW sign recHfied linear (ReLu) a
softsign(z) =
0)
1+
a rect(z)
= max(z,
•  hard tanh similar but computa0onally cheaper than tanh and saturates hard. •  Glorot and Bengio, AISTATS 2011 discuss soWsign and rec0fier 12 MaxOut Network A recent type of nonlinearity/network Goodfellow et al. (2013) Where This func0on too is a universal approximator State of the art on several image datasets 13 Gradient Checks are Awesome! •  Allow you to know that there are no bugs in your neural network implementa0on! •  Steps: 1.  Implement your gradient 2.  Implement a finite difference computa0on by looping through the parameters of your network, adding and subtrac0ng a small epsilon (∼10^-­‐4) and es0mate deriva0ves 3.  Compare the two and make sure they are almost the same 14 Gradient Checks are Awesome! • 
• 
• 
• 
15 If you gradient fails and you don’t know why? What now? Simplify your model un0l you have no bug! Example: Start from simplest model then go to what you want: •  Only soWmax on fixed input •  Backprop into word vectors and soWmax •  Add single unit single hidden layer •  Add mul0 unit single layer •  Add bias •  Add second layer single unit •  Add two soWmax units General Strategy 1.  Select appropriate Network Structure 1.  Structure: Single words, fixed windows vs Recursive Sentence Based vs Bag of words 2.  Nonlinearity 2.  Check for implementa0on bugs with gradient check 3.  Parameter ini0aliza0on 4.  Op0miza0on tricks 5.  Check if the model is powerful enough to overfit 1.  If not, change model structure or make model “larger” 2.  If you can overfit: Regularize 16 Parameter IniHalizaHon •  Ini0alize hidden layer biases to 0 and output (or reconstruc0on) biases to op0mal value if weights were 0 (e.g., mean target or inverse sigmoid of mean target). •  Ini0alize weights ∼ Uniform(−r, r), r inversely propor0onal to fan-­‐in (previous layer size) and fan-­‐out (next layer size): for tanh units, and 4x bigger for sigmoid units [Glorot AISTATS 2010] 17 StochasHc Gradient Descent (SGD)
•  Gradient descent uses total gradient over all examples per update, SGD updates aWer only 1 or few examples: •  Jt = loss func0on at current example, µ = parameter vector, ® = learning rate. •  Ordinary gradient descent as a batch method is very slow, should never be used. Use 2nd order batch method such as L-­‐BFGS. •  On large datasets, SGD usually wins over all batch methods. On smaller datasets L-­‐BFGS or Conjugate Gradients win. Large-­‐batch L-­‐BFGS extends the reach of L-­‐BFGS [Le et al. ICML 2011]. 18 Learning Rates •  Simplest recipe: keep it fixed and use the same for all parameters. •  Collobert scales them by the inverse of square root of the fan-­‐in of each neuron •  Beger results can generally be obtained by allowing learning rates to decrease, typically in O(1/t) because of theore0cal convergence guarantees, e.g., with hyper-­‐parameters ε0 and τ •  Beger yet: No hand-­‐set learning rates by using L-­‐BFGS or AdaGrad (Duchi, Hazan, & Singer 2011) 19 where yˆmax is the tree with the highest score. To
minimize the objective, we use the diagonal variAdagrad ant of AdaGrad (Duchi et al., 2011) with minibatches. For our parameter updates, we first de⇥1 to be the subgradient at time step
fine g®⌧: 2
•  Standard SGD, fixed RMP
t
T
⌧
and
G
=
t
⌧ =1 g⌧ g⌧ . The parameter update at
•  Instead: Adap0ve learning rates! time step t then becomes:
4.1
We
to
ch
go
Ma
•  Related paper: Adap0ve Subgradient Methods for Online tin
1/2
Learning and Stochas0c ✓Ot p0miza0on, Duchi ett = ✓t 1 ↵ (diag(G
))al. 2010 gt ,
(7)
ma
be
where ↵ is the learning rate. Since we use the di•  Learning rate is adap0ng differently for each parameter and rare the
agonal of Gt , we only have to store M values and
do
parameters get larger updates than frequently occurring the update becomes fast to compute: At time step
be
parameters. Word vectors! t, the update for the i’th parameter ✓t,i is:
PO
pa
↵
gt,i .
(8)
•  Let , then: ✓t,i = ✓t 1,i qP
mo
t
2
⌧ =1 g⌧,i
RN
na
Hence, the learning rate is adapting differtra
ently for each
parameter and rare parameters
get
Lecture 1, Slide 20 Richard Socher 4/15/15 tha
larger updates than frequently occurring parame-
(n
( (nl ) )T W·i l
0
sl+1
X (n 1)
@
Wji l
=
1)
(nl 1)
f 0 (zi
1
(nl 2)
)aj
Long-­‐Term Dependencies and (nC)lipping Trick A 0 (n 1) (n 2)
=
j=1
l
j
)
f (zi
l
) aj
l
|
{z networks, t}he gradient •  In very deep networks such as recurrent (nl 1) (nl 2)
= m
aj each associated with a step in is a product of Jacobian i atrices, the forward computa0on. This can become very small or very e used in the first line that the top layer is linear. This is a very detailed account o
large quickly [Bengio et al 1994], and the locality assump0on of chain rule.
we can write
the errors
of all
layers dl own. (except
gradient descent breaks the top layer) (in vector format, using th
):
(l)
⇣
= (W (l) )T
(l+1)
⌘
f 0 (z (l) ),
7 ikolov is to clip gradients •  The solu0on first introduced by M
to a maximum value. Makes a big difference in RNNs. 21 General Strategy 1. 
2. 
3. 
4. 
5. 
Select appropriate Network Structure 1. 
Structure: Single words, fixed windows vs Recursive Sentence Based vs Bag of words 2. 
Nonlinearity Check for implementa0on bugs with gradient check Parameter ini0aliza0on Op0miza0on tricks Check if the model is powerful enough to overfit 1. 
If not, change model structure or make model “larger” 2. 
If you can overfit: Regularize Assuming you found the right network structure, implemented it correctly, op0mize it properly and you can make your model overfit on your training data. Now, it’s 0me to regularize 22 Prevent Overfibng: Model Size and RegularizaHon •  Simple first step: Reduce model size by lowering number of units and layers and other parameters •  Standard L1 or L2 regulariza0on on weights •  Early Stopping: Use parameters that gave best valida0on error •  Sparsity constraints on hidden ac0va0ons, e.g., add to cost: 23 Prevent Feature Co-­‐adaptaHon Dropout (Hinton et al. 2012) •  Training 0me: at each instance of evalua0on (in online SGD-­‐
training), randomly set 50% of the inputs to each neuron to 0 •  Test 0me: halve the model weights (now twice as many) •  This prevents feature co-­‐adapta0on: A feature cannot only be useful in the presence of par0cular other features •  A kind of middle-­‐ground between Naïve Bayes (where all feature weights are set independently) and logis0c regression models (where weights are set in the context of all others) •  Can be thought of as a form of model bagging •  It also acts as a strong regularizer 24 Deep Learning Tricks of the Trade •  Y. Bengio (2012), “Prac0cal Recommenda0ons for Gradient-­‐
Based Training of Deep Architectures” •  Unsupervised pre-­‐training •  Stochas0c gradient descent and seyng learning rates •  Main hyper-­‐parameters • 
• 
• 
• 
• 
• 
Learning rate schedule & early stopping Minibatches Parameter ini0aliza0on Number of hidden units L1 or L2 weight decay Sparsity regulariza0on •  How to efficiently search for hyper-­‐parameter configura0ons •  Short answer: Random hyperparameter search (!) 25 Language Models A language model computes a probability for a sequence of words: Probability is usually condi0oned on window of n previous words : Very useful for a lot of tasks: Can be used to determine whether a sequence is a good / gramma0cal transla0on or speech ugerance 26 Richard Socher 4/15/15 y looking for θ that maximizes the training corpus penalized log-likelihood:
L=
, · ·eural · ,w
; θ)l+
R(θ),
Original anguage model ∑ log f (w , w n
1
T
t
t 1
t n+1
t
ation term. For example, in our experiments, R is a weight decay penalty
ts of the neural network and to the C matrix, not to the biases.3
he number of free parameters only scales linearly with V , the number of
B ENGIO , D UCHARME , V INCENT AND JAUVIN
It also only scales linearly with the order n : the scaling factor could
if more sharing structure were introduced, e.g. using a time-delay neural
N EURAL P ROBABILISTIC L ANGUAGE M ODEL
ural network (or a combination of both).
below, the neural network has one hidden layer beyond the word features
i-th output = P(wt = i | context)
direct connections from the word features to the output. Therefore there
softmax
ers: the shared word features layer C, which has no non-linearity (it would
...
...
dand
log-probabilities
for
each
output
word
i,
computed
as
follows,
with
the ordinary hyperbolic tangent hidden layer. More precisely, the
most computation here
:s the following function, with a softmax output layer, which guarantees
(1)
ming toy1:= b +W x +U tanh(d + Hx)
A Neural Probabilis0c Language Model, Bengio et al. 2003 Original equa0ons: nt tanh is applied element by element,
W is optionally zero (no direct
eywt
ˆ
P(wt |wt 1 , · · · wt n+1 ) =
.
word features layer activation vector,
∑i eyi which is the concatenation of the
eparameters
matrix C:
of the neural network, such as b and d in equation 1 below.
tanh
...
...
x = (C(wt
),C(w
), · · · ,C(w
)).
1
t 2
n+1
1142
Problem: Fixed wt indow en units, and m the number of features associated with each word. When
f context or desired,
the matrix W is set to 0. The free
word o
features
to outputsfare
e the output
biases b (with |V:|( elements), the hidden layer biases d (with
condi0oning C(wt
n+1 )
...
...
Table
look−up
in C
index for wt
n+1
C(wt 2)
...
Matrix C
shared parameters
across words
index for wt
output weights U (a |V | ⇥ h matrix), the word features
to output weights
Richard Socher 27 x), the hidden layer weights H (a h ⇥ (n Figure
1)m1: matrix),
and the word
Neural architecture:
f (i, wt 1 , · · · , wt
C(wt 1)
...
2
index for wt
1
4/15/15 n+1 ) = g(i,C(wt 1 ), · · ·
,C(wt
n+1 ))
w
Recurrent Neural Networks! Solu0on: Condi0on the neural network on all previous words and 0e the weights at each 0me step ht−1 W xt−1 28 ht+1 ht xt W xt+1 Richard Socher 4/15/15 Recurrent Neural Network language model Given list of word vectors: At a single 0me step: 29 Richard Socher 4/15/15 Recurrent Neural Network language model Main idea: we use the same set of W weights at all 0me steps! Everything else is the same: is some ini0aliza0on vector for the hidden layer at 0me step 0 is the column vector of L at index [t] at 0me step t Recurrent Neural Network language model is a probability distribu0on over the vocabulary Same cross entropy loss func0on but predic0ng words 31 Richard Socher 4/15/15 32 Richard Socher 4/15/15