Slide - CS224d: Deep Learning for Natural Language Processing

CS224D: Deep Learning for
Natural Language Processing
Andrew Maas Stanford University Spring 2015 Neural Networks in Speech Recogni5on Stanford CS224D Spring 2015 Outline • 
• 
• 
• 
Speech recogniAon systems overview HMM-­‐DNN (Hybrid) acousAc modeling What’s different about modern HMM-­‐DNNs? HMM-­‐free RNN recogniAon Stanford CS224D Spring 2015 Cat Clothes Climbing What acAon? Is the user annoyed? Ask for clarificaAon? Understanding Are there any good robot movies I can rent tonight? Deep Neural Network TranscripAon Noise ReducAon Stanford CS224D Spring 2015 ConversaAonal Speech Data Switchboard 300 hours 4,870 speakers but it was really nice to get back with a telephone and the city and everything and you know yeah well (i-­‐) the only way i could bear it was to (pass) (some) to be asleep i was like well it is not gonna (be-­‐) get over unAl you know (w-­‐) (w-­‐) yeah it (re-­‐) really i (th-­‐) i think that is what ruined it for us Stanford CS224D Spring 2015 Outline • 
• 
• 
• 
Speech recogniAon systems overview HMM-­‐DNN (Hybrid) acous5c modeling What’s different about modern HMM-­‐DNNs? HMM-­‐free RNN recogniAon Stanford CS224D Spring 2015 AcousAc Modeling with GMMs Transcrip5on: Pronuncia5on: Sub-­‐phones : Hidden Markov Model (HMM): Acous5c Model: Audio Input: Samson S – AE – M – S –AH – N 942 – 6 – 37 – 8006 – 4422 … 942 942 6 %&*
%&*
*&%
%&)
%&)
)&%
%&(
%&(
(&%
%&'
%&'
'&%
%&!
%&!
!&%
%&"
%&"
"&%
%&#
%&#
#&%
%&$
%&$
%
!!
!"
!#
!$
%
$
#
Features "
!
%
!!
$&%
!"
!#
!$
%
$
#
Features "
!
!
"
#
$
%
$!
#!
"!
%
!!
GMM models: P(x|s) x: input features s: HMM state Features Stanford CS224D Spring 2015 DNN Hybrid AcousAc Models Transcrip5on: Pronuncia5on: Sub-­‐phones : Hidden Markov Model (HMM): Acous5c Model: Audio Input: Samson S – AE – M – S –AH – N 942 – 6 – 37 – 8006 – 4422 … 942 942 6 P(s|x1) P(s|x2) P(s|x3) Features (x1) Features (x2) Features (x3) Use a DNN to approximate: P(s|x) Apply Bayes’ Rule: P(x|s) = P(s|x) * P(x) / P(s) DNN * Constant / State prior Stanford CS224D Spring 2015 Not Really a New Idea Renals, Morgan, Bourland, Cohen, & Franco. 1994. Stanford CS224D Spring 2015 Hybrid MLPs on Resource Management Renals, Morgan, Bourland, Cohen, & Franco. 1994. Stanford CS224D Spring 2015 Modern Systems use DNNs and Senones Dahl, Yu, Deng & Acero. 2011. Stanford CS224D Spring 2015 Hybrid Systems now Dominate ASR Hinton et al. 2012. Stanford CS224D Spring 2015 What’s Different in Modern DNNs? Fast computers = run many experiments Deeper nets improve on shallow nets Architecture choices (easiest is replacing sigmoid) Pre-­‐training does not ma*er. IniAally we thought this was the new trick that made things work •  Many more parameters • 
• 
• 
• 
Stanford CS224D Spring 2015 Depth Maiers (Somewhat) Warning! Depth can also act as a regularizer because it makes opAmizaAon more difficult. This is why you will someAmes see very deep networks perform well on TIMIT or other small tasks. Yu, Seltzer, Li, Huang, Seide. 2013. Stanford CS224D Spring 2015 Impact of Depth 45 36M 100M 200M RT-­‐03 Word Error Rate 40 35 30 25 20 1 3 5 Number of Hidden Layers (Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. In Submission) 7 Stanford CS224D Spring 2015 Replacing Sigmoid Hidden Units 2 TanH 1 ReL 0 -­‐4 0 4 -­‐1 RecAfied Linear (ReL) (Glorot & Bengio. 2011) Stanford CS224D Spring 2015 Comparing NonlineariAes Switchboard WER 25 TanH ReL 20 15 10 5 0 GMM 2 Layer 3 Lyaer 4 Layer (Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. In Submission) MSR 9 Layer IBM 7 Layer MMI Stanford CS224D Spring 2015 ConvoluAonal Networks •  Slide your filters along the frequency axis of filterbank features •  Great for spectral distorAons (eg. Short wave radio) Sainath, Mohamed, Kingsbury, & Ramabhadran . 2013. Stanford CS224D Spring 2015 Recurrent DNN Hybrid AcousAc Models Transcrip5on: Pronuncia5on: Sub-­‐phones : Hidden Markov Model (HMM): Acous5c Model: Audio Input: Samson S – AE – M – S –AH – N 942 – 6 – 37 – 8006 – 4422 … 942 942 6 P(s|x1) P(s|x2) P(s|x3) Features (x1) Features (x2) Features (x3) Stanford CS224D Spring 2015 Scaling up NN acousAc models in 1999 0.7M total NN parameters [Ellis & Morgan. 1999] Stanford CS224D Spring 2015 Adding More Parameters 15 Years Ago Size ma*ers: An empirical study of neural network training for LVCSR. Ellis & Morgan. ICASSP. 1999. Hybrid NN. 1 hidden layer. 54 HMM states. 74hr broadcast news task “…improvements are almost always obtained by increasing either or both of the amount of training data or the number of network parameters … We are now planning to train an 8000 hidden unit net on 150 hours of data … this training will require over three weeks of computaAon.” Stanford CS224D Spring 2015 Adding More Parameters Now •  Comparing total number of parameters (in millions) of previous work versus our new experiments Total DNN parameters (M) 0 50 100 150 200 Maas, Hannun, Qi, Lengerich, Ng, & Jurafsky. In submission. 250 300 350 400 450 Stanford CS224D Spring 2015 50 40 38 36 34 32 30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 45 Frame Error Rate 40 35 30 25 20 GMM 36M 100M Model Size (Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. In Submission) 200M RT-­‐03 Word Error Rate Scaling Total Parameters 400M Stanford CS224D Spring 2015 Scaling Total Parameters 50 Frame Error Rate 45 Word Error Rate 45 Frame Error Rate 40 35 35 30 30 RT-­‐03 Word Error Rate 40 25 25 20 20 GMM 36M 100M Model Size (Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. In Submission) 200M 400M Stanford CS224D Spring 2015 Outline • 
• 
• 
• 
Speech recogniAon systems overview HMM-­‐DNN (Hybrid) acousAc modeling What’s different about modern HMM-­‐DNNs? HMM-­‐free RNN recogni5on Stanford CS224D Spring 2015 HMM-­‐DNN Speech RecogniAon Transcrip5on: Pronuncia5on: Sub-­‐phones : Hidden Markov Model (HMM): Acous5c Model: Audio Input: Samson S – AE – M – S –AH – N 942 – 6 – 37 – 8006 – 4422 … 942 942 6 P(s|x1) P(s|x2) P(s|x3) Features (x1) Features (x2) Features (x3) Use a DNN to approximate: P(s|x) Apply Bayes’ Rule: P(x|s) = P(s|x) * P(x) / P(s) DNN * Constant / State prior Stanford CS224D Spring 2015 HMM-­‐Free RecogniAon Transcrip5on: Pronuncia5on: Sub-­‐phones : Hidden Markov Model (HMM): Acous5c Model: Audio Input: Samson S – AE – M – S –AH – N 942 – 6 – 37 – 8006 – 4422 … 942 942 6 P(s|x1) P(s|x2) P(s|x3) Features (x1) Features (x2) Features (x3) (Graves & Jaitly. 2014) Stanford CS224D Spring 2015 HMM-­‐Free RecogniAon Transcrip5on: Characters: Collapsing func5on: Acous5c Model: Audio Input: Samson SAMSON SS___AA_M_S___O___NNNN S S _ P(a|x1) P(a|x2) P(a|x3) Features (x1) Features (x2) Features (x3) (Graves & Jaitly. 2014) Use a DNN to approximate: P(a|x) The distribuAon over characters Stanford CS224D Spring 2015 CTC ObjecAve FuncAon Labels at each Ame index are condiAonally independent (like HMMs) Sum over all Ame-­‐level labelings consistent with the output label. Output label: AB Time-­‐level labelings: AB, _AB, A_B, … _A_B_ Final objecAve maximizes probability of true labels: (Graves & Jaitly, ICML 2014) Stanford CS224D Spring 2015 Collapsing Example Per-­‐frame argmax: ____________________________________________________________________________________________________y
y__ee_________i_ ____________________________________________a_____ _rr__e________hh__________b___ii_______lll__i_____i______aa______i_______iio__n___ ___cc_____rrr_u_____________________ ________ii___ss ______________o__________nn_____________hhh_a___________________nnddd ________________i__n___ __thh_e_____ __________________________________________bb_uuii_______lllldd____ii____nng_____ ___________________________________l___o___o_g__g___ii____nng______ ____b___rr_ii________ck__s__________________________________________p___ll__a________ssi_________eerr__ ______a___nnd_ ___b___lll_uu____ee__pp___r___i________nnss_ ________________f______oou____________rrr________ _____________f_____oo__rrr__i_y____ _____t____www_oo__________ ____nn___ew___________________ ______________________________________________________b___e_______t__________i____n___ ____e________pp_____aa___rr___i____mm_ee___nnntss _____________________________________________________________________________________________________
________________________________ AQer collapsing: yet a rehbilitaAon cru is onhand in the building loogging bricks plaster and blueprins four forty two new beAn epartments Reference: yet a rehabilitaAon crew is on hand in the building lugging bricks plaster and blueprints for forty two new bedroom apartments (Hannun, Maas, Jurafsky, & Ng. 2014) Stanford CS224D Spring 2015 Comparing Alignments HMM-­‐GMM phone probabiliAes (HMM slide from Dan Ellis) CTC character probabiliAes Stanford CS224D Spring 2015 Example Results (WSJ) YET A REHBILITATION CRU IS ONHAND IN THE BUILDING LOOGGING BRICKS PLASTER AND BLUEPRINS FOUR FORTY TWO NEW BETIN EPARTMENTS YET A REHABILITATION CREW IS ON HAND IN THE BUILDING LUGGING BRICKS PLASTER AND BLUEPRINTS FOR FORTY TWO NEW BEDROOM APARTMENTS THIS PARCLE GUNA COME BACK ON THIS ILAND SOM DAY SOO THE SPARKLE GONNA COME BACK ON THIS ISLAND SOMEDAY SOON TRADE REPRESENTIGD JUIDER WARANTS THAT THE U S WONT BACKCOFF ITS PUSH FOR TRADE BARIOR REDUCTIONS TRADE REPRESENTATIVE YEUTTER WARNS THAT THE U S WONT BACK OFF ITS PUSH FOR TRADE BARRIER REDUCTIONS TREASURY SECRETARY BAGER AT ROHIE WOS IN AUGGRAL PRESSED FOUR ARISE IN THE VALUE OF KOREAS CURRENCY TREASURY SECRETARY BAKER AT ROH TAE WOOS INAUGURAL PRESSED FOR A RISE IN THE VALUE OF KOREAS CURRENCY Stanford CS224D Spring 2015 Recurrence Maiers! S P(a|x1) Features (x1) S P(a|x2) Features (x2) (Hannun, Maas, Jurafsky, & Ng. 2014) _ P(a|x3) Architecture CER DNN 22 + recurrence 13 + bi-­‐direcAonal recurrence 10 Features (x3) Stanford CS224D Spring 2015 Decoding with a Language Model 12 Lexicon Character Error Rate [a, …, zebra] 8 Language Model p(“yeah” | “oh”) 4 0 None Character Probabili5es __oo_h__y_e_aa_h 40 Lexicon Bigram Word Error Rate 30 20 10 0 None (Hannun, Maas, Jurafsky, & Ng. 2014) Lexicon Bigram Stanford CS224D Spring 2015 DBRNN vs LSTM 12 Character Error Rate DBRNN LSTM 10 40 Word Error Rate DBRNN LSTM 35 30 8 25 6 20 4 15 10 2 5 0 None Lexicon Language Model Bigram 0 None Lexicon Bigram* Language Model * Graves & Jaitly use lazce rescoring (Hannun, Maas, Jurafsky, & Ng. 2014) Stanford CS224D Spring 2015 Rethinking Decoding Out of Vocabulary Words syriza Lexicon [a, …, zebra] Language Model p(“yeah” | “oh”) Character Probabili5es abo-­‐-­‐ schmidhuber __oo_h__y_e_aa_h (Maas*, Xie*, Jurafsky, & Ng. NAACL 2015) bae sof-­‐-­‐ p(h | o,h, ,y,e,a,) Character Language Model Character Probabili5es __oo_h__y_e_aa_h Stanford CS224D Spring 2015 Lexicon-­‐Free & HMM-­‐Free on Switchboard 40 35 30 25 20 15 10 5 0 HMM-­‐GMM CTC No LM (Maas*, Xie*, Jurafsky, & Ng. NAACL 2015) CTC + 7-­‐gram CTC + NN LM HMM-­‐DNN Stanford CS224D Spring 2015 Transcribing Out of Vocabulary Words Truth: yeah i went into the i do not know what you think of fidelity but HMM-­‐GMM: yeah when the i don’t know what you think of fidel it even them CTC-­‐CLM: yeah i went to i don’t know what you think of fidelity but um Truth: no no speaking of weather do you carry a alAmeter slash barometer HMM-­‐GMM: no i’m not all being the weather do you uh carry a uh helped emiUers last brahms her CTC-­‐CLM: no no beaAng of whether do you uh carry a uh a 5me or less barometer Truth: i would ima-­‐ well yeah it is i know you are able to stay home with them HMM-­‐GMM: i would amount well yeah it is i know um you’re able to stay home with them CTC-­‐CLM: i would ima-­‐ well yeah it is i know uh you’re able to stay home with them (Maas*, Xie*, Jurafsky, & Ng. NAACL 2015) Stanford CS224D Spring 2015 Comparing Alignments HMM-­‐GMM phone probabiliAes (HMM slide from Dan Ellis) CTC character probabiliAes Stanford CS224D Spring 2015 Learning Phonemes and Timing (Maas*, Xie*, Jurafsky, & Ng. NAACL 2015) Stanford CS224D Spring 2015 Learning Phonemes and Timing (Maas*, Xie*, Jurafsky, & Ng. NAACL 2015) Stanford CS224D Spring 2015 Conclusion •  HMM-­‐DNN systems are now the default, state-­‐of-­‐
the-­‐art for speech recogniAon •  We roughly understand why HMM-­‐DNNs work but older, shallow hybrid models didn’t work as well •  Recent work demonstrates feasibility of pure NN-­‐
based speech recogniAon systems (no HMM) Stanford CS224D Spring 2015 End •  More on spoken language understanding: –  cs224s.stanford.edu –  MSR video: youtu.be/Nu-­‐nlQqFCKg •  Open source speech recogniAon toolkit (Kaldi): –  Kaldi.sf.net Stanford CS224D Spring 2015