Deep Canonical Correlation Analysis Galen Andrew Raman Arora Jeff Bilmes

International Conference on Machine Learning (ICML 2013)
Deep Canonical Correlation Analysis
Galen Andrew1
Raman Arora2
1 University
2 Toyota
Jeff Bilmes1
Karen Livescu2
of Washington
Technological Institute at Chicago
Presented by Shaobo Han, Duke University
Nov. 7, 2014
G. Andrew et al., 2013
Deep Canonical Correlation Analysis
1 / 13
Outline
1
Background
Canonical Correlation Analysis (CCA)
Kernel Canonical Correlation Analysis (KCCA)
Denoising Autoencoder
2
Deep Canonical Correlation Analysis (DCCA)
3
Experiments
MNIST Handwritten Digits
Articulatory Speech
G. Andrew et al., 2013
Deep Canonical Correlation Analysis
2 / 13
Introduction
The problem: learn complex nonlinear transformations of two views, such
that the resulting representations are maximally correlated
Deep CCA (DCCA): learn highly correlated deep architectures
I
A nonlinear extension of CCA (linear projections)
I
An alternative to Kernel CCA (nonlinear projections)
Related work:
I
Multimodal autoencoders1
I
Multimodal restricted Boltzmann machines2
Key difference: learn two separate deep encodings, with the objective that
the learned encodings are as correlated as possible
1
Ngiam et al., Multimodal deep learning, ICML, 2011
Srivastava & Salakhutdinov, Multimodal learning with deep Boltzmann machines,
NIPS, 2012
2
G. Andrew et al., 2013
Deep Canonical Correlation Analysis
3 / 13
Canonical Correlation Analysis
Objective: find pairs of linear projection of the two views, (ω1T X1 , ω2T X2 )
that are maximally correlated
cov(ω1T X1 , ω2T X2 )
ρ(X1 , X2 ) = max corr(ω1T X1 , ω2T X2 ) = max q
ω1 ,ω2
ω1 ,ω2
var(ω1T X1 )var(ω2T X2 )
ω1T Σ12 ω2
= max q
ω1 ,ω2
(ω1T Σ11 ω1 )(ω2T Σ22 ω2 )
CCA reduces to a generalized eigenvalue problem:
ω1
Σ11 0
0 Σ12
ω1
=ρ
Σ21 0
ω2
0 Σ22
ω2
¯ 1 ∈ Rp1 ×n , H
¯ 2 ∈ Rp2 ×n , one can estimate
Given centered data matrices H
b 11 =
Σ
G. Andrew et al., 2013
1 ¯ ¯T
H1 H1 + r1 I,
n−1
r1 > 0
Deep Canonical Correlation Analysis
4 / 13
Kernel Canonical Correlation Analysis (1/3)
Objective: find maximally correlated nonlinear projections
ρF =
=
max
corr(f1 (X1 ), f2 (X2 ))
max
p
f1 ∈H1 ,f2 ∈H2
cov(f1 (X1 ), f2 (X2 ))
f1 ∈H1 ,f2 ∈H2
var(f1 (X1 ))var(f2 (X2 ))
Reproducing property:
f (x) = hK(·, x), f i,
∀f ∈ H
(1)
Let K1 and K2 be Mercer kernels3 with feature maps φ1 , φ2 , then
corr(f1 (X1 ), f2 (X2 )) = corr(hφ1 (X1 ), f1 i, hφ2 (X2 ), f2 i)
(2)
3
Saitoh, Theory of reproducing kernels and its applications, Longman Scientific &
Technical, 1988
G. Andrew et al., 2013
Deep Canonical Correlation Analysis
5 / 13
Kernel Canonical Correlation Analysis (2/3)
Kernel trick: corresponding to any Mercer kernel K(x1 , x2 ), there is a
map φ : X 7→ F, such that K(x1 , x2 ) = hφ(x1 ), φ(x2 )i
An instantiation: Define φ(x) = K(·, x) as the feature map, then
hφ(x1 ), φ(x2 )i = hK(·, x1 ), K(·, x2 )i = K(x1 , x2 )
(3)
Empirical correlations: let f1 (x) = αT1 K1 (·, x), f2 (x) = αT2 K2 (·, x)
1
cd
ov(hφ1 (x1 ), f1 i, hφ2 (x2 ), f2 i) = αT1 K1 K2 α2
N
1 T 2
var(hφ
c
α K α1
1 (x1 ), f1 i) =
N 1 1
1 T 2
var(hφ
c
α K α2
2 (x2 ), f2 i) =
N 2 2
Kernelized CCA problem:
ρbF =
max
α1 ,α2 ∈RN
αT1 K1 K2 α2
q
(αT1 K12 α1 )(αT2 K22 α2 )
G. Andrew
et al., 2013
Deep
Correlation
Analysis
All
the calculation
are performed
inCanonical
the input
space.
(4)
6 / 13
Kernel Canonical Correlation Analysis (3/3)
Regularized KCCA:
ρbFr =
max
α1 ,α2 ∈RN
αT1 K1 K2 α2
q
(αT1 (K1 + r1 I)2 α1 )(αT2 (K2 + r2 I)2 α2 )
(5)
KCCA also reduces to a generalized eigenvalue problem:
0
K1 K2
α1
(K1 + r1 I)2
0
α1
=ρ
K2 K1
0
α2
0
(K2 + r2 I)2
α2
Drawbacks of KCCA:
1. Representation is limited by a fixed kernel
2. Training time scales poorly with data size
3. Training data needs to be referenced when computing representations
of unseen instances
G. Andrew et al., 2013
Deep Canonical Correlation Analysis
7 / 13
Outline
1
Background
Canonical Correlation Analysis (CCA)
Kernel Canonical Correlation Analysis (KCCA)
Denoising Autoencoder
2
Deep Canonical Correlation Analysis (DCCA)
3
Experiments
MNIST Handwritten Digits
Articulatory Speech
G. Andrew et al., 2013
Deep Canonical Correlation Analysis
7 / 13
Denoising Autoencoder4
Objective: find good initial intermediate representations by explicitly
fill-in-the-blanks training
I
I
I
I
e
Clean input x is partially destroyed, yielding corrupted input: x
e is mapped to hidden representation: y = fθ (e
e + b)
x
x) = s(W x
T
From y we reconstruct a z = gθ0 (y) = W y
Train parameters to minimize the "reconstruction error"
la (W , b) = ||Z − X||2F + λa (||W ||2F + ||b||22 )
(6)
4
Vincent et al., Extracting and composing robust features with denoising
autoencoders, ICML, 2008
G. Andrew et al., 2013
Deep Canonical Correlation Analysis
8 / 13
Outline
1
Background
Canonical Correlation Analysis (CCA)
Kernel Canonical Correlation Analysis (KCCA)
Denoising Autoencoder
2
Deep Canonical Correlation Analysis (DCCA)
3
Experiments
MNIST Handwritten Digits
Articulatory Speech
G. Andrew et al., 2013
Deep Canonical Correlation Analysis
8 / 13
Deep Canonical Correlation Analysis (1/2)
Objective: find maximally correlated representations of the two views by
passing them through multiple stacked layers of nonlinear transformations
x1 ∈ Rp1 , h1 = s(W11 x1 + b11 ) ∈ Rc1 , where W11 ∈ Rc1 ×p1 , b11 ∈ Rc1 ,
h2 = s(W21 h1 + b12 ) ∈ Rc1 , final representation
f1 (x1 ) = s(Wd1 hd−1 + b1d ) ∈ Ro , for a network with d layers
G. Andrew et al., 2013
Deep Canonical Correlation Analysis
9 / 13
Deep Canonical Correlation Analysis (2/2)
Parameter optimization:
(θ1∗ , θ2∗ ) = argmax corr(f1 (X1 , θ1 ), f2 (X2 , θ2 ))
(7)
(θ1 ,θ2 )
1. Pretraining with a denoising autoencoder
2. Compute the gradient of corr(H1d , H2d ) w.r.t. the top-level
representations H1d and H2d
3. Fine-tuning parameters Wlv and bvl using backpropagation
(All hyper-parameters are chosen to optimize the total correlation on a
development set)
Non-saturating sigmoid function:
If g : R 7→ R is the function
g(y) = y 3 /3 + y, then s(x) = g −1 (x)
I
derivative is a simple function of
its value
I
s is not bounded
G. Andrew et al., 2013
Deep Canonical Correlation Analysis
10 / 13
Outline
1
Background
Canonical Correlation Analysis (CCA)
Kernel Canonical Correlation Analysis (KCCA)
Denoising Autoencoder
2
Deep Canonical Correlation Analysis (DCCA)
3
Experiments
MNIST Handwritten Digits
Articulatory Speech
G. Andrew et al., 2013
Deep Canonical Correlation Analysis
10 / 13
MNIST Handwritten Digits
Learn correlated representations of the left and right halves of images
I
54, 000 training, 6, 000 development, 10, 000 test
I
28 × 28 matrix of pixels, 392 features in each view
I
For KCCA, use radial basis function (RBF) kernels
I
Selected width for the DCCA-50-2 model: 2038 (left images), 1608
(right images)
Total correlation captured in the 50 most correlated dimensions on the split
MNIST dataset
G. Andrew et al., 2013
Deep Canonical Correlation Analysis
11 / 13
Articulatory Speech Data (1/2)
Wisconsin X-ray Microbeam Database (XRMB) of simultaneous acoustic
and articulatory recordings
I 5 independent experiments, 60% training, 20% development, 20% test
I MFCC x1 ∈ R273 , XRMB x2 ∈ R112
I For KCCA, use RBF kernels or polynomial kernels of degree d
I Selected width for the DCCA-50-2 model: 1641 (MFCC), 1769
(XRMB)
Total correlation captured in the 50 most correlated dimensions on the
articulatory dataset
G. Andrew et al., 2013
Deep Canonical Correlation Analysis
12 / 13
Articulatory Speech Data (2/2)
Total correlation captured by DCCA-112-d, for d ranging from 3 to 8
G. Andrew et al., 2013
Deep Canonical Correlation Analysis
13 / 13