The Out-of-Sample Problem for Classical Multidimensional Scaling: Addendum Technical Report 10-03

The Out-of-Sample Problem for
Classical Multidimensional Scaling:
Addendum
Minh Tanga and Michael W. Trossetb
November 7, 2010
Technical Report 10-03
Department of Statistics
Indiana University
Bloomington, IN
a
Formerly School of Informatics & Computing, Indiana University; currently
Department of Applied Mathematics & Statistics, Johns Hopkins University.
[email protected], [email protected]
b
Department of Statistics, Indiana University. [email protected]
The Out-of-Sample Problem for Classical
Multidimensional Scaling: Addendum
Minh Tang∗
Michael W. Trosset†
November 7, 2010
Abstract
Trosset and Priebe [11] based their out-of-sample extension of classical multidimensional scaling (CMDS) on the fact that CMDS approximates fallible inner product
data with Euclidean inner products. Applying this criterion to the out-of-sample case
results in a nonlinear optimization problem. For the special case of embedding a single
point, Trosset and Priebe noted that dropping the quartic term in their quartic objective function results in a convex quadratic objective function. The minimizer of this
function is an approximate solution of the out-of-sampling embedding problem.
Trosset and Priebe noted that the approximate solution of their problem is the outof-sample embedding technique proposed by Anderson and Robinson [1]. The purpose
of this addendum is to show that it is also the embedding technique proposed by de
Silva and Tenenbaum [4, 5] for Landmark MDS and by Bengio et al. [3, 2].
Key words: Spectral decomposition, Nystr¨om formula, Landmark MDS.
Contents
1 Introduction
2
2 Exact and Approximate Solutions
2
3 Out-of-Sample Embedding for PCA
3
4 Out-of-Sample Embedding for Kernel Methods
4
5 Out-of-Sample Embedding in Landmark MDS
5
∗
Formerly School of Informatics & Computing, Indiana University; currently Department of Applied
Mathematics & Statistics, Johns Hopkins University. E-mail: [email protected], [email protected]
†
Department of Statistics, Indiana University. E-mail: [email protected]
1
1
Introduction
Trosset and Priebe [11] considered the problem of embedding k points in a configuration of
n points previously constructed by classical multidimensional scaling (CMDS).
Their approach was based on the fact that CMDS solves the problem of finding the matrix
of d-dimensional Euclidean inner products that best approximates (in the sense of squared
error) a matrix of fallible inner products. This approximation clearly inspired Torgerson’s
[10] original formulation of CMDS; it was stated explicitly in [8].
Other researchers have proposed solutions for the special case of k = 1. Trosset and
Priebe noted that, “In this case, the optimal out-of-sample embedding can be approximated
by a simple formula, and the resulting approximate out-of-sample embedding turns out to be
identical to the out-of-sample embedding proposed in [1].” The purpose of this addendum
is to show that it is also identical to the out-of-sample embeddings proposed in [3, 2] and
[4, 5].
2
Exact and Approximate Solutions
Let ∆2 = [δij2 ] denote the squared dissimilarities of the original n objects and let x1 , . . . , xn ∈
<d denote the configuration constructed by CMDS. Let a2 ∈ <n denote the squared dissimilarities of the new object from the original n objects and let
"
A2 =
∆2 a2
at2 0
#
.
(1)
For w such that et w 6= 0, let
ewt
wet
1
I− t
A2 I − t
.
τw (A2 ) = −
2
ew
ew
!
!
Set w = (et , 0) and
"
B = τw (A2 ) =
τe (∆2 ) b
bt
β
#
.
(2)
The problem of embedding y with respect to the fixed configuration x1 , . . . , xn can then be
formulated as the following nonlinear optimization problem:
 t
x1
 .
 ..
min B − 
 t
y∈<d  xn
yt

h

 x1


· · · xn
2
n i
2 2
X
y = mind 2
bi − xti y + β − y t y .
y∈<
i=1
(3)
Let X denote the n × d data matrix whose rows are the xti . Trosset and Priebe observed
the following:
If the term (β − y t y)2 is dropped from the objective function in (3), then the
function that remains is convex, with stationary equation X t Xy = X t b. Assuming that X has full rank (otherwise, a smaller d will suffice), X t X is invertible
2
and yˆ = (X r X)−1 X t b, the unique solution of the stationary equation, approximates the optimal out-of-sample embedding defined by (3). The approximate
out-of-sample embedding yˆ was previously proposed by Anderson and Robinson
[1] for reasons that differ from ours.
They did not comment on connections between the approximate solution and the out-ofsample embedding techniques proposed by de Silva and Tenenbaum [4] for Landmark MDS
and by Bengio et al. [3, 2].
3
Out-of-Sample Embedding for PCA
It is instructive to begin by considering how to embed new points in a configuration constructed by principal component analysis (PCA). Gower [6] noted that, if z1 , . . . , zn <q and
δij = kzi − zj k, then the d-dimensional representation constructed by CMDS from ∆ = [δij ]
is identical to the d-dimensional PCA representation of z1 , . . . , zn .
Let P = I − eet /n. Given an n × q data matrix Z, PCA computes the singular value
decomposition of the centered data matrix:
Z˜ = ZP = U ΣV t
The symmetric matrix of centered pairwise inner products is then
K = Z˜ Z˜ t = U ΣV t V ΣU t = U Σ2 U t .
The d-dimensional PCA representation of Z is the n × d data matrix Xd = Ud Σd , where Σd
is the d × d diagonal matrix that contains the square roots of the d largest eigenvalues of B
and Ud is the n × d matrix that contains the corresponding eigenvectors.
Notice that
−1
Xd = Ud Σd = U Σ2 U t Ud Σ−1
d = BUd Σd .
Let xi ∈ <d denote the PCA representation of yi , i.e., xti is row i of X. Let ki denote column
i of K, i.e.,


hz1 − z¯, zi − z¯i


..
.
ki = 
.


hzn − z¯, zi − z¯i
Because K is symmetric, kit is row i of K and we obtain
t
or xi = Σ−1
xti = kit Ud Σ−1
d
d Ud ki .
(4)
Equation (4) suggests a way to embed another z ∈ <q in the d-dimensional PCA representation of z1 , . . . , zn : first compute


hz1 − z¯, z − z¯i


..
,
k=
.


hzn − z¯, z − z¯i
3
then
t
x = Σ−1
d Ud k.
(5)
Notice that k equals b in (2); hence,
t
t
t
t
t
t
Xdt Xd x = (Ud Σd )t Ud ΣΣ−1
d Ud k = Σd Ud Ud Ud k = Σd Ud k = Xd k = Xd b,
which demonstrates that (5) is the approximate solution of (3).
4
Out-of-Sample Embedding for Kernel Methods
In the language of kernel methods, K in Section 3 is a kernel matrix. More generally,
we replace <q with a feature space Ξ and suppose that γ : Ξ × Ξ → < is a symmetric
positive semidefinite function, i.e., for any z1 , . . . , zn ∈ Ξ, the n × n matrix Γ = [γ(zi , zj )] is
symmetric and positive semidefinite. We center γ with respect to the centroid (mean) of a
given z1 , . . . , zn ∈ Ξ by computing
γ˜ (u, v) = γ(u, v) −
n
n
n X
n
1X
1 X
1X
γ (u, zj ) −
γ (zi , v) + 2
γ (zi , zj ) .
n j=1
n i=1
n i=1 j=1
Given z1 , . . . , zn ∈ Ξ, kernel PCA replaces K = Z˜ Z˜ t with the centered inner product
matrix
˜ = P ΓP = [˜
K=Γ
γ (zi , zj )] .
The derivation of a formula for out-of-sample embedding for kernel PCA is identical to the
derivation of formula (5) for PCA. To embed another z ∈ Ξ in the d-dimensional representation of z1 , . . . , zn , first compute


γ˜ (z1 , z)


..
,
k=
.


γ˜ (zn , z)
(6)
then
t
x = Σ−1
d Ud k.
This version of (5) appears in [12].
Bengio et al. [3, 2] obtained out-of-sample extensions of various kernel methods by applying the above construction with suitable kernel functions. For CMDS, one begins with a
dissimilarity function δ : Ξ × Ξ → < and transforms it to an approximate kernel function by
setting


n
n
n X
n
1X
1X
1 X
1
δ 2 (u, zj ) −
δ 2 (zi , v) + 2
γ˜ (u, v) = − δ 2 (u, v) −
δ 2 (zi , zj ) .
2
n j=1
n i=1
n i=1 j=1
(7)
The function γ˜ is approximate in the sense that the matrix [˜
γ (zi , zj )] may not be positive
semidefinite; CMDS approximates this matrix with the nearest symmetric matrix that is
positive semidefinite and has rank ≤ d.
4
Let x1 , . . . , xn ∈ <d be the configuration constructed by CMDS. To embed z, define k by
using (7) in (6). Notice that k equals b in (2). The embedding proposed by Bengio et al.
[3, 2] is (5), which we have already demonstrated is the approximate solution of (3). Indeed,
Bengio et al. [2, Proposition 3] clearly understood that this was the case. Curiously, they
explicitly rejected the possibility of minimizing (3):
Note that by doing so, we do not seek to approximate β = γ˜ (z, z). Future work
should investigate embeddings which minimize the empirical reconstruction error
of γ˜ but ignore the diagonal contributions.1
In our view, CMDS does not ignore diagonal contributions and neither should out-of-sample
extensions of CMDS—unless one seeks a less expensive approximate solution.
5
Out-of-Sample Embedding in Landmark MDS
Landmark MDS was proposed in [4] by de Silva and Tenenbaum, who subsequently elaborated in [5]. The basic idea of Landmark MDS is to embed a large set of points by first
embedding a small subset of “landmark” points, then positioning each additional point in
relation to the landmark points. The idea is an old one, first used by Kruskal and Hart [7].
The problem of positioning each additional point is the problem of out-of-sample embedding.
Unlike Kruskal and Hart [7], de Silva and Tenenbaum were seeking an alternative to
applying CMDS to an entire set of points.2 Landmark MDS embeds the landmark points by
CMDS, then embeds the remaining points by “distance-based triangulation.” The resulting
out-of-sample embedding formula [5, Equation (3)] is
1
1
x = − L# a2 − ∆2 e ,
2
n
(8)
t
where L# is the pseudoinverse of Xdt = Σd Udt , i.e., L# = Σ−1
d Ud . Some manipulation reveals
that
1
1
− a2 − ∆2 e = b + αe;
2
n
hence,
t
t
t
t
Xdt Xd x = (Ud Σd )t Ud ΣΣ−1
d Ud (b + αe) = Xd b + αXd e = Xd b,
which demonstrates that (8) is the approximate solution of (3).
Acknowledgments
This research was supported by a grant from the Office of Naval Research.
1
We have quoted Bengio et al. [2, Section 4] verbatim, but substituted our notation for theirs.
Isomap [9] uses CMDS to embed shortest path distances. Isomap’s embedding step is a computational
bottleneck, which de Silva and Tenenbaum sought to mitigate.
2
5
References
[1] M. J. Anderson and J. Robinson. Generalized discriminant analysis based on distances.
Australian & New Zealand Journal of Statistics, 45:301–318, 2003.
[2] Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau, N. Le Roux, and M. Ouimet. Outof-sample extensions for LLE, Isomap, MDS, eigenmaps, and spectral clustering. In
S. T. S. Becker and K. Obermayer, editors, Advances in Neural Information Processing
Systems 15, pages 177–184. MIT Press, Cambridge, MA, 2003.
[3] Y. Bengio, J.-F. Paiemont, and P. Vincent. Out-of-sample extensions for LLE,
Isomap, MDS, eigenmaps, and spectral clustering. Technical Report 1238, D´epartement
d’Informatique et Recherche Op´erationelle, Universit´e de Montr´eal, Montr´eal, Qu´ebec,
Canada, July 2003.
[4] V. de Silva and J. B. Tenenbaum. Global versus local methods in nonlinear dimensionality reduction. In S. T. S. Becker and K. Obermayer, editors, Advances in Neural
Information Processing Systems 15, pages 705–712. MIT Press, Cambridge, MA, 2003.
[5] V.
de
Silva
and
J.
B.
Tenenbaum.
Sparse
multidimensional
scaling
using
landmark
points.
Available
at
http://pages.pomona.edu/~vds04747/public/papers/landmarks.pdf, June 2004.
[6] J. C. Gower. Some distance properties of latent root and vector methods in multivariate
analysis. Biometrika, 53:325–338, 1966.
[7] J. B. Kruskal and R. E. Hart. A geometric interpretation of diagnostic data from a
digital machine: Based on a study of the Morris, Illinois electronic central office. Bell
System Technical Journal, 45:1299–1338, 1966.
[8] K. V. Mardia. Some properties of classical multi-dimensional scaling. Communications
in Statistics—Theory and Methods, A7:1233–1241, 1978.
[9] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for
nonlinear dimensionality reduction. Science, 290:2319–2323, 2000.
[10] W. S. Torgerson. Multidimensional scaling: I. Theory and method. Psychometrika,
17:401–419, 1952.
[11] M. W. Trosset and C. E. Priebe. The out-of-sample problem for classical multidimensional scaling. Computational Statistics and Data Analysis, 52:4635–4642, June 2008.
[12] C. Wlliams and M. Seeger. Using the Nystr¨om method to speed up kernel machines.
In Advances in Neural Information Processing Systems 13, pages 682–688. MIT Press,
2001.
6