Understanding and Enhancing the Folding-in Method

Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Understanding and Enhancing the Folding-in
Method in Latent Semantic Indexing
Background
Motivation
Xiang Wang
Xiaoming Jin
Our Results
Experiments
Intelligent Data Engineering Group, School of Software
Tsinghua University
Summary
Q&A
17th International Conference on Database and Expert
Systems Applications(DEXA ’06)
Outline
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
1
Background
Outline
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
1
Background
2
Motivation
Outline
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
1
Background
2
Motivation
3
Our Results
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
Outline
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
1
Background
2
Motivation
3
Our Results
4
Experiments
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
Outline
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
1
Background
2
Motivation
3
Our Results
4
Experiments
5
Summary
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
Outline
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
1
Background
2
Motivation
3
Our Results
4
Experiments
5
Summary
6
Q&A
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
Vector Space Model for Text Retrieval
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
Vector Space Model for Text Retrieval
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
Vector Space Model for Text Retrieval
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
The Problem of VSM
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
The Problem of VSM
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
The dimension of the vector space, which is the number of
terms in all document, can be very high in practice, in
thousands namely.
The Problem of VSM
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
The dimension of the vector space, which is the number of
terms in all document, can be very high in practice, in
thousands namely.
The effectiveness and efficiency of text retrieval based on
VSM suffers from the curse of dimensionality.
The Problem of VSM
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
The dimension of the vector space, which is the number of
terms in all document, can be very high in practice, in
thousands namely.
The effectiveness and efficiency of text retrieval based on
VSM suffers from the curse of dimensionality.
Latent Semantic Indexing(LSI) was proposed to solve the
problem of high dimensionality of VSM.
Latent Semantic Indexing
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
1. Given a term-document matrix A, perform Singular Value
Decomposition(SVD) on A: A = U ΣV T .
2. Reduce the dimension of A to k: Ak = Uk Σk VkT .
3. Project original document vectors to a lower-dimensional
subspace: d¯ = UkT d
The Folding-in Method
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
The computational complexity of LSI is high, mainly due
to the SVD process.
The folding-in method is used as an approximation to LSI.
Instead of performing SVD on A, it performs SVD on A1 ,
which is a sample of the columns of A.
A1 is sometimes called the training set.
Pros and Cons of the Folding-in Method
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
Pros and Cons of the Folding-in Method
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
Pros
Maybe the best choice without any prior knowledge.
Easy to implement with low computational complexity.
Pros and Cons of the Folding-in Method
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Pros
Maybe the best choice without any prior knowledge.
Easy to implement with low computational complexity.
Background
Motivation
Our Results
Experiments
Summary
Q&A
Cons
The effectiveness of the folding-in method relies on proper
sampling.
There is no explicit way to justify the effectiveness of a
selected training set.
Our Contributions
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
Our Contributions
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
Understanding the Folding-in Method
We illustrated from the linear algebra point of view that the
essential of the folding-in method is a subspace tracking
process with partial information.
Our Contributions
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Understanding the Folding-in Method
We illustrated from the linear algebra point of view that the
essential of the folding-in method is a subspace tracking
process with partial information.
Motivation
Our Results
Experiments
Summary
Q&A
Enhancing the Folding-in Method
We proposed a novel training set selection strategy, which is
deterministic and more effective.
The Folding-in Method as a Subspace Tracking
Process
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
Ak = Uk Σk VkT , where UkT Uk = I.
Denote the range space of Ak to be Sk , then Uk UkT is an
orthogonal projection from Rm onto Sk .
Sk is called the semantic subspace, which is considered to
represent the latent semantic structure of original
document vectors.
The Folding-in Method as a Subspace Tracking
Process, cont.
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
The projection of d in original vector space S onto the
¯
lower dimensional subspace Sk is d.
¯ 2 /kdk2 equals to the cosine value between d and
w = kdk
subspace Sk .
Larger w implies that d is closer to the semantic subspace
we pursue.
A Novel Training Set Selection Strategy
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
Principle
Those document vectors which are closest to the target
semantic subspace will be chosen as the training set.
Algorithm
Input: A, k, n1
Output: A1
1. Find Uk for A.
2. Compute wi = kvi k2 for all 1 ≤ i ≤ n.
3. The first n1 documents with largest wi are selected as the
columns A1 , which is the training set.
Implementation of Our Algorithm
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Problems
The strength of our method comes from the utilization of the
latent information contained in the semantic subspace during
the training set selection process. However, as we have
mentioned before, it is impractical to compute the semantic
subspace over very large document collection, and that is
exactly the reason why the folding-in method is adopted.
Our Results
Experiments
Summary
Q&A
Solutions
Instead of computing the semantic subspace for all documents,
we perform the training set selection process on different
subsets of original document collection. Further selection can
be performed on the preliminary results.
Data Preparation
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
Identifier
MED
CISI
NPL
Documents
1033
1460
11429
Terms
5735
5544
7536
Queries
30
35
93
Table: Corpora used in the experiments
Experiment Settings
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Similarity search was performed on each data set.
Xiang Wang,
Xiaoming Jin
Average precision was used as evaluation metric.
Background
Motivation
Our Results
Experiments
Summary
Q&A
The results of LSI were used as ground truth.
The competitor is random sampling: 100 different
randomly selected samples and their best and average
performances were recorded as Rand-best and Rand-avg
respectively.
Experimental Results
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
Figure: Average precision with respect to LSI over MED and CISI
collection
Experimental Results, cont.
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
Figure: Average precision with respect to LSI over NPL collection
Experimental Results, cont.
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
Figure: Retrieval performance of gradual method over NPL collection
Summary
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
Summary
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
We theoretically justified the effectiveness of the folding-in
method from a linear algebra point of view.
Summary
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
We theoretically justified the effectiveness of the folding-in
method from a linear algebra point of view.
A novel training set selection strategy was proposed in a
greedy style.
Summary
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A
We theoretically justified the effectiveness of the folding-in
method from a linear algebra point of view.
A novel training set selection strategy was proposed in a
greedy style.
The idea of incremental subspace tracking can be further
developed.
Thank You
Understanding
and
Enhancing the
Folding-in
Method in
Latent
Semantic
Indexing
Xiang Wang,
Xiaoming Jin
Background
Motivation
Our Results
Experiments
Summary
Q&A