Aggregating Anchor Text

Building Enriched Document
Representations using
Aggregated Anchor Text
Donald Metzler, Jasmine Novak,
Hang Cui, and Srihari Reddy
Yahoo! Labs
1
http://www.search-engines-book.com/
http://ciir.cs.umass.edu/~strohman/
http://research.microsoft.com/…
http://ciir.cs.umass.edu/~metzler/
2
The Importance of Anchor Text
•  Anchor text is the most important source of evidence for web search
ranking
–  Even more important than link analysis techniques
–  Queries and anchor text are lexically and semantically similar
•  Anchor text is also useful for other web-based IR tasks
–  Classification
–  Content match (advertising on web pages)
–  Image search
–  Summarization
3
The Anchor Text Sparsity Problem
•  Anchor text follows a power law
–  A few web pages have a large amount of anchor text
–  A large number of web pages have no anchor text
•  Since anchor text is so important for many web-based IR tasks, web
pages that have very little, or no anchor text may be unfavorably biased
–  Web pages with no anchor text is less likely to be retrieved than
pages with anchor text
–  Matching ads to pages without anchor text is likely less effective
than pages with anchor text
–  Classification algorithms that use anchor text as a primary source of
evidence may perform poorly on pages with no anchor text
4
Overcoming Anchor Text Sparsity
•  The goal of this paper is to describe methods for overcoming the anchor
text sparsity problem
•  Three key steps
–  Aggregate anchor text
–  Weight aggregated anchor text
–  Build enriched document representation
•  We will also show that web search suffers from anchor text sparsity and
that our proposed method for aggregating anchor text can help improve
search relevance
5
Related Work
•  Anchor text representations
–  Using anchor text as meta-data [Brin and Page]
–  Anchor text models [Fujii]
•  Structured document retrieval
–  BM25F [Robertson et al.]
–  Mixtures of language models [Ogilvie and Callan]
•  Approach is similar in spirit to, but differs from, the previously proposed
approaches for spreading activation, link analysis, graph regularization,
score aggregation, and term frequency aggregation
–  Branded as different problems, but they all tackle similar problems
–  We aggregate textual representations
•  Similar idea was applied to image retrieval [Harmandas et al.]
6
A Few Definitions
•  Hypertext link: consists of a destination URL and a short description
called the anchor text
–  <a href=“http://www.sigir09.org”>SIGIR 2009</a>
•  Anchor text line: a unique piece of anchor text and its weight
–  SIGIR 2009 (7.5), SIGIR (5.0), information retrieval (2.0)
•  Internal/external inlinks: links that point to a target URL that originate
from within/outside of the site of the target URL
•  Internal/external anchor text: anchor text associated with internal/
external links
7
Aggregating Anchor Text
•  Given a target URL u, we define the aggregated anchor text as the
external anchor text of the internal inlinks of u
•  The internal inlinks of u are typically created by the website publisher
–  Tend to link related pages
–  Can generally be trusted, although spammable
•  The external anchor text captures what the world (beyond the publisher)
thinks a page is about
–  Fewer navigational (e.g., “home”, “next”) anchor text lines than internal
anchor text
–  Spammable (e.g., “miserable failure”)
•  By semantic transitivity, the aggregated anchor text is likely to be a good
descriptor of the target URL u
8
Weighting Aggregated Anchor Text
•  Same line of anchor text, with different weights, may originate from
multiple internal inlinks
•  Apply standard result set fusion approaches to combine weights
9
http://ballrooms.com/savoy.html
Dancing at the Savoy Ballroom … http://traveling.com/dances.html
… dances in New York … http://ballrooms.com/ http://nyc.com/culture.html
Learn more about famous Savoy Ballroom … The Lindy hop was popularized in New York City … 1
1
5
2
pages within the same site
The Savoy, a dance hall in Harlem … dancing to Lindy Hop
In New York, dances like salsa, and the Lindy Hop … http://dancing.com/ballrooms.html
3
Lindy hop is a dance from New York … Savoy Ballroom … http://dancing.com/newyork.html
http://dancing.com/lindyhop.html
… swing dancing … http://alldancing.com/swingdances.html
5
The Lindy hop … http://dancesite.com/swing.html
http://ballrooms.com/savoy.html
Dancing at the Savoy Ballroom … http://traveling.com/dances.html
… dances in New York … http://ballrooms.com/ http://nyc.com/culture.html
Learn more about famous Savoy Ballroom … The Lindy hop was popularized in New York City … 1
1
5
2
pages within the same site
The Savoy, a dance hall in Harlem … dancing to Lindy Hop
In New York, dances like salsa, and the Lindy Hop … http://dancing.com/ballrooms.html
3
Lindy hop is a dance from New York … Savoy Ballroom … http://dancing.com/newyork.html
http://dancing.com/lindyhop.html
… swing dancing … http://alldancing.com/swingdances.html
MIN savoy ballroom: 1 5 lindy hop: 1 dances in new york: 2 The Lindy hop … http://dancesite.com/swing.html
http://ballrooms.com/savoy.html
Dancing at the Savoy Ballroom … http://traveling.com/dances.html
… dances in New York … http://ballrooms.com/ http://nyc.com/culture.html
Learn more about famous Savoy Ballroom … The Lindy hop was popularized in New York City … 1
1
5
2
pages within the same site
The Savoy, a dance hall in Harlem … dancing to Lindy Hop
In New York, dances like salsa, and the Lindy Hop … http://dancing.com/ballrooms.html
3
Lindy hop is a dance from New York … Savoy Ballroom … http://dancing.com/newyork.html
http://dancing.com/lindyhop.html
… swing dancing … http://alldancing.com/swingdances.html
MAX savoy ballroom: 5 5 lindy hop: 1 dances in new york: 2 The Lindy hop … http://dancesite.com/swing.html
http://ballrooms.com/savoy.html
Dancing at the Savoy Ballroom … http://traveling.com/dances.html
… dances in New York … http://ballrooms.com/ http://nyc.com/culture.html
Learn more about famous Savoy Ballroom … The Lindy hop was popularized in New York City … 1
1
5
2
pages within the same site
The Savoy, a dance hall in Harlem … dancing to Lindy Hop
In New York, dances like salsa, and the Lindy Hop … http://dancing.com/ballrooms.html
3
Lindy hop is a dance from New York … Savoy Ballroom … http://dancing.com/newyork.html
http://dancing.com/lindyhop.html
… swing dancing … http://alldancing.com/swingdances.html
MEAN savoy ballroom: 3 5 lindy hop: 0.5 dances in new york: 1 The Lindy hop … http://dancesite.com/swing.html
http://ballrooms.com/savoy.html
Dancing at the Savoy Ballroom … http://traveling.com/dances.html
… dances in New York … http://ballrooms.com/ http://nyc.com/culture.html
Learn more about famous Savoy Ballroom … The Lindy hop was popularized in New York City … 1
1
5
2
pages within the same site
The Savoy, a dance hall in Harlem … dancing to Lindy Hop
In New York, dances like salsa, and the Lindy Hop … http://dancing.com/ballrooms.html
3
Lindy hop is a dance from New York … Savoy Ballroom … http://dancing.com/newyork.html
http://dancing.com/lindyhop.html
… swing dancing … http://alldancing.com/swingdances.html
SUM savoy ballroom: 6 5 lindy hop: 1 dances in new york: 2 The Lindy hop … http://dancesite.com/swing.html
Enriched Document Representations
•  Aggregated anchor text can be used to build enriched document
representations
•  Three representations
–  Combined: append aggregated anchor text lines to the end of the
external anchor text section
–  Backoff: same as combined, except we only append lines of
aggregated anchor text that are not already part of the external
anchor text field
–  New section: create a new section within the document that
contains the aggregated anchor text
•  List not meant to be exhaustive, but rather to give an idea of how
aggregated anchor text can be used to enhance document
representations
Experiments
•  Our goal is to show that anchor text sparsity is a problem for web search and that
our aggregated anchor text approach can be used to improve search relevance
•  Large-scale web test collection
–  22,822 queries
–  524,418 judged query/URL pairs
–  Judgments are on 5-point scale (Perfect, Excellent, Good, Fair, Bad)
•  Aggregate anchor text across entire web graph
•  Evaluate results using DCG@1, DCG@5, and NDCG
•  Use modified version of BM25F for ranking
–  Baseline uses all document fields available, including anchor text
–  Parameters tuned to optimize NDCG
•  All experiments use 2-fold cross validation
16
Overcoming Anchor Text Sparsity
•  URLs with judgments (biased sample)
–  Our method reduces the number of URLs with no anchor text by
38%
–  Average of 34 lines of aggregated anchor text per URL
•  Random sample of 1 million URLs
–  32,715 have external anchor text (about 3%)
–  50,127 have aggregated anchor text (about 5%)
–  43,841 of the 50,127 did not have external anchor text
–  We can nearly double the number of URLs with anchor text
–  Avg. number of anchor text lines per page increases from 1 to 11
17
Anchor Text Line Distribution
Web Search Results
Using aggregated
anchor text enriched
document
representations
consistently and
significantly
improves retrieval
effectiveness
19
Web Search Results (Continued…)
Aggregated anchor text helps improve ‘medium’
difficulty queries the most and (slightly) hurts easy
queries, which are mostly navigational
20
Web Search Results (Continued…)
Aggregated anchor text helps longer queries,
especially 4+ word queries, which tend to be more
difficult informational queries
21
Impact of Pruning Aggregated Anchor Text
Baseline (not shown):
7.576
Aggregated AT (all lines):
7.592
Aggregated AT (1 line):
7.584
Aggregated AT (100 lines)
7.594
22
Conclusions and Future Work
•  Anchor text plays an important role in web-based search tasks
•  Power law of anchor text distribution means that many pages that have
no anchor text will unfairly biased
•  Described a method for overcoming anchor text sparsity using the
external anchor text of the internal inlinks
•  Experimental results showed consistent and significant improvements
when aggregated anchor text used to enrich web documents
•  Future work
–  Weight the aggregated anchor text based on how related it is to the
target page
–  Propagate anchor text weights using random walk model
23
Questions?
24