Download Report

Lecture 12: Link Analysis for Web Retrieval
Trevor Cohn
COMP90042, 2015, Semester 1
What we’ll learn in this lecture
I
The web as a graph
I
Page-rank method for deriving the importance of pages
I
Hubs and authorities method
Up until now . . .
I
Documents assumed to be ‘equal’
I
Usefulness for ranking only affected by matching of query
terms (and length)
I
E.g., assumed P(R|d) uniform in probabilistic methods
I
Can we do better than this?
I
. . . some documents are authoratitive and should be ranked
higher than others.
The web as a graph
I
Pages on the Web do not standalone
I
Treatment as independent “documents” is over-simplification
I
Considerable information in hyperlink structure
Paraguay
jaguar
Mexico
predator
lion
tiger
What information does a hyperlink convey?
I
Directs user’s attention to other pages
I
Conferral of authority (not always!)
Anchor text to explain why linked page is of interest
I
I
I
I
I
“IBM computers” links to www.ibm.com
“search portal” links to www.yahoo.com
“click here” links to Adobe Acrobat
“evil empire” links to . . .
I
Additional source of terms for indexing
I
Perhaps the most important pages have more incoming links?
The web as a directed graph
Formally, consider
in-links Number of incoming edges
out-links Number of outgoing edges
connected components Path connects all pairs of nodes
Paraguay
jaguar
Mexico
predator
lion
tiger
Not all links are equal
Who and what to trust?
I
outgoing links from reputable sites should carry more weight
I
than user-generated content and links from unknown websites
Web has “bow-tie” structure, comprising
in pages that only have outgoing edges to
strongly connected component whose pages are highly interlinked,
and also link to
out pages that only have incoming edges
Typically don’t consider internal links within a web-site (why?)
Page rank
Assumptions
I
links convey authority of the source page
I
pages with more in links from authorative sources are more
important
I
how to formalise this in a model?
Random web surfer
Consider a surfer who visits a web page
I
then follows a random out link, uniformly
I
occasionally “teleports” to a new random page
(types a new URL)
Inference problem: what happens over time? Where does the
surfer end up visiting most often?
Example graph
1
2
3
Transition probabilities (no teleport for now)
P(1 → 1) = 0
P(2 → 1) = 21
P(3 → 1) = 0
P(1 → 2) = 1 P(1 → 3) = 0
P(2 → 2) = 0 P(2 → 3) = 21
P(3 → 2) = 1 P(3 → 3) = 0
Example from MRS, Chapter 21.
Example graph
1
2
3
Represent as matrix, Pij = P(i → j). I.e.,


0 1 0
P =  12 0 21 
0 1 0
Note that P is the adjacency matrix Aij = Jedge exists i → jK,
normalised such that each row sums to 1.
Adding Teleportation
If at each time step we randomly jump to another node in the
graph with probability α
I
scale our original P matrix by 1 − α
I
add
α
N
to the resulting matrix
Overall our transition matrix is
Aij
1
+α
Pij = (1 − α) P
N
j 0 Aij 0
For the example with α = 0.5


P=

1
6
5
12
1
6
2
3
1
6
2
3
1
6
5
12
1
6




Markov chain
Formally we have defined a Markov chain
I
a discrete time stochastic process
I
consists of N states, one per web page, denoted ~x (a row
vector)
I
starting at page i, then use “1-hot” representation
i.e., xi = 1 and xj = 0, j 6= i
Assumptions
I
prob of reaching a state is based only on previous state
p(xt |x1 , x2 , . . . , xt−1 ) = p(xt |xt−1 )
I
this is characterised by the transition matrix
Transitions as matrix multiplication
Probabilty chain rule can be expressed using matrix multiplication


P(1
→
1)
P(1
→
2)
P(1
→
3)
~x P = P(1) P(2) P(3) ×  P(2 → 1) P(2 → 2) P(2 → 3) 
P(3 → 1) P(3 → 2) P(3 → 3)
>

P(1)P(1 → 1) + P(2)P(2 → 1) + P(3)P(3 → 1)
=  P(1)P(1 → 2) + P(2)P(2 → 2) + P(3)P(3 → 2) 
P(1)P(1 → 3) + P(2)P(2 → 3) + P(3)P(3 → 4)
= [P(Xt+1 = 1) P(Xt+1 = 2) P(Xt+1 = 3)]
where P(1) = P(Xt = 1) and P(2 → 3) = P(Xt+1 = 3|Xt = 2).
Example
I
start at state ~x
I
after one time-step, probability of next state ~x P
I
after two time-steps, now (~x P)P = ~x P 2
I
...
Example
Example
I
start at state ~x
I
after one time-step, probability of next state ~x P
I
after two time-steps, now (~x P)P = ~x P 2
I
...
Example
I
~x = [0 1 0]
Example
I
start at state ~x
I
after one time-step, probability of next state ~x P
I
after two time-steps, now (~x P)P = ~x P 2
I
...
Example
I
~x = [0 1 0]
I
~x P = [0.4167 0.1667 0.4167]
Example
I
start at state ~x
I
after one time-step, probability of next state ~x P
I
after two time-steps, now (~x P)P = ~x P 2
I
...
Example
I
~x = [0 1 0]
I
~x P = [0.4167 0.1667 0.4167]
I
~x P 2 = [0.2083 0.5833 0.2083]
Example
I
start at state ~x
I
after one time-step, probability of next state ~x P
I
after two time-steps, now (~x P)P = ~x P 2
I
...
Example
I
~x = [0 1 0]
I
~x P = [0.4167 0.1667 0.4167]
I
~x P 2 = [0.2083 0.5833 0.2083]
I
~x P 3 = [0.3125 0.3750 0.3125]
Example
I
start at state ~x
I
after one time-step, probability of next state ~x P
I
after two time-steps, now (~x P)P = ~x P 2
I
...
Example
I
~x = [0 1 0]
I
~x P = [0.4167 0.1667 0.4167]
I
~x P 2 = [0.2083 0.5833 0.2083]
I
~x P 3 = [0.3125 0.3750 0.3125]
I
...
I
~x P 99 = [0.2778 0.4444 0.2778]
Example
I
start at state ~x
I
after one time-step, probability of next state ~x P
I
after two time-steps, now (~x P)P = ~x P 2
I
...
Example
I
~x = [0 1 0]
I
~x P = [0.4167 0.1667 0.4167]
I
~x P 2 = [0.2083 0.5833 0.2083]
I
~x P 3 = [0.3125 0.3750 0.3125]
I
...
I
~x P 99 = [0.2778 0.4444 0.2778]
I
~x P 100 = ~π = [0.2778 0.4444 0.2778]
Markov chain convergence
Run sufficiently long, state membership converges
I
reaches a steady-state, denoted ~π
I
transitions from this state leave the state unmodified
I
state frequencies encode frequency of visiting each page for
random surfer in the limit as t → ∞
When will the Markov Chain converge? Must have the property of
ergodicity For any start state i, all states j must be reachable
with non-zero probability for all t > T0 , for constant
T0 . Ergodicity in turn requires
irreducibility (reachability between i and j) and
aperiodicity (relating to partioning into sets with internal cycles).
Computing PageRank
I
Definition of a steady-state is
~π P = ~π
I
I.e., once in steady-state, remain in this state after transition
I
This is a classic linear algebra problem (finding the left
eigenvalues), of the form
~π P = λ~π
I
Can recover several solution vectors for different values of λ
I
We want the principle eigenvector, for which λ = 1
In practice, may use the power iteration method to handle large
graphs
Iteration method in Matlab
>> P = [1/6 2/3 1/6; 5/12 1/6 5/12; 1/6 2/3 1/6];
>> pi0= [1/3 1/3 1/3];
>> pi0 * P
ans =
0.2500
0.5000
0.2500
>> pi1 = pi0 * P
pi1 =
0.2500
0.5000
0.2500
>> pi2 = pi1 * P
pi2 =
0.2917
0.4167
0.2917
>> pi3 = pi2 * P
pi3 =
0.2708
0.4583
0.2708
>> pi4 = pi3 * P
pi4 =
0.2812
0.4375
0.2812
Eigenvalue method in Matlab
>> P = [1/6 2/3 1/6; 5/12 1/6 5/12; 1/6 2/3 1/6];
>> [V,D,W] = eig(P);
>> pi = W(:,1);
>> pi = pi/sum(pi);
>> pi’
ans =
0.2778
0.4444
0.2778
>> pi’ * P
ans =
0.2778
0.4444
0.2778
Hubs and authorities (HITS)
Assumes there are two kinds of pages on web
authorities providing authoratitive and detailed information
I Australian Tax Office
I Bureau of Meteorology
I Wikipedia page on Australian Cricket Team
hubs containing mainly links to lots of pages about a topic
I DMOZ web directories
I Someone’s Pinterest page
I Wikipedia disambiguation pages
Depending on our query, we might want one or the other
I
broad topic query, e.g., information about biking in Melbourne
I
specific query, e.g., is the Yarra trail sealed?
Hubs and authorities
Circular definition
A good Hub links to many authorities
A good Authority is linked to from many hubs.
Define
~h hub scores for each web page
~a authority scores for each web page
Mutually recursive definition
hi ←
X
ai ←
X
aj
i→j
hi
j→i
for all pages i.
Gives rise to iterative algorithm for finding ~h and ~a.
Computing Hubs and Authorities
Define
A adjacency matrix, as before Aij = 1 denotes edge
i →j
Leads to the relations
~h ← A~a
~a ← A>~h
Combining the definitions for ~h into ~a,
~a ← A> A~a
which is another eigenvalue problem. The principle eigenvalue of
A> A can be used to solve for ~a, provided there is a steady-state
solution (find ~h in similar way).
Summary: PageRank and Hubs and Authorities
I
Both static query-independent measures of web page quality
I
Can be run offline to score each web page
Based on latent (unobserved) quality metric for each page
I
I
I
single importance score
hub and authority scores
I
Plus transition mechanism
I
Gives rise to document rankings
PageRank and HITS in a retrieval system
I
How can we use these scores in a retreival system?
I
Alongside our other features, e.g., TF*IDF, BM25 factors, LM
I
Express model as a combination of factors, e.g.,
RSVd =
I
X
αi hi (~f: , ~fd,: , . . .)
i=1
I
I
PR/H&A become additional features, each with their own
weight α
Learn weighting using machine learned scoring function
I
I
to match binary relevance judgements
based on click throughs or query reformulations
These methods can be exploited, e.g., link spam, Google bombs,
Google bowling etc.
Looking back and forward
Back
I
Link structure of the web gives rise to
a graph
I
Structure of the graph conveys
information about importance of pages
I
PageRank models random surfer using
Markov Chain
I
HITS models hubs versus authority
nodes
I
Solve for steady-state using power
iteration or eigenvalue solver
Looking back and forward
Forward
I
Natural language processing, looking
in more detail into the structure of
text
I
Starting with text classification
Further reading
I
(Review of Eigen decompositions) 18.1, “Linear algebra review” of
Manning, Raghavan, and Schutze, Introduction to Information
Retrieval.
I
Chapter 21, “Link Analysis” of Manning, Raghavan, and Schutze,
Introduction to Information Retrieval.