Lecture 12: Link Analysis for Web Retrieval Trevor Cohn COMP90042, 2015, Semester 1 What we’ll learn in this lecture I The web as a graph I Page-rank method for deriving the importance of pages I Hubs and authorities method Up until now . . . I Documents assumed to be ‘equal’ I Usefulness for ranking only affected by matching of query terms (and length) I E.g., assumed P(R|d) uniform in probabilistic methods I Can we do better than this? I . . . some documents are authoratitive and should be ranked higher than others. The web as a graph I Pages on the Web do not standalone I Treatment as independent “documents” is over-simplification I Considerable information in hyperlink structure Paraguay jaguar Mexico predator lion tiger What information does a hyperlink convey? I Directs user’s attention to other pages I Conferral of authority (not always!) Anchor text to explain why linked page is of interest I I I I I “IBM computers” links to www.ibm.com “search portal” links to www.yahoo.com “click here” links to Adobe Acrobat “evil empire” links to . . . I Additional source of terms for indexing I Perhaps the most important pages have more incoming links? The web as a directed graph Formally, consider in-links Number of incoming edges out-links Number of outgoing edges connected components Path connects all pairs of nodes Paraguay jaguar Mexico predator lion tiger Not all links are equal Who and what to trust? I outgoing links from reputable sites should carry more weight I than user-generated content and links from unknown websites Web has “bow-tie” structure, comprising in pages that only have outgoing edges to strongly connected component whose pages are highly interlinked, and also link to out pages that only have incoming edges Typically don’t consider internal links within a web-site (why?) Page rank Assumptions I links convey authority of the source page I pages with more in links from authorative sources are more important I how to formalise this in a model? Random web surfer Consider a surfer who visits a web page I then follows a random out link, uniformly I occasionally “teleports” to a new random page (types a new URL) Inference problem: what happens over time? Where does the surfer end up visiting most often? Example graph 1 2 3 Transition probabilities (no teleport for now) P(1 → 1) = 0 P(2 → 1) = 21 P(3 → 1) = 0 P(1 → 2) = 1 P(1 → 3) = 0 P(2 → 2) = 0 P(2 → 3) = 21 P(3 → 2) = 1 P(3 → 3) = 0 Example from MRS, Chapter 21. Example graph 1 2 3 Represent as matrix, Pij = P(i → j). I.e., 0 1 0 P = 12 0 21 0 1 0 Note that P is the adjacency matrix Aij = Jedge exists i → jK, normalised such that each row sums to 1. Adding Teleportation If at each time step we randomly jump to another node in the graph with probability α I scale our original P matrix by 1 − α I add α N to the resulting matrix Overall our transition matrix is Aij 1 +α Pij = (1 − α) P N j 0 Aij 0 For the example with α = 0.5 P= 1 6 5 12 1 6 2 3 1 6 2 3 1 6 5 12 1 6 Markov chain Formally we have defined a Markov chain I a discrete time stochastic process I consists of N states, one per web page, denoted ~x (a row vector) I starting at page i, then use “1-hot” representation i.e., xi = 1 and xj = 0, j 6= i Assumptions I prob of reaching a state is based only on previous state p(xt |x1 , x2 , . . . , xt−1 ) = p(xt |xt−1 ) I this is characterised by the transition matrix Transitions as matrix multiplication Probabilty chain rule can be expressed using matrix multiplication P(1 → 1) P(1 → 2) P(1 → 3) ~x P = P(1) P(2) P(3) × P(2 → 1) P(2 → 2) P(2 → 3) P(3 → 1) P(3 → 2) P(3 → 3) > P(1)P(1 → 1) + P(2)P(2 → 1) + P(3)P(3 → 1) = P(1)P(1 → 2) + P(2)P(2 → 2) + P(3)P(3 → 2) P(1)P(1 → 3) + P(2)P(2 → 3) + P(3)P(3 → 4) = [P(Xt+1 = 1) P(Xt+1 = 2) P(Xt+1 = 3)] where P(1) = P(Xt = 1) and P(2 → 3) = P(Xt+1 = 3|Xt = 2). Example I start at state ~x I after one time-step, probability of next state ~x P I after two time-steps, now (~x P)P = ~x P 2 I ... Example Example I start at state ~x I after one time-step, probability of next state ~x P I after two time-steps, now (~x P)P = ~x P 2 I ... Example I ~x = [0 1 0] Example I start at state ~x I after one time-step, probability of next state ~x P I after two time-steps, now (~x P)P = ~x P 2 I ... Example I ~x = [0 1 0] I ~x P = [0.4167 0.1667 0.4167] Example I start at state ~x I after one time-step, probability of next state ~x P I after two time-steps, now (~x P)P = ~x P 2 I ... Example I ~x = [0 1 0] I ~x P = [0.4167 0.1667 0.4167] I ~x P 2 = [0.2083 0.5833 0.2083] Example I start at state ~x I after one time-step, probability of next state ~x P I after two time-steps, now (~x P)P = ~x P 2 I ... Example I ~x = [0 1 0] I ~x P = [0.4167 0.1667 0.4167] I ~x P 2 = [0.2083 0.5833 0.2083] I ~x P 3 = [0.3125 0.3750 0.3125] Example I start at state ~x I after one time-step, probability of next state ~x P I after two time-steps, now (~x P)P = ~x P 2 I ... Example I ~x = [0 1 0] I ~x P = [0.4167 0.1667 0.4167] I ~x P 2 = [0.2083 0.5833 0.2083] I ~x P 3 = [0.3125 0.3750 0.3125] I ... I ~x P 99 = [0.2778 0.4444 0.2778] Example I start at state ~x I after one time-step, probability of next state ~x P I after two time-steps, now (~x P)P = ~x P 2 I ... Example I ~x = [0 1 0] I ~x P = [0.4167 0.1667 0.4167] I ~x P 2 = [0.2083 0.5833 0.2083] I ~x P 3 = [0.3125 0.3750 0.3125] I ... I ~x P 99 = [0.2778 0.4444 0.2778] I ~x P 100 = ~π = [0.2778 0.4444 0.2778] Markov chain convergence Run sufficiently long, state membership converges I reaches a steady-state, denoted ~π I transitions from this state leave the state unmodified I state frequencies encode frequency of visiting each page for random surfer in the limit as t → ∞ When will the Markov Chain converge? Must have the property of ergodicity For any start state i, all states j must be reachable with non-zero probability for all t > T0 , for constant T0 . Ergodicity in turn requires irreducibility (reachability between i and j) and aperiodicity (relating to partioning into sets with internal cycles). Computing PageRank I Definition of a steady-state is ~π P = ~π I I.e., once in steady-state, remain in this state after transition I This is a classic linear algebra problem (finding the left eigenvalues), of the form ~π P = λ~π I Can recover several solution vectors for different values of λ I We want the principle eigenvector, for which λ = 1 In practice, may use the power iteration method to handle large graphs Iteration method in Matlab >> P = [1/6 2/3 1/6; 5/12 1/6 5/12; 1/6 2/3 1/6]; >> pi0= [1/3 1/3 1/3]; >> pi0 * P ans = 0.2500 0.5000 0.2500 >> pi1 = pi0 * P pi1 = 0.2500 0.5000 0.2500 >> pi2 = pi1 * P pi2 = 0.2917 0.4167 0.2917 >> pi3 = pi2 * P pi3 = 0.2708 0.4583 0.2708 >> pi4 = pi3 * P pi4 = 0.2812 0.4375 0.2812 Eigenvalue method in Matlab >> P = [1/6 2/3 1/6; 5/12 1/6 5/12; 1/6 2/3 1/6]; >> [V,D,W] = eig(P); >> pi = W(:,1); >> pi = pi/sum(pi); >> pi’ ans = 0.2778 0.4444 0.2778 >> pi’ * P ans = 0.2778 0.4444 0.2778 Hubs and authorities (HITS) Assumes there are two kinds of pages on web authorities providing authoratitive and detailed information I Australian Tax Office I Bureau of Meteorology I Wikipedia page on Australian Cricket Team hubs containing mainly links to lots of pages about a topic I DMOZ web directories I Someone’s Pinterest page I Wikipedia disambiguation pages Depending on our query, we might want one or the other I broad topic query, e.g., information about biking in Melbourne I specific query, e.g., is the Yarra trail sealed? Hubs and authorities Circular definition A good Hub links to many authorities A good Authority is linked to from many hubs. Define ~h hub scores for each web page ~a authority scores for each web page Mutually recursive definition hi ← X ai ← X aj i→j hi j→i for all pages i. Gives rise to iterative algorithm for finding ~h and ~a. Computing Hubs and Authorities Define A adjacency matrix, as before Aij = 1 denotes edge i →j Leads to the relations ~h ← A~a ~a ← A>~h Combining the definitions for ~h into ~a, ~a ← A> A~a which is another eigenvalue problem. The principle eigenvalue of A> A can be used to solve for ~a, provided there is a steady-state solution (find ~h in similar way). Summary: PageRank and Hubs and Authorities I Both static query-independent measures of web page quality I Can be run offline to score each web page Based on latent (unobserved) quality metric for each page I I I single importance score hub and authority scores I Plus transition mechanism I Gives rise to document rankings PageRank and HITS in a retrieval system I How can we use these scores in a retreival system? I Alongside our other features, e.g., TF*IDF, BM25 factors, LM I Express model as a combination of factors, e.g., RSVd = I X αi hi (~f: , ~fd,: , . . .) i=1 I I PR/H&A become additional features, each with their own weight α Learn weighting using machine learned scoring function I I to match binary relevance judgements based on click throughs or query reformulations These methods can be exploited, e.g., link spam, Google bombs, Google bowling etc. Looking back and forward Back I Link structure of the web gives rise to a graph I Structure of the graph conveys information about importance of pages I PageRank models random surfer using Markov Chain I HITS models hubs versus authority nodes I Solve for steady-state using power iteration or eigenvalue solver Looking back and forward Forward I Natural language processing, looking in more detail into the structure of text I Starting with text classification Further reading I (Review of Eigen decompositions) 18.1, “Linear algebra review” of Manning, Raghavan, and Schutze, Introduction to Information Retrieval. I Chapter 21, “Link Analysis” of Manning, Raghavan, and Schutze, Introduction to Information Retrieval.
© Copyright 2024