Search Query Log Analysis Kristina Lerman

Search Query Log Analysis
Kristina Lerman
What can we learn from web search
queries?
• Characteristics
– Length has steadily grown over the years
• 1990’s: < 2 terms
• 2001: 2.4 terms
• 2014: long search queries, e.g., “where is the nearest coffee shop”
– Heavy-tailed distribution of term frequency
– Billions of queries
• User intentions
– Aggregate query words with results of search to learn user’s needs,
wants, goals
– Create a database of commonsense knowledge
• Cf. Cyc
• Does data exist?
– AOL search query log
– Google trends
2006 AOL search query log dataset
•
•
•
•
~20M web queries
~650K users
3 month period: March 1 – May 31, 2006
Data format
– AnonID – an anonymous user ID number
– Query – the query issued by the user
– QueryTime – time query was submitted
– ItemRank – rank of item clicked in results
– ClickURL – the domain of the clicked item
Timeline
•
•
•
•
8/4/06: Announcement to SIG-IRList from AOL
8/6/06: TechCrunch slams AOL over privacy
8/7/06: Dataset removed
8/9/06: NYTimes identifies user 4417749
– Thelma Arnold, 62, from Lilbum, Georgia
• 8/21/06: AOL CTO Maureen Govern resigns
– AOL researcher and supervisor are fired
Weakly-supervised discovery of
named entities using web search
queries
Marius Pasca (Google)
CIKM-07: Conference on Information and Knowledge Management, Lisbon, Portugal
Weakly Supervised Discovery of Named
Entities using Web Search (2007)
• Goal: Automatically extract knowledge
(entities) from texts created by many people
– Discover new instances of classes
• Red Alert is videogame
• Lilbum is a town
• Lorazepam is a drug
• For what purpose?
– Cataloging human knowledge
– Understanding searching users
• #399392 in Lilbum takes Lorazepam, plays Red Alert
Intuition
• Templates in queries
“side effects of xanax pills”
“side effects of birth control pills”
“side effects of lipitor pills”
…
– Prefix: “side effects of”
– Postfix: “pills”
• But, templates are difficult to specify
– Cf. extraction patterns in web information
retrieval
“Weakly”-supervised approach
• Guided by a small set of known seed instances
– Input is a target class and some examples
• Drug: {phentermine, viagra, vicodin, vioxx, xanax}
• City: {london, paris, san francisco, tokyo, toronto}
• Food: {chicken, fish, milk, tomatoes, wheat}
– Identify the patterns seed instances occur in
• Learn many more new instances automatically
– Use patterns to find more instances
Step 1: Identify query templates
• Identify all queries
that contain each
known class instance
• vioxx
• Extract left and right
context
– “long term vioxx use”
• Prefix: “long term”
• Postfix: “use”
• Infix: “vioxx”
Step 2: Generate candidate instances
• Go over the query log again
• Identify all queries that match template
• Collect query infixes as candidate instances
{low blood pressure,
xanax, lamictal, generic
birth control, lipitor,
vicodin, beta blockers, …}
Step 3. Compile search signatures
• Each candidate is
represented as a
vector
– Each template is a
dimension
– Weighted by
frequency in
queries
Step 4. Reference signatures
• Vectors for example
class instances are
combined
• Prototype of search
signature for the
class
Example
Step 5. Compute signature similarity
• Vector similarity between reference signature
and candidate signature
– Jensen-Shannon similarity function
• Output is rank-ordered list
Drug: {viagra, phentermine, ambien, adderall, vicodin, hydrocodone, xanax,
vioxx, oxycontin, cialis, valium, lexapro, ritalin, zoloft, percocet, …}
Evaluation
Repeatability
• Need enormous database of search query logs
– Probably best done at Google or Microsoft
• What can be done with small query
databases?
• What types of social media text could this
method be applied to?
Classifying the user intent of
web queries using k-means
clustering
Ashish Kathuria, Bernard J. Jansen and Carolyn
Hafernik, Amanda Spink
Problem Introduction
WWW playes a vital tool in many people’s daily lives

Nearly 70 percent of searchers use a
search engine
 Search engines receive
hundreds of millions of queries per day
 Billions of results per week in
response to these queries.
Smart users: Novel and
increasingly assorted ways of
searching!!
Understanding intent behind
searching
Can help to improve search engine performance via
page ranking, result clustering, advertising, and presentation of
results
Approach
• Automatically classify a large set of queries from a web
search engine log as informational, navigational and
transactional.
• Encode the characteristics of informational, navigational
and transactional queries identified from prior work to
develop an automatic classifier using k-means clustering.
• Use data-mining techniques to more accurately
automatically classify queries by user Intent
• Overcome limitations of previous research:
– Small datasets
– Limited methodology
Classification of Queries
Images from http://moz.com/blog/segmenting-search-intent
Research methodology
• Dataset: Transaction log from Dogpile. Each record has
fields like: User identification, cookie, Time of day, Query
terms, source
Step 1: Creating sessions and removing duplicates
 The fields of Time of day, User identification, Cookie,
and Query were used to locate the initial query of a session
and then recreate the series of actions in the session.
 Collapsed the search using user identification, cookie, and
query to eliminate duplicates of result and null queries
Research methodology
Step 2: Generating additional attributes
 Calculated three additional attributes for each record: Query length, query
reformulation and result page
Step 3: Assignment of terms
1. Navigational:
 Contain company/business/organization/people names
 Queries containing portions of URLs or even complete URLs
2. Transactional:
 Analysis, specifically via the identification of key terms related to
transactional domains such as entertainment and ecommerce
3. Informational:
 Queries that use natural language terms
 Longer sessions than for informational searching
Research methodology
Step 4 : Textual data to numerical data
Step 5: Converting string to vector
K-means Clustering
The resulting data set had four attributes that could be used for
classification: query length, source, query reformulation rate, user
intent weight of the query
Navigational
Informational
Transactional
Results
• Performed on various datasets and achieved 94%
accuracy
• Overall, about 76 percent of the queries were
classified as informational, while about 12 percent
were classified as transactional, and 12 percent
were classified as navigational
Results
• Navigational queries: Low rates of reformulation, typically sessions of just
one query.
• Informational queries: Low occurrences of query reformulation,
indicating probably relatively easy informational needs, such as fact finding
• Transactional queries: Shorter queries
Discussion of approach
The approach has a high
success rate, it uses a large
data set of queries and does not
depend on external content,
thereby making it
implementable in real time.
• Limitations:
– The Dogpile user population representative of web search engine users in general?
– What if a prototype has multiple user intents associated with it ?
– Is relying solely on transactional logs sufficient ?
• Future Scope:
– Investigate in subcategories
– A laboratory study on how searchers express their underlying intent
– Devlope algorithmic approaches for more in-depth analysis of individual queries
Summary
• Identifying the user intent of web queries is very useful for web search
engines because it would allow them to provide more relevant results to
searchers and more precisely targeted sponsored links.
• Classifying queries helps in focused search:
– Information queries: Provide relevant information and ads
– Navigational queries: Provide links straight to a requested web
page
– Transactional queries: Focus on all commercial links for future
purchase as well
• The use of k-means as an automatic clustering and classification
technique yielded positive results and opened effective ways to improve
performance of web search engines.
-Neha Mundada
Acquiring Explicit Goals from Search
Query Logs
• Understanding human goals is necessary for
– Recognize goals of actions
– Create a plan
• E.g., ‘plan a trip to Vienna’ has subgoals
– ‘contact travel agent’
– ‘book hotel’
– ‘buy concert tickets’, etc.
• Automatically acquire human goals from
search query logs
– Acquire and organize commonsense knowledge
Research overview
• Research Question:
– If and How search query logs can be utilized to overcome
the problem of acquiring knowledge about human goals?
• Following an exploratory research style, we intend to
show:
– contain a small but interesting number of user goals
– Separation by automatic methods
• Results:
– Knowledge about the automatic acquisition of goals out of
search query logs
– Knowledge about the nature of goals extracted from
search query logs
Results of Human Subject Study
• 4 independent raters
• labeled 3000 queries
• Examples
– bug killing devices
– mothers working from
home
– how to lose weight
• Classes appear to be
separable
Experimental Setup
• AOL search query log
– ~ 20 million search queries
– recorded between March 1 and May 31 (2006 )
– ethical issues
• pre-processing steps to reduce noise
– 5 million queries
• labeled queries from the human subject study
were utilized as training examples
(controversial queries were omitted)
Classification approach
• Part of speech tagging
– Maximum entropy tagger converts a sequence of
words into a sequence of POS tags
– Example
• Query “buy a car”  buy/VB a/DT car/NN
• Set of words {buy, car}
• Part of speech trigrams
$ VB DT NN $  {$ VB DT, VB DT NN, DT NN $}
Classification approach (2)
• Linear Support Vector Machine [Dumais98]
– Robust and effective in the area of text classification
– Weka Machine Learning Toolkit
http://www.cs.waikato.ac.nz/ml/weka/
• Performance:
– 10 trials – 3-fold Cross Validation
– Precision, Recall and F1-Measure for the class:
“queries containing goals”
• Precision = 0.77
• Recall = 0.63
• F1-Measure = 0.69
N-fold cross validation
• Problem: limited amount of labeled data
• Solution: N-fold cross validation
•
•
•
•
Divide data into N equal segments (folds)
Training data: N-1 folds
Testing data: remaining fold
Repeat for remaining test folds and average
results
Goals are diverse
• Rank-Frequency plot of goals is heavy tailed
– Few goals share by many users
– Majority of goals are shared by only few users
Most frequent goals
Most frequent goals with “get”,
“make”, “change” and “be”
Summary
• Web search queries are an abundant, but very
sparse and very noisy, source of data about
needs, desires, intentions of people
• Clever methods can learn from these diverse
data
– Named entities
– Goals
• Can these methods be used in social media?