Search Engines: The players and the field

Search Engines: The players and the field
The mechanics of a typical search.
The search engine wars.
Statistics from search engine logs.
The architecture of a search engine.
The query engine.
Mechanics
of a typical
search
Results & ads
returned
ranked
Category of
first result
Result for
phrase query
Search on the Web
Corpus: The publicly accessible Web: static + dynamic
Goal: Retrieve high quality results relevant to the user’s need

(not docs!)
Need



Informational – want to learn about something
Low hemoglobin
Navigational – want to go to that page
United Airlines
Transactional – want to do something (web-mediated)
 Access a service
 Downloads
 Shop

Tampere weather
Mars surface images
Nikon CoolPix
Gray areas
 Find a good hub
Car rental Finland
 Exploratory search “see what’s there”
Abortion morality
Search Engines as Info Gatekeepers
Search engines are becoming
the primary entry point for discovering web pages.
Ranking of web pages
influences which pages users will view.
Exclusion of a site from search engines
will cut off the site from its intended audience.
The privacy policy of a search engine is important.
Introna & Nissenbaum: Defining the Web: The Politics of Search Engines
Hindman et al: Googlearchy: How a few Heavily-Linked Sites Dominate Politics on the Web
Search Engine Wars
The battle for domination of the web search space
is heating up!
The competition is good news for users!
Crucial:
advertising is combined with search results!
What if one of the search engines
will manage to dominate the space?
Yahoo!
Synonymous with the dot-com boom,
probably the best known brand on the web.
Started off as a web directory service in 1994,
acquired leading search engine technology in 2003.
Has very strong advertising and e-commerce partners
Lycos!
One of the pioneers of the field
Introduced innovations that
inspired the creation of Google
Verb “google” has become synonymous
with searching for information on the web.
Has raised the bar on search quality
Has been the most popular search engine in the last few years.
Had a very successful IPO in August 2004.
Is innovative and dynamic.
Has restored glamour in CS lost in dot-com-bust
Google
Synonymous with PC software.
Live Search
(was:
MSN Search)
Remember its victory in the browser wars with Netscape.
Developed its own search engine technology only
recently,
officially launched in Feb. 2005.
May link web search into its next version of Windows.
Ask Jeeves
Specialises in
natural language question answering.
Search driven by Teoma.
Cuil
The latest kid on the block
Claims to have indexed 120B pages!
So far, it does not rank!
Experiment with query syntax
Default is AND,
e.g. “computer chess” normally interpreted as
“computer AND chess”,
i.e. both keywords must be present in all hits.
“+chess” in a query means
the user insists that “chess” be present in all hits.
“computer OR chess” means
either keywords must be present in all hits.
“”computer chess”” means that the phrase “computer
chess” must be present in all hits.
Statistics from search engine logs
Statistic
(Year)
average terms per
query
average queries per
session
average result pages
viewed
usage of advanced
search features
AltaVista
(1998)
2.35
AlltheWeb Excite
(2002)
(2001)
2.30
2.60
2.02
2.80
2.30
1.39
1.55
1.70
20.4%
1.0%
10.0%
The most popular search keywords
AltaVista (1998) AlltheWeb (2002) Excite (2001)
sex
free
free
applet
sex
sex
porno
download
pictures
mp3
software
new
chat
uk
nude
Web search Users
Ill-defined queries



Short length
Imprecise terms
Sub-optimal syntax
(80% queries without operator)







Low effort in defining queries
Wide variance in

Specific behavior
Needs
Expectations
Knowledge
Bandwidth
85% look over
one result screen only
mostly above the fold
78% of queries are not
modified
 1 query/session

Follow links –
“the scent of information” ...
Query Distribution
Power law: few popular broad queries,
many rare specific queries
How far do people look for results?
(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)
Architecture of a Search Engine
Sponsored Links
CG Appliance Express
Discount Appliances (650) 756-3931
Same Day Certified Installation
www.cgappliance.com
San Francisco-Oakland-San Jose,
CA
User
Miele Vacuum Cleaners
Miele Vacuums- Complete Selection
Free Shipping!
www.vacuums.com
Miele Vacuum Cleaners
Miele-Free Air shipping!
All models. Helpful advice.
www.best-vacuum.com
Web
Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise
Web spider
At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances.
Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ...
www.miele.com/ - 20k - Cached - Similar pages
Miele
Welcome to Miele, the home of the very best appliances and kitchens in the world.
www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this
page ]
Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit
...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes.
www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ]
Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch
weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ...
www.miele.at/ - 3k - Cached - Similar pages
Search
Indexer
The Web
Indexes
Ad indexes
Rate of web content change
720K pages from 270 popular sites sampled daily
from Feb 17 – Jun 14, 1999 [Cho00]
Mathematically, what
does this seem to be?
What does this suggest for crawling policy?
Diversity
Languages/Encodings



Hundreds of languages, W3C encodings: 55 (Jul01) [W3C01]
Home pages (1997): English 82%, Next 15: 13% [Babe97]
Google (mid 2001): English: 53%, JGCFSKRIP: 30%
Document & query topic
Popular Query Topics (from 1 million Google queries, Apr 2000)
Arts
14.6%
Arts: Music
6.1%
Computers
13.8%
Regional: North America
5.3%
Regional
10.3%
Adult: Image Galleries
4.4%
Society
8.7%
Computers: Software
3.4%
Adult
8%
Computers: Internet
3.2%
Recreation
7.3%
Business: Industries
2.3%
Business
7.2%
Regional: Europe
1.8%
…
…
…
…
Search Index - Inverted File
Frequency
Also store position of word in web page (“offset”)
and information on HTML structure.
The query engine
The interface between
the search index, the user and the web.
Algorithmic details of commercial search engines
are kept as trade secrets.
First step is retrieval of potential results from the index.
Second step is the ranking of the results
based on their “relevance” to the query.
Portal
User Interface
Crawling the Web
Mode of crawl: BFS
Frequency of crawl: important
robots.txt gives
explicit directions on what not to crawl
Parallel machines crawl all the time