Web Mining Research: A Survey April 23rd 2014 CS332

pg 01
Web Mining Research:
A Survey
Authors: Raymond Kosala & Hendrik Blockeel
Presenter: Ryan Patterson
April 23rd 2014
Data Mining
CS332
pg 02
outline
•
•
•
•
•
•
•
Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Review
Exam Questions
pg 03
outline
•
•
•
•
•
•
•
Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Review
Exam Questions
pg 04
Introduction
“The Web is huge, diverse, and dynamic . . . we
are currently drowning in information and facing
information overload.”
Web users encounter problems:
Finding relevant information
Creating new knowledge out of the information
available on the Web
Personalization of the information
Learning about consumers or individual users
•
•
•
•
pg 05
outline
•
•
•
•
•
•
•
Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Review
Exam Questions
pg 06
Web Mining
“Web mining is the use of data mining
techniques to automatically discover and
extract information from Web documents and
services.”
Web mining subtasks:
1. Resource finding
2. Information selection and pre-processing
3. Generalization
4. Analysis
pg 07
Web Mining
Information Retrieval & Information Extraction
•
Information Retrieval (IR)
o
•
the automatic retrieval of all relevant documents
while at the same time retrieving as few of the nonrelevant as possible
Information Extraction (IE)
o
transforming a collection of documents into
information that is more readily digested and
analyzed
pg 08
Live demo
pg 09
outline
•
•
•
•
•
•
•
Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Review
Exam Questions
pg 10
Web Content Mining
Information Retrieval View
Unstructured Documents
• Most utilizes “bag of words” representation to generate documents features
•
•
o ignores the sequence in which the words occur
Document features can be reduced with selection algorithms
o ie. information gain
Possible alternative document feature representations:
o word positions in the document
o phrases/terms (ie. “annual interest rate”)
Semi-Structured Documents
• Utilize additional structural information gleaned from the document
o HTML markup (intra-document structure)
o HTML links (inter-document structure)
pg 11
Web content mining, IR unstructured documents
pg 12
Web content mining, IR semi structured documents
pg 13
Web Content Mining
Database View
•
•
•
“the Database view tries . . . to transform a Web
site to become a database so that . . . querying
on the Web become[s] possible.”
Uses Object Exchange Model (OEM)
o represents semi-structured data by a labeled graph
Database view algorithms typically start from manually selected Web sites
o site-specific parsers
Database view algorithms produce:
o extract document level schema or DataGuides
 structural summary of semi-structured data
o extract frequent substructures (sub-schema)
o multi-layered database
 each layer is obtained by generalizations on lower layers
pg 14
Web content mining, Database view
pg 15
outline
•
•
•
•
•
•
•
Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Review
Exam Questions
pg 16
Web Structure Mining
“. . . we are interested in the structure of the
hyperlinks within the Web itself”
•
Inspired by the study of social networks and citation analysis
o based on incoming & outgoing links we could discover specific types
of pages (such as hubs, authorities, etc)
•
Some algorithms calculate the quality/relevancy of each Web page
o ie. Page Rank
•
Others measure the completeness of a Web site
o measuring frequency of local links on the same server
o interpreting the nature of hierarchy of hyperlinks on one domain
pg 17
outline
•
•
•
•
•
•
•
Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Review
Exam Questions
pg 18
Web Usage Mining
“. . . focuses on techniques that could predict
user behavior while the user interacts with
the Web.”
•
Web usage is mined by parsing Web server logs
o mapped into relational tables → data mining techniques applied
o log data utilized directly
•
•
Users connecting through proxy servers and/or users or ISP’s utilizing
caching of Web data results in decreased server log accuracy
Two applications:
o personalized - user profile or user modeling in adaptive interfaces
o impersonalized - learning user navigation patterns
pg 19
outline
•
•
•
•
•
•
•
Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Review
Exam Questions
pg 20
Review
•
Web mining
o
•
o
Web content mining
o
•
•
4 subtasks
IR & IE
o
primarily intra-page analysis
IR view vs DB view
Web structure mining
o
primarily inter-page analysis
Web usage mining
o
primarily analysis of server activity logs
pg 21
Web Mining
Web Content Mining
Web Structure Mining
IR View
Web Usage Mining
DB View
- Unstructured
- Semi structured
- Semi structured
- Web site as DB
- Links structure
- Interactivity
Main Data
- Text documents
- Hypertext documents
- Hypertext documents
- Links structure
- Server logs
- Browser logs
Representation
- Bag of word, n-grams
- Terms, phrases
- Concepts of ontology
- Relational
- Edge-labeled graph (OEM)
- Relational
- Graph
- Relational table
- Graphs
- TFIDF and variants
- Machine learning
- Statistical (incl. NLP)
- Proprietary algorithms
- ILP
- (modified) association
rules
- Proprietary algorithms
- Machine Learning
- Statistical
- (modified) association rules
- Categorization
- Clustering
- Finding extraction rules
- Finding patterns in text
- User modeling
- Finding frequent substructures
- Web site schema
discovery
- Categorization
- Clustering
- Site construction, adaptation,
and management
- Marketing
- User modeling
View of Data
Method
Application
Categories
Web mining categories
pg 22
outline
•
•
•
•
•
•
•
Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Review
Exam Questions
pg 23
Exam Question 1
Q: Of the following Web mining paradigms:
Information Retrieval
Information Extraction
Which does a traditional Web search engine (google.com,
bing.com, etc.) attempt to accomplish? Briefly support
your answer.
•
•
pg 24
Exam Question 1
Q: Of the following Web mining paradigms:
Information Retrieval
Information Extraction
Which does a traditional Web search engine (google.com,
bing.com, etc.) attempt to accomplish? Briefly support
your answer.
•
•
A: Information Retrieval, the search engine attempts
provides a list of documents ranked by their relevancy to
the search query.
pg 25
Exam Question 2
Q: State one common problem hampering accurate
Web usage mining? Briefly support your answer.
pg 26
Exam Question 2
Q: State one common problem hampering accurate
Web usage mining? Briefly support your answer.
A:
Users connecting to a Web site though a proxy server,
Users (or their ISP’s) utilizing Web data caching,
will result in decreased server log accuracy. Accurate
server logs are required for accurate Web usage mining.
•
•
pg 27
Exam Question 3
Q: What is the phrase associated with the most popular
method for Web content mining algorithms to generate
document features from unstructured documents?
pg 28
Exam Question 3
Q: What is the phrase associated with the most popular
method for Web content mining algorithms to generate
document features from unstructured documents?
A:
“Bag of words” representation.