Download Report

Introduction to
Data Mining
SLIDES BY: SHREE JASWAL
Books
2
 Text Books:
1. Han, Kamber, "Data Mining Concepts and
Techniques", Morgan Kaufmann 3nd
Edition
2. Business Intelligence: Data Mining and
Optimization for Decision Making by Carlo
Vercellis ,Wiley India Publications
Chp1
Slides by: Shree Jaswal
Topics to be covered
3
 What is Data Mining;
 Kind of patterns to be mined;
 Technologies used;
 Major issues in Data Mining
Chp1
Slides by: Shree Jaswal
Evolution of database system technology
4
Chp1
Slides by: Shree Jaswal
What is a data warehouse?
5
Chp1
Slides by: Shree Jaswal
Example of a datawarehouse framework used by an
electronics store
6
Chp1
Slides by: Shree Jaswal
Chp1
Slides by: Shree Jaswal
7
Why Data Mining?
8
 The Explosive Growth of Data: from terabytes to petabytes

Data collection and data availability


Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation, …

Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated analysis
of massive data sets
Chp1
Slides by: Shree Jaswal
Why Data Mining?—Potential Applications
9
 Data analysis and decision support

Market analysis and management
 Target
marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation

Risk analysis and management
 Forecasting,
customer retention, improved underwriting, quality
control, competitive analysis

Fraud detection and detection of unusual patterns (outliers)
 Other Applications
Chp1

Text mining (news group, email, documents) and Web mining

Stream data mining

Bioinformatics and bio-data analysis
Slides by: Shree Jaswal
Ex. 1: Market Analysis and Management
10
 Where does the data come from?—Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, plus (public) lifestyle studies
 Target marketing


Find clusters of “model” customers who share the same characteristics: interest,
income level, spending habits, etc.
Determine customer purchasing patterns over time
 Cross-market analysis—Find associations/co-relations between product sales,
& predict based on such association
 Customer profiling—What types of customers buy what products (clustering or
classification)
 Customer requirement analysis


Identify the best products for different groups of customers
Predict what factors will attract new customers
 Provision of summary information


Chp1
Multidimensional summary reports
Statistical summary information (data central tendency and variation)
Slides by: Shree Jaswal
Ex. 2: Corporate Analysis & Risk Management
11
 Finance planning and asset evaluation

cash flow analysis and prediction

contingent claim analysis to evaluate assets

cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
 Resource planning

summarize and compare the resources and spending
 Competition
Chp1

monitor competitors and market directions

group customers into classes and a class-based pricing procedure

set pricing strategy in a highly competitive market
Slides by: Shree Jaswal
Ex. 3: Fraud Detection & Mining Unusual Patterns
12
 Approaches: Clustering & model construction for frauds, outlier analysis
 Applications: Health care, retail, credit card service, telecomm.

Auto insurance: ring of collisions

Money laundering: suspicious monetary transactions

Medical insurance


Professional patients, ring of doctors, and ring of references

Unnecessary or correlated screening tests
Telecommunications: phone-call fraud


Retail industry


Chp1
Phone call model: destination of the call, duration, time of day or week.
Analyze patterns that deviate from an expected norm
Analysts estimate that 38% of retail shrink is due to dishonest employees
Anti-terrorism
Slides by: Shree Jaswal
What Is Data Mining?
13
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously unknown

and potentially useful) patterns or knowledge from huge amount of
data
Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems
Chp1
Slides by: Shree Jaswal
Examples: What is (not) Data Mining?
 What is not Data
 What is Data Mining?
Mining?
– Look up phone
– Certain names are more
number in phone
directory
prevalent in certain US locations
(O’Brien, O’Rurke, O’Reilly… in
Boston area)
– Query a Web
– Group together similar
documents returned by search
engine according to their context
(e.g. Amazon rainforest,
Amazon.com,)
search engine for
information about
“Amazon”
KDD Process: Several Key Steps
15
 Learning the application domain

relevant prior knowledge and goals of application
 Creating a target data set: data selection
 Data cleaning and preprocessing: (may take 60% of effort!)
 Data reduction and transformation

Find useful features, dimensionality/variable reduction, invariant
representation
 Choosing functions of data mining

summarization, classification, regression, association, clustering
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.
 Use of discovered knowledge
Chp1
Slides by: Shree Jaswal
Knowledge Discovery (KDD) Process
16
 This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
 Data mining plays an essential role
in the knowledge discovery process
Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Chp1
Databases
Slides by: Shree Jaswal
Selection
Example: A Web Mining Framework
17
 Web mining usually involves

Data cleaning

Data integration from multiple sources

Warehousing the data

Data cube construction

Data selection for data mining

Data mining

Presentation of the mining results

Patterns and knowledge to be used or stored into knowledgebase
Chp1
Slides by: Shree Jaswal
Data Mining in Business Intelligence
18
Increasing potential
to support
business decisions
End User
Decision
Making
Data Presentation
Visualization Techniques
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Chp1
Slides by: Shree Jaswal
DBA
KDD Process: A Typical View from ML and Statistics
19
Input Data
Data
Mining
Data PreProcessing
Data integration
Normalization
Feature selection
Dimension reduction
Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis
…………
PostProcessing
Pattern
Pattern
Pattern
Pattern
evaluation
selection
interpretation
visualization
 This is a view from typical machine learning and statistics communities
Chp1
Slides by: Shree Jaswal
Decisions in Data Mining(1)
20
 Data to be mined
Database data (extended-relational, object-oriented, heterogeneous,
legacy), data warehouse, transactional data, stream, spatiotemporal,
time-series, sequence, text and web, multi-media, graphs & social and
information networks
 Knowledge to be mined (or: Data mining functions)
 Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
 Descriptive vs. predictive data mining : Descriptive mining tasks
characterize properties of data in a target data set & predictive mining
tasks perform induction on current data in order to make predictions
 Multiple/integrated functions and mining at multiple levels

Chp1
Slides by: Shree Jaswal
Decisions in Data Mining(2)
21
 Techniques utilized
Data-intensive, data warehouse (OLAP), machine learning, statistics,
pattern recognition, visualization, high-performance, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc.

Chp1
Slides by: Shree Jaswal
What Kinds of Patterns Can Be Mined?
22
1.
2.
3.
4.
5.
Chp1
Generalization
Association and Correlation Analysis
Classification
Cluster Analysis
Outlier Analysis
Slides by: Shree Jaswal
Data Mining Function: (1) Generalization
23
 Multidimensional concept description: Characterization
and discrimination

Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet region
 Data characterization is a summarization of the
general characteristics or features of a target class of data
 Data cube technology for computing


Scalable methods for computing (i.e., materializing)
multidimensional aggregates
OLAP (online analytical processing)
 Examples of Output forms : pie charts, MDD cubes, bar
charts, curves etc.
Chp1
Slides by: Shree Jaswal
Data Mining Function: (1) Generalization contd.
24
 Data discrimination is a comparison of the general
features of the target class data objects against the general
features of objects from one or multiple contrasting classes
 Data cube technology for computing


Drill down on any dimension
Discriminant rules: Discrimination descriptions expressed in the form of
rules
 Output forms : same as that of data characterization along
with discrimination descriptions
Chp1
Slides by: Shree Jaswal
Data Mining Function: (2) Association and
Correlation Analysis
25
 Frequent patterns (or frequent itemsets)

What items are frequently purchased together in your mart?
Eg. Milk & bread
 Association, correlation vs. causality

A typical association rule
 Computer
 software [1%, 50%] (support, confidence)
Confidence means that if one buys a computer there is a 50%
chance that she will buy software too. A 1% support means
that 1% of all transactions under analysis show that computer
& software are purchased together
 Association rules are discarded as uninteresting if they do not
satisfy both a minimum support threshold and a
minimum confidence threshold

Chp1
Slides by: Shree Jaswal
Data Mining Function: (3) Classification
26
 Classification and label prediction

Construct models (functions) based on some training examples

Describe and distinguish classes or concepts for future prediction
 E.g.,
classify countries based on (climate), or classify cars based
on (gas mileage)

Predict some unknown class labels
 Typical methods

Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-based
classification, logistic regression, …
 Typical applications:

Chp1
Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
Slides by: Shree Jaswal
Various forms of a classification model
27
Chp1
Slides by: Shree Jaswal
Data Mining Function: (4) Cluster Analysis
28
 Unsupervised learning (i.e., Class label is unknown)
 Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
 Data objects are clustered or grouped based on the principle
of maximizing intraclass similarity and
minimizing interclass similarity
 Many methods and applications
Chp1
Slides by: Shree Jaswal
A 2D plot of customer data with respect to customer
locations in a city showing 3 data clusters
Chp1
Slides by: Shree Jaswal
29
Data Mining Function: (5) Outlier Analysis
30
 Outlier analysis (anomaly mining)

Outlier: A data object that does not comply with the general behavior
of the data

Noise or exception? ― One person’s garbage could be another
person’s treasure

Methods: by product of clustering or regression analysis, …

Useful in fraud detection, rare events analysis
Chp1
Slides by: Shree Jaswal
Are All the “Discovered” Patterns Interesting?
31
 Data mining may generate thousands of patterns: Not all of them are
interesting

Suggested approach: Human-centered, query-based, focused mining
 Interestingness measures

A pattern is interesting if it is easily understood by humans, valid on new
or test data with some degree of certainty, potentially useful, novel, or
validates some hypothesis that a user seeks to confirm
 Objective vs. subjective interestingness measures
Chp1

Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.

Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty,
actionability, etc. Eg. A large Earthquake often follows a cluster of small
quakes
Slides by: Shree Jaswal
Find All and Only Interesting Patterns?
32
 Find all the interesting patterns: Completeness

Can a data mining system find all the interesting patterns? Do we
need to find all of the interesting patterns?

Heuristic vs. exhaustive search

Association vs. classification vs. clustering
 Search for only interesting patterns: An optimization problem

Can a data mining system find only the interesting patterns?

Approaches
 First
generate all the patterns and then filter out the uninteresting
ones
 Generate
Chp1
only the interesting patterns—mining query optimization
Slides by: Shree Jaswal
Other Pattern Mining Issues
33
 Precise patterns vs. approximate patterns

Association and correlation mining: possible find sets of precise
patterns
 But
approximate patterns can be more compact and sufficient
 How

to find high quality approximate patterns??
Gene sequence mining: approximate patterns are inherent
 How
to derive efficient approximate pattern mining algorithms??
 Constrained vs. non-constrained patterns
Chp1

Why constraint-based mining?

What are the possible kinds of constraints? How to push constraints
into the mining process?
Slides by: Shree Jaswal
Data Mining: Confluence of Multiple Disciplines
34
Machine
Learning
Applications
Algorithm
Chp1
Slides by: Shree Jaswal
Pattern
Recognition
Data Mining
Database
Technology
Statistics
Visualization
High-Performance
Computing
Why Confluence of Multiple Disciplines?
35
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-bytes of
data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
Chp1
Slides by: Shree Jaswal
Applications of Data Mining
36
 Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
 Collaborative analysis & recommender systems
 Basket data analysis to targeted marketing
 Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
 Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
 From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining
Chp1
Slides by: Shree Jaswal
Major Issues in Data Mining (1)
37
 Mining Methodology

Mining various and new kinds of knowledge

Mining knowledge in multi-dimensional space

Data mining: An interdisciplinary effort

Boosting the power of discovery in a networked environment

Handling noise, uncertainty, and incompleteness of data

Pattern evaluation and pattern- or constraint-guided mining
 User Interaction
Chp1

Interactive mining

Incorporation of background knowledge

Presentation and visualization of data mining results
Slides by: Shree Jaswal
Major Issues in Data Mining (2)
38
 Efficiency and Scalability

Efficiency and scalability of data mining algorithms

Parallel, distributed, stream, and incremental mining methods
 Diversity of data types

Handling complex types of data

Mining dynamic, networked, and global data repositories
 Data mining and society
Chp1

Social impacts of data mining

Privacy-preserving data mining

Invisible data mining
Slides by: Shree Jaswal
Summary
39
 Data mining: Discovering interesting patterns and knowledge from
massive amount of data
 A natural evolution of database technology, in great demand, with
wide applications
 A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
 Mining can be performed in a variety of data
 Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
 Data mining technologies and applications
 Major issues in data mining
Chp1
Slides by: Shree Jaswal