Download Report

Ch5 Mining Frequent Patterns,
Associations, and Correlations
Dr. Bernard Chen Ph.D.
University of Central Arkansas
Outline




Association Rules
Association Rules with FP tree
Misleading Rules
Multi-level Association Rules
What Is Frequent Pattern
Analysis?


Frequent pattern: a pattern (a set of items,
subsequences, substructures, etc.) that
occurs frequently in a data set
First proposed by Agrawal, Imielinski, and
Swami [AIS93] in the context of frequent
itemsets and association rule mining
What Is Frequent Pattern
Analysis?


Motivation: Finding inherent regularities in data

What products were often purchased together? bread and milk?

What are the subsequent purchases after buying a PC?

What kinds of DNA are sensitive to this new drug?

Can we automatically classify web documents?
Applications

Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, Web log (click stream) analysis, and DNA
sequence analysis.
Association Rules
Association Rules


support, s, probability that a transaction
contains X  Y
confidence, c, conditional probability
that a transaction having X also
contains Y
Association Rules

Let’s have an example









T100
T200
T300
T400
T500
T600
T700
T800
T900
1,2,5
2,4
2,3
1,2,4
1,3
2,3
1,3
1,2,3,5
1,2,3
Association Rules with Apriori
Minimum support=2/9
Minimum confidence=60%
The Apriori Algorithm

Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
Lk+1
increment the count of all candidates in Ck+1
that are contained in t
= candidates in Ck+1 with min_support
end
return k Lk;
Strong Association Rule


Strong association rules means the
frequent rules that also pass the
minimum confidence.
For example frequent rules: {I1, I2}


Confidence(I1->I2)= 4/6
(strong association rule!)
Confidence(I2->I1)= 4/7
Exercise


A dataset has five
transactions, let
min-support=60%
and
min_support=80%
Find all frequent
itemsets using
Apriori and all strong
association rules
TID
Items_bought
T1
T2
T3
T4
T5
M, O, N, K, E, Y
D, O, N, K , E, Y
M, A, K, E
M, U, C, K ,Y
C, O, O, K, I ,E
Association Rules with Apriori
K:5
E:4
M:3
O:3
Y:3
=>
KE:4
KM:3
KO:3
KY:3 =>
EM:2
EO:3
EY:2
MO:1
MY:2
OY:2
KE
KM
KO
KY
EO
=>
KEO
Outline




Association Rules
Association Rules with FP tree
Misleading Rules
Multi-level Association Rules
Mining Frequent Itemsets
without Candidate Generation


In many cases, the Apriori candidate
generate-and-test method significantly
reduces the size of candidate sets, leading to
good performance gain.
However, it suffer from two nontrivial costs:


It may generate a huge number of candidates (for
example, if we have 10^4 1-itemset, it may
generate more than 10^7 candidata 2-itemset)
It may need to scan database many times
Association Rules with Apriori
Minimum support=2/9
Minimum confidence=70%
Bottleneck of Frequent-pattern
Mining


Multiple database scans are costly
Mining long patterns needs many passes of scanning and
generates lots of candidates

To find frequent itemset i1i2…i100


# of scans: 100
# of Candidates: (1001) + (1002) + … + (110000) = 21001 = 1.27*1030 !

Bottleneck: candidate-generation-and-test

Can we avoid candidate generation?
Mining Frequent Patterns Without
Candidate Generation

Grow long patterns from short ones using local frequent
items

“abc” is a frequent pattern

Get all transactions having “abc”: DB|abc

“d” is a local frequent item in DB|abc  abcd is a
frequent pattern
Process of FP growth



Scan DB once, find frequent 1-itemset
(single item pattern)
Sort frequent items in frequency
descending order
Scan DB again, construct FP-tree
Association Rules

Let’s have an example









T100
T200
T300
T400
T500
T600
T700
T800
T900
1,2,5
2,4
2,3
1,2,4
1,3
2,3
1,3
1,2,3,5
1,2,3
FP Tree
Mining the FP tree
Benefits of the FP-tree Structure


Completeness
 Preserve complete information for frequent pattern
mining
 Never break a long pattern of any transaction
Compactness
 Reduce irrelevant info—infrequent items are gone
 Items in frequency descending order: the more
frequently occurring, the more likely to be shared
 Never be larger than the original database (not count
node-links and the count field)
 For Connect-4 DB, compression ratio could be over 100
Exercise


A dataset has five
transactions, let minsupport=60% and
min_confidence=80%
Find all frequent
itemsets using FP Tree
TID
Items_bought
T1
T2
T3
T4
T5
M, O, N, K, E, Y
D, O, N, K , E, Y
M, A, K, E
M, U, C, K ,Y
C, O, O, K, I ,E
Association Rules with FP Tree
K:5
E:4
M:3
O:3
Y:3
Association Rules with FP Tree
Y: KEMO:1 KEO:1 KY:1
K:3
KY
O: KEM:1 KE:2
KE:3 KO EO KEO
M: KE:2 K:1
K:3
KM
E: K:4 KE
FP-Growth vs. Apriori: Scalability
With the Support Threshold
Data set T25I20D10K
100
D1 FP-grow th runtime
90
D1 Apriori runtime
80
Run time(sec.)
70
60
50
40
30
20
10
0
0
0.5
1
1.5
2
Support threshold(%)
2.5
3
Why Is FP-Growth the Winner?

Divide-and-conquer:



decompose both the mining task and DB according to
the frequent patterns obtained so far
leads to focused search of smaller databases
Other factors

no candidate generation, no candidate test

compressed database: FP-tree structure

no repeated scan of entire database

basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching
Outline




Association Rules
Association Rules with FP tree
Misleading Rules
Multi-level Association Rules
Example 5.8 Misleading
“Strong” Association Rule

Of the 10,000 transactions analyzed,
the data show that



6,000 of the customer included computer
games,
while 7,500 include videos,
And 4,000 included both computer games
and videos
Misleading “Strong”
Association Rule

For this example:



Support (Game & Video) =
4,000 / 10,000 =40%
Confidence (Game => Video) =
4,000 / 6,000 = 66%
Suppose it pass our minimum support and
confidence (30% , 60%, respectively)
Misleading “Strong”
Association Rule



However, the truth is : “computer
games and videos are negatively
associated”
Which means the purchase of one of
these items actually decreases the
likelihood of purchasing the other.
(How to get this conclusion??)
Misleading “Strong”
Association Rule

Under the normal situation,




60% of customers buy the game
75% of customers buy the video
Therefore, it should have 60% * 75% =
45% of people buy both
That equals to 4,500 which is more than
4,000 (the actual value)
From Association Analysis to
Correlation Analysis

Lift is a simple correlation measure that is given as
follows



The occurrence of itemset A is independent of the
occurrence of itemset B if
P(AUB) = P(A)P(B)
Otherwise, itemset A and B are dependent and correlated
as events
Lift(A,B) = P(AUB) / P(A)P(B)


If the value is less than 1, the occurrence of A is negatively
correlated with the occurrence of B
If the value is greater than 1, then A and B are positively
correlated
Outline




Association Rules
Association Rules with FP tree
Misleading Rules
Multi-level Association Rules
Mining Multiple-Level
Association Rules

Items often form hierarchies
Mining Multiple-Level
Association Rules

Items often form hierarchies
Mining Multiple-Level
Association Rules

Flexible support settings

Items at the lower level are expected to
have lower support
uniform support
Level 1
min_sup = 5%
Level 2
min_sup = 5%
reduced support
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 3%
Multi-level Association:
Redundancy Filtering



Some rules may be redundant due to “ancestor”
relationships between items.
Example

milk  wheat bread

2% milk  wheat bread [support = 2%, confidence = 72%]
[support = 8%, confidence = 70%]
We say the first rule is an ancestor of the second
rule.