Download Report

Mining Frequent Patterns without Candidate Generation
Authors: Jiawei Han, Jian Pei, Yiwen Yin In SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Presenter:
0250207 林世豐
0350260 ⿈黃⼦子軒
1
Outline • Introduction – Background – Problem definition – challenges • Proposed Method – Frequent-‐Pattern-‐Tree – Architecture – FP-‐tree construction – FP-‐tree mining by pattern fragment growth (FP-‐growth) • Experiments – Experimental Setup – Experimental Results • Conclusions & Discussions
2
Background Frequent pattern mining plays an essential role
in many data mining tasks. Mining frequent
patterns in transaction databases has been
studied popularly in data mining research.
Most of the previous studies adopt an Aprior-like
candidate set generation-and-test approach.
However, candidate set generation is still costly.
3
Problem definition
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction. Market-Backet transaction
TID
Items
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diaper, Beer, Coke
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Example of Association Rule:
{Bread} -> {Milk}
{Diaper} -> {Beer}
4
Challenges (1)Candidate set generation is costly when there exist prolific patterns and/or long patterns. Given d items, there are 2d possible candidate itemsets. (2)It’s tedious to repeatedly scan the database and check a large set of candidates by pattern matching.
5
Proposed Method – FP-‐Tree
• FP-‐Tree (Frequent-‐Pattern-‐Tree) • Given transaction database DB=(T1,T2,…,Tn) and minimum support threshold ξ. • Step1: Identify frequent itemset and store in some sorted order.
6
Proposed Method – FP-‐Tree
• FP-‐Tree (Frequent-‐Pattern-‐Tree) • Identify the set of frequent items by one scan of DB. Let ξ=3 here.
TID
Item bought
(Ordered) Frequent Item
4
100
f,a,c,d,g,i,m,p
f,c,a,m,p
3
200
a,b,c,f,l,m,o
f,c,a,b,m
300
b,f,h,j,o
f,b
400
b,c,k,s,p
c,b,p
500
a,f,c,e,l,p,m,n
f,c,a,m,p
2
1
0
f
c
a b m p
7
Proposed Method – FP-‐Tree
TID
Item bought
(Ordered) Frequent Item
100
f,a,c,d,g,i,m,p
f,c,a,m,p
root
f:1
c:1
a:1
m:1
p:1
8
Proposed Method – FP-‐Tree
TID
Item bought
(Ordered) Frequent Item
100
f,a,c,d,g,i,m,p
f,c,a,m,p
200
a,b,c,f,l,m,o
f,c,a,b,m
root
f:2
c:2
a:2
m:1
b:1
p:1
m:1
9
Proposed Method – FP-‐Tree
TID
Item bought
(Ordered) Frequent Item
100
f,a,c,d,g,i,m,p
f,c,a,m,p
200
a,b,c,f,l,m,o
f,c,a,b,m
300
b,f,h,j,o
f,b
root
f:3
c:2
b:1
a:2
m:1
b:1
p:1
m:1
10
Proposed Method – FP-‐Tree
TID
Item bought
(Ordered) Frequent Item
100
f,a,c,d,g,i,m,p
f,c,a,m,p
200
a,b,c,f,l,m,o
f,c,a,b,m
300
b,f,h,j,o
f,b
400
b,c,k,s,p
c,b,p
root
f:3
c:2
c:1
b:1
a:2
b:1
p:1
m:1
b:1
p:1
m:1
11
Proposed Method – FP-‐Tree
TID
Item bought
(Ordered) Frequent Item
100
f,a,c,d,g,i,m,p
f,c,a,m,p
200
a,b,c,f,l,m,o
f,c,a,b,m
300
b,f,h,j,o
f,b
400
b,c,k,s,p
c,b,p
500
a,f,c,e,l,p,m,n
f,c,a,m,p
root
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
Each node contains item name and # of
path passed by.
12
Proposed Method – FP-‐Tree
TID
Item bought
(Ordered) Frequent Item
100
f,a,c,d,g,i,m,p
f,c,a,m,p
200
a,b,c,f,l,m,o
f,c,a,b,m
300
b,f,h,j,o
f,b
400
b,c,k,s,p
c,b,p
500
a,f,c,e,l,p,m,n
f,c,a,m,p
item
root
f:4
head of node link
c:3
c:1
b:1
a:3
b:1
p:1
f
c
a
m:2
b:1
p:2
m:1
b
m
p
13
Proposed Method – FP-‐Tree
TID
Item bought
(Ordered) Frequent Item
100
f,a,c,d,g,i,m,p
f,c,a,m,p
200
a,b,c,f,l,m,o
f,c,a,b,m
300
b,f,h,j,o
f,b
400
b,c,k,s,p
c,b,p
500
a,f,c,e,l,p,m,n
f,c,a,m,p
item
root
f:4
head of node link
c:3
c:1
b:1
a:3
b:1
p:1
f
c
a
m:2
b:1
p:2
m:1
b
m
p
14
Proposed Method – FP-‐Tree
• Node-‐link property: • For any frequent item ai, all the possible frequent patterns that contain ai can be obtained by following ai’s node-‐links, starting from ai’s head in the FP-‐tree header. • There is no need to include item that is analyzed before.
Node link property will check all
possible pattern.
15
Proposed Method – FP-‐Tree
item
PATTEN
p
(f,c,a,m,p:2) (c,b,p:1)
root
f:4
c:3
item
head of node link
c:1
b:1
a:3
b:1
p:1
f
c
a
m:2
b:1
p:2
m:1
b
m
p
16
Proposed Method – FP-‐Tree
item
PATTEN
p
(f,c,a,m,p:2) (c,b,p:1)
m
(f,c,a,m:2) (f,c,a,b,m:1)
root
f:4
c:3
item
head of node link
c:1
b:1
a:3
b:1
p:1
f
c
a
m:2
b:1
p:2
m:1
b
m
p
17
Proposed Method – FP-‐Tree
item
PATTEN
p
(f,c,a,m,p:2) (c,b,p:1)
m
(f,c,a,m:2) (f,c,a,b,m:1)
b
(f,c,a,b:1) (f,b:1) (c,b:1)
root
f:4
c:3
item
head of node link
c:1
b:1
a:3
b:1
p:1
f
c
a
m:2
b:1
p:2
m:1
b
m
p
18
Proposed Method – FP-‐Tree
item
conditional PATTEN
conditional FP-Tree
#
Pattern p
Frequent
p
(f,c,a,m,p:2) (c,b,p:1)
(c:3)|p
1
f,c,a,m,p
c
m
(f,c,a,m:2) (f,c,a,b,m:1)
(f:3,c:3,a:3)|m
2
f,c,a,m,p
c
b
(f,c,a,b:1) (f,b:1) (c,b:1)
-
3
c,b,p
c
a
(f,c,a:3)
(f:3,c:3)|a
c
(f,c:3)
(f:3)|c
f
(f)
-
• Conditional FP-‐Tree: • Use the sub-‐pattern base under the condition of item’s existence to construct FP-‐Tree.
root
c:3
19
Proposed Method – FP-‐Tree
item
conditional PATTEN
conditional FP-Tree
#
Pattern p
Frequent
p
(f,c,a,m,p:2) (c,b,p:1)
(c:3)|p
1
f,c,a,m
f,c,a
m
(f,c,a,m:2) (f,c,a,b,m:1)
(f:3,c:3,a:3)|m
2
f,c,a,m
f,c,a
b
(f,c,a,b:1) (f,b:1) (c,b:1)
-
3
f,c,a,b,m
f,c,a
a
(f,c,a:3)
(f:3,c:3)|a
c
(f,c:3)
(f:3)|c
f
(f)
-
• Conditional FP-‐Tree: • Use the sub-‐pattern base under the condition of item’s existence to construct FP-‐Tree.
root
f:3
c:3
a:3
20
Proposed Method – FP-‐Tree
• Single Path FP-‐Tree: • The complete set of the frequent patterns of T can be generated by the enumeration of all the combinations of the subpaths of P with the support being the minimum support of the items contained in the subpath.
root
f:3
c:3
a:3
21
Proposed Method – FP-‐Tree
• FP-‐Growth: Input: FP-‐Tree and minimum support threshold Output: The complete set of frequent patterns. Method: Call FP-‐Growth(FP-‐Tree,null)
1 Procedure FP-‐Growth(Tree,α) { if Tree is single path 2 then for each combination of nodes in path do 3 generate pattern βUα with support=minimum support of node in β. 4 else for each ai in the header of Tree do 5 generate pattern aiUα with support = ai.support 6 construct β’s conditional pattern base and conditional FP-‐Treeβ 7 if FP-‐Treeβ is not null 8 then call FP-‐Growth(Treeβ,β) 9 }
22
Experiments Experimental Setup
• Data Sets • T25.I10.D10K with 1K items as D1 • The average transaction size is 25. • The average maximal potentially frequent items size is 10. • The number of transactions is 10K. • T25.I20.D100K with 10K items as D2 • The average transaction size is 25. • The average maximal potentially frequent items size is 20. • The number of transactions is 100K.
23
Experiments Experimental Setup
• Baseline: – Apriori-‐like – Tree Projection
24
Experiments Experimental Results
• Support threshold decreases from 3% to 0.1%. • FP-‐growth scales much better than Apriori because the length of frequent item sets increase dramatically.
25
Experiments Experimental Results
• The run time per itemset of FP-‐Growth shows FP-‐Growth has good scalability with the reduction of minimum support threshold. • Moreover, as the support threshold goes down, the run time per itemset decreases dramatically. (The figure is in exponential scale.)
26
Experiments Experimental Results
• Figure5 shows the scalability with # of transactions. • Both FP-‐growth and Apriori show linear scalability with # of transactions from 10K to 100K. • Overall, FP-‐growth is about an order of magnitude faster than Apriori in large database.
27
Experiments Experimental Results
• Comparison between FP-‐Growth and Tree Projection. • Both methods are efficient in mining frequent patterns and faster than Apriori. • But FP-‐growth is better.
28
Experiments Experimental Results
• Figure7 shows the scalability with # of transactions. • FP-‐growth and Tree projection show linear scalability with # of transactions from 10K to 100K. • FP-‐growth is more scalable than Tree projection.
29
Conclusions
• FP-‐Tree provides a novel data structure for storing compressed, crucial information about frequent patterns, and developed a pattern growth method for efficient mining of frequent patterns in large databases. • Advantage of FP-‐Growth: • It constructs a highly compact FP-‐Tree. • Pattern growth method that avoids costly candidate generation and test. • Partition-‐based divide-‐and-‐conquer method dramatically reduces the size of subsquent conditional pattern base.
30
Discussions
• Strongest part of this paper • FP-‐Tree is a robust and novel method handling large database without candidate generation. The author proved the completeness of the theorem. • Weak points of this paper • If the database is extremely huge to fit in memory, miss rate of memory request will raise quickly. (low cache locality) • The root reason of low locality is based on the creation time of the branch of FP-‐Tree. • The height of the FP-‐Tree is bound by the length of transaction.
31
Discussions
• Possible improvement • Handling extremely huge database. • Better order of transaction to improve locality. • Long transaction. (Height of the FP-‐Tree) • Possible extension & applications • Parallel FP-‐Tree • MapReduce FP-‐Tree • Transaction Projected FP-‐Tree
32
END & Thanks for your attention
33