Association Rule Mining. Entscheidungsunterstützungssysteme SS 18

Association Rule Mining Entscheidungsunterstützungssysteme SS 18

Frequent Pattern Analysis Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami (1993) in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together? Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

Why Is Freq. Pattern Mining Important? Freq. pattern: An intrinsic and important property of datasets Foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: discriminative, frequent pattern analysis Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles Broad applications

Basic Concepts: Frequent Patterns Tid Customer buys beer Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper itemset: A set of one or more items k-itemset X = {x 1,, x k } (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X s support is no less than a minsup threshold

Basic Concepts: Association Rules Tid 10 20 30 40 50 Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk Customer buys beer Items bought Beer, Nuts, Diaper Beer, Coffee, Diaper Beer, Diaper, Eggs Customer buys both Customer buys diaper Find all the rules X Y with minimum support and confidence support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Association rules: (many more!) Beer Diaper (60%, 100%) Diaper Beer (60%, 75%)

Association Rule Generation (Problem definition) Two sub-problems Finding frequent itemsets whose occurrences exceed a predefined minimum support threshold Deriving association rules from those frequent itemsets (with the constraint of minimum confidence threshold) Solved iteratively until no more new rules emerge Most of the research focus is on the frequent pattern analysis because the second is quite straight-forward.

Computational Complexity How many itemsets are potentially to be generated in the worst case? The number of frequent itemsets to be generated is senstive to the minsup threshold When minsup is low, there exist potentially an exponential number of frequent itemsets The worst case: M N where M: # distinct items, and N: max length of transactions The worst case complexity vs. the expected probability Example: Suppose Walmart has 10 4 kinds of products The chance to pick up one product: 10-4 The chance to pick up a particular set of 10 products: ~10-40 1 out of 10 40 Sum of coefficients in (x 1 +x 2 + +x 10000 ) 10 = (10 4 ) 10 What is the chance this particular set of 10 products to be frequent 10 3 times in 10 9 transactions?

Association rules Formal Statement of the Problem I = itemset D = set of transactions T = transaction that contains a set of items X and Y are set of items in I X Y X I, Y I, and X Y Confidence c if c% of transactions in D that contain X also contain Y. Support s if s% of transactions in D contain X Y

The Apriori Algorithm Let C k be a set of candidate itemsets of size k, and L k be a set of frequent itemsets of size k Main steps of iteration 1. Find frequent itemset L k-1 2. Join step: C k is generated by joining L k-1 with itself (cartesian product L k-1 x L k-1 ) 3. Prune step (use Apriori property): Any (k 1)-itemset that is not frequent cannot be a subset of a frequent k-itemset(l k ), hence should be removed from C k 4. Obtain frequent itemset L k and repeat the steps unless L k =. Apriori property All nonempty subsets of a frequent itemset must also be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper}

The Apriori Algorithm C k : Set of candidate k-itemsets L k : Set of frequent k-itemsets (Pseudo-Code) L 1 = {frequent items}; for (k = 1; L k!= ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 end return k L k ; = candidates in C k+1 with min_support

Algorithm Apriori Example Consider a database, D, consisting of 9 transactions. Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % ) Let minimum confidence required is 70%. We have to first find out the frequent itemset using Apriori algorithm. Then, Association rules will be generated using min. support & min. confidence. TID T100 T200 T300 T400 T500 T600 T700 T800 T900 List of Items I1, I2, I5 I2, I4 I2, I3 I1, I2, I4 I1, I3 I2, I3 I1, I3 I1, I2, I3, I5 I1, I2, I3

Algorithm Apriori Example Step 1 : Generating 1-item set frequent pattern Scan D for count of each candidate Item set Sup. count {I1} 6 Compare candidate support count with minimum support Item set Sup.count {I1} 6 {I2} 7 {I2} 7 {I3} 6 {I3} 6 {I4} 2 {I4} 2 {I5} 2 {I5} 2 C1 L1 The set of frequent item-sets, L1, consists of candidate 1-itemsets satisfying minimum support In the first iteration of algorithm, each item is a member of set of the candidate

Algorithm Apriori Example Step 2 : Generating 2-item set frequent pattern Generate C 2 from Candidate L 1 Item set {I1,I2} {I1,I3} {I1,I4} Scan D for count of each candidate Item set Sup. count {I1,I2} 4 {I1,I3} 4 {I1,I4} 1 Compare candidate support count with minimum support Item set {I1,I2} 4 {I1,I3} 4 Sup. count {I1,I5} {I1,I5} 2 {I1,I5} 2 {I2,I3} {I2,I3} 4 {I2,I3} 4 {I2,I4} {I2,I4} 2 {I2,I4} 2 {I2,I5} {I2,I5} 2 {I2,I5} 2 {I3,I4} {I3,I5} {I3,I4} 0 {I3,I5} 1 L 2 {I4,I5} {I4,I5} 0 C 2 C 2

Item set {I1,I2} 4 {I1,I3} 4 {I1,I5} 2 {I2,I3} 4 {I2,I4} 2 {I2,I5} 2 Sup. count Algorithm Apriori Example Step 3 : Generating 3-item set frequent pattern L 2 Generate C3 from Candidate L2 Item set {I1,I2,I3} {I1,I2,I5} Scan D for count of each candidate Item set Sup. count {I1,I2,I3} 2 {I1,I2,I5} 2 Compare candidate support count with minimum support Item set {I1,I2,I3} 2 {I1,I2,I5} 2 Sup. count C 3 C 3 L 3 The generation of the set of candidates 3-itemsets, C 3, involves use of Apriori property In order to find C3, we compute L 2 join L 2 C 3 = L 2 join L 2 = {{I1,I2,I3}, {I1,I2,I4}, {I1,I2,I5}, {I1,I3,I5}, {I2,I3,I4}, {I2,I3,I5},{I2,I4,I5}} Now join step is complete and Prune step will be used to reduce the size of C 3. Prune step helps to avoid heavy computation due to large C k Prune step: Any (k 1)-itemset that is not frequent cannot be a subset of a frequent k-itemset(l k ), hence should be removed from C k

Algorithm Apriori Example Step 4 : Generating 4-item set frequent pattern The algorithm uses L3-join-L3 to generate a candidate set of 4-itemsets, C 4. Although join results in {{I1,I2,I3,I5}}, this item set is pruned since its subset {{I2,I3,I5}} is not frequent Thus C 4 =, and algorithm terminates, having found all of the frequent items. This completes our Apriori algorithm What s next? These frequent itemsets will be used to generate strong association rules (where strong association rules satisfy both minimum support and minimum confidence)

Generating Association Rules L 2 = I1, I2, I1, I3, I1, i5, I2, I3, I2, I4, I2, I5 TID T100 T200 T300 T400 T500 T600 T700 T800 T900 List of Items I1, I2, I5 I2, I4 I2, I3 I1, I2, I4 I1, I3 I2, I3 I1, I3 I1, I2, I3, I5 I1, I2, I3 Rule Confidence I1 I2 4/6=67% I1 I3 4/6=67% I1 I5 2/6=33% I2 I3 4/7=57% I2 I4 2/7=29% I2 I5 2/7=29% Rule Confidence I2 I1 4/7=57% I3 I1 4/6=67% I5 I1 2/2=100% I3 I2 4/6=67% I4 I2 2/2=100% I5 I2 2/2=100%

Generating Association Rules L 3 = I1, I2, I3, I1, I2, I5 TID T100 T200 T300 T400 T500 T600 T700 T800 T900 List of Items I1, I2, I5 I2, I4 I2, I3 I1, I2, I4 I1, I3 I2, I3 I1, I3 I1, I2, I3, I5 I1, I2, I3 Rule Confidence {I1, I2} I3 2/4=50% {I1, I3} I2 2/4=50% {I2, I3} I1 2/4=50% I1 {I2, I3} 2/6=33% I2 {I1, I3} 2/7=29% I3 {I1, I2} 2/6=33% Rule Confidence {I1, I2} I5 2/4=50% {I1, I5} I2 2/2=100% {I2, I5} I1 2/2=100% I1 {I2, I5} 2/6=33% I2 {I1, I5} 2/7=29% I5 {I1, I2} 2/2=100% Note that only the rules with one-item in the right hand side are generated in R.

Interestingness Measure: Correlations (Lift) play basketball eat cereal [40%, 66.7%] is misleading The overall % of students eating cereal is 75% > 66.7%. play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence Measure of dependent/correlated events: lift P(A B) lift = P A P(B) 2000 / 5000 lift ( B, C) 0.89 3000 / 5000 *3750 / 5000 1000 / 5000 lift( B, C) 1.33 3000 / 5000*1250 / 5000 Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000

play basketball eat cereal Lift calculation and interpretation (independent events case) support = P(B C) = 48%, confidence = P C B = 80% The overall % of students eating cereal = P C = P(C B) = 80% play basketball not eat cereal support = P(B C) = 12%, confidence = P C B = 20% The overall % of students not eating cereal = P C = P( C B) = 20% Lift is a ratio showing how frequently two sets of items (A and B) occur together compared to the expected frequency (frequency when they occur independently). Lift=1 implies that playing basketball is independent of eating cereal. lift = P(A B) P A P(B) A rule with Lift>1 is interesting because However, Lift<1 can also be interesting if you are interested in two sets of items not occuring together. Basketball Not basketball Sum (row) Cereal 2400 1600 4000 Not cereal 600 400 1000 Sum(col.) 3000 2000 5000 2400 / 5000 lift( B, C) 1 3000 / 5000*4000 / 5000 600 / 5000 lift( B, C) 1 3000 / 5000*1000 / 5000

TID List of Items T100 I1, I2, I5 Lift Calculation T200 T300 I2, I4 I2, I3 T400 I1, I2, I4 L 2 = I1, I2, I1, I3, I1, i5, I2, I3, I2, I4, I2, I5 T500 T600 T700 T800 T900 I1, I3 I2, I3 I1, I3 I1, I2, I3, I5 I1, I2, I3 Rule I1 I2 & I2 I1 Lift 4 9 ( 6 9 )(7 9 ) = 0.857 Rule I2 I3 & I3 I2 Lift 4 9 ( 7 9 )(6 9 ) = 0.857 I1 I3 & I3 I1 4 9 ( 6 9 )(6 9 ) = 1.000 I2 I4 & I4 I2 2 9 ( 7 9 )(2 9 ) = 1.286 I1 I5 & I5 I1 2 9 ( 6 9 )(2 9 ) = 1.500 I2 I5 & I5 I2 2 9 ( 7 9 )(2 9 ) = 1.286

TID List of Items T100 I1, I2, I5 Lift Calculation T200 T300 I2, I4 I2, I3 T400 I1, I2, I4 L 3 = I1, I2, I3, I1, I2, I5 T500 T600 T700 I1, I3 I2, I3 I1, I3 T800 I1, I2, I3, I5 T900 I1, I2, I3 Rule Lift {I1, I2} I3 0.750 {I1, I3} I2 0.643 {I2, I3} I1 0.750 I1 {I2, I3} 0.750 I2 {I1, I3} 0.643 I3 {I1, I2} 0.750 Rule Lift {I1, I2} I5 2.250 {I1, I5} I2 1.286 {I2, I5} I1 1.500 I1 {I2, I5} 1.500 I2 {I1, I5} 1.286 I5 {I1, I2} 2.250 Note that only the rules with one-item in the right hand side are generated in R.

Sources F. Provost and T. Fawcett, Data Science for Business J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques Srikant, R., & Agrawal, R. (1995). Mining generalized association rules (pp. 407-419). IBM Research Division. Agrawal, R., & Srikant, R. (1994, September). Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB (Vol. 1215, pp. 487-499). Support in an Association Rule. IBM Website. http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.i m.model.doc/c_defining_minimum_support.html Confidence in an Association Rule. IBM Website. http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.i m.model.doc/c_defining_minimum_confidence.html Wasilewska, Anita. Apriori Algorithm. http://www.cs.sunysb.edu/~cse634/lecture_notes/07apriori.pdf Amos Fiat & Idit Haran. Data Mining Seminar based on Mining Generalized Association Rules, 2003. http://www.cs.tau.ac.il/~fiat/dmsem03/mining%20generalized%20association%2 0Rules.ppt