Association Rule Mining. Entscheidungsunterstützungssysteme SS 18

Similar documents
AN ALTERNATIVE METHOD FOR ASSOCIATION RULES

FDM (Fast Distributed Mining) over normal mining algorithm based on A-priori property and its application in market basket analysis

We will be releasing HW1 today It is due in 2 weeks (4/18 at 23:59pm) The homework is long

I. INTRODUCTION II. LITERATURE SURVEY. International Journal of Advanced Networking & Applications (IJANA) ISSN:

Recommender Systems TIETS43 Collaborative Filtering

Computer Log Anomaly Detection Using Frequent Episodes

Mining Frequent Itemsets in a Stream

A Technology Forecasting Method using Text Mining and Visual Apriori Algorithm

THE EVOLUTION OF KDD: TOWARDS DOMAIN-DRIVEN DATA MINING 1

Preference-based Organization Interfaces: Aiding User Critiques in Recommender Systems

Chapter 3: Alarm correlation

Information Management course

A Fast Algorithm For Finding Frequent Episodes In Event Streams

Drum Transcription Based on Independent Subspace Analysis

Communication Theory II

37 Game Theory. Bebe b1 b2 b3. a Abe a a A Two-Person Zero-Sum Game

Module 3 Greedy Strategy

Patent Mining: Use of Data/Text Mining for Supporting Patent Retrieval and Analysis

Module 3 Greedy Strategy

Kernels and Support Vector Machines

Lecture 3 - Regression

Game Theory and Randomized Algorithms

Instructions. Information. Advice

Multitree Decoding and Multitree-Aided LDPC Decoding

CHAPTER 5 PAPR REDUCTION USING HUFFMAN AND ADAPTIVE HUFFMAN CODES

Optimal Yahtzee performance in multi-player games

On-site Traffic Accident Detection with Both Social Media and Traffic Data

Ubiquitous and Mobile Computing CS 528: MobileMiner Mining Your Frequent Behavior Patterns on Your Phone

Romantic Partnerships and the Dispersion of Social Ties

ABSTRACT I. INTRODUCTION

An Embedding Model for Mining Human Trajectory Data with Image Sharing

DETC FORMALIZING USER ACTIVITY PRODUCT FUNCTION ASSOCIATION BASED DESIGN RULES FOR UNIVERSAL PRODUCTS

INTELLIGENT APRIORI ALGORITHM FOR COMPLEX ACTIVITY MINING IN SUPERMARKET APPLICATIONS

Method for Real Time Text Extraction of Digital Manga Comic

Probability. The MEnTe Program Math Enrichment through Technology. Title V East Los Angeles College

An FPGA Implementation of Decision Tree Classification

Analysis of Sequence Moves in Successful Chess Openings Using Data Mining with Association Rules R.M.Rani

PROBABILITY M.K. HOME TUITION. Mathematics Revision Guides. Level: GCSE Foundation Tier

Mining Rules from Player Experience and Activity Data

Reduce the Wait Time For Customers at Checkout

Parsimony II Search Algorithms

Matching Words and Pictures

Performance comparison of convolutional and block turbo codes

Functions of several variables

A Probability Work Sheet

Machine Learning and Data Mining Course Summary

Sect Linear Equations in Two Variables

OCCASIONAL ITEMSET MINING BASED ON THE WEIGHT

Frequent Itemset based Event Detection in Uncertain Sensor Networks

Small power load disaggregation in office buildings based on electrical signature classification

November 6, Chapter 8: Probability: The Mathematics of Chance

Time-average constraints in stochastic Model Predictive Control

VOL. 3, NO.11 Nov, 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

5.4 Imperfect, Real-Time Decisions

Probability with Set Operations. MATH 107: Finite Mathematics University of Louisville. March 17, Complicated Probability, 17th century style

Combinatorics and Intuitive Probability

Raster Based Region Growing

CIS 2033 Lecture 6, Spring 2017

KenKen Strategies. Solution: To answer this, build the 6 6 table of values of the form ab 2 with a {1, 2, 3, 4, 5, 6}

The Product Rule can be viewed as counting the number of elements in the Cartesian product of the finite sets

A Comparison of Playlist Generation Strategies for Music Recommendation and a New Baseline Scheme

Notes 15: Concatenated Codes, Turbo Codes and Iterative Processing

Digital Television Lecture 5

Informed search algorithms. Chapter 3 (Based on Slides by Stuart Russell, Richard Korf, Subbarao Kambhampati, and UW-AI faculty)

UNIT VI. Current approaches to programming are classified as into two major categories:

: Principles of Automated Reasoning and Decision Making Midterm

Stamp detection in scanned documents

Design and Implementation of Privacy-preserving Recommendation System Based on MASK

Attribute Based Specification, Comparison And Selection Of A Robot

GEOG432: Remote sensing Lab 3 Unsupervised classification

A Kinect-based 3D hand-gesture interface for 3D databases

Lecture 20: Combinatorial Search (1997) Steven Skiena. skiena

Sense in Order: Channel Selection for Sensing in Cognitive Radio Networks

Algorithmique appliquée Projet UNO

Bead Sort: A Natural Sorting Algorithm

The study of probability is concerned with the likelihood of events occurring. Many situations can be analyzed using a simplified model of probability

Coding and Analysis of Cracked Road Image Using Radon Transform and Turbo codes

Basic Probability Concepts

CS 237 Fall 2018, Homework SOLUTION

Compound Probability. Set Theory. Basic Definitions

GEOG432: Remote sensing Lab 3 Unsupervised classification

A Service-based Approach to Situational Correlation and Analyses of Stream Sensor Data

Algorithms. Abstract. We describe a simple construction of a family of permutations with a certain pseudo-random

CLASSIFICATION OF CLOSED AND OPEN-SHELL (TURKISH) PISTACHIO NUTS USING DOUBLE TREE UN-DECIMATED WAVELET TRANSFORM

Rearrangement task realization by multiple mobile robots with efficient calculation of task constraints

Probability Theory. Mohamed I. Riffi. Islamic University of Gaza

SF2972: Game theory. Introduction to matching

A GRASP heuristic for the Cooperative Communication Problem in Ad Hoc Networks

Modeling, Analysis and Optimization of Networks. Alberto Ceselli

Real-Time Face Detection and Tracking for High Resolution Smart Camera System

Retrieval of Large Scale Images and Camera Identification via Random Projections

GE 113 REMOTE SENSING

Power Reduction in OFDM systems using Tone Reservation with Customized Convex Optimization

From ProbLog to ProLogic

IBM SPSS Neural Networks

Lecture5: Lossless Compression Techniques

CS 365 Project Report Digital Image Forensics. Abhijit Sharang (10007) Pankaj Jindal (Y9399) Advisor: Prof. Amitabha Mukherjee

Past questions from the last 6 years of exams for programming 101 with answers.

DVA325 Formal Languages, Automata and Models of Computation (FABER)

Section Introduction to Sets

Transcription:

Association Rule Mining Entscheidungsunterstützungssysteme SS 18

Frequent Pattern Analysis Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami (1993) in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together? Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

Why Is Freq. Pattern Mining Important? Freq. pattern: An intrinsic and important property of datasets Foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: discriminative, frequent pattern analysis Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles Broad applications

Basic Concepts: Frequent Patterns Tid Customer buys beer Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper itemset: A set of one or more items k-itemset X = {x 1,, x k } (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X s support is no less than a minsup threshold

Basic Concepts: Association Rules Tid 10 20 30 40 50 Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk Customer buys beer Items bought Beer, Nuts, Diaper Beer, Coffee, Diaper Beer, Diaper, Eggs Customer buys both Customer buys diaper Find all the rules X Y with minimum support and confidence support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Association rules: (many more!) Beer Diaper (60%, 100%) Diaper Beer (60%, 75%)

Association Rule Generation (Problem definition) Two sub-problems Finding frequent itemsets whose occurrences exceed a predefined minimum support threshold Deriving association rules from those frequent itemsets (with the constraint of minimum confidence threshold) Solved iteratively until no more new rules emerge Most of the research focus is on the frequent pattern analysis because the second is quite straight-forward.

Computational Complexity How many itemsets are potentially to be generated in the worst case? The number of frequent itemsets to be generated is senstive to the minsup threshold When minsup is low, there exist potentially an exponential number of frequent itemsets The worst case: M N where M: # distinct items, and N: max length of transactions The worst case complexity vs. the expected probability Example: Suppose Walmart has 10 4 kinds of products The chance to pick up one product: 10-4 The chance to pick up a particular set of 10 products: ~10-40 1 out of 10 40 Sum of coefficients in (x 1 +x 2 + +x 10000 ) 10 = (10 4 ) 10 What is the chance this particular set of 10 products to be frequent 10 3 times in 10 9 transactions?

Association rules Formal Statement of the Problem I = itemset D = set of transactions T = transaction that contains a set of items X and Y are set of items in I X Y X I, Y I, and X Y Confidence c if c% of transactions in D that contain X also contain Y. Support s if s% of transactions in D contain X Y

The Apriori Algorithm Let C k be a set of candidate itemsets of size k, and L k be a set of frequent itemsets of size k Main steps of iteration 1. Find frequent itemset L k-1 2. Join step: C k is generated by joining L k-1 with itself (cartesian product L k-1 x L k-1 ) 3. Prune step (use Apriori property): Any (k 1)-itemset that is not frequent cannot be a subset of a frequent k-itemset(l k ), hence should be removed from C k 4. Obtain frequent itemset L k and repeat the steps unless L k =. Apriori property All nonempty subsets of a frequent itemset must also be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper}

The Apriori Algorithm C k : Set of candidate k-itemsets L k : Set of frequent k-itemsets (Pseudo-Code) L 1 = {frequent items}; for (k = 1; L k!= ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 end return k L k ; = candidates in C k+1 with min_support

Algorithm Apriori Example Consider a database, D, consisting of 9 transactions. Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % ) Let minimum confidence required is 70%. We have to first find out the frequent itemset using Apriori algorithm. Then, Association rules will be generated using min. support & min. confidence. TID T100 T200 T300 T400 T500 T600 T700 T800 T900 List of Items I1, I2, I5 I2, I4 I2, I3 I1, I2, I4 I1, I3 I2, I3 I1, I3 I1, I2, I3, I5 I1, I2, I3

Algorithm Apriori Example Step 1 : Generating 1-item set frequent pattern Scan D for count of each candidate Item set Sup. count {I1} 6 Compare candidate support count with minimum support Item set Sup.count {I1} 6 {I2} 7 {I2} 7 {I3} 6 {I3} 6 {I4} 2 {I4} 2 {I5} 2 {I5} 2 C1 L1 The set of frequent item-sets, L1, consists of candidate 1-itemsets satisfying minimum support In the first iteration of algorithm, each item is a member of set of the candidate

Algorithm Apriori Example Step 2 : Generating 2-item set frequent pattern Generate C 2 from Candidate L 1 Item set {I1,I2} {I1,I3} {I1,I4} Scan D for count of each candidate Item set Sup. count {I1,I2} 4 {I1,I3} 4 {I1,I4} 1 Compare candidate support count with minimum support Item set {I1,I2} 4 {I1,I3} 4 Sup. count {I1,I5} {I1,I5} 2 {I1,I5} 2 {I2,I3} {I2,I3} 4 {I2,I3} 4 {I2,I4} {I2,I4} 2 {I2,I4} 2 {I2,I5} {I2,I5} 2 {I2,I5} 2 {I3,I4} {I3,I5} {I3,I4} 0 {I3,I5} 1 L 2 {I4,I5} {I4,I5} 0 C 2 C 2

Item set {I1,I2} 4 {I1,I3} 4 {I1,I5} 2 {I2,I3} 4 {I2,I4} 2 {I2,I5} 2 Sup. count Algorithm Apriori Example Step 3 : Generating 3-item set frequent pattern L 2 Generate C3 from Candidate L2 Item set {I1,I2,I3} {I1,I2,I5} Scan D for count of each candidate Item set Sup. count {I1,I2,I3} 2 {I1,I2,I5} 2 Compare candidate support count with minimum support Item set {I1,I2,I3} 2 {I1,I2,I5} 2 Sup. count C 3 C 3 L 3 The generation of the set of candidates 3-itemsets, C 3, involves use of Apriori property In order to find C3, we compute L 2 join L 2 C 3 = L 2 join L 2 = {{I1,I2,I3}, {I1,I2,I4}, {I1,I2,I5}, {I1,I3,I5}, {I2,I3,I4}, {I2,I3,I5},{I2,I4,I5}} Now join step is complete and Prune step will be used to reduce the size of C 3. Prune step helps to avoid heavy computation due to large C k Prune step: Any (k 1)-itemset that is not frequent cannot be a subset of a frequent k-itemset(l k ), hence should be removed from C k

Algorithm Apriori Example Step 4 : Generating 4-item set frequent pattern The algorithm uses L3-join-L3 to generate a candidate set of 4-itemsets, C 4. Although join results in {{I1,I2,I3,I5}}, this item set is pruned since its subset {{I2,I3,I5}} is not frequent Thus C 4 =, and algorithm terminates, having found all of the frequent items. This completes our Apriori algorithm What s next? These frequent itemsets will be used to generate strong association rules (where strong association rules satisfy both minimum support and minimum confidence)

Generating Association Rules L 2 = I1, I2, I1, I3, I1, i5, I2, I3, I2, I4, I2, I5 TID T100 T200 T300 T400 T500 T600 T700 T800 T900 List of Items I1, I2, I5 I2, I4 I2, I3 I1, I2, I4 I1, I3 I2, I3 I1, I3 I1, I2, I3, I5 I1, I2, I3 Rule Confidence I1 I2 4/6=67% I1 I3 4/6=67% I1 I5 2/6=33% I2 I3 4/7=57% I2 I4 2/7=29% I2 I5 2/7=29% Rule Confidence I2 I1 4/7=57% I3 I1 4/6=67% I5 I1 2/2=100% I3 I2 4/6=67% I4 I2 2/2=100% I5 I2 2/2=100%

Generating Association Rules L 3 = I1, I2, I3, I1, I2, I5 TID T100 T200 T300 T400 T500 T600 T700 T800 T900 List of Items I1, I2, I5 I2, I4 I2, I3 I1, I2, I4 I1, I3 I2, I3 I1, I3 I1, I2, I3, I5 I1, I2, I3 Rule Confidence {I1, I2} I3 2/4=50% {I1, I3} I2 2/4=50% {I2, I3} I1 2/4=50% I1 {I2, I3} 2/6=33% I2 {I1, I3} 2/7=29% I3 {I1, I2} 2/6=33% Rule Confidence {I1, I2} I5 2/4=50% {I1, I5} I2 2/2=100% {I2, I5} I1 2/2=100% I1 {I2, I5} 2/6=33% I2 {I1, I5} 2/7=29% I5 {I1, I2} 2/2=100% Note that only the rules with one-item in the right hand side are generated in R.

Interestingness Measure: Correlations (Lift) play basketball eat cereal [40%, 66.7%] is misleading The overall % of students eating cereal is 75% > 66.7%. play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence Measure of dependent/correlated events: lift P(A B) lift = P A P(B) 2000 / 5000 lift ( B, C) 0.89 3000 / 5000 *3750 / 5000 1000 / 5000 lift( B, C) 1.33 3000 / 5000*1250 / 5000 Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000

play basketball eat cereal Lift calculation and interpretation (independent events case) support = P(B C) = 48%, confidence = P C B = 80% The overall % of students eating cereal = P C = P(C B) = 80% play basketball not eat cereal support = P(B C) = 12%, confidence = P C B = 20% The overall % of students not eating cereal = P C = P( C B) = 20% Lift is a ratio showing how frequently two sets of items (A and B) occur together compared to the expected frequency (frequency when they occur independently). Lift=1 implies that playing basketball is independent of eating cereal. lift = P(A B) P A P(B) A rule with Lift>1 is interesting because However, Lift<1 can also be interesting if you are interested in two sets of items not occuring together. Basketball Not basketball Sum (row) Cereal 2400 1600 4000 Not cereal 600 400 1000 Sum(col.) 3000 2000 5000 2400 / 5000 lift( B, C) 1 3000 / 5000*4000 / 5000 600 / 5000 lift( B, C) 1 3000 / 5000*1000 / 5000

TID List of Items T100 I1, I2, I5 Lift Calculation T200 T300 I2, I4 I2, I3 T400 I1, I2, I4 L 2 = I1, I2, I1, I3, I1, i5, I2, I3, I2, I4, I2, I5 T500 T600 T700 T800 T900 I1, I3 I2, I3 I1, I3 I1, I2, I3, I5 I1, I2, I3 Rule I1 I2 & I2 I1 Lift 4 9 ( 6 9 )(7 9 ) = 0.857 Rule I2 I3 & I3 I2 Lift 4 9 ( 7 9 )(6 9 ) = 0.857 I1 I3 & I3 I1 4 9 ( 6 9 )(6 9 ) = 1.000 I2 I4 & I4 I2 2 9 ( 7 9 )(2 9 ) = 1.286 I1 I5 & I5 I1 2 9 ( 6 9 )(2 9 ) = 1.500 I2 I5 & I5 I2 2 9 ( 7 9 )(2 9 ) = 1.286

TID List of Items T100 I1, I2, I5 Lift Calculation T200 T300 I2, I4 I2, I3 T400 I1, I2, I4 L 3 = I1, I2, I3, I1, I2, I5 T500 T600 T700 I1, I3 I2, I3 I1, I3 T800 I1, I2, I3, I5 T900 I1, I2, I3 Rule Lift {I1, I2} I3 0.750 {I1, I3} I2 0.643 {I2, I3} I1 0.750 I1 {I2, I3} 0.750 I2 {I1, I3} 0.643 I3 {I1, I2} 0.750 Rule Lift {I1, I2} I5 2.250 {I1, I5} I2 1.286 {I2, I5} I1 1.500 I1 {I2, I5} 1.500 I2 {I1, I5} 1.286 I5 {I1, I2} 2.250 Note that only the rules with one-item in the right hand side are generated in R.

Sources F. Provost and T. Fawcett, Data Science for Business J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques Srikant, R., & Agrawal, R. (1995). Mining generalized association rules (pp. 407-419). IBM Research Division. Agrawal, R., & Srikant, R. (1994, September). Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB (Vol. 1215, pp. 487-499). Support in an Association Rule. IBM Website. http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.i m.model.doc/c_defining_minimum_support.html Confidence in an Association Rule. IBM Website. http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.i m.model.doc/c_defining_minimum_confidence.html Wasilewska, Anita. Apriori Algorithm. http://www.cs.sunysb.edu/~cse634/lecture_notes/07apriori.pdf Amos Fiat & Idit Haran. Data Mining Seminar based on Mining Generalized Association Rules, 2003. http://www.cs.tau.ac.il/~fiat/dmsem03/mining%20generalized%20association%2 0Rules.ppt