AN ALTERNATIVE METHOD FOR ASSOCIATION RULES

Similar documents
Association Rule Mining. Entscheidungsunterstützungssysteme SS 18

We will be releasing HW1 today It is due in 2 weeks (4/18 at 23:59pm) The homework is long

FDM (Fast Distributed Mining) over normal mining algorithm based on A-priori property and its application in market basket analysis

Odd-Prime Number Detector The table of minterms is represented. Table 13.1

Mining Frequent Itemsets in a Stream

Logic Design I (17.341) Fall Lecture Outline

EECS 150 Homework 4 Solutions Fall 2008

IE 361 Module 23. Prof. Steve Vardeman and Prof. Max Morris. Iowa State University

AIMA 3.5. Smarter Search. David Cline

Lecture5: Lossless Compression Techniques

To Explore the Properties of Parallelogram

CSS 343 Data Structures, Algorithms, and Discrete Math II. Balanced Search Trees. Yusuf Pisan

RBT Operations. The basic algorithm for inserting a node into an RBT is:

CSE101: Design and Analysis of Algorithms. Ragesh Jaiswal, CSE, UCSD

G.C.A.2: Chords, Secants and Tangents 9

Module 3 Greedy Strategy

Module 3 Greedy Strategy

Heuristic Search with Pre-Computed Databases

Playing Games. Henry Z. Lo. June 23, We consider writing AI to play games with the following properties:

LECTURE VI: LOSSLESS COMPRESSION ALGORITHMS DR. OUIEM BCHIR

EXCELLENCE IN MATHEMATICS EIGHTH GRADE TEST CHANDLER-GILBERT COMMUNITY COLLEGE S. THIRTEENTH ANNUAL MATHEMATICS CONTEST SATURDAY, JANUARY 19 th, 2013

Grade 6 Math Circles March 9, 2011 Combinations

Problem Solving Methods

From ProbLog to ProLogic

OCCASIONAL ITEMSET MINING BASED ON THE WEIGHT

One of the classes that I have taught over the past few years is a technology course for

Self-Adjusting Binary Search Trees. Andrei Pârvu

UKMT UKMT. Team Maths Challenge 2015 Regional Final. Group Round UKMT. Instructions

Counting Things. Tom Davis March 17, 2006

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Module 6 Lecture - 37 Divide and Conquer: Counting Inversions

Foundations of Artificial Intelligence

: Principles of Automated Reasoning and Decision Making Midterm

Huffman Coding - A Greedy Algorithm. Slides based on Kevin Wayne / Pearson-Addison Wesley

Communication Theory II

CSC 380 Final Presentation. Connect 4 David Alligood, Scott Swiger, Jo Van Voorhis

PRE-JUNIOR CERTIFICATE EXAMINATION, 2010 MATHEMATICS HIGHER LEVEL. PAPER 2 (300 marks) TIME : 2½ HOURS

Unit 10 Arcs and Angles of Circles

MATH CIRCLE, 10/13/2018

Chapter 3: Alarm correlation

12th Bay Area Mathematical Olympiad

2010 Pascal Contest (Grade 9)

Probability. Ms. Weinstein Probability & Statistics

Strings. A string is a list of symbols in a particular order.

Larger 5 & 6variable Karnaugh maps

Decomposition Search A Combinatorial Games Approach to Game Tree Search, with Applications to Solving Go Endgames

The problems in this booklet are organized into strands. A problem often appears in multiple strands. The problems are suitable for most students in

2/5/17 ADVERSARIAL SEARCH. Today. Introduce adversarial games Minimax as an optimal strategy Alpha-beta pruning Real-time decision making

ARTIFICIAL INTELLIGENCE (CS 370D)

CHAPTER 3 BASIC & COMBINATIONAL LOGIC CIRCUIT

game tree complete all possible moves

Lesson 27: Sine and Cosine of Complementary and Special Angles

mywbut.com Two agent games : alpha beta pruning

I. INTRODUCTION II. LITERATURE SURVEY. International Journal of Advanced Networking & Applications (IJANA) ISSN:

A Fast Algorithm For Finding Frequent Episodes In Event Streams

QUESTION 4(1) 4(F) 5(1) 5(F) 6(1) 6(F) 7(1) 7(F) VRAAG

Geometry - Chapter 6 Review

ABSTRACT I. INTRODUCTION

2. The lines with equations ax + 2y = c and bx - 3y = d are perpendicular. Find a b. A. -6 B C. -1 D. 1.5 E. 6

Compressing Pattern Databases

3. Given the similarity transformation shown below; identify the composition:

CSC 396 : Introduction to Artificial Intelligence

CSE101: Algorithm Design and Analysis. Ragesh Jaiswal, CSE, UCSD

Meet #3 January Intermediate Mathematics League of Eastern Massachusetts

1 This work was partially supported by NSF Grant No. CCR , and by the URI International Engineering Program.

CS61B Lecture #22. Today: Backtracking searches, game trees (DSIJ, Section 6.5) Last modified: Mon Oct 17 20:55: CS61B: Lecture #22 1

MTH 245: Mathematics for Management, Life, and Social Sciences

Beyond Prolog: Constraint Logic Programming

Coding for Efficiency

First Step Program (Std V) Preparatory Program- Ganit Pravinya Test Paper Year 2013

TOURNAMENT ROUND. Round 1

TD-Leaf(λ) Giraffe: Using Deep Reinforcement Learning to Play Chess. Stefan Lüttgen

GOING FOR GOLD. Problem Solving Bronze Paper 1. Q Topic My Mark Maximum Marks. 1 Ratio 4. 2 Probability 5. 3 Polygons 4. 4 Area 4.

Solutions of problems for grade R5

Computer Log Anomaly Detection Using Frequent Episodes

Well, there are 6 possible pairs: AB, AC, AD, BC, BD, and CD. This is the binomial coefficient s job. The answer we want is abbreviated ( 4

Stack permutations and an order relation for binary trees

COMPSCI 575/MATH 513 Combinatorics and Graph Theory. Lecture #30: The Cycle Index (Tucker Section 9.3) David Mix Barrington 30 November 2016

A C E. Answers Investigation 3. Applications. 12, or or 1 4 c. Choose Spinner B, because the probability for hot dogs on Spinner A is

Artificial Intelligence Ph.D. Qualifier Study Guide [Rev. 6/18/2014]

Adversarial Search. CMPSCI 383 September 29, 2011

5.4 Imperfect, Real-Time Decisions

Transportation Timetabling

HANOI STAR - APMOPS 2016 Training - PreTest1 First Round

Adversarial Search 1

Analysis procedure. To obtain the output Boolean functions from a logic diagram, proceed as follows:

Information Management course

Algorithms and Data Structures CS 372. The Sorting Problem. Insertion Sort - Summary. Merge Sort. Input: Output:

CSC384: Introduction to Artificial Intelligence. Game Tree Search

HUFFMAN CODING. Catherine Bénéteau and Patrick J. Van Fleet. SACNAS 2009 Mini Course. University of South Florida and University of St.

COMP219: COMP219: Artificial Intelligence Artificial Intelligence Dr. Annabel Latham Lecture 12: Game Playing Overview Games and Search

G.MG.A.3: Area of Polygons

5.4 Imperfect, Real-Time Decisions

Game Tree Search. CSC384: Introduction to Artificial Intelligence. Generalizing Search Problem. General Games. What makes something a game?

Lecture 16b: Permutations and Bell Ringing

Chapter 12. Cross-Layer Optimization for Multi- Hop Cognitive Radio Networks

CSC384 Introduction to Artificial Intelligence : Heuristic Search

Permutations and Combinations

Vesicular Image Formation in Silver Halide Materials

Computing Science (CMPUT) 496

UMBC 671 Midterm Exam 19 October 2009

Transcription:

AN ALTERNATIVE METHOD FOR ASSOCIATION RULES

RECAP

Mining Frequent Itemsets Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset An itemset that contains k items Support ( ) Count: Frequency of occurrence of an itemset E.g. ({Milk, Bread,Diaper}) = 2 Fraction: Fraction of transactions that contain an itemset E.g. s({milk, Bread, Diaper}) = 40% Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold, minsup Problem Definition Input: A set of transactions T, over a set of items I, minsup value Output: All itemsets with items in I having minsup TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

The itemset lattice A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE Given d items, there are 2 d possible itemsets Too expensive to test all!

The Apriori Principle Apriori principle (Main observation): If an itemset is frequent, then all of its subsets must also be frequent If an itemset is not frequent, then all of its supersets cannot be frequent X, Y : ( X Y ) s( X ) s( Y ) The support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support

Illustration of the Apriori principle Frequent subsets Found to be frequent

Illustration of the Apriori principle A B C D E AB AC AD AE BC BD BE CD CE DE Found to be Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Infrequent supersets Pruned ABCDE

The Apriori algorithm Level-wise approach C k = candidate itemsets of size k L k = frequent itemsets of size k Frequent itemset generation Candidate generation 1. k = 1, C 1 = all items 2. While C k not empty 3. Scan the database to find which itemsets in C k are frequent and put them into L k 4. Use L k to generate a collection of candidate itemsets C k+1 of size k+1 5. k = k+1 R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules", Proc. of the 20th Int'l Conference on Very Large Databases, 1994.

Candidate Generation Basic principle (Apriori): An itemset of size k+1 is candidate to be frequent only if all of its subsets of size k are known to be frequent Main idea: Construct a candidate of size k+1 by combining two frequent itemsets of size k Prune the generated k+1-itemsets that do not have all k-subsets to be frequent

THE FP-TREE AND THE FP-GROWTH ALGORITHM Slides from course lecture of E. Pitoura

Overview The FP-tree contains a compressed representation of the transaction database. A trie (prefix-tree) data structure is used Each transaction is a path in the tree paths can overlap. Once the FP-tree is constructed the recursive, divide-and-conquer FP-Growth algorithm is used to enumerate all frequent itemsets.

FP-tree Construction TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} The FP-tree is a trie (prefix tree) Since transactions are sets of items, we need to transform them into ordered sequences so that we can have prefixes Otherwise, there is no common prefix between sets {A,B} and {B,C,A} We need to impose an order to the items Initially, assume a lexicographic order.

FP-tree Construction Initially the tree is empty TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E}

FP-tree Construction Reading transaction TID = 1 TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} A:1 Each node in the tree has a label consisting of the item and the support (number of transactions that reach that node, i.e. follow that path) B:1 Node label = item:support

FP-tree Construction Reading transaction TID = 2 TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} B:1 A:1 B:1 Each transaction is a path in the tree We add pointers between nodes that refer to the same item

FP-tree Construction TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} The Header Table and the pointers assist in computing the itemset support After reading transactions TID=1, 2: Header Table Item A B C D E Pointer B:1 A:1 B:1

FP-tree Construction Reading transaction TID = 3 TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} Item A B C D E Pointer B:1 A:1 B:1

FP-tree Construction Reading transaction TID = 3 TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} Item A B C D E Pointer B:1 A:2 E:1 B:1

FP-tree Construction Reading transaction TID = 3 TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} Item A B C D E Pointer B:1 A:2 E:1 B:1 Each transaction is a path in the tree

FP-Tree Construction TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} Header table Item A B C D E Pointer Transaction Database C:3 B:5 A:7 Each transaction is a path in the tree E:1 E:1 B:3 C:3 Pointers are used to assist frequent itemset generation E:1

FP-tree size Every transaction is a path in the FP-tree The size of the tree depends on the compressibility of the data Extreme case: All transactions are the same, the FPtree is a single branch Extreme case: All transactions are different the size of the tree is the same as that of the database (bigger actually since we need additional pointers)

Item ordering The size of the tree also depends on the ordering of the items. Heuristic: order the items in according to their frequency from larger to smaller. We would need to do an extra pass over the dataset to count frequencies Example: TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} σ(α)=7, σ(β)=8, σ(c)=7, σ(d)=5, σ(ε)=3 Ordering : Β,Α,C,D,E TID Items 1 {Β,Α} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {Β,Α,C} 6 {Β,Α,C,D} 7 {B,C} 8 {Β,Α,C} 9 {Β,Α,D} 10 {B,C,E}

Finding Frequent Itemsets Input: The FP-tree Output: All Frequent Itemsets and their support Method: Divide and Conquer: Consider all itemsets that end in: E, D, C, B, A For each possible ending item, consider the itemsets with last items one of items preceding it in the ordering E.g, for E, consider all itemsets with last item D, C, B, A. This way we get all the itesets ending at DE, CE, BE, AE Proceed recursively this way. Do this for all items.

Frequent itemsets All Itemsets Ε D C B A DE CE BE AE CD BD AD BC AC AB CDE BDE ADE BCE ACE ABE BCD ACD ABD ABC ACDE BCDE ABDE ABCE ABCD ABCDE

Frequent Itemsets All Itemsets Frequent?; Ε D C B A DE CE BE AE CD BD AD BC AC AB Frequent?; CDE BDE ADE BCE ACE ABE BCD ACD ABD ABC Frequent? ACDE BCDE ABDE ABCE ABCD ABCDE Frequent?

Frequent Itemsets All Itemsets Frequent? Ε D C B A DE CE BE AE CD BD AD BC AC AB Frequent? CDE BDE ADE BCE ACE ABE BCD ACD ABD ABC Frequent? Frequent? ACDE BCDE ABDE ABCE ABCD ABCDE Frequent?

Frequent Itemsets All Itemsets Frequent? Ε D C B A DE CE BE AE CD BD AD BC AC AB Frequent? CDE BDE ADE BCE ACE ABE BCD ACD ABD ABC Frequent? ACDE BCDE ABDE ABCE ABCD ABCDE We can generate all itemsets this way We expect the FP-tree to contain a lot less

Using the FP-tree to find frequent itemsets TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} Header table Item A B C D E Pointer Transaction Database C:3 B:5 A:7 E:1 E:1 B:3 C:3 Bottom-up traversal of the tree. E:1 First, itemsets ending in E, then D, etc, each time a suffix-based class

Finding Frequent Itemsets Subproblem: find frequent itemsets ending in E A:7 B:3 B:5 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1 We will then see how to compute the support for the possible itemsets

Finding Frequent Itemsets Ending in D A:7 B:3 B:5 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

Finding Frequent Itemsets Ending in C A:7 B:3 B:5 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

Finding Frequent Itemsets Ending in B A:7 B:3 B:5 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

Finding Frequent Itemsets Ending in Α A:7 B:3 B:5 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

Algorithm For each suffix X Phase 1 Construct the prefix tree for X as shown before, and compute the support using the header table and the pointers Phase 2 If X is frequent, construct the conditional FP-tree for X in the following steps 1. Recompute support 2. Prune infrequent items 3. Prune leaves and recurse

Phase 1 construct prefix tree Find all prefix paths that contain E A:7 B:3 B:5 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1 Suffix Paths for Ε: {A,C,D,E}, {A,D,Ε}, {B,C,E}

Phase 1 construct prefix tree Find all prefix paths that contain E A:7 B:3 C:3 E:1 E:1 E:1 Prefix Paths for Ε: {A,C,D,E}, {A,D,Ε}, {B,C,E}

Compute Support for E (minsup = 2) How? Follow pointers while summing up counts: 1+1+1 = 3 > 2 E is frequent A:7 B:3 C:3 E:1 E:1 E:1 {E} is frequent so we can now consider suffixes DE, CE, BE, AE

E is frequent so we proceed with Phase 2 Phase 2 Convert the prefix tree of E into a conditional FP-tree Two changes (1) Recompute support (2) Prune infrequent A:7 B:3 E:1 C:3 E:1 E:1

Recompute Support A:7 B:3 The support counts for some of the nodes include transactions that do not end in E C:3 For example in ->B->C->E we count {B, C} The support of any node is equal to the sum of the support of leaves with label E in its subtree E:1 E:1 E:1

The support of any node is equal to the sum of the support of leaves with label E in its subtree A:7 B:3 C:3 E:1 E:1 E:1

A:7 B:3 E:1 E:1 E:1

A:7 B:1 E:1 E:1 E:1

A:7 B:1 E:1 E:1 E:1

A:7 B:1 E:1 E:1 E:1

A:2 B:1 E:1 E:1 E:1

A:2 B:1 E:1 E:1 E:1

Truncate A:2 B:1 Delete the nodes of Ε E:1 E:1 E:1

Truncate A:2 B:1 Delete the nodes of Ε E:1 E:1 E:1

Truncate A:2 B:1 Delete the nodes of Ε

Prune infrequent A:2 B:1 In the conditional FP-tree some nodes may have support less than minsup e.g., B needs to be pruned This means that B appears with E less than minsup times

A:2 B:1

A:2

A:2 The conditional FP-tree for E Repeat the algorithm for {D, E}, {C, E}, {A, E}

A:2 Phase 1 Find all prefix paths that contain D (DE) in the conditional FP-tree

A:2 Phase 1 Find all prefix paths that contain D (DE) in the conditional FP-tree

A:2 Compute the support of {D,E} by following the pointers in the tree 1+1 = 2 2 = minsup {D,E} is frequent

A:2 Phase 2 Construct the conditional FP-tree 1. Recompute Support 2. Prune nodes

A:2 Recompute support

A:2 Prune nodes

A:2 Prune nodes

A:2 Prune nodes Small support

A:2 Final condition FP-tree for {D,E} The support of A is minsup so {A,D,E} is frequent Since the tree has a single node we return to the next subproblem

A:2 The conditional FP-tree for E We repeat the algorithm for {D,E}, {C,E}, {A,E}

A:2 Phase 1 Find all prefix paths that contain C (CE) in the conditional FP-tree

A:2 Phase 1 Find all prefix paths that contain C (CE) in the conditional FP-tree

A:2 Compute the support of {C,E} by following the pointers in the tree 1+1 = 2 2 = minsup {C,E} is frequent

A:2 Phase 2 Construct the conditional FP-tree 1. Recompute Support 2. Prune nodes

A:1 Recompute support

A:1 Prune nodes

A:1 Prune nodes

A:1 Prune nodes

Prune nodes Return to the previous subproblem

A:2 The conditional FP-tree for E We repeat the algorithm for {D,E}, {C,E}, {A,E}

A:2 Phase 1 Find all prefix paths that contain A (AE) in the conditional FP-tree

A:2 Phase 1 Find all prefix paths that contain A (AE) in the conditional FP-tree

A:2 Compute the support of {A,E} by following the pointers in the tree 2 minsup {A,E} is frequent There is no conditional FP-tree for {A,E}

So for E we have the following frequent itemsets {E}, {D,E}, {C,E}, {A,E} We proceed with D

Ending in D A:7 B:3 B:5 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

Phase 1 construct prefix tree Find all prefix paths that contain D Support 5 > minsup, D is frequent Phase 2 Convert prefix tree into conditional FP-tree C:3 B:5 A:7 B:3 C:3

A:7 B:3 B:5 C:3 Recompute support

A:7 B:3 B:2 C:3 Recompute support

A:3 B:3 B:2 C:3 Recompute support

A:3 B:3 B:2 Recompute support

A:3 B:1 B:2 Recompute support

A:3 B:1 B:2 Prune nodes

A:3 B:1 B:2 Prune nodes

A:3 B:1 B:2 Construct conditional FP-trees for {C,D}, {B,D}, {A,D} And so on.

Observations At each recursive step we solve a subproblem Construct the prefix tree Compute the new support Prune nodes Subproblems are disjoint so we never consider the same itemset twice Support computation is efficient happens together with the computation of the frequent itemsets.

Observations The efficiency of the algorithm depends on the compaction factor of the dataset If the tree is bushy then the algorithm does not work well, it increases a lot of number of subproblems that need to be solved.