Rough Set Approach for Categorical Data Clustering 1

Similar documents
Simplified Algorithm and Hardware Implementation for the (24, 12, 8) Extended Golay Soft Decoder Up to 4 Errors

Data Attribute Reduction using Binary Conversion

Sinusoidal Steady State Analysis

The Existence, Uniqueness and Error Bounds of Approximation Splines Interpolation for Solving Second-Order Initial Value Problems

Reinforcement Learning for Fuzzy Control with Linguistic States

Content Based Color Image Retrieval via Wavelet Transforms

An Optimal Method for Using Multiple Gateways in Cellular IP Networks

A Comparison of South East Asian Face Emotion Classification Based on Optimized Ellipse Data Using Clustering Technique

A Substractive Clustering Based Fuzzy Hybrid Reference Control Design for Transient Response Improvement of PID Controller

Web-based Remote Human Pulse Monitoring System with Intelligent Data Analysis for Home Healthcare

Algorithms for Memory Hierarchies Lecture 14

BnB-ADOPT + with Several Soft Arc Consistency Levels

Knowledge Unit Relation Recognition Based on Markov Logic Networks

Efficient Large Integers Arithmetic by Adopting Squaring and Complement Recoding Techniques

An Algorithm Forecasting Time Series Using Wavelet

PRACTICAL, COMPUTATION EFFICIENT HIGH-ORDER NEURAL NETWORK FOR ROTATION AND SHIFT INVARIANT PATTERN RECOGNITION. Evgeny Artyomov and Orly Yadid-Pecht

UNIT 11 TWO-PERSON ZERO-SUM GAMES WITH SADDLE POINT

Pre-distortion Linearization for 64-QAM Modulation in Ka-Band Satellite Link

A Novel Control Method for Input Output Harmonic Elimination of the PWM Boost Type Rectifier Under Unbalanced Operating Conditions

Energy Efficient Session Key Establishment in Wireless Sensor Networks

Performance Evaluation of Survivable Multifiber WDM Networks

Test 2. ECON3161, Game Theory. Tuesday, November 6 th

Software Pipelining for the Pegasus IR

Adaptive modified backpropagation algorithm based on differential errors

CONSTRUCTING MINIMAL ADJACENT DOMINATING SETS IN SEMIGRAPHS FOR CLUSTERING IN WIRELESS NETWORKS

VI.C CIRCUIT BREAKERS

An iterative approach to an integrated land use and transportation planning tool for small urban areas

MAXIMUM FLOWS IN FUZZY NETWORKS WITH FUNNEL-SHAPED NODES

A Lower Bound for τ(n) of Any k-perfect Numbers

29. Network Functions for Circuits Containing Op Amps

A Comparison of Two Equivalent Real Formulations for Complex-Valued Linear Systems Part 2: Results

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University

DEVELOPMENT OF AN EFFICIENT EPILEPSY CLASSIFICATION SYSTEM FROM EEG SIGNALS FOR TELEMEDICINE APPLICATION

Superposition, Thevenin and Norton. Superposition

Prevention of Sequential Message Loss in CAN Systems

CONTAINER BERTH SCHEDULING POLICY WITH VARIABLE COST FUNCTION

EFFECTIVE CURRENT CONTROL DESIGN AND ANALYSIS OF SINGLE PHASE INVERTER FOR POWER QUALITY IMPROVEMENT

Design of Neuro-Fuzzy System Controller for DC Servomotor- Based Satellite Tracking System

Fall 2018 #11 Games and Nimbers. A. Game. 0.5 seconds, 64 megabytes

458 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 15, NO. 1, JANUARY 2016

International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 INTERNATIONAL JOURNAL OF ADVANCED RESEARCH IN

IMPACT OF AIRPORT NOISE REGULATIONS ON NETWORK TOPOLOGY AND DIRECT OPERATING COSTS OF AIRLINES

i S1 V IN i C1 i N i C2 i S2

Understanding the Spike Algorithm

antenna antenna (4.139)

Calculation of the received voltage due to the radiation from multiple co-frequency sources

Space Time Equalization-space time codes System Model for STCM

Review: Our Approach 2. CSC310 Information Theory

On The Study of Establishing a Responsive Infrastructure for a Massively Multiplayer On-Line Game

Rational Secret Sharing without Broadcast

The Performance Improvement of BASK System for Giga-Bit MODEM Using the Fuzzy System

Semi-quantum private comparison protocol under an. almost-dishonest third party

Tile Values of Information in Some Nonzero Sum Games

Optimal Toll Locations and Levels in Congestion Pricing Schemes: a Case Study of Stockholm

Optimal Placement of PMU and RTU by Hybrid Genetic Algorithm and Simulated Annealing for Multiarea Power System State Estimation

Walsh Function Based Synthesis Method of PWM Pattern for Full-Bridge Inverter

First Round Solutions Grades 4, 5, and 6

A Development of Earthing-Resistance-Estimation Instrument

Digital Transmission

OPTIMISING THE DESIGN OF MULTIMODAL FREIGHT TRANSPORT NETWORK IN INDONESIA

Phasor Representation of Sinusoidal Signals

A Novel Optimization of the Distance Source Routing (DSR) Protocol for the Mobile Ad Hoc Networks (MANET)

Topic 20: Huffman Coding

The Analysis and Simulation of Robot Kinematics and Dynamics Based on RoboAnalyzer

Fixation-Image Charts

10.4 AREAS AND LENGTHS IN POLAR COORDINATES

A Fuzzy-based Routing Strategy for Multihop Cognitive Radio Networks

problems palette of David Rock and Mary K. Porter 6. A local musician comes to your school to give a performance

Utility-based Routing

Parameter Free Iterative Decoding Metrics for Non-Coherent Orthogonal Modulation

A MODIFIED DIFFERENTIAL EVOLUTION ALGORITHM IN SPARSE LINEAR ANTENNA ARRAY SYNTHESIS

TECHNICAL NOTE TERMINATION FOR POINT- TO-POINT SYSTEMS TN TERMINATON FOR POINT-TO-POINT SYSTEMS. Zo = L C. ω - angular frequency = 2πf

Remote Sensing Image Sequence Segmentation Based on the Modified Fuzzy C-means

GLONASS Inter-frequency Biases and Their Effects on RTK and PPP Carrier-phase Ambiguity Resolution

A New Medium Access Control Protocol for TDD Mode Wideband CDMA Wireless Local Area Networks

To: Professor Avitabile Date: February 4, 2003 From: Mechanical Student Subject: Experiment #1 Numerical Methods Using Excel

Secure Transmission of Sensitive data using multiple channels

Sorting signed permutations by reversals, revisited

IEE Electronics Letters, vol 34, no 17, August 1998, pp ESTIMATING STARTING POINT OF CONDUCTION OF CMOS GATES

CS 135: Computer Architecture I. Boolean Algebra. Basic Logic Gates

Multi-beam antennas in a broadband wireless access system

Student Book SERIES. Fractions. Name

A Simple Satellite Exclusion Algorithm for Advanced RAIM

METHOD OF LOCATION USING SIGNALS OF UNKNOWN ORIGIN. Inventor: Brian L. Baskin

Side-Match Vector Quantizers Using Neural Network Based Variance Predictor for Image Coding

High Speed ADC Sampling Transients

Genetic Algorithm for Sensor Scheduling with Adjustable Sensing Range

Departure Scheduling in a Multi-airport System

An Activity Based Mobility Prediction Strategy Using Markov Modeling for Wireless Networks

Chapter 1. On-line Choice of On-line Algorithms. Yossi Azar Andrei Z. Broder Mark S. Manasse

GLONASS Double Difference Ambiguity Resolution in Real-Time

N( E) ( ) That is, if the outcomes in sample space S are equally likely, then ( )

1.0 INTRODUCTION 2.0 CELLULAR POSITIONING WITH DATABASE CORRELATION

An efficient cluster-based power saving scheme for wireless sensor networks

MATH 118 PROBLEM SET 6

Adaptive Modulation for Multiple Antenna Channels

Mitigation of Harmonics by Shunt Active Power Filter using Synchronous Detection Method

Resource Allocation Optimization for Device-to- Device Communication Underlaying Cellular Networks

MTBF PREDICTION REPORT

A study of turbo codes for multilevel modulations in Gaussian and mobile channels

Transcription:

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, Rough Set Approch for Ctegorcl Dt Clusterng Tutut Herwn*, Rozd Ghzl, Iwn Tr Ryd Ynto, nd Mustf Mt Ders Deprtment of Mthemtcs Educton nversts Ahmd Dhln, Yogyrt, Indones Fculty of Informton Technology nd Multmed nverst Tun Hussen Onn Mlys, Johor, Mlys Deprtment of Mthemtcs nversts Ahmd Dhln, Yogyrt, Indones tutut8@ud.c.d* (correspondng uthor), rozd@uthm.edu.my, wn5@gml.com, mmustf@uthm.edu.my Abstrct Clusterng ctegorcl dt s n ntegrl prt of dt mnng nd hs ttrcted much ttenton recently. In ths pper, we focus our dscusson on the rough set theory for ctegorcl dt clusterng. We propose MADE (Mxml Attrbutes DEpendency), n lterntve technque for ctegorcl dt clusterng usng rough set theory tng nto ccount mxmum ttrbutes dependences degree n ctegorcl-vlued nformton systems. Expermentl results on two benchmr CI dtsets show tht MADE technque s better wth the bselne ctegorcl dt clusterng technque wth respect to computtonl complexty nd clusters purty. Keywords: Clusterng; Ctegorcl dt; Informton system; Rough set theory; Attrbutes dependences.. Introducton Clusterng set of obects nto homogeneous clsses s fundmentl operton n dt mnng. The operton s requred n number of dt nlyss tss, such s unsupervsed clssfcton nd dt summton, s well s n the segmentton of lrge homogeneous dtsets nto smller homogeneous subsets tht cn be esly mnged, modeled seprtely nd nlyzed. Recently, mny ttentons hve been pd on the ctegorcl dt clusterng [,], where dt obects re mde up of non-numercl ttrbutes. For ctegorcl dt clusterng, severl new trends hve emerged for the technques n hndlng uncertnty n the clusterng process. One of the populr pproches for hndlng uncertnty s bsed on rough set theory []. The mn de of the rough clusterng s the clusterng dtset s mpped s the decson tble. Ths cn be done by ntroducng decson ttrbute nd consequently, dvde-nd-conquer method cn be used to prtton/cluster the obects. The frst ttempt on rough set-bsed technque s to select clusterng ttrbute proposed by Mzlc et l. []. They proposed two technques,.e., -Clusterng nd TR technques whch re bsed on the b-vlued ttrbute nd mxmum totl roughness n ech ttrbute, respectvely. One of the most successful poneerng rough clusterng technques s An erly verson of ths pper ppered n the Proceedng of Interntonl Conference, DTA, held s Prt of the Future Generton Informton Technology Conference, FGIT, Jeu Islnd, Kore, December -,, CCIS Sprnger-Verlg, pp. 7 8,.

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, Mnmum-Mnmum Roughness (MMR) proposed by Prmr [5]. The technque s bsed on lower, upper nd qulty of pproxmtons of set []. However, snce pplcton of rough set theory n ctegorcl dt clusterng s reltvely new, the focus of MMR s stll on the evluton ts performnce. To ths, the computtonl complexty nd clusters purty re stll outstndng ssues snce ll ttrbutes re consdered for selecton nd obects n dfferent clss pper n cluster, respectvely. In ths pper, we propose MADE (Mxml Attrbutes DEpendency), n lterntve technque for ctegorcl dt clusterng. The technque dffers on the bselne method, where the rough ttrbutes dependences n ctegorcl-vlued nformton systems s used to select clusterng ttrbute bsed on the mxmum degree. Further, we use dvde-ndconquer method to prtton/cluster the obects. We hve succeed n showng tht the proposed technque s ble to cheve lower computtonl complexty wth hgher purty s compred to MMR. The rest of ths pper s orgnzed s follows. Secton descrbes rough set theory. Secton descrbes the nlyss nd comprson of Mzlc s TR nd MMR technques. Secton descrbes the Mxmum Attrbutes Dependency (MADE) technque. Comprson tests of MADE wth MMR technques bsed on Soyben nd Zoo dtsets re descrbed n secton 5. Fnlly, the concluson of ths wor s descrbed n secton.. Rough Set Theory The syntx of nformton systems s very smlr to reltons n reltonl dt bses. Enttes n reltonl dtbses re lso represented by tuples of ttrbute vlues. An S, A, V, f, where nformton system s -tuple (qudruple) u, u, u, s non-empty fnte set of obects, A,,,,, u non-empty fnte set of ttrbutes, V A V s, V s the domn (vlue set) of ttrbute, f : A V s n nformton functon such tht f u, V, for every u, A, clled nformton (nowledge) functon. An nformton system s lso clled nowledge representton systems or n ttrbute-vlued system nd cn be ntutvely expressed n terms of n nformton tble (see Tble ). Tble. An nformton system A f u, f u, f u, f u, u u f f f u, u, u, A f u, u f u, f u, f u, f u, The tme complexty for computng n nformton system S, A, V, f snce there re,,,, A s A A vlues of f u, to be computed, where,,,, nd. Note tht tnduces set of mps t f u : A V tuple t f u,, f u,, f u,,, f u, A,. Ech mp s, where where,,,,. Note tht the tuple t s not necessrly ssocted wth entty unquely (see Tble 7). In n nformton tble, two dstnct enttes could hve the sme tuple representton A A A

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, (duplcted/redundnt tuple), whch s not permssble n reltonl dtbses. Thus, the concept of nformton systems s generlzton of the concept of reltonl dtbses. Defnton. Two elements set of ttrbute A x, y re sd to be -ndscernble (ndscernble by the n S) f nd only f f x f y,,, for every. Obvously, every subset of A nduces unque ndscernblty relton. Notce tht, n ndscernblty relton nduced by the set of ttrbute, denoted by IND, s n equvlence relton. The prtton of nduced by IND s denoted by / nd the equvlence clss n the prtton / contnng x, s denoted by x. The notons of lower nd upper pproxmtons of set re defned s follows. Defnton. (See [].) The -lower pproxmton of, denoted by nd -upper pproxmtons of, denoted by, re defned by x x nd x x It s esly seen tht the upper pproxmton of subset complement nd lower pproxmton by, respectvely., s expressed usng set where denote the complement of reltve to. The ccurcy of pproxmton (ccurcy of roughness) of ny subset respect to A s mesured by, denoted, () / where denotes the crdnlty of. For empty set, we defne. If s unon of some equvlence clsses, then s crsp wth respect to, nd otherwse, f wth. Obvously,. Thus, the set, s rough wth respect to. The ccurcy of roughness n equton () cn lso be nterpreted usng the well-nown Mrczewes-Stenhus (MZ) metrc [7]. y pplyng the Mrczewes-Stenhus metrc to the lower nd upper pproxmtons of subset n nformton system S, we hve D,. () The noton of the dependency of ttrbutes n nformton systems s gven n the followng defnton. be n nformton system nd let D nd C be ny subsets of A. Attrbute D s clled depends totlly on ttrbute C, denoted C D, f ll vlues of ttrbutes D re unquely determned by vlues of ttrbutes C. Defnton. Let S, A, V, f 5

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, In other words, ttrbute D depends totlly on ttrbute C, f there exst functonl dependency between vlues D nd C. The noton of generlzed ttrbutes dependency s gven n the followng defnton. be n nformton system nd let D nd C be ny subsets of A. Degree of dependency of ttrbute D on ttrbutes C, denoted C D, s defned by Defnton. Let S, A, V, f C / D. () Obvously,. Attrbute D s sd to be (totlly dependent) depends totlly (n degree of ) on the ttrbute C f. Otherwse, D s depends prtlly on C. Thus, ttrbute D depends totlly (prtlly) on ttrbute C, f ll (some) elements of the unverse cn be unquely clssfed to equvlence clsses of the prtton / D, employng C. In the followng secton, we nlyze nd compre the Totl Roughness (TR) nd Mn-Mn Roughness (MMR) technques for selectng clusterng ttrbute.. TR nd MMR Technques.. The TR Technque The defnton of nformton system s bsed on the noton of nformton system s stted n secton. From the defnton, suppose tht ttrbute A hs -dfferent vlues, sy,,,, n. Let,,,, n be subset of the obects hvng - dfferent vlues of ttrbute. The roughness of TR technque of the set,,,,n, wth respect to, where, denoted by R, s defned by R,,,, n. () From TR technque, the men roughness of ttrbute A, where, denoted Rough, s evluted s follow A wth respect to ttrbute Rough V R V, (5) where V s the set of vlues of ttrbute A. The totl roughness of ttrbute denoted TR, s obtned by the followng formul A wth respect to ttrbute A, where,

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, TR A Rough A. () As stted n Mzlc et l. [], the hghest vlue of TR, s the best selecton of prttonng ttrbute... The MMR Technque The defnton of nformton system s bsed on the noton of nformton system s stted n secton. From the defnton, suppose tht ttrbute A hs -dfferent vlues, sy,,,, n. Let,,,, n be subset of the obects hvng - dfferent vlues of ttrbute. The roughness of MMR technque of the set,,,, n, wth respect to, where, denoted by R, s defned by MMR,,,, n. (7) It s cler tht MMR technque uses MZ metrc to mesure the roughness of the set,,,, n, wth respect to, where. The men roughness of MMR technque s defned by MMRough V MMR V. (8) Accordng to Prmr et l. [5], the lest men roughness s the best selecton of prttonng ttrbute... Comprson of TR nd MMR technques Proposton 5. The vlue of roughness of MMR technque s the opposte of tht TR technque. Proof. Snce MMR technque uses MZ metrc to mesure the roughness of the set,,,, n, wth respect to, where,.e., then from (7), we hve MMR, R MMR. () Thus, the vlue of men roughness of MMR technque s lso the opposte of tht TR technque (5),.e., 7

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, MMRough V MMR V 8 V V V R V V R V V V R V Rough, for. () The MMR technque s bsed on the mnmum vlue of men roughness n (), wthout clcultng totl roughness (). Ths nlyss nd comprson hs shown tht TR nd MMR technques re provdng the smlr result when used n determnng the clusterng ttrbute. To llustrte tht MMR nd Mzlc s technques provde the sme results, we consder to the followng exmple. Exmple. We consder the dtset n llustrtve exmple of Tble n [5]. Tble. An nformton system n [5] 5 g lue Hrd Indefnte Plstc Negtve Medum Red Moderte Smooth Wood Neutrl Smll Yellow Soft Fuzzy Plush Postve Medum lue Moderte Fuzzy Plstc Negtve 5 Smll Yellow Soft Indefnte Plstc Neutrl g Green Hrd Smooth Wood Postve 7 Smll Yellow Hrd Indefnte Metl Postve 8 Smll Yellow Soft Indefnte Plstc Postve g Green Hrd Smooth Wood Neutrl Medum Green Moderte Smooth Plstc Neutrl In Tble, there re ten obects wth sx ctegorcl-vlued ttrbutes:,,,, 5 nd. Ech ttrbute hs more thn two vlues V,,,,,5,. Snce n ths cse there s no b-vlued ttrbutes, then we cnnot employ Mzlc s C technque. The clculton of TR nd MMR technques must be ppled on ll of the ttrbute vlues for obtnng the clusterng ttrbute. The clculton of TR vlue s bsed on formuls n (), (5) nd (). The technques of TR nd MMR re mplemented n MATLA verson 7... (R8). They re executed sequentlly on processor Intel Core Duo CPs. The totl

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, mn memory s G nd the opertng system s Wndows P Professonl SP. The results of TR nd MMR re gven n the followng Tble nd, respectvely. Attrbute Attrbute Tble. The TR of ll ttrbutes of Tble TR men roughness Rough Rough Rough Rough Rough.88.7.7 Rough Rough Rough Rough Rough.5.7.57.5 Rough Rough Rough Rough Rough.7.55. Rough Rough Rough Rough Rough..587 5 Rough Rough Rough Rough Rough.57..7.7 Rough Rough Rough Rough Rough.75. Tble. The MMR of ll ttrbutes of Tble MMR men roughness Rough Rough Rough Rough 5 Rough..58.8 Rough Rough Rough Rough 5 Rough.75.8.8.75 Rough Rough Rough Rough 5 Rough.58..7 Rough Rough Rough Rough 5 Rough.7.7 5 Rough Rough Rough Rough Rough.88.5 Rough Rough Rough Rough Rough 5.5. sed on Fgure, ttrbute,.e.,.85 hs hgher TR s compred to,,,,5,. Thus, ttrbute s selected s the clusterng ttrbute. Menwhle, bsed on Fgure, two ttrbutes re of eqully of MMR ( nd,.e..58). ut, the second vlue correspondng to ttrbute,.e.. s lower thn tht of,.e..7. Therefore, ttrbute s selected s the clusterng ttrbute.

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, Fgure. The TR vlue of ll ttrbutes of Tble Fgure. The MMR vlue of ll ttrbutes of Tble Tble 5. The computton nd response tme of TR nd MMR Computton Response tme (Sec) TR 7.7 MMR 7.7

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, Fgure. The computton of TR nd MMR Fgure. The response tme of TR nd MMR sed on the result on selectng clusterng ttrbute n Fgures,, nd, t s esly seen tht the decson, computton complexty nd processng tme of TR nd MMR technques re totlly the sme. Thus bsed on Proposton 5, the sttement tht MMR s n extenson of n pproch proposed Mzlc et l. n comprson exmple [5] s therefore consdered s ncorrect nd unresonble. On the other hnd, to cheve lower computtonl complexty n selectng prttonng ttrbute usng MMR, Prmr et l. suggested tht the mesurement of the roughness to be bsed on reltonshp between n ttrbute A nd the set defned s A nsted of clcultng the mxmum wth respect to ll where [5]. As hs been observed by us, ths technque only cn be ppled to very specl dtset. To llustrte ths problem, we consder to the followng exmple.

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, Exmple 7. In Tble, f we consder to mesure the roughness of ttrbute respect to the set of ttrbutes A Tble. A wth, then we get the vlue of modfed MMR s n Tble. The modfed MMR of ll ttrbutes of dtset n [5] Attrbute w.r.t. Men Roughness Rough Rough Rough Rough Rough 5 Rough MMR A A A A A 5 A sed on Tble, we hve not been ble to select clusterng ttrbute. Thus, the suggested technque would led problem,.e., fter clculton of men roughness of ttrbute A wth respect to the set of ttrbutes A, the vlue of MMR usully cnnot preserve the orgnl decson. Thus, ths modfed technque s therefore not relevnt to ll type of dtset. To overcome the problem of computtonl complexty of MMR, n secton, we ntroduce the Mxmum Attrbutes Dependences (MADE) technque to del wth the problem of ctegorcl dt clusterng.. Mxmum Attrbutes DEpendences (MADE) Technque.. MADE technque The MADE technque for selectng prttonng ttrbute s bsed on the mxmum degree of dependency of ttrbutes. The ustfcton tht the hgher of the degree of dependency of ttrbutes mples the more ccurcy for selectng prttonng ttrbute s stted n the Proposton 8. be n nformton system nd let D nd C be ny subsets of A. If D depends totlly on C, then D C, for every. Proposton 8. Let S, A, V, f Proof. Let D nd C be ny subsets of A n nformton system S, A, V, f hypothess, we hve INDC INDD. Furthermore, the prttonng C tht / D, thus, t s cler tht ny equvlence clss nduced by D. From the / s fner thn IND s unon of

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, some equvlence clss nduced by C x D C x. And hence, for every, we hve Consequently D IND. Therefore, for every x, we hve C C D. D D D C C C... Complexty Suppose tht n n nformton system S, A, V, f, there s A ttrbutes. For MADE, the computton of clcultng of dependency degree of ttrbute on ttrbute, where s A A. Thus, the computtonl complexty for MADE technque s of the polynoml O A A. The MADE s lgorthm for selectng clusterng ttrbute s gven n Fgure 5. Algorthm: MADE Input: Dtset wthout clusterng ttrbute Output: Clusterng ttrbute egn Step. Compute the equvlence clsses usng the ndscernblty relton on ech ttrbute. Step. Determne the dependency degree of ttrbute wth respect to ll, where. Step. Select the mxmum of dependency degree of ech ttrbute. Step. Select clusterng ttrbute bsed on the mxmum degree of dependency of ttrbutes. End Fgure 5. The MADE lgorthm As the sme procedure for selectng clusterng ttrbute of MMR, n usng MADE technque, t s recommended to loo t the next lowest dependences degree nsde the ttrbutes tht re ted nd so on untl the te s broen... Exmple The dtset s n nml dtset from Hu [8]. In Tble 7, there re nne nmls wth nne ctegorcl-vlued ttrbutes A ; Hr, Teeth, Eye, Fether, Feet, Et, Ml, Fly nd Swm. The ttrbutes Hr, Eye, Fether, Ml, Fly nd Swm hve two vlues. Attrbutes Teeth hs three vlues, nd other ttrbutes hve four vlues.

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch,. To obtn the dependences degree of ll ttrbutes, the frst step of the technques s to obtn the equvlence clsses nduced by ndscernblty relton of sngleton ttrbutes,.e., dsont clsses of obects whch re contn ndscernble obects. b. y collectng the equvlence clsses, prtton of obects cn be obtned. The prttons re shown n Fgure. c. The dependency degree of ttrbutes cn be obtned usng formul n (). For ttrbute Hr depends on ttrbutes Teeth, Eye, Fether, Feet, Et, Ml, Fly nd Swm, we hve the degrees s shown n Fgure 7. Tble 7. Anml world dtset from [8] Anml Hr Teeth Eye Fether Feet Et Ml Fly Swm Tger Y Ponted Forwrd N Clw Met Y N Y Cheeth Y Ponted Forwrd N Clw Met Y N Y Grffe Y lunt Sde N Hoof Grss Y N N Zebr Y lunt Sde N Hoof Grss Y N N Ostrch N N Sde Y Clw Grn N N N Pengun N N Sde Y Web Fsh N N Y Albtross N N Sde Y Clw Grn N Y Y Egle N N Forwrd Y Clw Met N Y N Vper N Ponted Forwrd N N Met N N N. Hr yes,,,, Hr no 5,,7,8,, / Hr,,,, 5,,7,8,. b. Teeth ponted,,, Teeth blunt,, Teeth no 5,,7,8, / Teeth,,,, 5,,7,8. c. Eye Forwrd,,8,, Eye Sde,,5,,7, / Eye,,8,,,,5,,7. d. Fether no,,,,, Fether yes 5,,7,8, / Fether,,,,, 5,,7,8. e. Feet clw,,5,7,8, Feet hoof,, Feet web, Feet no. / Feet,,5,7,8,,,,,. f. Et Met,,8,, Et grss,, Et grn 5,7, Et fsh. / Et,,8,,,, 5,7,. g. Ml yes,,,, Ml no 5,,7,8,, / Ml,,,, 5,,7,8,. h. Fly no,,,,5,,, Fly yes 7,8, / Fly,,,,5,, 7,8.. Swm yes,,,7, Swm no,,5,8,, / Swm,,,7,,,5,8,. Fgure. The prttons usng sngleton ttrbutes

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, Teeth Eye Fether Feet Et Ml Fly Swm Hr, where Hr, where Hr, where Hr, where Hr, where Hr, where Hr, where Hr, where / Hr / Hr Teeth Eye / Hr / Hr / Hr / Hr / Hr Feet Et, 5,,7,8 Fether. 5,,7,8,, 5,7 Ml Fly / Hr. 5,,, 5,,7,8, 7,8 Swm...... Fgure 7. The ttrbutes dependences Smlr clcultons re performed for ll the ttrbutes. These clcultons re summrzed n Tble 8. Tble 8. The dependences degree of ll ttrbutes from Tble 7 Attrbute Degree of dependency Hr Teeth Eye Fether Feet Et Ml Fly Swm....555. Teeth Hr Eye Fether Feet Et Ml Fly Swm...555. Eye Hr Teeth Fether Feet Et Ml Fly Swm.555. Fether Hr Teeth Eye Feet Et Ml Fly Swm...555.. Feet Hr Teeth Eye Fether Et Ml Fly Swm..555. Et Hr Teeth Eye Fether Feet Ml Fly Swm.555.. Ml Hr Teet h Eye Fether Feet Et Fly Sw m....555. Fly Hr Teet h Eye Fethe r Feet Et Ml Sw m..555.555... 5

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, Swm Hr Teet Eye Fethe Feet Et Ml Fly h r... Fgure 8. The mxml ttrbutes dependences Wth the MADE technque, the frst mxmum degree of dependency of ttrbutes,.e. occurs n ttrbutes Hr (Ml), Eye nd Fether (.e., ) s Fgure 8 shows. The second mxmum degree of dependency of ttrbutes,.e.. occurs n ttrbutes Hr. Thus, bsed on Fgure 8, ttrbute Hr s selected s clusterng ttrbute... Obects splttng For obects splttng, we use dvde-conquer method. For exmple, n Tble 7 we cn cluster (prtton) the nmls bsed on the decson ttrbute selected,.e., Hr/Ml. Notce tht, the prtton of the set of nmls nduced by ttrbute Hr/Ml s,,,, 5,,7,8,. To ths, we cn splt the nmls usng the herrchcl tree s follows. Tger, Cheeth, Grffe, Zebr, Ostrch, Pengun, Albtross, Egle, Vper The obects Tger, Cheeth, Grffe, Zebr Ostrch, Pengun, Albtross, Egle, Vper st possble clusters Tger, Cheeth Grffe, Zebr Ostrch, Pengun, Albtross, Egle Vper nd possble clusters Fgure. The obects splttng The technque s ppled recursvely to obtn further clusters. At subsequent tertons, the lef node hvng more obects s selected for further splttng. The lgorthm termntes

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, when t reches pre-defned number of clusters. Ths s subectve nd s pre-decded bsed ether on user requrement or domn nowledge. 5. Comprson Tests In order to test MADE nd compre t wth MMR, we use two dtsets obtned from the benchmr CI Mchne Lernng Repostory. We use Soyben nd Zoo dtsets re wth 7 nd obects. The purty of clusters ws used s mesure to test the qulty of the clusters [5]. The purty of cluster nd overll purty re defned s Purty Overll Purty the number of dt occurng n both the th cluster nd ts correspondng clss the number of dt n the dt set # of cluster Purty # of cluster Accordng to ths mesure, hgher vlue of overll purty ndctes better clusterng result, wth perfect clusterng yeldng vlue of [5]. The lgorthms of MMR nd MADE for Soyben nd Zoo dtsets re mplemented n MATLA verson 7... (R8). They re executed sequentlly on processor Intel Core Duo CPs. The totl mn memory s Ggbyte nd the opertng system s Wndows P Professonl SP. 5.. Soyben dtset The Soyben dtset contns 7 obects on dseses n soybens. Ech obect cn be clssfed s one of the four dseses nmely, Dporthe Stem Cner (D), Chrcol Rot (D), Rhzocton Root Rot (D), nd Phytophthor Rot (D) nd re descrbed by 5 ctegorcl ttrbutes []. The dtset s comprsed 7 obects for Phytophthor Rot dsese nd obects for ech of the remnng dseses. Snce there re four possble dseses, the obects wll be splt nto four clusters. The results re summrzed n Tble. All of 7 obects belong to the morty clss lbel of the cluster n whch they re clssfed. Thus, the overll purty of the clusters s %. 5.. Zoo dtset Tble. The purty of clusters Cluster D D D D Purty number 7 Overll Purty The Zoo dtset s comprsed of obects, where ech dt pont represents nformton of n nml n terms of 8 ctegorcl ttrbutes []. Ech nml dt pont s clssfed nto seven clsses. Therefore, for MADE, the splttng dt s set t seven clusters. Tble summrzes the results of runnng the MADE lgorthm on the Zoo dtset. 7

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, Tble. The purty of clusters Cluster C C C C C5 C C7 Purty number 5 5 8 7 Overll Purty All of obects belong to the morty clss lbel of the cluster n whch they re clssfed. Thus, the overll purty of the clusters s %. 5.. Comprson The comprson of overll purty, computton nd response tme of MADE nd MMR on Soyben nd Zoo dtsets re gven n Fgures, nd, respectvely. sed on Tble, the MADE technque provdes better soluton compred to MMR technque both n Soyben nd Zoo dtset. Tble. The overll mprovement of MMR by MADE Improvement Clusters Purty Computton Response Tme Soyben 7% % % Zoo % 77% 7% Fgure. The comprson of overll purty 8

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, Fgure. The comprson of computton Fgure. The comprson of response tme. Concluson Ctegorcl dt clusterng technque hs emerged s new trend n technque of hndlng uncertnty n the clusterng process. In ths pper, we hve proposed MADE, n lterntve technque for ctegorcl dt clusterng usng rough set theory bsed on ttrbutes dependences. We hve proven tht MADE technque s generlzton of MMR technque whch s ble to cheve lower computtonl complexty nd hgher clusters purty. Wth ths pproch, we beleve tht some pplctons through MADE wll be pplcble, such s for decson mng, clusterng very lrge dtsets nd etc. Acnowledgement Ths wor ws supported by the grnt of nverst Tun Hussen Onn Mlys.

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, References [] Hung, Z. Extensons to the -mens lgorthm for clusterng lrge dt sets wth ctegorcl vlues. Dt Mnng nd Knowledge Dscovery () (8) 8. [] Km, D., Lee, K., Lee, D. Fuzzy clusterng of ctegorcl dt usng fuzzy centrods. Pttern Recognton Letters 5 () () 7. [] Pwl, Z. Rough sets. Interntonl Journl of Computer nd Informton Scence., 8, 5. [] Mzlc, L.J., He, A., Zhu, Y., Coppoc, S. A rough set pproch n choosng prttonng ttrbutes. Proceedngs of the ISCA th, Interntonl Conference, CAINE-,,. [5] Prmr, D., Wu, T. nd lchurst, J. MMR: An lgorthm for clusterng ctegorcl dt usng rough set theory. Dt nd Knowledge Engneerng, 7, 87 8. [] Pwl, Z. nd Sowron, A. Rudments of rough sets. Informton Scences, 77 (), 7, 7. [7] Yo, Y.Y. Two vews of the theory of rough sets n fnte unverses. Approxmte Resonng, 5 (),, 7. [8] Hu,. Knowledge dscovery n dtbses: An ttrbute orented rough set pproch. PhD thess, nversty of Regn, 5. [] http://rchve.cs.uc.edu/ml/dtsets/soyben+%8smll% [] http://rchve.cs.uc.edu/ml/dtsets/zoo 5

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, Authors Tutut Herwn He s Ph.D. cnddte n Dt Mnng t nverst Tun Hussen Onn Mlys (THM). Hs reserch re ncludes Dt Mnng, KDD nd Rel Anlyss. Rozd Ghzl She receved her.sc. (Hons) degree n Computer Scence from nverst Sns Mlys, nd M.Sc. degree n Computer Scence from nverst Tenolog Mlys. She obtned her Ph.D. degree n Hgher Order Neurl Networs t Lverpool John Moores nversty, K. She s currently techng stff t Fculty of Informton technology nd Multmed, nverst Tun Hussen Onn Mlys (THM). Her reserch re ncludes neurl networs, fuzzy logc, fnncl tme seres predcton nd physcl tme seres forecstng. Iwn Tr Ryd Ynto He s M.Sc. cnddte n Dt Mnng t nverst Tun Hussen Onn Mlys (THM). Hs reserch re ncludes Dt Mnng, KDD nd Rel Anlyss. Mustf Mt Ders He receved the.sc. from nversty Putr Mlys, M.Sc. from nversty of rdford, Englnd nd Ph.D. from nversty Putr Mlys. He s professor of computer scence n the Fculty of Informton Technology nd Multmed, THM, Mlys. Hs reserch nterests nclude dstrbuted dtbses, dt grd, dtbse performnce ssues nd dt mnng. He hs publshed more thn 8 ppers n ournls nd conference proceedngs. He ws pponted s one of edtorl bord members for Interntonl Journl of Informton Technology, World Enformt Socety, revewer of specl ssue on Interntonl Journl of Prllel nd Dstrbuted Dtbses, Elsever,, specl ssue on Interntonl Journl of Cluster Computng, Kluwer,, IEEE conference on Cluster nd Grd Computng, held n Chcgo, Aprl,, nd Mlysn Journl of Computer Scence. He hs served s progrm commttee member for numerous nterntonl conferences/worshops ncludng Grd nd Peer-to-Peer Computng, (GPP 5, ), Autonomc Dstrbuted Dt nd Storge Systems Mngement (ADSM 5, ), WSEAS, Interntonl Assocton of Scence nd Technology, IASTED on Dtbse, etc. 5

Interntonl Journl of Dtbse Theory nd Applcton Vol., No., Mrch, 5