Dawn E. Holmes and Lakhmi C. Jain (Eds.) Data Mining: Foundations and Intelligent Paradigms
Intelligent Systems Reference Library, Volume 25 Editors-in-Chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail: kacprzyk@ibspan.waw.pl Prof.LakhmiC.Jain University of South Australia Adelaide Mawson Lakes Campus South Australia 5095 Australia E-mail: Lakhmi.jain@unisa.edu.au Further volumes of this series can be found on our homepage: springer.com Vol. 1. Christine L. Mumford and Lakhmi C. Jain (Eds.) Computational Intelligence: Collaboration, Fusion and Emergence, 2009 ISBN 978-3-642-01798-8 Vol. 2.Yuehui Chen and Ajith Abraham Tree-Structure Based Hybrid Computational Intelligence, 2009 ISBN 978-3-642-04738-1 Vol. 3. Anthony Finn and Steve Scheding Developments and Challenges for Autonomous Unmanned Vehicles, 2010 ISBN 978-3-642-10703-0 Vol. 4. Lakhmi C. Jain and Chee Peng Lim (Eds.) Handbook on Decision Making: Techniques and Applications, 2010 ISBN 978-3-642-13638-2 Vol. 5. George A. Anastassiou Intelligent Mathematics: Computational Analysis, 2010 ISBN 978-3-642-17097-3 Vol. 6. Ludmila Dymowa Soft Computing in Economics and Finance, 2011 ISBN 978-3-642-17718-7 Vol. 7. Gerasimos G. Rigatos Modelling and Control for Intelligent Industrial Systems, 2011 ISBN 978-3-642-17874-0 Vol. 8. Edward H.Y. Lim, James N.K. Liu, and Raymond S.T. Lee Knowledge Seeker Ontology Modelling for Information Search and Management, 2011 ISBN 978-3-642-17915-0 Vol. 9. Menahem Friedman and Abraham Kandel Calculus Light, 2011 ISBN 978-3-642-17847-4 Vol. 10. Andreas Tolk and Lakhmi C. Jain Intelligence-Based Systems Engineering, 2011 ISBN 978-3-642-17930-3 Vol. 11. Samuli Niiranen and Andre Ribeiro (Eds.) Information Processing and Biological Systems, 2011 ISBN 978-3-642-19620-1 Vol. 12. Florin Gorunescu Data Mining, 2011 ISBN 978-3-642-19720-8 Vol. 13. Witold Pedrycz and Shyi-Ming Chen (Eds.) Granular Computing and Intelligent Systems, 2011 ISBN 978-3-642-19819-9 Vol. 14. George A. Anastassiou and Oktay Duman Towards Intelligent Modeling: Statistical Approximation Theory, 2011 ISBN 978-3-642-19825-0 Vol. 15. Antonino Freno and Edmondo Trentin Hybrid Random Fields, 2011 ISBN 978-3-642-20307-7 Vol. 16. Alexiei Dingli Knowledge Annotation: Making Implicit Knowledge Explicit, 2011 ISBN 978-3-642-20322-0 Vol. 17. Crina Grosan and Ajith Abraham Intelligent Systems, 2011 ISBN 978-3-642-21003-7 Vol. 18. Achim Zielesny From Curve Fitting to Machine Learning,2011 ISBN 978-3-642-21279-6 Vol. 19. George A. Anastassiou Intelligent Systems: Approximation by Artificial Neural Networks, 2011 ISBN 978-3-642-21430-1 Vol. 20. Lech Polkowski Approximate Reasoning by Parts, 2011 ISBN 978-3-642-22278-8 Vol. 21. Igor Chikalov Average Time Complexity of Decision Trees, 2011 ISBN 978-3-642-22660-1 Vol. 22. Przemys law Różewski, Emma Kusztina, Ryszard Tadeusiewicz, and Oleg Zaikin Intelligent Open Learning Systems, 2011 ISBN 978-3-642-22666-3 Vol. 23. Dawn E. Holmes and Lakhmi C. Jain (Eds.) Data Mining: Foundations and Intelligent Paradigms, 2012 ISBN 978-3-642-23165-0 Vol. 24. Dawn E. Holmes and Lakhmi C. Jain (Eds.) Data Mining: Foundations and Intelligent Paradigms, 2012 ISBN 978-3-642-23240-4 Vol. 25. Dawn E. Holmes and Lakhmi C. Jain (Eds.) Data Mining: Foundations and Intelligent Paradigms, 2012 ISBN 978-3-642-23150-6
Dawn E. Holmes and Lakhmi C. Jain (Eds.) Data Mining: Foundations and Intelligent Paradigms Volume 3: Medical, Health, Social, Biological and other Applications 123
Prof.DawnE.Holmes Department of Statistics and Applied Probability University of California, Santa Barbara, CA 93106 USA E-mail: holmes@pstat.ucsb.edu Prof.LakhmiC.Jain Professor of Knowledge-Based Engineering University of South Australia Adelaide Mawson Lakes, SA 5095 Australia E-mail: Lakhmi.jain@unisa.edu.au ISBN 978-3-642-23150-6 e-isbn 978-3-642-23151-3 DOI 10.1007/978-3-642-23151-3 Intelligent Systems Reference Library ISSN 1868-4394 Library of Congress Control Number: 2011936705 c 2012 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed on acid-free paper 987654321 springer.com
Preface There are many invaluable books available on data mining theory and applications. However, in compiling a volume titled DATA MINING: Foundations and Intelligent Paradigms: Volume 3: Medical, Health, Social, Biological and other Applications we wish to introduce some of the latest developments to a broad audience of both specialists and non-specialists in this field. The term data mining was introduced in the 1990 s to describe an emerging field based on classical statistics, artificial intelligence and machine learning. By combining techniques from these areas, and developing new ones researchers are able to innovatively analyze large datasets productively. Patterns found in these datasets are subsequently analyzed with a view to acquiring new knowledge. These techniques have been applied in a broad range of medical, health, social and biological areas. In compiling this volume we have sought to present innovative research from prestigious contributors in the field of data mining. Each chapter is self-contained and is described briefly in Chapter 1. This book will prove valuable to theoreticians as well as application scientists/engineers in the area of Data Mining. Postgraduate students will also find this a useful sourcebook since it shows the direction of current research. We have been fortunate in attracting top class researchers as contributors and wish to offer our thanks for their support in this project. We also acknowledge the expertise and time of the reviewers. Finally, we also wish to thank Springer for their support. Dr. Dawn E. Holmes University of California Santa Barbara, USA Dr. Lakhmi C. Jain University of South Australia Adelaide, Australia
Contents Chapter 1 Advances in Intelligent Data Mining... 1 Dawn E. Holmes, Jeffrey W. Tweedale, Lakhmi C. Jain 1 Introduction... 1 2 Medical Influences........................................ 2 3 Health Influences......................................... 2 4 Social Influences... 2 4.1 InformationDiscovery... 2 4.2 On-LineCommunities... 3 5 Biological Influences...................................... 3 5.1 BiologicalNetworks... 3 5.2 EstimationsinGeneExpression... 4 6 ChaptersIncludedintheBook... 4 7 Conclusion... 6 References... 6 Chapter 2 Temporal Pattern Mining for Medical Applications... 9 Giulia Bruno, Paolo Garza 1 Introduction... 9 2 TypesofTemporalDatainMedicalDomain... 10 3 Definitions... 11 4 TemporalPatternMining Algorithms... 11 4.1 Temporal Pattern Mining from a Set of Sequences...... 12 4.2 Temporal Pattern Mining from a Single Sequence...... 14 5 MedicalApplications... 15 6 Conclusions... 17 References... 18 Chapter 3 BioKeySpotter: An Unsupervised Keyphrase Extraction Technique in the Biomedical Full-Text Collection... 19 Min Song, Prat Tanapaisankit 1 Introduction... 19
VIII Contents 2 Backgrounds and Related Work............................ 20 3 TheProposedApproach... 21 4 Evaluation... 23 4.1 Dataset... 24 4.2 ComparisonAlgorithms... 24 4.3 ExperimentalResults... 25 5 Conclusion... 26 References... 27 Chapter 4 Mining Health Claims Data for Assessing Patient Risk... 29 Ian Duncan 1 WhatIsHealthRisk?... 29 2 TraditionalModelsforAssessingHealthRisk... 33 3 RiskFactor-BasedRiskModels... 37 4 DataSources... 39 4.1 EnrollmentData... 40 4.2 ClaimsandCodingSystems... 40 4.3 InterpretationofClaimsCodes... 49 5 ClinicalIdentificationAlgorithms... 51 6 Sensitivity-SpecificityTrade-Off... 56 6.1 ConstructinganIdentificationAlgorithm... 56 6.2 SourcesofAlgorithms... 57 7 ConstructionandUseofGrouperModels... 58 7.1 DrugGrouperModels... 61 7.2 Drug-BasedRiskAdjustmentModels... 61 8 SummaryandConclusions... 62 References... 62 Chapter 5 Mining Biological Networks for Similar Patterns... 63 Ferhat Ay, Günhan Gülsoy, Tamer Kahveci 1 Introduction... 63 2 Metabolic Network Alignment with One-to-One Mappings..... 67 2.1 Model... 68 2.2 ProblemFormulation... 69 2.3 PairwiseSimilarityofEntities... 70 2.4 SimilarityofTopologies... 74 2.5 CombiningHomologyandTopology... 76 2.6 ExtractingtheMappingofEntities... 78 2.7 SimilarityScoreofNetworks... 79 2.8 ComplexityAnalysis... 80 3 Metabolic Network Alignment with One-to-Many Mappings... 80 3.1 Homological Similarity of Subnetworks................ 82 3.2 Topological Similarity of Subnetworks................. 83
Contents IX 3.3 CombiningHomologyandTopology... 84 3.4 Extracting Subnetwork Mappings.................... 84 4 SignificanceofNetworkAlignment... 88 4.1 IdentificationofAlternativeEntities... 88 4.2 Identification of Alternative Subnetworks.............. 89 4.3 One-to-Many Mappings within and across Major Clades... 91 5 Summary... 92 6 FurtherReading... 93 References... 96 Chapter 6 Estimation of Distribution Algorithms in Gene Expression Data Analysis... 101 Elham Salehi, Robin Gras 1 Introduction... 101 2 EstimationofDistributionofAlgorithms... 102 2.1 ModelBuildinginEDA... 103 2.2 Notation... 104 2.3 ModelswithIndependentVariables... 104 2.4 ModelswithPairWiseDependencies... 105 2.5 ModelswithMultipleDependencies... 106 3 Application of EDA in Gene Expression Data Analysis........ 108 3.1 State-of-Art of the Application of EDAs in Gene ExpressionDataAnalysis... 110 4 Conclusion... 116 References... 116 Chapter 7 Gene Function Prediction and Functional Network: The Role of Gene Ontology... 123 Erliang Zeng, Chris Ding, Kalai Mathee, Lisa Schneper, Giri Narasimhan 1 Introduction... 124 1.1 GeneFunctionPrediction... 125 1.2 FunctionalGeneNetworkGeneration... 127 1.3 RelatedWorkandLimitations... 128 2 GO-BasedGeneSimilarityMeasures... 129 3 Estimating Support for PPI Data with Applications to FunctionPrediction... 132 3.1 MixtureModelofPPIData... 132 3.2 DataSets... 133 3.3 FunctionPrediction... 134 3.4 EvaluatingtheFunctionPrediction... 135 3.5 ExperimentalResults... 137 3.6 Discussion... 147
X Contents 4 A Functional Network of Yeast Genes Using Gene Ontology Information... 149 4.1 DataSets... 149 4.2 ConstructingaFunctionalGeneNetwork... 149 4.3 UsingSemanticSimilarity(SS)... 150 4.4 Evaluating the Functional Gene Network............. 151 4.5 ExperimentalResults... 151 4.6 Discussion... 158 5 Conclusions... 159 References... 160 Chapter 8 Mining Multiple Biological Data for Reconstructing Signal Transduction Networks... 163 Thanh-Phuong Nguyen, Tu-Bao Ho 1 Introduction... 163 2 Background... 164 2.1 SignalTransductionNetwork... 164 2.2 Protein-ProteinInteraction... 166 3 Constructing Signal Transduction Networks Using Multiple Data... 167 3.1 RelatedWork... 167 3.2 MaterialsandMethods... 168 3.3 Clustering and Protein-Protein Interaction Networks.... 169 3.4 Evaluation... 174 4 SomeResultsofYeastSTNReconstruction... 178 5 Outlook... 180 6 Summary... 181 References... 181 Chapter 9 Mining Epistatic Interactions from High-Dimensional Data Sets... 187 Xia Jiang, Shyam Visweswaran, Richard E. Neapolitan 1 Introduction... 187 2 Background... 188 2.1 Epistasis... 188 2.2 Detecting Epistasis... 189 2.3 High-DimensionalData Sets... 190 2.4 BarrierstoLearningEpistasis... 191 2.5 MDR... 191 2.6 BayesianNetworks... 193 3 DiscoveringEpistasisUsingBayesianNetworks... 196 3.1 A Bayesian Network Model for Epistatic Interactions... 196 3.2 TheBNMBLScore... 197
Contents XI 3.3 Experiments... 197 4 EfficientSearch... 202 4.1 Experiments... 203 5 Discussion,Limitations,andFutureResearch... 206 References... 207 Chapter 10 Knowledge Discovery in Adversarial Settings... 211 D.B. Skillicorn 1 Introduction... 211 2 Characteristics of Adversarial Modelling..................... 214 3 TechnicalImplications... 216 4 Conclusion... 221 References... 222 Chapter 11 Analysis and Mining of Online Communities of Internet Forum Users... 225 Miko laj Morzy 1 Introduction... 225 1.1 WhatIsWeb2.0?... 225 1.2 New Forms of Participation Push or Pull?.......... 228 1.3 Internet Forums as New Forms of Conversation........ 229 2 Social-DrivenData... 231 2.1 WhatAreSocial-DrivenData?... 231 2.2 DatafromInternetForums... 234 3 InternetForums... 237 3.1 CrawlingInternetForums... 237 3.2 StatisticalAnalysis... 239 3.3 IndexAnalysis... 246 3.4 NetworkAnalysis... 253 4 RelatedWork... 260 5 Conclusions... 261 References... 262 Chapter 12 Data Mining for Information Literacy... 265 Bettina Berendt 1 Introduction... 265 2 Background... 267 2.1 InformationLiteracy... 267 2.2 CriticalLiteracy... 269 2.3 EducationalDataMining... 270 3 Towards Critical Data Literacy: A Frame for Analysis and Design... 270
XII Contents 3.1 AFrameofAnalysis:TechniqueandObject... 270 3.2 On the Chances of Achieving Critical Data Literacy: Principles of Successful Learning as Description Criteria... 272 4 Examples: Tools and Other Approaches Supporting Data MiningforInformationLiteracy... 273 4.1 Analysing Data: Do-It-Yourself Statistics Visualization... 273 4.2 Analysing Language: Viewpoints and Bias in Media Reporting... 277 4.3 Analysing Data Mining: Building, Comparing and Re-using Own and Others Conceptualizations of a Domain... 282 4.4 Analysing Actions: Feedback and Awareness Tools...... 284 4.5 Analysing Actions: Role Reversals in Data Collection andanalysis... 288 5 SummaryandConclusions... 292 References... 293 Chapter 13 Rule Extraction from Neural Networks and Support Vector Machines for Credit Scoring... 299 Rudy Setiono, Bart Baesens, David Martens 1 Introduction... 299 2 Re-RX: Recursive Rule Extraction from Neural Networks...... 300 2.1 MultilayerPerceptron... 300 2.2 Finding Optimal Network Structure by Pruning........ 303 2.3 RecursiveRuleExtraction... 304 2.4 ApplyingRe-RXforCreditScoring... 306 3 ALBA: Rule Extraction from Support Vector Machines....... 311 3.1 Support Vector Machine............................ 311 3.2 ALBA: Active Learning Based Approach to SVM Rule Extraction... 313 3.3 ApplyingALBAforCreditScoring... 316 4 Conclusion... 318 References... 318 Chapter 14 Using Self-Organizing Map for Data Mining: A Synthesis with Accounting Applications... 321 Andriy Andreev, Argyris Argyrou 1 Introduction... 321 2 DataPre-processing... 322 2.1 TypesofVariables... 322 2.2 DistanceMetrics... 323
Contents XIII 2.3 Rescaling Input Variables........................... 323 3 Self-OrganizingMap... 324 3.1 IntroductiontoSOM... 324 3.2 FormationofSOM... 324 4 PerformanceMetricsandClusterValidity... 326 5 ExtensionsofSOM... 328 5.1 Non-metricSpaces... 328 5.2 SOMforTemporalSequenceProcessing... 329 5.3 SOMforClusterAnalysis... 331 5.4 SOM for Visualizing High-Dimensional Data........... 333 6 FinancialApplicationsofSOM... 334 7 CaseStudy:ClusteringAccountingDatabases... 335 7.1 DataDescription... 335 7.2 DataPre-processing... 336 7.3 Experiments... 337 7.4 ResultsPresentationandDiscussion... 338 References... 338 Chapter 15 Applying Data Mining Techniques to Assess Steel Plant Operation Conditions... 343 Khan Muhammad Badruddin, Isao Yagi, Takao Terano 1 Introduction... 343 2 BriefDescriptionofEAF... 345 2.1 PerformanceEvaluationCriteria... 346 2.2 InnovationsinElectricArcFurnaces... 346 2.3 DetailsoftheOperation... 347 2.4 Understanding SCIPs and Stages of a Heat... 349 3 ProblemDescription... 350 4 DataMiningProcess... 351 4.1 Data... 351 4.2 DataPreprocessing... 351 4.3 AttributePruning... 353 4.4 TheExperiments... 354 4.5 DataMining Techniques... 354 5 Results... 355 5.1 Discussion... 358 6 ConcludingRemarks... 359 References... 360 Author Index... 363
Editors Dr. Dawn E. Holmes serves as Senior Lecturer in the Department of Statistics and Applied Probability and Senior Associate Dean in the Division of Undergraduate Education at UCSB. Her main research area, Bayesian Networks with Maximum Entropy, has resulted in numerous journal articles and conference presentations. Her other research interests include Machine Learning, Data Mining, Foundations of Bayesianism and Intuitionistic Mathematics. Dr. Holmes has co-edited, with Professor Lakhmi C. Jain, volumes Innovations in Bayesian Networks and Innovations in Machine Learning. Dr. Holmes teaches a broad range of courses, including SAS programming, Bayesian Networks and Data Mining. She was awarded the Distinguished Teaching Award by Academic Senate, UCSB in 2008. As well as being Associate Editor of the International Journal of Knowledge-Based and Intelligent Information Systems, Dr. Holmes reviews extensively and is on the editorial board of several journals, including the Journal of Neurocomputing. She serves as Program Scientific Committee Member for numerous conferences; including the International Conference on Artificial Intelligence and the International Conference on Machine Learning. In 2009 Dr. Holmes accepted an invitation to join Center for Research in Financial Mathematics and Statistics (CRFMS), UCSB. She was made a Senior Member of the IEEE in 2011. Professor Lakhmi C. Jain is a Director/Founder of the Knowledge-Based Intelligent Engineering Systems (KES) Centre, located in the University of South Australia. He is a fellow of the Institution of Engineers Australia. His interests focus on the artificial intelligence paradigms and their applications in complex systems, artscience fusion, e-education, e-healthcare, unmanned air vehicles and intelligent agents.