Statistics and Computing. Series Editors: J. Chambers D. Hand

Similar documents
Automated Multi-Camera Surveillance Algorithms and Practice

Synthetic Aperture Radar

Variation Tolerant On-Chip Interconnects

Statistics and Computing Series Editors: J. Chambers D. Hand W. Härdle

K-Best Decoders for 5G+ Wireless Communication

Application of Evolutionary Algorithms for Multi-objective Optimization in VLSI and Embedded Systems

Lateral Flow Immunoassay

Distributed Detection and Data Fusion

The European Heritage in Economics and the Social Sciences

ADVANCED POWER RECTIFIER CONCEPTS

Chess Skill in Man and Machine

LEAKAGE IN NANOMETER CMOS TECHNOLOGIES

Technology Roadmapping for Strategy and Innovation

Computational Intelligence for Network Structure Analytics

Multisector Growth Models

INDUSTRIAL ROBOTS PROGRAMMING: BUILDING APPLICATIONS FOR THE FACTORIES OF THE FUTURE

Advances in Computer Vision and Pattern Recognition

Design for Innovative Value Towards a Sustainable Society

Minimizing Spurious Tones in Digital Delta-Sigma Modulators

The Economics of Information, Communication, and Entertainment

CONTENTS PREFACE. Part One THE DESIGN PROCESS: PROPERTIES, PARADIGMS AND THE EVOLUTIONARY STRUCTURE

The European Heritage in Economics and the Social Sciences

Multivariate Permutation Tests: With Applications in Biostatistics

Dry Etching Technology for Semiconductors. Translation supervised by Kazuo Nojiri Translation by Yuki Ikezi

Graduate Texts in Mathematics. Editorial Board. F. W. Gehring P. R. Halmos Managing Editor. c. C. Moore

Advances in Metaheuristic Algorithms for Optimal Design of Structures

Matthias Pilz Susanne Berger Roy Canning (Eds.) Fit for Business. Pre-Vocational Education in European Schools RESEARCH

Health Information Technology Standards. Series Editor: Tim Benson

CMOS Test and Evaluation

A Practical Guide to Frozen Section Technique

The Astronaut s Cookbook

Principles of Data Security

SpringerBriefs in Space Development

Founding Editor Martin Campbell-Kelly, University of Warwick, Coventry, UK

Multiprocessor System-on-Chip

SpringerBriefs in Space Development

Broadband Networks, Smart Grids and Climate Change

E E Verification and Control of Hybrid Systems

AIRCRAFT CONTROL AND SIMULATION

Progress in Computer Science No.4. Edited by J.Bendey E. Coffman R.L.Graham D. Kuck N. Pippenger. Springer Science+Business Media, LLC

Precoding and Signal Shaping for Digital Transmission

Research Notes in Neural Computing

Socio-technical Design of Ubiquitous Computing Systems

Discursive Constructions of Corporate Identities by Chinese Banks on Sina Weibo

Advanced Information and Knowledge Processing

SHORTEST CONNECTIVITY

Cost Analysis and Estimating

ANALOG CIRCUITS AND SIGNAL PROCESSING

ENGINEERING CIRCUIT ANALYSIS

Video Segmentation and Its Applications

Springer Series in Advanced Microelectronics 33

Chapter 12. Cross-Layer Optimization for Multi- Hop Cognitive Radio Networks

Mechanics Over Micro and Nano Scales

MATHEMATICAL OPTIMIZATION AND ECONOMIC ANALYSIS

Postdisciplinary Studies in Discourse

Social Understanding

Physical Oceanography. Developments Since 1950

Modelling Non-Stationary Time Series

arxiv: v1 [cs.ai] 13 Dec 2014

SpringerBriefs in Electrical and Computer Engineering

High-Linearity CMOS. RF Front-End Circuits

Control Engineering. Editorial Advisory Board Okko Bosgra Delft University The Netherlands. William Powers Ford Motor Company (retired) USA

Digital Image Processing

SpringerBriefs in Computer Science

Faster than Nyquist Signaling

Management of Software Engineering Innovation in Japan

Sustainable Development

Requirements Engineering for Digital Health

Palgrave Studies in Comics and Graphic Novels. Series Editor Roger Sabin University of the Arts London London, United Kingdom

Advances in Game-Based Learning

Robust Hand Gesture Recognition for Robotic Hand Control

Palgrave Studies in Comics and Graphic Novels. Series Editor Roger Sabin University of the Arts London London, United Kingdom

Offshore Energy Structures

Adaptive Wireless. Communications. gl CAMBRIDGE UNIVERSITY PRESS. MIMO Channels and Networks SIDDHARTAN GOVJNDASAMY DANIEL W.

Computer Supported Cooperative Work. Series Editor Richard Harper Cambridge, United Kingdom

MATLAB Guide to Finite Elements

Fundamentals of Digital Forensics

CEPT WGSE PT SE21. SEAMCAT Technical Group

Handbook of Engineering Acoustics

Advanced Decision Making for HVAC Engineers

Current Technologies in Vehicular Communications

The number of mates of latin squares of sizes 7 and 8

Broadband in Europe: How Brussels Can Wire the Information Society

Causality, Correlation and Artificial Intelligence for Rational Decision Making

SpringerBriefs in Applied Sciences and Technology

Bioinformatics for Evolutionary Biologists

Statistical Analysis of Nuel Tournaments Department of Statistics University of California, Berkeley

Cross-Industry Innovation Processes

WHY STARTUPS FAIL AND HOW YOURS CAN SUCCEED. David Feinleib

High-Speed Circuit Board Signal Integrity

Computer Chess Compendium

Fault Diagnosis of Hybrid Dynamic and Complex Systems

Contents. List of Figures List of Tables. Structure of the Book How to Use this Book Online Resources Acknowledgements

Systems Dependability Assessment

STRATEGY FOR R&D: Studies in the Microeconomics of Development

Techniques for Generating Sudoku Instances

SpringerBriefs in Astronomy

Digital Control System Analysis and Design

Scheduling. Radek Mařík. April 28, 2015 FEE CTU, K Radek Mařík Scheduling April 28, / 48

Drones and Unmanned Aerial Systems

Transcription:

Statistics and Computing Series Editors: J. Chambers D. Hand W. Härdle

Statistics and Computing Brusco/Stahl: Branch-and-Bound Applications in Combinatorial Data Analysis. Dalgaard: Introductory Statistics with R. Gentle: Elements of Computational Statistics. Gentle: Numerical Linear Algebra for Applications in Statistics. Gentle: Random Number Generation and Monte Carlo Methods, 2nd Edition. Härdle/Klinke/Turlach: XploRe: An Interactive Statistical Computing Environment. Krause/Olson: The Basics of S-PLUS, 4th Edition. Lange: Numerical Analysis for Statisticians. Lemmon/Schafer: Developing Statistical Software in Fortran 95 Loader: Local Regression and Likelihood. Ó Ruanaidh/Fitzgerald: Numerical Bayesian Methods Applied to Signal Processing. Pannatier: VARIOWIN: Software for Spatial Data Analysis in 2D. Pinheiro/Bates: Mixed-Effects Models in S and S-PLUS. Venables/Ripley: Modern Applied Statistics with S, 4th Edition. Venables/Ripley: S Programming. Wilkinson: The Grammar of Graphics.

Michael J. Brusco Stephanie Stahl Branch-and-Bound Applications in Combinatorial Data Analysis

Michael J. Brusco Department of Marketing College of Business Florida State University Tallahassee, FL 32306-1110 USA Stephanie Stahl 2352 Hampshire Way Tallahassee, FL 32309-3138 USA Series Editors: J. Chambers Bell Labs, Lucent Technologies 600 Mountain Avenue Murray Hill, NJ 07974 USA D. Hand Department of Mathematics South Kensington Campus Imperial College, London London SW7 2AZ United Kingdom W. Härdle Institut für Statistik und Ökonometrie Humboldt-Universität zu Berlin Spandauer Str. 1 D-10178 Berlin Germany Library of Congress Control Number: 2005924426 ISBN-10: 0-387-25037-9 ISBN-13: 978-0387-25037-3 Printed on acid-free paper. 2005 Springer Science+Business Media, Inc. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, Inc., 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in the United States of America. (EB) 987654321 springeronline.com

For Cobol Lipshitz and Snobol Gentiment

Preface This monograph focuses on the application of the solution strategy known as branch-and-bound to problems of combinatorial data analysis. Combinatorial data analysis problems typically require either the selection of a subset of objects from a larger (master) set, the grouping of a collection of objects into mutually exclusive and exhaustive subsets, or the sequencing of objects. To obtain verifiably optimal solutions for this class of problems, we must evaluate (either explicitly or implicitly) all feasible solutions. Unfortunately, the number of feasible solutions for problems of combinatorial data analysis grows exponentially with problem size. For this reason, the explicit enumeration and evaluation of all solutions is computationally infeasible for all but the smallest problems. The branch-and-bound solution method is one type of partial enumeration solution strategy that enables some combinatorial data analysis problems to be solved optimally without explicitly enumerating all feasible solutions. To understand the operation of a branch-and-bound algorithm, we distinguish complete solutions from partial solutions. A complete solution is one for which a feasible solution to the optimization problem has been produced (e.g., all objects are assigned to a group, or all objects are assigned a sequence position). A partial solution is an incomplete solution (e.g., some objects are not assigned to a group, or some objects are not assigned a sequence position). During the execution of a branch-andbound algorithm, solutions are gradually constructed and are, therefore, only partially completed at most stages. If we can determine that a partial solution cannot possibly lead to an optimal solution, then that partial solution and all possible complete solutions stemming from the partial solution can be eliminated from further consideration. This elimination of partial solutions and, simultaneously, all of the complete and partial solutions that could be generated from them is the cornerstone of the branch-and-bound method. In the worst case, a branch-and-bound algorithm could require the complete enumeration of all feasible solutions. For this reason, we note from the outset that we will reach a point where branch-and-bound is not computationally feasible, and other solution approaches are required.

viii Preface We develop and describe a variety of applications of the branch-andbound paradigm to combinatorial data analysis. Part I of the monograph (Chapters 2 through 6) focuses on applications for partitioning a set of objects based on various criteria. Part II (Chapters 7 through 11) describes branch-and-bound approaches for seriation of a collection of objects. Part III (Chapters 12 through 14) addresses the plausibility of branch-and-bound methods for variable selection in multivariate data analysis, particularly focusing on cluster analysis and regression. Our development of this monograph was largely inspired by the recent monograph by Hubert, Arabie, and Meulman (2001) titled, Combinatorial Data Analysis: Optimization by Dynamic Programming. Like the branch-and-bound method, dynamic programming is a partial enumeration strategy that can produce optimal solutions for certain classes of combinatorial data analysis problems that would be insurmountable for exhaustive enumeration approaches. Many of the problems tackled in Parts I and II are also addressed in the Hubert et al. (2001) monograph, which would make an excellent companion reading for our monograph. At the end of many of the chapters, we identify available computer programs for implementing the relevant branch-and-bound procedures. This software is offered (without warranty) free of charge, and both the source code and executable programs are available. This code should enable users to reproduce results reported in the chapters, and also to implement the programs for their own data sets. We are deeply indebted to Phipps Arabie, J. Douglas Carroll, and Lawrence Hubert, whose enthusiasm and encouragement were the principal motivating factors for our pursuing this area of research. We also thank J. Dennis Cradit, who has frequently collaborated with us on related problems of combinatorial data analysis. Michael J. Brusco Stephanie Stahl January, 2005

Contents Preface vii 1 Introduction 1 1.1 Background... 1 1.2 Branch-and-Bound... 4 1.2.1 A Brief History... 4 1.2.2 Components of a Branch-and-Bound Model... 6 1.3 An Outline of the Monograph... 9 1.3.1 Module 1: Cluster Analysis Partitioning... 9 1.3.2 Module 2: Seriation... 10 1.3.3 Module 3: Variable selection... 11 1.4 Layout for Nonintroductory Chapters... 11 I Cluster Analysis Partitioning 13 2 An Introduction to Branch-and-Bound Methods for Partitioning 15 2.1 Partitioning Indices... 16 2.2 A Branch-and-Bound Paradigm for Partitioning... 20 2.2.1 Algorithm Notation... 20 2.2.2 Steps of the Algorithm... 21 2.2.3 Algorithm Description... 21 3 Minimum-Diameter Partitioning 25 3.1 Overview... 25 3.2 The INITIALIZE Step... 26 3.3 The PARTIAL SOLUTION EVALUATION Step... 30 3.4 A Numerical Example... 32 3.5 Application to a Larger Data Set... 34 3.6 An Alternative Diameter Criterion... 38 3.7 Strengths and Limitations... 39 3.8 Available Software... 39

x Contents 4 Minimum Within-Cluster Sums of Dissimilarities Partitioning 43 4.1 Overview... 43 4.2 The INITIALIZE Step... 44 4.3 The PARTIAL SOLUTION EVALUATION Step... 46 4.4 A Numerical Example... 50 4.5 Application to a Larger Data Set... 53 4.6 Strengths and Limitations of the Within-Cluster Sums Criterion... 54 4.7 Available Software... 56 5 Minimum Within-Cluster Sums of Squares Partitioning 59 5.1 The Relevance of Criterion (2.3)... 59 5.2 The INITIALIZE Step... 60 5.3 The PARTIAL SOLUTION EVALUATION Step... 64 5.4 A Numerical Example... 66 5.5 Application to a Larger Data Set... 70 5.6 Strengths and Limitations of the Standardized Within-Cluster Sum of Dissimilarities Criterion... 71 5.7. Available Software... 73 6 Multiobjective Partitioning 77 6.1 Multiobjective Problems in Cluster Analysis... 77 6.2 Partitioning of an Object Set Using Multiple Bases... 77 6.3 Partitioning of Objects in a Single Data Set Using Multiple Criteria... 82 6.4 Strengths and Limitations... 84 6.5 Available Software... 85 II Seriation 89 7 Introduction to the Branch-and-Bound Paradigm for Seriation 91 7.1 Background... 91 7.2 A General Branch-and-Bound Paradigm for Seriation... 93 8 Seriation Maximization of a Dominance Index 97 8.1 Introduction to the Dominance Index... 97 8.2 Fathoming Tests for Optimizing the Dominance Index... 98 8.2.1 Determining an Initial Lower Bound... 98 8.2.2 The Adjacency Test... 100 8.2.3 The Bound Test... 100 8.3 Demonstrating the Iterative Process... 102 8.4 EXAMPLES Extracting and Ordering a Subset... 103

Contents xi 8.4.1 Tournament Matrices... 103 8.4.2 Maximum Dominance Index vs. Perfect Dominance for Subsets... 106 8.5 Strengths and Limitations... 110 8.6 Available Software... 111 9 Seriation Maximization of Gradient Indices 113 9.1 Introduction to the Gradient Indices... 113 9.2 Fathoming Tests for Optimizing Gradient Indices... 115 9.2.1 The Initial Lower Bounds for Gradient Indices... 115 9.2.2 The Adjacency Test for Gradient Indices... 116 9.2.3 The Bound Test for Gradient Indices... 121 9.3 EXAMPLE An Archaeological Exploration... 123 9.4 Strengths and Limitations... 124 9.5 Available Software... 127 10 Seriation Unidimensional Scaling 129 10.1 Introduction to Unidimensional Scaling... 129 10.2 Fathoming Tests for Optimal Unidimensional Scaling... 131 10.2.1 Determining an Initial Lower Bound... 131 10.2.2 Testing for Symmetry in Optimal Unidimensional Scaling... 131 10.2.3 Adjacency Test Expanded to Interchange Test... 132 10.2.4 Bound Test... 136 10.3 Demonstrating the Iterative Process... 139 10.4 EXAMPLE Can You Hear Me Now?... 141 10.5 Strengths and Limitations... 144 10.6 Available Software... 144 11 Seriation Multiobjective Seriation 147 11.1 Introduction to Multiobjective Seriation... 147 11.2 Efficient Solutions... 149 11.3 Maximizing the Dominance Index for Multiple Asymmetric Proximity Matrices... 149 11.4 UDS for Multiple Symmetric Dissimilarity Matrices... 153 11.5 Comparing Gradient Indices for a Symmetric Dissimilarity Matrix... 160 11.6 Multiple Matrices with Multiple Criteria... 164 11.7 Strengths and Limitations... 169

xii Contents III Variable Selection 171 12 Introduction to Branch-and-Bound Methods for Variable Selection 173 12.1 Background... 173 13 Variable Selection for Cluster Analysis 177 13.1 True Variables and Masking Variables... 177 13.2 A Branch-and-Bound Approach to Variable Selection... 178 13.3 A Numerical Example... 182 13.4. Strengths, Limitations, and Extensions... 183 14 Variable Selection for Regression Analysis 187 14.1 Regression Analysis... 187 14.2 A Numerical Example... 190 14.3 Application to a Larger Data Set... 193 14.4 Strengths, Limitations, and Extensions... 198 14.5 Available Software... 199 Appendix A: General Branch-and-Bound Algorithm for Partitioning 203 Appendix B: General Branch-and-Bound Algorithm Using Forward Branching for Optimal Seriation Procedures 205 References 209 Index 219