A Statistical Test for Comparing Success Rates

Similar documents
One-Sample Z: C1, C2, C3, C4, C5, C6, C7, C8,... The assumed standard deviation = 110

Most typical tests can also be done as permutation tests. For example: Two sample tests (e.g., t-test, MWU test)

Math 58. Rumbos Fall Solutions to Exam Give thorough answers to the following questions:

Jednoczynnikowa analiza wariancji (ANOVA)

Statistical Hypothesis Testing

A1 = Chess A2 = Non-Chess B1 = Male B2 = Female

Name: Exam 01 (Midterm Part 2 Take Home, Open Everything)

MATRIX SAMPLING DESIGNS FOR THE YEAR2000 CENSUS. Alfredo Navarro and Richard A. Griffin l Alfredo Navarro, Bureau of the Census, Washington DC 20233

Econ 172A - Slides from Lecture 18

IE 361 Module 13. Control Charts for Counts ("Attributes Data") Reading: Section 3.3 of Statistical Quality Assurance Methods for Engineers

Lesson 1 6. Algebra: Variables and Expression. Students will be able to evaluate algebraic expressions.

Laboratory 1: Uncertainty Analysis

A GRASP heuristic for the Cooperative Communication Problem in Ad Hoc Networks

Dynamic Analog Testing via ATE Digital Test Channels

The fundamentals of detection theory

Rating and Generating Sudoku Puzzles Based On Constraint Satisfaction Problems

Solutions 2: Probability and Counting

Urn Sampling Without Replacement: Enumerative Combinatorics In R

A Factorial Representation of Permutations and Its Application to Flow-Shop Scheduling

How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory

Burst Error Correction Method Based on Arithmetic Weighted Checksums

Aesthetically Pleasing Azulejo Patterns

On the GNSS integer ambiguity success rate

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. B) Blood type Frequency

STAT Statistics I Midterm Exam One. Good Luck!

A GRASP HEURISTIC FOR THE COOPERATIVE COMMUNICATION PROBLEM IN AD HOC NETWORKS

XLVI Pesquisa Operacional na Gestão da Segurança Pública

A Brief Examination of Current and a Proposed Fine Frequency Estimator Using Three DFT Samples

2359 (i.e. 11:59:00 pm) on 4/16/18 via Blackboard

FROM BLIND SOURCE SEPARATION TO BLIND SOURCE CANCELLATION IN THE UNDERDETERMINED CASE: A NEW APPROACH BASED ON TIME-FREQUENCY ANALYSIS

Proceedings Statistical Evaluation of the Positioning Error in Sequential Localization Techniques for Sensor Networks

STRATEGY AND COMPLEXITY OF THE GAME OF SQUARES

Solutions to Odd-Numbered End-of-Chapter Exercises: Chapter 13

LECTURE 7: POLYNOMIAL CONGRUENCES TO PRIME POWER MODULI

Construction of SARIMAXmodels

Research Article A New Iterated Local Search Algorithm for Solving Broadcast Scheduling Problems in Packet Radio Networks

Game Theory and Randomized Algorithms

Game Theory Refresher. Muriel Niederle. February 3, A set of players (here for simplicity only 2 players, all generalized to N players).

Please Turn Over Page 1 of 7

Optimal Yahtzee performance in multi-player games

Jong C. Park Computer Science Division, KAIST

A Sliding Window PDA for Asynchronous CDMA, and a Proposal for Deliberate Asynchronicity

Probability. Engr. Jeffrey T. Dellosa.

Variations on the Two Envelopes Problem

ANGLE MODULATED SIMULATED KALMAN FILTER ALGORITHM FOR COMBINATORIAL OPTIMIZATION PROBLEMS

CIS 2033 Lecture 6, Spring 2017

Lectures 15/16 ANOVA. ANOVA Tests. Analysis of Variance. >ANOVA stands for ANalysis Of VAriance >ANOVA allows us to:

Techniques for Generating Sudoku Instances

December 12, FGCU Invitational Mathematics Competition Statistics Team

Ground Target Signal Simulation by Real Signal Data Modification

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

Permutations with short monotone subsequences

A New Standard for Radiographic Acceptance Criteria for Steel Castings: Gage R&R Study

Project summary. Key findings, Winter: Key findings, Spring:

Remember that represents the set of all permutations of {1, 2,... n}

Basic Probability Concepts

Tabu search for the single row facility layout problem using exhaustive 2-opt and insertion neighborhoods

Sampling Terminology. all possible entities (known or unknown) of a group being studied. MKT 450. MARKETING TOOLS Buyer Behavior and Market Analysis

A Winning Strategy for the Game of Antonim

Human Reconstruction of Digitized Graphical Signals

Possible responses to the 2015 AP Statistics Free Resposne questions, Draft #2. You can access the questions here at AP Central.

Do Populations Conform to the Law of Anomalous Numbers?

FAULT DIAGNOSIS AND RECONFIGURATION IN FLIGHT CONTROL SYSTEMS

Math 3338: Probability (Fall 2006)

Mehrdad Amirghasemi a* Reza Zamani a

Prediction of Cluster System Load Using Artificial Neural Networks

Application of QAP in Modulation Diversity (MoDiv) Design


Name: Exam 01 (Midterm Part 2 take home, open everything)

Shuffled Complex Evolution

Initialisation improvement in engineering feedforward ANN models.

SOLITAIRE CLOBBER AS AN OPTIMIZATION PROBLEM ON WORDS

Modulation Classification based on Modified Kolmogorov-Smirnov Test

Chapter 19. Inference about a Population Proportion. BPS - 5th Ed. Chapter 19 1

DSP First Lab 08: Frequency Response: Bandpass and Nulling Filters

CS100: DISCRETE STRUCTURES. Lecture 8 Counting - CH6

S Simulation program SEAMCAT

ECE313 Summer Problem Set 4. Reading: RVs, mean, variance, and coniditional probability

On Surfaces of Revolution whose Mean Curvature is Constant

Player Speed vs. Wild Pokémon Encounter Frequency in Pokémon SoulSilver Joshua and AP Statistics, pd. 3B

Game Theory and Economics of Contracts Lecture 4 Basics in Game Theory (2)

SOLUTIONS TO PROBLEM SET 5. Section 9.1

Optimal Multicast Routing in Ad Hoc Networks

Author: Yih-Yih Lin. Correspondence: Yih-Yih Lin Hewlett-Packard Company MR Forest Street Marlboro, MA USA

Topic 1: defining games and strategies. SF2972: Game theory. Not allowed: Extensive form game: formal definition

Genealogical trees, coalescent theory, and the analysis of genetic polymorphisms

Experiments on Alternatives to Minimax

INTEGRATED COVERAGE MEASUREMENT SAMPLE DESIGN FOR CENSUS 2000 DRESS REHEARSAL

VENTILATION CONTROL OF THE BLANKA TUNNEL: A MATHEMATICAL PROGRAMMING APPROACH

CS188 Spring 2014 Section 3: Games

Electronically Steerable planer Phased Array Antenna

The Galaxy. Christopher Gutierrez, Brenda Garcia, Katrina Nieh. August 18, 2012

Modulation Design For MIMO HARQ Channel

Design of intelligent surveillance systems: a game theoretic case. Nicola Basilico Department of Computer Science University of Milan

Comparison of Two Alternative Movement Algorithms for Agent Based Distillations

Image Encryption Based on the Modified Triple- DES Cryptosystem

Chapter 1 out of 37 from Discrete Mathematics for Neophytes: Number Theory, Probability, Algorithms, and Other Stuff by J. M. Cargal.

Permutation and Randomization Tests 1

2011, Stat-Ease, Inc.

AN EVALUATION OF TWO ALTERNATIVES TO MINIMAX. Dana Nau 1 Computer Science Department University of Maryland College Park, MD 20742

Transcription:

MIC2003: The Fifth Metaheuristics International Conference 110-1 A Statistical Test for Comparing Success Rates Éric D. Taillard EIVD, University of Applied Sciences of Western Switzerland Route de Cheseaux 1, Case postale CH-1401 Yverdon-les-Bains, Switzerland Eric.Taillard@eivd.ch 1 Introduction Who was not once perplex when reading, in an article comparing two optimization methods, numerical results presented under the following form : We have tested our optimization method A on a set of n problem instances from the literature and we succeeded in solving a of these instances. The concurrent method B succeeded in solving only b of these instances. However, it has to be noted that method B was tested only on m over the n instances. Indeed, the reader has no answer to the basic question : Is a success rate of a/n significantly superior to a success rate of b/m?. Very often, the answer to this central question cannot be found in classical statistical tests, since the lasts require large sample sizes (at least about 15). Nevertheless, in combinatorial optimization, problem instances sets are frequently smaller than 15. Intuitively, the reader is perfectly convinced that a method A that succeeded in solving all 10 problem instances of a given set is better than a method B that solved none of them. On the contrary, the reader will not be really convinced if the problem set contains only 3 instances. However, supposing that the problem instances have been chosen independently from the solving methods, it can be shown that a 3/3 rate of success is significantly larger (with a confidence level higher than 98% ) than a 0/3 rate of success. It could be argued that larger problem sets must be used, so that standard statistical sets could be applied. Unfortunately this is not always possible. First, for real problems, collecting data for a single instance may take several weeks. Second, there are classical problem instances libraries (ORLIB [2], QAPLIB [3], TSPLIB [4]) that seldom propose more than 10 instances for a given problem size. Third, it can be noted that many optimization methods are very time consuming (for instance the code for the quadratic assignment problem of [1] which took the equivalent of 7 years CPU time on a sequential computer for solving instance Nug30). In such a case, it might be more interesting to estimate the performances of a method on few large problem instances than on a multitude of toys-instances.

110-2 MIC2003: The Fifth Metaheuristics International Conference 2 Standard statistical test The comparison of rate of successes for two populations A and B is traditionally done as follows : Let p a (respectively p b ) be the probability of success of population A (respectively population B) and random samples of size n (respectively m) are taken for the experiment. Then, the statistic U = X a /n X b /m (where X a and X b are random variables associated with successful experiments in populations A and B) hasthemeanp a p b and variance p a q a /n + p b q b /m, whereq a =1 p a and q b =1 p b. For conducting a statistical test, the null hypothesis is p a = p b = p, sothatu should have a mean of 0. If both n and m are large, U is approximately normally distributed and the common probability p can be estimated by ˆp = a+b m+n,whereaand b are the number of successes observed in populations A and B. The null hypothesis will be rejected (and p a will be considered to be higher than p b )ifφ(û) >α,whereα is the confidence level, Φ is the cumulative standard normal distribution and a/n b/m û = a+b n+m n+m a b n+m (1/n +1/m) Practically, if both n and m are higher than 14, the null hypothesis can be reasonably rejected if û>1.65(α = 95%), û>2.06(α = 98%), û>2.33(α = 99%), û>2.58(α =99.5%), û>3.09(α =99.9%). 3 A non parametric statistical test In order to answer the question : Does a observation of a criterion over a sample of size n represent a rate higher than b occurrences over a sample of size m?, it can be proceeded as follows : Null hypothesis : Let us suppose that the (unknown) rate p of occurrence of the criterion is the same for both sample (i.e. p a = p b = p). Under the null hypothesis, the probability S(p, a, n, b, m) toobservea successes or more in the first population (of size n) andb successes or less in the second population (of size m) is given by the product of two binomial distributions (Ci n = n! i! (n i)! and Cj m = m! j! (m j)! are the binomial coefficients) : n b S(p, a, n, b, m) =( Ci n p i (1 p) n i ) ( Cj m p j (1 p) m j ) i=a j=0 Alternate hypothesis : rate of method B. p a >p b i.e. the success rate of method A is higher than the success The null hypothesis has to be rejected with a confidence level α (and the alternate hypothesis accepted, i.e. an a/n rate will be considered higher than a b/m rate) if max S(p, a, n, b, m) 1 α 0<p<1

MIC2003: The Fifth Metaheuristics International Conference 110-3 4 Examples Let us suppose that all n observations from the first sample where successes and all m observations from the second sample where failures (i.e. a = n and b = 0). Supposing that both populations have the same probability of success, S(p, n, n, 0,m)=p n (1 p) 0 p 0 (1 p) m = p n (1 p) m. The probability ˆp that maximizes S(p, n, n, 0,m) is given by solving the equation : ds(p, n, n, 0,m) dp = np n 1 (1 p) m mp n (1 p) m 1 =0 For the special case a = n and b = 0, the pooled estimate ˆp = a+b m+n is therefore the value that maximizes S(p, a, n, b, m) overp. For instance, if n =3andm =2,S(3/5, 3, 3, 0, 2) = 108/3125 < 5%. So a success rate of 3/3 is significantly higher (with confidence level of 95%) than a success rate of 0/2. Unfortunately, for arbitrary values of a, n, b and m, the pooled estimate is not the value that maximizes S(p, a, n, b, m) over p. For instance, S(3/7, 3, 4, 0, 3) < 4/100 and S( 6 2 2 7, 3, 4, 0, 3) > 4/100. This means that testing if a rate of 3/4 is significantly higher than a rate of 0/3 with a confidence level of 96% whould lead to an erroneous conclusion if the pool estimate is used. In general, the analytic expression of ˆp is at least hard to be found in practice. Therefore, we have numerically estimated ˆp and provide in Table 1 (and, respectively in Table 2), for various values of n and m and for a confidence level of 95% (respectively 99%), the most extreme couples (a, b) forwhichana/n rate of success is higher than a b/m rate. Reading the tables Let us suppose that the observed success rate of an optimization method A is 6/10 and the observed success rate of method B is 1/9 (meaning that a =6, n = 10, b =1,m =9). InTable1,entryn =10andm = 9 contains the couple (5,1), meaning that a 5/10 success rate is significantly higher than a 1/9 success rate at 95% confidence level. Since the success rate 6/10 > 5/10 it can be deduced that method A is significantly better than method B (at 95% confidence level). 5 Conclusions This article presents a non-parametric statistical test that is very interesting for those who want to compare different heuristic algorithms that do not necessarily end with feasible (or satisfying) solutions. This test has been specially designed for working with very small sample sizes, meaning that a substantial computational effort can be saved when conducting numerical experiments. When the sample sizes are lower than 15, standard statistical tests for comparing the success rates of two populations cannot be validly used. We have indeed observed that the standard statistical test abusively applied provides results that are erroneous. So it is for very high confidence rates, even if sample sizes are larger than 15. Therefore, a non parametric

110-4 MIC2003: The Fifth Metaheuristics International Conference n m 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 (3,0) (4,0) (5,0) (5,0) (6,0) (7,0) (7,0) (8,0) (9,0) (9,0) (10,0) (11,0) (11,0) (14,1) (15,1) 3 (2,0) (3,0) (3,0) (4,0) (4,0) (5,0) (5,0) (6,0) (6,0) (7,0) (7,0) (8,0) (9,0) (9,0) (5,1) (6,1) (7,1) (8,1) (8,1) (9,1) (10,1) (11,1) (12,1) (12,1) (13,1) (2,0) (3,0) (3,0) (3,0) (4,0) (4,0) (5,0) (5,0) (5,0) (6,0) (6,0) (7,0) (7,0) (8,0) 4 (3,1) (4,1) (5,1) (5,1) (6,1) (7,1) (7,1) (8,1) (9,1) (9,1) (10,1) (11,1) (11,1) (6,2) (7,2) (8,2) (9,2) (10,2) (11,2) (12,2) (12,2) (13,2) (14,2) (2,0) (2,0) (3,0) (3,0) (3,0) (4,0) (4,0) (4,0) (5,0) (5,0) (5,0) (6,0) (6,0) (7,0) 5 (3,1) (4,1) (4,1) (5,1) (5,1) (6,1) (6,1) (7,1) (8,1) (8,1) (9,1) (9,1) (10,1) (4,2) (5,2) (6,2) (7,2) (7,2) (8,2) (9,2) (10,2) (10,2) (11,2) (12,2) (12,2) (8,3) (9,3) (10,3) (11,3) (12,3) (13,3) (14,3) (15,3) (2,0) (2,0) (2,0) (3,0) (3,0) (3,0) (4,0) (4,0) (4,0) (5,0) (5,0) (5,0) (5,0) (6,0) (2,1) (3,1) (3,1) (4,1) (4,1) (5,1) (5,1) (6,1) (6,1) (7,1) (7,1) (8,1) (8,1) (9,1) 6 (3,2) (4,2) (5,2) (5,2) (6,2) (7,2) (7,2) (8,2) (9,2) (9,2) (10,2) (11,2) (11,2) (5,3) (6,3) (7,3) (8,3) (9,3) (9,3) (10,3) (11,3) (12,3) (12,3) (13,3) (10,4) (11,4) (12,4) (13,4) (14,4) (15,4) (2,0) (2,0) (2,0) (3,0) (3,0) (3,0) (3,0) (4,0) (4,0) (4,0) (4,0) (5,0) (5,0) (5,0) (2,1) (3,1) (3,1) (4,1) (4,1) (4,1) (5,1) (5,1) (6,1) (6,1) (7,1) (7,1) (7,1) (8,1) 7 (3,2) (4,2) (4,2) (5,2) (6,2) (6,2) (7,2) (7,2) (8,2) (8,2) (9,2) (10,2) (10,2) (4,3) (5,3) (6,3) (6,3) (7,3) (8,3) (9,3) (9,3) (10,3) (11,3) (11,3) (12,3) (6,4) (7,4) (8,4) (9,4) (10,4) (10,4) (11,4) (12,4) (13,4) (14,4) (11,5) (12,5) (13,5) (14,5) (15,5) (2,0) (2,0) (2,0) (2,0) (3,0) (3,0) (3,0) (3,0) (4,0) (4,0) (4,0) (4,0) (5,0) (5,0) (2,1) (3,1) (3,1) (3,1) (4,1) (4,1) (5,1) (5,1) (5,1) (6,1) (6,1) (6,1) (7,1) (7,1) (3,2) (4,2) (4,2) (5,2) (5,2) (6,2) (6,2) (7,2) (7,2) (8,2) (8,2) (9,2) (9,2) 8 (3,3) (4,3) (5,3) (5,3) (6,3) (7,3) (7,3) (8,3) (9,3) (9,3) (10,3) (10,3) (11,3) (5,4) (6,4) (7,4) (8,4) (8,4) (9,4) (10,4) (10,4) (11,4) (12,4) (13,4) (7,5) (8,5) (9,5) (10,5) (11,5) (12,5) (12,5) (13,5) (14,5) (13,6) (14,6) (15,6) (2,0) (2,0) (2,0) (2,0) (3,0) (3,0) (3,0) (3,0) (3,0) (4,0) (4,0) (4,0) (4,0) (4,0) (2,1) (2,1) (3,1) (3,1) (4,1) (4,1) (4,1) (5,1) (5,1) (5,1) (6,1) (6,1) (6,1) (7,1) (2,2) (3,2) (3,2) (4,2) (4,2) (5,2) (5,2) (6,2) (6,2) (7,2) (7,2) (8,2) (8,2) (9,2) 9 (3,3) (4,3) (4,3) (5,3) (6,3) (6,3) (7,3) (7,3) (8,3) (9,3) (9,3) (10,3) (10,3) (4,4) (5,4) (6,4) (6,4) (7,4) (8,4) (8,4) (9,4) (10,4) (10,4) (11,4) (12,4) (5,5) (6,5) (7,5) (8,5) (9,5) (9,5) (10,5) (11,5) (12,5) (12,5) (13,5) (8,6) (9,6) (10,6) (11,6) (12,6) (13,6) (14,6) (14,6) (15,7) (2,0) (2,0) (2,0) (2,0) (2,0) (3,0) (3,0) (3,0) (3,0) (3,0) (4,0) (4,0) (4,0) (4,0) (2,1) (2,1) (3,1) (3,1) (3,1) (4,1) (4,1) (4,1) (5,1) (5,1) (5,1) (6,1) (6,1) (6,1) (2,2) (3,2) (3,2) (4,2) (4,2) (5,2) (5,2) (5,2) (6,2) (6,2) (7,2) (7,2) (8,2) (8,2) 10 (3,3) (4,3) (4,3) (5,3) (5,3) (6,3) (6,3) (7,3) (7,3) (8,3) (9,3) (9,3) (10,3) (3,4) (4,4) (5,4) (5,4) (6,4) (7,4) (7,4) (8,4) (9,4) (9,4) (10,4) (10,4) (11,4) (4,5) (5,5) (6,5) (7,5) (7,5) (8,5) (9,5) (9,5) (10,5) (11,5) (12,5) (12,5) (6,6) (7,6) (8,6) (9,6) (10,6) (10,6) (11,6) (12,6) (13,6) (14,6) (9,7) (10,7) (11,7) (12,7) (13,7) (14,7) (15,7) (2,0) (2,0) (2,0) (2,0) (2,0) (2,0) (3,0) (3,0) (3,0) (3,0) (3,0) (4,0) (4,0) (4,0) (2,1) (2,1) (3,1) (3,1) (3,1) (3,1) (4,1) (4,1) (4,1) (5,1) (5,1) (5,1) (6,1) (6,1) (2,2) (3,2) (3,2) (4,2) (4,2) (4,2) (5,2) (5,2) (5,2) (6,2) (6,2) (7,2) (7,2) (7,2) (3,3) (4,3) (4,3) (5,3) (5,3) (6,3) (6,3) (7,3) (7,3) (7,3) (8,3) (8,3) (9,3) 11 (3,4) (4,4) (5,4) (5,4) (6,4) (6,4) (7,4) (7,4) (8,4) (9,4) (9,4) (10,4) (10,4) (4,5) (5,5) (6,5) (6,5) (7,5) (8,5) (8,5) (9,5) (10,5) (10,5) (11,5) (12,5) (5,6) (6,6) (7,6) (8,6) (8,6) (9,6) (10,6) (11,6) (11,6) (12,6) (13,6) (7,7) (8,7) (9,7) (10,7) (11,7) (11,7) (12,7) (13,7) (14,7) (10,8) (11,8) (12,8) (13,8) (14,8) (15,8) (2,0) (2,0) (2,0) (2,0) (2,0) (2,0) (3,0) (3,0) (3,0) (3,0) (3,0) (3,0) (4,0) (4,0) (2,1) (2,1) (3,1) (3,1) (3,1) (3,1) (4,1) (4,1) (4,1) (4,1) (5,1) (5,1) (5,1) (5,1) (2,2) (3,2) (3,2) (3,2) (4,2) (4,2) (4,2) (5,2) (5,2) (6,2) (6,2) (6,2) (7,2) (7,2) (2,3) (3,3) (3,3) (4,3) (4,3) (5,3) (5,3) (6,3) (6,3) (7,3) (7,3) (8,3) (8,3) (8,3) 12 (3,4) (4,4) (4,4) (5,4) (5,4) (6,4) (7,4) (7,4) (8,4) (8,4) (9,4) (9,4) (10,4) (3,5) (4,5) (5,5) (5,5) (6,5) (7,5) (7,5) (8,5) (8,5) (9,5) (10,5) (10,5) (11,5) (4,6) (5,6) (6,6) (7,6) (7,6) (8,6) (9,6) (9,6) (10,6) (11,6) (11,6) (12,6) (5,7) (6,7) (7,7) (8,7) (9,7) (9,7) (10,7) (11,7) (12,7) (12,7) (13,7) (7,8) (8,8) (9,8) (10,8) (11,8) (12,8) (12,8) (13,8) (14,8) (11,9) (12,9) (13,9) (14,9) (15,9) (2,0) (2,0) (2,0) (2,0) (2,0) (2,0) (2,0) (3,0) (3,0) (3,0) (3,0) (3,0) (3,0) (4,0) (2,1) (2,1) (2,1) (3,1) (3,1) (3,1) (3,1) (4,1) (4,1) (4,1) (4,1) (5,1) (5,1) (5,1) (2,2) (3,2) (3,2) (3,2) (4,2) (4,2) (4,2) (5,2) (5,2) (5,2) (6,2) (6,2) (6,2) (7,2) (2,3) (3,3) (3,3) (4,3) (4,3) (5,3) (5,3) (5,3) (6,3) (6,3) (7,3) (7,3) (8,3) (8,3) (3,4) (4,4) (4,4) (5,4) (5,4) (6,4) (6,4) (7,4) (7,4) (8,4) (8,4) (9,4) (9,4) 13 (3,5) (4,5) (5,5) (5,5) (6,5) (6,5) (7,5) (8,5) (8,5) (9,5) (9,5) (10,5) (10,5) (4,6) (5,6) (6,6) (6,6) (7,6) (8,6) (8,6) (9,6) (10,6) (10,6) (11,6) (11,6) (5,7) (6,7) (7,7) (7,7) (8,7) (9,7) (10,7) (10,7) (11,7) (12,7) (13,7) (6,8) (7,8) (8,8) (9,8) (10,8) (10,8) (11,8) (12,8) (13,8) (13,8) (8,9) (9,9) (10,9) (11,9) (12,9) (13,9) (14,9) (14,9) (12,10) (13,10) (14,10) (15,10) Table 1: Couples (a, b) for which a success rate a/n is significantly higher than a successes rate b/m, for a confidence level of 95%.

MIC2003: The Fifth Metaheuristics International Conference 110-5 n m 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 (7,0) (8,0) (9,0) (10,0) (11,0) (12,0) (12,0) (13,0) (14,0) 3 (4,0) (5,0) (6,0) (7,0) (7,0) (8,0) (9,0) (9,0) (10,0) (11,0) (11,0) (12,0) (12,1) (13,1) (14,1) (15,1) 4 (3,0) (4,0) (5,0) (5,0) (6,0) (6,0) (7,0) (8,0) (8,0) (9,0) (9,0) (10,0) (11,0) (6,1) (7,1) (8,1) (9,1) (10,1) (11,1) (11,1) (12,1) (13,1) (14,1) (3,0) (4,0) (4,0) (5,0) (5,0) (6,0) (6,0) (7,0) (7,0) (8,0) (8,0) (9,0) (9,0) 5 (5,1) (6,1) (7,1) (7,1) (8,1) (9,1) (10,1) (10,1) (11,1) (12,1) (12,1) (9,2) (10,2) (11,2) (12,2) (13,2) (14,2) (14,2) (3,0) (3,0) (4,0) (4,0) (5,0) (5,0) (6,0) (6,0) (7,0) (7,0) (8,0) (8,0) (9,0) 6 (4,1) (5,1) (6,1) (6,1) (7,1) (8,1) (8,1) (9,1) (9,1) (10,1) (11,1) (11,1) (6,2) (7,2) (8,2) (9,2) (10,2) (10,2) (11,2) (12,2) (13,2) (13,2) (11,3) (12,3) (13,3) (14,3) (15,3) (2,0) (3,0) (3,0) (4,0) (4,0) (5,0) (5,0) (5,0) (6,0) (6,0) (7,0) (7,0) (7,0) (8,0) (4,1) (5,1) (5,1) (6,1) (6,1) (7,1) (8,1) (8,1) (9,1) (9,1) (10,1) (10,1) 7 (5,2) (6,2) (7,2) (8,2) (8,2) (9,2) (10,2) (10,2) (11,2) (12,2) (12,2) (8,3) (9,3) (10,3) (11,3) (12,3) (12,3) (13,3) (14,3) (13,4) (14,4) (15,4) (2,0) (3,0) (3,0) (4,0) (4,0) (4,0) (5,0) (5,0) (5,0) (6,0) (6,0) (6,0) (7,0) (7,0) (3,1) (4,1) (4,1) (5,1) (6,1) (6,1) (7,1) (7,1) (8,1) (8,1) (9,1) (9,1) (10,1) 8 (4,2) (5,2) (6,2) (6,2) (7,2) (8,2) (8,2) (9,2) (10,2) (10,2) (11,2) (12,2) (6,3) (7,3) (8,3) (9,3) (9,3) (10,3) (11,3) (12,3) (12,3) (13,3) (9,4) (10,4) (11,4) (12,4) (13,4) (14,4) (14,4) (15,5) (2,0) (3,0) (3,0) (3,0) (4,0) (4,0) (4,0) (5,0) (5,0) (5,0) (6,0) (6,0) (6,0) (7,0) (3,1) (4,1) (4,1) (5,1) (5,1) (6,1) (6,1) (7,1) (7,1) (8,1) (8,1) (9,1) (9,1) 9 (4,2) (5,2) (6,2) (6,2) (7,2) (7,2) (8,2) (9,2) (9,2) (10,2) (10,2) (11,2) (5,3) (6,3) (7,3) (8,3) (8,3) (9,3) (10,3) (10,3) (11,3) (12,3) (12,3) (7,4) (8,4) (9,4) (10,4) (11,4) (11,4) (12,4) (13,4) (14,4) (10,5) (11,5) (12,5) (13,5) (14,5) (15,5) (2,0) (3,0) (3,0) (3,0) (4,0) (4,0) (4,0) (4,0) (5,0) (5,0) (5,0) (6,0) (6,0) (6,0) (3,1) (4,1) (4,1) (5,1) (5,1) (5,1) (6,1) (6,1) (7,1) (7,1) (8,1) (8,1) (9,1) (4,2) (5,2) (5,2) (6,2) (6,2) (7,2) (8,2) (8,2) (9,2) (9,2) (10,2) (10,2) 10 (5,3) (6,3) (7,3) (7,3) (8,3) (9,3) (9,3) (10,3) (10,3) (11,3) (12,3) (6,4) (7,4) (8,4) (9,4) (9,4) (10,4) (11,4) (12,4) (12,4) (13,4) (8,5) (9,5) (10,5) (11,5) (12,5) (13,5) (13,5) (14,5) (12,6) (13,6) (14,6) (15,6) (2,0) (3,0) (3,0) (3,0) (3,0) (4,0) (4,0) (4,0) (5,0) (5,0) (5,0) (5,0) (6,0) (6,0) (3,1) (4,1) (4,1) (4,1) (5,1) (5,1) (6,1) (6,1) (6,1) (7,1) (7,1) (8,1) (8,1) (3,2) (4,2) (5,2) (5,2) (6,2) (6,2) (7,2) (7,2) (8,2) (8,2) (9,2) (9,2) (10,2) 11 (4,3) (5,3) (6,3) (6,3) (7,3) (8,3) (8,3) (9,3) (9,3) (10,3) (11,3) (11,3) (5,4) (6,4) (7,4) (8,4) (8,4) (9,4) (10,4) (10,4) (11,4) (12,4) (12,4) (7,5) (8,5) (9,5) (10,5) (10,5) (11,5) (12,5) (13,5) (13,5) (9,6) (10,6) (11,6) (12,6) (13,6) (14,6) (14,6) (13,7) (14,7) (15,7) (2,0) (2,0) (3,0) (3,0) (3,0) (4,0) (4,0) (4,0) (4,0) (5,0) (5,0) (5,0) (5,0) (6,0) (3,1) (3,1) (4,1) (4,1) (5,1) (5,1) (5,1) (6,1) (6,1) (7,1) (7,1) (7,1) (8,1) (3,2) (4,2) (4,2) (5,2) (5,2) (6,2) (6,2) (7,2) (7,2) (8,2) (8,2) (9,2) (9,2) (4,3) (5,3) (5,3) (6,3) (7,3) (7,3) (8,3) (8,3) (9,3) (9,3) (10,3) (11,3) 12 (5,4) (6,4) (7,4) (7,4) (8,4) (9,4) (9,4) (10,4) (11,4) (11,4) (12,4) (6,5) (7,5) (8,5) (9,5) (9,5) (10,5) (11,5) (11,5) (12,5) (13,5) (8,6) (9,6) (10,6) (11,6) (12,6) (12,6) (13,6) (14,6) (10,7) (11,7) (12,7) (13,7) (14,7) (15,7) (14,8) (15,8) (2,0) (2,0) (3,0) (3,0) (3,0) (3,0) (4,0) (4,0) (4,0) (4,0) (5,0) (5,0) (5,0) (5,0) (2,1) (3,1) (3,1) (4,1) (4,1) (4,1) (5,1) (5,1) (6,1) (6,1) (6,1) (7,1) (7,1) (7,1) (3,2) (4,2) (4,2) (5,2) (5,2) (6,2) (6,2) (7,2) (7,2) (7,2) (8,2) (8,2) (9,2) (4,3) (5,3) (5,3) (6,3) (6,3) (7,3) (7,3) (8,3) (9,3) (9,3) (10,3) (10,3) 13 (4,4) (5,4) (6,4) (6,4) (7,4) (8,4) (8,4) (9,4) (9,4) (10,4) (11,4) (11,4) (5,5) (6,5) (7,5) (8,5) (8,5) (9,5) (10,5) (10,5) (11,5) (12,5) (12,5) (7,6) (8,6) (9,6) (10,6) (10,6) (11,6) (12,6) (13,6) (13,6) (8,7) (9,7) (10,7) (11,7) (12,7) (13,7) (13,7) (14,7) (11,8) (12,8) (13,8) (14,8) (15,8) (2,0) (2,0) (3,0) (3,0) (3,0) (3,0) (4,0) (4,0) (4,0) (4,0) (4,0) (5,0) (5,0) (5,0) (2,1) (3,1) (3,1) (4,1) (4,1) (4,1) (5,1) (5,1) (5,1) (6,1) (6,1) (6,1) (7,1) (7,1) (3,2) (4,2) (4,2) (5,2) (5,2) (5,2) (6,2) (6,2) (7,2) (7,2) (8,2) (8,2) (8,2) (3,3) (4,3) (5,3) (5,3) (6,3) (6,3) (7,3) (7,3) (8,3) (8,3) (9,3) (9,3) (10,3) 14 (4,4) (5,4) (6,4) (6,4) (7,4) (7,4) (8,4) (9,4) (9,4) (10,4) (10,4) (11,4) (5,5) (6,5) (7,5) (7,5) (8,5) (9,5) (9,5) (10,5) (11,5) (11,5) (12,5) (6,6) (7,6) (8,6) (9,6) (9,6) (10,6) (11,6) (11,6) (12,6) (13,6) (7,7) (8,7) (9,7) (10,7) (11,7) (11,7) (12,7) (13,7) (14,7) (9,8) (10,8) (11,8) (12,8) (13,8) (14,8) (14,8) (12,9) (13,9) (14,9) (15,9) (2,0) (2,0) (3,0) (3,0) (3,0) (3,0) (3,0) (4,0) (4,0) (4,0) (4,0) (5,0) (5,0) (5,0) (2,1) (3,1) (3,1) (3,1) (4,1) (4,1) (4,1) (5,1) (5,1) (5,1) (6,1) (6,1) (6,1) (7,1) (3,2) (4,2) (4,2) (4,2) (5,2) (5,2) (6,2) (6,2) (6,2) (7,2) (7,2) (8,2) (8,2) (3,3) (4,3) (4,3) (5,3) (5,3) (6,3) (6,3) (7,3) (7,3) (8,3) (8,3) (9,3) (9,3) (4,4) (5,4) (5,4) (6,4) (7,4) (7,4) (8,4) (8,4) (9,4) (9,4) (10,4) (10,4) 15 (5,5) (6,5) (6,5) (7,5) (8,5) (8,5) (9,5) (10,5) (10,5) (11,5) (11,5) (5,6) (6,6) (7,6) (8,6) (8,6) (9,6) (10,6) (10,6) (11,6) (12,6) (12,6) (7,7) (8,7) (9,7) (10,7) (10,7) (11,7) (12,7) (12,7) (13,7) (8,8) (9,8) (10,8) (11,8) (12,8) (12,8) (13,8) (14,8) (10,9) (11,9) (12,9) (13,9) (14,9) (15,9) (13,10) (14,10) (15,10) Table 2: Couples (a, b) for which a success rate a/n is significantly higher than a successes rate b/m, for a confidence level of 99%.

110-6 MIC2003: The Fifth Metaheuristics International Conference test has been developed. This test is more accurate and can be applied for any sample sizes, but it requires relatively heavy computations. So, pre-computed values for 95% and 99% confidence levels have been tabulated in the present article. The computation of confidence levels can also be done online at the URL : http://ina.eivd.ch/projects/stamp/ When the sample sizes are at least 15, we have observed that the probability of rejecting the null hypothesis when the last is true for the standard test almost always over estimates the corresponding value obtained with the non parametric test. This means that the standard test very seldom reject the null hypothesis when it has to be accepted, according to the non parametric test. Let us mention that, very often, the standard test strongly over estimates the value of the probability of the null hypothesis, meaning that the non parametric test proposed is more powerful than the standard one. 6 Acknowledgements The author would like to thank F. Taillard and J. Zuber for comments and discussions on early versions of the article as well as A. Løkketangen for asking the embarrassing questions about the comparison of methods for the satisfiability problem that have led to the development of the present statistical test. The online implementation for computing the statistical test is due to Ph. Wälti. The present work is supported by the strategic founds of the Applied University of Western Switzerland, grant LQF01-03. References [1] K.M. Anstreicher, N.W. Brixius, J.-P. Goux and J. Linderoth, Solving large quadratic assignment problems on computational grids, to appear in Mathematical Programming, Series B, 2001. Currently available on the Web from http://www.biz.uiowa.edu/faculty/anstreicher/mwqap.ps [2] J.E. Beasley, OR-Library: distributing test problems by electronic mail, Journal of the Operational Research Society 41(11), 1990, pp. 1069 1072. http://mscmga.ms.ic.ac.uk/info.html [3] R.E. Burkard, S.E. Karisch and F. Rendl, QAPLIB A Quadratic Assignment Problem Library, Journal of Global Optimization 10, pp. 391 403, 1997. http://www.opt.math.tugraz.ac.at/qaplib [4] G. Reinelt, TSPLIB : sample instances for the TSP (and related problems) from various sources and of various types. Available on the web : http://www.iwr.uniheidelberg.de/groups/comopt/software/tsplib95/index.html