Comparing Extreme Members is a Low-Power Method of Comparing Groups: An Example Using Sex Differences in Chess Performance

Similar documents
No stereotype threat effect in international chess

Sex Differences in Intellectual Performance Analysis of a Large Cohort of Competitive Chess Players

Why are (the best) Women so Good. at Chess? Participation Rates and. Gender Differences in Intellectual. Domains

The Glicko system. Professor Mark E. Glickman Boston University

Female chess players outperform expectations when playing men

AP STATISTICS 2015 SCORING GUIDELINES

Female chess players outperform expectations when playing men

Probabilities and Probability Distributions

MAT Midterm Review

Considering the Role of Gender in Developing a Science Identity: Undergraduate Students in STEM Fields at Large, Public, Research Universities

Checkmate? The role of gender stereotypes in the ultimate intellectual sport

The Effects of Speed on Skilled Chess Performance. Bruce D. Burns. Michigan State University

rotation procedure (Promax) to allow any factors that emerged to correlate. Results are

, x {1, 2, k}, where k > 0. (a) Write down P(X = 2). (1) (b) Show that k = 3. (4) Find E(X). (2) (Total 7 marks)

Probability and Counting Techniques

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

CHAPTER 6 PROBABILITY. Chapter 5 introduced the concepts of z scores and the normal curve. This chapter takes

Review Questions on Ch4 and Ch5

Basic Probability Concepts

Chapter 7 Homework Problems. 1. If a carefully made die is rolled once, it is reasonable to assign probability 1/6 to each of the six faces.

1. The masses, x grams, of the contents of 25 tins of Brand A anchovies are summarized by x =

Hypergeometric Probability Distribution

Key Words: age-order, last birthday, full roster, full enumeration, rostering, online survey, within-household selection. 1.

Contents 2.1 Basic Concepts of Probability Methods of Assigning Probabilities Principle of Counting - Permutation and Combination 39

Guess the Mean. Joshua Hill. January 2, 2010

Learning and Individual Differences

Math 58. Rumbos Fall Solutions to Exam Give thorough answers to the following questions:

Simulations. 1 The Concept

Female Height. Height (inches)

Game Mechanics Minesweeper is a game in which the player must correctly deduce the positions of

AP Statistics Ch In-Class Practice (Probability)

Monte-Carlo Simulation of Chess Tournament Classification Systems

On the Monty Hall Dilemma and Some Related Variations

EUROPEAN BLITZ AND RAPID CHESS CHAMPIONSHIPS, SKOPJE 2018 Hotel Aleksandar Palace, December 2018 REGULATIONS

1) If P(E) is the probability that an event will occur, then which of the following is true? (1) 0 P(E) 1 (3) 0 P(E) 1 (2) 0 P(E) 1 (4) 0 P(E) 1

Probability Paradoxes

Women into Engineering: An interview with Simone Weber

Section 1.5 Graphs and Describing Distributions

Variations on the Two Envelopes Problem

Going back to the definition of Biostatistics. Organizing and Presenting Data. Learning Objectives. Nominal Data 10/10/2016. Tabulation and Graphs

A1 = Chess A2 = Non-Chess B1 = Male B2 = Female

Laboratory 1: Uncertainty Analysis

OFF THE WALL. The Effects of Artist Eccentricity on the Evaluation of Their Work ROUGH DRAFT

MATH 215 DISCRETE MATHEMATICS INSTRUCTOR: P. WENG

A Mathematical Analysis of Oregon Lottery Win for Life

Probability. March 06, J. Boulton MDM 4U1. P(A) = n(a) n(s) Introductory Probability

Population Adaptation for Genetic Algorithm-based Cognitive Radios

Expertise in Complex Decision Making: The Role of Search in Chess 70 Years After de Groot

Statistical House Edge Analysis for Proposed Casino Game Jacks

Lesson Sampling Distribution of Differences of Two Proportions

Suppose Y is a random variable with probability distribution function f(y). The mathematical expectation, or expected value, E(Y) is defined as:

MDM4U Some Review Questions

Tricia Berry Director, UT Austin Women in Engineering Program Director, Texas Girls Collaborative Project txgcp.org

CSC/MTH 231 Discrete Structures II Spring, Homework 5

NUCLEAR SAFETY AND RELIABILITY

Table A.1 Variable definitions

Artificial Intelligence Search III

Cutting a Pie Is Not a Piece of Cake

1st Prize th - 10th Prizes 1700х5= nd Prize th - 15th Prizes 1200х5= rd Prize th -20th Prizes 700х5=3500

Rosen, Discrete Mathematics and Its Applications, 6th edition Extra Examples

SAMPLE. This chapter deals with the construction and interpretation of box plots. At the end of this chapter you should be able to:

November 11, Chapter 8: Probability: The Mathematics of Chance

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Comparing Means. Chapter 24. Case Study Gas Mileage for Classes of Vehicles. Case Study Gas Mileage for Classes of Vehicles Data collection

SERGEY I. NIKOLENKO AND ALEXANDER V. SIROTKIN

STOR 155 Introductory Statistics. Lecture 10: Randomness and Probability Model

Basic Probability Ideas. Experiment - a situation involving chance or probability that leads to results called outcomes.

Unit Nine Precalculus Practice Test Probability & Statistics. Name: Period: Date: NON-CALCULATOR SECTION

Chapter 6: Probability and Simulation. The study of randomness

Statistics Intermediate Probability


Math 247: Continuous Random Variables: The Uniform Distribution (Section 6.1) and The Normal Distribution (Section 6.2)

Sheffield United FC Gender Pay Gap 2017

Fundamentals of Probability

2.2 More on Normal Distributions and Standard Normal Calculations

1. Let X be a continuous random variable such that its density function is 8 < k(x 2 +1), 0 <x<1 f(x) = 0, elsewhere.

MITOCW mit_jpal_ses06_en_300k_512kb-mp4

Sutiono, Arie Pratama; Ramadan, Rido Author(s) Jarukasetporn, Peetikorn; Takeuchi, Purwarianti, Ayu; Iida, Hiroyuki

A complete database of international chess players and chess performance ratings for varied longitudinal studies

REPORT ON THE EUROSTAT 2017 USER SATISFACTION SURVEY

The Effect of Chess on Reading Scores: District Nine Chess Program Second Year Report. Stuart Margulies Ph. D.

Using Neural Network and Monte-Carlo Tree Search to Play the Game TEN

MAT 1272 STATISTICS LESSON STATISTICS AND TYPES OF STATISTICS

1. An office building contains 27 floors and has 37 offices on each floor. How many offices are in the building?

Research as a Deliberate Chess Activity Software Testing Platform for Professional Dynamic Development of the Education Sector

The point value of each problem is in the left-hand margin. You must show your work to receive any credit, except on problems 1 & 2. Work neatly.

Andrei Behel AC-43И 1

STAT 155 Introductory Statistics. Lecture 11: Randomness and Probability Model

Independent and Mutually Exclusive Events

Who Invents IT? March 2007 Executive Summary. An Analysis of Women s Participation in Information Technology Patenting

FIDE Rating Regulations

CSE 312 Midterm Exam May 7, 2014

TO PLOT OR NOT TO PLOT?

Chess as a cognitive training ground. Six years of trials in primary schools.

Gathering information about an entire population often costs too much or is virtually impossible.

Proserv Gender Pay Gap Report 2017

Time And Resource Characteristics Of Radical New Product Development (NPD) Projects And their Dynamic Control. Introduction. Problem Description.

Exam Time. Final Exam Review. TR class Monday December 9 12:30 2:30. These review slides and earlier ones found linked to on BlackBoard

TEMPORAL DIFFERENCE LEARNING IN CHINESE CHESS

Theoretical loss and gambling intensity: a simulation study

Transcription:

Comparing Extreme Members is a Low-Power Method of Comparing Groups: An Example Using Sex Differences in Chess Performance Mark E. Glickman, Ph.D. 1, 2 Christopher F. Chabris, Ph.D. 3 1 Center for Health Quality, Outcomes and Economic Research, a VA HSR&D Center of Excellence, Bedford MA. 2 Boston University School of Public Health, Department of Health Policy and Management, Boston MA. 3 Department of Psychology, Union College, Schenectady, NY.

Author Affiliations Mark E. Glickman, Ph.D. Center for Health Quality, Outcomes, and Economic Research Edith Nourse Rogers Memorial Veterans Hospital 200 Springs Road (152) Bedford, MA 01730 Tel: 781-687-2875 Fax: 781-687-3106 Email: mg@bu.edu Christopher F. Chabris, Ph.D. Department of Psychology Union College 807 Union Street Schenectady, NY 12308 Email: chabris@gmail.com

Four years after he won the world chess championship, Garry Kasparov was quoted as saying there is real chess and there is women s chess chess does not fit women properly (Chelminski, 1989). It is true that no woman has ever come close to winning the world chess championship, and that men vastly outnumber women at the highest levels of chess achievement. However, it is also obvious that men outnumber women at all levels in chess, and this difference in overall participation rates (the proportion of all men and women who choose to enter competitive chess) has been cited to explain the difference in high achievement (Charness & Gerchak, 1996; Howard, 2005; Chabris & Glickman, 2006). The question of sex differences in achievement is equally salient in other fields where more men reach the top levels than women, such as academia, business, and the law. Chess is an excellent domain in which to study predictors of performance because of its relatively objective rating system developed by Arpad Elo (1986), which assigns to each player in official competitions a numerical value representing his or her strength. The larger the rating difference between two players, the better the higher-rated player is expected to score in a match between them. Bilalić, Smallbone, McLeod, and Gobet (2009; hereafter BSMG) use chess rating data in an intriguing way to address the question of sex differences in chess ability. BSMG develop an approximation to calculate the expected value of the k-th highest value in a sample of n from a normal distribution, and use the result to compare chess ratings between top males and females. Their work extends a result by Charness and Gerchak (1996), who derive an approximation to the expected maximum of a sample from a normal distribution. BSMG apply their approximation to the top 100 male and female German tournament chess players, and this analysis shows that it is difficult to conclude that men are better than women on average even though the best men have much higher ratings than the best women. BSMG correctly observe

that the larger male German population of chess players would, simply by chance, produce better players even when the averages are the same. They go one step further and conclude that above the levels of the 80th-best men and women players, women are actually higher-rated than expected relative to men based on the sample sizes. They argue, like Charness and Gerchak, that one must account for participation rates when comparing the best achievers before generating new hypotheses (e.g., differences in cognitive ability or training regimens) to explain performance differences between groups. The main drawback of BSMG s analyses is that they do not account for the inherently high variability of the extreme values in a sample. While differences in the highest chess rating between men and women may be explainable by differential rates of participation, they will tell us very little (with any certainty) about the average differences between men and women. By design, comparing only the highest achievers is a low-power procedure that is not likely to produce useful results. To see why this is so, suppose in a sample of n observations from a population with continuous probability distribution function F (and density f), we wish to find the approximate distribution of the highest values. Instead of using BSMG s approximation for the k-th largest value of a normally distributed sample, we can use an asymptotic normal approximation to the distribution of the t-th fractile, normal, and is given by X t, where t = ( n k) / n. The approximate distribution of X t is X t 1 t(1 t) ~ N ( F ( t), ), 1 2 n[ f ( F ( t))] where F 1 ( t ) is the value corresponding to the t-th fractile of the distribution. This result can be found in advanced textbooks in statistics (e.g., Lehmann, 1983; Bickel & Doksum, 1977). For

most common distributions, the value of F 1 ( t ) can be calculated numerically using standard statistical software packages, such as R (R Development Core Team, 2008). As an example, consider BSMG s comparison of the 5th best male to the 5th best female. By inspection of the graph in their Figure 2, the observed rating difference is about 290 in favor of the male player. The authors assume that the distribution of all players ratings is normal, following N(1461, 342²), and that the numbers of male and female chess players are 113,386 and 7,013 respectively. From our formula, the approximate distribution of the 5th order statistic for men is N(2802, 36.9²) and for women it is N(2552, 44.2²), so that the distribution of the difference is N(250, 57.6²); note that the variance of the difference is the sum of the individual variances. Thus an approximate 95% confidence interval for the rating difference between the 5th best male and female is 137.1 362.9, which is arguably too wide an interval to serve as a diagnostic for whether men are stronger chess players than women on average. A related problem is revealed when we compare the 100th best male and female German tournament chess players. According to the BSMG approximation formula, the 100th best male should be rated near 2495.6 and the 100th best female should be rated near 2066.1, a difference of 429.5 (it is unclear to us how BSMG arrived at the value of 440 mentioned on p. 1162). The observed difference appears to be about 380, based on the graph in their Figure 2. Using our formula, the 100th best male rating has a N(2530.6, 10.0²) distribution, and the 100th best female rating has a N(2210.0, 13.4²) distribution, so that the male female difference follows a N(320.6, 16.7²) distribution, with a 95% confidence interval of 287.9 353.3. Relative to our mean, here again it appears as though men outperform women (significantly), which is the reverse of the conclusion presented by BSMG.

In fact, this conclusion is not justified either, because it is sensitive to an unchecked and potentially false assumption. Underlying the calculations made by both BSMG and ourselves is the assumption that chess ratings are distributed normally. This is a crucial assumption, and one that is arguably not satisfied by actual chess rating systems. The apparent justification for assuming a normal distribution in BSMG s analysis is in their Figure 1, which shows a superimposed normal density function having similar features to the empirical rating distribution. It is difficult to determine from this graph whether the right tail of the rating distribution is normal (a normal probability plot might help address this question), but there is nothing in the statistical architecture of chess rating systems that favors ratings being distributed normally (see Glickman, 1995, for a detailed discussion of this issue). To demonstrate the extent to which the assumed distribution can affect conclusions about the comparison between the top men and women, assume that German chess federation ratings have the mean and standard deviation specified by BSMG, but that the ratings follow a t- distribution with some specified degrees of freedom. Histograms of data coming from a t- distribution and a normal distribution would be virtually indistinguishable, but a t-distribution has tails that are sufficiently heavy to affect the analysis of the extremes. Such t-based models are becoming increasingly popular for robust data analyses (e.g., see Lange et al., 1989). Using our formula, we calculated the estimated ratings of the 100th best male and female assuming chess ratings truly followed a t-distribution with 15 degrees of freedom instead of a normal distribution. These ratings would follow N(2752.7, 19.3²) for the male and N(2243.2, 17.7²) for the female, so that the difference would follow N(509.5, 26.2²). Our normal distribution calculation resulted in a mean of 320.6, which is 188.9 less than the estimate based on the t- distribution. This very large discrepancy stems entirely from the different assumptions about the

distribution of ratings. Unless the analyst is sure about this distribution, specifically at the right tails, any statistical comparison between top order statistics is highly uncertain not only because extremes tend to vary greatly, but also because the assumed distribution of the data may be incorrect. If one s goal is to detect average differences among groups, one should choose procedures that are based on less variable statistics than an analysis of extremes, and ones that are more robust to distributional assumptions. An obvious candidate is the sample mean, which is considerably less variable than high-order statistics. Even using lower order statistics, such as the top 10th or 20th percentile of the sample, would reduce the variability appreciably relative to the ones used by BSMG. Using the mean, or even the lower percentiles of the empirical distribution, is also much less sensitive to distributional assumptions than is using the highest values. We took this approach to examine sex differences in chess ability among 250,000 U.S. rated players; we found that the male mean was significantly higher than the female mean, but that this difference itself might result from the much larger number of boys than girls who enter competition (Chabris & Glickman, 2006; see also Maass et al., 2008). The greater objectivity of Elo-type ratings as compared to other measures of relative ability (peer evaluations, impact analyses, patents, prize winnings, etc.) can mask the fact that they are still imperfect measures of underlying parameters, and the consequence that conclusions derived from them will be subject to variability. Researchers using chess ratings as data to answer questions about patterns of human performance should keep in mind that this variability is greatest for extreme values in a distribution, and that the extremes are also very sensitive to small changes in the underlying form of the distribution. Accordingly, though the conclusion

BSMG arrived at could be correct, the procedures they followed do not have the statistical power to support it.

References Bickel, P.J., & Doksum, K.A. (1977). Mathematical statistics: Basic ideas and selected topics. San Francisco: Holden-Day. Bilalić, M., Smallbone, K., McLeod, P., & Gobet, F. (2009). Why are (the best) women so good at chess? Participation rates and gender differences in intellectual domains. Proceedings of the Royal Society B, 276, 1161 1165. Chabris, C.F., & Glickman, M.E. (2006). Sex differences in intellectual performance: Analysis of a large cohort of competitive chess players. Psychological Science, 17, 1009 1107. Charness, N., & Gerchak, Y. (1996). Participation rates and maximal performance: A log-linear explanation for group differences, such as Russian and male dominance in chess. Psychological Science, 7, 46 51. Chelminski, R. (1989). Playboy interview: Garry Kasparov. Playboy, November. [http://www.playboy.com/articles/garry-kasparov-1989-interview/index.html] Elo, A.E. (1986). The rating of chessplayers, past and present (2nd ed.). New York: Arco. Glickman, M.E. (1995). Chess rating systems. American Chess Journal, 3, 59 102. Howard, R.W. (2005). Are gender differences in high achievement disappearing? A test in one intellectual domain. Journal of Biosocial Science, 37, 371 380. Lange, K.L., Little, R.J.A., & Taylor, J.M.G. (1989) Robust statistical modeling using the t distribution. Journal of the American Statistical Association, 84, 881 896. Lehmann, E.L. (1983). Theory of point estimation. New York: Wiley. Maass, A., D Ettole, C., & Cadinu, M. (2008). Checkmate? The role of gender stereotypes in the ultimate intellectual sport. European Journal of Social Psychology, 38, 231 245. R Development Core Team (2008) R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. [http://www.r-project.org] Acknowledgments We thank Christopher Avery, Neil Charness, and Andrew Metrick for their comments on an earlier version of this article.