Statistical Methods in Computer Science

Similar documents
Why Randomize? Jim Berry Cornell University

OFF THE WALL. The Effects of Artist Eccentricity on the Evaluation of Their Work ROUGH DRAFT

Why Randomize? Dan Levy Harvard Kennedy School

MITOCW mit_jpal_ses06_en_300k_512kb-mp4

Note to Teacher. Description of the investigation. Time Required. Materials. Procedures for Wheel Size Matters TEACHER. LESSONS WHEEL SIZE / Overview

Solutions to Odd-Numbered End-of-Chapter Exercises: Chapter 13

Grade 8 Math Assignment: Probability

Research & Development (R&D) defined (3 phase process)

Statistical Hypothesis Testing

Independence Is The Word

Chapter 20. Inference about a Population Proportion. BPS - 5th Ed. Chapter 19 1

Spring 2017 Math 54 Test #2 Name:

Chess as a cognitive training ground. Six years of trials in primary schools.

2. Inference for comparing two proportions

Chapter 1: Introduction to Statistics

Caveat. We see what we are. e.g. Where are your keys when you finally find them? 3.4 The Nature of Science

Basic Probability Concepts

Player Speed vs. Wild Pokémon Encounter Frequency in Pokémon SoulSilver Joshua and AP Statistics, pd. 3B

Fundamentals of Probability

Hypothesis Tests. w/ proportions. AP Statistics - Chapter 20

COMPOUND EVENTS. Judo Math Inc.

Randomized Evaluations in Practice: Opportunities and Challenges. Kyle Murphy Policy Manager, J-PAL January 30 th, 2017

The Effects of Entrainment in a Tutoring Dialogue System. Huy Nguyen, Jesse Thomason CS 3710 University of Pittsburgh

Stat472/572 Sampling: Theory and Practice Instructor: Yan Lu Albuquerque, UNM

February 24, [Click for Most Updated Paper] [Click for Most Updated Online Appendices]

Investigate the great variety of body plans and internal structures found in multi cellular organisms.

Ten Calculator Activities- Teacher s Notes

Date. Probability. Chapter

Note to the Teacher. Description of the investigation. Time Required. Additional Materials VEX KITS AND PARTS NEEDED

Proportions. Chapter 19. Inference about a Proportion Simple Conditions. Inference about a Proportion Sampling Distribution

CS 441/541 Artificial Intelligence Fall, Homework 6: Genetic Algorithms. Due Monday Nov. 24.

A1 = Chess A2 = Non-Chess B1 = Male B2 = Female

What are the chances?

Nessie is alive! Gerco Onderwater. Role of statistics, bias and reproducibility in scientific research

Probability and Counting Rules. Chapter 3

The Plumb Stones Game

Probability. March 06, J. Boulton MDM 4U1. P(A) = n(a) n(s) Introductory Probability

Determining the Relationship Between the Range and Initial Velocity of an Object Moving in Projectile Motion

FreeCiv Learner: A Machine Learning Project Utilizing Genetic Algorithms

Full file at

Introduction to Chi Square

X Space X Time X Condition

Wright-Fisher Process. (as applied to costly signaling)

Using Autosomal DNA for Genealogy Debbie Parker Wayne, CG, CGL SM

6. Methods of Experimental Control. Chapter 6: Control Problems in Experimental Research

Analyzing Games: Solutions

Pedigree Reconstruction using Identity by Descent

Puzzling Pedigrees. Essential Question: How can pedigrees be used to study the inheritance of human traits?

MITOCW MITCMS_608S14_ses03_2

Probability Paradoxes

Simulations. 1 The Concept

Dota2 is a very popular video game currently.

Chapter /5 Simulations / 21

Five-In-Row with Local Evaluation and Beam Search

Sampling Terminology. all possible entities (known or unknown) of a group being studied. MKT 450. MARKETING TOOLS Buyer Behavior and Market Analysis

CSE 312 Midterm Exam May 7, 2014

Chapter 19. Inference about a Population Proportion. BPS - 5th Ed. Chapter 19 1

All-Stars Dungeons And Diamonds Fundamental. Secrets, Details And Facts (v1.0r3)

Basic Probability Ideas. Experiment - a situation involving chance or probability that leads to results called outcomes.

Laboratory 1: Uncertainty Analysis

For question 1 n = 5, we let the random variable (Y) represent the number out of 5 who get a heart attack, p =.3, q =.7 5

Chapter 5 Probability

Using Artificial intelligent to solve the game of 2048

Chapter 1: About Science

AI Principles, Semester 2, Week 1, Lecture 2, Cognitive Science and AI Applications. The Computational and Representational Understanding of Mind

Building Concepts: Ratios Within and Between Scaled Shapes

Considering the Role of Gender in Developing a Science Identity: Undergraduate Students in STEM Fields at Large, Public, Research Universities

SUPPLEMENTARY INFORMATION

Section 6.5 Conditional Probability

Concerted actions program. Appendix to full research report. Jeffrey Derevensky, Rina Gupta. Institution managing award: McGill University

ENDER S GAME VIDEO DISCUSSION QUESTIONS

From Purple Prose to Machine-Checkable Proofs: Levels of rigor in family history tools

Table A.1 Variable definitions

Comp 3211 Final Project - Poker AI

Case 1: If Denver is the first city visited, then the outcome looks like: ( D ).

English Curriculum. Beginner Lesson 11

Drexel-SDP GK-12 ACTIVITY

Sampling, Part 2. AP Statistics Chapter 12

* Order Matters For Permutations * Section 4.6 Permutations MDM4U Jensen. Part 1: Factorial Investigation

ES 492: SCIENCE IN THE MOVIES

DECISION MAKING IN THE IOWA GAMBLING TASK. To appear in F. Columbus, (Ed.). The Psychology of Decision-Making. Gordon Fernie and Richard Tunney

Optimal Yahtzee performance in multi-player games

The Deception of the Eye and the Brain

More on games (Ch )

Critical and Social Perspectives on Mindfulness

General Education Rubrics

Intelligent Robotics: Introduction

Class 10: Sampling and Surveys (Text: Section 3.2)

How to Combine Records in (New) FamilySearch

Name: Exam 01 (Midterm Part 2 take home, open everything)

MITOCW watch?v=fll99h5ja6c

Stats: Modeling the World. Chapter 11: Sample Surveys

The Need for Hypotheses in Informatics

2010 Census Coverage Measurement - Initial Results of Net Error Empirical Research using Logistic Regression

Algebra II Journal. Module 4: Inferences. Predicting the Future

How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory

Tuesday, April 25, 2017 Individual Contest FORM A. Do Not Open This Booklet Until Instructed To Do So By The Proctor

TEAK Sound and Music

C Nav QA/QC Precision and Reliability Statistics

Levels of Description: A Role for Robots in Cognitive Science Education

Transcription:

Statistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Experimental Lifecycle Vague idea groping around experiences Initial observations Model/Theory Data, analysis, interpretation Hypothesis Results & final Presentation Experiment 2006-now Gal Kaminka 2

A Slightly Revised View... Model/Theory Analysis Hypothesis Experiment 2006-now Gal Kaminka 3

Proving a Theory? We've discussed 4 methods of proving a proposition Everyone knows it Someone specific says it An experiment supports it We can mathematically prove it Some propositions cannot be verified empirically: This mega-compiler has linear run-time Infinite possible inputs --> cannot prove empirically But they may still be disproved: e.g., code that causes the compiler to run non-linearly 2006-now Gal Kaminka 4

Karl Popper's Philosophy of Science Popper advanced a particular philosophy of science: Falsifiability For a theory to be considered scientific, it must be falsifiable There must be some way to refute it, in principle Not falsifiable <==> Not scientific Examples: All crows are black falsifiable by finding a white crow Compile in linear time falsifiable by non-linear performance Theory tested on its predictions 2006-now Gal Kaminka 5

Proving by disproving... Platt ( Strong Inference, 1964) offers a specific method: 1) Devise alternative hypotheses for observations 2) Devise experiment(s) allowing elimination of hypotheses 3) Carry out experiments to obtain a clean result 4) Go to 1. The idea is to eliminate (falsify) hypotheses 2006-now Gal Kaminka 6

Forming Hypotheses So, to support theory X, we: 1) Construct falsifiability hypotheses X 1,... X n,... 2) Systematically experiment to disprove X, but proving X i 3) If all falsification hypotheses eliminated, then this lends support Note that future falsification hypotheses may be formed Theory must continue to hold against attacks Popper: Scientific evolution, survival of the fittest theory How does this view hold in computer science? 2006-now Gal Kaminka 7

Forming Hypotheses in CS (1) Carefully identify the theoretical object we are studying: e.g., the relation between input-size and run-time is linear e.g., the algorithm causes robots to collect pucks better e.g., the display improves user performance (2) Identify falsification hypothesis (null hypothesis) H 0 e.g., there is an input-size for which run-time is non-linear e.g., the algorithm will cause robots to collect less pucks e.g., the display will have no effect on user performance (3) Now, experiment to eliminate H 0 2006-now Gal Kaminka 8

The Basics of Experiment Design Experiments identify a relation between variables X, Y,... Simple experiments: Provide indication of relation Better/worse, linear or non-linear,... Advanced experiments: help identify causes, interactions Linear in input size but constant factor depends on type of data 2006-now Gal Kaminka 9

Types of Experiments and Variables Manipulation experiments Manipulate (= set value of) independent variables Observe (measure value of) dependent variables Observation experiments Observe predictor variables Observe response variables Other variables: Endogenous: On causal path between independent and dependent Extraneous: Other variables influencing dependent variables 2006-now Gal Kaminka 10

An example observation experiment Theory: Gender affects score performance Falsifying hypothesis: Gender does not affect performance Cannot use manipulation experiments: Cannot control gender Must use observation experiments 2006-now Gal Kaminka 11

An example observation experiment (ala Empirical methods in AI, Cohen 1995) # Siblings: 2 Mother: artist Gender: Male Height: 145cm # Siblings: 3 Mother: Doctor Gender: Female Height: 135cm Teacher's attitude Child confidence Teacher's attitude Child confidence Test score: 650 Independent (Predictor) Variables Test score: 720 2006-now Gal Kaminka 12

An example observation experiment (ala Empirical methods in AI, Cohen 1995) # Siblings: 2 Mother: artist Gender: Male Height: 145cm # Siblings: 3 Mother: Doctor Gender: Female Height: 135cm Teacher's attitude Child confidence Teacher's attitude Child confidence Test score: 650 Dependent (Response) Variables Test score: 720 2006-now Gal Kaminka 13

An example observation experiment (ala Empirical methods in AI, Cohen 1995) # Siblings: 2 Mother: artist Gender: Male Height: 145cm # Siblings: 3 Mother: Doctor Gender: Female Height: 135cm Teacher's attitude Child confidence Teacher's attitude Child confidence Test score: 650 Endogenous Variables Test score: 720 2006-now Gal Kaminka 14

An example observation experiment (ala Empirical methods in AI, Cohen 1995) # Siblings: 2 Mother: artist Gender: Male Height: 145cm # Siblings: 3 Mother: Doctor Gender: Female Height: 135cm Teacher's attitude Child confidence Teacher's attitude Child confidence Test score: 650 Exogenous Variables Test score: 720 2006-now Gal Kaminka 15

Experiment Design: Introduction Different experiment types explore different hypotheses For instance, a very simple design: treatment experiment Sometimes known as a lesion study treatment Ind 1 & Ex 1 & Ex 2 &... & Ex n ==> Dep 1 control Ex 1 & Ex 2 &... & Ex n ==> Dep 2 Treatment condition: With independent variable Control condition: with no independent variable 2006-now Gal Kaminka 16

Comparison Experiments An improvement over treatment experiments Allow comparison of different conditions treatment 1 Ind 1 & Ex 1 & Ex 2 &... & Ex n ==> Dep 1 treatment 2 Ind 2 & Ex 1 & Ex 2 &... & Ex n ==> Dep 2 control Ex 1 & Ex 2 &... & Ex n ==> Dep 3 Compare performance of algorithm A to B to C... Control condition: Optional (e.g., to establish baseline) 2006-now Gal Kaminka 17

Example of Comparison Experiments Compare performance of user interface A to B to C... (Kaminka and Elmaliach 2006) 6 deviation[degree] 5.5 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Split&Tool Only Tool split 2006-now Gal Kaminka 18

Careful! An effect on the dependent variable may not be as expected Example: An experiment Hypothesis: fly's ear is on its wings Fly with two wings. Make loud noise. Observe flight. Fly with one wing. Make loud noise. No flight. Conclusion: Fly with only one wing cannot hear! What's going on here? First, interpretation by the experimenter But also, lack of sufficient falsifiability: There are other possible explanations for why fly wouldn't fly. 2006-now Gal Kaminka 19

Controlling for other factors Often, we cannot manipulate all extraneous variables Then, we need to make sure they are sampled randomly Randomization averages out their affect This can be difficult e.g.,, suppose we are trying to relate gender and math We control for effect of # of siblings by random sampling But # of siblings may be related to age: Parents continue to have children hoping for a boy (Beal 1994) Thus # of siblings tied with gender Must separate results based on # of siblings 2006-now Gal Kaminka 20

Factorial Experiment Designs Every combination of factor values is sampled Hope is to exclude or reveal interactions Performance Performance Head turn velocity This creates a combinatorial number of experiments N factors, k values each = k N combinations Strategies for eliminating values: Merge values, categories. Skip values. Focus on extremes, to get a general trend. Head turn velocity 2006-now Gal Kaminka 21

Tips for Factorial Experiments For numerical variables, 2 value ranges are not enough Don't give a good sense of the function relating variables. Measure, measure, measure. Piggybacking measurement: cheaper than re-running experiments Simplify comparisons: Use same number of data points (trials) for all configurations 2006-now Gal Kaminka 22

Experiment Validity Type of validity: Internal and External validity Internal validity: Experiment shows relationship (independent causes dependent) External validity: Degree to which results generalize to other conditions Threats: uncontrolled conditions threatening validity 2006-now Gal Kaminka 23

Internal validity threats: Examples Order effects Practice effects in human or animal test subjects Bug in testing system leaves system unclean for next trial Demand effects Experimenter influences subject e.g., answering questions of subjects Confounding effects See fly with no wings cannot hear 2006-now Gal Kaminka 24

Order Effects Order effects can confound results If treatment/control given two different orders e.g., good for treatment, bad for control (or vice versa) Solution: Counter-balancing (all possible orders to all groups) If treatment/control given exact same order Practice effects in humans and animals Solution: Randomize order of presentation to subjects 2006-now Gal Kaminka 25

External threats to validity Sampling bias: Non-representative samples e.g., non-representative external factors Floor and ceiling effects Problems tested too hard, too easy Regression effects Results have no way to go but up or down Solution approach: Run pilot experiments 2006-now Gal Kaminka 26

Sampling Bias Prefer setting/measuring specific values over others For instance: Including results that were found by some deadline Solution: Detect, and remove e.g., by visualization, looking for non-normal distributions e.g., surprising distribution of dependent data, for different values of indepdentn variable. 2006-now Gal Kaminka 27

Baselines: Floor and Ceiling Effects How do we know A is good? Bad? Maybe the problems are too simple? Too hard? For example New machine learning algorithm has 95% accuracy Is this good? Controlling for Floor/Ceiling Establish baselines Find range of inputs Show that a silly approach achieves close result 2006-now Gal Kaminka 28

Regression Effects General phenomenon: Regression towards the mean Repeated measurement converges towards mean values Example threat: Run a program on 100 different inputs Problems 6, 14, 15 get a very low score We now fix problem, and want to re-test If chance has anything to do with scoring, then must re-run all Why? Scores on 6, 14, 15 has no where to go but up. So re-running these problems will show improvement by chance Solution: Re-run complete tests, or sample conditions uniformly 2006-now Gal Kaminka 29

Summary Defensive thinking If I were trying to disprove the claim, what would I do Then think ways to counter any possible attack on claim Strong Inference, Popper's falsification ideas Science moves by disproving theories (empirically) Experiment design: Carefully think through threats Ideal independent variables: easy to manipulate Ideal dependent variables: measurable, sensitive, and meaningful Next week: Hypothesis testing (?) 2006-now Gal Kaminka 30

Sampling Bias 2006-now Gal Kaminka 31