Statistical Methods in Computer Science

Statistical Methods in Computer Science Experiment Design Gal A. Kaminka galk@cs.biu.ac.il

Experimental Lifecycle Vague idea groping around experiences Initial observations Model/Theory Data, analysis, interpretation Hypothesis Results & final Presentation Experiment 2006-now Gal Kaminka 2

A Slightly Revised View... Model/Theory Analysis Hypothesis Experiment 2006-now Gal Kaminka 3

Proving a Theory? We've discussed 4 methods of proving a proposition Everyone knows it Someone specific says it An experiment supports it We can mathematically prove it Some propositions cannot be verified empirically: This mega-compiler has linear run-time Infinite possible inputs --> cannot prove empirically But they may still be disproved: e.g., code that causes the compiler to run non-linearly 2006-now Gal Kaminka 4

Karl Popper's Philosophy of Science Popper advanced a particular philosophy of science: Falsifiability For a theory to be considered scientific, it must be falsifiable There must be some way to refute it, in principle Not falsifiable <==> Not scientific Examples: All crows are black falsifiable by finding a white crow Compile in linear time falsifiable by non-linear performance Theory tested on its predictions 2006-now Gal Kaminka 5

Proving by disproving... Platt ( Strong Inference, 1964) offers a specific method: 1) Devise alternative hypotheses for observations 2) Devise experiment(s) allowing elimination of hypotheses 3) Carry out experiments to obtain a clean result 4) Go to 1. The idea is to eliminate (falsify) hypotheses 2006-now Gal Kaminka 6

Forming Hypotheses So, to support theory X, we: 1) Construct falsifiability hypotheses X 1,... X n,... 2) Systematically experiment to disprove X, but proving X i 3) If all falsification hypotheses eliminated, then this lends support Note that future falsification hypotheses may be formed Theory must continue to hold against attacks Popper: Scientific evolution, survival of the fittest theory How does this view hold in computer science? 2006-now Gal Kaminka 7

Forming Hypotheses in CS (1) Carefully identify the theoretical object we are studying: e.g., the relation between input-size and run-time is linear e.g., the algorithm causes robots to collect pucks better e.g., the display improves user performance (2) Identify falsification hypothesis (null hypothesis) H 0 e.g., there is an input-size for which run-time is non-linear e.g., the algorithm will cause robots to collect less pucks e.g., the display will have no effect on user performance (3) Now, experiment to eliminate H 0 2006-now Gal Kaminka 8

The Basics of Experiment Design Experiments identify a relation between variables X, Y,... Simple experiments: Provide indication of relation Better/worse, linear or non-linear,... Advanced experiments: help identify causes, interactions Linear in input size but constant factor depends on type of data 2006-now Gal Kaminka 9

Types of Experiments and Variables Manipulation experiments Manipulate (= set value of) independent variables Observe (measure value of) dependent variables Observation experiments Observe predictor variables Observe response variables Other variables: Endogenous: On causal path between independent and dependent Extraneous: Other variables influencing dependent variables 2006-now Gal Kaminka 10

An example observation experiment Theory: Gender affects score performance Falsifying hypothesis: Gender does not affect performance Cannot use manipulation experiments: Cannot control gender Must use observation experiments 2006-now Gal Kaminka 11

An example observation experiment (ala Empirical methods in AI, Cohen 1995) # Siblings: 2 Mother: artist Gender: Male Height: 145cm # Siblings: 3 Mother: Doctor Gender: Female Height: 135cm Teacher's attitude Child confidence Teacher's attitude Child confidence Test score: 650 Independent (Predictor) Variables Test score: 720 2006-now Gal Kaminka 12

An example observation experiment (ala Empirical methods in AI, Cohen 1995) # Siblings: 2 Mother: artist Gender: Male Height: 145cm # Siblings: 3 Mother: Doctor Gender: Female Height: 135cm Teacher's attitude Child confidence Teacher's attitude Child confidence Test score: 650 Dependent (Response) Variables Test score: 720 2006-now Gal Kaminka 13

An example observation experiment (ala Empirical methods in AI, Cohen 1995) # Siblings: 2 Mother: artist Gender: Male Height: 145cm # Siblings: 3 Mother: Doctor Gender: Female Height: 135cm Teacher's attitude Child confidence Teacher's attitude Child confidence Test score: 650 Endogenous Variables Test score: 720 2006-now Gal Kaminka 14

An example observation experiment (ala Empirical methods in AI, Cohen 1995) # Siblings: 2 Mother: artist Gender: Male Height: 145cm # Siblings: 3 Mother: Doctor Gender: Female Height: 135cm Teacher's attitude Child confidence Teacher's attitude Child confidence Test score: 650 Exogenous Variables Test score: 720 2006-now Gal Kaminka 15

Experiment Design: Introduction Different experiment types explore different hypotheses For instance, a very simple design: treatment experiment Sometimes known as a lesion study treatment Ind 1 & Ex 1 & Ex 2 &... & Ex n ==> Dep 1 control Ex 1 & Ex 2 &... & Ex n ==> Dep 2 Treatment condition: With independent variable Control condition: with no independent variable 2006-now Gal Kaminka 16

Comparison Experiments An improvement over treatment experiments Allow comparison of different conditions treatment 1 Ind 1 & Ex 1 & Ex 2 &... & Ex n ==> Dep 1 treatment 2 Ind 2 & Ex 1 & Ex 2 &... & Ex n ==> Dep 2 control Ex 1 & Ex 2 &... & Ex n ==> Dep 3 Compare performance of algorithm A to B to C... Control condition: Optional (e.g., to establish baseline) 2006-now Gal Kaminka 17

Example of Comparison Experiments Compare performance of user interface A to B to C... (Kaminka and Elmaliach 2006) 6 deviation[degree] 5.5 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Split&Tool Only Tool split 2006-now Gal Kaminka 18

Careful! An effect on the dependent variable may not be as expected Example: An experiment Hypothesis: fly's ear is on its wings Fly with two wings. Make loud noise. Observe flight. Fly with one wing. Make loud noise. No flight. Conclusion: Fly with only one wing cannot hear! What's going on here? First, interpretation by the experimenter But also, lack of sufficient falsifiability: There are other possible explanations for why fly wouldn't fly. 2006-now Gal Kaminka 19

Controlling for other factors Often, we cannot manipulate all extraneous variables Then, we need to make sure they are sampled randomly Randomization averages out their affect This can be difficult e.g.,, suppose we are trying to relate gender and math We control for effect of # of siblings by random sampling But # of siblings may be related to age: Parents continue to have children hoping for a boy (Beal 1994) Thus # of siblings tied with gender Must separate results based on # of siblings 2006-now Gal Kaminka 20

Factorial Experiment Designs Every combination of factor values is sampled Hope is to exclude or reveal interactions Performance Performance Head turn velocity This creates a combinatorial number of experiments N factors, k values each = k N combinations Strategies for eliminating values: Merge values, categories. Skip values. Focus on extremes, to get a general trend. Head turn velocity 2006-now Gal Kaminka 21

Tips for Factorial Experiments For numerical variables, 2 value ranges are not enough Don't give a good sense of the function relating variables. Measure, measure, measure. Piggybacking measurement: cheaper than re-running experiments Simplify comparisons: Use same number of data points (trials) for all configurations 2006-now Gal Kaminka 22

Experiment Validity Type of validity: Internal and External validity Internal validity: Experiment shows relationship (independent causes dependent) External validity: Degree to which results generalize to other conditions Threats: uncontrolled conditions threatening validity 2006-now Gal Kaminka 23

Internal validity threats: Examples Order effects Practice effects in human or animal test subjects Bug in testing system leaves system unclean for next trial Demand effects Experimenter influences subject e.g., answering questions of subjects Confounding effects See fly with no wings cannot hear 2006-now Gal Kaminka 24

Order Effects Order effects can confound results If treatment/control given two different orders e.g., good for treatment, bad for control (or vice versa) Solution: Counter-balancing (all possible orders to all groups) If treatment/control given exact same order Practice effects in humans and animals Solution: Randomize order of presentation to subjects 2006-now Gal Kaminka 25

External threats to validity Sampling bias: Non-representative samples e.g., non-representative external factors Floor and ceiling effects Problems tested too hard, too easy Regression effects Results have no way to go but up or down Solution approach: Run pilot experiments 2006-now Gal Kaminka 26

Sampling Bias Prefer setting/measuring specific values over others For instance: Including results that were found by some deadline Solution: Detect, and remove e.g., by visualization, looking for non-normal distributions e.g., surprising distribution of dependent data, for different values of indepdentn variable. 2006-now Gal Kaminka 27

Baselines: Floor and Ceiling Effects How do we know A is good? Bad? Maybe the problems are too simple? Too hard? For example New machine learning algorithm has 95% accuracy Is this good? Controlling for Floor/Ceiling Establish baselines Find range of inputs Show that a silly approach achieves close result 2006-now Gal Kaminka 28

Regression Effects General phenomenon: Regression towards the mean Repeated measurement converges towards mean values Example threat: Run a program on 100 different inputs Problems 6, 14, 15 get a very low score We now fix problem, and want to re-test If chance has anything to do with scoring, then must re-run all Why? Scores on 6, 14, 15 has no where to go but up. So re-running these problems will show improvement by chance Solution: Re-run complete tests, or sample conditions uniformly 2006-now Gal Kaminka 29

Summary Defensive thinking If I were trying to disprove the claim, what would I do Then think ways to counter any possible attack on claim Strong Inference, Popper's falsification ideas Science moves by disproving theories (empirically) Experiment design: Carefully think through threats Ideal independent variables: easy to manipulate Ideal dependent variables: measurable, sensitive, and meaningful Next week: Hypothesis testing (?) 2006-now Gal Kaminka 30

Sampling Bias 2006-now Gal Kaminka 31