On Feature Selection, Bias-Variance, and Bagging

On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich Caruana 2 1 Department of Computer Science Cornell University 2 Microsoft Corporation ECML-PKDD 2009 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 1 / 22

Task: Model Presence/Absence of Birds Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

Task: Model Presence/Absence of Birds Tried: SVMs boosted decision trees bagged decision trees neural networks Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

Task: Model Presence/Absence of Birds Tried: SVMs boosted decision trees bagged decision trees neural networks Ultimate goal: understand avian population dynamics Ran feature selection to find smallest feature set with excellent performance. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

Bagging Likes Many Noisy Features (?) RMS 0.385 0.38 0.375 0.37 0.365 0.36 0.355 0.35 European Starling bagging all features 0 5 10 15 20 25 30 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 3 / 22

Surprised Reviewers Reviewer A [I] also found that the results reported in Figure 2 [were] strange, where the majority [of] results show that classifiers built from selected features are actually inferior to the ones trained from the whole feature [set]. Reviewer B It is very surprising that the performance of all methods improves (or stays constant) when the number of features is increased. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 4 / 22

Purpose of this Study Does bagging often benefit from many features? If so, why? Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 5 / 22

Outline 1 Story Behind the Paper 2 Background 3 Experiment 1: FS and Bias-Variance 4 Experiment 2: Weak, Noisy Features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 6 / 22

Review of Bagging Bagging: simple ensemble learning algorithm [Bre96]: draw random sample of training data train a model using sample (e.g. decision tree) repeat N times (e.g. 25 times) bagged predictions: average predictions of N models Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 7 / 22

Facts about Bagging Surprisingly competitive performance & rarely overfits [BK99]. Main benefit is reducing variance of constituent models [BK99]. Improves ability to ignore irrelevant features [AP96]. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 8 / 22

Review of Bias-Variance Decomposition Error of learning algorithm on example x comes from 3 sources: noise intrinsic error / uncertainty for x s true label bias how close, on average, is algorithm to optimal prediction variance how much does prediction change if change training set Error decomposes as: error(x) = noise(x) + bias(x) + variance(x) On real problems, cannot separately measure bias and noise. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 9 / 22

Measuring Bias & Variance (Squared Error) Generate empirical distribution of the algorithm s predictions [BK99]: Randomly sample 1 2 of the training data. Train model using sample and make predictions y for test data. Repeat R times (e.g. 20 times). Compute average prediction y m for every test example. For each test example x with true label t: bias(x) = (t y m ) 2 variance(x) = 1 R R (y m y i ) 2 Average over test cases to get expected bias & variance for algorithm. i=1 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 10 / 22

Review of Feature Selection Forward Stepwise Feature Selection Start from empty selected set. Evaluate benefit of selecting each non-selected feature (train model for each choice). Select most beneficial feature. Repeat search until stopping criteria. Correlation-based Feature Filtering Rank features by individual correlation with class label. Choose cutoff point (by statistical test or cross-validation). Keep features above cutoff point. Discard rest. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 11 / 22

Experiment 1: Bias-Variance of Feature Selection Summary: 19 datasets order features using feature selection forward stepwise feature selection or correlation feature filtering, depending on dataset size estimate bias & variance at multiple feature set sizes 5-fold cross-validation # Features 1e+06 100000 10000 1000 100 10 Dataset Sizes 1 100 1000 10000 100000 # Samples Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 12 / 22

Case 1: No Improvement from Feature Selection 0.08 0.07 covtype variance bias/noise 0.06 MSE 0.05 0.04 0.03 single decision tree bagged decision tree 0.02 1 2 3 4 5 10 20 30 40 50 54 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 13 / 22

Case 2: FS Improves Non-Bagged Model MSE 0.1 0.095 0.09 0.085 0.08 variance bias/noise medis overfits with too many features 0.075 0.07 0.065 0.06 1 2 3 4 5 10 20 30 40 50 60 63 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 14 / 22

Take Away Points More features lower bias/noise, higher variance. Feature selection does not improve bagged model performance (1 exception). Best subset size corresponds to best bias/variance tradeoff point. Algorithm dependant Relevant features may be discarded if variance increase outweighs extra information Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 15 / 22

Why Does Bagging Benefit from so Many Features? MSE 0.13 0.125 0.12 0.115 0.11 0.105 0.1 0.095 0.09 0.085 0.08 1 5 10 25 cryst 50 100 # features 200 variance bias/noise 400 800 1,341 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 16 / 22

Hypothesis Bagging improves base learner s ability to benefit from weak, noisy features. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 17 / 22

Experiment 2: Noisy Informative Features Summary: generate synthetic data (6 features) duplicate 1/2 of the features 20 times corrupt X% of values in duplicated features train single and bagged trees with corrupted features and 3 non-duplicated features compare to: ideal, unblemished feature set, and no noisy features (3 non-duplicated only) Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 18 / 22

Bagging Extracts More Info from Noisy Features 0.3 0.25 damaged variance bias/noise MSE 0.2 0.15 0.1 0.05 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

Bagging Extracts More Info from Noisy Features 0.3 0.25 damaged variance bias/noise MSE 0.2 0.15 6 original features (ideal) 0.1 0.05 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

Bagging Extracts More Info from Noisy Features 0.3 3 non-duplicated 0.25 features (baseline) 0.2 MSE 0.15 damaged variance bias/noise 6 original features (ideal) 0.1 0.05 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

Bagging Extracts More Info from Noisy Features 0.3 3 non-duplicated 0.25 features (baseline) 0.2 MSE 0.15 damaged variance bias/noise everything else: 3 non-duplicated features + 60 noisy features 6 original features (ideal) 0.1 0.05 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

Conclusions After training 9,060,936 decision trees... Experiment 1: More features lower bias/noise, higher variance. Feature selection does not improve bagged model performance. Best subset size corresponds to best bias/variance tradeoff point. Experiment 2: Bagged trees surprisingly good at extracting useful information from noisy features. Different weak features in different trees. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 20 / 22

Bibliography Kamal M. Ali and Michael J. Pazzani. Error reduction through learning multiple descriptions. Machine Learning, 24(3):173 202, 1996. Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1-2):105 139, 1999. Leo Breiman. Bagging predictors. Machine Learning, 24(2):123 140, 1996. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 21 / 22

Exception: Overfitting Pseudo-Identifiers 0.2 0.19 0.18 bunting variance bias/noise MSE 0.17 0.16 0.15 0.14 0.13 1 2 3 4 5 10 20 40 60 80 100 120 140 160 175 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 22 / 22