On Feature Selection, Bias-Variance, and Bagging

Size: px

Start display at page:

Download "On Feature Selection, Bias-Variance, and Bagging"

Timothy Berry
6 years ago
Views:

1 On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich Caruana 2 1 Department of Computer Science Cornell University 2 Microsoft Corporation ECML-PKDD 2009 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

2 Task: Model Presence/Absence of Birds Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

3 Task: Model Presence/Absence of Birds Tried: SVMs boosted decision trees bagged decision trees neural networks Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

Task: Model Presence/Absence of Birds Tried: SVMs boosted decision trees bagged decision trees neural networks Ultimate goal: understand avian population

4 Task: Model Presence/Absence of Birds Tried: SVMs boosted decision trees bagged decision trees neural networks Ultimate goal: understand avian population dynamics Ran feature selection to find smallest feature set with excellent performance. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

5 Bagging Likes Many Noisy Features (?) RMS European Starling bagging all features # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

6 Surprised Reviewers Reviewer A [I] also found that the results reported in Figure 2 [were] strange, where the majority [of] results show that classifiers built from selected features are actually inferior to the ones trained from the whole feature [set]. Reviewer B It is very surprising that the performance of all methods improves (or stays constant) when the number of features is increased. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

7 Purpose of this Study Does bagging often benefit from many features? If so, why? Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

8 Outline 1 Story Behind the Paper 2 Background 3 Experiment 1: FS and Bias-Variance 4 Experiment 2: Weak, Noisy Features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

9 Review of Bagging Bagging: simple ensemble learning algorithm [Bre96]: draw random sample of training data train a model using sample (e.g. decision tree) repeat N times (e.g. 25 times) bagged predictions: average predictions of N models Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

10 Facts about Bagging Surprisingly competitive performance & rarely overfits [BK99]. Main benefit is reducing variance of constituent models [BK99]. Improves ability to ignore irrelevant features [AP96]. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

11 Review of Bias-Variance Decomposition Error of learning algorithm on example x comes from 3 sources: noise intrinsic error / uncertainty for x s true label bias how close, on average, is algorithm to optimal prediction variance how much does prediction change if change training set Error decomposes as: error(x) = noise(x) + bias(x) + variance(x) On real problems, cannot separately measure bias and noise. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

12 Measuring Bias & Variance (Squared Error) Generate empirical distribution of the algorithm s predictions [BK99]: Randomly sample 1 2 of the training data. Train model using sample and make predictions y for test data. Repeat R times (e.g. 20 times). Compute average prediction y m for every test example. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

13 Measuring Bias & Variance (Squared Error) Generate empirical distribution of the algorithm s predictions [BK99]: Randomly sample 1 2 of the training data. Train model using sample and make predictions y for test data. Repeat R times (e.g. 20 times). Compute average prediction y m for every test example. For each test example x with true label t: bias(x) = (t y m ) 2 variance(x) = 1 R R (y m y i ) 2 Average over test cases to get expected bias & variance for algorithm. i=1 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

14 Review of Feature Selection Forward Stepwise Feature Selection Start from empty selected set. Evaluate benefit of selecting each non-selected feature (train model for each choice). Select most beneficial feature. Repeat search until stopping criteria. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

15 Review of Feature Selection Forward Stepwise Feature Selection Start from empty selected set. Evaluate benefit of selecting each non-selected feature (train model for each choice). Select most beneficial feature. Repeat search until stopping criteria. Correlation-based Feature Filtering Rank features by individual correlation with class label. Choose cutoff point (by statistical test or cross-validation). Keep features above cutoff point. Discard rest. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

16 Experiment 1: Bias-Variance of Feature Selection Summary: 19 datasets order features using feature selection forward stepwise feature selection or correlation feature filtering, depending on dataset size estimate bias & variance at multiple feature set sizes 5-fold cross-validation # Features 1e Dataset Sizes # Samples Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

17 Case 1: No Improvement from Feature Selection covtype variance bias/noise 0.06 MSE single decision tree bagged decision tree # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

18 Case 2: FS Improves Non-Bagged Model MSE variance bias/noise medis overfits with too many features # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

19 Take Away Points More features lower bias/noise, higher variance. Feature selection does not improve bagged model performance (1 exception). Best subset size corresponds to best bias/variance tradeoff point. Algorithm dependant Relevant features may be discarded if variance increase outweighs extra information Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

20 Why Does Bagging Benefit from so Many Features? MSE cryst # features 200 variance bias/noise ,341 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

21 Why Does Bagging Benefit from so Many Features? MSE cryst # features 200 variance bias/noise ,341 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

22 Why Does Bagging Benefit from so Many Features? MSE cryst # features 200 variance bias/noise ,341 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

23 Hypothesis Bagging improves base learner s ability to benefit from weak, noisy features. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

24 Experiment 2: Noisy Informative Features Summary: generate synthetic data (6 features) duplicate 1/2 of the features 20 times corrupt X% of values in duplicated features train single and bagged trees with corrupted features and 3 non-duplicated features compare to: ideal, unblemished feature set, and no noisy features (3 non-duplicated only) Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

25 Bagging Extracts More Info from Noisy Features damaged variance bias/noise MSE core fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

26 Bagging Extracts More Info from Noisy Features damaged variance bias/noise MSE original features (ideal) core fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

27 Bagging Extracts More Info from Noisy Features non-duplicated 0.25 features (baseline) 0.2 MSE 0.15 damaged variance bias/noise 6 original features (ideal) core fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

28 Bagging Extracts More Info from Noisy Features non-duplicated 0.25 features (baseline) 0.2 MSE 0.15 damaged variance bias/noise everything else: 3 non-duplicated features + 60 noisy features 6 original features (ideal) core fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

29 Conclusions After training 9,060,936 decision trees... Experiment 1: More features lower bias/noise, higher variance. Feature selection does not improve bagged model performance. Best subset size corresponds to best bias/variance tradeoff point. Experiment 2: Bagged trees surprisingly good at extracting useful information from noisy features. Different weak features in different trees. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

30 Bibliography Kamal M. Ali and Michael J. Pazzani. Error reduction through learning multiple descriptions. Machine Learning, 24(3): , Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1-2): , Leo Breiman. Bagging predictors. Machine Learning, 24(2): , Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

31 Exception: Overfitting Pseudo-Identifiers bunting variance bias/noise MSE # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD / 22

SMILe: Shuffled Multiple-Instance Learning

SMILe: Shuffled Multiple-Instance Learning Gary Doran and Soumya Ray Department of Electrical Engineering and Computer Science Case Western Reserve University Cleveland, OH 44106, USA {gary.doran,sray}@case.edu