On Feature Selection, Bias-Variance, and Bagging

Similar documents
SMILe: Shuffled Multiple-Instance Learning

SELECTING RELEVANT DATA

Applications of Machine Learning Techniques in Human Activity Recognition

Image analysis. CS/CME/BioE/Biophys/BMI 279 Oct. 31 and Nov. 2, 2017 Ron Dror

DETECTION AND CLASSIFICATION OF POWER QUALITY DISTURBANCES

Voice Activity Detection

Learning to Play like an Othello Master CS 229 Project Report. Shir Aharon, Amanda Chang, Kent Koyanagi

Knowledge discovery & data mining Classification & fraud detection

SSB Debate: Model-based Inference vs. Machine Learning

Learning Deep Networks from Noisy Labels with Dropout Regularization

Stacking Ensemble for auto ml

MULTIPLE CLASSIFIERS FOR ELECTRONIC NOSE DATA

Classification of Hand Gestures using Surface Electromyography Signals For Upper-Limb Amputees

Classification of photographic images based on perceived aesthetic quality

Building a Computer Mahjong Player Based on Monte Carlo Simulation and Opponent Models

CSE 258 Winter 2017 Assigment 2 Skill Rating Prediction on Online Video Game

IJITKMI Volume 7 Number 2 Jan June 2014 pp (ISSN ) Impact of attribute selection on the accuracy of Multilayer Perceptron

FreeCiv Learner: A Machine Learning Project Utilizing Genetic Algorithms

CSE 255 Assignment 1: Helpfulness in Amazon Reviews

Dynamic Throttle Estimation by Machine Learning from Professionals

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

Project summary. Key findings, Winter: Key findings, Spring:

CAT Training CNNs for Image Classification with Noisy Labels

An Introduction to Machine Learning for Social Scientists

Neurocomputing 73 (2010) Contents lists available at ScienceDirect. Neurocomputing. journal homepage:

Outcome Forecasting in Sports. Ondřej Hubáček

Friends don t let friends deploy Black-Box models The importance of transparency in Machine Learning. Rich Caruana Microsoft Research

Classification of Voltage Sag Using Multi-resolution Analysis and Support Vector Machine

Adversarial Examples and Adversarial Training. Ian Goodfellow, OpenAI Research Scientist Presentation at Quora,

The Game-Theoretic Approach to Machine Learning and Adaptation

How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory

Efficient Target Detection from Hyperspectral Images Based On Removal of Signal Independent and Signal Dependent Noise

Multimedia Forensics

Cover Page. The handle holds various files of this Leiden University dissertation.

Human or Robot? Robert Recatto A University of California, San Diego 9500 Gilman Dr. La Jolla CA,

Object Category Detection using Audio-visual Cues

Patent Mining: Use of Data/Text Mining for Supporting Patent Retrieval and Analysis

Optimal Load Shedding Using an Ensemble of Artificial Neural Networks

Statistical Hypothesis Testing

TODAY, wireless communications are an integral part of

Introduction to Machine Learning

Quick, Draw! Doodle Recognition

2010 Census Coverage Measurement - Initial Results of Net Error Empirical Research using Logistic Regression

Autocomplete Sketch Tool

Image analysis. CS/CME/BioE/Biophys/BMI 279 Oct. 31 and Nov. 2, 2017 Ron Dror

Constructing local discriminative features for signal classification

Sampling distributions and the Central Limit Theorem

Campus Location Recognition using Audio Signals

Classification of Digital Photos Taken by Photographers or Home Users

Feature Reduction and Payload Location with WAM Steganalysis

CROSS-LAYER FEATURES IN CONVOLUTIONAL NEURAL NETWORKS FOR GENERIC CLASSIFICATION TASKS. Kuan-Chuan Peng and Tsuhan Chen

Mapping Open Water Bodies with Optical Remote Sensing

Estimating Single-Channel Source Separation Masks: Relevance Vector Machine Classifiers vs. Pitch-Based Masking

arxiv: v2 [cs.mm] 12 Jan 2018

AdaBoost based EMD as a De-Noising Technique in Time Delay Estimation Application

Predicting outcomes of professional DotA 2 matches

STARCRAFT 2 is a highly dynamic and non-linear game.

Detection and Classification of Power Quality Event using Discrete Wavelet Transform and Support Vector Machine

A TWO-PART PREDICTIVE CODER FOR MULTITASK SIGNAL COMPRESSION. Scott Deeann Chen and Pierre Moulin

RELEASING APERTURE FILTER CONSTRAINTS

Wheel Health Monitoring Using Onboard Sensors

Real-Time Tracking via On-line Boosting Helmut Grabner, Michael Grabner, Horst Bischof

Camera Model Identification Framework Using An Ensemble of Demosaicing Features

Contents. List of Figures List of Tables. Structure of the Book How to Use this Book Online Resources Acknowledgements

Feature Engineering. Digging into Data: Jordan Boyd-Graber. University of Maryland. March 4, 2013

Simulate IFFT using Artificial Neural Network Haoran Chang, Ph.D. student, Fall 2018

An Introduction to Poker Opponent Modeling

2007 Census of Agriculture Non-Response Methodology

Learning Strategies for Opponent Modeling in Poker

CC4.5: cost-sensitive decision tree pruning

10:00-10:30 HOMOGENIZATION OF THE GLOBAL TEMPERATURE Victor Venema, University of Bonn

AN ABSTRACT OF THE THESIS OF

Learning Dota 2 Team Compositions

We Know Where You Are : Indoor WiFi Localization Using Neural Networks Tong Mu, Tori Fujinami, Saleil Bhat

Laser Printer Source Forensics for Arbitrary Chinese Characters

CS229: Machine Learning

Lecture 4 Biosignal Processing. Digital Signal Processing and Analysis in Biomedical Systems

Black Box Machine Learning

POKER AGENTS LD Miller & Adam Eck April 14 & 19, 2011

Server-side Early Detection Method for Detecting Abnormal Players of StarCraft

Deep Learning for Infrastructure Assessment in Africa using Remote Sensing Data

Detecting spread spectrum pseudo random noise tags in EEG/MEG using a structure-based decomposition

Stock Market Indices Prediction Using Time Series Analysis

Performance of Specific vs. Generic Feature Sets in Polyphonic Music Instrument Recognition

Selection of Significant Features Using Monte Carlo Feature Selection

Searching Optimal Movements in Multi-Player Games with Imperfect Information

The game of Bridge: a challenge for ILP

Information Management course

Empirical Assessment of Classification Accuracy of Local SVM

Wavelet-based Voice Morphing

Fault Detection Using Hilbert Huang Transform

Some Challenging Problems in Mining Social Media

Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction

Deep Neural Networks (2) Tanh & ReLU layers; Generalisation and Regularisation

Stock Price Prediction Using Multilayer Perceptron Neural Network by Monitoring Frog Leaping Algorithm

BAYESIAN STATISTICAL CONCEPTS

NEW ASSOCIATION IN BIO-S-POLYMER PROCESS

Othello/Reversi using Game Theory techniques Parth Parekh Urjit Singh Bhatia Kushal Sukthankar

Opleiding Informatica

Dota2 is a very popular video game currently.

Transcription:

On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich Caruana 2 1 Department of Computer Science Cornell University 2 Microsoft Corporation ECML-PKDD 2009 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 1 / 22

Task: Model Presence/Absence of Birds Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

Task: Model Presence/Absence of Birds Tried: SVMs boosted decision trees bagged decision trees neural networks Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

Task: Model Presence/Absence of Birds Tried: SVMs boosted decision trees bagged decision trees neural networks Ultimate goal: understand avian population dynamics Ran feature selection to find smallest feature set with excellent performance. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

Bagging Likes Many Noisy Features (?) RMS 0.385 0.38 0.375 0.37 0.365 0.36 0.355 0.35 European Starling bagging all features 0 5 10 15 20 25 30 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 3 / 22

Surprised Reviewers Reviewer A [I] also found that the results reported in Figure 2 [were] strange, where the majority [of] results show that classifiers built from selected features are actually inferior to the ones trained from the whole feature [set]. Reviewer B It is very surprising that the performance of all methods improves (or stays constant) when the number of features is increased. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 4 / 22

Purpose of this Study Does bagging often benefit from many features? If so, why? Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 5 / 22

Outline 1 Story Behind the Paper 2 Background 3 Experiment 1: FS and Bias-Variance 4 Experiment 2: Weak, Noisy Features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 6 / 22

Review of Bagging Bagging: simple ensemble learning algorithm [Bre96]: draw random sample of training data train a model using sample (e.g. decision tree) repeat N times (e.g. 25 times) bagged predictions: average predictions of N models Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 7 / 22

Facts about Bagging Surprisingly competitive performance & rarely overfits [BK99]. Main benefit is reducing variance of constituent models [BK99]. Improves ability to ignore irrelevant features [AP96]. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 8 / 22

Review of Bias-Variance Decomposition Error of learning algorithm on example x comes from 3 sources: noise intrinsic error / uncertainty for x s true label bias how close, on average, is algorithm to optimal prediction variance how much does prediction change if change training set Error decomposes as: error(x) = noise(x) + bias(x) + variance(x) On real problems, cannot separately measure bias and noise. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 9 / 22

Measuring Bias & Variance (Squared Error) Generate empirical distribution of the algorithm s predictions [BK99]: Randomly sample 1 2 of the training data. Train model using sample and make predictions y for test data. Repeat R times (e.g. 20 times). Compute average prediction y m for every test example. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 10 / 22

Measuring Bias & Variance (Squared Error) Generate empirical distribution of the algorithm s predictions [BK99]: Randomly sample 1 2 of the training data. Train model using sample and make predictions y for test data. Repeat R times (e.g. 20 times). Compute average prediction y m for every test example. For each test example x with true label t: bias(x) = (t y m ) 2 variance(x) = 1 R R (y m y i ) 2 Average over test cases to get expected bias & variance for algorithm. i=1 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 10 / 22

Review of Feature Selection Forward Stepwise Feature Selection Start from empty selected set. Evaluate benefit of selecting each non-selected feature (train model for each choice). Select most beneficial feature. Repeat search until stopping criteria. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 11 / 22

Review of Feature Selection Forward Stepwise Feature Selection Start from empty selected set. Evaluate benefit of selecting each non-selected feature (train model for each choice). Select most beneficial feature. Repeat search until stopping criteria. Correlation-based Feature Filtering Rank features by individual correlation with class label. Choose cutoff point (by statistical test or cross-validation). Keep features above cutoff point. Discard rest. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 11 / 22

Experiment 1: Bias-Variance of Feature Selection Summary: 19 datasets order features using feature selection forward stepwise feature selection or correlation feature filtering, depending on dataset size estimate bias & variance at multiple feature set sizes 5-fold cross-validation # Features 1e+06 100000 10000 1000 100 10 Dataset Sizes 1 100 1000 10000 100000 # Samples Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 12 / 22

Case 1: No Improvement from Feature Selection 0.08 0.07 covtype variance bias/noise 0.06 MSE 0.05 0.04 0.03 single decision tree bagged decision tree 0.02 1 2 3 4 5 10 20 30 40 50 54 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 13 / 22

Case 2: FS Improves Non-Bagged Model MSE 0.1 0.095 0.09 0.085 0.08 variance bias/noise medis overfits with too many features 0.075 0.07 0.065 0.06 1 2 3 4 5 10 20 30 40 50 60 63 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 14 / 22

Take Away Points More features lower bias/noise, higher variance. Feature selection does not improve bagged model performance (1 exception). Best subset size corresponds to best bias/variance tradeoff point. Algorithm dependant Relevant features may be discarded if variance increase outweighs extra information Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 15 / 22

Why Does Bagging Benefit from so Many Features? MSE 0.13 0.125 0.12 0.115 0.11 0.105 0.1 0.095 0.09 0.085 0.08 1 5 10 25 cryst 50 100 # features 200 variance bias/noise 400 800 1,341 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 16 / 22

Why Does Bagging Benefit from so Many Features? MSE 0.13 0.125 0.12 0.115 0.11 0.105 0.1 0.095 0.09 0.085 0.08 1 5 10 25 cryst 50 100 # features 200 variance bias/noise 400 800 1,341 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 16 / 22

Why Does Bagging Benefit from so Many Features? MSE 0.13 0.125 0.12 0.115 0.11 0.105 0.1 0.095 0.09 0.085 0.08 1 5 10 25 cryst 50 100 # features 200 variance bias/noise 400 800 1,341 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 16 / 22

Hypothesis Bagging improves base learner s ability to benefit from weak, noisy features. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 17 / 22

Experiment 2: Noisy Informative Features Summary: generate synthetic data (6 features) duplicate 1/2 of the features 20 times corrupt X% of values in duplicated features train single and bagged trees with corrupted features and 3 non-duplicated features compare to: ideal, unblemished feature set, and no noisy features (3 non-duplicated only) Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 18 / 22

Bagging Extracts More Info from Noisy Features 0.3 0.25 damaged variance bias/noise MSE 0.2 0.15 0.1 0.05 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

Bagging Extracts More Info from Noisy Features 0.3 0.25 damaged variance bias/noise MSE 0.2 0.15 6 original features (ideal) 0.1 0.05 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

Bagging Extracts More Info from Noisy Features 0.3 3 non-duplicated 0.25 features (baseline) 0.2 MSE 0.15 damaged variance bias/noise 6 original features (ideal) 0.1 0.05 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

Bagging Extracts More Info from Noisy Features 0.3 3 non-duplicated 0.25 features (baseline) 0.2 MSE 0.15 damaged variance bias/noise everything else: 3 non-duplicated features + 60 noisy features 6 original features (ideal) 0.1 0.05 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

Conclusions After training 9,060,936 decision trees... Experiment 1: More features lower bias/noise, higher variance. Feature selection does not improve bagged model performance. Best subset size corresponds to best bias/variance tradeoff point. Experiment 2: Bagged trees surprisingly good at extracting useful information from noisy features. Different weak features in different trees. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 20 / 22

Bibliography Kamal M. Ali and Michael J. Pazzani. Error reduction through learning multiple descriptions. Machine Learning, 24(3):173 202, 1996. Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1-2):105 139, 1999. Leo Breiman. Bagging predictors. Machine Learning, 24(2):123 140, 1996. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 21 / 22

Exception: Overfitting Pseudo-Identifiers 0.2 0.19 0.18 bunting variance bias/noise MSE 0.17 0.16 0.15 0.14 0.13 1 2 3 4 5 10 20 40 60 80 100 120 140 160 175 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 22 / 22