Feature Engineering. Digging into Data: Jordan Boyd-Graber. University of Maryland. March 4, 2013

Similar documents
Knowledge discovery & data mining Classification & fraud detection

ACT ONE. Setting and set-up: place and time, main characters, central problem or conflict, and what happens to get the plot started

Chapter & Scene Summary

SELECTING RELEVANT DATA

SSB Debate: Model-based Inference vs. Machine Learning

Copywriting on Tight Deadlines. How ordinary marketers are achieving 103% gains with a step-by-step framework

HOW TO ORDER AND DOWNLOAD YOUR MUSIC FROM THE WENDELL BROOKS WOOCOMMERCE STORE

Step 1: Choose Three Books from the Book List

Midterm with Answers and FFQ (tm)

My Story Can Beat Up Your Story! The Missing Chapter. Jeffrey Alan Schechter

INTERNET SAFETY. OBJECTIVES: 1. Internet safety what is true and what is false? 2. & Instant Messaging safety 3. Strangers on the Internet

Predicting outcomes of professional DotA 2 matches

WHERE DID GEORGE LUCAS GET THE IDEA FOR HIS CHARACTERS IN STAR WARS?

Solving tasks and move score... 18

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Feb 3 & 5):

CS510 \ Lecture Ariel Stolerman

Questioning Strategies Questions and Answers

Instead, when we say act break we re talking about a literary concept. We use act breaks to discuss critical turning points in the story:

On Feature Selection, Bias-Variance, and Bagging

Kernels and Support Vector Machines

BE540 - Introduction to Biostatistics Computer Illustration. Topic 1 Summarizing Data Software: STATA. A Visit to Yellowstone National Park, USA

Predicting the Usefulness of Amazon Reviews Using Off-The-Shelf Argumentation Mining

Purpose Week 2. Author: Chelsea Jacobs Project Supervisors Nick Diliberto and Rob Quinn Artwork: Kindred Canvas

Contents. List of Figures List of Tables. Structure of the Book How to Use this Book Online Resources Acknowledgements

Black Ops Hypnosis Exposed

Sentiment Analysis of User-Generated Contents for Pharmaceutical Product Safety

Drawing with Perspective in DeltaCAD

All-Stars Dungeons And Diamonds Fundamental. Secrets, Details And Facts (v1.0r3)

What happens to the other 3 who didn t make any money? Why didn t they?

Forms of Fiction: It s All a Story. Fiction: True or False?

DESIGN A SHOOTING STYLE GAME IN FLASH 8

Chapter 4. September 08, appstats 4B.notebook. Displaying Quantitative Data. Aug 4 9:13 AM. Aug 4 9:13 AM. Aug 27 10:16 PM.

NCSS Statistical Software

How Tall Are You? Introducing Functions

Writing the Half-Hour Spec Comedy Script Instructor: Manny Basanese

GE 113 REMOTE SENSING

Space Invadersesque 2D shooter

Matthew Fox CS229 Final Project Report Beating Daily Fantasy Football. Introduction

AUTOMATED MUSIC TRACK GENERATION

Sections Descriptive Statistics for Numerical Variables

Installation guide. Activate. Install your TV. Uninstall. 1 min 10 mins. 30 mins

In this chapter, you find out how easy scripting is. The scripting basics set

USING ACX TO PRODUCE AN AUDIOBOOK. M.L. Humphrey

USER GUIDE. NEED HELP? Call us on +44 (0)

How can it be right when it feels so wrong? Outliers, diagnostics, non-constant variance

HOOVER TIMES LATE SPRING 2013 HOOVER TIMES. Hoover Street Elementary School Vol. 1, No. 2. The Life of Matt Martin

Chapter 2 Descriptive Statistics: Tabular and Graphical Methods

Gratitude Is with You... Always

How To Handbook For Learners

Heuristic Evaluation of Spiel

Classification of Road Images for Lane Detection

Automated hand recognition as a human-computer interface

AECOsim Building Designer. Quick Start Guide. Chapter A08 Space Planning Bentley Systems, Incorporated

To solve a problem (perform a task) in a virtual world, we must accomplish the following:

Star Wars The Story Of Darth Vader

CHM 152 Lab 1: Plotting with Excel updated: May 2011

CSE 255 Assignment 1: Helpfulness in Amazon Reviews

Kodu Module 1: Eating Apples in the Kodu World

GETTING STARTED. STAR WARS D6: New Player Starting Guide. Become Your Character. Use Your Imagination. Keep Things Moving. Combat As Last Resort

Sample file WRITTEN BY ED TEIXEIRA DIGITALLY EDITED BY CRAIG ANDREWS

Chapter 4. Displaying and Summarizing Quantitative Data. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Editing Your Novel by: Katherine Lato Last Updated: 12/17/14

Third Workshop in Playwriting

Evolutionary Artificial Neural Networks For Medical Data Classification

11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO

CS188 Spring 2010 Section 3: Game Trees

HOW TO: Write Like a Pro

A: My Brother, the robot B: new neighbors

Name: Partners: Statistics. Review 2 Version A

Warning; please do not attempt these techniques until you have mastered the rest of the book. You may become disoriented, dazed and confused.

Definitions. Jordan Boyd-Graber University of Maryland JANUARY 12, Jordan Boyd-Graber UMD Definitions 1 / 20

understanding sensors

Protocols. Graphical programming for Icy. a.k.a. programming, for the rest of us

TDD Making sure everything works. Agile Transformation Summit May, 2015

Will AI Kill Us All? Machine Learning: Jordan Boyd-Graber University of Maryland SCENARIOS. Ideas Adapted from Rodney Brooks

Blur Your Course into Focus (Jan 09)

STAB22 section 2.4. Figure 2: Data set 2. Figure 1: Data set 1

INTRODUCTION. Welcome to Subtext the first community in the pages of your books.

FOUR SIMPLE TRADING GOALS

Tracy McMillan on The Person You Really Need To Marry (Full Transcript)

Use Linear Regression to Find the Best Line on a Graphing Calculator

Detective Party Preview

Your First Game: Devilishly Easy

League of Legends: Dynamic Team Builder

Section 3: Break Through the Competition: Hands-On Workshop to Make Your Novel Pop to the Top

After you launch StoryO, you will see 5 sample projects in the Projects screen. To sample a project, click on the name of the project.


Overcoming Edmodo errors when turning in assignments

You Can t Come In Without A TIE (v.1.2)

What Limits the Reproductive Success of Migratory Birds? Warbler Data Analysis (50 pts.)

Worksheets :::1::: Copyright Zach Browman - All Rights Reserved Worldwide

FamilySearch. When you sign into FamilySearch, your own personalized home page will appear. This page will consistently change.

jimfusion Satellite image manipulation SOFTWARE FEATURES QUICK GUIDE

dotted line kind of in the middle of my screen. And what that dotted line represents is where the page ends. Ok, so that s the very end of my page.

FILM MAKING STORYTELLING

CS188 Spring 2010 Section 3: Game Trees

Family History: Genealogy Made Easy with Lisa Louise Cooke Republished 2014

Reviewing the Person Information

// Parts of a Multimeter

What To Look For When Revising

Transcription:

Feature Engineering Digging into Data: Jordan Boyd-Graber University of Maryland March 4, 2013 Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 1 / 30

Roadmap How to split your dataset TV Tropes Dataset Feature engineering Demo of classification in Rattle Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 2 / 30

Outline 1 Preparing Data for Classification 2 Evaluating Classification 3 TV Tropes 4 Extracting Features 5 Trying Out Classifiers in Rattle Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 3 / 30

Test Dataset Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 4 / 30

Partitioning the Data Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 5 / 30

Partitioning the Data Train: Learn a model Validation: Evaluate different models Test: See how well your model does (only do this once) Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 5 / 30

Overfitting Consider error of hypothesis h over training data: error train (h) entire distribution of data: error (h) Hypothesis h H overfits training data if there is an alternative hypothesis h H such that error train (h) < error train (h ) and error (h) > error (h ) Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 6 / 30

Overfitting in Decision Tree Learning 0.9 0.85 0.8 Accuracy 0.75 0.7 0.65 0.6 0.55 On training data On test data 0.5 0 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes) Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 7 / 30

Avoiding Overfitting How can we avoid overfitting? stop growing when data split not statistically significant grow full tree, then post-prune How to select best tree: Measure performance over training data to find many models Measure performance over separate validation data set to choose one that doesn t overfit Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 8 / 30

Why validate? Often, what you try doesn t work the first time around Process the data somehow Add more features Try different models After a while, you get better numbers on your test dataset Rattle does this automatically Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 9 / 30

Outline 1 Preparing Data for Classification 2 Evaluating Classification 3 TV Tropes 4 Extracting Features 5 Trying Out Classifiers in Rattle Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 10 / 30

Confusion Matrix Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 11 / 30

When accuracy lies Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 12 / 30

When accuracy lies Moral: If you care about X, make sure your data have it! Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 12 / 30

Outline 1 Preparing Data for Classification 2 Evaluating Classification 3 TV Tropes 4 Extracting Features 5 Trying Out Classifiers in Rattle Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 13 / 30

TV Tropes Social media site Catalog of tropes Functionally like Wikipedia, but... Less formal No notability requirement Focused on popular culture Absent-Minded Professor Doc Emmett Brown from Back to the Future. The drunk mathematician in Strangers on a Train becomes a plot point, because of his forgetfulness, Guy is suspected of a murder he didn t commit. The Muppet Show: Dr. Bunsen Honeydew. Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 14 / 30

Spoilers What makes neat is that the dataset is annotated by users for spoilers. A spoiler: A published piece of information that divulges a surprise, such as a plot twist in a movie. Spoiler Han Solo arriving just in time to save Luke from Vader and buy Luke the vital seconds needed to send the proton torpedos into the Death Star s thermal exhaust port. Leia, after finding out that despite her (feigned) cooperation, Tarkin intends to destroy Alderaan anyway. Luke rushes to the farm, only to find it already raided and his relatives dead harkens to an equally distressing scene in The Searchers. Not a spoiler Diving into the garbage chute gets them out of the firefight, but the droids have to save them from the compacter. They do some pretty evil things with that Death Star, but we never hear much of how they affect the rest of the Galaxy. A deleted scene between Luke and Biggs explores this somewhat. Luke enters Leia s cell in a Stormtrooper uniform, and she calmly starts some banter. Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 15 / 30

The dataset Downloaded the pages associated with a show. Took complete sentences from the text and split them into ones with spoilers and those without Created a balanced dataset (50% spoilers, 50% not) Split into training, development, and test shows Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 16 / 30

The dataset Downloaded the pages associated with a show. Took complete sentences from the text and split them into ones with spoilers and those without Created a balanced dataset (50% spoilers, 50% not) Split into training, development, and test shows Why is this important? Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 16 / 30

The dataset Downloaded the pages associated with a show. Took complete sentences from the text and split them into ones with spoilers and those without Created a balanced dataset (50% spoilers, 50% not) Split into training, development, and test shows Why is this important? I ll show results using SVM; similar results apply to other classifiers Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 16 / 30

Outline 1 Preparing Data for Classification 2 Evaluating Classification 3 TV Tropes 4 Extracting Features 5 Trying Out Classifiers in Rattle Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 17 / 30

Step 1: The obvious Take every sentence, and split on on-characters. Input: These aren t the droids you re looking for. Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 18 / 30

Step 1: The obvious Take every sentence, and split on on-characters. Input: These aren t the droids you re looking for. Features These:1 aren:1 t:1 the:1 droids:1 you:1 re:1 looking:1 for:1 False True False 56 34 True 583 605 Accuracy: 0.517 Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 18 / 30

Step 1: The obvious Take every sentence, and split on on-characters. Input: These aren t the droids you re looking for. Features These:1 aren:1 t:1 the:1 droids:1 you:1 re:1 looking:1 for:1 What s wrong with this? False True False 56 34 True 583 605 Accuracy: 0.517 Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 18 / 30

Step 2: Normalization Normalize the words Lowercase everything Stem the words (not always a good idea!) Input: These aren t the droids you re looking for. Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 19 / 30

Step 2: Normalization Normalize the words Lowercase everything Stem the words (not always a good idea!) Input: These aren t the droids you re looking for. Features these:1 are:1 t:1 the:1 droid:1 you:1 re:1 look:1 for:1 False True False 52 27 True 587 612 Accuracy: 0.520 Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 19 / 30

Step 3: Remove Usless Features Use a stoplist Remove features that appear in > 10% of observations (and aren t correlated with label) Input: These aren t the droids you re looking for. Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 20 / 30

Step 3: Remove Usless Features Use a stoplist Remove features that appear in > 10% of observations (and aren t correlated with label) Input: These aren t the droids you re looking for. Features droid:1 look:1 False True False 59 20 True 578 621 Accuracy: 0.532 Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 20 / 30

Step 4: Add Useful Features Use bigrams ( these_are ) instead of unigrams ( these, are ) Creates a lot of features! Input: These aren t the droids you re looking for. Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 21 / 30

Step 4: Add Useful Features Use bigrams ( these_are ) instead of unigrams ( these, are ) Creates a lot of features! Input: These aren t the droids you re looking for. Features these_are:1 aren_t:1 t_the:1 the_droids:1 you_re:1 re_looking:1 looking_for:1 False True False 203 104 True 436 535 Accuracy: 0.578 Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 21 / 30

Step 5: Prune (Again) Not all bigrams appear often SVM has to search a long time and might not get to the right answer Helps to prune features Input: These aren t the droids you re looking for. Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 22 / 30

Step 5: Prune (Again) Not all bigrams appear often SVM has to search a long time and might not get to the right answer Helps to prune features Input: These aren t the droids you re looking for. Features these_are:1 the_droids:1 re_looking:1 looking_for:1 False True False 410 276 True 229 363 Accuracy: 0.605 Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 22 / 30

How do you find new features? Make predictions on the development set. Look at contingency table; where are the errors? What do you miss? Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 23 / 30

How do you find new features? Make predictions on the development set. Look at contingency table; where are the errors? What do you miss? Error analysis! What feature would the classifier need to get this right? What features are confusing the classifier? If it never appears in the development set, it isn t useful If it doesn t appear often, it isn t useful Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 23 / 30

How do you know something is a good feature? Make a contingency table for that feature (should give you good information gain) Throw it into your classifier (accuracy should improve) Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 24 / 30

Homework 2 I ve given you TV Tropes data And development data And test data (no labels) Only have 15 features (should get you around 56%) For these features, it doesn t matter (much) which classifier you use Your job: add additional features and see how they do Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 25 / 30

Outline 1 Preparing Data for Classification 2 Evaluating Classification 3 TV Tropes 4 Extracting Features 5 Trying Out Classifiers in Rattle Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 26 / 30

Selecting a model Go to model tab and select one of the models Make sure the model makes sense For logistic regression, select linear and logistic Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 27 / 30

Selecting a model Go to model tab and select one of the models Make sure the model makes sense For logistic regression, select linear and logistic For SVM, you also need to select a kernel (try linear first, then Gaussian which will be much slower) Output varies by model SVM is least informative (hard to summarize) Note you can click draw to see decision trees Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 27 / 30

Decision Trees Have Many Options... Prior: The prior observation probabilities (in case your training data are skewed) Min Split: How many observations can be in an expanded leaf (pre-test) Min Bucket: How many observations can be in any resulting leaf (post-test) Max Depth: How many levels the tree has Complexity: How many if statements the tree has Defaults are reasonable; tweak if you are having complexity issues. Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 28 / 30

How d we do? Fit the model by clicking on the execute button Click on the evaluate tab, have your boxes checked for the models you want to compare Select specific datasets (e.g. external csv file) For the weather dataset, SVM does best (.14) To get explicit predictions, click the score button We ll learn about the other metrics next week! Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 29 / 30

RTextTools library(rtexttools) train.df <- read.csv("train/train.csv") train.df$sentence <- as.character(train.df$sentence) dev.df <- read.csv("dev/dev.csv") dev.df$sentence <- as.character(dev.df$sentence) train.df <- train.df[1:1000,] dev.df <- dev.df[1:100,] data <- rbind(train.df, dev.df) dev_size <- dim(dev.df)[1] total_size <- dim(data)[1] matrix <- create_matrix(cbind(data$sentence, data$trope), language="english", removenumbers=true, stemwords=false, weighting=weighttfidf) container <- create_container(matrix, data$spoiler, trainsize=1:dev_size, testsize=(1+dev_size):total_size, virgin=false) models <- train_models(container, algorithms=c("maxent","svm")) results <- classify_models(container, models) Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering March 4, 2013 30 / 30