Data Mining Misconceptions #1: The 50/50 Problem

Similar documents
Technologists and economists both think about the future sometimes, but they each have blind spots.

We're excited to announce that the next JAFX Trading Competition will soon be live!

COM C. Rozwell

Predicting the Past (It s Much Easier Than Predicting the Future!)

WorkQuest Presentation Finding Opportunities 2002 STC Region 4 Conference 2002 James E. McCarty All rights reserved Page 1 of 9

Mining MLM Leads in 8 Easy Steps

Applying Text Analytics to the Patent Literature to Gain Competitive Insight

FEATURES REAL-WORLD RESEARCH

CEOCFO Magazine. Pat Patterson, CPT President and Founder. Agilis Consulting Group, LLC

Assignment 4: Permutations and Combinations

Laboratory 1: Uncertainty Analysis

Applying Automated Optical Inspection Ben Dawson, DALSA Coreco Inc., ipd Group (987)

VCXO Basics David Green & Anthony Scalpi

Technology Roadmaps as a Tool for Energy Planning and Policy Decisions

Combinatorics: The Fine Art of Counting

Constructing Line Graphs*

Texture characterization in DIRSIG

Academic Vocabulary Test 1:

Note: This PDF contains affiliate links.

Artificial Intelligence in Medicine. The Landscape. The Landscape

Probability (Devore Chapter Two)

Automatic Processing of Dance Dance Revolution

What determines your personal success?

Here is Your Amazing SPECIAL REPORT That Shows You... How to Mail 1 Magic Page And Be Set for Life... GUARANTEED!!! The Cash Goes Straight to You!

Graphics can be defined as translations of numbers in the form of a. drawing, design or plan to explain or illustrate something.

U strictly dominates D for player A, and L strictly dominates R for player B. This leaves (U, L) as a Strict Dominant Strategy Equilibrium.

Mathematics Behind Game Shows The Best Way to Play

Deriving the General Equation of a Circle

GUIDE TO SPEAKING POINTS:

On the Monty Hall Dilemma and Some Related Variations

The first task is to make a pattern on the top that looks like the following diagram.

8.EE. Development from y = mx to y = mx + b DRAFT EduTron Corporation. Draft for NYSED NTI Use Only

GAME THEORY Day 5. Section 7.4

Introduction to Foresight

TEKSING TOWARD STAAR MATHEMATICS GRADE 7. Projection Masters

Key stage 2 mathematics tasks for the more able Number slide solutions and what to look for

HUSTLE YOUR WAY TO THE TOP

This practical focuses on microscope technique and making and recording observations in the form of biological drawings.

Foreword to the First Edition

Probability - Introduction Chapter 3, part 1

You Can Get Paid Each Time Our Phone Rings.

«Digital transformation of Pharma and API Plants: a way to create value for long term sustainability» G. Burba

Field Report: How I Got my First Coaching Clients

Exhibition Savvy. ...And how you can avoid them! The top 12 mistakes small businesses make when they exhibit... Fiona Humberstone

Welcome to Physical Science 100

MANAGEMENT REPORT QUICK START GUIDE

Stanford CS Commencement Alex Aiken 6/17/18

Probability Interactives from Spire Maths A Spire Maths Activity

Confidently Assess Risk Using Public Records Data with Scalable Automated Linking Technology (SALT)

Solving the Rubik s Cube

MITOCW mit_jpal_ses06_en_300k_512kb-mp4

The Two Ruinous Mistakes Advisors Make When Following Up with Prospects

Disclosing Self-Injury

If a fair coin is tossed 10 times, what will we see? 24.61% 20.51% 20.51% 11.72% 11.72% 4.39% 4.39% 0.98% 0.98% 0.098% 0.098%

UC Berkeley Haas School of Business Economic Analysis for Business Decisions (EWMBA 201A) Game Theory I (PR 5) The main ideas

Failures of Intuition: Building a Solid Poker Foundation through Combinatorics

The Odds Calculators: Partial simulations vs. compact formulas By Catalin Barboianu

The A.I. Revolution Begins With Augmented Intelligence. White Paper January 2018

TRUSTING THE MIND OF A MACHINE

Leadership: Getting and Giving the Call for Action

How to Use Donor Newsletters to Raise More Money for Your Non-Profit

MODELLING AN EQUATION

Contents. 1. Phases of Consciousness 3 2. Watching Models 6 3. Holding Space 8 4. Thought Downloads Actions Results 12 7.

My exposure to entrepreneurship came at an early age. From

Human Factors Points to Consider for IDE Devices

SPIRE MATHS Stimulating, Practical, Interesting, Relevant, Enjoyable Maths For All

By Mark Hindsbo Vice President and General Manager, ANSYS

GCSE MARKING SCHEME AUTUMN 2016 MATHEMATICS (NEW) UNIT 1 - FOUNDATION TIER 3300U10-1. WJEC CBAC Ltd.

Startups, Patents and Five Common Mistakes

MODELLING EQUATIONS. modules. preparation. an equation to model. basic: ADDER, AUDIO OSCILLATOR, PHASE SHIFTER optional basic: MULTIPLIER 1/10

Math 7 Notes - Unit 11 Probability

Why You Aren't Getting Referrals - And What to Do About It

2012 COMMON CORE STATE STANDARDS ALIGNED MODULES 2012 COMMON CORE STATE STANDARDS ALIGNED MODULES

deeply know not If students cannot perform at the standard s DOK level, they have not mastered the standard.

Counts up unreliably; still counting the smaller number to get one too many in the answer

Avoid the 5 Biggest DWI Pitfalls Presented by: The Volk & McElroy Law Firm

Navigating the AI Adoption Minefield Pitfalls, best practices, and developing your own AI roadmap April 11

MAT points Impact on Course Grade: approximately 10%

From Probability to the Gambler s Fallacy

Commentary on Candidate Evidence. English (National 5): Reading for Understanding, Analysis and Evaluation

Webb s Depth of Knowledge: Transitioning to

Junior Circle Meeting 5 Probability. May 2, ii. In an actual experiment, can one get a different number of heads when flipping a coin 100 times?

Profitable Consulting Fees

Game Mechanics Minesweeper is a game in which the player must correctly deduce the positions of

6 WEEK REALITY CHECK

WORKSHOP SIX. Probability. Chance and Predictions. Math Awareness Workshops

Georgia Department of Education Georgia Standards of Excellence Framework GSE Geometry Unit 6

An SWR-Feedline-Reactance Primer Part 1. Dipole Samples

ARE LAW FIRMS INNOVATING?

CS61c: Introduction to Synchronous Digital Systems

Seeing things clearly: the reality of VR for women. Exploring virtual reality opportunities for media and technology companies

The Three Laws of Artificial Intelligence

GMAT Timing Strategy Guide

Released October Year. Small Steps Guidance and Examples. Block 4: Multiplication & Division

Infrastructure for Systematic Innovation Enterprise

Patterns in Fractions

Rubik's Revenge Solution Page

The Predictable Selling System

Become A Health Coach Certification. Pillar 2: TCM Skills Week 2. Pillar 2 Week 2 Video 4 1

Updated December Year. Small Steps Guidance and Examples. Block 4: Multiplication & Division

Transcription:

Data Mining Misconceptions #1: The 50/50 Problem By Tim Graettinger This fall will mark my twentieth year as a data mining professional. Thank you. During that time, I worked at five different companies mostly startups - and consulted for many, many clients. Changes to the data mining field during that period are startling, in terms of the computational horsepower available, the size of the databases being generated, and the software tools developed to model and analyze them. At the same time, scant progress has been made in educating the public, in general, and clients, in particular, about data mining. There are many untruths, half-truths, and downright statistics floating around about how data mining works and how it is used. In this and future articles, I intend to clear up a few of the most pervasive of these misconceptions. Some misconceptions arise from simple errors in logic. Often, they stem from a lack of familiarity or experience. None are particularly technical problems. All are easily remedied with simple examples and simple explanations. In this article, I will focus on one misconception that I call the 50/50 problem. An Example of the 50/50 Problem Recently, I was working with a very bright, energetic client in the biotech industry. Her firm builds imaging equipment and provides services to pharmaceutical companies. The imaging equipment (calling it a complex, microscope-like camera is far too wordy) generated data that she wanted to use to classify chemical compounds as promising or unpromising candidates for drugs. It turns out, in the vast world of chemical compounds, that there are more unpromising drug candidates than promising ones - a lot more. My job was to use data mining techniques to create a classifier (a mathematical formula or a set of rules) that would successfully distinguish promising drug candidates from unpromising ones - using data produced by the imaging equipment. After some initial work, I presented a classifier to my client. I happily reported that the classifier correctly labeled promising compounds as promising 10% of the time. My client was completely underwhelmed 1. Her knee-jerk response was, But you can do 50% just by flipping a coin!

Actually, a very simple classifier can do much better than 50%. I mentioned earlier that there are many more unpromising compounds than promising ones. In this project, 999 out of every 1000 compounds was unpromising, or 99.9%. A classifier that labels every compound as unpromising is correct 99.9% of the time. Despite its apparently high accuracy, such a classifier is worthless to a pharmaceutical company. Why? Such a classifier would recommend that no compound ever be developed further as a potential drug. Strictly abiding by the classifier, life-saving research would come to an abrupt halt. The 50/50 Problem in a Nutshell Is a misconception becoming evident? My client, like many intelligent people, made a simple error in thinking. She made the assumption, because there were two possible outcomes (promising and unpromising), that the outcomes were both 50% likely. This is the 50/50 problem. My own theory is that many of us are victims of our own education. All of my probability textbooks introduced the subject with discussions about flipping coins. With that as a starting point, perhaps it s no wonder that people make the 50/50 assumption without even thinking about it. The first step towards a solution is to admit there is a problem. Please repeat after me, Just because there are two possible outcomes, that does not mean they are equally likely. The second part of a solution is to replace the wrong mental image with the right one. Rather than think about the two outcomes as alternate sides of a coin, think about them as two, clearly unequal, pieces of a pie (see Figure 1). Beneath the crust of the small red piece, picture a filling of delicious fruit. Beneath the large blue piece, picture some mud. Now, consider a blindfolded person plunging a fork into the pie. There are only two possible outcomes, mud or fruit. But the odds of a tasty result are not 50/50, are they?

Data Mining Imperatives In the drug compound project, the outcomes were not equally likely. One outcome (unpromising) was very common while the other (promising) was rare and significantly more valuable. The rare outcome is the one promising compound out of one thousand candidates that may become a multi-billion-dollar blockbuster drug. In a scenario like this, there are two dominant data mining imperatives: - To improve the odds of uncovering the rare, valuable outcome - To make the right kind of mistakes Improving the Odds Let s consider each of these imperatives in turn. First, the classifier needed to improve the odds of finding good drug candidates. I mentioned above that a classifier is typically a formula or set of rules. In this case, the drug classifier was a mathematical formula 2. The inputs to the formula were various measurements of a compound - for instance, the size, shape, or color of biological features in a microscopic image. The output of the formula was a number, often called a score. By design, the higher the score, the more likely the outcome of interest here, the more likely that the candidate compound would be a promising drug. Provided with a large list of compounds and their associated measurements from the imagining equipment, my classifier produced a list of scores, one for each compound.

Without a classifier, the list of compounds is just a list. A researcher could start with a compound anywhere in the list, synthesize it, test it for promise as a drug candidate, and 1 time in 1000 - it may turn out to be promising. With the classifier, the list can be sorted by score, and a researcher can again start at the top, synthesizing and testing compounds. With my classifier, 1 time in 10 a compound near the top of the sorted list will be a promising drug candidate. Allow me to repeat: 1 time in 10. The odds are improved by a factor of 100. Suddenly, the classifier seemed much more powerful to my client. Making the Right Kind of Mistakes Having passed the first hurdle by improving the odds, we turn to the second imperative: making the right kind of mistakes. What does that mean aren t all mistakes created equal? From common experience, though, we know that some errors are worse than others. Taking a misstep in the kitchen is inconsequential. Taking a misstep on a busy city street can have a very serious impact - literally. Let s take a closer look at the mistakes that can be made by the drug candidate classifier.

See the diagram in Figure 2. One side represents the true outcome, whether a compound is actually unpromising or promising. The other side represents the predicted outcome from the classifier, again, unpromising or promising. (To make this distinction of predicted outcomes, the classifier scores were cut-off compounds with score values above the cut-off value were labeled as promising, while those with values below the cut-off were labeled as unpromising. More about the choice of cut-off shortly). As shown in green in the figure, there are two ways the classifier can be correct - by predicting promising when the compound truly is promising, and by predicting unpromising when it truly is unpromising. Conversely, as shown in yellow, there are two ways for the classifier to make a mistake: by predicting that a truly unpromising compound is promising (a false alarm ), or by predicting that a truly promising compound is unpromising (a miss ). Now, which mistake is worse for a drug maker? My intuition was that a false alarm was the worse error. After all, I knew that pharmaceutical companies spend many millions of dollars to test and develop a single compound into a drug. Surely, they would want to avoid that expense if they could. While that was true, I learned that a miss was the worse error. Since a promising compound is like a rare gem to the industry, missing any good candidate amounts to missing a potential multi-billion-dollar opportunity. Although costly, some false alarms are acceptable, and even viewed as inevitable - a cost of doing research and business. By setting the cut-off value lower or higher, the classifier can trade misses (billion-dollar missed opportunities) for false alarms (milliondollar dead ends). Beyond the 50/50 Problem The 50/50 misconception can fool smart people into thinking that data mining is overhyped, or useless, or both. To counter the misconception, I presented an example of an extremely valuable classifier that was correct just 10% of the time. Conversely, I also described a 99.9% accurate classifier that was useless. Data mining can improve the odds of finding rare, valuable results. Further, with good judgment, data miners can help clients limit the cost of mistakes and missed opportunities. So remember, just because there are two possible outcomes, that does not mean they are equally likely - or equally valuable.

Tim Graettinger, Ph.D., is the President of Discovery Corps, Inc. (http://www.discoverycorpsinc.com), a Pittsburgh-area company specializing in data mining, visualization, and predictive analytics. Your comments and questions about this article are welcome. Please contact Tim at (724)-743-3642 or tgraettinger@discoverycorpsinc.com 1 The reader may also be underwhelmed at this point, but please read on. 2 In this article, we will ignore the details of how the formula is derived just know that it can be derived through statistical or other data mining methods.