Statistical Tests: More Complicated Discriminants

03/07/07 PHY310: Statistical Data Analysis 1 PHY310: Lecture 14 Statistical Tests: More Complicated Discriminants Road Map When the likelihood discriminant will fail The Multi Layer Perceptron discriminant Other Heuristics

03/07/07 PHY310: Statistical Data Analysis 2 Constructing the Test Statistic Suppose you've got measurements And likelihood functions for the alternate hypotheses L x ;H 0 =g x H 0 The best test statistic is the likelihood ratio (It's called optimal ) t x = L x ;H 0 L x ;H 1 x= x 1, x 2,..., x n L x ;H 1 =g x H 1 or log t =log L x ;H 0 log L x ; H 1 The problem is that we usually can't construct the full likelihood Approximate as a product of independent likelihood functions Independent likelihoods are much easier to construct L x ;H 0 = L x 1 ;H 0 L x 2 ;H 0... L x n ;H 0 Resulting statistic may not be optimal, but remember any test statistic that works is great

03/07/07 PHY310: Statistical Data Analysis 3 Independent Likelihoods Can Fail The likelihood method discussed last lecture is very powerful, but sometimes the full likelihoods have bad internal correlations The independent approximation won't work Two classes of data X variable is Gaussian Y variable is correlated to X, but classes have a different offset. Almost no overlap between the data sets, so we expect the discriminant to work very well

03/07/07 PHY310: Statistical Data Analysis 4 The Simple Likelihood Discriminant Black points of measurements correctly identified as data class one Red points are measurements correctly identified as data class two Green points are measurements that were incorrectly identified

03/07/07 PHY310: Statistical Data Analysis 5 Comparison of the Discriminants Not a very good approximation! Observation: We only need to know the location of the log(t)=0 surface (the discriminant surface ).

03/07/07 PHY310: Statistical Data Analysis 6 Approximating Functions with Multi Layer Perceptron Networks Neural Nets don't really have anything to do with Neurons They are a way to parameterize a function Modeled on a simple mathematical model of how neurons work. Advantage of this parameterization Able to fit an arbitrarily complicated function Don't need to know the value of the function! Do need to know if a change improves the approximation Not a new idea. Original papers and research was done in the 50's and early 60's! Sin(x) parameterized using a MLP Neural Net with -- one input [ x ] -- two hidden layers with 3 perceptrons -- one output [ sin(x) ]

03/07/07 PHY310: Statistical Data Analysis 7 Neurons as Mathematical Functions A neuron is a function, Input is a vector of values, A=(a 1,a 2,...,a n ) Output is a activation strength, b a 1 w 1 a 2 w 2 a 3 a 4 a 5 w 3 w 4 w 5 f w i a i w 0 b 1 f x = 1 e x The threshold function is usually implemented as a sigmoid, but can be anything Popular alternatives Gaussian for input layers Linear for output layers

03/07/07 PHY310: Statistical Data Analysis 8 MLP Networks Input Layer Output Layer Internal (Hidden) Layer There is a weight and threshold for each connection between neutrons. In this network, there are 9 weights and 4 thresholds. These are free parameters that can used to parameterize a function. Several methods to find the weights, but all are beyond this class. All that is needed to fit the weights is a method of determining if a change in the weight makes the output value closer to correct. You don't need to know the correct output value to find the weights!

03/07/07 PHY310: Statistical Data Analysis 9 The MLP Parameterization of Sin(x) To parameterize a function, you need A training set : Random inputs, and the desired value for each input A test set : Random inputs and the desired value for each input Used to determine how well the network is parameterizing the function A three layer network (2 hidden layers) can provide an arbitrarily good parameterization. Resulting Network for Sin(x)

Using a MLP to Parameterize the Discriminant Surface Do the discriminate between classes of data, we need to find the hypersurface where log(t) = 0 log t =log L x ;H 0 log L x ; H 1 =0 Off the surface, we don't really care about the value of log(t) Parameterizing the likelihood ratio surface Training Set: A set of simulated measurements and the desired log likelihood ratio (usually -1 and 1) for each measurement. The desired value is used to determine if a weight is improving the parameterization. Test Set: A set of simulated measurements and the desired log likelihood ratio for each measurement. The training sets need to be pretty large, or the surface may be very poorly approximated If there are n input variables with two hidden layers containing m 1, and m 2 nodes, there are (n+1) m 1 + (m 1 +1) m 2 + m 2 + 1 weights 03/07/07 PHY310: Statistical Data Analysis 10

03/07/07 PHY310: Statistical Data Analysis 11 Constructing the MLP discriminant Choose a canned NN implementation I'm using ROOT Choose the network Number of internal layers Number of neurons per layer Generate a set of simulated data Use half for the training set Use half for the test set Train the network

03/07/07 PHY310: Statistical Data Analysis 12 Results of the MLP Network The MLP network is a high order fit to the log(t)=0 surface Many free parameters, so any surface can be described Lot's of free parameters, can fall into a local minima! Result is more or less a step function

03/07/07 PHY310: Statistical Data Analysis 13 Likelihood Method vs MLP Method For some distributions, MLP can provide much better discrimination than the likelihood method, but... Likelihood Method MLP Method

Cost of Using an MLP Network With MLP networks, you are parameterizing the discriminant with A non-linear function A large number of free parameters All caveats and warnings about multidimensional minimization apply! Understanding (or visualizing) the discriminant Very difficult to understand the effect of one variable on the output With likelihood, the single variable pdf is the entire story Very difficult to understand the shape of the discriminant surface With likelihood, it comes from the product of the individual p.d.f. Can study each p.d.f independently My Biased Recommendation: Stick with a likelihood based discriminant that can be understood Use an MLP network to check if the discriminant is nearly optimal If it's not optimal, look for a better set of variables Avoid using an MLP network for final results unless it provides a dramatic improvement. Consider using another higher-order discriminant 03/07/07 PHY310: Statistical Data Analysis 14

K Nearest Neighbors A new definition for brute force Start with a large sample of example events Find the k nearest neighbors to your test event (just like the name) k is just an integer, for example 15 Vote to find the right classification Count the number of examples from each class The most frequent is the class for the event 9 Red Neighbors Take a region 6 Black Neighbors Test point is assigned RED Partial assignment possible Test point is 60% red and 40% black 03/07/07 PHY310: Statistical Data Analysis 15

03/07/07 PHY310: Statistical Data Analysis 16 Decision Tree Classification Twenty Questions on Steroids Decision Trees are a heuristic, and not really a statistical test Sequentially choose cuts to classify data This really is twenty questions Consider 20 yes/no questions: Can isolate 1 out of a million Questions are chosen to optimize the information entropy Advantages Explanation is very simple (see above) It's a white box You can look at each individual decision and understand it's affect Easy to implement Disadvantages Not all that common in physics

03/07/07 PHY310: Statistical Data Analysis 17 Finally The Likelihood Discriminant is still your best choice Be on the look out for when it is sub-optimal When Likelihood doesn't is sub-optimal, you have moved into an area of active research MLP Neural Nets are old and understood Unfortunately, after 40 years of research nobody knows how to understand how each variable is affecting the output K-Nearest Neighbors is easy Usually, it's not computationally viable Decision Trees The latest hot item, and it's a simple heuristic Not all that common in physics (yet) The End