Lecture 3 - Regression

Similar documents
Kernels and Support Vector Machines

Contents. List of Figures List of Tables. Structure of the Book How to Use this Book Online Resources Acknowledgements

Image analysis. CS/CME/BIOPHYS/BMI 279 Fall 2015 Ron Dror

Lecture 4 : Monday April 6th

Lesson 08. Convolutional Neural Network. Ing. Marek Hrúz, Ph.D. Katedra Kybernetiky Fakulta aplikovaných věd Západočeská univerzita v Plzni.

Section 15.3 Partial Derivatives

Improving the Generalized Likelihood Ratio Test for Unknown Linear Gaussian Channels

Privacy preserving data mining multiplicative perturbation techniques

SSB Debate: Model-based Inference vs. Machine Learning

LECTURE 19 - LAGRANGE MULTIPLIERS

3.5 Marginal Distributions

Designing Information Devices and Systems I Spring 2016 Official Lecture Notes Note 18

DVA325 Formal Languages, Automata and Models of Computation (FABER)

THEORY: NASH EQUILIBRIUM

Cómo estructurar un buen proyecto de Machine Learning? Anna Bosch Rue VP Data Launchmetrics

STRATEGY AND COMPLEXITY OF THE GAME OF SQUARES

Global Journal of Engineering Science and Research Management

Digital Communication Systems ECS 452

We like to depict a vector field by drawing the outputs as vectors with their tails at the input (see below).

Graph-of-word and TW-IDF: New Approach to Ad Hoc IR (CIKM 2013) Learning to Rank: From Pairwise Approach to Listwise Approach (ICML 2007)

Appendix III Graphs in the Introductory Physics Laboratory

The Automatic Classification Problem. Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification

Functions of several variables

Lecture 19 - Partial Derivatives and Extrema of Functions of Two Variables

Antennas and Propagation. Chapter 6b: Path Models Rayleigh, Rician Fading, MIMO

2.1 Partial Derivatives

Statistics, Probability and Noise

EXERCISES CHAPTER 11. z = f(x, y) = A x α 1. x y ; (3) z = x2 + 4x + 2y. Graph the domain of the function and isoquants for z = 1 and z = 2.

Signal Recovery from Random Measurements

Exercises to Chapter 2 solutions

Independence of Path and Conservative Vector Fields

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2014

1111: Linear Algebra I

47. Conservative Vector Fields

M 3 : Manipulatives, Modeling, and Mayhem - Session I Activity #1

Pattern Avoidance in Unimodal and V-unimodal Permutations

Outcome Forecasting in Sports. Ondřej Hubáček

The Game-Theoretic Approach to Machine Learning and Adaptation

LECTURE 3: CONGRUENCES. 1. Basic properties of congruences We begin by introducing some definitions and elementary properties.

Continuity of the Norm of a Composition Operator

Permutation group and determinants. (Dated: September 19, 2018)

4 to find the dimensions of the rectangle that have the maximum area. 2y A =?? f(x, y) = (2x)(2y) = 4xy

i + u 2 j be the unit vector that has its initial point at (a, b) and points in the desired direction. It determines a line in the xy-plane:

Multiple Input Multiple Output (MIMO) Operation Principles

HEALTH CARE EXPENDITURE IN AFRICA AN APPLICATION OF SHRINKAGE METHODS

Exam 2 Review Sheet. r(t) = x(t), y(t), z(t)

28,800 Extremely Magic 5 5 Squares Arthur Holshouser. Harold Reiter.

Application of Multi Layer Perceptron (MLP) for Shower Size Prediction

JMG. Review Module 1 Lessons 1-20 for Mid-Module. Prepare for Endof-Unit Assessment. Assessment. Module 1. End-of-Unit Assessment.

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Algorithmic Game Theory Date: 12/6/18

Acentral problem in the design of wireless networks is how

Determinants, Part 1

Lecture 18 - Counting

Chapter 2 Channel Equalization

The fundamentals of detection theory

SOLUTIONS TO PROBLEM SET 5. Section 9.1

Review Sheet for Math 230, Midterm exam 2. Fall 2006

Antennas and Propagation. Chapter 5c: Array Signal Processing and Parametric Estimation Techniques

Math for Economics 1 New York University FINAL EXAM, Fall 2013 VERSION A

Game Theory and Randomized Algorithms

CODE division multiple access (CDMA) systems suffer. A Blind Adaptive Decorrelating Detector for CDMA Systems

Algorithmic Number Theory and Cryptography (CS 303)

Lecture 2. 1 Nondeterministic Communication Complexity

Digital Systems Principles and Applications TWELFTH EDITION. 3-3 OR Operation With OR Gates. 3-4 AND Operations with AND gates

(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods

Training a Minesweeper Solver

Chapter 2: Signal Representation

Measure Preserving Isomorphisms

OPTIMUM GEODETIC DATUM TRANSFORMATION TECHNIQUES FOR GPS SURVEYS IN EGYPT

Notes for Recitation 3

Recommender Systems TIETS43 Collaborative Filtering

Machine Learning for Antenna Array Failure Analysis

Chapter 1. The alternating groups. 1.1 Introduction. 1.2 Permutations

SOLUTIONS FOR PROBLEM SET 4

Introduction to Coding Theory

arxiv: v1 [cs.sd] 4 Dec 2018

Intro to Probability Instructor: Alexandre Bouchard

MATH 8 FALL 2010 CLASS 27, 11/19/ Directional derivatives Recall that the definitions of partial derivatives of f(x, y) involved limits

Definitions and claims functions of several variables

Image analysis. CS/CME/BioE/Biophys/BMI 279 Oct. 31 and Nov. 2, 2017 Ron Dror

Lower Bounds for the Number of Bends in Three-Dimensional Orthogonal Graph Drawings

Unitary Space Time Modulation for Multiple-Antenna Communications in Rayleigh Flat Fading

Adversarial Attacks on Face Detectors using Neural Net based Constrained Optimization

Formal Verification. Lecture 5: Computation Tree Logic (CTL)

Solutions to the problems from Written assignment 2 Math 222 Winter 2015

Final Exam, Math 6105

Exam Preparation Guide Geometrical optics (TN3313)

Game Theory and Algorithms Lecture 19: Nim & Impartial Combinatorial Games

Summary Overview of Topics in Econ 30200b: Decision theory: strong and weak domination by randomized strategies, domination theorem, expected utility

CCO Commun. Comb. Optim.

Practice problems from old exams for math 233

Development of an improved flood frequency curve applying Bulletin 17B guidelines

Hamming Codes and Decoding Methods

Reading. Projections. The 3D synthetic camera model. Imaging with the synthetic camera. Angel. Chapter 5. Optional

SYMMETRIES OF FIBONACCI POINTS, MOD m

Greedy Flipping of Pancakes and Burnt Pancakes

Signal Field-Strength Measurements: Basics

Exam 2 Summary. 1. The domain of a function is the set of all possible inputes of the function and the range is the set of all outputs.

Predicting Content Virality in Social Cascade

Stat 20: Intro to Probability and Statistics

Transcription:

Lecture 3 - Regression Instructor: Prof Ganesh Ramakrishnan July 25, 2016 1 / 30

The Simplest ML Problem: Least Square Regression Curve Fitting: Motivation Error measurement Minimizing Error Method of Least Squares July 25, 2016 2 / 30

Curve Fitting: Motivation Example scenarios: Prices of house to be fitted as a function of the area of the house Temperature of a place to be fitted as a function of its latitude and longitude and time of the year Stock Price (or BSE/Nifty value) to be fitted as a function of Company Earnings Height of students to be fitted as a function of their weight One or more observations/parameters in the data are expected to represent the output in the future July 25, 2016 3 / 30

Higher you go, the more expensive the house! Consider the variation of price (in $) of house with variations in its height (in m) above the ground level These are specified as coordinates of the 8 points: (x 1, y 1 ),, (x 8, y 8 ) Desired: Find a pattern or curve that characterizes the price as a function of the height Figure: Price of house vs its height - for illustration purpose July 25, only 2016 4 / 30

Errors and Causes (Observable) Data is generally collected through measurements or surveys Surveys can have random human errors Measurements are subject to imprecision of the measuring or recording instrument Outliers due to variability in the measurement or due to some experimental error; Robustness to Errors: Minimize the effect of error in predicted model Data cleansing: Outlier handling in a pre-processing step July 25, 2016 5 / 30

Curve Fitting: The Process Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints - Wikipedia July 25, 2016 6 / 30

Curve Fitting: The Process Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints - Wikipedia Need quantitative criteria to find the best fit Error function E : curve f dataset D R Error function must capture the deviation of prediction from expected value July 25, 2016 6 / 30

Example Consider the two candidate prediction curves in blue and red respectively respectively Which is the better fit? Figure: Price of house vs its height - for illustration purpose only July 25, 2016 7 / 30

Question What are some options for error function E(f, D) that measure the deviation of prediction from expected value? July 25, 2016 8 / 30

Examples of E D f(x i ) y i D f(x i ) y i D (f(x i ) y i ) 2 D (f(x i ) y i ) 3 and many more July 25, 2016 9 / 30

Question Which choice F do you think can give us best fit curve and why? Hint: Think of these errors as distances July 25, 2016 10 / 30

Squared Error (f(x i ) y i ) 2 D One best fit curve corresponds to f that minimizes the above function It 1 Is continuous and differentiable 2 Can be visualized as square of Euclidean distance between predicted and observed values Mathematical optimization of this function: Topic of following lectures This is the Method of least squares July 25, 2016 11 / 30

Regression, More Formally Formal Definition Types of Regression Geometric Interpretation of least square solution Linear Regression as a canonical example Optimization (Formally deriving least Square Solution) Regularization (Ridge Regression, Lasso), Bayesian Interpretation (Bayesian Linear Regression) Non-parametric estimation (Local linear regression), Non-linearity through Kernels (Support Vector Regression) July 25, 2016 12 / 30

Linear Regression with Illustration Regression is about learning to predict a set of output variables (dependent variables) as a function of a set of input variables (independent variables) Example A company wants to determine how much it should spend on TV commercials to increase sales to a desired level y Basis? July 25, 2016 13 / 30

Linear Regression with Illustration Regression is about learning to predict a set of output variables (dependent variables) as a function of a set of input variables (independent variables) Example A company wants to determine how much it should spend on TV commercials to increase sales to a desired level y Basis? It has previous observations of the form <xi,y i >, xi is an instance of money spent on advertisements and y i was the corresponding observed sale figure July 25, 2016 13 / 30

Linear Regression with Illustration Regression is about learning to predict a set of output variables (dependent variables) as a function of a set of input variables (independent variables) Example A company wants to determine how much it should spend on TV commercials to increase sales to a desired level y Basis? It has previous observations of the form <xi,y i >, xi is an instance of money spent on advertisements and y i was the corresponding observed sale figure Suppose the observations support the following linear approximation y = β 0 + β 1 x (1) Then x = y β 0 β 1 can be used to determine the money to be spent Estimation for Regression: Determine appropriate value for β 0 and β 1 from the past observations July 25, 2016 13 / 30

Linear Regression with Illustration Figure: Linear regression on TV advertising vs sales figure July 25, 2016 14 / 30

What will it mean to have sales as a non-linear function of investment in advertising? July 25, 2016 15 / 30

Basic Notation Data set: D =< x 1, y 1 >,, < x m, y m > - Notation (used throughout the course) - m = number of training examples - x s = input/independent variables - y s = output/dependent/ target variables - (x, y) - a single training example - (x j, y j ) - specific example (j th training example) - j is an index into the training set ϕ i s are the attribute/basis functions, and let ϕ 1 (x 1 ) ϕ 2 (x 1 ) ϕ p (x 1 ) ϕ = ϕ 1 (x m ) ϕ 2 (x m ) ϕ p (x m ) y = y 1 y m July 25, 2016 16 / 30 (2) (3)

Formal Definition General Regression problem: Determine a function f such that f (x) is the best predictor for y, with respect to D: f = argmin f F E(f, D) Here, F denotes the class of functions over which the error minimization is performed Parametrized Regression problem: Need to determine parameters w for the function f ( ϕ(x), w ) which minimize our error function E ( f(ϕ(x), w), D ) w = argmin E ( f(ϕ(x), w), D ) w July 25, 2016 17 / 30

Types of Regression Classified based on the function class and error function F is space of linear functions f(ϕ(x), w) = w T ϕ(x) + b = Linear Regression Problem is then to determine w such that, w = argmin w E(w, D) (4) July 25, 2016 18 / 30

Types of Regression (contd) Ridge Regression: A shrinkage parameter (regularization parameter) is added in the error function to reduce discrepancies due to variance Logistic Regression: Models conditional probability of dependent variable given independent variables and is extensively used in classification tasks f(ϕ(x), w) = log Pr(y x) 1 Pr(y x) = b + wt ϕ(x) (5) Lasso regression, Stepwise regression and several others July 25, 2016 19 / 30

Least Square Solution Form of E() should lead to accuracy and tractability The squared loss is a commonly used error/loss function It is the sum of squares of the differences between the actual value and the predicted value E(f, D) = m (f(x j ) y j ) 2 (6) j=1 The least square solution for linear regression is obtained as w = argmin w m p ( (w i ϕ i (x j ) y j ) 2 ) (7) j=1 i=1 July 25, 2016 20 / 30

The minimum value of the squared loss is zero If zero were attained at w, we would have July 25, 2016 21 / 30

The minimum value of the squared loss is zero If zero were attained at w, we would have u, ϕ T (x u )w = y u, or equivalently ϕw = y, where ϕ 1 (x 1 ) ϕ p (x 1 ) ϕ = ϕ 1 (x m ) ϕ p (x m ) and y 1 y = y m It has a solution if y is in the column space (the subspace of R n formed by the column vectors) of ϕ July 25, 2016 22 / 30

The minimum value of the squared loss is zero If zero were NOT attainable at w, what can be done? July 25, 2016 23 / 30

Geometric Interpretation of Least Square Solution Let y be a solution in the column space of ϕ The least squares solution is such that the distance between y and y is minimized Therefore July 25, 2016 24 / 30

Geometric Interpretation of Least Square Solution Let y be a solution in the column space of ϕ The least squares solution is such that the distance between y and y is minimized Therefore, the line joining y to y should be orthogonal to the column space ϕw = y (8) (y y ) T ϕ = 0 (9) (y ) T ϕ = (y) T ϕ (10) July 25, 2016 25 / 30

(ϕw) T ϕ = y T ϕ (11) w T ϕ T ϕ = y T ϕ (12) ϕ T ϕw = ϕ T y (13) w = (ϕ T ϕ) 1 y (14) Here ϕ T ϕ is invertible only if ϕ has full column rank July 25, 2016 26 / 30

Proof? July 25, 2016 27 / 30

Theorem : ϕ T ϕ is invertible if and only if ϕ is full column rank Proof : Given that ϕ has full column rank and hence columns are linearly independent, we have that ϕx = 0 x = 0 Assume on the contrary that ϕ T ϕ is non invertible Then x 0 such that ϕ T ϕx = 0 x T ϕ T ϕx = 0 (ϕx) T ϕx = 0 ϕx = 0 This is a contradiction Hence ϕ T ϕ is invertible if ϕ is full column rank If ϕ T ϕ is invertible then ϕx = 0 implies (ϕ T ϕx) = 0, which in turn implies x = 0, This implies ϕ has full column rank if ϕ T ϕ is invertible Hence, theorem proved July 25, 2016 28 / 30

Figure: Least square solution y is the orthogonal projection of y onto column space of ϕ July 25, 2016 29 / 30

What is Next? Some more questions on the Least Square Linear Regression Model More generally: How to minimize a function? Level Curves and Surfaces Gradient Vector Directional Derivative Hyperplane Tangential Hyperplane Gradient Descent Algorithm July 25, 2016 30 / 30