Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2014

Similar documents
CRF and Structured Perceptron

Machine Learning for Language Technology

Kernels and Support Vector Machines

Lesson 08. Convolutional Neural Network. Ing. Marek Hrúz, Ph.D. Katedra Kybernetiky Fakulta aplikovaných věd Západočeská univerzita v Plzni.

Learning Structured Predictors

ANSWER KEY. (a) For each of the following partials derivatives, use the contour plot to decide whether they are positive, negative, or zero.

Lecture 3 - Regression

Advanced Techniques for Mobile Robotics Location-Based Activity Recognition

IBM SPSS Neural Networks

Contents. List of Figures List of Tables. Structure of the Book How to Use this Book Online Resources Acknowledgements

Statistical Tests: More Complicated Discriminants

An Efficient Color Image Segmentation using Edge Detection and Thresholding Methods

SSB Debate: Model-based Inference vs. Machine Learning

The Automatic Classification Problem. Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification

Introduction to Machine Learning

Math 148 Exam III Practice Problems

Elevation Matrices of Surfaces

We like to depict a vector field by drawing the outputs as vectors with their tails at the input (see below).

Voice Activity Detection

Midterm for Name: Good luck! Midterm page 1 of 9

Discriminative Training for Automatic Speech Recognition

arxiv: v1 [cs.ni] 23 Jan 2019

Functions of several variables

Log-linear models (part 1I)

Feature Selection for Activity Recognition in Multi-Robot Domains

Deep Neural Networks (2) Tanh & ReLU layers; Generalisation and Regularisation

Automatic Speech Recognition (CS753)

Review Sheet for Math 230, Midterm exam 2. Fall 2006

Classification of Road Images for Lane Detection

Exam 1 Study Guide. Math 223 Section 12 Fall Student s Name

Project. B) Building the PWM Read the instructions of HO_14. 1) Determine all the 9-mers and list them here:

(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods

Empirical Assessment of Classification Accuracy of Local SVM

Learning Structured Predictors

An Enhanced Fast Multi-Radio Rendezvous Algorithm in Heterogeneous Cognitive Radio Networks

A comparative study of different feature sets for recognition of handwritten Arabic numerals using a Multi Layer Perceptron

Research Seminar. Stefano CARRINO fr.ch

VEHICLE LICENSE PLATE DETECTION ALGORITHM BASED ON STATISTICAL CHARACTERISTICS IN HSI COLOR MODEL

A Primer on Image Segmentation. Jonas Actor

Automatic Detection Of Optic Disc From Retinal Images. S.Sherly Renat et al.,

ECC419 IMAGE PROCESSING

Demosaicing Algorithm for Color Filter Arrays Based on SVMs

Student: Nizar Cherkaoui. Advisor: Dr. Chia-Ling Tsai (Computer Science Dept.) Advisor: Dr. Eric Muller (Biology Dept.)

SIGNAL PROCESSING OF POWER QUALITY DISTURBANCES

Decoding of Ternary Error Correcting Output Codes

Extraction and Recognition of Text From Digital English Comic Image Using Median Filter

CS231A Final Project: Who Drew It? Style Analysis on DeviantART

Evaluation of Image Segmentation Based on Histograms

DIGITAL IMAGE PROCESSING (COM-3371) Week 2 - January 14, 2002

6. FUNDAMENTALS OF CHANNEL CODER

Compound Object Detection Using Region Co-occurrence Statistics

Study Impact of Architectural Style and Partial View on Landmark Recognition

Table of contents. Vision industrielle 2002/2003. Local and semi-local smoothing. Linear noise filtering: example. Convolution: introduction

Carnegie Mellon University, University of Pittsburgh

SYLLABUS CHAPTER - 2 : INTENSITY TRANSFORMATIONS. Some Basic Intensity Transformation Functions, Histogram Processing.

An Approximation Algorithm for Computing the Mean Square Error Between Two High Range Resolution RADAR Profiles

Image Recognition for PCB Soldering Platform Controlled by Embedded Microchip Based on Hopfield Neural Network

Robust Decentralized Differentially Private Stochastic Gradient Descent

Heuristics & Pattern Databases for Search Dan Weld

AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND TRANSFER FUNCTIONS

CHAPTER 11 PARTIAL DERIVATIVES

Machine Learning for Antenna Array Failure Analysis

Stacking Ensemble for auto ml

Name: ID: Section: Math 233 Exam 2. Page 1. This exam has 17 questions:

CHAPTER 4 LOCATING THE CENTER OF THE OPTIC DISC AND MACULA

Research Article n-digit Benford Converges to Benford

Deep Learning Basics Lecture 9: Recurrent Neural Networks. Princeton University COS 495 Instructor: Yingyu Liang

An Introduction to Machine Learning for Social Scientists

Image Denoising using Dark Frames

Deep Learning for Launching and Mitigating Wireless Jamming Attacks

Local Search: Hill Climbing. When A* doesn t work AIMA 4.1. Review: Hill climbing on a surface of states. Review: Local search and optimization

Today. CS 395T Visual Recognition. Course content. Administration. Expectations. Paper reviews

Keywords: - Gaussian Mixture model, Maximum likelihood estimator, Multiresolution analysis

Segmentation of Fingerprint Images Using Linear Classifier

Princeton ELE 201, Spring 2014 Laboratory No. 2 Shazam

Detection of Compound Structures in Very High Spatial Resolution Images

An Adaptive Intelligence For Heads-Up No-Limit Texas Hold em

SYDE 112, LECTURE 34 & 35: Optimization on Restricted Domains and Lagrange Multipliers

EE 435/535: Error Correcting Codes Project 1, Fall 2009: Extended Hamming Code. 1 Introduction. 2 Extended Hamming Code: Encoding. 1.

MA/CSSE 473 Day 14. Permutations wrap-up. Subset generation. (Horner s method) Permutations wrap up Generating subsets of a set

Log-linear models (part III)

Privacy preserving data mining multiplicative perturbation techniques

GE 113 REMOTE SENSING. Topic 7. Image Enhancement

Segmentation of Fingerprint Images

Lecture 5: Pitch and Chord (1) Chord Recognition. Li Su

An Hybrid MLP-SVM Handwritten Digit Recognizer

DERIVATION OF TRAPS IN AUDITORY DOMAIN

Question Score Max Cover Total 149

Fast Blur Removal for Wearable QR Code Scanners (supplemental material)

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

Heuristic Search with Pre-Computed Databases

THE EXO-200 experiment searches for double beta decay

Advances in Computer Vision and Pattern Recognition

CS229: Machine Learning

Exam 2 Review Sheet. r(t) = x(t), y(t), z(t)

Computational aspects of two-player zero-sum games Course notes for Computational Game Theory Section 3 Fall 2010

Absolute Value of Linear Functions

[f(t)] 2 + [g(t)] 2 + [h(t)] 2 dt. [f(u)] 2 + [g(u)] 2 + [h(u)] 2 du. The Fundamental Theorem of Calculus implies that s(t) is differentiable and

DIAGNOSIS OF STATOR FAULT IN ASYNCHRONOUS MACHINE USING SOFT COMPUTING METHODS

30 Int'l Conf. IP, Comp. Vision, and Pattern Recognition IPCV'15

Transcription:

Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary & multi-class case, conditional random fields Marc Toussaint University of Stuttgart Summer 24

Structured Output & Structured Input regression: R n R structured output: R n binary class label {, } R n integer class label {, 2,.., M} R n sequence labelling y :T R n image labelling y :W,:H R n graph labelling y :N structured input: relational database R labelled graph/sequence R 2/3

Examples for Structured Output Text tagging X = sentence Y = tagging of each word http://sourceforge.net/projects/crftagger Image segmentation X = image Y = labelling of each pixel http://scholar.google.com/scholar?cluster=3447722994273582 Depth estimation X = single image Y = depth map http://make3d.cs.cornell.edu/ 3/3

CRFs in image processing 4/3

CRFs in image processing Google conditional random field image Multiscale Conditional Random Fields for Image Labeling (CVPR 24) Scale-Invariant Contour Completion Using Conditional Random Fields (ICCV 25) Conditional Random Fields for Object Recognition (NIPS 24) Image Modeling using Tree Structured Conditional Random Fields (IJCAI 27) A Conditional Random Field Model for Video Super-resolution (ICPR 26) 5/3

From Regression to Structured Output Our first step in regression was to define f : R n R as we defined a loss function derived optimal parameters β f(x) = φ(x) β How could we represent a discrete-valued function F : R n Y? 6/3

From Regression to Structured Output Our first step in regression was to define f : R n R as we defined a loss function derived optimal parameters β f(x) = φ(x) β How could we represent a discrete-valued function F : R n Y? Discriminative Function 6/3

Discriminative Function Represent a discrete-valued function F : R n Y via a discriminative function f : R n Y R such that F : x argmax y f(x, y) A discriminative function f(x, y) maps an input x to an output ŷ(x) = argmax f(x, y) y A discriminative function f(x, y) has high value if y is a correct answer to x; and low value if y is a false answer In that way a discriminative function e.g. discriminates correct sequence/image/graph-labelling from wrong ones 7/3

Example Discriminative Function Input: x R 2 ; output y {, 2, 3} displayed are p(y = x), p(y =2 x), p(y =3 x).8.6.4.2.9.5..9.5..9.5. -2-2 3-2 - 2 3 (here already scaled to the interval [,]... explained later) 8/3

How could we parameterize a discriminative function? Well, linear in features! f(x, y) = k j= φ j(x, y)β j = φ(x, y) β Example: Let x R and y {, 2, 3}. Typical features might be φ(x, y) = [y = ] x [y = 2] x [y = 3] x [y = ] x 2 [y = 2] x 2 [y = 3] x 2 Example: Let x, y {, } be both discrete. Features might be φ(x, y) = [x = ][y = ] [x = ][y = ] [x = ][y = ] [x = ][y = ] 9/3

more intuition... Features connect input and output. Each φ j (x, y) allows f to capture a certain dependence between x and y If both x and y are discrete, a feature φ j (x, y) is typically a joint indicator function (logical function), indicating a certain event Each weight β j mirrors how important/frequent/infrequent a certain dependence described by φ j (x, y) is f(x) is also called energy, and the following methods are also called energy-based modelling, esp. in neural modelling /3

In the remainder: Logistic regression: binary case Multi-class case Preliminary comments on the general structured output case (Conditional Random Fields) /3

Logistic regression: Binary case 2/3

Binary classification example (MT/plot.h -> gnuplot pipe) 3 train decision boundary 2 - -2-2 - 2 3 Input x R 2 Output y {, } Example shows RBF Ridge Logistic Regression 3/3

A loss function for classification Data D = {(x i, y i )} n i= with x i R d and y i {, } Bad idea: Squared error regression (See also Hastie 4.2) 4/3

A loss function for classification Data D = {(x i, y i )} n i= with x i R d and y i {, } Bad idea: Squared error regression (See also Hastie 4.2) Maximum likelihood: We interpret the discriminative function f(x, y) as defining class probabilities p(y x) = ef(x,y) y ef(x,y ) p(y x) should be high for the correct class, and low otherwise 4/3

A loss function for classification Data D = {(x i, y i )} n i= with x i R d and y i {, } Bad idea: Squared error regression (See also Hastie 4.2) Maximum likelihood: We interpret the discriminative function f(x, y) as defining class probabilities p(y x) = ef(x,y) y ef(x,y ) p(y x) should be high for the correct class, and low otherwise For each (x i, y i ) we want to maximize the likelihood p(y i x i ): L neg-log-likelihood (β) = n i= log p(y i x i ) 4/3

Logistic regression In the binary case, we have two functions f(x, ) and f(x, ). W.l.o.g. we may fix f(x, ) = to zero. Therefore we choose features φ(x, y) = [y = ] φ(x) with arbitrary input features φ(x) R k We have and conditional class probabilities else ŷ = argmax f(x, y) = y if φ(x) β > p( x) = with the logistic sigmoid function σ(z) = e f(x,) = σ(f(x, )) e f(x,) + ef(x,) ez =. +e z e z + exp(x)/(+exp(x)).9.8.7.6.5.4.3.2. - -5 5 Given data D = {(x i, y i)} n i=, we minimize L logistic (β) = n i= log p(yi xi) + λ β 2 = [ ] n i= y i log p( x i ) + ( y i ) log[ p( x i )] + λ β 2 5/3

Optimal parameters β Gradient (see exercises): n = i= (p i y i )φ(x i ) + 2λIβ = X (p y) + 2λIβ, L logistic (β) β p i := p(y = x i ), X = L logistic (β) β φ(x ). φ(x n ) R n k is non-linear in β (it enters also the calculation of p i ) does not have analytic solution 6/3

Optimal parameters β Gradient (see exercises): n = i= (p i y i )φ(x i ) + 2λIβ = X (p y) + 2λIβ, L logistic (β) β p i := p(y = x i ), X = L logistic (β) β φ(x ). φ(x n ) R n k is non-linear in β (it enters also the calculation of p i ) does not have analytic solution Newton algorithm: iterate β β H - L logistic (β) β with Hessian H = 2 L logistic (β) β = X W X + 2λI 2 W diagonal with W ii = p i ( p i ) 6/3

RBF ridge logistic regression: 3 (MT/plot.h -> gnuplot pipe) train decision boundary - 2 - -2-2 - 2 3 3 2 - -2-3 -2-2 3-2 - 2 3./x.exe -mode 2 -modelfeaturetype 4 -lambda e+ -rbfbias -rbfwidth.2 7/3

polynomial (cubic) logistic regression: 3 (MT/plot.h -> gnuplot pipe) train decision boundary - 2 - -2-2 - 2 3 3 2 - -2-3 -2-2 3-2 - 2 3./x.exe -mode 2 -modelfeaturetype 3 -lambda e+ 8/3

Recap: Classification Regression parameters β predictive function f(x) = φ(x) β least squares loss L ls (β) = n i= (yi f(xi))2 parameters β discriminative function f(x, y) = φ(x, y) β class probabilities p(y x) e f(x,y) neg-log-likelihood L neg-log-likelihood (β) = n i= log p(yi xi) 9/3

Logistic regression: Multi-class case 2/3

Logistic regression: Multi-class case Data D = {(x i, y i )} n i= with x i R d and y i {,.., M} We choose f(x, y) = φ(x, y) β with φ(x, y) = [y = ] φ(x) [y = 2] φ(x). [y = M] φ(x) where φ(x) are arbitrary features. We have M (or M-) parameters β Conditional class probabilties p(y x) = ef(x,y) y ef(x,y ) (optionally we may set f(x, M) = and drop the last entry) f(x, y) = log p(y x) p(y =M x) (the discriminative functions model log-ratios ) Given data D = {(x i, y i )} n i=, we minimize L logistic (β) = n i= log p(y =y i x i ) + λ β 2 2/3

Optimal parameters β Gradient: L logistic (β) β c = n i= (p ic y ic )φ(x i ) + 2λIβ c = X (p c y c ) + 2λIβ c, p ic = p(y =c x i ) Hessian: H = 2 L logistic (β) β c β d = X W cd X + 2[c = d] λi W cd diagonal with W cd,ii = p ic ([c = d] p id ) 22/3

polynomial (quadratic) ridge 3-class logistic regression: 3 2 (MT/plot.h -> gnuplot pipe) train p=.5.8.6.4.2.9.5..9.5..9.5. - -2-2 3-2 - 2 3-2 -2-2 3./x.exe -mode 3 -modelfeaturetype 3 -lambda e+ 23/3

Conditional Random Fields 24/3

Conditional Random Fields (CRFs) CRFs are a generalization of logistic binary and multi-class classification The output y may be an arbitrary (usually discrete) thing (e.g., sequence/image/graph-labelling) Hopefully we can minimize efficiently argmax f(x, y) y over the output! f(x, y) should be structured in y so this optimization is efficient. The name CRF describes that p(y x) e f(x,y) defines a probability distribution (a.k.a. random field) over the output y conditional to the input x. The word field usually means that this distribution is structured (a graphical model; see later part of lecture). 25/3

CRFs: Core equations f(x, y) = φ(x, y) β p(y x) = ef(x,y) y ef(x,y ) = ef(x,y) Z(x,β) Z(x, β) = log y e f(x,y ) (log partition function) L(β) = i log p(y i x i ) = i [φ(x, y) β Z(x i, β)] β Z(x, β) = y 2 β 2 Z(x, β) = y p(y x) φ(x, y) [ ][ ] p(y x) φ(x, y) φ(x, y) β Z β Z This gives the neg-log-likelihood L(β), its gradient and Hessian 26/3

Training CRFs Maximize conditional likelihood But Hessian is typically too large (Images: pixels, 5 features) If f(x, y) has a chain structure over y, the Hessian is usually banded computation time linear in chain length Alternative: Efficient gradient method, e.g.: Vishwanathan et al.: Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods Other loss variants, e.g., hinge loss as with Support Vector Machines ( Structured output SVMs ) Perceptron algorithm: Minimizes hinge loss using a gradient method 27/3

CRFs: the structure is in the features Assume y = (y,.., y l ) is a tuple of individual (local) discrete labels We can assume that f(x, y) is linear in features f(x, y) = k φ j (x, y j )β j = φ(x, y) β j= where each feature φ j (x, y j ) depends only on a subset y j of labels. φ j (x, y j ) effectively couples the labels y j. Then e f(x,y) is a factor graph. 28/3

CRFs: examples structures Assume y = (y,.., y l ) is a tuple of individual (local) discrete labels We can assume that f(x, y) is linear in features f(x, y) = k φ j (x, y j )β j = φ(x, y) β j= where each feature φ j (x, y j ) depends only on a subset y j of labels. φ j (x, y j ) effectively couples the labels y j. Then e f(x,y) is a factor graph. 29/3

Example: pair-wise coupled pixel labels x y y 2 y 3 y 4 y W y 2 y 3 y H Each black box corresponds to features φ j (y j ) which couple neighboring pixel labels y j Each gray box corresponds to features φ j (x j, y j ) which couple a local pixel observation x j with a pixel label y j 3/3