Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary & multi-class case, conditional random fields Marc Toussaint University of Stuttgart Summer 24
Structured Output & Structured Input regression: R n R structured output: R n binary class label {, } R n integer class label {, 2,.., M} R n sequence labelling y :T R n image labelling y :W,:H R n graph labelling y :N structured input: relational database R labelled graph/sequence R 2/3
Examples for Structured Output Text tagging X = sentence Y = tagging of each word http://sourceforge.net/projects/crftagger Image segmentation X = image Y = labelling of each pixel http://scholar.google.com/scholar?cluster=3447722994273582 Depth estimation X = single image Y = depth map http://make3d.cs.cornell.edu/ 3/3
CRFs in image processing 4/3
CRFs in image processing Google conditional random field image Multiscale Conditional Random Fields for Image Labeling (CVPR 24) Scale-Invariant Contour Completion Using Conditional Random Fields (ICCV 25) Conditional Random Fields for Object Recognition (NIPS 24) Image Modeling using Tree Structured Conditional Random Fields (IJCAI 27) A Conditional Random Field Model for Video Super-resolution (ICPR 26) 5/3
From Regression to Structured Output Our first step in regression was to define f : R n R as we defined a loss function derived optimal parameters β f(x) = φ(x) β How could we represent a discrete-valued function F : R n Y? 6/3
From Regression to Structured Output Our first step in regression was to define f : R n R as we defined a loss function derived optimal parameters β f(x) = φ(x) β How could we represent a discrete-valued function F : R n Y? Discriminative Function 6/3
Discriminative Function Represent a discrete-valued function F : R n Y via a discriminative function f : R n Y R such that F : x argmax y f(x, y) A discriminative function f(x, y) maps an input x to an output ŷ(x) = argmax f(x, y) y A discriminative function f(x, y) has high value if y is a correct answer to x; and low value if y is a false answer In that way a discriminative function e.g. discriminates correct sequence/image/graph-labelling from wrong ones 7/3
Example Discriminative Function Input: x R 2 ; output y {, 2, 3} displayed are p(y = x), p(y =2 x), p(y =3 x).8.6.4.2.9.5..9.5..9.5. -2-2 3-2 - 2 3 (here already scaled to the interval [,]... explained later) 8/3
How could we parameterize a discriminative function? Well, linear in features! f(x, y) = k j= φ j(x, y)β j = φ(x, y) β Example: Let x R and y {, 2, 3}. Typical features might be φ(x, y) = [y = ] x [y = 2] x [y = 3] x [y = ] x 2 [y = 2] x 2 [y = 3] x 2 Example: Let x, y {, } be both discrete. Features might be φ(x, y) = [x = ][y = ] [x = ][y = ] [x = ][y = ] [x = ][y = ] 9/3
more intuition... Features connect input and output. Each φ j (x, y) allows f to capture a certain dependence between x and y If both x and y are discrete, a feature φ j (x, y) is typically a joint indicator function (logical function), indicating a certain event Each weight β j mirrors how important/frequent/infrequent a certain dependence described by φ j (x, y) is f(x) is also called energy, and the following methods are also called energy-based modelling, esp. in neural modelling /3
In the remainder: Logistic regression: binary case Multi-class case Preliminary comments on the general structured output case (Conditional Random Fields) /3
Logistic regression: Binary case 2/3
Binary classification example (MT/plot.h -> gnuplot pipe) 3 train decision boundary 2 - -2-2 - 2 3 Input x R 2 Output y {, } Example shows RBF Ridge Logistic Regression 3/3
A loss function for classification Data D = {(x i, y i )} n i= with x i R d and y i {, } Bad idea: Squared error regression (See also Hastie 4.2) 4/3
A loss function for classification Data D = {(x i, y i )} n i= with x i R d and y i {, } Bad idea: Squared error regression (See also Hastie 4.2) Maximum likelihood: We interpret the discriminative function f(x, y) as defining class probabilities p(y x) = ef(x,y) y ef(x,y ) p(y x) should be high for the correct class, and low otherwise 4/3
A loss function for classification Data D = {(x i, y i )} n i= with x i R d and y i {, } Bad idea: Squared error regression (See also Hastie 4.2) Maximum likelihood: We interpret the discriminative function f(x, y) as defining class probabilities p(y x) = ef(x,y) y ef(x,y ) p(y x) should be high for the correct class, and low otherwise For each (x i, y i ) we want to maximize the likelihood p(y i x i ): L neg-log-likelihood (β) = n i= log p(y i x i ) 4/3
Logistic regression In the binary case, we have two functions f(x, ) and f(x, ). W.l.o.g. we may fix f(x, ) = to zero. Therefore we choose features φ(x, y) = [y = ] φ(x) with arbitrary input features φ(x) R k We have and conditional class probabilities else ŷ = argmax f(x, y) = y if φ(x) β > p( x) = with the logistic sigmoid function σ(z) = e f(x,) = σ(f(x, )) e f(x,) + ef(x,) ez =. +e z e z + exp(x)/(+exp(x)).9.8.7.6.5.4.3.2. - -5 5 Given data D = {(x i, y i)} n i=, we minimize L logistic (β) = n i= log p(yi xi) + λ β 2 = [ ] n i= y i log p( x i ) + ( y i ) log[ p( x i )] + λ β 2 5/3
Optimal parameters β Gradient (see exercises): n = i= (p i y i )φ(x i ) + 2λIβ = X (p y) + 2λIβ, L logistic (β) β p i := p(y = x i ), X = L logistic (β) β φ(x ). φ(x n ) R n k is non-linear in β (it enters also the calculation of p i ) does not have analytic solution 6/3
Optimal parameters β Gradient (see exercises): n = i= (p i y i )φ(x i ) + 2λIβ = X (p y) + 2λIβ, L logistic (β) β p i := p(y = x i ), X = L logistic (β) β φ(x ). φ(x n ) R n k is non-linear in β (it enters also the calculation of p i ) does not have analytic solution Newton algorithm: iterate β β H - L logistic (β) β with Hessian H = 2 L logistic (β) β = X W X + 2λI 2 W diagonal with W ii = p i ( p i ) 6/3
RBF ridge logistic regression: 3 (MT/plot.h -> gnuplot pipe) train decision boundary - 2 - -2-2 - 2 3 3 2 - -2-3 -2-2 3-2 - 2 3./x.exe -mode 2 -modelfeaturetype 4 -lambda e+ -rbfbias -rbfwidth.2 7/3
polynomial (cubic) logistic regression: 3 (MT/plot.h -> gnuplot pipe) train decision boundary - 2 - -2-2 - 2 3 3 2 - -2-3 -2-2 3-2 - 2 3./x.exe -mode 2 -modelfeaturetype 3 -lambda e+ 8/3
Recap: Classification Regression parameters β predictive function f(x) = φ(x) β least squares loss L ls (β) = n i= (yi f(xi))2 parameters β discriminative function f(x, y) = φ(x, y) β class probabilities p(y x) e f(x,y) neg-log-likelihood L neg-log-likelihood (β) = n i= log p(yi xi) 9/3
Logistic regression: Multi-class case 2/3
Logistic regression: Multi-class case Data D = {(x i, y i )} n i= with x i R d and y i {,.., M} We choose f(x, y) = φ(x, y) β with φ(x, y) = [y = ] φ(x) [y = 2] φ(x). [y = M] φ(x) where φ(x) are arbitrary features. We have M (or M-) parameters β Conditional class probabilties p(y x) = ef(x,y) y ef(x,y ) (optionally we may set f(x, M) = and drop the last entry) f(x, y) = log p(y x) p(y =M x) (the discriminative functions model log-ratios ) Given data D = {(x i, y i )} n i=, we minimize L logistic (β) = n i= log p(y =y i x i ) + λ β 2 2/3
Optimal parameters β Gradient: L logistic (β) β c = n i= (p ic y ic )φ(x i ) + 2λIβ c = X (p c y c ) + 2λIβ c, p ic = p(y =c x i ) Hessian: H = 2 L logistic (β) β c β d = X W cd X + 2[c = d] λi W cd diagonal with W cd,ii = p ic ([c = d] p id ) 22/3
polynomial (quadratic) ridge 3-class logistic regression: 3 2 (MT/plot.h -> gnuplot pipe) train p=.5.8.6.4.2.9.5..9.5..9.5. - -2-2 3-2 - 2 3-2 -2-2 3./x.exe -mode 3 -modelfeaturetype 3 -lambda e+ 23/3
Conditional Random Fields 24/3
Conditional Random Fields (CRFs) CRFs are a generalization of logistic binary and multi-class classification The output y may be an arbitrary (usually discrete) thing (e.g., sequence/image/graph-labelling) Hopefully we can minimize efficiently argmax f(x, y) y over the output! f(x, y) should be structured in y so this optimization is efficient. The name CRF describes that p(y x) e f(x,y) defines a probability distribution (a.k.a. random field) over the output y conditional to the input x. The word field usually means that this distribution is structured (a graphical model; see later part of lecture). 25/3
CRFs: Core equations f(x, y) = φ(x, y) β p(y x) = ef(x,y) y ef(x,y ) = ef(x,y) Z(x,β) Z(x, β) = log y e f(x,y ) (log partition function) L(β) = i log p(y i x i ) = i [φ(x, y) β Z(x i, β)] β Z(x, β) = y 2 β 2 Z(x, β) = y p(y x) φ(x, y) [ ][ ] p(y x) φ(x, y) φ(x, y) β Z β Z This gives the neg-log-likelihood L(β), its gradient and Hessian 26/3
Training CRFs Maximize conditional likelihood But Hessian is typically too large (Images: pixels, 5 features) If f(x, y) has a chain structure over y, the Hessian is usually banded computation time linear in chain length Alternative: Efficient gradient method, e.g.: Vishwanathan et al.: Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods Other loss variants, e.g., hinge loss as with Support Vector Machines ( Structured output SVMs ) Perceptron algorithm: Minimizes hinge loss using a gradient method 27/3
CRFs: the structure is in the features Assume y = (y,.., y l ) is a tuple of individual (local) discrete labels We can assume that f(x, y) is linear in features f(x, y) = k φ j (x, y j )β j = φ(x, y) β j= where each feature φ j (x, y j ) depends only on a subset y j of labels. φ j (x, y j ) effectively couples the labels y j. Then e f(x,y) is a factor graph. 28/3
CRFs: examples structures Assume y = (y,.., y l ) is a tuple of individual (local) discrete labels We can assume that f(x, y) is linear in features f(x, y) = k φ j (x, y j )β j = φ(x, y) β j= where each feature φ j (x, y j ) depends only on a subset y j of labels. φ j (x, y j ) effectively couples the labels y j. Then e f(x,y) is a factor graph. 29/3
Example: pair-wise coupled pixel labels x y y 2 y 3 y 4 y W y 2 y 3 y H Each black box corresponds to features φ j (y j ) which couple neighboring pixel labels y j Each gray box corresponds to features φ j (x j, y j ) which couple a local pixel observation x j with a pixel label y j 3/3