Learning Structured Predictors

Size: px

Start display at page:

Download "Learning Structured Predictors"

Marvin Melton
5 years ago
Views:

1 Learning Structured Predictors Xavier Carreras Xerox Research Centre Europe

2 Supervised (Structured) Prediction Learning to predict: given training data { (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor x y that works well on unseen inputs x Non-Structured Prediction: outputs y are atomic Binary prediction: y { 1, +1} Multiclass prediction: y {1, 2,..., L} Structured Prediction: outputs y are structured Sequence prediction: y are sequences Parsing: y are trees...

3 Named Entity Recognition y per - qnt - - org org - time x Jim bought 300 shares of Acme Corp. in 2006

4 Named Entity Recognition y per - qnt - - org org - time x Jim bought 300 shares of Acme Corp. in 2006 y per per - - loc x Jack London went to Paris y per per - - loc x Paris Hilton went to London y per - - loc x Jackie went to Lisdon

5 Part-of-speech Tagging y NNP NNP VBZ NNP. x Ms. Haag plays Elianti.

6 Syntactic Parsing P LOC ROOT VC OBJ NMOD PMOD SBJ TMP NMOD NAME Unesco is now holding its biennial meetings in New York. x are sentences y are syntactic dependency trees

7 Machine Translation the transformation of does not ench. A particular instance may, RB(not), x 0:VB) ne, x 0, pas arbitrary syntax tree fragment. r lexicalized (e.g. does) or vari-. rhs(r i ) is represented as a senguage words and variables. a brief overview of how such ules are acquired automatically ure 1, the (π, f, a) triple is repted graph G (edges going downtinction between edges of π and node of the graph is labeled with lement span (the latter in italic span of a node n is defined by first and last word in f that are The complement span of n is pans of all nodes n in G that dants nor ancestors of n. Nodes and complement spans are nonthe frontier set F G. larly interesting about the fronfrontier of graph G containing '& ''( $%*!" #! $%& ''( )' ''& # #"& #%" *"& "!"& "!"&!!"#$%"& $&!%&!"#$& $%&!"'$& # " "! $%& ' ' ( ) # "! ' ( * $ & ) ( #%)! '& $%&!"'$& '& '%&!"*$& '!"& $& '%(!"*$("& && '%(!"%$("& '!"& '& (!"%$+("& (!"%$("& "#$%$ 5 &$'&($ )*+(,-$.%/0'*.,/% +'1)*2 30'1 40.*+$ + -!! "#$ %& '( ) *+, " Figure 1: Spans and complement-spans determine what rules are extracted. Constituents (Galley et alin2006) gray are members of the frontier set; a minimal rule is extracted from each of them. (a) S(x 0:NP, x 1:VP, x 2:.) x 0, x 1, x 2 (b) NP(x 0:DT, CD(7), NNS(people)) x 0,7 (c) DT(these) (d) VP(x 0:VBP, x 1:NP) x 0, x 1 (e) VBP(include) (f) NP(x 0:NP, x 1:VP) x 1,, x 0 (g) NP(x 0:NNS) x 0 x are sentences in Chinese y are sentences in English aligned to x + )!")

8 Object Detection (Kumar and Hebert 2003) x are images y are grids labeled with object types

9 Object Detection (Kumar and Hebert 2003) x are images y are grids labeled with object types

10 Today s Goals Introduce basic concepts for structured prediction We will restrict to sequence prediction What can we can borrow from standard classification? Learning paradigms and algorithms, in essence, work here too However, computations behind algorithms are prohibitive What can we borrow from HMM and other structured formalisms? Representations of structured data into feature spaces Inference/search algorithms for tractable computations E.g., algorithms for HMMs (Viterbi, forward-backward) will play a major role in today s methods

11 Today s Goals Introduce basic concepts for structured prediction We will restrict to sequence prediction What can we can borrow from standard classification? Learning paradigms and algorithms, in essence, work here too However, computations behind algorithms are prohibitive What can we borrow from HMM and other structured formalisms? Representations of structured data into feature spaces Inference/search algorithms for tractable computations E.g., algorithms for HMMs (Viterbi, forward-backward) will play a major role in today s methods

12 Sequence Prediction y per per - - loc x Jack London went to Paris

13 Sequence Prediction x = x 1 x 2... x n are input sequences, x i X y = y 1 y 2... y n are output sequences, y i {1,..., L} Goal: given training data { (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor x y that works well on unseen inputs x What is the form of our prediction model?

14 Exponentially-many Solutions Let Y = {-, per, loc} The solution space (all output sequences): Jack London went to Paris per loc per loc per loc per loc per loc Each path is a possible solution For an input sequence of size n, there are Y n possible outputs

15 Exponentially-many Solutions Let Y = {-, per, loc} The solution space (all output sequences): Jack London went to Paris per loc per loc per loc per loc per loc Each path is a possible solution For an input sequence of size n, there are Y n possible outputs

16 Approach 1: Local Classifiers? Jack London went to Paris Decompose the sequence into n classification problems: A classifier predicts individual labels at each position ŷ i = argmax w f(x, i, l) l {loc, per, -} f(x, i, l) represents an assignment of label l for x i w is a vector of parameters, has a weight for each feature of f Use standard classification methods to learn w At test time, predict the best sequence by a simple concatenation of the best label for each position

17 Approach 1: Local Classifiers? Jack London went to Paris Decompose the sequence into n classification problems: A classifier predicts individual labels at each position ŷ i = argmax w f(x, i, l) l {loc, per, -} f(x, i, l) represents an assignment of label l for x i w is a vector of parameters, has a weight for each feature of f Use standard classification methods to learn w At test time, predict the best sequence by a simple concatenation of the best label for each position

18 Indicator Features f(x, i, l) is a vector of d features representing label l for x i [ f 1 (x, i, l),..., f j (x, i, l),..., f d (x, i, l) ] What s in a feature f j (x, i, l)? Anything we can compute using x and i and l Anything that indicates whether l is (not) a good label for xi Indicator features: binary-valued features looking at: a simple pattern of x and target position i and the candidate label l for position i { 1 if xi =London and l =loc f j (x, i, l) = 0 otherwise { 1 if xi+1 =went and l =loc f k (x, i, l) = 0 otherwise

19 Feature Templates Feature templates generate many indicator features mechanically A feature template is identified by a type, and a number of values Example: template word extracts the current word { 1 if xi = w and l = a f word,a,w (x, i, l) = 0 otherwise A feature of this type is identified by the tuple word, a, w Generates a feature for every label a Y and every word w e.g.: a = loc w = London, a = - w = London a = loc w = Paris a = per w = Paris a = per w = John a = - w = the

20 Feature Templates Feature templates generate many indicator features mechanically A feature template is identified by a type, and a number of values Example: template word extracts the current word { 1 if xi = w and l = a f word,a,w (x, i, l) = 0 otherwise A feature of this type is identified by the tuple word, a, w Generates a feature for every label a Y and every word w e.g.: a = loc w = London, a = - w = London a = loc w = Paris a = per w = Paris a = per w = John a = - w = the In feature-based models: Define feature templates manually Instantiate the templates on every set of values in the training data generates a very high-dimensional feature space Define parameter vector w indexed by such feature tuples Let the learning algorithm choose the relevant features

21 More Features for NE Recognition per Jack London went to Paris In practice, construct f(x, i, l) by... Define a number of simple patterns of x and i current word x i is xi capitalized? xi has digits? prefixes/suffixes of size 1, 2, 3,... is x i a known location? is x i a known person? next word previous word current and next words together other combinations Define feature templates by combining patterns with labels l Generate actual features by instantiating templates on training data

22 More Features for NE Recognition per per - Jack London went to Paris In practice, construct f(x, i, l) by... Define a number of simple patterns of x and i current word x i is xi capitalized? xi has digits? prefixes/suffixes of size 1, 2, 3,... is x i a known location? is x i a known person? next word previous word current and next words together other combinations Define feature templates by combining patterns with labels l Generate actual features by instantiating templates on training data Main limitation: features can t capture interactions between labels!

23 Approach 2: HMM for Sequence Prediction π per per T per,per per - - loc O per, London Jack London went to Paris Define an HMM were each label is a state Model parameters: πl : probability of starting with label l Tl,l : probability of transitioning from l to l O l,x : probability of generating symbol x given label l Predictions: p(x, y) = π y1 O y1,x 1 T yi 1,y i O yi,x i i>1 Learning: relative counts + smoothing Prediction: Viterbi algorithm

24 Approach 2: Representation in HMM π per per T per,per per - - loc O per, London Jack London went to Paris Label interactions are captured in the transition parameters But interactions between labels and input symbols are quite limited! Only Oyi,x i = p(x i y i ) Not clear how to exploit patterns such as: Capitalization, digits Prefixes and suffixes Next word, previous word Combinations of these with label transitions Why? HMM independence assumptions: given label y i, token x i is independent of anything else

25 Approach 2: Representation in HMM π per per T per,per per - - loc O per, London Jack London went to Paris Label interactions are captured in the transition parameters But interactions between labels and input symbols are quite limited! Only Oyi,x i = p(x i y i ) Not clear how to exploit patterns such as: Capitalization, digits Prefixes and suffixes Next word, previous word Combinations of these with label transitions Why? HMM independence assumptions: given label y i, token x i is independent of anything else

26 Local Classifiers vs. HMM Form: Local Classifiers w f(x, i, l) Learning: standard classifiers Prediction: independent for each x i Advantage: feature-rich Drawback: no label interactions Form: HMM π y1 O y1,x 1 T yi 1,y i O yi,x i i>1 Learning: relative counts Prediction: Viterbi Advantage: label interactions Drawback: no fine-grained features

27 Approach 3: Global Sequence Predictors y: per per - - loc x: Jack London went to Paris Learn a single classifier from x y Next questions:... predict(x 1:n ) = argmax y Y n w f(x, y) How do we represent entire sequences in f(x, y)? There are exponentially-many sequences y for a given x, how do we solve the argmax problem?

28 Approach 3: Global Sequence Predictors y: per per - - loc x: Jack London went to Paris Learn a single classifier from x y Next questions:... predict(x 1:n ) = argmax y Y n w f(x, y) How do we represent entire sequences in f(x, y)? There are exponentially-many sequences y for a given x, how do we solve the argmax problem?

29 Factored Representations y: per per - - loc x: Jack London went to Paris How do we represent entire sequences in f(x, y)? Look at individual assignments y i (standard classification) Look at bigrams of outputs labels y i 1, y i Look at trigrams of outputs labels yi 2, y i 1, y i Look at n-grams of outputs labels yi n+1,..., y i 1, y i Look at the full label sequence y (intractable) A factored representation will lead to a tractable model

30 Factored Representations y: per per - - loc x: Jack London went to Paris How do we represent entire sequences in f(x, y)? Look at individual assignments y i (standard classification) Look at bigrams of outputs labels y i 1, y i Look at trigrams of outputs labels yi 2, y i 1, y i Look at n-grams of outputs labels yi n+1,..., y i 1, y i Look at the full label sequence y (intractable) A factored representation will lead to a tractable model

31 Factored Representations y: per per - - loc x: Jack London went to Paris How do we represent entire sequences in f(x, y)? Look at individual assignments y i (standard classification) Look at bigrams of outputs labels y i 1, y i Look at trigrams of outputs labels yi 2, y i 1, y i Look at n-grams of outputs labels yi n+1,..., y i 1, y i Look at the full label sequence y (intractable) A factored representation will lead to a tractable model

32 Factored Representations y: per per - - loc x: Jack London went to Paris How do we represent entire sequences in f(x, y)? Look at individual assignments y i (standard classification) Look at bigrams of outputs labels y i 1, y i Look at trigrams of outputs labels yi 2, y i 1, y i Look at n-grams of outputs labels yi n+1,..., y i 1, y i Look at the full label sequence y (intractable) A factored representation will lead to a tractable model

33 Bigram Feature Templates y per per - - loc x Jack London went to Paris A template for word + bigram: 1 if x i = w and f wb,a,b,w (x, i, y i 1, y i ) = y i 1 = a and y i = b 0 otherwise e.g., f wb,per,per,london (x, 2, per, per) = 1 f wb,per,per,london (x, 3, per, -) = 0 f wb,per,-,went (x, 3, per, -) = 1

34 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1

35 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1

36 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1

37 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1

38 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1

39 Representations Factored at Bigrams y: per per - - loc x: Jack London went to Paris f(x, i, y i 1, y i ) A d-dimensional feature vector of a label bigram at i Each dimension is typically a boolean indicator (0 or 1) f(x, y) = n i=1 f(x, i, y i 1, y i ) A d-dimensional feature vector of the entire y Aggregated representation by summing bigram feature vectors Each dimension is now a count of a feature pattern

40 Linear Sequence Prediction where predict(x 1:n ) = argmax y Y n w f(x, y) f(x, y) = n f(x, i, y i 1, y i ) i=1 Note the linearity of the expression: n w f(x, y) = w f(x, i, y i 1, y i ) i=1 n = w f(x, i, y i 1, y i ) i=1 Next questions: How do we solve the argmax problem? How do we learn w?

41 Linear Sequence Prediction where predict(x 1:n ) = argmax y Y n w f(x, y) f(x, y) = n f(x, i, y i 1, y i ) i=1 Note the linearity of the expression: n w f(x, y) = w f(x, i, y i 1, y i ) i=1 n = w f(x, i, y i 1, y i ) i=1 Next questions: How do we solve the argmax problem? How do we learn w?

42 Linear Sequence Prediction where predict(x 1:n ) = argmax y Y n w f(x, y) f(x, y) = n f(x, i, y i 1, y i ) i=1 Note the linearity of the expression: n w f(x, y) = w f(x, i, y i 1, y i ) i=1 n = w f(x, i, y i 1, y i ) i=1 Next questions: How do we solve the argmax problem? How do we learn w?

43 Predicting with Factored Sequence Models Consider a fixed w. Given x 1:n find: argmax y Y n n w f(x, i, y i 1, y i ) i=1 Use the Viterbi algorithm, takes O(n Y 2 ) Notational change: since w and x 1:n are fixed we will use s(i, a, b) = w f(x, i, a, b)

44 Viterbi for Factored Sequence Models Given scores s(i, a, b) for each position i and output bigram a, b, find: n argmax s(i, y i 1, y i ) y Y n i=1 Use the Viterbi algorithm, takes O(n Y 2 ) Intuition: output sequences that share bigrams will share scores 1... i 2 i 1 i i n best subsequence with y i 1 = per best subsequence with y i = per best subsequence with y i 1 = loc s(i,loc, per) best subsequence with y i = loc best subsequence with y i 1 = best subsequence with y i =

45 Intuition for Viterbi Consider a fixed x 1:n Assume we have the best sub-sequences up to position i i 1 i best subsequence with y i 1 = per best subsequence with y i 1 = loc best subsequence with y i 1 = What is the best sequence up to position i with y i =loc?

46 Intuition for Viterbi Consider a fixed x 1:n Assume we have the best sub-sequences up to position i i 1 i best subsequence with y i 1 = per best subsequence with y i 1 = loc best subsequence with y i 1 = What is the best sequence up to position i with y i =loc?

47 Intuition for Viterbi Consider a fixed x 1:n Assume we have the best sub-sequences up to position i i 1 i best subsequence with y i 1 = per s(i,per, loc) best subsequence with y i 1 = loc best subsequence with y i 1 = s(i,loc, loc) s(i,, loc) What is the best sequence up to position i with y i =loc?

48 Viterbi for Linear Factored Predictors ŷ = argmax y Y n n w f(x, i, y i 1, y i ) i=1 Definition: score of optimal sequence for x 1:i ending with a Y δ(i, a) = max y Y i :y i=a j=1 Use the following recursions, for all a Y: i s(j, y j 1, y j ) δ(1, a) = s(1, y 0 = null, a) δ(i, a) = max δ(i 1, b) + s(i, b, a) b Y The optimal score for x is max a Y δ(n, a) The optimal sequence ŷ can be recovered through back-pointers

49 Linear Factored Sequence Prediction predict(x 1:n ) = argmax y Y n w f(x, y) Factored representation, e.g. based on bigrams Flexible, arbitrary features of full x and the factors Efficient prediction using Viterbi Next, learning w: Probabilistic log-linear models: Local learning, a.k.a. Maximum-Entropy Markov Models Global learning, a.k.a. Conditional Random Fields Margin-based methods: Structured Perceptron Structured SVM

50 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w

51 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1

52 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1

53 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1

54 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2

55 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2

56 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 w NextW,per,went = +2

57 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 w NextW,per,went = +2 w NextW,org,played = +2

58 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 w NextW,per,went = +2 w NextW,org,played = +2 w PrevW,org,against = +2

59 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 w NextW,per,went = +2 w NextW,org,played = +2 w PrevW,org,against = w UpperBigram,per,per = +2 w UpperBigram,loc,loc = +2 w NextW,loc,played = 1000

60 Log-linear Models for Sequence Prediction y per per - - loc x Jack London went to Paris

61 Log-linear Models for Sequence Prediction Model the conditional distribution: Pr(y x; w) = where x = x1 x 2... x n X exp {w f(x, y)} Z(x; w) y = y1 y 2... y n Y and Y = {1,..., L} f(x, y) represents x and y with d features w R d are the parameters of the model Z(x; w) is a normalizer called the partition function Z(x; w) = exp {w f(x, z)} z Y To predict the best sequence predict(x 1:n ) = argmax Pr(y x) y Y n

62 Log-linear Models: Name Let s take the log of the conditional probability: log Pr(y x; w) = log exp{w f(x, y)} Z(x; w) = w f(x, y) log y exp{w f(x, y)} = w f(x, y) log Z(x; w) Partition function: Z(x; w) = y exp{w f(x, y)} log Z(x; w) is a constant for a fixed x In the log space, computations are linear, i.e., we model log-probabilities using a linear predictor

63 Making Predictions with Log-Linear Models For tractability, assume f(x, y) decomposes into bigrams: f(x 1:n, y 1:n ) = n f(x, i, y i 1, y i ) i=1 Given w, given x 1:n, find: exp { n i=1 argmax Pr(y 1:n x 1:n ; w) = amax w f(x, i, y i 1, y i )} y 1:n y Z(x; w) { n } = amax exp w f(x, i, y i 1, y i ) y = amax y We can use the Viterbi algorithm i=1 n w f(x, i, y i 1, y i ) i=1

64 Making Predictions with Log-Linear Models For tractability, assume f(x, y) decomposes into bigrams: f(x 1:n, y 1:n ) = n f(x, i, y i 1, y i ) i=1 Given w, given x 1:n, find: exp { n i=1 argmax Pr(y 1:n x 1:n ; w) = amax w f(x, i, y i 1, y i )} y 1:n y Z(x; w) { n } = amax exp w f(x, i, y i 1, y i ) y = amax y We can use the Viterbi algorithm i=1 n w f(x, i, y i 1, y i ) i=1

65 Parameter Estimation in Log-Linear Models Pr(y x; w) = exp {w f(x, y)} Z(x; w) How to estimate w given training data? Two approaches: MEMMs: assume that Pr(y x; w) decomposes CRFs: assume that f(x, y) decomposes

66 Parameter Estimation in Log-Linear Models Pr(y x; w) = exp {w f(x, y)} Z(x; w) How to estimate w given training data? Two approaches: MEMMs: assume that Pr(y x; w) decomposes CRFs: assume that f(x, y) decomposes

67 Maximum Entropy Markov Models (MEMMs) (McCallum, Freitag, Pereira 00) Similarly to HMMs: Pr(y 1:n x 1:n ) = Pr(y 1 x 1:n ) Pr(y 2:n x 1:n, y 1 ) n = Pr(y 1 x 1:n ) Pr(y i x 1:n, y 1:i 1 ) = Pr(y 1 x 1:n ) Assumption under MEMMs: i=2 n Pr(y i x 1:n, y i 1 ) i=2 Pr(y i x 1:n, y 1:i 1 ) = Pr(y i x 1:n, y i 1 )

68 Parameter Estimation in MEMMs Decompose sequential problem: Pr(y 1:n x 1:n ) = Pr(y 1 x 1:n ) n Pr(y i x 1:n, i, y i 1 ) i=2 Learn local log-linear distributions (i.e. MaxEnt) Pr(y x, i, y ) = exp{w f(x, i, y, y)} Z(x, i, y ) where x is an input sequence y and y are tags f(x, i, y, y) is a feature vector of x, the position to be tagged, the previous tag and the current tag Sequence learning reduced to multi-class logistic regression

69 Conditional Random Fields (Lafferty, McCallum, Pereira 2001) Log-linear model of the conditional distribution: Pr(y x; w) = exp{w f(x, y)} Z(x) where x = x1 x 2... x n X y = y1 y 2... y n Y and Y = {1,..., L} f(x, y) is a feature vector of x and y w are model parameters To predict the best sequence ŷ = argmax Pr(y x) y Y Assumption in CRF (for tractability): f(x, y) decomposes into factors

70 Parameter Estimation in CRFs Given a training set { } (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ), estimate w Define the conditional log-likelihood of the data: L(w) = m log Pr(y (k) x (k) ; w) k=1 L(w) measures how well w explains the data. A good value for w will give a high value for Pr(y (k) x (k) ; w) for all k = 1... m. We want w that maximizes L(w)

71 Learning the Parameters of a CRF We pose it as a concave optimization problem Find: w = argmax L(w) λ w R D 2 w 2 where The first term is the log-likelihood of the data The second term is a regularization term, it penalizes solutions with large norm (similar to norm-minimization in SVM) λ is a parameter to control the trade-off between fitting the data and model complexity

72 Learning the Parameters of a CRF Find w = argmax L(w) λ w R D 2 w 2 In general there is no analytical solution to this optimization We use iterative techniques, i.e. gradient-based optimization 1. Initialize w = 0 2. Take derivatives of L(w) λ 2 w 2, compute gradient 3. Move w in steps proportional to the gradient 4. Repeat steps 2 and 3 until convergence Fast and scalable algorithms exist

73 Computing the Gradient in CRFs Consider a parameter w j and its associated feature f j : L(w) w j = 1 m m f j (x (k), y (k) ) k=1 m k=1 y Y Pr(y x (k) ; w) f j (x (k), y) where f(x, y) = n f j (x, i, y i 1, y i ) i=1 First term: observed value of f j in training examples Second term: expected value of f j under current w In the optimal, observed = expected

74 Computing the Gradient in CRFs The first term is easy to compute, by counting explicitly 1 m m f j (x, i, y (k) i 1, y(k) i ) k=1 The second term is more involved, k=1 i m Pr(y x (k) ; w) f j (x (k), i, y i 1, y i ) y Y i because it sums over all sequences y Y But there is an efficient solution...

75 Computing the Gradient in CRFs For an example (x (k), y (k) ): where n Pr(y x (k) ; w) f j (x (k), i, y i 1, y i ) = y Y n i=1 n µ k i (a, b)f j (x (k), i, a, b) i=1 a,b Y µ k i (a, b) = Pr( i, a, b x (k) ; w) = Pr(y x (k) ; w) y Y n : y i 1 =a, y i =b The quantities µ k i can be computed efficiently in O(nL 2 ) using the forward-backward algorithm

76 Forward-Backward for CRFs Assume fixed x. Calculate in O(n Y 2 ) µ i (a, b) = Pr(y x; w) y Y n :y i 1=a,y i=b, 1 i n; a, b Y Definition: forward and backward quantities { i } α i (a) = exp j=1 w f(x, j, y j 1, y j ) β i (b) = Z = a α n(a) y 1:i Y i :y i=a y i:n Y (n i+1) :y i=b { n } exp j=i+1 w f(x, j, y j 1, y j ) µ i (a, b) = {α i 1 (a) exp{w f(x, i, a, b)} β i (b) Z 1 } Similarly to Viterbi, α i (a) and β i (b) can be computed efficiently in a recursive manner

77 Forward-Backward for CRFs Assume fixed x. Calculate in O(n Y 2 ) µ i (a, b) = Pr(y x; w) y Y n :y i 1=a,y i=b, 1 i n; a, b Y Definition: forward and backward quantities { i } α i (a) = exp j=1 w f(x, j, y j 1, y j ) β i (b) = Z = a α n(a) y 1:i Y i :y i=a y i:n Y (n i+1) :y i=b { n } exp j=i+1 w f(x, j, y j 1, y j ) µ i (a, b) = {α i 1 (a) exp{w f(x, i, a, b)} β i (b) Z 1 } Similarly to Viterbi, α i (a) and β i (b) can be computed efficiently in a recursive manner

78 CRFs: summary so far Log-linear models for sequence prediction, Pr(y x; w) Computations factorize on label bigrams Model form: argmax y Y w f(x, i, y i 1, y i ) i Prediction: uses Viterbi (from HMMs) Parameter estimation: Gradient-based methods, in practice L-BFGS Computation of gradient uses forward-backward (from HMMs)

79 CRFs: summary so far Log-linear models for sequence prediction, Pr(y x; w) Computations factorize on label bigrams Model form: argmax y Y w f(x, i, y i 1, y i ) i Prediction: uses Viterbi (from HMMs) Parameter estimation: Gradient-based methods, in practice L-BFGS Computation of gradient uses forward-backward (from HMMs) Next Question: MEMMs or CRFs? HMMs or CRFs?

80 MEMMs and CRFs MEMMs: Pr(y x) = n i=1 exp {w f(x, i, y i 1, y i )} Z(x, i, y i 1 ; w) CRFs: Pr(y x) = exp { n i=1 w f(x, i, y i 1, y i )} Z(x) Both exploit the same factorization, i.e. same features Same computations to compute argmax y Pr(y x) MEMMs locally normalized; CRFs globally normalized MEMM assume that Pr(y i x 1:n, y 1:i 1 ) = Pr(y i x 1:n, y i 1 ) Leads to Label Bias Problem MEMMs are cheaper to train (reduces to multiclass learning) CRFs are easier to extend to other structures (next lecture)

81 HMMs for sequence prediction x are the observations, y are the hidden states HMMs model the joint distributon Pr(x, y) Parameters: (assume X = {1,..., k} and Y = {1,..., l}) π R l, π a = Pr(y 1 = a) T R l l, T a,b = Pr(y i = b y i 1 = a) O R l k, O a,c = Pr(x i = c y i = a) Model form Pr(x, y) = π y1 O y1,x 1 n i=2 T yi 1,y i O yi,x i Parameter Estimation: maximum likelihood by counting events and normalizing

82 HMMs and CRFs In CRFs: ŷ = amax y i w f(x, i, y i 1, y i ) In HMMs: ŷ = amax y π y1 O y1,x 1 n i=2 T y i 1,y i O yi,x i = amax y log(π y1 O y1,x 1 ) + n i=2 log(t y i 1,y i O yi,x i ) An HMM can be expressed as factored linear models: f j (x, i, y, y ) w j i = 1 & y = a log(π a ) i > 1 & y = a & y = b log(t a,b ) y = a & x i = c log(o a,b ) Hence, HMM are factored linear models

83 HMMs and CRFs: main differences Representation: HMM features are tied to the generative process. CRF features are very flexible. They can look at the whole input x paired with a label bigram (y i, y i+1 ). In practice, for prediction tasks, good discriminative features can improve accuracy a lot. Parameter estimation: HMMs focus on explaining the data, both x and y. CRFs focus on the mapping from x to y. A priori, it is hard to say which paradigm is better. Same dilemma as Naive Bayes vs. Maximum Entropy.

84 Structured Prediction Perceptron, SVMs, CRFs

85 Learning Structured Predictors Goal: given training data { (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor x y with small error on unseen inputs In a CRF: argmax P (y x; w) = exp { n i=1 w f(x, i, y i 1, y i )} y Y Z(x; w) n = w f(x, i, y i 1, y i ) i=1 To predict new values, Z(x; w) is not relevant Parameter estimation: w is set to maximize likelihood Can we learn w more directly, focusing on errors?

86 Learning Structured Predictors Goal: given training data { (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor x y with small error on unseen inputs In a CRF: argmax P (y x; w) = exp { n i=1 w f(x, i, y i 1, y i )} y Y Z(x; w) n = w f(x, i, y i 1, y i ) i=1 To predict new values, Z(x; w) is not relevant Parameter estimation: w is set to maximize likelihood Can we learn w more directly, focusing on errors?

87 The Structured Perceptron (Collins, 2002) Set w = 0 For t = 1... T For each training example (x, y) 1. Compute z = argmax z w f(x, z) 2. If z y w w + f(x, y) f(x, z) Return w

88 The Structured Perceptron + Averaging (Freund and Schapire, 1998) (Collins 2002) Set w = 0, w a = 0 For t = 1... T For each training example (x, y) 1. Compute z = argmax z w f(x, z) 2. If z y w w + f(x, y) f(x, z) 3. w a = w a + w Return w a /mt, where m is the number of training examples

89 Perceptron Updates: Example y per per - - loc z per loc - - loc x Jack London went to Paris Let y be the correct output for x. Say we predict z instead, under our current w The update is: g = f(x, y) f(x, z) = i f(x, i, y i 1, y i ) i f(x, i, z i 1, z i ) = f(x, 2, per, per) f(x, 2, per, loc) + f(x, 3, per, -) f(x, 3, loc, -) Perceptron updates are typically very sparse

90 Properties of the Perceptron Online algorithm. Often much more efficient than batch algorithms If the data is separable, it will converge to parameter values with 0 errors Number of errors before convergence is related to a definition of margin. Can also relate margin to generalization properties In practice: 1. Averaging improves performance a lot 2. Typically reaches a good solution after only a few (say 5) iterations over the training set 3. Often performs nearly as well as CRFs, or SVMs

91 Averaged Perceptron Convergence Iteration Accuracy (results on validation set for a parsing task)

92 Margin-based Structured Prediction Let f(x, y) = n i=1 f(x, i, y i 1, y i ) Model: argmax y Y w f(x, y) Consider an example (x (k), y (k) ): y y (k) : w f(x (k), y (k) ) < w f(x (k), y) = error Let y = argmax y Y :y y (k) w f(x(k), y) Define γ k = w (f(x (k), y (k) ) f(x (k), y )) The quantity γ k is a notion of margin on example k: γ k > 0 no mistakes in the example high γ k high confidence

93 Margin-based Structured Prediction Let f(x, y) = n i=1 f(x, i, y i 1, y i ) Model: argmax y Y w f(x, y) Consider an example (x (k), y (k) ): y y (k) : w f(x (k), y (k) ) < w f(x (k), y) = error Let y = argmax y Y :y y (k) w f(x(k), y) Define γ k = w (f(x (k), y (k) ) f(x (k), y )) The quantity γ k is a notion of margin on example k: γ k > 0 no mistakes in the example high γ k high confidence

94 Margin-based Structured Prediction Let f(x, y) = n i=1 f(x, i, y i 1, y i ) Model: argmax y Y w f(x, y) Consider an example (x (k), y (k) ): y y (k) : w f(x (k), y (k) ) < w f(x (k), y) = error Let y = argmax y Y :y y (k) w f(x(k), y) Define γ k = w (f(x (k), y (k) ) f(x (k), y )) The quantity γ k is a notion of margin on example k: γ k > 0 no mistakes in the example high γ k high confidence

95 Mistake-augmented Margins (Taskar et al, 2004) e(y (k), ) x (k) Jack London went to Paris y (k) per per - - loc 0 y per loc - - loc 1 y per y - - per per - 5 Def: e(y, y ) = n i=1 [y i y i ] e.g., e(y (k), y (k) )=0, e(y (k), y )=1, e(y (k), y )=5 We want a w such that y y (k) : w f(x (k), y (k) ) > w f(x (k), y) + e(y (k), y) (the higher the error of y, the larger the separation should be)

96 Mistake-augmented Margins (Taskar et al, 2004) e(y (k), ) x (k) Jack London went to Paris y (k) per per - - loc 0 y per loc - - loc 1 y per y - - per per - 5 Def: e(y, y ) = n i=1 [y i y i ] e.g., e(y (k), y (k) )=0, e(y (k), y )=1, e(y (k), y )=5 We want a w such that y y (k) : w f(x (k), y (k) ) > w f(x (k), y) + e(y (k), y) (the higher the error of y, the larger the separation should be)

97 Mistake-augmented Margins (Taskar et al, 2004) e(y (k), ) x (k) Jack London went to Paris y (k) per per - - loc 0 y per loc - - loc 1 y per y - - per per - 5 Def: e(y, y ) = n i=1 [y i y i ] e.g., e(y (k), y (k) )=0, e(y (k), y )=1, e(y (k), y )=5 We want a w such that y y (k) : w f(x (k), y (k) ) > w f(x (k), y) + e(y (k), y) (the higher the error of y, the larger the separation should be)

98 Structured Hinge Loss Define a mistake-augmented margin γ k,y =w f(x (k), y (k) ) w f(x (k), y) e(y (k), y) γ k = min y y (k) γ k,y Define loss function on example k as: { } L(w, x (k), y (k) ) = max w f(x (k), y) + e(y (k), y) w f(x (k), y (k) ) y Y Leads to an SVM for structured prediction Given a training set, find: argmin w R D m L(w, x (k), y (k) ) + λ 2 w 2 k=1

99 Regularized Loss Minimization Given a training set { (x (1), y (1) ),..., (x (m), y (m) ) }. Find: m argmin L(w, x (k), y (k) ) + λ w R D 2 w 2 k=1 Two common loss functions L(w, x (k), y (k) ) : Log-likelihood loss (CRFs) log P (y (k) x (k) ; w) Hinge loss (SVMs) ( ) max w f(x (k), y) + e(y (k), y) w f(x (k), y (k) ) y Y

100 Learning Structure Predictors: summary so far Linear models for sequence prediction argmax w f(x, i, y i 1, y i ) y Y Computations factorize on label bigrams Decoding: using Viterbi Marginals: using forward-backward Parameter estimation: Perceptron, Log-likelihood, SVMs Extensions from classification to the structured case Optimization methods: Stochastic (sub)gradient methods (LeCun et al 98) (Shalev-Shwartz et al. 07) Exponentiated Gradient (Collins et al 08) SVM Struct (Tsochantaridis et al. 04) Structured MIRA (McDonald et al 05) i

101 Beyond Linear Sequence Prediction

102 Sequence Prediction, Beyond Bigrams It is easy to extend the scope of features to k-grams f(x, i, y i k+1:i 1, y i ) In general, think of state σ i remembering relevant history σ i = y i 1 for bigrams σ i = y i k+1:i 1 for k-grams σi can be the state at time i of a deterministic automaton generating y The structured predictor is argmax y Y w f(x, i, σ i, y i ) Viterbi and forward-backward extend naturally, in O(nL k ) i

103 Dependency Structures Dependency Structures * John saw a movie that he liked today Directed arcs represent dependencies between a head word Directed arcs represent dependencies between a head word and a modifier and a modifier word. word. E.g.: E.g.: movie movie modifies modifies saw, saw, John John modifies modifies saw, saw, today today modifies saw saw

104 Dependency Parsing: arc-factored models Dependency Parsing: arc-factored models (McDonald et al. 2005) (McDonald et al. 2005) * John saw a movie that he liked today Parse Parse trees trees decompose decompose into into single single dependencies dependencies h, m h, m argmax argmax w f(x, f(x,h,m) y Y(x) y Y(x) h,m y h,m y Some Some features: features: f 1 (x, f 1 (x,h,m)=[ saw = [ movie movie ] ] f 2 (x, f 2 (x,h,m)=[distance = [ = =+2] ] Tractable Tractable inference inference algorithms algorithms exist exist (tomorrow s (tomorrow s lecture) lecture)

105 Linear Structured Prediction Sequence prediction (bigram factorization) argmax y Y(x) Dependency parsing (arc-factored) argmax y Y(x) w f(x, i, y i 1, y i ) i h,m y w f(x, h, m) In general, we can enumerate parts r y w f(x, r) argmax y Y(x) r y

106 Factored Sequence Prediction: from Linear to Non-linear score(x, y) = i s(x, i, y i 1, y i ) Linear: s(x, i, y i 1, y i ) = w f(x, i, y i 1, y i ) Non-linear, using a feed-forward neural network: s(x, i, y i 1, y i ) = w yi 1,y i h(f(x, i)) where: h(f(x, i)) = σ(w 2 σ(w 1 σ(w 0 f(x, i)))) Remarks: The non-linear model computes a hidden representation of the input Still factored: Viterbi and Forward-Backward work Parameter estimation becomes non-convex, use backpropagation

107 Recurrent Sequence Prediction y 1 y 2 y 3 y n h 1 h 2 h 3... h n x 1 x 2 x 3 x n Maintains a state: a hidden variable that keeps track of previous observations and predictions Making predictions is not tractable In practice: greedy predictions or beam search Learning is non-convex Popular methods: RNN, LSTM, Spectral Models,...

108 Thanks!

Learning Structured Predictors

Learning Structured Predictors Xavier Carreras 1/70 Supervised (Structured) Prediction Learning to predict: given training data { (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor