Learning Structured Predictors

Size: px

Start display at page:

Download "Learning Structured Predictors"

Hilary Flynn
6 years ago
Views:

1 Learning Structured Predictors Xavier Carreras 1/70

2 Supervised (Structured) Prediction Learning to predict: given training data { (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor x y that works well on unseen inputs x Non-Structured Prediction: outputs y are atomic Binary prediction: y { 1, +1} Multiclass prediction: y {1, 2,..., L} Structured Prediction: outputs y are structured Sequence prediction: y are sequences Parsing: y are trees... 2/70

3 Named Entity Recognition y per - qnt - - org org - time x Jim bought 300 shares of Acme Corp. in /70

4 Named Entity Recognition y per - qnt - - org org - time x Jim bought 300 shares of Acme Corp. in 2006 y per per - - loc x Jack London went to Paris y per per - - loc x Paris Hilton went to London y per - - loc x Jackie went to Lisdon 3/70

5 Part-of-speech Tagging y NNP NNP VBZ NNP. x Ms. Haag plays Elianti. 4/70

6 Syntactic Parsing P LOC ROOT VC OBJ NMOD PMOD SBJ TMP NMOD NAME Unesco is now holding its biennial meetings in New York. x are sentences y are syntactic dependency trees 5/70

7 Machine Translation the transformation of does not nch. A particular instance may B(not), x 0:VB) ne, x 0, pas arbitrary syntax tree fragment. lexicalized (e.g. does) or varirhs(r i ) is represented as a seguage words and variables. brief overview of how such les are acquired automatically re 1, the (π, f, a) triple is repd graph G (edges going downnction between edges of π and ode of the graph is labeled with ement span (the latter in italic span of a node n is defined by rst and last word in f that are The complement span of n is '&!" #! ''( $%& ''( $%* )' ''& # #"& #%" *"& "!"& "!"&!!"#$%"& $&!%&!"#$& $%&!"'$& '!"& '!"& (!"%$("& # " "! $%& ' ' ( ) # "! ' ( * $ & ) ( #%)! '& $%&!"'$& '& '%&!"*$& $& '%(!"*$("& && '%(!"%$("& '& (!"%$+("& "#$%$ 5 &$'&($ )*+(,-$.%/0'*.,/% +'1)*2 30'1 40.*+$ + -!! "#$ %& '( ) *+, " Figure 1: Spans and complement-spans determine what (?) rules are extracted. Constituents in gray are members of the frontier set; a minimal rule is extracted from each of them. + )!") 6/70

8 Object Detection (?) x are images y are grids labeled with object types 7/70

9 Object Detection (?) x are images y are grids labeled with object types 7/70

10 Today s Goals Introduce basic concepts for structured prediction We will restrict to sequence prediction What can we can borrow from standard classification? Learning paradigms and algorithms, in essence, work here too However, computations behind algorithms are prohibitive What can we borrow from HMM and other structured formalisms? Representations of structured data into feature spaces Inference/search algorithms for tractable computations E.g., algorithms for HMMs (Viterbi, forward-backward) will play a major role in today s methods 8/70

11 Today s Goals Introduce basic concepts for structured prediction We will restrict to sequence prediction What can we can borrow from standard classification? Learning paradigms and algorithms, in essence, work here too However, computations behind algorithms are prohibitive What can we borrow from HMM and other structured formalisms? Representations of structured data into feature spaces Inference/search algorithms for tractable computations E.g., algorithms for HMMs (Viterbi, forward-backward) will play a major role in today s methods 8/70

12 Sequence Prediction y per per - - loc x Jack London went to Paris 9/70

13 Sequence Prediction x = x 1 x 2... x n are input sequences, x i X y = y 1 y 2... y n are output sequences, y i {1,..., L} Goal: given training data { (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor x y that works well on unseen inputs x What is the form of our prediction model? 10/70

14 Exponentially-many Solutions Let Y = {-, per, loc} The solution space (all output sequences): Jack London went to Paris per loc per loc per loc per loc per loc Each path is a possible solution For an input sequence of size n, there are Y n possible outputs 11/70

15 Exponentially-many Solutions Let Y = {-, per, loc} The solution space (all output sequences): Jack London went to Paris per loc per loc per loc per loc per loc Each path is a possible solution For an input sequence of size n, there are Y n possible outputs 11/70

16 Approach 1: Local Classifiers? Jack London went to Paris Decompose the sequence into n classification problems: A classifier predicts individual labels at each position ŷ i = argmax w f(x, i, l) l {loc, per, -} f(x, i, l) represents an assignment of label l for x i w is a vector of parameters, has a weight for each feature of f Use standard classification methods to learn w At test time, predict the best sequence by a simple concatenation of the best label for each position 12/70

17 Approach 1: Local Classifiers? Jack London went to Paris Decompose the sequence into n classification problems: A classifier predicts individual labels at each position ŷ i = argmax w f(x, i, l) l {loc, per, -} f(x, i, l) represents an assignment of label l for x i w is a vector of parameters, has a weight for each feature of f Use standard classification methods to learn w At test time, predict the best sequence by a simple concatenation of the best label for each position 12/70

18 Indicator Features f(x, i, l) is a vector of d features representing label l for x i [ f 1 (x, i, l),..., f j (x, i, l),..., f d (x, i, l) ] What s in a feature f j (x, i, l)? Anything we can compute using x and i and l Anything that indicates whether l is (not) a good label for x i Indicator features: binary-valued features looking at: a simple pattern of x and target position i and the candidate label l for position i { 1 if xi =London and l =loc f j (x, i, l) = 0 otherwise { 1 if xi+1 =went and l =loc f k (x, i, l) = 0 otherwise 13/70

19 Feature Templates Feature templates generate many indicator features mechanically A feature template is identified by a type, and a number of values Example: template word extracts the current word { 1 if xi = w and l = a f word,a,w (x, i, l) = 0 otherwise A feature of this type is identified by the tuple word, a, w Generates a feature for every label a Y and every word w e.g.: a = loc w = London, a = - w = London a = loc w = Paris a = per w = Paris a = per w = John a = - w = the 14/70

20 Feature Templates Feature templates generate many indicator features mechanically A feature template is identified by a type, and a number of values Example: template word extracts the current word { 1 if xi = w and l = a f word,a,w (x, i, l) = 0 otherwise A feature of this type is identified by the tuple word, a, w Generates a feature for every label a Y and every word w e.g.: a = loc w = London, a = - w = London a = loc w = Paris a = per w = Paris a = per w = John a = - w = the In feature-based models: Define feature templates manually Instantiate the templates on every set of values in the training data generates a very high-dimensional feature space Define parameter vector w indexed by such feature tuples Let the learning algorithm choose the relevant features 14/70

21 More Features for NE Recognition In practice, construct f(x, i, l) by... per Jack London went to Paris Define a number of simple patterns of x and i current word x i is xi capitalized? xi has digits? prefixes/suffixes of size 1, 2, 3,... is x i a known location? is x i a known person? Define feature templates by combining patterns with labels l next word previous word current and next words together other combinations Generate actual features by instantiating templates on training data 15/70

22 More Features for NE Recognition In practice, construct f(x, i, l) by... per per - Jack London went to Paris Define a number of simple patterns of x and i current word x i is xi capitalized? xi has digits? prefixes/suffixes of size 1, 2, 3,... is x i a known location? is x i a known person? Define feature templates by combining patterns with labels l next word previous word current and next words together other combinations Generate actual features by instantiating templates on training data Main limitation: features can t capture interactions between labels! 15/70

23 Approach 2: HMM for Sequence Prediction π per per T per,per per - - loc O per, London Jack London went to Paris Define an HMM were each label is a state Model parameters: πl : probability of starting with label l T l,l : probability of transitioning from l to l O l,x : probability of generating symbol x given label l Predictions: p(x, y) = π y1 O y1,x 1 T yi 1,y i O yi,x i i>1 Learning: relative counts + smoothing Prediction: Viterbi algorithm 16/70

24 Approach 2: Representation in HMM π per per T per,per per - - loc O per, London Jack London went to Paris Label interactions are captured in the transition parameters But interactions between labels and input symbols are quite limited! Only Oyi,x i = p(x i y i ) Not clear how to exploit patterns such as: Capitalization, digits Prefixes and suffixes Next word, previous word Combinations of these with label transitions Why? HMM independence assumptions: given label y i, token x i is independent of anything else 17/70

25 Approach 2: Representation in HMM π per per T per,per per - - loc O per, London Jack London went to Paris Label interactions are captured in the transition parameters But interactions between labels and input symbols are quite limited! Only Oyi,x i = p(x i y i ) Not clear how to exploit patterns such as: Capitalization, digits Prefixes and suffixes Next word, previous word Combinations of these with label transitions Why? HMM independence assumptions: given label y i, token x i is independent of anything else 17/70

26 Local Classifiers vs. HMM Local Classifiers HMM Form: w f(x, i, l) Learning: standard classifiers Form: π y1 O y1,x 1 T yi 1,y i O yi,x i i>1 Prediction: independent for each x i Advantage: feature-rich Drawback: no label interactions Learning: relative counts Prediction: Viterbi Advantage: label interactions Drawback: no fine-grained features 18/70

27 Approach 3: Global Sequence Predictors Learn a single classifier from x y y: per per - - loc x: Jack London went to Paris predict(x 1:n ) = argmax y Y n w f(x, y) Next questions:... How do we represent entire sequences in f(x, y)? There are exponentially-many sequences y for a given x, how do we solve the argmax problem? 19/70

28 Approach 3: Global Sequence Predictors Learn a single classifier from x y y: per per - - loc x: Jack London went to Paris predict(x 1:n ) = argmax y Y n w f(x, y) Next questions:... How do we represent entire sequences in f(x, y)? There are exponentially-many sequences y for a given x, how do we solve the argmax problem? 19/70

29 Factored Representations y: per per - - loc x: Jack London went to Paris How do we represent entire sequences in f(x, y)? Look at individual assignments y i (standard classification) Look at bigrams of outputs labels y i 1, y i Look at trigrams of outputs labels yi 2, y i 1, y i Look at n-grams of outputs labels yi n+1,..., y i 1, y i Look at the full label sequence y (intractable) A factored representation will lead to a tractable model 20/70

30 Factored Representations y: per per - - loc x: Jack London went to Paris How do we represent entire sequences in f(x, y)? Look at individual assignments y i (standard classification) Look at bigrams of outputs labels y i 1, y i Look at trigrams of outputs labels yi 2, y i 1, y i Look at n-grams of outputs labels yi n+1,..., y i 1, y i Look at the full label sequence y (intractable) A factored representation will lead to a tractable model 20/70

31 Factored Representations y: per per - - loc x: Jack London went to Paris How do we represent entire sequences in f(x, y)? Look at individual assignments y i (standard classification) Look at bigrams of outputs labels y i 1, y i Look at trigrams of outputs labels yi 2, y i 1, y i Look at n-grams of outputs labels yi n+1,..., y i 1, y i Look at the full label sequence y (intractable) A factored representation will lead to a tractable model 20/70

32 Factored Representations y: per per - - loc x: Jack London went to Paris How do we represent entire sequences in f(x, y)? Look at individual assignments y i (standard classification) Look at bigrams of outputs labels y i 1, y i Look at trigrams of outputs labels yi 2, y i 1, y i Look at n-grams of outputs labels yi n+1,..., y i 1, y i Look at the full label sequence y (intractable) A factored representation will lead to a tractable model 20/70

33 Bigram Feature Templates y per per - - loc x Jack London went to Paris A template for word + bigram: 1 if x i = w and f wb,a,b,w (x, i, y i 1, y i ) = y i 1 = a and y i = b 0 otherwise e.g., f wb,per,per,london (x, 2, per, per) = 1 f wb,per,per,london (x, 3, per, -) = 0 f wb,per,-,went (x, 3, per, -) = 1 21/70

34 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1 22/70

35 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1 22/70

36 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1 22/70

37 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1 22/70

38 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1 22/70

39 Representations Factored at Bigrams y: per per - - loc x: Jack London went to Paris f(x, i, y i 1, y i ) A d-dimensional feature vector of a label bigram at i Each dimension is typically a boolean indicator (0 or 1) f(x, y) = n i=1 f(x, i, y i 1, y i ) A d-dimensional feature vector of the entire y Aggregated representation by summing bigram feature vectors Each dimension is now a count of a feature pattern 23/70

40 Linear Sequence Prediction where predict(x 1:n ) = argmax w f(x, y) y Y n n f(x, y) = f(x, i, y i 1, y i ) i=1 Note the linearity of the expression: n w f(x, y) = w f(x, i, y i 1, y i ) = Next questions: How do we solve the argmax problem? i=1 n w f(x, i, y i 1, y i ) i=1 How do we learn w? 24/70

41 Linear Sequence Prediction where predict(x 1:n ) = argmax w f(x, y) y Y n n f(x, y) = f(x, i, y i 1, y i ) i=1 Note the linearity of the expression: n w f(x, y) = w f(x, i, y i 1, y i ) = Next questions: How do we solve the argmax problem? How do we learn w? i=1 n w f(x, i, y i 1, y i ) i=1 24/70

42 Linear Sequence Prediction where predict(x 1:n ) = argmax w f(x, y) y Y n n f(x, y) = f(x, i, y i 1, y i ) i=1 Note the linearity of the expression: n w f(x, y) = w f(x, i, y i 1, y i ) = Next questions: How do we solve the argmax problem? How do we learn w? i=1 n w f(x, i, y i 1, y i ) i=1 24/70

43 Predicting with Factored Sequence Models Consider a fixed w. Given x 1:n find: argmax y Y n n w f(x, i, y i 1, y i ) i=1 Use the Viterbi algorithm, takes O(n Y 2 ) Notational change: since w and x 1:n are fixed we will use s(i, a, b) = w f(x, i, a, b) 25/70

44 Viterbi for Factored Sequence Models Given scores s(i, a, b) for each position i and output bigram a, b, find: argmax y Y n n s(i, y i 1, y i ) i=1 Use the Viterbi algorithm, takes O(n Y 2 ) Intuition: output sequences that share bigrams will share scores 1... i 2 i 1 i i n best subsequence with y i 1 = per best subsequence with y i = per best subsequence with y i 1 = loc s(i,loc, per) best subsequence with y i = loc best subsequence with y i 1 = best subsequence with y i = 26/70

45 Intuition for Viterbi Assume we have the best sub-sequence up to position i 1 ending with each label: 1... i 1 i best subsequence with y i 1 = per best subsequence with y i 1 = loc best subsequence with y i 1 = What is the best sequence up to position i with y i =loc? 27/70

46 Intuition for Viterbi Assume we have the best sub-sequence up to position i 1 ending with each label: 1... i 1 i best subsequence with y i 1 = per best subsequence with y i 1 = loc best subsequence with y i 1 = What is the best sequence up to position i with y i =loc? 27/70

47 Intuition for Viterbi Assume we have the best sub-sequence up to position i 1 ending with each label: 1... i 1 i best subsequence with y i 1 = per s(i,per, loc) best subsequence with y i 1 = loc best subsequence with y i 1 = s(i,loc, loc) s(i,, loc) What is the best sequence up to position i with y i =loc? 27/70

48 Viterbi for Linear Factored Predictors ŷ = argmax y Y n n w f(x, i, y i 1, y i ) Definition: score of optimal sequence for x 1:i ending with a Y δ(i, a) = i=1 max y Y i :y i=a j=1 Use the following recursions, for all a Y: i s(j, y j 1, y j ) δ(1, a) = s(1, y 0 = null, a) δ(i, a) = max δ(i 1, b) + s(i, b, a) b Y The optimal score for x is max a Y δ(n, a) The optimal sequence ŷ can be recovered through back-pointers 28/70

49 Linear Factored Sequence Prediction predict(x 1:n ) = argmax y Y n w f(x, y) Factored representation, e.g. based on bigrams Flexible, arbitrary features of full x and the factors Efficient prediction using Viterbi Next, learning w: Probabilistic log-linear models: Local learning, a.k.a. Maximum-Entropy Markov Models Global learning, a.k.a. Conditional Random Fields Margin-based methods: Structured Perceptron Structured SVM 29/70

50 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w 30/70

51 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 30/70

52 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 30/70

53 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 30/70

54 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 30/70

55 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 30/70

56 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 w NextW,per,went = +2 30/70

57 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 w NextW,per,went = +2 w NextW,org,played = +2 30/70

58 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 w NextW,per,went = +2 w NextW,org,played = +2 w PrevW,org,against = +2 30/70

59 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 w NextW,per,went = +2 w NextW,org,played = +2 w PrevW,org,against = w UpperBigram,per,per = +2 w UpperBigram,loc,loc = +2 w NextW,loc,played = /70

60 Log-linear Models for Sequence Prediction y per per - - loc x Jack London went to Paris 31/70

61 Log-linear Models for Sequence Prediction Model the conditional distribution: where x = x 1 x 2... x n X Pr(y x; w) = exp {w f(x, y)} Z(x; w) y = y 1 y 2... y n Y and Y = {1,..., L} f(x, y) represents x and y with d features w R d are the parameters of the model Z(x; w) is a normalizer called the partition function Z(x; w) = z Y exp {w f(x, z)} To predict the best sequence predict(x 1:n ) = argmax Pr(y x) y Y n 32/70

62 Log-linear Models: Name Let s take the log of the conditional probability: log Pr(y x; w) = log exp{w f(x, y)} Z(x; w) = w f(x, y) log y exp{w f(x, y)} = w f(x, y) log Z(x; w) Partition function: Z(x; w) = y exp{w f(x, y)} log Z(x; w) is a constant for a fixed x In the log space, computations are linear, i.e., we model log-probabilities using a linear predictor 33/70

63 Making Predictions with Log-Linear Models For tractability, assume f(x, y) decomposes into bigrams: n f(x 1:n, y 1:n ) = f(x, i, y i 1, y i ) Given w, given x 1:n, find: i=1 exp { n i=1 argmax Pr(y 1:n x 1:n ; w) = amax w f(x, i, y i 1, y i )} y 1:n y Z(x; w) { n } = amax exp w f(x, i, y i 1, y i ) y We can use the Viterbi algorithm = amax y i=1 n w f(x, i, y i 1, y i ) i=1 34/70

64 Making Predictions with Log-Linear Models For tractability, assume f(x, y) decomposes into bigrams: n f(x 1:n, y 1:n ) = f(x, i, y i 1, y i ) Given w, given x 1:n, find: i=1 exp { n i=1 argmax Pr(y 1:n x 1:n ; w) = amax w f(x, i, y i 1, y i )} y 1:n y Z(x; w) { n } = amax exp w f(x, i, y i 1, y i ) y We can use the Viterbi algorithm = amax y i=1 n w f(x, i, y i 1, y i ) i=1 34/70

65 Parameter Estimation in Log-Linear Models Pr(y x; w) = exp {w f(x, y)} Z(x; w) Given training data { } (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ), How to estimate w? Define the conditional log-likelihood of the data: L(w) = a k = 1 m log Pr(y (k) x (k) ; w) L(w) measures how well w explains the data. A good value for w will give a high value for Pr(y (k) x (k) ; w) for all k = 1... m. We want w that maximizes L(w) 35/70

66 Parameter Estimation in Log-Linear Models Pr(y x; w) = exp {w f(x, y)} Z(x; w) Given training data { } (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ), How to estimate w? Define the conditional log-likelihood of the data: L(w) = a k = 1 m log Pr(y (k) x (k) ; w) L(w) measures how well w explains the data. A good value for w will give a high value for Pr(y (k) x (k) ; w) for all k = 1... m. We want w that maximizes L(w) 35/70

67 Learning Log-Linear Models: Loss + Regularization Solve: w = argmin w R d Loss {}}{ L(w) + Regularization {}}{ λ 2 w 2 where The first term is the negative conditional log-likelihood The second term is a regularization term, it penalizes solutions with large norm λ R controls the trade-off between loss and regularization Convex optimization problem gradient descent Two common losses based on log-likelihood that make learning tractable: Local Loss (MEMM): assume that Pr(y x; w) decomposes Global Loss (CRF): assume that f(x, y) decomposes 36/70

68 Learning Log-Linear Models: Loss + Regularization Solve: w = argmin w R d Loss {}}{ L(w) + Regularization {}}{ λ 2 w 2 where The first term is the negative conditional log-likelihood The second term is a regularization term, it penalizes solutions with large norm λ R controls the trade-off between loss and regularization Convex optimization problem gradient descent Two common losses based on log-likelihood that make learning tractable: Local Loss (MEMM): assume that Pr(y x; w) decomposes Global Loss (CRF): assume that f(x, y) decomposes 36/70

69 Maximum Entropy Markov Models (MEMM)? Similarly to HMMs: Assumption under MEMMs: Pr(y 1:n x 1:n ) = Pr(y 1 x 1:n ) Pr(y 2:n x 1:n, y 1 ) n = Pr(y 1 x 1:n ) Pr(y i x 1:n, y 1:i 1 ) = Pr(y 1 x 1:n ) i=2 n Pr(y i x 1:n, y i 1 ) i=2 Pr(y i x 1:n, y 1:i 1 ) = Pr(y i x 1:n, y i 1 ) 37/70

70 Parameter Estimation in MEMM Pr(y 1:n x 1:n ) = Pr(y 1 x 1:n ) The log-linear model is normalized locally (i.e. at each position): The log-likelihood is also local : L(w) w j = 1 m n Pr(y i x 1:n, i, y i 1 ) i=2 Pr(y x, i, y ) = exp{w f(x, i, y, y)} Z(x, i, y ) L(w) = m n (k) log Pr(y (k) i x (k), i, y (k) i 1 ) k=1 i=1 observed expected m n (k) {}}{{ f j (x (k), i, y (k) i 1, }}{ y(k) i ) Pr(y x (k), i, y (k) i 1, y) f j(x (k), i, y (k) i 1, y) k=1 i=1 y Y 38/70

71 Conditional Random Fields (?) Log-linear model of the conditional distribution: Pr(y x; w) = exp{w f(x, y)} Z(x) where x and y are input and output sequences f(x, y) is a feature vector of x and y that decomposes into factors w are model parameters To predict the best sequence ŷ = argmax Pr(y x) y Y Log-Likelihood at the global (sequence) level: m L(w) = log Pr(y (k) x (k) ; w) k=1 39/70

72 Computing the Gradient in CRFs Consider a parameter w j and its associated feature f j : observed expected L(w) = 1 m {}}{{ }}{ w j m f j (x (k), y (k) ) Pr(y x (k) ; w) f j (x (k), y) y Y k=1 where f j (x, y) = n f j (x, i, y i 1, y i ) i=1 First term: observed value of f j in training examples Second term: expected value of f j under current w In the optimal, observed = expected 40/70

73 Computing the Gradient in CRFs The first term is easy to compute, by counting explicitly i f j (x, i, y (k) i 1, y(k) i ) The second term is more involved, Pr(y x (k) ; w) f j (x (k), i, y i 1, y i ) y Y i because it sums over all sequences y Y n But there is an efficient solution... 41/70

74 Computing the Gradient in CRFs For an example (x (k), y (k) ): Pr(y x (k) ; w) y Y n n f j (x (k), i, y i 1, y i ) = i=1 n µ k i (a, b)f j (x (k), i, a, b) i=1 a,b Y µ k i (a, b) is the marginal probability of having labels (a, b) at position i: µ k i (a, b) = Pr( i, a, b x (k) ; w) = Pr(y x (k) ; w) y Y n : y i 1 =a, y i =b The quantities µ k i can be computed efficiently in O(nL 2 ) using the forward-backward algorithm 42/70

75 Forward-Backward for CRFs Assume fixed x and w. For notational convenience, define the score of a label bigram as: s(i, a, b) = exp{w f(x, i, a, b)} such that we can write Pr(y x) = exp{w f(x, y)} Z(x) = exp{ n i=1 w f(x, i, y i 1, y i )} n i=1 = s(i, y i 1, y i ) Z(x) Z Normalizer: Z = n y i=1 s(i, y i 1, y i ) Marginals: µ(i, a, b) = 1 n Z y,s.t.y i 1 =a,y i =b i=1 s(i, y i 1, y i ) 43/70

76 Forward-Backward for CRFs Definition: forward and backward quantities i α i (a) = j=1 s(j, y j 1, y j ) β i (b) = Z = a α n(a) y 1:i Y i :y i=a y i:n Y (n i+1) :y i=b µ i (a, b) = {α i 1 (a) s(i, a, b)} β i (b) Z 1 } n j=i+1 s(j, y j 1, y j ) Similarly to Viterbi, α i (a) and β i (b) can be computed recursively in O(n Y 2 ) 44/70

77 Forward-Backward for CRFs Definition: forward and backward quantities i α i (a) = j=1 s(j, y j 1, y j ) β i (b) = Z = a α n(a) y 1:i Y i :y i=a y i:n Y (n i+1) :y i=b µ i (a, b) = {α i 1 (a) s(i, a, b)} β i (b) Z 1 } n j=i+1 s(j, y j 1, y j ) Similarly to Viterbi, α i (a) and β i (b) can be computed recursively in O(n Y 2 ) 44/70

78 CRFs: summary so far Log-linear models for sequence prediction, Pr(y x; w) Computations factorize on label bigrams Model form: argmax y Y w f(x, i, y i 1, y i ) Prediction: uses Viterbi (from HMMs) Parameter estimation: Gradient-based methods, in practice L-BFGS Computation of gradient uses forward-backward (from HMMs) i 45/70

79 CRFs: summary so far Log-linear models for sequence prediction, Pr(y x; w) Computations factorize on label bigrams Model form: argmax y Y w f(x, i, y i 1, y i ) Prediction: uses Viterbi (from HMMs) Parameter estimation: Gradient-based methods, in practice L-BFGS Computation of gradient uses forward-backward (from HMMs) i Next Question: MEMMs or CRFs? HMMs or CRFs? 45/70

80 MEMMs and CRFs MEMMs: Pr(y x) = n i=1 exp {w f(x, i, y i 1, y i )} Z(x, i, y i 1 ; w) CRFs: Pr(y x) = exp { n i=1 w f(x, i, y i 1, y i )} Z(x) Both exploit the same factorization, i.e. same features Same computations to compute argmax y Pr(y x) MEMMs locally normalized; CRFs globally normalized MEMM assume that Pr(y i x 1:n, y 1:i 1 ) = Pr(y i x 1:n, y i 1 ) Leads to Label Bias Problem MEMMs are cheaper to train (reduces to multiclass learning) CRFs are easier to extend to other structures (next lecture) 46/70

81 HMMs for sequence prediction x are the observations, y are the hidden states HMMs model the joint distributon Pr(x, y) Parameters: (assume X = {1,..., k} and Y = {1,..., l}) π R l, π a = Pr(y 1 = a) T R l l, T a,b = Pr(y i = b y i 1 = a) O R l k, O a,c = Pr(x i = c y i = a) Model form Pr(x, y) = π y1 O y1,x 1 n i=2 T yi 1,y i O yi,x i Parameter Estimation: maximum likelihood by counting events and normalizing 47/70

82 HMMs and CRFs In CRFs: ŷ = amax y i w f(x, i, y i 1, y i ) In HMMs: ŷ = amax y π y1 O y1,x 1 n i=2 T y i 1,y i O yi,x i = amax y log(π y1 O y1,x 1 ) + n i=2 log(t y i 1,y i O yi,x i ) An HMM can be expressed as factored linear models: f j (x, i, y, y ) w j i = 1 & y = a log(π a ) i > 1 & y = a & y = b log(t a,b ) y = a & x i = c log(o a,b ) Hence, HMM are factored linear models 48/70

83 HMMs and CRFs: main differences Representation: HMM features are tied to the generative process. CRF features are very flexible. They can look at the whole input x paired with a label bigram (y i, y i+1 ). In practice, for prediction tasks, good discriminative features can improve accuracy a lot. Parameter estimation: HMMs focus on explaining the data, both x and y. CRFs focus on the mapping from x to y. A priori, it is hard to say which paradigm is better. Same dilemma as Naive Bayes vs. Maximum Entropy. 49/70

84 Structured Prediction Perceptron, SVMs, CRFs 50/70

85 Learning Structured Predictors Goal: given training data {(x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor x y with small error on unseen inputs In a CRF: argmax P (y x; w) = exp { n i=1 w f(x, i, y i 1, y i )} y Y Z(x; w) n = w f(x, i, y i 1, y i ) i=1 To predict new values, Z(x; w) is not relevant Parameter estimation: w is set to maximize likelihood Can we learn w more directly, focusing on errors? 51/70

86 Learning Structured Predictors Goal: given training data {(x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor x y with small error on unseen inputs In a CRF: argmax P (y x; w) = exp { n i=1 w f(x, i, y i 1, y i )} y Y Z(x; w) n = w f(x, i, y i 1, y i ) i=1 To predict new values, Z(x; w) is not relevant Parameter estimation: w is set to maximize likelihood Can we learn w more directly, focusing on errors? 51/70

87 The Structured Perceptron? Set w = 0 For t = 1... T For each training example (x, y) 1. Compute z = argmax z w f(x, z) 2. If z y w w + f(x, y) f(x, z) Return w 52/70

88 The Structured Perceptron + Averaging?;? Set w = 0, w a = 0 For t = 1... T For each training example (x, y) 1. Compute z = argmax z w f(x, z) 2. If z y w w + f(x, y) f(x, z) 3. w a = w a + w Return w a /mt, where m is the number of training examples 53/70

89 Perceptron Updates: Example y per per - - loc z per loc - - loc x Jack London went to Paris Let y be the correct output for x. Say we predict z instead, under our current w The update is: g = f(x, y) f(x, z) = i f(x, i, y i 1, y i ) i f(x, i, z i 1, z i ) = f(x, 2, per, per) f(x, 2, per, loc) + f(x, 3, per, -) f(x, 3, loc, -) Perceptron updates are typically very sparse 54/70

90 Properties of the Perceptron Online algorithm. Often much more efficient than batch algorithms If the data is separable, it will converge to parameter values with 0 errors Number of errors before convergence is related to a definition of margin. Can also relate margin to generalization properties In practice: 1. Averaging improves performance a lot 2. Typically reaches a good solution after only a few (say 5) iterations over the training set 3. Often performs nearly as well as CRFs, or SVMs 55/70

91 Averaged Perceptron Convergence Iteration Accuracy (results on validation set for a parsing task) 56/70

92 Margin-based Structured Prediction Let f(x, y) = n i=1 f(x, i, y i 1, y i ) Model: argmax y Y w f(x, y) Consider an example (x (k), y (k) ): y y (k) : w f(x (k), y (k) ) < w f(x (k), y) = error Let y = argmax y Y :y y (k) w f(x(k), y) Define γ k = w (f(x (k), y (k) ) f(x (k), y )) The quantity γ k is a notion of margin on example k: γ k > 0 no mistakes in the example high γ k high confidence 57/70

93 Margin-based Structured Prediction Let f(x, y) = n i=1 f(x, i, y i 1, y i ) Model: argmax y Y w f(x, y) Consider an example (x (k), y (k) ): y y (k) : w f(x (k), y (k) ) < w f(x (k), y) = error Let y = argmax y Y :y y (k) w f(x(k), y) Define γ k = w (f(x (k), y (k) ) f(x (k), y )) The quantity γ k is a notion of margin on example k: γ k > 0 no mistakes in the example high γ k high confidence 57/70

94 Margin-based Structured Prediction Let f(x, y) = n i=1 f(x, i, y i 1, y i ) Model: argmax y Y w f(x, y) Consider an example (x (k), y (k) ): y y (k) : w f(x (k), y (k) ) < w f(x (k), y) = error Let y = argmax y Y :y y (k) w f(x(k), y) Define γ k = w (f(x (k), y (k) ) f(x (k), y )) The quantity γ k is a notion of margin on example k: γ k > 0 no mistakes in the example high γ k high confidence 57/70

95 Mistake-augmented Margins? e(y (k), ) x (k) Jack London went to Paris y (k) per per - - loc 0 y per loc - - loc 1 y per y - - per per - 5 Def: e(y, y ) = n i=1 [y i y i ] e.g., e(y (k), y (k) )=0, e(y (k), y )=1, e(y (k), y )=5 We want a w such that y y (k) : w f(x (k), y (k) ) > w f(x (k), y) + e(y (k), y) (the higher the error of y, the larger the separation should be) 58/70

96 Mistake-augmented Margins? e(y (k), ) x (k) Jack London went to Paris y (k) per per - - loc 0 y per loc - - loc 1 y per y - - per per - 5 Def: e(y, y ) = n i=1 [y i y i ] e.g., e(y (k), y (k) )=0, e(y (k), y )=1, e(y (k), y )=5 We want a w such that y y (k) : w f(x (k), y (k) ) > w f(x (k), y) + e(y (k), y) (the higher the error of y, the larger the separation should be) 58/70

97 Mistake-augmented Margins? e(y (k), ) x (k) Jack London went to Paris y (k) per per - - loc 0 y per loc - - loc 1 y per y - - per per - 5 Def: e(y, y ) = n i=1 [y i y i ] e.g., e(y (k), y (k) )=0, e(y (k), y )=1, e(y (k), y )=5 We want a w such that y y (k) : w f(x (k), y (k) ) > w f(x (k), y) + e(y (k), y) (the higher the error of y, the larger the separation should be) 58/70

98 Structured Hinge Loss Define a mistake-augmented margin γ k,y =w f(x (k), y (k) ) w f(x (k), y) e(y (k), y) γ k = min y y (k) γ k,y Define loss function on example k as: { } L(w, x (k), y (k) ) = max w f(x (k), y) + e(y (k), y) w f(x (k), y (k) ) y Y Leads to an SVM for structured prediction Given a training set, find: argmin w R D m L(w, x (k), y (k) ) + λ 2 w 2 k=1 59/70

99 Regularized Loss Minimization Given a training set { (x (1), y (1) ),..., (x (m), y (m) ) }. Find: m argmin L(w, x (k), y (k) ) + λ w R D 2 w 2 k=1 Two common loss functions L(w, x (k), y (k) ) : Log-likelihood loss (CRFs) log P (y (k) x (k) ; w) Hinge loss (SVMs) ( ) max w f(x (k), y) + e(y (k), y) w f(x (k), y (k) ) y Y 60/70

100 Learning Structure Predictors: summary so far Linear models for sequence prediction argmax w f(x, i, y i 1, y i ) y Y Computations factorize on label bigrams Decoding: using Viterbi Marginals: using forward-backward Parameter estimation: Perceptron, Log-likelihood, SVMs Extensions from classification to the structured case Optimization methods: Stochastic (sub)gradient methods (??) Exponentiated Gradient (?) SVM Struct (?) Structured MIRA (?) i 61/70

101 Beyond Linear Sequence Prediction 62/70

102 Factored Sequence Prediction, Beyond Bigrams It is easy to extend the scope of features to k-grams f(x, i, y i k+1:i 1, y i ) In general, think of state σ i remembering relevant history σ i = y i 1 for bigrams σi = y i k+1:i 1 for k-grams σi can be the state at time i of a deterministic automaton generating y The structured predictor is argmax y Y w f(x, i, σ i, y i ) Viterbi and forward-backward extend naturally, in O(nL k ) i 63/70

103 Dependency Dependency Structures Structures * John saw a movie that he liked today Directed arcs represent dependencies between a head word Directed arcs represent dependencies between a head word and a modifier word. and a modifier word. E.g.: E.g.: movie modifies saw, John modifies movie saw, modifies saw, today modifies John sawmodifies saw, today modifies saw 64/70

104 Dependency Dependency Parsing: Parsing: arc-factored arc-factored models models? (McDonald et al. 2005) * John saw a movie that he liked today Parse trees decompose Parse trees into decompose single dependencies into single dependencies h, m h, m argmax argmax w f(x, f(x,h,m) y Y(x) y Y(x) h,m y h,m y Some features: f 1 (x,h,m)=[ saw movie ] Some features: f 1 (x, h, m) = f 2 [(x,h,m)=[distance saw movie ] =+2] Tractable f 2 (x, inference h, m) = algorithms [ distance = exist +2 (tomorrow s ] lecture) Tractable inference algorithms exist (tomorrow s lecture) 65/70

105 Linear Structured Prediction Sequence prediction (bigram factorization) argmax y Y(x) Dependency parsing (arc-factored) argmax y Y(x) w f(x, i, y i 1, y i ) i h,m y w f(x, h, m) In general, we can enumerate parts r y w f(x, r) argmax y Y(x) r y 66/70

106 Parameter estimation becomes non-convex, use backpropagation 67/70 Factored Sequence Prediction: from Linear to Non-linear score(x, y) = i s(x, i, y i 1, y i ) Linear: s(x, i, y i 1, y i ) = w f(x, i, y i 1, y i ) Non-linear, using a feed-forward neural network: where: s(x, i, y i 1, y i ) = w yi 1,y i h(f(x, i)) h(f(x, i)) = σ(w 2 σ(w 1 σ(w 0 f(x, i)))) Remarks: The non-linear model computes a hidden representation of the input Still factored: Viterbi and Forward-Backward work

107 Recurrent Sequence Prediction y 1 y 2 y 3 y n h 1 h 2 h 3... h n x 1 x 2 x 3 x n Maintains a state: a hidden variable that keeps track of previous observations and predictions Making predictions is not tractable In practice: greedy predictions or beam search Learning is non-convex Popular methods: RNN, LSTM, Spectral Models,... 68/70

108 Thanks! 69/70

109 References I Michael Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-volume 10, pages 1 8. Association for Computational Linguistics, Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter L Bartlett. Exponentiated gradient algorithms for conditional random fields and max-margin markov networks. The Journal of Machine Learning Research, 9: , Koby Crammer, Ryan McDonald, and Fernando Pereira. Scalable large-margin online learning for structured classification. In NIPS Workshop on Learning With Structured Outputs, Yoav Freund and Robert E. Schapire. Large margin classification using the perceptron algorithm. Mach. Learn., 37(3): , December ISSN doi: /A: URL Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Ignacio Thayer. Scalable inference and training of context-rich syntactic translation models. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages Association for Computational Linguistics, Sanjiv Kumar and Martial Hebert. Man-made structure detection in natural images using a causal multiscale random field. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2003), June 2003, Madison, WI, USA, pages , doi: /CVPR URL John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML 01, pages , San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. ISBN URL Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): , Andrew McCallum, Dayne Freitag, and Fernando CN Pereira. Maximum entropy markov models for information extraction and segmentation. In Icml, volume 17, pages , Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajič. Non-projective dependency parsing using spanning tree algorithms. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages Association for Computational Linguistics, Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3 30, Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin markov networks. In Advances in neural information processing systems, volume 16, Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods for structured and interdependent output variables. Journal of machine learning research, 6(Sep): , /70

Learning Structured Predictors

Learning Structured Predictors Xavier Carreras Xerox Research Centre Europe Supervised (Structured) Prediction Learning to predict: given training data { (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y