Learning Structured Predictors
|
|
- Marvin Melton
- 5 years ago
- Views:
Transcription
1 Learning Structured Predictors Xavier Carreras Xerox Research Centre Europe
2 Supervised (Structured) Prediction Learning to predict: given training data { (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor x y that works well on unseen inputs x Non-Structured Prediction: outputs y are atomic Binary prediction: y { 1, +1} Multiclass prediction: y {1, 2,..., L} Structured Prediction: outputs y are structured Sequence prediction: y are sequences Parsing: y are trees...
3 Named Entity Recognition y per - qnt - - org org - time x Jim bought 300 shares of Acme Corp. in 2006
4 Named Entity Recognition y per - qnt - - org org - time x Jim bought 300 shares of Acme Corp. in 2006 y per per - - loc x Jack London went to Paris y per per - - loc x Paris Hilton went to London y per - - loc x Jackie went to Lisdon
5 Part-of-speech Tagging y NNP NNP VBZ NNP. x Ms. Haag plays Elianti.
6 Syntactic Parsing P LOC ROOT VC OBJ NMOD PMOD SBJ TMP NMOD NAME Unesco is now holding its biennial meetings in New York. x are sentences y are syntactic dependency trees
7 Machine Translation the transformation of does not ench. A particular instance may, RB(not), x 0:VB) ne, x 0, pas arbitrary syntax tree fragment. r lexicalized (e.g. does) or vari-. rhs(r i ) is represented as a senguage words and variables. a brief overview of how such ules are acquired automatically ure 1, the (π, f, a) triple is repted graph G (edges going downtinction between edges of π and node of the graph is labeled with lement span (the latter in italic span of a node n is defined by first and last word in f that are The complement span of n is pans of all nodes n in G that dants nor ancestors of n. Nodes and complement spans are nonthe frontier set F G. larly interesting about the fronfrontier of graph G containing '& ''( $%*!" #! $%& ''( )' ''& # #"& #%" *"& "!"& "!"&!!"#$%"& $&!%&!"#$& $%&!"'$& # " "! $%& ' ' ( ) # "! ' ( * $ & ) ( #%)! '& $%&!"'$& '& '%&!"*$& '!"& $& '%(!"*$("& && '%(!"%$("& '!"& '& (!"%$+("& (!"%$("& "#$%$ 5 &$'&($ )*+(,-$.%/0'*.,/% +'1)*2 30'1 40.*+$ + -!! "#$ %& '( ) *+, " Figure 1: Spans and complement-spans determine what rules are extracted. Constituents (Galley et alin2006) gray are members of the frontier set; a minimal rule is extracted from each of them. (a) S(x 0:NP, x 1:VP, x 2:.) x 0, x 1, x 2 (b) NP(x 0:DT, CD(7), NNS(people)) x 0,7 (c) DT(these) (d) VP(x 0:VBP, x 1:NP) x 0, x 1 (e) VBP(include) (f) NP(x 0:NP, x 1:VP) x 1,, x 0 (g) NP(x 0:NNS) x 0 x are sentences in Chinese y are sentences in English aligned to x + )!")
8 Object Detection (Kumar and Hebert 2003) x are images y are grids labeled with object types
9 Object Detection (Kumar and Hebert 2003) x are images y are grids labeled with object types
10 Today s Goals Introduce basic concepts for structured prediction We will restrict to sequence prediction What can we can borrow from standard classification? Learning paradigms and algorithms, in essence, work here too However, computations behind algorithms are prohibitive What can we borrow from HMM and other structured formalisms? Representations of structured data into feature spaces Inference/search algorithms for tractable computations E.g., algorithms for HMMs (Viterbi, forward-backward) will play a major role in today s methods
11 Today s Goals Introduce basic concepts for structured prediction We will restrict to sequence prediction What can we can borrow from standard classification? Learning paradigms and algorithms, in essence, work here too However, computations behind algorithms are prohibitive What can we borrow from HMM and other structured formalisms? Representations of structured data into feature spaces Inference/search algorithms for tractable computations E.g., algorithms for HMMs (Viterbi, forward-backward) will play a major role in today s methods
12 Sequence Prediction y per per - - loc x Jack London went to Paris
13 Sequence Prediction x = x 1 x 2... x n are input sequences, x i X y = y 1 y 2... y n are output sequences, y i {1,..., L} Goal: given training data { (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor x y that works well on unseen inputs x What is the form of our prediction model?
14 Exponentially-many Solutions Let Y = {-, per, loc} The solution space (all output sequences): Jack London went to Paris per loc per loc per loc per loc per loc Each path is a possible solution For an input sequence of size n, there are Y n possible outputs
15 Exponentially-many Solutions Let Y = {-, per, loc} The solution space (all output sequences): Jack London went to Paris per loc per loc per loc per loc per loc Each path is a possible solution For an input sequence of size n, there are Y n possible outputs
16 Approach 1: Local Classifiers? Jack London went to Paris Decompose the sequence into n classification problems: A classifier predicts individual labels at each position ŷ i = argmax w f(x, i, l) l {loc, per, -} f(x, i, l) represents an assignment of label l for x i w is a vector of parameters, has a weight for each feature of f Use standard classification methods to learn w At test time, predict the best sequence by a simple concatenation of the best label for each position
17 Approach 1: Local Classifiers? Jack London went to Paris Decompose the sequence into n classification problems: A classifier predicts individual labels at each position ŷ i = argmax w f(x, i, l) l {loc, per, -} f(x, i, l) represents an assignment of label l for x i w is a vector of parameters, has a weight for each feature of f Use standard classification methods to learn w At test time, predict the best sequence by a simple concatenation of the best label for each position
18 Indicator Features f(x, i, l) is a vector of d features representing label l for x i [ f 1 (x, i, l),..., f j (x, i, l),..., f d (x, i, l) ] What s in a feature f j (x, i, l)? Anything we can compute using x and i and l Anything that indicates whether l is (not) a good label for xi Indicator features: binary-valued features looking at: a simple pattern of x and target position i and the candidate label l for position i { 1 if xi =London and l =loc f j (x, i, l) = 0 otherwise { 1 if xi+1 =went and l =loc f k (x, i, l) = 0 otherwise
19 Feature Templates Feature templates generate many indicator features mechanically A feature template is identified by a type, and a number of values Example: template word extracts the current word { 1 if xi = w and l = a f word,a,w (x, i, l) = 0 otherwise A feature of this type is identified by the tuple word, a, w Generates a feature for every label a Y and every word w e.g.: a = loc w = London, a = - w = London a = loc w = Paris a = per w = Paris a = per w = John a = - w = the
20 Feature Templates Feature templates generate many indicator features mechanically A feature template is identified by a type, and a number of values Example: template word extracts the current word { 1 if xi = w and l = a f word,a,w (x, i, l) = 0 otherwise A feature of this type is identified by the tuple word, a, w Generates a feature for every label a Y and every word w e.g.: a = loc w = London, a = - w = London a = loc w = Paris a = per w = Paris a = per w = John a = - w = the In feature-based models: Define feature templates manually Instantiate the templates on every set of values in the training data generates a very high-dimensional feature space Define parameter vector w indexed by such feature tuples Let the learning algorithm choose the relevant features
21 More Features for NE Recognition per Jack London went to Paris In practice, construct f(x, i, l) by... Define a number of simple patterns of x and i current word x i is xi capitalized? xi has digits? prefixes/suffixes of size 1, 2, 3,... is x i a known location? is x i a known person? next word previous word current and next words together other combinations Define feature templates by combining patterns with labels l Generate actual features by instantiating templates on training data
22 More Features for NE Recognition per per - Jack London went to Paris In practice, construct f(x, i, l) by... Define a number of simple patterns of x and i current word x i is xi capitalized? xi has digits? prefixes/suffixes of size 1, 2, 3,... is x i a known location? is x i a known person? next word previous word current and next words together other combinations Define feature templates by combining patterns with labels l Generate actual features by instantiating templates on training data Main limitation: features can t capture interactions between labels!
23 Approach 2: HMM for Sequence Prediction π per per T per,per per - - loc O per, London Jack London went to Paris Define an HMM were each label is a state Model parameters: πl : probability of starting with label l Tl,l : probability of transitioning from l to l O l,x : probability of generating symbol x given label l Predictions: p(x, y) = π y1 O y1,x 1 T yi 1,y i O yi,x i i>1 Learning: relative counts + smoothing Prediction: Viterbi algorithm
24 Approach 2: Representation in HMM π per per T per,per per - - loc O per, London Jack London went to Paris Label interactions are captured in the transition parameters But interactions between labels and input symbols are quite limited! Only Oyi,x i = p(x i y i ) Not clear how to exploit patterns such as: Capitalization, digits Prefixes and suffixes Next word, previous word Combinations of these with label transitions Why? HMM independence assumptions: given label y i, token x i is independent of anything else
25 Approach 2: Representation in HMM π per per T per,per per - - loc O per, London Jack London went to Paris Label interactions are captured in the transition parameters But interactions between labels and input symbols are quite limited! Only Oyi,x i = p(x i y i ) Not clear how to exploit patterns such as: Capitalization, digits Prefixes and suffixes Next word, previous word Combinations of these with label transitions Why? HMM independence assumptions: given label y i, token x i is independent of anything else
26 Local Classifiers vs. HMM Form: Local Classifiers w f(x, i, l) Learning: standard classifiers Prediction: independent for each x i Advantage: feature-rich Drawback: no label interactions Form: HMM π y1 O y1,x 1 T yi 1,y i O yi,x i i>1 Learning: relative counts Prediction: Viterbi Advantage: label interactions Drawback: no fine-grained features
27 Approach 3: Global Sequence Predictors y: per per - - loc x: Jack London went to Paris Learn a single classifier from x y Next questions:... predict(x 1:n ) = argmax y Y n w f(x, y) How do we represent entire sequences in f(x, y)? There are exponentially-many sequences y for a given x, how do we solve the argmax problem?
28 Approach 3: Global Sequence Predictors y: per per - - loc x: Jack London went to Paris Learn a single classifier from x y Next questions:... predict(x 1:n ) = argmax y Y n w f(x, y) How do we represent entire sequences in f(x, y)? There are exponentially-many sequences y for a given x, how do we solve the argmax problem?
29 Factored Representations y: per per - - loc x: Jack London went to Paris How do we represent entire sequences in f(x, y)? Look at individual assignments y i (standard classification) Look at bigrams of outputs labels y i 1, y i Look at trigrams of outputs labels yi 2, y i 1, y i Look at n-grams of outputs labels yi n+1,..., y i 1, y i Look at the full label sequence y (intractable) A factored representation will lead to a tractable model
30 Factored Representations y: per per - - loc x: Jack London went to Paris How do we represent entire sequences in f(x, y)? Look at individual assignments y i (standard classification) Look at bigrams of outputs labels y i 1, y i Look at trigrams of outputs labels yi 2, y i 1, y i Look at n-grams of outputs labels yi n+1,..., y i 1, y i Look at the full label sequence y (intractable) A factored representation will lead to a tractable model
31 Factored Representations y: per per - - loc x: Jack London went to Paris How do we represent entire sequences in f(x, y)? Look at individual assignments y i (standard classification) Look at bigrams of outputs labels y i 1, y i Look at trigrams of outputs labels yi 2, y i 1, y i Look at n-grams of outputs labels yi n+1,..., y i 1, y i Look at the full label sequence y (intractable) A factored representation will lead to a tractable model
32 Factored Representations y: per per - - loc x: Jack London went to Paris How do we represent entire sequences in f(x, y)? Look at individual assignments y i (standard classification) Look at bigrams of outputs labels y i 1, y i Look at trigrams of outputs labels yi 2, y i 1, y i Look at n-grams of outputs labels yi n+1,..., y i 1, y i Look at the full label sequence y (intractable) A factored representation will lead to a tractable model
33 Bigram Feature Templates y per per - - loc x Jack London went to Paris A template for word + bigram: 1 if x i = w and f wb,a,b,w (x, i, y i 1, y i ) = y i 1 = a and y i = b 0 otherwise e.g., f wb,per,per,london (x, 2, per, per) = 1 f wb,per,per,london (x, 3, per, -) = 0 f wb,per,-,went (x, 3, per, -) = 1
34 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1
35 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1
36 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1
37 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1
38 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1
39 Representations Factored at Bigrams y: per per - - loc x: Jack London went to Paris f(x, i, y i 1, y i ) A d-dimensional feature vector of a label bigram at i Each dimension is typically a boolean indicator (0 or 1) f(x, y) = n i=1 f(x, i, y i 1, y i ) A d-dimensional feature vector of the entire y Aggregated representation by summing bigram feature vectors Each dimension is now a count of a feature pattern
40 Linear Sequence Prediction where predict(x 1:n ) = argmax y Y n w f(x, y) f(x, y) = n f(x, i, y i 1, y i ) i=1 Note the linearity of the expression: n w f(x, y) = w f(x, i, y i 1, y i ) i=1 n = w f(x, i, y i 1, y i ) i=1 Next questions: How do we solve the argmax problem? How do we learn w?
41 Linear Sequence Prediction where predict(x 1:n ) = argmax y Y n w f(x, y) f(x, y) = n f(x, i, y i 1, y i ) i=1 Note the linearity of the expression: n w f(x, y) = w f(x, i, y i 1, y i ) i=1 n = w f(x, i, y i 1, y i ) i=1 Next questions: How do we solve the argmax problem? How do we learn w?
42 Linear Sequence Prediction where predict(x 1:n ) = argmax y Y n w f(x, y) f(x, y) = n f(x, i, y i 1, y i ) i=1 Note the linearity of the expression: n w f(x, y) = w f(x, i, y i 1, y i ) i=1 n = w f(x, i, y i 1, y i ) i=1 Next questions: How do we solve the argmax problem? How do we learn w?
43 Predicting with Factored Sequence Models Consider a fixed w. Given x 1:n find: argmax y Y n n w f(x, i, y i 1, y i ) i=1 Use the Viterbi algorithm, takes O(n Y 2 ) Notational change: since w and x 1:n are fixed we will use s(i, a, b) = w f(x, i, a, b)
44 Viterbi for Factored Sequence Models Given scores s(i, a, b) for each position i and output bigram a, b, find: n argmax s(i, y i 1, y i ) y Y n i=1 Use the Viterbi algorithm, takes O(n Y 2 ) Intuition: output sequences that share bigrams will share scores 1... i 2 i 1 i i n best subsequence with y i 1 = per best subsequence with y i = per best subsequence with y i 1 = loc s(i,loc, per) best subsequence with y i = loc best subsequence with y i 1 = best subsequence with y i =
45 Intuition for Viterbi Consider a fixed x 1:n Assume we have the best sub-sequences up to position i i 1 i best subsequence with y i 1 = per best subsequence with y i 1 = loc best subsequence with y i 1 = What is the best sequence up to position i with y i =loc?
46 Intuition for Viterbi Consider a fixed x 1:n Assume we have the best sub-sequences up to position i i 1 i best subsequence with y i 1 = per best subsequence with y i 1 = loc best subsequence with y i 1 = What is the best sequence up to position i with y i =loc?
47 Intuition for Viterbi Consider a fixed x 1:n Assume we have the best sub-sequences up to position i i 1 i best subsequence with y i 1 = per s(i,per, loc) best subsequence with y i 1 = loc best subsequence with y i 1 = s(i,loc, loc) s(i,, loc) What is the best sequence up to position i with y i =loc?
48 Viterbi for Linear Factored Predictors ŷ = argmax y Y n n w f(x, i, y i 1, y i ) i=1 Definition: score of optimal sequence for x 1:i ending with a Y δ(i, a) = max y Y i :y i=a j=1 Use the following recursions, for all a Y: i s(j, y j 1, y j ) δ(1, a) = s(1, y 0 = null, a) δ(i, a) = max δ(i 1, b) + s(i, b, a) b Y The optimal score for x is max a Y δ(n, a) The optimal sequence ŷ can be recovered through back-pointers
49 Linear Factored Sequence Prediction predict(x 1:n ) = argmax y Y n w f(x, y) Factored representation, e.g. based on bigrams Flexible, arbitrary features of full x and the factors Efficient prediction using Viterbi Next, learning w: Probabilistic log-linear models: Local learning, a.k.a. Maximum-Entropy Markov Models Global learning, a.k.a. Conditional Random Fields Margin-based methods: Structured Perceptron Structured SVM
50 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w
51 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1
52 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1
53 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1
54 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2
55 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2
56 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 w NextW,per,went = +2
57 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 w NextW,per,went = +2 w NextW,org,played = +2
58 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 w NextW,per,went = +2 w NextW,org,played = +2 w PrevW,org,against = +2
59 The Learner s Game Training Data per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 w NextW,per,went = +2 w NextW,org,played = +2 w PrevW,org,against = w UpperBigram,per,per = +2 w UpperBigram,loc,loc = +2 w NextW,loc,played = 1000
60 Log-linear Models for Sequence Prediction y per per - - loc x Jack London went to Paris
61 Log-linear Models for Sequence Prediction Model the conditional distribution: Pr(y x; w) = where x = x1 x 2... x n X exp {w f(x, y)} Z(x; w) y = y1 y 2... y n Y and Y = {1,..., L} f(x, y) represents x and y with d features w R d are the parameters of the model Z(x; w) is a normalizer called the partition function Z(x; w) = exp {w f(x, z)} z Y To predict the best sequence predict(x 1:n ) = argmax Pr(y x) y Y n
62 Log-linear Models: Name Let s take the log of the conditional probability: log Pr(y x; w) = log exp{w f(x, y)} Z(x; w) = w f(x, y) log y exp{w f(x, y)} = w f(x, y) log Z(x; w) Partition function: Z(x; w) = y exp{w f(x, y)} log Z(x; w) is a constant for a fixed x In the log space, computations are linear, i.e., we model log-probabilities using a linear predictor
63 Making Predictions with Log-Linear Models For tractability, assume f(x, y) decomposes into bigrams: f(x 1:n, y 1:n ) = n f(x, i, y i 1, y i ) i=1 Given w, given x 1:n, find: exp { n i=1 argmax Pr(y 1:n x 1:n ; w) = amax w f(x, i, y i 1, y i )} y 1:n y Z(x; w) { n } = amax exp w f(x, i, y i 1, y i ) y = amax y We can use the Viterbi algorithm i=1 n w f(x, i, y i 1, y i ) i=1
64 Making Predictions with Log-Linear Models For tractability, assume f(x, y) decomposes into bigrams: f(x 1:n, y 1:n ) = n f(x, i, y i 1, y i ) i=1 Given w, given x 1:n, find: exp { n i=1 argmax Pr(y 1:n x 1:n ; w) = amax w f(x, i, y i 1, y i )} y 1:n y Z(x; w) { n } = amax exp w f(x, i, y i 1, y i ) y = amax y We can use the Viterbi algorithm i=1 n w f(x, i, y i 1, y i ) i=1
65 Parameter Estimation in Log-Linear Models Pr(y x; w) = exp {w f(x, y)} Z(x; w) How to estimate w given training data? Two approaches: MEMMs: assume that Pr(y x; w) decomposes CRFs: assume that f(x, y) decomposes
66 Parameter Estimation in Log-Linear Models Pr(y x; w) = exp {w f(x, y)} Z(x; w) How to estimate w given training data? Two approaches: MEMMs: assume that Pr(y x; w) decomposes CRFs: assume that f(x, y) decomposes
67 Maximum Entropy Markov Models (MEMMs) (McCallum, Freitag, Pereira 00) Similarly to HMMs: Pr(y 1:n x 1:n ) = Pr(y 1 x 1:n ) Pr(y 2:n x 1:n, y 1 ) n = Pr(y 1 x 1:n ) Pr(y i x 1:n, y 1:i 1 ) = Pr(y 1 x 1:n ) Assumption under MEMMs: i=2 n Pr(y i x 1:n, y i 1 ) i=2 Pr(y i x 1:n, y 1:i 1 ) = Pr(y i x 1:n, y i 1 )
68 Parameter Estimation in MEMMs Decompose sequential problem: Pr(y 1:n x 1:n ) = Pr(y 1 x 1:n ) n Pr(y i x 1:n, i, y i 1 ) i=2 Learn local log-linear distributions (i.e. MaxEnt) Pr(y x, i, y ) = exp{w f(x, i, y, y)} Z(x, i, y ) where x is an input sequence y and y are tags f(x, i, y, y) is a feature vector of x, the position to be tagged, the previous tag and the current tag Sequence learning reduced to multi-class logistic regression
69 Conditional Random Fields (Lafferty, McCallum, Pereira 2001) Log-linear model of the conditional distribution: Pr(y x; w) = exp{w f(x, y)} Z(x) where x = x1 x 2... x n X y = y1 y 2... y n Y and Y = {1,..., L} f(x, y) is a feature vector of x and y w are model parameters To predict the best sequence ŷ = argmax Pr(y x) y Y Assumption in CRF (for tractability): f(x, y) decomposes into factors
70 Parameter Estimation in CRFs Given a training set { } (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ), estimate w Define the conditional log-likelihood of the data: L(w) = m log Pr(y (k) x (k) ; w) k=1 L(w) measures how well w explains the data. A good value for w will give a high value for Pr(y (k) x (k) ; w) for all k = 1... m. We want w that maximizes L(w)
71 Learning the Parameters of a CRF We pose it as a concave optimization problem Find: w = argmax L(w) λ w R D 2 w 2 where The first term is the log-likelihood of the data The second term is a regularization term, it penalizes solutions with large norm (similar to norm-minimization in SVM) λ is a parameter to control the trade-off between fitting the data and model complexity
72 Learning the Parameters of a CRF Find w = argmax L(w) λ w R D 2 w 2 In general there is no analytical solution to this optimization We use iterative techniques, i.e. gradient-based optimization 1. Initialize w = 0 2. Take derivatives of L(w) λ 2 w 2, compute gradient 3. Move w in steps proportional to the gradient 4. Repeat steps 2 and 3 until convergence Fast and scalable algorithms exist
73 Computing the Gradient in CRFs Consider a parameter w j and its associated feature f j : L(w) w j = 1 m m f j (x (k), y (k) ) k=1 m k=1 y Y Pr(y x (k) ; w) f j (x (k), y) where f(x, y) = n f j (x, i, y i 1, y i ) i=1 First term: observed value of f j in training examples Second term: expected value of f j under current w In the optimal, observed = expected
74 Computing the Gradient in CRFs The first term is easy to compute, by counting explicitly 1 m m f j (x, i, y (k) i 1, y(k) i ) k=1 The second term is more involved, k=1 i m Pr(y x (k) ; w) f j (x (k), i, y i 1, y i ) y Y i because it sums over all sequences y Y But there is an efficient solution...
75 Computing the Gradient in CRFs For an example (x (k), y (k) ): where n Pr(y x (k) ; w) f j (x (k), i, y i 1, y i ) = y Y n i=1 n µ k i (a, b)f j (x (k), i, a, b) i=1 a,b Y µ k i (a, b) = Pr( i, a, b x (k) ; w) = Pr(y x (k) ; w) y Y n : y i 1 =a, y i =b The quantities µ k i can be computed efficiently in O(nL 2 ) using the forward-backward algorithm
76 Forward-Backward for CRFs Assume fixed x. Calculate in O(n Y 2 ) µ i (a, b) = Pr(y x; w) y Y n :y i 1=a,y i=b, 1 i n; a, b Y Definition: forward and backward quantities { i } α i (a) = exp j=1 w f(x, j, y j 1, y j ) β i (b) = Z = a α n(a) y 1:i Y i :y i=a y i:n Y (n i+1) :y i=b { n } exp j=i+1 w f(x, j, y j 1, y j ) µ i (a, b) = {α i 1 (a) exp{w f(x, i, a, b)} β i (b) Z 1 } Similarly to Viterbi, α i (a) and β i (b) can be computed efficiently in a recursive manner
77 Forward-Backward for CRFs Assume fixed x. Calculate in O(n Y 2 ) µ i (a, b) = Pr(y x; w) y Y n :y i 1=a,y i=b, 1 i n; a, b Y Definition: forward and backward quantities { i } α i (a) = exp j=1 w f(x, j, y j 1, y j ) β i (b) = Z = a α n(a) y 1:i Y i :y i=a y i:n Y (n i+1) :y i=b { n } exp j=i+1 w f(x, j, y j 1, y j ) µ i (a, b) = {α i 1 (a) exp{w f(x, i, a, b)} β i (b) Z 1 } Similarly to Viterbi, α i (a) and β i (b) can be computed efficiently in a recursive manner
78 CRFs: summary so far Log-linear models for sequence prediction, Pr(y x; w) Computations factorize on label bigrams Model form: argmax y Y w f(x, i, y i 1, y i ) i Prediction: uses Viterbi (from HMMs) Parameter estimation: Gradient-based methods, in practice L-BFGS Computation of gradient uses forward-backward (from HMMs)
79 CRFs: summary so far Log-linear models for sequence prediction, Pr(y x; w) Computations factorize on label bigrams Model form: argmax y Y w f(x, i, y i 1, y i ) i Prediction: uses Viterbi (from HMMs) Parameter estimation: Gradient-based methods, in practice L-BFGS Computation of gradient uses forward-backward (from HMMs) Next Question: MEMMs or CRFs? HMMs or CRFs?
80 MEMMs and CRFs MEMMs: Pr(y x) = n i=1 exp {w f(x, i, y i 1, y i )} Z(x, i, y i 1 ; w) CRFs: Pr(y x) = exp { n i=1 w f(x, i, y i 1, y i )} Z(x) Both exploit the same factorization, i.e. same features Same computations to compute argmax y Pr(y x) MEMMs locally normalized; CRFs globally normalized MEMM assume that Pr(y i x 1:n, y 1:i 1 ) = Pr(y i x 1:n, y i 1 ) Leads to Label Bias Problem MEMMs are cheaper to train (reduces to multiclass learning) CRFs are easier to extend to other structures (next lecture)
81 HMMs for sequence prediction x are the observations, y are the hidden states HMMs model the joint distributon Pr(x, y) Parameters: (assume X = {1,..., k} and Y = {1,..., l}) π R l, π a = Pr(y 1 = a) T R l l, T a,b = Pr(y i = b y i 1 = a) O R l k, O a,c = Pr(x i = c y i = a) Model form Pr(x, y) = π y1 O y1,x 1 n i=2 T yi 1,y i O yi,x i Parameter Estimation: maximum likelihood by counting events and normalizing
82 HMMs and CRFs In CRFs: ŷ = amax y i w f(x, i, y i 1, y i ) In HMMs: ŷ = amax y π y1 O y1,x 1 n i=2 T y i 1,y i O yi,x i = amax y log(π y1 O y1,x 1 ) + n i=2 log(t y i 1,y i O yi,x i ) An HMM can be expressed as factored linear models: f j (x, i, y, y ) w j i = 1 & y = a log(π a ) i > 1 & y = a & y = b log(t a,b ) y = a & x i = c log(o a,b ) Hence, HMM are factored linear models
83 HMMs and CRFs: main differences Representation: HMM features are tied to the generative process. CRF features are very flexible. They can look at the whole input x paired with a label bigram (y i, y i+1 ). In practice, for prediction tasks, good discriminative features can improve accuracy a lot. Parameter estimation: HMMs focus on explaining the data, both x and y. CRFs focus on the mapping from x to y. A priori, it is hard to say which paradigm is better. Same dilemma as Naive Bayes vs. Maximum Entropy.
84 Structured Prediction Perceptron, SVMs, CRFs
85 Learning Structured Predictors Goal: given training data { (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor x y with small error on unseen inputs In a CRF: argmax P (y x; w) = exp { n i=1 w f(x, i, y i 1, y i )} y Y Z(x; w) n = w f(x, i, y i 1, y i ) i=1 To predict new values, Z(x; w) is not relevant Parameter estimation: w is set to maximize likelihood Can we learn w more directly, focusing on errors?
86 Learning Structured Predictors Goal: given training data { (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor x y with small error on unseen inputs In a CRF: argmax P (y x; w) = exp { n i=1 w f(x, i, y i 1, y i )} y Y Z(x; w) n = w f(x, i, y i 1, y i ) i=1 To predict new values, Z(x; w) is not relevant Parameter estimation: w is set to maximize likelihood Can we learn w more directly, focusing on errors?
87 The Structured Perceptron (Collins, 2002) Set w = 0 For t = 1... T For each training example (x, y) 1. Compute z = argmax z w f(x, z) 2. If z y w w + f(x, y) f(x, z) Return w
88 The Structured Perceptron + Averaging (Freund and Schapire, 1998) (Collins 2002) Set w = 0, w a = 0 For t = 1... T For each training example (x, y) 1. Compute z = argmax z w f(x, z) 2. If z y w w + f(x, y) f(x, z) 3. w a = w a + w Return w a /mt, where m is the number of training examples
89 Perceptron Updates: Example y per per - - loc z per loc - - loc x Jack London went to Paris Let y be the correct output for x. Say we predict z instead, under our current w The update is: g = f(x, y) f(x, z) = i f(x, i, y i 1, y i ) i f(x, i, z i 1, z i ) = f(x, 2, per, per) f(x, 2, per, loc) + f(x, 3, per, -) f(x, 3, loc, -) Perceptron updates are typically very sparse
90 Properties of the Perceptron Online algorithm. Often much more efficient than batch algorithms If the data is separable, it will converge to parameter values with 0 errors Number of errors before convergence is related to a definition of margin. Can also relate margin to generalization properties In practice: 1. Averaging improves performance a lot 2. Typically reaches a good solution after only a few (say 5) iterations over the training set 3. Often performs nearly as well as CRFs, or SVMs
91 Averaged Perceptron Convergence Iteration Accuracy (results on validation set for a parsing task)
92 Margin-based Structured Prediction Let f(x, y) = n i=1 f(x, i, y i 1, y i ) Model: argmax y Y w f(x, y) Consider an example (x (k), y (k) ): y y (k) : w f(x (k), y (k) ) < w f(x (k), y) = error Let y = argmax y Y :y y (k) w f(x(k), y) Define γ k = w (f(x (k), y (k) ) f(x (k), y )) The quantity γ k is a notion of margin on example k: γ k > 0 no mistakes in the example high γ k high confidence
93 Margin-based Structured Prediction Let f(x, y) = n i=1 f(x, i, y i 1, y i ) Model: argmax y Y w f(x, y) Consider an example (x (k), y (k) ): y y (k) : w f(x (k), y (k) ) < w f(x (k), y) = error Let y = argmax y Y :y y (k) w f(x(k), y) Define γ k = w (f(x (k), y (k) ) f(x (k), y )) The quantity γ k is a notion of margin on example k: γ k > 0 no mistakes in the example high γ k high confidence
94 Margin-based Structured Prediction Let f(x, y) = n i=1 f(x, i, y i 1, y i ) Model: argmax y Y w f(x, y) Consider an example (x (k), y (k) ): y y (k) : w f(x (k), y (k) ) < w f(x (k), y) = error Let y = argmax y Y :y y (k) w f(x(k), y) Define γ k = w (f(x (k), y (k) ) f(x (k), y )) The quantity γ k is a notion of margin on example k: γ k > 0 no mistakes in the example high γ k high confidence
95 Mistake-augmented Margins (Taskar et al, 2004) e(y (k), ) x (k) Jack London went to Paris y (k) per per - - loc 0 y per loc - - loc 1 y per y - - per per - 5 Def: e(y, y ) = n i=1 [y i y i ] e.g., e(y (k), y (k) )=0, e(y (k), y )=1, e(y (k), y )=5 We want a w such that y y (k) : w f(x (k), y (k) ) > w f(x (k), y) + e(y (k), y) (the higher the error of y, the larger the separation should be)
96 Mistake-augmented Margins (Taskar et al, 2004) e(y (k), ) x (k) Jack London went to Paris y (k) per per - - loc 0 y per loc - - loc 1 y per y - - per per - 5 Def: e(y, y ) = n i=1 [y i y i ] e.g., e(y (k), y (k) )=0, e(y (k), y )=1, e(y (k), y )=5 We want a w such that y y (k) : w f(x (k), y (k) ) > w f(x (k), y) + e(y (k), y) (the higher the error of y, the larger the separation should be)
97 Mistake-augmented Margins (Taskar et al, 2004) e(y (k), ) x (k) Jack London went to Paris y (k) per per - - loc 0 y per loc - - loc 1 y per y - - per per - 5 Def: e(y, y ) = n i=1 [y i y i ] e.g., e(y (k), y (k) )=0, e(y (k), y )=1, e(y (k), y )=5 We want a w such that y y (k) : w f(x (k), y (k) ) > w f(x (k), y) + e(y (k), y) (the higher the error of y, the larger the separation should be)
98 Structured Hinge Loss Define a mistake-augmented margin γ k,y =w f(x (k), y (k) ) w f(x (k), y) e(y (k), y) γ k = min y y (k) γ k,y Define loss function on example k as: { } L(w, x (k), y (k) ) = max w f(x (k), y) + e(y (k), y) w f(x (k), y (k) ) y Y Leads to an SVM for structured prediction Given a training set, find: argmin w R D m L(w, x (k), y (k) ) + λ 2 w 2 k=1
99 Regularized Loss Minimization Given a training set { (x (1), y (1) ),..., (x (m), y (m) ) }. Find: m argmin L(w, x (k), y (k) ) + λ w R D 2 w 2 k=1 Two common loss functions L(w, x (k), y (k) ) : Log-likelihood loss (CRFs) log P (y (k) x (k) ; w) Hinge loss (SVMs) ( ) max w f(x (k), y) + e(y (k), y) w f(x (k), y (k) ) y Y
100 Learning Structure Predictors: summary so far Linear models for sequence prediction argmax w f(x, i, y i 1, y i ) y Y Computations factorize on label bigrams Decoding: using Viterbi Marginals: using forward-backward Parameter estimation: Perceptron, Log-likelihood, SVMs Extensions from classification to the structured case Optimization methods: Stochastic (sub)gradient methods (LeCun et al 98) (Shalev-Shwartz et al. 07) Exponentiated Gradient (Collins et al 08) SVM Struct (Tsochantaridis et al. 04) Structured MIRA (McDonald et al 05) i
101 Beyond Linear Sequence Prediction
102 Sequence Prediction, Beyond Bigrams It is easy to extend the scope of features to k-grams f(x, i, y i k+1:i 1, y i ) In general, think of state σ i remembering relevant history σ i = y i 1 for bigrams σ i = y i k+1:i 1 for k-grams σi can be the state at time i of a deterministic automaton generating y The structured predictor is argmax y Y w f(x, i, σ i, y i ) Viterbi and forward-backward extend naturally, in O(nL k ) i
103 Dependency Structures Dependency Structures * John saw a movie that he liked today Directed arcs represent dependencies between a head word Directed arcs represent dependencies between a head word and a modifier and a modifier word. word. E.g.: E.g.: movie movie modifies modifies saw, saw, John John modifies modifies saw, saw, today today modifies saw saw
104 Dependency Parsing: arc-factored models Dependency Parsing: arc-factored models (McDonald et al. 2005) (McDonald et al. 2005) * John saw a movie that he liked today Parse Parse trees trees decompose decompose into into single single dependencies dependencies h, m h, m argmax argmax w f(x, f(x,h,m) y Y(x) y Y(x) h,m y h,m y Some Some features: features: f 1 (x, f 1 (x,h,m)=[ saw = [ movie movie ] ] f 2 (x, f 2 (x,h,m)=[distance = [ = =+2] ] Tractable Tractable inference inference algorithms algorithms exist exist (tomorrow s (tomorrow s lecture) lecture)
105 Linear Structured Prediction Sequence prediction (bigram factorization) argmax y Y(x) Dependency parsing (arc-factored) argmax y Y(x) w f(x, i, y i 1, y i ) i h,m y w f(x, h, m) In general, we can enumerate parts r y w f(x, r) argmax y Y(x) r y
106 Factored Sequence Prediction: from Linear to Non-linear score(x, y) = i s(x, i, y i 1, y i ) Linear: s(x, i, y i 1, y i ) = w f(x, i, y i 1, y i ) Non-linear, using a feed-forward neural network: s(x, i, y i 1, y i ) = w yi 1,y i h(f(x, i)) where: h(f(x, i)) = σ(w 2 σ(w 1 σ(w 0 f(x, i)))) Remarks: The non-linear model computes a hidden representation of the input Still factored: Viterbi and Forward-Backward work Parameter estimation becomes non-convex, use backpropagation
107 Recurrent Sequence Prediction y 1 y 2 y 3 y n h 1 h 2 h 3... h n x 1 x 2 x 3 x n Maintains a state: a hidden variable that keeps track of previous observations and predictions Making predictions is not tractable In practice: greedy predictions or beam search Learning is non-convex Popular methods: RNN, LSTM, Spectral Models,...
108 Thanks!
Learning Structured Predictors
Learning Structured Predictors Xavier Carreras 1/70 Supervised (Structured) Prediction Learning to predict: given training data { (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor
More informationMachine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2014
Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary
More informationMachine Learning for Language Technology
Machine Learning for Language Technology Generative and Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Machine Learning for Language
More informationCRF and Structured Perceptron
CRF and Structured Perceptron CS 585, Fall 2015 -- Oct. 6 Introduction to Natural Language Processing http://people.cs.umass.edu/~brenocon/inlp2015/ Brendan O Connor Viterbi exercise solution CRF & Structured
More information신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일
신경망기반자동번역기술 Konkuk University Computational Intelligence Lab. http://ci.konkuk.ac.kr kikim01@kunkuk.ac.kr 김강일 Index Issues in AI and Deep Learning Overview of Machine Translation Advanced Techniques in
More informationThe Automatic Classification Problem. Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification
Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification Parallel to AIMA 8., 8., 8.6.3, 8.9 The Automatic Classification Problem Assign object/event or sequence of objects/events
More informationMidterm for Name: Good luck! Midterm page 1 of 9
Midterm for 6.864 Name: 40 30 30 30 Good luck! 6.864 Midterm page 1 of 9 Part #1 10% We define a PCFG where the non-terminals are {S, NP, V P, V t, NN, P P, IN}, the terminal symbols are {Mary,ran,home,with,John},
More informationThe revolution of the empiricists. Machine Translation. Motivation for Data-Driven MT. Machine Translation as Search
The revolution of the empiricists Machine Translation Word alignment & Statistical MT Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Classical approaches
More informationKernels and Support Vector Machines
Kernels and Support Vector Machines Machine Learning CSE446 Sham Kakade University of Washington November 1, 2016 2016 Sham Kakade 1 Announcements: Project Milestones coming up HW2 You ve implemented GD,
More informationThe Game-Theoretic Approach to Machine Learning and Adaptation
The Game-Theoretic Approach to Machine Learning and Adaptation Nicolò Cesa-Bianchi Università degli Studi di Milano Nicolò Cesa-Bianchi (Univ. di Milano) Game-Theoretic Approach 1 / 25 Machine Learning
More informationLog-linear models (part 1I)
Log-linear models (part 1I) Lecture, Feb 2 CS 690N, Spring 2017 Advanced Natural Language Processing http://people.cs.umass.edu/~brenocon/anlp2017/ Brendan O Connor College of Information and Computer
More informationUser Goal Change Model for Spoken Dialog State Tracking
User Goal Change Model for Spoken Dialog State Tracking Yi Ma Department of Computer Science & Engineering The Ohio State University Columbus, OH 43210, USA may@cse.ohio-state.edu Abstract In this paper,
More informationPart of Speech Tagging & Hidden Markov Models (Part 1) Mitch Marcus CIS 421/521
Part of Speech Tagging & Hidden Markov Models (Part 1) Mitch Marcus CIS 421/521 NLP Task I Determining Part of Speech Tags Given a text, assign each token its correct part of speech (POS) tag, given its
More informationFeature Selection for Activity Recognition in Multi-Robot Domains
Feature Selection for Activity Recognition in Multi-Robot Domains Douglas L. Vail and Manuela M. Veloso Computer Science Department Carnegie Mellon University Pittsburgh, PA USA {dvail2,mmv}@cs.cmu.edu
More informationDiscriminative Training for Automatic Speech Recognition
Discriminative Training for Automatic Speech Recognition 22 nd April 2013 Advanced Signal Processing Seminar Article Heigold, G.; Ney, H.; Schluter, R.; Wiesler, S. Signal Processing Magazine, IEEE, vol.29,
More informationMultiplayer Pushdown Games. Anil Seth IIT Kanpur
Multiplayer Pushdown Games Anil Seth IIT Kanpur Multiplayer Games we Consider These games are played on graphs (finite or infinite) Generalize two player infinite games. Any number of players are allowed.
More informationInformation Extraction. CS6200 Information Retrieval (and a sort of advertisement for NLP in the spring)
Information Extraction CS6200 Information Retrieval (and a sort of advertisement for NLP in the spring) 1 Informa(on Extrac(on Automa(cally extract structure from text annotate document using tags to iden(fy
More informationMachine Translation - Decoding
January 15, 2007 Table of Contents 1 Introduction 2 3 4 5 6 Integer Programing Decoder 7 Experimental Results Word alignments Fertility Table Translation Table Heads Non-heads NULL-generated (ct.) Figure:
More informationStatistical Tests: More Complicated Discriminants
03/07/07 PHY310: Statistical Data Analysis 1 PHY310: Lecture 14 Statistical Tests: More Complicated Discriminants Road Map When the likelihood discriminant will fail The Multi Layer Perceptron discriminant
More informationSSB Debate: Model-based Inference vs. Machine Learning
SSB Debate: Model-based nference vs. Machine Learning June 3, 2018 SSB 2018 June 3, 2018 1 / 20 Machine learning in the biological sciences SSB 2018 June 3, 2018 2 / 20 Machine learning in the biological
More informationAlternation in the repeated Battle of the Sexes
Alternation in the repeated Battle of the Sexes Aaron Andalman & Charles Kemp 9.29, Spring 2004 MIT Abstract Traditional game-theoretic models consider only stage-game strategies. Alternation in the repeated
More informationOptimal Coded Information Network Design and Management via Improved Characterizations of the Binary Entropy Function
Optimal Coded Information Network Design and Management via Improved Characterizations of the Binary Entropy Function John MacLaren Walsh & Steven Weber Department of Electrical and Computer Engineering
More informationDeep Learning Basics Lecture 9: Recurrent Neural Networks. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 9: Recurrent Neural Networks Princeton University COS 495 Instructor: Yingyu Liang Introduction Recurrent neural networks Dates back to (Rumelhart et al., 1986) A family of
More informationNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition Presented by Allan June 16, 2017 Slides: http://www.statnlp.org/event/naner.html Some content is taken from the original slides. Named Entity Recognition
More informationFigure 1. Artificial Neural Network structure. B. Spiking Neural Networks Spiking Neural networks (SNNs) fall into the third generation of neural netw
Review Analysis of Pattern Recognition by Neural Network Soni Chaturvedi A.A.Khurshid Meftah Boudjelal Electronics & Comm Engg Electronics & Comm Engg Dept. of Computer Science P.I.E.T, Nagpur RCOEM, Nagpur
More informationResearch Seminar. Stefano CARRINO fr.ch
Research Seminar Stefano CARRINO stefano.carrino@hefr.ch http://aramis.project.eia- fr.ch 26.03.2010 - based interaction Characterization Recognition Typical approach Design challenges, advantages, drawbacks
More informationTwo Bracketing Schemes for the Penn Treebank
Anssi Yli-Jyrä Two Bracketing Schemes for the Penn Treebank Abstract The trees in the Penn Treebank have a standard representation that involves complete balanced bracketing. In this article, an alternative
More informationAutomata and Formal Languages - CM0081 Turing Machines
Automata and Formal Languages - CM0081 Turing Machines Andrés Sicard-Ramírez Universidad EAFIT Semester 2018-1 Turing Machines Alan Mathison Turing (1912 1954) Automata and Formal Languages - CM0081. Turing
More informationThe fundamentals of detection theory
Advanced Signal Processing: The fundamentals of detection theory Side 1 of 18 Index of contents: Advanced Signal Processing: The fundamentals of detection theory... 3 1 Problem Statements... 3 2 Detection
More informationGenerating Groove: Predicting Jazz Harmonization
Generating Groove: Predicting Jazz Harmonization Nicholas Bien (nbien@stanford.edu) Lincoln Valdez (lincolnv@stanford.edu) December 15, 2017 1 Background We aim to generate an appropriate jazz chord progression
More informationReinforcement Learning in Games Autonomous Learning Systems Seminar
Reinforcement Learning in Games Autonomous Learning Systems Seminar Matthias Zöllner Intelligent Autonomous Systems TU-Darmstadt zoellner@rbg.informatik.tu-darmstadt.de Betreuer: Gerhard Neumann Abstract
More informationarxiv: v1 [cs.ni] 23 Jan 2019
Machine Learning for Wireless Communications in the Internet of Things: A Comprehensive Survey Jithin Jagannath, Nicholas Polosky, Anu Jagannath, Francesco Restuccia, and Tommaso Melodia ANDRO Advanced
More informationLecture 3 - Regression
Lecture 3 - Regression Instructor: Prof Ganesh Ramakrishnan July 25, 2016 1 / 30 The Simplest ML Problem: Least Square Regression Curve Fitting: Motivation Error measurement Minimizing Error Method of
More informationAutomatic Speech Recognition (CS753)
Automatic Speech Recognition (CS753) Lecture 9: Brief Introduction to Neural Networks Instructor: Preethi Jyothi Feb 2, 2017 Final Project Landscape Tabla bol transcription Music Genre Classification Audio
More informationApplications of Music Processing
Lecture Music Processing Applications of Music Processing Christian Dittmar International Audio Laboratories Erlangen christian.dittmar@audiolabs-erlangen.de Singing Voice Detection Important pre-requisite
More informationGenerating an appropriate sound for a video using WaveNet.
Australian National University College of Engineering and Computer Science Master of Computing Generating an appropriate sound for a video using WaveNet. COMP 8715 Individual Computing Project Taku Ueki
More informationLog-linear models (part 1I)
Log-linear models (part 1I) CS 690N, Spring 2018 Advanced Natural Language Processing http://people.cs.umass.edu/~brenocon/anlp2018/ Brendan O Connor College of Information and Computer Sciences University
More informationLesson 08. Convolutional Neural Network. Ing. Marek Hrúz, Ph.D. Katedra Kybernetiky Fakulta aplikovaných věd Západočeská univerzita v Plzni.
Lesson 08 Convolutional Neural Network Ing. Marek Hrúz, Ph.D. Katedra Kybernetiky Fakulta aplikovaných věd Západočeská univerzita v Plzni Lesson 08 Convolution we will consider 2D convolution the result
More informationStatistical Machine Translation. Machine Translation Phrase-Based Statistical MT. Motivation for Phrase-based SMT
Statistical Machine Translation Machine Translation Phrase-Based Statistical MT Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University October 2009 Probabilistic
More informationCommunication Theory II
Communication Theory II Lecture 13: Information Theory (cont d) Ahmed Elnakib, PhD Assistant Professor, Mansoura University, Egypt March 22 th, 2015 1 o Source Code Generation Lecture Outlines Source Coding
More informationNeural Network Part 4: Recurrent Neural Networks
Neural Network Part 4: Recurrent Neural Networks Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from
More informationAttention-based Multi-Encoder-Decoder Recurrent Neural Networks
Attention-based Multi-Encoder-Decoder Recurrent Neural Networks Stephan Baier 1, Sigurd Spieckermann 2 and Volker Tresp 1,2 1- Ludwig Maximilian University Oettingenstr. 67, Munich, Germany 2- Siemens
More informationUsing RASTA in task independent TANDEM feature extraction
R E S E A R C H R E P O R T I D I A P Using RASTA in task independent TANDEM feature extraction Guillermo Aradilla a John Dines a Sunil Sivadas a b IDIAP RR 04-22 April 2004 D a l l e M o l l e I n s t
More informationDetection of Compound Structures in Very High Spatial Resolution Images
Detection of Compound Structures in Very High Spatial Resolution Images Selim Aksoy Department of Computer Engineering Bilkent University Bilkent, 06800, Ankara, Turkey saksoy@cs.bilkent.edu.tr Joint work
More information11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO
Introduction to RNNs for NLP SHANG GAO About Me PhD student in the Data Science and Engineering program Took Deep Learning last year Work in the Biomedical Sciences, Engineering, and Computing group at
More informationRelation Extraction, Neural Network, and Matrix Factorization
Relation Extraction, Neural Network, and Matrix Factorization Presenter: Haw-Shiuan Chang UMass CS585 guest lecture on 2016 Nov. 17 Most slides prepared by Patrick Verga Relation Extraction Knowledge Graph
More informationk-means Clustering David S. Rosenberg December 15, 2017 Bloomberg ML EDU David S. Rosenberg (Bloomberg ML EDU) ML 101 December 15, / 18
k-means Clustering David S. Rosenberg Bloomberg ML EDU December 15, 2017 David S. Rosenberg (Bloomberg ML EDU) ML 101 December 15, 2017 1 / 18 k-means Clustering David S. Rosenberg (Bloomberg ML EDU) ML
More informationTD-Leaf(λ) Giraffe: Using Deep Reinforcement Learning to Play Chess. Stefan Lüttgen
TD-Leaf(λ) Giraffe: Using Deep Reinforcement Learning to Play Chess Stefan Lüttgen Motivation Learn to play chess Computer approach different than human one Humans search more selective: Kasparov (3-5
More informationCS510 \ Lecture Ariel Stolerman
CS510 \ Lecture04 2012-10-15 1 Ariel Stolerman Administration Assignment 2: just a programming assignment. Midterm: posted by next week (5), will cover: o Lectures o Readings A midterm review sheet will
More informationTopic 1: defining games and strategies. SF2972: Game theory. Not allowed: Extensive form game: formal definition
SF2972: Game theory Mark Voorneveld, mark.voorneveld@hhs.se Topic 1: defining games and strategies Drawing a game tree is usually the most informative way to represent an extensive form game. Here is one
More informationCómo estructurar un buen proyecto de Machine Learning? Anna Bosch Rue VP Data Launchmetrics
Cómo estructurar un buen proyecto de Machine Learning? Anna Bosch Rue VP Data Intelligence @ Launchmetrics annaboschrue@gmail.com Motivating example 90% Accuracy and you want to do better IDEAS: - Collect
More informationGame Theory and Randomized Algorithms
Game Theory and Randomized Algorithms Guy Aridor Game theory is a set of tools that allow us to understand how decisionmakers interact with each other. It has practical applications in economics, international
More informationRecommender Systems TIETS43 Collaborative Filtering
+ Recommender Systems TIETS43 Collaborative Filtering Fall 2017 Kostas Stefanidis kostas.stefanidis@uta.fi https://coursepages.uta.fi/tiets43/ selection Amazon generates 35% of their sales through recommendations
More informationDeep Neural Networks (2) Tanh & ReLU layers; Generalisation and Regularisation
Deep Neural Networks (2) Tanh & ReLU layers; Generalisation and Regularisation Steve Renals Machine Learning Practical MLP Lecture 4 9 October 2018 MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2)
More informationContents. MA 327/ECO 327 Introduction to Game Theory Fall 2017 Notes. 1 Wednesday, August Friday, August Monday, August 28 6
MA 327/ECO 327 Introduction to Game Theory Fall 2017 Notes Contents 1 Wednesday, August 23 4 2 Friday, August 25 5 3 Monday, August 28 6 4 Wednesday, August 30 8 5 Friday, September 1 9 6 Wednesday, September
More informationDynamic Programming in Real Life: A Two-Person Dice Game
Mathematical Methods in Operations Research 2005 Special issue in honor of Arie Hordijk Dynamic Programming in Real Life: A Two-Person Dice Game Henk Tijms 1, Jan van der Wal 2 1 Department of Econometrics,
More informationPedigree Reconstruction using Identity by Descent
Pedigree Reconstruction using Identity by Descent Bonnie Kirkpatrick Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2010-43 http://www.eecs.berkeley.edu/pubs/techrpts/2010/eecs-2010-43.html
More informationKalman Filtering, Factor Graphs and Electrical Networks
Kalman Filtering, Factor Graphs and Electrical Networks Pascal O. Vontobel, Daniel Lippuner, and Hans-Andrea Loeliger ISI-ITET, ETH urich, CH-8092 urich, Switzerland. Abstract Factor graphs are graphical
More informationDesign of Parallel Algorithms. Communication Algorithms
+ Design of Parallel Algorithms Communication Algorithms + Topic Overview n One-to-All Broadcast and All-to-One Reduction n All-to-All Broadcast and Reduction n All-Reduce and Prefix-Sum Operations n Scatter
More informationIntelligent Agents & Search Problem Formulation. AIMA, Chapters 2,
Intelligent Agents & Search Problem Formulation AIMA, Chapters 2, 3.1-3.2 Outline for today s lecture Intelligent Agents (AIMA 2.1-2) Task Environments Formulating Search Problems CIS 421/521 - Intro to
More informationProject. B) Building the PWM Read the instructions of HO_14. 1) Determine all the 9-mers and list them here:
Project Please choose ONE project among the given five projects. The last three projects are programming projects. hoose any programming language you want. Note that you can also write programs for the
More information1. Introduction to Game Theory
1. Introduction to Game Theory What is game theory? Important branch of applied mathematics / economics Eight game theorists have won the Nobel prize, most notably John Nash (subject of Beautiful mind
More informationComputational aspects of two-player zero-sum games Course notes for Computational Game Theory Section 3 Fall 2010
Computational aspects of two-player zero-sum games Course notes for Computational Game Theory Section 3 Fall 21 Peter Bro Miltersen November 1, 21 Version 1.3 3 Extensive form games (Game Trees, Kuhn Trees)
More informationCandyCrush.ai: An AI Agent for Candy Crush
CandyCrush.ai: An AI Agent for Candy Crush Jiwoo Lee, Niranjan Balachandar, Karan Singhal December 16, 2016 1 Introduction Candy Crush, a mobile puzzle game, has become very popular in the past few years.
More informationTraining a Minesweeper Solver
Training a Minesweeper Solver Luis Gardea, Griffin Koontz, Ryan Silva CS 229, Autumn 25 Abstract Minesweeper, a puzzle game introduced in the 96 s, requires spatial awareness and an ability to work with
More informationCS 188: Artificial Intelligence Spring 2007
CS 188: Artificial Intelligence Spring 2007 Lecture 7: CSP-II and Adversarial Search 2/6/2007 Srini Narayanan ICSI and UC Berkeley Many slides over the course adapted from Dan Klein, Stuart Russell or
More informationIBM SPSS Neural Networks
IBM Software IBM SPSS Neural Networks 20 IBM SPSS Neural Networks New tools for building predictive models Highlights Explore subtle or hidden patterns in your data. Build better-performing models No programming
More informationIntroduction to Machine Learning
Introduction to Machine Learning Deep Learning Barnabás Póczos Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio Geoffrey Hinton Yann LeCun 2
More informationSinging Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection
Detection Lecture usic Processing Applications of usic Processing Christian Dittmar International Audio Laboratories Erlangen christian.dittmar@audiolabs-erlangen.de Important pre-requisite for: usic segmentation
More informationInformation Extraction. CS6200 Information Retrieval (and a sort of advertisement for NLP in the spring)
Information Extraction CS6200 Information Retrieval (and a sort of advertisement for NLP in the spring) Informa(on Extrac(on Automa(cally extract structure from text annotate document using tags to iden(fy
More informationOutcome Forecasting in Sports. Ondřej Hubáček
Outcome Forecasting in Sports Ondřej Hubáček Motivation & Challenges Motivation exploiting betting markets performance optimization Challenges no available datasets difficulties with establishing the state-of-the-art
More informationIntroduction to Markov Models. Estimating the probability of phrases of words, sentences, etc.
Introduction to Markov Models Estimating the probability of phrases of words, sentences, etc. But first: A few preliminaries on text preprocessing What counts as a word? A tricky question. CIS 421/521
More informationIntroduction to Spring 2009 Artificial Intelligence Final Exam
CS 188 Introduction to Spring 2009 Artificial Intelligence Final Exam INSTRUCTIONS You have 3 hours. The exam is closed book, closed notes except a two-page crib sheet, double-sided. Please use non-programmable
More informationIntroduction to Machine Learning
Introduction to Machine Learning Perceptron Barnabás Póczos Contents History of Artificial Neural Networks Definitions: Perceptron, Multi-Layer Perceptron Perceptron algorithm 2 Short History of Artificial
More informationUsing Administrative Records for Imputation in the Decennial Census 1
Using Administrative Records for Imputation in the Decennial Census 1 James Farber, Deborah Wagner, and Dean Resnick U.S. Census Bureau James Farber, U.S. Census Bureau, Washington, DC 20233-9200 Keywords:
More informationDependency-based Convolutional Neural Networks for Sentence Embedding
Dependency-based Convolutional Neural Networks for Sentence Embedding ROOT? Mingbo Ma Liang Huang CUNY Bing Xiang Bowen Zhou IBM T. J. Watson ACL 2015 Beijing Convolutional Neural Network for NLP Kalchbrenner
More informationDynamic Games: Backward Induction and Subgame Perfection
Dynamic Games: Backward Induction and Subgame Perfection Carlos Hurtado Department of Economics University of Illinois at Urbana-Champaign hrtdmrt2@illinois.edu Jun 22th, 2017 C. Hurtado (UIUC - Economics)
More informationModule 3 Greedy Strategy
Module 3 Greedy Strategy Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu Introduction to Greedy Technique Main
More informationAntennas and Propagation. Chapter 5c: Array Signal Processing and Parametric Estimation Techniques
Antennas and Propagation : Array Signal Processing and Parametric Estimation Techniques Introduction Time-domain Signal Processing Fourier spectral analysis Identify important frequency-content of signal
More informationReinforcement Learning Agent for Scrolling Shooter Game
Reinforcement Learning Agent for Scrolling Shooter Game Peng Yuan (pengy@stanford.edu) Yangxin Zhong (yangxin@stanford.edu) Zibo Gong (zibo@stanford.edu) 1 Introduction and Task Definition 1.1 Game Agent
More informationAdvanced Techniques for Mobile Robotics Location-Based Activity Recognition
Advanced Techniques for Mobile Robotics Location-Based Activity Recognition Wolfram Burgard, Cyrill Stachniss, Kai Arras, Maren Bennewitz Activity Recognition Based on L. Liao, D. J. Patterson, D. Fox,
More informationDECISION TREE TUTORIAL
Kardi Teknomo DECISION TREE TUTORIAL Revoledu.com Decision Tree Tutorial by Kardi Teknomo Copyright 2008-2012 by Kardi Teknomo Published by Revoledu.com Online edition is available at Revoledu.com Last
More informationGame Tree Search. CSC384: Introduction to Artificial Intelligence. Generalizing Search Problem. General Games. What makes something a game?
CSC384: Introduction to Artificial Intelligence Generalizing Search Problem Game Tree Search Chapter 5.1, 5.2, 5.3, 5.6 cover some of the material we cover here. Section 5.6 has an interesting overview
More informationUNIVERSITY OF SOUTHAMPTON
UNIVERSITY OF SOUTHAMPTON ELEC6014W1 SEMESTER II EXAMINATIONS 2007/08 RADIO COMMUNICATION NETWORKS AND SYSTEMS Duration: 120 mins Answer THREE questions out of FIVE. University approved calculators may
More informationAn Hybrid MLP-SVM Handwritten Digit Recognizer
An Hybrid MLP-SVM Handwritten Digit Recognizer A. Bellili ½ ¾ M. Gilloux ¾ P. Gallinari ½ ½ LIP6, Université Pierre et Marie Curie ¾ La Poste 4, Place Jussieu 10, rue de l Ile Mabon, BP 86334 75252 Paris
More informationSome algorithmic and combinatorial problems on permutation classes
Some algorithmic and combinatorial problems on permutation classes The point of view of decomposition trees PhD Defense, 2009 December the 4th Outline 1 Objects studied : Permutations, Patterns and Classes
More informationTTIC 31230, Fundamentals of Deep Learning David McAllester, April AlphaZero
TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 AlphaZero 1 AlphaGo Fan (October 2015) AlphaGo Defeats Fan Hui, European Go Champion. 2 AlphaGo Lee (March 2016) 3 AlphaGo Zero vs.
More informationPROJECT 5: DESIGNING A VOICE MODEM. Instructor: Amir Asif
PROJECT 5: DESIGNING A VOICE MODEM Instructor: Amir Asif CSE4214: Digital Communications (Fall 2012) Computer Science and Engineering, York University 1. PURPOSE In this laboratory project, you will design
More informationModeling, Analysis and Optimization of Networks. Alberto Ceselli
Modeling, Analysis and Optimization of Networks Alberto Ceselli alberto.ceselli@unimi.it Università degli Studi di Milano Dipartimento di Informatica Doctoral School in Computer Science A.A. 2015/2016
More informationModule 3 Greedy Strategy
Module 3 Greedy Strategy Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu Introduction to Greedy Technique Main
More informationLecture 5: Pitch and Chord (1) Chord Recognition. Li Su
Lecture 5: Pitch and Chord (1) Chord Recognition Li Su Recap: short-time Fourier transform Given a discrete-time signal x(t) sampled at a rate f s. Let window size N samples, hop size H samples, then the
More informationSimple Large-scale Relation Extraction from Unstructured Text
Simple Large-scale Relation Extraction from Unstructured Text Christos Christodoulopoulos and Arpit Mittal Amazon Research Cambridge Alexa Question Answering Alexa, what books did Carrie Fisher write?
More informationContents 1 Introduction Optical Character Recognition Systems Soft Computing Techniques for Optical Character Recognition Systems
Contents 1 Introduction.... 1 1.1 Organization of the Monograph.... 1 1.2 Notation.... 3 1.3 State of Art.... 4 1.4 Research Issues and Challenges.... 5 1.5 Figures.... 5 1.6 MATLAB OCR Toolbox.... 5 References....
More informationSpeech Processing. Simon King University of Edinburgh. additional lecture slides for
Speech Processing Simon King University of Edinburgh additional lecture slides for 2018-19 assignment Q&A writing exercise Roadmap Modules 1-2: The basics Modules 3-5: Speech synthesis Modules 6-9: Speech
More informationThe game of Bridge: a challenge for ILP
The game of Bridge: a challenge for ILP S. Legras, C. Rouveirol, V. Ventos Véronique Ventos LRI Univ Paris-Saclay vventos@nukk.ai 1 Games 2 Interest of games for AI Excellent field of experimentation Problems
More informationPatterns and random permutations II
Patterns and random permutations II Valentin Féray (joint work with F. Bassino, M. Bouvel, L. Gerin, M. Maazoun and A. Pierrot) Institut für Mathematik, Universität Zürich Summer school in Villa Volpi,
More informationLearning, prediction and selection algorithms for opportunistic spectrum access
Learning, prediction and selection algorithms for opportunistic spectrum access TRINITY COLLEGE DUBLIN Hamed Ahmadi Research Fellow, CTVR, Trinity College Dublin Future Cellular, Wireless, Next Generation
More informationScheduling. Radek Mařík. April 28, 2015 FEE CTU, K Radek Mařík Scheduling April 28, / 48
Scheduling Radek Mařík FEE CTU, K13132 April 28, 2015 Radek Mařík (marikr@fel.cvut.cz) Scheduling April 28, 2015 1 / 48 Outline 1 Introduction to Scheduling Methodology Overview 2 Classification of Scheduling
More informationMikko Myllymäki and Tuomas Virtanen
NON-STATIONARY NOISE MODEL COMPENSATION IN VOICE ACTIVITY DETECTION Mikko Myllymäki and Tuomas Virtanen Department of Signal Processing, Tampere University of Technology Korkeakoulunkatu 1, 3370, Tampere,
More informationArtificial Neural Networks. Artificial Intelligence Santa Clara, 2016
Artificial Neural Networks Artificial Intelligence Santa Clara, 2016 Simulate the functioning of the brain Can simulate actual neurons: Computational neuroscience Can introduce simplified neurons: Neural
More information