Learning Structured Predictors

Size: px
Start display at page:

Download "Learning Structured Predictors"

Transcription

1 Learning Structured Predictors Xavier Carreras 1/70

2 Supervised (Structured) Prediction Learning to predict: given training data { (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor x y that works well on unseen inputs x Non-Structured Prediction: outputs y are atomic Binary prediction: y { 1, +1} Multiclass prediction: y {1, 2,..., L} Structured Prediction: outputs y are structured Sequence prediction: y are sequences Parsing: y are trees... 2/70

3 Named Entity Recognition y per - qnt - - org org - time x Jim bought 300 shares of Acme Corp. in /70

4 Named Entity Recognition y per - qnt - - org org - time x Jim bought 300 shares of Acme Corp. in 2006 y per per - - loc x Jack London went to Paris y per per - - loc x Paris Hilton went to London y per - - loc x Jackie went to Lisdon 3/70

5 Part-of-speech Tagging y NNP NNP VBZ NNP. x Ms. Haag plays Elianti. 4/70

6 Syntactic Parsing P LOC ROOT VC OBJ NMOD PMOD SBJ TMP NMOD NAME Unesco is now holding its biennial meetings in New York. x are sentences y are syntactic dependency trees 5/70

7 Machine Translation the transformation of does not nch. A particular instance may B(not), x 0:VB) ne, x 0, pas arbitrary syntax tree fragment. lexicalized (e.g. does) or varirhs(r i ) is represented as a seguage words and variables. brief overview of how such les are acquired automatically re 1, the (π, f, a) triple is repd graph G (edges going downnction between edges of π and ode of the graph is labeled with ement span (the latter in italic span of a node n is defined by rst and last word in f that are The complement span of n is '&!" #! ''( $%& ''( $%* )' ''& # #"& #%" *"& "!"& "!"&!!"#$%"& $&!%&!"#$& $%&!"'$& '!"& '!"& (!"%$("& # " "! $%& ' ' ( ) # "! ' ( * $ & ) ( #%)! '& $%&!"'$& '& '%&!"*$& $& '%(!"*$("& && '%(!"%$("& '& (!"%$+("& "#$%$ 5 &$'&($ )*+(,-$.%/0'*.,/% +'1)*2 30'1 40.*+$ + -!! "#$ %& '( ) *+, " Figure 1: Spans and complement-spans determine what (?) rules are extracted. Constituents in gray are members of the frontier set; a minimal rule is extracted from each of them. + )!") 6/70

8 Object Detection (?) x are images y are grids labeled with object types 7/70

9 Object Detection (?) x are images y are grids labeled with object types 7/70

10 Today s Goals Introduce basic concepts for structured prediction We will restrict to sequence prediction What can we can borrow from standard classification? Learning paradigms and algorithms, in essence, work here too However, computations behind algorithms are prohibitive What can we borrow from HMM and other structured formalisms? Representations of structured data into feature spaces Inference/search algorithms for tractable computations E.g., algorithms for HMMs (Viterbi, forward-backward) will play a major role in today s methods 8/70

11 Today s Goals Introduce basic concepts for structured prediction We will restrict to sequence prediction What can we can borrow from standard classification? Learning paradigms and algorithms, in essence, work here too However, computations behind algorithms are prohibitive What can we borrow from HMM and other structured formalisms? Representations of structured data into feature spaces Inference/search algorithms for tractable computations E.g., algorithms for HMMs (Viterbi, forward-backward) will play a major role in today s methods 8/70

12 Sequence Prediction y per per - - loc x Jack London went to Paris 9/70

13 Sequence Prediction x = x 1 x 2... x n are input sequences, x i X y = y 1 y 2... y n are output sequences, y i {1,..., L} Goal: given training data { (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor x y that works well on unseen inputs x What is the form of our prediction model? 10/70

14 Exponentially-many Solutions Let Y = {-, per, loc} The solution space (all output sequences): Jack London went to Paris per loc per loc per loc per loc per loc Each path is a possible solution For an input sequence of size n, there are Y n possible outputs 11/70

15 Exponentially-many Solutions Let Y = {-, per, loc} The solution space (all output sequences): Jack London went to Paris per loc per loc per loc per loc per loc Each path is a possible solution For an input sequence of size n, there are Y n possible outputs 11/70

16 Approach 1: Local Classifiers? Jack London went to Paris Decompose the sequence into n classification problems: A classifier predicts individual labels at each position ŷ i = argmax w f(x, i, l) l {loc, per, -} f(x, i, l) represents an assignment of label l for x i w is a vector of parameters, has a weight for each feature of f Use standard classification methods to learn w At test time, predict the best sequence by a simple concatenation of the best label for each position 12/70

17 Approach 1: Local Classifiers? Jack London went to Paris Decompose the sequence into n classification problems: A classifier predicts individual labels at each position ŷ i = argmax w f(x, i, l) l {loc, per, -} f(x, i, l) represents an assignment of label l for x i w is a vector of parameters, has a weight for each feature of f Use standard classification methods to learn w At test time, predict the best sequence by a simple concatenation of the best label for each position 12/70

18 Indicator Features f(x, i, l) is a vector of d features representing label l for x i [ f 1 (x, i, l),..., f j (x, i, l),..., f d (x, i, l) ] What s in a feature f j (x, i, l)? Anything we can compute using x and i and l Anything that indicates whether l is (not) a good label for x i Indicator features: binary-valued features looking at: a simple pattern of x and target position i and the candidate label l for position i { 1 if xi =London and l =loc f j (x, i, l) = 0 otherwise { 1 if xi+1 =went and l =loc f k (x, i, l) = 0 otherwise 13/70

19 Feature Templates Feature templates generate many indicator features mechanically A feature template is identified by a type, and a number of values Example: template word extracts the current word { 1 if xi = w and l = a f word,a,w (x, i, l) = 0 otherwise A feature of this type is identified by the tuple word, a, w Generates a feature for every label a Y and every word w e.g.: a = loc w = London, a = - w = London a = loc w = Paris a = per w = Paris a = per w = John a = - w = the 14/70

20 Feature Templates Feature templates generate many indicator features mechanically A feature template is identified by a type, and a number of values Example: template word extracts the current word { 1 if xi = w and l = a f word,a,w (x, i, l) = 0 otherwise A feature of this type is identified by the tuple word, a, w Generates a feature for every label a Y and every word w e.g.: a = loc w = London, a = - w = London a = loc w = Paris a = per w = Paris a = per w = John a = - w = the In feature-based models: Define feature templates manually Instantiate the templates on every set of values in the training data generates a very high-dimensional feature space Define parameter vector w indexed by such feature tuples Let the learning algorithm choose the relevant features 14/70

21 More Features for NE Recognition In practice, construct f(x, i, l) by... per Jack London went to Paris Define a number of simple patterns of x and i current word x i is xi capitalized? xi has digits? prefixes/suffixes of size 1, 2, 3,... is x i a known location? is x i a known person? Define feature templates by combining patterns with labels l next word previous word current and next words together other combinations Generate actual features by instantiating templates on training data 15/70

22 More Features for NE Recognition In practice, construct f(x, i, l) by... per per - Jack London went to Paris Define a number of simple patterns of x and i current word x i is xi capitalized? xi has digits? prefixes/suffixes of size 1, 2, 3,... is x i a known location? is x i a known person? Define feature templates by combining patterns with labels l next word previous word current and next words together other combinations Generate actual features by instantiating templates on training data Main limitation: features can t capture interactions between labels! 15/70

23 Approach 2: HMM for Sequence Prediction π per per T per,per per - - loc O per, London Jack London went to Paris Define an HMM were each label is a state Model parameters: πl : probability of starting with label l T l,l : probability of transitioning from l to l O l,x : probability of generating symbol x given label l Predictions: p(x, y) = π y1 O y1,x 1 T yi 1,y i O yi,x i i>1 Learning: relative counts + smoothing Prediction: Viterbi algorithm 16/70

24 Approach 2: Representation in HMM π per per T per,per per - - loc O per, London Jack London went to Paris Label interactions are captured in the transition parameters But interactions between labels and input symbols are quite limited! Only Oyi,x i = p(x i y i ) Not clear how to exploit patterns such as: Capitalization, digits Prefixes and suffixes Next word, previous word Combinations of these with label transitions Why? HMM independence assumptions: given label y i, token x i is independent of anything else 17/70

25 Approach 2: Representation in HMM π per per T per,per per - - loc O per, London Jack London went to Paris Label interactions are captured in the transition parameters But interactions between labels and input symbols are quite limited! Only Oyi,x i = p(x i y i ) Not clear how to exploit patterns such as: Capitalization, digits Prefixes and suffixes Next word, previous word Combinations of these with label transitions Why? HMM independence assumptions: given label y i, token x i is independent of anything else 17/70

26 Local Classifiers vs. HMM Local Classifiers HMM Form: w f(x, i, l) Learning: standard classifiers Form: π y1 O y1,x 1 T yi 1,y i O yi,x i i>1 Prediction: independent for each x i Advantage: feature-rich Drawback: no label interactions Learning: relative counts Prediction: Viterbi Advantage: label interactions Drawback: no fine-grained features 18/70

27 Approach 3: Global Sequence Predictors Learn a single classifier from x y y: per per - - loc x: Jack London went to Paris predict(x 1:n ) = argmax y Y n w f(x, y) Next questions:... How do we represent entire sequences in f(x, y)? There are exponentially-many sequences y for a given x, how do we solve the argmax problem? 19/70

28 Approach 3: Global Sequence Predictors Learn a single classifier from x y y: per per - - loc x: Jack London went to Paris predict(x 1:n ) = argmax y Y n w f(x, y) Next questions:... How do we represent entire sequences in f(x, y)? There are exponentially-many sequences y for a given x, how do we solve the argmax problem? 19/70

29 Factored Representations y: per per - - loc x: Jack London went to Paris How do we represent entire sequences in f(x, y)? Look at individual assignments y i (standard classification) Look at bigrams of outputs labels y i 1, y i Look at trigrams of outputs labels yi 2, y i 1, y i Look at n-grams of outputs labels yi n+1,..., y i 1, y i Look at the full label sequence y (intractable) A factored representation will lead to a tractable model 20/70

30 Factored Representations y: per per - - loc x: Jack London went to Paris How do we represent entire sequences in f(x, y)? Look at individual assignments y i (standard classification) Look at bigrams of outputs labels y i 1, y i Look at trigrams of outputs labels yi 2, y i 1, y i Look at n-grams of outputs labels yi n+1,..., y i 1, y i Look at the full label sequence y (intractable) A factored representation will lead to a tractable model 20/70

31 Factored Representations y: per per - - loc x: Jack London went to Paris How do we represent entire sequences in f(x, y)? Look at individual assignments y i (standard classification) Look at bigrams of outputs labels y i 1, y i Look at trigrams of outputs labels yi 2, y i 1, y i Look at n-grams of outputs labels yi n+1,..., y i 1, y i Look at the full label sequence y (intractable) A factored representation will lead to a tractable model 20/70

32 Factored Representations y: per per - - loc x: Jack London went to Paris How do we represent entire sequences in f(x, y)? Look at individual assignments y i (standard classification) Look at bigrams of outputs labels y i 1, y i Look at trigrams of outputs labels yi 2, y i 1, y i Look at n-grams of outputs labels yi n+1,..., y i 1, y i Look at the full label sequence y (intractable) A factored representation will lead to a tractable model 20/70

33 Bigram Feature Templates y per per - - loc x Jack London went to Paris A template for word + bigram: 1 if x i = w and f wb,a,b,w (x, i, y i 1, y i ) = y i 1 = a and y i = b 0 otherwise e.g., f wb,per,per,london (x, 2, per, per) = 1 f wb,per,per,london (x, 3, per, -) = 0 f wb,per,-,went (x, 3, per, -) = 1 21/70

34 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1 22/70

35 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1 22/70

36 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1 22/70

37 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1 22/70

38 More Templates for NER x Jack London went to Paris y per per - - loc y per loc - - loc y loc - x My trip to London... f w,per,per,london (...) = 1 iff x i = London and y i 1 = per and y i = per f w,per,loc,london (...) = 1 iff x i = London and y i 1 = per and y i = loc f prep,loc,to (...) = 1 iff x i 1 = to and x i /[A-Z]/ and y i = loc f city,loc (...) = 1 iff y i = loc and world-cities(x i) = 1 f fname,per (...) = 1 iff y i = per and first-names(x i) = 1 22/70

39 Representations Factored at Bigrams y: per per - - loc x: Jack London went to Paris f(x, i, y i 1, y i ) A d-dimensional feature vector of a label bigram at i Each dimension is typically a boolean indicator (0 or 1) f(x, y) = n i=1 f(x, i, y i 1, y i ) A d-dimensional feature vector of the entire y Aggregated representation by summing bigram feature vectors Each dimension is now a count of a feature pattern 23/70

40 Linear Sequence Prediction where predict(x 1:n ) = argmax w f(x, y) y Y n n f(x, y) = f(x, i, y i 1, y i ) i=1 Note the linearity of the expression: n w f(x, y) = w f(x, i, y i 1, y i ) = Next questions: How do we solve the argmax problem? i=1 n w f(x, i, y i 1, y i ) i=1 How do we learn w? 24/70

41 Linear Sequence Prediction where predict(x 1:n ) = argmax w f(x, y) y Y n n f(x, y) = f(x, i, y i 1, y i ) i=1 Note the linearity of the expression: n w f(x, y) = w f(x, i, y i 1, y i ) = Next questions: How do we solve the argmax problem? How do we learn w? i=1 n w f(x, i, y i 1, y i ) i=1 24/70

42 Linear Sequence Prediction where predict(x 1:n ) = argmax w f(x, y) y Y n n f(x, y) = f(x, i, y i 1, y i ) i=1 Note the linearity of the expression: n w f(x, y) = w f(x, i, y i 1, y i ) = Next questions: How do we solve the argmax problem? How do we learn w? i=1 n w f(x, i, y i 1, y i ) i=1 24/70

43 Predicting with Factored Sequence Models Consider a fixed w. Given x 1:n find: argmax y Y n n w f(x, i, y i 1, y i ) i=1 Use the Viterbi algorithm, takes O(n Y 2 ) Notational change: since w and x 1:n are fixed we will use s(i, a, b) = w f(x, i, a, b) 25/70

44 Viterbi for Factored Sequence Models Given scores s(i, a, b) for each position i and output bigram a, b, find: argmax y Y n n s(i, y i 1, y i ) i=1 Use the Viterbi algorithm, takes O(n Y 2 ) Intuition: output sequences that share bigrams will share scores 1... i 2 i 1 i i n best subsequence with y i 1 = per best subsequence with y i = per best subsequence with y i 1 = loc s(i,loc, per) best subsequence with y i = loc best subsequence with y i 1 = best subsequence with y i = 26/70

45 Intuition for Viterbi Assume we have the best sub-sequence up to position i 1 ending with each label: 1... i 1 i best subsequence with y i 1 = per best subsequence with y i 1 = loc best subsequence with y i 1 = What is the best sequence up to position i with y i =loc? 27/70

46 Intuition for Viterbi Assume we have the best sub-sequence up to position i 1 ending with each label: 1... i 1 i best subsequence with y i 1 = per best subsequence with y i 1 = loc best subsequence with y i 1 = What is the best sequence up to position i with y i =loc? 27/70

47 Intuition for Viterbi Assume we have the best sub-sequence up to position i 1 ending with each label: 1... i 1 i best subsequence with y i 1 = per s(i,per, loc) best subsequence with y i 1 = loc best subsequence with y i 1 = s(i,loc, loc) s(i,, loc) What is the best sequence up to position i with y i =loc? 27/70

48 Viterbi for Linear Factored Predictors ŷ = argmax y Y n n w f(x, i, y i 1, y i ) Definition: score of optimal sequence for x 1:i ending with a Y δ(i, a) = i=1 max y Y i :y i=a j=1 Use the following recursions, for all a Y: i s(j, y j 1, y j ) δ(1, a) = s(1, y 0 = null, a) δ(i, a) = max δ(i 1, b) + s(i, b, a) b Y The optimal score for x is max a Y δ(n, a) The optimal sequence ŷ can be recovered through back-pointers 28/70

49 Linear Factored Sequence Prediction predict(x 1:n ) = argmax y Y n w f(x, y) Factored representation, e.g. based on bigrams Flexible, arbitrary features of full x and the factors Efficient prediction using Viterbi Next, learning w: Probabilistic log-linear models: Local learning, a.k.a. Maximum-Entropy Markov Models Global learning, a.k.a. Conditional Random Fields Margin-based methods: Structured Perceptron Structured SVM 29/70

50 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w 30/70

51 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 30/70

52 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 30/70

53 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 30/70

54 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 30/70

55 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 30/70

56 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 w NextW,per,went = +2 30/70

57 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 w NextW,per,went = +2 w NextW,org,played = +2 30/70

58 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 w NextW,per,went = +2 w NextW,org,played = +2 w PrevW,org,against = +2 30/70

59 The Learner s Game per - - Maria is beautiful loc - - Lisbon is beautiful per - - loc Jack went to Lisbon loc - - Argentina is nice Training Data per per - - loc loc Jack London went to South Paris org - - org Argentina played against Germany Weight Vector w w Lower,- = +1 w Upper,per = +1 w Upper,loc = +1 w Word,per,Maria = +2 w Word,per,Jack = +2 w NextW,per,went = +2 w NextW,org,played = +2 w PrevW,org,against = w UpperBigram,per,per = +2 w UpperBigram,loc,loc = +2 w NextW,loc,played = /70

60 Log-linear Models for Sequence Prediction y per per - - loc x Jack London went to Paris 31/70

61 Log-linear Models for Sequence Prediction Model the conditional distribution: where x = x 1 x 2... x n X Pr(y x; w) = exp {w f(x, y)} Z(x; w) y = y 1 y 2... y n Y and Y = {1,..., L} f(x, y) represents x and y with d features w R d are the parameters of the model Z(x; w) is a normalizer called the partition function Z(x; w) = z Y exp {w f(x, z)} To predict the best sequence predict(x 1:n ) = argmax Pr(y x) y Y n 32/70

62 Log-linear Models: Name Let s take the log of the conditional probability: log Pr(y x; w) = log exp{w f(x, y)} Z(x; w) = w f(x, y) log y exp{w f(x, y)} = w f(x, y) log Z(x; w) Partition function: Z(x; w) = y exp{w f(x, y)} log Z(x; w) is a constant for a fixed x In the log space, computations are linear, i.e., we model log-probabilities using a linear predictor 33/70

63 Making Predictions with Log-Linear Models For tractability, assume f(x, y) decomposes into bigrams: n f(x 1:n, y 1:n ) = f(x, i, y i 1, y i ) Given w, given x 1:n, find: i=1 exp { n i=1 argmax Pr(y 1:n x 1:n ; w) = amax w f(x, i, y i 1, y i )} y 1:n y Z(x; w) { n } = amax exp w f(x, i, y i 1, y i ) y We can use the Viterbi algorithm = amax y i=1 n w f(x, i, y i 1, y i ) i=1 34/70

64 Making Predictions with Log-Linear Models For tractability, assume f(x, y) decomposes into bigrams: n f(x 1:n, y 1:n ) = f(x, i, y i 1, y i ) Given w, given x 1:n, find: i=1 exp { n i=1 argmax Pr(y 1:n x 1:n ; w) = amax w f(x, i, y i 1, y i )} y 1:n y Z(x; w) { n } = amax exp w f(x, i, y i 1, y i ) y We can use the Viterbi algorithm = amax y i=1 n w f(x, i, y i 1, y i ) i=1 34/70

65 Parameter Estimation in Log-Linear Models Pr(y x; w) = exp {w f(x, y)} Z(x; w) Given training data { } (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ), How to estimate w? Define the conditional log-likelihood of the data: L(w) = a k = 1 m log Pr(y (k) x (k) ; w) L(w) measures how well w explains the data. A good value for w will give a high value for Pr(y (k) x (k) ; w) for all k = 1... m. We want w that maximizes L(w) 35/70

66 Parameter Estimation in Log-Linear Models Pr(y x; w) = exp {w f(x, y)} Z(x; w) Given training data { } (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ), How to estimate w? Define the conditional log-likelihood of the data: L(w) = a k = 1 m log Pr(y (k) x (k) ; w) L(w) measures how well w explains the data. A good value for w will give a high value for Pr(y (k) x (k) ; w) for all k = 1... m. We want w that maximizes L(w) 35/70

67 Learning Log-Linear Models: Loss + Regularization Solve: w = argmin w R d Loss {}}{ L(w) + Regularization {}}{ λ 2 w 2 where The first term is the negative conditional log-likelihood The second term is a regularization term, it penalizes solutions with large norm λ R controls the trade-off between loss and regularization Convex optimization problem gradient descent Two common losses based on log-likelihood that make learning tractable: Local Loss (MEMM): assume that Pr(y x; w) decomposes Global Loss (CRF): assume that f(x, y) decomposes 36/70

68 Learning Log-Linear Models: Loss + Regularization Solve: w = argmin w R d Loss {}}{ L(w) + Regularization {}}{ λ 2 w 2 where The first term is the negative conditional log-likelihood The second term is a regularization term, it penalizes solutions with large norm λ R controls the trade-off between loss and regularization Convex optimization problem gradient descent Two common losses based on log-likelihood that make learning tractable: Local Loss (MEMM): assume that Pr(y x; w) decomposes Global Loss (CRF): assume that f(x, y) decomposes 36/70

69 Maximum Entropy Markov Models (MEMM)? Similarly to HMMs: Assumption under MEMMs: Pr(y 1:n x 1:n ) = Pr(y 1 x 1:n ) Pr(y 2:n x 1:n, y 1 ) n = Pr(y 1 x 1:n ) Pr(y i x 1:n, y 1:i 1 ) = Pr(y 1 x 1:n ) i=2 n Pr(y i x 1:n, y i 1 ) i=2 Pr(y i x 1:n, y 1:i 1 ) = Pr(y i x 1:n, y i 1 ) 37/70

70 Parameter Estimation in MEMM Pr(y 1:n x 1:n ) = Pr(y 1 x 1:n ) The log-linear model is normalized locally (i.e. at each position): The log-likelihood is also local : L(w) w j = 1 m n Pr(y i x 1:n, i, y i 1 ) i=2 Pr(y x, i, y ) = exp{w f(x, i, y, y)} Z(x, i, y ) L(w) = m n (k) log Pr(y (k) i x (k), i, y (k) i 1 ) k=1 i=1 observed expected m n (k) {}}{{ f j (x (k), i, y (k) i 1, }}{ y(k) i ) Pr(y x (k), i, y (k) i 1, y) f j(x (k), i, y (k) i 1, y) k=1 i=1 y Y 38/70

71 Conditional Random Fields (?) Log-linear model of the conditional distribution: Pr(y x; w) = exp{w f(x, y)} Z(x) where x and y are input and output sequences f(x, y) is a feature vector of x and y that decomposes into factors w are model parameters To predict the best sequence ŷ = argmax Pr(y x) y Y Log-Likelihood at the global (sequence) level: m L(w) = log Pr(y (k) x (k) ; w) k=1 39/70

72 Computing the Gradient in CRFs Consider a parameter w j and its associated feature f j : observed expected L(w) = 1 m {}}{{ }}{ w j m f j (x (k), y (k) ) Pr(y x (k) ; w) f j (x (k), y) y Y k=1 where f j (x, y) = n f j (x, i, y i 1, y i ) i=1 First term: observed value of f j in training examples Second term: expected value of f j under current w In the optimal, observed = expected 40/70

73 Computing the Gradient in CRFs The first term is easy to compute, by counting explicitly i f j (x, i, y (k) i 1, y(k) i ) The second term is more involved, Pr(y x (k) ; w) f j (x (k), i, y i 1, y i ) y Y i because it sums over all sequences y Y n But there is an efficient solution... 41/70

74 Computing the Gradient in CRFs For an example (x (k), y (k) ): Pr(y x (k) ; w) y Y n n f j (x (k), i, y i 1, y i ) = i=1 n µ k i (a, b)f j (x (k), i, a, b) i=1 a,b Y µ k i (a, b) is the marginal probability of having labels (a, b) at position i: µ k i (a, b) = Pr( i, a, b x (k) ; w) = Pr(y x (k) ; w) y Y n : y i 1 =a, y i =b The quantities µ k i can be computed efficiently in O(nL 2 ) using the forward-backward algorithm 42/70

75 Forward-Backward for CRFs Assume fixed x and w. For notational convenience, define the score of a label bigram as: s(i, a, b) = exp{w f(x, i, a, b)} such that we can write Pr(y x) = exp{w f(x, y)} Z(x) = exp{ n i=1 w f(x, i, y i 1, y i )} n i=1 = s(i, y i 1, y i ) Z(x) Z Normalizer: Z = n y i=1 s(i, y i 1, y i ) Marginals: µ(i, a, b) = 1 n Z y,s.t.y i 1 =a,y i =b i=1 s(i, y i 1, y i ) 43/70

76 Forward-Backward for CRFs Definition: forward and backward quantities i α i (a) = j=1 s(j, y j 1, y j ) β i (b) = Z = a α n(a) y 1:i Y i :y i=a y i:n Y (n i+1) :y i=b µ i (a, b) = {α i 1 (a) s(i, a, b)} β i (b) Z 1 } n j=i+1 s(j, y j 1, y j ) Similarly to Viterbi, α i (a) and β i (b) can be computed recursively in O(n Y 2 ) 44/70

77 Forward-Backward for CRFs Definition: forward and backward quantities i α i (a) = j=1 s(j, y j 1, y j ) β i (b) = Z = a α n(a) y 1:i Y i :y i=a y i:n Y (n i+1) :y i=b µ i (a, b) = {α i 1 (a) s(i, a, b)} β i (b) Z 1 } n j=i+1 s(j, y j 1, y j ) Similarly to Viterbi, α i (a) and β i (b) can be computed recursively in O(n Y 2 ) 44/70

78 CRFs: summary so far Log-linear models for sequence prediction, Pr(y x; w) Computations factorize on label bigrams Model form: argmax y Y w f(x, i, y i 1, y i ) Prediction: uses Viterbi (from HMMs) Parameter estimation: Gradient-based methods, in practice L-BFGS Computation of gradient uses forward-backward (from HMMs) i 45/70

79 CRFs: summary so far Log-linear models for sequence prediction, Pr(y x; w) Computations factorize on label bigrams Model form: argmax y Y w f(x, i, y i 1, y i ) Prediction: uses Viterbi (from HMMs) Parameter estimation: Gradient-based methods, in practice L-BFGS Computation of gradient uses forward-backward (from HMMs) i Next Question: MEMMs or CRFs? HMMs or CRFs? 45/70

80 MEMMs and CRFs MEMMs: Pr(y x) = n i=1 exp {w f(x, i, y i 1, y i )} Z(x, i, y i 1 ; w) CRFs: Pr(y x) = exp { n i=1 w f(x, i, y i 1, y i )} Z(x) Both exploit the same factorization, i.e. same features Same computations to compute argmax y Pr(y x) MEMMs locally normalized; CRFs globally normalized MEMM assume that Pr(y i x 1:n, y 1:i 1 ) = Pr(y i x 1:n, y i 1 ) Leads to Label Bias Problem MEMMs are cheaper to train (reduces to multiclass learning) CRFs are easier to extend to other structures (next lecture) 46/70

81 HMMs for sequence prediction x are the observations, y are the hidden states HMMs model the joint distributon Pr(x, y) Parameters: (assume X = {1,..., k} and Y = {1,..., l}) π R l, π a = Pr(y 1 = a) T R l l, T a,b = Pr(y i = b y i 1 = a) O R l k, O a,c = Pr(x i = c y i = a) Model form Pr(x, y) = π y1 O y1,x 1 n i=2 T yi 1,y i O yi,x i Parameter Estimation: maximum likelihood by counting events and normalizing 47/70

82 HMMs and CRFs In CRFs: ŷ = amax y i w f(x, i, y i 1, y i ) In HMMs: ŷ = amax y π y1 O y1,x 1 n i=2 T y i 1,y i O yi,x i = amax y log(π y1 O y1,x 1 ) + n i=2 log(t y i 1,y i O yi,x i ) An HMM can be expressed as factored linear models: f j (x, i, y, y ) w j i = 1 & y = a log(π a ) i > 1 & y = a & y = b log(t a,b ) y = a & x i = c log(o a,b ) Hence, HMM are factored linear models 48/70

83 HMMs and CRFs: main differences Representation: HMM features are tied to the generative process. CRF features are very flexible. They can look at the whole input x paired with a label bigram (y i, y i+1 ). In practice, for prediction tasks, good discriminative features can improve accuracy a lot. Parameter estimation: HMMs focus on explaining the data, both x and y. CRFs focus on the mapping from x to y. A priori, it is hard to say which paradigm is better. Same dilemma as Naive Bayes vs. Maximum Entropy. 49/70

84 Structured Prediction Perceptron, SVMs, CRFs 50/70

85 Learning Structured Predictors Goal: given training data {(x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor x y with small error on unseen inputs In a CRF: argmax P (y x; w) = exp { n i=1 w f(x, i, y i 1, y i )} y Y Z(x; w) n = w f(x, i, y i 1, y i ) i=1 To predict new values, Z(x; w) is not relevant Parameter estimation: w is set to maximize likelihood Can we learn w more directly, focusing on errors? 51/70

86 Learning Structured Predictors Goal: given training data {(x (1), y (1) ), (x (2), y (2) ),..., (x (m), y (m) ) } learn a predictor x y with small error on unseen inputs In a CRF: argmax P (y x; w) = exp { n i=1 w f(x, i, y i 1, y i )} y Y Z(x; w) n = w f(x, i, y i 1, y i ) i=1 To predict new values, Z(x; w) is not relevant Parameter estimation: w is set to maximize likelihood Can we learn w more directly, focusing on errors? 51/70

87 The Structured Perceptron? Set w = 0 For t = 1... T For each training example (x, y) 1. Compute z = argmax z w f(x, z) 2. If z y w w + f(x, y) f(x, z) Return w 52/70

88 The Structured Perceptron + Averaging?;? Set w = 0, w a = 0 For t = 1... T For each training example (x, y) 1. Compute z = argmax z w f(x, z) 2. If z y w w + f(x, y) f(x, z) 3. w a = w a + w Return w a /mt, where m is the number of training examples 53/70

89 Perceptron Updates: Example y per per - - loc z per loc - - loc x Jack London went to Paris Let y be the correct output for x. Say we predict z instead, under our current w The update is: g = f(x, y) f(x, z) = i f(x, i, y i 1, y i ) i f(x, i, z i 1, z i ) = f(x, 2, per, per) f(x, 2, per, loc) + f(x, 3, per, -) f(x, 3, loc, -) Perceptron updates are typically very sparse 54/70

90 Properties of the Perceptron Online algorithm. Often much more efficient than batch algorithms If the data is separable, it will converge to parameter values with 0 errors Number of errors before convergence is related to a definition of margin. Can also relate margin to generalization properties In practice: 1. Averaging improves performance a lot 2. Typically reaches a good solution after only a few (say 5) iterations over the training set 3. Often performs nearly as well as CRFs, or SVMs 55/70

91 Averaged Perceptron Convergence Iteration Accuracy (results on validation set for a parsing task) 56/70

92 Margin-based Structured Prediction Let f(x, y) = n i=1 f(x, i, y i 1, y i ) Model: argmax y Y w f(x, y) Consider an example (x (k), y (k) ): y y (k) : w f(x (k), y (k) ) < w f(x (k), y) = error Let y = argmax y Y :y y (k) w f(x(k), y) Define γ k = w (f(x (k), y (k) ) f(x (k), y )) The quantity γ k is a notion of margin on example k: γ k > 0 no mistakes in the example high γ k high confidence 57/70

93 Margin-based Structured Prediction Let f(x, y) = n i=1 f(x, i, y i 1, y i ) Model: argmax y Y w f(x, y) Consider an example (x (k), y (k) ): y y (k) : w f(x (k), y (k) ) < w f(x (k), y) = error Let y = argmax y Y :y y (k) w f(x(k), y) Define γ k = w (f(x (k), y (k) ) f(x (k), y )) The quantity γ k is a notion of margin on example k: γ k > 0 no mistakes in the example high γ k high confidence 57/70

94 Margin-based Structured Prediction Let f(x, y) = n i=1 f(x, i, y i 1, y i ) Model: argmax y Y w f(x, y) Consider an example (x (k), y (k) ): y y (k) : w f(x (k), y (k) ) < w f(x (k), y) = error Let y = argmax y Y :y y (k) w f(x(k), y) Define γ k = w (f(x (k), y (k) ) f(x (k), y )) The quantity γ k is a notion of margin on example k: γ k > 0 no mistakes in the example high γ k high confidence 57/70

95 Mistake-augmented Margins? e(y (k), ) x (k) Jack London went to Paris y (k) per per - - loc 0 y per loc - - loc 1 y per y - - per per - 5 Def: e(y, y ) = n i=1 [y i y i ] e.g., e(y (k), y (k) )=0, e(y (k), y )=1, e(y (k), y )=5 We want a w such that y y (k) : w f(x (k), y (k) ) > w f(x (k), y) + e(y (k), y) (the higher the error of y, the larger the separation should be) 58/70

96 Mistake-augmented Margins? e(y (k), ) x (k) Jack London went to Paris y (k) per per - - loc 0 y per loc - - loc 1 y per y - - per per - 5 Def: e(y, y ) = n i=1 [y i y i ] e.g., e(y (k), y (k) )=0, e(y (k), y )=1, e(y (k), y )=5 We want a w such that y y (k) : w f(x (k), y (k) ) > w f(x (k), y) + e(y (k), y) (the higher the error of y, the larger the separation should be) 58/70

97 Mistake-augmented Margins? e(y (k), ) x (k) Jack London went to Paris y (k) per per - - loc 0 y per loc - - loc 1 y per y - - per per - 5 Def: e(y, y ) = n i=1 [y i y i ] e.g., e(y (k), y (k) )=0, e(y (k), y )=1, e(y (k), y )=5 We want a w such that y y (k) : w f(x (k), y (k) ) > w f(x (k), y) + e(y (k), y) (the higher the error of y, the larger the separation should be) 58/70

98 Structured Hinge Loss Define a mistake-augmented margin γ k,y =w f(x (k), y (k) ) w f(x (k), y) e(y (k), y) γ k = min y y (k) γ k,y Define loss function on example k as: { } L(w, x (k), y (k) ) = max w f(x (k), y) + e(y (k), y) w f(x (k), y (k) ) y Y Leads to an SVM for structured prediction Given a training set, find: argmin w R D m L(w, x (k), y (k) ) + λ 2 w 2 k=1 59/70

99 Regularized Loss Minimization Given a training set { (x (1), y (1) ),..., (x (m), y (m) ) }. Find: m argmin L(w, x (k), y (k) ) + λ w R D 2 w 2 k=1 Two common loss functions L(w, x (k), y (k) ) : Log-likelihood loss (CRFs) log P (y (k) x (k) ; w) Hinge loss (SVMs) ( ) max w f(x (k), y) + e(y (k), y) w f(x (k), y (k) ) y Y 60/70

100 Learning Structure Predictors: summary so far Linear models for sequence prediction argmax w f(x, i, y i 1, y i ) y Y Computations factorize on label bigrams Decoding: using Viterbi Marginals: using forward-backward Parameter estimation: Perceptron, Log-likelihood, SVMs Extensions from classification to the structured case Optimization methods: Stochastic (sub)gradient methods (??) Exponentiated Gradient (?) SVM Struct (?) Structured MIRA (?) i 61/70

101 Beyond Linear Sequence Prediction 62/70

102 Factored Sequence Prediction, Beyond Bigrams It is easy to extend the scope of features to k-grams f(x, i, y i k+1:i 1, y i ) In general, think of state σ i remembering relevant history σ i = y i 1 for bigrams σi = y i k+1:i 1 for k-grams σi can be the state at time i of a deterministic automaton generating y The structured predictor is argmax y Y w f(x, i, σ i, y i ) Viterbi and forward-backward extend naturally, in O(nL k ) i 63/70

103 Dependency Dependency Structures Structures * John saw a movie that he liked today Directed arcs represent dependencies between a head word Directed arcs represent dependencies between a head word and a modifier word. and a modifier word. E.g.: E.g.: movie modifies saw, John modifies movie saw, modifies saw, today modifies John sawmodifies saw, today modifies saw 64/70

104 Dependency Dependency Parsing: Parsing: arc-factored arc-factored models models? (McDonald et al. 2005) * John saw a movie that he liked today Parse trees decompose Parse trees into decompose single dependencies into single dependencies h, m h, m argmax argmax w f(x, f(x,h,m) y Y(x) y Y(x) h,m y h,m y Some features: f 1 (x,h,m)=[ saw movie ] Some features: f 1 (x, h, m) = f 2 [(x,h,m)=[distance saw movie ] =+2] Tractable f 2 (x, inference h, m) = algorithms [ distance = exist +2 (tomorrow s ] lecture) Tractable inference algorithms exist (tomorrow s lecture) 65/70

105 Linear Structured Prediction Sequence prediction (bigram factorization) argmax y Y(x) Dependency parsing (arc-factored) argmax y Y(x) w f(x, i, y i 1, y i ) i h,m y w f(x, h, m) In general, we can enumerate parts r y w f(x, r) argmax y Y(x) r y 66/70

106 Parameter estimation becomes non-convex, use backpropagation 67/70 Factored Sequence Prediction: from Linear to Non-linear score(x, y) = i s(x, i, y i 1, y i ) Linear: s(x, i, y i 1, y i ) = w f(x, i, y i 1, y i ) Non-linear, using a feed-forward neural network: where: s(x, i, y i 1, y i ) = w yi 1,y i h(f(x, i)) h(f(x, i)) = σ(w 2 σ(w 1 σ(w 0 f(x, i)))) Remarks: The non-linear model computes a hidden representation of the input Still factored: Viterbi and Forward-Backward work

107 Recurrent Sequence Prediction y 1 y 2 y 3 y n h 1 h 2 h 3... h n x 1 x 2 x 3 x n Maintains a state: a hidden variable that keeps track of previous observations and predictions Making predictions is not tractable In practice: greedy predictions or beam search Learning is non-convex Popular methods: RNN, LSTM, Spectral Models,... 68/70

108 Thanks! 69/70

109 References I Michael Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-volume 10, pages 1 8. Association for Computational Linguistics, Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter L Bartlett. Exponentiated gradient algorithms for conditional random fields and max-margin markov networks. The Journal of Machine Learning Research, 9: , Koby Crammer, Ryan McDonald, and Fernando Pereira. Scalable large-margin online learning for structured classification. In NIPS Workshop on Learning With Structured Outputs, Yoav Freund and Robert E. Schapire. Large margin classification using the perceptron algorithm. Mach. Learn., 37(3): , December ISSN doi: /A: URL Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Ignacio Thayer. Scalable inference and training of context-rich syntactic translation models. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages Association for Computational Linguistics, Sanjiv Kumar and Martial Hebert. Man-made structure detection in natural images using a causal multiscale random field. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2003), June 2003, Madison, WI, USA, pages , doi: /CVPR URL John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML 01, pages , San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. ISBN URL Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): , Andrew McCallum, Dayne Freitag, and Fernando CN Pereira. Maximum entropy markov models for information extraction and segmentation. In Icml, volume 17, pages , Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajič. Non-projective dependency parsing using spanning tree algorithms. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages Association for Computational Linguistics, Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3 30, Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin markov networks. In Advances in neural information processing systems, volume 16, Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods for structured and interdependent output variables. Journal of machine learning research, 6(Sep): , /70

Learning Structured Predictors

Learning Structured Predictors Learning Structured Predictors Xavier Carreras Xerox Research Centre Europe Supervised (Structured) Prediction Learning to predict: given training data { (x (1), y (1) ), (x (2), y (2) ),..., (x (m), y

More information

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2014

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2014 Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary

More information

Machine Learning for Language Technology

Machine Learning for Language Technology Machine Learning for Language Technology Generative and Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Machine Learning for Language

More information

The Automatic Classification Problem. Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification

The Automatic Classification Problem. Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification Parallel to AIMA 8., 8., 8.6.3, 8.9 The Automatic Classification Problem Assign object/event or sequence of objects/events

More information

CRF and Structured Perceptron

CRF and Structured Perceptron CRF and Structured Perceptron CS 585, Fall 2015 -- Oct. 6 Introduction to Natural Language Processing http://people.cs.umass.edu/~brenocon/inlp2015/ Brendan O Connor Viterbi exercise solution CRF & Structured

More information

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab.  김강일 신경망기반자동번역기술 Konkuk University Computational Intelligence Lab. http://ci.konkuk.ac.kr kikim01@kunkuk.ac.kr 김강일 Index Issues in AI and Deep Learning Overview of Machine Translation Advanced Techniques in

More information

Kernels and Support Vector Machines

Kernels and Support Vector Machines Kernels and Support Vector Machines Machine Learning CSE446 Sham Kakade University of Washington November 1, 2016 2016 Sham Kakade 1 Announcements: Project Milestones coming up HW2 You ve implemented GD,

More information

The revolution of the empiricists. Machine Translation. Motivation for Data-Driven MT. Machine Translation as Search

The revolution of the empiricists. Machine Translation. Motivation for Data-Driven MT. Machine Translation as Search The revolution of the empiricists Machine Translation Word alignment & Statistical MT Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Classical approaches

More information

Log-linear models (part 1I)

Log-linear models (part 1I) Log-linear models (part 1I) Lecture, Feb 2 CS 690N, Spring 2017 Advanced Natural Language Processing http://people.cs.umass.edu/~brenocon/anlp2017/ Brendan O Connor College of Information and Computer

More information

Midterm for Name: Good luck! Midterm page 1 of 9

Midterm for Name: Good luck! Midterm page 1 of 9 Midterm for 6.864 Name: 40 30 30 30 Good luck! 6.864 Midterm page 1 of 9 Part #1 10% We define a PCFG where the non-terminals are {S, NP, V P, V t, NN, P P, IN}, the terminal symbols are {Mary,ran,home,with,John},

More information

User Goal Change Model for Spoken Dialog State Tracking

User Goal Change Model for Spoken Dialog State Tracking User Goal Change Model for Spoken Dialog State Tracking Yi Ma Department of Computer Science & Engineering The Ohio State University Columbus, OH 43210, USA may@cse.ohio-state.edu Abstract In this paper,

More information

Feature Selection for Activity Recognition in Multi-Robot Domains

Feature Selection for Activity Recognition in Multi-Robot Domains Feature Selection for Activity Recognition in Multi-Robot Domains Douglas L. Vail and Manuela M. Veloso Computer Science Department Carnegie Mellon University Pittsburgh, PA USA {dvail2,mmv}@cs.cmu.edu

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Deep Learning Barnabás Póczos Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio Geoffrey Hinton Yann LeCun 2

More information

Discriminative Training for Automatic Speech Recognition

Discriminative Training for Automatic Speech Recognition Discriminative Training for Automatic Speech Recognition 22 nd April 2013 Advanced Signal Processing Seminar Article Heigold, G.; Ney, H.; Schluter, R.; Wiesler, S. Signal Processing Magazine, IEEE, vol.29,

More information

The Game-Theoretic Approach to Machine Learning and Adaptation

The Game-Theoretic Approach to Machine Learning and Adaptation The Game-Theoretic Approach to Machine Learning and Adaptation Nicolò Cesa-Bianchi Università degli Studi di Milano Nicolò Cesa-Bianchi (Univ. di Milano) Game-Theoretic Approach 1 / 25 Machine Learning

More information

Machine Translation - Decoding

Machine Translation - Decoding January 15, 2007 Table of Contents 1 Introduction 2 3 4 5 6 Integer Programing Decoder 7 Experimental Results Word alignments Fertility Table Translation Table Heads Non-heads NULL-generated (ct.) Figure:

More information

Neural Architectures for Named Entity Recognition

Neural Architectures for Named Entity Recognition Neural Architectures for Named Entity Recognition Presented by Allan June 16, 2017 Slides: http://www.statnlp.org/event/naner.html Some content is taken from the original slides. Named Entity Recognition

More information

Part of Speech Tagging & Hidden Markov Models (Part 1) Mitch Marcus CIS 421/521

Part of Speech Tagging & Hidden Markov Models (Part 1) Mitch Marcus CIS 421/521 Part of Speech Tagging & Hidden Markov Models (Part 1) Mitch Marcus CIS 421/521 NLP Task I Determining Part of Speech Tags Given a text, assign each token its correct part of speech (POS) tag, given its

More information

Reinforcement Learning in Games Autonomous Learning Systems Seminar

Reinforcement Learning in Games Autonomous Learning Systems Seminar Reinforcement Learning in Games Autonomous Learning Systems Seminar Matthias Zöllner Intelligent Autonomous Systems TU-Darmstadt zoellner@rbg.informatik.tu-darmstadt.de Betreuer: Gerhard Neumann Abstract

More information

Multiplayer Pushdown Games. Anil Seth IIT Kanpur

Multiplayer Pushdown Games. Anil Seth IIT Kanpur Multiplayer Pushdown Games Anil Seth IIT Kanpur Multiplayer Games we Consider These games are played on graphs (finite or infinite) Generalize two player infinite games. Any number of players are allowed.

More information

Generating Groove: Predicting Jazz Harmonization

Generating Groove: Predicting Jazz Harmonization Generating Groove: Predicting Jazz Harmonization Nicholas Bien (nbien@stanford.edu) Lincoln Valdez (lincolnv@stanford.edu) December 15, 2017 1 Background We aim to generate an appropriate jazz chord progression

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 9: Brief Introduction to Neural Networks Instructor: Preethi Jyothi Feb 2, 2017 Final Project Landscape Tabla bol transcription Music Genre Classification Audio

More information

Kalman Filtering, Factor Graphs and Electrical Networks

Kalman Filtering, Factor Graphs and Electrical Networks Kalman Filtering, Factor Graphs and Electrical Networks Pascal O. Vontobel, Daniel Lippuner, and Hans-Andrea Loeliger ISI-ITET, ETH urich, CH-8092 urich, Switzerland. Abstract Factor graphs are graphical

More information

Attention-based Multi-Encoder-Decoder Recurrent Neural Networks

Attention-based Multi-Encoder-Decoder Recurrent Neural Networks Attention-based Multi-Encoder-Decoder Recurrent Neural Networks Stephan Baier 1, Sigurd Spieckermann 2 and Volker Tresp 1,2 1- Ludwig Maximilian University Oettingenstr. 67, Munich, Germany 2- Siemens

More information

Alternation in the repeated Battle of the Sexes

Alternation in the repeated Battle of the Sexes Alternation in the repeated Battle of the Sexes Aaron Andalman & Charles Kemp 9.29, Spring 2004 MIT Abstract Traditional game-theoretic models consider only stage-game strategies. Alternation in the repeated

More information

Deep Learning Basics Lecture 9: Recurrent Neural Networks. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 9: Recurrent Neural Networks. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 9: Recurrent Neural Networks Princeton University COS 495 Instructor: Yingyu Liang Introduction Recurrent neural networks Dates back to (Rumelhart et al., 1986) A family of

More information

An Hybrid MLP-SVM Handwritten Digit Recognizer

An Hybrid MLP-SVM Handwritten Digit Recognizer An Hybrid MLP-SVM Handwritten Digit Recognizer A. Bellili ½ ¾ M. Gilloux ¾ P. Gallinari ½ ½ LIP6, Université Pierre et Marie Curie ¾ La Poste 4, Place Jussieu 10, rue de l Ile Mabon, BP 86334 75252 Paris

More information

Relation Extraction, Neural Network, and Matrix Factorization

Relation Extraction, Neural Network, and Matrix Factorization Relation Extraction, Neural Network, and Matrix Factorization Presenter: Haw-Shiuan Chang UMass CS585 guest lecture on 2016 Nov. 17 Most slides prepared by Patrick Verga Relation Extraction Knowledge Graph

More information

Statistical Tests: More Complicated Discriminants

Statistical Tests: More Complicated Discriminants 03/07/07 PHY310: Statistical Data Analysis 1 PHY310: Lecture 14 Statistical Tests: More Complicated Discriminants Road Map When the likelihood discriminant will fail The Multi Layer Perceptron discriminant

More information

SSB Debate: Model-based Inference vs. Machine Learning

SSB Debate: Model-based Inference vs. Machine Learning SSB Debate: Model-based nference vs. Machine Learning June 3, 2018 SSB 2018 June 3, 2018 1 / 20 Machine learning in the biological sciences SSB 2018 June 3, 2018 2 / 20 Machine learning in the biological

More information

Log-linear models (part 1I)

Log-linear models (part 1I) Log-linear models (part 1I) CS 690N, Spring 2018 Advanced Natural Language Processing http://people.cs.umass.edu/~brenocon/anlp2018/ Brendan O Connor College of Information and Computer Sciences University

More information

Using RASTA in task independent TANDEM feature extraction

Using RASTA in task independent TANDEM feature extraction R E S E A R C H R E P O R T I D I A P Using RASTA in task independent TANDEM feature extraction Guillermo Aradilla a John Dines a Sunil Sivadas a b IDIAP RR 04-22 April 2004 D a l l e M o l l e I n s t

More information

Research Seminar. Stefano CARRINO fr.ch

Research Seminar. Stefano CARRINO  fr.ch Research Seminar Stefano CARRINO stefano.carrino@hefr.ch http://aramis.project.eia- fr.ch 26.03.2010 - based interaction Characterization Recognition Typical approach Design challenges, advantages, drawbacks

More information

Optimal Coded Information Network Design and Management via Improved Characterizations of the Binary Entropy Function

Optimal Coded Information Network Design and Management via Improved Characterizations of the Binary Entropy Function Optimal Coded Information Network Design and Management via Improved Characterizations of the Binary Entropy Function John MacLaren Walsh & Steven Weber Department of Electrical and Computer Engineering

More information

Communication Theory II

Communication Theory II Communication Theory II Lecture 13: Information Theory (cont d) Ahmed Elnakib, PhD Assistant Professor, Mansoura University, Egypt March 22 th, 2015 1 o Source Code Generation Lecture Outlines Source Coding

More information

Automata and Formal Languages - CM0081 Turing Machines

Automata and Formal Languages - CM0081 Turing Machines Automata and Formal Languages - CM0081 Turing Machines Andrés Sicard-Ramírez Universidad EAFIT Semester 2018-1 Turing Machines Alan Mathison Turing (1912 1954) Automata and Formal Languages - CM0081. Turing

More information

Lecture 3 - Regression

Lecture 3 - Regression Lecture 3 - Regression Instructor: Prof Ganesh Ramakrishnan July 25, 2016 1 / 30 The Simplest ML Problem: Least Square Regression Curve Fitting: Motivation Error measurement Minimizing Error Method of

More information

WorldQuant. Perspectives. Welcome to the Machine

WorldQuant. Perspectives. Welcome to the Machine WorldQuant Welcome to the Machine Unlike the science of artificial intelligence, which has yet to live up to the promise of replicating the human brain, machine learning is changing the way we do everything

More information

The fundamentals of detection theory

The fundamentals of detection theory Advanced Signal Processing: The fundamentals of detection theory Side 1 of 18 Index of contents: Advanced Signal Processing: The fundamentals of detection theory... 3 1 Problem Statements... 3 2 Detection

More information

Applications of Music Processing

Applications of Music Processing Lecture Music Processing Applications of Music Processing Christian Dittmar International Audio Laboratories Erlangen christian.dittmar@audiolabs-erlangen.de Singing Voice Detection Important pre-requisite

More information

Weiran Wang, On Column Selection in Kernel Canonical Correlation Analysis, In submission, arxiv: [cs.lg].

Weiran Wang, On Column Selection in Kernel Canonical Correlation Analysis, In submission, arxiv: [cs.lg]. Weiran Wang 6045 S. Kenwood Ave. Chicago, IL 60637 (209) 777-4191 weiranwang@ttic.edu http://ttic.uchicago.edu/ wwang5/ Education 2008 2013 PhD in Electrical Engineering & Computer Science. University

More information

Information Extraction. CS6200 Information Retrieval (and a sort of advertisement for NLP in the spring)

Information Extraction. CS6200 Information Retrieval (and a sort of advertisement for NLP in the spring) Information Extraction CS6200 Information Retrieval (and a sort of advertisement for NLP in the spring) 1 Informa(on Extrac(on Automa(cally extract structure from text annotate document using tags to iden(fy

More information

Detection of Compound Structures in Very High Spatial Resolution Images

Detection of Compound Structures in Very High Spatial Resolution Images Detection of Compound Structures in Very High Spatial Resolution Images Selim Aksoy Department of Computer Engineering Bilkent University Bilkent, 06800, Ankara, Turkey saksoy@cs.bilkent.edu.tr Joint work

More information

Online Large Margin Semi-supervised Algorithm for Automatic Classification of Digital Modulations

Online Large Margin Semi-supervised Algorithm for Automatic Classification of Digital Modulations Online Large Margin Semi-supervised Algorithm for Automatic Classification of Digital Modulations Hamidreza Hosseinzadeh*, Farbod Razzazi**, and Afrooz Haghbin*** Department of Electrical and Computer

More information

Generating an appropriate sound for a video using WaveNet.

Generating an appropriate sound for a video using WaveNet. Australian National University College of Engineering and Computer Science Master of Computing Generating an appropriate sound for a video using WaveNet. COMP 8715 Individual Computing Project Taku Ueki

More information

TD-Leaf(λ) Giraffe: Using Deep Reinforcement Learning to Play Chess. Stefan Lüttgen

TD-Leaf(λ) Giraffe: Using Deep Reinforcement Learning to Play Chess. Stefan Lüttgen TD-Leaf(λ) Giraffe: Using Deep Reinforcement Learning to Play Chess Stefan Lüttgen Motivation Learn to play chess Computer approach different than human one Humans search more selective: Kasparov (3-5

More information

Bayesian Positioning in Wireless Networks using Angle of Arrival

Bayesian Positioning in Wireless Networks using Angle of Arrival Bayesian Positioning in Wireless Networks using Angle of Arrival Presented by: Rich Martin Joint work with: David Madigan, Eiman Elnahrawy, Wen-Hua Ju, P. Krishnan, A.S. Krishnakumar Rutgers University

More information

Lesson 08. Convolutional Neural Network. Ing. Marek Hrúz, Ph.D. Katedra Kybernetiky Fakulta aplikovaných věd Západočeská univerzita v Plzni.

Lesson 08. Convolutional Neural Network. Ing. Marek Hrúz, Ph.D. Katedra Kybernetiky Fakulta aplikovaných věd Západočeská univerzita v Plzni. Lesson 08 Convolutional Neural Network Ing. Marek Hrúz, Ph.D. Katedra Kybernetiky Fakulta aplikovaných věd Západočeská univerzita v Plzni Lesson 08 Convolution we will consider 2D convolution the result

More information

11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO

11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO Introduction to RNNs for NLP SHANG GAO About Me PhD student in the Data Science and Engineering program Took Deep Learning last year Work in the Biomedical Sciences, Engineering, and Computing group at

More information

Neural Network Part 4: Recurrent Neural Networks

Neural Network Part 4: Recurrent Neural Networks Neural Network Part 4: Recurrent Neural Networks Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from

More information

Dynamic Programming in Real Life: A Two-Person Dice Game

Dynamic Programming in Real Life: A Two-Person Dice Game Mathematical Methods in Operations Research 2005 Special issue in honor of Arie Hordijk Dynamic Programming in Real Life: A Two-Person Dice Game Henk Tijms 1, Jan van der Wal 2 1 Department of Econometrics,

More information

Reinforcement Learning Agent for Scrolling Shooter Game

Reinforcement Learning Agent for Scrolling Shooter Game Reinforcement Learning Agent for Scrolling Shooter Game Peng Yuan (pengy@stanford.edu) Yangxin Zhong (yangxin@stanford.edu) Zibo Gong (zibo@stanford.edu) 1 Introduction and Task Definition 1.1 Game Agent

More information

Statistical Machine Translation. Machine Translation Phrase-Based Statistical MT. Motivation for Phrase-based SMT

Statistical Machine Translation. Machine Translation Phrase-Based Statistical MT. Motivation for Phrase-based SMT Statistical Machine Translation Machine Translation Phrase-Based Statistical MT Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University October 2009 Probabilistic

More information

CS 188: Artificial Intelligence Spring 2007

CS 188: Artificial Intelligence Spring 2007 CS 188: Artificial Intelligence Spring 2007 Lecture 7: CSP-II and Adversarial Search 2/6/2007 Srini Narayanan ICSI and UC Berkeley Many slides over the course adapted from Dan Klein, Stuart Russell or

More information

Deep Neural Networks (2) Tanh & ReLU layers; Generalisation and Regularisation

Deep Neural Networks (2) Tanh & ReLU layers; Generalisation and Regularisation Deep Neural Networks (2) Tanh & ReLU layers; Generalisation and Regularisation Steve Renals Machine Learning Practical MLP Lecture 4 9 October 2018 MLP Lecture 4 / 9 October 2018 Deep Neural Networks (2)

More information

arxiv: v1 [cs.ni] 23 Jan 2019

arxiv: v1 [cs.ni] 23 Jan 2019 Machine Learning for Wireless Communications in the Internet of Things: A Comprehensive Survey Jithin Jagannath, Nicholas Polosky, Anu Jagannath, Francesco Restuccia, and Tommaso Melodia ANDRO Advanced

More information

Deep Learning for Autonomous Driving

Deep Learning for Autonomous Driving Deep Learning for Autonomous Driving Shai Shalev-Shwartz Mobileye IMVC dimension, March, 2016 S. Shalev-Shwartz is also affiliated with The Hebrew University Shai Shalev-Shwartz (MobilEye) DL for Autonomous

More information

Figure 1. Artificial Neural Network structure. B. Spiking Neural Networks Spiking Neural networks (SNNs) fall into the third generation of neural netw

Figure 1. Artificial Neural Network structure. B. Spiking Neural Networks Spiking Neural networks (SNNs) fall into the third generation of neural netw Review Analysis of Pattern Recognition by Neural Network Soni Chaturvedi A.A.Khurshid Meftah Boudjelal Electronics & Comm Engg Electronics & Comm Engg Dept. of Computer Science P.I.E.T, Nagpur RCOEM, Nagpur

More information

Research on Hand Gesture Recognition Using Convolutional Neural Network

Research on Hand Gesture Recognition Using Convolutional Neural Network Research on Hand Gesture Recognition Using Convolutional Neural Network Tian Zhaoyang a, Cheng Lee Lung b a Department of Electronic Engineering, City University of Hong Kong, Hong Kong, China E-mail address:

More information

Outcome Forecasting in Sports. Ondřej Hubáček

Outcome Forecasting in Sports. Ondřej Hubáček Outcome Forecasting in Sports Ondřej Hubáček Motivation & Challenges Motivation exploiting betting markets performance optimization Challenges no available datasets difficulties with establishing the state-of-the-art

More information

AN IMPROVED NEURAL NETWORK-BASED DECODER SCHEME FOR SYSTEMATIC CONVOLUTIONAL CODE. A Thesis by. Andrew J. Zerngast

AN IMPROVED NEURAL NETWORK-BASED DECODER SCHEME FOR SYSTEMATIC CONVOLUTIONAL CODE. A Thesis by. Andrew J. Zerngast AN IMPROVED NEURAL NETWORK-BASED DECODER SCHEME FOR SYSTEMATIC CONVOLUTIONAL CODE A Thesis by Andrew J. Zerngast Bachelor of Science, Wichita State University, 2008 Submitted to the Department of Electrical

More information

Move Prediction in Go Modelling Feature Interactions Using Latent Factors

Move Prediction in Go Modelling Feature Interactions Using Latent Factors Move Prediction in Go Modelling Feature Interactions Using Latent Factors Martin Wistuba and Lars Schmidt-Thieme University of Hildesheim Information Systems & Machine Learning Lab {wistuba, schmidt-thieme}@ismll.de

More information

Augmenting Self-Learning In Chess Through Expert Imitation

Augmenting Self-Learning In Chess Through Expert Imitation Augmenting Self-Learning In Chess Through Expert Imitation Michael Xie Department of Computer Science Stanford University Stanford, CA 94305 xie@cs.stanford.edu Gene Lewis Department of Computer Science

More information

Game Theory and Randomized Algorithms

Game Theory and Randomized Algorithms Game Theory and Randomized Algorithms Guy Aridor Game theory is a set of tools that allow us to understand how decisionmakers interact with each other. It has practical applications in economics, international

More information

Signal Recovery from Random Measurements

Signal Recovery from Random Measurements Signal Recovery from Random Measurements Joel A. Tropp Anna C. Gilbert {jtropp annacg}@umich.edu Department of Mathematics The University of Michigan 1 The Signal Recovery Problem Let s be an m-sparse

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Perceptron Barnabás Póczos Contents History of Artificial Neural Networks Definitions: Perceptron, Multi-Layer Perceptron Perceptron algorithm 2 Short History of Artificial

More information

A Comparison of Particle Swarm Optimization and Gradient Descent in Training Wavelet Neural Network to Predict DGPS Corrections

A Comparison of Particle Swarm Optimization and Gradient Descent in Training Wavelet Neural Network to Predict DGPS Corrections Proceedings of the World Congress on Engineering and Computer Science 00 Vol I WCECS 00, October 0-, 00, San Francisco, USA A Comparison of Particle Swarm Optimization and Gradient Descent in Training

More information

Pedigree Reconstruction using Identity by Descent

Pedigree Reconstruction using Identity by Descent Pedigree Reconstruction using Identity by Descent Bonnie Kirkpatrick Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2010-43 http://www.eecs.berkeley.edu/pubs/techrpts/2010/eecs-2010-43.html

More information

Coursework 2. MLP Lecture 7 Convolutional Networks 1

Coursework 2. MLP Lecture 7 Convolutional Networks 1 Coursework 2 MLP Lecture 7 Convolutional Networks 1 Coursework 2 - Overview and Objectives Overview: Use a selection of the techniques covered in the course so far to train accurate multi-layer networks

More information

Hash Function Learning via Codewords

Hash Function Learning via Codewords Hash Function Learning via Codewords 2015 ECML/PKDD, Porto, Portugal, September 7 11, 2015. Yinjie Huang 1 Michael Georgiopoulos 1 Georgios C. Anagnostopoulos 2 1 Machine Learning Laboratory, University

More information

Semantic Localization of Indoor Places. Lukas Kuster

Semantic Localization of Indoor Places. Lukas Kuster Semantic Localization of Indoor Places Lukas Kuster Motivation GPS for localization [7] 2 Motivation Indoor navigation [8] 3 Motivation Crowd sensing [9] 4 Motivation Targeted Advertisement [10] 5 Motivation

More information

Recommender Systems TIETS43 Collaborative Filtering

Recommender Systems TIETS43 Collaborative Filtering + Recommender Systems TIETS43 Collaborative Filtering Fall 2017 Kostas Stefanidis kostas.stefanidis@uta.fi https://coursepages.uta.fi/tiets43/ selection Amazon generates 35% of their sales through recommendations

More information

Lecture 5: Pitch and Chord (1) Chord Recognition. Li Su

Lecture 5: Pitch and Chord (1) Chord Recognition. Li Su Lecture 5: Pitch and Chord (1) Chord Recognition Li Su Recap: short-time Fourier transform Given a discrete-time signal x(t) sampled at a rate f s. Let window size N samples, hop size H samples, then the

More information

Carnegie Mellon University, University of Pittsburgh

Carnegie Mellon University, University of Pittsburgh Carnegie Mellon University, University of Pittsburgh Carnegie Mellon University, University of Pittsburgh Artificial Intelligence (AI) and Deep Learning (DL) Overview Paola Buitrago Leader AI and BD Pittsburgh

More information

22c181: Formal Methods in Software Engineering. The University of Iowa Spring Propositional Logic

22c181: Formal Methods in Software Engineering. The University of Iowa Spring Propositional Logic 22c181: Formal Methods in Software Engineering The University of Iowa Spring 2010 Propositional Logic Copyright 2010 Cesare Tinelli. These notes are copyrighted materials and may not be used in other course

More information

Topic 1: defining games and strategies. SF2972: Game theory. Not allowed: Extensive form game: formal definition

Topic 1: defining games and strategies. SF2972: Game theory. Not allowed: Extensive form game: formal definition SF2972: Game theory Mark Voorneveld, mark.voorneveld@hhs.se Topic 1: defining games and strategies Drawing a game tree is usually the most informative way to represent an extensive form game. Here is one

More information

Training a Minesweeper Solver

Training a Minesweeper Solver Training a Minesweeper Solver Luis Gardea, Griffin Koontz, Ryan Silva CS 229, Autumn 25 Abstract Minesweeper, a puzzle game introduced in the 96 s, requires spatial awareness and an ability to work with

More information

Artificial Neural Networks. Artificial Intelligence Santa Clara, 2016

Artificial Neural Networks. Artificial Intelligence Santa Clara, 2016 Artificial Neural Networks Artificial Intelligence Santa Clara, 2016 Simulate the functioning of the brain Can simulate actual neurons: Computational neuroscience Can introduce simplified neurons: Neural

More information

MULTIPLE CLASSIFIERS FOR ELECTRONIC NOSE DATA

MULTIPLE CLASSIFIERS FOR ELECTRONIC NOSE DATA MULTIPLE CLASSIFIERS FOR ELECTRONIC NOSE DATA M. Pardo, G. Sberveglieri INFM and University of Brescia Gas Sensor Lab, Dept. of Chemistry and Physics for Materials Via Valotti 9-25133 Brescia Italy D.

More information

Computational aspects of two-player zero-sum games Course notes for Computational Game Theory Section 3 Fall 2010

Computational aspects of two-player zero-sum games Course notes for Computational Game Theory Section 3 Fall 2010 Computational aspects of two-player zero-sum games Course notes for Computational Game Theory Section 3 Fall 21 Peter Bro Miltersen November 1, 21 Version 1.3 3 Extensive form games (Game Trees, Kuhn Trees)

More information

Artificial Intelligence and Deep Learning

Artificial Intelligence and Deep Learning Artificial Intelligence and Deep Learning Cars are now driving themselves (far from perfectly, though) Speaking to a Bot is No Longer Unusual March 2016: World Go Champion Beaten by Machine AI: The Upcoming

More information

Introduction to Markov Models. Estimating the probability of phrases of words, sentences, etc.

Introduction to Markov Models. Estimating the probability of phrases of words, sentences, etc. Introduction to Markov Models Estimating the probability of phrases of words, sentences, etc. But first: A few preliminaries on text preprocessing What counts as a word? A tricky question. CIS 421/521

More information

Mikko Myllymäki and Tuomas Virtanen

Mikko Myllymäki and Tuomas Virtanen NON-STATIONARY NOISE MODEL COMPENSATION IN VOICE ACTIVITY DETECTION Mikko Myllymäki and Tuomas Virtanen Department of Signal Processing, Tampere University of Technology Korkeakoulunkatu 1, 3370, Tampere,

More information

Antennas and Propagation. Chapter 5c: Array Signal Processing and Parametric Estimation Techniques

Antennas and Propagation. Chapter 5c: Array Signal Processing and Parametric Estimation Techniques Antennas and Propagation : Array Signal Processing and Parametric Estimation Techniques Introduction Time-domain Signal Processing Fourier spectral analysis Identify important frequency-content of signal

More information

Introduction to Spring 2009 Artificial Intelligence Final Exam

Introduction to Spring 2009 Artificial Intelligence Final Exam CS 188 Introduction to Spring 2009 Artificial Intelligence Final Exam INSTRUCTIONS You have 3 hours. The exam is closed book, closed notes except a two-page crib sheet, double-sided. Please use non-programmable

More information

PROJECT 5: DESIGNING A VOICE MODEM. Instructor: Amir Asif

PROJECT 5: DESIGNING A VOICE MODEM. Instructor: Amir Asif PROJECT 5: DESIGNING A VOICE MODEM Instructor: Amir Asif CSE4214: Digital Communications (Fall 2012) Computer Science and Engineering, York University 1. PURPOSE In this laboratory project, you will design

More information

Classification with Pedigree and its Applicability to Record Linkage

Classification with Pedigree and its Applicability to Record Linkage Classification with Pedigree and its Applicability to Record Linkage Evan S. Gamble, Sofus A. Macskassy, and Steve Minton Fetch Technologies, 2041 Rosecrans Ave, El Segundo, CA 90245 {egamble,sofmac,minton}@fetch.com

More information

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Module 6 Lecture - 37 Divide and Conquer: Counting Inversions

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Module 6 Lecture - 37 Divide and Conquer: Counting Inversions Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute Module 6 Lecture - 37 Divide and Conquer: Counting Inversions Let us go back and look at Divide and Conquer again.

More information

Are there alternatives to Sigmoid Hidden Units? MLP Lecture 6 Hidden Units / Initialisation 1

Are there alternatives to Sigmoid Hidden Units? MLP Lecture 6 Hidden Units / Initialisation 1 Are there alternatives to Sigmoid Hidden Units? MLP Lecture 6 Hidden Units / Initialisation 1 Hidden Unit Transfer Functions Initialising Deep Networks Steve Renals Machine Learning Practical MLP Lecture

More information

Paper Presentation. Steve Jan. March 5, Virginia Tech. Steve Jan (Virginia Tech) Paper Presentation March 5, / 28

Paper Presentation. Steve Jan. March 5, Virginia Tech. Steve Jan (Virginia Tech) Paper Presentation March 5, / 28 Paper Presentation Steve Jan Virginia Tech March 5, 2015 Steve Jan (Virginia Tech) Paper Presentation March 5, 2015 1 / 28 2 paper to present Nonparametric Multi-group Membership Model for Dynamic Networks,

More information

k-means Clustering David S. Rosenberg December 15, 2017 Bloomberg ML EDU David S. Rosenberg (Bloomberg ML EDU) ML 101 December 15, / 18

k-means Clustering David S. Rosenberg December 15, 2017 Bloomberg ML EDU David S. Rosenberg (Bloomberg ML EDU) ML 101 December 15, / 18 k-means Clustering David S. Rosenberg Bloomberg ML EDU December 15, 2017 David S. Rosenberg (Bloomberg ML EDU) ML 101 December 15, 2017 1 / 18 k-means Clustering David S. Rosenberg (Bloomberg ML EDU) ML

More information

CandyCrush.ai: An AI Agent for Candy Crush

CandyCrush.ai: An AI Agent for Candy Crush CandyCrush.ai: An AI Agent for Candy Crush Jiwoo Lee, Niranjan Balachandar, Karan Singhal December 16, 2016 1 Introduction Candy Crush, a mobile puzzle game, has become very popular in the past few years.

More information

Avoiding consecutive patterns in permutations

Avoiding consecutive patterns in permutations Avoiding consecutive patterns in permutations R. E. L. Aldred M. D. Atkinson D. J. McCaughan January 3, 2009 Abstract The number of permutations that do not contain, as a factor (subword), a given set

More information

Regret Minimization in Games with Incomplete Information

Regret Minimization in Games with Incomplete Information Regret Minimization in Games with Incomplete Information Martin Zinkevich maz@cs.ualberta.ca Michael Bowling Computing Science Department University of Alberta Edmonton, AB Canada T6G2E8 bowling@cs.ualberta.ca

More information

AERONAUTICAL CHANNEL MODELING FOR PACKET NETWORK SIMULATORS

AERONAUTICAL CHANNEL MODELING FOR PACKET NETWORK SIMULATORS AERONAUTICAL CHANNEL MODELING FOR PACKET NETWORK SIMULATORS Author: Sandarva Khanal Advisor: Dr. Richard A. Dean Department of Electrical and Computer Engineering Morgan State University ABSTRACT The introduction

More information

CS510 \ Lecture Ariel Stolerman

CS510 \ Lecture Ariel Stolerman CS510 \ Lecture04 2012-10-15 1 Ariel Stolerman Administration Assignment 2: just a programming assignment. Midterm: posted by next week (5), will cover: o Lectures o Readings A midterm review sheet will

More information

Building a Business Knowledge Base by a Supervised Learning and Rule-Based Method

Building a Business Knowledge Base by a Supervised Learning and Rule-Based Method KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 9, NO. 1, Jan. 2015 407 Copyright 2015 KSII Building a Business Knowledge Base by a Supervised Learning and Rule-Based Method Sungho Shin 1, 2,

More information

1. Introduction to Game Theory

1. Introduction to Game Theory 1. Introduction to Game Theory What is game theory? Important branch of applied mathematics / economics Eight game theorists have won the Nobel prize, most notably John Nash (subject of Beautiful mind

More information

Modeling, Analysis and Optimization of Networks. Alberto Ceselli

Modeling, Analysis and Optimization of Networks. Alberto Ceselli Modeling, Analysis and Optimization of Networks Alberto Ceselli alberto.ceselli@unimi.it Università degli Studi di Milano Dipartimento di Informatica Doctoral School in Computer Science A.A. 2015/2016

More information

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection Detection Lecture usic Processing Applications of usic Processing Christian Dittmar International Audio Laboratories Erlangen christian.dittmar@audiolabs-erlangen.de Important pre-requisite for: usic segmentation

More information