Log-linear models (part III)

Size: px

Start display at page:

Download "Log-linear models (part III)"

Buddy Webster
5 years ago
Views:

1 Log-linear models (part III) Lecture, Feb 7 CS 690N, Spring 2017 Advanced Natural Language Processing Brendan O Connor College of Information and Computer Sciences University of Massachusetts Amherst

2 MaxEnt / Log-Linear models x: input (all previous words) y: output (next word) f(x,y) => Rd feature function [[domain knowledge here!]] v: Rd Y parameter vector (weights) p(y x; v) = exp (v f(x, y)) P y 0 2Y exp (v f(x, y0 )) P Application to history-based LM: P (w 1..w T )= Y t = Y t P (w t w 1..w t 1 ) exp(v f(w 1..w t 1,w t )) P w2v exp(v f(w 1..w t 1,w))

3 Learning log p(y x; v) = v f(x, y) j log p(y x; v) = y 0 2Y exp v f(x, y 0 ) Gradient at a single example: can it be zero? Full dataset gradient: First moments match at the mode Log-likelihood is concave At least with regularization, since typically linearly separable Is my function convex? Check Boyd and Vandenberghe ch. 3 3

4 Learning log p(y x; v) = v f(x, y) log X y 0 2Y exp v f(x, y j log p(y x; v) = fun with the chain rule Gradient at a single example: can it be zero? Full dataset gradient: First moments match at the mode Log-likelihood is concave At least with regularization, since typically linearly separable Is my function convex? Check Boyd and Vandenberghe ch. 3 3

5 Learning log p(y x; v) = v f(x, y) log X y 0 2Y exp v f(x, y j log p(y x; v) = fun with the chain rule f j (x, y) X y 0 p(y 0 x; v)f j (x, y 0 ) Gradient at a single example: can it be zero? Full dataset gradient: First moments match at the mode Log-likelihood is concave At least with regularization, since typically linearly separable Is my function convex? Check Boyd and Vandenberghe ch. 3 3

6 Learning log p(y x; v) = v f(x, y) log X y 0 2Y exp v f(x, y j log p(y x; v) = fun with the chain rule f j (x, y) Feature in data? X y 0 p(y 0 x; v)f j (x, y 0 ) Feature in posterior? Gradient at a single example: can it be zero? Full dataset gradient: First moments match at the mode Log-likelihood is concave At least with regularization, since typically linearly separable Is my function convex? Check Boyd and Vandenberghe ch. 3 3

7 Gradient descent Batch gradient descent (doesn t work well by itself) Most commonly used alternatives LBFGS (adaptive version of batch GD) Call a library implementation with gradient callback SGD, one example at a time and adaptive variants: Adagrad, Adam, etc. Intuition Issue: Combining per-example sparse updates with regularization updates Lazy updates Occasional regularizer steps (easy to implement) 4

8 stopped here on 2/7 5

9 Engineering Sparse dot products are crucial! Lots and lots of features? Millions to billions of features: performance often keeps improving! Features seen only once at training time typically help Feature name=>number mapping is the problem; the parameter vector is fine Feature hashing: make e.g. N(u,v,w) mapping random with collisions (!) Accuracy loss low since features are rare. Works well, great for large-scale data (memory usage constant!) Practically: use a fast string hashing function (e.g. murmurhash or Python s internal one) 6

10 Feature selection Offline feature selection Count cutoffs: computational, not performance benefits Predictive value: mutual info. / info. gain / chi-square L1 regularization: encourages θ sparsity min log p (y x)+ X j j L1 optimization: convex but nonsmooth; requires subgradient methods 7

11 Dense representations Saul and Pereira 1997? Mnih and Hinton 2007: log-bilinear model 8

12 Bengio et al. 2003: N-gram MLP f (w t,,w t n+1 )= ˆP(w t w t 1 1 ) i-th output = P(w t = i context) ( ) softmax most computation here tanh C(w t n+1 )... Table look up in C... C(w t 2 ) C(w t 1 ) Matrix C shared parameters across words w t n+1 w t 2 index for index for index for w t 1 C(i) 2 R m. Word embedding parameters 9 x =(C(w t 1 ),C(w t 2 ),,C(w t n+1 )). y = b +Wx+U tanh(d + Hx) ˆP(w t w t 1, w t n+1 )= ey wt i e y i.

Log-linear models (part 1I)

Log-linear models (part 1I) Lecture, Feb 2 CS 690N, Spring 2017 Advanced Natural Language Processing http://people.cs.umass.edu/~brenocon/anlp2017/ Brendan O Connor College of Information and Computer