Machine Learning for Language Technology

Size: px

Start display at page:

Download "Machine Learning for Language Technology"

Ambrose Lang
5 years ago
Views:

1 Machine Learning for Language Technology Generative and Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology Machine Learning for Language Technology 1(7)

2 Generative Models Naive Bayes is a generative probabilistic model because it models the joint distribution of inputs and outputs: P(x,y)=P(y)P(x y)=p(y) ny P(f i (x) y) i=1 Machine Learning for Language Technology 2(7)

3 Pros and Cons of Generative Models Pros: Straightforward to do estimation (MLE or MAP) Informative other distributions can be derived: P(x)= P y P(x,y) P(y x)= Marginalization P(x,y) Py P(x,y) Conditionalization Cons: Unnecessary to model input distribution (for classification) Necessary to make rigid independence assumptions Machine Learning for Language Technology 3(7)

4 Discriminative Models A discriminative (or conditional) probabilistic model only models the conditional distribution of outputs given inputs: P(y x) We can discriminate between different outputs for an input, but we cannot generate input-output pairs Machine Learning for Language Technology 4(7)

5 Pros and Cons of Discriminative Models Pros: Only models the distribution relevant for classification Less rigid independence assumptions Cons: Harder to do estimation (MLE or MAP) Less informative other distributions can not be derived Machine Learning for Language Technology 5(7)

6 From Naive Bayes to Logistic Regression Logistic regression P(y x)= exp[w f(x,y)] P y exp[w 0 f(x,y0 )] Models the conditional distribution directly The discriminative counterpart of Naive Bayes Machine Learning for Language Technology 6(7)

7 Quiz Which of the following statements are true? 1. If we know P(x,y), we can derive P(x y). 2. If we know P(x y), we can derive P(x,y). 3. If we know P(x,y), we can derive P(x). 4. If we know P(x), we can derive P(x,y). Machine Learning for Language Technology 7(7)

8 Machine Learning for Language Technology Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Machine Learning for Language Technology 1(11)

9 Introduction I We want to build a discriminative (or conditional) classifier: f (x) =argmax P(y x) y I We will do this in two steps: 1. Build a feature-based linear classifier 2. Normalize to a conditional probability model I The result will be a log-linear model Machine Learning for Language Technology 2(11)

10 Feature Representations I We assume a mapping from input-output pairs (x, y) to a high dimensional feature vector I f(x, y) :X Y! R m I For any vector v 2 R m,letv j be the j th value Machine Learning for Language Technology 3(11)

11 Examples I x is a document and y is a label 8 < 1 if x contains the word interest f j (x, y) = and y = financial : 0 otherwise f j (x, y) =%ofwordsinx with punctuation and y = scientific I x is a word and y is a part-of-speech tag 1 if x = bank and y = Verb f j (x, y) = 0 otherwise Machine Learning for Language Technology 4(11)

12 Examples I x is a name, y is a label classifying the name 8 < f 0 (x, y) = : 1 if x contains George and y = Person 0 otherwise 8 < f 4 (x, y) = : 1 if x contains George and y = Object 0 otherwise 8 < f 1 (x, y) = : 8 < f 2 (x, y) = : 8 < f 3 (x, y) = : 1 if x contains Washington and y = Person 0 otherwise 1 if x contains Bridge and y = Person 0 otherwise 1 if x contains General and y = Person 0 otherwise 8 < f 5 (x, y) = : 8 < f 6 (x, y) = : 8 < f 7 (x, y) = : 1 if x contains Washington and y = Object 0 otherwise 1 if x contains Bridge and y = Object 0 otherwise 1 if x contains General and y = Object 0 otherwise I x=general George Washington, y=person! f(x, y) =[ ] I x=george Washington Bridge, y=object! f(x, y) =[ ] I x=george Washington George, y=object! f(x, y) =[ ] Machine Learning for Language Technology 5(11)

13 Block Feature Vectors I x=general George Washington, y=person! f(x, y) =[ ] I x=george Washington Bridge, y=object! f(x, y) =[ ] I x=george Washington George, y=object! f(x, y) =[ ] I Each equal size block corresponds to one label I Non-zero values allowed only in one block Machine Learning for Language Technology 6(11)

14 Linear Classifiers I Linear classifier: I The score (or probability) of a particular classification is based on a linear combination of features and their weights I Let w 2 R m be a weight vector for f(x, y) 2 R m I The weight wi reflects the significance of feature f i (x, y) I wi >0, f i (x, y) favors class y I The larger wi is, the stronger the association I Example: I w3 =1.7) the word General favors the class Person I w7 = 0.9 ) the word General disfavors the class Object Machine Learning for Language Technology 7(11)

15 Linear Classifiers I The score of a class y is the inner product of f(x, y) and w: f(x, y) w = mx f i (x, y) w i i=1 I The highest scoring class wins: f (x) =argmax y f(x, y) w = argmax y mx f i (x, y) w i i=1 Machine Learning for Language Technology 8(11)

16 Binary Linear Classifier Divides all points: Machine Learning for Language Technology 9(11)

17 Multiclass Linear Classifier Defines regions of space: I i.e., + are all points (x, y) where + =argmax y w f(x, y) Machine Learning for Language Technology 10(11)

18 Quiz I Suppose that these are all the features used in a spam filter: I f1 (x, y) = x contains buy and y = SPAM I f 2 (x, y) = x contains buy and y = HAM I Suppose the corresponding weights are: I w1 =1.0 I w2 =0.0 I Which of the following statements is false? 1. The SPAM score for a document containing the word buy is The HAM score for a document containing the word buy is The SPAM score for a document not containing the word buy is The HAM score for a document not containing the word buy is 1.0 Machine Learning for Language Technology 11(11)

19 Machine Learning for Language Technology Log-Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Machine Learning for Language Technology 1(6)

20 Introduction I We want to build a discriminative (or conditional) classifier: where f (x) =argmax P(y x) y nx P(y i x) =1 i=1 I Alinearclassifieroutputsscoresintherange( 1, 1) I We need to do two things: 1. Make sure that all scores are positive 2. Normalize scores to sum to 1 Machine Learning for Language Technology 2(6)

21 Log-Linear Models I Linear model: I Make scores positive: f(x, y) w exp [f(x, y) w] I Normalize: P(y x) = exp [f(x, y) w] P n i=1 exp [f(x, y i) w] Machine Learning for Language Technology 3(6)

22 Log-Linear Models I Crash course in exponentiation: I Note: exp x = a x (for some base a) 0 < exp x < 1 if x < 0 exp x = 1 if x = 0 1 < exp x if x > 0 I The inverse of exponentiation is the logarithm: log exp x = x I Hence, the log-linear model is linear in log(arithmic) space Machine Learning for Language Technology 4(6)

23 Log-Linear Models I Suppose we have (only) two classes with the following scores: I Using base 2, we have: I Normalizing, we get: P(y 1 x) = P(y 2 x) = f(x, y 1 ) w = 1.0 f(x, y 2 ) w = 2.0 exp [f(x, y 1 ) w] = 2 exp [f(x, y 2 ) w] = 0.25 exp[f(x,y 1 ) w] exp[f(x,y 1 ) w]+exp[f(x,y 2 ) w] exp[f(x,y 2 ) w] exp[f(x,y 1 ) w]+exp[f(x,y 2 ) w] = = 0.89 = = 0.11 Machine Learning for Language Technology 5(6)

24 Quiz 2 I Suppose a (plain) linear classifier for spam filtering assigns SPAM and HAM the same score for a document d. I What is P(SPAM d) in the normalized log-linear model? Impossible to tell Machine Learning for Language Technology 6(6)

25 Machine Learning for Language Technology Logistic Regression Joakim Nivre Uppsala University Department of Linguistics and Philology Machine Learning for Language Technology 1(6)

26 Logistic Regression I We know how to do classification with a log-linear model: exp [f(x, y) w] f (x) =argmax P(y x) =argmax P n y y i=1 exp [f(x, y i) w] I But how do we learn the weights? Machine Learning for Language Technology 2(6)

27 Maximum Likelihood Estimation I For a generative model like NB, we maximize joint likelihood: I argmax ny P (y i )P (x i y i ) i=1 We can use relative frequencies to get the joint MLE I Now we want to maximize conditional likelihood: ny argmax P (y i x i ) I I i=1 Bad news: there is no analytical solution Good news: the likelihood function is convex Machine Learning for Language Technology 3(6)

28 Gradient Ascent I Convexity guarantees a single maximum I Gradient ascent: 1. Guess an initial weight vector w 0 (all w 0 = 0.0) 2. Repeat until convergence: 2.1 Use gradient of w i to determine ascent direction 2.2 Update w i+1 w i + gradient step Machine Learning for Language Technology 4(6)

29 Linear Models and Logistic Regression I Linear model: I Classifier score is a linear combination of weighted features I Logistic regression (log-linear model): I Learn weights to maximize conditional likelihood I Only one of many possible ways to learn weights I This does not matter for classification: argmax y exp [f(x, y) w] P n i=1 exp [f(x, y i) w] = argmax f(x, y) w y Machine Learning for Language Technology 5(6)

30 Naive Bayes and Logistic Regression Naive Bayes Logistic Regression Generative model Discriminative model Estimates P(x, y) Estimates P(y x) MLE has closed form solution MLE requires numerical optimization Strong independence assumptions No independence assumptions Better on small training sets Better on medium-sized training sets Machine Learning for Language Technology 6(6)

Kernels and Support Vector Machines

Kernels and Support Vector Machines Machine Learning CSE446 Sham Kakade University of Washington November 1, 2016 2016 Sham Kakade 1 Announcements: Project Milestones coming up HW2 You ve implemented GD,