Introduction to Markov Models. Estimating the probability of phrases of words, sentences, etc.

Introduction to Markov Models Estimating the probability of phrases of words, sentences, etc.

But first: A few preliminaries on text preprocessing

What counts as a word? A tricky question. CIS 421/521 - Intro to AI 3

How to find Sentences?? CIS 421/521 - Intro to AI 4

Q1: How to estimate the probability of a given sentence W? A crucial step in speech recognition (and lots of other applications) First guess: bag of words : Given word lattice: form subsidy for farm subsidies far Pˆ( W ) P ( w ) w W Unigram counts (in 1.7 * 10 6 words of AP text): form 183 subsidy 15 for 18185 farm 74 subsidies 55 far 570 Most likely word string given ˆ( ) PW isn t quite right CIS 421/521 - Intro to AI 5

Predicting a word sequence II Next guess: products of bigrams For W=w 1 w 2 w 3 w n, Given word lattice: Bigram counts (in 1.7 * 10 6 words of AP text): Much Better (if not quite right) (Q: the counts are tiny! Why?) CIS 421/521 - Intro to AI 6 n 1 Pˆ( W ) P ( wiwi 1 ) i 1 form subsidy for farm subsidies far form subsidy 0 subsidy for 2 form subsidies 0 subsidy far 0 farm subsidy 0 subsidies for 6 farm subsidies 4 subsidies far 0

How can we estimate P(W) correctly? Problem: Naïve Bayes model for bigrams violates independence assumptions. Let s do this right. Let W=w 1 w 2 w 3 w n. Then, by the chain rule, P( W ) P( w1 )* P( w2 w )* P( w3 w w2 )*...* P( wn w... wn 1) 1 1 1 We can estimate P(w 2 w 1 ) by the Maximum Likelihood Estimator and P(w 3 w 1 w 2 ) by and so on Count( w1w 2) Count( w ) Count( w1w 2w3 ) Count( w w ) 1 1 2 CIS 421/521 - Intro to AI 7

and finally, Estimating P(w n w 1 w 2 w n-1 ) Again, we can estimate P(w n w 1 w 2 w n-1 ) with the MLE Count( w w... w ) 1 2 n 1 2 wn 1 Count( w w... ) So to decide pat vs. pot in Heat up the oil in a large p?t, compute for pot Count("Heat up the oil in a large pot") Count("Heat up the oil in a large") UNLESS OUR CORPUS IS REALLY HUGE BOTH COUNTS WILL BE 0, yielding 0/0 CIS 421/521 - Intro to AI 8

The Web is HUGE!! (2016 version) 48.9/403=0.121 CIS 421/521 - Intro to AI 9

But what if we only have 100 million words for our estimates?? CIS 421/521 - Intro to AI 10

A BOTEC Estimate of What We Can Estimate What parameters can we estimate with 100 million words of training data?? Assuming (for now) uniform distribution over only 5000 words So even with 10 8 words of data, for even trigrams we encounter the sparse data problem.. CIS 421/521 - Intro to AI 11

Review: How can we estimate P(W) correctly? Problem: Naïve Bayes model for bigrams violates independence assumptions. Let s do this right. Let W=w 1 w 2 w 3 w n. Then, by the chain rule, P( W ) P( w1 )* P( w2 w )* P( w3 w w2 )*...* P( wn w... wn 1) We can estimate P(w 2 w 1 ) by the Maximum Likelihood Estimator Count( w1w 2) Count( w ) and P(w 3 w 1 w 2 ) by Count( w1w 2w3 ) Count( w1w 2) and so on 1 1 1 1 CIS 421/521 - Intro to AI 12

The Markov Assumption: Only the Immediate Past Matters CIS 421/521 - Intro to AI 13

The Markov Assumption: Estimation We estimate the probability of each w i given previous context by P(w i w 1 w 2 w i-1 ) = P(w i w i-1 ) which can be estimated by Count( w w ) i 1 Count( w ) i 1 i So we re back to counting only unigrams and bigrams!! AND we have a correct practical estimation method for P(W) given the Markov assumption! CIS 421/521 - Intro to AI 14

Markov Models CIS 421/521 - Intro to AI 15

Review (and crucial for upcoming homework): Cumulative distribution Functions (CDFs) The CDF of a random variable X is denoted by F X (x) and is defined by F X (x)=pr(x x) F is monotonic nondecreasing: x y, F x F y If X is a discrete random variable that attains values x 1, x 2,, x n with probabilities p(x 1 ), p(x 2 ), then FX ( xi ) p( xi ) j i CIS 421/521 - Intro to AI 16

CDF for a very small English corpus Corpus: the mouse ran up the clock. The spider ran up the waterspout. P(the)=4/12, P(ran)=P(up)=2/12 P(mouse)=P(clock)=P(spider)=P(waterspout)=1/12 Arbitrarily fix an order: w1=the, w2=ran, w3=up, w4=mouse, 1 11/1 10/1 9/12 8/12 7/12 6/12 5/12 ` 4/12 3/12 2/12 F(the)=4/12 F(ran)=6/12 F(up)=8/12 F(mouse)=9/12 ` 1/12 The Ran Up Mouse Clock Spider waterspout CIS 421/521 - Intro to AI 17

Visualizing an n-gram based language model: the Shannon/Miller/Selfridge method To generate a sequence of n words given unigram estimates: Fix some ordering of the vocabulary v 1 v 2 v 3 v k. For each word position i, 1 i n Choose a random value r i between 0 and 1 Choose w i = the first v j such that F V v r i i.e the first v j such that j m 1 P( v ) m r i CIS 421/521 - Intro to AI 18

Visualizing an n-gram based language model: the Shannon/Miller/Selfridge method To generate a sequence of n words given a 1 st order Markov model (i.e. conditioned on one previous word): Fix some ordering of the vocabulary v 1 v 2 v 3 v k. Use unigram method to generate an initial word w 1 For each remaining position i, 2 i n Choose a random value r i between 0 and 1 Choose w i = the first v j such that P( vm wi 1) ri j m 1 CIS 421/521 - Intro to AI 19

The Shannon/Miller/Selfridge method trained on Shakespeare (This and next two slides from Jurafsky) CIS 421/521 - Intro to AI 20

Wall Street Journal just isn t Shakespeare CIS 421/521 - Intro to AI 21

Shakespeare as corpus N=884,647 tokens, V=29,066 Shakespeare produced 300,000 bigram types out of V 2 = 844 million possible bigrams. So 99.96% of the possible bigrams were never seen (have zero entries in the table) Quadgrams worse: What's coming out looks like Shakespeare because it is Shakespeare CIS 421/521 - Intro to AI 22

The Sparse Data Problem Again So we smooth. How likely is a 0 count? Much more likely than I let on!!! CIS 421/521 - Intro to AI 23

English word frequencies well described by Zipf s Law Zipf (1949) characterized the relation between word frequency and rank as: f r r C/f log(r) C (for constant log(c) - log Purely Zipfian data plots as a straight line on a loglog scale (f) C) *Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ). CIS 421/521 - Intro to AI 24

Word frequency & rank in Brown Corpus vs Zipf Lots of area under the tail of this curve! From: Interactive mathematics http://www.intmath.com CIS 421/521 - Intro to AI 25

Zipf s law for the Brown corpus CIS 421/521 - Intro to AI 26

Exploiting Zipf to do Language ID #The following filters out arabic words that are also frequent in Spanish and English... arabic_top_12 = [ '7ata', 'ana', 'ma', 'w', 'bs', 'fe', 'b3d', '3adou', 'mn', 'kan', 'men', 'ahmed' ] #The following filters out urdu words common in English urdu_top_17 = ['hai', 'ko', 'ki', 'main', 'na', 'se', 'ho', 'bhi', 'mein', 'ka', 'tum', 'nahi', 'meri', 'jo', 'wo', 'dil', 'hain'] spanish_top_16 = ['de', 'la', 'que', 'el', 'en', 'y', 'es', 'un', 'los', 'por', 'se', 'para', 'con'] english_top_20 = ['the', 'to', 'of', 'in', 'i', 'a', 'is', 'and, 'you', 'for', 'on', 'it', 'that', 'are', 'with', 'am', 'my', 'be', 'at' 'not', 'we'] CIS 421/521 - Intro to AI 27

All the code you need. #TO GET BEST LANGUAGE AS STRING: lid_pick_best(lid_process_tweet(tweet)) counts=collections.counter() def lid_process_tweet(tweet): counts.clear() for word in re.split(r'[\.?!,]*\s+, tweet.encode('ascii','replace ).strip().lower()): if not re.match(r'http://\s+',word): for lang in languages: if word in topwords[lang]: #( english, arabic,...) counts[lang]+=1 #dict of word lists indexed by land return counts.most_common() def lid_pick_best (count_list): if count_list: return count_list[0][0] else: return 'UNKNOWN' CIS 421/521 - Intro to AI 28