Introduction to Markov Models. Estimating the probability of phrases of words, sentences, etc.

Similar documents
Introduction to Markov Models

Introduction to Markov Models

Part of Speech Tagging & Hidden Markov Models (Part 1) Mitch Marcus CIS 421/521

The revolution of the empiricists. Machine Translation. Motivation for Data-Driven MT. Machine Translation as Search

Log-linear models (part 1I)

Log-linear models (part 1I)

Lecture 4: n-grams in NLP. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

The fundamentals of detection theory

HW1 is due Thu Oct 12 in the first 5 min of class. Read through chapter 5.

Speech Recognition. Mitch Marcus CIS 421/521 Artificial Intelligence

24.09 Minds and Machines Fall 11 HASS-D CI

Statistical Machine Translation. Machine Translation Phrase-Based Statistical MT. Motivation for Phrase-based SMT

Great Is the Love/Hay Gran Amor. Jaime Cortez. Unison Keyboard

Bayesian Positioning in Wireless Networks using Angle of Arrival

Specimen 2018 Morning Time allowed: 1 hour

System Identification and CDMA Communication

CS 188: Artificial Intelligence Spring Speech in an Hour

Backward induction is a widely accepted principle for predicting behavior in sequential games. In the classic

The study of probability is concerned with the likelihood of events occurring. Many situations can be analyzed using a simplified model of probability

Alternation in the repeated Battle of the Sexes

/665 Natural Language Processing

Statistical Analysis of Modern Communication Signals

Card counting meets hidden Markov models

Motif finding. GCB 535 / CIS 535 M. T. Lee, 10 Oct 2004

Log-linear models (part III)

Machine Learning for Language Technology

Guess the Mean. Joshua Hill. January 2, 2010

Naive Bayes text classification. Sumin Han

Laws of Text. Lecture Objectives. Text Technologies for Data Science INFR Learn about some text laws. This lecture is practical 9/26/2018

The Automatic Classification Problem. Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification

Midterm for Name: Good luck! Midterm page 1 of 9

A Maximum Likelihood TOA Based Estimator For Localization in Heterogeneous Networks

November 8, Chapter 8: Probability: The Mathematics of Chance

University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005

Machine Translation - Decoding

Laboratory 1: Uncertainty Analysis

If a series of games (on which money has been bet) is interrupted before it can end, what is the fairest way to divide the stakes?

Ismaila Ba MSc Student, Department of Mathematics and Statistics Université de Moncton

10/12/2015. SHRDLU: 1969 NLP solved?? : A sea change in AI technologies. SHRDLU: A demonstration proof. 1990: Parsing Research in Crisis

Latest trends in sentiment analysis - A survey

(Small Group Sydney, Emma, Carson, Lucas) What ya gonna do when the lake goes dry, honey What ya gonna do when the lake goes dry?

Total. STAT/MATH 394 A - Autumn Quarter Midterm. Name: Student ID Number: Directions. Complete all questions.

3.5 Marginal Distributions

Peak-based EMG Detection Via CWT

Lesson 6.1 Linear Equation Review

CS 540: Introduction to Artificial Intelligence

Discriminative Training for Automatic Speech Recognition

1 What s in the shipping package?

FAST LEMPEL-ZIV (LZ 78) COMPLEXITY ESTIMATION USING CODEBOOK HASHING

Recap from previous lecture. Information Retrieval. Topics for Today. Recall: Basic structure of an Inverted index. Dictionaries & Tolerant Retrieval

Some Parameter Estimators in the Generalized Pareto Model and their Inconsistency with Observed Data

Efficiency and detectability of random reactive jamming in wireless networks

The Log-Log Term Frequency Distribution

Algorithms and Data Structures

Chapter 11. Sampling Distributions. BPS - 5th Ed. Chapter 11 1

Understanding Apparent Increasing Random Jitter with Increasing PRBS Test Pattern Lengths

Veracity Managing Uncertain Data. Skript zur Vorlesung Datenbanksystem II Dr. Andreas Züfle

Simple Large-scale Relation Extraction from Unstructured Text

MAS.160 / MAS.510 / MAS.511 Signals, Systems and Information for Media Technology Fall 2007

The Self-Avoiding Walk (Probability And Its Applications) By Neal Madras;Gordon Slade

Outlier-Robust Estimation of GPS Satellite Clock Offsets

Discrete Structures for Computer Science

Simple Large-scale Relation Extraction from Unstructured Text

CH 20 NUMBER WORD PROBLEMS

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2014

Outcome Forecasting in Sports. Ondřej Hubáček

Research Seminar. Stefano CARRINO fr.ch

STAT Statistics I Midterm Exam One. Good Luck!

A Weighted Least Squares Algorithm for Passive Localization in Multipath Scenarios

Literature Look for these books in a library. Point out shapes and how they can be found in everyday objects. Vocabulary Builder. Home Activity.

Lecture 15. Turbo codes make use of a systematic recursive convolutional code and a random permutation, and are encoded by a very simple algorithm:

Lesson 47. A 30X zoom lens

Mobility Patterns in Microcellular Wireless Networks

final examination on May 31 Topics from the latter part of the course (covered in homework assignments 4-7) include:

Essential Question How can you list the possible outcomes in the sample space of an experiment?

Battleship as a Dialog System Aaron Brackett, Gerry Meixiong, Tony Tan-Torres, Jeffrey Yu

Chapter 3: Resistive Network Analysis Instructor Notes

15 Discrete-Time Modulation

The Munich 2011 CHiME Challenge Contribution: BLSTM-NMF Speech Enhancement and Recognition for Reverberated Multisource Environments

Instrumental Considerations

THE CHALLENGES OF SENTIMENT ANALYSIS ON SOCIAL WEB COMMUNITIES

The Game-Theoretic Approach to Machine Learning and Adaptation

Joint Distributions, Independence Class 7, Jeremy Orloff and Jonathan Bloom

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Announcements. Today. Speech and Language. State Path Trellis. HMMs: MLE Queries. Introduction to Artificial Intelligence. V22.

Day Session Common Core Adaptation Common Core Standards Measurement Benchmarks

Apply Kalman Filter in Financial Time Series

Image Processing Computer Graphics I Lecture 20. Display Color Models Filters Dithering Image Compression

CHAPTER 9 THE EFFECTS OF GAUGE LENGTH AND STRAIN RATE ON THE TENSILE PROPERTIES OF REGULAR AND AIR JET ROTOR SPUN COTTON YARNS

Chapter 5 Exercise Solutions

Digital Communication Systems ECS 452

Numerical: Data with quantity Discrete: whole number answers Example: How many siblings do you have?

BetaPoker: Reinforcement Learning for Heads-Up Limit Poker Albert Tung, Eric Xu, and Jeffrey Zhang

Electric Guitar Pickups Recognition

Local Search: Hill Climbing. When A* doesn t work AIMA 4.1. Review: Hill climbing on a surface of states. Review: Local search and optimization

Analog Circuits Prof. Jayanta Mukherjee Department of Electrical Engineering Indian Institute of Technology-Bombay

Free-Standing Mathematics Qualification Mathematics

Degrees of Freedom in Adaptive Modulation: A Unified View

CSE 255 Assignment 1: Helpfulness in Amazon Reviews

Real Time Word to Picture Translation for Chinese Restaurant Menus

Transcription:

Introduction to Markov Models Estimating the probability of phrases of words, sentences, etc.

But first: A few preliminaries on text preprocessing

What counts as a word? A tricky question. CIS 421/521 - Intro to AI 3

How to find Sentences?? CIS 421/521 - Intro to AI 4

Q1: How to estimate the probability of a given sentence W? A crucial step in speech recognition (and lots of other applications) First guess: bag of words : Given word lattice: form subsidy for farm subsidies far Pˆ( W ) P ( w ) w W Unigram counts (in 1.7 * 10 6 words of AP text): form 183 subsidy 15 for 18185 farm 74 subsidies 55 far 570 Most likely word string given ˆ( ) PW isn t quite right CIS 421/521 - Intro to AI 5

Predicting a word sequence II Next guess: products of bigrams For W=w 1 w 2 w 3 w n, Given word lattice: Bigram counts (in 1.7 * 10 6 words of AP text): Much Better (if not quite right) (Q: the counts are tiny! Why?) CIS 421/521 - Intro to AI 6 n 1 Pˆ( W ) P ( wiwi 1 ) i 1 form subsidy for farm subsidies far form subsidy 0 subsidy for 2 form subsidies 0 subsidy far 0 farm subsidy 0 subsidies for 6 farm subsidies 4 subsidies far 0

How can we estimate P(W) correctly? Problem: Naïve Bayes model for bigrams violates independence assumptions. Let s do this right. Let W=w 1 w 2 w 3 w n. Then, by the chain rule, P( W ) P( w1 )* P( w2 w )* P( w3 w w2 )*...* P( wn w... wn 1) 1 1 1 We can estimate P(w 2 w 1 ) by the Maximum Likelihood Estimator and P(w 3 w 1 w 2 ) by and so on Count( w1w 2) Count( w ) Count( w1w 2w3 ) Count( w w ) 1 1 2 CIS 421/521 - Intro to AI 7

and finally, Estimating P(w n w 1 w 2 w n-1 ) Again, we can estimate P(w n w 1 w 2 w n-1 ) with the MLE Count( w w... w ) 1 2 n 1 2 wn 1 Count( w w... ) So to decide pat vs. pot in Heat up the oil in a large p?t, compute for pot Count("Heat up the oil in a large pot") Count("Heat up the oil in a large") UNLESS OUR CORPUS IS REALLY HUGE BOTH COUNTS WILL BE 0, yielding 0/0 CIS 421/521 - Intro to AI 8

The Web is HUGE!! (2016 version) 48.9/403=0.121 CIS 421/521 - Intro to AI 9

But what if we only have 100 million words for our estimates?? CIS 421/521 - Intro to AI 10

A BOTEC Estimate of What We Can Estimate What parameters can we estimate with 100 million words of training data?? Assuming (for now) uniform distribution over only 5000 words So even with 10 8 words of data, for even trigrams we encounter the sparse data problem.. CIS 421/521 - Intro to AI 11

Review: How can we estimate P(W) correctly? Problem: Naïve Bayes model for bigrams violates independence assumptions. Let s do this right. Let W=w 1 w 2 w 3 w n. Then, by the chain rule, P( W ) P( w1 )* P( w2 w )* P( w3 w w2 )*...* P( wn w... wn 1) We can estimate P(w 2 w 1 ) by the Maximum Likelihood Estimator Count( w1w 2) Count( w ) and P(w 3 w 1 w 2 ) by Count( w1w 2w3 ) Count( w1w 2) and so on 1 1 1 1 CIS 421/521 - Intro to AI 12

The Markov Assumption: Only the Immediate Past Matters CIS 421/521 - Intro to AI 13

The Markov Assumption: Estimation We estimate the probability of each w i given previous context by P(w i w 1 w 2 w i-1 ) = P(w i w i-1 ) which can be estimated by Count( w w ) i 1 Count( w ) i 1 i So we re back to counting only unigrams and bigrams!! AND we have a correct practical estimation method for P(W) given the Markov assumption! CIS 421/521 - Intro to AI 14

Markov Models CIS 421/521 - Intro to AI 15

Review (and crucial for upcoming homework): Cumulative distribution Functions (CDFs) The CDF of a random variable X is denoted by F X (x) and is defined by F X (x)=pr(x x) F is monotonic nondecreasing: x y, F x F y If X is a discrete random variable that attains values x 1, x 2,, x n with probabilities p(x 1 ), p(x 2 ), then FX ( xi ) p( xi ) j i CIS 421/521 - Intro to AI 16

CDF for a very small English corpus Corpus: the mouse ran up the clock. The spider ran up the waterspout. P(the)=4/12, P(ran)=P(up)=2/12 P(mouse)=P(clock)=P(spider)=P(waterspout)=1/12 Arbitrarily fix an order: w1=the, w2=ran, w3=up, w4=mouse, 1 11/1 10/1 9/12 8/12 7/12 6/12 5/12 ` 4/12 3/12 2/12 F(the)=4/12 F(ran)=6/12 F(up)=8/12 F(mouse)=9/12 ` 1/12 The Ran Up Mouse Clock Spider waterspout CIS 421/521 - Intro to AI 17

Visualizing an n-gram based language model: the Shannon/Miller/Selfridge method To generate a sequence of n words given unigram estimates: Fix some ordering of the vocabulary v 1 v 2 v 3 v k. For each word position i, 1 i n Choose a random value r i between 0 and 1 Choose w i = the first v j such that F V v r i i.e the first v j such that j m 1 P( v ) m r i CIS 421/521 - Intro to AI 18

Visualizing an n-gram based language model: the Shannon/Miller/Selfridge method To generate a sequence of n words given a 1 st order Markov model (i.e. conditioned on one previous word): Fix some ordering of the vocabulary v 1 v 2 v 3 v k. Use unigram method to generate an initial word w 1 For each remaining position i, 2 i n Choose a random value r i between 0 and 1 Choose w i = the first v j such that P( vm wi 1) ri j m 1 CIS 421/521 - Intro to AI 19

The Shannon/Miller/Selfridge method trained on Shakespeare (This and next two slides from Jurafsky) CIS 421/521 - Intro to AI 20

Wall Street Journal just isn t Shakespeare CIS 421/521 - Intro to AI 21

Shakespeare as corpus N=884,647 tokens, V=29,066 Shakespeare produced 300,000 bigram types out of V 2 = 844 million possible bigrams. So 99.96% of the possible bigrams were never seen (have zero entries in the table) Quadgrams worse: What's coming out looks like Shakespeare because it is Shakespeare CIS 421/521 - Intro to AI 22

The Sparse Data Problem Again So we smooth. How likely is a 0 count? Much more likely than I let on!!! CIS 421/521 - Intro to AI 23

English word frequencies well described by Zipf s Law Zipf (1949) characterized the relation between word frequency and rank as: f r r C/f log(r) C (for constant log(c) - log Purely Zipfian data plots as a straight line on a loglog scale (f) C) *Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ). CIS 421/521 - Intro to AI 24

Word frequency & rank in Brown Corpus vs Zipf Lots of area under the tail of this curve! From: Interactive mathematics http://www.intmath.com CIS 421/521 - Intro to AI 25

Zipf s law for the Brown corpus CIS 421/521 - Intro to AI 26

Exploiting Zipf to do Language ID #The following filters out arabic words that are also frequent in Spanish and English... arabic_top_12 = [ '7ata', 'ana', 'ma', 'w', 'bs', 'fe', 'b3d', '3adou', 'mn', 'kan', 'men', 'ahmed' ] #The following filters out urdu words common in English urdu_top_17 = ['hai', 'ko', 'ki', 'main', 'na', 'se', 'ho', 'bhi', 'mein', 'ka', 'tum', 'nahi', 'meri', 'jo', 'wo', 'dil', 'hain'] spanish_top_16 = ['de', 'la', 'que', 'el', 'en', 'y', 'es', 'un', 'los', 'por', 'se', 'para', 'con'] english_top_20 = ['the', 'to', 'of', 'in', 'i', 'a', 'is', 'and, 'you', 'for', 'on', 'it', 'that', 'are', 'with', 'am', 'my', 'be', 'at' 'not', 'we'] CIS 421/521 - Intro to AI 27

All the code you need. #TO GET BEST LANGUAGE AS STRING: lid_pick_best(lid_process_tweet(tweet)) counts=collections.counter() def lid_process_tweet(tweet): counts.clear() for word in re.split(r'[\.?!,]*\s+, tweet.encode('ascii','replace ).strip().lower()): if not re.match(r'http://\s+',word): for lang in languages: if word in topwords[lang]: #( english, arabic,...) counts[lang]+=1 #dict of word lists indexed by land return counts.most_common() def lid_pick_best (count_list): if count_list: return count_list[0][0] else: return 'UNKNOWN' CIS 421/521 - Intro to AI 28