Introduction to Markov Models

Similar documents
Introduction to Markov Models. Estimating the probability of phrases of words, sentences, etc.

Introduction to Markov Models

Part of Speech Tagging & Hidden Markov Models (Part 1) Mitch Marcus CIS 421/521

The revolution of the empiricists. Machine Translation. Motivation for Data-Driven MT. Machine Translation as Search

/665 Natural Language Processing

Log-linear models (part 1I)

Log-linear models (part 1I)

Lecture 4: n-grams in NLP. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

Machine Translation - Decoding

The Log-Log Term Frequency Distribution

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

The Automatic Classification Problem. Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification

Speech Recognition. Mitch Marcus CIS 421/521 Artificial Intelligence

Statistical Machine Translation. Machine Translation Phrase-Based Statistical MT. Motivation for Phrase-based SMT

Chapter 11. Sampling Distributions. BPS - 5th Ed. Chapter 11 1

Laws of Text. Lecture Objectives. Text Technologies for Data Science INFR Learn about some text laws. This lecture is practical 9/26/2018

CS 188: Artificial Intelligence Spring Speech in an Hour

Understanding Apparent Increasing Random Jitter with Increasing PRBS Test Pattern Lengths

Social media sentiment analysis and topic detection for Singapore English

System Identification and CDMA Communication

Statistical tests. Paired t-test

Guess the Mean. Joshua Hill. January 2, 2010

Statistical Analysis of Modern Communication Signals

Laboratory 1: Uncertainty Analysis

News English.com Ready-to-use ESL/EFL Lessons by Sean Banville Facebook creator is Time Person of the Year

Midterm 2 Practice Problems

FAST LEMPEL-ZIV (LZ 78) COMPLEXITY ESTIMATION USING CODEBOOK HASHING

10/12/2015. SHRDLU: 1969 NLP solved?? : A sea change in AI technologies. SHRDLU: A demonstration proof. 1990: Parsing Research in Crisis

Speech Processing. Simon King University of Edinburgh. additional lecture slides for

MAS160: Signals, Systems & Information for Media Technology. Problem Set 4. DUE: October 20, 2003

Discriminative Training for Automatic Speech Recognition

Bayesian Positioning in Wireless Networks using Angle of Arrival

The Game-Theoretic Approach to Machine Learning and Adaptation

Algorithms and Data Structures

TO PLOT OR NOT TO PLOT?

Science Binder and Science Notebook. Discussions

Introduction to Artificial Intelligence

Shaftesbury Park Primary School. Wandsworth test examples

Lesson Activity Toolkit

Failures of Intuition: Building a Solid Poker Foundation through Combinatorics

Midterm for Name: Good luck! Midterm page 1 of 9

Log-linear models (part III)

News English.com Ready-to-use ESL / EFL Lessons

MFL and Numeracy. Teachers of MFL in KS2 and KS3 reinforce:

The Odds Calculators: Partial simulations vs. compact formulas By Catalin Barboianu

Why Should We Care? Everyone uses plotting But most people ignore or are unaware of simple principles Default plotting tools are not always the best

The Self-Avoiding Walk (Probability And Its Applications) By Neal Madras;Gordon Slade

Motif finding. GCB 535 / CIS 535 M. T. Lee, 10 Oct 2004

Named Entity Recognition. Natural Language Processing Emory University Jinho D. Choi

Battleship as a Dialog System Aaron Brackett, Gerry Meixiong, Tony Tan-Torres, Jeffrey Yu

Dance Movement Patterns Recognition (Part II)

Real Time Word to Picture Translation for Chinese Restaurant Menus

The study of probability is concerned with the likelihood of events occurring. Many situations can be analyzed using a simplified model of probability

Latest trends in sentiment analysis - A survey

If a series of games (on which money has been bet) is interrupted before it can end, what is the fairest way to divide the stakes?

Machine Learning for Language Technology

HW1 is due Thu Oct 12 in the first 5 min of class. Read through chapter 5.

Backward induction is a widely accepted principle for predicting behavior in sequential games. In the classic

THE CHALLENGES OF SENTIMENT ANALYSIS ON SOCIAL WEB COMMUNITIES

Sketching Interface. Larry Rudolph April 24, Pervasive Computing MIT SMA 5508 Spring 2006 Larry Rudolph

Information Extraction. CS6200 Information Retrieval (and a sort of advertisement for NLP in the spring)

Multimedia Systems Entropy Coding Mahdi Amiri February 2011 Sharif University of Technology

Basics of Five Card Draw

Sketching Interface. Motivation

Efficiency and detectability of random reactive jamming in wireless networks

Recent Advances in Image Deblurring. Seungyong Lee (Collaboration w/ Sunghyun Cho)

Skill, Matchmaking, and Ranking. Dr. Josh Menke Sr. Systems Designer Activision Publishing

CPS331 Lecture: Heuristic Search last revised 6/18/09

AUTOMATIC SPEECH RECOGNITION FOR NUMERIC DIGITS USING TIME NORMALIZATION AND ENERGY ENVELOPES

News English.com Ready-to-use ESL / EFL Lessons

Frictional Force (32 Points)

the question of whether computers can think is like the question of whether submarines can swim -- Dijkstra

SELECTING RELEVANT DATA

1 2-step and other basic conditional probability problems

Pixel Response Effects on CCD Camera Gain Calibration

An Adaptive Intelligence For Heads-Up No-Limit Texas Hold em

Chapter 11. Sampling Distributions. BPS - 5th Ed. Chapter 11 1

the ARTICLE (for teachers)

Objective: the student will gain speed and accuracy in letter recognition.

Huffman Coding with Non-Sorted Frequencies

Alternation in the repeated Battle of the Sexes

Objective: Plot points, using them to draw lines in the plane, and describe

MAS.160 / MAS.510 / MAS.511 Signals, Systems and Information for Media Technology Fall 2007

WHAT THE COURSE IS AND ISN T ABOUT. Welcome to CIS 391. Introduction to Artificial Intelligence. Grading & Homework. Welcome to CIS 391

NEWS ENGLISH LESSONS.com

Outcome Forecasting in Sports. Ondřej Hubáček

JIGSAW ACTIVITY, TASK # Make sure your answer in written in the correct order. Highest powers of x should come first, down to the lowest powers.

News English.com Ready-to-use ESL / EFL Lessons

Why Should We Care? More importantly, it is easy to lie or deceive people with bad plots

Technologists and economists both think about the future sometimes, but they each have blind spots.

The fundamentals of detection theory

Name: Exam 01 (Midterm Part 2 take home, open everything)

Quiddler Skill Connections for Teachers

One-Sample Z: C1, C2, C3, C4, C5, C6, C7, C8,... The assumed standard deviation = 110

Biased Opponent Pockets

2. Review of Pawns p

Card counting meets hidden Markov models

CCMR Educational Programs

TRANSFORMS / WAVELETS

News English.com Ready-to-use ESL / EFL Lessons

Transcription:

Introduction to Markov Models But first: A few preliminaries Estimating the probability of phrases of words, sentences, etc. CIS 391 - Intro to AI 2 What counts as a word? A tricky question. How to find Sentences?? CIS 391 - Intro to AI 3 CIS 391 - Intro to AI 4 Q1: How to estimate the probability of a given sentence W? A crucial step in speech recognition (and lots of other applications) First guess: products of unigrams Pˆ( W ) P ( w ) Given word lattice: form subsidy for farm subsidies far Unigram counts (in 1.7 * 10 6 words of AP text): form 183 subsidy 15 for 18185 farm 74 subsidies 55 far 570 Not quite right CIS 391 - Intro to AI 5 ww Predicting a word sequence II Next guess: products of bigrams For W=w 1 w 2 w 3 w n, Given word lattice: form subsidy for farm subsidies far Bigram counts (in 1.7 * 10 6 words of AP text): form subsidy 0 subsidy for 2 form subsidies 0 subsidy far 0 farm subsidy 0 subsidies for 6 farm subsidies 4 subsidies far 0 Better (if not quite right) (But the counts are tiny! Why?) CIS 391 Intr)o to AI 6 n1 Pˆ( W ) P ( wiwi 1 ) i1 1

How can we estimate P correctly? Problem: Naïve Bayes model for bigrams violates independence assumptions. Let s do this right. Let W=w 1 w 2 w 3 w n. Then, by the chain rule, P ( W ) P ( w )* P ( w w )* P ( w w w )*...* P ( w w... w ) We can estimate P(w 2 w 1 ) by the Maximum Likelihood Estimator and P(w 3 w 1 w 2 ) by and so on n n 1 2 3 2 1 1 1 1 Count( w1w 2) Count( w ) Count( w1w 2w3 ) Count( w w ) CIS 391 - Intro to AI 7 1 1 2 and finally, Estimating P(w n w 1 w 2 w n-1 ) Again, we can estimate P(w n w 1 w 2 w n-1 ) with the MLE Count( w w... w ) Count( w w... ) 1 2 n 1 2 wn 1 So to decide pat vs. pot in Heat up the oil in a large p?t, compute for pot Count("Heat up the oil in a large pot") 0 Count("Heat up the oil in a larg e") 0 CIS 391 - Intro to AI 8 Hmm..The Web Changes Things (2008 or so) Statistics and the Web II Even the web in 2008 yields low counts! So, P( pot heat up the oil in a large ) = 8/49 0.16 CIS 391 - Intro to AI 9 CIS 391 - Intro to AI 10 But the web has grown!!!. 165/891=0.185 CIS 391 - Intro to AI 11 CIS 391 - Intro to AI 12 2

So. A larger corpus won t help much unless it s HUGE. but the web is!!! A BOTEC Estimate of What We Can Estimate What parameters can we estimate with 100 million words of training data?? Assuming (for now) uniform distribution over only 5000 words But what if we only have 100 million words for our estimates?? So even with 10 8 words of data, for even trigrams we encounter the sparse data problem.. CIS 391 - Intro to AI 13 CIS 391 - Intro to AI 14 The Markov Assumption: Only the Immediate Past Matters The Markov Assumption: Estimation We estimate the probability of each w i given previous context by P(w i w 1 w 2 w i-1 ) = P(w i w i-1 ) which can be estimated by Count( wi 1wi) Count( w ) i1 So we re back to counting only unigrams and bigrams!! AND we have a correct practical estimation method for P(W) given the Markov assumption! CIS 391 - Intro to AI 15 CIS 391 - Intro to AI 16 Markov Models Visualizing an n-gram based language model: the Shannon/Miller/Selfridge method To generate a sequence of n words given unigram estimates: Fix some ordering of the vocabulary v 1 v 2 v 3 v k. For each word w i, 1 i n Choose a random value r i between 0 and 1 w i = the first v j such that j m1 P( v ) r m i CIS 391 - Intro to AI 17 CIS 391 - Intro to AI 18 3

Visualizing an n-gram based language model: the Shannon/Miller/Selfridge method The Shannon/Miller/Selfridge method trained on Shakespeare To generate a sequence of n words given a 1 st order Markov model (i.e. conditioned on one previous word): Fix some ordering of the vocabulary v 1 v 2 v 3 v k. Use unigram method to generate an initial word w 1 For each remaining w i, 2 i n Choose a random value r i between 0 and 1 j w i = the first v j such that m1 P( v w ) r m i1 i CIS 391 - Intro to AI 19 (This and next two slides from Jurafsky) CIS 391 - Intro to AI 20 Wall Street Journal just isn t Shakespeare Shakespeare as corpus N=884,647 tokens, V=29,066 Shakespeare produced 300,000 bigram types out of V 2 = 844 million possible bigrams. So 99.96% of the possible bigrams were never seen (have zero entries in the table) Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare The Sparse Data Problem Again English word frequencies well described by Zipf s Law Zipf (1949) characterized the relation between word frequency and rank as: f r C (for constant C) r C/f log(r) log(c) - log (f) Purely Zipfian data plots as a straight line on a loglog scale How likely is a 0 count? Much more likely than I let on!!! CIS 391 - Intro to AI 23 *Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ). CIS 391 - Intro to AI 24 4

Word frequency & rank in Brown Corpus vs Zipf Zipf s law for the Brown corpus Lots of area under the tail of this curve! From: Interactive mathematics http://www.intmath.com CIS 391 - Intro to AI 25 CIS 391 - Intro to AI 26 Smoothing Smoothing At least one unknown word likely per sentence given Zipf!! To fix 0 s caused by this, we can smooth the data. Assume we know how many types never occur in the data. Steal probability mass from types that occur at least once. Distribute this probability mass over the types that never occur. This black art is why NLP is taught in the engineering school Jason Eisner CIS 391 - Intro to AI 28 Smoothing.is like Robin Hood: it steals from the rich and gives to the poor Review: Add-One Smoothing Estimate probabilities ˆP by assuming every possible word type v V actually occurred one extra time (as if by appending an unabridged dictionary) So if there were N words in our corpus, then instead of estimating Count( w) Pw ˆ( ) N we estimate Count( w 1) Pw ˆ( ) N V CIS 391 - Intro to AI 29 CIS 391 - Intro to AI 30 5

Add-One Smoothing (again) Pro: Very simple technique Cons: Probability of frequent n-grams is underestimated Probability of rare (or unseen) n-grams is overestimated Therefore, too much probability mass is shifted towards unseen n- grams All unseen n-grams are smoothed in the same way Using a smaller added-count improves things but only some More advanced techniques (Kneser Ney, Witten-Bell) use properties of component n-1 grams and the like... (Hint for this homework ) CIS 391 - Intro to AI 31 6