Lecture 4: n-grams in NLP. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

Similar documents
Introduction to Markov Models

/665 Natural Language Processing

Introduction to Markov Models. Estimating the probability of phrases of words, sentences, etc.

The Enriched TreeTagger System

The revolution of the empiricists. Machine Translation. Motivation for Data-Driven MT. Machine Translation as Search

JocondeLab. DGLFLF Brigitte TRAN. Délégation générale à la langue française et aux langues de France

Local Language Computing Policy in Korea

An Overview to Human Computation. Dr. Ling-Jyh Chen Institute of Information Science, Academia Sinica

Statistical Machine Translation. Machine Translation Phrase-Based Statistical MT. Motivation for Phrase-based SMT

Leverage always-on voice trigger IP to reach ultra-low power consumption in voicecontrolled

Automatic Categorization : Future Perspectives

Digging Deeper, Reaching Further. Module 5: Visualizing Textual Data An Introduction

Study Singular They in Contemporary English. Bich Ngoc Do

KIPO s plan for AI - Are you ready for AI? - Gyudong HAN, KIPO Republic of Korea

The 1000 Most Important Questions You Will Ever Ask Yourself That Make Life Work For You

Your work as a translator at lengoo What can you expect from your journey with lengoo? What is most important to us and our customers?

Log-linear models (part 1I)

THE CHALLENGES OF SENTIMENT ANALYSIS ON SOCIAL WEB COMMUNITIES

Where to File Patent Application Yumiko Hamano IP Consultant - IP Commercialization Partner, ET Cube International

Undergraduate Majors and Minors

WORKSHOP. Sara Bauer Ma, MSc. Computational Linguistics or "How your last tweet will be used against you" September 28, 2018

Italian Short Stories For Beginners: 9 Captivating Short Stories To Learn Italian And Expand Your Vocabulary While Having Fun By The Language

VISIT US & INTERACT WITH US!

Log-linear models (part 1I)

IE11, Edge (current version), Chrome (current version), Firefox (current version)

SPEAKING AMERICAN: HOW Y'ALL, YOUSE, AND YOU GUYS TALK: A VISUAL GUIDE BY JOSH KATZ

Getting Started with Osmo Words

COMMERCIAL DATA SHEET

A Case Study of Machine Translation in Financial Sentiment Analysis

Graduate Teaching Assistant - PhD Scholarship in Games and X Reality

Computer Games Assoc. Prof. Mathias LUX Klagenfurt University

Chinese Characters: A Genealogy And Dictionary (English And Mandarin Chinese Edition) By Rick Harbaugh

FALL 2015 PATHWAYS APPROVED 4/6/15

German-American Names By George Fenwick Jones READ ONLINE

ENTRY ARTIFICIAL INTELLIGENCE

Original Script Cataloging at the Library of Congress: Past, Present, and Future

The (In)ability to Triangulate in Data Driven Healthcare Research

PCT Status Report. Francis Gurry

VRS 4.2 Professional Key New Features

Real Time Word to Picture Translation for Chinese Restaurant Menus

The Long Tail of Research Data

Teaching specialisations

The state of machine translation in Europe and future prospects

2017 Regional Discussions

Twenty years of Translating and the Computer John Hutchins (University of East Anglia)

All Ears English Episode 190:

Recap from previous lectures. Information Retrieval. Recap from previous lectures. Topics for Today. Dictionaries & Tolerant Retrieval.

Census 2011 Demographic Trends

Courses Available to First-Year Students Fall 2015

Thank you for auditioning for SATURDAY NIGHT FEVER. at THE JOHN W. ENGEMAN THEATER ROLE: BOBBY

Using forced alignment and HTML5 media syntax to share speech archive data. John Coleman. Phonetics Laboratory, Oxford

About me. Why choose me

Using AutoCrat with Google Forms/Sheets

Northern Ireland Discount Code list 2017/18

p(s) = P(1st significant digit is s) = log )

DOWNLOAD OR READ : GET STARTED IN POLISH A TEACH YOURSELF GUIDE PDF EBOOK EPUB MOBI

Midterm for Name: Good luck! Midterm page 1 of 9

Tracy McMillan on The Person You Really Need To Marry (Full Transcript)

Latest trends in sentiment analysis - A survey

Contents 1 Introduction Optical Character Recognition Systems Soft Computing Techniques for Optical Character Recognition Systems

All Ears English Episode 157:

Hello and welcome to the CPA Australia podcast, your source for business, leadership and public practice accounting information.

Part of Speech Tagging & Hidden Markov Models (Part 1) Mitch Marcus CIS 421/521

German Slanguage (English And German Edition) By Mike Ellis

THE OXFORD DICTIONARY OF FAMILY NAMES IN BRITAIN AND IRELAND FROM OXFORD UNIVERSITY PRESS

Recap from previous lecture. Information Retrieval. Topics for Today. Recall: Basic structure of an Inverted index. Dictionaries & Tolerant Retrieval

Serving the humanities: daydreams and nightmares

Casual & Puzzle Games Data Benchmarks North America, Q1 2017

DOWNLOAD OR READ : NEW WORLD INTERACTIVE ENGLISH WORKBOOK 7 ANSWERS PDF EBOOK EPUB MOBI

Image to Sound Conversion

Overview of Automatic Speech Recognition for Transcription System in the Japanese Parliament (Diet)

Machine Translation - Decoding

Fields of Study at the University of Copenhagen

This is an All Ears English Podcast, Episode 54: Meeting Monday, English with Fluency MC: How to Find Love on Valentine s Day.

International Nuclear Information System (INIS)

Embroidery Simulator v.3.4. Visualization of ROI Results

RANKIN'S HEIDILICIOUS BY RANKIN DOWNLOAD EBOOK : RANKIN'S HEIDILICIOUS BY RANKIN PDF

2016 Census Profile on the Town of Richmond Hill

Russian. Graduate. Faculty. Careers. Facilities and Resources. Undergraduate. Financial Support. Dual Degrees and Double Majors

Building a Business Knowledge Base by a Supervised Learning and Rule-Based Method

Environmental Justice Tool Guide

Mr. Chairpersons, Ladies, and Gentlemen, it is indeed a great honor for me to be given this opportunity to address you on the occasion of this

A New Perspective on Scanning

Fallout 3 full download

FRBR for Movies and Finding FRBR in MARC

Elements of the Short Story in Glenn Blake s Westerns

TITLE: Using collections and worksets in large-scale corpora: Preliminary findings from the Workset Creation for Scholarly Analysis project

spectro-guide Total Appearance Control color and gloss in one unit Easy to use and handle Solid Color

GUIDE FOR REVIEWERS v1.0. September 20, 2018 FACT SHEET

The A TO Z OF FREELANCE TRANSLATION Frequently Asked Questions

All Ears English Episode 216:

Learn In Your Car: Spanish, The Complete Language Course By uncredited, Henry N. Raymond

Appendix Table 1: Distribution of Credit Granting Scores and Test Timing

THE USE OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING IN SPEECH RECOGNITION. A CS Approach By Uniphore Software Systems

Statistics on Weed Lists at (Note: These are NOT final statistics. These statistics are prior to the return of boxes by Better World Books.

My Perfect Face. by Eric Eberwein Eric Eberwein All rights reserved Publication Scene4 Magazine

Executive Council 108th session San Sebastián, Spain, May 2018 Provisional agenda item 5(b)

MASTER OF SECONDARY TEACHING Teaching Area Requirements

Cheap, Fast and Good Enough: Speech Transcription with Mechanical Turk. Scott Novotney and Chris Callison-Burch 04/02/10

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

Transcription:

Lecture 4: n-grams in NLP LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

Objectives Frequent n-grams in English n-grams and statistical NLP n-grams and conditional probability Large n-gram resources 2/2/2017 2

For fun: most frequent bigrams? 2551888 of the 1887475 in the 1041011 to the 861798 on the 676658 and the 648408 to be 578806 for the 561171 at the 498217 in a 479627 do n't 455367 with the 451460 from the 443547 of a 395939 that the 362176 is a 361879 going to 335255 by the 330828 as a 319846 with a 317431 I think Source: http://www.ngrams.info/download_coca.asp 2/2/2017 3

Most frequent trigrams? 198630 I do n't 140305 one of the 129406 a lot of 117289 the United States 79825 do n't know 76782 out of the 75015 as well as 73540 going to be 61373 I did n't 61132 to be a Source: http://www.ngrams.info/download_coca.asp 2/2/2017 4

n-grams and statistical NLP You have a good intuition as a native speaker. Beyond intuition, it is possible to obtain a highly detailed & accurate set of n-gram statistics. How? Through corpus data. Corpus-sourced, large-scale n-grams are one of the biggest contributors to the recent advancement of statistical natural language processing (NLP) technologies. Used for: spelling correction, machine translation, speech recognition, information extraction... JUST ABOUT ANY NLP APPLICATION 2/2/2017 5

n-grams vs. conditional probability Suppose 'is' is the current word. What is the most likely next word? How likely are 'you' and 'your' as the next word? Questions of conditional probability Can be answered through n-gram data 'is' occurs 4,705,743,816 times (1)* 'a' is the most likely next word with (2) / (1) = 0.10 probability. 'your' as the next word has (3) / (1) = 0.0036 probability. 'you' as the next word has (4) / (1) = 0.000388 probability. *Source: http://norvig.com/ngrams/count_1w.txt **Source: http://norvig.com/ngrams/count_2w.txt is a ** (2) 476718990 is the 306482559 is not 276753375 is an 98762170 is to 97276807 is your (3) 17051576 is you (4) 1826931 6

Extremely large "All our N-gram are Belong to You" http://googleresearch.blogspot.com/2006/08/all-our-n-gramare-belong-to-you.html Google Web 1T 5-Gram, released in August 2006 through LDC (Linguistic Data Consortium) 1-5 grams Compiled from 1-trillion words of running web text 24 GB of compressed text Source of Norvig's 1- and 2-gram frequency lists Publication of this data triggered huge advances in NLP technologies and applications. 2/2/2017 7

Even larger Google Books Ngram Corpus http://storage.googleapis.com/books/ngrams/books/datasetsv 2.html Basis for Google Books Ngram Viewer 1-5grams Freely downloadable (for those who can) Compiled from over 5 million books, published up to 2008 Data has publication dates; good for charting historical trend Books were digitized using OCR In multiple languages American/British English, Chinese, French, German, Hebrew, Italian, Russian, Spanish 2/2/2017 8

Large-ish COCA n-gram lists http://www.ngrams.info/download_coca.asp Word 2-5 grams, each containing top ~1 million entries Based on COCA (The Corpus of Contemporary American English) (http://corpus.byu.edu/coca/), 520 million words as of Jan 2017 COCA's full unigram list is not free. COCA's top 5000 words/lemmas http://www.wordfrequency.info/free.asp Contains lemma and POS of top 5,000 words 2/2/2017 9

Excerpted, manageable Natural Language Corpus Data: Beautiful Data by Peter Norvig http://norvig.com/ngrams/ Has lists of large-scale English n-gram data: character ( 1- & 2- grams) and word level (1, 2, 3 grams) Data derived/excerpted from Google Web 1T 5-Gram corpus ¼ million most frequent bigrams Google's original data is 315 mil 2/2/2017 10

1-grams/word list: Norvig vs. ENABLE count_1w.txt the 23135851162 of 13151942776 and 12997637966 to 12136980858 a 9081174698 in 8469404971 for 5933321709 is 4705743816 on 3750423199 that 3400031103 by 3350048871 this 3228469771 with 3183110675 i 3086225277 goofel 12711 gooek 12711 gooddg 12711 gooblle 12711 gollgo 12711 golgw 12711 Total # of entries: 333K vs. 173K Overlap? Usefulness? enable1.txt aa aah aahed aahing aahs aal aalii aaliis aals aardvark aardvarks aardwolf aardwolves aargh aarrgh zymotic zymurgies zymurgy zyzzyva zyzzyvas 11

2-grams: Norvig vs. COCA count_2w.txt you get 25183570 you getting 430987 you give 3512233 you go 8889243 you going 2100506 you gone 210111 you gonna 416217 you good 441878 you got 4699128 you gotta 668275 you graduate 117698 you grant 103633 you great 450637 you grep 120367 you grew 102321 you grow 398329 you guess 186565 you guessed 295086 you guys 5968988 you had 7305583 you hand 120379 Compiled from: 1 trillion words vs. 500 million words w2_.txt 39509 you get 30 you gets 31 you gettin 861 you getting 263 you girls 24 you git 5690 you give 138 you given 169 you giving 182 you glad 46 you glance 23594 you go 70 you god 54 you goddamn 115 you goin 9911 you going 1530 you gon 262 you gone 444 you good 25 you google 19843 you got 12

2-grams: Norvig vs. COCA count_2w.txt you get 25183570 you getting 430987 you give 3512233 you go 8889243 you going 2100506 you gone 210111 you gonna 416217 you good 441878 you got 4699128 you gotta 668275 you graduate 117698 you grant 103633 you great 450637 you grep 120367 you grew 102321 you grow 398329 you guess 186565 you guessed 295086 you guys 5968988 you had 7305583 you hand 120379 Total # of entries: ¼ million* vs. 1 million *NOT google's fault! Norvig only took top 0.1% of 315 million. Usefulness? w2_.txt 39509 you get 30 you gets 31 you gettin 861 you getting 263 you girls 24 you git 5690 you give 138 you given 169 you giving 182 you glad 46 you glance 23594 you go 70 you god 54 you goddamn 115 you goin 9911 you going 1530 you gon 262 you gone 444 you good 25 you google 19843 you got 13

Know your data When using publicly available resources, you must evaluate and understand the data. Origin? Domain & genre? Size? Traits? Merits and limitations? Fit with your project? 2/2/2017 14