Lecture 4: n-grams in NLP LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han
Objectives Frequent n-grams in English n-grams and statistical NLP n-grams and conditional probability Large n-gram resources 2/2/2017 2
For fun: most frequent bigrams? 2551888 of the 1887475 in the 1041011 to the 861798 on the 676658 and the 648408 to be 578806 for the 561171 at the 498217 in a 479627 do n't 455367 with the 451460 from the 443547 of a 395939 that the 362176 is a 361879 going to 335255 by the 330828 as a 319846 with a 317431 I think Source: http://www.ngrams.info/download_coca.asp 2/2/2017 3
Most frequent trigrams? 198630 I do n't 140305 one of the 129406 a lot of 117289 the United States 79825 do n't know 76782 out of the 75015 as well as 73540 going to be 61373 I did n't 61132 to be a Source: http://www.ngrams.info/download_coca.asp 2/2/2017 4
n-grams and statistical NLP You have a good intuition as a native speaker. Beyond intuition, it is possible to obtain a highly detailed & accurate set of n-gram statistics. How? Through corpus data. Corpus-sourced, large-scale n-grams are one of the biggest contributors to the recent advancement of statistical natural language processing (NLP) technologies. Used for: spelling correction, machine translation, speech recognition, information extraction... JUST ABOUT ANY NLP APPLICATION 2/2/2017 5
n-grams vs. conditional probability Suppose 'is' is the current word. What is the most likely next word? How likely are 'you' and 'your' as the next word? Questions of conditional probability Can be answered through n-gram data 'is' occurs 4,705,743,816 times (1)* 'a' is the most likely next word with (2) / (1) = 0.10 probability. 'your' as the next word has (3) / (1) = 0.0036 probability. 'you' as the next word has (4) / (1) = 0.000388 probability. *Source: http://norvig.com/ngrams/count_1w.txt **Source: http://norvig.com/ngrams/count_2w.txt is a ** (2) 476718990 is the 306482559 is not 276753375 is an 98762170 is to 97276807 is your (3) 17051576 is you (4) 1826931 6
Extremely large "All our N-gram are Belong to You" http://googleresearch.blogspot.com/2006/08/all-our-n-gramare-belong-to-you.html Google Web 1T 5-Gram, released in August 2006 through LDC (Linguistic Data Consortium) 1-5 grams Compiled from 1-trillion words of running web text 24 GB of compressed text Source of Norvig's 1- and 2-gram frequency lists Publication of this data triggered huge advances in NLP technologies and applications. 2/2/2017 7
Even larger Google Books Ngram Corpus http://storage.googleapis.com/books/ngrams/books/datasetsv 2.html Basis for Google Books Ngram Viewer 1-5grams Freely downloadable (for those who can) Compiled from over 5 million books, published up to 2008 Data has publication dates; good for charting historical trend Books were digitized using OCR In multiple languages American/British English, Chinese, French, German, Hebrew, Italian, Russian, Spanish 2/2/2017 8
Large-ish COCA n-gram lists http://www.ngrams.info/download_coca.asp Word 2-5 grams, each containing top ~1 million entries Based on COCA (The Corpus of Contemporary American English) (http://corpus.byu.edu/coca/), 520 million words as of Jan 2017 COCA's full unigram list is not free. COCA's top 5000 words/lemmas http://www.wordfrequency.info/free.asp Contains lemma and POS of top 5,000 words 2/2/2017 9
Excerpted, manageable Natural Language Corpus Data: Beautiful Data by Peter Norvig http://norvig.com/ngrams/ Has lists of large-scale English n-gram data: character ( 1- & 2- grams) and word level (1, 2, 3 grams) Data derived/excerpted from Google Web 1T 5-Gram corpus ¼ million most frequent bigrams Google's original data is 315 mil 2/2/2017 10
1-grams/word list: Norvig vs. ENABLE count_1w.txt the 23135851162 of 13151942776 and 12997637966 to 12136980858 a 9081174698 in 8469404971 for 5933321709 is 4705743816 on 3750423199 that 3400031103 by 3350048871 this 3228469771 with 3183110675 i 3086225277 goofel 12711 gooek 12711 gooddg 12711 gooblle 12711 gollgo 12711 golgw 12711 Total # of entries: 333K vs. 173K Overlap? Usefulness? enable1.txt aa aah aahed aahing aahs aal aalii aaliis aals aardvark aardvarks aardwolf aardwolves aargh aarrgh zymotic zymurgies zymurgy zyzzyva zyzzyvas 11
2-grams: Norvig vs. COCA count_2w.txt you get 25183570 you getting 430987 you give 3512233 you go 8889243 you going 2100506 you gone 210111 you gonna 416217 you good 441878 you got 4699128 you gotta 668275 you graduate 117698 you grant 103633 you great 450637 you grep 120367 you grew 102321 you grow 398329 you guess 186565 you guessed 295086 you guys 5968988 you had 7305583 you hand 120379 Compiled from: 1 trillion words vs. 500 million words w2_.txt 39509 you get 30 you gets 31 you gettin 861 you getting 263 you girls 24 you git 5690 you give 138 you given 169 you giving 182 you glad 46 you glance 23594 you go 70 you god 54 you goddamn 115 you goin 9911 you going 1530 you gon 262 you gone 444 you good 25 you google 19843 you got 12
2-grams: Norvig vs. COCA count_2w.txt you get 25183570 you getting 430987 you give 3512233 you go 8889243 you going 2100506 you gone 210111 you gonna 416217 you good 441878 you got 4699128 you gotta 668275 you graduate 117698 you grant 103633 you great 450637 you grep 120367 you grew 102321 you grow 398329 you guess 186565 you guessed 295086 you guys 5968988 you had 7305583 you hand 120379 Total # of entries: ¼ million* vs. 1 million *NOT google's fault! Norvig only took top 0.1% of 315 million. Usefulness? w2_.txt 39509 you get 30 you gets 31 you gettin 861 you getting 263 you girls 24 you git 5690 you give 138 you given 169 you giving 182 you glad 46 you glance 23594 you go 70 you god 54 you goddamn 115 you goin 9911 you going 1530 you gon 262 you gone 444 you good 25 you google 19843 you got 13
Know your data When using publicly available resources, you must evaluate and understand the data. Origin? Domain & genre? Size? Traits? Merits and limitations? Fit with your project? 2/2/2017 14