Laws of Text. Lecture Objectives. Text Technologies for Data Science INFR Learn about some text laws. This lecture is practical 9/26/2018

Similar documents
Introduction to Markov Models

Chapter 3 Exponential and Logarithmic Functions

Education Resources. This section is designed to provide examples which develop routine skills necessary for completion of this section.

Lesson 8. Diana Pell. Monday, January 27

WHY FUNCTION POINT COUNTS COMPLY WITH BENFORD S LAW

As A Man Thinketh: A Literary Essay By James Allen

Statistical Analysis of Modern Communication Signals

Continuous time and Discrete time Signals and Systems

Algorithms and Data Structures

Introduction to Markov Models. Estimating the probability of phrases of words, sentences, etc.

Eleven Fifty-Nine And Counting By Jack Van Impe

Exercise Problems: Information Theory and Coding

Selecting the Right Model Studio PC Version

Lecture 3, Multirate Signal Processing

Device Characterization Project #1

Do Populations Conform to the Law of Anomalous Numbers?

8.1 Exponential Growth 1. Graph exponential growth functions. 2. Use exponential growth functions to model real life situations.

Build Your Own Bose WaveRadio Bass Preamp Active Filter Design

Assignment 4: Permutations and Combinations

Benford's Law. Theory, the General Law of Relative Quantities, and Forensic Fraud Detection Applications. Alex Ely Kossovsky.

COS Lecture 7 Autonomous Robot Navigation

What is Galaxy? And why should I learn it? Keith Bradnam

DETECTING FRAUD USING MODIFIED BENFORD ANALYSIS

Problem 1 Multiple sets of data on a single graph [Gottfried, pg. 92], Downloading, Importing Data

Chapter 4, Continued. 4.3 Laws of Logarithms. 1. log a (AB) = log a A + log a B. 2. log a ( A B ) = log a A log a B. 3. log a (A c ) = C log a A

Identify Non-linear Functions from Data

Log-linear models (part 1I)

Lecture 16. The Bipolar Junction Transistor (I) Forward Active Regime. Outline. The Bipolar Junction Transistor (BJT): structure and basic operation

LOGARITHMIC FUNCTIONS AND THEIR APPLICATIONS

CCST9017 Hidden Order in Daily Life: A Mathematical Perspective. Lecture 8. Statistical Frauds and Benford s Law

Practice Test 3 (longer than the actual test will be) 1. Solve the following inequalities. Give solutions in interval notation. (Expect 1 or 2.

Modeling and Analysis of Systems Lecture #9 - Frequency Response. Guillaume Drion Academic year

Benford s Law: Tables of Logarithms, Tax Cheats, and The Leading Digit Phenomenon

Applied Linear Algebra in Geoscience Using MATLAB

Section 7.2 Logarithmic Functions

3 USRP2 Hardware Implementation

Official Stamp Collector's Bible By Stephen Datz

Logarithmic Functions

Constructing a Toolkit to Evaluate Quality of State and Local Administrative Data

Amplitude balancing for AVO analysis

Laboratory Lecture 4

Being A Green Mother (Incarnations Of Immortality, Book 5) By Piers Anthony

Solutions to Information Theory Exercise Problems 5 8

This manuscript was the basis for the article A Refresher Course in Control Theory printed in Machine Design, September 9, 1999.

Filtering. Image Enhancement Spatial and Frequency Based

Graphing Exponential Functions

EEE118: Electronic Devices and Circuits

CSE373: Data Structure & Algorithms Lecture 23: More Sorting and Other Classes of Algorithms. Nicki Dell Spring 2014

Class #16: Experiment Matlab and Data Analysis

PYKC 27 Feb 2017 EA2.3 Electronics 2 Lecture PYKC 27 Feb 2017 EA2.3 Electronics 2 Lecture 11-2

Ulysses S. Grant: A Victor, Not A Butcher: The Military Genius Of The Man Who Won The Civil War [Unabridged] [Audible Audio Edition] By Edward H.

ECE 695 Numerical Simulations Lecture 28: Finite-Difference Time Domain in MEEP. Prof. Peter Bermel March 27, 2017

DOWNLOAD OR READ : THE LOG OF A NONCOMBATANT WWI CENTENARY SERIES PDF EBOOK EPUB MOBI

Key Questions. What is an LED and how does it work? How does a laser work? How does a semiconductor laser work? ECE 340 Lecture 29 : LEDs and Lasers

11/3/71 BASIC (VI) basic -- DEC supplied BASIC

ECE 340 Lecture 29 : LEDs and Lasers Class Outline:

Fushigi Yugi: Genbu Kaiden, Vol. 2 By Yuu Watase

NMR Basics. Lecture 2

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Physics 8.02 Spring 2005 Experiment 10: LR and Undriven LRC Circuits

Tech Fads vs. Fundamental Shifts

4 EXPONENTIAL AND LOGARITHMIC FUNCTIONS

Image Processing. 2. Point Processes. Computer Engineering, Sejong University Dongil Han. Spatial domain processing

Biophysical Techniques (BPHS 4090/PHYS 5800)

Korea Strait: A Novel (Dan Lenson Novels Book 10) By David Poyer READ ONLINE

Ma 322: Biostatistics Solutions to Homework Assignment 1

RTN Induced Frequency Shift Measurements Using a Ring Oscillator Based Circuit

VLSI is scaling faster than number of interface pins

Digital Image Processing. Lecture # 3 Image Enhancement

Introduction to Systems Engineering

4. Non Adaptive Sorting Batcher s Algorithm

Wide-field Infrared Survey Explorer (WISE)

MATH 20C: FUNDAMENTALS OF CALCULUS II FINAL EXAM

Lecture 4: n-grams in NLP. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

Star Wars The Force Unleashed 2 Collector's Edition: Prima Official Game Guide By Fernando Bueno

Connectivity in Social Networks

Log-linear models (part III)

Performance analysis of Erbium Doped Fiber Amplifier at different pumping configurations

Variables. Lecture 13 Sections Wed, Sep 16, Hampden-Sydney College. Displaying Distributions - Quantitative.

The Beat Vision: A Primary Sourcebook By Arthur Knight READ ONLINE

Exponential and Logarithmic Functions. Copyright Cengage Learning. All rights reserved.

Impossible Objects: Amazing Optical Illusions To Confound & Astound By J. Timothy Unruh READ ONLINE

The Complete Book Of Solitaire And Patience Games By Albert H. Morehead

Table of contents. Vision industrielle 2002/2003. Local and semi-local smoothing. Linear noise filtering: example. Convolution: introduction

FYS3240 PC-based instrumentation and microcontrollers. Signal sampling. Spring 2017 Lecture #5

Drilling: The Manual Of Methods, Applications, And Management By The Australian Drilling

For the system to have the high accuracy needed for many measurements,

Objectives: Fluently add and subtract within 20. Use place value understanding and properties of operations to add and subtract.

Lecture5: Lossless Compression Techniques

Information Retrieval Evaluation

Example: The graphs of e x, ln(x), x 2 and x 1 2 are shown below. Identify each function s graph.

LISTEN A MINUTE.com. I Love You. Focus on new words, grammar and pronunciation in this short text.

Lecture 10 Performance of Communication System: Bit Error Rate (BER) EE4900/EE6720 Digital Communications

The Devil's Fire (The Devil's Gate Trilogy Book 2) By Rue Volley

Reading Instructions Chapters for this lecture. Computer Assisted Image Analysis Lecture 2 Point Processing. Image Processing

! Multi-Rate Filter Banks (con t) ! Data Converters. " Anti-aliasing " ADC. " Practical DAC. ! Noise Shaping

Slipknot - Vol. 3 (The Subliminal Verses) (Guitar Recorded Versions) By Slipknot

OCS Implementation of Amplifiers

Connecting Australia. How the nbn broadband access network is changing Australia. An economic study of the way we work, live and connect.

Digital Information. INFO/CSE 100, Spring 2006 Fluency in Information Technology.

Lecture 1: Introduction to pedigree analysis

Transcription:

Text Technologies for Data Science INFR11145 Laws of Text Instructor: Walid Magdy 26-Sep-2018 Lecture Objectives Learn about some text laws Zipf s law Benford s law Heap s law Clumping/contagion This lecture is practical 2 1

You can try with me Shell commands: cat, sort, uniq, grep Perl (or alternative) Excel (or alternative) Download the following: Bible: http://www.gutenberg.org/cache/epub/10/pg10.txt 3 Words nature Word basic unit to represent text Certain characteristics are observed for the words we use! These characteristics are very consistent, that we can apply laws for them These laws apply for: Different languages Different domains of text 4 2

Log(frequency) Frequency Frequency of words Some words are very frequent e.g. the, of, to Many words are less frequent e.g. schizophrenia, bazinga ~50% terms appears once Frequency of words has hard exponential decay Log(rank) 5 Zipf s Law: For a given collection of text, ranking unique terms according to their frequency, then: r P r const r, rank of term according to frequency P r, probability of appearance of term P r const r f x 1 x 6 3

Zipf s Law: Wikipedia abstracts 3.5M En abstracts r P r const r freq r const Term Rank Frequency the 1 5,134,790 of 2 3,102,474 in 3 2,607,875 a 4 2,492,328 is 5 2,181,502 and 6 1,962,326 was 7 1,159,088 to 8 1,088,396 by 9 766,656 an 10 566,970 it 11 557,492 for 13 493,374 as 14 480,277 on 15 471,544 from 16 412,785 r x freq 5,134,790 6,204,948 7,823,625 9,969,312 10,907,510 11,773,956 8,113,616 8,707,168 6,899,904 5,669,700 6,132,412 5,970,456 6,413,862 6,723,878 7,073,160 7 Distribution of first digit in frequencies? 1) Uniform 2) Exp decay 3) Normal Term Rank Frequency the 1 5,134,790 of 2 3,102,474 in 3 2,607,875 a 4 2,492,328 is 5 2,181,502 and 6 1,962,326 was 7 1,159,088 to 8 1,088,396 by 9 766,656 an 10 566,970 it 11 557,492 for 13 493,374 as 14 480,277 on 15 471,544 from 16 412,785 8 4

v (vocabulary) Benford s Law: First digit of a number follows a Zipf s like law! Terms frequencies Physical constants Energy bills Population numbers Benford s law: P d = log(1 + 1 d ) 9 Heap s Law: While going through documents, the number of new terms noticed will reduce over time For a book/collection, while reading through, record: n: number of words read v: number of news words (unique words) Vocabulary growth: v n = k n b where, b < 1 typically, 0.4 < b < 0.7 n (words) 10 5

Heap s Law: shouldn t it saturate? n = 80+ million, but still growing Think about: - spelling errors - names - emails - codes Accurate for most collections, but different k, b Not very accurate when n is small 11 Clumping/Contagion in text From Zipf s law, we notice: Most words do not appear that much! Once you see a word once expect to see again! Words are like: Rare contagious disease Not, rare independent lightening Words are rare events, but they are contagious 12 6

density Clumping/Contagion in text Wiki abstract collection Identify terms appeared only twice Measure distance between the two occurrences of the terms: d = n occurence2 n occurence1 Plot density function of d Majority of terms appearing only twice appear close to each other. distance (d) 13 Applying the laws Given a collection of 20 billion terms, What is the number of unique terms? Heap s law: v n = k n b, assume k = 0.25, b = 0.5 v n = 0.25 (20B) 0.5 35M What is the number of terms appearing once? Zipf s law ~17M appeared only once 14 7

Summary Text follows well-known phenomena Text Laws: Zipf Heap Contagion in text 15 Recourses Text book: Search engines: IR in practice chapter 4 Videos: Zipf s law, Vsouce: https://www.youtube.com/watch?v=fcn8zs912oe Benford s law, Numberphile: https://www.youtube.com/watch?v=xxjlr2ok1km Tools: Unix commands for windows https://sourceforge.net/projects/unxutils 16 8