Intuitive Considerations Clarifying the Origin and Applicability of the Benford Law. Abstract

Similar documents
arxiv: v4 [physics.data-an] 4 Nov 2011

Benford s Law, data mining, and financial fraud: a case study in New York State Medicaid data

Characterization of noise in airborne transient electromagnetic data using Benford s law

USING BENFORD S LAW IN THE ANALYSIS OF SOCIO-ECONOMIC DATA

arxiv: v2 [math.pr] 20 Dec 2013

Benford s Law: Tables of Logarithms, Tax Cheats, and The Leading Digit Phenomenon

arxiv: v1 [physics.data-an] 5 May 2010

Fundamental Flaws in Feller s. Classical Derivation of Benford s Law

Benford s Law of First Digits: From Mathematical Curiosity to Change Detector

Research Article n-digit Benford Converges to Benford

Do Populations Conform to the Law of Anomalous Numbers?

ABSTRACT. The probability that a number in many naturally occurring tables

Unit Nine Precalculus Practice Test Probability & Statistics. Name: Period: Date: NON-CALCULATOR SECTION

arxiv: v1 [q-fin.st] 29 Aug 2012

Fraud Detection using Benford s Law

log

DETECTING FRAUD USING MODIFIED BENFORD ANALYSIS

Section 2.3 Task List

Co-occurrence of the Benford-like and Zipf Laws Arising from the Texts Representing Human and Artificial Languages

EXPERIMENTAL ERROR AND DATA ANALYSIS

CCST9017 Hidden Order in Daily Life: A Mathematical Perspective. Lecture 8. Statistical Frauds and Benford s Law

Faculty Forum You Cannot Conceive The Many Without The One -Plato-

Burst Error Correction Method Based on Arithmetic Weighted Checksums

Benford s Law. David Groce Lyncean Group March 23, 2005

BENFORD S LAW AND NATURALLY OCCURRING PRICES IN CERTAIN ebay AUCTIONS*

Math 147 Section 5.2. Application Example

GREATER CLARK COUNTY SCHOOLS PACING GUIDE. Algebra I MATHEMATICS G R E A T E R C L A R K C O U N T Y S C H O O L S

Connectivity in Social Networks

Benford's Law. Theory, the General Law of Relative Quantities, and Forensic Fraud Detection Applications. Alex Ely Kossovsky.

Lecture 17 z-transforms 2

CCO Commun. Comb. Optim.

Benford s Law Applies to Online Social Networks

The spatial structure of an acoustic wave propagating through a layer with high sound speed gradient

Theoretical Framework and Simulation Results for Implementing Weighted Multiple Sampling in Scientific CCDs

Chapter 3 Exponential and Logarithmic Functions

General Disposition Strategies of Series Configuration Queueing Systems

Towards Real-time Hardware Gamma Correction for Dynamic Contrast Enhancement

The Political Economy of Numbers: John V. C. Nye - Washington University. Charles C. Moul - Washington University

Benford s Law and articles of scientific journals: comparison of JCR Ò and Scopus data

The A pplicability Applicability o f of B enford's Benford's Law Fraud detection i n in the the social sciences Johannes Bauer

arxiv: v1 [math.gm] 29 Mar 2015

Analysis of the electrical disturbances in CERN power distribution network with pattern mining methods

ECS 20 (Spring 2013) Phillip Rogaway Lecture 1

On the Peculiar Distribution of the U.S. Stock Indeces Digits

8.1 Exponential Growth 1. Graph exponential growth functions. 2. Use exponential growth functions to model real life situations.

7 th grade Math Standards Priority Standard (Bold) Supporting Standard (Regular)

Analysis of Temporal Logarithmic Perspective Phenomenon Based on Changing Density of Information

Newcomb, Benford, Pareto, Heaps, and Zipf Are arbitrary numbers random?

Recursive Sequences. EQ: How do I write a sequence to relate each term to the previous one?

Newcomb, Benford, Pareto, Heaps, and Zipf Are arbitrary numbers random?

arxiv: v1 [math.co] 30 Nov 2017

Universal Properties of Poker Tournaments Persistence, the leader problem and extreme value statistics. Clément Sire

Exponential and Logarithmic Functions. Copyright Cengage Learning. All rights reserved.

Tennessee Senior Bridge Mathematics

MATRIX SAMPLING DESIGNS FOR THE YEAR2000 CENSUS. Alfredo Navarro and Richard A. Griffin l Alfredo Navarro, Bureau of the Census, Washington DC 20233

arxiv: v1 [cs.dm] 2 Jul 2018

Kent Bertilsson Muhammad Amir Yousaf

arxiv:cond-mat/ v1 19 May 1993

STEM: Electronics Curriculum Map & Standards

Corners in Tree Like Tableaux

Jitter in Digital Communication Systems, Part 1

Benford s law first significant digit and distribution distances for testing the reliability of financial reports in developing countries

Sequence and Series Lesson 6. March 14, th Year HL Maths. March 2013

Study Guide and Intervention

TenMarks Curriculum Alignment Guide: EngageNY/Eureka Math, Grade 7

Comparing Exponential and Logarithmic Rules

Available online at ScienceDirect. Procedia IUTAM 14 (2015 ) IUTAM ABCM Symposium on Laminar Turbulent Transition

The Effect of Sample Size on Result Accuracy using Static Image Analysis

Siyavula textbooks: Grade 12 Maths. Collection Editor: Free High School Science Texts Project

Direct calculation of metal oxide semiconductor field effect transistor high frequency noise parameters

Understanding Digital Signal Processing

NUMBER THEORY AMIN WITNO

Math 32, October 22 & 27: Maxima & Minima

Dyck paths, standard Young tableaux, and pattern avoiding permutations

Zhanjiang , People s Republic of China

Ramanujan-type Congruences for Overpartitions Modulo 5. Nankai University, Tianjin , P. R. China

UNIT 2 LINEAR AND EXPONENTIAL RELATIONSHIPS Station Activities Set 2: Relations Versus Functions/Domain and Range

Adaptive Kalman Filter based Channel Equalizer

28th Seismic Research Review: Ground-Based Nuclear Explosion Monitoring Technologies

ON THE VALIDITY OF THE NOISE MODEL OF QUANTIZATION FOR THE FREQUENCY-DOMAIN AMPLITUDE ESTIMATION OF LOW-LEVEL SINE WAVES

ECEn 487 Digital Signal Processing Laboratory. Lab 3 FFT-based Spectrum Analyzer

School of Business. Blank Page

2.1 BASIC CONCEPTS Basic Operations on Signals Time Shifting. Figure 2.2 Time shifting of a signal. Time Reversal.

Robust Broadband Periodic Excitation Design

PERFORMANCE ANALYSIS OF DIFFERENT M-ARY MODULATION TECHNIQUES IN FADING CHANNELS USING DIFFERENT DIVERSITY

Initialisation improvement in engineering feedforward ANN models.

Demonstration of Chaos

Lossy Compression of Permutations

Image Enhancement in spatial domain. Digital Image Processing GW Chapter 3 from Section (pag 110) Part 2: Filtering in spatial domain

Submitted November 19, 1989 to 2nd Conference Economics and Artificial Intelligence, July 2-6, 1990, Paris

Image permutation scheme based on modified Logistic mapping

A cellular automaton for urban traffic noise

Composite square and monomial power sweeps for SNR customization in acoustic measurements

Aesthetically Pleasing Azulejo Patterns

Grade 6/7/8 Math Circles April 1/2, Modular Arithmetic

Vincent Thomas Mule, Jr., U.S. Census Bureau, Washington, DC

Benford Distribution in Science. Fabio Gambarara & Oliver Nagy

Instructor Notes for Chapter 4

AI Plays Yun Nie (yunn), Wenqi Hou (wenqihou), Yicheng An (yicheng)

BENFORD S LAW IN THE CASE OF HUNGARIAN WHOLE-SALE TRADE SECTOR

Transcription:

Intuitive Considerations Clarifying the Origin and Applicability of the Benford Law G. Whyman *, E. Shulzinger, Ed. Bormashenko Ariel University, Faculty of Natural Sciences, Department of Physics, Ariel, P.O.B.3, 40700, Israel Abstract The diverse applications of the Benford law attract investigators working in various fields of physics, biology and sociology. At the same time, the groundings of the Benford law remain obscure. Our paper demonstrates that the Benford law arises from the positional (place-value) notation accepted for representing various sets of data. An alternative to Benford formulae to predict the distribution of digits in statistical data are derived. Application of these formulae to the statistical analysis of infrared spectra of polymers is presented. Violations of the Benford Law are discussed. KEYWORDS: Benford s law, leading digit phenomenon, statistical data, infrared spectra; positional notation. Introduction The Benford law is a phenomenological, contra-intuitive law observed in many naturally occurring tables of numerical data; also called the first-digit law, first digit phenomenon, or leading digit phenomenon. It states that in listings, tables of statistics, etc., the digit tends to occur with a probability of 30%, much greater than the expected value of.% (i.e. one digit out of 9) [-2]. The discovery of Benford's law goes back to 88, when the American astronomer Simon Newcomb noticed that in logarithm tables (used at that time to perform calculations), the earlier pages (which contained numbers that started with ) were much more worn and smudged than the later pages. Newcomb noted, that the ten digits do not occur with equal frequency must be evident to any making use of logarithmic tables, and noticing how much faster first pages wear out than the last ones []. The phenomenon was re-discovered by the physicist Frank Benford, who tested it on data extracted from 20 different

domains, as different as surface areas of rivers, physical constants, molecular weights, etc. Since then, the law has been credited to Benford [2]. The Benford law is expressed by the following statement: the occurrence of first significant digits in data sets follows a logarithmic distribution: P ( n) log 0, n, 2,...,9 () n where P(n) is the probability of a number having the first non-zero digit n. Since its formulation, Benford's law has been applied for the analysis of a broad variety of statistical data, including atomic spectra [3], population dynamics [4], magnitude and depth of earthquakes [5], genomic data [6-7], mantissa distributions of pulsars [8], and economic data [9-0]. While Benford's law definitely applies to many situations in the real world, a satisfactory explanation has been given only recently through the works of Hill et al. [-3], who called the Benford distribution the law of statistical folklore. Important intuitive physical insights in the grounding of the Benford law, relating its origin to the scaling invariance of physical laws, were reported by Pietronero et al. [4]. Engel et al. demonstrated that the Benford law takes place approximatively for exponentially distributed numbers [5]. Fewster supplied a simple geometrical reasoning of the Benford law [6]. The breakdown of the Benford law was reported for certain sets of statistical data [7-9]. It should be mentioned that the grounding and applicability of the Benford law remain highly debatable [3]. In spite of this, the Benford law was effectively exploited for detecting fraud in accounting data [8]. Quantifying non-stationarity effects on organization of turbulent motion by Benford s law was reported recently [20]. Our paper supplies intuitive reasoning clarifying the origin of the Benford law.. New Results..The Origin of the Benford Law, and the Positional (place-value) notation In practice, measured quantities or analyzed data are restricted by a prescribed accuracy defined by a number of significant digits. This means that mantissas of

fn(m) decimal numbers, which are simply integers, are restricted from above by some integer, say, m+. Taking in mind the above mentioned, consider a set {,2,, m}. When m, this set coincides with the full set of integers. Let us elucidate how the frequency f n (m) of numbers beginning with the digit (n=) depends on m. In the first 6 lines of Table, the examples for the values of m are presented for which f (m) successively reaches minimum and maximum. It is seen that the above frequency changes quasi-periodically with increasing m, decreasing and increasing, and reaches its minima and maxima in turn for selected values of m (see Figure ). Successive minimums f min,n (k) and maximums f max,n (k) are enumerated by k=,2..0 0.8 0.6 n= 0.4 0.2 n=2 n=3 0 000 2000 3000 4000 m Figure. The dependence of first-digit frequency on the upper mantissa limit for the digits n= (black), 2 (red), and 3 (green). As another example, in the following lines of Table, the minimal and maximal frequencies f min,5 (k), f max,5 (k) and f min,9 (k), f max,9 (k) of integers beginning with the digits 5 and 9 are given. It is seen that the maximal and minimal frequencies

Table. Frequencies of integers beginning with different figures. First digit, n, of the number m {,2,, m} Amount, p, of numbers k Minimal and maximal frequencies, p/m 9,2,,9 /9 9,2,..,9 /9 99,2,,99 /99=/9 2 99,2,,99 /99 999,2,,999 /999=/9 3 999,2,,999 /999 49,2,,49 /49 59,2,,59 /59 5 499,2,,499 /499 2 599,2,,599 /599 4999,2,,4999 /4999 3 5999,2,,5999 /5999 89,2,,89 /89 99,2,,99 /99 9 899,2,,899 /899 2 999,2,,999 /999 8999,2,,8999 /8999 3 9999,2,,9999 /9999

decrease for the sequence n=,5,9: the number of integers beginning with these digits remains the same, but the sizes of the corresponding intervals [,m] grow (compare m in the third column for different n and the same k). As is seen from Table, the successive minima and maxima, enumerated by k, may be written as f min,n (k) = 0 k + 0 k 2 + + (n ) 0 k + 9 0 k + 9 0 k 2 + + 9, (2) f max,n (k) = 0 k + 0 k + + n 0 k + 9 0 k + 9 0 k 2 + + 9 (3) for k =,2,3. All the sums in (2), (3) are calculated as sums of the geometric sequence f min,n (k) = f max,n (k) = 0 k 9(n 0 k ), (4) 0 k+ 9[(n + ) 0 k ]. (5) Letting k go to infinity (which also means letting corresponding values of m in Table to go to infinity), results in f min,n = lim k f min,n (k) = 9n, (6) 0 f max,n = lim f max,n (k) = k 9(n + ). (7) The probability of the occasional choosing of a particular number beginning with the digit n from the whole set of integers may be estimated as a normalized arithmetic mean or a normalized geometric mean of the minimal (6) and maximal (7) frequencies:

The final result is P arith (n) = [f min,n + f max,n ]/ ( (f min,i + f max,i )) 9 i= 9 P geom (n) = f min,n f max,n / ( f min,i f max,i ). i= 0 P arith (n) = n + + n 9 ( 0 i + + i= i ) P geom (n) = n(n + ) 9 i= / i(i + ) (8). (9) The results of equations (8) and (9) are compared with the Benford formula () in Table 2 and Figure 2. As is seen, the normalized geometric mean shows very good agreement, even though the mathematical forms of () and (9) are different.

0.35 0.30 0.25 P(n) 0.20 0.5 Benford formula () arithmetic mean (8) geometric mean (9) 0.0 0.05 0.00 0 2 3 4 5 6 7 8 9 0 n Figure 2. Comparison of equations (8) and (9) with the Benford formula (). Table 2. Comparison of equations (8) and (9) with the Benford formula (). n 2 3 4 5 6 7 8 9 Benford 0.300 0.76 0.249 0.0969 0.0798 0.06695 0.05799 0.055 0.04576 Geometric mean, Eq. (9) Arithmetic mean, Eq. (8) 0.3046 0.759 0.244 0.09632 0.07865 0.06647 0.05756 0.05077 0.0454 0.272 0.733 0.28 0.07 0.08439 0.0722 0.06297 0.05589 0.05023 The results (6)-(9) allow an obvious generalization for the case of an arbitrary base N of the positional digit system:

N f min,n = (N )n, f N N max,n = (N )(n + ) N P N arith (n) = ( N n + + n ) / ( N i + + i ) (0) i= P N geom (n) = n(n + ) N i= / i(i + ) () where n N. In particular, in the binary system (N=2), all the right-hand sides of the four last equations turn to for n= (all the numbers presented in the binary system begin with ). It is well known that in many cases the Benford distribution does not hold. This may happen, e.g., under some restriction on the set of admissible numbers. For example, if the inequality l < 000 is imposed on the random sample of integers l (or mantissas of real numbers), the probability P() will be close to /9 (see Table ), and not to the value predicted by the Benford formula or by equations (8, 9), which is about 3 times larger. More generally, the necessary condition is that the set {,2,, m} to which a random sample of integers belong should contain the same numbers of minimal (4) and maximal (5) frequencies. In any case, if some restrictions take place, the following inequalities should be fulfilled: or f min,n P(n) f max,n 9n P(n) 0 9(n + ) (2) in the decimal system. In digit systems with a lower base N, the appropriate inequalities are stronger: (N )n P(n) N (N )(n + ).

A favorable situation for the Benford distribution appears when admissible numbers belong to a function range in a vicinity of infinite singularity. In this case, restrictions on m are absent, and the statement of tending m to infinity in (6) and (7) becomes reasonable..2. Exemplification of New Results: Applicability of the Obtained Results to the Analysis of Infrared Spectra of Polymers In our recent paper we demonstrated that the Benford law takes place within the absorbance domain of infrared (IR) spectra of polymers [2]. The IR spectra may be treated as sets of values of absorbance corresponding to the sets of wavenumbers. Consider now validity of Eqs. 8-9 to the actual distribution of leading digits in the absorbance spectra of polymers studied in Ref. 2, and represented in Fig. 3.It is recognized that the geometrical averaging given by Eq. supplies the best correspondence with the experimental results.

P(n) 0.4 0.35 0.3 Experiment Benford Arithmetic mean Geometric mean 0.25 0.2 0.5 0. 0.05 0 2 3 4 5 6 7 8 9 n Figure 3. The actual frequencies of leading digits appearing in the set of absorbance spectra vs. the Benford law and Eqs. 0 and. The correlation coefficients are: R=0.964 for the Benford distribution, R=0.956 for Eq. 0 (arithmetic average approximation) and R=0.966 for Eq. (geometrical average approximation). Summary The present article places emphasis on the Benford law as a consequence of the structure of positional digit systems. From this point of view, attempts of explanations based on scale invariance, base invariance or even representing of the Benford law as a mysterious law of nature, at least call for refinement. A very convincing example is the binary positional digit system (with a base of 2) for which the Benford law

should state that the probability of finding the digit at the first place of a number is 00%. As shown above, some statistical estimation of the probability of finding the digits at the first place of a number can be given, which obeys a different mathematical form alternative to the Benford law. This form, which is expressed by the derived equations (0), (), gives practically the same numerical results as the Benford formula. Limitations from below and from above on admissible numbers imposed a priori lead to violations of the Benford law. Some inequalities concerning these violations can be useful. Acknowledgements GW thanks to Israel Ministry of Absorption for years-long generous support and to his sister Elena Vaiman for her help. REFERENCES [] Newcomb S, Amer J. Math 88; 4: 39 40. [2] Benford F. Proc Am Phil Soc 938; 78: 55 572. [3] Pain JC. Phys Rev E 2008; 77:0202. [4] Mir TA. Phys A 202; 39:792 798. [5] Sambridge MM, Tkalčić H, Jackson A. Geophys Re Lett 200; A37: L2230. [6] Hernandez Caceres JL. Electronic Journal of Biomedicine 2008; :2 7 35. [7] Friar T. Goldman JL, Pérez Mercader J. Plos One 202; 7:e36624:9p. [8] Shao L., Ma BD, Astrop. Phys. 200; 33: 255 262. [9] Giles DE. Appl Econ Lett 2007; 4:57-6. [0] Mir TA, Ausloos M, Cerqueti R. Eur. Phys. J. B 204; 87: 26.

[] Hill TP. Proceedings of the AMS 995;23: 887 895. [2] Hill TP. Statist Sci 995;0: 354 363. [3] Berger A, Hill TP. Math Intell 20; 33: 85 9. [4] Pietronero L, Tosatti E, Tosatti V, Vespignani A. Phys A 200; 29: 297 304. [5] Engel HA, Leuenberger Ch. Stat Probabil Lett 2003; 63: 36 365. [6] Fewster RM. Am Stat 2009; 63:2 6-32. [7] Ausloos M, Herteliu C, Ileanu B. Phys A 205; 49:736 745. [8] Durtschi C, Hillison W, Pacini C. JFA 2004; 5: 7-34. [9] Günnel S, Tödter KH. Empirica 2009; 36: 273-292. [20] Li Q., Fu Z., Commun Nonlinear Sci Numer Simulat 206; 33: 9 98. [2] Bormashenko Ed, Shulzinger E, Whyman G, Bormashenko Ye. Phys A 206; 444: 524-529.