Statistics, Probability and Noise

Statistics, Probability and Noise Claudia Feregrino-Uribe & Alicia Morales-Reyes Original material: Rene Cumplido Autumn 2015, CCC-INAOE

Contents Signal and graph terminology Mean and standard deviation Signal versus underlying process The Histogram, Pmf and Pdf The Normal Distribution Digital noise generation Precision and accuracy 2

Introduction Statistics and probability are used in DSP to characterize signals and the processes that generate them. For example, a primary use of DSP is to reduce interference, noise, and other undesirable components in acquired data. Inherent part of the signal being measured Imperfections in the data acquisition system Introduced as an unavoidable byproduct of some DSP operation. Statistics and probability allow these disruptive features to be measured and classified Aid in developing strategies to remove the offending components. 3

Signal and Graph Terminology The vertical axis may represent voltage, light intensity, sound pressure, etc. Since we don't know what it represents in this particular case, we will label it: amplitude. This parameter is also called several other names: the y axis, the dependent variable, the range, and the ordinate. 4

Signal and Graph Terminology The horizontal axis represents the other parameter of the signal, going by such names as: the x-axis, the independent variable, the domain, and the abscissa. Time is the most common parameter to appear on the horizontal axis of acquired signals Other parameters are used in specific applications. Eg. rock density at equally spaced distances along the surface of the earth. In general, label the horizontal axis: sample number. If this were a continuous signal, label would be, eg.: time, distance, x, etc 5

Signal and Graph Terminology The two parameters that form a signal are generally not interchangeable. The parameter on the y-axis (the dependent variable) is said to be a function of the parameter on the x-axis (the independent variable). The independent variable describes how or when each sample is taken, while the dependent variable is the actual measurement. Given a specific value on the x-axis, we can find the corresponding value on the y-axis, but usually not the other way around. 6

Signal and Graph Terminology Domain is a very widely used term in DSP. A signal that uses time as the independent variable is said to be in the time domain. Another common signal in DSP uses frequency as the independent variable, resulting in the term, frequency domain. Signals that use distance as the independent parameter are said to be in the spatial domain. What if the x-axis is labeled with something like sample number? Refer to them as being in the time domain. 7

Signal and Graph Terminology Although the signals in previous figures are discrete, they are displayed in this figure as continuous lines. There are too many samples to be distinguishable if they were displayed as individual markers. In graphs that portray shorter signals ( <100) the individual markers are usually shown. Continuous lines may or may not be drawn to connect the markers. A continuous line could imply what is happening between samples, or simply be an aid to help the reader's eye follow a trend in noisy data. Examine the labeling of the horizontal axis to find if you are working with a discrete or continuous signal. 8

Signal and Graph Terminology Sampling notation: The variable, N, is widely used in DSP to represent the total number of samples in a signal. Each sample is assigned a sample number or index. These are the numbers that appear along the horizontal axis. Two notations for assigning sample numbers are commonly used. Sample indexes run from 1 to N (e.g., 1 to 512) - Math Sample indexes run from 0 to N 1 (e.g., 0 to 511) - DSP 9

Mean The mean, indicated by µ is the statistician's jargon for the average value of a signal. Add all of the samples together, and divide by N. Mathematically: µ = 1 1 N N i= 0 x i 10

Standard Deviation The standard deviation is a measure of how far the signal fluctuates from the mean. The standard deviation is obtained by averaging the squares of differences of each sample with the mean. The square root is taken to compensate for the initial squaring. In equation form: σ = 1 N 1 N 1 i= 0 ( x i µ ) 2 11

Common waveforms Ratio of the peak-to-peak amplitude of the std. dev. for several common waveforms 12

Running statistics It is often desirable to recalculate the mean and standard deviation as new samples are acquired and added to the signal. This type of calculation is called running statistics. N, the total number of samples sum, the sum of these samples sum of squares, the sum of the squares of the samples 13

Signal-to-Noise Ratio and Coefficient of Variation In some situations, the mean describes what is being measured, while the standard deviation represents noise and other Interference The standard deviation is not important in itself, but only in comparison to the mean. This gives rise to the term: signal-to-noise ratio (SNR) Mean divided by the standard deviation. Another term used is the coefficient of variation (CV) standard deviation divided by the mean, multiplied by 100 percent. E.g. a signal with a CV of 2%, has an SNR of 50 Better data means a higher value for the SNR and a lower value for the CV. 14

Signal versus Underlying Process Statistics is the science of interpreting numerical data, such as acquired signals. Probability is used in DSP to understand the processes that generate signals. In DSP it is important to distinguish the acquired signal from the underlying process. 15

Signal versus Underlying Process The probabilities of the underlying process are constant, but the statistics of the acquired signal change each time the experiment is repeated E.g. A signal created by flipping a coin 1000 times. Heads -> one, Tails ->zero This random irregularity found in actual data is called by such names as: statistical variation, statistical fluctuation, and statistical noise The process that created this signal has a mean of exactly 0.5 50% heads, 50% tails The actual 1000 point signal will not necessarily have a mean of exactly 0.5. Random chance will make the number of ones and zeros slightly different each time the signal is generated 16

The Histogram, Pmf and Pdf The histogram displays the number of samples there are in a signal that have each of the possible values. A histogram is represented by H i i is an index that runs from 0 to M-1 M is the number of possible values each sample can take E.g. H 50 is the number of samples that have a value of 50. The sum of all values in the histogram is equal to the number of points in the signal. 17

The Histogram, Pmf and Pdf Mean Standard deviation Efficient to calculate mean and std. dev. of very large data sets Images Statistics are calculated per groups of samples 18

Probability mass function (pmf) Important: the acquired signal is a noisy version of the underlying process Histogram is formed from an acquired signal and calculated using a finite number of samples Corresponding underlying process is the probability mass function (pmf) Pmf is what would be obtained with an infinite number of samples. Pmf can be estimated (inferred) from the histogram, or it may be deduced by some mathematical technique Normalization, total number of samples Pmf is important because it describes the probability that a certain value will be generated. Discrete data only 19

Probability density function (pdf) Continuous signals Pdf or probability distribution function How can we calculate a probability? Pdf s vertical axis is in probability density units, rather than just probability Eg. 0.03 at 120.5 does not mean that the a voltage of 120.5 millivolts will occur 3% of the time Probability of 120.5 millivolts to occur is really small > there is an infinite number of signal values for time scale: 120.49997, 120.49998, 120.49999, etc 20

To remember The histogram, pmf, and pdf are very similar concepts. Try not to be confused. The total area under the pdf curve, the integral from - to +, will always be equal to one. The sum of all of the pmf values being equal to one The sum of all of the histogram values being equal to N. 21

Examples of continuous waveforms and their pdfs 22

The Normal Distribution Signals formed from random processes usually have a bell shaped pdf. This is called a normal distribution, a Gauss distribution, or a Gaussian After German mathematician, Karl F. Gauss (1777-1855) The basic shape of the curve can be generated by: 23

The Normal Distribution This raw curve can be converted into the complete Gaussian by adding an adjustable mean, µ, and standard deviation, σ. The equation must be normalized so that the total area under the curve is equal to one. General form of the normal distribution. 24

The Normal Distribution 25

Cumulative distribution function (cdf) Pdf integration is used to find the probability that a signal will be within a certain range of values Pdf s integral is called cumulative distribution function (cdf), Φ(x) Gaussian s integral is calculated by numerical integration Very fine discrete sampling of the continuous Gaussian curve, from -10σ to +10σ Discrete signal s samples are added to simulate integration 26

Cumulative distribution function (cdf) Φ(x), the cumulative distribution function of the normal distribution (mean = 0, standard deviation = 1). 27

Digital noise generation Random noise is an important topic in both electronics and DSP. For example, it limits how small of a signal an instrument can measure, the distance a radio system can communicate, and how much radiation is required to produce an x-ray image. A common need in DSP is to generate signals that resemble various types of random noise. This is required to test the performance of algorithms that must work in the presence of noise. 28

Random number generator The heart of digital noise generation is the random number generator. Most programming languages have this as a standard function. Each random number has a value between zero and one, with an equal probability of being anywhere between these two extremes. The mean of the underlying process that generated this signal is 0.5 The standard deviation is, and the 1/ 12 = 0.29, and The distribution is uniform between zero and one. 29

Digital noise generation (Gaussian) There are two methods for generating such signals using a random number generator. a signal obtained by adding two random numbers to form each sample, i.e., X = RND+RND. Since each of the random numbers can run from zero to one, the sum can run from zero to two. The mean is now one, and the standard deviation is 1/ 6 The pdf has changed from a uniform distribution to a triangular distribution. The signal spends more of its time around a value of one, with less time spent near zero or two. 30

Digital noise generation (Gaussian) 2 31

Digital noise generation (Gaussian) (3) Taking this idea a step further, adding twelve random numbers to produce each sample. The mean is now six The standard deviation is one. The pdf has virtually become a Gaussian. This procedure can be used to create a normally distributed noise signal with an arbitrary mean and standard deviation. For each sample in the signal: 1) add twelve random numbers 2) subtract six to make the mean equal to zero 3) multiply by the standard deviation desired 4) add the desired mean 32

Digital noise generation (Gaussian) (4) 33

Central Limit Theorem The mathematical basis for this algorithm is contained in the Central Limit Theorem, one of the most important concepts in probability. In its simplest form, the Central Limit Theorem states that a sum of random numbers becomes normally distributed as more and more of the random numbers are added together. The Central Limit Theorem does not require the individual random numbers be from any particular distribution, Or even that the random numbers be from the same distribution The Central Limit Theorem provides the reason why normally distributed signals are seen so widely in nature Whenever many different random forces are interacting, the resulting pdf becomes a Gaussian. 34

Digital noise generation 2 nd method A random number generator is invoked twice, to obtain R1 and R2. A normally distributed random number, X, can then be found from: Just as before, this approach can generate normally distributed random signals with an arbitrary mean and standard deviation. Take each number generated by this equation multiply it by the desired standard deviation, and add the desired mean. 35

Random number generators Random number generators operate by starting with a seed, a number between zero and one. When the random number generator is invoked, the seed is passed through a fixed algorithm, resulting in a new number between zero and one. This new number is reported as the random number It is then internally stored to be used as the seed the next time the random number generator is called. The algorithm that transforms the seed into the new random number is often of the form: 36

Precision and Accuracy Precision and accuracy are terms used to describe systems and methods that measure, estimate, or predict. In all these cases, we wish to know the value of some parameter This is called the true value, or simply, truth. The method provides a measured value, that you want to be as close to the true value as possible. Precision and accuracy are ways of describing the error that can exist between these two values. 37

Precision and Accuracy consider an oceanographer measuring water depth using a sonar system The mean occurs at the center of the distribution best estimate of the depth based on all measured data The standard deviation defines distribution s width how much variation occurs between successive measurements Good accuracy, poor precision Poor repeatability 38

Precision and Accuracy Precision is a measure of random noise When deciding which name to call the problem, ask yourself two questions. Poor accuracy results from systematic errors Bad calibration Eg. Converting time to distance, how? Accuracy is a measure of calibration First: Will averaging successive readings provide a better measurement? If yes, call the error precision If no, call it accuracy Second: Will calibration correct the error? If yes, call it accuracy If no, call it precision. 39