Descriptive Statistics II. Graphical summary of the distribution of a numerical variable. Boxplot

MAT 2379 (Spring 2012) Descriptive Statistics II Graphical summary of the distribution of a numerical variable We will present two types of graphs that can be used to describe the distribution of a numerical variable : a boxplot. a histogram. Boxplot We will present a useful visual tool invented by John Tukey in 1977 to describe the distribution a numerical variable. The diagram will help us to describe the central tendency and the dispersion of the distribution. It is useful to compare the central tendency and also to compare the dispersion of many groups. It will also help us to identify outlying values (i.e. atypical values). We draw a box from the first quartile Q 1 to the third quartile Q 3 which is cut at the median. The range from Q 1 to Q 3 is the inter-quartile range which is measure of dispersion. The median is a mesure of central tendency. At the left of the box, we draw a stem (also called a whisker) down to the smallest value that is within 1.5 times the interquartile range of Q 1. At the right, we draw a stem (also called a whisker) to the largest value that is within 1.5 times the interquartile range of the 3rd quartile. 1

For values that are past a distance of 1.5IQR to the right or left of the box, then we put a point for each of these values. We call these values outliers or atypical values. q 1 x q 3 1,5 IQR IQR 1,5 IQR Example 3 : Consider the radish growth from Example 2. The data are in the file RadishGrowth.txt., see the update of May 24 on the course Web page. To construct the boxplot use the following command in Minitab (here we suppose that the data is in column C1) : MTB > boxplot c1. Here is the output : 2

Now suppose that we would like to compare this distribution of growth for radishes in darkness with the distribution of growth for radishes that were given 12 hours of light per day. Here is the data : 3 10 15 17 18 18 18 20 20 25 25 25 28 29 The data can also be found in the file radishii.txt, see the update of May 29 on the Web page. We will assume that the variable are in the columns C1 and C2 of Minitab. Here are the commands to produce side-by-side boxplots. MTB > boxplot c1 c2; SUBC> overlay. Here is the graph : 3

Discussion : The central tendency of growth are similar for both groups. But the growth for radishes that receive the light are less variable (i.e. less dispersed) in comparison to the growth of radish in the dark. In addition, there is a radish growth in the group of radishes that receive the light that is an outlier in comparison to other growths in its group. This growth of 3mm is atypical. Remarks : This is an example of data from an experiment. The response variable is the growth after 3 days (in mm). The factor (or explanatory variable) is access to light. The light factor has two levels : access to light 12 hours a day and total darkness. We call the levels of the factor treatments. The radish seedling are the experimental units. We randomly assign treatments to these basic units of study. If the conditional distribution of the response varies from treatment to treatment, then we say that there is a treatment effect. We have data from one experiment, however these data can vary from experiment to experiment or from sample to sample. We will learn in this course how to determine if a treatment effect is significant by taking into account the sample to sample variability. 4

Histogram The frequency distribution of a numerical variable can be displayed with a histogram. Construction of a histogram : 1. Dived the horizontal axis into sub-intevals (preferably of equal length). Each sub-interval represents a range values for the random variable. It is often suggested to use between 5 to 20 classes. Often # of classes = n works well. 2. Different statistical packages use different techniques to determine the number of subintervals. However often the default works well. 3. Terminology : Often a subinterval is called a bin. 4. For each bin, erect a rectangle whose height is equal to either the frequency, the relative frequency or the density. 5. If you use the density, that is density = relative frequency/length of bin, then the area of the bin, which is density length of bin, is equal to the relative frequency (i.e. probability). A histogram is used to describe the shape of the distribution of a numerical variable. Some examples of histograms that are respectively approximately symmetric, skewed to the right, skewed to the left. The asymmetry is in the direction of the atypical values. approximately symmetric skewed to the right skewed to the left 5

Example 4 : Consider the following histograms. Identify the distribution that are skewed to the right or skewed to the left. Describe the skewness as being weak or strong. Also identify the histograms that are approximately symmetric. 6

Example 5 : Consider the radishes that receive 12 hours of light per day from Example 3. The data can also be found in the file radishii.txt, see the update of May 29 on the Web page. We will construct the histogram for the growth of radish. Here we assume that the data is in column c2. Here is the command. MTB > histogram c2. Here are the results. This histogram is a histogram of frequencies. We can also produce a probability histogram with the following commands : MTB > histogram c2; SUBC> percent. Here is the graph. 7

This histogram is a histogram of frequencies. We can also produce a density histogram with the following commands : MTB > histogram c2; SUBC> density. Here is the graph. 8