Frequency Distribution and Graphs

Chapter 2 Frequency Distribution and Graphs 2.1 Organizing Qualitative Data Denition 2.1.1 A categorical frequency distribution lists the number of occurrences for each category of data. Example 2.1.1 For the following table, construct a frequency distribution of the state of birth of American presidents. President State of Birth President State of Birth Washington Virginia Lincoln Kentucky J. Adams Massachusetts A. Johnson North Carolina Jeerson Virginia Grant Ohio Madison Virginia Hayes Ohio Monroe Virginia Gareld Ohio J. Q. Adams Massachusetts Arthur Vermont Jackson South Carolina Cleveland New Jersey Van Buren New York B. Harrison Ohio W. H. Harrison Virginia Cleveland New Jersey Tyler Virginia McKinley Ohio Polk North Carolina T. Roosevelt New York Taylor Virginia Taft Ohio Filmore New York Wilson Virginia Pierce New Hampshire Harding Ohio Buchanan Pennsylvania 20

State Tally Frequency Virginia Massachusetts South Carolina New York North Carolina New Hampshire Pennsylvania Kentucky Ohio Vermont New Jersey Denition 2.1.2 The relative frequency is the proportion or percent of observations within a category and is found using the formula frequency Relative frequency= sum of all frequencies : A relative frequency distribution lists the relative frequency of each category of data. Example 2.1.2 For the Frequency Distribution in the previous example, construct the relative frequency distribution. State Relative Frequency Virginia Massachusetts South Carolina New York North Carolina New Hampshire Pennsylvania Kentucky Ohio Vermont New Jersey Once the raw data is summarized into tables, we can create graphs. 21

Denition 2.1.3 A bar graph is constructed by labelling each category of data on a horizontal axis and the frequency or relative frequency of the caterogy on the vertical axis. A rectange of equal width is drawn for each category. The height of the rectangle is equal to the category's frequency or relative frequency. Example 2.1.3 Construct a bar graph for the frequency of the birth states of American Presidents. Denition 2.1.4 A Pareto chart is a bar graph whose bars are drawn in decreasing order of frequency or relative frequency. Example 2.1.4 Construct a Pareto chart for the frequency of the birth states of American Presidents. 22

If you wish to compare two data sets, you could draw a side-by-side bar graph where by you draw the bars for the same categories side by side (with no space) and you leave spaces between the categories. Refer to Example 4 on Page 57 in the textbook. Denition 2.1.5 A pie chart is a circle divided into sectors where each sector represents a category of data. The area of each sector is proportional to the frequency of the category. The sectors are usually drawn from largest to smallest (clockwise direction). Example 2.1.5 The data represented in the following table represent the educational attainment of residents of the United States 25 years or older, based upon data obtained from the 2000 United States Census. Construct a pie chart of the data. Educational Attainment 2000 Less than 9th grade 12 237 601 9th-12th grade, no diploma 20 343 848 High school diploma 52 395 507 Some college, no degree 36 453 108 Associate's degree 11 487 194 Bachelor's degree 28 603 014 Graduate/professional degree 15 930 061 Total 177 450 333 23

2.2 Organizing Quantitative Data I 2.2.1 Stem-and-Leaf Display: A stem-and-leaf display can be thought is a variation of the histogram (to be discussed in a moment), especially when the observations are two-digit numbers. To draw a stem-and-leaf display, a. List the digits 0 to 9 in a column and draw a vertical line. These correspond to the leading digits of the observations. b. For each observation, record its second digit to the right of this vertical line in the row where the rst digit appears. c. Arrange the second digits in each row so that they are in increasing order. The column of the rst digits is referred to as the stem and the second digits are the leaves. Example 2.2.1 Construct a Stem-and-Leaf Display of the following Stats 244 grades: 75 98 42 75 84 78 99 90 80 89 15 57 68 60 77 90 62 58 49 52 2.2.2 Histograms of Discrete Data We can summarize discrete data in tables where the possible values of the discrete variable are used to create the categories of data. A frequency distribution created using discrete data is sometimes referred to as an ungrouped frequency distribution. 24

Example 2.2.2 The manager of a Wendy's fast-food restaurant is interested in studying the typical number of customers who arrive during the lunch hour. The following data represent the number of customers who arrive at Wendy's for 24 randomly selected 15-minute intervals of time during lunch. Construct a relative frequency distribution for this data. 7, 5, 2, 6, 2, 6, 6, 4, 6, 6, 7, 5, 2, 2, 8, 6, 6, 6, 1, 5, 9, 6, 2, 9 Denition 2.2.1 A histogram is constructed by drawing rectangles for each class of data. The height of each rectangle is the frequency or relative frequency of the class. The width of each rectangle should be the same and the rectangle should touch each other. Example 2.2.3 Construct a frequency histogram for the above Wendy's data. 25

2.2.3 Grouped Frequency Distributions Quite often when the range of the data set is large, the individual frequencies do not reveal much information about the data. We commonly group the data into intervals. If the intervals are not given apriori, a rule of thumb is to try TEN intervals. This by no means always holds (for instance, when the range is small or when there is a lot of data). With experience, one is able to determine an appropriate number of intervals to represent his/her data. The intervals should not overlap and should continuously cover the range of the data. The lower class limit of a class is the smallest value within the class and the upper class limit of a class is the largest value within the class. The class width is the dierence between consecutive lower class limits. A set of data is said to be open ended if the last class does not have an upper class limit. Denition 2.2.2 The class midpoint is found be adding the lower and upper class limit and dividing the result by 2. Class midpoint= Lower Class Limit + Upper Class Limit : 2 A template for the Grouped Frequency Distribution Table is as follows: Cumulative Percentage Cumulative Interval Midpoint Tally Freq Frequency Frequency % Freq Example 2.2.4 The times (in minutes) of 50 runners in a marathon were: 246 238 246 251 240 243 245 243 241 248 244 246 249 246 245 244 248 240 243 249 242 245 239 244 246 246 248 248 249 248 250 242 243 245 242 242 246 246 245 247 244 240 245 247 248 247 250 247 248 250 Suppose the classes for the runners were dened (inclusively) by: 237-239 240-242 26

243-245 246-248 249-251 Complete a Grouped Frequency Distribution Table. Cumulative Interval Midpoint Tally Freq Frequency Percentage Frequency Cumulative % Freq Notes on a Grouped Frequency Distribution Table: 1. Suitable class sizes are subjective and depend on the data. Usually the class sizes should be of the same width. If the class sizes are too wide, then too much of the information is lost. The graphic representation of the Grouped Frequency Distribution (a histogram, to be discussed in the next section) would appear to be "box-like". If the class sizes are too narrow, then the graphic representation (a histogram) has many little boxes and little discernable information can be obtained from the graph. 2. It is a good strategy to set the class boundaries (which we use to form the class intervals) at values where no data points lie. Usually the trick we used above works. In the above example, our data consisted of whole numbers, so we were able to use ###.5's as the boundaries. If our data consisted of data of the form $###.##, our class boundaries would have to be of the form $###.##5. 2.2.4 Graphically Representing Continuous Data To draw a histogram for continuous data, one would create a grouped frequency distribution table for the data and then create the histogram based on the grouped frequency distribution table. Denition 2.2.3 A frequency polygon is drawn by plotting a point above each class midpoint on a horizontal axis at a height equal to the frequency of the class. After the points for each class are plotted, straight lines are drawn to connect consecutive points. 27

Denition 2.2.4 A cumulative frequency distribution displays the aggregate frequency of the category. In other words, for discrete data, it displays the total number of observations less than or equal to the category. For continuous data, it displays the total number of observations less than or equal to the upper class limit of a class. Denition 2.2.5 A cumulative relative frequency distribution displays the proportion of observations less than or equal to the category for discrete data and the proportion of observations less than or equal to the upper class limit for continuous data. Denition 2.2.6 An ogive curve (or ogive for short) is a graph that represents the cumulative relative frequency for the class. It is constructed by plotting points whose x-coordinates are the upper class limits and whose y-coordinates are the cumulative relative frequencies. After the points for each class are plotted, straight lines are drawn between consecutive points. Example 2.2.5 For the marathon example in the previous section, draw a frequency polygon and an ogive curve. Denition 2.2.7 A time series plot is obtained by plotting the time in which a variable is measured on the horizontal axis and the corresponding value of the variable on the vertical axis. Lines are then drawn connecting the points. 28

Example 2.2.6 Microsoft Corporation's Stock Prices 120 100 Stock Value 80 60 40 20 0 Jan-00 Mar-00 May-00 Jul-00 Sep-00 Nov-00 Jan-01 Mar-01 May-01 Jul-01 Sep-01 Nov-01 Time 2.2.5 Distribution Shapes The following are examples of histograms that would be considered: (1) uniform; (2) bell-shaped (also called symmetric) 29

0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 Symmetric (also called Normal) ; (3) skewed to the right (if the right side of the diagram containing half the observations extends much farther out than the left side) 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 Skewed to the right (also called positively skewed) ; 30

(4) and skewed to the left (if the left side of the diagram containing half the observations extends much farther out than the right side). 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 Skewed to the left (also called negatively skewed) 31

2.3 Graphical Misrepresentations of Data Please read through this section on your own. 32