## Histograms

### Introduction

We will be following a mathematically-correct procedure in this module.

The reason for pointing this out right at the beginning is that the word histogram appears to be used with a slightly different meaning in some less mathematically-orientated sources, or in spreadsheets.

In a nutshell, histograms are closely related to bar charts, but differ in that they are used to represent frequency distributions.

Strictly speaking, they would be used to represent continuous data (definitions below), and that is the course will be following. What this means visually is that the bars on a histogram will have no spacing between them.

For completeness, we could also state that whereas bar charts have bars all the same width, it is possible on a histogram to have bars of unequal width, in which case it would be the area of the bars that we would be interested in. Obviously it would be simpler if all the bars had the same width, and that is the approach we will adopt here.

Before tackling histograms, we need to learn about a few concepts.

### Discrete and Continuous Data

Data or variables that are continuous are those that can take any variables within a range. An obvious example is height. The height of human beings can take any value between biologically determined limits. So if someone is is 1.63 metres tall and someone else is 1.64 metres tall, then it is possible for someone to have an intermediate height between these two values, e.g. 1.633 metres or 1.637 metres. You can apply these logic ad infinitum to decreasing gaps, in order to show that any height whatsoever within a particular range could correspond to someone's height - there are no gaps within this range which could not correspond to someone's height.

On the other hand, discrete data can only take certain values, e.g. the total number of goals scored by a football team. It could have scored 42 goals, or 43 goals, or 44 goals....., but definitely not 42 ½ or 42 ¼ , etc. .

Note that discrete data need not be integer values - for example

1/4, 1/2, 3/4, 1, 5/4, 3/2,.....

is classed as a discrete set of numbers

### Frequency Distributions - Introduction

The frequency distribution, above left, is hopefully self-explanatory. From the frequency distribution, a bar chart has been drawn up, in line with what you learnt from the Bar Charts module.

### Grouped Frequency Distributions

When we have a large amount of data, encompassing a large range, it might be more convenient to produce a Grouped Frequency Diagram.

For example, if we had 70 items of data recording the ages of 70 different individuals who took part in a particular activity, we could produce the following table

 Age Tally Total Frequency 1-10 8 11-20 13 21-30 18 31-40 5 41-50 5 51-60 18 61-70 3

So, although we have data that can take any value between 1-70, instead of having 70 different intervals, we have sorted the data into just 7 groups.

The grouping is arbitrary and it is up to you to decide the appropriate intervals. Sometimes, though, the 'best' intervals to use can be more 'obvious' than at other times. There is however still a measure of subjectivity, there is not, in general, a totally correct and unique interval to use.

However, too few groups will destroy the details of the distribution - too many could destroy the pattern of the distribution (for example, some groups might contain no data). As a rough rule of thumb, the number of groups is usually between 5 and 20. The more data you have, the more groups you could have, because each group will still contain a significant amount of data.

Each group is more correctly referred to in Mathematics as a class. We will use this word exclusively from now on.

For completeness, we need to know a few definitions :-

• Class Boundaries define the class interval - the upper class boundary (UCB) is the maximum possible value which would be in the class, and the lower class boundary (LCB) is the minimum value which would be in that class.

The UCB for one particular class will be the LCB for an adjacent class and the LCB for the same particular class will be the UCB for the adjacent class on its other side.

• Class Interval is a statement of the actual range covered by a class. For example a particular class could have the class interval 5.5 to 6.5, and the adjacent class could have the class interval 6.5 to 7.5, and so on.
• Class Width - difference between the upper class boundary and the lower class boundary.
• Class Midpoint - is the midpoint between the Upper Class Boundary and the Lower Class Boundary

### Frequency Distributions - Continuous Variables

When you think about it, data for continuous variables will need to be grouped before we can represent it on a histogram.

If we were to construct a frequency diagram for finishing times for a marathon, we have a choice of class intervals we could use. For example

10 minutes : For example : ...., 3hrs - 3hrs 10 mins, 3hrs 10 mins - 3hrs 20 mins, 3hrs 20 mins - 3hrs 30 mins, etc.

15 minutes : For example : ...., 3hrs - 3hrs 15 mins, 3hrs 15 mins - 3hrs 30 mins, 3hrs 30 mins - 3hrs 45 mins, etc.

or whatever

In practise, there is no ambiguity here because the chances of someone finishing in exactly 3 hours 10 mins is very small, so although we have intervals with common borders, e.g. 3 hrs - 3 hrs 10 mins and 3 hrs 10 mins - 3 hrs 20 mins, it is extremely likely that contestants will fall within definite intervals, and not on the border (mathematically speaking, the chances of a contestant finishing in exactly 3hrs 10 mins is zero - I have used the word 'unlikely' in the description to account for the inability of humans, or indeed machines, from calculating absolute accuracy).

Example Frequency Distribution

 Time Tally Total Frequency 2 hrs 45 mins - 3 hrs 00 mins 8 3 hrs 00 mins - 3 hrs 15 mins 13 3 hrs 15 mins - 3 hrs 30 mins 18 3 hrs 30 mins - 3 hrs 45 mins 13 3 hrs 45 mins - 4 hrs 00 mins 13 4 hrs 00 mins - 4 hrs 15 mins 6 4 hrs 15 mins - 4 hrs 30 mins 18 4 hrs 30 mins - 4 hrs 45 mins 3

It is in charting such data that a histogram would be used. I have already said that histograms resemble bar charts closely. The main diference is that histograms will normally be shown with the bars touching each other (with no spacing), to account for the continuous nature of the data, whereas simple bar charts tend to have spaces to separate the bars.

For completeness, I should state that histograms can be used to chart discrete data, usually where the discrete variables can take a large number of values and it is therefore convenient to group the data. In order to do, a few 'tricks' need to be introduced which we will talk about later. For the moment, just consider the continuous variable case.

### Drawing Histograms

Once we have drawn up a grouped frequency table, the drawing of a histogram should be straightforward - remember not to include any spacing between the bars.

Given the following frequency distribution, for obseravtions of thrust from a rocket engine

we can produce the following histogram

### Histograms - Discrete Data

Despite what I have said so far, many examples of Histograms that are presented either in books or on the Internet appear to be actually using discrete data. We have to be bit careful here about our definitions - there is a certain 'fudge factor' involved, becasue of the way the data may be presented.

Quite definitely it is standard practise in statistics that if the amount of discrete data is large that maybe we can use methods primarily intended for continuous data - the classic example is the use of normal distribution.

However, there are a couple of other scenarios worth looking at

1. If the data has been rounded before being recorded, e.g.

• lengths rounded to the nearest metre
• ages rounded to the nearest whole number of years

then although the data may appear discrete, it is actually continuous. We need merely to introduce the appropriate class boundaries and proceed as normal.

For example, if data is presented to the nearest metre, then data corresponding to 3 metres will, in reality, be data corresponding to a class interval of 2.5 metres to 3.5 metres.

2. If discrete data covers a range completely, then to all intents and purposes, we can probably consider it as continuous. For example, marks for an exam, which we expect can take any number within a range but are normally quoted as whole numbers. When you think about it, data of this type is probably really continuous but is presented as discrete data out of pure practical considerations.

In this case, we have to introduce a 'fix' and set up class boundaries appropriate to continuous data. For example, if the frequency distribution was drawn up as previously in this module (under Grouped Frequency Distributions) using the groupings

...,   11-20,   21-30,  31-40,  ...etc.,

the class intervals would be 'adjusted' to be

...,  10.5-20.5,  20.5-30.5,  30.5-40.5,  ...etc. .

So seemingly discrete data, as in this table, can still be displayed using a histogram, as shown here

### Frequency Polygon

Frequency Polygons are related to histograms in that they are another way of representing the same data.

Stated simply, whereas histograms display a bar relating to a particular class, a frequency diagram would require a point to be marked at the relevant height and at the class midpoint. These points are then joined by straight lines.

Looking at in an alternative way, the points would coincide with the midpoints of the top of the rectangles in the histogram.

One advantage of this method of presentation is that several polygons can be drawn on the same graph, allowing direct comparison between different frequency distributions.

The diagram below shows a frequency distribution represented by a histogram, below which the same frequency distribution is represented by a frequency polygon.

This is a histogram with the frequency polygon superimposed over it.

Note that in order for the frequency polyon to 'connect' with the axes, an 'imaginary' class has been added to the right and the line drawn down to the horizontal axis at the class midpoint, and to the left a line has been drawn to the vertical axis at a value correponding to half of the value of the leftmost point (or half the height of the leftmost bar, if we are thinking in histogram terms).

### Past Exam Questions

The list shows the amounts of sodium in different brands of bottled water

1.    Mary checks her table. The total number of brands on her table is ?

• A    one too few
• B    two too few
• C    one too many
• D    two too many

2.    Which of these categories does not contain the correct number of brands ?

• A    Low sodium (7 to 10 mg/litre)
• B    Medium Sodium (11 to 14 mg/litre)
• C    High Sodium (15 to 18 mg/litre)
• D    Very high sodium (more than 18 mg/litre)