2. Introduction into the Mathematical Methods

Measures of spread

Okay, you have calculated the mean for your data. But maybe the one value does not tell you enough about your data. You may want to know how much your data spreads; does it cover just one value, or does it cover many values. Okay, you can see this, but you may want a measure of it. There are two measures of the spread of data; the range and the variance.

The range of a dataset is the difference between the smallest and largest values in that dataset. Thus a dataset of the heights of students may contain values {1.27, 1.35, 1.33, 1.64, 1.21, 1.24, 1.26, 1.26, 1.48, 1.30, 1.29, 1.45, 1.51, 1.32, 1.61, 1.63} in which the smallest value is 1.21 and the largest value is 1.64, so that the range is 1.64 - 1.21 = 0.43.

Do not confuse range with the domain of a dataset, where the domain for the above dataset could be all values that can occur. With a dice, the domain is {1, 2, 3, 4, 5, 6} whilst the range is 6 - 1 = 5.

The variance for a set of observations is the average of the sums of the squares of the residuals, as shown in the equation:

σ 2 = i=1 N ( x i μ ) 2 N

with μ beeing the expected value.

This is the variance for a population of size N. If we used the equation for a population with a sample, then we would introduce bias or error into the derived value. This bias is avoided by using the equation below, where one divides by (n-1) instead of dividing by N, where n is the sample size:

s 2 = i=1 n ( x i x ¯ ) ( n1 ) 2


Expand to give:

s 2 = x 1 2 2 x 1 x ¯ + ( x ¯ ) 2 + x 2 2 2 x 2 x ¯ + ( x ¯ ) 2 +...+ x n 2 2 x n x ¯ + ( x ¯ ) 2 n1

and group:

s 2 = x i 2 ( x i · x i ) n n1 = x i 2 n ( x ¯ ) 2 n1

These two forms of the variance equation are identical and the choice as to which to use is up to the user. However, the second form is better for use in computers as it does not require two passes through the data.

With frequency distribution data, you again need to take into account that each bin contains a number of observations. The variance is thus the sum of the products of the bin value minus the mean by the bin count, divided by the sum of the bin counts.

s 2 = k i v i 2 x ¯ 2 k i k i 1

The square root of the variance is called the standard deviation. It has the same units as your data, and so it can be plotted on your histograms.