How To Calculate Standard Deviation Of A Set

Missing alternative text

The median is known as a measure out of location; that is, it tells us where the data are. As stated in , we do not need to know all the exact values to calculate the median; if we made the smallest value even smaller or the largest value even larger, information technology would not change the value of the median. Thus the median does non employ all the information in the data and then it can be shown to be less efficient than the mean or average, which does apply all values of the data. To calculate the mean nosotros add together upward the observed values and separate past the number of them. The total of the values obtained in Tabular array 1.ane was 22.five

Missing alternative text

, which was divided past their number, 15, to give a mean of 1.5. This familiar process is
conveniently expressed by the post-obit symbols:

Missing alternative text

(pronounced "ten bar") signifies the mean; 10 is each of the values of urinary lead; northward is the number of these values; and σ , the Greek majuscule sigma (our "Southward") denotes "sum of". A major disadvantage of the hateful is that information technology is sensitive to outlying points. For example, replacing 2.ii by 22 in Table 1.1 increases the mean to 2.82 , whereas the median will be unchanged.

Equally well every bit measures of location we need measures of how variable the data are. We met two of these measures, the range and interquartile range, in Chapter 1.

The range is an of import measurement, for figures at the top and bottom of it announce the findings furthest removed from the generality. However, they practise not give much indication of the spread of observations about the hateful. This is where the standard divergence (SD) comes in.

The theoretical basis of the standard deviation is complex and need not problem the ordinary user. We will talk over sampling and populations in Chapter 3. A applied point to note here is that, when the population from which the data arise have a distribution that is approximately "Normal" (or Gaussian), and so the standard deviation provides a useful basis for interpreting the data in terms of probability.

The Normal distribution is represented by a family of curves defined uniquely by two parameters, which are the mean and the standard difference of the population. The curves are always symmetrically bell shaped, but the extent to which the bell is compressed or flattened out depends on the standard deviation of the population. However, the mere fact that a curve is bell shaped does not mean that information technology represents a Normal distribution, because other distributions may have a similar sort of shape.

Many biological characteristics conform to a Normal distribution closely enough for it to be normally used – for instance, heights of adult men and women, blood pressures in a good for you population, random errors in many types of laboratory measurements and biochemical data. Effigy two.i shows a Normal curve calculated from the diastolic blood pressures of 500 men, mean 82 mmHg, standard deviation ten mmHg. The ranges representing [+-1SD, +12SD, and +-3SD] about the mean are marked. A more than extensive gear up of values is given in Table A of the print edition.

Figure ii.ane

Missing alternative text

The reason why the standard divergence is such a useful measure of the scatter of the observations is this: if the observations follow a Normal distribution, a range covered by one standard deviation above the mean and one standard deviation beneath it

Missing alternative text

includes about 68% of the observations; a range of 2 standard deviations above and two below (

) about 95% of the observations; and of three standard deviations above and iii below (

) almost 99.7% of the observations. Consequently, if we know the hateful and standard difference of a ready of observations, we can obtain some useful information by simple arithmetic. Past putting ane, two, or three standard deviations above and beneath the mean nosotros tin approximate the ranges that would be expected to include about 68%, 95%, and 99.7% of the observations.

Standard deviation from ungrouped data

The standard deviation is a summary measure of the differences of each observation from the mean. If the differences themselves were added up, the positive would exactly balance the negative and so their sum would exist zero. Consequently the squares of the differences are added. The sum of the squares is then divided by the number of observations minus oneto requite the mean of the squares, and the foursquare root is taken to bring the measurements dorsum to the units nosotros started with. (The division by the number of observations minus oneinstead of the number of observations itself to obtain the mean foursquare is considering "degrees of freedom" must exist used. In these circumstances they are one less than the full. The theoretical justification for this demand not trouble the user in practice.)

To gain an intuitive feel for degrees of liberty, consider choosing a chocolate from a box of n chocolates. Every time we come to choose a
chocolate we have a choice, until nosotros come to the terminal one (normally i with a nut in it!), and then we accept no choice. Thus nosotros take n-ane choices, or "degrees of freedom".

The calculation of the variance is illustrated in Table 2.one with the fifteen readings in the preliminary study of urinary lead concentrations (Tabular array 1.ii). The readings are set out in column (ane). In column (two) the difference between each reading and the mean is recorded. The sum of the differences is 0. In cavalcade (3) the differences are squared, and the sum of those squares is given at the bottom of the cavalcade.

Table ii.i

Missing alternative text

The sum of the squares of the differences (or deviations) from the mean, 9.96, is now divided past the full number of ascertainment minus one, to give the variance.Thus,

Missing alternative text

In this case we notice:

Missing alternative text

Finally, the square root of the variance provides the standard deviation:

Missing alternative text

from which nosotros get

Missing alternative text

This process illustrates the structure of the standard departure, in detail that the two extreme values 0.1 and iii.2 contribute most to the sum of the differences squared.

Calculator procedure

Most inexpensive calculators have procedures that enable one to calculate the mean and standard deviations directly, using the "SD" mode. For example, on modern Casio calculators one presses SHIFT and '.' and a trivial "SD" symbol should appear on the display. On earlier Casios one presses INV and Way , whereas on a Sharp second F and Stat should be used. The information are stored via the One thousand+ push. Thus, having gear up the calculator into the "SD" or "Stat" mode, from Table 2.1 nosotros enter 0.1 Grand+ , 0.4 G+ , etc. When all the data are entered, we can check that the right number of observations take been included by Shift and due north, and "15" should be displayed. The mean is displayed past Shift and

Missing alternative text

and the standard difference by Shift and

Missing alternative text

. Avoid pressing Shift and Air conditioning betwixt these operations as this clears the statistical retention. There is another button on many calculators. This uses the divisor due north rather than n – i in the calculation of the standard departure. On a Sharp computer

Missing alternative text

is denoted

Missing alternative text

, whereas

Missing alternative text

is denoted s. These are the "population" values, and are derived bold that an entire population is available or that interest focuses solely on the data in manus, and the results are not going to be generalised (see Affiliate
iii for details of samples and populations). As this situation very rarely arises,

Missing alternative text

should exist used and ignored, although even for moderate sample sizes the difference is going to be small. Retrieve to return to normal mode before resuming calculations because many of the usual functions are not available in "Stat" mode. On a modern Casio this is Shift 0. On earlier Casios and on Sharps ane repeats the sequence that remember the "Stat" fashion. Some calculators stay in "Stat"
fashion even when switched off.Mullee (1) provides advice on choosing and using a calculator. The calculator formulas apply the relationship

Missing alternative text

The right mitt expression can be hands memorised by the expression mean of the squares minus the mean foursquare". The sample variance

Missing alternative text

is obtained from

Missing alternative text

The above equation can be seen to be truthful in Table 2.ane, where the sum of the square of the observations,

Missing alternative text

, is given as 43.7l.

We thus obtain

Missing alternative text

the same value given for the full in column (3). Care should be taken because this formula involves subtracting 2 large numbers to get a small one, and can lead to incorrect results if the numbers are very large. For example, attempt finding the standard difference of 100001, 100002, 100003 on a calculator. The right respond is 1, simply many calculators will give 0 because of rounding error. The solution is to subtract a big number from each of the observations (say 100000) and calculate the standard deviation on the remainders, namely 1, ii and 3.

Standard deviation from grouped data

We can as well calculate a standard deviation for discrete quantitative variables. For case, in addition to studying the lead concentration in the urine of 140 children, the paediatrician asked how frequently each of them had been examined past a doctor during the year. Later collecting the information he tabulated the data shown in Table two.2 columns (i) and (2). The hateful is calculated by multiplying column (1) by cavalcade (2), adding the products, and dividing by the total number of observations. Table 2.ii

Missing alternative text

As we did for continuous information, to calculate the standard deviation we foursquare each of the observations in plow. In this instance the observation is the number of visits, just because we have several children in each class, shown in cavalcade (two), each squared number (column (4)), must be multiplied by the number of children. The sum of squares is given at the foot of column (v), namely 1697. We then use the calculator formula to detect the variance:

Missing alternative text

and

Missing alternative text

.Notation that although the number of visits is non Ordinarily distributed, the distribution is reasonably symmetrical about the mean. The approximate 95% range is given past

Fig 2.19

This excludes two children with no visits and
half dozen children with six or more than visits. Thus at that place are eight of 140 = 5.vii% outside the theoretical 95% range.Notation that it is mutual for discrete quantitative variables to have what is known every bit skeweddistributions, that is they are not symmetrical. One clue to lack of symmetry from derived statistics is when the mean and the median differ considerably. Another is when the standard deviation is of the aforementioned guild of magnitude as the hateful, only the observations must be non-negative. Sometimes a transformation volition
catechumen a skewed distribution into a symmetrical one. When the data are counts, such equally number of visits to a doctor, often the square root transformation will help, and if there are no zero or negative values a logarithmic transformation will render the distribution more than symmetrical.

Information transformation

An anaesthetist measures the pain of a procedure using a 100 mm visual analogue scale on 7 patients. The results are given in Table two.3, together with the log etransformation (the ln push on a calculator). Table 2.3

Missing alternative text

The data are plotted in Figure ii.2, which shows that the outlier does not appear so extreme in the logged data. The mean and median are ten.29 and two, respectively, for the original data, with a standard deviation of 20.22. Where the mean is bigger than the median, the distribution is positively skewed. For the logged data the hateful and median are 1.24 and 1.x respectively, indicating that the logged information have a more symmetrical distribution. Thus it would be amend to analyse the logged transformed data
in statistical tests than using the original scale.Figure ii.2

Missing alternative text

In reporting these results, the median of the raw data would exist given, merely information technology should exist explained that the statistical test wascarried out on the transformed data. Note that the median of the logged information is the same every bit the log of the median of the raw data – however, this is non true for the hateful. The mean of the logged data is non necessarily equal to the log of the mean of the raw data.
The antilog (exp or

Missing alternative text

on a calculator) of the hateful of the logged information is known as the geometric hateful,and is frequently a
better summary statistic than the hateful for data from positively skewed distributions. For these data the geometric mean in iii.45 mm.

Between subjects and within subjects standard deviation

If repeated measurements are fabricated of, say, blood pressure on an individual, these measurements are likely to vary. This is within subject, or intrasubject, variability and we can summate a standard divergence of these observations. If the observations are close together in fourth dimension, this standard deviation is ofttimes described every bit the measurement error.Measurements made on different subjects vary co-ordinate to between subject, or intersubject, variability. If many observations were made on each individual, and the average taken, then we can presume that the intrasubject variability has been averaged out and the variation in the average values is due solely to the intersubject variability. Single observations on individuals clearly contain a mixture of intersubject and intrasubject variation. The coefficient of variation(CV%) is the intrasubject standard deviation divided by the mean, expressed as a percentage. It is often quoted equally a mensurate of repeatability for biochemical assays, when an assay is carried out on several occasions on the same sample. It has the advantage of being independent of the units of measurement, but too numerous theoretical disadvantages. It is usually nonsensical to use the coefficient of variation as a measure of betwixt field of study variability.

Common questions

When should I utilize the mean and when should I apply the median to describe my
information?

It is a commonly held misapprehension that for Commonly distributed data one uses the mean, and for non-Normally distributed data ane uses the median. Alas this is not so: if the information are Normally distributed the hateful and the median will be close; if the information are not Ordinarily distributed and so both the hateful and the median may requite useful information. Consider a variable that takes the value 1 for males and 0 for females. This is clearly not Normally distributed. However, the hateful gives the proportion of males in the group, whereas the median merely tells us which grouping contained more than fifty% of the people. Similarly, the hateful from ordered categorical variables can be more than useful than the median, if the ordered categories tin can be given meaningful scores. For case, a lecture might be rated as 1 (poor) to 5 (excellent). The usual statistic for summarising the upshot would be the mean. In the situation where there is a small-scale grouping at i farthermost of a distribution (for example, almanac income) then the median volition exist more "representative" of the distribution. My data must take values greater than nil and yet the mean and standard deviation are about the aforementioned size. How does this happen? If data have a very skewed distribution, and so the standard deviation volition exist grossly inflated, and is not a good measure of variability to use. Equally we have shown, occasionally a transformation of the data, such every bit a log transform, will render the distribution more than symmetrical. Alternatively, quote the interquartile range.

References

1. Mullee Yard A. How to choose and use a calculator. In: How to do it 2.BMJ
Publishing Grouping, 1995:58-62.

Exercises

Practise 2.1

In the campaign confronting smallpox a doctor inquired into the number of times 150 people aged xvi and over in an Ethiopian village had been vaccinated. He obtained the following figures: never, 12 people; one time, 24; twice, 42; 3 times, 38; four times, 30; five times, 4. What is the hateful number of times those people had been vaccinated and what is the standard deviation?Answer