Re: HP 10 Stat problem Message #15 Posted by Tom Sherman on 22 Mar 2004, 4:26 p.m., in response to message #3 by hugh steers
I enjoyed the recent posts by Hugh Steers, Norris, James Prange, and others concerning the methods for doing standard deviations on calculators. The beautiful algorithm presented by Hugh (I will call it "Hugh's algorithm" if he will forgive me) had my mind boggled for several days -- hence my delay in this response.
A calculator can, it seems, be programmed in three main ways (with many variations) to do standard deviations. We recall the definition of the standard deviation: that it is the square root of the variance, where the variance is the sum of the squares of displacement of data points from their mean, normalized by dividing by the number of points, n, if the mean is known before hand (population variance), or by n-1 if the mean has to be established from the data (sample variance). Let the sum of the squares of the displacements (deviations) be denoted by s, and consider the ways that s can be calculated.
The first method directly sums the squares of the deviations from the mean to find s. After all, that is how s is defined, so why would we think of doing it any other way? The problem is that the mean has to be found first, by summing the data points and dividing by n, before the squares of deviations from the mean can be found. So this first method requires that the data be handled twice by the program -- that the program have two loops in series, through which the data pass. If the calculator has so much memory that it can assign a memory register to each data point, this method is fine, but if it has not, then the calculator would require that the operator feed it the data twice. No one would want to do that, and so no calculator that has only a few memory registers would be built using this method.
The second method makes use of the fact that the original definition of s can be easily transformed by expansion of (x-m)^2, and substituting for m its equivalent of Sum(x)/n. The expression for s then becomes: Sum(x^2)-((Sum(x))^2)/n. The calculation of s can then be done with only one loop, with one pass of the data, as Sum(x^2) and Sum(x) can each be incremented as the new data points are introduced. A calculator using this method does not have to remember the original data. It only needs to have three storage registers, for Sum(x^2), Sum(x), and n. Hence this method is available to calculators having limited memory, and seems to have been adopted by most of them that have a standard deviation program.
The problem with this second method, as Norris has lucidly explained, is that the squaring of raw data can overrun the number of digits available to the calculator, if the data have many digits. Small differences in data will therefore be lost by the calculator, and incorrect results will be returned. As Norris and James and the HP manuals explain, this problem can be avoided by subtracting an assumed mean from the data, a process which is called "normalizing" the data (a somewhat inappropriate term, since it involves subtraction rather than division). The best assumed mean would be the mean itself, but if we knew that, we would be using the first method rather than the second.
It is fairly easy to see, as James has shown us, that the real mean of data that have been normalized can be recovered by adding back the assumed mean to the mean of the normalized data. At first blush of intuition, it is less obvious (at least to me) that an s calculated from normalized data does not have to be corrected in some way to give back the real s for the original data. But no correction is necessary. s is a measure only of scatter or dispersion of points along a number line, and so long as the scale of that line is not changed (by multiplication or division), it does not matter where the zero point of that line is taken. The process of normalizing can be pictured as one of sliding the number line under the data points until its zero is close to the mean of the points. The points remain in the same positions relative to one another, and their total scatter is not affected. A little algebra clinches it: the normalized data yield the correct value for s without any adjustment such as is needed for the mean.
As James pointed out, it would be nice if a calculator using this second method had a program for normalizing the data, and James further suggested that the program could start by taking the first point as the assumed mean. That is exactly what the algorithm cited by Hugh does. Hugh's algorithm gives us a third method for calculating s, and one that strikes me as novel, beautiful, and at first, mind-boggling. I had to stare at it for a long time -- unraveling it in my mind, iterating it on paper, confirming it with a BASIC program, and finally going through all its algebra, before I fully understood it.
Hugh's algorithm, as he well recognizes, combines the high accuracy of the first method with the second method's economy in storage registers. The data is fed through a single loop and the results are given without any need to retain the data. The only storage registers required are for n (the number of data points), m (the mean of the entered data), s (the same s as used above), and t. What is t? Maybe it stands for temporary, or transition, or tension. It is in fact the deviation of a new data point from the mean of the previously entered data. As it turns out, t is not essential. It can be eliminated altogether if desired, and a variation of the algorithm can be written which requires only three variables: n, m, and s. The third method then achieves the high accuracy of the first while using only three storage registers, the same number used by the less accurate second method.
The novelty of the third method (or so it seems to me) is that it makes us view s in a somewhat different way. In the original definition of s, used by the first method, we are inclined to view s as a summation of independent contributions from the various data points. But now we are reminded that s results from all the inter-relations of the points, and that the s that a new point brings into a previously-existing set of points is not usually the same s that we would calculate for it after it has joined the set -- because the entrance of the new point usually changes the mean of the set. The s value contributed by an earlier point has, in effect, to be recalculated as a new point is added, because the mean has changed. The first data point entered, for example, initially contributes a zero value for s, because the point is its own mean. But once a second point is added, assuming it is different from the first, the mean shifts to a position halfway between the two points, and the increment of s brought in by the second point has to account not only for its own s for the now two-point system, but for the s of the first point as well. At each stage, as new points are brought in, the algorithm determines the mean and s for that number of points. At each step, the new point brings in an increment of s that correctly increases the s for the previous points to allow for the change in mean that the new point has caused. The mean of a group of points represents not only their average value (in the sense of Sum(x)/n, but the point at which the sum of squares of their deviations (their s value) is at a minimum. Hence a change in mean caused by a new point also causes an increase in s for the previous points.
The strategy of the third method is to continually recalculate the mean as the new points are fed in, and then to use the new mean to calculate the increment in s. It is, in effect, a program to continually "re-normalize" the data as the new points are declared. At each step, the program has the true mean and s value for the declared set, even though it has "forgotten" what the previous points were. And at each step the multiplications are of relatively small numbers -- multiplications of normalized data rather than of the original data.
In the variation of the algorithm that eliminates t and works only with n, m, and s, the values of m and s are incremented (as n increases) by the following relations:
m(new) = ((n-1)*m(old) + x)/n
s(new) = s(old) + (n*(x-m(new))^2)/(n-1)
Ah, no -- I am not the first to work out these relations. Michael Zeltkevic (1995) has them at:
http://web.mit.edu/10.001/Web/Course_Notes/Statistics_Notes/Visualization/node4.html
A short BASIC program to run this algorithm, using only storage variables n, m, and s, could be something like this:
10 S=0
20 PRINT "ENTER 999.111 TO END DATA INPUT"
30 INPUT "X =";X
40 M=X
50 N=1
60 INPUT "X =";X
70 IF X=999.111 THEN 120
80 N=N+1
90 M=((N-1)*M+X)/N
100 S=S+(N*(X-M)^2)/(N-1)
110 GOTO 60
120 PRINT "MEAN =";M
130 PRINT "VARIANCE =";S/(N-1)
140 PRINT "STD.DEV =";SQR(S/(N-1))
150 PRINT "COEF. OF VAR =";(SQR(S/(N-1)))/M
160 END
(Since it has no t variable with which to work, this variation of Hugh's algorithm keeps the first x value outside the main loop in order to prevent a division by zero in line 100 -- hence the inelegance of the two input commands. Perhaps you can find a nicer way of doing it.)
I want to thank again Hugh, Norris, James, and all the others for such an interesting and illuminating series of posts about the calculation of the standard deviation. I think Hugh was right that the makers of many calculators have failed to find the most accurate way to do the calculation when memory is limited. My apologies for being so long-winded in saying this.
Cheers, Tom
|