12-31-2017, 12:50 AM
Hi,
I noticed that the HP Prime Statistics 2Var App has got the very useful quadratic regression function, but as I understand it doesn't calculate the correlation coefficient "r" (it just gives NaN).
So, I thought it could be a fun exercise to reimplement the quadratic regression from scratch and including the r calculation as well.
example usage
The returned list gives the coefficient for a parabola with the formula f(x) = A + Bx + Cx^2. The order of the coefficient is chosen to be compatible with the built-in function "polynomial_regression".
The fourth element in the list L2(4) is the correlation coefficient.
You can compare the built-in polynomial_regression(L1, 2) and qreg(L1) and compare the results. From a few, totally non scientific, test that I made the errors seem to be around 1e-10 on average.
Sources:
Formula used http://keisan.casio.com/exec/system/14059932254941
I believe that the coefficient are correct at least to 8 significant digits, but I'm not 100% sure.
Here it is another implementation I did in Python, to check if the algorithm was correct (I find it easier to do a prototype in Python and then translate it to HPPPL later).
The python version have embedded test data and prints the results. As I said, it's just a quick prototype, not intended for real use, so I didn't bother adding proper argument parsing from the command line.
I hope you find it useful, I plan on translating it to the HP 35S later on, if I have time.
EDIT: about the example data in the python code. It's a list of measured diameters (in millimeters) and sizes (in GB) of the reflective layer of several DVD-R that I've had around my house. As all of you may know, you can easily see the occupied area of a DVD-R, so I thought I would calculate a formula to get an approximation of the space left on a disk just by measuring the diameter of the written (lighter) area, and since the area grows with the square of the diameter, it seemed appropriate to use a quadratic regression fitting for this task.
I noticed that the HP Prime Statistics 2Var App has got the very useful quadratic regression function, but as I understand it doesn't calculate the correlation coefficient "r" (it just gives NaN).
So, I thought it could be a fun exercise to reimplement the quadratic regression from scratch and including the r calculation as well.
Code:
EXPORT qreg(data)
BEGIN
LOCAL n := SIZE(data);
LOCAL n_inv := 1.0 / n;
LOCAL sigmaX := 0, sigmaY := 0, sigmaX2 := 0, sigmaX3 := 0, sigmaX4 := 0, sigmaXY := 0, sigmaX2Y := 0;
local i;
FOR i FROM 1 TO n DO
local x := data(i,1);
local y := data(i,2);
local x2 := x * x;
local x3 := x2 * x;
local x4 := x2 * x2;
local xy := x * y;
local x2y := x2 * y;
sigmaX := sigmaX + x;
sigmaY := sigmaY + y;
sigmaX2 := sigmaX2 + x2;
sigmaX3 := sigmaX3 + x3;
sigmaX4 := sigmaX4 + x4;
sigmaXY := sigmaXY + xy;
sigmaX2Y := sigmaX2Y + x2y;
END;
local x_m := sigmaX * n_inv;
local x2_m := sigmaX2 * n_inv;
local x_m2 := x_m * x_m;
local y_m := sigmaY * n_inv;
local Sxx := sigmaX2 * n_inv - x_m2;
local Sxy := sigmaXY * n_inv - x_m * y_m;
local Sxx2 := sigmaX3 * n_inv - x_m * x2_m;
local Sx2x2 := sigmaX4 * n_inv - x2_m * x2_m;
local Sx2y := sigmaX2Y * n_inv - x2_m * y_m;
local comm := 1.0 / (Sxx * Sx2x2 - Sxx2 * Sxx2);
local B := (Sxy * Sx2x2 - Sx2y * Sxx2) * comm;
local C := (Sx2y * Sxx - Sxy * Sxx2) * comm;
local A := y_m - B * x_m - C * x2_m;
local sigmaUp := 0;
local sigmaDown := 0;
FOR i FROM 1 TO n DO
local x := data[i,1];
local y := data[i,2];
sigmaUp := sigmaUp + (y - (A + B * x + C * x * x))^2;
sigmaDown := sigmaDown + (y - y_m)^2;
END;
local r := sqrt(1.0 - sigmaUp / sigmaDown);
RETURN {C, B, A, r};
END;
example usage
Code:
L1 := {{1,1}, {2.1, 4}, {3.98, 16.2}}
L2 := qreg(L1)
The returned list gives the coefficient for a parabola with the formula f(x) = A + Bx + Cx^2. The order of the coefficient is chosen to be compatible with the built-in function "polynomial_regression".
The fourth element in the list L2(4) is the correlation coefficient.
You can compare the built-in polynomial_regression(L1, 2) and qreg(L1) and compare the results. From a few, totally non scientific, test that I made the errors seem to be around 1e-10 on average.
Sources:
Formula used http://keisan.casio.com/exec/system/14059932254941
I believe that the coefficient are correct at least to 8 significant digits, but I'm not 100% sure.
Here it is another implementation I did in Python, to check if the algorithm was correct (I find it easier to do a prototype in Python and then translate it to HPPPL later).
Code:
import math
# The data in pairs (mm, GB)
data = [
(45, 0),
(90, 2.39),
(117, 4.7),
(58, 0.34),
(97, 2.68),
(114, 4.18),
(106, 3.41),
(86, 1.93)
]
# count all the sums
n = len(data)
n_inv = 1.0 / n
sigmaX = sigmaY = sigmaX2 = sigmaX3 = sigmaX4 = sigmaXY = sigmaX2Y = 0
for x,y in data:
x2 = x * x
x3 = x2 * x
x4 = x2 * x2
xy = x * y
x2y = x2 * y
sigmaX += x
sigmaY += y
sigmaX2 += x2
sigmaX3 += x3
sigmaX4 += x4
sigmaXY += xy
sigmaX2Y += x2y
x_m = sigmaX * n_inv
x2_m = sigmaX2 * n_inv
x_m2 = x_m * x_m
y_m = sigmaY * n_inv
# Calcuate the coefficients
Sxx = sigmaX2 * n_inv - x_m2
Sxy = sigmaXY * n_inv - x_m * y_m
Sxx2 = sigmaX3 * n_inv - x_m * x2_m
Sx2x2 = sigmaX4 * n_inv - x2_m * x2_m
Sx2y = sigmaX2Y * n_inv - x2_m * y_m
# Calculate the parabola y = A + Bx + Cx^2
comm = 1.0 / (Sxx * Sx2x2 - Sxx2 * Sxx2)
B = (Sxy * Sx2x2 - Sx2y * Sxx2) * comm
C = (Sx2y * Sxx - Sxy * Sxx2) * comm
A = y_m - B * x_m - C * x2_m
def f(x):
return A + B * x + C * x * x
# calculate the correlation coefficient r
sigmaUp = sigmaDown = 0
for x,y in data:
sigmaUp += (y - f(x)) ** 2
sigmaDown += (y - y_m) ** 2
r = math.sqrt(1.0 - sigmaUp / sigmaDown)
print('f(x) = {} x^2 + {} x + {}'.format(C, B, A))
print('r = {}'.format(r))
The python version have embedded test data and prints the results. As I said, it's just a quick prototype, not intended for real use, so I didn't bother adding proper argument parsing from the command line.
I hope you find it useful, I plan on translating it to the HP 35S later on, if I have time.
EDIT: about the example data in the python code. It's a list of measured diameters (in millimeters) and sizes (in GB) of the reflective layer of several DVD-R that I've had around my house. As all of you may know, you can easily see the occupied area of a DVD-R, so I thought I would calculate a formula to get an approximation of the space left on a disk just by measuring the diameter of the written (lighter) area, and since the area grows with the square of the diameter, it seemed appropriate to use a quadratic regression fitting for this task.