# HP Forums

Full Version: Important Request
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
In the statistical application there are many options to do the regression analysis, but I think a very important option is missing which is the best fit option that does appear on the HP 50g, it is an important option so you do not have to search for all the options Of the calculator which is the curve that best fits the plotted data.
So you are saying you should "numerically" decide which is the best fit on the data without doing any sort of analysis regarding whether the fit is actually apropriate for the situation?

The 50g only has linear fits. Once you move into different fits, you have no equivalent to "R" to do comparison with. What would you propose comparing when there is no equivalent?
No, I mean that in the HP 50g there is an option when doing regression analysis that is called "Best Fit" that is not present in the Prime, this option in the Prime finds the lowest deviation of the statistical data; And as in the Prime is not this option because in reality it is difficult to realize what the best fit curve.
I with Tim here. It is far better to know enough about your data to choose an appropriate fit.

One of the students I did honours statistics with did an exhaustive search for GLIMs on one assignment. All the way up to four factor interaction terms. The resulting best fit was utterly impossible to interpret and a terrible model for the data.

- Pauli
Maybe I have not explained myself well. On the HP 50g for a data set or points with the "Best fit" option the calculator checks the best fit curve that passes through the given points, I think internally it calculates the least deviation using the least squares method for Each regression model, and then choose the one with the lowest deviation, and all that does in the "Best Fit" option of the statistics application, it would be good for the Prime too.
I hesitate to disagree with the teaching experts here, but I agree with math7 on this. In fact, it would be very useful if Prime could simply create -- in one step -- a table of ALL the correlation coefficients for ALL the available curve fit types. My reason is that the BEST teacher is not the expert at the front of the classroom, or the expert who wrote the textbook, but the student's own CURIOSITY which leads him or her to PLAY with the concepts being taught. The best way to learn what correlation really means is by EXPLORING its value generated by different data sets for different curve fit equations. Easy access to that information would stimulate such exploration. But such exploration FAILS to be student-motivated if the calculator makes it a tedious procedure, as Prime currently does, requiring lots of keystrokes (as well as writing the results down on paper!) for each correlation coefficient.

YES, understanding of the function that generated the data is certainly more important than blindly using an automated "best fit" button. Nobody disagrees with that. But that's not a good reason to omit a time-proven tool that stimulates student exploration, especially for the HP Prime whose primary market is students.

In brief: Please stop thinking of "Best Fit" as a way for lazy cheaters to avoid thinking, and start thinking of it as a kind of "Explorer" that curious students will play with and learn from. I've watched many students play with it and learn well from it in the RPL models for over 25 years, and have never once seen it hinder learning.
Hello,

Unfortunately, the "usual" best fit method, based on least (pred(x)-realy)² is actually incorrect as it calculates distances between data and curve on a vertical axis....

The "right" way to do is would be to calculate the distance based on "the" closest perpendicular to the curve that passes through the point...

But this is hard and very computationally intensive to do.

Cyrille
Model selection is a long-standing problem in statistics and machine learning. It is still an very active research topic. There is no superior method, approach, or criterion. (That is, one that always returns the correct answer.) One of the best criteria today is Akaike's Information Criterion, corrected for small sample sizes (AICc). It is a combination of a measure of the reduction in the bias (under-fitting) and a measure of the complexity of the model (over-fitting, which leads to variance). The best model is a trade off between bias and variance that is achieved by minimizing AICc.
AICc can be computed directly from the likelihood function of the model parameters when using maximum likelihood estimation or from the sum of the squared errors when using a least squares method.
When there is no equivalent calculation of "error" for ALL fit types currently presented, what would I compare?

I'd be fine to implement something, but I have not seen any math that applies to all fit types! Using one type of error calculation mixed with a different type simply messes things up due to bias in the various types of calculating things.
The programmers of the HP 50g incorporated that and other tools into the calculator, if they did it was because they thought it was useful and beneficial for statistical calculations and based on it realize what curve is best fit a series of data dices. In the HP Prime there are 13 options for adjusting curves for the series of data entered in the numerical view. Please see attached pictures. A "simple view" can be observed that the "cubic" model is better suited to most points than the "exponential" model. This is seen with the naked eye, but you should not rely on the view. Then the dispersion of the data the calculator could calculate it for the 13 internal models that it has instead of having to do it one at a time, and of all those with the lowest dispersion will be the "best fit" curve. It is cumbersome for the user to have to test the 13 different models to see which fits better, for that is the calculator. Derivatives or integrals can also be done by hand, but for that the calculator also brings it in, not for the user to do it but it. A system of 2 x 2 equations can be solved one in paper, but the calculator also does it by itself, for that it comes from factory.

It can not be possible for the user to have to test all 13 different models and point on a paper as Joe Horn says to decide next which is the curve that best fits and therefore have his equation. The Prime must calculate for the 13 models which is the one that best approaches all the data entered, if the HP 50g does, with more reason HP Prime with a processor of 400 MHz.
I agree with Joe Horn's opinion with regard to student's curiosity. Years ago, a work colleague wrote a BASIC language regression program that tested for many types of models (I believe it was 20 or thirty) and ranked them by their correlation coefficient. Although it can easily be argued that it was a shotgun approach to analyzing data, it did cause one to explore and exam data points. Obviously, some of the regression fits were weird or silly but, still, it was all very informative.

Anything that causes a student to explore the subject they are learning can not be a bad thing. Obviously, it is incumbent upon a student's teacher to teach a student not to blindly rely on a cookbook or shotgun approach to the subject they are learning.

Lastly, the aforementioned program I was referring to, was written on an old, slow IBM computer. How hard would it be to write or do something similar on HP's whizbang HP Prime?
The problem here is that you cannot compare "error" in all the models as there is not a mathematical way to do so except in the case of a *LINEAR* model. Every fit in the 50g is a linear fit. The options other then "linear" are just data that has some transformations to the data, but they are in reality simple linear fits. This makes comparing them trivial. Until someone can point out a unifying error algorithm that applies to all regressions equally, I don't think there is anything I can do here.

There is not an equivalent "error" value for polynomial fits (there is a value that is close, but not directly comparable to the linear "error"), logistic fits, or trigonometric fits. In addition, once you allow an Nth polynomial fit, your error by definition will not matter as simply adjusting to N+1 for your polynomial with perfectly match the data.

I am also completely unaware of *any* statistical package that offers an "auto-fit" for anything but simple, linear regressions, and even those are extremely rare.

The other major problem is that unless your data is very small, the logistic and sinusoidal fitting is VERY slow and consumes VERY large amounts of memory. This is because you have to do a lot of matrix calculations, inversions, and iterations to attempt to minimize errors. They are not "simple" calculations like the linear or polynomial fits and are very noticeable in calculation time when you have anything other then small data sets. If you attempt to run these on data that is not already a decent fit, you can see seconds of calculation time with poor results at the end. Running that constantly for each display would not be a very good UI experience.

Well then, why not just do the linear fits? That was a possibility. However, explaining "this is a best fit, but only for these and not those fits" doesn't really seem like a very understandable explanation and creates other problems in the workflow/ui. Nor is a "best fit" a feature that has been in high demand (in fact, you are the first person to bring it up). There is also a very strong argument that doing a "Best Fit" is in fact a bad practice that should NOT be encouraged or taught for several important reasons.

The great thing is that the Prime is programmable! You can very easily make your own "best fit" with a few simple commands.

Stick this in your app program:

Code:
```VIEW "Best-Fit",BestFit() BEGIN LOCAL j,r,fit; FOR j FROM 1 TO 6 DO   S1(4):=j; //set fit type   STARTVIEW(0,1); //load settings, redraw   Do2VStats(); //recalculate      IF CoefDet > r THEN      r:=CoefDet; //found a better fit      fit:=j;    END; END; MSGBOX("CoefDet:" + r + "\n"    + "Fit Selected:" + fit); S1(4):=fit; //set our selected STARTVIEW(1,1); //see the plot END;```

To run it, press your VIEW key on the app and you have a nice interface doing the best fit.

I am not against adding a "best fit" type of option, I just would want to do it correctly and right now I don't see any way to do that.
Yes, something like that is what i mean Tim, I copied your code into the statistical application and it works fine, that's what I mean by doing the HP Prime, that she calculates and chooses the best approach, I think it should not be so difficult if you already did it for a Model as the complete program would be for the 13 models, a little more code, and the calculator would choose the one that has a better fit to the data, it is simply something like that, if the HP 50g exists I do not see why not in the Prime.

Note: I suppose for a lot of points as it does take more processing and amount of memory in the calculator, but the option would be viable for practical life problems for about 20 or 30 points I think the calculator is quite capable and speedy to calculate something so.
I imagine that perhaps in the development team of the Prime there are people who know a lot of mathematical and numerical analysis, which is the area that studies methods of approach to solve equations, functions, algorithms of different types, etc.
(04-03-2017 11:30 PM)math7 Wrote: [ -> ]I think it should not be so difficult if you already did it for a Model as the complete program would be for the 13 models, a little more code, and the calculator would choose the one that has a better fit to the data, it is simply something like that, if the HP 50g exists I do not see why not in the Prime.

Please re-read carefully my post. I explained in detail several times and reasons why this request is neither simple, nor mathematically possible to the best of my knowledge. There is no such thing as "correlation" for ANYTHING except a linear model. Other methods of measuring error are not compatible.

Also, please note that you are talking to the person that wrote 95+% of the statistics code in Prime. I did spend probably at least a month of time over the course of the 39gII/Prime researching and investigating if this would be possible. I'd love it if I were corrected here, but so far I've not seen anything that would be possible to implement.

I do agree with Joe that a "Best fit" opens up some interesting learning options and investigation paths, but as I've shown a simple app function gives that option. I'm not sure how to build it in, nor the appropriateness of doing so. A lot of statistics educators actually were vehemently against having such an option built in.
(04-03-2017 11:30 PM)math7 Wrote: [ -> ]Yes, something like that is what i mean Tim, I copied your code into the statistical application and it works fine, that's what I mean by doing the HP Prime, that she calculates and chooses the best approach, I think it should not be so difficult if you already did it for a Model as the complete program would be for the 13 models, a little more code, and the calculator would choose the one that has a better fit to the data, it is simply something like that, if the HP 50g exists I do not see why not in the Prime.

Note: I suppose for a lot of points as it does take more processing and amount of memory in the calculator, but the option would be viable for practical life problems for about 20 or 30 points I think the calculator is quite capable and speedy to calculate something so.

You did not read Tim's response very carefully. The program he posted only works on regression models whose unknown parameters are linear. Even then, given any n distinct points, a polynomial fit (of degree n-1) will always be exact (i.e. "best fit") since it will pass through every single one of the n points. Thus, the "answer" to the "best fit" question is always a polynomial fit. Of course, this is also nonsense because polynomials are not always the best models of data (consider any set of data corresponding to periodic phenomenon).

It would NOT be easy to generalize Tim's program to all the other fits because there is not "standard" for comparison. It would be like comparing apples to oranges (because only some of the regression models are linear; others are not). You cannot define "best" when there is no standard for comparison. You cannot even use a "least amount of error" type of comparison because a polynomial fit will always have 0 error unless your points have repeated input values.
What I mean is that the calculator gives the lowest value of R^2 and r as seen in the figures, not having to do the user one by one for each model adjustment, but the calculator does it automatically, it is very little Processing for it.
The closer to one is the R value the better fit the model will have and therefore the curve.