Threaded Mode | Linear Mode

lcwright1964 · (This post was last modified: 09-01-2015 10:40 PM by lcwright1964.)

On his excellent page on Gamma/Factorial approximations, Peter Luschny recommends a simplification of the standard 3-term Stieltjes continued fraction approximation of the Gamma function that is well-suited to calculator programming. Not only does the simplification render all constants as single digits, it just so happens to improve the performance by roughly an extra digit of accuracy over the original version. (Peter doesn't discuss this added benefit--I discovered it when examining the approximation myself in high precision environments on Free42 and in Mathematica.)

The structure of the program is inspired largely by JM Baillard's high quality Stirling series approximation, GAM+. I have in particular adopted Jean-Michel's treatment of how the leading factors get treated (i.e., what gets exponentiated and what is multiplied straight through), and his device of splitting one term into two factors so as to avoid overflow when the argument exceeds roughly 56. To save steps I have eliminated his treatment of integer arguments with the calculator's built-in FACT function.

The work is all done on the stack, except for the synthetic register M, which is used for the accumulation of the Pochhammer shift-and-divide product when the argument is 7 or less. If the program is entered from the keyboard using any regular register for this is fine.

In GAM+, Jean-Michel cleverly uses storage register arithmetic with stack registers to save steps and minimize stack movement--e.g. ST+ X accomplishes the same as ENTER + or 2 *, and ST- Y alters the contents of the Y register while leaving the X register unchanged. I have not used these devices, given that I wish to do a straight port to the HP67/97 forthwith and on those machines that is not possible. If the user wishes to use these concise commands the program can be shortened by 4 or 5 steps at least.

In high precision environments where adequate guard digits are not an issue (e.g., quadruple precision on Free42 Decimal), the approximation gives 10 digits easily (a minimum of about 10.4 exact decimal digits [edd] to be precise), and usually 11 or more across the argument range 0 to 70. Of course, rounding error in the 10-digit HP41 environment compromises this. I am finding in general that the program gives at least 8 good digits, usually 9 (with the 10th off by at most a couple ULP) and occasionally all 10. In my experience this is about as good as GAM+, only mine is smaller and seems to run at least a little faster.

Without further ado, here is the listing:

Code:

 01 LBL "GMST"

 02 7

 03 STO M

 04 ST/ M

 05 X<>Y

 06 LBL 01

 07 X>Y?

 08 GTO 02

 09 ST* M

 10 1

 11 +

 12 GTO 01

 13 LBL 02

 14 ENTER

 15 ENTER

 16 ENTER

 17 4

 18 *

 19 1/X

 20 +

 21 5

 22 *

 23 1/X

 24 X<>Y

 25 6

 26 *

 27 +

 28 ENTER

 29 +

 30 1/X

 31 X<>Y

 32 -

 33 E^X

 34 X<>Y

 35 R^

 36 0.5

 37 -

 38 LASTX

 39 *

 40 Y^X

 41 ENTER

 42 RDN

 43 *

 44 R^

 45 *

 46 PI

 47 ENTER

 48 +

 49 SQRT

 50 *

 51 0

 52 X<> M

 53 /

 54 END

Execution is self-evident--put the argument in the X register and XEQ [alpha]GMST[alpha].

I plan to port this directly and forthwith to the 67/97. I am keen on suggestions to save steps, because I would like to be able to fit the routine into the 49 steps of the HP33C/E or the HP25. And if I ever learn M-CODE programming this is one I will try to adapt as I expect it will be fast and give 10 full digits when 13-digit internal precision is used.

RAW and LIF files are attached in the ZIP file (I can't seem to upload them unzipped). Barcode PDF also attached.

Looking forward to feedback.

Les

Dieter · (This post was last modified: 09-02-2015 08:58 PM by Dieter.)

(09-01-2015 10:31 PM)lcwright1964 Wrote: In high precision environments where adequate guard digits are not an issue (e.g., quadruple precision on Free42 Decimal), the approximation gives 10 digits easily (a minimum of about 10.4 exact decimal digits [edd] to be precise), and usually 11 or more across the argument range 0 to 70.

The error decreases as x increases. So the largest errors occur with small x, like 0...7 – 12 where they may cause the 10th digit to round incorrectly. For x>13 the relative error drops below 1E–11 so that the 10-digit result shoud be within ±0,6 ULP (like the built-in transcendental functions).

(09-01-2015 10:31 PM)lcwright1964 Wrote: I am keen on suggestions to save steps, because I would like to be able to fit the routine into the 49 steps of the HP33C/E or the HP25.

Here is a first idea. The two multiplications after step 40 obviously avoid overflow for large arguments. But they can be coded more straightforward:

Instead of...

Code:

 ...

 40 Y^X

 41 ENTER

 42 RDN

 43 *

 44 R^

 45 *

 ...

...what about this one:

Code:

 ...

 40 Y^X

 41 *

 42 LastX

 43 *

 ...

Also on other calculators the final 0 X<>M may be replaced by a simple RCL M (resp. RCL 0), since the zero only clears the 41's M-register to avoid leaving strange characters in Alpha.

And finally here is another possible improvement in terms of accuracy. The crucial point of the Stieltjes method is the exponential of a number close to x (cf. step 33). If x exceeds 10, the argument of the e^x function is >10 so that 1 ULP equals 1 E–8, which in turn means that the result of e^x has a potential relative error of 1 E–8. So at this point we may lose up to two digits!

A possible solution is...

Code:

...

 30 1/X

 31 E^X

 32 X<>Y

 33 E^X

 34 /

 ...

...this method, i.e. instead of e^(a–x) evaluate e^a / e^x.

Try x=30, 50 or 70 with the original code and compare with the result of the proposed method: In both cases the latter yields one more valid digit, here it is even dead on in two out or three cases. You may even tweak the results slightly by replacing the multiplication by sqrt(2 Pi) with a division by sqrt(0,5/Pi) – this constant rounds better to 10 digits. ;-)

The only drawback is that for integer arguments below 10 the results are off in the last digit while the original code happens to round nicely to the exact integer solution. But then try x=10, 11, 12, 13, 14... Although I haven't done any comprehensive tests, it looks like the proposed changes may yield results with an error within 10 ULP or 1 unit in the 9th digit, even on a 10-digit calculator like the 41 or 67/97. But be sure to do your own tests.

Addendum: here are some results from GMST and a modified version.

Code:

  x         GMST           ULP    modified        ULP   exact

------------------------------------------------------------------------

  2      1,000000000         0   1,000000000        0   1

  3,5    3,323350973        +3   3,323350973       +3   3,323350970

  4,1    6,812622860        -3   6,812622865       +2   6,812622863

  5      24,00000000         0   24,00000001       +1   24

  7      720,0000000         0   720,0000003       +3   720

  8,3    9281,392518        -8   9281,392522       -4   9281,392526

 10      362879,9998        -2   362880,0000        0   362880

 13      479001598,3       -17   479001600,2       +2   479001600

 14,567  2,756346267 E+10  +12   2,756346254 E+10  -1   2,756346255 E+10

 20      1,216451005 E+17   +1   1,216451005 E+17  +1   1,216451004 E+17

 33,3    7,487577618 E+35  +21   7,487577596 E+35  -1   7,487577597 E+35

 40,5    1,286050246 E+47   -2   1,286050248 E+47   0   1,286050248 E+47

 50      6,082818622 E+62  -24   6,082818646 E+62  +6   6,082818640 E+62

 60      1,386831191 E+80   +6   1,386831185 E+80   0   1,386831185 E+80

 70      1,711224527 E+98   +3   1,711224524 E+98   0   1,711224524 E+98

 70,956  9,933230633 E+99  +14   9,933230624 E+99  +5   9,933230619 E+99

 70,957  9,975586583 E+99  -19   9,975586601 E+99  -1   9,975586602 E+99

Dieter

Dieter · 09-03-2015, 08:10 PM

(09-01-2015 10:31 PM)lcwright1964 Wrote: I plan to port this directly and forthwith to the 67/97. I am keen on suggestions to save steps, because I would like to be able to fit the routine into the 49 steps of the HP33C/E or the HP25.

The original GMST routine can be coded in 48 steps on a HP25, including the suggested change of the sqrt(2 Pi) constant. This is easy since the '25 uses line addressing instead of labels. Also the X↔Y R↑ sequence can be replaced by two consecutive R↓ commands (AFAIK the HP25 offers no R↑).

However, with the additional accuracy improvement it's bit tricky to do it in 49 steps (50 is easy), but it can be done with the use of three data registers instead of one.

Code:

01  STO 0

02  STO/ 0

03  7

04  X<>Y

05  X≥Y?

06  GTO 11

07  STO* 0

08  1

09  +

10  GTO 05

11  STO 1

12  4

13  *

14  1/x

15  RCL 1

16  +

17  5

18  *

19  1/x

20  RCL 1

21  6

22  *

23  +

24  ,

25  5

26  STO 2

27  X<>Y

28  /

29  e^x

30  RCL 1

31  e^x

32  /

33  RCL 1

34  RCL 1

35  RCL 2

36  -

37  LastX

38  *

39  y^x

40  *

41  LastX

42  *

43  RCL 2

44  Pi

45  /

46  √x

47  /

48  RCL 0

49  /

Instead of the stack this program uses R1 to store x, and R2 holds the constant 0.5 which is used several times. Since I am not sure whether starting a progam with R/S enables stack lift, the original 7 in the first line was moved and the 1 in R0 is provided by dividing x by x (which cannot be zero anyway). Since I do not own an HP25 I could not test all this, but I think it should work.

Dieter

lcwright1964 · 09-04-2015, 01:54 AM

Hey Dieter, thank you for all that.

There is no doubt now that across the range that your choice to take the antilog of both results and divide the them as opposed to subtract the value from the CF computation and antilog the whole thing is the better choice. Indeed, my approach was taken from JMB's GAM+ routine, where he does this. I wonder if your modification would improve that too?

I had figured out the line addressing issue as a step saver before I read your post, and I was keen to try it out when I got the chance. In my original and in the streamlined version using LastX there is a saving of 3 steps due to loss of the LBL lines but a gain of one step because .5 takes two steps. I also figured that on the 33 and 25 step 49 doesn't have to be an R/S or RTN as it just stops there automatically.

I would like to comment on my choice of 7 as the shift and divide threshold. On Mathematica it was easy to plot out exact decimal digits against x (where EDD = -Log10[Abs[Relative Error]]). The curve sweeps up to about EDD = 10.4 when x = 7. Then there is a spike a little after that, indicating that the approximation briefly becomes really accurate around x = 8 for some reason. The curve then swoops back down to to about EDD = 10.4 at roughly x =8.5, from whence it gradually climbs to EDD = 11 at x = 13 or so as you have observed. With adequate extra precision EDD = 10.4 is plenty to get 10 full digits across the board, and I figured that taking the cut-off higher would slow things down for smaller values of x without much added benefit. So that was my logic there.

I am figuring that tweaks can start to get out of hand after a bit without driving one crazy, but I think we have taken this pretty far. Thanks!

Les

Dieter · 09-04-2015, 06:29 AM

(09-04-2015 01:54 AM)lcwright1964 Wrote: Indeed, my approach was taken from JMB's GAM+ routine, where he does this. I wonder if your modification would improve that too?

The proposed modification divides e^(quite_a_small_value) by e^x, as opposed to e^(quite_a_small_value – x). Since x >> quite_a_small_value, 2 or 3 digits are lost after the subtraction. On the other hand the two individual antilogs are more or less exact (within working precision). That's why the results are more accurate. Since the mentioned small value is close to zero I even tried a method that uses e^x–1, but that does not make much of a difference.

(09-04-2015 01:54 AM)lcwright1964 Wrote: With adequate extra precision EDD = 10.4 is plenty to get 10 full digits across the board,

Let me add a short remark on these EDD figures. EDD=n means a relative error of 10^–n, which in turn means the error can approach 1 unit in the n-th significant digit. If that is the last displayed digit, the error of the displayed result may be up to 1,5 ULP. So EDD=n means that the last digit can be off by 1 or slightly more. 10,4 EDD equals a relative error of 4 E–10 or up to 0,9 ULP including display roundoff. That's not exactly what I'd call "plenty". ;-) I prefer a somewhat stricter measure: for n valid digits the relative error should not exceed 10^–(n+1), i.e. 1 E–11 for 10 digits. This makes sure the n-digit rounded result is within 0,6 ULP, which matches HP's accuracy for most of the transcendental functions. But since we do all the calculations with merely the same 10 digits, we should not be too picky here... ;-)

(09-04-2015 01:54 AM)lcwright1964 Wrote: I am figuring that tweaks can start to get out of hand after a bit without driving one crazy,

Exactly. ;-)

Dieter

lcwright1964 · 09-04-2015, 06:32 PM

Dieter, I like your rewrite so much, using registers and some other tweaks, that I have adopted it in toto for all of the units I am trying to program. Here is the HP41 version, where, again inspired by JMP, I use the synthetic registers M, N, and O in place of your 0, 1, and 2. It seems to run just as fast and works great:

Code:

 01 LBL "GMST"

 02 STO M

 03 ST/ M

 04 7

 05 X<>Y

 06 LBL 01

 07 X>Y?

 08 GTO 02

 09 ST* M

 10 1

 11 +

 12 GTO 01

 13 LBL 02

 14 STO N

 15 4

 16 *

 17 1/X

 18 RCL N

 19 +

 20 5

 21 *

 22 1/X

 23 RCL N

 24 6

 25 *

 26 +

 27 0.5

 28 STO O

 29 X<>Y

 30 /

 31 E^X

 32 RCL N

 33 E^X

 34 /

 35 RCL N

 36 RCL N

 37 RCL O

 38 -

 39 LASTX

 40 *

 41 Y^X

 42 *

 43 LASTX

 44 *

 45 RCL O

 46 PI

 47 /

 48 SQRT

 49 /

 50 RCL M

 51 /

 52 END

Once one gets rid of the three LBL steps and the END statement and adds a line to code .5, one is done to the 49 steps that fit perfectly into the HP33E/C or the HP25. I have a real 33C and Olivier's emulators on my Android and am keen to see how this works out.

Again, for anyone who wants to play with the program in V41, Free42, Emu42, or any of a number of compatible iPhone or Android emulators, I attach the RAW file. I didn't bother with a LIF or barcode file this time--anyone with the level of sophistication to turn to those already knows how to use hp41uc to compile the files themselves. Indeed, that is pretty well what you need to do to get a program into Emu41, which I use as a companion to PILBox.

Les

Dieter · 09-05-2015, 05:49 AM

(09-04-2015 06:32 PM)lcwright1964 Wrote: Dieter, I like your rewrite so much, using registers and some other tweaks, that I have adopted it in toto for all of the units I am trying to program.

Ah, that's nice.

(09-04-2015 06:32 PM)lcwright1964 Wrote: Here is the HP41 version, where, again inspired by JMP, I use the synthetic registers M, N, and O in place of your 0, 1, and 2.

The HP41 version can be done with a single register, similar to your original version. But if you want to use M, N and O be sure to add a CLA at the end to clean up the Alpha register so that no weird characters are left there. ;-)

(09-04-2015 06:32 PM)lcwright1964 Wrote: It seems to run just as fast and works great:

It should deliver about nine digits, and I did not observe any cases where the error exceeded 6 units in the last place. But maybe someone else will. ;-)

Dieter

lcwright1964 · (This post was last modified: 09-05-2015 09:56 PM by lcwright1964.)

GMST(9.25) gives 69106.22680 which is off by 9 ULP (the last digit should be 9), but that is about the worst example I have found so far.

Gerald H · 09-06-2015, 06:40 AM

On the HP 42S GMST returns 69106.2268928 while the inbuilt gamma returns 69106.2268951 for input 9.25

lcwright1964 · 09-06-2015, 03:07 PM

(09-06-2015 06:40 AM)Gerald H Wrote: On the HP 42S GMST returns 69106.2268928 while the inbuilt gamma returns 69106.2268951 for input 9.25

And it would. The rounding error for this particular argument is more acute than most in the 10-digit HP41 environment. The 12-digit HP42S (I assume you mean the real calculator) gives a couple of more digits so the rounding issues would be in the 11th and 12th place as opposed to 9th and 10th. Moreover, for that particular argument even with adequate precision available GMST is only going to give 10 or 11 correct digits anyway.

Les

Dieter · (This post was last modified: 09-09-2015 10:15 PM by Dieter.)

(09-06-2015 03:07 PM)lcwright1964 Wrote: Moreover, for that particular argument even with adequate precision available GMST is only going to give 10 or 11 correct digits anyway.

The algorithm discussed in this thread is a modified three term Stieltjes approximation. The exact third coefficient should be 53/210, but here it is slightly rounded down to 1/4. This way the original continued fraction can be written in a different way with three integer constants 4, 5 and 6.

The slight variation of the third coefficient actually improves the accuracy for smaller arguments. We can now go one step further and tweak the remaining coefficients a bit. With sufficient working precision, e.g. on Free42, you may try replacing the 5 with 5,0000367 while the threshold for the shift-and-divide-routine is increased from 7 to 8. As far as I can see this delivers results with a max. relative error of about 1 E–11. This way Free42 delivers Gamma(9,25) = 69106,2268944 which is off by just 7 units in the 12th digit. With the original coefficients the error was approx. ±4 E–11.

Setting the threshold to 9 and replacing the 5 with 5,00005 will even limit the error to ~2 E–12. But this requires much more precision than the 41 can provide.

Dieter

lcwright1964 · 09-10-2015, 10:51 PM

(09-09-2015 09:21 PM)Dieter Wrote: Setting the threshold to 9 and replacing the 5 with 5,00005 will even limit the error to ~2 E–12. But this requires much more precision than the 41 can provide.

Nonetheless, Dieter, this proves to be a great refinement and I have incorporated it and have changed the two lines in question in the version I now use. I doesn't add any lines or use much more memory. There seems to be minimal impact on speed too. This will work also for a port to the 67/97, but there aren't enough lines in the 25 or 33E/C to accommodate it.

Thank you!

Les

Dieter · (This post was last modified: 09-11-2015 08:22 PM by Dieter.)

I now did some more tests with the Stieltjes algorithm, and I think it is a very nice method that can deliver excellent results without much effort. The resulting accuracy depends on different factors: The number of terms (for our purposes three will usually do, four give even better results), the restriction of integer coefficients (results improve if real values are allowed) and the threshold below which the shift-and-divide method is applied (the higher the better, but a high threshold means more multiplications that may cause some loss of accuracy for small arguments).

Here are some examples coded in VBA. The following examples are not optimized (e.g. there is no provision to prevent overflow), they only are intended to show the basic idea.

First of all a here is a three-term implementation. It uses the simplification that originally led to three integer coefficients 4, 5 and 6.

Code:

Function st3gamma(x)

sqrt2pi = 2.50662827463100

a = 4.024278

b = 5.000135   ' two coefficients

t = 7          ' threshold for shift-and-divide

s = 1

Do While x < t

   s = s * x

   x = x + 1

Loop

q = 1 / 2 / (1 / (b * (x + 1 / (a * x))) + 6 * x)

st3gamma = sqrt2pi * x^(x-0.5) * Exp(q-x) / s

End Function

Two real coefficients, threshold = 7 => error ~ 1,1 E–11.

Set a = 4, b = 5.00003671 and t = 8 => error ~ 1 E–11.
Set a = 4, b = 5.00005017 and t = 9 => error ~ 2 E–12.

With four terms the results get even more accurate. Here the original Stieltjes CF method is used, thus the first Stieltjes coefficients 1/12 and 1/30 show up in the code.

Code:

Function st4gamma(x)

sqrt2pi = 2.50662827463100

a = 2

b = 3.963421   ' two coefficients

t = 7          ' threshold for shift-and-divide

s = 1

Do While x < t

   s = s * x

   x = x + 1

Loop

q = 1 / (12 * (x + 1 / (30 * (x + 1 / (b * (x + 1 / (a * x)))))))

st4gamma = sqrt2pi * x^(x-0.5) * Exp(q-x) / s

End Function

One real coefficient, threshold = 7 => error ~ 7 E–13.
Set a = 2, b = 3.963274 and t = 8 => error ~ 4 E–13.
Set a = 1.957, b = 3.96266 and t = 8 => error ~ 7 E–14.

All this is more accurate than the original Stieltjes continued fraction method with its rational coefficients. The versions above simply trade the extremely high accuracy for large arguments (which is lost) for higher accuracy with smaller arguments down to 7.

Now all we need is a sufficiently accurate calculator that can take advantage of the examples above. ;-) I just tried the second last version with error < 4 E–13 on my trusted 35s and the results I got seem to be off only in the last digit – which is not caused by the algorithm but due to roundoff errors during the calculation.

In an actual implementation there are two crucial points: avoid overflow and evaluate exp(q–x) as accurately as possible. The calculator programs earlier in this thread handle both problems quite well.

Dieter

lcwright1964 · 09-13-2015, 02:06 AM

Hey, Dieter, for HP41 work with 10 digits I think I am sticking with the refinement mentioned above. Expressing the coefficient that is around 5 to a few more digits doesn't seem to confer extra benefit on the HP41, and indeed when I plot the error curve in an arbitrary precision environment I really see no big improvement at all.

I should play around with your four term versions, but I suspect that the advantages of them are lost in a 10 digit environment and, if anything, the extra computations due to the extra term produce their own round off issues. Of course more accuracy is desirable for 12-digit calculators, but the point is more of an academic curiosity, since the newer HPs like the 42s and 35s compute the gamma (actually factorial) on allowable non-integer arguments just fine.

Of course, all of these great refinements are good candidates for M-CODE programming, but I think none of them can top some of the optimized Lanczos approximations we were discussing elsewhere. Getting rid of the shift-divide step for smaller arguments really goes far to speeding things up across the board.

Les

Dieter · (This post was last modified: 09-13-2015 02:09 PM by Dieter.)

(09-13-2015 02:06 AM)lcwright1964 Wrote: Hey, Dieter, for HP41 work with 10 digits I think I am sticking with the refinement mentioned above. Expressing the coefficient that is around 5 to a few more digits doesn't seem to confer extra benefit on the HP41, and indeed when I plot the error curve in an arbitrary precision environment I really see no big improvement at all.

The additional digits only help squeeze the error a tiny bit below the 2 E–12 mark. ;-) You won't notice much of a difference even in a higher precision environment. For the '41 and other 10-digit machines the earlier methods are perfectly OK.

BTW, with a=4,01 and b=5,00008 you can lower the threshold to 8 and still get similar accuracy (error < 5 E–12), while for small x one loop less is required.

(09-13-2015 02:06 AM)lcwright1964 Wrote: I should play around with your four term versions, but I suspect that the advantages of them are lost in a 10 digit environment and, if anything, the extra computations due to the extra term produce their own round off issues.

Sure, all this was done with higher-precision environments in mind. The best I got so far is an error < 3,7 E–14 (threshold = 8), and I tried this on a WP34s in SP mode (16 digits). It's nice to see the 12 digit display show the same result as the built-in Gamma function. ;-)

(09-13-2015 02:06 AM)lcwright1964 Wrote: Of course, all of these great refinements are good candidates for M-CODE programming, but I think none of them can top some of the optimized Lanczos approximations we were discussing elsewhere. Getting rid of the shift-divide step for smaller arguments really goes far to speeding things up across the board.

Right, this shift-and-divide thing slows down everything, especially on older 10-digit calculators like the '41 or 67/97. But in MCode speed should be no big issue, and with some careful, accuracy-preserving coding one may expect something like 11 valid digits.

Finally I would like to sum up some results for three-term Stieltjes approximations in the attached graphics. The y-axis is the EDD figure, i.e. the base-10 log of the relative error, or roughly the number of exact digits (±1 unit). The peaks are the points where the error passes through zero and accuracy becomes infinite. The sawtooth pattern for small x is due to the shift-and-divide method applied there. You can see the original Stieltjes method in the black graph, while the red one shows the improvement due to the a=4 simplification. I especially like the blue version as it reaches an error level slightly below 1 E–11 down to t=8. Yes, you may omit the ...04 at the end of the b coefficient. ;-)

Dieter