Calculator Benchmark 48GX/hp48xgcc and 50g/HPGCC3 results

The Museum of HP Calculators

HP Forum Archive 18

Calculator Benchmark 48GX/hp48xgcc and 50g/HPGCC3 results
Message #1 Posted by Egan Ford on 5 Feb 2008, 2:13 a.m.

48GX/hp48xgcc results:

RESULT: 876 TIME: 3.371399 SEC
#include <hp48/object.h> #include <hp48/core.h> #include <math.h>
int main() { int x, y, r, t, n, a[9]; hp_object *o; double s;
for (n = 10; n > 0; --n) { r = 8; s = 0; x = 0; do { a[++x] = r; do { ++s; y = x; while (y > 1) if (!(t = a[x] - a[--y]) || x - y == abs(t)) { y = 0; while (!--a[x]) --x; } } while (y != 1); } while (x != r); }
o = sys_malloc (5 + 2 * sizeof (double)); if (!o) exit (1);
o->prolog = 0x2933; o->_hide.real = s;
sys_exit (o); }
50g/HPGCC3 beta 192 MHz results:

RESULT: 876 TIME: 0.000331 SEC
#include <hpgcc49.h>
int main() { int x, y, r, s, t, n, a[9]; cpu_setspeed(192 * 1000000);
for (n = 100000; n > 0; --n) { r = 8; s = 0; x = 0; do { a[++x] = r; do { ++s; y = x; while (y > 1) if (!(t = a[x] - a[--y]) || x - y == abs(t)) { y = 0; while (!--a[x]) --x; } } while (y != 1); } while (x != r); } sat3_push_dbl_real(s); return (0); }

Edited: 5 Feb 2008, 2:30 a.m.

Re: Calculator Benchmark 48GX/hp48xgcc and 50g/HPGCC3 results
Message #2 Posted by Raymond Del Tondo on 5 Feb 2008, 3:57 a.m.,
in response to message #1 by Egan Ford

Looks like comparing apples with oranges, especially given the CPU clock difference;-)
Would be nice to know how much overhead the cross-compiled C stuff actually produced on the HP-48.

Re: Calculator Benchmark 48GX/hp48xgcc and 50g/HPGCC3 results
Message #3 Posted by Egan Ford on 5 Feb 2008, 4:07 a.m.,
in response to message #2 by Raymond Del Tondo

I was not trying to state anything or draw any conclusions, I was just supplying data for Xerxes' list.
That said, if you want apples to apples, the GX result above is faster than any other GX on Xerxes' list.
How would you like me to measure overhead? What type of overhead?

Re: Calculator Benchmark 48GX/hp48xgcc and 50g/HPGCC3 results
Message #4 Posted by Raymond Del Tondo on 5 Feb 2008, 7:19 a.m.,
in response to message #3 by Egan Ford

>How would you like me to measure overhead? What type of overhead?
>
The type of overhead can be derived from the type of object produced by the cross-compiler,
and the dimension of the overhead can be (roughly) seen after decompiling the code.
Does the cross-compiler produce pure machine code, SysRPL, UserRPL code, or a mixture?
How big is the object? Does it call XGCC library functions? And so on...
Could you send me the binary of the HP-48 object ?

Re: Calculator Benchmark 48GX/hp48xgcc and 50g/HPGCC3 results
Message #5 Posted by Egan Ford on 5 Feb 2008, 11:18 a.m.,
in response to message #4 by Raymond Del Tondo

Quote:
The type of overhead can be derived from the type of object produced by the cross-compiler, and the dimension of the overhead can be (roughly) seen after decompiling the code.
I'll leave the decompiling of the code to you :-)

Quote:
Does the cross-compiler produce pure machine code, SysRPL, UserRPL code, or a mixture?
AFAIK, machine code only. The binaries require the use of shared libraries.

Quote:
How big is the object? Does it call XGCC library functions? And so on...
The following objects are required to run this benchmark:
object size (bytes) ---------- ---- nqueens 1367 GCCLDD 292 libcore.sl 103 libgcc.sl 741
GCCLDD, libcore.sl, libgcc.sl, and other .sls can be use by other C programs minimizing RAM usage.

Quote:
Could you send me the binary of the HP-48 object ?
http://sense.net/~egan/hp48xgcc/xgcc.hp
This object is a directory with everything you need. There are multiple nqueens binaries. NQ1, (one iteration), NQ10 (ten iterations), NQS (solution).
Use at your own risk :-), you can test with EMU48 first.

Re: Calculator Benchmark 48GX/hp48xgcc and 50g/HPGCC3 results
Message #6 Posted by Xerxes on 6 Feb 2008, 8:48 a.m.,
in response to message #1 by Egan Ford

Hello Egan,
Thank you for this results.
Please allow me to ask you for the result of the 50G at 75 MHz to have the speed up factor compared to 192 MHz for completeness. May be the result is the same as already tested with HPGCC2 with unstructured code, but I'm not sure about it.

Re: Calculator Benchmark 48GX/hp48xgcc and 50g/HPGCC3 results
Message #7 Posted by Egan Ford on 6 Feb 2008, 1:08 p.m.,
in response to message #6 by Xerxes

Completeness.
HPGCC3/50g
MHz Iterations Time(s)/iteration --- ---------- ----------------- 6 3125 0.01081921875 12 6250 0.00543117188 48 25000 0.00132467285 75 39062 0.00086321105 120 62500 0.00053928320 152 79166 0.00042776882 192 100000 0.00033105103

Edited: 6 Feb 2008, 1:09 p.m.

Re: Calculator Benchmark 48GX/hp48xgcc and 50g/HPGCC3 results
Message #8 Posted by Xerxes on 6 Feb 2008, 8:53 p.m.,
in response to message #7 by Egan Ford

;-)
Interesting is the difference of the effective speed up factors:
x1.3 for UserRPL @ 203 MHz
x2.6 for HPGCC @ 192 MHz
The hp48xgcc seems to be not very efficient for a native compiler considering the result of calculators of the same category.

Re: Calculator Benchmark 48GX/hp48xgcc and 50g/HPGCC3 results
Message #9 Posted by Egan Ford on 6 Feb 2008, 10:31 p.m.,
in response to message #8 by Xerxes

Quote:
Interesting is the difference of the effective speed up factors:
x1.3 for UserRPL @ 203 MHz x2.6 for HPGCC @ 192 MHz

It may take more than increasing the clock rate to increase the speed of UserRPL under Saturn emulation. Memory speed may be a factor too. The HPGCC version is very small, perhaps it fits in cache. All speculation.

Quote:
The hp48xgcc seems to be not very efficient for a native compiler considering the result of calculators of the same category.
What other calculators are you comparing to?
From the comparison below I'd say hp48xgcc was very efficient:
4:02 HP-48GX UserRPL / Ver.P 1:30 HP-50G UserRPL 1:07 HP-50G UserRPL / Fast Mode x1.3 (75->203 MHz) 35.2 HP-48GX SysRPL / Ver.R 3.37 HP-48GX C / Structured / HP48XGCC / Cross Compiler


Re: Calculator Benchmark 48GX/hp48xgcc and 50g/HPGCC3 results
Message #10 Posted by Xerxes on 7 Feb 2008, 9:39 a.m.,
in response to message #9 by Egan Ford

Quote:
What other calculators are you comparing to?

4.28 PC-G850V (Z80 @ 8.0 MHz) C / Unstructured / Bytecode 3.37 HP-48GX (Saturn @ ~4 MHz) C / Structured / HP48XGCC / Cross Compiler 2.92 Series 3a (V30 @ 7.68 MHz) OPL / Bytecode 1.27 PB-2000C (HD61700 @ 0.91 MHz) Pascal / DL-Pascal-ROM-Card 1.2 / Compiler 0.136 HP-200LX (80186 @ 7.9 MHz) Basic / DEFINT / QuickBasic 4.5 / Compiler 0.0886 HP-200LX (80186 @ 7.9 MHz) C / Unstructured / Turbo C 2.01 / Compiler
I can't assess the speed of the Saturn CPU for assembly programs. Probably the Saturn CPU is not very effective for integer only problems. I have to occupy myself with the instruction set more deeply to find out e.g. if it's possible to use the registers only even for storage of the board indices.


Re: Calculator Benchmark 48GX/hp48xgcc and 50g/HPGCC3 results
Message #11 Posted by Egan Ford on 7 Feb 2008, 10:39 a.m.,
in response to message #10 by Xerxes

Quote:
I can't assess the speed of the Saturn CPU for assembly programs.
The only way to rate the efficiency of hp48xgcc is to write a Saturn assembly version of the benchmark. I think I know someone that may be able to do that.

My native 48GX benchmark implementation outperforms hp48xgcc version:-)
Message #12 Posted by Raymond Del Tondo on 7 Feb 2008, 5:13 p.m.,
in response to message #11 by Egan Ford

Hi,
I don't know if you meant me...
However, I just wrote a real native Saturn assembly version of that benchmark.
I simply had to know how much could still be gained;-)
Conclusion: The xgcc version is not bad, but _way_ off regarding speed!
I haven't disassembled the xgcc output you sent to me yet (thanks for that:),
but it seems that either the lib calls or the chosen data structures,
or a mixture of both produces the relatively huge overhead in run time.
My real native version runs in about 0.9699108 seconds on Emu48 in 'Authentic' speed,
and in about 0.803724365234 seconds on my real HP-48GX revR !
The sample size was 100 runs of the program in each case,
and having taken the average of the single run times.
So it seems my solution runs circles around the xgcc generic assembly code:-)
Should I post the listing here?
Raymond

Re: My native 48GX benchmark implementation outperforms hp48xgcc version:-)
Message #13 Posted by Gene Wright on 7 Feb 2008, 5:27 p.m.,
in response to message #12 by Raymond Del Tondo

Yes!

Re: My native 48GX benchmark implementation outperforms hp48xgcc version:-)
Message #14 Posted by Egan Ford on 7 Feb 2008, 5:46 p.m.,
in response to message #12 by Raymond Del Tondo

Quote:
I don't know if you meant me...
However, I just wrote a real native Saturn assembly version of that benchmark.
Thanks. I was hoping you'd do it.
Quote:
Conclusion: The xgcc version is not bad, but _way_ off regarding speed!
Ah, well, king for a day...
Quote:
My real native version runs in about 0.9699108 seconds on Emu48 in 'Authentic' speed, and in about 0.803724365234 seconds on my real HP-48GX revR !
Your stellar results are no surprise to me. I was expecting native assembly to be at least 2x faster. More so since hp48xgcc is half-baked.
Quote:
Should I post the listing here?
Yes. Can you have your version return the solution as well?
Edited: 7 Feb 2008, 5:50 p.m.

Re: My native 48GX benchmark implementation outperforms hp48xgcc version:-)
Message #15 Posted by Raymond Del Tondo on 8 Feb 2008, 8:23 p.m.,
in response to message #14 by Egan Ford

Hi again,
my current verion runs in about 0.3363 seconds!
Now the run time factor is about 10 (TEN) between the hp48gxcc version and my native solution;-)
In other words, the gxcc version needs ten times more time than my version.
Now we could say further optimizations will be somewhat difficult...
Have nice weekend:-)

Re: My native 48GX benchmark implementation outperforms hp48xgcc version:-)
Message #16 Posted by Raymond Del Tondo on 7 Feb 2008, 5:46 p.m.,
in response to message #12 by Raymond Del Tondo

Ok, here's the beef!
The code is a translation of the generic BASIC listing given by 'Xerxes'
into pure Saturn machine language, and thus the algo is the same.
Xerxes' bench listings
There may be some places for slight improvements,
especially the ASLC and CSLC parts, but I think it's not bad so far...
...and it's another example for the speed and efficiency of the real HP-48;-)
The (updated) program listed below returns the solved board in stack level 3,
the execution time in seconds in stack level2,
and the count of evaluated nodes in stack level 1.
The board matrix is kept in CPU register C[9:1] . Nib C[0] is used as scratch.
One of the goals was to reduce the total count of CPU cycles,
and that's why the index pointers X and Y (in B[0] and D[0])
were accessed using P (D=D-1 P) instead of using (D=D-1 A),
where the latter has a shorter opcode but needs more CPU cycles.
Have fun!
Raymond

:: CK0 CLKTICKS
* A B C D Dn Rn P Cyc CODE GOSBVL =SAVPTR
L10 LAHEX 888888888 R0=A 0:R C=0 W D1=C 1:S
B=0 A X=0 * D=0 A Y=0
L40 P= 0 2 A=R0 R
?A#B P GOYES L50 GOTO L180
L50 B=B+1 P R X' AAAAAAAAx * X=X+1
L60 P= 0 0 2 * A(X)=R C=B P AAAAAAAAX P=C 0 X C=A P AAAAAAAAX * Pushed R to AAAAAAAA at pos X
L70 CD1EX S * S=S+1 C=C+1 A CD1EX
L80 P= 0 0 2 * Y=X C=B P X D=C P Y=X
L90 D=D-1 P Y' * Y=Y-1
L100 ?D=0 P GOYES L40
L110 P= 0 2 * T=A(X)-A(Y) C=D P AAAAAAAAY P=C 0 Y
A=C P 'A(Y)'
L110asl ASLC P=P+1 3 GONC L110asl * On exit: P=0
* Here: P=0 A[0]='A(Y)' 2
C=A P * Backup of 'A(Y)' A=C W * Full backup of AAAAAAAA incl A(Y)
C=B P AAAAAAAAX P=C 0 X 6
L110csl CSLC P=P+1 3 GONC L110csl * On exit: P=0
ACEX W A[0]='A(X)' C[0]='A(Y)' ?A>=C P GOYES NoSwp
ACEX P
NoSwp A=A-C P ABS(T)
L120 ?A=0 P * IF T=0 THEN 140 GOYES L140
L130 C=B P X * IF X-Y<>ABS T THEN 90 C=C-D P X-Y
?A#C P GOYES L90
L140 P= 0 2 C=B P X * A(X)=A(X)-1 P=C 0 X 6 C=C-1 P
L150 ?C#0 P * IF A(X)<>0 THEN 70 GOYES L70
L160 P= 0 * X=X-1 B=B-1 P
L170 ?B#0 P * IF X<>0 THEN 140 GOYES L140
L180
* AD1EX * PRINT S * P= 0 2 * GOVLNG =PUSH#ALOOP
CSR W * Shift right one nib AD1EX P= 7 ACEX WP
RSTK=C * Save count
GOSBVL =PUSHhxs GOSBVL =SAVPTR
C=RSTK A=C A GOVLNG =PUSH#ALOOP
ENDCODE
CLKTICKS ( *Ticks1 #Board #LCnt Ticks2* ) * ROT 4ROLL

bit- #>% # 2000 UNCOERCE %/ SWAP UNCOERCE
;

Edited: 7 Feb 2008, 6:38 p.m.

Re: My native 48GX benchmark implementation outperforms hp48xgcc version:-)
Message #17 Posted by Xerxes on 7 Feb 2008, 9:01 p.m.,
in response to message #16 by Raymond Del Tondo

Thank you Raymond for this interesting implementation thats an enrichment for the list.
With your permission I have inserted your listing without the comments like the other assembly examples. The BASIC listing was used as pattern for all assembly versions.
Now the execution speed of HP48XGCC is not surprising any more. It seems that my suspicion comes true that integer only problems are not the strong point of the Saturn CPU. On the other hand the speed of the 71B BASIC interpreter shows the advantage of it's instruction set.
The informations about the clock speed of the 48GX are not really clear. I have found 4.0 MHz, ~4 MHz and 3.7-4.0 MHz. But what is correct?

Re: My native 48GX benchmark implementation outperforms hp48xgcc version:-)
Message #18 Posted by Raymond Del Tondo on 8 Feb 2008, 3:44 a.m.,
in response to message #17 by Xerxes

I'm glad I could contribute :-)
About the Saturn CPU: I'm not sure whether the Saturn is weak regarding integer handling,
but the main target of the developers seemed to be good at BCD handling.
<OT> I think the combination of Saturn CPU and HP-71B OS was _very_ efficient. Hats off for the developers!
</OT>
About the HP-48G series clock speed: AFAIK the latter of your options (3.7-4.0 MHz) comes nearest to reality .

Down to 0.3363 secs! [Re: My native 48GX benchmark implementation outperforms hp48xgcc version:-)]
Message #19 Posted by Raymond Del Tondo on 8 Feb 2008, 7:27 a.m.,
in response to message #17 by Xerxes

BTW did I mention that I just slightly improved the code,
it now needs only about 0.3363 (ZERO POINT THREE) seconds!
I replaced the biggest time wasters by some more efficient code.
Here's the story:
As mentioned in my earlier post, the ASLC and CSLC loops were the ones which could use some refinement.

** The two loops L110asl and L110csl from the original listing take the biggest amount of time. ** In the best case, when P is 9, the loop is run 16-9 times, -> 5 times, ** which sums up to 5*21 + 5*3 + 4*10 (NC case) + 1*3 (C case) = 105 + 15 + 40 + 3 = 163 cycles ** In the worst case, when P =1, the loop is run 16-1 times, 15 times, ** which sums up to 15*21 + 15*3 + 14*10 (NC case) + 1*3 (C case) = 315 + 45 + 140 + 3 = 503 cycles !!! ** ** The newer method (using ASR W) andshown below, ** runs 1 time in the best case, and 8 times in the worst case ** summing up to 2 + 3+d + 6 + 1*3+d + 1*2 + 3 = 2 + 4 + 6 + 19 2 + 3 = 36 ** or worst 2 + 3+d + 6 + 8*3+d + 8*2 + 7*10 + 3 = 250 cycles !
P=C 0 Y 6 A=C P 'A(Y)' 3+d P= 0 2 C=-C P 3+d P=C 0 6 * L110asr ASR W 3+d P=P+1 2 GONC L110asr 10/3 ********** C=A P 3+d * Backup of 'A(Y)' A=C W 3+d * Full backup of AAAAAAAA incl A(Y) C=B P AAAAAAAAX 3+d
C=-C P 3+d P=C 0 6
L110csr CSR W 3+d P=P+1 3 GONC L110csr 10/3
ACEX W A[0]='A(X)' C[0]='A(Y)' 3+d

So the second variant was remarkably faster (runtime was about 0.5 seconds!) than the easy and elegant, but slower initial version.
But the worst case (250 cycles) still looked not too good, so I tried some other methods, including self-modifying code in temporary memory.
Not that self-modifying shit which actually modifies itself, but a default code slice in a RAM buffer, outside of the main program,
which got a parameter modified on demand, and then called from the main program. Not bad, too, but the management overhead was slightly too large.
With this method the run time was between 0.37 seconds and 0.38 seconds. Nothing more to gain.
So I finally used the discrete dispatcher version, which I actually had before the 'self-modifying' one.
This version is the fastest one up to now. It has a run time of 0.33630 seconds :-)
The current listing is shown below.
Have fun:-)
Raymond

:: CK0 CLKTICKS
* A B C D Dn Rn P Cyc CODE GOSBVL =SAVPTR
L10 LAHEX 888888888 R0=A 0:R 19 C=0 W 3+d D1=C 1:S 8 B=0 A X=0 7
L40 P= 0 2 A=R0 R 19 ?A=B P 13+d/6+d GOYES L180
L50 B=B+1 P R X' AAAAAAAAx 3+d * X=X+1
L60 P= 0 0 2 * A(X)=R C=B P AAAAAAAAX 3+d P=C 0 X 6 C=A P AAAAAAAAX 3+d * Pushed R to AAAAAAAA at pos X
L70 CD1EX S 8 * S=S+1 C=C+1 A 7 CD1EX 8
L80 P= 0 0 2 * Y=X C=B P X 3+d D=C P Y=X 3+d
L90 D=D-1 P Y' 3+d * Y=Y-1
L100 ?D=0 P 13+d/6+d GOYES L40
L110 P= 0 2 * T=A(X)-A(Y) C=D P AAAAAAAAY 3+d GOSUB Ptst P= 0 A=C P C=B P GOSUB Ptst P= 0 ?A>=C P 13+d/6+d GOYES NoSwp
ACEX P 3+d
NoSwp A=A-C P ABS(T) 3+d
L120 ?A=0 P 13+d/6+d * IF T=0 THEN 140 GOYES L140
L130 C=B P X 3+d * IF X-Y<>ABS T THEN 90 C=C-D P X-Y 3+d ?A#C P 13+d/6+d GOYES L90
L140 P= 0 2 C=B P X 3+d * A(X)=A(X)-1 P=C 0 X 6 C=C-1 P 3+d
L150 ?C#0 P 13+d/6+d * IF A(X)<>0 THEN 70 GOYES L70
L160 P= 0 2 * X=X-1 B=B-1 P 3+d
L170 ?B#0 P 13+d/6+d * IF X<>0 THEN 140 GOYES L140
L180 CSR W * Shift right one nib 3+d AD1EX 8 P= 7 2 ACEX WP 3+d
RSTK=C * Save count 8
GOSBVL =PUSHhxs GOSBVL =SAVPTR
C=RSTK 8 A=C A 7 GOVLNG =PUSH#ALOOP
********* Ptst P=C 0 6 ?P# 1 13/6 GOYES tP2
CPEX 1 C=P 0 CPEX 1 RTNCC
tP2 ?P# 2 13/6 GOYES tP3
CPEX 2 C=P 0 CPEX 2 RTNCC
tP3 ?P# 3 13/6 GOYES tP4
CPEX 3 C=P 0 CPEX 3 RTNCC
tP4 ?P# 4 13/6 GOYES tP5
CPEX 4 C=P 0 CPEX 4 RTNCC
tP5 ?P# 5 13/6 GOYES tP6
CPEX 5 C=P 0 CPEX 5 RTNCC
tP6 ?P# 6 13/6 GOYES tP7
CPEX 6 C=P 0 CPEX 6 RTNCC
tP7 ?P# 7 13/6 GOYES tP8
CPEX 7 C=P 0 CPEX 7 RTNCC
tP8 CPEX 8 C=P 0 CPEX 8 RTNCC ENDCODE
CLKTICKS ( *Ticks1 #Board #LCnt Ticks2* ) 4ROLL bit- #>% # 2000 UNCOERCE %/ SWAP UNCOERCE ;

Edited: 8 Feb 2008, 7:06 p.m.

Re: Down to 0.3363 secs! [Re: My native 48GX benchmark implementation outperforms hp48xgcc version:-
Message #20 Posted by Xerxes on 8 Feb 2008, 8:46 p.m.,
in response to message #19 by Raymond Del Tondo

I can't believe it. Your latest creation is about 10(!) times faster than hp48xgcc that surly doesn't use a register for holding the array.
Your program shows clearly the advantage of efficient hand-coded assembly. Thanks for the digression to Saturn. ;-)

[ Return to Index | Top of Index ]

Go back to the main exhibit hall