Threaded Mode | Linear Mode

Anders · (This post was last modified: 08-30-2018 06:33 PM by Anders.)

(08-30-2018 03:57 PM)xerxes Wrote: Thank you for testing, but are you sure with 1.158 sec? The G1 needs 0.346 sec only.

No it should be 0.158_s (hehe).

One way to explain what is going is that there is also overhead (just by running a program the OS run memory management etc in the background etc) that you cannot cut out. The overhead factor get's smaller if you run the benchmark say 10 time in a loop just out side the outer REPEAT/UNTIL. So for this purpose, I modified your benchmark that way and made variables local as well.

Running this program (looping 10 time) I now get 0.855_s +/- 0.05 (on battery not connected to the PC)

So with that the we can say that overhead has been spread across 10 loops and we get 0.0855_s per loop. So not a complete apples to apples comparison to G1 but 0.346 vs 0.0855 is a 4x speed up, inline with the Savage() benchmark improvment I did before.

Code:

EXPORT NQUEENS()

 BEGIN

   LOCAL startTime, duration, endTime, R, L1, S, X, I; 

   // making everything local

   R:=8;

   L1:=MAKELIST(0,X,1,R,1);

   S:=0;

   X:=0;

   startTime := TICKS(); // inserted code to measure start time

   // inserted code to run the problem 10 times

   FOR I FROM 1 TO 10 DO 

     REPEAT

       X:=X+1;

       L1(X):=R;

       REPEAT

         S:=S+1;

         Y:=X;

         WHILE Y>1 DO

           Y:=Y-1;

           T:=L1(X)-L1(Y);

           IF T==0 OR X-Y==ABS(T) THEN

             Y:=0;

             L1(X):=L1(X)-1;

             WHILE L1(X)==0 DO

               X:=X-1;

               L1(X):=L1(X)-1;

             END;

           END;

         END;

       UNTIL Y==1 END;

     UNTIL X==R END;

  END;

  endTime := TICKS(); // inserted code to measure end time

  // inserted code to calculate duration

  duration := (endTime - startTime)/1000;

  RETURN {duration,S};

 END;

xerxes · 08-30-2018, 04:56 PM

Very interesting, 1 loop with global vars took 1.158 seconds and 10 loops with local vars only 0.855.
Obviously local vars are much faster than global ones. I'll update the list with the result of 0.0855.
Unfortunately the G1 Prime and the HP-39GII were tested with global vars only in the list.

grsbanks · 08-30-2018, 05:39 PM

(08-30-2018 04:33 PM)Anders Wrote: Running this program (looping 10 time) I now get 0.855_s +/- 0.05 (on battery not connected to the PC)

I don't think connecting the Prime to a USB port changes anything (unlike the DM42 whose clock ramps up from 24MHz to 80MHz).

Albert Chan · 08-30-2018, 05:45 PM

(08-30-2018 04:56 PM)xerxes Wrote: Very interesting, 1 loop with global vars took 1.158 seconds and 10 loops with local vars only 0.855.

Are you sure the numbers are right ? Seems off by a decimal point ...

Might be better to revert back to global variables, and check the real overhead.

Anders · (This post was last modified: 08-30-2018 07:24 PM by Anders.)

(08-30-2018 05:45 PM)Albert Chan Wrote:
(08-30-2018 04:56 PM)xerxes Wrote: Very interesting, 1 loop with global vars took 1.158 seconds and 10 loops with local vars only 0.855.

Are you sure the numbers are right ? Seems off by a decimal point ...

Might be better to revert back to global variables, and check the real overhead.

No i made a typo, it should be 0.158_s (hehe)...

Playing with this problem a bit back and forth, it seams to me that the actual work (the two REPEAT nested loops) only consumes part of the time and a rather large % is spent on overhead (OS/memory management etc).

With better understanding of the inner workings of Prime OS, the code can be optimized to minimize the time further.

xerxes · 08-30-2018, 06:41 PM

Thank you very much for your effort. I've updated the list with the new result.

toml_12953 · 08-30-2018, 07:33 PM

(08-30-2018 06:21 PM)Anders Wrote:
(08-30-2018 05:45 PM)Albert Chan Wrote: Are you sure the numbers are right ? Seems off by a decimal point ...

Might be better to revert back to global variables, and check the real overhead.
No i made a typo, it should be 0.158_s (hehe)..
but I payed around and made the important variables global:

I now get 0.707_s +/- 0.1 for this program. A bit less than before. and I was wrong the first test with out the FOR 1 to 10 loop I did, should of course be 0.158 s. Sorry for the confusion I caused

If you want to run the program without having to modify it by adding the timing code, you can use TEVAL from the home screen. It's usually best to compare unmodified code. Then you don't have to worry about unintended side effects changing the timing.

Anders · (This post was last modified: 08-30-2018 08:07 PM by Anders.)

(08-30-2018 07:33 PM)toml_12953 Wrote:
(08-30-2018 06:21 PM)Anders Wrote: No i made a typo, it should be 0.158_s (hehe)..
but I payed around and made the important variables global:

I now get 0.707_s +/- 0.1 for this program. A bit less than before. and I was wrong the first test with out the FOR 1 to 10 loop I did, should of course be 0.158 s. Sorry for the confusion I caused

If you want to run the program without having to modify it by adding the timing code, you can use TEVAL from the home screen. It's usually best to compare unmodified code. Then you don't have to worry about unintended side effects changing the timing.

Exactly, before we benchmark we need to decide what we are looking to benchmark, how we design the measurement to exactly do that and simultaneously minimize overhead the measurement functions themselves causes.
If I do TEVAL() on the whole program I measure, make list and variable memory allocations the function call etc and with a program that runs in milliseconds this overhead is quite significant.

If you want to measure how fast it actually solves the problem where the problem is given, you put the measurement points exactly around the problem solving not the other prep stuff.

toml_12953 · (This post was last modified: 08-30-2018 10:01 PM by toml_12953.)

(08-30-2018 08:05 PM)Anders Wrote:
(08-30-2018 07:33 PM)toml_12953 Wrote: If you want to run the program without having to modify it by adding the timing code, you can use TEVAL from the home screen. It's usually best to compare unmodified code. Then you don't have to worry about unintended side effects changing the timing.

If I do TEVAL() on the whole program I measure, make list and variable memory allocations the function call etc and with a program that runs in milliseconds this overhead is quite significant.

The other calculators programs were measured including the prep stuff so in order to compare apples to apples, you'd have to include it for the Prime, too.

Claudio L. · 08-30-2018, 10:02 PM

(08-30-2018 08:05 PM)Anders Wrote: If you want to measure how fast it actually solves the problem where the problem is given, you put the measurement points exactly around the problem solving not the other prep stuff.

If you look at Xerxes's list, you'll see he has done this test on hundreds of machines, so right now is not the time to change the testing pattern. Every other implementation (I think) was tested with all the overhead to solve one problem. If you need to run that multiple times, you run it in a loop, and measure outside the loop, then divide by the number of executions.
Moving what you call "prep work" out of the loop makes your result not comparable to all other measurements in the list.

toml_12953 · 08-30-2018, 10:18 PM

(08-30-2018 10:02 PM)Claudio L. Wrote:
(08-30-2018 08:05 PM)Anders Wrote: If you want to measure how fast it actually solves the problem where the problem is given, you put the measurement points exactly around the problem solving not the other prep stuff.

If you look at Xerxes's list, you'll see he has done this test on hundreds of machines, so right now is not the time to change the testing pattern. Every other implementation (I think) was tested with all the overhead to solve one problem. If you need to run that multiple times, you run it in a loop, and measure outside the loop, then divide by the number of executions.
Moving what you call "prep work" out of the loop makes your result not comparable to all other measurements in the list.

Exactly my point. Thanks!

Anders · 08-30-2018, 10:56 PM

(08-30-2018 10:02 PM)Claudio L. Wrote:
(08-30-2018 08:05 PM)Anders Wrote: If you want to measure how fast it actually solves the problem where the problem is given, you put the measurement points exactly around the problem solving not the other prep stuff.

If you look at Xerxes's list, you'll see he has done this test on hundreds of machines, so right now is not the time to change the testing pattern. Every other implementation (I think) was tested with all the overhead to solve one problem. If you need to run that multiple times, you run it in a loop, and measure outside the loop, then divide by the number of executions.
Moving what you call "prep work" out of the loop makes your result not comparable to all other measurements in the list.

Yes and you can be sure that I understand that. However,
1. the code is different for every platform, (since they use different languages to implement the same basic algorithm)
2. therefore the implementation is different for every platform (allowing for optimizations)
3. it's not clear how each benchmark was measured and how many times in each test

In other words it is already difficult to compare platforms without understanding this.

If x0% of the time in a benchmark running one loop is due to malloc's or what other stuff the OS is doing just to set up the process/tread of the program before it executes the main algorithm, are we measuring what we want to measure? (speed at which it solves a problem executing a particular algorithm - which is the same on all platforms). I think not.

CyberAngel · 08-31-2018, 12:29 AM

(08-30-2018 10:56 PM)Anders Wrote:
(08-30-2018 10:02 PM)Claudio L. Wrote: If you look at Xerxes's list, you'll see he has done this test on hundreds of machines, so right now is not the time to change the testing pattern. Every other implementation (I think) was tested with all the overhead to solve one problem. If you need to run that multiple times, you run it in a loop, and measure outside the loop, then divide by the number of executions.
Moving what you call "prep work" out of the loop makes your result not comparable to all other measurements in the list.
Yes and you can be sure that I understand that. However,
1. the code is different for every platform, (since they use different languages to implement the same basic algorithm)
2. therefore the implementation is different for every platform (allowing for optimizations)
3. it's not clear how each benchmark was measured and how many times in each test

In other words it is already difficult to compare platforms without understanding this.

If x0% of the time in a benchmark running one loop is due to malloc's or what other stuff the OS is doing just to set up the process/tread of the program before it executes the main algorithm, are we measuring what we want to measure? (speed at which it solves a problem executing a particular algorithm - which is the same on all platforms). I think not.

The timing should be looped algorithm per second.
If the measurement is done by a human
then a suitable long long time should limit the error involved.
Timing a minute and forty seconds or 100_s is too error prone.
I'd say 16_min 40_s or 1000_s (a kilosecond) would do it.
Unfortunately if some old machine runs the algorithm ONCE in over 3_h
then it's about 0.083 algorithm/s?

How much time it takes for a Ferris wheel to turn one round?
How about a turbine engine?
We use revolutions per minute for car engines.

MIPS = Million Instructions Per Second
That's what we should use for calculators.

Just my flow of thoughts at about 3:30 AM (I woke up)

Claudio L. · 08-31-2018, 01:22 AM

(08-30-2018 10:56 PM)Anders Wrote: Yes and you can be sure that I understand that. However,
1. the code is different for every platform, (since they use different languages to implement the same basic algorithm)
2. therefore the implementation is different for every platform (allowing for optimizations)
3. it's not clear how each benchmark was measured and how many times in each test

In other words it is already difficult to compare platforms without understanding this.

If x0% of the time in a benchmark running one loop is due to malloc's or what other stuff the OS is doing just to set up the process/tread of the program before it executes the main algorithm, are we measuring what we want to measure? (speed at which it solves a problem executing a particular algorithm - which is the same on all platforms). I think not.

I didn't read all of the implementations, but they look pretty similar, saving language differences. Perhaps the only really different ones are the RPL and sysRPL ones because of the heavy stack use, but the algorithm is 100% identical.
I get your point, C code can allocate all local variables including an 8-element vector with a single assembler instruction, the HP Prime takes its sweet time doing the MAKELIST(), the RPL machines don't need to create the list in advance, they use the values in the stack and create the vector at the end. These are very subtle differences, but I think they are "part of the beast". If a particular machine is slow to handle lists, so be it. The RPL code could leave the 8 values on the stack, doesn't really need to create the list either, but it was left there to make it comparable (RPL machines are also quite slow to handle lists).
It's hard to do benchmarks, and highly subjective, Xerxes has done a great job for many years collecting these results.
The results are very clear in my opinion: You can see the code that was executed, listed right there. For machines that are too fast, that code as-listed was executed multiple times and the total running time divided by the number of loops.
Let me give you another example: newRPL has extremely fast local variables, so perhaps it would be faster to re-code the algorithm using local variables rather than stackrobatics. However, it was benchmarked using the same code as the HP28,48,49g,49g+ and 50g. It may seem a bad idea to you, but I think it's more useful to have a comparable value than the fastest possible.

Tim Wessman · (This post was last modified: 08-31-2018 02:09 AM by Tim Wessman.)

(08-31-2018 01:22 AM)Claudio L. Wrote: the HP Prime takes its sweet time doing the MAKELIST()

Hmmm? What are you thinking of here. The number of objects in the list is calculated using the start/stop/step, and then pointers are dropped in there as calculated. What is the "sweet time" you are talking about?

xerxes · 08-31-2018, 12:11 PM

After reading the comments about the correct timing of the code, I checked the test code for the G2 more carefully and
found a serious bug. X has to be zero at the beginning of the execution, but this is only the case at the first iteration,
bacause X:=0 is outside the FOR loop. To have the MAKELIST outside the loop doesn't seem that critical to me, because
the execution time is surely minimal in relation to the test code itself. To minimize the overhead effect, I've used
simply more iterations for very fast results. I would suggest 100 iterations in the case of the Prime. I hope, that I've
made no mistake correcting the test code:

Code:

EXPORT NQUEENS()

 BEGIN

   R:=8;

   L1:=MAKELIST(0,X,1,R,1);

   startTime := TICKS(); // inserted code to measure start time

   // inserted code to run the problem 100 times

   FOR I FROM 1 TO 100 DO 

     S:=0;

     X:=0;

     REPEAT

       X:=X+1;

       L1(X):=R;

       REPEAT

         S:=S+1;

         Y:=X;

         WHILE Y>1 DO

           Y:=Y-1;

           T:=L1(X)-L1(Y);

           IF T==0 OR X-Y==ABS(T) THEN

             Y:=0;

             L1(X):=L1(X)-1;

             WHILE L1(X)==0 DO

               X:=X-1;

               L1(X):=L1(X)-1;

             END;

           END;

         END;

       UNTIL Y==1 END;

     UNTIL X==R END;

  END;

  endTime := TICKS(); // inserted code to measure end time

  // inserted code to calculate duration

  duration := (endTime - startTime)/1000;

  RETURN {duration,S};

 END;

Sorry for the complications to Anders.

Claudio L. · (This post was last modified: 08-31-2018 02:44 PM by Claudio L..)

(08-31-2018 01:46 AM)Tim Wessman Wrote:
(08-31-2018 01:22 AM)Claudio L. Wrote: the HP Prime takes its sweet time doing the MAKELIST()

Hmmm? What are you thinking of here. The number of objects in the list is calculated using the start/stop/step, and then pointers are dropped in there as calculated. What is the "sweet time" you are talking about?

Reading post #82 and around, I thought he went from 1.15 seconds to 0.805 seconds by converting to local vars and moving MAKELIST out of the loop, that's 20% of execution time. I assumed most of that overhead would be on MAKELIST since they were talking about memory management (and that's pretty much the only statement that allocates any memory). EDIT: I blamed MAKELIST because it evaluates an expression repeatedly, that's bound to be much slower than just writing a zero-filled list.
But now reading again, that 1.15 seconds was a typo so my thoughts were way off.
If it isn't the case, then even more reason to leave the code as-is and not try to tweak it by moving things out of the loop.

BruceH · 08-31-2018, 04:24 PM

(08-30-2018 01:48 PM)CyberAngel Wrote: To the Real People out there:
Do you want a slower or faster calculator?
(instead of a wisecrack answer)

I'd be more than happy with a 48SX in the Prime form factor.

Albert Chan · 08-31-2018, 04:40 PM

Hi, Claudio. L,

Spot on !

Cost of overhead is part of the benchmark, and should not be "spread-out".

I was curious, all this speed, does it drain battery much faster ?
What is typical time for a recharge ?

Voldemar · (This post was last modified: 09-02-2018 02:50 PM by Voldemar.)

Sorry for posting in wrong place.
How to exit from exam mode in HP Prime PC emulator? I did not find the answer in forum. Accidentally entered in emulator exam mode, finally reset the emulator, it lost all data. Is there any way to exit exam mode without losing data in PC emulator?