Threaded Mode | Linear Mode

Claudio L. · (This post was last modified: 11-06-2014 12:33 AM by Claudio L..)

Inspired by Patrice's relentless search for speed, I created the following versions of the N-Queens problem, solved initially for the Calculator's Benchmark

1) The original unmodified code was tested as per the article.
2) The same algorithm, but using PICK/UNPICK on the stack instead of lists:

Code:

« 8. 0. 0. 0. \-> R S X Y

  « CLEAR TICKS 1. R

    START 0.

    NEXT

    DO R 'X' INCR UNPICK

      DO 'S' INCR DROP X 'Y' STO

        WHILE Y 1. >

        REPEAT X PICK 'Y' DECR 1. + PICK -

          IF DUP 0. == SWAP ABS X Y - == OR

          THEN 0. 'Y' STO X PICK 1. - X UNPICK

            WHILE X PICK 0. ==

            REPEAT 'X' DECR PICK 1. - X UNPICK

            END

          END

        END

      UNTIL Y 1. ==

      END

    UNTIL X R ==

    END 8. DROPN TICKS SWAP - B\->R 8192. / S

  »

»

3) The original algorithm, but removing the list access from the inner loop:

Code:

« 8. 0. 0. 0. \-> R S X Y

  « CLEAR TICKS 1. R

    START 0.

    NEXT

    DO R 'X' INCR UNPICK

      DO 'S' INCR DROP X 'Y' STO

        WHILE Y 1. >

        REPEAT X PICK 'Y' DECR 1. + PICK -

          IF DUP 0. == SWAP ABS X Y - == OR

          THEN 0. 'Y' STO X PICK 1. - X UNPICK

            WHILE X PICK 0. ==

            REPEAT 'X' DECR PICK 1. - X UNPICK

            END

          END

        END

      UNTIL Y 1. ==

      END

    UNTIL X R ==

    END 8. DROPN TICKS SWAP - B\->R 8192. / S

  »

»

4) The algorithm in 2) removing the indirection from the inner loop:

Code:

« 8. 0. 0. 0. 0. \-> R S X Y Ax

  « CLEAR TICKS 1. R

    START 0.

    NEXT

    DO 'X' INCR DROP R 'Ax' STO

      DO 'S' INCR DROP X 'Y' STO

        WHILE Y 1. >

        REPEAT Ax Y PICK - 'Y' DECR DROP

          IF DUP 0. == SWAP ABS X Y - == OR

          THEN 0. 'Y' STO 'Ax' DECR DROP

            WHILE Ax 0. ==

            REPEAT 'X' DECR PICK 1. - 'Ax' STO

            END

          END

        END

      UNTIL Y 1. ==

      END Ax X UNPICK

    UNTIL X R ==

    END TICKS 10. ROLL - B\->R 8192. / S

  »

»

5) A new recursive algorithm:

Code:

ROUTINE CHECKQUEEN:

--------------------------

« 1. \-> X RES «

IF X 1 > THEN

X PICK

 1 X 1 - FOR I

  DUP I 2 + PICK - ABS X I - ABS

   IF == THEN

     0. 'RES' STO X 'I' STO

   END

   NEXT

   DROP RES

 ELSE

   1.

 END

»

» 

ROUTINE DOLEVEL:

---------------------

«  9 OVER - \-> X LIMIT 

«

 1 8 START LIMIT 8 + ROLLD NEXT

 LIMIT DUPN 1 8 START LIMIT LIMIT 8 + + ROLL NEXT

  DO 9. ROLL X UNPICK

        IF X CHECKQUEEN

        THEN X 1 + DUP 

         IF 8 \<= THEN IF DOLEVEL THEN 0. 9 ROLLD 0. 'LIMIT' STO ELSE X PICK 17 X - ROLLD END

          ELSE 9 ROLLD 0 'LIMIT' STO END

        ELSE X PICK 17 X - ROLLD

        END 'LIMIT' DECR

      UNTIL 0. \<=

      END

      1 9 X - START 9 ROLL DROP NEXT

      IF LIMIT 0. ==

      THEN 0 ELSE 1 END

  »

»

MAIN PROGRAM NQUEENS:

------------------------------

« TICKS

  1. 2. 3. 4. 5. 6. 7. 8.

  0. 0. 0. 0. 0. 0. 0. 0.

  1. DOLEVEL DROP

  1. 8. START 9. ROLL DROP NEXT

  TICKS

  10. ROLL - B\->R 8192. /

»

I previously posted the benchmark results, but I will repost here for completeness.

Benchmark results:

Original: userRPL = 90 seconds. newRPL = 197 ms
Stack: userRPL = 60.6 seconds. newRPL = 154 ms
Original w/optimized inner loop: userRPL = 67.6 seconds. newRPL = 154 ms
Stack w/Optimized inner loop: userRPL = 57.7 seconds. newRPL = 143 ms
Recursive stack: userRPL = 29.7 seconds. newRPL = 101 ms
Werner's stack (see below): userRPL = 22.6 seconds. newRPL = 106 ms

These results are at the normal speed for both environments (userRPL=75 MHz, newRPL=192 MHz), both running on different 50g machines.
The exact same code was executed on both machines. The only difference was that the trailing dots were removed from the numbers on newRPL. The change is justified since they are a quirk of the 50g, to force the use of faster real numbers instead of variable precision integers. newRPL has automatic integer to real promotion and variable precision reals, so there is no need for the explicit use of reals. Leaving the dots would actually cause a slow-down on newRPL due to the forced use of reals.

newRPL produced the wrong result when running the recursive method, which I will investigate, fix it and report back.

UPDATE: after fixing a bug in the compiler, newRPL runs the recursive version properly. The result was added to the list, as well as Werner's algorithm running unmodified on newRPL.

Claudio

Werner · 11-01-2014, 10:03 AM

Here's my entry, still using the same algorithm:
[49G] 125 bytes #A38Ch

Code:

\<<

  0.

  DO

    8. SWAP 1. +              @ Add Queen

    WHILE

                              @ Test if Qn can be caught by Q1..Qn-1

      DUP2                    @ Q1 .. Qn n Qn n

      DO 1. -                 @ Q1 .. Qn n Qn n-i

      UNTIL

        DUP2 5. + PICK - ABS  @ n-i |Qn-Qi|

        DUP2 - * NOT          @ n-i T/F

      END

    REPEAT                    @ Qn can be caught

      DROP

                              @ Next square for Queen

      WHILE SWAP DUP 1. SAME

      REPEAT -                @ Drop Queen - will not remove the first

      END

      1. - SWAP

    END

    DROP

  UNTIL DUP 8. SAME           @ Done

  END

  \->LIST

\>>

Compatible with all RPL models, I think.
timing on a real 49G: 54_s

Cheers, Werner

Claudio L. · 11-01-2014, 01:34 PM

(11-01-2014 10:03 AM)Werner Wrote: timing on a real 49G: 54_s

Excellent! Very clean and efficient. This is a proper implementation of the original algorithm on-stack (mine was just taking the same code and changing GET/PUT with PICK/UNPICK, so not really well thought out). No lists, not even local variables. I think we have a winner here. It will be very difficult for anybody to beat this.
I ran it on the 50g for proper comparison: 22.6 seconds, so it smokes my recursive implementation and then some.
Great work! I'll edit the original post to report your results as well.

Thanks for the entry.
Claudio

Claudio L. · 11-02-2014, 02:24 AM

(10-31-2014 11:21 PM)Claudio L. Wrote:

Recursive stack: userRPL = 29.7 seconds. newRPL = 111 ms

Werner's stack (see below): userRPL = 22.6 seconds. newRPL = 141 ms

I was able to run things on newRPL and updated the original post with the new results.
Interestingly, the recursive algorithm runs faster than Werner's stack algorithm on newRPL, while the opposite is true for userRPL.
This shows how different optimizations work on different platforms. I believe the difference is in the use of local variables: userRPL does a search by name within the environment, while the newRPL compiler is capable of optimizing into PUTLAM/GETLAM style operations at compile time, making the use of named local variables just as efficient as using the stack.

Then, the recursive algorithm is theoretically supposed to be faster, since it only checks 301 cases versus the 876 of the general algorithm. Not 3 times faster as the number of checks would suggest, since the recursive algorithm wastes some time shuffling numbers in the stack, and the discarded cases are so trivial the other algorithm doesn't take too much time on them, but should be a little faster. When running on userRPL it takes a big hit for using variables, enough to overturn the results.

Another interesting observation is that the speedup ratio between the slowest algorithm (the original one) vs the fastest on each platform is very different: 3.98x for userRPL and 2.04x for newRPL. This is also likely due to the smarter compiler in newRPL speeding up things.

DavidM · 11-02-2014, 03:00 AM

(11-02-2014 02:24 AM)Claudio L. Wrote: ...I believe the difference is in the use of local variables: userRPL does a search by name within the environment, while the newRPL compiler is capable of optimizing into PUTLAM/GETLAM style operations at compile time, making the use of named local variables just as efficient as using the stack.

This is a Very Good Thing.

Keep up the good work!

xerxes · 11-02-2014, 08:17 AM

Very interesting! As soon as I'm back from my journey, I'll have a closer look.

patrice · 11-02-2014, 10:36 AM

Well done guys.

Nice to see your efforts to make yours calcs more "Fast and Furious"

Claudio L. · 11-05-2014, 11:57 PM

Applying two simple improvements, one in the instruction execution loop of newRPL (moved some unnecessary things out of the loop), and then replacing memmove() with a version that moves words instead of bytes (which I should've done from the beginning), produced a nice speed improvement between 10 to 25%.
These improvements are nothing special and apply to all programs running on newRPL.
The original went from 226 ms to 198 ms, while Werner's went from 141 ms to 106 ms, being the one most impacted by the improvements.
I'll update the times in the original post with the new marks for recordkeeping.

Han · 11-06-2014, 03:29 AM

Awesome!

xerxes · 11-13-2014, 06:08 PM

Thanks to Claudio for the effort and comparisons.

And special thanks to Werner for this impressive solution and the new test code for userRPL.

Besides the 49G and 50G, I've updated the results of the 28C und 28S (normal and turbo speed).

Unfortunately I had to remove the userRPL results of the 48SX and 48GX and also the sysRPL result
of the 48GX, because these are not comparable any more.

Werner · (This post was last modified: 11-18-2014 10:39 AM by Werner.)

Xerxes,
the RPL is compatible with the 48 (S,G and GX) as well.
The dots after the numbers are only needed for the 49 and 50, as these have the integer type, and I want reals (they are much faster). Depending on the state of flag 51, the dot must be a period (clear) or a comma (set).
As for SysRPL, here's a quick transcript of the RPL:

Code:

::

  ZERO

  BEGIN

    EIGHT SWAP#1+

    ::

    BEGIN

      2DUP

      BEGIN

        #1-

        2DUP FIVE #+PICK 2DUP#> IT SWAP #-

        2DUP#+ 2#0=OR 

      UNTIL

      #0=case AGAIN

      DROPSWAP

      BEGIN DUP#1=

      WHILE #-SWAP

      REPEAT

      #1-SWAP

    AGAIN

    ;

    DROPRDROP

    EIGHT OVER#=

  UNTIL

  ZERO_DO

    8ROLL UNCOERCE

  LOOP

  EIGHT {}N

;

@

Cheers, Werner

xerxes · 11-18-2014, 09:45 PM

It would be very nice to have the results of the 48SX/GX with your userRPL and sysRPL test codes in the list, but I haven't the possibility to make these tests.

Werner · 11-19-2014, 06:22 AM

I have a 48GX, but no SX. Will test at first opportunity, that would be tonight if I don't forget ;-)
Werner

Claudio L. · 11-19-2014, 03:19 PM

(11-18-2014 07:39 AM)Werner Wrote: As for SysRPL, here's a quick transcript of the RPL:

Interesting.
On the 50g, your sysRPL version clocked 2.318 seconds (average of 5 runs doing MEM right before). Not bad at all!
In the list, it fits right in line with the WP-34S (2.1/2.3 seconds).

Comparing this result with your algorithm written in userRPL shows a speedup of sysRPL vs userRPL of almost 10x. Then from sysRPL to newRPL the speedup is 16x, but since it's running at 192 MHz, the real gain due to bypassing the Saturn layer is around 6.4x (at the same clock speed).

Normalizing results, for language efficiency comparison only:

userRPL = 1.0x (by definition)
sysRPL = 9.75x (speedup due to BINT use, lost argument safety checks)
newRPL = 62.6x (speedup due to emulation layer, at 75 MHz)

Notice that sysRPL loses all safety checks to increase efficiency, newRPL and userRPL have all safety checks in place. newRPL uses BINTs automatically, so it's closer to sysRPL than userRPL in this aspect.
N-Queens does not make heavy use of basic arithmetic operations on reals, so it's more oriented towards loop efficiency and integer math/counters. I've been toying with other tests that do some number crunching (mainly some PI finding algorithms, some series expansions, etc). The idea is to measure real computing speed (real as in real numbers), but I can't find a good algorithm, so any ideas are welcome.

Here's a couple of examples, but none of them work for what I want to measure:

Simple product that converges to PI/2:

Code:

<< 1. .5 √ 1. 100. START

           DUP ROT * SWAP .5 * .5 + √ NEXT

 * INV 2. *

 >>

Archimedes's method of finding PI:

Code:

<<

4. 3. √ * 6. -> a b

<< 1. 100. START 

2. a b * * a b + / 'a' STO b a * √ 'b' STO

NEXT a 2 /

>>

>>

Both examples are not good because products that converge tend to 1, and after a few terms, it all becomes "multiply by 1". It's also very fast in userRPL, which makes it difficult to measure in newRPL.

What other problems could we use to benchmark real numbers?

Han · 11-20-2014, 08:53 AM

Fast Fourier transforms or pretty much anything involving matrices would probably be reasonable speed tests. Linear or polynomial regressions on large data sets using vandermonde matrices come to mind for me since I have been working with them recently.

xerxes · 11-20-2014, 09:27 PM

Claudio, I've updated the list with the sysRPL und newRPL results. Thanks for testing.

(11-19-2014 03:19 PM)Claudio L. Wrote: N-Queens does not make heavy use of basic arithmetic operations on reals, so it's more oriented towards loop efficiency and integer math/counters.

At that time I had also difficulties to find a real numbers problem for benchmarking. But as soon as I decided to test assembly languages too,
I switched to an integer problem. A new real number benchmark would be interesting.

Claudio L. · 11-20-2014, 10:57 PM

@Han: Using matrices uses more memory, so memory management starts playing a bigger role. I like the FFT, which is a lot of multiplications but it needs a lot of storage as well.

How about this one: Is the sum of the reciprocal of triangular numbers, which very slowly converges to 2.
The slow convergence is good so we can do thousands of operations without going out of range.
\[ \sum { 2 \over n (n+1) } \]
Here's the code:

Code:

<< 0. 10000. 1. FOR n

   2 n DUP 1. + * / +

   -1 STEP

>>

It has one multiplication, one division and two additions at each step. I chose 10000 terms quite arbitrarily. The last term is in the order 1e-6, so it is still significant even with the 12-digit precision. It takes about 87 seconds on the 50g, so it can be measured comfortably.
Notice that the loop runs backwards, so there's less loss of significant digits.

One drawback I see is that the multiplication is always between two integers...

Any objections or other proposals?

Claudio L. · 11-21-2014, 04:27 PM

Easy solution: replace n=10*k and work with the integers divided by 10, then compensate with the numeric factor, and no more integers anywhere:

Code:

<< 0. 1000. 0.1 FOR n

   0.02 n DUP 0.1 + * / +

   -0.1 STEP

>>

The exact result is:
1 followed by 'x' 9's, then an 8, followed by 'x' 0's, then repeats the sequence with a 1 followed by...
'x' is the exponent of the end of the loop. If the loop starts from 1e3, there's 3 9's and 3 zeroes:
1.999 8 000 1 999 8 000 1 999 8 000 1 999 8 000 ...

So it's easy to check for rounding errors even at higher precisions (which I want to benchmark too).

Claudio L. · (This post was last modified: 11-26-2014 01:32 PM by Claudio L..)

I finally had the time to get some results in.
The following code was tested on a 50g under userRPL and newRPL under various precision settings.

Code:

<< 0. 1000. 0.1 FOR n

   0.02 n DUP 0.1 + * / +

   -0.1 STEP

>>

The times resulted as follows:

userRPL = 87.4 seconds
newRPL = 0.893 seconds (at 36 digits precision)
newRPL = 1.615 seconds (at 108 digits precision)
newRPL = 10.843 seconds (at 1008 digits precision)

A few observations from these results:
a) newRPL vs userRPL shows a speedup of 98x. Not as impressive as the 200x obtained in integer math, but still good, especially considering it's using 36-digit precision vs. 12-digits in userRPL.
b) Execution time seems to be growing fairly linear with precision, which is somewhat unexpected (at least for multiplication and division, which should have a big influence on this test).
c) Since the multiplication involves numbers of known precision (variable n), the time used in the multiplication in this test should be independent of the precision. In this case, the division and the last addition are the only operations affected by the selected precision (not sure if that is enough to justify the linear behavior).

Claudio

Bruno · 09-08-2015, 10:41 AM

(11-18-2014 09:45 PM)xerxes Wrote: It would be very nice to have the results of the 48SX/GX with your userRPL and sysRPL test codes in the list, but I haven't the possibility to make these tests.

Hi,

below are the results using the Werner's Algorithm on my HP48(s) :

UserRPL (125 bytes ; #5987h):

HP48S ROM revJ: 76.97s
HP48SX ROM revE: 78.94s
HP48GX ROM revM (@3.6Mhz): 54.46s

SysRPL (105 bytes ; #D802h):

HP48S ROM revJ: 15.06s
HP48SX ROM revE: 15.98s
HP48GX ROM revM (@3.6Mhz): 10.19s