HP Forums

Full Version: Emulator vs simulator performance
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
The question arises to know what is the order of magnitude of the performance penalty of an emulator versus a native application.

(06-05-2020 12:39 AM)Valentin Albillo Wrote: [ -> ]
Quote:An emulator adds the overhead of the CPU/hardware simulation; Free42 is a native application. All depends on the performance of the emulation engine, I can't really judge the x60 ratio but it's the order of magnitude we can expect

I don't concur. A factor of 60x is just too much, no emulation should be that inefficient. Say 10x would be acceptable, if slow, but 60x ? Really ? Converting a 10 seconds running time to 10 minutes ? A 10 minutes running time to 10 hours ? That would be a horribly inefficient emulation, direct-to-garbage-bin class.

Ok, so let's compare Free42 and Emu42, that have similar functionalities and are both high quality software recognized by the community, and let's compare them on the same host system.

You may say, it's not fair, Free42 provides much better accuracy with its 35-digit arithmetic, so let's also include the binary version that has similar arithmetic accuracy.

I will not use a trivial benchmark but the program from the latest article from Valentin, that computes the area of the Mandelbrot set (great article, Valentin, I may comment it later in an other thread).
The program is run to evaluate 10,000 points with the other default parameters.
The display is fully static during Valentin's program run (no display updates, no flying goose).

Emu42 and Free42 are run separately, during the run they each use about one of the core of my core-i3 machine (global CPU loading ~25-30%).
Of course, Emu42 is run with authentic calculator speed off.

Emu42 1.24 : 5min08s
Free42 2.5.18 decimal : 6.3s
Free42 2.5.18 binary : 2.0s
Emu42 execution time measured by hand, Free42 execution measured using TIME.

J-F
Interesting. So the 60x slowdown claim appears fairly accurate. Decimal free42 is about 49 times faster than emu42.

Eric Smith can probably speak more knowledgeably than I can, but here are some thoughts.

The Saturn CPU is a pretty different from an x86. Emulating it might be harder than a more traditional one. Things like register fields might be hard. And if the Saturn BCD arithmetic doesn't translate well to x86 then that could be a problem.

Also, it's my understanding that the 42s is implemented in SysRPL. That adds another layer of complication and slowdown compared to assembly language. As I recall working with C on a 50g, the speed difference there is something like native C code is about 5x faster than Saturn assembly, which is about 5x faster than SysRPL, which is about 5x faster than User RPL. Those values may vary somewhat, but I do recall specifically that C code is about 100x faster than userRPL on the 50g.

Dave
(06-05-2020 03:26 PM)David Hayden Wrote: [ -> ]Also, it's my understanding that the 42s is implemented in SysRPL. That adds another layer of complication and slowdown compared to assembly language. As I recall working with C on a 50g, the speed difference there is something like native C code is about 5x faster than Saturn assembly, which is about 5x faster than SysRPL, which is about 5x faster than User RPL. Those values may vary somewhat, but I do recall specifically that C code is about 100x faster than userRPL on the 50g.

Yes, the 42S is internally using RPL. But Free42 is written in C, not assembly language, so the comparison is fair.
On the other hand, in this particular benchmark, we can expect a significant part of the time is spent in arithmetic operations that are in asm in the HP-42S, not sure in the Intel 35-digit library.

My own Emu71/DOS is largely written in assembly (using Intel x86 BCD support for efficiency - at the cost of minor compatibility issues) but it's a 16-bit DOS application and there is no native HP-71 simulator available for comparison.

J-F
.
Hi, J-F:

(06-05-2020 03:03 PM)J-F Garnier Wrote: [ -> ]I will not use a trivial benchmark but the program from the latest article from Valentin, that computes the area of the Mandelbrot set (great article, Valentin, I may comment it later in an other thread).

Thank you very much, you're most welcome to comment to your heart's content. Smile

Quote:The program is run to evaluate 10,000 points with the other default parameters. The display is fully static during Valentin's program run (no display updates, no flying goose.

(How do you know it's a "goose" and not a "gander" ? Oh, I remember, the gander flies backwards. At least it does in the HP-41C family, as the HP42S (and Free42) uses a little triangle instead)

Quote:Emu42 and Free42 are run separately, during the run they each use about one of the core of my core-i3 machine (global CPU loading ~25-30%).
Of course, Emu42 is run with authentic calculator speed off.

Emu42 1.24 : 5min08s
Free42 2.5.18 decimal : 6.3s
Free42 2.5.18 binary : 2.0s
Emu42 execution time measured by hand, Free42 execution measured using TIME.

Quite a difference, indeed. Emu42 is calculating ~32 points/sec, Free42 BCD is doing ~1,587 points/sec and Free42 binary does 5,000 points/sec.

The difference between 32 p/s and 5,000 p/s is ~156x, which seems to me unreasonable performance difference for an emulation (Emu42) vs. a simulation (Free42), that's more than two orders of magnitude faster (or slower) on the same hardware (and presumably OS).

Also, your Free42 BCD is running only ~75% faster on your hardware than mine on my mid-range Samsung tablet. Frankly, I would expect it to run much faster in your system (say, at least 4x).

Puzzling results to me, all of them. Sad

Thanks and regards.
V.
(06-05-2020 03:45 PM)J-F Garnier Wrote: [ -> ]Yes, the 42S is internally using RPL. But Free42 is written in C, not assembly language, so the comparison is fair.
.. but Free42 is compiled, and the 42 SysRPL is interpreted.
Werner
I only can speak about Emu42, I don't know any details about Free42 inside.

From early times I put the focus of my emulator development on rebuilding the original hardware on functional level as close as possible. I also expect that an emulation of a real machine is faster, or at minimum, have equal speed.

The Saturn CPU emulation inside Emu42 was originally build by Sebastien Carlier for Emu48 and was improved by me in the Emu48 development over the years. This emulation core was never optimized for speed reason.

One anecdote belongs to the Saturn opcode dispatcher. The actual emulator code decoding the Saturn opcode in tables one nibble each. This makes the tables small because 1 nibble = 4 bit -> 2^4 = 16 conditions in a table. Cyrille thought, why not decode 3 nibbles (12 bit) of the opcode at first time. So the first opcode dispatcher table had a size of 2^12 = 4096 entries. On a PC the emulator with 4096 dispatcher table was about 10% faster than the one with 16 entries. Mission accomplished? Not really, the optimization was done for the Pocket PC's with Windows CE or Pocket PC 2002 devices. On these devices the new version was massively slower. And why? The 4096 entry dispatcher table hasn't fit into the 1st level CPU cache and so all accesses to this table must be done over the slow main memory.

So what is important for me? The speed difference between original calculator and emulation. So I think a 170 to 340 times faster emulation is fast enough. This is a benchmark list from 2011 comparing the emulation on different host systems comparing to the real machine:
Code:

Emu42 benchmark results using Erik Ehrling's
"Miller-Rabin Primality Test for the HP-42S"

Prime number: 999,999,999,961

Real HP-42S

ROM REV A: Std clock 1MHz  5m 48s
ROM REV A: Dbl clock 2MHz  2m 52s

C2E6750/2.66GHz/333MHz/DDR2 / 2 GB / Windows XP SP2 / Emu42 v1.10

ROM REV C: Max 1s     Auth 5m 27s

2x E5507/2.26GHz/800MHz/DDR3 / 4 GB / Windows 7 SP1 (x86) / Emu42 v1.14

ROM REV C: Max 2s     Auth 5m 25s

A64X2/3800+/800MHz/DDR2 / 4GB / Windows 7 SP1 (x64) / Emu42 v1.14

ROM REV C: Max 2s     Auth 5m 26s

A64X2/3800+/800MHz/DDR2 / 4GB / Windows XP SP3 / Emu42 v1.14

ROM REV C: Max 2s     Auth 5m 26s

A64X2/3800+/533MHz/DDR2 / 1GB / Windows XP SP2 / Emu42 v1.09beta1

ROM REV C: Max 2s     Auth 5m 26s

P4HT/3.4GHz/400MHz/DDR / 2GB / Windows XP SP3 / Emu42 v1.14

ROM REV C: Max 2s     Auth 5m 27s

P4HT/3.4GHz/400MHz/DDR / 1GB / Windows XP SP2 / Emu42 v1.09beta1

ROM REV C: Max 2s     Auth 5m 27s

P4HT/3.2GHz/400MHz/DDR / 1GB / Windows 2000 SP4 / Emu42 v1.09beta1

ROM REV C: Max 2s     Auth 5m 27s

P4/2.4GHz/533MHz / 1GB / Windows XP SP3 / Emu42 v1.11

ROM REV C: Max 2s     Auth 5m 26s

P4/2.4GHz/533MHz / 256MB / Windows 2000 SP4 / Emu42 v0.98-5

ROM REV C: Max 3s     Auth 5m 28s

P3/1.0GHz/133MHz / 512MB / Windows 2000 SP4 / Emu42 v1.09beta1

ROM REV C: Max 4s     Auth 5m 27s

P3/850MHz/100MHz / 384MB / Windows XP SP1 / Emu42 v0.98-5

ROM REV C: Max 5s     Auth 5m 28s

P3/850MHz/100MHz / 384MB / Windows 2000 SP4 / Emu42 v0.98-5

ROM REV C: Max 5s     Auth 5m 28s

P3/500MHz/100MHz / 256MB / Windows 2000 SP4 / Emu42 v0.98-5

ROM REV C: Max 8s     Auth 5m 24s

P3/500MHz/100MHz / 256MB / Windows 98 / Emu42 v0.98-5

ROM REV C: Max 8s     Auth 5m 24s

P3/450MHz/100MHz / 320MB / Windows 2000 SP4 / Emu42 v0.98-5

ROM REV C: Max 10s    Auth 5m 27s

P3/450MHz/100MHz / 128MB / Windows NT4.0 SP4 / Emu42 v0.98-5

ROM REV C: Max 10s    Auth 5m 24s

P(MMX)/200MHz/66MHz / 96MB / Windows 98SE / Emu42 v0.98-4

ROM REV A: Max 43s    Auth 5m 13s
ROM REV B: Max 44s    Auth 5m 25s
ROM REV C: Max 45s    Auth 5m 25s

P/100MHz/66MHz / 32MB / Windows 95B (OSR2) / Emu42 v0.98-4

ROM REV A: Max 1m 00s Auth 5m 13s
ROM REV B: Max 1m 02s Auth 5m 26s
ROM REV C: Max 1m 01s Auth 5m 25s

ARM PXA310/640MHz / Win Mobile 6 Classic / Emu42PPC v1.10

ROM REV C: Max 17s    Auth 5m 26s

ARM PXA270/624MHz / Win Mobile 5.0 / Emu42PPC v1.09

ROM REV C: Max 17s    Auth 5m 26s

ARM PXA270/624MHz / Win Mobile 2003 SE / Emu42PPC v1.02beta5

ROM REV C: Max 17s    Auth 5m 24s

ARM PXA270/624MHz / Win Mobile 2003 SE / Emu42PPC v1.01

ROM REV C: Max 19s    Auth 5m 24s

ARM PXA270/520MHz / Win Mobile 5.0 / Emu42PPC v1.07beta1

ROM REV C: Max 20s    Auth 40s *1

ARM PXA270/520MHz / Win Mobile 2003 SE / Emu42PPC v1.01

ROM REV C: Max 23s    Auth 5m 24s

ARM PXA255/400MHz / Win Mobile 2003 / Emu42PPC v0.20

ROM REV C: Max 30s    Auth ?m ??s

ARM MSM7200/400MHz / Win Mobile 6 Professional / Emu42PPC v1.09

ROM REV C: Max 30s    Auth 5m 26s

ARM PXA270/312MHz / Win Mobile 2003 SE / Emu42PPC v1.02

ROM REV C: Max 29s    Auth 5m 20s

ARM S3C2410/266MHz / Win Mobile 2003 / Emu42PPC v1.02beta5

ROM REV C: Max 32s    Auth ?m ??s

ARM S3C2410/266MHz / Win Mobile 2003 / Emu42PPC v1.01

ROM REV C: Max 34s    Auth ?m ??s

ARM PXA270/208MHz / Win Mobile 2003 SE / Emu42PPC v1.01

ROM REV C: Max 1m 02s Auth 5m 24s

ARM OMAP850/195MHz / Win Mobile 5.0 / Emu42PPC v1.05beta1

ROM REV C: Max 1m 19s Auth 2m 42s

ARM SA1110/206MHz / Pocket PC 2002 / Emu42PPC v1.05beta1

ROM REV C: Max 1m 34s Auth 5m 26s

ARM SA1110/206MHz / Pocket PC 2000 / Emu42PPC v1.09

ROM REV C: Max 1m 18s Auth 5m 26s


*1 high performance counter run only with 1000Hz, this cause trouble in
   connection with timer2 related routines and "Authentic Speed" setting


Environment

Emu42 v0.98-4/5 and Emu42PPC v0.20-1.09 use the same engine
Since Emu42 v1.12 and Emu42PPC v1.11 Sacajawea hardware support is included,
so implementation got some Lewis/Sacajawea hardware specific switches.

Speed setting in Emu42.ini / registry:

LewisCycles=64

PRM? is the only program in memory. If there are more programs in
memory the position of PRM? has direct influence on the execution
time.

On the Pocket PC / Win Mobile Emu42PPC was the only visible process
running. Tools like Wisbar or running ActiveSync slow down emulation
speed. The Max values on Emu42PPC differs from run to run in a wide
range, the measured values were the fastest ever measured.

Emu42 v1.09beta1 03/05/07
Emu42 v0.98-5    11/12/03
Emu42 v0.98-4    10/30/03

Compiler:
Microsoft Visual C++ 6.0 SP1 <- Emu42 v1.12
Microsoft Visual C++ 6.0 SP5 -> Emu42 v1.13

Settings:
/nologo /Gr /MT /W3 /GX /O2 /Ob2 /D "NDEBUG" /D "WIN32" /D "_WINDOWS"
/D "STRICT" /Fp".\Release/EMU32.pch" /Yu"pch.h" /Fo".\Release/" /Fd".\Release/"
/FD /c

Emu42PPC v0.20 06/09/04
Emu42PPC v1.01

Compiler:
eMbedded Visual C++ 3.0 Edition 2002

Settings:
/nologo /W3 /O2 /Ob0 /D _WIN32_WCE=$(CEVersion) /D "$(CePlatform)" /D "ARM"
/D "_ARM_" /D UNDER_CE=$(CEVersion) /D "UNICODE" /D "_UNICODE" /D "NDEBUG"
/Fp"ARMRel/EMU42.pch" /Yu"pch.h" /Fo"ARMRel/" /Oxs /M$(CECrtMT) /c

Emu42PPC v1.02beta5 01/17/05
Emu42PPC v1.05beta1 01/23/06
Emu42PPC v1.07beta1 07/10/06

Compiler:
eMbedded Visual C++ 3.0 Edition 2002

Settings:
/nologo /W3 /O2 /Ob2 /D _WIN32_WCE=$(CEVersion) /D "$(CePlatform)" /D "ARM"
/D "_ARM_" /D UNDER_CE=$(CEVersion) /D "UNICODE" /D "_UNICODE" /D "NDEBUG"
/Yu"pch.h" /Oxs /M$(CECrtMT) /c


Thanks to Erik Ehrling for contributing the real calculator, P100 and P200
benchmark values.

10/20/11 (c) by Christoph Gie├čelink, c dot giesselink at gmx dot de

Making repeatable benchmarks on the HP42S are not so easy as it sounds. First of all, the FOCAL code can only search for global labels before his actual position, so is the search position on top of memory and the label is not found so far, the search continues at the .END. position. So if you have many programs with many global labels behind your program, this will slow down program execution. One more detail, these was a bug in the RAW file object loader until Emu42 v1.22. The FOCAL program object loader allows also to import HP41 FOCAL programs saved by the V41 emulator. Because of some internal differences about NULL byte handling, NULL bytes are removed (packing) or added (behind numbers) and so the distance between labels change. The distance on global labels was fixed during the import, the distance on local labels not. This caused execution errors using HP41 programs. Therefore the import now clears the distance information in all local label jump and execute FOCAL opcodes. Don't worry, the HP42 restore these offsets at the first program run, and because of this the first run of a FOCAL program directly after importing is slower then the following runs.

But now to a further difference of emulation and simulation. The Emu42 emulator has to handle some speed related issues running the code of an original ROM. Just remember, the authors of the code haven't thought about running this code 100-400 times faster. So some parts are just done by executing code in a loop to create a delay or the frequency of a beeper. On the HP48 the backarrow key has an autorepeat function. So pressing the backarrow key and holding it, removes slowly character by character in the command line. When you have a machine with runs 100 times faster, and you do the same thing, the input line is immediately empty. So happened with Emu48 running a HP48. But back to Emu42. It took me years to implement the Redeye sending and making the beeper emulation. In both cases the timing is done by the CPU executing opcodes. Moreover, the CPU strobe frequency is not very accurate, so not usable for sending the Redeye Printer protocol. So the ROM code is making a speed calibration of the CPU before printing.

Therefore a loop with known CPU-cycles is executed in a time frame given by a timer referenced by the 32768Hz crystal. The number of loops is counted. Bad is, the register width of the loop counter is too small for a 150 times faster CPU execution, so the count register overrun many many times and so the result of the speed measurement was rubbish effecting the emulation of the Redeye frame transmitter and beep generation.

I think this is an important difference between emulation and simulation.

On the last Allschwil meeting I talked about speed update for my HP-92198 simulation. I done a print of a large HP71 BASIC program with Emu71 which took 180s. I don't know how fast is the original hardware, HP71 and HP-92198 video output, but it's slower. Now I cheated, the actual HP-92198 simulation does the same thing in 8s now. Why I say cheated? The program is compiled with the same C++ compiler running on the same machine. The difference is the display update. The prior version updated the display content after very new character, the actual version only every 30ms. So I modified the test conditions, the result for the user is the same, but it's a huge difference for the CPU.

So when we speak about calculations we are discussing about numerical results of a mathematical problem. When I change the numerical algorithm for some reasons, it's the same cheating as with the display output, I changed the test conditions.

So I think it's quite hard to compare Emu42 with Free42. Use the one which is more suitable for your problem.

BTW how fast is Free42 comparing to Wolfram Mathematica solving the same problem?
(06-05-2020 03:45 PM)J-F Garnier Wrote: [ -> ]But Free42 is written in C, not assembly language, so the comparison is fair.
On the other hand, in this particular benchmark, we can expect a significant part of the time is spent in arithmetic operations that are in asm in the HP-42S, not sure in the Intel 35-digit library.

The Intel Decimal Floating-Point Math Library is written in C, and works using 64-bit integer operations under the hood. I don't think there's any assembly language in there but I haven't really looked for it... but when building for ARM, there definitely isn't since that is not a supported target at all, and I had to do a bit of hacking to even get it to build for Android and iOS.

When built for 32-bit targets, the 64-bit integer operations are compiled into multiple 32-bit operations, and that might benefit from coding it in assembly instead to eliminate duplicate operations (e.g. a 64-bit integer multiplication requires four 32-bit multiplications in general, but the special case of squaring a 64-bit number can be done using only three 32-bit multiplications), but this was not done. Free42 for Windows is a 32-bit app, and the Android and iOS versions contain 32-bit and 64-bit code, running the 32-bit code on 32-bit platforms and running 64-bit code otherwise.
(06-05-2020 07:27 PM)Christoph Giesselink Wrote: [ -> ]So what is important for me? The speed difference between original calculator and emulation. So I think a 170 to 340 times faster emulation is fast enough. This is a benchmark list from 2011 comparing the emulation on different host systems comparing to the real machine:
Code:

Emu42 benchmark results using Erik Ehrling's
"Miller-Rabin Primality Test for the HP-42S"

Prime number: 999,999,999,961
...

I agree with you, the improved speed vs the original is an important benefit, but not the unique goal of an emulator.
Thanks for the benchmark, I remember it now.

Quote:So I think it's quite hard to compare Emu42 with Free42. Use the one which is more suitable for your problem.

Well , not so hard to compare, since I did it :-) but yes I understand what you mean. My goal was just to have a more clear view of the performance difference. We can't expect an emulator to compete with native applications.

I'm using both Emu42 and Free42, for different usages, as I'm using your Emu71 for Windows in some cases (very precise emulation) and my Emu71/DOS for others (integrated HP-IL environment)

J-F
(06-06-2020 04:37 AM)Thomas Okken Wrote: [ -> ]When built for 32-bit targets, the 64-bit integer operations are compiled into multiple 32-bit operations, and that might benefit from coding it in assembly instead to eliminate duplicate operations (e.g. a 64-bit integer multiplication requires four 32-bit multiplications in general, but the special case of squaring a 64-bit number can be done using only three 32-bit multiplications), but this was not done. Free42 for Windows is a 32-bit app, and the Android and iOS versions contain 32-bit and 64-bit code, running the 32-bit code on 32-bit platforms and running 64-bit code otherwise.

This may explain the observation from Valentin:

(06-05-2020 04:24 PM)Valentin Albillo Wrote: [ -> ]Also, your Free42 BCD is running only ~75% faster on your hardware than mine on my mid-range Samsung tablet. Frankly, I would expect it to run much faster in your system (say, at least 4x).

Also I'm using a modest 1.8MHz core-i3 machine, more powerful machines may give significantly better performances.

J-F
(06-05-2020 03:26 PM)David Hayden Wrote: [ -> ][snip]
The Saturn CPU is a pretty different from an x86.

I would put it as *extremely different* Smile The only real similarity is that the x86 and Saturn family of processors both use a "dest = dest op source" machine code format for many of their arithmetical-logical operations,
although the Saturn has many "dest = source op dest" machine instructions as well Smile

Quote:Emulating it might be harder than a more traditional one. Things like register fields might be hard.

If you're writing the emulator in a language like C, then dealing with register fields means dealing with a lot of bit-masking and bit-shifts Smile ( except for maybe the Emu48 family of emulators which seem to perform decimal operations ( and apparently HEX mode operations as well ) in a nibble serial
manner if I'm not mistaken. This might explain why some of the emulators are so slow with respect to emulating Saturn machine code arithmetic and comparison instructions )

Quote:And if the Saturn BCD arithmetic doesn't translate well to x86 then that could be a problem.

The BCD instructions on modern x86 CPUs are legacy instructions that have been relegated to the CPU's microcode ROM, and are therefore very slow.

The 32-bit x86 CPUs only have instructions that operate on two packed BCD digits at a time. For example :

Code:
MOV AL, 42h
MOV CL, 33h
ADD AL, CL
DAA

which adds packed BCD 42 to 33, or :

Code:
MOV AL, 42h
MOV CL, 33h
SUB AL, CL
DAS

which subtracts packed BCD 33 from 42 .

The "DAA" and "DAS" instructions stand for "Decimal Adjust for Addition" and "Decimal Adjust for Subtraction" respectively.

In AMD64 / x64 "long mode", the above DAA and DAS instructions are not available, having been used as part of the 64-bit instruction set encoding.

In C, especially on a 64-bit x64 machine, it's faster to use a little bit-twiddling and arithmetic trickery to perform packed BCD arithmetic. ( a quick google search turns up many solutions for addition and subtraction of packed BCD integers in C using only bitwise operations and arithmetic, eg. see this Wikipedia solution in C which uses just ten
bitwise and arithmetic operations ( no multiplication or division ). Also, I do have a vague memory of a bitwise solution that only used *six* operations, but I can't remember where I when I saw it and I can't seem to reproduce it "de novo" either :/ . On many ARM processors, even the above ten operations
can be reduced to just *eight* operations because of many ARM processors' "0-cycle" barrel-shifter which one gets for free with most ARM arithmetic or bitwise instructions Smile )

( EDIT #1 : The above statements about ARM-based processors' "free" or "0-cycle" barrel-shifter may not be entirely accurate, as I haven't done much ARM assembly language in a long while. I believe that the previously "free" shift operation may incur some performance penalties on more modern ARM processors, although I'm no expert on ARM assembly, so I'm not sure )

Quote:Also, it's my understanding that the 42s is implemented in SysRPL. That adds another layer of complication and slowdown compared to assembly language.

Indeed Smile RPL on Saturn CPU's involves at least *three* levels of indirection a lot of the time Smile

Quote:As I recall working with C on a 50g, the speed difference there is something like native C code is about 5x faster than Saturn assembly, which is about 5x faster than SysRPL, which is about 5x faster than User RPL. Those values may vary somewhat, but I do recall specifically that C code is about 100x faster than userRPL on the 50g.

That sounds about right Smile

Regards,

Jonathan

P.S. The formatting seems to be messed up in the preview of this post with long lines running off-screen which generates a scroll bar.
(06-05-2020 04:56 PM)Werner Wrote: [ -> ]
(06-05-2020 03:45 PM)J-F Garnier Wrote: [ -> ]Yes, the 42S is internally using RPL. But Free42 is written in C, not assembly language, so the comparison is fair.
.. but Free42 is compiled, and the 42 SysRPL is interpreted.
Werner

SysRPL is not exactly interpreted. An interpreter such as HP Basic uses tokens and relies on tables to get the execution address (on HP Basic, it's quite complex and relatively slow with all the possible LEXs to scan).
In SysRPL, the "tokens" are the execution addresses themselves. The right term is probably "threaded code" as for the Forth language. But I'm not a RPL expert :-)

(06-09-2020 08:49 PM)Jonathan Busby Wrote: [ -> ]
Quote:Also, it's my understanding that the 42s is implemented in SysRPL. That adds another layer of complication and slowdown compared to assembly language.
Indeed Smile RPL on Saturn CPU's involves at least *three* levels of indirection a lot of the time Smile

Quote:As I recall working with C on a 50g, the speed difference there is something like native C code is about 5x faster than Saturn assembly, which is about 5x faster than SysRPL, which is about 5x faster than User RPL. Those values may vary somewhat, but I do recall specifically that C code is about 100x faster than userRPL on the 50g.
That sounds about right Smile

What is the penalty of SysRPL compared to assembly language? I tried to answer by porting Valentin's program to the HP-32SII, emulated in Christoph's Emu42. So the comparison is done at constant CPU speed.

Here is the result, for 10,000 points as above test:
HP42S emulated on Emu42 1.24 : 5min08s (as above)
HP32SII emulated on Emu42 1.24 : 2min01s !

So despite the 32SII RPN language is not as powerful for complex numbers than the 42S, the 32SII is 2.5x more efficient than the HP42S. I didn't expect so much difference, it's a surprise for me.
The difference is not as large on the real machines, since the CPU speed is probably reduced on the 32S. The 42S is supposed to run at 1MHz, does somebody know the speed of the 32SII CPU?

For the curious, here is the HP-32S program:
Code:
; Mandelbrot area, for the HP-32S/SII
; based on Valentin Albillo's program "AM" for the HP-42S
;Variables:
;I : loop index
;J : loop index
;M : count
;N : #points
;K : #iterations
;C, D : random point under test
;F : constant = 0.25^2
;A : scratch
; labels used: A, B, C, D, E.


A01 LBL A
; initialization and input
A02 STO N       ;#points
A03 STO I
A04 0.25
A05 x^2
A06 STO F
A07 1
A08 SEED
A09 256        ; #iterations
A10 STO K
A11 CLx
A12 STO M
; main loop
B01 LBL B
B02 RCL K      ; # iterations
B03 STO J      ; in J loop index
B04 RANDOM
B05 2.5
B06 *
B07 2
B08 -
B09 STO C
B10 RANDOM
B11 1.2
B12 *
B13 STO D
; belongs to cardioid?
B14 x^2
B15 x<>y 
B16 x^2
B17 +
B18 SQRT        ; abs(z)
B19 STO A
B20 RCL D
B21 RCL/ A
B22 RCL C
B23 RCL/ A      ; z/abs(z)
B24 0
B25 2
B26 CMPLX-
B27 CMPLX*
B28 x^2
B29 x<>y
B30 x^2
B31 +              
B32 SQRT 
B33 4
B34 /
B35 RCL A
B36 x<y?
B37 GTO D
; belongs to main disk?
B38 1 
B39 RCL+ C
B40 x^2
B41 RCL D
B42 x^2
B43 +
B44 RCL F
B45 x>y?
B46 GTO D
; belongs elsewhere in M?
B47 RCL D
B48 RCL C
C01 LBL C
C02 0
C03 ENTER 
C04 CMPLX+
C05 CMPLX*          ; Z^2
C06 RCL D
C07 RCL C
C08 CMPLX+          ; Z=Z^2+C
C09 0
C10 ENTER
C11 CMPLX+
C12 x^2
C13 x<>y
C14 x^2
C15 +  ; |Z|┬▓
C16 4
C17 x<=y?
C18 GTO E
C19 Rdn
C20 Rdn
C21 DSE J
C22 GTO C     ; next iter    
D01 LBL D
D02 ISG M     ; count++
E01 LBL E
E02 DSE I
E03 GTO B     ; next point
E04 RCL M     ; recall count
E05 RTN

J-F
(06-10-2020 07:57 AM)J-F Garnier Wrote: [ -> ]The 42S is supposed to run at 1MHz, does somebody know the speed of the 32SII CPU?

According to Wikipedia the Lewis CPU used in the 42S runs at 1MHz and the Sacajawea CPU used in the 32SII runs at 640kHz.

(Edited to fix typo on the 32SII frequency)
(06-10-2020 12:16 PM)Didier Lachieze Wrote: [ -> ]
(06-10-2020 07:57 AM)J-F Garnier Wrote: [ -> ]The 42S is supposed to run at 1MHz, does somebody know the speed of the 32SII CPU?

According to Wikipedia the Lewis CPU used in the 42S runs at 1MHz and the Sacajawea CPU used in the 32SII runs at 640Hz.

These speed settings at Wikipedia are from me contributed many years ago. But I'm not sure about the values for the 1LU7 Bert and 1LR3 Sacajawea chips any more. First, the PCB for the Low-End Pioneers contain no crystal. The external component on a 1LU7 PCB is a capacitor and the external components on a 1LR3 PCB are a capacitor and a resistor.

But how I get the "Authentic Speed" value for Emu42 emulating a HP32SII?

In this case I gone a more practical way.

I programmed Katie Wasserman 99 Digits of PI on an HP 32SII on a real HP32SII and measured the execution time. It took around the 11 minutes like mentioned in the article. With this measured time I adjusted the Sacajawea CPU cycles reference inside Emu42 to a value of 54, so that the real and emulated calculator needs more or less the equal time executing this program.

What does this number 54 mean? Execute 54 CPU cycles in a 16384 Hz time frame, so 54 * 16384 Hz = 884736 Hz.

BTW, the CPU cycles reference in Emu42 for a 1MHz Lewis CPU is 61 (61 * 16384 Hz = 999424 Hz).

Can I proof this calculated frequency? This depends on the memory type the assembler code is running. On all memory devices which are connected directly to the Saturn bus, the CPU cycles inside Emu42 are correct. But when the memory device is accessed over a Saturn bus to 8 bit converter to access regular static RAM or ROM devices with 8 bit data bus interface, the CPU cycles inside Emu42 are wrong. In the last case you have the problem that the same opcode needs more cycles for a memory access and even more, the same opcode may have different cycles when the opcode is executed on an even or on an odd address.

But was this mean in the case of the HP32SII? Both RAM and ROM are internal devices directly connected to the Saturn bus. So the used CPU cycles should be correct. As result I assume a CPU strobe frequency of about 884 kHz for the HP32SII now.
(06-10-2020 07:57 AM)J-F Garnier Wrote: [ -> ][snip]

SysRPL is not exactly interpreted. An interpreter such as HP Basic uses tokens and relies on tables to get the execution address (on HP Basic, it's quite complex and relatively slow with all the possible LEXs to scan).
In SysRPL, the "tokens" are the execution addresses themselves. The right term is probably "threaded code" as for the Forth language. But I'm not a RPL expert :-)

You're actually completely correct Smile See this article Smile Technically RPL is a "TIL" or "Threaded Interpreted Language" Smile

(06-09-2020 08:49 PM)Jonathan Busby Wrote: [ -> ]Indeed Smile RPL on Saturn CPU's involves at least *three* levels of indirection a lot of the time Smile

Actually I misspoke : Although there are usually *three* levels of indirection with most simple RPL objects that eg. just push themselves to the stack, the third level of indirection is just the act of re-entering the RPL inner loop.

Quote:What is the penalty of SysRPL compared to assembly language?

Well, the "RPL inner loop" as implemented on Saturn based HP calculators, follows the following control flow, assuming the current object being executed in the runstream is a pointer to an embedded BINT object ( to demonstrate the various levels of indirection ) :

Code:
A=DAT0        A
D0=D0+        5
PC=(A)        *First level of indirection

Code:
=PRLG   LC(2)   10
        A=A-C   B       
        PC=(A)  *Second level of indirection : PC now points to BINT direct execution code dirbint

Code:
dirbint D0=D0-  5
        AD0EX
        D0=A
        D0=D0+  10
        D=D-1   A
        GOC     OutOfMemory
        D1=D1-  5
        DAT1=A  A
        A=DAT0  A
        D0=D0+  5
        PC=(A) *This is technically a "third" level of indirection, but it's really just the next object/pointer in the runstream being executed

So, I initially misspoke as there are technically only *two* levels of indirection in most Sys-RPL words Smile

As for the Sys-RPL performance penalty, well, direct object execution involves *two* "PC=(A)" instructions. This means that a 5-nibble address has to be read from memory, and memory accesses on the Saturn CPU are notoriously slow. Also, it involves a 5-nibble absolute control flow jump as the PC is set to the address previously read from memory, and such absolute jumps on the Saturn CPU are slow, although not as slow as memory accesses Smile Also, one must take into account all the other instructions that are executed when an RPL object is directly executed in the runstream, and this also adds a lot of overhead.

( EDIT : The reason for memory accesses being so slow on the Saturn CPU is not due to the Saturn CPU itself per se, but instead due to the Saturn Bus. In the original discrete HP71B Saturn chip, I believe that the Saturn Bus Interface was integrated onto the chip. On later Saturn based SoCs like the Yorke, the Saturn bus ran at half the speed of the Saturn CPU itself, which slowed down memory accesses by about 2x. This, though, is only one aspect of the Saturn bus which contributes to the slowdown. For an instruction like "PC=A", the Saturn CPU drives a "LOAD PC" command on the Saturn bus, which is then followed by a 5-cycle operation in which the CPU transfers the 5 nibbles of the new PC address and which the memory controllers load into their local PCs. There is then a command auto-switch to a "PC READ" command and a "dummy strobe" on the Saturn bus for memory pipelining. For an eg. "A=DAT0 W" instruction, first, the CPU issues a "LOAD DP" command onto the Saturn bus and then the CPU performs a 5-cycle operation in which it successively drives 5 address nibbles onto the Saturn bus which are latched by the memory controllers. There is then a command auto-switch to "DP READ", another 1-cycle "dummy strobe" and then the CPU reads 16 nibbles from the Saturn bus. So, for the "PC=(A)" instruction, you have 1-cycle for the "LOAD DP" command, 1-cycle for the dummy strobe, 5-cycles to read the data, another 1-cycle for the "LOAD PC" command, 5-cycles for the CPU to transfer the new PC address to the memory controllers and then, finally, a 1-cycle dummy strobe, for a total 14 cycles, not including instruction decode time. For the "A=DAT0 A" instruction, you have the initial 1-cycle "LOAD DP" bus command, 5 cycles to drive the 5 nibbles of the address, a 1-cycle dummy strobe and then 5-cycles for reading 5 data nibbles for a total of 12 cycles, not counting instruction decode time. If this is on the Yorke SoC, then the total cycle length associated with the Saturn bus is around 24 cycles as the Saturn bus on the Yorke SoC only runs at 2MHz. )

Quote:I tried to answer by porting Valentin's program to the HP-32SII, emulated in Christoph's Emu42. So the comparison is done at constant CPU speed.

Here is the result, for 10,000 points as above test:
HP42S emulated on Emu42 1.24 : 5min08s (as above)
HP32SII emulated on Emu42 1.24 : 2min01s !

So despite the 32SII RPN language is not as powerful for complex numbers than the 42S, the 32SII is 2.5x more efficient than the HP42S. I didn't expect so much difference, it's a surprise for me.
The difference is not as large on the real machines, since the CPU speed is probably reduced on the 32S. The 42S is supposed to run at 1MHz, does somebody know the speed of the 32SII CPU?

AFAIK, the 32SII Saturn runs at about 640KHz.

Regards,

Jonathan
Performance measurents with Emu71/Win until v1.11

J-F Garnier made some performance tests with Emu71/DOS and Emu71/Win and detected a performance catastrophe inside Emu71/Win until v1.11 (actual version) in connection with the BASIC GOTO command. Emu71/DOS is not affected.

A loop with GOTO where you expected an execution time at full speed of ~0.08s takes ~9s, whereas the same loop at "Authentic Calculator Speed" takes ~20s. This behavior is caused by annunciator settings combined with the display annunciator update which takes quite long.

TNX to J-F Garnier for this detection.

The problem will be fixed in a next version, so take care when you do performance measurements with Emu71/Win until v1.11 with a massive use of GOTO loops.
(06-10-2020 12:16 PM)Didier Lachieze Wrote: [ -> ]
(06-10-2020 07:57 AM)J-F Garnier Wrote: [ -> ]The 42S is supposed to run at 1MHz, does somebody know the speed of the 32SII CPU?

According to Wikipedia the Lewis CPU used in the 42S runs at 1MHz and the Sacajawea CPU used in the 32SII runs at 640Hz.

This is stuff that former-HPer Eric Vogel would have known right off the top of his head back in the day. He gave a nice presentation on the various Pioneer machines at a Chicago-area CHIP-group meeting back in April of 1991 around the time of the 32SII introduction, where he put chalk to blackboard and made a nice chart of all the ROM/RAM/CPU/display variants up and down the Pioneer series. Those were fun days.
Jake
(06-10-2020 12:16 PM)Didier Lachieze Wrote: [ -> ]
(06-10-2020 07:57 AM)J-F Garnier Wrote: [ -> ]The 42S is supposed to run at 1MHz, does somebody know the speed of the 32SII CPU?

According to Wikipedia the Lewis CPU used in the 42S runs at 1MHz and the Sacajawea CPU used in the 32SII runs at 640Hz.

Not to nitpick, but I think you made a typo with "640Hz" Smile It should be "640KHz" Smile If it was just 640Hz, then it would be just a little faster than Charles' Babbage's mechanical Analytical Engine Tongue

Regards,

Jonathan
(06-16-2020 07:12 PM)Jonathan Busby Wrote: [ -> ]Not to nitpick, but I think you made a typo with "640Hz" Smile It should be "640KHz" Smile If it was just 640Hz, then it would be just a little faster than Charles' Babbage's mechanical Analytical Engine Tongue

Surely you mean kHz, as in 1000 Hz, and not KHz, as in Kelvin-Hertz? Big Grin
(06-16-2020 07:23 PM)Thomas Okken Wrote: [ -> ]
(06-16-2020 07:12 PM)Jonathan Busby Wrote: [ -> ]Not to nitpick, but I think you made a typo with "640Hz" Smile It should be "640KHz" Smile If it was just 640Hz, then it would be just a little faster than Charles' Babbage's mechanical Analytical Engine Tongue

Surely you mean kHz, as in 1000 Hz, and not KHz, as in Kelvin-Hertz? Big Grin

HAHA! Big Grin

Seems I'm not the only one susceptible to making typos Tongue

Regards,

Jonathan
So I run the Valentin's problem on Emu71 too since the initial question was to compare a HP-71B emulator and Free42.
Here are the results, still with 10,000 points and other defauts parameters:
Emu71/Win 1.11 : 1min55s
Emu71/DOS 2.45 run in VirtualBox : 1min58s (not too bad for an old 16-bit DOS program :-)
(for reference, Free42 Decimal : 6.3s )

My first attempt was to adapt the 42S program with the same GOTOs, and so I identified the abnormal slow-down in Emu71/Win with GOTOs that Christoph reported above.
Then I replaced the GOTOs with FOR..NEXT loops and other constructions (anyway using GOTOs is considered as a bad programming style) and the Emu71/Win performance was then similar to Emu42 emulating a HP-32SII.


Here is my HP-71B program (without GOTOs):
Code:
10 ! ------
20 ! Mandelbrot set area
30 ! from Valentin Albillo "AM" program
40 ! 
50 COMPLEX Z,C,A
60 F=.25
70 INPUT "Points?";N
80 T=TIME
90 RANDOMIZE 1
100 K=256
110 M=0
120 ! 
130 FOR I=1 TO N
140 B=0
150 C=(RND*2.5-2,RND*1.2)
160 A=SGN(C) @ IF ABS((A-2)*A)*F>ABS(C) THEN B=1
170 IF ABS(C+1)<F THEN B=1
180 IF B THEN J=K+1 ELSE J=1 @ Z=C
190 B=1
200 FOR J=J TO K
210 Z=Z*Z+C @ IF ABS(Z)>=2 THEN J=K @ B=0
220 NEXT J
230 IF B THEN M=M+1
240 NEXT I
250 DISP M;TIME-T

The abnormal slowdown in Emu71/Win up to 1.11 occurs with GOTO/GOSUB and of course with statements directly or indirectly related to the annunciators:
SFLAG/CFLAG/RADIANS/DEGREES/USER

J-F
Reference URL's