Emulator vs simulator performance
06-05-2020, 03:03 PM
Post: #1
 J-F Garnier Senior Member Posts: 701 Joined: Dec 2013
Emulator vs simulator performance
The question arises to know what is the order of magnitude of the performance penalty of an emulator versus a native application.

(06-05-2020 12:39 AM)Valentin Albillo Wrote:
Quote:An emulator adds the overhead of the CPU/hardware simulation; Free42 is a native application. All depends on the performance of the emulation engine, I can't really judge the x60 ratio but it's the order of magnitude we can expect

I don't concur. A factor of 60x is just too much, no emulation should be that inefficient. Say 10x would be acceptable, if slow, but 60x ? Really ? Converting a 10 seconds running time to 10 minutes ? A 10 minutes running time to 10 hours ? That would be a horribly inefficient emulation, direct-to-garbage-bin class.

Ok, so let's compare Free42 and Emu42, that have similar functionalities and are both high quality software recognized by the community, and let's compare them on the same host system.

You may say, it's not fair, Free42 provides much better accuracy with its 35-digit arithmetic, so let's also include the binary version that has similar arithmetic accuracy.

I will not use a trivial benchmark but the program from the latest article from Valentin, that computes the area of the Mandelbrot set (great article, Valentin, I may comment it later in an other thread).
The program is run to evaluate 10,000 points with the other default parameters.
The display is fully static during Valentin's program run (no display updates, no flying goose).

Emu42 and Free42 are run separately, during the run they each use about one of the core of my core-i3 machine (global CPU loading ~25-30%).
Of course, Emu42 is run with authentic calculator speed off.

Emu42 1.24 : 5min08s
Free42 2.5.18 decimal : 6.3s
Free42 2.5.18 binary : 2.0s
Emu42 execution time measured by hand, Free42 execution measured using TIME.

J-F
06-05-2020, 03:26 PM
Post: #2
 David Hayden Senior Member Posts: 405 Joined: Dec 2013
RE: Emulator vs simulator performance
Interesting. So the 60x slowdown claim appears fairly accurate. Decimal free42 is about 49 times faster than emu42.

Eric Smith can probably speak more knowledgeably than I can, but here are some thoughts.

The Saturn CPU is a pretty different from an x86. Emulating it might be harder than a more traditional one. Things like register fields might be hard. And if the Saturn BCD arithmetic doesn't translate well to x86 then that could be a problem.

Also, it's my understanding that the 42s is implemented in SysRPL. That adds another layer of complication and slowdown compared to assembly language. As I recall working with C on a 50g, the speed difference there is something like native C code is about 5x faster than Saturn assembly, which is about 5x faster than SysRPL, which is about 5x faster than User RPL. Those values may vary somewhat, but I do recall specifically that C code is about 100x faster than userRPL on the 50g.

Dave
06-05-2020, 03:45 PM
Post: #3
 J-F Garnier Senior Member Posts: 701 Joined: Dec 2013
RE: Emulator vs simulator performance
(06-05-2020 03:26 PM)David Hayden Wrote:  Also, it's my understanding that the 42s is implemented in SysRPL. That adds another layer of complication and slowdown compared to assembly language. As I recall working with C on a 50g, the speed difference there is something like native C code is about 5x faster than Saturn assembly, which is about 5x faster than SysRPL, which is about 5x faster than User RPL. Those values may vary somewhat, but I do recall specifically that C code is about 100x faster than userRPL on the 50g.

Yes, the 42S is internally using RPL. But Free42 is written in C, not assembly language, so the comparison is fair.
On the other hand, in this particular benchmark, we can expect a significant part of the time is spent in arithmetic operations that are in asm in the HP-42S, not sure in the Intel 35-digit library.

My own Emu71/DOS is largely written in assembly (using Intel x86 BCD support for efficiency - at the cost of minor compatibility issues) but it's a 16-bit DOS application and there is no native HP-71 simulator available for comparison.

J-F
06-05-2020, 04:24 PM
Post: #4
 Valentin Albillo Senior Member Posts: 866 Joined: Feb 2015
RE: Emulator vs simulator performance
.
Hi, J-F:

(06-05-2020 03:03 PM)J-F Garnier Wrote:  I will not use a trivial benchmark but the program from the latest article from Valentin, that computes the area of the Mandelbrot set (great article, Valentin, I may comment it later in an other thread).

Thank you very much, you're most welcome to comment to your heart's content.

Quote:The program is run to evaluate 10,000 points with the other default parameters. The display is fully static during Valentin's program run (no display updates, no flying goose.

(How do you know it's a "goose" and not a "gander" ? Oh, I remember, the gander flies backwards. At least it does in the HP-41C family, as the HP42S (and Free42) uses a little triangle instead)

Quote:Emu42 and Free42 are run separately, during the run they each use about one of the core of my core-i3 machine (global CPU loading ~25-30%).
Of course, Emu42 is run with authentic calculator speed off.

Emu42 1.24 : 5min08s
Free42 2.5.18 decimal : 6.3s
Free42 2.5.18 binary : 2.0s
Emu42 execution time measured by hand, Free42 execution measured using TIME.

Quite a difference, indeed. Emu42 is calculating ~32 points/sec, Free42 BCD is doing ~1,587 points/sec and Free42 binary does 5,000 points/sec.

The difference between 32 p/s and 5,000 p/s is ~156x, which seems to me unreasonable performance difference for an emulation (Emu42) vs. a simulation (Free42), that's more than two orders of magnitude faster (or slower) on the same hardware (and presumably OS).

Also, your Free42 BCD is running only ~75% faster on your hardware than mine on my mid-range Samsung tablet. Frankly, I would expect it to run much faster in your system (say, at least 4x).

Puzzling results to me, all of them.

Thanks and regards.
V.

All My Articles & other Materials here:  Valentin Albillo's HP Collection

06-05-2020, 04:56 PM
Post: #5
 Werner Senior Member Posts: 721 Joined: Dec 2013
RE: Emulator vs simulator performance
(06-05-2020 03:45 PM)J-F Garnier Wrote:  Yes, the 42S is internally using RPL. But Free42 is written in C, not assembly language, so the comparison is fair.
.. but Free42 is compiled, and the 42 SysRPL is interpreted.
Werner
06-05-2020, 07:27 PM
Post: #6
 Christoph Giesselink Member Posts: 222 Joined: Dec 2013
RE: Emulator vs simulator performance
I only can speak about Emu42, I don't know any details about Free42 inside.

From early times I put the focus of my emulator development on rebuilding the original hardware on functional level as close as possible. I also expect that an emulation of a real machine is faster, or at minimum, have equal speed.

The Saturn CPU emulation inside Emu42 was originally build by Sebastien Carlier for Emu48 and was improved by me in the Emu48 development over the years. This emulation core was never optimized for speed reason.

One anecdote belongs to the Saturn opcode dispatcher. The actual emulator code decoding the Saturn opcode in tables one nibble each. This makes the tables small because 1 nibble = 4 bit -> 2^4 = 16 conditions in a table. Cyrille thought, why not decode 3 nibbles (12 bit) of the opcode at first time. So the first opcode dispatcher table had a size of 2^12 = 4096 entries. On a PC the emulator with 4096 dispatcher table was about 10% faster than the one with 16 entries. Mission accomplished? Not really, the optimization was done for the Pocket PC's with Windows CE or Pocket PC 2002 devices. On these devices the new version was massively slower. And why? The 4096 entry dispatcher table hasn't fit into the 1st level CPU cache and so all accesses to this table must be done over the slow main memory.

So what is important for me? The speed difference between original calculator and emulation. So I think a 170 to 340 times faster emulation is fast enough. This is a benchmark list from 2011 comparing the emulation on different host systems comparing to the real machine:
Code:
 Emu42 benchmark results using Erik Ehrling's "Miller-Rabin Primality Test for the HP-42S" Prime number: 999,999,999,961 Real HP-42S ROM REV A: Std clock 1MHz  5m 48s ROM REV A: Dbl clock 2MHz  2m 52s C2E6750/2.66GHz/333MHz/DDR2 / 2 GB / Windows XP SP2 / Emu42 v1.10 ROM REV C: Max 1s     Auth 5m 27s 2x E5507/2.26GHz/800MHz/DDR3 / 4 GB / Windows 7 SP1 (x86) / Emu42 v1.14 ROM REV C: Max 2s     Auth 5m 25s A64X2/3800+/800MHz/DDR2 / 4GB / Windows 7 SP1 (x64) / Emu42 v1.14 ROM REV C: Max 2s     Auth 5m 26s A64X2/3800+/800MHz/DDR2 / 4GB / Windows XP SP3 / Emu42 v1.14 ROM REV C: Max 2s     Auth 5m 26s A64X2/3800+/533MHz/DDR2 / 1GB / Windows XP SP2 / Emu42 v1.09beta1 ROM REV C: Max 2s     Auth 5m 26s P4HT/3.4GHz/400MHz/DDR / 2GB / Windows XP SP3 / Emu42 v1.14 ROM REV C: Max 2s     Auth 5m 27s P4HT/3.4GHz/400MHz/DDR / 1GB / Windows XP SP2 / Emu42 v1.09beta1 ROM REV C: Max 2s     Auth 5m 27s P4HT/3.2GHz/400MHz/DDR / 1GB / Windows 2000 SP4 / Emu42 v1.09beta1 ROM REV C: Max 2s     Auth 5m 27s P4/2.4GHz/533MHz / 1GB / Windows XP SP3 / Emu42 v1.11 ROM REV C: Max 2s     Auth 5m 26s P4/2.4GHz/533MHz / 256MB / Windows 2000 SP4 / Emu42 v0.98-5 ROM REV C: Max 3s     Auth 5m 28s P3/1.0GHz/133MHz / 512MB / Windows 2000 SP4 / Emu42 v1.09beta1 ROM REV C: Max 4s     Auth 5m 27s P3/850MHz/100MHz / 384MB / Windows XP SP1 / Emu42 v0.98-5 ROM REV C: Max 5s     Auth 5m 28s P3/850MHz/100MHz / 384MB / Windows 2000 SP4 / Emu42 v0.98-5 ROM REV C: Max 5s     Auth 5m 28s P3/500MHz/100MHz / 256MB / Windows 2000 SP4 / Emu42 v0.98-5 ROM REV C: Max 8s     Auth 5m 24s P3/500MHz/100MHz / 256MB / Windows 98 / Emu42 v0.98-5 ROM REV C: Max 8s     Auth 5m 24s P3/450MHz/100MHz / 320MB / Windows 2000 SP4 / Emu42 v0.98-5 ROM REV C: Max 10s    Auth 5m 27s P3/450MHz/100MHz / 128MB / Windows NT4.0 SP4 / Emu42 v0.98-5 ROM REV C: Max 10s    Auth 5m 24s P(MMX)/200MHz/66MHz / 96MB / Windows 98SE / Emu42 v0.98-4 ROM REV A: Max 43s    Auth 5m 13s ROM REV B: Max 44s    Auth 5m 25s ROM REV C: Max 45s    Auth 5m 25s P/100MHz/66MHz / 32MB / Windows 95B (OSR2) / Emu42 v0.98-4 ROM REV A: Max 1m 00s Auth 5m 13s ROM REV B: Max 1m 02s Auth 5m 26s ROM REV C: Max 1m 01s Auth 5m 25s ARM PXA310/640MHz / Win Mobile 6 Classic / Emu42PPC v1.10 ROM REV C: Max 17s    Auth 5m 26s ARM PXA270/624MHz / Win Mobile 5.0 / Emu42PPC v1.09 ROM REV C: Max 17s    Auth 5m 26s ARM PXA270/624MHz / Win Mobile 2003 SE / Emu42PPC v1.02beta5 ROM REV C: Max 17s    Auth 5m 24s ARM PXA270/624MHz / Win Mobile 2003 SE / Emu42PPC v1.01 ROM REV C: Max 19s    Auth 5m 24s ARM PXA270/520MHz / Win Mobile 5.0 / Emu42PPC v1.07beta1 ROM REV C: Max 20s    Auth 40s *1 ARM PXA270/520MHz / Win Mobile 2003 SE / Emu42PPC v1.01 ROM REV C: Max 23s    Auth 5m 24s ARM PXA255/400MHz / Win Mobile 2003 / Emu42PPC v0.20 ROM REV C: Max 30s    Auth ?m ??s ARM MSM7200/400MHz / Win Mobile 6 Professional / Emu42PPC v1.09 ROM REV C: Max 30s    Auth 5m 26s ARM PXA270/312MHz / Win Mobile 2003 SE / Emu42PPC v1.02 ROM REV C: Max 29s    Auth 5m 20s ARM S3C2410/266MHz / Win Mobile 2003 / Emu42PPC v1.02beta5 ROM REV C: Max 32s    Auth ?m ??s ARM S3C2410/266MHz / Win Mobile 2003 / Emu42PPC v1.01 ROM REV C: Max 34s    Auth ?m ??s ARM PXA270/208MHz / Win Mobile 2003 SE / Emu42PPC v1.01 ROM REV C: Max 1m 02s Auth 5m 24s ARM OMAP850/195MHz / Win Mobile 5.0 / Emu42PPC v1.05beta1 ROM REV C: Max 1m 19s Auth 2m 42s ARM SA1110/206MHz / Pocket PC 2002 / Emu42PPC v1.05beta1 ROM REV C: Max 1m 34s Auth 5m 26s ARM SA1110/206MHz / Pocket PC 2000 / Emu42PPC v1.09 ROM REV C: Max 1m 18s Auth 5m 26s *1 high performance counter run only with 1000Hz, this cause trouble in    connection with timer2 related routines and "Authentic Speed" setting Environment Emu42 v0.98-4/5 and Emu42PPC v0.20-1.09 use the same engine Since Emu42 v1.12 and Emu42PPC v1.11 Sacajawea hardware support is included, so implementation got some Lewis/Sacajawea hardware specific switches. Speed setting in Emu42.ini / registry: LewisCycles=64 PRM? is the only program in memory. If there are more programs in memory the position of PRM? has direct influence on the execution time. On the Pocket PC / Win Mobile Emu42PPC was the only visible process running. Tools like Wisbar or running ActiveSync slow down emulation speed. The Max values on Emu42PPC differs from run to run in a wide range, the measured values were the fastest ever measured. Emu42 v1.09beta1 03/05/07 Emu42 v0.98-5    11/12/03 Emu42 v0.98-4    10/30/03 Compiler: Microsoft Visual C++ 6.0 SP1 <- Emu42 v1.12 Microsoft Visual C++ 6.0 SP5 -> Emu42 v1.13 Settings: /nologo /Gr /MT /W3 /GX /O2 /Ob2 /D "NDEBUG" /D "WIN32" /D "_WINDOWS" /D "STRICT" /Fp".\Release/EMU32.pch" /Yu"pch.h" /Fo".\Release/" /Fd".\Release/" /FD /c Emu42PPC v0.20 06/09/04 Emu42PPC v1.01 Compiler: eMbedded Visual C++ 3.0 Edition 2002 Settings: /nologo /W3 /O2 /Ob0 /D _WIN32_WCE=$(CEVersion) /D "$(CePlatform)" /D "ARM" /D "_ARM_" /D UNDER_CE=$(CEVersion) /D "UNICODE" /D "_UNICODE" /D "NDEBUG" /Fp"ARMRel/EMU42.pch" /Yu"pch.h" /Fo"ARMRel/" /Oxs /M$(CECrtMT) /c Emu42PPC v1.02beta5 01/17/05 Emu42PPC v1.05beta1 01/23/06 Emu42PPC v1.07beta1 07/10/06 Compiler: eMbedded Visual C++ 3.0 Edition 2002 Settings: /nologo /W3 /O2 /Ob2 /D _WIN32_WCE=$(CEVersion) /D "$(CePlatform)" /D "ARM" /D "_ARM_" /D UNDER_CE=$(CEVersion) /D "UNICODE" /D "_UNICODE" /D "NDEBUG" /Yu"pch.h" /Oxs /M$(CECrtMT) /c Thanks to Erik Ehrling for contributing the real calculator, P100 and P200 benchmark values. 10/20/11 (c) by Christoph Gießelink, c dot giesselink at gmx dot de

Making repeatable benchmarks on the HP42S are not so easy as it sounds. First of all, the FOCAL code can only search for global labels before his actual position, so is the search position on top of memory and the label is not found so far, the search continues at the .END. position. So if you have many programs with many global labels behind your program, this will slow down program execution. One more detail, these was a bug in the RAW file object loader until Emu42 v1.22. The FOCAL program object loader allows also to import HP41 FOCAL programs saved by the V41 emulator. Because of some internal differences about NULL byte handling, NULL bytes are removed (packing) or added (behind numbers) and so the distance between labels change. The distance on global labels was fixed during the import, the distance on local labels not. This caused execution errors using HP41 programs. Therefore the import now clears the distance information in all local label jump and execute FOCAL opcodes. Don't worry, the HP42 restore these offsets at the first program run, and because of this the first run of a FOCAL program directly after importing is slower then the following runs.

But now to a further difference of emulation and simulation. The Emu42 emulator has to handle some speed related issues running the code of an original ROM. Just remember, the authors of the code haven't thought about running this code 100-400 times faster. So some parts are just done by executing code in a loop to create a delay or the frequency of a beeper. On the HP48 the backarrow key has an autorepeat function. So pressing the backarrow key and holding it, removes slowly character by character in the command line. When you have a machine with runs 100 times faster, and you do the same thing, the input line is immediately empty. So happened with Emu48 running a HP48. But back to Emu42. It took me years to implement the Redeye sending and making the beeper emulation. In both cases the timing is done by the CPU executing opcodes. Moreover, the CPU strobe frequency is not very accurate, so not usable for sending the Redeye Printer protocol. So the ROM code is making a speed calibration of the CPU before printing.

Therefore a loop with known CPU-cycles is executed in a time frame given by a timer referenced by the 32768Hz crystal. The number of loops is counted. Bad is, the register width of the loop counter is too small for a 150 times faster CPU execution, so the count register overrun many many times and so the result of the speed measurement was rubbish effecting the emulation of the Redeye frame transmitter and beep generation.

I think this is an important difference between emulation and simulation.

On the last Allschwil meeting I talked about speed update for my HP-92198 simulation. I done a print of a large HP71 BASIC program with Emu71 which took 180s. I don't know how fast is the original hardware, HP71 and HP-92198 video output, but it's slower. Now I cheated, the actual HP-92198 simulation does the same thing in 8s now. Why I say cheated? The program is compiled with the same C++ compiler running on the same machine. The difference is the display update. The prior version updated the display content after very new character, the actual version only every 30ms. So I modified the test conditions, the result for the user is the same, but it's a huge difference for the CPU.

So when we speak about calculations we are discussing about numerical results of a mathematical problem. When I change the numerical algorithm for some reasons, it's the same cheating as with the display output, I changed the test conditions.

So I think it's quite hard to compare Emu42 with Free42. Use the one which is more suitable for your problem.

BTW how fast is Free42 comparing to Wolfram Mathematica solving the same problem?
06-06-2020, 04:37 AM
Post: #7
 Thomas Okken Senior Member Posts: 1,810 Joined: Feb 2014
RE: Emulator vs simulator performance
(06-05-2020 03:45 PM)J-F Garnier Wrote:  But Free42 is written in C, not assembly language, so the comparison is fair.
On the other hand, in this particular benchmark, we can expect a significant part of the time is spent in arithmetic operations that are in asm in the HP-42S, not sure in the Intel 35-digit library.

The Intel Decimal Floating-Point Math Library is written in C, and works using 64-bit integer operations under the hood. I don't think there's any assembly language in there but I haven't really looked for it... but when building for ARM, there definitely isn't since that is not a supported target at all, and I had to do a bit of hacking to even get it to build for Android and iOS.

When built for 32-bit targets, the 64-bit integer operations are compiled into multiple 32-bit operations, and that might benefit from coding it in assembly instead to eliminate duplicate operations (e.g. a 64-bit integer multiplication requires four 32-bit multiplications in general, but the special case of squaring a 64-bit number can be done using only three 32-bit multiplications), but this was not done. Free42 for Windows is a 32-bit app, and the Android and iOS versions contain 32-bit and 64-bit code, running the 32-bit code on 32-bit platforms and running 64-bit code otherwise.
06-08-2020, 07:53 AM
Post: #8
 J-F Garnier Senior Member Posts: 701 Joined: Dec 2013
RE: Emulator vs simulator performance
(06-05-2020 07:27 PM)Christoph Giesselink Wrote:  So what is important for me? The speed difference between original calculator and emulation. So I think a 170 to 340 times faster emulation is fast enough. This is a benchmark list from 2011 comparing the emulation on different host systems comparing to the real machine:
Code:
 Emu42 benchmark results using Erik Ehrling's "Miller-Rabin Primality Test for the HP-42S" Prime number: 999,999,999,961 ...

I agree with you, the improved speed vs the original is an important benefit, but not the unique goal of an emulator.
Thanks for the benchmark, I remember it now.

Quote:So I think it's quite hard to compare Emu42 with Free42. Use the one which is more suitable for your problem.

Well , not so hard to compare, since I did it :-) but yes I understand what you mean. My goal was just to have a more clear view of the performance difference. We can't expect an emulator to compete with native applications.

I'm using both Emu42 and Free42, for different usages, as I'm using your Emu71 for Windows in some cases (very precise emulation) and my Emu71/DOS for others (integrated HP-IL environment)

J-F
06-09-2020, 07:41 AM
Post: #9
 J-F Garnier Senior Member Posts: 701 Joined: Dec 2013
RE: Emulator vs simulator performance
(06-06-2020 04:37 AM)Thomas Okken Wrote:  When built for 32-bit targets, the 64-bit integer operations are compiled into multiple 32-bit operations, and that might benefit from coding it in assembly instead to eliminate duplicate operations (e.g. a 64-bit integer multiplication requires four 32-bit multiplications in general, but the special case of squaring a 64-bit number can be done using only three 32-bit multiplications), but this was not done. Free42 for Windows is a 32-bit app, and the Android and iOS versions contain 32-bit and 64-bit code, running the 32-bit code on 32-bit platforms and running 64-bit code otherwise.

This may explain the observation from Valentin:

(06-05-2020 04:24 PM)Valentin Albillo Wrote:  Also, your Free42 BCD is running only ~75% faster on your hardware than mine on my mid-range Samsung tablet. Frankly, I would expect it to run much faster in your system (say, at least 4x).

Also I'm using a modest 1.8MHz core-i3 machine, more powerful machines may give significantly better performances.

J-F
06-09-2020, 08:49 PM (This post was last modified: 06-10-2020 07:38 PM by Jonathan Busby.)
Post: #10
 Jonathan Busby Member Posts: 250 Joined: Nov 2014
RE: Emulator vs simulator performance
(06-05-2020 03:26 PM)David Hayden Wrote:  [snip]
The Saturn CPU is a pretty different from an x86.

I would put it as *extremely different* The only real similarity is that the x86 and Saturn family of processors both use a "dest = dest op source" machine code format for many of their arithmetical-logical operations,
although the Saturn has many "dest = source op dest" machine instructions as well

Quote:Emulating it might be harder than a more traditional one. Things like register fields might be hard.

If you're writing the emulator in a language like C, then dealing with register fields means dealing with a lot of bit-masking and bit-shifts ( except for maybe the Emu48 family of emulators which seem to perform decimal operations ( and apparently HEX mode operations as well ) in a nibble serial
manner if I'm not mistaken. This might explain why some of the emulators are so slow with respect to emulating Saturn machine code arithmetic and comparison instructions )

Quote:And if the Saturn BCD arithmetic doesn't translate well to x86 then that could be a problem.

The BCD instructions on modern x86 CPUs are legacy instructions that have been relegated to the CPU's microcode ROM, and are therefore very slow.

The 32-bit x86 CPUs only have instructions that operate on two packed BCD digits at a time. For example :

Code:
MOV AL, 42h MOV CL, 33h ADD AL, CL DAA

which adds packed BCD 42 to 33, or :

Code:
MOV AL, 42h MOV CL, 33h SUB AL, CL DAS

which subtracts packed BCD 33 from 42 .

The "DAA" and "DAS" instructions stand for "Decimal Adjust for Addition" and "Decimal Adjust for Subtraction" respectively.

In AMD64 / x64 "long mode", the above DAA and DAS instructions are not available, having been used as part of the 64-bit instruction set encoding.

In C, especially on a 64-bit x64 machine, it's faster to use a little bit-twiddling and arithmetic trickery to perform packed BCD arithmetic. ( a quick google search turns up many solutions for addition and subtraction of packed BCD integers in C using only bitwise operations and arithmetic, eg. see this Wikipedia solution in C which uses just ten
bitwise and arithmetic operations ( no multiplication or division ). Also, I do have a vague memory of a bitwise solution that only used *six* operations, but I can't remember where I when I saw it and I can't seem to reproduce it "de novo" either :/ . On many ARM processors, even the above ten operations
can be reduced to just *eight* operations because of many ARM processors' "0-cycle" barrel-shifter which one gets for free with most ARM arithmetic or bitwise instructions )

( EDIT #1 : The above statements about ARM-based processors' "free" or "0-cycle" barrel-shifter may not be entirely accurate, as I haven't done much ARM assembly language in a long while. I believe that the previously "free" shift operation may incur some performance penalties on more modern ARM processors, although I'm no expert on ARM assembly, so I'm not sure )

Quote:Also, it's my understanding that the 42s is implemented in SysRPL. That adds another layer of complication and slowdown compared to assembly language.

Indeed RPL on Saturn CPU's involves at least *three* levels of indirection a lot of the time

Quote:As I recall working with C on a 50g, the speed difference there is something like native C code is about 5x faster than Saturn assembly, which is about 5x faster than SysRPL, which is about 5x faster than User RPL. Those values may vary somewhat, but I do recall specifically that C code is about 100x faster than userRPL on the 50g.

That sounds about right

Regards,

Jonathan

P.S. The formatting seems to be messed up in the preview of this post with long lines running off-screen which generates a scroll bar.

Aeternitas modo est. Longa non est, paene nil.
06-10-2020, 07:57 AM
Post: #11
 J-F Garnier Senior Member Posts: 701 Joined: Dec 2013
RE: Emulator vs simulator performance
(06-05-2020 04:56 PM)Werner Wrote:
(06-05-2020 03:45 PM)J-F Garnier Wrote:  Yes, the 42S is internally using RPL. But Free42 is written in C, not assembly language, so the comparison is fair.
.. but Free42 is compiled, and the 42 SysRPL is interpreted.
Werner

SysRPL is not exactly interpreted. An interpreter such as HP Basic uses tokens and relies on tables to get the execution address (on HP Basic, it's quite complex and relatively slow with all the possible LEXs to scan).
In SysRPL, the "tokens" are the execution addresses themselves. The right term is probably "threaded code" as for the Forth language. But I'm not a RPL expert :-)

(06-09-2020 08:49 PM)Jonathan Busby Wrote:
Quote:Also, it's my understanding that the 42s is implemented in SysRPL. That adds another layer of complication and slowdown compared to assembly language.
Indeed RPL on Saturn CPU's involves at least *three* levels of indirection a lot of the time

Quote:As I recall working with C on a 50g, the speed difference there is something like native C code is about 5x faster than Saturn assembly, which is about 5x faster than SysRPL, which is about 5x faster than User RPL. Those values may vary somewhat, but I do recall specifically that C code is about 100x faster than userRPL on the 50g.
That sounds about right

What is the penalty of SysRPL compared to assembly language? I tried to answer by porting Valentin's program to the HP-32SII, emulated in Christoph's Emu42. So the comparison is done at constant CPU speed.

Here is the result, for 10,000 points as above test:
HP42S emulated on Emu42 1.24 : 5min08s (as above)
HP32SII emulated on Emu42 1.24 : 2min01s !

So despite the 32SII RPN language is not as powerful for complex numbers than the 42S, the 32SII is 2.5x more efficient than the HP42S. I didn't expect so much difference, it's a surprise for me.
The difference is not as large on the real machines, since the CPU speed is probably reduced on the 32S. The 42S is supposed to run at 1MHz, does somebody know the speed of the 32SII CPU?

For the curious, here is the HP-32S program:
Code:
; Mandelbrot area, for the HP-32S/SII ; based on Valentin Albillo's program "AM" for the HP-42S ;Variables: ;I : loop index ;J : loop index ;M : count ;N : #points ;K : #iterations ;C, D : random point under test ;F : constant = 0.25^2 ;A : scratch ; labels used: A, B, C, D, E. A01 LBL A ; initialization and input A02 STO N       ;#points A03 STO I A04 0.25 A05 x^2 A06 STO F A07 1 A08 SEED A09 256        ; #iterations A10 STO K A11 CLx A12 STO M ; main loop B01 LBL B B02 RCL K      ; # iterations B03 STO J      ; in J loop index B04 RANDOM B05 2.5 B06 * B07 2 B08 - B09 STO C B10 RANDOM B11 1.2 B12 * B13 STO D ; belongs to cardioid? B14 x^2 B15 x<>y  B16 x^2 B17 + B18 SQRT        ; abs(z) B19 STO A B20 RCL D B21 RCL/ A B22 RCL C B23 RCL/ A      ; z/abs(z) B24 0 B25 2 B26 CMPLX- B27 CMPLX* B28 x^2 B29 x<>y B30 x^2 B31 +               B32 SQRT  B33 4 B34 / B35 RCL A B36 x<y? B37 GTO D ; belongs to main disk? B38 1  B39 RCL+ C B40 x^2 B41 RCL D B42 x^2 B43 + B44 RCL F B45 x>y? B46 GTO D ; belongs elsewhere in M? B47 RCL D B48 RCL C C01 LBL C C02 0 C03 ENTER  C04 CMPLX+ C05 CMPLX*          ; Z^2 C06 RCL D C07 RCL C C08 CMPLX+          ; Z=Z^2+C C09 0 C10 ENTER C11 CMPLX+ C12 x^2 C13 x<>y C14 x^2 C15 +  ; |Z|² C16 4 C17 x<=y? C18 GTO E C19 Rdn C20 Rdn C21 DSE J C22 GTO C     ; next iter     D01 LBL D D02 ISG M     ; count++ E01 LBL E E02 DSE I E03 GTO B     ; next point E04 RCL M     ; recall count E05 RTN

J-F
06-10-2020, 12:16 PM (This post was last modified: 06-18-2020 06:37 PM by Didier Lachieze.)
Post: #12
 Didier Lachieze Senior Member Posts: 1,508 Joined: Dec 2013
RE: Emulator vs simulator performance
(06-10-2020 07:57 AM)J-F Garnier Wrote:  The 42S is supposed to run at 1MHz, does somebody know the speed of the 32SII CPU?

According to Wikipedia the Lewis CPU used in the 42S runs at 1MHz and the Sacajawea CPU used in the 32SII runs at 640kHz.

(Edited to fix typo on the 32SII frequency)
06-10-2020, 01:53 PM
Post: #13
 Christoph Giesselink Member Posts: 222 Joined: Dec 2013
RE: Emulator vs simulator performance
(06-10-2020 12:16 PM)Didier Lachieze Wrote:
(06-10-2020 07:57 AM)J-F Garnier Wrote:  The 42S is supposed to run at 1MHz, does somebody know the speed of the 32SII CPU?

According to Wikipedia the Lewis CPU used in the 42S runs at 1MHz and the Sacajawea CPU used in the 32SII runs at 640Hz.

These speed settings at Wikipedia are from me contributed many years ago. But I'm not sure about the values for the 1LU7 Bert and 1LR3 Sacajawea chips any more. First, the PCB for the Low-End Pioneers contain no crystal. The external component on a 1LU7 PCB is a capacitor and the external components on a 1LR3 PCB are a capacitor and a resistor.

But how I get the "Authentic Speed" value for Emu42 emulating a HP32SII?

In this case I gone a more practical way.

I programmed Katie Wasserman 99 Digits of PI on an HP 32SII on a real HP32SII and measured the execution time. It took around the 11 minutes like mentioned in the article. With this measured time I adjusted the Sacajawea CPU cycles reference inside Emu42 to a value of 54, so that the real and emulated calculator needs more or less the equal time executing this program.

What does this number 54 mean? Execute 54 CPU cycles in a 16384 Hz time frame, so 54 * 16384 Hz = 884736 Hz.

BTW, the CPU cycles reference in Emu42 for a 1MHz Lewis CPU is 61 (61 * 16384 Hz = 999424 Hz).

Can I proof this calculated frequency? This depends on the memory type the assembler code is running. On all memory devices which are connected directly to the Saturn bus, the CPU cycles inside Emu42 are correct. But when the memory device is accessed over a Saturn bus to 8 bit converter to access regular static RAM or ROM devices with 8 bit data bus interface, the CPU cycles inside Emu42 are wrong. In the last case you have the problem that the same opcode needs more cycles for a memory access and even more, the same opcode may have different cycles when the opcode is executed on an even or on an odd address.

But was this mean in the case of the HP32SII? Both RAM and ROM are internal devices directly connected to the Saturn bus. So the used CPU cycles should be correct. As result I assume a CPU strobe frequency of about 884 kHz for the HP32SII now.
06-10-2020, 08:32 PM (This post was last modified: 06-12-2020 06:21 PM by Jonathan Busby.)
Post: #14
 Jonathan Busby Member Posts: 250 Joined: Nov 2014
RE: Emulator vs simulator performance
(06-10-2020 07:57 AM)J-F Garnier Wrote:  [snip]

SysRPL is not exactly interpreted. An interpreter such as HP Basic uses tokens and relies on tables to get the execution address (on HP Basic, it's quite complex and relatively slow with all the possible LEXs to scan).
In SysRPL, the "tokens" are the execution addresses themselves. The right term is probably "threaded code" as for the Forth language. But I'm not a RPL expert :-)

You're actually completely correct See this article Technically RPL is a "TIL" or "Threaded Interpreted Language"

(06-09-2020 08:49 PM)Jonathan Busby Wrote:  Indeed RPL on Saturn CPU's involves at least *three* levels of indirection a lot of the time

Actually I misspoke : Although there are usually *three* levels of indirection with most simple RPL objects that eg. just push themselves to the stack, the third level of indirection is just the act of re-entering the RPL inner loop.

Quote:What is the penalty of SysRPL compared to assembly language?

Well, the "RPL inner loop" as implemented on Saturn based HP calculators, follows the following control flow, assuming the current object being executed in the runstream is a pointer to an embedded BINT object ( to demonstrate the various levels of indirection ) :

Code:
A=DAT0        A D0=D0+        5 PC=(A)        *First level of indirection

Code:
=PRLG   LC(2)   10         A=A-C   B                PC=(A)  *Second level of indirection : PC now points to BINT direct execution code dirbint

Code:
dirbint D0=D0-  5         AD0EX         D0=A         D0=D0+  10         D=D-1   A         GOC     OutOfMemory         D1=D1-  5         DAT1=A  A         A=DAT0  A         D0=D0+  5         PC=(A) *This is technically a "third" level of indirection, but it's really just the next object/pointer in the runstream being executed

So, I initially misspoke as there are technically only *two* levels of indirection in most Sys-RPL words

As for the Sys-RPL performance penalty, well, direct object execution involves *two* "PC=(A)" instructions. This means that a 5-nibble address has to be read from memory, and memory accesses on the Saturn CPU are notoriously slow. Also, it involves a 5-nibble absolute control flow jump as the PC is set to the address previously read from memory, and such absolute jumps on the Saturn CPU are slow, although not as slow as memory accesses Also, one must take into account all the other instructions that are executed when an RPL object is directly executed in the runstream, and this also adds a lot of overhead.

( EDIT : The reason for memory accesses being so slow on the Saturn CPU is not due to the Saturn CPU itself per se, but instead due to the Saturn Bus. In the original discrete HP71B Saturn chip, I believe that the Saturn Bus Interface was integrated onto the chip. On later Saturn based SoCs like the Yorke, the Saturn bus ran at half the speed of the Saturn CPU itself, which slowed down memory accesses by about 2x. This, though, is only one aspect of the Saturn bus which contributes to the slowdown. For an instruction like "PC=A", the Saturn CPU drives a "LOAD PC" command on the Saturn bus, which is then followed by a 5-cycle operation in which the CPU transfers the 5 nibbles of the new PC address and which the memory controllers load into their local PCs. There is then a command auto-switch to a "PC READ" command and a "dummy strobe" on the Saturn bus for memory pipelining. For an eg. "A=DAT0 W" instruction, first, the CPU issues a "LOAD DP" command onto the Saturn bus and then the CPU performs a 5-cycle operation in which it successively drives 5 address nibbles onto the Saturn bus which are latched by the memory controllers. There is then a command auto-switch to "DP READ", another 1-cycle "dummy strobe" and then the CPU reads 16 nibbles from the Saturn bus. So, for the "PC=(A)" instruction, you have 1-cycle for the "LOAD DP" command, 1-cycle for the dummy strobe, 5-cycles to read the data, another 1-cycle for the "LOAD PC" command, 5-cycles for the CPU to transfer the new PC address to the memory controllers and then, finally, a 1-cycle dummy strobe, for a total 14 cycles, not including instruction decode time. For the "A=DAT0 A" instruction, you have the initial 1-cycle "LOAD DP" bus command, 5 cycles to drive the 5 nibbles of the address, a 1-cycle dummy strobe and then 5-cycles for reading 5 data nibbles for a total of 12 cycles, not counting instruction decode time. If this is on the Yorke SoC, then the total cycle length associated with the Saturn bus is around 24 cycles as the Saturn bus on the Yorke SoC only runs at 2MHz. )

Quote:I tried to answer by porting Valentin's program to the HP-32SII, emulated in Christoph's Emu42. So the comparison is done at constant CPU speed.

Here is the result, for 10,000 points as above test:
HP42S emulated on Emu42 1.24 : 5min08s (as above)
HP32SII emulated on Emu42 1.24 : 2min01s !

So despite the 32SII RPN language is not as powerful for complex numbers than the 42S, the 32SII is 2.5x more efficient than the HP42S. I didn't expect so much difference, it's a surprise for me.
The difference is not as large on the real machines, since the CPU speed is probably reduced on the 32S. The 42S is supposed to run at 1MHz, does somebody know the speed of the 32SII CPU?

AFAIK, the 32SII Saturn runs at about 640KHz.

Regards,

Jonathan

Aeternitas modo est. Longa non est, paene nil.
06-15-2020, 08:54 PM
Post: #15
 Christoph Giesselink Member Posts: 222 Joined: Dec 2013
RE: Emulator vs simulator performance
Performance measurents with Emu71/Win until v1.11

J-F Garnier made some performance tests with Emu71/DOS and Emu71/Win and detected a performance catastrophe inside Emu71/Win until v1.11 (actual version) in connection with the BASIC GOTO command. Emu71/DOS is not affected.

A loop with GOTO where you expected an execution time at full speed of ~0.08s takes ~9s, whereas the same loop at "Authentic Calculator Speed" takes ~20s. This behavior is caused by annunciator settings combined with the display annunciator update which takes quite long.

TNX to J-F Garnier for this detection.

The problem will be fixed in a next version, so take care when you do performance measurements with Emu71/Win until v1.11 with a massive use of GOTO loops.
06-16-2020, 06:25 PM
Post: #16
 Jake Schwartz Senior Member Posts: 325 Joined: Dec 2013
RE: Emulator vs simulator performance
(06-10-2020 12:16 PM)Didier Lachieze Wrote:
(06-10-2020 07:57 AM)J-F Garnier Wrote:  The 42S is supposed to run at 1MHz, does somebody know the speed of the 32SII CPU?

According to Wikipedia the Lewis CPU used in the 42S runs at 1MHz and the Sacajawea CPU used in the 32SII runs at 640Hz.

This is stuff that former-HPer Eric Vogel would have known right off the top of his head back in the day. He gave a nice presentation on the various Pioneer machines at a Chicago-area CHIP-group meeting back in April of 1991 around the time of the 32SII introduction, where he put chalk to blackboard and made a nice chart of all the ROM/RAM/CPU/display variants up and down the Pioneer series. Those were fun days.
Jake
06-16-2020, 07:12 PM
Post: #17
 Jonathan Busby Member Posts: 250 Joined: Nov 2014
RE: Emulator vs simulator performance
(06-10-2020 12:16 PM)Didier Lachieze Wrote:
(06-10-2020 07:57 AM)J-F Garnier Wrote:  The 42S is supposed to run at 1MHz, does somebody know the speed of the 32SII CPU?

According to Wikipedia the Lewis CPU used in the 42S runs at 1MHz and the Sacajawea CPU used in the 32SII runs at 640Hz.

Not to nitpick, but I think you made a typo with "640Hz" It should be "640KHz" If it was just 640Hz, then it would be just a little faster than Charles' Babbage's mechanical Analytical Engine

Regards,

Jonathan

Aeternitas modo est. Longa non est, paene nil.
06-16-2020, 07:23 PM
Post: #18
 Thomas Okken Senior Member Posts: 1,810 Joined: Feb 2014
RE: Emulator vs simulator performance
(06-16-2020 07:12 PM)Jonathan Busby Wrote:  Not to nitpick, but I think you made a typo with "640Hz" It should be "640KHz" If it was just 640Hz, then it would be just a little faster than Charles' Babbage's mechanical Analytical Engine

Surely you mean kHz, as in 1000 Hz, and not KHz, as in Kelvin-Hertz?
06-16-2020, 07:26 PM
Post: #19
 Jonathan Busby Member Posts: 250 Joined: Nov 2014
RE: Emulator vs simulator performance
(06-16-2020 07:23 PM)Thomas Okken Wrote:
(06-16-2020 07:12 PM)Jonathan Busby Wrote:  Not to nitpick, but I think you made a typo with "640Hz" It should be "640KHz" If it was just 640Hz, then it would be just a little faster than Charles' Babbage's mechanical Analytical Engine

Surely you mean kHz, as in 1000 Hz, and not KHz, as in Kelvin-Hertz?

HAHA!

Seems I'm not the only one susceptible to making typos

Regards,

Jonathan

Aeternitas modo est. Longa non est, paene nil.
06-18-2020, 06:25 PM (This post was last modified: 06-18-2020 06:28 PM by J-F Garnier.)
Post: #20
 J-F Garnier Senior Member Posts: 701 Joined: Dec 2013
RE: Emulator vs simulator performance
So I run the Valentin's problem on Emu71 too since the initial question was to compare a HP-71B emulator and Free42.
Here are the results, still with 10,000 points and other defauts parameters:
Emu71/Win 1.11 : 1min55s
Emu71/DOS 2.45 run in VirtualBox : 1min58s (not too bad for an old 16-bit DOS program :-)
(for reference, Free42 Decimal : 6.3s )

My first attempt was to adapt the 42S program with the same GOTOs, and so I identified the abnormal slow-down in Emu71/Win with GOTOs that Christoph reported above.
Then I replaced the GOTOs with FOR..NEXT loops and other constructions (anyway using GOTOs is considered as a bad programming style) and the Emu71/Win performance was then similar to Emu42 emulating a HP-32SII.

Here is my HP-71B program (without GOTOs):
Code:
10 ! ------ 20 ! Mandelbrot set area 30 ! from Valentin Albillo "AM" program 40 !  50 COMPLEX Z,C,A 60 F=.25 70 INPUT "Points?";N 80 T=TIME 90 RANDOMIZE 1 100 K=256 110 M=0 120 !  130 FOR I=1 TO N 140 B=0 150 C=(RND*2.5-2,RND*1.2) 160 A=SGN(C) @ IF ABS((A-2)*A)*F>ABS(C) THEN B=1 170 IF ABS(C+1)<F THEN B=1 180 IF B THEN J=K+1 ELSE J=1 @ Z=C 190 B=1 200 FOR J=J TO K 210 Z=Z*Z+C @ IF ABS(Z)>=2 THEN J=K @ B=0 220 NEXT J 230 IF B THEN M=M+1 240 NEXT I 250 DISP M;TIME-T

The abnormal slowdown in Emu71/Win up to 1.11 occurs with GOTO/GOSUB and of course with statements directly or indirectly related to the annunciators:
SFLAG/CFLAG/RADIANS/DEGREES/USER

J-F
 « Next Oldest | Next Newest »

User(s) browsing this thread: 1 Guest(s)