Post Reply 
Emulator vs simulator performance
06-10-2020, 08:32 PM (This post was last modified: 06-12-2020 06:21 PM by Jonathan Busby.)
Post: #14
RE: Emulator vs simulator performance
(06-10-2020 07:57 AM)J-F Garnier Wrote:  [snip]

SysRPL is not exactly interpreted. An interpreter such as HP Basic uses tokens and relies on tables to get the execution address (on HP Basic, it's quite complex and relatively slow with all the possible LEXs to scan).
In SysRPL, the "tokens" are the execution addresses themselves. The right term is probably "threaded code" as for the Forth language. But I'm not a RPL expert :-)

You're actually completely correct Smile See this article Smile Technically RPL is a "TIL" or "Threaded Interpreted Language" Smile

(06-09-2020 08:49 PM)Jonathan Busby Wrote:  Indeed Smile RPL on Saturn CPU's involves at least *three* levels of indirection a lot of the time Smile

Actually I misspoke : Although there are usually *three* levels of indirection with most simple RPL objects that eg. just push themselves to the stack, the third level of indirection is just the act of re-entering the RPL inner loop.

Quote:What is the penalty of SysRPL compared to assembly language?

Well, the "RPL inner loop" as implemented on Saturn based HP calculators, follows the following control flow, assuming the current object being executed in the runstream is a pointer to an embedded BINT object ( to demonstrate the various levels of indirection ) :

Code:
A=DAT0        A
D0=D0+        5
PC=(A)        *First level of indirection

Code:
=PRLG   LC(2)   10
        A=A-C   B       
        PC=(A)  *Second level of indirection : PC now points to BINT direct execution code dirbint

Code:
dirbint D0=D0-  5
        AD0EX
        D0=A
        D0=D0+  10
        D=D-1   A
        GOC     OutOfMemory
        D1=D1-  5
        DAT1=A  A
        A=DAT0  A
        D0=D0+  5
        PC=(A) *This is technically a "third" level of indirection, but it's really just the next object/pointer in the runstream being executed

So, I initially misspoke as there are technically only *two* levels of indirection in most Sys-RPL words Smile

As for the Sys-RPL performance penalty, well, direct object execution involves *two* "PC=(A)" instructions. This means that a 5-nibble address has to be read from memory, and memory accesses on the Saturn CPU are notoriously slow. Also, it involves a 5-nibble absolute control flow jump as the PC is set to the address previously read from memory, and such absolute jumps on the Saturn CPU are slow, although not as slow as memory accesses Smile Also, one must take into account all the other instructions that are executed when an RPL object is directly executed in the runstream, and this also adds a lot of overhead.

( EDIT : The reason for memory accesses being so slow on the Saturn CPU is not due to the Saturn CPU itself per se, but instead due to the Saturn Bus. In the original discrete HP71B Saturn chip, I believe that the Saturn Bus Interface was integrated onto the chip. On later Saturn based SoCs like the Yorke, the Saturn bus ran at half the speed of the Saturn CPU itself, which slowed down memory accesses by about 2x. This, though, is only one aspect of the Saturn bus which contributes to the slowdown. For an instruction like "PC=A", the Saturn CPU drives a "LOAD PC" command on the Saturn bus, which is then followed by a 5-cycle operation in which the CPU transfers the 5 nibbles of the new PC address and which the memory controllers load into their local PCs. There is then a command auto-switch to a "PC READ" command and a "dummy strobe" on the Saturn bus for memory pipelining. For an eg. "A=DAT0 W" instruction, first, the CPU issues a "LOAD DP" command onto the Saturn bus and then the CPU performs a 5-cycle operation in which it successively drives 5 address nibbles onto the Saturn bus which are latched by the memory controllers. There is then a command auto-switch to "DP READ", another 1-cycle "dummy strobe" and then the CPU reads 16 nibbles from the Saturn bus. So, for the "PC=(A)" instruction, you have 1-cycle for the "LOAD DP" command, 1-cycle for the dummy strobe, 5-cycles to read the data, another 1-cycle for the "LOAD PC" command, 5-cycles for the CPU to transfer the new PC address to the memory controllers and then, finally, a 1-cycle dummy strobe, for a total 14 cycles, not including instruction decode time. For the "A=DAT0 A" instruction, you have the initial 1-cycle "LOAD DP" bus command, 5 cycles to drive the 5 nibbles of the address, a 1-cycle dummy strobe and then 5-cycles for reading 5 data nibbles for a total of 12 cycles, not counting instruction decode time. If this is on the Yorke SoC, then the total cycle length associated with the Saturn bus is around 24 cycles as the Saturn bus on the Yorke SoC only runs at 2MHz. )

Quote:I tried to answer by porting Valentin's program to the HP-32SII, emulated in Christoph's Emu42. So the comparison is done at constant CPU speed.

Here is the result, for 10,000 points as above test:
HP42S emulated on Emu42 1.24 : 5min08s (as above)
HP32SII emulated on Emu42 1.24 : 2min01s !

So despite the 32SII RPN language is not as powerful for complex numbers than the 42S, the 32SII is 2.5x more efficient than the HP42S. I didn't expect so much difference, it's a surprise for me.
The difference is not as large on the real machines, since the CPU speed is probably reduced on the 32S. The 42S is supposed to run at 1MHz, does somebody know the speed of the 32SII CPU?

AFAIK, the 32SII Saturn runs at about 640KHz.

Regards,

Jonathan

Aeternitas modo est. Longa non est, paene nil.
Find all posts by this user
Quote this message in a reply
Post Reply 


Messages In This Thread
RE: Emulator vs simulator performance - Jonathan Busby - 06-10-2020 08:32 PM



User(s) browsing this thread: 1 Guest(s)