Threaded Mode | Linear Mode

Alejandro Paz(Germany) · (This post was last modified: 08-09-2017 07:31 AM by Alejandro Paz(Germany).)

I was wondering if someone made any inroads in extending the precision of the algorithms implemented in the woodstock or (could also be nut) saturn based machines.

Let me explain:

Say the word is 32 nibbles long instead of 14 or 16. I now that it needs a new processor, so to say. And the P register and handling needs extension, and so on.

Note: I am aware of the extended precision done to the 41, I read something about it here in the forums. But it not is implemented with extra long working registers, nor has more than 13 digits.

What are your thoughts about it ?

See it as an experiment.

I am working on this parallel version of the woodstock core, and was wondering if one could achieve better precision by extension of the word and fixing of the algorithms.
I wanted to test with square root, I think it is quite simple and I have an over look of it, it doesn't need extra routines (at least the saturn version).

Paul Dale · 08-09-2017, 08:32 AM

I suspect it would be better to produce new algorithm for the higher precision. Nobody is going to make an advanced NUT processor when there are a plethora of adequate processors available already.

Pauli

Massimo Gnerucci · 08-09-2017, 10:00 AM

(08-09-2017 08:32 AM)Paul Dale Wrote: I suspect it would be better to produce new algorithm for the higher precision. Nobody is going to make an advanced NUT processor when there are a plethora of adequate processors available already.

Pauli

Uh, really?

Paul Dale · 08-09-2017, 11:04 AM

The Newt is a fantastic piece of work. It is replicating the NUT with minimal extensions. The registers are the same size, the maths routine likewise. Sure, it's got lots of registers and heaps of memory but the precision isn't increased.

It also isn't a commodity processing -- i.e. the price is high.

Pauli

Massimo Gnerucci · 08-09-2017, 01:22 PM

(08-09-2017 11:04 AM)Paul Dale Wrote: The Newt is a fantastic piece of work. It is replicating the NUT with minimal extensions. The registers are the same size, the maths routine likewise. Sure, it's got lots of registers and heaps of memory but the precision isn't increased.

It also isn't a commodity processing -- i.e. the price is high.

Pauli

It perfectly fits the scope it was developed for.
I was questioning your Nobody is going to make an advanced NUT processor ;)

Alejandro Paz(Germany) · 08-10-2017, 05:58 AM

Quote:I suspect it would be better to produce new algorithm for the higher precision. Nobody is going to make an advanced NUT processor when there are a plethora of adequate processors available already.

I am trying to develop such a processor, a better Saturn if you want. FPGAs are very flexible !
I have other algorithms for ARM/MIPS processors.

Paul Dale · 08-10-2017, 07:42 AM

(08-09-2017 01:22 PM)Massimo Gnerucci Wrote: It perfectly fits the scope it was developed for.
I was questioning your Nobody is going to make an advanced NUT processor

But faster is easy. More accurate is quite another matter entirely....

Pauli

J-F Garnier · 08-10-2017, 07:48 AM

(08-10-2017 05:58 AM)Alejandro Paz(Germany) Wrote:
Quote:I suspect it would be better to produce new algorithm for the higher precision. Nobody is going to make an advanced NUT processor when there are a plethora of adequate processors available already.

I am trying to develop such a processor, a better Saturn if you want. FPGAs are very flexible !
I have other algorithms for ARM/MIPS processors.

Several years ago, I made similar investigations. Not really to get higher precision, but to provide higher speed and/or higher memory capacity.

For the HP41, I considered (in my Emu41) to implement a floating-point unit (using a unused Nut opcode), with the goal to simply re-write the system math routines. I ever made it (and will no more try) but some traces of my thoughts can still be found in my Emu41 sources (nutcpu.c).

For the HP71, I once considered to increase the width or the A address field from 5 to 6 nibbles, to increase the memory space. But the compatibility problems with the existing firmware were too important, and I gave up.
I also considered a FPU extension.
All I did in my Saturn emulation was (as you know) to add the R5-R7 registers (allowed by the opcode map) and increase the stack depth, although the HP71 firmware was not changed to take benefit of it.

Fell free to re-use these ideas!

J-F

Alejandro Paz(Germany) · 08-10-2017, 12:22 PM

Quote:For the HP71, I once considered to increase the width or the A address field from 5 to 6 nibbles, to increase the memory space. But the compatibility problems with the existing firmware were too important, and I gave up.
I also considered a FPU extension.
All I did in my Saturn emulation was (as you know) to add the R5-R7 registers (allowed by the opcode map) and increase the stack depth, although the HP71 firmware was not changed to take benefit of it.

I remember you talked about the extra R registers, I also wondered about the free slots, so to say, in the opcode map.
The increased stack would remove the need for that software stack implementation, it allows for extra 16 levels if I'm not mistaken. I just wonder how many times other software uses it (besides the ROM).

I don't want to add too many unnecessary opcodes but I think there are a couple of tricks being used that could benefit for an extra opcode here and there, like checking if the sign digit is a 9.

Paul Dale · 08-11-2017, 03:07 AM

A single opcode to do the RPL end sequence might be worthwhile. It would save two decode and execute cycles very often.

Pauli

Alejandro Paz(Germany) · 08-11-2017, 06:04 AM

And I'd really like a higher resolution 48 Smile

, not kidding. I have this nice 160x104 LCD sitting on my desk. Together with some 160x160, and that nice Sharp memory display at 400x240 (like what the DM42 has).

That is always the problem how to upgrade something without breaking everything that works.

On the speed:

I talk about the 1LF2 (HP-71):

One of the issues with the Saturn implementation is that with 4 bit memory width, you need as many cycles as nibbles to fetch the opcode. And then one extra cycle per calculated nibble.

With say 16 bit wide memory one could fetch many opcodes in one cycle, most in two. Extending the width of the internal alu to say 16 bits, that of course poses other "hurdles" on implementation but it is doable, one could execute word opcodes in 4 cycles, plus fetch.

A 64 bit parallel ALU could perform word opcodes in 1 cycle, needs 881 LUTs in a MachXO2, 4 A..D registers. Not bad.

There are a couple possibilities for incrementing the speed, rising the clock is one of them, but not the best without addressing memory bandwidth. The HP48 uses byte accesses, read-before-write for RAM. Lots of work, if one wants Smile

Claudio L. · 08-11-2017, 04:09 PM

(08-11-2017 03:07 AM)Paul Dale Wrote: A single opcode to do the RPL end sequence might be worthwhile. It would save two decode and execute cycles very often.

Pauli

That opcode was already done on the 49G+/50G internal Saturn emulator.
Other new opcodes worth adding are MOVEDN and MOVEUP for the memory copying routines, those are responsible for most of the speedup of the 50g vs 49g.

Alejandro Paz(Germany) · 09-14-2017, 11:38 AM

I have been busy doing the "extending the precision" of the algorithms, I coded a simulator for an extended version of the Saturn. I did change the encoding, P has now 5 bits instead of 4. And there are some extra registers.
Calculating the square root of 5 needs as you can guess double the amount of executed opcodes as in the original Saturn.

All the code to the experiment, called Parallel Neptune Core, can be found here.

The source file of the sqrt function with the equivalent Saturn code follows:

Code:

     1            ;

     2            ; Number format extended form

     3            ;

     4            ;

     5            ;                      | <------ A ------> |

     6            ;                              | <-- X --> |

     7            ;                              | XS|  

     8            ;      WS-1              4   3   2   1   0  

     9            ;      +---+--- ... ---+---+---+---+---+---+   

    10            ; A/C  | S |           | S | E | E | E | E |  

    11            ;      +---+--- ... ---+---+---+---+---+---+

    12            ;

    13            ;      +---+--- ... ---+---+---+---+---+---+

    14            ; B/D  | 0 |    WS - 2 digits significand  |

    15            ;      +---+--- ... ---+---+---+---+---+---+

    16            ;

    17            ; 9 as Significand sign means negative number

    18            ; Digits 5..9 as exponent sign means negative exponent

    19            

    20  000000 2000001E          loadp       #30       ;     P=      14

    21  000001 31300005          ldn         b,#5      ;     LC(1)  #5

    22                    

    23            

    24            ;

    25            ; SQRT 

    26            ;

    27            ; A exponent in A field

    28            ; B significand in m field

    29            ;

    30            ;

    31            ;

    32            ;

    33            ;

    34            ;

    35            ;

    36            

    37            SQRT_Full:                    ;

    38                                          ; =SQR15

    39                                          ;     XM=0

    40                                          ;     SB=0

    41                                          ;     SETHEX

    42                                          ;     A=A+1   XS

    43                                          ;     A=A-1   XS

    44                                          ;     SETDEC

    45                                          ;     GONC    =SQR17

    46                                          ;     ?A#0    B

    47                                          ;     RTNYES

    48                                          ;     ?A#0    S

    49                                          ;     GOYES   L_0C563

    50                                          ;     RTN

    51                                          ; *********************

    52                                          ; =SQR17

    53                                          ;     B=0     S

    54  000002 19431F00          mov.w       c,b       ;     C=B     W

    55                                          ;     ?B=0    W

    56                                          ;     GOYES   =SQR70

    57                                          ;     ?A=0    S

    58                                          ;     GOYES   L_0C56D

    59                                          ; L_0C563

    60                                          ;     P=      0

    61                                          ;     LC(2)  #A

    62                                          ;     GOTO    =INVNaN

    63                                          ; L_0C56D

    64  000003 08331F00          add.d.w     b,b       ;     B=B+B   W

    65  000004 08331F00          add.d.w     b,b       ;     B=B+B   W

    66  000005 08341F00          add.d.w     b,c       ;     B=B+C   W

    67  000006 19420400          mov.a       c,a       ;     C=A     A

    68  000007 08440400          add.d.a     c,c       ;     C=C+C   A

    69  000008 19401F00          mov.w       c,0       ;     C=0     M

    70  000009 5300000B          jnc         SQRT_1    ;     GONC    L_0C583

    71  00000A 0E491F03          not.d.m     c         ;     C=-C-1  M

    72            SQRT_1:                       ; L_0C583

    73  00000B 19420400          mov.a       c,a       ;     C=A     A

    74  00000C 19241F03          mov.m       a,c       ;     A=C     W

    75  00000D 08441F00          add.d.w     c,c       ;     C=C+C   W

    76  00000E 08441F00          add.d.w     c,c       ;     C=C+C   W

    77  00000F 08241F00          add.d.w     a,c       ;     A=A+C   W

    78  000010 20000000          loadp       #0        ;     P=      0

    79  000011 1120FFFF          neq.p       a,0       ;     ?A#0    P

    80  000012 54000014          jt          SQRT_2    ;     GOYES   L_0C59B

    81  000013 1C301F00          sr.d.w      b         ;     BSR     W

    82            SQRT_2:                       ; L_0C59B

    83  000014 1C201F00          sr.d.w      a         ;     ASR     W

    84  000015 19401F00          mov.w       c,0       ;     C=0     W

    85  000016 2000001E          loadp       #30       ;     P=      14

    86  000017 31400005          ldn         c,#5      ;     LC(1)  #5

    87  000018 1A341F00          ex.w        b,c       ;     BCEX    W

    88            SQRT_3:                       ; L_0C5A9

    89  000019 1C30FF00          sr.d.wp     b         ;     BSR     WP

    90  00001A 0D31FFFF          sub.d.p     b,#1      ;     B=B-1   P

    91            SQRT_4:                       ; L_0C5AF

    92  00001B 0931FFFF          add.d.p     b,#1      ;     B=B+1   P

    93  00001C 0C431F00          sub.d.w     c,b       ;     C=C-B   W

    94  00001D 5300001B          jnc         SQRT_4    ;     GONC    L_0C5AF

    95  00001E 08431F00          add.d.w     c,b       ;     C=C+B   W

    96  00001F 1D401F00          sl.d.w      c         ;     CSL     W

    97  000020 24000000          decp                  ;     P=P-1

    98  000021 53000019          jnc         SQRT_3    ;     GONC    L_0C5A9

    99                                          ; *********************

   100            SQRT_5:                       ; =SQR70

   101                                          ;     SB=0

   102            SQRT_6:                       ; L_0C5C6

   103  000022 10401F00          eq.w        c,0       ;     ?C=0    W

   104  000023 54000026          jt          SQRT_7    ;     GOYES   L_0C5D1

   105  000024 0D410200          sub.d.x     c,#1      ;     C=C-1   X

   106  000025 1C400200          sr.d.x      c         ;     CSR     X

   107            SQRT_7:                       ; L_0C5D1

   108  000026 41000000          stop                  ;     RTNCC

Alejandro Paz(Germany) · (This post was last modified: 09-14-2017 11:50 AM by Alejandro Paz(Germany).)

In the repository here there is a Parallel Saturn core, written in verilog. It is still a work in progress. I just coded, after much thought, a pre-fetching BUS controller.
The Parallel Saturn should improve the throughput in three different ways:

Increasing memory bandwidth for fetch: I opted for a 16 bit memory width, some opcodes are 2 nibbles long, 2 optimally positioned opcodes could be fetched at once. This condition is extra checked.

Increasing memory bandwidth for data access: again 16 bit accesses should improve transfers. Alignment issues arise here and read-before-write cycles are needed.

Increasing the width of the ALU path: Using 64 bit registers at once, while limits the maximum frequency on the target FPGA (<10 MHz), should provide abundant improvement on long executing opcodes.

The ALU is mostly coded, the bus controller is partially coded, no data accesses yet. The rest is a rehash of my nibble-serial (but fully working) 1LF2 implementation.

Let's see if I can fully realize this other project Smile

. Even at 2 MHz, it should be many times faster than a Yorke, let's see.

sa-penguin · 09-23-2017, 07:48 AM

Oddly enough, I was looking at extended precision math just the other week - on a Z80.
Extended precision data was held in data registers, and the number of registers to use was the main variable (if set to zero, only the Stack registers are used).

I appreciate you guys are talking about CPU level instructions. I'm sill thinking about the higher levels (like how to display the result). Extending the precision by a single register should give proof of concept, check the new math routines, yet not slow down processing too much.

Start with divide by 3, and square root of 5, before jumping to : asin(acos(atan(sin(cos(tan(1/9))))))

Alejandro Paz(Germany) · (This post was last modified: 09-24-2017 08:06 AM by Alejandro Paz(Germany).)

Quote:Start with divide by 3, and square root of 5, before jumping to : asin(acos(atan(sin(cos(tan(1/9))))))

That's why I used 5 as an example for the square root Smile

.

Anecdotally, some TI calculators like the TI-82, TI-83, TI-84, TI-85 and TI-86 among others use a Z80 as main processor. They can be divided in two groups lower and upper end. Both groups use very similar routines, the difference lays in the precision of the algorithms used, really minimal, like 2-4 digits and larger exponents.

But, the interesting part is that the numbers are stored in memory and used in place. The Z80 has a full complement of BCD friendly opcodes add, sub, daa (for addition and subtraction), and 4 bit shifts between accumulator and memory !, and many 16 bit pointers. All these resources are very well exploited in the mentioned models.

The whole math group of routines in the Saturn takes like 4 kbytes of memory, in the case of the Z80 (like in the TI-8x but not the 89) it takes, if I'm not mistaken, like 10 kbytes with the inflexibility of having the numbers always fixed in memory.

One could tailor such routines to use BC, DE, and HL but that means only 6 digits. I think that the way the TIs handle the whole is quite clever. That is a point where the Z80 actually excels in comparison with other more modern processors. Always talking about packed BCD. Nowadays one can achieve quite a bit more performance using base 100 (like I did), base 10000 (like the unix command bc) or greater bases like newRPL and others. Something doable when you have a relatively fast division and multiplication instructions, something totally lacking in the Z80. (The Z180 has a mul instruction) and memory constraints are not as severe as they once were !.

Another anecdote: Some like 20 years ago, I decided to use the H8/300H (A then Hitachi processor much like but not compatible with the MC68k) as the basis for my handheld calculator. This processor has 8 32 bit registers, as you can imagine I used the registers to temporarily contain the fractions of the floating point registers as I performed calculations, limited to 16 packed BCD digits. My target speed was like 8 MHz and that made kind of sense. Today, I'd go with a greater base, it just makes more sense with a RISC processor.

I also developed an AVR based BCD four function package with 16 digits precision. The AVR doesn't help at all with BCD Sad

and unrolled routines needed quite a bit of space.

Alejandro Paz(Germany) · (This post was last modified: 09-28-2017 12:42 PM by Alejandro Paz(Germany).)

I have updated the Parallel Saturn core. At this point many instructions have been implemented. Some instructions are still missing like the ones dealing with device configuration, shutdown, interrupts. Memory read is partially implemented and memory write is still missing. But there is enough to let the synthesizer give us an idea of how big and fast it is. Big and slow for a MachXO2-7000ZE1

Code:

Design Summary

   Number of registers:   1720 out of  7209 (24%)

      PFU registers:         1718 out of  6864 (25%)

      PIO registers:            2 out of   345 (1%)

   Number of SLICEs:      3076 out of  3432 (90%)

      SLICEs as Logic/ROM:   3073 out of  3432 (90%)

      SLICEs as RAM:            3 out of  2574 (0%)

      SLICEs as Carry:        309 out of  3432 (9%)

   Number of LUT4s:        6151 out of  6864 (90%)

      Number used as logic LUTs:        5527

      Number used as distributed RAM:     6

      Number used as ripple logic:      618

      Number used as shift registers:     0

   Number of PIO sites used: 46 + 4(JTAG) out of 115 (43%)

   Number of block RAMs:  16 out of 26 (62%)

Report:    3.212MHz is the maximum frequency for this preference.

90 % of this FLGA is pretty big, and the maximum speed is well, slow.

There are a couple of important points to consider:

The 64-bit computations, ADD, SUB, Logic and a mux are performed between two consecutive flanks of the clock. I haven't attempted any optimization here beyond latching both source arguments (Line 241..248, saturn_alru.v). The slowest path goes through the subtract unit, as expected.

But clock speed doesn't tell the whole story, if we do not know how many clocks are needed for an opcode. Small opcodes like P=n take 4 clocks to complete but fetch can take up to 3 clocks. While the per-fetcher does a pretty good work it can be greatly improved. Jumps take 2 clocks plus fetch. An unaligned opcode needs at least 2 extra clocks for fetch.

For comparison, the nibble serial implementation on the same FPGA can achieve 16 MHz. But opcodes need many more cycles:

Code:

Opcode | Serial | Parallel

-------|--------|---------

B=B+C W|     27 |    3+4

GONC   |      9 |    3+2

27 cycles @ 16 Mhz = 1.68 us
7 cycles @ 3.2 Mhz = 2.18 us

9 cycles @ 16 Mhz = 0.5625 us
5 cycles @ 3.2 Mhz = 2.18 us
[/code]

I think that without improving the pre-fetcher it will be difficult to see any gain Sad

.
Fetch and subsequent execute and streamlining of the fetch state machine. At least one clock can be dropped as ST_INIT is not really needed, imho.
The ALU path can also be extended to two cycles allowing to double the frequency.