Threaded Mode | Linear Mode

Alejandro Paz(Germany) · (This post was last modified: 09-28-2017 12:42 PM by Alejandro Paz(Germany).)

I have updated the Parallel Saturn core. At this point many instructions have been implemented. Some instructions are still missing like the ones dealing with device configuration, shutdown, interrupts. Memory read is partially implemented and memory write is still missing. But there is enough to let the synthesizer give us an idea of how big and fast it is. Big and slow for a MachXO2-7000ZE1

Code:

Design Summary

   Number of registers:   1720 out of  7209 (24%)

      PFU registers:         1718 out of  6864 (25%)

      PIO registers:            2 out of   345 (1%)

   Number of SLICEs:      3076 out of  3432 (90%)

      SLICEs as Logic/ROM:   3073 out of  3432 (90%)

      SLICEs as RAM:            3 out of  2574 (0%)

      SLICEs as Carry:        309 out of  3432 (9%)

   Number of LUT4s:        6151 out of  6864 (90%)

      Number used as logic LUTs:        5527

      Number used as distributed RAM:     6

      Number used as ripple logic:      618

      Number used as shift registers:     0

   Number of PIO sites used: 46 + 4(JTAG) out of 115 (43%)

   Number of block RAMs:  16 out of 26 (62%)

Report:    3.212MHz is the maximum frequency for this preference.

90 % of this FLGA is pretty big, and the maximum speed is well, slow.

There are a couple of important points to consider:

The 64-bit computations, ADD, SUB, Logic and a mux are performed between two consecutive flanks of the clock. I haven't attempted any optimization here beyond latching both source arguments (Line 241..248, saturn_alru.v). The slowest path goes through the subtract unit, as expected.

But clock speed doesn't tell the whole story, if we do not know how many clocks are needed for an opcode. Small opcodes like P=n take 4 clocks to complete but fetch can take up to 3 clocks. While the per-fetcher does a pretty good work it can be greatly improved. Jumps take 2 clocks plus fetch. An unaligned opcode needs at least 2 extra clocks for fetch.

For comparison, the nibble serial implementation on the same FPGA can achieve 16 MHz. But opcodes need many more cycles:

Code:

Opcode | Serial | Parallel

-------|--------|---------

B=B+C W|     27 |    3+4

GONC   |      9 |    3+2

27 cycles @ 16 Mhz = 1.68 us
7 cycles @ 3.2 Mhz = 2.18 us

9 cycles @ 16 Mhz = 0.5625 us
5 cycles @ 3.2 Mhz = 2.18 us
[/code]

I think that without improving the pre-fetcher it will be difficult to see any gain Sad

.
Fetch and subsequent execute and streamlining of the fetch state machine. At least one clock can be dropped as ST_INIT is not really needed, imho.
The ALU path can also be extended to two cycles allowing to double the frequency.