11-06-2022, 01:56 PM
Recently started a new Forth project. This time for the Sharp PC-850(V)(S). Perhaps this is interesting to some of you.
Repository: https://github.com/Robert-van-Engelen/Forth850
It's about 8K in size plus 8K free space or more, written from scratch in Z80 assembly and Forth itself. It's an ANS/standard compliant, directly threaded code (DTC) Forth implementation. Words are case insensitive.
No file IO and no floating point yet (no syscalls are published as far as I know), but may look into floating point later, e.g. to use a Z80 32 bit float library, if it fits. (EDIT: floating point is included with the latest version).
Why write from scratch? Well, firstly well-known Z80 Forth implementations like CamelForth, eForth and Jupiter Ace aren't ANS Forth compliant and lack features. Secondly, it's more fun to write from scratch. The Moving Forth article was helpful, but I've decided to use DE as TOS, BC as IP (instruction pointer) and RP (return stack pointer) in memory. It's faster that way, since DE (TOS) and HL (scratch) are exchangeable in only 4 cycles and HL is a versatile register pair. Even faster would be to implement subroutine threaded code (STC) to execute Z80 code directly without DTC fetch+execute, but that will require 50% more space on average for Forth programs. Nice for speed, but perhaps not so much for usability on a small 32K machine.
With a bit of time tinkering, I came up with efficient Z80 code for Forth words, more efficient in some cases than CamelForth, eForth and Jupiter Ace (the latter isn't particularly fast by the way). If someone can point to more efficient Z80 Forth DTC implementations, then please let me know.
For example, my signed 16x16->16 multiplication takes only 51 cycles per bit versus 62 cycles per bit published by Zilog. I'm using the ASZ80 macro assembler with the CODE macro defining a Forth word "*" in machine code with label "star":
EDIT: even faster multiplication is possible. To calculate the high order byte we do not need to iterate over all 8 bits of the high order multiplier stored in register c, but only over the nonzero bits. We also can ignore the lower order result stored in register e. This reduces the max loop iteration cycle time to 32 and 33 per bit. Furthermore, the second loop only runs until the last bit of register c is shifted out. If register c is zero, the second loop does not execute thereby saving hundreds of cycles. We also use jp instead of jr to improve and balance the cycle time per bit:
At the heart of Forth parsing are my new CHOP and TRIM words in assembly that exploit special Z80 ops, which I found to be better suitable and more efficient than CamelForth's approach to parsing. The CHOP and TRIM words:
My case-insensitive dictionary search FIND-WORD should speed up Forth compilation by quite a bit:
The Z80 is alright for Forth (I've used the Z80 back in the 80s and loved it), but not as good as the 6809 and the ESR-L (PC-E500) that have two stack registers. What a luxury compared to the Z80!
- Rob
Repository: https://github.com/Robert-van-Engelen/Forth850
It's about 8K in size plus 8K free space or more, written from scratch in Z80 assembly and Forth itself. It's an ANS/standard compliant, directly threaded code (DTC) Forth implementation. Words are case insensitive.
No file IO and no floating point yet (no syscalls are published as far as I know), but may look into floating point later, e.g. to use a Z80 32 bit float library, if it fits. (EDIT: floating point is included with the latest version).
Why write from scratch? Well, firstly well-known Z80 Forth implementations like CamelForth, eForth and Jupiter Ace aren't ANS Forth compliant and lack features. Secondly, it's more fun to write from scratch. The Moving Forth article was helpful, but I've decided to use DE as TOS, BC as IP (instruction pointer) and RP (return stack pointer) in memory. It's faster that way, since DE (TOS) and HL (scratch) are exchangeable in only 4 cycles and HL is a versatile register pair. Even faster would be to implement subroutine threaded code (STC) to execute Z80 code directly without DTC fetch+execute, but that will require 50% more space on average for Forth programs. Nice for speed, but perhaps not so much for usability on a small 32K machine.
With a bit of time tinkering, I came up with efficient Z80 code for Forth words, more efficient in some cases than CamelForth, eForth and Jupiter Ace (the latter isn't particularly fast by the way). If someone can point to more efficient Z80 Forth DTC implementations, then please let me know.
For example, my signed 16x16->16 multiplication takes only 51 cycles per bit versus 62 cycles per bit published by Zilog. I'm using the ASZ80 macro assembler with the CODE macro defining a Forth word "*" in machine code with label "star":
Code:
CODE *,star
push de ; save TOS
exx ; save bc with ip
pop de ; n2->de
pop bc ; n1->bc
ld hl,0 ; 0->hl
ld a,c ; c->a low order byte of n1
ld c,b ; b->c save high order byte of n1
ld b,8 ; 8->b loop counter
1$: rra ; 4 ; loop, a>>1->a set cf
jr nc,2$ ; 7 ; if cf=1 then
add hl,de ; 11 ; hl+de->hl
2$: sla e ; 8 ;
rl d ; 8 ; de<<1->de
djnz 1$ ; 13(51); until --b=0
ld a,c ; c->a high order byte of n1
ld b,8 ; 8->b loop counter
3$: rra ; 4 ; loop, a>>1->a set cf
jr nc,4$ ; 7 ; if cf=1 then
add hl,de ; 11 ; hl+de->hl
4$: sla e ; 8 ;
rl d ; 8 ; de<<1->de
djnz 3$ ; 13(51); until --b=0
push hl ; save hl with product as TOS
exx ; restore bc with ip
pop de ; pop new TOS
JP_NEXT ; continue
EDIT: even faster multiplication is possible. To calculate the high order byte we do not need to iterate over all 8 bits of the high order multiplier stored in register c, but only over the nonzero bits. We also can ignore the lower order result stored in register e. This reduces the max loop iteration cycle time to 32 and 33 per bit. Furthermore, the second loop only runs until the last bit of register c is shifted out. If register c is zero, the second loop does not execute thereby saving hundreds of cycles. We also use jp instead of jr to improve and balance the cycle time per bit:
Code:
CODE *,star
push de ; save TOS
exx ; save bc with ip
pop de ; n2 -> de
pop bc ; n1 -> bc
ld hl,0 ; 0 -> hl
ld a,c ; c -> a low order byte of n1
ld c,b ; b -> c save high order byte of n1
ld b,8 ; 8 -> b loop counter
1$: rra ; 4 ; loop, a >> 1 -> a set cf
jr nc,2$ ; 7 ; if cf = 1 then
add hl,de ; 11 ; hl + de -> hl
2$: sla e ; 8 ;
rl d ; 8 ; de << 1 -> de
djnz 1$ ; 13(51); until --b = 0
ld a,h ; h -> a do high order, low order is done
jr 5$ ; jump to shift c and loop
3$: add d ; 4 ; loop, a + d -> a
4$: sla d ; 8 ; d << 1 -> d
5$: srl c ; 8 ; c >> 1 -> c set cf and z if no bits left
jr c,3$ ; 12/7(32); until cf = 0 repeat with addition
jp nz,4$ ; 10(33); until c = 0 repeat without addition
ld h,a ; a -> h
push hl ; save hl with product as TOS
exx ; restore bc with ip
pop de ; pop new TOS
JP_NEXT ; continue
At the heart of Forth parsing are my new CHOP and TRIM words in assembly that exploit special Z80 ops, which I found to be better suitable and more efficient than CamelForth's approach to parsing. The CHOP and TRIM words:
Code:
; CHOP c-addr u1 char -- c-addr u2
; truncate string up to matching char
; leaves string if char not found
; char=0x20 (bl) chops 0x00 to 0x20 (white space)
CODE CHOP,chop
ld a,e ; char->a
exx ; save bc with ip
ex af,af' ; save a with char
pop bc ; pop u1->bc
ld e,c ;
ld d,b ; u1->de
ld a,c ;
or b ; test bc=0, 0->cf
jr z,2$ ; if bc=0 then not found
pop hl ;
push hl ; c-addr->hl
ex af,af' ; restore a with char
cp 0x20 ;
jr z,3$ ; if a=0x20 then find white space
or a ; 0->cf not found
; find char in string
cpir ; repeat until a=[hl++] or --bc=0
jr nz,2$ ; if match then
1$: ccf ; complement cf when found, to correct cpi bc--
2$: ex de,hl ; u1->hl
sbc hl,bc ; u1-bc-cf->hl
push hl ; save hl as TOS
exx ; restore bc with ip
pop de ; pop new TOS
JP_NEXT ; continue
; find white space in string
3$: cp (hl) ; 7 ; loop to compare a to [hl]
cpi ; 16 ; hl++,bc--
jr nc,1$ ; 7 ; if [hl]<a then found
jp pe,3$ ; 10 ; until bc=0
jr 1$ ; not found
; TRIM c-addr1 u1 char -- c-addr2 u2
; trim initial chars
; char=0x20 (bl) trims 0x00 to 0x20 (white space)
CODE TRIM,trim
ld a,e ; char->a
exx ; save bc with ip
pop bc ; u1->bc
pop hl ; c-addr1->hl
1$: ex af,af' ; save a
ld a,c ;
or b ;
jr z,3$ ; if bc<>0 then
ex af,af' ; restore a
2$: cpi ; 16 ; loop
jr nz,4$ ; 7 ; while a=[hl++],--bc
jp pe,2$ ; 10 ; until b=0
3$: push hl ; save hl as 2OS
push bc ; save bc as TOS
exx ; restore bc with ip
pop de ; pop new TOS
JP_NEXT ; continue
4$: cp 0x20 ;
jr nz,5$ ; if char=0x20 then
dec hl ;
cp (hl) ;
inc hl ;
jr nc,1$ ; if [hl-1]<=0x20 then keep trimming
5$: inc bc ; correct bc++ for cpi match
dec hl ; correct hl-- for cpi match
jr 3$ ; finalize trimming
My case-insensitive dictionary search FIND-WORD should speed up Forth compilation by quite a bit:
Code:
; FIND-WORD c-addr u -- c-addr 0 | xt 1 | xt -1
; search dictionary for matching word
; leaves execution token and 1=immediate or -1=not immediate
; leaves c-addr and 0 when not found
CODE FIND-WORD,findword
ld a,d ;
or a ; test d=0 high order byte of u
jp nz,zero_next ; if u is too large then set new TOS to 0
sla e ; shift u to compare w/o immediate bit
jp c,zero_next ; if u is too large then set new TOS to 0
jp z,zero_next ; if u=0 then set new TOS to 0
push de ; save de with 2*u
exx ; save bc with ip
pop bc ; pop 2*u->bc
pop de ; pop c-addr->de
ld hl,(context+3) ; CONTEXT->hl
jr 3$ ; start searching
; loop over dictionary
1$: pop de ; restore de with c-addr
2$: pop hl ; 10 ; loop, restore hl with lfa
3$: ld a,(hl) ; 7 ;
inc hl ; 6 ;
ld h,(hl) ; 7 ;
ld l,a ; 4 ; [hl]->hl follow link at hl=lfa
or h ; 4 ;
jr z,6$ ; 7 ; if hl=0 then not found
push hl ; 11 ; save hl with lfa
inc hl ; 6 ;
inc hl ; 6 ; hl+2->hl with nt (nfa)
ld a,(hl) ; 7 ; word length
add a ; 4 ; shift away immediate bit
cp c ; 4 ; test a=c word length match (both shifted)
jr nz,2$ ; 12(95); if lengths differ then continue searching
; compare string to word
push de ; save de with c-addr
inc hl ; hl++ point to nfa chars
ld b,c ; 2*u->b
srl b ; u->b word length (nonzero)
; loop over word chars
4$: ld a,(de) ; 7 ; loop
cp (hl) ; 7 ; compare [de]=[hl]
jr z,5$ ; 12/7 ; if mismatch then
and 0xdf ; 7 ; make upper case
cp 'A ; 7 ;
jr c,1$ ; 7 ; if a<'A' then continue search
cp 'Z+1 ; 7 ;
jr nc,1$ ; 7 ; if a>'Z' then continue search
xor (hl) ; 7 ;
and 0xdf ; 7 ; case insensitive compare [de]=[hl]
jr nz,1$ ; 7 ; if mismatch then continue search
5$: inc de ; 6 ; de++ point to next char of c-addr
inc hl ; 6 ; hl++ point to next char of word
djnz 4$ ; 13(51/102);until --b=0
; found a matching word
pop de ; discard saved c-addr
ex (sp),hl ; save hl with xt as 2OS, restore hl with lfa
inc hl ;
inc hl ; hl+2->hl with nt (nfa)
bit immediate_bit,(hl) ; test immediate bit of [hl] word length
exx ; restore bc with ip
jp nz,one_next ; set new TOS to 1 if word is immediate
jp mone_next ; set new TOS to -1
; not found
6$: push de ; save de with c-addr as 2OS
exx ; restore bc with ip
jp zero_next ; set new TOS to 0
The Z80 is alright for Forth (I've used the Z80 back in the 80s and loved it), but not as good as the 6809 and the ESR-L (PC-E500) that have two stack registers. What a luxury compared to the Z80!
- Rob