A stack-machine CPU with a quirky 18-bit architecture, running on Tang Nano 9K
Go to file
2024-09-11 13:17:46 -04:00
fpga_project@74ff91a337 initial 2024-08-30 14:15:57 -04:00
src moved shl and shr routines into bootstrap, extrn macro 2024-09-11 13:17:46 -04:00
fpga_project.gprj 512/50.771 OP_AND,B_TOS is a nop that preserves carry, just to allow RET to pass carry! 2024-09-09 20:47:15 -04:00
fpga_project.gprj.user flagless, but now have to drop a lot! 2024-09-07 20:11:05 -04:00
README.md 38400 baud, readme 2024-09-11 12:44:10 -04:00

Tugman SOC

Tugman CPU is an experimental 18-bit stack machine loosely based on James Bowman's J1 CPU. It is a single-cycle stack machine with mostly VLIW instructions.

This is a Tang Nano SOC for experimenting with a Tugman processor.

Also, a fasmg-based assembler (include tugman_syntax.asm macros). Not great, but capable of outputting code. fasmg must be installed, and fasmg should invoke the proper binary from the commandline.

Edit code.asm, run make to asemble code.hex first;

Build using the GOWIN ide - yosys will build incorrectly;

Configure using openFPGALoader, make load will do it.

Connect a terminal at 38400 baud.

The Initial Program Loader displays TUGMAN and waits for a binary upload. Send it a sequence of low-endian DWORDS, starting with load address and the count of DWORDS to follow. See test.asm and Makefile for examples of how to assemble Tugman code and make a binary.

test.asm now contains a hex dump, and a routine to convert keystrokes into hex digits. Soon, a simple monitor.

Prerequisites:

GOWIN IDE openFPGALoader fasmg

Rationale

Stack machines are conceptually fun, but in practice require a lot of instructions to do simple things, often leading to an unpleasant coding eqperience.

An ALU with a selection of sources allows complex operations on memory, registers or input values.

Its quirky instruction set begs for fun opportunistic optimization of code!

Architecture

This is a very quirky CPU. On the one hand it is a minimalist stack machine. On the other, it can perform multiple operations in a single cycle, including ALU operations directly on registers or memory or IO read results, issue simultaneous memory reads, adjust stacks, and even return.

The architecture begs for opportunistic optimization.

Registers

There are two stacks: a return stack and a data stack.

  • TOS = top of datastack

  • NOS = second item on datastack

  • TOR = top of return stack

  • ALU_B = source of ALU operation, also memory address

  • IP = instruction pointer

  • C = carry flag

  • Z = zero flag

Memory

Up to 8K of memory is used (18 bits and is dual-ported).

Instructions are read from the instruction port (read-only).

Data reads are issued on the data port every non-write cycle from ALU_B and the result is available for ALU instructions next cycle. Writes happen immediately from TOS to [ALU_B], and the data written is available from the memory port next cycle.

Make sure to read memory data exactly on the next cycle from setting ALU_B to a good address source, as most of the memory reads are nonsense!

Instruction Set Architecture

The instructions are mostly undecoded, VLIW-style.

  XX_...._...._...._....   family  (10=lits, 00=jmps, 01=ALU)
  1X_XXXX_XXXX_XXXX_XXXX   literal
  00_JJJd_oooo_oooo_oooo   jmp (JJJ=call,jmp,jz,jnz,jc,jnc) drop,offset
  01_N..._...._...._....   Negate TOS before ALU operation (for -)
  01_.C.._...._...._....   Carry on
  01_..XX_X..._...._....   ALU op (+/-,  &, |, ^, portB, >>)
  01_...._.XXX_...._....   B mux  (TOS, NOS, TOR, IO, MEM, 1,-1)
  01_...._...._X..._....   return
  01_...._...._.XXX_....   write control (nothing,NOS,TOR,mem,IO)
  01_...._...._...._XX..   RSP inc (signed)
  01_...._...._...._..XX   DSP inc (signed)

There are 3 types of instructions:

  • LIT Literal load into TOS, DSP adjusted automatically;
  • Control transfer -- calls and jumps may be conditional on Z and C
  • ALU operations

Control Transfer

There are 8 control transfer instructions, set by JJJ field:

000 CALL
001 JMP
010 JZ
011 JNZ
100 JC
101 JNC
reserved

Every control transfer operation can also do a DROP operation if the D bit is set.

The target is computed by adding signed offset to IP.

ALU

Many things can happen in a cycle during an ALU operation:

  • TOS and carry may be negated prior to operation
  • The ALU operation result is placed into TOS;
  • TOS may be written to NOS,TOR,Memory,or any of output ports;
  • Return stack and/or Data stack may be adjusted +1,-1, or -2;
  • Memory read is issued using ALU_B (unless a memory write is on)
  • RTS may direct execution of next instruction to TOR
  • Logical operations can pass on carry, or set it to 0 or 1

ALU operation Encoding

ALU operation encodings contain 4 components: * ALU_N bit (negates TOS and carry); * ALU_C bit (introduces carry into +/- and >>) * ALU_B port selection * ALU operation proper

     Operation                ALU_B select
===  ==========          ===  ===============
000  ALU_B +/- TOS       000  TOS
001  ALU_B & TOS         001  NOS
010  ALU_B | TOS         010  TOR
011  ALU_B ^ TOS         011  [ALU_B] memory
100                      100  IO input
101                      101  1
110  ALU_B               110  -1
111  ALU_B >>            111  0

Subtraction is synthesized by using the ALU_N bit, which negates ALU_A and inverts carry, turning addition into subtraction.

Flags

There are two flags:

C is carry generated by the ALU operation and is available to the next instruction.

  • +/- always sets C to appropriate value;
  • sets C to the bit shifted out

For logical operations, ALU_C and ALU_N bits control the outgoing carry value:

C N  Result
===========
0 0  Carry cleared
0 1  Carry set 
1 0  Carry preserved
1 1  Carry inverted

Preserving carry makes it possible to pass the result of a subroutine using C upon return.

For right shift:

N C  Result
===========
0 0  shift right
0 1  rotate right using carry 
1 .  rotate right

Z is set when TOS is zero.

DSP and RSP incrementors

The CPU uses signed 2-bit values to adjust the stack pointers. Writes to data or return stack use the adjusted stack pointers, while reads happen prior to adjustment!

The convention is to increment stacks when pushing, and decrement when popping.

Normally, RSP should be decremented in conjunction with a RET bit.

Jump/call instructions perform an implicit DROP when the D bit is set, and automatically decrement DSP.

For CALL, RSP is auto-incremented and IP+1 is saved on the return stack. RET does not auto-decrement RSP!

Write control

A single write may be selected for each cycle:

000  No write
001  TOS -> NOS     ; next cycle, value of NOS will be same as current TOS
010  ALU -> TOR     ; next cycle, value of TOR will be same as ALU result
011  TOS -> [ALU_B] ; write TOS to memory addressed by current ALU_B
1??  TOS -> output  ; write TOS to output port selected

Note that TOR is written from the current ALU result, while other writes use TOS (prior to ALU computation).

Assembler Crash Course

A barely-working assembler is cobbled up from FASMG macros. FASMG rules apply, and all macros and expressions are available. Be careful -- incorrect opcodes will silently fail!

Literals are encoded as: lit ....

Jumps are encoded as jmp, jz, jnz, jc while calls, jsr. Append a 'd' to make it also DROP.

ALU opcodes use the keyword op followed by several components:

  • operation (required) OP_B, OP_ADD, OP_SUB, OP_AND, OP_OR, OP_XOR, OP_SHR
  • ALU_B (required) B_TOS, B_NOS, B_TOR, B_MEM, B_IN, B_1, B_N1
  • Write selector WN,WT,WR,W0,W1,W2,W3
  • DSP adjust DSPI, DSPD, DSPD2
  • RSP adjust RSPI, RSPD, RSPD2
  • Carry to be used for + - CIN

Many Forth instructions can generally be synthesized as a single operation:

dup:    op      OP_B,B_TOS,  DSPI,WN    ;inc DSP and copy TOS into it

swap:   op      OP_B,B_NOS,  WN         ;write TOS->NOS, read NOS->TOS

over:   op      OP_B,B_NOS,  DSPI,WN    ;read NOS, push TOS onto dstack

drop:   op      OP_B,B_NOS,  DSPD
       

add:    op      OP_ADD,B_NOS, DSPD      ;add tos+nos, adjust dstack

push:   op      OP_B,B_NOS, DSPI, RSPI,WR;

dec:    op      OP_ADD,B_N1

not:    op      OP_XOR,B_N1

neg:    op      OP_SUB,B_0

These may be made into macros using fasmg:

macro push?
        op      OP_B,B_NOS,DSPD,RSPI,WR
end macro
...

;       TUCK  (a,b-- b.a,b
tuck:   swap
        over
        
; IIs there a better way
;       ROT (a,b,c--b,c,a)
rot:    push                            ;(a,b--
        swap                            ;(b,a--
        pop                             ;(b,a,c
        swap                            ;(b,c,a
        

jsr, jmp and conditional jumps have versions that also drop. To simulate compare/jump instructions that do not alter the original value:

    lit '0'
    op  OP_SUB,B_NOS        ;(v,test
    jzd .zero               ;(v

Other times it may be useful to test the result of the operation:

    lit '0'
    op  OP_SUB,B_NOS,DSPD   ;( v-$30 
    jc  .too_low
    ...

Memory reads require two, but keep in mind that other operations may be performed simultaneously.

fetch:  op      OP_B,B_TOS              ;issue read from TOS
        op      OP_B,B_MEM              ;result into TOS

Some more interesting examples:


;------------------------------------------------------------------
; double-indirect memory read
        op      OP_B,B_TOS      ;issue read on TOS
        op      OP_B,B_MEM      ;issue read on memory
        op      OP_B,B_MEM      ;issue read on memory
                OP_B,B_TOS      ;result in TOS 
;------------------------------------------------------------------
; Copy cnt words from src to destination:
;
; (cnt,src,dst--

copy:   push                     ; dst onto return stack
        push                     ; src onto return stack
                                 ; keep count on datastack
                                 ;  D          R     Mem
.loop:  lit 1                    ;(cnt,1      src     
   op   OP_ADD,B_TOR,WR,DSPD     ;(cnt,src++  src++  issued    inc src
   op   OP_B,B_MEM,RSPD          ;(cnt,val,   dst              read val
   op   OP_B,B_TOR,WM,DSPD       ;(cnt,val    dst    write     store
   op   OP_ADD,B_1,WR,DSPD       ;(cnt,dst++  dst++            inc dst
   op   OP_SUB,B_1,RSPI          ;(cnt--      src
   jnz  .loop
   op   OP_B,B_NOS,DSPD, RSPD,RET  ;drop 0 cnt and return     
   
   
;----------------------------------------------------------------
; Pass constants at call site, and return past constant:

        jsr    qqq
        lit   constant1
        ...

qqq:    lit 1                   ;increment value
        op  OP_ADD,B_TOR,WR     ;read constant1 at [B_TOR], TOR+1
        op  OP_B,B_MEM          ;TOS=constant1
        ...
        op  ...   RET

;----------------------------------------------------------------
; Return from subroutine with the carry flag intact:
         ...
        jc      .return
        ...
.return:
        op  OP_AND,B_TOS,RSPD,RET   ;C-preserving nop just to return

IO

There are 4 output ports addressable directly in the insturction as W0,W1,W2,W3. TOS is written to the specified port. The ports are allocated as:

UART_RD_ACK     0    ;After reading UART RX, acknowlege
UART_WR         1
LED             3

There is a single 18-bit input port, available as an ALU input, configured as

{FROM_UART[7:0],rxready,txready}  status bits are active low 

To read the UART, you can:

rx:  lit   0                       ;(0
rxpoll: op    OP_SHR,B_IN             ;(in>>1
        op    OP_SHR,B_TOS            ;(in>>2  rxready is in C 
        jc    .l1
        lit   0
        op    OP_B,B_NOS,DSPD,W0,RSPD,RET ;strobe ack0, return

To write UART, check for txready first:

tx:     lit     0                           ;(char,0     reserve space
txpoll: op      OP_SHR,B_IN                 ;(char,stat
        jc      txpoll                      ;(char,stat
        op      OP_B,B_NOS,DSPD             ;(char
        op      OP_B,B_NOS,DSPD,W1,RSPD,RET ;(           output,drop


Notes and Observations

On Nano9K, the system should run up to ~40MHz according to reports, but I haven't tried it faster than ~27

Tugman architecture was my first stab at making a CPU over 10 years ago, originally concieved and sketched out at Tugman State Park in Oregon.

Tradeoffs

ISA design is an excercise in juggling multiple factors. In this case, priority was given to:

  • Squeezing WLIW instructions into 18 bits;
  • Keeping fMAX to around 40MHz on the slow Nano9K
  • Keeping overall resource usage to ~6%

This results in limitations on the muxes used -- adding instructions costs lots in fMAX and resources. Splitting the ALU into ALU_OP and ALU_B creates a lot of possible instructions with minimum costs.