fpga_project@74ff91a337 | ||
src | ||
fpga_project.gprj | ||
fpga_project.gprj.user | ||
README.md |
Tugman SOC
Tugman CPU is an experimental 18-bit stack machine loosely based on James Bowman's J1 CPU. It is a single-cycle stack machine with mostly VLIW instructions.
This is a Tang Nano SOC for experimenting with a Tugman processor.
Also, a fasmg-based assembler (include tugman_syntax.asm macros). Not great, but capable of outputting code. fasmg must be installed, and fasmg
should invoke the proper binary from the commandline.
Edit code.asm, run make to asemble code.hex first;
Build using the GOWIN ide - yosys will build incorrectly;
Configure using openFPGALoader, make load
will do it.
Connect a terminal at 38400 baud.
The Initial Program Loader displays TUGMAN
and waits for a binary upload. Send it a sequence of low-endian DWORDS, starting with load address and the count of DWORDS to follow. See test.asm
and Makefile
for examples of how to assemble Tugman code and make a binary.
test.asm now contains a hex dump, and a routine to convert keystrokes into hex digits. Soon, a simple monitor.
Prerequisites:
GOWIN IDE openFPGALoader fasmg
Rationale
Stack machines are conceptually fun, but in practice require a lot of instructions to do simple things, often leading to an unpleasant coding eqperience.
An ALU with a selection of sources allows complex operations on memory, registers or input values.
Its quirky instruction set begs for fun opportunistic optimization of code!
Architecture
This is a very quirky CPU. On the one hand it is a minimalist stack machine. On the other, it can perform multiple operations in a single cycle, including ALU operations directly on registers or memory or IO read results, issue simultaneous memory reads, adjust stacks, and even return.
The architecture begs for opportunistic optimization.
Registers
There are two stacks: a return stack and a data stack.
-
TOS = top of datastack
-
NOS = second item on datastack
-
TOR = top of return stack
-
ALU_B = source of ALU operation, also memory address
-
IP = instruction pointer
-
C = carry flag
-
Z = zero flag
Memory
Up to 8K of memory is used (18 bits and is dual-ported).
Instructions are read from the instruction port (read-only).
Data reads are issued on the data port every non-write cycle from ALU_B and the result is available for ALU instructions next cycle. Writes happen immediately from TOS to [ALU_B], and the data written is available from the memory port next cycle.
Make sure to read memory data exactly on the next cycle from setting ALU_B to a good address source, as most of the memory reads are nonsense!
Instruction Set Architecture
The instructions are mostly undecoded, VLIW-style.
XX_...._...._...._.... family (10=lits, 00=jmps, 01=ALU)
1X_XXXX_XXXX_XXXX_XXXX literal
00_JJJd_oooo_oooo_oooo jmp (JJJ=call,jmp,jz,jnz,jc,jnc) drop,offset
01_N..._...._...._.... Negate TOS before ALU operation (for -)
01_.C.._...._...._.... Carry on
01_..XX_X..._...._.... ALU op (+/-, &, |, ^, portB, >>)
01_...._.XXX_...._.... B mux (TOS, NOS, TOR, IO, MEM, 1,-1)
01_...._...._X..._.... return
01_...._...._.XXX_.... write control (nothing,NOS,TOR,mem,IO)
01_...._...._...._XX.. RSP inc (signed)
01_...._...._...._..XX DSP inc (signed)
There are 3 types of instructions:
- LIT Literal load into TOS, DSP adjusted automatically;
- Control transfer -- calls and jumps may be conditional on Z and C
- ALU operations
Control Transfer
There are 8 control transfer instructions, set by JJJ field:
000 CALL
001 JMP
010 JZ
011 JNZ
100 JC
101 JNC
reserved
Every control transfer operation can also do a DROP operation if the D bit is set.
The target is computed by adding signed offset to IP.
ALU
Many things can happen in a cycle during an ALU operation:
- TOS and carry may be negated prior to operation
- The ALU operation result is placed into TOS;
- TOS may be written to NOS,TOR,Memory,or any of output ports;
- Return stack and/or Data stack may be adjusted +1,-1, or -2;
- Memory read is issued using ALU_B (unless a memory write is on)
- RTS may direct execution of next instruction to TOR
- Logical operations can pass on carry, or set it to 0 or 1
ALU operation Encoding
ALU operation encodings contain 4 components: * ALU_N bit (negates TOS and carry); * ALU_C bit (introduces carry into +/- and >>) * ALU_B port selection * ALU operation proper
Operation ALU_B select
=== ========== === ===============
000 ALU_B +/- TOS 000 TOS
001 ALU_B & TOS 001 NOS
010 ALU_B | TOS 010 TOR
011 ALU_B ^ TOS 011 [ALU_B] memory
100 100 IO input
101 101 1
110 ALU_B 110 -1
111 ALU_B >> 111 0
Subtraction is synthesized by using the ALU_N bit, which negates ALU_A and inverts carry, turning addition into subtraction.
Flags
There are two flags:
C is carry generated by the ALU operation and is available to the next instruction.
- +/- always sets C to appropriate value;
-
sets C to the bit shifted out
For logical operations, ALU_C and ALU_N bits control the outgoing carry value:
C N Result
===========
0 0 Carry cleared
0 1 Carry set
1 0 Carry preserved
1 1 Carry inverted
Preserving carry makes it possible to pass the result of a subroutine using C upon return.
For right shift:
N C Result
===========
0 0 shift right
0 1 rotate right using carry
1 . rotate right
Z is set when TOS is zero.
DSP and RSP incrementors
The CPU uses signed 2-bit values to adjust the stack pointers. Writes to data or return stack use the adjusted stack pointers, while reads happen prior to adjustment!
The convention is to increment stacks when pushing, and decrement when popping.
Normally, RSP should be decremented in conjunction with a RET bit.
Jump/call instructions perform an implicit DROP when the D bit is set, and automatically decrement DSP.
For CALL, RSP is auto-incremented and IP+1 is saved on the return stack. RET does not auto-decrement RSP!
Write control
A single write may be selected for each cycle:
000 No write
001 TOS -> NOS ; next cycle, value of NOS will be same as current TOS
010 ALU -> TOR ; next cycle, value of TOR will be same as ALU result
011 TOS -> [ALU_B] ; write TOS to memory addressed by current ALU_B
1?? TOS -> output ; write TOS to output port selected
Note that TOR is written from the current ALU result, while other writes use TOS (prior to ALU computation).
Assembler Crash Course
A barely-working assembler is cobbled up from FASMG macros. FASMG rules apply, and all macros and expressions are available. Be careful -- incorrect opcodes will silently fail!
Literals are encoded as: lit ...
.
Jumps are encoded as jmp, jz, jnz, jc
while calls, jsr
. Append a 'd' to make it also DROP.
ALU opcodes use the keyword op
followed by several components:
- operation (required)
OP_B, OP_ADD, OP_SUB, OP_AND, OP_OR, OP_XOR, OP_SHR
- ALU_B (required)
B_TOS, B_NOS, B_TOR, B_MEM, B_IN, B_1, B_N1
- Write selector
WN,WT,WR,W0,W1,W2,W3
- DSP adjust
DSPI, DSPD, DSPD2
- RSP adjust
RSPI, RSPD, RSPD2
- Carry to be used for + -
CIN
Many Forth instructions can generally be synthesized as a single operation:
dup: op OP_B,B_TOS, DSPI,WN ;inc DSP and copy TOS into it
swap: op OP_B,B_NOS, WN ;write TOS->NOS, read NOS->TOS
over: op OP_B,B_NOS, DSPI,WN ;read NOS, push TOS onto dstack
drop: op OP_B,B_NOS, DSPD
add: op OP_ADD,B_NOS, DSPD ;add tos+nos, adjust dstack
push: op OP_B,B_NOS, DSPI, RSPI,WR;
dec: op OP_ADD,B_N1
not: op OP_XOR,B_N1
neg: op OP_SUB,B_0
These may be made into macros using fasmg:
macro push?
op OP_B,B_NOS,DSPD,RSPI,WR
end macro
...
; TUCK (a,b-- b.a,b
tuck: swap
over
; IIs there a better way
; ROT (a,b,c--b,c,a)
rot: push ;(a,b--
swap ;(b,a--
pop ;(b,a,c
swap ;(b,c,a
jsr, jmp and conditional jumps have versions that also drop. To simulate compare/jump instructions that do not alter the original value:
lit '0'
op OP_SUB,B_NOS ;(v,test
jzd .zero ;(v
Other times it may be useful to test the result of the operation:
lit '0'
op OP_SUB,B_NOS,DSPD ;( v-$30
jc .too_low
...
Memory reads require two, but keep in mind that other operations may be performed simultaneously.
fetch: op OP_B,B_TOS ;issue read from TOS
op OP_B,B_MEM ;result into TOS
Some more interesting examples:
;------------------------------------------------------------------
; double-indirect memory read
op OP_B,B_TOS ;issue read on TOS
op OP_B,B_MEM ;issue read on memory
op OP_B,B_MEM ;issue read on memory
OP_B,B_TOS ;result in TOS
;------------------------------------------------------------------
; Copy cnt words from src to destination:
;
; (cnt,src,dst--
copy: push ; dst onto return stack
push ; src onto return stack
; keep count on datastack
; D R Mem
.loop: lit 1 ;(cnt,1 src
op OP_ADD,B_TOR,WR,DSPD ;(cnt,src++ src++ issued inc src
op OP_B,B_MEM,RSPD ;(cnt,val, dst read val
op OP_B,B_TOR,WM,DSPD ;(cnt,val dst write store
op OP_ADD,B_1,WR,DSPD ;(cnt,dst++ dst++ inc dst
op OP_SUB,B_1,RSPI ;(cnt-- src
jnz .loop
op OP_B,B_NOS,DSPD, RSPD,RET ;drop 0 cnt and return
;----------------------------------------------------------------
; Pass constants at call site, and return past constant:
jsr qqq
lit constant1
...
qqq: lit 1 ;increment value
op OP_ADD,B_TOR,WR ;read constant1 at [B_TOR], TOR+1
op OP_B,B_MEM ;TOS=constant1
...
op ... RET
;----------------------------------------------------------------
; Return from subroutine with the carry flag intact:
...
jc .return
...
.return:
op OP_AND,B_TOS,RSPD,RET ;C-preserving nop just to return
IO
There are 4 output ports addressable directly in the insturction as W0,W1,W2,W3. TOS is written to the specified port. The ports are allocated as:
UART_RD_ACK 0 ;After reading UART RX, acknowlege
UART_WR 1
LED 3
There is a single 18-bit input port, available as an ALU input, configured as
{FROM_UART[7:0],rxready,txready} status bits are active low
To read the UART, you can:
rx: lit 0 ;(0
rxpoll: op OP_SHR,B_IN ;(in>>1
op OP_SHR,B_TOS ;(in>>2 rxready is in C
jc .l1
lit 0
op OP_B,B_NOS,DSPD,W0,RSPD,RET ;strobe ack0, return
To write UART, check for txready first:
tx: lit 0 ;(char,0 reserve space
txpoll: op OP_SHR,B_IN ;(char,stat
jc txpoll ;(char,stat
op OP_B,B_NOS,DSPD ;(char
op OP_B,B_NOS,DSPD,W1,RSPD,RET ;( output,drop
Notes and Observations
On Nano9K, the system should run up to ~40MHz according to reports, but I haven't tried it faster than ~27
Tugman architecture was my first stab at making a CPU over 10 years ago, originally concieved and sketched out at Tugman State Park in Oregon.
Tradeoffs
ISA design is an excercise in juggling multiple factors. In this case, priority was given to:
- Squeezing WLIW instructions into 18 bits;
- Keeping fMAX to around 40MHz on the slow Nano9K
- Keeping overall resource usage to ~6%
This results in limitations on the muxes used -- adding instructions costs lots in fMAX and resources. Splitting the ALU into ALU_OP and ALU_B creates a lot of possible instructions with minimum costs.