163 lines
6.8 KiB
Markdown
163 lines
6.8 KiB
Markdown
|
## SubX
|
||
|
|
||
|
SubX is a notation for a subset of x86 machine code. [The Mu translator](http://akkartik.github.io/mu/html/apps/mu.subx.html)
|
||
|
is implemented in SubX and also emits SubX code.
|
||
|
|
||
|
Here's an example program in SubX that adds 1 and 1 and returns the result to
|
||
|
the parent shell process:
|
||
|
|
||
|
```sh
|
||
|
== code
|
||
|
Entry:
|
||
|
# ebx = 1
|
||
|
bb/copy-to-ebx 1/imm32
|
||
|
# increment ebx
|
||
|
43/increment-ebx
|
||
|
# exit(ebx)
|
||
|
e8/call syscall_exit/disp32
|
||
|
```
|
||
|
|
||
|
## The syntax of SubX instructions
|
||
|
|
||
|
Just like in regular machine code, SubX programs consist mostly of instructions,
|
||
|
which are basically sequences of numbers (always in hex). Instructions consist
|
||
|
of words separated by whitespace. Words may be _opcodes_ (defining the
|
||
|
operation being performed) or _arguments_ (specifying the data the operation
|
||
|
acts on). Any word can have extra _metadata_ attached to it after `/`. Some
|
||
|
metadata is required (like the `/imm32` and `/imm8` above), but unrecognized
|
||
|
metadata is silently skipped so you can attach comments to words (like the
|
||
|
instruction name `/copy-to-eax` above, or the `/exit` argument).
|
||
|
|
||
|
What do all these numbers mean? SubX supports a small subset of the 32-bit x86
|
||
|
instruction set that likely runs on your computer. (Think of the name as short
|
||
|
for "sub-x86".) The instruction set contains instructions like `89/copy`,
|
||
|
`01/add`, `3d/compare` and `51/push-ecx` which modify registers and a byte-addressable
|
||
|
memory. For a complete list of supported instructions, run `bootstrap help
|
||
|
opcodes`.
|
||
|
|
||
|
The registers instructions operate on are as follows:
|
||
|
|
||
|
- Six general-purpose 32-bit registers: `0/eax`, `1/ebx`, `2/ecx`, `3/edx`,
|
||
|
`6/esi` and `7/edi`.
|
||
|
- Two additional 32-bit registers: `4/esp` and `5/ebp`. (I suggest you only
|
||
|
use these to manage the call stack.)
|
||
|
|
||
|
(SubX doesn't support floating-point registers yet. Intel processors support
|
||
|
an 8-bit mode, 16-bit mode and 64-bit mode. SubX will never support them.
|
||
|
There are also _many_ more instructions that SubX will never support.)
|
||
|
|
||
|
While SubX doesn't provide the usual mnemonics for opcodes, it _does_ provide
|
||
|
error-checking. If you miss an argument or accidentally add an extra argument,
|
||
|
you'll get a nice error. SubX won't arbitrarily interpret bytes of data as
|
||
|
instructions or vice versa.
|
||
|
|
||
|
It's worth distinguishing between an instruction's arguments and its _operands_.
|
||
|
Arguments are provided directly in instructions. Operands are pieces of data
|
||
|
in register or memory that are operated on by instructions.
|
||
|
|
||
|
Intel processors typically operate on no more than two operands, and at most
|
||
|
one of them (the 'reg/mem' operand) can access memory. The address of the
|
||
|
reg/mem operand is constructed by expressions of one of these forms:
|
||
|
|
||
|
- `%reg`: operate on just a register, not memory
|
||
|
- `*reg`: look up memory with the address in some register
|
||
|
- `*(reg + disp)`: add a constant to the address in some register
|
||
|
- `*(base + (index << scale) + disp)` where `base` and `index` are registers,
|
||
|
and `scale` and `disp` are 2- and 32-bit constants respectively.
|
||
|
|
||
|
Under the hood, SubX turns expressions of these forms into multiple arguments
|
||
|
with metadata in some complex ways. See [the doc on bare SubX](subx_bare.md).
|
||
|
|
||
|
That covers the complexities of the reg/mem operand. The second operand is
|
||
|
simpler. It comes from exactly one of the following argument types:
|
||
|
|
||
|
- `/r32`
|
||
|
- displacement: `/disp8` or `/disp32`
|
||
|
- immediate: `/imm8` or `/imm32`
|
||
|
|
||
|
Putting all this together, here's an example that adds the integer in `eax` to
|
||
|
the one at address `edx`:
|
||
|
|
||
|
```
|
||
|
01/add %edx 0/r32/eax
|
||
|
```
|
||
|
|
||
|
## The syntax of SubX programs
|
||
|
|
||
|
SubX programs map to the same ELF binaries that a conventional Linux system
|
||
|
uses. Linux ELF binaries consist of a series of _segments_. In particular, they
|
||
|
distinguish between code and data. Correspondingly, SubX programs consist of a
|
||
|
series of segments, each starting with a header line: `==` followed by a name
|
||
|
and approximate starting address.
|
||
|
|
||
|
All code must lie in a segment called 'code'.
|
||
|
|
||
|
Segments can be added to.
|
||
|
|
||
|
```sh
|
||
|
== code 0x09000000 # first mention requires starting address
|
||
|
...A...
|
||
|
|
||
|
== data 0x0a000000
|
||
|
...B...
|
||
|
|
||
|
== code # no address necessary when adding
|
||
|
...C...
|
||
|
```
|
||
|
|
||
|
The `code` segment now contains the instructions of `A` as well as `C`.
|
||
|
|
||
|
Within the `code` segment, each line contains a comment, label or instruction.
|
||
|
Comments start with a `#` and are ignored. Labels should always be the first
|
||
|
word on a line, and they end with a `:`.
|
||
|
|
||
|
Instructions can refer to labels in displacement or immediate arguments, and
|
||
|
they'll obtain a value based on the address of the label: immediate arguments
|
||
|
will contain the address directly, while displacement arguments will contain
|
||
|
the difference between the address and the address of the current instruction.
|
||
|
The latter is mostly useful for `jump` and `call` instructions.
|
||
|
|
||
|
Functions are defined using labels. By convention, labels internal to functions
|
||
|
(that must only be jumped to) start with a `$`. Any other labels must only be
|
||
|
called, never jumped to. All labels must be unique.
|
||
|
|
||
|
Functions are called using the following syntax:
|
||
|
```
|
||
|
(func arg1 arg2 ...)
|
||
|
```
|
||
|
|
||
|
Function arguments must be either literals (integers or strings) or a reg/mem
|
||
|
operand using the syntax in the previous section.
|
||
|
|
||
|
A special label is `Entry`, which can be used to specify/override the entry
|
||
|
point of the program. It doesn't have to be unique, and the latest definition
|
||
|
will override earlier ones.
|
||
|
|
||
|
(The `Entry` label, along with duplicate segment headers, allows programs to
|
||
|
be built up incrementally out of multiple [_layers_](http://akkartik.name/post/wart-layers).)
|
||
|
|
||
|
Another special pair of labels are the block delimiters `{` and `}`. They can
|
||
|
be nested, and jump instructions can take arguments `loop` or `break` that
|
||
|
jump to the enclosing `{` and `}` respectively.
|
||
|
|
||
|
The data segment consists of labels as before and byte values. Referring to
|
||
|
data labels in either `code` segment instructions or `data` segment values
|
||
|
yields their address.
|
||
|
|
||
|
Automatic tests are an important part of SubX, and there's a simple mechanism
|
||
|
to provide a test harness: all functions that start with `test-` are called in
|
||
|
turn by a special, auto-generated function called `run-tests`. How you choose
|
||
|
to call it is up to you.
|
||
|
|
||
|
I try to keep things simple so that there's less work to do when implementing
|
||
|
SubX in SubX. But there _is_ one convenience: instructions can provide a
|
||
|
string literal surrounded by quotes (`"`) in an `imm32` argument. SubX will
|
||
|
transparently copy it to the `data` segment and replace it with its address.
|
||
|
Strings are the only place where a SubX word is allowed to contain spaces.
|
||
|
|
||
|
That should be enough information for writing SubX programs. The `apps/`
|
||
|
directory provides some fodder for practice in the `apps/ex*.subx` files,
|
||
|
giving a more gradual introduction to SubX features. In particular, you should
|
||
|
work through `apps/factorial4.subx`, which demonstrates all the above ideas in
|
||
|
concert.
|