mu/subx at e07a3f2886b117970b3cd58f7cd6806cbfe5cc4a - mu

History

Kartik Agaram e07a3f2886 4537 Streamline the factorial function; we don't need to save a stack variable into a register before operating on it. All instructions can take a stack variable directly. In the process we found two bugs: a) Opcode f7 was not implemented correctly. It was internally consistent but I'd never validated it against a natively running program. Turns out it encodes multiple instructions, not just 'not'. b) The way we look up imm32 operands was sometimes reading them before disp8/disp32 operands.		2018-09-07 22:19:13 -07:00
..
apps	4537	2018-09-07 22:19:13 -07:00
examples	4535 - support for global variable names	2018-09-01 23:03:50 -07:00
html	4351	2018-07-16 07:55:07 -07:00
000organization.cc	4426 - error on unrecognized sub-commands	2018-07-26 16:58:54 -07:00
001help.cc	4436	2018-07-27 10:50:33 -07:00
002test.cc	4426 - error on unrecognized sub-commands	2018-07-26 16:58:54 -07:00
003trace.cc	4517	2018-08-13 16:49:32 -07:00
003trace.test.cc	4487	2018-08-05 08:30:48 -07:00
010---vm.cc	4520 - several syscalls for files	2018-08-13 21:01:04 -07:00
011run.cc	4531 - automatically compute segment addresses	2018-09-01 20:10:06 -07:00
012elf.cc	4537	2018-09-07 22:19:13 -07:00
013direct_addressing.cc	4537	2018-09-07 22:19:13 -07:00
014indirect_addressing.cc	4537	2018-09-07 22:19:13 -07:00
015immediate_addressing.cc	4537	2018-09-07 22:19:13 -07:00
016index_addressing.cc	4469	2018-08-03 23:42:37 -07:00
017jump_disp8.cc	4469	2018-08-03 23:42:37 -07:00
018jump_disp16.cc	4469	2018-08-03 23:42:37 -07:00
019functions.cc	4469	2018-08-03 23:42:37 -07:00
020syscalls.cc	4522	2018-08-14 10:55:00 -07:00
028translate.cc	4532	2018-09-01 20:38:32 -07:00
029transforms.cc	4482	2018-08-04 22:38:23 -07:00
030---operands.cc	4531 - automatically compute segment addresses	2018-09-01 20:10:06 -07:00
031check_operands.cc	4537	2018-09-07 22:19:13 -07:00
032check_operand_bounds.cc	4499	2018-08-09 21:46:12 -07:00
034compute_segment_address.cc	4535 - support for global variable names	2018-09-01 23:03:50 -07:00
035labels.cc	4535 - support for global variable names	2018-09-01 23:03:50 -07:00
036global_variables.cc	4535 - support for global variable names	2018-09-01 23:03:50 -07:00
100index	4499	2018-08-09 21:46:12 -07:00
Readme.md	4529 - move examples to a sub-directory	2018-09-01 09:39:36 -07:00
build	4403	2018-07-25 13:07:01 -07:00
build_and_test_until	4457	2018-07-30 11:31:09 -07:00
cheatsheet.pdf	4026	2017-10-12 09:36:55 -07:00
clean	4462	2018-07-30 20:28:36 -07:00
gen	4530 - create an apps/ directory	2018-09-01 10:54:20 -07:00
opcodes	3968	2017-07-11 21:41:15 -07:00
run	4530 - create an apps/ directory	2018-09-01 10:54:20 -07:00
subx	4211	2018-02-20 01:38:15 -08:00
subx.vim	4523 - Give up on pass-through phases	2018-08-20 22:19:41 -07:00
test_layers	4530 - create an apps/ directory	2018-09-01 10:54:20 -07:00
vim_errors.subx	4512 - divide labels into two categories	2018-08-12 22:38:36 -07:00
vimrc.vim	4020	2017-10-11 02:32:38 -07:00

Readme.md

What is this?

SubX is a thin layer of syntactic sugar over (32-bit x86) machine code. The SubX translator (it's too simple to be called a compiler, or even an assembler) generates ELF binaries that require just a Unix-like kernel to run. (The translator isn't self-hosted yet; generating the binaries does require a C++ compiler and runtime.)

Thin layer of abstraction over machine code, isn't that just an assembler?

Assemblers try to hide the precise instructions emitted from the programmer. Consider these instructions in Assembly language:

add EBX, ECX
copy EBX, 0
copy ECX, 1

Here are the same instructions in SubX, just a list of numbers (opcodes and operands) with metadata 'comments' after a /:

01/add 3/mod/direct 3/rm32/ebx 1/r32/ecx
bb/copy 0/imm32
b9/copy 1/imm32

Notice that a single instruction, say 'copy', maps to multiple opcodes. That's just the tip of the iceberg of complexity that Assembly languages deal with.

SubX doesn't shield the programmer from these details. Words always contain the actual bits or bytes for machine code. But they also can contain metadata after slashes, and SubX will run cross-checks and give good error messages when there's a discrepancy between code and metadata.

But why not use an assembler?

The long-term goal is to make programming in machine language ergonomic enough that I (or someone else) can build a compiler for a high-level language in it. That is, building a compiler without needing a compiler, anywhere among its prerequisites.

Assemblers today are complex enough that they're built in a high-level language, and need a compiler to build. They also tend to be designed to fit into a larger toolchain, to be a back-end for a compiler. Their output is in turn often passed to other tools like a linker. The formats that all these tools use to talk to each other have grown increasingly complex in the face of decades of evolution, usage and backwards-compatibility constraints. All these considerations add to the burden of the assembler developer. Building the assembler in a high-level language helps face up to them.

Assemblers do often accept a far simpler language, just a file format really, variously called 'flat' or 'binary', which gives the programmer complete control over the precise bytes in an executable. SubX is basically trying to be a more ergonomic flat assembler that will one day be bootstrapped from machine code.

Why in the world?

It seems wrong-headed that our computers look polished but are plagued by foundational problems of security and reliability. I'd like to learn to walk before I try to run. The plan: start out using the computer only to check my program for errors rather than to hide low-level details. Force myself to think about security by living with raw machine code for a while. Reintroduce high level languages (HLLs) only after confidence is regained in the foundations (and when the foundations are ergonomic enough to support developing a compiler in them). Delegate only when I can verify with confidence.
The software in our computers has grown incomprehensible. Nobody understands it all, not even experts. Even simple programs written by a single author require lots of time for others to comprehend. Compilers are a prime example, growing so complex that programmers have to choose to either program them or use them. I think they may also contribute to the incomprehensibility of the stack above them. I'd like to explore how much of a HLL I can build without a monolithic optimizing compiler, and see if deconstructing the work of the compiler can make the stack as a whole more comprehensible to others.
I want to learn about the internals of the infrastructure we all rely on in our lives.

Running

$ git clone https://github.com/akkartik/mu
$ cd mu/subx
$ ./subx

Running subx will transparently compile it as necessary.

Usage

subx currently has the following sub-commands:

subx test: runs all automated tests.
subx translate <input file> <output ELF binary>: translates a text file containing hex bytes and macros into an executable ELF binary.
subx run <ELF binary>: simulates running the ELF binaries emitted by subx translate. Useful for debugging, and also enables more thorough testing of translate.

Putting them together, build and run one of the example programs:

$ ./subx translate examples/ex1.1.subx examples/ex1
$ ./subx run examples/ex1

If you're running on Linux, ex1 will also be runnable directly:

$ examples/ex1

There are a few such example programs in the examples/ directory. At any commit an example's binary should be identical bit for bit with the output of translating the .subx file. The binary should also be natively runnable on a 32-bit Linux system. If either of these invariants is broken it's a bug on my part. The binary should also be runnable on a 64-bit Linux system. I can't guarantee it, but I'd appreciate hearing if it doesn't run.

However, not all 32-bit Linux binaries are guaranteed to be runnable by subx. I'm not building general infrastructure here for all of the x86 ISA and ELF format. SubX is about programming with a small, regular subset of 32-bit x86:

Only instructions that operate on the 32-bit E*X registers. (No floating-point yet.)
Only instructions that assume a flat address space; no instructions that use segment registers.
No instructions that check the carry or parity flags; arithmetic operations always operate on signed integers (while bitwise operations always operate on unsigned integers)
Only relative jump instructions (with 8-bit or 16-bit offsets).

The ELF binaries generated are statically linked and missing a lot of advanced ELF features as well. But they will run.

For more details on programming in this subset, consult the online help:

$ ./subx help

Resources

Inspirations

“Creating tiny ELF executables”
“Bootstrapping a compiler from nothing”
Forth implementations like StoneKnifeForth