diff --git a/README.md b/README.md index 75c4ff1f..83a479b6 100644 --- a/README.md +++ b/README.md @@ -7,14 +7,14 @@ Mu is not designed to operate in large clusters providing services for millions of people. Mu is designed for _you_, to run one computer. (Or a few.) Running the code you want to run, and nothing else. - ```sh - $ git clone https://github.com/akkartik/mu - $ cd mu - $ ./translate_mu apps/ex2.mu # emit a.elf - $ ./a.elf # adds 3 and 4 - $ echo $? - 7 - ``` +```sh +$ git clone https://github.com/akkartik/mu +$ cd mu +$ ./translate_mu apps/ex2.mu # emit a.elf +$ ./a.elf # adds 3 and 4 +$ echo $? +7 +``` [![Build Status](https://api.travis-ci.org/akkartik/mu.svg?branch=master)](https://travis-ci.org/akkartik/mu) @@ -82,26 +82,26 @@ result in good error messages. Once generated, ELF binaries can be packaged up with a Linux kernel into a bootable disk image: - ```sh - $ ./translate_mu apps/ex2.mu # emit a.elf - # dependencies - $ sudo apt install build-essential flex bison wget libelf-dev libssl-dev xorriso - $ tools/iso/linux a.elf - $ qemu-system-x86_64 -m 256M -cdrom mu_linux.iso -boot d - ``` +```sh +$ ./translate_mu apps/ex2.mu # emit a.elf +# dependencies +$ sudo apt install build-essential flex bison wget libelf-dev libssl-dev xorriso +$ tools/iso/linux a.elf +$ qemu-system-x86_64 -m 256M -cdrom mu_linux.iso -boot d +``` The disk image also runs on [any cloud server that supports custom images](http://akkartik.name/post/iso-on-linode). Mu also runs on the minimal hobbyist OS [Soso](https://github.com/ozkl/soso). (Requires graphics and sudo access. Currently doesn't work on a cloud server.) - ```sh - $ ./translate_mu apps/ex2.mu # emit a.elf - # dependencies - $ sudo apt install build-essential util-linux nasm xorriso # maybe also dosfstools and mtools - $ tools/iso/soso a.elf # requires sudo - $ qemu-system-i386 -cdrom mu_soso.iso - ``` +```sh +$ ./translate_mu apps/ex2.mu # emit a.elf +# dependencies +$ sudo apt install build-essential util-linux nasm xorriso # maybe also dosfstools and mtools +$ tools/iso/soso a.elf # requires sudo +$ qemu-system-i386 -cdrom mu_soso.iso +``` ## Syntax @@ -121,16 +121,16 @@ Here's an example program in Mu: Here's an example program in SubX: - ```sh - == code - Entry: - # ebx = 1 - bb/copy-to-ebx 1/imm32 - # increment ebx - 43/increment-ebx - # exit(ebx) - e8/call syscall_exit/disp32 - ``` +```sh +== code +Entry: + # ebx = 1 + bb/copy-to-ebx 1/imm32 + # increment ebx + 43/increment-ebx + # exit(ebx) + e8/call syscall_exit/disp32 +``` [More details on SubX syntax →](subx.md) diff --git a/mu.md b/mu.md index cf855615..c7534760 100644 --- a/mu.md +++ b/mu.md @@ -46,13 +46,13 @@ Zooming out from single statements, here's a complete sample program in Mu: Mu programs are lists of functions. Each function has the following form: - ``` - fn _name_ _inout_ ... -> _output_ ... { - _statement_ - _statement_ - ... - } - ``` +``` +fn _name_ _inout_ ... -> _output_ ... { + _statement_ + _statement_ + ... +} +``` Each function has a header line, and some number of statements, each on a separate line. Headers describe inouts and outputs. Inouts can't be registers, @@ -64,15 +64,15 @@ outputs in registers, and modify inouts passed in by reference. In addition, there's one more constraint: output registers must match the function header. For example: - ``` - fn f -> x/eax: int { - ... - } - fn g { - a/eax <- f # ok - a/ebx <- f # wrong - } - ``` +``` +fn f -> x/eax: int { + ... +} +fn g { + a/eax <- f # ok + a/ebx <- f # wrong +} +``` The function `main` is special; it is where the program starts running. It must always return a single int in register `ebx` (as the exit status of the @@ -92,23 +92,23 @@ and `}`, both each alone on a line. Blocks can nest: - ``` +``` +{ + _statements_ { - _statements_ - { - _more statements_ - } + _more statements_ } - ``` +} +``` Blocks can be named (with the name ending in a `:` on the same line as the `{`): - ``` - $name: { - _statements_ - } - ``` +``` +$name: { + _statements_ +} +``` Further down we'll see primitive statements for skipping or repeating blocks. Besides control flow, the other use for blocks is... @@ -119,10 +119,10 @@ Functions can define new variables at any time with the keyword `var`. There are two variants of the `var` statement, for defining variables in registers or memory. - ``` - var name: type - var name/reg: type <- ... - ``` +``` +var name: type +var name/reg: type <- ... +``` Variables on the stack are never initialized. (They're always implicitly zeroed them out.) Variables in registers are always initialized. @@ -142,54 +142,54 @@ Here is the list of arithmetic primitive operations supported by Mu. The name `n` indicates a literal integer rather than a variable, and `var/reg` indicates a variable in a register. - ``` - var/reg <- increment - increment var - var/reg <- decrement - decrement var - var1/reg1 <- add var2/reg2 - var/reg <- add var2 - add-to var1, var2/reg - var/reg <- add n - add-to var, n +``` +var/reg <- increment +increment var +var/reg <- decrement +decrement var +var1/reg1 <- add var2/reg2 +var/reg <- add var2 +add-to var1, var2/reg +var/reg <- add n +add-to var, n - var1/reg1 <- sub var2/reg2 - var/reg <- sub var2 - sub-from var1, var2/reg - var/reg <- sub n - sub-from var, n +var1/reg1 <- sub var2/reg2 +var/reg <- sub var2 +sub-from var1, var2/reg +var/reg <- sub n +sub-from var, n - var1/reg1 <- and var2/reg2 - var/reg <- and var2 - and-with var1, var2/reg - var/reg <- and n - and-with var, n +var1/reg1 <- and var2/reg2 +var/reg <- and var2 +and-with var1, var2/reg +var/reg <- and n +and-with var, n - var1/reg1 <- or var2/reg2 - var/reg <- or var2 - or-with var1, var2/reg - var/reg <- or n - or-with var, n +var1/reg1 <- or var2/reg2 +var/reg <- or var2 +or-with var1, var2/reg +var/reg <- or n +or-with var, n - var1/reg1 <- xor var2/reg2 - var/reg <- xor var2 - xor-with var1, var2/reg - var/reg <- xor n - xor-with var, n +var1/reg1 <- xor var2/reg2 +var/reg <- xor var2 +xor-with var1, var2/reg +var/reg <- xor n +xor-with var, n - var/reg <- copy var2/reg2 - copy-to var1, var2/reg - var/reg <- copy var2 - var/reg <- copy n - copy-to var, n +var/reg <- copy var2/reg2 +copy-to var1, var2/reg +var/reg <- copy var2 +var/reg <- copy n +copy-to var, n - compare var1, var2/reg - compare var1/reg, var2 - compare var/eax, n - compare var, n +compare var1, var2/reg +compare var1/reg, var2 +compare var/eax, n +compare var, n - var/reg <- multiply var2 - ``` +var/reg <- multiply var2 +``` Any statement above that takes a variable in memory can be replaced with a dereference (`*`) of an address variable (of type `(addr ...)`) in a register. @@ -211,11 +211,11 @@ Since most x86 instructions implicitly load 32 bits at a time from memory, variables of type 'byte' are only allowed in registers, not on the stack. Here are the possible statements for reading bytes to/from memory: - ``` - var/reg <- copy-byte var2/reg2 # var: byte, var2: byte - var/reg <- copy-byte *var2/reg2 # var: byte, var2: (addr byte) - copy-byte-to *var1/reg1, var2/reg2 # var1: (addr byte), var2: byte - ``` +``` +var/reg <- copy-byte var2/reg2 # var: byte, var2: byte +var/reg <- copy-byte *var2/reg2 # var: byte, var2: (addr byte) +copy-byte-to *var1/reg1, var2/reg2 # var1: (addr byte), var2: byte +``` In addition, variables of type 'byte' are restricted to (the lowest bytes of) just 4 registers: eax, ecx, edx and ebx. @@ -228,9 +228,9 @@ jump to the beginning of the containing block. All jumps can take an optional label starting with '$': - ``` - loop $foo - ``` +``` +loop $foo +``` This instruction jumps to the beginning of the block called $foo. The corresponding `break` jumps to the end of the block. Either jump statement must lie somewhere @@ -239,83 +239,83 @@ blocks with restraint; jumps to places far away can get confusing.) There are two unconditional jumps: - ``` - loop - loop label - break - break label - ``` +``` +loop +loop label +break +break label +``` The remaining jump instructions are all conditional. Conditional jumps rely on the result of the most recently executed `compare` instruction. (To keep programs easy to read, keep compare instructions close to the jump that uses them.) - ``` - break-if-= - break-if-= label - break-if-!= - break-if-!= label - ``` +``` +break-if-= +break-if-= label +break-if-!= +break-if-!= label +``` Inequalities are similar, but have unsigned and signed variants. For simplicity, always use signed integers; use the unsigned variants only to compare addresses. - ``` - break-if-< - break-if-< label - break-if-> - break-if-> label - break-if-<= - break-if-<= label - break-if->= - break-if->= label +``` +break-if-< +break-if-< label +break-if-> +break-if-> label +break-if-<= +break-if-<= label +break-if->= +break-if->= label - break-if-addr< - break-if-addr< label - break-if-addr> - break-if-addr> label - break-if-addr<= - break-if-addr<= label - break-if-addr>= - break-if-addr>= label - ``` +break-if-addr< +break-if-addr< label +break-if-addr> +break-if-addr> label +break-if-addr<= +break-if-addr<= label +break-if-addr>= +break-if-addr>= label +``` Similarly, conditional loops: - ``` - loop-if-= - loop-if-= label - loop-if-!= - loop-if-!= label +``` +loop-if-= +loop-if-= label +loop-if-!= +loop-if-!= label - loop-if-< - loop-if-< label - loop-if-> - loop-if-> label - loop-if-<= - loop-if-<= label - loop-if->= - loop-if->= label +loop-if-< +loop-if-< label +loop-if-> +loop-if-> label +loop-if-<= +loop-if-<= label +loop-if->= +loop-if->= label - loop-if-addr< - loop-if-addr< label - loop-if-addr> - loop-if-addr> label - loop-if-addr<= - loop-if-addr<= label - loop-if-addr>= - loop-if-addr>= label - ``` +loop-if-addr< +loop-if-addr< label +loop-if-addr> +loop-if-addr> label +loop-if-addr<= +loop-if-addr<= label +loop-if-addr>= +loop-if-addr>= label +``` ## Addresses Passing objects by reference requires the `address` operation, which returns an object of type `addr`. - ``` - var/reg: (addr T) <- address var2: T - ``` +``` +var/reg: (addr T) <- address var2: T +``` Here `var2` can't live in a register. @@ -325,24 +325,24 @@ Mu arrays are size-prefixed so that operations on them can check bounds as necessary at run-time. The `length` statement returns the number of elements in an array. - ``` - var/reg: int <- length arr/reg: (addr array T) - ``` +``` +var/reg: int <- length arr/reg: (addr array T) +``` The `index` statement takes an `addr` to an `array` and returns an `addr` to one of its elements, that can be read from or written to. - ``` - var/reg: (addr T) <- index arr/reg: (addr array T), n - var/reg: (addr T) <- index arr: (array T sz), n - ``` +``` +var/reg: (addr T) <- index arr/reg: (addr array T), n +var/reg: (addr T) <- index arr: (array T sz), n +``` The index can also be a variable in a register, with a caveat: - ``` - var/reg: (addr T) <- index arr/reg: (addr array T), idx/reg: int - var/reg: (addr T) <- index arr: (array T sz), idx/reg: int - ``` +``` +var/reg: (addr T) <- index arr/reg: (addr array T), idx/reg: int +var/reg: (addr T) <- index arr: (array T sz), idx/reg: int +``` The caveat: the size of T must be 1, 2, 4 or 8 bytes. The x86 instruction set has complex addressing modes that can index into an array in a single instruction @@ -351,30 +351,30 @@ in these situations. For types in general you'll need to split up the work, performing a `compute-offset` before the `index`. - ``` - var/reg: (offset T) <- compute-offset arr: (addr array T), idx/reg: int # arr can be in reg or mem - var/reg: (offset T) <- compute-offset arr: (addr array T), idx: int # arr can be in reg or mem - ``` +``` +var/reg: (offset T) <- compute-offset arr: (addr array T), idx/reg: int # arr can be in reg or mem +var/reg: (offset T) <- compute-offset arr: (addr array T), idx: int # arr can be in reg or mem +``` The `compute-offset` statement returns a value of type `(offset T)` after performing any necessary bounds checking. Now the offset can be passed to `index` as usual: - ``` - var/reg: (addr T) <- index arr/reg: (addr array T), idx/reg: (offset T) - ``` +``` +var/reg: (addr T) <- index arr/reg: (addr array T), idx/reg: (offset T) +``` ## Compound types Primitive types can be combined together using the `type` keyword. For example: - ``` - type point { - x: int - y: int - } - ``` +``` +type point { + x: int + y: int +} +``` Mu programs are currently sequences of `fn` and `type` definitions. @@ -382,19 +382,19 @@ To access within a compound type, use the `get` instruction. There are two forms. You need either a variable of the type itself (say `T`) in memory, or a variable of type `(addr T)` in a register. - ``` - var/reg: (addr T_f) <- get var/reg: (addr T), f - var/reg: (addr T_f) <- get var: T, f - ``` +``` +var/reg: (addr T_f) <- get var/reg: (addr T), f +var/reg: (addr T_f) <- get var: T, f +``` The `f` here is the field name from the `type` definition, and `T_f` must match the type of `f` in the `type` definition. For example, some legal instructions for the definition of `point` above: - ``` - var a/eax: (addr int) <- get p, x - var a/eax: (addr int) <- get p, y - ``` +``` +var a/eax: (addr int) <- get p, x +var a/eax: (addr int) <- get p, y +``` ## Handles for safe access to the heap @@ -407,9 +407,9 @@ security issues or hard-to-debug misbehavior. To actually _use_ a `handle`, we have to turn it into an `addr` first using the `lookup` statement. - ``` - var y/reg: (addr T) <- lookup x - ``` +``` +var y/reg: (addr T) <- lookup x +``` Now operate on the `addr` as usual, safe in the knowledge that you can later recover any writes to its payload from `x`. @@ -433,26 +433,26 @@ Try to avoid mixing these use cases. You can copy handles to another variable on the stack like this: - ``` - var x: (handle T) - # ..some code initializing x.. - var y/eax: (addr handle T) <- address ... - copy-handle x, y - ``` +``` +var x: (handle T) +# ..some code initializing x.. +var y/eax: (addr handle T) <- address ... +copy-handle x, y +``` You can also save handles inside compound types like this: - ``` - var y/reg: (addr handle T_f) <- get var: (addr T), f - copy-handle-to *y, x - ``` +``` +var y/reg: (addr handle T_f) <- get var: (addr T), f +copy-handle-to *y, x +``` Or this: - ``` - var y/reg: (addr handle T) <- index arr: (addr array handle T), n - copy-handle-to *y, x - ``` +``` +var y/reg: (addr handle T) <- index arr: (addr array handle T), n +copy-handle-to *y, x +``` ## Conclusion diff --git a/subx.md b/subx.md index b1ab38bc..bd9f9d1f 100644 --- a/subx.md +++ b/subx.md @@ -6,16 +6,16 @@ is implemented in SubX and also emits SubX code. Here's an example program in SubX that adds 1 and 1 and returns the result to the parent shell process: - ```sh - == code - Entry: - # ebx = 1 - bb/copy-to-ebx 1/imm32 - # increment ebx - 43/increment-ebx - # exit(ebx) - e8/call syscall_exit/disp32 - ``` +```sh +== code +Entry: + # ebx = 1 + bb/copy-to-ebx 1/imm32 + # increment ebx + 43/increment-ebx + # exit(ebx) + e8/call syscall_exit/disp32 +``` ## The syntax of SubX instructions @@ -78,9 +78,9 @@ simpler. It comes from exactly one of the following argument types: Putting all this together, here's an example that adds the integer in `eax` to the one at address `edx`: - ``` - 01/add %edx 0/r32/eax - ``` +``` +01/add %edx 0/r32/eax +``` ## The syntax of SubX programs diff --git a/subx_bare.md b/subx_bare.md index b2429ed2..f4ee07ff 100644 --- a/subx_bare.md +++ b/subx_bare.md @@ -41,9 +41,9 @@ contains `4`. Rather than encoding register `esp`, it means the address is provided by three _whole new_ arguments (`/base`, `/index` and `/scale`) in a _totally_ different way (where `<<` is the left-shift operator): - ``` - reg/mem = *(base + (index << scale)) - ``` +``` +reg/mem = *(base + (index << scale)) +``` (There are a couple more exceptions ☹; see [Table 2-2](modrm.pdf) and [Table 2-3](sib.pdf) of the Intel manual for the complete story.) @@ -130,38 +130,38 @@ This repo includes two translators for bare SubX. The first is [the bootstrap translator](bootstrap.md) implemented in C++. In addition, you can use SubX to translate itself. For example, running natively on Linux: - ```sh - # generate translator phases using the C++ translator - $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/hex.subx -o hex - $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/survey.subx -o survey - $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/pack.subx -o pack - $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/assort.subx -o assort - $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/dquotes.subx -o dquotes - $ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/tests.subx -o tests - $ chmod +x hex survey pack assort dquotes tests +```sh +# generate translator phases using the C++ translator +$ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/hex.subx -o hex +$ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/survey.subx -o survey +$ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/pack.subx -o pack +$ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/assort.subx -o assort +$ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/dquotes.subx -o dquotes +$ ./bootstrap translate init.linux 0*.subx apps/subx-params.subx apps/tests.subx -o tests +$ chmod +x hex survey pack assort dquotes tests - # use the generated translator phases to translate SubX programs - $ cat init.linux apps/ex1.subx |./tests |./dquotes |./assort |./pack |./survey |./hex > a.elf - $ chmod +x a.elf - $ ./a.elf - $ echo $? - 42 +# use the generated translator phases to translate SubX programs +$ cat init.linux apps/ex1.subx |./tests |./dquotes |./assort |./pack |./survey |./hex > a.elf +$ chmod +x a.elf +$ ./a.elf +$ echo $? +42 - # or, automating the above steps - $ ./translate_subx init.linux apps/ex1.subx - $ ./a.elf - $ echo $? - 42 - ``` +# or, automating the above steps +$ ./translate_subx init.linux apps/ex1.subx +$ ./a.elf +$ echo $? +42 +``` Or, running in a VM on other platforms (much slower): - ```sh - $ ./translate_subx_emulated init.linux apps/ex1.subx # generates identical a.elf to above - $ ./bootstrap run a.elf - $ echo $? - 42 - ``` +```sh +$ ./translate_subx_emulated init.linux apps/ex1.subx # generates identical a.elf to above +$ ./bootstrap run a.elf +$ echo $? +42 +``` ## Resources diff --git a/subx_debugging.md b/subx_debugging.md index d7c0c294..27b98774 100644 --- a/subx_debugging.md +++ b/subx_debugging.md @@ -25,20 +25,24 @@ rudimentary but hopefully still workable toolkit: - Generate a trace for the failing test while running your program in emulated mode (`bootstrap run`): + ``` $ ./bootstrap translate input.subx -o binary $ ./bootstrap --trace run binary arg1 arg2 2>trace ``` + The ability to generate a trace is the essential reason for the existence of `bootstrap run` mode. It gives far better visibility into program internals than running natively. - As a further refinement, it is possible to render label names in the trace by adding a second flag to the `bootstrap translate` command: + ``` $ ./bootstrap --debug translate input.subx -o binary $ ./bootstrap --trace run binary arg1 arg2 2>trace ``` + `bootstrap --debug translate` emits a mapping from label to address in a file called `labels`. `bootstrap --trace run` reads in the `labels` file if it exists and prints out any matching label name as it traces each instruction @@ -57,9 +61,11 @@ rudimentary but hopefully still workable toolkit: some loop. Function names get similar `run: == label` lines. - One trick when emitting traces with labels: + ``` $ grep label trace ``` + This is useful for quickly showing you the control flow for the run, and the function executing when the error occurred. I find it useful to start with this information, only looking at the complete trace after I've gotten