7024

2020-10-14 10:45:25 -07:00 · 2020-10-14 10:45:25 -07:00 · 0d9e8309da
parent 8350faef52
commit 0d9e8309da
1 changed files with 64 additions and 35 deletions
--- a/mu.md
+++ b/mu.md
@ -12,11 +12,15 @@ short, the former increments a value in memory, while the latter increments a
 value in a register.

 Most languages start from some syntax and do what it takes to implement it.
-Mu, however, is designed as a safe way to program in [a regular subset of
+Mu, however, is designed as a safe[1] way to program in [a regular subset of
 32-bit x86 machine code](subx.md), _satisficing_ rather than optimizing for a
 clean syntax. To keep the mapping to machine code lightweight, Mu exclusively
 uses statements. Most statements map to a single instruction of machine code.

+[1] While it's designed to be memory-safe, and already performs many safety
+checks, the Mu compiler is still a work in progress and can currently corrupt
+memory just like C can.
+
 Since the x86 instruction set restricts how many memory locations an instruction
 can use, Mu makes registers explicit as well. Variables must be explicitly
 mapped to registers; otherwise they live in memory.
@ -82,7 +86,8 @@ two signatures:
 - `fn main -> x/ebx: int`
 - `fn main args: (addr array (addr array byte)) -> x/ebx: int`

-(The name of the output is flexible.)
+(The names of the inout and output are flexible. Strings are addresses to
+arrays of bytes, or `(addr array byte)` in Mu.)

 ## Blocks

@ -124,11 +129,11 @@ var name/reg: type <- ...
 ```

 Variables on the stack are never initialized. (They're always implicitly
-zeroed them out.) Variables in registers are always initialized.
+zeroed out.) Variables in registers are always initialized.

 Register variables can go in 6 integer registers: `eax`, `ebx`, `ecx`, `edx`,
-`esi` and `edi`. Floating-point values can also go in 8 other registers:
-`xmm0`, `xmm1`, `xmm2`, `xmm3`, `xmm4`, `xmm5`, `xmm6` and `xmm7`.
+`esi` and `edi`. Floating-point values can go in 8 other registers: `xmm0`,
+`xmm1`, `xmm2`, `xmm3`, `xmm4`, `xmm5`, `xmm6` and `xmm7`.

 Defining a variable in a register either clobbers the previous variable (if it
 was defined in the same block) or shadows it temporarily (if it was defined in
@ -138,16 +143,15 @@ Variables exist from their definition until the end of their containing block.
 Register variables may also die earlier if their register is clobbered by a
 new variable.

-Variables on the stack can be of many types (but not `byte`). Variables in
-integer registers can only contain 32-bit values: `int`, `boolean`, `(addr
-...)`. Variables in floating-point registers can only contain values of type
-`float`.
+Variables on the stack can be of many types (but not `byte`). Integer registers
+can only contain 32-bit values: `int`, `byte`, `boolean`, `(addr ...)`. Floating-point
+registers can only contain values of type `float`.

 ## Integer primitives

 Here is the list of arithmetic primitive operations supported by Mu. The name
 `n` indicates a literal integer rather than a variable, and `var/reg` indicates
-a variable in a register.
+a variable in a register, though that's not always valid Mu syntax.

 ```
 var/reg <- increment
@ -215,8 +219,9 @@ can't dereference variables in memory. You have to load them into a register
 first.

 Excluding dereferences, the above statements must operate on non-address
-primitive types: `int` or `boolean`. (Booleans are really just `int`s, and Mu
-assumes any value but `0` is true.)
+values with primitive types: `int`, `boolean` or `byte`. (Booleans are really
+just `int`s, and Mu assumes any value but `0` is true.) You can copy addresses
+to int variables, but not the other way around.

 ## Floating-point primitives

@ -266,7 +271,7 @@ Remember, when these instructions use indirect mode, they still use an integer
 register. Floating-point registers can't hold addresses.

 Two instructions in the above list are approximate. According to the Intel
-Manual, `reciprocal` and `inverse-square-root` [go off the rails around the
+manual, `reciprocal` and `inverse-square-root` [go off the rails around the
 fourth decimal place](x86_approx.md). If you need more precision, use `divide`
 separately.

@ -312,9 +317,10 @@ compare var1/xreg1, var2

 ## Operating on individual bytes

-A special-case is variables of type 'byte'. Mu is a 32-bit platform so for the
+A special-case is variables of type `byte`. Mu is a 32-bit platform so for the
 most part only supports types that are multiples of 32 bits. However, we do
-want to support strings in ASCII and UTF-8, which will be arrays of bytes.
+want to support strings in ASCII and UTF-8, which will be arrays of 8-bit
+bytes.

 Since most x86 instructions implicitly load 32 bits at a time from memory,
 variables of type 'byte' are only allowed in registers, not on the stack. Here
@ -327,7 +333,8 @@ copy-byte-to *var1/reg1, var2/reg2  # var1: (addr byte), var2: byte
 ```

 In addition, variables of type 'byte' are restricted to (the lowest bytes of)
-just 4 registers: eax, ecx, edx and ebx.
+just 4 registers: `eax`, `ecx`, `edx` and `ebx`. As always, this is due to
+constraints of the x86 instruction set.

 ## Primitive jumps

@ -367,8 +374,7 @@ break-if-!=
 break-if-!= label
 ```

-Inequalities are similar, but have unsigned and signed variants. For simplicity,
-always use signed integers; use the unsigned variants only to compare addresses.
+Inequalities are similar, but have additional variants for addresses and floats.

 ```
 break-if-<
@ -424,6 +430,15 @@ loop-if-addr<=
 loop-if-addr<= label
 loop-if-addr>=
 loop-if-addr>= label
+
+loop-if-float<
+loop-if-float< label
+loop-if-float>
+loop-if-float> label
+loop-if-float<=
+loop-if-float<= label
+loop-if-float>=
+loop-if-float>= label
 ```

 ## Addresses
@ -496,12 +511,12 @@ type point {

 Mu programs are currently sequences of `fn` and `type` definitions.

-Compound types can't include `addr` types for safety (use `handle` instead).
-They also can't currently include `array`, `stream` or `byte` types. Since
-arrays and streams carry their size with them, supporting them in compound
-types complicates variable initialization. Instead of defining them inline in
-a type definition, define a `handle` to them. Bytes shouldn't be used for
-anything but arrays of bytes (utf-8 strings).
+Compound types can't include `addr` types for safety (use `handle` instead,
+which is described below). They also can't currently include `array`, `stream`
+or `byte` types. Since arrays and streams carry their size with them, supporting
+them in compound types complicates variable initialization. Instead of
+defining them inline in a type definition, define a `handle` to them. Bytes
+shouldn't be used for anything but utf-8 strings.

 To access within a compound type, use the `get` instruction. There are two
 forms. You need either a variable of the type itself (say `T`) in memory, or a
@ -512,8 +527,8 @@ var/reg: (addr T_f) <- get var/reg: (addr T), f
 var/reg: (addr T_f) <- get var: T, f
 ```

-The `f` here is the field name from the `type` definition, and `T_f` must
-match the type of `f` in the `type` definition. For example, some legal
+The `f` here is the field name from the `type` definition, and its type `T_f`
+must match the type of `f` in the `type` definition. For example, some legal
 instructions for the definition of `point` above:

 ```
@ -523,11 +538,12 @@ var a/eax: (addr int) <- get p, y

 ## Handles for safe access to the heap

-We've seen the `addr` type, but it's intended to be short-lived. In particular,
-you can't save `addr` values inside compound `type`s. To do that you need a
-"fat pointer" called a `handle` that is safe to keep around for extended
-periods and ensures it's used safely without corrupting the heap and causing
-security issues or hard-to-debug misbehavior.
+We've seen the `addr` type, but it's intended to be short-lived. `addr` values
+should never escape from functions. In particular, save `addr` values inside
+compound `type`s. To do that you need a "fat pointer" called a `handle` that
+is safe to keep around for extended periods and ensures it's used safely
+without corrupting the heap and causing security issues or hard-to-debug
+misbehavior.

 To actually _use_ a `handle`, we have to turn it into an `addr` first using
 the `lookup` statement.
@ -545,7 +561,7 @@ doesn't implement reclamation yet.)

 Having two kinds of addresses takes some getting used to. Do we pass in
 variables by value, by `addr` or by `handle`? In inputs or outputs? Here are 3
-rules:
+rules of thumb:

  * Functions that need to look at the payload should accept an `(addr ...)`.
  * Functions that need to treat a handle as a value, without looking at its
@ -596,8 +612,21 @@ var y/eax: (addr handle T) <- address ...
 copy-handle x, y
 ```

+## Seams
+
+I said at the start that most instructions map 1:1 to x86 machine code. To
+enforce type- and memory-safety, I was forced to carve out a few exceptions:
+
+* the `index` instruction on arrays, for bounds-checking
+* the `length` instruction on arrays, for translating the array size in bytes
+  into the number of elements.
+* the `lookup` instruction on handles, for validating fat-pointer metadata
+* `var` instructions, for initializing memory
+
 ## Conclusion

-Anything not allowed here is forbidden. At least until you modify mu.subx.
-Please [contact me](mailto:ak@akkartik.com) or [report issues](https://github.com/akkartik/mu/issues)
-when you encounter a missing or misleading error message.
+Anything not allowed here is forbidden. Even if the compiler doesn't currently
+detect and complain about it. Please [contact me](mailto:ak@akkartik.com) or
+[report issues](https://github.com/akkartik/mu/issues) when you encounter a
+missing or misleading error message. Thank you for bearing with the dust! I'm
+here for the long haul, and everything will be clean and checked in due time.