Fix a couple of subtle bugs.
- the VM was conditionally reading from the instruction stream, so that
other bugs got masked by decoding errors.
- push-n-bytes was clobbering eax.
No support for combining characters. Graphemes are currently just utf-8
encodings of a single Unicode code-point. No support for code-points that
require more than 32 bits in utf-8.
1000+ LoC spent; just 300+ excluding tests.
Still one known gap; we don't check the entirety of an array's element
type if it's a compound. So far we just check if say both sides start with
'addr'. Obviously that's not good enough.
There's a question of how we should match array types with a capacity on
ones without. For now we're going to do the simplest possible thing and
just make type-match? more robust. It'll always return false if the types
don't match exactly. For ignoring capacity we'll rely on the checks of
the `address` operation (which don't exist yet). This means we should do
this to pass an address to an array to a function f with signature `f (addr
array int)`:
var a: (array int 3)
var b/eax: (addr array int) <- address a
f b
rather than this:
var a: (array int 3)
var b/eax: (addr array int 3) <- address a
f b
Similar reasoning applies to stream types. Arrays and streams are currently
the only types that can have an optional capacity.
For example:
fn main -> r/ebx: int {
var x/eax: grapheme <- copy 0x9286e2 # code point 0x2192 in utf-8
print-grapheme-to-real-screen x
print-string-to-real-screen "\n"
}
Graphemes must fit in 4 bytes (21 bits for code points). Unclear what we
should do for longer clusters since graphemes are a fixed-size type at
the moment.
Both have the same size: 4 bytes.
So far I've just renamed print-byte to print-grapheme, but it still behaves
the same.
I'm going to support printing code-points next, but grapheme 'clusters'
spanning multiple code-points won't be supported for some time.
This is a hacky special case. The alternative would be more general support
for generics.
One observation: we might be able to type-check some primitives using `sig`s.
Only if they don't return anything, since primitives usually need to support
arbitrary registers. I'm not doing that yet, though. It eliminates the
possibility of writing tests for them in mu.subx, which can't see 400.mu.
But it's an alternative:
sig allocate out: (addr handle _)
sig populate out: (addr handle array _), n: int
sig populate-stream out: (addr handle stream _), n: int
sig read-from-stream s: (addr stream _T), out: (addr _T)
sig write-to-stream s: (addr stream _T), in: (addr _T)
We could write the tests in Mu. But then we're testing behavior rather
than the code generated. There are trade-offs. By performing type-checking
in mu.subx I retain the option to write both kinds of tests.
Slices contain `addr`s so the same rules apply to them. They can't be stored
in structs and so on. But they may be an efficient temporary while parsing.
Streams are currently a second generic type after arrays, and gradually
strengthening the case to just bite the bullet and support first-class
generics in Mu.
We need to remember to clear local variables. And there's a good question
here of how Mu supports variables of type stream or table. Or other user-defined
types that inline arrays.
Function signatures can now take type parameters starting with '_'.
Type parameters in a signature match any concrete type in the call. But
they have to be consistent within a single call.
Things I considered but punted on for now:
- having '_' match anything without needing to be consistent. Wildcards
actually seem harder to understand.
- disallowing top-level '_' types. I'll wait until a concrete use case
for disallowing.
We still don't support *defining* types with type parameters, so for now
this is only useful for calling functions on arrays or streams or handles.
- allocate var
- populate var, n
Both rely on the type of `var` to compute the size of the allocation. No
need to repeat the name of the type like in C, C++ or Java.
Mu exclusively uses hex everywhere for a consistent programming experience
from machine code up. But we all still tend to say '10' when we mean 'ten'.
Catch that early.
This commit reimplements commit 6515 to happen during type-checking rather
than as early as possible. That way we naturally get a more informative
error message.
The new failing test is now passing, and so is this manual test that had
been throwing a spurious error:
fn foo {
var a/eax: int <- copy 0
var b/ebx: int <- copy 0
{
var a1/eax: int <- copy 0
var b1/ebx: int <- copy a1
}
b <- copy a
}
However, factorial.mu is still throwing a spurious error.
Some history on this commit's fix: When I moved stack-location tracking
out of the parsing phase (commit 6116, Mar 10) I thoughtlessly moved block-depth
tracking as well. And the reason that happened: I'd somehow gotten by without
ever cleaning up vars from a block during parsing. For all my tests, this
is a troubling sign that I'm not testing enough.
The good news: clean-up-blocks works perfectly during parsing.
Before: bytes can't live on the stack, so size(byte) == 1 just for array
elements.
After: bytes mostly can't live on the stack except for function args (which
seem too useful to disallow), so size(byte) == 4 except there's now a new
primitive called element-size for array elements where size(byte) == 1.
Now apps/browse.subx starts working again.
Several bugs fixed in the process, and expectation of further bugs is growing.
I'd somehow started assuming I don't need to have separate cases for rm32
as a register vs mem. That's not right. We might need more reg-reg Primitives.
Most unbelievably, I'd forgotten to pass the output 'out' arg to 'lookup-var'
long before the recent additions of 'err' and 'ed' args. But things continued
to work because an earlier call just happened to leave the arg at just
the right place on the stack. So we only caught all these places when we
had to provide error messages.
Byte-oriented addressing is only supported in a couple of instructions
in SubX. As a result, variables of type 'byte' can't live on the stack,
or in registers 'esi' and 'edi'.
I had a little "optimization" to avoid creating nested blocks if "they weren't
needed". Except, of course, they were. Lose the optimization. Sometimes
we create multiple jumps when a single one would suffice. Ignore that for
now.
The rule: emit spills for a register unless the output is written somewhere
in the current block after the current instruction. Including in nested
blocks.
Let's see if this is right.
Rather than have two ways to decide whether to emit push/pop instructions,
just record for each var on the 'vars' stack whether we emitted a push
for it, and reuse the decision to emit a pop.
Observations:
- the orchestration from 'in' to 'addr-in' to '_in-addr' to 'in-addr'
is quite painful. Once to turn a handle into its address, once to turn
a handle into the address of its payload, and a third time to switch
a variable out of the overloaded 'eax' variable to make room for read-byte-buffered.
- I'm starting to use SubX as an escape hatch for features missing in Mu:
- access to syscalls (which pass args in registers)
- access to global variables
How did new-literal ever work?! Somehow we had eax silently being clobbered
without affecting behavior over like 5 apps. Unsafe languages suck.
Anyways, factorial.mu is now part of CI.
So far it's unclear how to do this in a series of small commits. Still
nibbling around the edges. In this commit we standardize some terminology:
The length of an array or stream is denominated in the high-level elements.
The _size_ is denominated in bytes.
The thing we encode into the type is always the size, not the length.
There's still an open question of what to do about the Mu `length` operator.
I'd like to modify it to provide the length. Currently it provides the
size. If I can't fix that I'll rename it.
At the lowest level, SubX without syntax sugar uses names without prepositions.
For example, 01 and 03 are both called 'add', irrespective of source and
destination operand. Horizontal space is at a premium, and we rely on the
comments at the end of each line to fully describe what is happening.
Above that, however, we standardize on a slightly different naming convention
across:
a) SubX with syntax sugar,
b) Mu, and
c) the SubX code that the Mu compiler emits.
Conventions, in brief:
- by default, the source is on the left and destination on the right.
e.g. add %eax, 1/r32/ecx ("add eax to ecx")
- prepositions reverse the direction.
e.g. add-to %eax, 1/r32/ecx ("add ecx to eax")
subtract-from %eax, 1/r32/ecx ("subtract ecx from eax")
- by default, comparisons are left to right while 'compare<-' reverses.
Before, I was sometimes swapping args to make the operation more obvious,
but that would complicate the code-generation of the Mu compiler, and it's
nice to be able to read the output of the compiler just like hand-written
code.
One place where SubX differs from Mu: copy opcodes are called '<-' and
'->'. Hopefully that fits with the spirit of Mu rather than the letter
of the 'copy' and 'copy-to' instructions.
At the SubX level we have to put up with null-terminated kernel strings
for commandline args. But so far we haven't done much with them. Rather
than try to support them we'll just convert them transparently to standard
length-prefixed strings.
In the process I realized that it's not quite right to treat the combination
of argc and argv as an array of kernel strings. Argc counts the number
of elements, whereas the length of an array is usually denominated in bytes.