Unfortunately the Unicode database doesn't actually provide obvious
metadata for combining characters. The process I followed is as follows.
I noticed that GNU Unifont provides the following files for download:
- unifont-13.0.06.hex: All Plane 0 glyphs
- unifont_sample-13.0.06.hex: The above .hex file with combining circles added
Downloading and diffing the two yields all code-points with combining
circles. I assume they are exactly the combining characters I care
about.
One mechanical difficulty is cross-correlating the above files that
include the code-point in each line with font.subx which does not. I got
things to work by modifying the above files in place until they have the
same format as font.subx, using the following Vim commands on each file:
:%s|.\{64\}|10/size^M00/is-combine^M&|
:%s|^.\{32\}$|08/size^M00/is-combine^M&00000000000000000000000000000000|
:%s|..|& |g
:%s|10 /s iz e|10/size|
:%s|08 /s iz e|08/size|
:%s|00 /i s- co mb in e|00/is-combine|
Now I can update the metadata with a Vim macro which jumps to the next
hunk and increments /is-combine on the previous line.
https://en.wikipedia.org/wiki/Combining_character
The plan: just draw the combining character in the same space as the
previous character. This will almost certainly not work for some Unicode
blocks (tibetan?)
This commit only changes the data/memory/disk model to make some space.
As always in Mu, we avoid bit-mask tricks even if that wastes memory.
Yet another gnarly reason to start checking all arg metadata in
linux/pack.subx or something like that. With this bug most of my
programs (including browser-slack!) were working even though the
instruction stream was almost certainly misdecoded halfway through every
attempt to draw glyphs.
Unix text-mode terminals transparently support utf-8 these days, and so
I treat utf-8 sequences (which I call graphemes in Mu) as fundamental.
I then blindly carried over this state of affairs to bare-metal Mu,
where it makes no sense. If you don't have a terminal handling
font-rendering for you, fonts are most often indexed by code points and
not utf-8 sequences.
We'll need this when rendering 16-bit glyphs. They'll occupy two
8x16 display units on screen, but the grapheme is a single unit as far
as fake screens are concerned.