diff --git a/A-with-dots.png b/A-with-dots.png new file mode 100644 index 0000000..c54be7a Binary files /dev/null and b/A-with-dots.png differ diff --git a/barnold.css b/barnold.css new file mode 100644 index 0000000..591b219 --- /dev/null +++ b/barnold.css @@ -0,0 +1,17 @@ +h2 { + margin-top: 2em; +} + +table, th, td { + border: 1px solid; + border-collapse: collapse; + padding: 0.1em 0.5em 0.1em 0.5em; +} + +pre { + background: #556b2f; + border: 1px solid #f0e68c; + color: #ffffe0; + padding: 4px 8px; + overflow-x: scroll; +} diff --git a/index.html b/index.html index 43801b8..7446354 100644 --- a/index.html +++ b/index.html @@ -1,24 +1,107 @@ -
Recently I tried to learn myself a little UTF-8. My guide was +Markus Kuhn's +FAQ. Its discussion of "combining characters" made sense to +me. These are "code points", in UTF-8 speak, that identify a sort of +decoration applied to the preceding character. The FAQ compared two +examples. The first is "LATIN CAPITAL LETTER A WITH DIAERESIS". This +is a "precomposed character", i.e. you get the capital A and its +little dotted hat together as a single unit. +
+The second example is the same character conceptually, but represented +by two code points: "LATIN CAPITAL LETTER A" followed +by "COMBINING DIAERESIS". The first one give you a plain capital A and +the second one means "go back to the last character and put two little +dots on top, kthxbai". This combining form is apparently to be +preferred because of its greater flexibility. You don't need to define +every possible combination of plain letter plus decorator (or +"diacritical mark" as the jargon has it). +
+ Here's a summary of the code points, their encoding in UTF-8 and the result as rendered by your browser. +
+
Code point name, value | +Bytes (hex) | +Rendering | +
---|---|---|
LATIN CAPITAL LETTER A WITH DIAERESIS, U00C4 | +xc3, x84 | +Ä | +
LATIN CAPITAL LETTER A, U0041 | +x41 | +Ä | +
COMBINING DIAERESIS, U0308 | +xcc, x88 | +
+ Under that Rendering column, you should see the same characters as + below (shown as an image in case your browser renders the characters + differently), +
+ +
+ except I added single quotes around each character, just to show + there was no peculiar white space appearing. I typed those printf + commands in a urxvt terminal emulator running on my laptop, while + connected to a bash shell in tilde.club. The "\xNN" in printf is a + handy sequence to output a byte with the hex value of NN. +
+ What I'm trying to get at with the table and the image is that you + should end up with the self same visible character, + or glyph in UTF-8 speak, whichever of the two methods you + use. In theory you shouldn't be able to tell apart the "precomposed" + (one code point) character from the "composed" (two code point) + character, short of running od(1) or the like. + +
+ Theory and practise are a little different. +
+
+barnold@tilde$ printf "\xc3\x84" | wc --chars +1 +barnold@tilde$ printf "\x41\xcc\x88" | wc --chars +2+ +
+ Though the two forms of "A with a diaeresis" are in principle one + and the same character, wc(1) thinks that the combining form + has two characters, not just one. According to + Markus Kuhn's FAQ, "A combining character is not a full character by + itself" so we have a contradiction here. (You might wonder, if it + isn't a character why did they call it a + "combining character"? I have no answer to that.) +
+ The maintainers of + GNU coreutils + don't regard wc's count of 2 as a bug (I asked on the mailing list) + so it's unlikely to change. After decades of effort in computer + science the question "what's the character count of this string?" + doesn't necessarily have a clear answer. +