diff --git a/A-with-dots.png b/A-with-dots.png new file mode 100644 index 0000000..c54be7a Binary files /dev/null and b/A-with-dots.png differ diff --git a/barnold.css b/barnold.css new file mode 100644 index 0000000..591b219 --- /dev/null +++ b/barnold.css @@ -0,0 +1,17 @@ +h2 { + margin-top: 2em; +} + +table, th, td { + border: 1px solid; + border-collapse: collapse; + padding: 0.1em 0.5em 0.1em 0.5em; +} + +pre { + background: #556b2f; + border: 1px solid #f0e68c; + color: #ffffe0; + padding: 4px 8px; + overflow-x: scroll; +} diff --git a/index.html b/index.html index 43801b8..7446354 100644 --- a/index.html +++ b/index.html @@ -1,24 +1,107 @@ -barnold's tilde.club page - + + + barnold's tilde.club page + + + - -

~~~barnold's tilde.club page~~~

+ +

Counting characters

+ +2024-03-19 + +

Recently I tried to learn myself a little UTF-8. My guide was +Markus Kuhn's +FAQ. Its discussion of "combining characters" made sense to +me. These are "code points", in UTF-8 speak, that identify a sort of +decoration applied to the preceding character. The FAQ compared two +examples. The first is "LATIN CAPITAL LETTER A WITH DIAERESIS". This +is a "precomposed character", i.e. you get the capital A and its +little dotted hat together as a single unit. +

+The second example is the same character conceptually, but represented +by two code points: "LATIN CAPITAL LETTER A" followed +by "COMBINING DIAERESIS". The first one give you a plain capital A and +the second one means "go back to the last character and put two little +dots on top, kthxbai". This combining form is apparently to be +preferred because of its greater flexibility. You don't need to define +every possible combination of plain letter plus decorator (or +"diacritical mark" as the jargon has it). +

+ Here's a summary of the code points, their encoding in UTF-8 and the result as rendered by your browser. +

+ + + + + + + + + + + + + + + + + + + + +
Code point name, valueBytes (hex)Rendering
LATIN CAPITAL LETTER A WITH DIAERESIS, U00C4xc3, x84Ä
LATIN CAPITAL LETTER A, U0041x41
COMBINING DIAERESIS, U0308xcc, x88
+ +

+ Under that Rendering column, you should see the same characters as + below (shown as an image in case your browser renders the characters + differently), +

+ +

+ except I added single quotes around each character, just to show + there was no peculiar white space appearing. I typed those printf + commands in a urxvt terminal emulator running on my laptop, while + connected to a bash shell in tilde.club. The "\xNN" in printf is a + handy sequence to output a byte with the hex value of NN. +

+ What I'm trying to get at with the table and the image is that you + should end up with the self same visible character, + or glyph in UTF-8 speak, whichever of the two methods you + use. In theory you shouldn't be able to tell apart the "precomposed" + (one code point) character from the "composed" (two code point) + character, short of running od(1) or the like. + +

+ Theory and practise are a little different. +

+

+barnold@tilde$ printf "\xc3\x84" | wc --chars
+1
+barnold@tilde$ printf "\x41\xcc\x88" | wc --chars
+2
+ +

+ Though the two forms of "A with a diaeresis" are in principle one + and the same character, wc(1) thinks that the combining form + has two characters, not just one. According to + Markus Kuhn's FAQ, "A combining character is not a full character by + itself" so we have a contradiction here. (You might wonder, if it + isn't a character why did they call it a + "combining character"? I have no answer to that.) +

+ The maintainers of + GNU coreutils + don't regard wc's count of 2 as a bug (I asked on the mailing list) + so it's unlikely to change. After decades of effort in computer + science the question "what's the character count of this string?" + doesn't necessarily have a clear answer. +

Best commit message of the year (so far)

2024-01-06