Add item about counting characters.

This commit is contained in:
barnold 2024-03-19 18:30:30 -04:00
parent b73f8ae850
commit ef8027484a
3 changed files with 115 additions and 15 deletions

BIN
A-with-dots.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

17
barnold.css Normal file
View File

@ -0,0 +1,17 @@
h2 {
margin-top: 2em;
}
table, th, td {
border: 1px solid;
border-collapse: collapse;
padding: 0.1em 0.5em 0.1em 0.5em;
}
pre {
background: #556b2f;
border: 1px solid #f0e68c;
color: #ffffe0;
padding: 4px 8px;
overflow-x: scroll;
}

View File

@ -1,24 +1,107 @@
<html>
<head><title>barnold's tilde.club page</title>
<link rel="stylesheet" href="https://tilde.club/style.css">
<head>
<meta charset="utf-8" />
<title>barnold's tilde.club page</title>
<link rel="stylesheet" href="https://tilde.club/style.css">
<link rel="stylesheet" href="/~barnold/barnold.css">
</head>
</head>
<body>
<style>
pre {
background: #556b2f;
border: 1px solid #f0e68c;
color: #ffffe0;
padding: 4px 8px;
overflow-x: scroll;
}
::-webkit-scrollbar {
height: 40px;
}
</style>
<h1>~~~barnold's tilde.club page~~~</h1>
<h2>Counting characters</h2>
2024-03-19
<p>Recently I tried to learn myself a little UTF-8. My guide was
<a href="https://www.cl.cam.ac.uk/~mgk25/unicode.html">Markus Kuhn's
FAQ</a>. Its discussion of "combining characters" made sense to
me. These are "code points", in UTF-8 speak, that identify a sort of
decoration applied to the preceding character. The FAQ compared two
examples. The first is "LATIN CAPITAL LETTER A WITH DIAERESIS". This
is a "precomposed character", i.e. you get the capital A and its
little dotted hat together as a single unit.
<p>
The second example is the same character conceptually, but represented
by <strong>two</strong> code points: "LATIN CAPITAL LETTER A" followed
by "COMBINING DIAERESIS". The first one give you a plain capital A and
the second one means "go back to the last character and put two little
dots on top, kthxbai". This combining form is apparently to be
preferred because of its greater flexibility. You don't need to define
every possible combination of plain letter plus decorator (or
"diacritical mark" as the jargon has it).
<p>
Here's a summary of the code points, their encoding in UTF-8 and the result as rendered by your browser.
<p>
<table>
<tr>
<th>Code point name, value</th>
<th>Bytes (hex)</th>
<th>Rendering</th>
</tr>
<tr>
<td>LATIN CAPITAL LETTER A WITH DIAERESIS, U00C4</td>
<td>xc3, x84</td>
<td>Ä</td>
</tr>
<tr>
<td>LATIN CAPITAL LETTER A, U0041</td>
<td>x41</td>
<td rowspan=2></td>
</tr>
<tr>
<td>COMBINING DIAERESIS, U0308</td>
<td>xcc, x88</td>
</tr>
</table>
<p>
Under that Rendering column, you should see the same characters as
below (shown as an image in case your browser renders the characters
differently),
<p>
<img src="/~barnold/A-with-dots.png"/>
<p>
except I added single quotes around each character, just to show
there was no peculiar white space appearing. I typed those printf
commands in a urxvt terminal emulator running on my laptop, while
connected to a bash shell in tilde.club. The "\xNN" in printf is a
handy sequence to output a byte with the hex value of NN.
<p>
What I'm trying to get at with the table and the image is that you
should end up with the self same visible character,
or <em>glyph</em> in UTF-8 speak, whichever of the two methods you
use. In theory you shouldn't be able to tell apart the "precomposed"
(one code point) character from the "composed" (two code point)
character, short of running <strong>od(1)</strong> or the like.
<p>
Theory and practise are a little different.
<p>
<pre>
barnold@tilde$ printf "\xc3\x84" | wc --chars
1
barnold@tilde$ printf "\x41\xcc\x88" | wc --chars
2</pre>
<p>
Though the two forms of "A with a diaeresis" are in principle one
and the same character, wc(1) thinks that the combining form
has <strong>two</strong> characters, not just one. According to
Markus Kuhn's FAQ, "A combining character is not a full character by
itself" so we have a contradiction here. (You might wonder, if it
isn't a character why did they call it a
"combining <em>character</em>"? I have no answer to that.)
<p>
The maintainers of
<a href="https://www.gnu.org/software/coreutils/">GNU coreutils</a>
don't regard wc's count of 2 as a bug (I asked on the mailing list)
so it's unlikely to change. After decades of effort in computer
science the question "what's the character count of this string?"
doesn't necessarily have a clear answer.
<h2>Best commit message of the year (so far)</h2>
2024-01-06