Add item about counting characters.
This commit is contained in:
parent
b73f8ae850
commit
ef8027484a
Binary file not shown.
After Width: | Height: | Size: 19 KiB |
|
@ -0,0 +1,17 @@
|
||||||
|
h2 {
|
||||||
|
margin-top: 2em;
|
||||||
|
}
|
||||||
|
|
||||||
|
table, th, td {
|
||||||
|
border: 1px solid;
|
||||||
|
border-collapse: collapse;
|
||||||
|
padding: 0.1em 0.5em 0.1em 0.5em;
|
||||||
|
}
|
||||||
|
|
||||||
|
pre {
|
||||||
|
background: #556b2f;
|
||||||
|
border: 1px solid #f0e68c;
|
||||||
|
color: #ffffe0;
|
||||||
|
padding: 4px 8px;
|
||||||
|
overflow-x: scroll;
|
||||||
|
}
|
113
index.html
113
index.html
|
@ -1,24 +1,107 @@
|
||||||
<html>
|
<html>
|
||||||
<head><title>barnold's tilde.club page</title>
|
<head>
|
||||||
<link rel="stylesheet" href="https://tilde.club/style.css">
|
<meta charset="utf-8" />
|
||||||
|
<title>barnold's tilde.club page</title>
|
||||||
|
<link rel="stylesheet" href="https://tilde.club/style.css">
|
||||||
|
<link rel="stylesheet" href="/~barnold/barnold.css">
|
||||||
|
</head>
|
||||||
</head>
|
</head>
|
||||||
<body>
|
<body>
|
||||||
|
|
||||||
<style>
|
|
||||||
pre {
|
|
||||||
background: #556b2f;
|
|
||||||
border: 1px solid #f0e68c;
|
|
||||||
color: #ffffe0;
|
|
||||||
padding: 4px 8px;
|
|
||||||
overflow-x: scroll;
|
|
||||||
}
|
|
||||||
::-webkit-scrollbar {
|
|
||||||
height: 40px;
|
|
||||||
}
|
|
||||||
</style>
|
|
||||||
|
|
||||||
<h1>~~~barnold's tilde.club page~~~</h1>
|
<h1>~~~barnold's tilde.club page~~~</h1>
|
||||||
|
|
||||||
|
|
||||||
|
<h2>Counting characters</h2>
|
||||||
|
|
||||||
|
2024-03-19
|
||||||
|
|
||||||
|
<p>Recently I tried to learn myself a little UTF-8. My guide was
|
||||||
|
<a href="https://www.cl.cam.ac.uk/~mgk25/unicode.html">Markus Kuhn's
|
||||||
|
FAQ</a>. Its discussion of "combining characters" made sense to
|
||||||
|
me. These are "code points", in UTF-8 speak, that identify a sort of
|
||||||
|
decoration applied to the preceding character. The FAQ compared two
|
||||||
|
examples. The first is "LATIN CAPITAL LETTER A WITH DIAERESIS". This
|
||||||
|
is a "precomposed character", i.e. you get the capital A and its
|
||||||
|
little dotted hat together as a single unit.
|
||||||
|
<p>
|
||||||
|
The second example is the same character conceptually, but represented
|
||||||
|
by <strong>two</strong> code points: "LATIN CAPITAL LETTER A" followed
|
||||||
|
by "COMBINING DIAERESIS". The first one give you a plain capital A and
|
||||||
|
the second one means "go back to the last character and put two little
|
||||||
|
dots on top, kthxbai". This combining form is apparently to be
|
||||||
|
preferred because of its greater flexibility. You don't need to define
|
||||||
|
every possible combination of plain letter plus decorator (or
|
||||||
|
"diacritical mark" as the jargon has it).
|
||||||
|
<p>
|
||||||
|
Here's a summary of the code points, their encoding in UTF-8 and the result as rendered by your browser.
|
||||||
|
<p>
|
||||||
|
<table>
|
||||||
|
<tr>
|
||||||
|
<th>Code point name, value</th>
|
||||||
|
<th>Bytes (hex)</th>
|
||||||
|
<th>Rendering</th>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>LATIN CAPITAL LETTER A WITH DIAERESIS, U00C4</td>
|
||||||
|
<td>xc3, x84</td>
|
||||||
|
<td>Ä</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>LATIN CAPITAL LETTER A, U0041</td>
|
||||||
|
<td>x41</td>
|
||||||
|
<td rowspan=2>Ä</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>COMBINING DIAERESIS, U0308</td>
|
||||||
|
<td>xcc, x88</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
Under that Rendering column, you should see the same characters as
|
||||||
|
below (shown as an image in case your browser renders the characters
|
||||||
|
differently),
|
||||||
|
<p>
|
||||||
|
<img src="/~barnold/A-with-dots.png"/>
|
||||||
|
<p>
|
||||||
|
except I added single quotes around each character, just to show
|
||||||
|
there was no peculiar white space appearing. I typed those printf
|
||||||
|
commands in a urxvt terminal emulator running on my laptop, while
|
||||||
|
connected to a bash shell in tilde.club. The "\xNN" in printf is a
|
||||||
|
handy sequence to output a byte with the hex value of NN.
|
||||||
|
<p>
|
||||||
|
What I'm trying to get at with the table and the image is that you
|
||||||
|
should end up with the self same visible character,
|
||||||
|
or <em>glyph</em> in UTF-8 speak, whichever of the two methods you
|
||||||
|
use. In theory you shouldn't be able to tell apart the "precomposed"
|
||||||
|
(one code point) character from the "composed" (two code point)
|
||||||
|
character, short of running <strong>od(1)</strong> or the like.
|
||||||
|
|
||||||
|
<p>
|
||||||
|
Theory and practise are a little different.
|
||||||
|
<p>
|
||||||
|
<pre>
|
||||||
|
barnold@tilde$ printf "\xc3\x84" | wc --chars
|
||||||
|
1
|
||||||
|
barnold@tilde$ printf "\x41\xcc\x88" | wc --chars
|
||||||
|
2</pre>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
Though the two forms of "A with a diaeresis" are in principle one
|
||||||
|
and the same character, wc(1) thinks that the combining form
|
||||||
|
has <strong>two</strong> characters, not just one. According to
|
||||||
|
Markus Kuhn's FAQ, "A combining character is not a full character by
|
||||||
|
itself" so we have a contradiction here. (You might wonder, if it
|
||||||
|
isn't a character why did they call it a
|
||||||
|
"combining <em>character</em>"? I have no answer to that.)
|
||||||
|
<p>
|
||||||
|
The maintainers of
|
||||||
|
<a href="https://www.gnu.org/software/coreutils/">GNU coreutils</a>
|
||||||
|
don't regard wc's count of 2 as a bug (I asked on the mailing list)
|
||||||
|
so it's unlikely to change. After decades of effort in computer
|
||||||
|
science the question "what's the character count of this string?"
|
||||||
|
doesn't necessarily have a clear answer.
|
||||||
|
|
||||||
<h2>Best commit message of the year (so far)</h2>
|
<h2>Best commit message of the year (so far)</h2>
|
||||||
|
|
||||||
2024-01-06
|
2024-01-06
|
||||||
|
|
Loading…
Reference in New Issue