Add item about counting characters.
This commit is contained in:
parent
b73f8ae850
commit
ef8027484a
Binary file not shown.
After Width: | Height: | Size: 19 KiB |
|
@ -0,0 +1,17 @@
|
|||
h2 {
|
||||
margin-top: 2em;
|
||||
}
|
||||
|
||||
table, th, td {
|
||||
border: 1px solid;
|
||||
border-collapse: collapse;
|
||||
padding: 0.1em 0.5em 0.1em 0.5em;
|
||||
}
|
||||
|
||||
pre {
|
||||
background: #556b2f;
|
||||
border: 1px solid #f0e68c;
|
||||
color: #ffffe0;
|
||||
padding: 4px 8px;
|
||||
overflow-x: scroll;
|
||||
}
|
113
index.html
113
index.html
|
@ -1,24 +1,107 @@
|
|||
<html>
|
||||
<head><title>barnold's tilde.club page</title>
|
||||
<link rel="stylesheet" href="https://tilde.club/style.css">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<title>barnold's tilde.club page</title>
|
||||
<link rel="stylesheet" href="https://tilde.club/style.css">
|
||||
<link rel="stylesheet" href="/~barnold/barnold.css">
|
||||
</head>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<style>
|
||||
pre {
|
||||
background: #556b2f;
|
||||
border: 1px solid #f0e68c;
|
||||
color: #ffffe0;
|
||||
padding: 4px 8px;
|
||||
overflow-x: scroll;
|
||||
}
|
||||
::-webkit-scrollbar {
|
||||
height: 40px;
|
||||
}
|
||||
</style>
|
||||
|
||||
<h1>~~~barnold's tilde.club page~~~</h1>
|
||||
|
||||
|
||||
<h2>Counting characters</h2>
|
||||
|
||||
2024-03-19
|
||||
|
||||
<p>Recently I tried to learn myself a little UTF-8. My guide was
|
||||
<a href="https://www.cl.cam.ac.uk/~mgk25/unicode.html">Markus Kuhn's
|
||||
FAQ</a>. Its discussion of "combining characters" made sense to
|
||||
me. These are "code points", in UTF-8 speak, that identify a sort of
|
||||
decoration applied to the preceding character. The FAQ compared two
|
||||
examples. The first is "LATIN CAPITAL LETTER A WITH DIAERESIS". This
|
||||
is a "precomposed character", i.e. you get the capital A and its
|
||||
little dotted hat together as a single unit.
|
||||
<p>
|
||||
The second example is the same character conceptually, but represented
|
||||
by <strong>two</strong> code points: "LATIN CAPITAL LETTER A" followed
|
||||
by "COMBINING DIAERESIS". The first one give you a plain capital A and
|
||||
the second one means "go back to the last character and put two little
|
||||
dots on top, kthxbai". This combining form is apparently to be
|
||||
preferred because of its greater flexibility. You don't need to define
|
||||
every possible combination of plain letter plus decorator (or
|
||||
"diacritical mark" as the jargon has it).
|
||||
<p>
|
||||
Here's a summary of the code points, their encoding in UTF-8 and the result as rendered by your browser.
|
||||
<p>
|
||||
<table>
|
||||
<tr>
|
||||
<th>Code point name, value</th>
|
||||
<th>Bytes (hex)</th>
|
||||
<th>Rendering</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>LATIN CAPITAL LETTER A WITH DIAERESIS, U00C4</td>
|
||||
<td>xc3, x84</td>
|
||||
<td>Ä</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>LATIN CAPITAL LETTER A, U0041</td>
|
||||
<td>x41</td>
|
||||
<td rowspan=2>Ä</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>COMBINING DIAERESIS, U0308</td>
|
||||
<td>xcc, x88</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<p>
|
||||
Under that Rendering column, you should see the same characters as
|
||||
below (shown as an image in case your browser renders the characters
|
||||
differently),
|
||||
<p>
|
||||
<img src="/~barnold/A-with-dots.png"/>
|
||||
<p>
|
||||
except I added single quotes around each character, just to show
|
||||
there was no peculiar white space appearing. I typed those printf
|
||||
commands in a urxvt terminal emulator running on my laptop, while
|
||||
connected to a bash shell in tilde.club. The "\xNN" in printf is a
|
||||
handy sequence to output a byte with the hex value of NN.
|
||||
<p>
|
||||
What I'm trying to get at with the table and the image is that you
|
||||
should end up with the self same visible character,
|
||||
or <em>glyph</em> in UTF-8 speak, whichever of the two methods you
|
||||
use. In theory you shouldn't be able to tell apart the "precomposed"
|
||||
(one code point) character from the "composed" (two code point)
|
||||
character, short of running <strong>od(1)</strong> or the like.
|
||||
|
||||
<p>
|
||||
Theory and practise are a little different.
|
||||
<p>
|
||||
<pre>
|
||||
barnold@tilde$ printf "\xc3\x84" | wc --chars
|
||||
1
|
||||
barnold@tilde$ printf "\x41\xcc\x88" | wc --chars
|
||||
2</pre>
|
||||
|
||||
<p>
|
||||
Though the two forms of "A with a diaeresis" are in principle one
|
||||
and the same character, wc(1) thinks that the combining form
|
||||
has <strong>two</strong> characters, not just one. According to
|
||||
Markus Kuhn's FAQ, "A combining character is not a full character by
|
||||
itself" so we have a contradiction here. (You might wonder, if it
|
||||
isn't a character why did they call it a
|
||||
"combining <em>character</em>"? I have no answer to that.)
|
||||
<p>
|
||||
The maintainers of
|
||||
<a href="https://www.gnu.org/software/coreutils/">GNU coreutils</a>
|
||||
don't regard wc's count of 2 as a bug (I asked on the mailing list)
|
||||
so it's unlikely to change. After decades of effort in computer
|
||||
science the question "what's the character count of this string?"
|
||||
doesn't necessarily have a clear answer.
|
||||
|
||||
<h2>Best commit message of the year (so far)</h2>
|
||||
|
||||
2024-01-06
|
||||
|
|
Loading…
Reference in New Issue