Add item about counting characters.

2024-03-19 18:30:30 -04:00 · 2024-03-19 18:30:30 -04:00 · ef8027484a
parent b73f8ae850
commit ef8027484a
3 changed files with 115 additions and 15 deletions
--- a/A-with-dots.png
+++ b/A-with-dots.png
--- a/barnold.css
+++ b/barnold.css
@ -0,0 +1,17 @@
+h2 {
+    margin-top: 2em;
+}
+
+table, th, td {
+    border: 1px solid;
+    border-collapse: collapse;
+    padding: 0.1em 0.5em 0.1em 0.5em;
+}
+
+pre {
+    background: #556b2f;
+    border: 1px solid #f0e68c;
+    color: #ffffe0;
+    padding: 4px 8px;
+    overflow-x: scroll;
+}
--- a/index.html
+++ b/index.html
@ -1,24 +1,107 @@
 <html>
-<head><title>barnold's tilde.club page</title>
-<link rel="stylesheet" href="https://tilde.club/style.css">
+  <head>
+    <meta charset="utf-8" />
+    <title>barnold's tilde.club page</title>
+    <link rel="stylesheet" href="https://tilde.club/style.css">
+    <link rel="stylesheet" href="/~barnold/barnold.css">
+  </head>
 </head>
 <body>

-<style>
-  pre {
-      background: #556b2f;
-      border: 1px solid #f0e68c;
-      color: #ffffe0;
-      padding: 4px 8px;
-      overflow-x: scroll;
-  }
-  ::-webkit-scrollbar {
-      height: 40px;
-  }
-</style>
-
 <h1>~~~barnold's tilde.club page~~~</h1>

+
+<h2>Counting characters</h2>
+
+2024-03-19
+
+<p>Recently I tried to learn myself a little UTF-8. My guide was
+<a href="https://www.cl.cam.ac.uk/~mgk25/unicode.html">Markus Kuhn's
+FAQ</a>. Its discussion of "combining characters" made sense to
+me. These are "code points", in UTF-8 speak, that identify a sort of
+decoration applied to the preceding character. The FAQ compared two
+examples.  The first is "LATIN CAPITAL LETTER A WITH DIAERESIS". This
+is a "precomposed character", i.e. you get the capital A and its
+little dotted hat together as a single unit.
+<p>
+The second example is the same character conceptually, but represented
+by <strong>two</strong> code points: "LATIN CAPITAL LETTER A" followed
+by "COMBINING DIAERESIS". The first one give you a plain capital A and
+the second one means "go back to the last character and put two little
+dots on top, kthxbai". This combining form is apparently to be
+preferred because of its greater flexibility. You don't need to define
+every possible combination of plain letter plus decorator (or
+"diacritical mark" as the jargon has it).
+<p>
+  Here's a summary of the code points, their encoding in UTF-8 and the result as rendered by your browser.
+<p>
+  <table>
+    <tr>
+      <th>Code point name, value</th>
+      <th>Bytes (hex)</th>
+      <th>Rendering</th>
+    </tr>
+    <tr>
+      <td>LATIN CAPITAL LETTER A WITH DIAERESIS, U00C4</td>
+      <td>xc3, x84</td>
+      <td>Ä</td>
+    </tr>
+    <tr>
+      <td>LATIN CAPITAL LETTER A, U0041</td>
+      <td>x41</td>
+      <td rowspan=2>Ä</td>
+    </tr>
+    <tr>
+      <td>COMBINING DIAERESIS, U0308</td>
+      <td>xcc, x88</td>
+    </tr>
+  </table>
+
+<p>
+  Under that Rendering column, you should see the same characters as
+  below (shown as an image in case your browser renders the characters
+  differently),
+<p>
+  <img src="/~barnold/A-with-dots.png"/>
+<p>
+  except I added single quotes around each character, just to show
+  there was no peculiar white space appearing. I typed those printf
+  commands in a urxvt terminal emulator running on my laptop, while
+  connected to a bash shell in tilde.club. The "\xNN" in printf is a
+  handy sequence to output a byte with the hex value of NN.
+<p>
+  What I'm trying to get at with the table and the image is that you
+  should end up with the self same visible character,
+  or <em>glyph</em> in UTF-8 speak, whichever of the two methods you
+  use. In theory you shouldn't be able to tell apart the "precomposed"
+  (one code point) character from the "composed" (two code point)
+  character, short of running <strong>od(1)</strong> or the like.
+
+<p>
+  Theory and practise are a little different.
+<p>
+  <pre>
+barnold@tilde$ printf "\xc3\x84" | wc --chars
+1
+barnold@tilde$ printf "\x41\xcc\x88" | wc --chars
+2</pre>
+
+<p>
+  Though the two forms of "A with a diaeresis" are in principle one
+  and the same character, wc(1) thinks that the combining form
+  has <strong>two</strong> characters, not just one. According to
+  Markus Kuhn's FAQ, "A combining character is not a full character by
+  itself" so we have a contradiction here. (You might wonder, if it
+  isn't a character why did they call it a
+  "combining <em>character</em>"? I have no answer to that.)
+<p>
+  The maintainers of
+  <a href="https://www.gnu.org/software/coreutils/">GNU coreutils</a>
+  don't regard wc's count of 2 as a bug (I asked on the mailing list)
+  so it's unlikely to change. After decades of effort in computer
+  science the question "what's the character count of this string?"
+  doesn't necessarily have a clear answer.
+
 <h2>Best commit message of the year (so far)</h2>

 2024-01-06