Add item about counting characters.

2024-03-19 18:30:30 -04:00 · 2024-03-19 18:30:30 -04:00 · ef8027484a
parent b73f8ae850
commit ef8027484a
3 changed files with 115 additions and 15 deletions
--- a/A-with-dots.png
+++ b/A-with-dots.png
--- a/barnold.css
+++ b/barnold.css
@ -0,0 +1,17 @@
 h2 {
    margin-top: 2em;
 }
 table, th, td {
    border: 1px solid;
    border-collapse: collapse;
    padding: 0.1em 0.5em 0.1em 0.5em;
 }
 pre {
    background: #556b2f;
    border: 1px solid #f0e68c;
    color: #ffffe0;
    padding: 4px 8px;
    overflow-x: scroll;
 }
--- a/index.html
+++ b/index.html
@ -1,24 +1,107 @@
 <html>
-<head><title>barnold's tilde.club page</title>
+  <head>
-<link rel="stylesheet" href="https://tilde.club/style.css">
+    <meta charset="utf-8" />
    <title>barnold's tilde.club page</title>
    <link rel="stylesheet" href="https://tilde.club/style.css">
    <link rel="stylesheet" href="/~barnold/barnold.css">
  </head>
 </head>
 <body>
 <style>
  pre {
      background: #556b2f;
      border: 1px solid #f0e68c;
      color: #ffffe0;
      padding: 4px 8px;
      overflow-x: scroll;
  }
  ::-webkit-scrollbar {
      height: 40px;
  }
 </style>
 <h1>~~~barnold's tilde.club page~~~</h1>
 <h2>Counting characters</h2>
 2024-03-19
 <p>Recently I tried to learn myself a little UTF-8. My guide was
 <a href="https://www.cl.cam.ac.uk/~mgk25/unicode.html">Markus Kuhn's
 FAQ</a>. Its discussion of "combining characters" made sense to
 me. These are "code points", in UTF-8 speak, that identify a sort of
 decoration applied to the preceding character. The FAQ compared two
 examples.  The first is "LATIN CAPITAL LETTER A WITH DIAERESIS". This
 is a "precomposed character", i.e. you get the capital A and its
 little dotted hat together as a single unit.
 <p>
 The second example is the same character conceptually, but represented
 by <strong>two</strong> code points: "LATIN CAPITAL LETTER A" followed
 by "COMBINING DIAERESIS". The first one give you a plain capital A and
 the second one means "go back to the last character and put two little
 dots on top, kthxbai". This combining form is apparently to be
 preferred because of its greater flexibility. You don't need to define
 every possible combination of plain letter plus decorator (or
 "diacritical mark" as the jargon has it).
 <p>
  Here's a summary of the code points, their encoding in UTF-8 and the result as rendered by your browser.
 <p>
  <table>
    <tr>
      <th>Code point name, value</th>
      <th>Bytes (hex)</th>
      <th>Rendering</th>
    </tr>
    <tr>
      <td>LATIN CAPITAL LETTER A WITH DIAERESIS, U00C4</td>
      <td>xc3, x84</td>
      <td>Ä</td>
    </tr>
    <tr>
      <td>LATIN CAPITAL LETTER A, U0041</td>
      <td>x41</td>
      <td rowspan=2>Ä</td>
    </tr>
    <tr>
      <td>COMBINING DIAERESIS, U0308</td>
      <td>xcc, x88</td>
    </tr>
  </table>
 <p>
  Under that Rendering column, you should see the same characters as
  below (shown as an image in case your browser renders the characters
  differently),
 <p>
  <img src="/~barnold/A-with-dots.png"/>
 <p>
  except I added single quotes around each character, just to show
  there was no peculiar white space appearing. I typed those printf
  commands in a urxvt terminal emulator running on my laptop, while
  connected to a bash shell in tilde.club. The "\xNN" in printf is a
  handy sequence to output a byte with the hex value of NN.
 <p>
  What I'm trying to get at with the table and the image is that you
  should end up with the self same visible character,
  or <em>glyph</em> in UTF-8 speak, whichever of the two methods you
  use. In theory you shouldn't be able to tell apart the "precomposed"
  (one code point) character from the "composed" (two code point)
  character, short of running <strong>od(1)</strong> or the like.
 <p>
  Theory and practise are a little different.
 <p>
  <pre>
 barnold@tilde$ printf "\xc3\x84" | wc --chars
 1
 barnold@tilde$ printf "\x41\xcc\x88" | wc --chars
 2</pre>
 <p>
  Though the two forms of "A with a diaeresis" are in principle one
  and the same character, wc(1) thinks that the combining form
  has <strong>two</strong> characters, not just one. According to
  Markus Kuhn's FAQ, "A combining character is not a full character by
  itself" so we have a contradiction here. (You might wonder, if it
  isn't a character why did they call it a
  "combining <em>character</em>"? I have no answer to that.)
 <p>
  The maintainers of
  <a href="https://www.gnu.org/software/coreutils/">GNU coreutils</a>
  don't regard wc's count of 2 as a bug (I asked on the mailing list)
  so it's unlikely to change. After decades of effort in computer
  science the question "what's the character count of this string?"
  doesn't necessarily have a clear answer.
 <h2>Best commit message of the year (so far)</h2>
 2024-01-06