Add chinese date format article

2021-03-21 18:39:17 +00:00 · 2021-03-21 18:39:17 +00:00 · 54c1b14cef
parent aa8fdf97e3
commit 54c1b14cef
1 changed files with 119 additions and 0 deletions
--- a/content/cn-date.md
+++ b/content/cn-date.md
@ -0,0 +1,119 @@
 ---
 title: chinese date format
 ---
 ## Format
 * Traditional Chinese: `二零二零年四月二十八日`
 * Simplified Chinese: `2020年04月28日`
 ## Characters
 | Character | Meaning |
 | --------: | ------: |
 |        零 |       0 |
 |        〇 |       0 |
 |        一 |       1 |
 |        二 |       2 |
 |        三 |       3 |
 |        四 |       4 |
 |        五 |       5 |
 |        六 |       6 |
 |        七 |       7 |
 |        八 |       8 |
 |        九 |       9 |
 |        十 |      10 |
 |      二十 |      20 |
 |      三十 |      30 |
 |        年 |    Year |
 |        月 |   Month |
 |        日 |     Day |
 ## jq implementation
 I implemented a chinese date parser using [jq][jq] for [itsb][itsb], as seen
 [here][helpers].
 The parsing methods are in two parts: `parse_chinese_number`, a parser that is
 only guaranteed to work from 0 to 99, and `parse_chinese_date`, which splits
 the date components and sends them to `parse_chinese_number`.
 This parser only works with jq≥1.6, as jq 1.5 and earlier had some
 [Unicode issues][unicode] that caused most string manipulations in this parser
 to break.
 ### Number parsing
 In Chinese dates, years are always expressed using all of their digits, aka
 "two zero two zero" for 2020, and not "two thousand twenty".  This makes the
 parsing much simpler as I do not need to even know how thousands or hundreds
 are expressed in Chinese; I still however need tens to handle months and days.
 I first started by writing a parser that only handles single digits; I made an
 object mapping Chinese characters to their string digits, and just translated
 each character then re-concatenated.
 ```
 ($input // "")     # ["二", "零", "二", "零"]
 | map($charmap[.]) # ["2", "0", "2", "0"]
 | join("")         # "2020"
 ```
 I then added 10 as an empty string, because we can just ignore it when going
 number by number.  This only works in some cases:
 | Number | Parsed as | Expected |     Actual |
 | ------:| ---------:| --------:| ----------:|
 | 二十八 |      二八 |       28 |         28 |
 |   十八 |        八 |       18 |          8 |
 |   二十 |        二 |       20 |          2 |
 |     十 |      `""` |       10 | Type error |
 I chose to ignore any case where more than one 十 would be found as my goal
 was only to parse in the 1-31 range, and 十十十一 is longer than 三一 or
 三十一 so I can expect them to not be used.
 The remaining edge cases only occur when `十` is at the start or the end of
 the string, so I handled them in three ways:
 * If the string is exactly `"十"`, return 10 before any other parsing;
 * If the string ends with 十, multiply the output by 10;
 * If the string begins with 十, add 10 to the output.
 And to avoid adding complexity to my feed parsing scripts, I changed the
 `map` to `map($charmap[.] // .)`, which just ignores unknown characters.
 Combined with some checks made by the date parsing function, this makes it
 possible to parse both traditional and simplified formats without making
 many changes.
 ### Date parsing
 The date components are split using a regex, to avoid sending too much garbage
 to `parse_chinese_number` in the event of a badly formatted date. Once the
 numbers are parsed, we get an object in this format:
 ```json
 {"year": 1234, "month": 12, "day": 3}
 ```
 In some cases, some investigation agencies will be using years from the Chinese
 calendar, where year 0 is year 1911 of the Gregorian calendar.  I therefore
 added a check that adds 1911 years when the year is below 1900; this causes the
 parser to work properly only for years between 1900 and 2811 inclusive for
 dates using this calendar.
 I then use a rather simple method to get a Unix timestamp:
 `"\(.year)-\(.month)-\(.day)T00:00:00Z" | fromdateiso8601`.  I could have used
 a more normal method such as a `strptime("%Y-%m-%d") | mktime`, or build the
 same array that `strptime` returns, such as
 `[1234, 11, 2, 0, 0, 0, 0] | mktime`, but that requires some particular
 handling as months and days are zero-based in this format.
 ## Acknowledgements
 Thanks to ~m455 for the Chinese dates crash course on IRC!
 [helpers]: https://tildegit.org/lucidiot/itsb/src/commit/70ef6b9e978aa58170e80bbb7121f3a6089f8ac3/jq/helpers.jq#L45
 [itsb]: https://tilde.town/~lucidiot/itsb/
 [jq]: https://stedolan.github.io/jq/
 [unicode]: https://github.com/stedolan/jq/issues/1166