diff --git a/content/cn-date.md b/content/cn-date.md new file mode 100644 index 0000000..ddbc5cb --- /dev/null +++ b/content/cn-date.md @@ -0,0 +1,119 @@ +--- +title: chinese date format +--- + +## Format + +* Traditional Chinese: `二零二零年四月二十八日` +* Simplified Chinese: `2020年04月28日` + +## Characters + +| Character | Meaning | +| --------: | ------: | +| 零 | 0 | +| 〇 | 0 | +| 一 | 1 | +| 二 | 2 | +| 三 | 3 | +| 四 | 4 | +| 五 | 5 | +| 六 | 6 | +| 七 | 7 | +| 八 | 8 | +| 九 | 9 | +| 十 | 10 | +| 二十 | 20 | +| 三十 | 30 | +| 年 | Year | +| 月 | Month | +| 日 | Day | + +## jq implementation + +I implemented a chinese date parser using [jq][jq] for [itsb][itsb], as seen +[here][helpers]. + +The parsing methods are in two parts: `parse_chinese_number`, a parser that is +only guaranteed to work from 0 to 99, and `parse_chinese_date`, which splits +the date components and sends them to `parse_chinese_number`. + +This parser only works with jq≥1.6, as jq 1.5 and earlier had some +[Unicode issues][unicode] that caused most string manipulations in this parser +to break. + +### Number parsing + +In Chinese dates, years are always expressed using all of their digits, aka +"two zero two zero" for 2020, and not "two thousand twenty". This makes the +parsing much simpler as I do not need to even know how thousands or hundreds +are expressed in Chinese; I still however need tens to handle months and days. + +I first started by writing a parser that only handles single digits; I made an +object mapping Chinese characters to their string digits, and just translated +each character then re-concatenated. + +``` +($input // "") # ["二", "零", "二", "零"] +| map($charmap[.]) # ["2", "0", "2", "0"] +| join("") # "2020" +``` + +I then added 10 as an empty string, because we can just ignore it when going +number by number. This only works in some cases: + +| Number | Parsed as | Expected | Actual | +| ------:| ---------:| --------:| ----------:| +| 二十八 | 二八 | 28 | 28 | +| 十八 | 八 | 18 | 8 | +| 二十 | 二 | 20 | 2 | +| 十 | `""` | 10 | Type error | + +I chose to ignore any case where more than one 十 would be found as my goal +was only to parse in the 1-31 range, and 十十十一 is longer than 三一 or +三十一 so I can expect them to not be used. + +The remaining edge cases only occur when `十` is at the start or the end of +the string, so I handled them in three ways: + +* If the string is exactly `"十"`, return 10 before any other parsing; +* If the string ends with 十, multiply the output by 10; +* If the string begins with 十, add 10 to the output. + +And to avoid adding complexity to my feed parsing scripts, I changed the +`map` to `map($charmap[.] // .)`, which just ignores unknown characters. +Combined with some checks made by the date parsing function, this makes it +possible to parse both traditional and simplified formats without making +many changes. + +### Date parsing + +The date components are split using a regex, to avoid sending too much garbage +to `parse_chinese_number` in the event of a badly formatted date. Once the +numbers are parsed, we get an object in this format: + +```json +{"year": 1234, "month": 12, "day": 3} +``` + +In some cases, some investigation agencies will be using years from the Chinese +calendar, where year 0 is year 1911 of the Gregorian calendar. I therefore +added a check that adds 1911 years when the year is below 1900; this causes the +parser to work properly only for years between 1900 and 2811 inclusive for +dates using this calendar. + +I then use a rather simple method to get a Unix timestamp: +`"\(.year)-\(.month)-\(.day)T00:00:00Z" | fromdateiso8601`. I could have used +a more normal method such as a `strptime("%Y-%m-%d") | mktime`, or build the +same array that `strptime` returns, such as +`[1234, 11, 2, 0, 0, 0, 0] | mktime`, but that requires some particular +handling as months and days are zero-based in this format. + +## Acknowledgements + +Thanks to ~m455 for the Chinese dates crash course on IRC! + +[helpers]: https://tildegit.org/lucidiot/itsb/src/commit/70ef6b9e978aa58170e80bbb7121f3a6089f8ac3/jq/helpers.jq#L45 +[itsb]: https://tilde.town/~lucidiot/itsb/ +[jq]: https://stedolan.github.io/jq/ +[unicode]: https://github.com/stedolan/jq/issues/1166