--- title: chinese date format --- ## Format * Traditional Chinese: `二零二零年四月二十八日` * Simplified Chinese: `2020年04月28日` ## Characters | Character | Meaning | | --------: | ------: | | 零 | 0 | | 〇 | 0 | | 一 | 1 | | 二 | 2 | | 三 | 3 | | 四 | 4 | | 五 | 5 | | 六 | 6 | | 七 | 7 | | 八 | 8 | | 九 | 9 | | 十 | 10 | | 二十 | 20 | | 三十 | 30 | | 年 | Year | | 月 | Month | | 日 | Day | ## jq implementation I implemented a chinese date parser using [jq][jq] for [itsb][itsb], as seen [here][helpers]. The parsing methods are in two parts: `parse_chinese_number`, a parser that is only guaranteed to work from 0 to 99, and `parse_chinese_date`, which splits the date components and sends them to `parse_chinese_number`. This parser only works with jq≥1.6, as jq 1.5 and earlier had some [Unicode issues][unicode] that caused most string manipulations in this parser to break. ### Number parsing In Chinese dates, years are always expressed using all of their digits, aka "two zero two zero" for 2020, and not "two thousand twenty". This makes the parsing much simpler as I do not need to even know how thousands or hundreds are expressed in Chinese; I still however need tens to handle months and days. I first started by writing a parser that only handles single digits; I made an object mapping Chinese characters to their string digits, and just translated each character then re-concatenated. ``` ($input // "") # ["二", "零", "二", "零"] | map($charmap[.]) # ["2", "0", "2", "0"] | join("") # "2020" ``` I then added 10 as an empty string, because we can just ignore it when going number by number. This only works in some cases: | Number | Parsed as | Expected | Actual | | ------:| ---------:| --------:| ----------:| | 二十八 | 二八 | 28 | 28 | | 十八 | 八 | 18 | 8 | | 二十 | 二 | 20 | 2 | | 十 | `""` | 10 | Type error | I chose to ignore any case where more than one 十 would be found as my goal was only to parse in the 1-31 range, and 十十十一 is longer than 三一 or 三十一 so I can expect them to not be used. The remaining edge cases only occur when `十` is at the start or the end of the string, so I handled them in three ways: * If the string is exactly `"十"`, return 10 before any other parsing; * If the string ends with 十, multiply the output by 10; * If the string begins with 十, add 10 to the output. And to avoid adding complexity to my feed parsing scripts, I changed the `map` to `map($charmap[.] // .)`, which just ignores unknown characters. Combined with some checks made by the date parsing function, this makes it possible to parse both traditional and simplified formats without making many changes. ### Date parsing The date components are split using a regex, to avoid sending too much garbage to `parse_chinese_number` in the event of a badly formatted date. Once the numbers are parsed, we get an object in this format: ```json {"year": 1234, "month": 12, "day": 3} ``` In some cases, some investigation agencies will be using years from the Chinese calendar, where year 0 is year 1911 of the Gregorian calendar. I therefore added a check that adds 1911 years when the year is below 1900; this causes the parser to work properly only for years between 1900 and 2811 inclusive for dates using this calendar. I then use a rather simple method to get a Unix timestamp: `"\(.year)-\(.month)-\(.day)T00:00:00Z" | fromdateiso8601`. I could have used a more normal method such as a `strptime("%Y-%m-%d") | mktime`, or build the same array that `strptime` returns, such as `[1234, 11, 2, 0, 0, 0, 0] | mktime`, but that requires some particular handling as months and days are zero-based in this format. ## Acknowledgements Thanks to ~m455 for the Chinese dates crash course on IRC! [helpers]: https://tildegit.org/lucidiot/itsb/src/commit/70ef6b9e978aa58170e80bbb7121f3a6089f8ac3/jq/helpers.jq#L45 [itsb]: https://tilde.town/~lucidiot/itsb/ [jq]: https://stedolan.github.io/jq/ [unicode]: https://github.com/stedolan/jq/issues/1166