Add chinese date format article

This commit is contained in:
~lucidiot 2021-03-21 18:39:17 +00:00
parent aa8fdf97e3
commit 54c1b14cef
1 changed files with 119 additions and 0 deletions

119
content/cn-date.md Normal file
View File

@ -0,0 +1,119 @@
---
title: chinese date format
---
## Format
* Traditional Chinese: `二零二零年四月二十八日`
* Simplified Chinese: `2020年04月28日`
## Characters
| Character | Meaning |
| --------: | ------: |
| 零 | 0 |
| | 0 |
| 一 | 1 |
| 二 | 2 |
| 三 | 3 |
| 四 | 4 |
| 五 | 5 |
| 六 | 6 |
| 七 | 7 |
| 八 | 8 |
| 九 | 9 |
| 十 | 10 |
| 二十 | 20 |
| 三十 | 30 |
| 年 | Year |
| 月 | Month |
| 日 | Day |
## jq implementation
I implemented a chinese date parser using [jq][jq] for [itsb][itsb], as seen
[here][helpers].
The parsing methods are in two parts: `parse_chinese_number`, a parser that is
only guaranteed to work from 0 to 99, and `parse_chinese_date`, which splits
the date components and sends them to `parse_chinese_number`.
This parser only works with jq≥1.6, as jq 1.5 and earlier had some
[Unicode issues][unicode] that caused most string manipulations in this parser
to break.
### Number parsing
In Chinese dates, years are always expressed using all of their digits, aka
"two zero two zero" for 2020, and not "two thousand twenty". This makes the
parsing much simpler as I do not need to even know how thousands or hundreds
are expressed in Chinese; I still however need tens to handle months and days.
I first started by writing a parser that only handles single digits; I made an
object mapping Chinese characters to their string digits, and just translated
each character then re-concatenated.
```
($input // "") # ["二", "零", "二", "零"]
| map($charmap[.]) # ["2", "0", "2", "0"]
| join("") # "2020"
```
I then added 10 as an empty string, because we can just ignore it when going
number by number. This only works in some cases:
| Number | Parsed as | Expected | Actual |
| ------:| ---------:| --------:| ----------:|
| 二十八 | 二八 | 28 | 28 |
| 十八 | 八 | 18 | 8 |
| 二十 | 二 | 20 | 2 |
| 十 | `""` | 10 | Type error |
I chose to ignore any case where more than one 十 would be found as my goal
was only to parse in the 1-31 range, and 十十十一 is longer than 三一 or
三十一 so I can expect them to not be used.
The remaining edge cases only occur when `十` is at the start or the end of
the string, so I handled them in three ways:
* If the string is exactly `"十"`, return 10 before any other parsing;
* If the string ends with 十, multiply the output by 10;
* If the string begins with 十, add 10 to the output.
And to avoid adding complexity to my feed parsing scripts, I changed the
`map` to `map($charmap[.] // .)`, which just ignores unknown characters.
Combined with some checks made by the date parsing function, this makes it
possible to parse both traditional and simplified formats without making
many changes.
### Date parsing
The date components are split using a regex, to avoid sending too much garbage
to `parse_chinese_number` in the event of a badly formatted date. Once the
numbers are parsed, we get an object in this format:
```json
{"year": 1234, "month": 12, "day": 3}
```
In some cases, some investigation agencies will be using years from the Chinese
calendar, where year 0 is year 1911 of the Gregorian calendar. I therefore
added a check that adds 1911 years when the year is below 1900; this causes the
parser to work properly only for years between 1900 and 2811 inclusive for
dates using this calendar.
I then use a rather simple method to get a Unix timestamp:
`"\(.year)-\(.month)-\(.day)T00:00:00Z" | fromdateiso8601`. I could have used
a more normal method such as a `strptime("%Y-%m-%d") | mktime`, or build the
same array that `strptime` returns, such as
`[1234, 11, 2, 0, 0, 0, 0] | mktime`, but that requires some particular
handling as months and days are zero-based in this format.
## Acknowledgements
Thanks to ~m455 for the Chinese dates crash course on IRC!
[helpers]: https://tildegit.org/lucidiot/itsb/src/commit/70ef6b9e978aa58170e80bbb7121f3a6089f8ac3/jq/helpers.jq#L45
[itsb]: https://tilde.town/~lucidiot/itsb/
[jq]: https://stedolan.github.io/jq/
[unicode]: https://github.com/stedolan/jq/issues/1166