120 lines
4.3 KiB
Markdown
120 lines
4.3 KiB
Markdown
---
|
||
title: chinese date format
|
||
---
|
||
|
||
## Format
|
||
|
||
* Traditional Chinese: `二零二零年四月二十八日`
|
||
* Simplified Chinese: `2020年04月28日`
|
||
|
||
## Characters
|
||
|
||
| Character | Meaning |
|
||
| --------: | ------: |
|
||
| 零 | 0 |
|
||
| 〇 | 0 |
|
||
| 一 | 1 |
|
||
| 二 | 2 |
|
||
| 三 | 3 |
|
||
| 四 | 4 |
|
||
| 五 | 5 |
|
||
| 六 | 6 |
|
||
| 七 | 7 |
|
||
| 八 | 8 |
|
||
| 九 | 9 |
|
||
| 十 | 10 |
|
||
| 二十 | 20 |
|
||
| 三十 | 30 |
|
||
| 年 | Year |
|
||
| 月 | Month |
|
||
| 日 | Day |
|
||
|
||
## jq implementation
|
||
|
||
I implemented a chinese date parser using [jq][jq] for [itsb][itsb], as seen
|
||
[here][helpers].
|
||
|
||
The parsing methods are in two parts: `parse_chinese_number`, a parser that is
|
||
only guaranteed to work from 0 to 99, and `parse_chinese_date`, which splits
|
||
the date components and sends them to `parse_chinese_number`.
|
||
|
||
This parser only works with jq≥1.6, as jq 1.5 and earlier had some
|
||
[Unicode issues][unicode] that caused most string manipulations in this parser
|
||
to break.
|
||
|
||
### Number parsing
|
||
|
||
In Chinese dates, years are always expressed using all of their digits, aka
|
||
"two zero two zero" for 2020, and not "two thousand twenty". This makes the
|
||
parsing much simpler as I do not need to even know how thousands or hundreds
|
||
are expressed in Chinese; I still however need tens to handle months and days.
|
||
|
||
I first started by writing a parser that only handles single digits; I made an
|
||
object mapping Chinese characters to their string digits, and just translated
|
||
each character then re-concatenated.
|
||
|
||
```
|
||
($input // "") # ["二", "零", "二", "零"]
|
||
| map($charmap[.]) # ["2", "0", "2", "0"]
|
||
| join("") # "2020"
|
||
```
|
||
|
||
I then added 10 as an empty string, because we can just ignore it when going
|
||
number by number. This only works in some cases:
|
||
|
||
| Number | Parsed as | Expected | Actual |
|
||
| ------:| ---------:| --------:| ----------:|
|
||
| 二十八 | 二八 | 28 | 28 |
|
||
| 十八 | 八 | 18 | 8 |
|
||
| 二十 | 二 | 20 | 2 |
|
||
| 十 | `""` | 10 | Type error |
|
||
|
||
I chose to ignore any case where more than one 十 would be found as my goal
|
||
was only to parse in the 1-31 range, and 十十十一 is longer than 三一 or
|
||
三十一 so I can expect them to not be used.
|
||
|
||
The remaining edge cases only occur when `十` is at the start or the end of
|
||
the string, so I handled them in three ways:
|
||
|
||
* If the string is exactly `"十"`, return 10 before any other parsing;
|
||
* If the string ends with 十, multiply the output by 10;
|
||
* If the string begins with 十, add 10 to the output.
|
||
|
||
And to avoid adding complexity to my feed parsing scripts, I changed the
|
||
`map` to `map($charmap[.] // .)`, which just ignores unknown characters.
|
||
Combined with some checks made by the date parsing function, this makes it
|
||
possible to parse both traditional and simplified formats without making
|
||
many changes.
|
||
|
||
### Date parsing
|
||
|
||
The date components are split using a regex, to avoid sending too much garbage
|
||
to `parse_chinese_number` in the event of a badly formatted date. Once the
|
||
numbers are parsed, we get an object in this format:
|
||
|
||
```json
|
||
{"year": 1234, "month": 12, "day": 3}
|
||
```
|
||
|
||
In some cases, some investigation agencies will be using years from the Chinese
|
||
calendar, where year 0 is year 1911 of the Gregorian calendar. I therefore
|
||
added a check that adds 1911 years when the year is below 1900; this causes the
|
||
parser to work properly only for years between 1900 and 2811 inclusive for
|
||
dates using this calendar.
|
||
|
||
I then use a rather simple method to get a Unix timestamp:
|
||
`"\(.year)-\(.month)-\(.day)T00:00:00Z" | fromdateiso8601`. I could have used
|
||
a more normal method such as a `strptime("%Y-%m-%d") | mktime`, or build the
|
||
same array that `strptime` returns, such as
|
||
`[1234, 11, 2, 0, 0, 0, 0] | mktime`, but that requires some particular
|
||
handling as months and days are zero-based in this format.
|
||
|
||
## Acknowledgements
|
||
|
||
Thanks to ~m455 for the Chinese dates crash course on IRC!
|
||
|
||
[helpers]: https://tildegit.org/lucidiot/itsb/src/commit/70ef6b9e978aa58170e80bbb7121f3a6089f8ac3/jq/helpers.jq#L45
|
||
[itsb]: https://tilde.town/~lucidiot/itsb/
|
||
[jq]: https://stedolan.github.io/jq/
|
||
[unicode]: https://github.com/stedolan/jq/issues/1166
|