Add chinese date format article
This commit is contained in:
parent
aa8fdf97e3
commit
54c1b14cef
|
@ -0,0 +1,119 @@
|
||||||
|
---
|
||||||
|
title: chinese date format
|
||||||
|
---
|
||||||
|
|
||||||
|
## Format
|
||||||
|
|
||||||
|
* Traditional Chinese: `二零二零年四月二十八日`
|
||||||
|
* Simplified Chinese: `2020年04月28日`
|
||||||
|
|
||||||
|
## Characters
|
||||||
|
|
||||||
|
| Character | Meaning |
|
||||||
|
| --------: | ------: |
|
||||||
|
| 零 | 0 |
|
||||||
|
| 〇 | 0 |
|
||||||
|
| 一 | 1 |
|
||||||
|
| 二 | 2 |
|
||||||
|
| 三 | 3 |
|
||||||
|
| 四 | 4 |
|
||||||
|
| 五 | 5 |
|
||||||
|
| 六 | 6 |
|
||||||
|
| 七 | 7 |
|
||||||
|
| 八 | 8 |
|
||||||
|
| 九 | 9 |
|
||||||
|
| 十 | 10 |
|
||||||
|
| 二十 | 20 |
|
||||||
|
| 三十 | 30 |
|
||||||
|
| 年 | Year |
|
||||||
|
| 月 | Month |
|
||||||
|
| 日 | Day |
|
||||||
|
|
||||||
|
## jq implementation
|
||||||
|
|
||||||
|
I implemented a chinese date parser using [jq][jq] for [itsb][itsb], as seen
|
||||||
|
[here][helpers].
|
||||||
|
|
||||||
|
The parsing methods are in two parts: `parse_chinese_number`, a parser that is
|
||||||
|
only guaranteed to work from 0 to 99, and `parse_chinese_date`, which splits
|
||||||
|
the date components and sends them to `parse_chinese_number`.
|
||||||
|
|
||||||
|
This parser only works with jq≥1.6, as jq 1.5 and earlier had some
|
||||||
|
[Unicode issues][unicode] that caused most string manipulations in this parser
|
||||||
|
to break.
|
||||||
|
|
||||||
|
### Number parsing
|
||||||
|
|
||||||
|
In Chinese dates, years are always expressed using all of their digits, aka
|
||||||
|
"two zero two zero" for 2020, and not "two thousand twenty". This makes the
|
||||||
|
parsing much simpler as I do not need to even know how thousands or hundreds
|
||||||
|
are expressed in Chinese; I still however need tens to handle months and days.
|
||||||
|
|
||||||
|
I first started by writing a parser that only handles single digits; I made an
|
||||||
|
object mapping Chinese characters to their string digits, and just translated
|
||||||
|
each character then re-concatenated.
|
||||||
|
|
||||||
|
```
|
||||||
|
($input // "") # ["二", "零", "二", "零"]
|
||||||
|
| map($charmap[.]) # ["2", "0", "2", "0"]
|
||||||
|
| join("") # "2020"
|
||||||
|
```
|
||||||
|
|
||||||
|
I then added 10 as an empty string, because we can just ignore it when going
|
||||||
|
number by number. This only works in some cases:
|
||||||
|
|
||||||
|
| Number | Parsed as | Expected | Actual |
|
||||||
|
| ------:| ---------:| --------:| ----------:|
|
||||||
|
| 二十八 | 二八 | 28 | 28 |
|
||||||
|
| 十八 | 八 | 18 | 8 |
|
||||||
|
| 二十 | 二 | 20 | 2 |
|
||||||
|
| 十 | `""` | 10 | Type error |
|
||||||
|
|
||||||
|
I chose to ignore any case where more than one 十 would be found as my goal
|
||||||
|
was only to parse in the 1-31 range, and 十十十一 is longer than 三一 or
|
||||||
|
三十一 so I can expect them to not be used.
|
||||||
|
|
||||||
|
The remaining edge cases only occur when `十` is at the start or the end of
|
||||||
|
the string, so I handled them in three ways:
|
||||||
|
|
||||||
|
* If the string is exactly `"十"`, return 10 before any other parsing;
|
||||||
|
* If the string ends with 十, multiply the output by 10;
|
||||||
|
* If the string begins with 十, add 10 to the output.
|
||||||
|
|
||||||
|
And to avoid adding complexity to my feed parsing scripts, I changed the
|
||||||
|
`map` to `map($charmap[.] // .)`, which just ignores unknown characters.
|
||||||
|
Combined with some checks made by the date parsing function, this makes it
|
||||||
|
possible to parse both traditional and simplified formats without making
|
||||||
|
many changes.
|
||||||
|
|
||||||
|
### Date parsing
|
||||||
|
|
||||||
|
The date components are split using a regex, to avoid sending too much garbage
|
||||||
|
to `parse_chinese_number` in the event of a badly formatted date. Once the
|
||||||
|
numbers are parsed, we get an object in this format:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{"year": 1234, "month": 12, "day": 3}
|
||||||
|
```
|
||||||
|
|
||||||
|
In some cases, some investigation agencies will be using years from the Chinese
|
||||||
|
calendar, where year 0 is year 1911 of the Gregorian calendar. I therefore
|
||||||
|
added a check that adds 1911 years when the year is below 1900; this causes the
|
||||||
|
parser to work properly only for years between 1900 and 2811 inclusive for
|
||||||
|
dates using this calendar.
|
||||||
|
|
||||||
|
I then use a rather simple method to get a Unix timestamp:
|
||||||
|
`"\(.year)-\(.month)-\(.day)T00:00:00Z" | fromdateiso8601`. I could have used
|
||||||
|
a more normal method such as a `strptime("%Y-%m-%d") | mktime`, or build the
|
||||||
|
same array that `strptime` returns, such as
|
||||||
|
`[1234, 11, 2, 0, 0, 0, 0] | mktime`, but that requires some particular
|
||||||
|
handling as months and days are zero-based in this format.
|
||||||
|
|
||||||
|
## Acknowledgements
|
||||||
|
|
||||||
|
Thanks to ~m455 for the Chinese dates crash course on IRC!
|
||||||
|
|
||||||
|
[helpers]: https://tildegit.org/lucidiot/itsb/src/commit/70ef6b9e978aa58170e80bbb7121f3a6089f8ac3/jq/helpers.jq#L45
|
||||||
|
[itsb]: https://tilde.town/~lucidiot/itsb/
|
||||||
|
[jq]: https://stedolan.github.io/jq/
|
||||||
|
[unicode]: https://github.com/stedolan/jq/issues/1166
|
Loading…
Reference in New Issue