<85> in document creates rendering issue #199

Closed
opened 2020-11-04 03:43:20 +00:00 by asdf · 6 comments
Collaborator

Documents containing a special character <85> are not rendered correctly.

  • The page title is missing
  • In areas of the document where this character is, rendering for the line is stopped. Content from the previous page will still be displayed.

Using Bombadillo 2.3.1 as well as release2.3.3

Steps to reproduce:

  1. Open Bombadillo, open the default start page if necessary
  2. Browse to any of the following links:

gemini://gemini.conman.org:1965/boston/2020/11/03.3
gemini://gemini.conman.org:1965/boston/2020/11/03.2
gemini://gemini.conman.org:1965/boston/2020/11/03.1
gemini://gemini.conman.org:1965/boston/2020/10/12.1

  • Downloading these documents and viewing using the local protocol causes the same issue.
  • Tested in gnome-terminal and st and saw the same issue.

Viewing the documents in vim shows a special character <85>. I think this is U+0085, or NEL, the next line symbol.

In page.go, we can get rid of characters we don't want. Adding '\u0085' to this list makes rendering work. This probably isn't good enough, as the character is meant to be a line break, but I'm not sure exactly how it is meant to be represented yet.

Documents containing a special character `<85>` are not rendered correctly. - The page title is missing - In areas of the document where this character is, rendering for the line is stopped. Content from the previous page will still be displayed. Using Bombadillo 2.3.1 as well as release2.3.3 Steps to reproduce: 1. Open Bombadillo, open the default start page if necessary 2. Browse to any of the following links: gemini://gemini.conman.org:1965/boston/2020/11/03.3 gemini://gemini.conman.org:1965/boston/2020/11/03.2 gemini://gemini.conman.org:1965/boston/2020/11/03.1 gemini://gemini.conman.org:1965/boston/2020/10/12.1 - Downloading these documents and viewing using the local protocol causes the same issue. - Tested in *gnome-terminal* and *st* and saw the same issue. Viewing the documents in vim shows a special character `<85>`. I think this is `U+0085`, or `NEL`, the next line symbol. In page.go, we can get rid of characters we don't want. Adding `'\u0085'` to this list makes rendering work. This probably isn't good enough, as the character is meant to be a line break, but I'm not sure exactly how it is meant to be represented yet.
asdf added the
bug
rendering
labels 2020-11-04 03:43:20 +00:00
asdf self-assigned this 2020-11-04 03:43:29 +00:00
Owner

Interesting. I had not been familiar with that character. I wonder if there is a list somewhere of characters of that sort (that are not printed characters themselves, but modify the output).

Interesting. I had not been familiar with that character. I wonder if there is a list somewhere of characters of that sort (that are not printed characters themselves, but modify the output).
Author
Collaborator

This article has an extremely detailed description of unicode line breaks. There's also the wikipedia article. I've only partially read these they are very long.

I don't think there is anything builtin to identify these type of characters - unicode.IsSpace might be the closest.

This question shows a similar problem and how these are identified in Java.

Need to read more.

[This article](https://www.unicode.org/reports/tr14/) has an extremely detailed description of unicode line breaks. There's also the [wikipedia article](https://en.wikipedia.org/wiki/Newline). I've only partially read these they are very long. I don't think there is anything builtin to identify these type of characters - `unicode.IsSpace` might be the closest. [This question](https://stackoverflow.com/questions/52594005/golang-replace-any-and-all-newline-characters) shows a similar problem and how these are identified in [Java](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#lineending). Need to read more.
Author
Collaborator

This section might be the most relevant:

https://en.wikipedia.org/wiki/Newline#Unicode

The rune literals in Go treat unicode representations like 'u000A' as equal to '\n', so we already handle most of these in WrapContent but mostly by ignoring them. We do not handle the following items:

  • NEL \u0085 (new line)
  • LS \u2028 (line separator)
  • PS \u2029 (paragraph separator)

As we are ignoring most of the others, should we also ignore these? It seems like it might be just as complex to implement them.

This section might be the most relevant: https://en.wikipedia.org/wiki/Newline#Unicode The rune literals in Go treat unicode representations like `'u000A'` as equal to `'\n'`, so we already handle most of these in `WrapContent` but mostly by ignoring them. We do not handle the following items: - NEL \u0085 (new line) - LS \u2028 (line separator) - PS \u2029 (paragraph separator) As we are ignoring most of the others, should we also ignore these? It seems like it might be just as complex to implement them.
Owner

Awesome! That is great news that Go treats them as equal to \n. I think we should print a newline for any of the above three characters. THere should be a rune for them, right? If so it should be either an && within an else if or another case within a switch (cant remember what is happening there to know which one).

Is there any downside you can think of to treating them like \n? If we do this as part of the line wrapping it means if someone downloads the file they will still get the original characters as intended (which is good).

Awesome! That is great news that Go treats them as equal to `\n`. I think we should print a newline for any of the above three characters. THere should be a rune for them, right? If so it should be either an `&&` within an `else if` or another `case` within a `switch` (cant remember what is happening there to know which one). Is there any downside you can think of to treating them like `\n`? If we do this as part of the line wrapping it means if someone downloads the file they will still get the original characters as intended (which is good).
Author
Collaborator

I've done a WIP PR on this to help with the explanation.

Just to try to be clearer regarding your first point, the unicode line terminators from Wikipedia are just \n, \f, \v, \r and \r\n, plus the three I highlighted - NEL, LS and PS, but using a unicode reference. I was just confused, but literally the unicode reference for line feed is u000A and 'u000A' == '\n' is True in Go. That isn't related to NEL, LS or PS, Go doesn't treat them as equal to \n. Sorry!

Your last point about downsides, do you mean like a loss of fidelity? If so, not in a way we don't already do. A good equivalent for this is supporting \v instead of ignoring it. We wouldn't actually use \v when printing to the screen, but approximate how it would look using spaces and \n.

The main downside is that it's a complicated topic for a rare occurrence. As noted on Wikipedia: "Recognizing and using the newline codes greater than 0x7F (NEL, LS and PS) is not often done". gnome-terminal and st have only implemented NEL as a line terminator, LS and PS aren't rendered. But at least we are learning something.

I've done a WIP PR on this to help with the explanation. Just to try to be clearer regarding your first point, the [unicode line terminators from Wikipedia](https://en.wikipedia.org/wiki/Newline#Unicode) are just `\n`, `\f`, `\v`, `\r` and `\r\n`, plus the three I highlighted - `NEL`, `LS` and `PS`, but using a unicode reference. I was just confused, but literally the unicode reference for line feed is `u000A` and `'u000A' == '\n'` is `True` in Go. That isn't related to NEL, LS or PS, Go doesn't treat them as equal to `\n`. Sorry! Your last point about downsides, do you mean like a loss of fidelity? If so, not in a way we don't already do. A good equivalent for this is supporting `\v` instead of ignoring it. We wouldn't actually use `\v` when printing to the screen, but approximate how it would look using spaces and `\n`. The main downside is that it's a complicated topic for a rare occurrence. As noted on Wikipedia: "Recognizing and using the newline codes greater than 0x7F (NEL, LS and PS) is not often done". *gnome-terminal* and *st* have only implemented `NEL` as a line terminator, `LS` and `PS` aren't rendered. But at least we are learning something.
asdf closed this issue 2020-11-06 02:34:12 +00:00
Author
Collaborator

Just a note for future reference that NEL, LS and PS are implemented, each as a single line ending.

Just a note for future reference that NEL, LS and PS are implemented, each as a single line ending.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sloum/bombadillo#199
No description provided.