incorrect wrapping of line containing en dash #35

Closed
opened 2019-09-10 10:42:25 +00:00 by asdf · 10 comments
Collaborator

Issue found on page gopher://baud.baby:70/0/phlog/fs20190818.txt

The following line is 78 characters long, but the last word is wrapped to a new line when the console is set to 80 characters wide.

   Suldusk – Really cool dark  folk/black metal sort of deal.  The lead singer

When investigating, a unit test on wrapLines was made for this issue, but the line does not wrap in the test.

Vim character count for this line shows two different numbers for any character after the dash.

The dash is an 'en dash' U+2013.

Issue found on page gopher://baud.baby:70/0/phlog/fs20190818.txt The following line is 78 characters long, but the last word is wrapped to a new line when the console is set to 80 characters wide. <pre> Suldusk – Really cool dark folk/black metal sort of deal. The lead singer</pre> When investigating, a unit test on `wrapLines` was made for this issue, but the line does not wrap in the test. Vim character count for this line shows two different numbers for any character after the dash. The dash is an 'en dash' U+2013.
asdf self-assigned this 2019-09-10 10:43:47 +00:00
Owner

Yeah. I have not worked out proper counting for variable width characters. The wrapping definitely needs a lot of work :(

Yeah. I have not worked out proper counting for variable width characters. The wrapping definitely needs a lot of work :(
Owner

Running this on play.golang.org will show what is happening:

package main

import (
	"fmt"
)

func main() {
	fmt.Println(len("–"))
	fmt.Println(len([]rune("–")))
}

This is the return:

3
1

This is due to multibyte characters. Apparently endash is, for some reason, a 3 byte character (getting the len of a string returns the bytes count). So, we can count by character by casting to rune or using a different counting method (there are a few routes to go).

Something that can happen with multibyte characters is that they are wider than other characters. Take for instance, it is a single character, but is wide. If you type A௸ into a terminal, the characters are of a different width. Depending on how specific terminals render the character this could create issues and overages, or (on my system) overlapping characters. A simple and direct solution (though crappy) would be to only support us ascii. I think that is a bad call though. Particularly since go natively supports unicode easily. I'm just not sure how to handle line counting.

What are your thoughts?

Running this on play.golang.org will show what is happening: ``` package main import ( "fmt" ) func main() { fmt.Println(len("–")) fmt.Println(len([]rune("–"))) } ``` This is the return: ``` 3 1 ``` This is due to multibyte characters. Apparently endash is, for some reason, a 3 byte character (getting the len of a string returns the bytes count). So, we can count by character by casting to rune or using a different counting method (there are a few routes to go). Something that can happen with multibyte characters is that they are wider than other characters. Take `௸` for instance, it is a single character, but is wide. If you type `A௸` into a terminal, the characters are of a different width. Depending on how specific terminals render the character this could create issues and overages, or (on my system) overlapping characters. A simple and direct solution (though crappy) would be to only support us ascii. I think that is a bad call though. Particularly since go natively supports unicode easily. I'm just not sure how to handle line counting. What are your thoughts?
Owner

The solution I can see would entail rewriting the rendering engine and parsing the document character by character rather than dealing with lines and words. Then by keeping a character count you could do things like calculate how many spaces a tab will be worth at that point (assuming you manually set tabs on program load to a fixed value, such as 4). Then you could have an option for hard or soft line wrapping and just deal with them that way. Each character could be determined to be a single width char or not and dealt with accordingly. I'll work that into the rendering for v2.0 for sure.

As for here... I'm not sure what the good option is.

The solution I can see would entail rewriting the rendering engine and parsing the document character by character rather than dealing with lines and words. Then by keeping a character count you could do things like calculate how many spaces a tab will be worth at that point (assuming you manually set tabs on program load to a fixed value, such as 4). Then you could have an option for hard or soft line wrapping and just deal with them that way. Each character could be determined to be a single width char or not and dealt with accordingly. I'll work that into the rendering for v2.0 for sure. As for here... I'm not sure what the good option is.
Author
Collaborator

Thanks for all the information on this. I haven't had any experience with this before, so it's good to learn.

I've had a look at the following documents:

I think it's worth doing what we can to support Unicode. I'll go with your suggestions as to how this is done as I don't fully understand everything here yet. If it's something for 2.0 then that is ok too.

For this specifically, I replaced the parts of wrapLines() where the line and word lengths are measured with len([]rune(string)) and it actually wraps those lines correctly now. I also tried using utf8.RuneCountInString and got the same result. Benchmarks indicate similar performance for both, with around 1/9 increase over the original version (although who knows if the benchmark test is useful or not).

I've submitted PR #37 for review of this activity.

Thanks for all the information on this. I haven't had any experience with this before, so it's good to learn. I've had a look at the following documents: - https://blog.golang.org/strings - http://www.joelonsoftware.com/articles/Unicode.html I think it's worth doing what we can to support Unicode. I'll go with your suggestions as to how this is done as I don't fully understand everything here yet. If it's something for 2.0 then that is ok too. For this specifically, I replaced the parts of `wrapLines()` where the line and word lengths are measured with `len([]rune(string))` and it actually wraps those lines correctly now. I also tried using `utf8.RuneCountInString` and got the same result. Benchmarks indicate similar performance for both, with around 1/9 increase over the original version (although who knows if the benchmark test is useful or not). I've submitted PR #37 for review of this activity.
Owner

This conversation has also been really interesting:

https://stackoverflow.com/questions/12668681/how-to-get-the-number-of-characters-in-a-string

I agree that unicode support is the preferable way to go, utf-8 at the very least. I commented on the PR, but I think len([]rune(string)) is probably the way to go, but maybe add a code comment.

That will drastically improve wrapping accuracy. It wont get us the whole way, but it will do a serviceable job for sure. I do not think gopher folks are using a ton of weird multispace emoji currently anyway. But being able to support multiple languages is nice.

Once merged in #37 will close this issue?

This conversation has also been really interesting: https://stackoverflow.com/questions/12668681/how-to-get-the-number-of-characters-in-a-string I agree that unicode support is the preferable way to go, utf-8 at the very least. I commented on the PR, but I think `len([]rune(string))` is probably the way to go, but maybe add a code comment. That will drastically improve wrapping accuracy. It wont get us the whole way, but it will do a serviceable job for sure. I do not think gopher folks are using a ton of weird multispace emoji currently anyway. But being able to support multiple languages is nice. Once merged in #37 will close this issue?
Author
Collaborator

It's impressive and intimidating how deep some of these problems go.

I'm happy to use len([]rune(string)) for greater accuracy now, and consider something like https://github.com/rivo/uniseg for the future.

I'll update the PR and we can close this off.

It's impressive and intimidating how deep some of these problems go. I'm happy to use len([]rune(string)) for greater accuracy now, and consider something like https://github.com/rivo/uniseg for the future. I'll update the PR and we can close this off.
Author
Collaborator

OK the PR is updated, but there is one thing I forgot: I can't figure out why the test passed before when it should have failed.

Reproducing this in the go playground with the old code produces the same result - no lines are being wrapped because the line length doesn't exceed the specified console width of 80.

In fact, nothing in this document should wrap with console width 80. Part of it still does, so there must be something else to it.

This could be because the test text is manually entering in to a slice of strings, and not through the method Bombadillo uses to read content. To validate this, I'd have to figure out how to either

  1. Include content parsing in to the test
  2. Get a copy of the slice of strings that would be passed to wrapLines

This is likely a big task, so if there are suggestions on what this could be it would help.

OK the PR is updated, but there is one thing I forgot: I can't figure out why the test passed before when it should have failed. Reproducing this in the go playground with the old code produces the same result - no lines are being wrapped because the line length doesn't exceed the specified console width of 80. In fact, nothing in this document should wrap with console width 80. Part of it still does, so there must be something else to it. This could be because the test text is manually entering in to a slice of strings, and not through the method Bombadillo uses to read content. To validate this, I'd have to figure out how to either 1. Include content parsing in to the test 1. Get a copy of the slice of strings that would be passed to `wrapLines` This is likely a big task, so if there are suggestions on what this could be it would help.
Owner

I do not have too much advice :-/ You could log the slice to a file? The logging package is supposed to be pretty capable: https://golang.org/pkg/log/

As an aside, v2.0 has been coming along (I am a VERY tired person because of it, but it has been fun to work on). It is a mess of its own sort... but its screen drawing is MUCH faster (no flicker!) and I skipped soft wrapping lines in favor of hard wrap. The branch is up and you are welcome to try it out if you want. Most commands do not work except for entering addresses, link numbers, scrolling, forward, backward, quit. The screen resizing and rewrapping is now automatic and done on a separate thread (technically it is concurrency, not multithreading). Anyway, if you feel like trying it out and taking a peek at the code, go for it. The branch is v2-restructure. Make a backup of your .bombadillo.ini file just in case something goes haywire though.

I do not have too much advice :-/ You could log the slice to a file? The logging package is supposed to be pretty capable: `https://golang.org/pkg/log/` As an aside, v2.0 has been coming along (I am a VERY tired person because of it, but it has been fun to work on). It is a mess of its own sort... but its screen drawing is MUCH faster (no flicker!) and I skipped soft wrapping lines in favor of hard wrap. The branch is up and you are welcome to try it out if you want. Most commands do not work except for entering addresses, link numbers, scrolling, forward, backward, quit. The screen resizing and rewrapping is now automatic and done on a separate thread (technically it is concurrency, not multithreading). Anyway, if you feel like trying it out and taking a peek at the code, go for it. The branch is `v2-restructure`. Make a backup of your `.bombadillo.ini` file just in case something goes haywire though.
Author
Collaborator

v2 is very cool, the effort you have put in really shows. I really like the new wrapping function - it seems more straight-forward and it does much more.

So - the issue with the unicode character en dash is resolved in v2.

Also, I think the tests were no good - they did not consider newline at all whereas the application would have, so the results were due to be different.

The issue I was trying to test for is still present in v2, but I think it might be worth closing this off and referring any further activity on issue #31.

v2 is very cool, the effort you have put in really shows. I really like the new wrapping function - it seems more straight-forward and it does much more. So - the issue with the unicode character en dash is resolved in v2. Also, I think the tests were no good - they did not consider `newline` at all whereas the application would have, so the results were due to be different. The issue I was trying to test for is still present in v2, but I think it might be worth closing this off and referring any further activity on issue #31.
Owner

Sounds good. I am closing this. Anyone looking at this later should refer to #31 for further conversation on the subject.

Sounds good. I am closing this. Anyone looking at this later should refer to #31 for further conversation on the subject.
sloum closed this issue 2019-09-13 15:19:49 +00:00
sloum added the
duplicate
bug
labels 2019-09-13 15:20:08 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: sloum/bombadillo#35
No description provided.