incorrect wrapping of line containing en dash #35
Labels
No Label
blocked
bug
build
documentation
duplicate
enhancement
finger
gemini
gopher
help wanted
http
in progress
invalid
local
needs-info
non-code
non-functional
non-urgent
question
release
rendering
suggestion
telnet
terminal
urgent
wontfix
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: sloum/bombadillo#35
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Issue found on page gopher://baud.baby:70/0/phlog/fs20190818.txt
The following line is 78 characters long, but the last word is wrapped to a new line when the console is set to 80 characters wide.
When investigating, a unit test on
wrapLines
was made for this issue, but the line does not wrap in the test.Vim character count for this line shows two different numbers for any character after the dash.
The dash is an 'en dash' U+2013.
Yeah. I have not worked out proper counting for variable width characters. The wrapping definitely needs a lot of work :(
Running this on play.golang.org will show what is happening:
This is the return:
This is due to multibyte characters. Apparently endash is, for some reason, a 3 byte character (getting the len of a string returns the bytes count). So, we can count by character by casting to rune or using a different counting method (there are a few routes to go).
Something that can happen with multibyte characters is that they are wider than other characters. Take
௸
for instance, it is a single character, but is wide. If you typeA௸
into a terminal, the characters are of a different width. Depending on how specific terminals render the character this could create issues and overages, or (on my system) overlapping characters. A simple and direct solution (though crappy) would be to only support us ascii. I think that is a bad call though. Particularly since go natively supports unicode easily. I'm just not sure how to handle line counting.What are your thoughts?
The solution I can see would entail rewriting the rendering engine and parsing the document character by character rather than dealing with lines and words. Then by keeping a character count you could do things like calculate how many spaces a tab will be worth at that point (assuming you manually set tabs on program load to a fixed value, such as 4). Then you could have an option for hard or soft line wrapping and just deal with them that way. Each character could be determined to be a single width char or not and dealt with accordingly. I'll work that into the rendering for v2.0 for sure.
As for here... I'm not sure what the good option is.
Thanks for all the information on this. I haven't had any experience with this before, so it's good to learn.
I've had a look at the following documents:
I think it's worth doing what we can to support Unicode. I'll go with your suggestions as to how this is done as I don't fully understand everything here yet. If it's something for 2.0 then that is ok too.
For this specifically, I replaced the parts of
wrapLines()
where the line and word lengths are measured withlen([]rune(string))
and it actually wraps those lines correctly now. I also tried usingutf8.RuneCountInString
and got the same result. Benchmarks indicate similar performance for both, with around 1/9 increase over the original version (although who knows if the benchmark test is useful or not).I've submitted PR #37 for review of this activity.
This conversation has also been really interesting:
https://stackoverflow.com/questions/12668681/how-to-get-the-number-of-characters-in-a-string
I agree that unicode support is the preferable way to go, utf-8 at the very least. I commented on the PR, but I think
len([]rune(string))
is probably the way to go, but maybe add a code comment.That will drastically improve wrapping accuracy. It wont get us the whole way, but it will do a serviceable job for sure. I do not think gopher folks are using a ton of weird multispace emoji currently anyway. But being able to support multiple languages is nice.
Once merged in #37 will close this issue?
It's impressive and intimidating how deep some of these problems go.
I'm happy to use len([]rune(string)) for greater accuracy now, and consider something like https://github.com/rivo/uniseg for the future.
I'll update the PR and we can close this off.
OK the PR is updated, but there is one thing I forgot: I can't figure out why the test passed before when it should have failed.
Reproducing this in the go playground with the old code produces the same result - no lines are being wrapped because the line length doesn't exceed the specified console width of 80.
In fact, nothing in this document should wrap with console width 80. Part of it still does, so there must be something else to it.
This could be because the test text is manually entering in to a slice of strings, and not through the method Bombadillo uses to read content. To validate this, I'd have to figure out how to either
wrapLines
This is likely a big task, so if there are suggestions on what this could be it would help.
I do not have too much advice :-/ You could log the slice to a file? The logging package is supposed to be pretty capable:
https://golang.org/pkg/log/
As an aside, v2.0 has been coming along (I am a VERY tired person because of it, but it has been fun to work on). It is a mess of its own sort... but its screen drawing is MUCH faster (no flicker!) and I skipped soft wrapping lines in favor of hard wrap. The branch is up and you are welcome to try it out if you want. Most commands do not work except for entering addresses, link numbers, scrolling, forward, backward, quit. The screen resizing and rewrapping is now automatic and done on a separate thread (technically it is concurrency, not multithreading). Anyway, if you feel like trying it out and taking a peek at the code, go for it. The branch is
v2-restructure
. Make a backup of your.bombadillo.ini
file just in case something goes haywire though.v2 is very cool, the effort you have put in really shows. I really like the new wrapping function - it seems more straight-forward and it does much more.
So - the issue with the unicode character en dash is resolved in v2.
Also, I think the tests were no good - they did not consider
newline
at all whereas the application would have, so the results were due to be different.The issue I was trying to test for is still present in v2, but I think it might be worth closing this off and referring any further activity on issue #31.
Sounds good. I am closing this. Anyone looking at this later should refer to #31 for further conversation on the subject.