Improve multibyte character handling #5

Open
opened 2022-12-03 21:37:02 +00:00 by paladin1 · 4 comments
Owner

Reported by Matěj Cepl:

The CzeCSP module, when loaded into a pane, does not show the correct text in the pane's titlebar. The appropriate titlebar text should be: "Český studijní překlad"

Reported by Matěj Cepl: The CzeCSP module, when loaded into a pane, does not show the correct text in the pane's titlebar. The appropriate titlebar text should be: "Český studijní překlad"
paladin1 changed title from Possible non-English rendering issue to Pane titlebar rendering incorrect 2022-12-27 22:05:56 +00:00
paladin1 added the
bug
label 2022-12-27 22:06:17 +00:00
Author
Owner

CzeCSP module information is available here.

This does appear to be a bug. Scriptura renders the module description as "Czech Leský studijní p Yeklad──", though all renderings I could find elsewhere of the module has the module description as some variant of "Český studijní překlad".

This is possibly due to Scriptura not evaluating or displaying module descriptions in wide character strings. I notice the 2017 Finnish Bible's title is "Pyhä Raamattu (STLK 2017)─", which has a similar dash at the end. However, SWORD itself doesn't use wide-character strings for module names or descriptions, and it's not clear if that's relevant at the moment.

[CzeCSP module information is available here.](https://gitlab.com/crosswire-bible-society/CzeCSP/-/tree/master/) This does appear to be a bug. Scriptura renders the module description as "Czech Leský studijní p Yeklad──", though all renderings I could find elsewhere of the module has the module description as some variant of "Český studijní překlad". This is possibly due to Scriptura not evaluating or displaying module descriptions in wide character strings. I notice the 2017 Finnish Bible's title is "Pyhä Raamattu (STLK 2017)─", which has a similar dash at the end. However, SWORD itself doesn't use wide-character strings for module names or descriptions, and it's not clear if that's relevant at the moment.
Author
Owner

It is indeed a multibyte rendering issue:

sword::SWModule *text = swrd.getModule("CzeCSP");
const char* desc = text->getDescription();
fprintf(stderr, "%s\n", desc);

outputs:

Czech Český studijní překlad

with xxd output showing the multibyte characters:

00000000: 437a 6563 6820 c48c 6573 6bc3 bd20 7374  Czech ..esk.. st
00000010: 7564 696a 6ec3 ad20 70c5 9965 6b6c 6164  udijn.. p..eklad
00000020: 0a 

Titlebar display currently relies on mvwprintw() - which does not appear to be wide-character aware - so it looks necessary to retool the whole flow of module description retrieval and display for wide characters.

It is indeed a multibyte rendering issue: ``` sword::SWModule *text = swrd.getModule("CzeCSP"); const char* desc = text->getDescription(); fprintf(stderr, "%s\n", desc); ``` outputs: > Czech Český studijní překlad with xxd output showing the multibyte characters: ``` 00000000: 437a 6563 6820 c48c 6573 6bc3 bd20 7374 Czech ..esk.. st 00000010: 7564 696a 6ec3 ad20 70c5 9965 6b6c 6164 udijn.. p..eklad 00000020: 0a ``` Titlebar display currently relies on mvwprintw() - which does not appear to be wide-character aware - so it looks necessary to retool the whole flow of module description retrieval and display for wide characters.
paladin1 changed title from Pane titlebar rendering incorrect to Make pane titlebar wide-character aware 2023-01-01 01:54:29 +00:00
paladin1 changed title from Make pane titlebar wide-character aware to Improve multibyte character handling 2023-03-03 00:52:03 +00:00
Author
Owner

It is actually the removal of wide-character use that was required, and the alteration of how titlebars are rendered.

Fixed by 0a8acd40cce97a7b77835942875a76cac500f901

There is a trade-off for this resolution. Characters in the extended ASCII set (such as Latin-1) do not seem to be printable as we wipe the locale data to get UTF-8 to render properly. I currently cannot find a way around this. Among the modules I tested it appears that most everything is in UTF-8 so perhaps this is not a large trade-off.

It is actually the removal of wide-character use that was required, and the alteration of how titlebars are rendered. Fixed by [0a8acd40cce97a7b77835942875a76cac500f901](https://tildegit.org/paladin1/scriptura/commit/0a8acd40cce97a7b77835942875a76cac500f901) There is a trade-off for this resolution. Characters in the extended ASCII set (such as Latin-1) do not seem to be printable as we wipe the locale data to get UTF-8 to render properly. I currently cannot find a way around this. Among the modules I tested it appears that most everything is in UTF-8 so perhaps this is not a large trade-off.
Author
Owner

Found out later about MarkupFilters, which can force output into a particular character set. From the listserv:

I do have another question regarding the construction of MarkupFilterMgr.
If I want to apply the encoding filter for UTF8, but do not need the markup filter manager for XHTML is it ok to perform the construction like this?

new MarkupFilterMgr(sword::FMT_UNKNOWN, sword::ENC_UTF8)

...

SWMgr mgr(new MarkupFilterMgr(sword::FMT_XHTML, sword::ENC_UTF8));

This may work but needs some exploration.

Found out later about MarkupFilters, which can force output into a particular character set. From the listserv: > I do have another question regarding the construction of MarkupFilterMgr. If I want to apply the encoding filter for UTF8, but do not need the markup filter manager for XHTML is it ok to perform the construction like this? ``` new MarkupFilterMgr(sword::FMT_UNKNOWN, sword::ENC_UTF8) ``` ... ``` SWMgr mgr(new MarkupFilterMgr(sword::FMT_XHTML, sword::ENC_UTF8)); ``` This may work but needs some exploration.
paladin1 reopened this issue 2023-04-22 20:58:52 +00:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: paladin1/scriptura#5
No description provided.