Improve multibyte character handling #5

New Issue

paladin1 · 2022-12-03T21:37:02Z

paladin1 commented

2022-12-03 21:37:02 +00:00

Reported by Matěj Cepl:

The CzeCSP module, when loaded into a pane, does not show the correct text in the pane's titlebar. The appropriate titlebar text should be: "Český studijní překlad"

Reported by Matěj Cepl: The CzeCSP module, when loaded into a pane, does not show the correct text in the pane's titlebar. The appropriate titlebar text should be: "Český studijní překlad"

paladin1 changed title from ~~Possible non-English rendering issue~~ to Pane titlebar rendering incorrect

2022-12-27 22:05:56 +00:00

paladin1 added the

bug

label 2022-12-27 22:06:17 +00:00

paladin1 commented

2022-12-27 22:16:13 +00:00

CzeCSP module information is available here.

This does appear to be a bug. Scriptura renders the module description as "Czech Leský studijní p Yeklad──", though all renderings I could find elsewhere of the module has the module description as some variant of "Český studijní překlad".

This is possibly due to Scriptura not evaluating or displaying module descriptions in wide character strings. I notice the 2017 Finnish Bible's title is "Pyhä Raamattu (STLK 2017)─", which has a similar dash at the end. However, SWORD itself doesn't use wide-character strings for module names or descriptions, and it's not clear if that's relevant at the moment.

[CzeCSP module information is available here.](https://gitlab.com/crosswire-bible-society/CzeCSP/-/tree/master/) This does appear to be a bug. Scriptura renders the module description as "Czech Leský studijní p Yeklad──", though all renderings I could find elsewhere of the module has the module description as some variant of "Český studijní překlad". This is possibly due to Scriptura not evaluating or displaying module descriptions in wide character strings. I notice the 2017 Finnish Bible's title is "Pyhä Raamattu (STLK 2017)─", which has a similar dash at the end. However, SWORD itself doesn't use wide-character strings for module names or descriptions, and it's not clear if that's relevant at the moment.

paladin1 commented

2023-01-01 01:53:35 +00:00

It is indeed a multibyte rendering issue:

sword::SWModule *text = swrd.getModule("CzeCSP");
const char* desc = text->getDescription();
fprintf(stderr, "%s\n", desc);

outputs:

Czech Český studijní překlad

with xxd output showing the multibyte characters:

00000000: 437a 6563 6820 c48c 6573 6bc3 bd20 7374  Czech ..esk.. st
00000010: 7564 696a 6ec3 ad20 70c5 9965 6b6c 6164  udijn.. p..eklad
00000020: 0a

Titlebar display currently relies on mvwprintw() - which does not appear to be wide-character aware - so it looks necessary to retool the whole flow of module description retrieval and display for wide characters.

It is indeed a multibyte rendering issue: ``` sword::SWModule *text = swrd.getModule("CzeCSP"); const char* desc = text->getDescription(); fprintf(stderr, "%s\n", desc); ``` outputs: > Czech Český studijní překlad with xxd output showing the multibyte characters: ``` 00000000: 437a 6563 6820 c48c 6573 6bc3 bd20 7374 Czech ..esk.. st 00000010: 7564 696a 6ec3 ad20 70c5 9965 6b6c 6164 udijn.. p..eklad 00000020: 0a ``` Titlebar display currently relies on mvwprintw() - which does not appear to be wide-character aware - so it looks necessary to retool the whole flow of module description retrieval and display for wide characters.

paladin1 changed title from ~~Pane titlebar rendering incorrect~~ to Make pane titlebar wide-character aware

2023-01-01 01:54:29 +00:00

paladin1 changed title from ~~Make pane titlebar wide-character aware~~ to Improve multibyte character handling

2023-03-03 00:52:03 +00:00

paladin1 commented

2023-03-03 01:16:50 +00:00

It is actually the removal of wide-character use that was required, and the alteration of how titlebars are rendered.

Fixed by 0a8acd40cce97a7b77835942875a76cac500f901

There is a trade-off for this resolution. Characters in the extended ASCII set (such as Latin-1) do not seem to be printable as we wipe the locale data to get UTF-8 to render properly. I currently cannot find a way around this. Among the modules I tested it appears that most everything is in UTF-8 so perhaps this is not a large trade-off.

It is actually the removal of wide-character use that was required, and the alteration of how titlebars are rendered. Fixed by [0a8acd40cce97a7b77835942875a76cac500f901](https://tildegit.org/paladin1/scriptura/commit/0a8acd40cce97a7b77835942875a76cac500f901) There is a trade-off for this resolution. Characters in the extended ASCII set (such as Latin-1) do not seem to be printable as we wipe the locale data to get UTF-8 to render properly. I currently cannot find a way around this. Among the modules I tested it appears that most everything is in UTF-8 so perhaps this is not a large trade-off.

paladin1 closed this issue

2023-03-03 01:16:50 +00:00

paladin1 commented

2023-04-22 20:58:52 +00:00

Found out later about MarkupFilters, which can force output into a particular character set. From the listserv:

I do have another question regarding the construction of MarkupFilterMgr.
If I want to apply the encoding filter for UTF8, but do not need the markup filter manager for XHTML is it ok to perform the construction like this?

new MarkupFilterMgr(sword::FMT_UNKNOWN, sword::ENC_UTF8)

...

SWMgr mgr(new MarkupFilterMgr(sword::FMT_XHTML, sword::ENC_UTF8));

This may work but needs some exploration.

Found out later about MarkupFilters, which can force output into a particular character set. From the listserv: > I do have another question regarding the construction of MarkupFilterMgr. If I want to apply the encoding filter for UTF8, but do not need the markup filter manager for XHTML is it ok to perform the construction like this? ``` new MarkupFilterMgr(sword::FMT_UNKNOWN, sword::ENC_UTF8) ``` ... ``` SWMgr mgr(new MarkupFilterMgr(sword::FMT_XHTML, sword::ENC_UTF8)); ``` This may work but needs some exploration.

paladin1 reopened this issue

2023-04-22 20:58:52 +00:00

Sign in to join this conversation.