Motivation
This idea started in
SetUTF8CharSetByDefault. When switching the character set of a TWiki site using
{Site}{CharSet}, say from
'ISO-8859-1' to
'UTF-8', you need to convert the encoding of topic text. The same issue comes up when importing topics from another TWiki site.
This can be solved with two enhancements:
- Add a
charset="..." attribute to the META:TOPICINFO
- Add a
{Site}{LegacyCharSet} configure setting to indicate the legacy character set of topics that do not have the charset="..." attribute set
Description and Documentation
1. Add a charset="..." attribute to the META:TOPICINFO
- When viewing a topic, the topic text is rendered based on the
charset="..." topic info attribute, and if that is missing, based on the {Site}{DefaultCharSet} configure setting.
- This means that decoding can not be done by Perl's I/O layer: Before TWiki has read the META:TOPICINFO, it doesn't know which encoding must be used. It also implies the assumption that the META:TOPICINFO can be interpreted before actually knowing the encoding. We should be safe with using Perl's default encoding for that purpose, and then manually decoding the topic text according to the charset.
- The decoding process should be done shortly after reading the topic, in particular before processing
INCLUDE{} or other functions which pull in text from different topics (which may have been written in a different encoding).
- When writing a topic, the topic text is encoded as
{Site}{CharSet}, and the charset attribute is set accordingly.
- When creating a new topic, the
charset="..." attribute is set to the {Site}{CharSet} configure setting.
2. Add a {Site}{LegacyCharSet} configure setting
- This is to indicate the default character set of topics that do not have the
charset="..." attribute set.
- Example: All your topics are
'ISO-8859-1', and you want to convert the site to 'UTF-8'. To switch the site's character set, you set {Site}{LegacyCharSet} to 'ISO-8859-1', and {Site}{CharSet} to 'UTF-8'.
- Because we SetUTF8CharSetByDefault, for compatibility the TWiki distribution has
{Site}{LegacyCharSet} = 'ISO-8859-1', and {Site}{CharSet} = 'UTF-8'.
Examples
Impact
Implementation
--
Contributors:
Peter Thoeny - 2020-09-17
Discussion
I did not put myself as
CommittedDeveloper due to time commitment. Any takers?
--
Peter Thoeny - 2020-09-17
At the moment I'm rather deep in another project, but I think I can contribute to that task. I have collected some experience migrating stuff towards Unicode when working on
Act
, which also has its roots in a time when Perl's Unicode support was a bit shaky.
--
Harald Jörg - 2020-09-18
I have added some points about viewing the topics. This is more difficult than writing because several data streams need to be considered in the rendering process: Template files, template topics, included topics, formatted search results, query parameters, and even
LocalSite.cfg if you happen to have e.g. a
{WebMasterName} with an รถ.
--
Harald Jörg - 2020-09-18