Tags:
create new tag
view all tags

Feature Proposals » Add character set to META:TOPICINFO

Summary

Current State: Developer: Reason: Date: Concerns By: Bug Tracking: Proposed For:
UnderInvestigation   None       LimaRelease

Edit Form

TopicSummary:
CurrentState:
CommittedDeveloper:
ReasonForDecision:
DateOfCommitment:   Format: YYYY-MM-DD
ConcernRaisedBy:
BugTracking:
OutstandingIssues:
RelatedTopics:
InterestedParties:
ProposedFor:
TWikiContributors:
 

Motivation

This idea started in SetUTF8CharSetByDefault. When switching the character set of a TWiki site using {Site}{CharSet}, say from 'ISO-8859-1' to 'UTF-8', you need to convert the encoding of topic text. The same issue comes up when importing topics from another TWiki site.

This can be solved with two enhancements:

  1. Add a charset="..." attribute to the META:TOPICINFO
  2. Add a {Site}{LegacyCharSet} configure setting to indicate the legacy character set of topics that do not have the charset="..." attribute set

Description and Documentation

1. Add a charset="..." attribute to the META:TOPICINFO

  • When viewing a topic, the topic text is rendered based on the charset="..." topic info attribute, and if that is missing, based on the {Site}{DefaultCharSet} configure setting.
    • This means that decoding can not be done by Perl's I/O layer: Before TWiki has read the META:TOPICINFO, it doesn't know which encoding must be used. It also implies the assumption that the META:TOPICINFO can be interpreted before actually knowing the encoding. We should be safe with using Perl's default encoding for that purpose, and then manually decoding the topic text according to the charset.
    • The decoding process should be done shortly after reading the topic, in particular before processing INCLUDE{} or other functions which pull in text from different topics (which may have been written in a different encoding).
  • When writing a topic, the topic text is encoded as {Site}{CharSet}, and the charset attribute is set accordingly.
  • When creating a new topic, the charset="..." attribute is set to the {Site}{CharSet} configure setting.

2. Add a {Site}{LegacyCharSet} configure setting

  • This is to indicate the default character set of topics that do not have the charset="..." attribute set.
  • Example: All your topics are 'ISO-8859-1', and you want to convert the site to 'UTF-8'. To switch the site's character set, you set {Site}{LegacyCharSet} to 'ISO-8859-1', and {Site}{CharSet} to 'UTF-8'.
  • Because we SetUTF8CharSetByDefault, for compatibility the TWiki distribution has {Site}{LegacyCharSet} = 'ISO-8859-1', and {Site}{CharSet} = 'UTF-8'.

Examples

Impact

Implementation

-- Contributors: Peter Thoeny - 2020-09-17

Discussion

I did not put myself as CommittedDeveloper due to time commitment. Any takers?

-- Peter Thoeny - 2020-09-17

At the moment I'm rather deep in another project, but I think I can contribute to that task. I have collected some experience migrating stuff towards Unicode when working on Act, which also has its roots in a time when Perl's Unicode support was a bit shaky.

-- Harald Jörg - 2020-09-18

I have added some points about viewing the topics. This is more difficult than writing because several data streams need to be considered in the rendering process: Template files, template topics, included topics, formatted search results, query parameters, and even LocalSite.cfg if you happen to have e.g. a {WebMasterName} with an รถ.

-- Harald Jörg - 2020-09-18

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r2 - 2020-09-18 - HaraldJoerg
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.