Feature Proposals » Set UTF-8 CharSet by default

Summary

Current State:	Developer:	Reason:	Date:	Concerns By:	Bug Tracking:	Proposed For:
UnderInvestigation	PeterThoeny	None	2020-09-14			LimaRelease

Detail

TopicSummary:	Set UTF-8 CharSet by default
CurrentState:	UnderInvestigation
CommittedDeveloper:	PeterThoeny
ReasonForDecision:	None
DateOfCommitment:	2020-09-14
ConcernRaisedBy:
BugTracking:
OutstandingIssues:
RelatedTopics:
InterestedParties:
ProposedFor:	LimaRelease
TWikiContributors:

Edit Form

TopicSummary:
CurrentState:
CommittedDeveloper:
ReasonForDecision:
DateOfCommitment:	Format: `YYYY-MM-DD`
ConcernRaisedBy:
BugTracking:
OutstandingIssues:
RelatedTopics:
InterestedParties:
ProposedFor:
TWikiContributors:

Motivation
Description and Documentation
Examples
Impact
Implementation
Discussion

Motivation

Now that Perl is more I18N friendly we should make UTF-8 the default character set

Description and Documentation

Set this in lib/TWiki.spec:

$TWiki::cfg{Site}{CharSet} = 'utf-8';

Examples

Impact

WhatDoesItAffect:	I18n

Implementation

-- Contributors: Peter Thoeny - 2020-09-15

Discussion

Any gotchas I might have missed?

-- Peter Thoeny - 2020-09-15

I just changed the character set on TWiki.org's TWiki to utf-8. Test:

Test 日本語
Test German Umlaut schräg, müde, köstlich

-- Peter Thoeny - 2020-09-15

I agree: Perl's unicode support is very robust since several versions. Also, today's platforms use UTF-8 as default encoding, which matters whenever TWiki uses external tools (well, Windows is an exception, but they also don't have the external tools). Moving towards UTF-8 also helps getting rid of the use of "locales" to distinguish between different one-byte-per-character encodings.

However, there are gotchas whenever you have existing topics. This is beyond the scope of just changing the default, but probably rather relevant for the TWiki customer base.

In my use cases the lifetime of topics has always been longer than the lifetime of any hardware or software version. German texts, saved in ISO-8859-1, are usually "invalid" in UTF-8. So, either I'm stuck to the initial encoding, or I should find a way to migrate... which is tricky. The English language (and therefore the TWiki distribution topics) hides the problem, but issues on twiki.org are visible in personal homepages, e.g. that of BjoernDoering, or particularly nasty at WikiNamesWithUmlauts. By the way: I can safely enter any unicode character "outside" of the 1-byte-range, because browsers just submit them HTML-escaped, but I can't search for such characters.

I would consider it a very good step towards UTF-8-ness if TWiki would add META:ENCODING information to every topic it writes. To use this meta information instead of a global {Site]{CharSet} if available for reading topics is the next step, but it needs to be safeguarded if the previous {CharSet} was different from UTF-8. Today, TWiki's UTF-8-decoding is pretty sloppy, ignoring errors, so an edit/save cycle might damage the topic contents.

As for wiki names with umlauts, this is another can of worms. File systems are encoding-agnostic, file names are just bytes, and there's no place where an application can store which encoding it uses to deal with unicode. Therefore, decoding file names as UTF-8 needs safeguarding as well. Out of a habit, I still stick to ASCII for filenames smile

-- Harald Jörg - 2020-09-15

I hope we will make TWiki UTF-8 based. It's a significant undertaking. But without doing it, TWiki cannot handle non-ASCII characters correctly in all cases. Simply setting {Site}{CharSet} to 'utf-8' leaves some cases where non-ASCII characters are not displayed correctly.

I've been using TWiki in UTF-8 for more than 10 years with thousands of webs and millions of topics. So I can say that TWiki can handle topics in UTF-8. However, even on a fresh TWiki install, TWiki has subtle but deeply rooted character handling problems. Since Perl 5.8 or so, Perl distinguishes byte string and UTF-8 string. To handle non-ASCII characters properly in Perl, you need to put them in a UTF-8 string rather than a byte string. But TWiki is handling non-ASCII characters as byte string. This was OK when CGI.pm was NOT properly handling non-ASCII characters. CGI.pm was handling byte string rather than UTF-8 string. Not anymore. Still, TWiki can handle non-ASCII characters in UTF-8 most of the time. Most notable gotcha is with TWikiForms. If you have non-ASCII characters in select or radio options, they are not properly displayed on the edit page. This is because TWIki handles non-ASCII characters as byte strings, and when CGI.pm gets them, CGI.pm handles them as ISO-8859-1.

-- Hideyo Imazu - 2020-09-16

It looks like we have consensus on enabling UTF-8 by default.

I would ignore topic names with umlauts for now. The create topic already changed ö to oe for example.

We can fix issues as we find them. I just fixed the first one: TWikibug:Item7911: I18N: Raw view with UTF-8 charset mangles text. A fix is pending for related TWikibug:Item7912: I18N: Raw view with UTF-8 charset mangles form field text.

I like to idea of recording the character set in each topic. That makes migrating content between TWiki sites easier (like on a company merger). Instead of a adding a new META:ENCODING meta tag, I think a logical place is to add a charset="..." attribute to the existing META:TOPICINFO meta tag. To help with legacy topics that do not have that charset="..." attribute set, we can add a new {Site}{LegacyCharSet} to define what the character set is of those topics, such as 'ISO-8859-1'. Followup in AddCharSetToMetaTopicInfo.

-- Peter Thoeny - 2020-09-17

I agree that adding the information to META:TOPICINFO is preferable.

A minor suggestion: Though "charset" has been used historically, it isn't exactly appropriate: "charset" makes sense to distinguish which set of characters should be associated with the bytes 0-255. UTF-8, on the other hand, is an encoding of Unicode, and Unicode can represent any character. HTTP/HTML have kept "charset" as the name for compatibility, and the TWiki configuration variables should do the same, but for TOPICINFO (which can't be directly changed by users) I'd prefer "encoding".

I am also slightly suspicious about We can fix issues as we find them. In my experience one of the dangers lies in code which "works" in some paths due to a cancellation of errors, but fails in others. This is difficult to entangle because if you fix one of the errors, things get worse, making you apparently stuck with the error. I'd prefer to start with a guideline when data should be decoded / encoded, and in particular I recommend to decode immediately after reading the data.

BTW: Last year I gave a presentation about encoding with Perl at the German Perl Workshop - the talk (Youtube) is in German, but the slides (PDF) are in English.

-- Harald Jörg - 2020-09-18

ChangeProposalForm
TopicClassification	FeatureRequest
TopicSummary	Set UTF-8 CharSet by default
CurrentState	UnderInvestigation
CommittedDeveloper	PeterThoeny
ReasonForDecision	None
DateOfCommitment	2020-09-14
ConcernRaisedBy
BugTracking
OutstandingIssues
RelatedTopics
InterestedParties
ProposedFor	LimaRelease
TWikiContributors

Topic revision: r5 - 2020-09-18 - HaraldJoerg

Account
- Log In
- Register User

Edit
Attach

Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.