Motivation
Now that Perl is more
I18N friendly we should make UTF-8 the default character set
Description and Documentation
Set this in
lib/TWiki.spec:
$TWiki::cfg{Site}{CharSet} = 'utf-8';
Examples
Impact
Implementation
--
Contributors:
Peter Thoeny - 2020-09-15
Discussion
Any gotchas I might have missed?
--
Peter Thoeny - 2020-09-15
I just changed the character set on TWiki.org's TWiki to utf-8. Test:
- Test 日本語
- Test German Umlaut schräg, müde, köstlich
--
Peter Thoeny - 2020-09-15
I agree: Perl's unicode support is very robust since several versions. Also, today's platforms use UTF-8 as default encoding, which matters whenever TWiki uses external tools (well, Windows is an exception, but they also don't have the external tools). Moving towards UTF-8 also helps getting rid of the use of "locales" to distinguish between different one-byte-per-character encodings.
However, there are gotchas whenever you have existing topics. This is beyond the scope of just changing the default, but probably rather relevant for the TWiki customer base.
In my use cases the lifetime of topics has always been longer than the lifetime of any hardware or software version. German texts, saved in ISO-8859-1, are usually "invalid" in UTF-8. So, either I'm stuck to the initial encoding, or I should find a way to migrate... which is tricky. The English language (and therefore the TWiki distribution topics) hides the problem, but issues on twiki.org are visible in personal homepages, e.g. that of
BjoernDoering, or particularly nasty at
WikiNamesWithUmlauts. By the way: I
can safely enter any unicode character "outside" of the 1-byte-range, because browsers just submit them
HTML-escaped, but I can't search for such characters.
I would consider it a very good step towards UTF-8-ness if TWiki would add
META:ENCODING information to every topic it writes. To
use this meta information instead of a global
{Site]{CharSet} if available for reading topics is the next step, but it needs to be safeguarded if the previous
{CharSet} was different from UTF-8. Today, TWiki's UTF-8-decoding is pretty sloppy, ignoring errors, so an edit/save cycle might damage the topic contents.
As for
wiki names with umlauts, this is another can of worms. File systems are encoding-agnostic, file names are just bytes, and there's no place where an application can store which encoding it uses to deal with unicode. Therefore, decoding file names as UTF-8 needs safeguarding as well. Out of a habit, I still stick to ASCII for filenames
--
Harald Jörg - 2020-09-15
I hope we will make TWiki UTF-8 based. It's a significant undertaking. But without doing it, TWiki cannot handle non-ASCII characters correctly in all cases. Simply setting {Site}{CharSet} to 'utf-8' leaves some cases where non-ASCII characters are not displayed correctly.
I've been using TWiki in UTF-8 for more than 10 years with thousands of webs and millions of topics. So I can say that TWiki can handle topics in UTF-8. However, even on a fresh TWiki install, TWiki has subtle but deeply rooted character handling problems. Since Perl 5.8 or so, Perl distinguishes byte string and UTF-8 string. To handle non-ASCII characters properly in Perl, you need to put them in a UTF-8 string rather than a byte string. But TWiki is handling non-ASCII characters as byte string. This was OK when
CGI.pm was NOT properly handling non-ASCII characters.
CGI.pm was handling byte string rather than UTF-8 string. Not anymore. Still, TWiki can handle non-ASCII characters in UTF-8 most of the time. Most notable gotcha is with
TWikiForms. If you have non-ASCII characters in select or radio options, they are not properly displayed on the edit page. This is because TWIki handles non-ASCII characters as byte strings, and when
CGI.pm gets them,
CGI.pm handles them as ISO-8859-1.
--
Hideyo Imazu - 2020-09-16
It looks like we have consensus on enabling UTF-8 by default.
I would ignore topic names with umlauts for now. The create topic already changed
ö to
oe for example.
We can fix issues as we find them. I just fixed the first one:
TWikibug:Item7911
:
I18N: Raw view with UTF-8 charset mangles text. A fix is pending for related
TWikibug:Item7912
:
I18N: Raw view with UTF-8 charset mangles form field text.
I like to idea of recording the character set in each topic. That makes migrating content between TWiki sites easier (like on a company merger). Instead of a adding a new
META:ENCODING meta tag, I think a logical place is to add a
charset="..." attribute to the existing
META:TOPICINFO meta tag. To help with legacy topics that do not have that
charset="..." attribute set, we can add a new
{Site}{LegacyCharSet} to define what the character set is of those topics, such as
'ISO-8859-1'. Followup in
AddCharSetToMetaTopicInfo.
--
Peter Thoeny - 2020-09-17
I agree that adding the information to
META:TOPICINFO is preferable.
A minor suggestion: Though "charset" has been used historically, it isn't exactly appropriate: "charset" makes sense to distinguish which set of characters should be associated with the bytes 0-255. UTF-8, on the other hand, is an
encoding of Unicode, and Unicode can represent
any character. HTTP/HTML have kept "charset" as the name for compatibility, and the TWiki configuration variables should do the same, but for
TOPICINFO (which can't be directly changed by users) I'd prefer "encoding".
I am also slightly suspicious about
We can fix issues as we find them. In my experience one of the dangers lies in code which "works" in some paths due to a cancellation of errors, but fails in others. This is difficult to entangle because if you fix one of the errors, things get
worse, making you apparently stuck with the error. I'd prefer to start with a guideline
when data should be decoded / encoded, and in particular I recommend to decode immediately after reading the data.
BTW: Last year I gave a presentation about encoding with Perl at the German Perl Workshop - the
talk (Youtube)
is in German, but the
slides (PDF)
are in English.
--
Harald Jörg - 2020-09-18