Tags:
internationalization2Add my vote for this tag create new tag
, view all tags

Use UTF-8

The more work I do on getting I18N support right in WYSIWYG, the more convinced I am that TWiki goes out of it's way to make life difficult for users, admins and extensions authors by not using UTF8.

UnderstandingEncodings is a detailed primer on character sets and a discussion on the problems inherent in trying to support non-UTF8 character sets in the TWiki core. Please read it carefully before commenting. I also highly recommend the following overview of unicode and UTF8 http://www.cl.cam.ac.uk/~mgk25/unicode.html. RD also recommends The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), which is a nice gentle introduction.

Proposal

The proposal here is to modify TWiki to assume and check for the use of UTF-8 in all content. That means UTF-8 would be assumed in:

  • topic content
  • topic and web names
  • template files
  • URL parameters, including form content
  • external interfaces
  • web browsers

External data will be checked to ensure it is valid UTF-8, which is important for security.

Key concepts

  • UTF-8 is the encoding - see UnderstandingEncodings
  • UTF-8 character mode (aka Perl utf8 mode) - Perl handles the 1 to N bytes of a Unicode character as a single character, not as N bytes. This is the target of this work. See perldoc perlunicode for details.
  • UTF-8 as bytes mode - the legacy approach: Perl happens to be processing the 1 to N bytes of a Unicode character as N bytes not as a character. This is usually a mistake if you are trying to "Use UTF8", but is sort-of supported with current TWiki versions (see InstallationWithI18N for when it's used) - you don't get WikiWord support and so on, but the characters should not be mangled.

Technical Detail - What would need to be done?

The default character encoding for TWiki will be UTF-8, using NFC as the UnicodeNormalisation form, and TWiki will work in Perl's "UTF-8 character" mode.

  • Core cleanup: some deleting of old code would be needed where it is aimed at different character sets, as well as fixing some regressions where core code is not following InternationalisationGuidelines.
  • Fixing and updating of documentation - however, designing the UTF-8 support for much simpler installation and configuration is important
    • Corrections to the documentation
    • Add guidelines for adding localized templates or skins: they need to be (or to be converted to) UTF-8, too
  • Review all extensions (plugins, skins, contribs) for assumptions about character sets (e.g. /[A-Z]+/) and add guidelines for extensions authors
  • all streams opened by the store need to use :encoding(utf-8-strict), not :utf8 - the latter doesn't check for a valid UTF-8 encoding, leading to possible security holes - see Security comment below. ("UTF-8" in Perl is equivalent to "utf-8-strict", but the latter is less vulnerable to typos.)
  • stdout and stderr need to be re-opened in :encoding(utf-8-strict), using equivalent of binmode(STDxxx,':encoding(utf-8-strict)'); - this also needs to apply to ModPerl and similar CgiAccelerators which may not use stdout/stderr.
  • Increment the store version
  • Check very carefully whether input data from a form is indeed UTF-8 encoded - generally if you force the output page to UTF-8 using HTTP header and HTML charset, the data returned in a POST or GET will also be in UTF-8. So this is mainly useful to guard against a user explicitly setting their browser to the wrong character set. Fortunately CPAN:Encode can do this very efficiently, certainly faster than the EncodeURLsWithUTF8 regex.
    • CGI.pm does not do any encoding. In fact it can't, because the encoding is not given with the HTTP request. However, CPAN:CGI does turn on Perl utf8 mode in some more recent versions, which has been a problem for the current pre-Unicode versions of TWiki. We might need to test against specific CPAN:CGI versions if we get problems.
  • Evaluate the encoding of all content which is retrieved via other protocols:
    • HTTP (e.g. %INCLUDE{http://somewhere}% and other TWiki::Net interfaces
    • Mail as in MailInContrib
  • Searching - ensure that searching for Unicode characters works, including case-insensitive searching
    • Forcing all characters into NFC dramatically simplifies searching (see UnicodeNormalisation). Some complexity with characters that are always in decomposed form - requires care to avoid false matches with unaccented form (or can simply define this as a feature i.e. some support for 'ignore accents' searching?)
    • TBC: whether 'ignore accents' searching is in scope - probably requires some UnicodeNormalisation to strip accents when comparing characters, and may be language-dependent.
  • Sorting
    • Sorting by Unicode codepoint is the default - fast but won't always give the results that people expect.
      • Perl locales change the Perl collation order, but are very buggy when combined with Perl utf8 mode, in RD's experience - best avoided, which is why many Perl Unicode apps don't use locales at all.
    • UnicodeCollation support is the main alternative - this uses UnicodeNormalisation under the covers, which may cause a performance hit.
  • (Possibly) UnicodeNormalisation to enforce use of NFC throughout
    • May be required for environments where a Mac or iPhone can be used as a TWiki client, or an external data source uses NFD - see UnicodeMac.
  • Use UTF-8 for all content which TWiki sends elsewhere
    • Sent mail, as in TWiki's notifications
    • Command parameters for Sandbox commands (need to set the operating system's encoding / locale for these commands)
  • Ensure this all works with WysiwygPlugin and TinyMCEPlugin
  • Ensure this works with ordinary form editing, without WYSIWYG
  • An audit of the core code to find cases where failure to acknowledge the encoding correctly has implicitly broken the code.
  • Unit testcases would be required for:
    • Existing topic in non-UTF-8 charset
    • Topic with broken UTF-8 encodings
    • Check encoding on all pages generated by TWiki
  • Fixes for the following bugs would need to be confirmed: TWikibug:Item3574 TWikibug:Item4074 TWikibug:Item2587 TWikibug:Item3679 TWikibug:Item4292 TWikibug:Item4077 TWikibug:Item4419 TWikibug:Item5133 TWikibug:Item5351 TWikibug:Item5437 TWikibug:Item4946
  • Perl 5.8, recent versions only - even if in "non-Unicode" mode.
  • Batch migration of all TWiki topics including pathnames from pre-Unicode character sets to UTF-8 (see below)
    • To convert from non-UTF-8 character sets other than ASCII, batch migration tools will be provided to convert all topic contents and the pathnames for all topics and attachments.
  • Windows server support
    • Windows may also have some issues with Unicode filenames, but it uses NFC so should be OK. Apache on Windows works best with UTF-8 URLs, so actually our Windows I18N support could improve with UTF-8. No UnicodeNormalisation problems, unlike UnicodeMac.
  • Security
    • TWiki must take care to check that possible UTF-8 data is in fact using only valid UTF-8 codepoints (characters in the encoding) and is not using an 'overlong' encoding - both can lead to security holes. Some specific points:
  • Performance benchmarking and tweaking
    • Should do benchmarking very early so we get some good metrics of how the Unicode changes are affecting performance. Some optimisations may be possible though I have no idea what they are. My experience a few years back was a 3 times slowdown; hopefully Perl has improved since then.
  • (Possibly) "Unicode mode" toggle
    • Despite the niceness of Unicode, I think it's important to have a simple toggle that globally disables Unicode usage for TWiki. While this disables any I18N, it also allows the user or developer to:
      1. Avoid any Perl utf8 bugs
      2. Ensure best possible performance even if some strings get forced into Perl utf8 mode.
      3. Easily compare the non-utf8 and utf8 modes for unit and system testing.
      4. Run TWiki in non-I18N mode on Perl 5.6
    • It may be a bit of a hassle initially to enable this (e.g. dynamic code in BEGIN blocks etc) but I think it's worth it.

Perl bugginess with Unicode:

  • Should be better now, but I ran into some issues a few years back, and we should expect to uncover and workaround some Perl bugs.

The following items are out of scope:

  • Pre-Unicode charset support (e.g. ISO-8859-1 and other 8-bit character encodings) - only ASCII and UTF-8 will be supported
  • MacOSX client and server support
    • Requires a lot of additional work specific to MacOS X - see UnicodeMac for details.
  • Mainframe support (see TWikiOnMainframe)
  • Backward compatibility with Perl 5.6 and early 5.8.x's - this is more relevant if we support pre-Unicode site charsets, but early 5.8.x applies to Unicode as well.
    • Once we start doing UseUTF8, TWiki will no longer work with Perl 5.6 due to its broken Unicode support. Also, TWiki will only work on later 5.8 versions - some systems have older 5.8.x's with too many Unicode bugs to be usable.
    • Perl 5.6 and early 5.8.x version support is out of scope, even if in 'non-Unicode' mode.

Batch migration of TWiki data

  • Migration of topics and filenames - any pre-Unicode encoded non-ASCII data in the topics or filenames (including attachment filenames but not contents) will need to be converted using batch migration tools - the TWiki release will provide a reasonably automated tool for migration, though it will be important to back up all TWiki data and check the results of the migration carefully.
    • Conversion from TWiki files generated using MacOS clients or servers will not be supported, due to UnicodeMac problems.
    • A batch migration process is essential - this goes against the TWiki upgrade philosophy but this is quite a big and complex change to the entire pub and data trees, including existing filenames.
    • Tools to be used probably include iconv (or piconv from CPAN:Encode) for topic data, and convmv for pathnames
  • Batch migration will preserve history: since RCS is a text format (checked with man rcsfile and this page documenting more details of RCS file format) and doesn't appear to have any length or checksum fields that would mess this up, it is fairly trivial - just use the iconv utility for the file, and convmv for the filename itself (and directory names).
  • There may be some corner cases if people have embedded URL-encoded links within a TWiki page, but that's unlikely and not required with current I18N. The page linked here makes it clear that it is safe to embed UTF-8 in RCS files - the only problem might come if Asian sites have (against TWiki I18N recommendations at InstallationWithI18N) used a non-ASCII-safe double-byte character set such as Shift-JIS as the {Site}{CharSet} when we convert this to UTF-8, as RCS may have escaped a conflicting byte within a double-byte character. I suggest we don't bother with this case, as such character sets were never supported (see JapaneseAndChineseSupport as well).
  • Within TWiki topics, HTML/XML entities such as &uuml or &#874 may be converted to UTF-8 characters. However, this would require parsing all TML within topics, so that such entities found within <pre> and <verbatim> blocks, or within embedded JavaScript, are not converted.This is a lot of work and prevents the use of a tool such as iconv to translate at the character level, so this may not be in scope - if it is done, it would be as an option or second pass conversion.

Collation

UnicodeCollation doesn't give you language specific sorting, but it does provide a good default sort order across all languages. People who want correct sorting in Swedish, Danish, Japanese, etc, will need a 'language sort module' that adds some language specific collation rules for their language. This should be done as with the UI internationalisation stuff, ideally so that when you do a translation someone a bit more techie defines the collation rules - these are available from various sources with a bit of luck but there are very few language specific modules on CPAN that help. Also, CPAN:Unicode::Collate involves loading a 1.3 MB default collation order file, which could have some performance impact... UnicodeCollation might need to be enabled only for those who want language-specific collation to be absolutely correct, with the default being to sort by UTF-8 codepoint, which doesn't look nice but at least is fairly fast - some performance testing needed, with and without ModPerl.

Resources

Earlier work by RD on UnicodeSupport - can provide my code, which is based on an old TWiki alpha version, and it did get to the point of running on a semi-public Unicode test site, running in real "perl utf8" mode not just "bytes" mode with UTF-8 encoding.

From VictorKasatkin in UnicodeProblemsAndSolutionCandidates: see CGI.pm, Apache, and UTF-8 (linked above.)

  1. First replace use CGI in TWiki.pm with use CGI::as_utf8; based on this Perlmonks thread
  2. insert in BEGIN section use open IO => ':locale'; and insert in edit.pl use utf8; - I can create topic in local utf8 encoding (regexes work), but after editing view.pl it hangs (as I see in top under linux) and it doesn't respond when localization is enabled in config...

-- Contributors: CrawfordCurrie, HaraldJoerg, RichardDonkin, VictorKasatkin

Discussion

(Deleted earlier discussion about accept-charset design option, now dropped)

I think it is feasible to migrate TWiki towards pure UTF-8. It should work for all known languages on the planet Earth and will limit the platform we have to test on to ONE.

With respect to plugins the important steps will be

  • Ensure anything in Func is UTF-8 compatible
  • Migrate the most popular plugins to UTF-8 and in the process document what it is it takes to upgrade a plugin to UTF-8
  • Provide a safe upgrade method. I have tried to upgrade ASCII type topics by converting to UTF-8 and so far it has worked fine each time. This is a situation where it is difficult - maybe impossible - to do it on the fly. But unlike upgrade scripts suggested for syntax changes that are always bound to fail because we cannot predict the more advanced ways to make applications, upgrading to UTF-8 is happening at "byte level" and is a well known process which there are plenty of tools available for. But we need to have a good way to ensure that one does not double convert topics (convert a topic which is already converted).

Going to pure utf-8 will be a task that takes a lot of testing. I am testing UTF-8 at the moment in the 4.2.1 context and as you know there are a couple of new bugs I opened where I have seen that SEARCH and verbatim is not yet working in UTF-8. There will be many more test steps needed before we can let go of other charsets. But I think the step to make TWiki utf-8 only should be considered with a positive spirit because it will make TWiki fully I18N which it is not now and with a chance of being stable also for non-English users.

-- KennethLavrsen - 13 Apr 2008

I agree that UTF-8 should be the way to go, and I fully support moving towards encoding topics in UTF-8 as soon as possible. Easy moving of topics between TWiki installation needs a unique encoding, and UTF-8 seems to be without alternative for that purpose. Topics (and templates) have long-lived encodings, they tend to lie on disks for years without surreptitiously changing their encoding, hence the migration path needs to be carefully paved (as you did in your proposal).

My options do not refer to using UTF-8 for writing topics, but to the encoding used for TWiki's other interface, HTTP/HTML written for browsers. UTF-8 would work fine for me, and maybe for all installations (including Elvish). So probably we could jettison option (2) right now.

So what it boils down to is not that TWiki is using UTF-8 (because, strictly spoken, TWiki is using Perl's internal encoding all the time), but that TWiki expects all its external data interfaces to be encoded in UTF-8. From that point of view, topics are the easiest part because writing and reading topics is under TWiki's more or less exclusive control. As you wrote, we'll need to carefully collect assumptions about encodings, but also identify unjustified ignorance. Maybe you summarized these cases with your item An audit of the core code to find cases where failure to acknowledge the encoding correctly has implicitly broken the code.. I added some to the list above, hopefully it won't grow too much.

-- HaraldJoerg - 13 Apr 2008

It would be good to look at UnicodeSupport and linked pages, which contain a lot of thinking about this. I've commented at UnicodeProblemsAndSolutionCandidates in detail on some of the issues that would need to be solved, which cover some of the points made above. It would be helpful if the various Unicode pages were interlinked - perhaps UnicodeSupport could be refactored into a 'landing page' for all these topics including latest discussions, to make it easier to find them.

Shame I missed this discussion - I haven't been tracking TWiki for a while now, but would be interested in participating if someone can email me. Unfortunately TWiki.org doesn't have a good way of monitoring 'only pages with certain keywords' that I'm aware of.

  • WebRss supports SEARCH statements to narrow down what you get notified of (and Crawford entered an enhancement request of mine for supporting SEARCH queries (full TML actually) in WebNotify) - SD

-- RichardDonkin - 14 Jun 2008

Thanks for the tip, Sven.

On the options - I think the best one is option 1, i.e. UTF-8 at the presentation level and internally. There should be very few systems these days where UTF-8 is not supported - even on an ancient 486 you can boot a live CD that supports UTF-8 in Lynx - but I'm sure someone will come up with one.

In a possible Phase 2 of UTF-8 adoption, we could implement some charset conversion at the presentation layer, e.g. if someone has a browser or email client that only does a legacy Russian or Japanese character set, perhaps, and they are unable to upgrade their clients. This could perhaps be driven by accept-charset. However, this adds complexity so let's not do it in the first phase of UseUTF8.

See more comments in text prefixed with RD.

-- RichardDonkin - 15 Jun 2008

I've added a key concepts section above to try to differentiate between "UTF-8 character mode" in Perl vs. processing UTF-8 as bytes (which is not what we want), as a result of commenting on Bugs:Item5566.

-- RichardDonkin - 26 Jun 2008

(Comment material re security merged above)

-- ChristianLudwig - 27 Jun 2008

One simple next step might be to agree whether we can dump the accept-charset idea which IMO is not required.

-- RichardDonkin - 28 Jun 2008

I think "keep it simple" has to be the guiding principle here. I think accept-charset falls the wrong side of that line, and should not be used.

The main support problem we have had with encoding support to date has been excessive flexibility coupled with a lack of documentation explaining in simple terms what the casual admin needs to do. I had to research quite a lot to reach my poor level of understanding, and it's unreasonable to expect yer averidge admin to do the same.

So, from a user perspective, I don't want to know it's using UTF8 (or any other encoding). configure should have no encoding options, just a single, simple options for setting the user interface language. If that means committing to a less-than-100%-flexible approach, then I'm in favour.

-- CrawfordCurrie - 28 Jun 2008

A less flexible approach should be possible since we won't be using locales, and I agree completely with going for simplicity. Some remaining issues though:

  • Batch migration of topics - this is essential to keep core code simple, so it only has to deal with UTF-8
  • Performance - early testing and tuning will be important, covering both the English-only and the I18N-heavy cases. If this can't be optimised, a Unicode-mode toggle as mentioned above will be important, but it could be based on a simple toggle such as {UseInternationalisation}.
    • ModPerl, PersistentPerl - should be tested from start to work with these to get reasonable performance (and ensure that UTF-8 character mode is enabled when stdin/stdout are not relevant)
  • Sorting - if we don't do locales, topic and table column sorting will need UnicodeCollation (unless we sort by codepoint which is very basic). There is a default order, but for language-specific sorting this ideally is based on "the language", which can most simply be derived from the user's language (for message internationalisation). Unicode obviously supports multiple languages but for collation you need to know which language the user is working in, and hence which Unicode collation order to use. The good news is that CPAN:Unicode::Collate does all this for you as long as it's used in any sort routines.

Some expert-level config options may be needed to work around brokenness, but we should try to avoid wherever possible (like the Unicode mode toggle). If we limit ourselves to Perl 5.8 only that will simplify matters - if Perl 5.6 must be supported it could turn off all I18N and use only ASCII.

-- RichardDonkin - 28 Jun 2008

  • I thought the user interface language code required the locale to work?
  • I'm torn on batch migration. Migration on the fly is seductive, and fairly easy to make work, but the performance is likely to stink. Batch migration has the potential to lose the history (unless it rebuilds it using the new encoding)
  • Another issue is Extensions. Authors need comprehensive support to make sure they don't fall into the /[A-Za-z]/ trap.
    • Case detection and conversion (whatever 'case' means)
    • Sort collation
    • Language/encoding information
My personal opinion is that Perl 5.6 is past it's sell-by date and should be dropped. This might shut out some hosting providers; I'd be interested to hear if any are still using 5.6.

-- CrawfordCurrie - 29 Jun 2008

Will have to look at the UI language code but I think it only uses locales because the core does. If we go UTF-8 there is no problem, as the translation files are already in UTF-8.

(migration material merged above)

EncodeURLsWithUTF8 may need to be enhanced slightly - haven't thought about the details yet, but limiting ourselves to browsers supporting UTF-8 should help and might even simplify it. Attachment support through UTF-8 URLs will be the main remaining issue - however by making the browser use UTF-8 we force all its URLs to be in UTF-8 format. We might even find TWikiOnMainframe I18N works without special code...

Extensions are a problem, which InternationalisationGuidelines tries to address, but it's really down to the extension author and promoting I18N amongst authors. Many extensions aren't I18N-aware, but I think those that are already I18N-aware will have an easier time converting, and going Unicode makes life easier generally, particularly for extensions that interface to third-party systems that already use UTF-8.

(Collation material merged above)

-- RichardDonkin - 29 Jun 2008

A few more updates above to my comment of 29 Jun, and also some updates to main text - in particular I've removed the accept-charset part since we are agreed we don't want to do this.

-- RichardDonkin - 01 Jul 2008

Any more thoughts on this? I've done some updates to UnicodeCollation including a test script - this isn't hard to do.

-- RichardDonkin - 15 Jul 2008

I'm with you on batch migration. I think extension authors will have to be left to sort out their own houses; though the most common extensions will need to be tested. I don't care much about MacOSX, and until an OS X user with hardware steps forward I doubt anyone else will.

The main problem I foresee is testing. I don't think it makes sense to do any of this without a testing strategy. My preference is for UTF8 testcases to be added to the existing unit test suite, as lack of unit tests in this area has been crippling in the past. And as you say, performance testing is required.

I'd like to make proper UTF8 support a feature of TWiki 5.0, but I think it requires a lot more concentrated effort from interested parties than just the two of us batting ideas around, especially as neither of us is likely to be actively coding anything. Specifically I'd like to hear from community members who actually want to actively use non-western charsets in their day-to-day work, as their experiences would be key to the success of the venture.

-- CrawfordCurrie - 15 Jul 2008

Handy character set detection tool that may be useful in the batch migration of TWiki webs to UTF-8: http://chardet.feedparser.org/ - written in Python, based on the Mozilla character set detection code.

Possibly improved Perl regex to validate UTF-8 data from URLs, form input or WysiwygPlugin.

-- RichardDonkin - 2009-09-22

Lots of new material and revisions in the main DocumentMode section above. I've also refactored the comments, deleting some obsolete material and merging some material into the main section. I now understand the pain that is UnicodeMac much better.

-- RichardDonkin - 2009-10-22

Edit | Attach | Watch | Print version | History: r35 < r34 < r33 < r32 < r31 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r35 - 2009-11-10 - PeterThoeny
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2016 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.