Tags: view all tags

See UseUTF8 for a more recent discussion of how we can implement UTF-8 support in TWiki - the rest of this page is several years old. --Main.RichardDonkin

Proposed UTF-8 Support for Internationalization

I'm planning to add support for UTF-8 (Unicode's 8-bit encoding) into TWiki in three phases. Here's the plan in outline:

Support for Request URLs in UTF-8 (EncodeURLsWithUTF8) - Goal: avoid need to configure browser - COMPLETE: now implemented for all character sets in TWikiAlphaRelease!
- This avoids having to configure the browser to turn off this encoding (the default in InternetExplorer, OperaBrowser and some MozillaBrowser URLs) and enables support of browsers where only this mode is supported (e.g. OperaBrowser for SonyEricsson P800). A non-UTF-8 character set is still used within TWiki and for other parts of HTML pages.
- Frequently requested on InternationalisationUTF8 and covered in more detail in EncodeURLsWithUTF8 - this issue is preventing some TWiki sites from adopting I18N, including TWiki.org
- The character encoding and character set for use in TWiki itself will remain non-UTF-8 in this phase, e.g. ISO-8859-1, KOI8-R, EUC-JP, etc - hence a conversion from UTF-8 in URLs to the relevant single-byte or multi-byte charset will be needed.
- Does not require real UTF-8 support in Perl or the operating system, and will still work on Perl 5.005_03 as per TWikiSystemRequirements, as long as CPAN:Unicode::MapUTF8 is installed. If you're using ISO-8859-1, there's no need to install CPAN:Unicode::MapUTF8.
- UTF-8 URLs are auto-detected - site can have a mix of users with UTF-8 URLs or conventional site charset URLs.
  - TO DO: Reliability of auto-detection needs testing, particularly for CyrillicSupport and JapaneseAndChineseSupport
- TWiki's current URL format (i.e. using URL-encoded single-byte encoding such as ISO-8859-1) will still be supported transparently alongside UTF-8 URLs
- No data conversion of TWiki files required
- Should support use of MozillaBrowser with TWikiOnMainframe (as long as mainframe web server can convert or pass through UTF-8 URLs)
- Implemented in TWikiAlphaRelease for virtually all character encodings - see EncodeURLsWithUTF8 for details and demo site. More details in comment below.
Basic UTF-8 support - Goal: enable multiple languages in a single TWiki
- All TWiki filenames and file data will be in UTF-8 when this is configured as the $siteCharset, and Perl's UTF-8 support will be used to ensure that references to WikiWords, sorting, searching, etc, all work correctly. TWiki will use UTF-8 URLs and page contents when configured, but will also work in ISO-8859-1 which will remain the default.
- Will require Perl 5.8 for full support due to Unicode regex requirements - however, sites not using Perl 5.8 can still use TWiki in non-UTF-8 mode.
- Data conversion of TWiki files required, including RCS, for existing sites that migrate to UTF-8 for topic data storage.
- Non-UTF-8 browsers will not be supported when in UTF-8 mode.
- Will also address conversion of non-UTF-8 site charset to Unicode XML entities, for use in RSS feeds (see WebRssAndEUCJPAndMojibakeCorruption) - this is easily done in this phase due to the creation of two-way character set translation functions.
Advanced UTF-8 support - Goal: more support, including sorting, normalisation and use of legacy browers when in UTF-8 mode
- (Possible) Non-UTF-8 browsers will be supported through on the fly conversion to non-UTF-8 charsets.
- TWikiOnMainframe support - possibly with use of UTF-EBCDIC, which is similar to UTF-8 but EBCDIC-clean
- Issues to think about for full support:
  - File systems that mandate UnicodeNormalisation forms (e.g. MacOSXFilesystemEncodingWithI18N) - more and admin issue than a TWiki issue
  - Ability to have >1 language in a TWiki installation (TranslationSupport) - UTF-8 is a key enabler only
  - Unicode collation (CPAN:Unicode::Collate) - language-sensitive sorting is a hard problem
  - Unicode normalisation (CPAN:Unicode::Normalize) - configurable, will expect NFC by default and convert to NFC if configured (see UnicodeNormalisation)

Perl 5.8 has good support for UTF-8, so this is consistent with where Perl is going. UTF-8 encoding works well on Unix/Linux boxes since it does not use any ASCII characters (e.g. NUL, '/', '\') within multi-byte encoded characters - see Markus Kuhn's Unicode page for an overview and some demo UTF-8 files. There are more UTF-8 resources on InternationalisationEnhancements in the Unicode section.

Perl 5.8 doesn't have any specific support for normalisation, but CPAN:Unicode::Normalize can be used to normalise data. Now that Perl 5.8.5 and higher are out, it would be best to require the latest 5.8.x for phase 2 or later since they have numerous Unicode bug fixes, but Phase 1 will work with any Perl version including 5.8.0.

Discussion elsewhere

Project UTF-8 is an initiative by Freedesktop.org to promote the use of UTF-8 in FreeSoftware - they've linked to this page from their page on software not supporting UTF-8.

Here are some instructions on I18N setup for Chinese speakers in Taiwan - this is using a UTF-8 locale, like NathanOllerenshaw in Japan. More reports on how well or badly the Feb 2003 TWiki code works with UTF-8 would be great - please include version of Perl. Maybe it would be enough for CairoRelease to simply document how TWiki can be used with UTF-8, including any limitations and Perl version dependencies?

The Indymedia group have been discussing UTF-8 support in TWiki and I'm in contact with some Indymedia people to see if they can test the UTF-8 support. The Indymedia TWikis have a very active multilingual user base.

Setup and UTF-8 patches for grep, diff, etc

GNU grep and diff should both support UTF-8, but some patches may not have made it into your version. It's essential to use GNU grep/diff in any case, as recommended in TWikiSystemRequirements. In particular, grep 2.5 (the current version as of Jan 2004) has a bug that causes 100-times slowdown when using UTF-8 locales - see this UTF-8 Project page for more details and link to a patch. The OpenI18N Group has some more patches for GNU grep and diffutils.

- It's important to set up your environment correctly for UTF-8: see these notes on UTF-8 setup for client and server.

Phase 1 complete

Phase 1 (i.e. EncodeURLsWithUTF8) is now complete - I have some working code running on http://donkin.org/ that allows use of UTF-8 URLs with virtually any $siteCharset. This has been tested with EUC-JP and has the nice spin-off that Japanese characters are now visible in the URL when UTF-8 URLs are used, at least in OperaBrowser and probably in other browsers. This will also work with virtually any non-UTF-8 character set supported by CPAN:Unicode::MapUTF8 (for Perl < 5.8) or CPAN:Encode (for Perl 5.8).

The code requires TWiki to be running with the right $siteCharset and that CPAN:Unicode::MapUTF8 (for Perl < 5.8) is installed, or that Perl 5.8 is used (includes CPAN:Encode). Users can either use UTF-8 URLs or $siteCharset URLs, and auto-detection works OK as far as I've tested it. However, short non-UTF-8 $siteCharset URLs that mimic UTF-8 encoding could be a problem in theory but I don't know any UTF-8 sequences that map onto likely sequences of UTF-8 characters. I do have a plan for disambiguation logic but I'm not going to do this unless necessary.

- Attachments are now working fine with UTF-8 URLs without use of viewfile in most cases, so performance should be basically the same as with non-I18N attachment names.

Issues with building Perl

You don't need any particular version of Perl unless you want UTF-8 as the site character set. If you do want this, Perl 5.8 is recommended, ideally 5.8.5 or higher - you may need to build your own copy, which is not too hard with the standard Perl docs and the Perl 5.8.5 installation docs.

If you build your own copy of Perl 5.8 (or 5.6) and are using TWikiOnWebHostingSites, be sure to check for a problem with building Perl 5.8 whereby the CPAN:Cwd module's getcwd function fails because it can't chdir ".." all the way to the / directory (may be something to do with NFS but more likely just due to security setup at Dreamhost). I've provided a rough patch for Perl 5.8's Cwd.pm module on BuildingPerlCwdIssue.

Early Phase 2 support

- Running with $siteCharset = UTF-8 is not yet supported, but recent testing with Perl 5.8.3 indicates that it mostly works: topics can be created, updated and viewed using UTF-8 for topic names as well as contents. WikiWord handling and searching don't yet work. The statistics script now works properly in UTF-8 mode.

Testers wanted

Anyone interested in alpha/beta testing this? The code is working fine for site character sets of EUC-JP, KOI8-R and ISO-8859-2, and is basically working with UTF-8 character sets (Phase 2). The code is now in TWikiAlphaRelease and has been running at http://donkin.org for some time without any problems.

- you can now test this without installing TWiki by using my UTF-8 test site - this is running with $siteCharset set to UTF-8 and basically working. Some people from http://freedesktop.org are testing the TWikiAlphaRelease through this site.

Things to do

Still TO DO in phase 1:

~~Testing on Perl 5.8 - installed Perl 5.8.3 on Linux, which resolved some UTF-8 related issues. Fixed code to work with CPAN:Encode on Perl 5.8, now in TWikiAlphaRelease~~
~~Attachment support as per EncodeURLsWithUTF8~~
~~TWikiOnMainframe support for EBCDIC characters~~ - needs testing
~~MozillaURLEncodingWithI18N support~~
~~Improved performance for attachments by eliminating viewfile as much as possible through use of URL encoding~~
Robustness of auto-detection of UTF-8, particularly in Cyrillic or South East Asian character sets that use almost all '8 bit' characters
- - seems quite robust, since UTF-8 has a lot of redundancy, but more testing needed. The code tries to interpret URLs as UTF-8 first, then falls back to a non-UTF-8 site character set such as KOI8-R - so the default setup for most modern browsers will work without ambiguity.

-- RichardDonkin - 19 Jan 2004

Comments

How well does this implementation tally with W3C & IETF standards?

http://www.w3.org/International/O-URL-and-ident.html
RFC:2396 - Uniform Resource Identifiers (URI): Generic Syntax
Current internet draft for IRIs (Internationalised Resource Identifiers):
- http://www.w3.org/International/iri-edit/draft-duerst-iri-05.txt
- Jumping off point: http://www.w3.org/International/iri-edit/

Were you aware of this when writing code to support EncodeURLsWithUTF8 (and friends) ?

-- MS - 30 Jan 2004

This work is quite well aligned with W3C and IETF standards work, I believe. Your first link, Internationalized Resource Identifiers (IRIs), references an appendix of the HTML 4.0 specification that recommends use of URL-encoded UTF-8 in URLs (now supported by all major browsers to some degree). This appendix was also referenced in EncodeURLsWithUTF8 near the top, so I was aware of the HTML recommendation re UTF-8 URLs/URIs.

My main motivation was simply to get such URLs to work, but since the browsers are following standards, TWiki is as well. The use of UTF-8 in new URI schemes is summarised in Internationalized Resource Identifiers (IRIs) and recommended in RFC:2718. The latter follows the RFC:2936 URI syntax standard, which is a higher level standard that doesn't really apply directly to HTTP/HTML URLs.

IRIs are not yet standardised or implemented: they basically act as a way of directly using Unicode characters (i.e. integers drawn from a 21 bit codepoint space and written as numeric character references, such as &#xNNNNN;) within a resource identifier, rather than having to use UTF-8. The advantage is that IRIs are self-identifying whereas some UTF-8 URLs could in theory be confused for non-UTF-8 URLs (although this is unlikely in practice). IRIs are mapped into URIs by essentially converting the Unicode codepoints into UTF-8 byte strings. So when IRIs do arrive in browsers, I expect they'll be converted into URIs first before being seen by TWiki. If not, it would not be too hard to use the conversion code now in TWiki to convert directly from IRIs. However, I haven't really read this spec yet, and am mainly being driven by what browsers actually do - for example, Mozilla sometimes uses UTF-8 URLs and sometimes uses non-UTF-8 URLs, so TWiki needs to work with both for the foreseeable future.

I also consulted the W3C character model and followed its choice of Unicode Normalisation Form C (NFC) when thinking about Unicode normalisation - see MacOSXFilesystemEncodingWithI18N for some notes on this, although normalisation should typically be optional.

One other standards-related effort to think about is InternationalisedDomainNames (IDNs)...

-- RichardDonkin - 30 Jan 2004

I now have a test site running the latest TWikiAlphaRelease in UTF-8 mode on Perl 5.8.3 - see this demo page, and just register if you'd like an account. TWiki is mainly working in UTF-8 mode for explicitly linked pages, but WikiWord recognition is not there yet and there are some bugs.

UPDATE: There are various small UTF-8 mode fixes in TWikiAlphaRelease, including one for CVS:bin/statistics that makes it work better in UTF-8 mode. Also, the xml:lang attribute is now supported as an initial step towards TranslationSupport - this also helps Unicode rendering since the language is important to selecting the actual glyph (i.e. character shape) - a specific code point must be rendered differently in Chinese and Japanese even though the character is the same. I've also discovered an interesting bi-directional rendering bug due to use of Hebrew, see my UTF-8 bugs page for details.

UPDATE: Attachments are now done, and no longer using viewfile for the most common case. Also, %INTURLENCODE% now does nothing and is deprecated.

-- RichardDonkin - 08 Feb 2004

There is an Apache module, mod_fileiri, that converts IRIs, UTF-8 URLs and legacy URLs into the appropriate legacy character set (e.g. ISO-8859-1) - see this IRI presentation by Martin Dürst. Despite the existence of this module, it's more portable to include the UTF-8 URL support within TWiki since some people don't use Apache and those that do may not find it easy (or even possible) to install modules.

Another useful slide in this pres covers browser support for IRIs - interestingly, this slide seems to use IRI as a virtual synonym for 'UTF-8 URI'. For some discussion of why Mozilla doesn't do UTF-8 URLs by default (this was removed in Mozilla 1.0/Netscape 7), see MozillaBug:129726 and MozillaBug:150376.

For a gory discussion on how Perl interacts with locales and OSs, it's worth reading this long but good Unicode thread on news://perl.unicode.

-- RichardDonkin - 13 Feb 2004

Question about Polish support refactored to PolishLanguageSetup.

-- RichardDonkin - 16 Feb 2004

Moved ScheduledFor to Dakaar, as phase 1 is complete for Cairo.

-- CrawfordCurrie - 01 Jul 2004

I've been researching how to do this, following some email discussions - mostly it's a question of use open in SVN:lib/TWiki/Store.pm, and possibly the same for stdin in other scripts, where using CGI. However, ModPerl doesn't use a pipe to communicate with CGI scripts, so a small wrapper around CPAN:CGI would be needed to set the Perl UTF-8 flag for all incoming parameters. See discussion thread.

-- RichardDonkin - 04 Jan 2005

Unscheduled until someone shows commitment.

-- CrawfordCurrie - 15 Feb 2005

ChangeProposalForm
TopicClassification	FeatureRequest
TopicSummary	Support for UTF-8 in TWiki in three phases, starting with UTF-8 URLs
CurrentState	UnderInvestigation
CommittedDeveloper
ReasonForDecision	None
DateOfCommitment
ConcernRaisedBy
BugTracking
OutstandingIssues	No commitment to implement.
RelatedTopics	CategoryI18N, UseUTF8, UnicodeSupport, InternationalisationUTF8, EncodeURLsWithUTF8, InternationalisationEnhancements (section on Unicode)
InterestedParties
ProposedFor
TWikiContributors

Topic revision: r47 - 2008-09-30 - RichardDonkin

Account
- Log In
- Register User

Edit
Attach

Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.