Internationalisation with UTF-8

Discussion of UTF-8 URLs refactored into EncodeURLsWithUTF8Discuss - now implemented as EncodeURLsWithUTF8 -- RD.

This controversial Slashdot story raised some FUD that Unicode could not properly deal with Chinese and Japanese. However, the comments refute this fairly comprehensively - in fact, the comments are much more useful than the article.

-- RichardDonkin - 18 Feb 2003

PhpWiki:Utf8Migration has some useful discussion and links - PhpWiki 1.3 ~~has in fact made the move~~ is in the process of moving to UTF-8. That page shows some issues due to PHP's UTF-8 support.

-- RichardDonkin - 15 Mar 2003

That Slashdot Story is pure fud, written by someone with an axe to grind. UTF-8 is fine for Japanese and Chinese. Anyone who says otherwise hasn't looked at it enough, or talked to anyone in Japan or China who actually use it.

We use UTF-8 for all new apps we write, and we don't get any of the problems we used to get with mojibake (literally in Japanese, a string of unintelligible characters).

The problem we have currently with our TWiki, is that UTF-8 is getting mangled somewhere in the display code - it gets saved correctly, but the view script mangles it. We are running a quite old version of TWiki, can anyone tell me if this is fixed now? We have the encoding set to UTF-8 because we need to support Russian, Japanese AND English on the one page smile

-- NathanOllerenshaw - 27 Jun 2003

I know the story was probably FUD, but some of the comments to the story were useful. There still seem to be quite a few websites out there that are using non-UTF-8 encodings for Japanese and Chinese - is this because of immature support for UTF-8 in browsers, server tools, or just inertia?

The current TWiki code is not intended to work with UTF-8 - I suspect that there'll be a fairly significant amount of work to get it all sorted out, but patches are welcome (see PatchGuidelines). I assume you've tried setting the charset appropriately in the $siteLocale in TWiki.cfg? This should automatically make the contents of pages work with UTF-8, even if the links don't.

-- RichardDonkin - 27 Jun 2003

The problem seems to be a mix of inertia (why change from Shift_JIS when it works well now?), lack of understanding about the benefits of a single character set as well as browser issues. Believe it or not, a lot of people in Japan still use really old versions of Netscape simply because the localised versions of Netscape forced everything to Shift_JIS ... and if all the sites you view a Shift_JIS, then it works. Lots of webmasters here fail to set the encoding on the document, or use Apache's filename extension system for encoding ...

I've set the charset in $siteLocale. The issue seems to be with just some, not all, characters that get encoded. For example, if I type "nihongo" in kanji (chinese characters), then this won't be corrupted on display. But, if I type the hiragana character "n" followed by a hiragana comma, the "n" turns into a &. Go figure. smile Some Kanji gets corrupted randomly like this as well; but more subtly. Sometimes changing the meaning but not "looking" like corruption. Very strange.

I'll try upgrading what Twiki we are running, and see if I can nail it down. I think it is due to the twiki regex stuff doing things it shouldn't be. The notwiki tag, however, doesn't seem to do anything.

BTW, if you have a twiki up that is running in UTF-8 mode and want me to demonstrate these issues, let me know.

-- NathanOllerenshaw - 30 Jun 2003

The & transformation sounds a bit like what's discussed in NbspBreaksRssFeed and related pages - there should be some links on InternationalisationIssues.

Not sure what you mean by 'notwiki tag', that's not a TWiki tag - do you mean <nop>?

I don't have a UTF-8 TWiki, but it would be useful to test one.

-- RichardDonkin - 30 Jun 2003

Sorted.

$useLocale = 1; $siteLocale = "ja_JP.utf8";

Initial testing with the latest stable version of TWiki seems to show that there is no corruption issues in text display.

I have some issues with file uploads, whats the best topic to discuss those on?

-- NathanOllerenshaw - 01 Jul 2003

Just ask a new support question on Support.WebHome having read the SupportGuidelines - if you're sure it's a bug, log a new bug at Codev.WebHome. There are quite a few file attachment issues logged on the Support web, some of which are diagnosed by the latest testenv from CVSget:bin/testenv.

-- RichardDonkin - 01 Jul 2003

Thanks Richard. I don't think its a bug, in fact its documented that it isn't. I created a new topic, UpdateAttachmentsDontWorkAsExpected, documenting my problem. Its also mentioned in UsabilityIdeas. I was hoping someone would comment on it. hint hint roll eyes (sarcastic) I'm willing to do the work as necessary, I am a fairly mean perl hacker, I just need a nudge in the right direction. big grin

-- NathanOllerenshaw - 02 Jul 2003

Hint taken smile ... More info about your UTF-8 setup would be useful - have you got UTF-8 WikiWords working, and if so using which browsers?

-- RichardDonkin - 02 Jul 2003

Haha smile Thanks.

Well, in japanese, there aren't capitals, so not sure how WikiWords are supposed to work with that wink I think that WikiWords in japanese will never work.

Note however that Romaji (roman characters) are an acceptable (if rudimentary) way of representing japanese words. Its perfectly acceptable, I think, to just spell out ni ho n go as nihongo. So I guess people will just use the [] style links to link to japanese topics, with the topic in Romaji.

Of course, one of the locals here might feel very differently about it, but I don't see how you could make a WikiWord out of Japanese unless you use Romaji. They just don't have upper case characters ... smile

Now, as far as WikiWords in other languages that do have case, no, I don't. We will probably have some Russian pages on here soon, and when we do I'll ask the Russians what their experiences with Twiki are like.

Right now, its all good for us, no more mojibake which is the main thing. It was getting a little frustrating, and was turning people off the idea of TWiki. mad!

-- NathanOllerenshaw - 03 Jul 2003

As russian TWiki user, I'd like to note that Twiki in general work perfectly. I can have russian WikiWord, all I need to turn off Unicoded URLs in browser.

The real problem with Unicode is the TWiki storage system. Files are stored under the names of topic, Scince my local codepage is koi8-r, and my FFS partition on FreeBSD can handle one-byte charsets, there is no problem.

But I need Unicode in TWiki to support MSIE-based clients, though currently they use Mozilla/Win32 and didn't complain. When TWiki attempts to create file with Unicoded name on FFS, result will be unpredictable and, which is bad - unreadable.

I think that we need Unicode at HTTP/Web level, and local charset at file system level. Conversions can be easily done with Text::Iconv perl module. TWiki Installations need to specify local charset and network charset, and viola, all will work. Local charset may be Unicode as well, if particular file system allows it.

-- SergeySolyanik - 03 Jul 2003

Thanks for the feedback - it's nice to hear that TWiki is working OK for Russian users. I've had very little feedback on the I18N code, so I'm not really sure if it meets people's needs. Is there a public URL for your site?

I'm not sure why InternetExplorer would require Unicode in TWiki - I did a lot of testing with IE5.5 while developing I18N and it works fine with KOI8-R or whatever charset.

When we go Unicode, it would be difficult to not use Unicode at the storage level - imagine a Russian TWiki in which someone creates a topic with French accented characters, and then try to figure out which non-Unicode charset should be used for the filename of the French topic. It's much easier to use Unicode - if there are problems in humans reading Unicode filenames, that's something that OS tools (or maybe the filesystem) should address - e.g. ls, vim and so on. However, Unix/Linux has no problem with arbitrary characters in filenames - just as long as there is some tool that can show what the filenames mean (even a simple Perl tool), the files are still manageable. Of course, you can just limit the 'real' charset per web and do the translation, but in that case, why not just use KOI8-R throughout, as in the current code?

Creating Unicode filenames should always be predictable - the filename is just the UTF-8 encoding of the topic name and in fact will be just the topic name plus ".txt", like today.

There may be a requirement to translate to/from 8-bit character sets at the web interface level while using UTF-8 internally - this would enable use of KOI8-R, Shift-JIS, and similar 'legacy' character sets for sites where full browser UTF-8 support can't be assumed. It might even be possible to do this on a per-user level through browser charset negotiation tricks or with authenticated view setups.

-- RichardDonkin - 03 Jul 2003

I'm sorry, there is no public access to our TWiki, network isn't connected to Internet at all.

About InternetExplorer requiremets - it's not a requrement, it's just usability problem. InternetExplorer ignores page charset when sending URL, and even if page displays correctly, to follow a link we must tweak preferences, which can hard to explain to certain users.

About accented characters in russian TWiki - it's hard to imagine. In many russian 8bit encodings, such as Win-1251, KOI8-R, CP866 russian letters occupies exactly the place for accented characters. So it's very little possibility for russian users to mix in one document russian letters and other language. That's a big and long lasting problem, which is out of our scope. So, we can easily skip this case. It's a testing case only when we need to mix Cyrillic, accented and even hieroglyphs in one WikiWord. But of course it's better to has TWiki without any restrictions.

About UTF-8 in filenames - yes, filesystem is transparent, but certainly we lost some comfort with system utilities such as ls, etc. And more - how about migration from current KOI8-R based topics, to UTF-8? Certainly we would need some recodind utility.

One of TWiki advantages is using plain text files and RCS which can be processed manually or by other tools. Until there is no support for Unicode in OS, IMHO we can't use Unicode at filesystem level.

-- SergeySolyanik - 04 Jul 2003

InternetExplorer is not alone in not specifying a charset when sending a URL - this is how HTTP works, the content-type's charset only applies to the result of a request, not to the request. This is true for GET and POST, including form values sent by POST - there's a link about form filling issues on InternationalisationEnhancements.

French accented characters in a Russian TWiki would be quite possible if the user's browser and TWiki were using UTF-8 - I mentioned this to illustrate how it would be difficult to use an 8-bit (non-UTF-8) charset such as KOI8-R for file names when the main part of TWiki is using UTF-8. Mixing charsets in one WikiWord is unlikely, but I can imagine a single page that references Russian and French WikiWords - e.g. to point to translated versions of the current page. Since these WikiWords could be in then same web, it's important to enable UTF-8 for filenames as well.

Most modern OSs now support UTF-8 or Unicode, e.g. Linux, FreeBSD, Solaris, Windows NT, etc - so in practice I'd expect that ls and similar tools are already able to handle UTF-8 characters. File system support is important to the level of being able to store UTF-8 characters and perhaps for returning number of characters in the filename, but it should already be there.

People who prefer to use a single non-Unicode charset throughout a TWiki site can carry on as they are - the main push for UTF-8 is to avoid changing browser setups and to enable more than one language (with conflicting charsets) in a single TWiki site, but UTF-8 will be 'just another charset' to some degree, with some special characteristics.

-- RichardDonkin - 04 Jul 2003

Good news! There does appear to be an algorithm for detecting whether a URL containing escaped characters is in UTF8 or ISO-8859-1. Here's a (fair-use) extract from the manual for the IBM HTTP Server v5.3 for OS/390, which does exactly this:

DetectUTF8 - Specify whether to detect and convert UTF-8 encoded characters in URLs

Use this directive to enable the automatic detection of escaped UTF-8 characters in URLs. When you set this directive to OFF, the Web server assumes all escaped characters are encoded in IS08859-1 format. When you set the directive to ON, the Web server attempts to detect whether the escape sequence is in UTF-8 format. If any escape sequence is in UTF-8 format, the Web server translates the UTF-8 character to its EBCDIC (IBM-1047) equivalent. If this UTF-8 character does not map to an IBM-1047 character, processing continues with the UTF-8 escape sequence left untouched. If the Web server does not detect an escape sequence in UTF-8 format, the Web server unconditionally assumes that the first byte is a single IS08859-1 character, and translates the character to IBM-1047 format. The Web server then processes the next escape sequence, performing the same steps as it did on the previous sequence.
- Note: The Web server can handle a URL that has a mix of escaped UTF-8 characters, escaped ISO8859-1characters, and unescaped ISO8859-1 characters.

What's important is that the web server looks at each escape sequence (e.g. %E5%F7), trying to interpret one or more bytes as a UTF-8 encoded character - if it fails, it then interprets the first byte in the sequence as ISO-8859-1 and tries again. (The IBM-1047 charset is EBCDIC-based - see RewritingUrlsWithEscapedCharsUnderOs390 for details, but actually that part doesn't matter here, it is just used as an extra validity test.)

I think the key to this solution is limiting the number of charsets considered at any one time - if the TWiki server is configured for (say) ISO-8859-1 and UTF-8, this certainly works. Some combinations, e.g. KOI8-R and UTF-8, may be impossible to support easily, in which case the user must configure their browser and TWiki server appropriately to use (or avoid) UTF-8 URLs - however, many users could just use UTF-8 URLs within the browser and have these translated automagically into the ISO-8859-* type charsets.

Strange that a TWikiOnMainframe web server provided this solution, but it just goes to show that weird and wonderful platforms all help to improve TWiki...

Now that I've looked a bit more, there are many algorithms out there for charset detection, but most are aimed at HTML page auto-detection, and may well not work well for URLs:

Frank Tang's charset detection links - includes simple Perl UTF-8 detector based on legal codings
Excellent paper on Mozilla's 3-part algorithm using coding legality, character frequencies and two-character frequencies - detects the language as well as the encoding. Too complex for use on URLs, but looks very good.
Discussion on IRC auto-detection of charsets
Simple UTF-8 detector in C
CPAN:Unicode::Japanese - includes auto-detection for various Japanese charsets
CPAN:Encode::Guess - auto-detection from suitably dissimilar charsets (needs Perl 5.8)
Browser detection for forms input datatypes including useful undocumented JavaScript to check IE's current charset (try this out now if you are using IE - see Sandbox.TestCharset).
TextCat, tool for language detection - in Perl, OpenSource

Some more investigation is needed, but a simple URL charset detector, between one single-byte charset (e.g. ISO-8859-1 or KOI8-R) and UTF-8 should be quite feasible in terms of code size and run time. Despite all the detectors out there, I think the IBM algorithm may be a good choice and has the merit of being pre-tested by a large user population, at least for ISO-8859-1 - KOI8-R etc would require more testing.

-- RichardDonkin - 03 Sep 2003

See ProposedUTF8SupportForI18N for my thinking on phases of UTF-8 support for TWiki.

Here are some resources on Perl 5.8's Unicode support:

perluniintro - introduction to Unicode support
perlunicode - Unicode support details
- Unicode Encodings section has good explanation of why holes exist in UTF-8 encoding, to avoid overlong strings causing security issues (see MajorSecurityProblemWithIncludeFileProcessing)
Encode module - core module to convert between UTF-8 and many other character encodings. Also available as CPAN:Encode, but requires Perl 5.8.0 anyway.

Other resources:

CPAN:Unicode::String - provides Unicode handling for Perl versions without built-in Unicode support, including conversions to/from various character encodings.
Adding Unicode support to ECMAscript - experience of adding Unicode-based I18N to ECMAscript, quite interesting for any software project.
- Unicode Basics introduction from IBM's ICU documentation - recommended and up to date
- Joel Spolsky on Unicode - highly recommended introduction that is informative, entertaining and mostly accurate
Wikipedia:Unicode
Wikipedia:UTF-8
Wikipedia's technical issues list - browsers with good UTF-8 support
XML, Unicode and Perl - good article including examples on converting character encodings using Perl 5.005 and 5.6.
Perl and Multi-byte Characters (PDF) - outstanding presentation by Ken Lunde, who wrote the Google:CJKV book.
RFC:1922 - ISO-2022-JP, commonly used in Japanese emails - 7-bit encoding using escape sequences
RFC:1843 - HZ, used by Chinese speakers - 7-bit encoding using escape sequences
Netscape 4.x I18N - good overview of form submission issues and main character encodings

-- RichardDonkin - 08 Sep 2003

If anybody looks at the Ken Lunde presentation (Perl and Multi-byte Characters), I have a question: On page 4 he has an example: "full to half width katakana" and below it talks about three characters becoming two. As a novice to character sets, encodings, glyphs, etc.: does he really mean "characters" or should he have said "glyphs"? (Or, in this case, are the terms interchangable?)

-- RandyKramer - 17 Sep 2003

I'm not sure, but I believe he is using the terminology correctly, assuming that there are three character set codepoints in the initial string and two in the final string. The actual character set encoding is likely to use multiple bytes per character since this is Japanese. As I understand it, glyphs are the particular shape of a character on the screen, i.e. controlled by a combination of the character set and the font and particular language-specific rendering logic (e.g. in Arabic, a single character (codepoint) can be rendered as many different glyphs depending on adjacent characters).

For a good introduction to some of these concepts, see Amazon's page on the Unicode Demystified book and choose 'Look Inside' to read an excerpt. (I have this book on order along with Ken Lunde's Google:CJKV book.)

I'm still researching the best approach for all this, but am now convinced that UTF-8 is the most sensible storage and processing format for TWiki, with conversion to and from legacy character sets as needed (see ProposedUTF8SupportForI18N, Phase 2). In particular, I'm now aware of just how hard it would be to fully support ISO-2022-JP as a native character encoding format, so we won't be doing that, or any other escape-sequence based encodings, ISO-2022-* or other. There may be a need for some minor tweaks to TWiki rendering to HTML when that's the browser character encoding, but they are just a few lines - the full conversion will be done by CPAN modules.

MHonArc (a mail archive tool) has already implemented this in GPLed Perl, which is a useful reference point since it has good support for ISO-2022-JP and UTF-8 - see their TEXTENCODING feature (converts incoming charset to storage format, usually UTF-8) and CHARSETCONVERTERS feature (renders storage format into HTML, with minor changes depending on character set).

CPAN:MHonArc::Encode is quite interesting here, even if we don't use it directly - it wraps CPAN:Encode (included in Perl 5.8) and CPAN:Unicode::MapUTF8 (available on CPAN, requires Perl 5.6). Mail archive tools have some internal similarities to Wikis - both render essentially plain text as HTML, with some markup (e.g. mailto: and http: links). Mail archive tools have to deal with a much wider range of input character encodings, since the encoding of an email is up to the sender, and email lists are frequently used internationally, so MHonArc at least is ahead in this area.

-- RichardDonkin - 18 Sep 2003

Thanks, Richard!

-- RandyKramer - 18 Sep 2003

I have just read most of the Google:CJKV book and the half-width vs full width katakana question is clearer - these are separate codepoints in a coded character set, including in Unicode for compatibility (round-trip conversions). However, if you were starting from scratch you might well say there should just be a single katakana codepoint, i.e. they are really specific renderings of the same glyph.

Joel Spolsky has written a great article on Unicode - well worth reading as an entertaining and brief introduction to the essentials of Unicode.

-- RichardDonkin - 14 Oct 2003

NathanOllerenshaw's observation on 30 Jun 2003 about weirdness with Shift_JIS now appears to be due to this character set not being ASCII-safe - see JapaneseAndChineseSupport for more details. Shift_JIS and other unsafe character sets are excluded from use with DakarRelease and later.

-- RichardDonkin - 30 Nov 2004

BasicForm
TopicClassification	BrainstormingIdea
TopicSummary	Use of the UTF-8 encoding of Unicode characters within TWiki
InterestedParties
RelatedTopics	CategoryI18N, InternationalisationEnhancements, JapaneseAndChineseSupport, EncodeURLsWithUTF8, ProposedUTF8SupportForI18N

Topic revision: r43 - 2008-10-21 - PeterThoeny

Account
- Log In
- Register User

Edit
Attach

Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.