Encode URLs with UTF-8
This page addresses UTF-8 support for URLs only. The overall plan for UTF-8 support for TWiki is described in
ProposedUTF8SupportForI18N.
Official documentation for this feature moved to TWiki.AppendixEncodeURLsWithUTF8
Older discussion to be refactored - see also EncodeURLsWithUTF8Discuss
Feature Request: Enable browser encoding of UTF-8 in URLs
URLs are not allowed to contain non-ASCII (8th bit set) characters:
http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars
The URLs in
WikiWords, BASE-tags and the variables REVISIONS, SPACEDTOPICS, TOPICMOVED and probably others should therefore be (1) converted by the browser from the page character encoding to UTF-8 and then (2) URL encoded by the browser. The web server should decode the URL-encoding as normal and pass a UTF-8 string to the TWiki code.
Hence, TWiki should avoid generating URLs that use URL-encoding (or if it does, should encode them to UTF-8 before URL encoding) and should accept UTF-8 URLs in requests.
This will enable browsers that UTF-8 encode URLs by default to use TWiki sites without reconfiguration.
Test case
Sandbox.PommesDeTerreGratinée (with browser UTF-8 encoding of URLs turned on)
- This is encoded as ISO-8859-1 in this page (since TWiki.org runs with this charset at present), and dynamically encoded as UTF-8 by many browsers as long as this encoding option is not disabled by the user.
Environment
--
KimHansen - 13 Feb 2003
Follow up
This is not a bug as such, but a FeatureEnhancementRequest aiming to avoid having to configure browsers to generate URLs not URL-encoded in UTF-8. Also requested in EncodingUtf8 and NonEnglishTopicName, as well as being referenced in TWikiMiniWebServer, and discussed in InternationalisationUTF8.
--
RichardDonkin - 14 Feb 2003
It doesn't work with MS-IE in its default setup, and we can not ask our users to disable UTF-8 encoding of URIs, because we are going to use TWiki as a CMS for our public website.
--
KimHansen - 15 Feb 2003
I have attached two patches, they fix
WikiWords, %REVISIONS%, %SPACEDTOPIC% and %META{moved}%. There are probably other URIs that need fixing.
--
KimHansen - 15 Feb 2003
Thanks for the patches - the good thing about fixing this issue is that it should enable UTF8-supporting browsers to use TWiki
I18N without having to be reconfigured to not use UTF8 in URLs. However, significant testing with various browsers is important to avoid introducing new problems.
The patches as supplied would require that TWiki uses UTF-8 for all filenames, which may well cause other I18N features such as sorting of topic names to fail. The rest of this page discusses a more complete solution that uses UTF-8 only for inbound URLs, and a non-UTF-8 character encoding (e.g. ISO-8859-1) for all other purposes within TWiki.
--
RichardDonkin - 16 Feb 2003
Solution Outline
Many modern browsers, e.g.
InternetExplorer 5.5+,
OperaBrowser (7.x+?) or
MozillaBrowser (for POST URLs), translate the URLs from the page's character set (e.g. ISO-8859-1) to UTF-8 and then URL-encode them. This is consistent with
W3C's
Recommendation in the HTML 4.0 specification (as well as their
URL encoding programs page) which recommends that URLs should be URL-encoded as UTF-8.
Some browsers do perform this translation, and it does avoid problems with
TWikiOnMainframe using
I18N, as long as the web server supports UTF-8 URLs. As discussed in
EncodeURLsWithUTF8Discuss, it's important to be able to dynamically detect at least ISO-8859-1 vs. UTF-8 encoded URLs.
Some browsers
only use UTF-8 encoding of URLs (e.g.
OperaBrowser on the
SonyEricsson P800
SmartPhone) - others are incapable of such encoding and just send the URL in the page's character encoding (e.g. ISO-8859-1). So it's important that this is configurable, or even better auto-detects the character encoding used for the URLs where possible.
Alpha and beta implementation
The overall plan for UTF-8 support for TWiki is described in two phases in
ProposedUTF8SupportForI18N - this page addresses the first page, in which UTF-8 is supported for URLs only.
UTF-8 URL translation to ISO-8859-1 is now implemented in
TWikiAlphaRelease (and
TWikiBetaRelease of 18 Dec 2003 or later) - the code now automatically detects whether a URL is UTF-8 or not, taking care to avoid over-long and illegal UTF-8 encodings that could introduce
security holes (tested against a comprehensive
UTF-8 test file, which IE5.5 fails quite dangerously, and
OperaBrowser passes). Any non-ASCII URLs that are
not valid UTF-8 are then assumed to be directly URL-encoded as a single-byte or multi-byte character set (as now), e.g. EUC-JP.
The main point is that you can use TWiki with international characters in
WikiWords without changing your browser setup from the default, as long as your site uses ISO-8859-1, and you can also still use TWiki using non-UTF-8 URLs. This works on any Perl version from 5.005_03 onwards and corresponds to Phase 1 of
ProposedUTF8SupportForI18N. You can have different users using different URL formats transparently, as long as your site is using ISO-8859-1.
UTF-8 URLs are automatically converted to ISO-8859-1 if that's your current
$siteCharset
(from the
TWiki.cfg
locale setting). If not, nothing is done to the URL characters - in a future update to TWiki, I'll convert such URLs into any character encoding using suitable
CPAN modules, but that will create an extra installation requirement for such sites, or require Perl 5.8 for its built-in
CPAN:Encode (5.8 is a good idea anyway for Unicode support).
I've tested this a bit in both modes using Opera and IE - testing with Mozilla and friends would be very helpful, particularly with disabled INTURLENCODE (see
TWikiOnMainframe for links, requires some TWiki.pm changes as in related patches). You can try this out without upgrading TWiki at
http://donkin.org/bin/view/Test/TestTopic5.
Issues and Things To Do
- : Attachments to topics with I18N names now work OK and avoid the
viewfile
redirect for performance reasons.
- TWiki generates the whole page in the site charset, e.g. ISO-8859-1, but the browser dynamically UTF-8 encodes the attachment's URL when it's used. Since Apache serves attachment downloads without TWiki being involved, TWiki's code can't do its UTF-8 decoding trick, so TWiki URL-encode such URLs in ISO-8859-1 or whatever when generating the page, to bypass this URL encoding, ensuring that the URLs and filenames seen by Apache remain in the site charset.
- TWikiOnMainframe uses EBCDIC web servers that typically translate their output to ASCII, UTF-8 or ISO-8859-1 (and URLs in the other direction) since there are so few EBCDIC web browsers. Such web servers don't work with even ISO-8859-1 URLs if they are URL encoded, since the automated translation is bypassed for URL-encoded characters.
- : For TWikiOnMainframe, TWiki assumes that the web server will automatically translate UTF-8 URLs into EBCDIC URLs, as long as URL encoding is turned off in TWiki pages.
- OperaBrowser on the P800 smartphone is working for page viewing but leads to corrupt page names in a new way when editing pages.
Updates
Recently requested at
HowToDealWithGermanUmlaute. This feature is included in the latest
TWikiBetaRelease, leading up to
CairoRelease.
ChristianKohl has been testing this feature for some weeks (in ISO-8859-1 mode) and hasn't found any issues.
I've also figured out how to get
TWikiOnMainframe EBCDIC web servers working with browers that do UTF-8 URL encoding of attachments (see above). This may not get done until Phase 2 of
ProposedUTF8SupportForI18N, though.
I've revamped this page to be closer to
DocumentMode. Comments or questions are very welcome as always!
--
RichardDonkin - 7 Jan 2004
- I have working code that supports UTF-8 URLs more fully, converting to virtually any character set including KOI8-R, EUC-JP, and so on. ISO-2022-* and HZ-* are now specifically excluded from use as the site character set, since they interfere with
TWikiML, and it's now possible to override the site character set defined in the
$siteLocale
setting. This feature is therefore nearly complete.
See
ProposedUTF8SupportForI18N for more information, and grab a copy of the code from
TWikiAlphaRelease and test it to destruction!
--
RichardDonkin - 20 Jan 2004
This feature is now complete apart from docs - code for attachment support is now in
TWikiAlphaRelease, with improved performance. This should also work with
TWikiOnMainframe. Tested with IE 5.5, Opera 7.11 and Mozilla (Firebird 0.7).
TWiki is also beginning to work in full UTF-8 mode (i.e. all content is UTF-8) - this is phase 2 of
ProposedUTF8SupportForI18N, but there's probably a lot more work. Encoding of
XML entities for RSS feeds can also make use of the Unicode conversion modules now used for URLs.
--
RichardDonkin - 08 Feb 2004
Refactored - discussion from ChristianKohl about why this is needed, from InternationalisationEnhancements.
I've implemented another workaround with the stable release, for Greek (ISO-8859-7)
WikiWords, and it has been working unexpectedly well: I've configured Apache to convert URLs to iso-8859-7 if they're in UTF8. You do this with the following configuration commands:
RewriteEngine on
[ If you have any other rewrite rules, put them here ]
# Important: following rule must be last on the list, otherwise it may impede
# the other rules from executing
RewriteMap uni2iso prg:/usr/local/bin/utf2iso8859-7
RewriteRule (.*) ${uni2iso:$1} [PT]
and you must have a script, /usr/local/bin/utf2iso8859-7, to do the conversion.
Here's the script I'm using, which is in Python. If you make any changes to the script other than really trivial, or if you rewrite it in Perl, make sure you read Apache's RewriteMap documentation carefully.
--
AntoniosChristofides - 04 Mar 2004
Thanks for posting this - it's interesting to see how this can be done with rewrite rules, and this may be useful for people who have Apache admin access. There is also an Apache 2.0 module,
mod_fileiri
, that does the same thing - see
this presentation and discussion on
ProposedUTF8SupportForI18N.
However, since your solution requires Apache admin rights and involves loading the Python interpreter and its
encode
module, it is less widely applicable than the
TWikiBetaRelease approach outlined in
EncodeURLsWithUTF8, which is done in Perl and can benefit from
ModPerl to avoid the overhead of re-compiling the
CPAN conversion module(s). The TWiki beta code is quite stable so there's no reason not to use it for sites that need UTF-8 URLs - if Perl 5.8 is used there's no need to install any
CPAN modules.
You might want to try upgrading your site to the
TWikiBetaRelease (not too hard if you just upgrade the scripts and then modify
TWikiPreferences for new variable names - there's already a
CairoRelease upgrade guide on the
TWiki web.
--
RichardDonkin - 04 Mar 2004
MattWilkie has offered to help out clearing up the docs for Cairo.
--
CrawfordCurrie - 02 Jul 2004
Thanks for the cleanup Richard! I blockquoted the "official docs" part of the topic, upped the
DocProgress to 95%, and changed classification to
FeatureDone. Please revert if I'm wrong.
--
MattWilkie - 27 Jul 2004
Removed the missing quote on the blockquote element, which was gobbling the first paragraph
Also did various other minor updates.
--
RichardDonkin - 28 Jul 2004
Created an
AppendixEncodeURLsWithUTF8 to hold the official docs for this feature. I don't know if that is the best place for these docs. Move/Rename as makes sense.
--
MattWilkie - 10 Aug 2004