Tags:
internationalization2Add my vote for this tag create new tag
view all tags

Encode URLs with UTF-8

This page addresses UTF-8 support for URLs only. The overall plan for UTF-8 support for TWiki is described in ProposedUTF8SupportForI18N.

Official documentation for this feature moved to TWiki.AppendixEncodeURLsWithUTF8



Older discussion to be refactored - see also EncodeURLsWithUTF8Discuss

Feature Request: Enable browser encoding of UTF-8 in URLs

URLs are not allowed to contain non-ASCII (8th bit set) characters: http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars

The URLs in WikiWords, BASE-tags and the variables REVISIONS, SPACEDTOPICS, TOPICMOVED and probably others should therefore be (1) converted by the browser from the page character encoding to UTF-8 and then (2) URL encoded by the browser. The web server should decode the URL-encoding as normal and pass a UTF-8 string to the TWiki code.

Hence, TWiki should avoid generating URLs that use URL-encoding (or if it does, should encode them to UTF-8 before URL encoding) and should accept UTF-8 URLs in requests.

This will enable browsers that UTF-8 encode URLs by default to use TWiki sites without reconfiguration.

Test case

Sandbox.PommesDeTerreGratinée (with browser UTF-8 encoding of URLs turned on)

  • This is encoded as ISO-8859-1 in this page (since TWiki.org runs with this charset at present), and dynamically encoded as UTF-8 by many browsers as long as this encoding option is not disabled by the user.

Environment

TWiki version: TWikiRelease01Feb2003

-- KimHansen - 13 Feb 2003

Follow up

This is not a bug as such, but a FeatureEnhancementRequest aiming to avoid having to configure browsers to generate URLs not URL-encoded in UTF-8. Also requested in EncodingUtf8 and NonEnglishTopicName, as well as being referenced in TWikiMiniWebServer, and discussed in InternationalisationUTF8.

-- RichardDonkin - 14 Feb 2003

It doesn't work with MS-IE in its default setup, and we can not ask our users to disable UTF-8 encoding of URIs, because we are going to use TWiki as a CMS for our public website.

-- KimHansen - 15 Feb 2003

I have attached two patches, they fix WikiWords, %REVISIONS%, %SPACEDTOPIC% and %META{moved}%. There are probably other URIs that need fixing.

-- KimHansen - 15 Feb 2003

Thanks for the patches - the good thing about fixing this issue is that it should enable UTF8-supporting browsers to use TWiki I18N without having to be reconfigured to not use UTF8 in URLs. However, significant testing with various browsers is important to avoid introducing new problems.

The patches as supplied would require that TWiki uses UTF-8 for all filenames, which may well cause other I18N features such as sorting of topic names to fail. The rest of this page discusses a more complete solution that uses UTF-8 only for inbound URLs, and a non-UTF-8 character encoding (e.g. ISO-8859-1) for all other purposes within TWiki.

-- RichardDonkin - 16 Feb 2003

Solution Outline

Many modern browsers, e.g. InternetExplorer 5.5+, OperaBrowser (7.x+?) or MozillaBrowser (for POST URLs), translate the URLs from the page's character set (e.g. ISO-8859-1) to UTF-8 and then URL-encode them. This is consistent with W3C's Recommendation in the HTML 4.0 specification (as well as their URL encoding programs page) which recommends that URLs should be URL-encoded as UTF-8.

Some browsers do perform this translation, and it does avoid problems with TWikiOnMainframe using I18N, as long as the web server supports UTF-8 URLs. As discussed in EncodeURLsWithUTF8Discuss, it's important to be able to dynamically detect at least ISO-8859-1 vs. UTF-8 encoded URLs.

Some browsers only use UTF-8 encoding of URLs (e.g. OperaBrowser on the SonyEricsson P800 SmartPhone) - others are incapable of such encoding and just send the URL in the page's character encoding (e.g. ISO-8859-1). So it's important that this is configurable, or even better auto-detects the character encoding used for the URLs where possible.

Alpha and beta implementation

The overall plan for UTF-8 support for TWiki is described in two phases in ProposedUTF8SupportForI18N - this page addresses the first page, in which UTF-8 is supported for URLs only.

UTF-8 URL translation to ISO-8859-1 is now implemented in TWikiAlphaRelease (and TWikiBetaRelease of 18 Dec 2003 or later) - the code now automatically detects whether a URL is UTF-8 or not, taking care to avoid over-long and illegal UTF-8 encodings that could introduce security holes (tested against a comprehensive UTF-8 test file, which IE5.5 fails quite dangerously, and OperaBrowser passes). Any non-ASCII URLs that are not valid UTF-8 are then assumed to be directly URL-encoded as a single-byte or multi-byte character set (as now), e.g. EUC-JP.

The main point is that you can use TWiki with international characters in WikiWords without changing your browser setup from the default, as long as your site uses ISO-8859-1, and you can also still use TWiki using non-UTF-8 URLs. This works on any Perl version from 5.005_03 onwards and corresponds to Phase 1 of ProposedUTF8SupportForI18N. You can have different users using different URL formats transparently, as long as your site is using ISO-8859-1.

UTF-8 URLs are automatically converted to ISO-8859-1 if that's your current $siteCharset (from the TWiki.cfg locale setting). If not, nothing is done to the URL characters - in a future update to TWiki, I'll convert such URLs into any character encoding using suitable CPAN modules, but that will create an extra installation requirement for such sites, or require Perl 5.8 for its built-in CPAN:Encode (5.8 is a good idea anyway for Unicode support).

I've tested this a bit in both modes using Opera and IE - testing with Mozilla and friends would be very helpful, particularly with disabled INTURLENCODE (see TWikiOnMainframe for links, requires some TWiki.pm changes as in related patches). You can try this out without upgrading TWiki at http://donkin.org/bin/view/Test/TestTopic5.

Issues and Things To Do

  • UPDATED: Attachments to topics with I18N names now work OK and avoid the viewfile redirect for performance reasons.
    • TWiki generates the whole page in the site charset, e.g. ISO-8859-1, but the browser dynamically UTF-8 encodes the attachment's URL when it's used. Since Apache serves attachment downloads without TWiki being involved, TWiki's code can't do its UTF-8 decoding trick, so TWiki URL-encode such URLs in ISO-8859-1 or whatever when generating the page, to bypass this URL encoding, ensuring that the URLs and filenames seen by Apache remain in the site charset.
  • TWikiOnMainframe uses EBCDIC web servers that typically translate their output to ASCII, UTF-8 or ISO-8859-1 (and URLs in the other direction) since there are so few EBCDIC web browsers. Such web servers don't work with even ISO-8859-1 URLs if they are URL encoded, since the automated translation is bypassed for URL-encoded characters.
    • UPDATED: For TWikiOnMainframe, TWiki assumes that the web server will automatically translate UTF-8 URLs into EBCDIC URLs, as long as URL encoding is turned off in TWiki pages.
  • OperaBrowser on the P800 smartphone is working for page viewing but leads to corrupt page names in a new way when editing pages.

Updates

Recently requested at HowToDealWithGermanUmlaute. This feature is included in the latest TWikiBetaRelease, leading up to CairoRelease. IDEA! ChristianKohl has been testing this feature for some weeks (in ISO-8859-1 mode) and hasn't found any issues.

I've also figured out how to get TWikiOnMainframe EBCDIC web servers working with browers that do UTF-8 URL encoding of attachments (see above). This may not get done until Phase 2 of ProposedUTF8SupportForI18N, though.

I've revamped this page to be closer to DocumentMode. Comments or questions are very welcome as always!

-- RichardDonkin - 7 Jan 2004

NEW - I have working code that supports UTF-8 URLs more fully, converting to virtually any character set including KOI8-R, EUC-JP, and so on. ISO-2022-* and HZ-* are now specifically excluded from use as the site character set, since they interfere with TWikiML, and it's now possible to override the site character set defined in the $siteLocale setting. This feature is therefore nearly complete.

See ProposedUTF8SupportForI18N for more information, and grab a copy of the code from TWikiAlphaRelease and test it to destruction!

-- RichardDonkin - 20 Jan 2004

This feature is now complete apart from docs - code for attachment support is now in TWikiAlphaRelease, with improved performance. This should also work with TWikiOnMainframe. Tested with IE 5.5, Opera 7.11 and Mozilla (Firebird 0.7).

TWiki is also beginning to work in full UTF-8 mode (i.e. all content is UTF-8) - this is phase 2 of ProposedUTF8SupportForI18N, but there's probably a lot more work. Encoding of XML entities for RSS feeds can also make use of the Unicode conversion modules now used for URLs.

-- RichardDonkin - 08 Feb 2004


Refactored - discussion from ChristianKohl about why this is needed, from InternationalisationEnhancements.

I've implemented another workaround with the stable release, for Greek (ISO-8859-7) WikiWords, and it has been working unexpectedly well: I've configured Apache to convert URLs to iso-8859-7 if they're in UTF8. You do this with the following configuration commands:

  RewriteEngine on
  [ If you have any other rewrite rules, put them here ]
  # Important: following rule must be last on the list, otherwise it may impede
  # the other rules from executing
  RewriteMap uni2iso prg:/usr/local/bin/utf2iso8859-7
  RewriteRule (.*) ${uni2iso:$1} [PT]

and you must have a script, /usr/local/bin/utf2iso8859-7, to do the conversion. Here's the script I'm using, which is in Python. If you make any changes to the script other than really trivial, or if you rewrite it in Perl, make sure you read Apache's RewriteMap documentation carefully.

-- AntoniosChristofides - 04 Mar 2004

Thanks for posting this - it's interesting to see how this can be done with rewrite rules, and this may be useful for people who have Apache admin access. There is also an Apache 2.0 module, mod_fileiri, that does the same thing - see this presentation and discussion on ProposedUTF8SupportForI18N.

However, since your solution requires Apache admin rights and involves loading the Python interpreter and its encode module, it is less widely applicable than the TWikiBetaRelease approach outlined in EncodeURLsWithUTF8, which is done in Perl and can benefit from ModPerl to avoid the overhead of re-compiling the CPAN conversion module(s). The TWiki beta code is quite stable so there's no reason not to use it for sites that need UTF-8 URLs - if Perl 5.8 is used there's no need to install any CPAN modules.

You might want to try upgrading your site to the TWikiBetaRelease (not too hard if you just upgrade the scripts and then modify TWikiPreferences for new variable names - there's already a CairoRelease upgrade guide on the TWiki web.

-- RichardDonkin - 04 Mar 2004

MattWilkie has offered to help out clearing up the docs for Cairo.

-- CrawfordCurrie - 02 Jul 2004

Thanks for the cleanup Richard! I blockquoted the "official docs" part of the topic, upped the DocProgress to 95%, and changed classification to FeatureDone. Please revert if I'm wrong.

-- MattWilkie - 27 Jul 2004

Removed the missing quote on the blockquote element, which was gobbling the first paragraph wink Also did various other minor updates.

-- RichardDonkin - 28 Jul 2004

Created an AppendixEncodeURLsWithUTF8 to hold the official docs for this feature. I don't know if that is the best place for these docs. Move/Rename as makes sense.

-- MattWilkie - 10 Aug 2004

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatext utf2iso8859-7 r1 manage 0.9 K 2004-03-04 - 10:50 UnknownUser Script to convert UTF8 to iso8859-7 for Apache <nop>RewriteMap
Edit | Attach | Watch | Print version | History: r38 < r37 < r36 < r35 < r34 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r38 - 2006-02-15 - PeterThoeny
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.