<<O>>  Difference Topic AppendixEncodeURLsWithUTF8 (r1.2 - 14 Aug 2004 - PeterThoeny)

META TOPICPARENT TWikiDocumentation

TWIKI DOCS FEEDBACK: Please help in maintaining high quality documentation: fix this topic if you find errors or incomplete content. Post questions, error notes, and suggestions concerning the documentation on this page in the comments section below! To solve problems you are having using this part of TWiki, see the Support web, rather than commenting here. For more on TWiki feature development, see the Codev web. The docs on TWiki.org reflect the latest development release, not the latest beta and production releases.
Line: 20 to 20

Changed:
<
<
The following 'non-ASCII-safe' character encodings are now excluded from use as the site character set, since they interfere with TWiki markup: ISO-2022-*, HZ-*, Shift-JIS, MS-Kanji, GB2312, GBK, GB18030, Johab and UHC. However, many multi-byte character sets work fine, e.g. EUC-JP, EUC-KR, EUC-TW, and EUC-CN. In addition, UTF-8 can already be used, with some limitations, for East Asian languages where EUC character encodings are not acceptable - see TWiki:Codev/ProposedUTF8SupportForI18N.
>
>
ISO-2022-*, HZ-* and other 'non-ASCII-safe' multi-byte character sets are now specifically excluded from use as the site character set, since they interfere with TWiki ML; however, many multi-byte character sets work fine, e.g. EUC-JP, GB2312, etc.

It's now possible to override the site character set defined in the $siteLocale setting in TWiki.cfg - this enables you to have a slightly different spelling of the character set in the server locale (e.g. 'eucjp') and the HTTP header sent to the browser (e.g. 'euc-jp').

Line: 68 to 68



Comments & Questions about the Documentation in this Section


Changed:
<
<
Moved official docs here from EncodeURLsWithUTF8. Please proofread. Most likely problems are links back to twiki.org (Codev and Plugin webs). They should be either prefaced with TWiki: or re-worded so as not to create an NonExistantTopic? link.
>
>
Moved official docs here from EncodeURLsWithUTF8. Please proofread. Most likely problems are links back to twiki.org (Codev and Plugin webs). They should be either prefaced with TWiki: or re-worded so as not to create an NonExistantTopic? link.

-- MattWilkie - 10 Aug 2004

Deleted:
<
<
thanks Peter! -- MattWilkie - 17 Aug 2004

This feature is unfortunately broken in CairoRelease - you'll need to apply patches from GermanUmlauteBreakWikiWords to resolve this.

-- RichardDonkin - 01 Oct 2004

Update re non-ASCII-safe multi-byte character encodings - exclusions extended in DakarRelease, see SomeChineseCharactersBreakWikiLinks for details.

-- RichardDonkin - 29 Oct 2004

 <<O>>  Difference Topic AppendixEncodeURLsWithUTF8 (r1.1 - 10 Aug 2004 - MattWilkie)

META TOPICPARENT TWikiDocumentation

TWIKI DOCS FEEDBACK: Please help in maintaining high quality documentation: fix this topic if you find errors or incomplete content. Post questions, error notes, and suggestions concerning the documentation on this page in the comments section below! To solve problems you are having using this part of TWiki, see the Support web, rather than commenting here. For more on TWiki feature development, see the Codev web. The docs on TWiki.org reflect the latest development release, not the latest beta and production releases.
Line: 9 to 9

This page addresses implemented UTF-8 support for URLs only. The overall plan for UTF-8 support for TWiki is described in TWiki:Codev.ProposedUTF8SupportForI18N .

Changed:
<
<

Current Status

>
>
Current Status:

Changed:
<
<
To simplify use of internationalised characters within WikiWords and attachment names, TWiki now supports UTF-8 URLs, converting on-the-fly to virtually any character set, including ISO-8859-*, KOI8-R, EUC-JP, and so on.
>
>
To simplify use of internationalised characters within WikiWords and attachment names, TWiki now supports UTF-8 URLs, converting on-the-fly to virtually any character set, including ISO-8859-*, KOI8-R, EUC-JP, and so on.

Support for UTF-8 URL encoding avoids having to configure the browser to turn off this encoding in URLs (the default in Internet Explorer, Opera Browser and some Mozilla Browser URLs) and enables support of browsers where only this mode is supported (e.g. Opera Browser for Symbian smartphones). A non-UTF-8 site character set (e.g. ISO-8859-*) is still used within TWiki, and in fact pages are stored and viewed entirely in the site character set - the browser dynamically converts URLs from the site character set into UTF-8, and TWiki converts them back again.

System requirements are updated as follows:

  • ASCII or ISO-8859-1-only sites do not require any additional CPAN modules to be installed.
  • Perl 5.8 sites using any character set do not require additional modules, since CPAN:Encode is installed as part of Perl.
Changed:
<
<
>
>

ISO-2022-*, HZ-* and other 'non-ASCII-safe' multi-byte character sets are now specifically excluded from use as the site character set, since they interfere with TWiki ML; however, many multi-byte character sets work fine, e.g. EUC-JP, GB2312, etc.

It's now possible to override the site character set defined in the $siteLocale setting in TWiki.cfg - this enables you to have a slightly different spelling of the character set in the server locale (e.g. 'eucjp') and the HTTP header sent to the browser (e.g. 'euc-jp').

Changed:
<
<
This feature should also support use of Mozilla Browser with TWiki:Codev.TWikiOnMainframe (as long as mainframe web server can convert or pass through UTF-8 URLs) - however, this specific combination is not tested. Other browser-server combinations should not have any problems.
>
>
This feature should also support use of MozillaBrowser? with TWikiOnMainframe? (as long as mainframe web server can convert or pass through UTF-8 URLs) - however, this specific combination is not tested. Other browser-server combinations should not have any problems.

Please note that use of UTF-8 as the site character set is not yet supported - see Phase 2 of TWiki:Codev.ProposedUTF8SupportForI18N for plans and work to date in this area.

Line: 38 to 39

URLs are not allowed to contain non-ASCII (8th bit set) characters: http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars
Changed:
<
<
The overall plan for UTF-8 support for TWiki is described in two phases in TWiki:/Codev.ProposedUTF8SupportForI18N - this page addresses the first phase, in which UTF-8 is supported for URLs only.
>
>
The overall plan for UTF-8 support for TWiki is described in two phases in ProposedUTF8SupportForI18N? - this page addresses the first page, in which UTF-8 is supported for URLs only.

Changed:
<
<
UTF-8 URL translation to virtually any character set is supported as of TWiki Release 01 Sep 2004, but full UTF-8 support (e.g. pages in UTF-8) is not supported yet - this will be phase 2.
>
>
UTF-8 URL translation to virtually any character set is supported as of CairoRelease? (but full UTF-8 support (i.e. pages in UTF-8) is not supported yet - see ProposedUTF8SupportForI18N?, Phase 2 onwards).

Changed:
<
<
The code automatically detects whether a URL is UTF-8 or not, taking care to avoid over-long and illegal UTF-8 encodings that could introduce TWiki:Codev.MajorSecurityProblemWithIncludeFileProcessing (tested against a comprehensive UTF-8 test file, which IE 5.5 fails quite dangerously, and Opera Browser passes). Any non-ASCII URLs that are not valid UTF-8 are then assumed to be directly URL-encoded as a single-byte or multi-byte character set (as now), e.g. EUC-JP.
>
>
The code automatically detects whether a URL is UTF-8 or not, taking care to avoid over-long and illegal UTF-8 encodings that could introduce security holes? (tested against a comprehensive UTF-8 test file, which IE5.5 fails quite dangerously, and OperaBrowser? passes). Any non-ASCII URLs that are not valid UTF-8 are then assumed to be directly URL-encoded as a single-byte or multi-byte character set (as now), e.g. EUC-JP.

Changed:
<
<
The main point is that you can use TWiki with international characters in WikiWords without changing your browser setup from the default, and you can also still use TWiki using non-UTF-8 URLs. This works on any Perl version from 5.005_03 onwards and corresponds to Phase 1 of TWiki:Codev.ProposedUTF8SupportForI18N. You can have different users using different URL formats transparently on the same server.
>
>
The main point is that you can use TWiki with international characters in WikiWords without changing your browser setup from the default, and you can also still use TWiki using non-UTF-8 URLs. This works on any Perl version from 5.005_03 onwards and corresponds to Phase 1 of ProposedUTF8SupportForI18N?. You can have different users using different URL formats transparently on the same server.

UTF-8 URLs are automatically converted to the current $siteCharset (from the TWiki.cfg locale setting), using modules such as CPAN:Encode if needed.

TWiki generates the whole page in the site charset, e.g. ISO-8859-1 or EUC-JP, but the browser dynamically UTF-8 encodes the attachment's URL when it's used. Since Apache serves attachment downloads without TWiki being involved, TWiki's code can't do its UTF-8 decoding trick, so TWiki URL-encodes such URLs in ISO-8859-1 or whatever when generating the page, to bypass this URL encoding, ensuring that the URLs and filenames seen by Apache remain in the site charset.

Changed:
<
<
TWiki:Codev.TWikiOnMainframe uses EBCDIC web servers that typically translate their output to ASCII, UTF-8 or ISO-8859-1 (and URLs in the other direction) since there are so few EBCDIC web browsers. Such web servers don't work with even ISO-8859-1 URLs if they are URL encoded, since the automated translation is bypassed for URL-encoded characters. For TWiki on Mainframe, TWiki assumes that the web server will automatically translate UTF-8 URLs into EBCDIC URLs, as long as URL encoding is turned off in TWiki pages.
>
>
TWiki:Codev.TWikiOnMainframe uses EBCDIC web servers that typically translate their output to ASCII, UTF-8 or ISO-8859-1 (and URLs in the other direction) since there are so few EBCDIC web browsers. Such web servers don't work with even ISO-8859-1 URLs if they are URL encoded, since the automated translation is bypassed for URL-encoded characters. For TWikiOnMainframe?, TWiki assumes that the web server will automatically translate UTF-8 URLs into EBCDIC URLs, as long as URL encoding is turned off in TWiki pages.


Changed:
<
<

Testing and Limitation

>
>

Testing


It should work with TWiki:Codev.TWikiOnMainframe. Tested with IE 5.5, Opera 7.11 and Mozilla (Firebird 0.7).

Changed:
<
<
Opera Browser on the P800 smartphone is working for page viewing but leads to corrupt page names when editing pages.
>
>
This has tested quite a lot in both modes using Opera, Mozilla variants and IE - more testing with Mozilla and friends would be very helpful, particularly with disabled INTURLENCODE (see TWiki:Codev.TWikiOnMainframe for links, requires some TWiki.pm changes as in related patches). You can try this out without upgrading TWiki at http://donkin.org/bin/view/Test/TestTopic5.

Changed:
<
<
For up to date information see TWiki:Codev.EncodeURLsWithUTF8
>
>
OperaBrowser on the P800 smartphone is working for page viewing but leads to corrupt page names in a new way when editing pages.

Changed:
<
<
-- TWiki:Main.RichardDonkin - 7 Jan 2004
-- TWiki:Main.MattWilkie - 10 Aug 2004
-- TWiki:Main.PeterThoeny - 14 Aug 2004
>
>
TWiki:Support.HowToDealWithGermanUmlaute (implemented): ChristianKohl has been testing this feature for some weeks (in ISO-8859-1 mode) and hasn't found any issues.

Deleted:
<
<



Comments & Questions about the Documentation in this Section



Deleted:
<
<
Moved official docs here from EncodeURLsWithUTF8. Please proofread. Most likely problems are links back to twiki.org (Codev and Plugin webs). They should be either prefaced with TWiki: or re-worded so as not to create an NonExistantTopic? link.

Changed:
<
<
-- MattWilkie - 10 Aug 2004
>
>

Development discussion of this feature at TWiki:Codev.EncodeURLsWithUTF8 .

Changed:
<
<
thanks Peter! -- MattWilkie - 17 Aug 2004
>
>



Comments & Questions about the Documentation in this Section



Changed:
<
<
This feature is unfortunately broken in CairoRelease - you'll need to apply patches from GermanUmlauteBreakWikiWords to resolve this.
>
>
Moved official docs here from EncodeURLsWithUTF8. Please proofread. Most likely problems are links back to twiki.org (Codev and Plugin webs). They should be either prefaced with TWiki: or re-worded so as not to create an NonExistantTopic? link.

Changed:
<
<
-- RichardDonkin - 01 Oct 2004
>
>
-- MattWilkie - 10 Aug 2004

View topic | Diffs | r1.5 | > | r1.4 | > | r1.3 | More
Revision r1.5 - 29 Oct 2004 - 20:18 - RichardDonkin
Revision r1.2 - 14 Aug 2004 - 07:35 - PeterThoeny