This is for general discussion of InternationalisationEnhancements and related I18N matters. Please log bugs via BugReports as usual and mention them here - InternationalisationIssues has links to current known issues. Early parts of the page cover how the I18N code was developed up to the TWikiRelease01Feb2003.
Development log
(Refactored bug involving Mozilla and MacOS X to MacOSXFilesystemEncodingWithI18N)
StefanLindmark's bug is now fixed in
TWikiAlphaRelease, though there are still some oddities in how
MacOS names the filenames used for TWiki topics - see
MacOSXFilesystemEncodingWithI18N.
--
RichardDonkin - 03 Dec 2002
A lot of new fixes for i18n went in yesterday, so the latest
TWikiAlphaRelease is the one to use. This has also had a handy spin-off in making TWiki more customisable even for non-I18N users, e.g. it is possible to edit TWiki.pm in one place to redefine the allowed format of Web names (see
WebNameAsWikiName). TWiki now works well with Mozilla despite some questionable use of UTF8 encoding by that browser, and with IE5, Opera, and so on.
Tested on Perl 5.6.1 and Perl 5.005_03, on Linux,
MacOS X and Windows, with working locales - also includes a reasonable workaround for non-working locales.
StefanLindmark has been doing a lot of testing, which is very helpful - more testers are very welcome, of course!
You can create webs with i18n webnames via
ManagingWebs, in CVS.
InterwikiPlugin has also been fixed in CVS to handle 8-bit characters in the site or page name.
testenv now tests the
I18N setup for your TWiki installation and warns you if there are likely to be problems - it also generates a list of non-ASCII alphabetic characters for pasting into
TWiki.cfg, which enables TWiki to work efficiently even on Perl 5.005 (rather than generating this list at run time). As always, have a look at
http://donkin.org/bin/view/Test/TestTopic5
to try this code out - or just download
TWikiAlphaRelease of 05 Dec or later.
--
RichardDonkin - 06 Dec 2002
I've put in fixes for mailnotify with i18n web names and user
WikiNames into
TWikiAlphaRelease. I've also fixed
mapping of i18n
WikiNames to/from intranet userids. Anyone with an 8-bit
WikiName must login as an intranet userid e.g. 'jsmith', i.e. their Basic
Authentication login with Apache can't use 8 bit characters. This seems to be a limitation of Apache
or perhaps HTTP.
--
RichardDonkin - 07 Dec 2002
The
statistics script now supports
I18N web names and topic names. All
bin scripts in the
TWikiAlphaRelease are now updated for 8-bit
WikiWords and ABBREVs, including rename, view and edit.
upload now supports 8-bit characters in attachment filenames.
register,
installpasswd and
TWiki/Access.pm only support ASCII (7-bit)
WikiNames for users, because Apache Basic Authentication doesn't seem to support login with any 8-bit characters in the username. Fixing the
I18N problem would allow people to create unusable
WikiNames, which seems like a bad idea. Such users can be created manually, and can then log in via an intranet userid (e.g. 'jfrancois') with no 8-bit characters. The Apache
htpasswd tool would need to be tested for its handling of 8-bit userids, of course; some sites use third party tools to manage the
.htpasswd file, so there could be wider problems in enabling such userids.
Most
lib modules are fixed as well.
Prefs.pm doesn't need any i18n fixes;
Form.pm may do, and
Search.pm does. The
Store.pm fixes are similar to those in
rename, and are fairly obscure cases.
So - if anyone wants to test the alpha, it is fairly complete in its
I18N support at the moment, as long as you use ISO-8859-1, and is working quite well. There is a test site up at
http://donkin.org/bin/view/Test/TestTopic5
if you want to try it out without installing it, but I really need people to install it in different environments to flush out any problems.
MacOS X now works, with peculiarities in file naming, and Windows and Linux are well tested - any Perl 5.6 users are particularly welcome!
I'm on vacation/holiday from the 12th to 19th December, but I will pick up any
I18N bugs when I get back - please link them from here, or email me.
--
RichardDonkin - 08 Dec 2002
Charset selection is now implemented, and plural processing turned off for non-English language locales. The charset is extracted from the $siteLocale, e.g.
ru_RU.KOI8-R would use the
KOI8-R
charset for Russian sites. The charset is set in the HTTP headers (which take priority) and the
HTML META tag - skins should use the following sort of markup (ideally just in one file, but this is less critical now):
<meta http-equiv="Content-Type" content="text/html; charset=%CHARSET%" />
Also,
NationalCharactersEncodedInSearchResults is now fixed in CVS, which was stopping non-ISO-8859-1 users from seeing the right characters in search results. The charset is also used in the template for
TWikiSyndication using
RichSiteSummary - this is valid
XML according to the
XML Specification
.
Getting the charset name right is important (see
http://www.iana.org/assignments/character-sets
and go for the 'preferred MIME names' if possible), and testing with various browsers and client OSs is recommended. These are site-wide settings because they affect the contents of topics, and transcoding between different charsets on the fly would be rather horrible...
The new code is in
TWikiAlphaRelease and would greatly benefit from being tested by people who need these features
--
RichardDonkin - 08 Dec 2002
Fixed an issue with Mozilla, which doesn't like URL-encoded anchors - now working on test page and included in latest ZIP file in
TWikiAlphaRelease.
UPDATE: Perl 5.6 is now working again - the
TWikiAlphaRelease ZIP file is up to date with this fix.
--
RichardDonkin - 09 Dec 2002
See
JapaneseAndChineseSupport for links to demo pages using Japanese and Chinese characters in TWiki, and
CyrillicSupport for link to a Cyrillic demo page.
The latest
TWikiAlphaRelease code is now running on TWiki.org - this is in
$useLocale = 0 mode, so no
I18N features are turned on, but this is useful to test that the
I18N changes haven't affected operation of English-language sites that don't need any
I18N. The new code does make it easier to customise what's accepted as a legal
WikiWord or web name, incidentally.
--
RichardDonkin - 10 Dec 2002
I'm on vacation/holiday from the 12th to 19th December, but I will pick up any
I18N bugs when I get back - please link them from here, or email me. Since the new alpha code is running on TWiki.org, you know who to blame if it breaks

... However, it seems to be working OK, and has been tested a lot more on Perl 5.6 recently.
--
RichardDonkin - 10 Dec 2002
(UTF-8 discussion refactored to InternationalisationUTF8)
--
RichardDonkin,
TomKagan - 18-23 Dec 2002
I made a small spec change:
%WEBURLENCODED% etc should be replaced by
%URLENCODE{"%WEB%"}% etc. Above docs are updated.
--
PeterThoeny - 05 Jan 2003
There is a problem with doing this - the
%WEBURLENCODED% variables as defined now can be made to behave differently in the future, depending on site setup, since they are only used for
I18N purposes. This means that when we go to UTF8-based encoding of all URLs in some future version, there is no need to URL-encode Mozilla's URLs to bypass its UTF8-encoding - hence, these variables can become a no-op, allowing the generated URL to be UTF8-encoded by Mozilla rather than URL-encoded by TWiki.
Using the
%URLENCODE{"%WEB%"}% approach, this is not possible - so we need to make sure that we use a different function, e.g. INTENCODE or INTURLENCODE instead of URLENCODE, or revert this change to use the original setup.
I would like some discussion before making spec changes like this - one skin is already being updated to use the previous variable format and I would like to minimise any changes.
--
RichardDonkin - 05 Jan 2003
Oops, I guess I am trying to rush out the release too perfectly and too fast
I took out the
%...ENCODED% variables and replaced them by
%URLENCODE{...}%. This is more flexible (you do not need to change the code to escape new variables, e.g. INCLUDINGTOPIC) and it speeds up normal topic processing (because I could remove 6 regular expressions). I did not realize the implications.
Changing the spec of Beta is acceptable since it is Beta

. For performance and flexibility I suggest to add a new variable. INTENCODE sounds more like "interger" then "internationalization". Shall we change the spec from
%URLENCODE{...}% to
%I18NENCODE{...}%?
--
PeterThoeny - 06 Jan 2003
%I18NENCODE{...}% would be OK, though I'd prefer
%INTURLENCODE{...}% for ease of typing and clarity (people will realise what it means from the URLENCODE part, which is the most important bit, and the subtlety re
I18N modes from RTFMing about the INT part, I think). Anyway, this is certainly more elegant and flexible than proliferating variables, and easier to document.
--
RichardDonkin - 06 Jan 2003
TWiki expands now
%INTURLENCODE{...}%. All
I18N related
%URLENCODE{...}% are now changed to
%INTURLENCODE{...}%.
TWikiAlphaRelease,
TWikiBetaRelease and TWiki.org are updated.
--
PeterThoeny - 08 Jan 2003
Thanks for doing the updates - however, there are some other
I18N fixes to be made before
BeijingRelease, in
lib/TWiki/*.pm modules, so I'm not sure that this alpha/beta is the final release candidate.
All the fixes are fairly minor and unlikely to happen in the real world, but should really be done before we ship. There is also some other cleanup, see my
BeijingRelease to-do list at
RichardDonkin. I have done some of the
lib/TWiki/*.pm fixes locally, but ran into odd behaviour that I couldn't figure out at the time - will have another go at them this week, I hope.
I've updated the docs on skin/template and topic changes, above, and will flag in
Plugins.InternationalisingYourSkin.
--
RichardDonkin - 08 Jan 2003
Hi Peter and Richard. I know you have both been working very hard in this area. I have seen all the CVS checkin emails flying. Since there is still more work to do, what is the real status in terms of the percent complete numbers on
BeijingRelease? It's currently set to 100% specification, 95% implementation and 10% documentation. Some projects (as a general rule) get more difficult the closer they get to completion, and this seems like one of them!
--
GrantBow - 09 Jan 2003
The code is 95% complete roughly - remaining fixes are in quite esoteric areas (e.g. renaming pages with 8-bit character as first letter) but should still be done for
BeijingRelease even though most people would not notice if they weren't included. The documentation doesn't need to be too complex - the TWiki.cfg already has some text on how to set the config variables, so it's a matter of writing a doc based on that, and another small page about making skins and topics
I18N-aware, which is already written above.
--
RichardDonkin - 09 Jan 2003
What's the status now?
--
GrantBow - 28 Jan 2003
Still the same, I'm afraid - my employer has recently merged, and I have a demanding new role, both of which mean I am working very long hours with virtually no time for TWiki coding. As mentioned, the remaining fixes are quite obscure and will probably not be noticed by anyone.
However, there is one minor fix that I want to put in, which is to create a
$localeRegexes setting in TWiki.cfg, normally set to 1 - when set to 0, this will force the
I18N code into 'Perl 5.005' mode for
WikiWords, i.e. doing explicit national character matching rather than using locale-based regexes. I have had to do something like this on my web host for
http://donkin.org/
, since their Perl 5.6 upgrade recently made the current
I18N code use broken locales and stop working... Unfortunately broken locales are quite common.
By the way, the Japanese text support may be quite popular - see my site's
web statistics for Jan 2003
, which show a lot of hits for the JapaneseText page. This is a very low volume site and this page is only linked from
JapaneseAndChineseSupport on TWiki.org.
UPDATE: The
$localeRegexes setting is now in CVS for
TWikiAlphaRelease, tested on donkin.org. It's a very simple change so should cause no problems, but worth testing elsewhere. Default is to assume locales are working, since that is simplest to set up. This should be my last change to
I18N before
BeijingRelease.
--
RichardDonkin - 30 Jan 2003
Comments post-release
(UTF-8 discussion refactored to InternationalisationUTF8)
Note that some ISO-8859-1 characters earlier in
InternationalisationEnhancements were corrupted inadvertently a few days ago - this was in revision 1.48 of 30 Jan 2003, done by myself

... (TWiki revision tracking is amazingly useful!) I was probably using Phoenix, a Mozilla-based browser, but I'm not 100% sure. Any Phoenix users should test carefully - most of my testing was done with IE5, Mozilla, Opera 6 and (some) K-Meleon, so I'm surprised to see this.
I suspect a UTF-related browser bug, since two characters ('ée') were turned into a single '?' character, which is probably indicating an invalid UTF-8 encoding. If it was a non-UTF-8 bug, the number of characters would have been preserved. Any browser that correctly uses the Content-Type header, controlled by the new TWiki
%CHARSET% variable (currently utf-8 on TWiki.org) should not corrupt TWiki pages, even if it supports UTF-8.
(Please follow up in
InternationalisationUTF8.)
This feature is in the new
TWikiRelease01Feb2003 (formerly
BeijingRelease).
--
RichardDonkin - 04 Feb 2003
(Bug report and discussion refactored to BuiltinWebPluralisationWithI18N.)
--
RichardDonkin - 16 Feb 2003
I'm curious to find out how many people are running with
I18N using the Feb 2003 release - presumably
ConnyBrunnkvist and
StefanLindmark at least. Anyone else? Any comments on how it works?
I would also like to get TWiki.org to enable the
I18N features so that 8-bit characters can be used in
WikiWords. Probably a good idea to do the remaining fixes first, though.
--
RichardDonkin - 22 Feb 2003
I'm running TWiki with
I18N on an IBM Mainframe after some troubles.
Further infomation on this can be found at
RewritingUrlsWithEscapedCharsUnderOs390
or
TWikiOnMainframe.
--
OliverEichhorn - 04 Sep 2003
Interesting
blog entry
about the problems of locales such as Turkey, in which locale-aware upper casing of an English letter (e.g.
i) gives a quite different letter (e.g.
İ, capital I with dot above, Unicode U+0130) from what would be expected in English (i.e.
I). This is fine for Turkish but in some cases the English language upper casing is required, e.g. for internal strings or (non-IDN) domain names.
For TWiki, we mainly want to use locale-aware upper and lower casing, but in some cases, if we are dealing with English language items (e.g. variable names) we would need to use the English-only operations. Can't think of any cases where this matters for TWiki right now, but worth knowing about.
--
RichardDonkin - 01 Mar 2005
(Moved from elsewhere)
Though it doesn't have all the international characters, there is now a
SpecialCharacters document which will help users figure out what
HTML to use for displaying special characters.
--
AmandaSmith - 20 Feb 2006