Tags:
internationalization1Add my vote for this tag create new tag
, view all tags
This is for general discussion of InternationalisationEnhancements and related I18N matters. Please log bugs via BugReports as usual and mention them here - InternationalisationIssues has links to current known issues. Early parts of the page cover how the I18N code was developed up to the TWikiRelease01Feb2003.

Development log

(Refactored bug involving Mozilla and MacOS X to MacOSXFilesystemEncodingWithI18N)

StefanLindmark's bug is now fixed in TWikiAlphaRelease, though there are still some oddities in how MacOS names the filenames used for TWiki topics - see MacOSXFilesystemEncodingWithI18N.

-- RichardDonkin - 03 Dec 2002

A lot of new fixes for i18n went in yesterday, so the latest TWikiAlphaRelease is the one to use. This has also had a handy spin-off in making TWiki more customisable even for non-I18N users, e.g. it is possible to edit TWiki.pm in one place to redefine the allowed format of Web names (see WebNameAsWikiName). TWiki now works well with Mozilla despite some questionable use of UTF8 encoding by that browser, and with IE5, Opera, and so on.

Tested on Perl 5.6.1 and Perl 5.005_03, on Linux, MacOS X and Windows, with working locales - also includes a reasonable workaround for non-working locales. StefanLindmark has been doing a lot of testing, which is very helpful - more testers are very welcome, of course!

You can create webs with i18n webnames via ManagingWebs, in CVS. InterwikiPlugin has also been fixed in CVS to handle 8-bit characters in the site or page name.

testenv now tests the I18N setup for your TWiki installation and warns you if there are likely to be problems - it also generates a list of non-ASCII alphabetic characters for pasting into TWiki.cfg, which enables TWiki to work efficiently even on Perl 5.005 (rather than generating this list at run time). As always, have a look at http://donkin.org/bin/view/Test/TestTopic5 to try this code out - or just download TWikiAlphaRelease of 05 Dec or later.

-- RichardDonkin - 06 Dec 2002

I've put in fixes for mailnotify with i18n web names and user WikiNames into TWikiAlphaRelease. I've also fixed mapping of i18n WikiNames to/from intranet userids. Anyone with an 8-bit WikiName must login as an intranet userid e.g. 'jsmith', i.e. their Basic Authentication login with Apache can't use 8 bit characters. This seems to be a limitation of Apache or perhaps HTTP.

-- RichardDonkin - 07 Dec 2002

The statistics script now supports I18N web names and topic names. All bin scripts in the TWikiAlphaRelease are now updated for 8-bit WikiWords and ABBREVs, including rename, view and edit. upload now supports 8-bit characters in attachment filenames.

register, installpasswd and TWiki/Access.pm only support ASCII (7-bit) WikiNames for users, because Apache Basic Authentication doesn't seem to support login with any 8-bit characters in the username. Fixing the I18N problem would allow people to create unusable WikiNames, which seems like a bad idea. Such users can be created manually, and can then log in via an intranet userid (e.g. 'jfrancois') with no 8-bit characters. The Apache htpasswd tool would need to be tested for its handling of 8-bit userids, of course; some sites use third party tools to manage the .htpasswd file, so there could be wider problems in enabling such userids.

Most lib modules are fixed as well. Prefs.pm doesn't need any i18n fixes; Form.pm may do, and Search.pm does. The Store.pm fixes are similar to those in rename, and are fairly obscure cases.

So - if anyone wants to test the alpha, it is fairly complete in its I18N support at the moment, as long as you use ISO-8859-1, and is working quite well. There is a test site up at http://donkin.org/bin/view/Test/TestTopic5 if you want to try it out without installing it, but I really need people to install it in different environments to flush out any problems. MacOS X now works, with peculiarities in file naming, and Windows and Linux are well tested - any Perl 5.6 users are particularly welcome!

I'm on vacation/holiday from the 12th to 19th December, but I will pick up any I18N bugs when I get back - please link them from here, or email me.

-- RichardDonkin - 08 Dec 2002

Charset selection is now implemented, and plural processing turned off for non-English language locales. The charset is extracted from the $siteLocale, e.g. ru_RU.KOI8-R would use the KOI8-R charset for Russian sites. The charset is set in the HTTP headers (which take priority) and the HTML META tag - skins should use the following sort of markup (ideally just in one file, but this is less critical now):

 <meta http-equiv="Content-Type" content="text/html; charset=%CHARSET%" />

Also, NationalCharactersEncodedInSearchResults is now fixed in CVS, which was stopping non-ISO-8859-1 users from seeing the right characters in search results. The charset is also used in the template for TWikiSyndication using RichSiteSummary - this is valid XML according to the XML Specification.

Getting the charset name right is important (see http://www.iana.org/assignments/character-sets and go for the 'preferred MIME names' if possible), and testing with various browsers and client OSs is recommended. These are site-wide settings because they affect the contents of topics, and transcoding between different charsets on the fly would be rather horrible...

The new code is in TWikiAlphaRelease and would greatly benefit from being tested by people who need these features smile

-- RichardDonkin - 08 Dec 2002

Fixed an issue with Mozilla, which doesn't like URL-encoded anchors - now working on test page and included in latest ZIP file in TWikiAlphaRelease.

UPDATE: Perl 5.6 is now working again - the TWikiAlphaRelease ZIP file is up to date with this fix.

-- RichardDonkin - 09 Dec 2002

See JapaneseAndChineseSupport for links to demo pages using Japanese and Chinese characters in TWiki, and CyrillicSupport for link to a Cyrillic demo page.

The latest TWikiAlphaRelease code is now running on TWiki.org - this is in $useLocale = 0 mode, so no I18N features are turned on, but this is useful to test that the I18N changes haven't affected operation of English-language sites that don't need any I18N. The new code does make it easier to customise what's accepted as a legal WikiWord or web name, incidentally.

-- RichardDonkin - 10 Dec 2002

I'm on vacation/holiday from the 12th to 19th December, but I will pick up any I18N bugs when I get back - please link them from here, or email me. Since the new alpha code is running on TWiki.org, you know who to blame if it breaks smile ... However, it seems to be working OK, and has been tested a lot more on Perl 5.6 recently.

-- RichardDonkin - 10 Dec 2002

(UTF-8 discussion refactored to InternationalisationUTF8)

-- RichardDonkin, TomKagan - 18-23 Dec 2002

I made a small spec change: %WEBURLENCODED% etc should be replaced by %URLENCODE{"%WEB%"}% etc. Above docs are updated.

-- PeterThoeny - 05 Jan 2003

There is a problem with doing this - the %WEBURLENCODED% variables as defined now can be made to behave differently in the future, depending on site setup, since they are only used for I18N purposes. This means that when we go to UTF8-based encoding of all URLs in some future version, there is no need to URL-encode Mozilla's URLs to bypass its UTF8-encoding - hence, these variables can become a no-op, allowing the generated URL to be UTF8-encoded by Mozilla rather than URL-encoded by TWiki.

Using the %URLENCODE{"%WEB%"}% approach, this is not possible - so we need to make sure that we use a different function, e.g. INTENCODE or INTURLENCODE instead of URLENCODE, or revert this change to use the original setup.

I would like some discussion before making spec changes like this - one skin is already being updated to use the previous variable format and I would like to minimise any changes.

-- RichardDonkin - 05 Jan 2003

Oops, I guess I am trying to rush out the release too perfectly and too fast wink

I took out the %...ENCODED% variables and replaced them by %URLENCODE{...}%. This is more flexible (you do not need to change the code to escape new variables, e.g. INCLUDINGTOPIC) and it speeds up normal topic processing (because I could remove 6 regular expressions). I did not realize the implications.

Changing the spec of Beta is acceptable since it is Beta smile . For performance and flexibility I suggest to add a new variable. INTENCODE sounds more like "interger" then "internationalization". Shall we change the spec from %URLENCODE{...}% to %I18NENCODE{...}%?

-- PeterThoeny - 06 Jan 2003

%I18NENCODE{...}% would be OK, though I'd prefer %INTURLENCODE{...}% for ease of typing and clarity (people will realise what it means from the URLENCODE part, which is the most important bit, and the subtlety re I18N modes from RTFMing about the INT part, I think). Anyway, this is certainly more elegant and flexible than proliferating variables, and easier to document.

-- RichardDonkin - 06 Jan 2003

TWiki expands now %INTURLENCODE{...}%. All I18N related %URLENCODE{...}% are now changed to %INTURLENCODE{...}%. TWikiAlphaRelease, TWikiBetaRelease and TWiki.org are updated.

-- PeterThoeny - 08 Jan 2003

Thanks for doing the updates - however, there are some other I18N fixes to be made before BeijingRelease, in lib/TWiki/*.pm modules, so I'm not sure that this alpha/beta is the final release candidate.

All the fixes are fairly minor and unlikely to happen in the real world, but should really be done before we ship. There is also some other cleanup, see my BeijingRelease to-do list at RichardDonkin. I have done some of the lib/TWiki/*.pm fixes locally, but ran into odd behaviour that I couldn't figure out at the time - will have another go at them this week, I hope.

I've updated the docs on skin/template and topic changes, above, and will flag in Plugins.InternationalisingYourSkin.

-- RichardDonkin - 08 Jan 2003

Hi Peter and Richard. I know you have both been working very hard in this area. I have seen all the CVS checkin emails flying. Since there is still more work to do, what is the real status in terms of the percent complete numbers on BeijingRelease? It's currently set to 100% specification, 95% implementation and 10% documentation. Some projects (as a general rule) get more difficult the closer they get to completion, and this seems like one of them!

-- GrantBow - 09 Jan 2003

The code is 95% complete roughly - remaining fixes are in quite esoteric areas (e.g. renaming pages with 8-bit character as first letter) but should still be done for BeijingRelease even though most people would not notice if they weren't included. The documentation doesn't need to be too complex - the TWiki.cfg already has some text on how to set the config variables, so it's a matter of writing a doc based on that, and another small page about making skins and topics I18N-aware, which is already written above.

-- RichardDonkin - 09 Jan 2003

What's the status now?

-- GrantBow - 28 Jan 2003

Still the same, I'm afraid - my employer has recently merged, and I have a demanding new role, both of which mean I am working very long hours with virtually no time for TWiki coding. As mentioned, the remaining fixes are quite obscure and will probably not be noticed by anyone.

However, there is one minor fix that I want to put in, which is to create a $localeRegexes setting in TWiki.cfg, normally set to 1 - when set to 0, this will force the I18N code into 'Perl 5.005' mode for WikiWords, i.e. doing explicit national character matching rather than using locale-based regexes. I have had to do something like this on my web host for http://donkin.org/, since their Perl 5.6 upgrade recently made the current I18N code use broken locales and stop working... Unfortunately broken locales are quite common.

By the way, the Japanese text support may be quite popular - see my site's web statistics for Jan 2003, which show a lot of hits for the JapaneseText page. This is a very low volume site and this page is only linked from JapaneseAndChineseSupport on TWiki.org.

UPDATE: The $localeRegexes setting is now in CVS for TWikiAlphaRelease, tested on donkin.org. It's a very simple change so should cause no problems, but worth testing elsewhere. Default is to assume locales are working, since that is simplest to set up. This should be my last change to I18N before BeijingRelease.

-- RichardDonkin - 30 Jan 2003

Comments post-release

(UTF-8 discussion refactored to InternationalisationUTF8)

Note that some ISO-8859-1 characters earlier in InternationalisationEnhancements were corrupted inadvertently a few days ago - this was in revision 1.48 of 30 Jan 2003, done by myself smile ... (TWiki revision tracking is amazingly useful!) I was probably using Phoenix, a Mozilla-based browser, but I'm not 100% sure. Any Phoenix users should test carefully - most of my testing was done with IE5, Mozilla, Opera 6 and (some) K-Meleon, so I'm surprised to see this.

I suspect a UTF-related browser bug, since two characters ('ée') were turned into a single '?' character, which is probably indicating an invalid UTF-8 encoding. If it was a non-UTF-8 bug, the number of characters would have been preserved. Any browser that correctly uses the Content-Type header, controlled by the new TWiki %CHARSET% variable (currently iso-8859-1 on TWiki.org) should not corrupt TWiki pages, even if it supports UTF-8.

(Please follow up in InternationalisationUTF8.)

This feature is in the new TWikiRelease01Feb2003 (formerly BeijingRelease).

-- RichardDonkin - 04 Feb 2003

(Bug report and discussion refactored to BuiltinWebPluralisationWithI18N.)

-- RichardDonkin - 16 Feb 2003

I'm curious to find out how many people are running with I18N using the Feb 2003 release - presumably ConnyBrunnkvist and StefanLindmark at least. Anyone else? Any comments on how it works?

I would also like to get TWiki.org to enable the I18N features so that 8-bit characters can be used in WikiWords. Probably a good idea to do the remaining fixes first, though.

-- RichardDonkin - 22 Feb 2003

I'm running TWiki with I18N on an IBM Mainframe after some troubles.

Further infomation on this can be found at RewritingUrlsWithEscapedCharsUnderOs390 or TWikiOnMainframe.

-- OliverEichhorn - 04 Sep 2003

Interesting blog entry about the problems of locales such as Turkey, in which locale-aware upper casing of an English letter (e.g. i) gives a quite different letter (e.g. İ, capital I with dot above, Unicode U+0130) from what would be expected in English (i.e. I). This is fine for Turkish but in some cases the English language upper casing is required, e.g. for internal strings or (non-IDN) domain names.

For TWiki, we mainly want to use locale-aware upper and lower casing, but in some cases, if we are dealing with English language items (e.g. variable names) we would need to use the English-only operations. Can't think of any cases where this matters for TWiki right now, but worth knowing about.

-- RichardDonkin - 01 Mar 2005

(Moved from elsewhere)

Though it doesn't have all the international characters, there is now a SpecialCharacters document which will help users figure out what HTML to use for displaying special characters.

-- AmandaSmith - 20 Feb 2006

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r9 - 2006-02-22 - RichardDonkin
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.