Tags:
internationalization1Add my vote for this tag create new tag
view all tags

Internationalisation Enhancements

This page is targeted at anyone interested in the development of internationalisation support for TWiki. If you are looking for instructions on configuring your TWiki to work with your local language, see InstallationWithI18N

  • One code base for the world
  • English is just another language

Slogans borrowed from the Mozilla I18N project

For help in updating plugins or core code for Internationalisation, see InternationalisationGuidelines (NEW)

This page is a gateway and discussion point for developers working on I18N. It is mainly a collection of resources useful to such developers.

Related pages: InternationalisationDiscuss, InternationalisationIssues, UnicodeSupport (NEW), InternationalisationUTF8, ProposedUTF8SupportForI18N, EncodeURLsWithUTF8, CyrillicSupport, JapaneseAndChineseSupport, UserInterfaceInternationalisation, UserInterfaceLocalisation, BiDirectionalText (NEW)

Introduction

The TWiki code, since TWikiRelease01Feb2003, has good support for internationalisation ('I18N') in some key areas. This is primarily to support 8-bit character sets, such as ISO-8859-15 and KOI8-R, though this also helps today with multi-byte character sets such as EUC-JP, and will help with Unicode in the longer term (see below for Unicode links, and InternationalisationUTF8 for discussion).

The key feature is support of 8-bit characters in WikiWords, ensuring that they are auto-linked, displayed and sorted as required. A locale-aware version of grep will be necessary for searching to work properly - GNU grep works fine and is available on virtually any platform.

Use of locales is controlled by a configure setting, and the locale is site-wide for simplicity. More complex setup of locales may be possible in future, but there are security issues with allowing web users to set their own locale variables.

The Move to Unicode

Unicode and UTF-8 support were out of scope for the initial work in late 2002, because the whole area of Unicode is much more complex. UTF-8 support has been investigated, and implemented for URLs in EncodeURLsWithUTF8, but full UTF-8 support is currently on hold because of lack of time, some technical difficulties (see ProposedUTF8SupportForI18N), and the greater importance of UserInterfaceInternationalisation. However, East Asian sites successfully run TWiki with UTF-8 for Japanese, Chinese, Korean, etc. It is to be hoped that I18N may come back into scope in 2008.

Things to do

Multi-byte character support

A few multi-byte character encodings other than Unicode do work already, specifically EUC-JP, EUC-KR, EUC-TW and EUC-CN. All other character encodings, including Shift-JIS, Big5, GB2312, GBK, UHC, Johab and others, will not work due to their not being 'ASCII safe' - some East Asian characters include ASCII bytes, usually as the second byte, that can be confused with special TWiki characters.

Quite a few bugs relating to use of Chinese and similar languages with TWiki have been fixed - see InternationalisationIssues.

UTF-8 support: ProposedUTF8SupportForI18N is under development and UTF-8 can be used in a limited way today as the site character set for East Asian sites, i.e. Chinese, Japanese, Korean and Vietnamese, without WikiWord support beyond ASCII (which is not so important anyway for many East Asian languages).

UserInterfaceInternationalisation describes the work done on a framework to enable localisation of the TWiki user interface (i.e. translating English language text, wherever it appears in templates or other parts of the TWiki user interface, into another language). Through use of CPAN modules for L10N, it should be quite easy to translate TWIki into another language - see UserInterfaceLocalisation for details.

Multilingual TWiki

Enabling multiple languages within a single TWiki page (e.g. French, Russian and Japanese) - this should be implemented through Unicode (specifically UTF-8 support), combined with the UserInterfaceInternationalisation already implemented in DakarRelease (which allows each user to select their preferred translation of the TWiki user interface).

Specific requests:

Older TWiki releases

Browser setup

TWikiRelease01Sep2004 and later do not require any browser setup. If you have an older TWiki version the following may help.

  • NEW: For users of Firefox, Mozilla/SeaMonkey and similar browsers: you can optionally configure your browser for UTF-8 URLs as follows so that non-ASCII characters display properly in the URL bar and when mousing over links. This is not necessary for TWiki to work, but looks nicer. Instructions given are for Firefox 1.0:
    1. Type about:config into the URL bar
    2. Type utf into the filter field that appears
    3. Double-click on the network.standard-url.encode-utf8 line so that it says true
    4. Double-click on the network.standard-url.escape-utf8 line so that it says true
  • This ensures that UTF-8 URL encoding is used for all URLs - note that this does not mean your site needs to use UTF-8. See EncodeURLsWithUTF8 for more details.

  • History:
    • TWiki's Dec 2001 release could link to WikiWords with 8-bit characters in their names, as long as you use [[WikiWord]] type links.
    • TWikiRelease01Feb2003 was the first release with full I18N support for 8-bit WikiWords.
    • In both these releases, you needed to disable UTF-8 (Unicode) encoding of URLs by the browser (which is enabled by default in some browsers):
      • InternetExplorer 5.0 or higher: in Tools | Options | Advanced, uncheck 'always send URLs as UTF-8', then close all IE windows and restart IE. (No changes needed for IE 4.0)
      • OperaBrowser 6.x or higher: in Preferences | Network | International Web Addresses, uncheck 'encode all addresses with UTF-8'.
      • MozillaBrowser 1.x, Netscape 6+ and related browsers: no setup necessary (tested on Mozilla 1.1 and K-Meleon 0.7) - but see the skin/template changes below
      • Netscape 4.x: works fine in general without any setup changes, but may have some problems as in BrowserProblemWithUmlauts (tested briefly on Netscape 4.7) and may 'burp' on loading page
      • Lynx 2.8: no setup necessary (tested briefly on Lynx 2.8.4rel.1 with CygWin)
      • Once you've done this, the following should link to an existing page called called LaLangueFrançaise, with the topic name appearing correctly: LaLangueFrançaise - written using [[Sandbox.LaLangueFrançaise][LaLangueFrançaise]]

Skin and template changes

In TWikiRelease01Feb2003, some minor skin/template changes are needed to support use of forms with Mozilla and the %CHARSET% variable with any browser. The standard TWiki templates are now fixed in the TWikiAlphaRelease to work with I18N web names and WikiWords using Mozilla - this is because Mozilla decides to UTF8-encode URLs if they are used as a form submission URL, even though the whole page is in ISO-8859-1 mode and other URLs are never encoded...

To make any skin work with the new I18N support, some simple changes are needed to any form submission URLs:

  1. Locate any <form> elements in your skin templates - e.g. grep -i '<form' *.tmpl under Unix/Linux/CygWin.
  2. Change the form submission URL (usually on same line as the <form tag, and always part of the action="http://foo" attribute) so that the variables %WEB%, %BASEWEB%, %INCLUDINGWEB% and %TOPIC% are properly URL encoded. For example, to URL encode .../%WEB%/%TOPIC% write .../%INTURLENCODE{"%WEB%/%TOPIC%"}%.
    • You only need to make this change for form submission URLs - any other URLs don't need to change, e.g. those used for normal links
    • (NEW) Be sure to use =, not = - this helps to ensure that your skin will work smoothly in the future, when TWiki eventually supports UTF8 throughout.

To support character set selection (which enables any character set to be used in the skin and topic contents), skins should use the following HTML using the new %CHARSET% variable instead of iso-8859-1:

<head>
 ...
 <meta http-equiv="Content-Type" content="text/html; charset=%CHARSET%" />
</head>

Skins for TWikiSyndication should use names of the form 'rss*' - this ensures that the TWiki code knows it is handling RSS data, which requires I18N characters (i.e. with 8th bit set) to be encoded as &nnn; sequences.

  • Entities written as numeric character references ((NCRs) such as &1562; are drawn from the Unicode (ISO 10646-1) character set, whose first 255 codepoints are the same as ISO-8859-1. These entities always refer to the same character, regardless of the document's character encoding, according to the HTML 4.0 spec.
  • (NEW) See Plugins.InternationalisingYourSkin for more discussion of how to internationalise skins.

History - I18N-related TWiki pages

8-bit Wiki words etc

8-bit Interwiki

8-bit external programs

I18N of search results

Selecting browser character set

I18N resources

TWiki sites

These sites are either about I18N or using TWiki I18N features - some old sites using I18N may require UTF-8 URL encoding to be turned off in your browser as per #BrowserSetup, but those using TWikiRelease01Sep2004 or later will not:

Character encodings for internationalisation

Pre-Unicode character encodings

Unicode and UTF-8

I18N of HTML/HTTP

Scripts and languages

  • Ancient Scripts - good coverage of non-Roman writing systems, many of which are still used today despite name of site
  • Language Introductions - tutorials on writing in Russian, Korean, etc.

FAQs and Guides

Perl I18N

Other I18N

Other Wiki i18n efforts

Many other leading Wikis already have i18n features:

Useful newsgroup threads

Updates

One issue is that many locale setups are somewhat broken, particularly on Windows. On Debian GNU/Linux, the \w regex in the fr_FR.ISO8859-1 locale matches '-' as well as '_', which is a minor issue, while on CygWin there is no locale support at all, and on ActivePerl, uppercasing a character can lead to a completely different and even non-alphabetic character! In Perl 5.8 on another Debian system, using the locale fr_FR.UTF8 meant that the collation order was as for ASCII, and a Japanese (Kanji) character was included in the set of alphabetic characters...

This means that workarounds will be essential for many people, so this code will make it easy to avoid using any locale functions if $useLocale is turned off - basically, this will involve typing a list of upper and lower case non-ASCII national characters into TWiki.cfg variable settings. This will help with features handled entirely by TWiki, such as WikiWords, but won't address external programs, for which the only solution is to report the bugs to whoever maintains them, or perhaps install different versions of such programs.

UPDATE: I've coded most of this - all the basic link types are working, apart from anchors and upper casing in spaced-out WikiWords. There's a test page up at http://donkin.org/bin/view/Test/TestTopic5 running on this code - not yet in TWikiAlphaRelease as I'd like to test it a bit more, but it seems to work OK. It's been tested in no-locale mode only so far, so will work on broken locales. I really need Perl 5.6 on a system with working locales to test this - will probably have to install Perl 5.6 on Debian.

-- RichardDonkin - 26 Nov 2002

I've now got sorting of the WikiWords in WebIndex working - turns out that ls on my Debian is locale-unaware, but TWiki sorts the output anyway in Perl, so it works with only a five line change to Search.pm. Locales are also working fine under Perl 5.005_03.

-- RichardDonkin - 29 Nov 2002

Now in TWikiAlphaRelease - please test this out and log any bugs! It's quite easy to set up if you have a working locale on your system. Be sure to review #Browser_setup for a simple browser config change required for this to work.

-- RichardDonkin - 30 Nov 2002

More links about what other Wikis are doing in this area - PhpWiki is quite a way ahead, in that it actually ships with translated pages for several languages and already supports PhpWiki:DoubleByteCharacters. MoinMoin also ships with translated pages and has Unicode character support.

-- RichardDonkin - 02 Dec 2002

Now released as part of TWikiRelease01Feb2003 and running on TWiki.org (with I18N turned off).

(Discussion refactored to InternationalisationDiscuss; any bugs should be reported via BugReports as normal, and linked from InternationalisationIssues as well.)

-- RichardDonkin - 16 Feb 2003

Mainframe (EBCDIC) and UTF-8 support

Update on recent work:

-- RichardDonkin - 11 Sep 2003

Localisation (L10N) of TWiki

Added link above about KwikiWiki 's L10N - localization Kwiki . See the links at the end - Kwiki uses standard CPAN modules for that.

What is the difference between L10N and I18N?

-- PeterMasiar (cannot copy-paste sig?)

I18N stands for internationalization (I + 18 chars + N). L10N stands for localization (L + 10 chars + N). Internationalization makes an application ready to be localized into different languages. That is, I18N is the base, making sure the app can handle character sets in multiple languages and provides a framework handling language specific text and formatting (e.g. externalized language files). L10N into a different language is a relatively simple task for an app that has a solid I18N framework.

-- PeterThoeny - 14 Sep 2003

There are some links on L10N of TWiki in an earlier section (now added to the TOC). There are some people interested in doing translations of TWiki, which would involve development of the infrastructure to support L10N - currently, TWiki I18N is aimed at page editing and display rather than at L10N of TWiki's text output, but that could change if a Perl developer starts writing some patches for this.

-- RichardDonkin - 14 Sep 2003

Localisation: translation of TWiki documentation

See TranslationSupport for more recent discussion.

UTF-8 support in URLs (NEW!)

Significant progress has been made here, so you can now use UTF-8 URLs with virtually any site character set - see EncodeURLsWithUTF8. Now in TWikiAlphaRelease.

-- RichardDonkin - 19 Jan 2004

New for Dakar

  1. Strikeout of Edit and Attach links in edit and preview pane made language-insensitive
  2. Made the "Add form..." and "Replace form..." buttons configurable in templates -- TW


Some refactoring of this page to reflect current work on UserInterfaceInternationalisation being done for DakarRelease and highlight optional FirefoxBrowser setup for more readable display of URLs in the URL bar, and remove or de-emphasise historic info.

-- RichardDonkin - 03 Oct 2005

I'm just wondering if there is a posiblity to disable language-selection through brower-identification and just stick to english. Is there a variable to complete disable that stuff?

-- GerdMeison - 04 Nov 2005

yep; {UseInternationalisation}

-- CrawfordCurrie - 04 Nov 2005

I'm sorry, Crawford, but a "grep -R UseInternationalisation" on my dakar-install doesn't find anything. In which file should that be written? My wiki has a default internationalisation on the user-interface. It's only that part which I want to have always in english.

-- GerdMeison - 04 Nov 2005

{UseInternationalisation} is an option in the configure interface. Check out lib/TWiki.cfg

-- AntonioTerceiro - 04 Nov 2005

No, I mean, it was just renamed to {UserInterfaceInternationalisation} (in SVN).

-- AntonioTerceiro - 06 Nov 2005

Does working (i18n) code exist for capitalizing wiki word to WikiWord?

-- ArthurClemens - 22 Mar 2006

There's some code in SVN:TWiki/Render.pm that looks like this - it will work with I18N as long as locales are properly set up, but it probably won't work in 'locale regexes off' mode:

    # Turn spaced-out names into WikiWords - upper case first letter of
    # whole link, and first of each word. TODO: Try to turn this off,
    # avoiding spaces being stripped elsewhere
    $theTopic =~ s/^(.)/\U$1/;
    $theTopic =~ s/\s([$TWiki::regex{mixedAlphaNum}])/\U$1/go;

So this is something of an I18N bug - requires code that uses upperNational and lowerNational to do upper-casing, which is not trivial since some lower case letters don't exist as upper case (e.g. German ß). Probably not worth fixing unless someone has this issue and the time to fix it.

-- RichardDonkin - 30 Mar 2006

I installed the twiki DakarRelease. But I found that the Chinese topic title can not display correctly. Moreover, it make the page format wrong. I copy the page TWikiQickStart on stlchina from http://www.stlchina.org (a chinese twiki site). But it cannot display the same thing as it on stlchina. Please check the attached file for detail.

-- ZhengLingxiang - 05 Apr 2006

It's best if you create a new support request under the Support web. See SupportGuidelines on how to do this.

Your raw.txt attachment is quite interesting - it is using either GB2312 or GBK character encoding. Neither of these is supported by TWiki (see JapaneseAndChineseSupport for details) since there are some Chinese characters that include ASCII characters that are processed (parsed) by TWiki (e.g. [), which will cause your page text to be displayed incorrectly.

From your configure.htm output, it seems you are using UTF-8, which explains why pasting in text in GBK didn't work.

-- RichardDonkin - 05 Apr 2006

The text is save in utf-8 format in the wiki page. If I just perview the topic when edit, all thing works fine. But after I saved it, the page cannot bed displayed properly. The raw.txt in GBK, just because I save in this format.

-- ZhengLingxiang - 05 Apr 2006

I'll need more information to help further - the exact error case you are seeing needs to be clearly explained. I don't read Chinese, so please be very specific as to exactly which characters don't work. SupportGuidelines is a good place to start.

-- RichardDonkin - 05 Apr 2006

I do some more test and create a new support page ChineseHeadlineBrokenPageFormat

-- ZhengLingxiang - 06 Apr 2006

As far as I see it site lang doesnt get used. I changed line 141 to: my $userLanguage = _normalize_language_tag($session->{prefs}->getPreferencesValue('LANGUAGE')) | $TWiki::cfg{Site}{Lang}; now it will use the site lang if there is no user pref

-- AdamHyde - 08 May 2008

Correct - and it has been removed.

-- CrawfordCurrie - 31 May 2008

The Lang (more recently Site Lang) was intended for future use when we eventually supported multiple languages, but this was never implemented.

-- RichardDonkin - 14 Jun 2008

Topic attachments
I Attachment History Action Size Date Who Comment
Compressed Zip archivezip configure.html.zip r1 manage 29.4 K 2006-04-05 - 08:40 UnknownUser twiki configure of my server
JPEGjpg format_wrong.JPG r1 manage 200.0 K 2006-04-05 - 08:44 UnknownUser wrong page display
Texttxt raw.txt r1 manage 5.5 K 2006-04-05 - 08:44 UnknownUser raw text of the chinese version TWikiQickStart on stlchina
Texttxt testpage.txt r1 manage 1.2 K 2002-12-03 - 13:27 UnknownUser Test page for i18n (ISO-8859-1)
Edit | Attach | Watch | Print version | History: r138 < r137 < r136 < r135 < r134 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r138 - 2008-06-14 - RichardDonkin
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.