Tags:
dev_essential1Add my vote for this tag internationalization1Add my vote for this tag localization1Add my vote for this tag create new tag
view all tags

Internationalisation guidelines

This document is targeted at developers (core code, plugin code and user interface developers). If you are looking for instructions on configuring your TWiki to work with your local language, see InstallationWithI18N

The Good News

It's easy to do

  • It really is easy to internationalise your TWiki plugin or core code, so it works with almost any language, not just English!
    • Typically, only a few lines need to change, in very simple ways
  • All the hard work has been done for you already.

It will help your plugin, and TWiki

  • Using I18N lets your plugin (and TWiki) be much more widely used, meaning more feedback, patches and general goodness
  • The I18N regexes make your code more flexible- e.g. there's now a single place to change definition of a WikiWord across all plugins.
  • They also work across systems from Windows to Linux and even mainframes, and across Perl versions and browsers

What you need to do

You do need to read this page and be a little careful when using regular expressions - any time you see [A-Z], \w or \b, a little alarm bell should go off in your head saying "I should use one of the I18N-aware regular expressions instead".

Goal of this page

TWiki has reasonable Internationalisation (I18N) support today (see InternationalisationEnhancements and UserInterfaceInternationalisation). However, plugin developers and core coders need guidelines to avoid I18N issues in the future, and to ensure their plugins are widely used, not just by people who speak English as their first language.

I18N code has tended to regress over the last few years - partly due to lack of unit tests, but also due to new code being written that doesn't follow these guidelines.

Overview

Internationalisation in TWiki includes use of locales to ensure WikiWords and other page contents work with international characters (e.g. GrødWeb.BlåBærGrød), which means avoiding all use of [A-Z] or \w in regular expressions (except when you really mean A to Z of course, e.g. variable names that only use ASCII alphanumeric).

Fortunately, the changes required to your code are quite simple, and "all the hard work has been done" (to quote SteffenPoulsen!)

UserInterfaceInternationalisation guidelines are now included below - these show you how to make your user interface text ("message strings") internationalised, so that it can be translated as part of UserInterfaceLocalisation efforts.

What if you don't bother?

Unless you are careful about using regular expressions ('regexes') that match alphabetic characters, your plugin or core module probably won't work for users in languages other than English.

Guidelines

Preparing a plugin or core module for I18N

There's one simple thing you absolutely must do in your plugin or core module to allow for I18N. You have to make sure the plugin can "see" the regular expressions set up in TWiki.pm, and that it uses locale information.

You can do this by adding the following lines to the plugin, somewhere after the package declaration (code assumes TWiki 4.0 or higher, CairoRelease is similar but you might as well update your plugin to work with 4.x). This is from Plugins:InterwikiPlugin, which is part of the core and a good example:

BEGIN {
    # Do a dynamic 'use locale' for this module
    if( $TWiki::cfg{UseLocale} ) {
        require locale;
        import locale();
    }
}

Now everything is set up for using the regular expressions from TWiki.pm. This means that you can just write $TWiki::regex{wikiWordRegex} instead of figuring out your own matching rules for matching a WikiWord, across the many sorts of broken I18N locales and different Perl versions. A huge amount of testing and debugging is encapsulated for you, and you are also guaranteed that your regex code will work in future when TWiki has UnicodeSupport!

Fixing regular expressions that match letters

The main thing to do when internationalising plugins is to never use [A-Z] except when you really mean A to Z, ASCII only (e.g. in TWikiVariable names perhaps). Also, you should never, ever use \w or \b since this match 'words' based on locales and I18N characters, but don't work in many environments that have broken locales (including all Windows systems!) - instead, read on for how to write simple regexes that do the same thing portably across a wide variety of systems.

Whenever you see these patterns, ask yourself 'should this match accented characters as well as A-Z?' and 'is this really trying to match a page (topic) name or a web name'? (Page names are normally WikiWords, but not always.)

Once you have identified these problem areas, you need to use the regexes carefully crafted in the startup code in TWiki.pm. These regexes work across Perl 5.005_03, Perl 5.6, and 5.8 or higher, including environments with very broken Perl locales (e.g. Windows), so they are your best option for cross-version and cross-platform I18N support. Code using these regexes will also work with future UnicodeSupport when that's implemented, despite the actual regexes changing dramatically.

Quickly find problematic regexes - NEW

To easily check core code or plugins for possible use of regexes that don't take account of I18N, just run the following one-liner under Linux or Cygwin - it will search all *.pm files including any subdirectories for use of [a-z], \w and \b, which are all potential issues unless you really mean A to Z without any international characters:

    find . -name '*.pm' | xargs egrep -i '\(\[a-z\]|\\w|\\b\)' >regex-warnings.txt

Some editors such as VimEditor and EmacsEditor can take this output file and help you easily navigate to the right place to see the code in context.

Types of pre-defined regexes

The startup code in TWiki.pm pre-defines a number of complete regexes as well as strings for use in building character classes as part regexes. Naming is used to distinguish these - examples are from the point of view of the calling code:

  • Complete regexes are compiled using qr/.../ and can be used as part of larger regexes, or as is. They are named fooRegex, e.g. $TWiki::regex{wikiWordRegex} and are usually 'concept regexes' that match email addresses, WikiWords, etc.
  • Strings for use in character classes (i.e. within [....] in a regex) are just strings and must be used only in character classes. They are named foo, i.e. no Regex suffix - for example $TWiki::regex{mixedAlphaNum}. On a Perl platform with broken I18N locales, this would be the string "a-zA-Z0-9" - note no square brackets!

Fixing core code regexes

This is similar to the plugin code below, but a bit less verbose as you have direct access to regexes without going through the plugin API. For example, you would change the following:

        if( $topic =~ /^\^\([\_\-a-zA-Z0-9\|]+\)\$$/ ) {

Into:

        if( $topic =~ /^\^\([\_\-$TWiki::regex{mixedAlphaNum}\|]+\)\$$/ ) {

That's still not hugely readable, but more complex regexes can be greatly simplified. The following code is much more readable than using a-z etc, as well as working for I18N:

       $anchorName =~ s/($TWiki::regex{wikiWordRegex})/_$1/go;

      # Prevent automatic WikiWord or CAPWORD linking in explicit links
      $link =~ s/(?<=[\s\(])($TWiki::regex{wikiWordRegex}|[$TWiki::regex{upperAlpha}])/$1/

Fixing plugin regexes

Plugin code is a bit more verbose than core code as it must first get the regexes via the Plugin API - here's an example adapted from Plugin:InterwikiPlugin:

    # Regexes for the Site:page format InterWiki reference
    my $mixedAlphaNum = TWiki::Func::getRegularExpression('mixedAlphaNum');
    my $upperAlpha = TWiki::Func::getRegularExpression('upperAlpha');
    $sitePattern    = "([$upperAlpha][$mixedAlphaNum]+)";
    $pagePattern    = "([${mixedAlphaNum}_\/][$mixedAlphaNum" . '\.\/\+\_\,\;\:\!\?\%\#-]+?)';

Regex efficiency

You may also want to use /o on your regex, or compile it using $fooPatternRegex = qr/$someRegexVar/, which should give better performance if used more than once, e.g. in loops or when running under ModPerl. See perldoc perlop and perldoc perlre for details, and don't use this if the regex (not the substitution right-hand-side) includes 'real' variables that vary between invocations of your code, e.g. user name.

International message strings

As TWiki supports user interface internationalization, you should now avoid putting English language strings directly into Perl code. In addition, you should follow the main InternationalisationGuidelines to ensure that regular expressions and other code work well across multiple languages and locales (i.e. countries or regions).

The TWiki::I18N class encapsulates message text internationalisation, and the i18n field of the TWiki session object is an instance of this class. Thus, wherever you might need to write an English string inside Perl code, you must write it wrapped in a call to the TWiki:I18N::maketext method, like this:

# $session is an instance of TWiki class
my $msg = $session->{i18n}->maketext("Access denied: you don't have access for editing this topic.");

You can also interpolate parameters into the text, and let the translator correctly translate messages, keeping a place for yout parameters. Just write placeholders for the parameters numbered with the parameter order in the maketext call: [_1] for the first parameter, [_2] for the second, and so on. Then you can do things like this:

# $session is an instance of TWiki class
my $msg = $session->{i18n}->maketext("This is topic [_1] on the web [_2].", $topic, $web);

Note that translators can change the order in which parameters appear in translated text (i.e. [_2] appearing first than [_1]), but they must keep the text's semantics, so that substituting the first parameter into [_1] and second parameter into [_2] says the same thing that is said in original, whatever order they are in.

See UserInterfaceInternationalisation for guidelines for writing message strings that can be translated.

Testing your fixes

Don't forget to test your code across a number of different I18N areas:

  • Page (WikiWord) and web names with I18N characters
  • Page contents with I18N characters - usually not a problem
  • Attachments with I18N characters in filename or the topic/web that contains attachment
  • Searching for I18N characters - especially if external programs used
  • Sorting to include I18N characters - whether using internal Perl code or external programs

If there is a valid locale that works within Perl, most things should 'just work' once you have fixed the regexes. However, on Windows and other platforms where locales are broken in Perl terms, you will only be able to do I18N for page contents and page/web names.

Example of code that needs fixing

Taking the original UpdateInfoPlugin as an example (now fixed...) - this uses \w to match a WikiWord (which is actually incorrect anyway, as it will match non-WikiWords!), when it should use the relevant WikiWord regex via the plugin API. You frequently find that you end up fixing other bugs when adding I18N support, because it forces you to look closely at the regexes.


Discussion

Any comments on how the I18N documentation is written or could be improved


Richard, I took surprise in realizing just how easy it really is to update the plugins - you did all the thinking already smile

-- SteffenPoulsen - 25 Feb 2005, 01 Apr 2005, 02 Apr 2005

(UserInterfaceInternationalisation:) I think I have now moved all of the error and warning message texts - including those generated inline - into templates. In theory, it should now be be possible to switch language just through switching skin & help topics....

-- CrawfordCurrie - 06 Apr 2005

Updated to deprecate use of \w - using \w is almost always the wrong thing to do, so don't do it...

-- RichardDonkin - 24 Jul 2006

Major update to this page, restructuring and using more examples based on TWiki 4.x code. Have also linked from Plugins.ReadmeFirst and CodingStandards, and UserInterfaceInternationalisation.

Any suggestions for updates or linkage?

-- RichardDonkin - 10 Nov 2006

String truncating should be fixed. It appears not to be easy because utf-8 is multibite byte encoding. When utf-8 coded word passes through such a code, the character may split to invalid parts and it comes regulary at least for any cyrillic language (see Bugs:Item3574 for example).

Here are a code samples which may cause such an error:

Renders.pm:

    $anchorName =~ s/^(.{32})(.*)$/$1/; # limit to 32 chars - FIXME: Use Unicode chars before truncate

Search.pm:

    $searchString = substr($searchString, 0, 1500);

Prefs/PrefsCache.pm

        $val =~ s/^(.{32}).*$/$1..../s;

What to do with such a code? Is it a right way to surround them as follows?

--- Search.pm.orig      2007-02-05 07:13:00.000000000 +0300
+++ Search.pm   2007-02-10 21:10:39.000000000 +0300
@@ -144,7 +144,14 @@
     $searchString = $1;

     # Limit string length
-    $searchString = substr($searchString, 0, 1500);
+    if( defined($TWiki::cfg{Site}{CharSet}) && ( $] >= 5.008 ) &&
+          $TWiki::cfg{Site}{CharSet} =~ /^utf-?8/i ) {
+            require Encode;
+            $searchString = Encode::decode('utf8', $searchString);
+            $searchString = substr($searchString, 0, 1500);
+            $searchString = Encode::encode('utf8', $searchString);
+          }
+    else{ $searchString = substr($searchString, 0, 1500);}
 }

 =pod

Another question: I need also in the last file

@@ -662,6 +669,9 @@

         } else {

+            require POSIX;
+            import POSIX qw( locale_h LC_CTYPE LC_COLLATE);
+            setlocale(&LC_COLLATE, $TWiki::cfg{Site}{Locale});
             # simple sort, see Codev.SchwartzianTransformMisused
             # note no extraction of topic info here, as not needed
             # for the sort. Instead it will be read lazily, later on.

I do not understand why the LC_COLLATE does not spread from TWiki.pm to this code, but it is and I realy need such a string here and in TablePlugin/Core.pm and in Users/TWikiUserMapping.pm. I should probably better to wrap the last added lines in if( defined($TWiki::cfg{UseLocale}){ block.

-- SergejZnamenskij - 19 Feb 2007

Sergej - have commented over on Bugs:Item3574. We can do string truncation very simply with regexes in my view if we are using UTF-8 as site charset, without calling CPAN:Encode - regexes would be less code, though perhaps not as fast. Encode would work better if using a non-UTF-8 site charset though, though most multi-byte charsets other than UTF-8 should not be used anyway (see JapaneseAndChineseSupport).

As for LC_COLLATE - not sure why you need this - you can just do use locale to get Perl collation/sorting to use I18N, and this has no effect on Unix sort commands (which TWiki doesn't use anyway). See the section above on 'preparing plugins and modules', or UsingPerlLocalesTheRightWay, for how to do the "dynamic use-locale" correctly - should already be done in all modules, but perhaps some new modules have omitted this without anyone noticing.

-- RichardDonkin - 20 Feb 2007

Thank You, Richard!

As for LC_COLLATE - I just use the current TWiki 4.11 and did not see UsingPerlLocalesTheRightWay as it considered to be fixed in the current version, but it was not fixed frown really

The UTF8 problem seems to became much more serious then it may be supposed. All Russian people whom I know to install TWiki last year were tried to install twiki with utf8 and could not repare completely broken site (see Bugs:Item3574) were trying to use utf8 and could not repare completely broken site (Bugs:Item3574 is just a one case). Nobody tries one-byte sharset now. We loose customers and collaborators as far. There are a simple and robust solution to apply, but nobody can find and use them.

The main problem AFAIK is that there are completely different use cases which can not mixed and reqire special coding in some places:

  1. Use locale for one-byte sharset (fast)
  2. Perl 5.6 uf8 level support (regexps for string truncating and probably non-perfect sorting in 5.8)
  3. Perl 5.8 Use unicode
It is so complicated to support all in the same code, that seems not to be a right way to go To keep several independent branches in complete, updated and working state is a very hard work.

What about to change structure as follows?

  • Produce separate installation and upgrading packs for 1-3 installs.
  • Currently 1 should be an SVN branch but 2 and 3 to be maintained as the batch of patches to this branch. It will make it possible to properly maintain and test either 2 and 3 in updated tree. The main branch selection should represent the main TWiki site install.

Is there another effective framefork idea?

-- SergejZnamenskij - 21 Feb 2007

Re UTF8 problem - still don't understand why Russian users of TWiki are using UTF-8 today, as WikiWords don't work. Please use KOI8-R instead as recommended in CyrillicSupport, it works very well until we get real UnicodeSupport, and TWiki does not support UTF-8 for alphabetic languages (see InstallationWithI18N).

On the use cases - I don't agree that we need independent branches or patches for this code, which would be a lot more work to maintain. I suggest we ignore use case 2 for UnicodeSupport (i.e. Perl 5.6) - anyone who wants Unicode must simply upgrade to a recent version of Perl 5.8.x, since there are huge numbers of bug fixes and performance improvements

So now we just need to support the traditional I18N code and the newer UnicodeSupport code. Most of the difference is in setup (use open, locales, etc), and can largely be confined to a few places - there may well be some Perl tricks that let us put this code in a special module that still has the required effect on the 'main' module that it's called from.

I think the Unicode work should be done on a development branch, since it can be quite difficult to get working and touches many modules, but should be frequently re-synced from the mainstream development. Also, if it can be broken into phases, it could be implemented in phases, as long as it's always protected by an 'if in Unicode mode' condition.

-- RichardDonkin - 22 Feb 2007

Probably, You are right (I did not use SVN before). What about CyrillicSupport in koi8-r for current 4.11, the two still remaining problems are the sorting ( while LC_CTYPE holds from TWiki.pm, LC_COLLATE loose; the patchs above fix it for me ) and the cyrilic UserName registration does not work for now. I did not report both as a bugs as I think the unicode support will be good enough in the nearest future and it has greater priority.

-- SergejZnamenskij - 22 Feb 2007

Interesting that you need to set LC_COLLATE via setlocale - this seems to be a bug in the I18N code, not sure when it was introduced as I did test this originally with sorting. Perhaps it got taken out as part of code cleanup without re-testing. This does need fixing and looks like it would affect all locales. Also seen in LdapPluginDev.

As for Cyrillic usernames, I didn't implement I18N usernames because Apache basic authentication only permits ASCII (or at least it did when I wrote the code). If that's changed, it would be quite easy to change the username regex to permit this - probably a one line fix. This has been reported before, so please Google for earlier reports and check InternationalisationIssues.

It would be good to report both of those as bugs, but particularly the first one, and submit patches, as that would help improve the current code - even with UnicodeSupport done, not everyone will turn this on due to performance and probably Unicode issues in Perl, so it's worth fixing this.

On Unicode - please read ProposedUTF8SupportForI18N - you would need to implement Unicode collation (Phase 3) to get the same sort of sorting you get today with locales (or will when LC_COLLATE is set). Personally I would focus on Phase 2 first (basic Unicode support) to get the core working, and then add features like collation and UnicodeNormalisation.

-- RichardDonkin - 23 Feb 2007

LoginName was pure english, just the WikiUserName in cyrillic -see Bugs:Item3679

It seems that the testing in Cyrillic indicates some general localisation bugs much faster, then testing in Latin Languages.

-- SergejZnamenskij - 23 Feb 2007

I was talking about WikiUsernames. Just had a look at Users.pm and it looks like it's using the correct regex - TWiki::regex{wikiWordRegex}$. Are you sure your webserver setup permits non-ASCII user names? That was the reason I didn't implement this originally.

Cyrillic is a very good test case for I18N because every letter is non-ASCII, rather than only accented letters as with most Latin languages.

-- RichardDonkin - 24 Feb 2007

In TWiki4 webserver has nothing to do with WikiUserName if LoginName of the user differs. The user can use latin LoginName to login and authorize and quite a different WikiUserName will be associated with him via TWikiUsers topic analyse in TWikiUserMapping.pm. This function works fine for Latin but does not work for Cyrillic. See details in Bugs:Item3679.

-- SergejZnamenskij - 24 Feb 2007

Ah, I see - never tested Latin usernames + I18N WikiUsernames. Will comment on that bug page.

-- RichardDonkin - 25 Feb 2007

I do Russian (none Twiki) pages with windows-1251 heading:

<?xml version="1.0" encoding="windows-1251"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head xml:lang="ru" lang="ru"><meta http-equiv="content-type" content="text/html; charset=windows-1251" xml:lang="ru" lang="ru" />

Any browser displays such pages correctly. It validates perfect and I can read and edit code with any editor, as it should.

The question is how to make Twiki generate pages with above Windows-1251 heading?

-- DimitriRytsk - 11 May 2007

Dimitri - It's best to raise this as a Support question, but the short answer is that if you can install or generate a Unix/Linux locale on the TWiki server that uses Windows-1251, you can have TWiki use this locale so that all pages will be in this character set. However, this may not be too easy, as Windows-1251 is of course Windows-specific, and may not work so well on Linux servers, whereas KOI8-R is platform-independent.

-- RichardDonkin - 13 May 2007

Interesting that revision 31 of this topic, by Dimitri, seems to have caused the verbatim/code sections to be collapsed into a single line for each section, and has also put spaces at the start of many other lines. Not sure if this is a TWiki issue or something else... I've fixed this by reverting whole topic to rev 30 and pasting in more recent comments - the corruption was quite extensive so I'm sure it was not a deliberate edit.

-- RichardDonkin - 25 May 2007

Minor update to Motivation etc above.

-- RichardDonkin - 23 Aug 2007

More updates adding the Good News section etc.

-- RichardDonkin - 14 Sep 2007

I've added a new section above to quickly check core code or plugins for possible use of regexes that don't take account of I18N - just run the following one-liner under Linux or Cygwin - it will search all *.pm files including any subdirectories for use of [a-z], \w and \b, which are all potential issues unless you really mean A to Z without any international characters:

    find . -name *.pm | xargs egrep '\(\[a-z\]|\\w|\\b\)' >regex-warnings.txt

-- RichardDonkin - 14 Jun 2008

Edit | Attach | Watch | Print version | History: r40 < r39 < r38 < r37 < r36 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r40 - 2008-06-15 - RichardDonkin
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.