Internationalisation guidelines
This document is targeted at developers (core code, plugin code and user interface developers). If you are looking for instructions on configuring your TWiki to work with your local language, see
InstallationWithI18N
The Good News
It's easy to do
- It really is easy to internationalise your TWiki plugin or core code, so it works with almost any language, not just English!
- Typically, only a few lines need to change, in very simple ways
- All the hard work has been done for you already.
It will help your plugin, and TWiki
- Using I18N lets your plugin (and TWiki) be much more widely used, meaning more feedback, patches and general goodness
- The I18N regexes make your code more flexible- e.g. there's now a single place to change definition of a WikiWord across all plugins.
- They also work across systems from Windows to Linux and even mainframes, and across Perl versions and browsers
What you need to do
You do need to read this page and be a little careful when using regular expressions - any time you see
[A-Z]
,
\w
or
\b
, a little alarm bell should go off in your head saying "I should use one of the
I18N-aware regular expressions instead".
Goal of this page
TWiki has reasonable Internationalisation (
I18N) support today (see
InternationalisationEnhancements and
UserInterfaceInternationalisation). However, plugin developers and core coders need guidelines to avoid
I18N issues in the future, and to ensure their plugins are widely used, not just by people who speak English as their first language.
I18N code has tended to regress over the last few years - partly due to lack of unit tests, but also due to new code being written that doesn't follow these guidelines.
Overview
Internationalisation in TWiki includes use of locales to ensure
WikiWords and other page contents work with international characters (e.g.
GrødWeb.BlåBærGrød
), which means avoiding all use of
[A-Z]
or
\w
in regular expressions (except when you really mean A to Z of course, e.g. variable names that only use ASCII alphanumeric).
Fortunately, the changes required to your code are quite simple, and
"all the hard work has been done" (to quote
SteffenPoulsen!)
UserInterfaceInternationalisation guidelines are now included below - these show you how to make your user interface text ("message strings") internationalised, so that it can be translated as part of
UserInterfaceLocalisation efforts.
What if you don't bother?
Unless you are careful about using regular expressions ('regexes') that match alphabetic characters, your plugin or core module probably won't work for users in languages other than English.
Guidelines
Preparing a plugin or core module for I18N
There's one simple thing you absolutely must do in your plugin or core module to allow for
I18N. You have to make sure the plugin can "see" the regular expressions set up in
TWiki.pm
, and that it uses locale information.
You can do this by adding the following lines to the plugin, somewhere after the package declaration (code assumes TWiki 4.0 or higher,
CairoRelease is similar but you might as well update your plugin to work with 4.x). This is from Plugins:InterwikiPlugin, which is part of the core and a good example:
BEGIN {
# Do a dynamic 'use locale' for this module
if( $TWiki::cfg{UseLocale} ) {
require locale;
import locale();
}
}
Now everything is set up for using the regular expressions from
TWiki.pm
. This means that you can just write
$TWiki::regex{wikiWordRegex}
instead of figuring out your own matching rules for matching a
WikiWord, across the many sorts of broken
I18N locales and different Perl versions. A huge amount of testing and debugging is encapsulated for you, and you are also guaranteed that your regex code will work in future when TWiki has
UnicodeSupport!
Fixing regular expressions that match letters
The main thing to do when internationalising plugins is to
never use
[A-Z]
except when you really mean A to Z, ASCII only (e.g. in TWikiVariable names perhaps). Also, you should
never, ever use
\w
or
\b
since this match 'words' based on locales and
I18N characters, but don't work in many environments that have broken locales (including all Windows systems!) - instead, read on for how to write simple regexes that do the same thing portably across a wide variety of systems.
Whenever you see these patterns, ask yourself 'should this match accented characters as well as A-Z?' and 'is this really trying to match a page (topic) name or a web name'? (Page names are normally
WikiWords, but not always.)
Once you have identified these problem areas, you need to use the regexes carefully crafted in the startup code in
TWiki.pm
. These regexes work across Perl 5.005_03, Perl 5.6, and 5.8 or higher, including environments with very broken Perl locales (e.g. Windows), so they are your best option for cross-version and cross-platform
I18N support. Code using these regexes will also work with future
UnicodeSupport when that's implemented, despite the actual regexes changing dramatically.
Quickly find problematic regexes -
To easily check core code or plugins for possible use of regexes that don't take account of
I18N, just run the following one-liner under Linux or Cygwin - it will search all *.pm files including any subdirectories for use of
[a-z]
,
\w
and
\b
, which are all potential issues unless you really mean A to Z without any international characters:
find . -name '*.pm' | xargs egrep -i '\(\[a-z\]|\\w|\\b\)' >regex-warnings.txt
Some editors such as
VimEditor and
EmacsEditor can take this output file and help you easily navigate to the right place to see the code in context.
Types of pre-defined regexes
The startup code in
TWiki.pm
pre-defines a number of complete regexes as well as strings for use in building character classes as part regexes. Naming is used to distinguish these - examples are from the point of view of the calling code:
- Complete regexes are compiled using
qr/.../
and can be used as part of larger regexes, or as is. They are named fooRegex
, e.g. $TWiki::regex{wikiWordRegex}
and are usually 'concept regexes' that match email addresses, WikiWords, etc.
- Strings for use in character classes (i.e. within
[....]
in a regex) are just strings and must be used only in character classes. They are named foo
, i.e. no Regex suffix - for example $TWiki::regex{mixedAlphaNum}
. On a Perl platform with broken I18N locales, this would be the string "a-zA-Z0-9"
- note no square brackets!
Fixing core code regexes
This is similar to the plugin code below, but a bit less verbose as you have direct access to regexes without going through the plugin API. For example, you would change the following:
if( $topic =~ /^\^\([\_\-a-zA-Z0-9\|]+\)\$$/ ) {
Into:
if( $topic =~ /^\^\([\_\-$TWiki::regex{mixedAlphaNum}\|]+\)\$$/ ) {
That's still not hugely readable, but more complex regexes can be greatly simplified. The following code is much more readable than using
a-z
etc, as well as working for
I18N:
$anchorName =~ s/($TWiki::regex{wikiWordRegex})/_$1/go;
# Prevent automatic WikiWord or CAPWORD linking in explicit links
$link =~ s/(?<=[\s\(])($TWiki::regex{wikiWordRegex}|[$TWiki::regex{upperAlpha}])/$1/
Fixing plugin regexes
Plugin code is a bit more verbose than core code as it must first get the regexes via the Plugin API - here's an example adapted from Plugin:InterwikiPlugin:
# Regexes for the Site:page format InterWiki reference
my $mixedAlphaNum = TWiki::Func::getRegularExpression('mixedAlphaNum');
my $upperAlpha = TWiki::Func::getRegularExpression('upperAlpha');
$sitePattern = "([$upperAlpha][$mixedAlphaNum]+)";
$pagePattern = "([${mixedAlphaNum}_\/][$mixedAlphaNum" . '\.\/\+\_\,\;\:\!\?\%\#-]+?)';
Regex efficiency
You may also want to use
/o
on your regex, or compile it using
$fooPatternRegex = qr/$someRegexVar/
, which should give better performance if used more than once, e.g. in loops or when running under
ModPerl. See
perldoc perlop
and
perldoc perlre
for details, and don't use this if the regex (not the substitution right-hand-side) includes 'real' variables that vary between invocations of your code, e.g. user name.
International message strings
As TWiki supports user interface internationalization, you should now avoid putting English language strings
directly into Perl code. In addition, you should follow the main
InternationalisationGuidelines to ensure that regular expressions and other code work well across multiple languages and locales (i.e. countries or regions).
The
TWiki::I18N
class encapsulates message text internationalisation, and the
i18n
field of the TWiki session object is an instance of this class. Thus, wherever
you might need to write an English string inside Perl code, you must write it wrapped in a call
to the
TWiki:I18N::maketext
method, like this:
# $session is an instance of TWiki class
my $msg = $session->{i18n}->maketext("Access denied: you don't have access for editing this topic.");
You can also interpolate parameters into the text, and let the translator correctly translate messages,
keeping a place for yout parameters. Just write placeholders for the parameters numbered with the
parameter order in the
maketext
call:
[_1]
for the first parameter,
[_2]
for the second,
and so on. Then you can do things like this:
# $session is an instance of TWiki class
my $msg = $session->{i18n}->maketext("This is topic [_1] on the web [_2].", $topic, $web);
Note that translators can change the order in which parameters appear in translated text
(i.e.
[_2]
appearing first than
[_1]
), but they must keep the text's semantics, so that
substituting the first parameter into
[_1]
and second parameter into
[_2]
says the
same thing that is said in original, whatever order they are in.
See
UserInterfaceInternationalisation for guidelines for writing message strings that can be translated.
Testing your fixes
Don't forget to test your code across a number of different
I18N areas:
- Page (WikiWord) and web names with I18N characters
- Page contents with I18N characters - usually not a problem
- Attachments with I18N characters in filename or the topic/web that contains attachment
- Searching for I18N characters - especially if external programs used
- Sorting to include I18N characters - whether using internal Perl code or external programs
If there is a valid locale that works within Perl, most things should 'just work' once you have fixed the regexes. However, on Windows and other platforms where locales are broken in Perl terms, you will only be able to do
I18N for page contents and page/web names.
Example of code that needs fixing
Taking the original
UpdateInfoPlugin as an example (now fixed...) - this uses
\w
to match a
WikiWord (which is actually incorrect anyway, as it will match non-WikiWords!), when it should use the relevant
WikiWord regex via the plugin API. You frequently find that you end up fixing other bugs when adding
I18N support, because it forces you to look closely at the regexes.
Discussion
Any comments on how the I18N documentation is written or could be improved
Richard, I took surprise in realizing just how easy it really is to update the plugins - you did all the thinking already
--
SteffenPoulsen - 25 Feb 2005, 01 Apr 2005, 02 Apr 2005
(
UserInterfaceInternationalisation:) I think I have now moved all of the error and warning message texts - including those generated inline - into templates. In theory, it should now be be possible to switch language just through switching skin & help topics....
--
CrawfordCurrie - 06 Apr 2005
Updated to deprecate use of
\w
- using
\w
is
almost always the wrong thing to do, so don't do it...
--
RichardDonkin - 24 Jul 2006
Major update to this page, restructuring and using more examples based on TWiki 4.x code. Have also linked from
Plugins.ReadmeFirst and
CodingStandards, and
UserInterfaceInternationalisation.
Any suggestions for updates or linkage?
--
RichardDonkin - 10 Nov 2006
String truncating should be fixed. It appears not to be easy because utf-8 is
multibite byte encoding. When utf-8 coded word passes through such a code, the character may split to invalid parts and it comes regulary at least for any cyrillic language (see
Bugs:Item3574 for example).
Here are a code samples which may cause such an error:
Renders.pm:
$anchorName =~ s/^(.{32})(.*)$/$1/; # limit to 32 chars - FIXME: Use Unicode chars before truncate
Search.pm:
$searchString = substr($searchString, 0, 1500);
Prefs/PrefsCache.pm
$val =~ s/^(.{32}).*$/$1..../s;
What to do with such a code? Is it a right way to surround them as follows?
--- Search.pm.orig 2007-02-05 07:13:00.000000000 +0300
+++ Search.pm 2007-02-10 21:10:39.000000000 +0300
@@ -144,7 +144,14 @@
$searchString = $1;
# Limit string length
- $searchString = substr($searchString, 0, 1500);
+ if( defined($TWiki::cfg{Site}{CharSet}) && ( $] >= 5.008 ) &&
+ $TWiki::cfg{Site}{CharSet} =~ /^utf-?8/i ) {
+ require Encode;
+ $searchString = Encode::decode('utf8', $searchString);
+ $searchString = substr($searchString, 0, 1500);
+ $searchString = Encode::encode('utf8', $searchString);
+ }
+ else{ $searchString = substr($searchString, 0, 1500);}
}
=pod
Another question: I need also in the last file
@@ -662,6 +669,9 @@
} else {
+ require POSIX;
+ import POSIX qw( locale_h LC_CTYPE LC_COLLATE);
+ setlocale(&LC_COLLATE, $TWiki::cfg{Site}{Locale});
# simple sort, see Codev.SchwartzianTransformMisused
# note no extraction of topic info here, as not needed
# for the sort. Instead it will be read lazily, later on.
I do not understand why the LC_COLLATE does not spread from TWiki.pm to this code, but it is and I realy need such a string here and in
TablePlugin/Core.pm and in Users/TWikiUserMapping.pm. I should probably better to wrap the last added lines in
if( defined($TWiki::cfg{UseLocale}){
block.
--
SergejZnamenskij - 19 Feb 2007
Sergej - have commented over on
Bugs:Item3574. We can do string truncation very simply with regexes in my view if we are using UTF-8 as site charset, without calling
CPAN:Encode - regexes would be less code, though perhaps not as fast. Encode would work better if using a non-UTF-8 site charset though, though most multi-byte charsets other than UTF-8 should not be used anyway (see
JapaneseAndChineseSupport).
As for LC_COLLATE - not sure why you need this - you can just do
use locale
to get Perl collation/sorting to use
I18N, and this has no effect on Unix
sort
commands (which TWiki doesn't use anyway). See the section above on 'preparing plugins and modules', or
UsingPerlLocalesTheRightWay, for how to do the "dynamic use-locale" correctly - should already be done in all modules, but perhaps some new modules have omitted this without anyone noticing.
--
RichardDonkin - 20 Feb 2007
Thank You, Richard!
As for LC_COLLATE - I just use the current TWiki 4.11 and did not see
UsingPerlLocalesTheRightWay as it considered to be fixed in the current version, but it was not fixed
really
The UTF8 problem seems to became much more serious then it may be supposed.
All Russian people whom I know to install TWiki last year were tried to install twiki with utf8 and could not repare completely broken site (see
Bugs:Item3574) were trying to use utf8 and could not repare completely broken site (
Bugs:Item3574 is just a one case). Nobody tries one-byte sharset now. We loose customers and collaborators as far. There are a simple and robust solution to apply, but nobody can find and use them.
The main problem AFAIK is that there are completely different use cases which can not mixed and reqire special coding in some places:
- Use locale for one-byte sharset (fast)
- Perl 5.6 uf8 level support (regexps for string truncating and probably non-perfect sorting in 5.8)
- Perl 5.8 Use unicode
It is so complicated to support all in the same code, that seems not to be a right way to go
To keep several independent branches in complete, updated and working state is a very hard work.
What about to change structure as follows?
- Produce separate installation and upgrading packs for 1-3 installs.
- Currently 1 should be an SVN branch but 2 and 3 to be maintained as the batch of patches to this branch. It will make it possible to properly maintain and test either 2 and 3 in updated tree. The main branch selection should represent the main TWiki site install.
Is there another effective framefork idea?
--
SergejZnamenskij - 21 Feb 2007
Re UTF8 problem - still don't understand why Russian users of TWiki are using UTF-8 today, as
WikiWords don't work. Please use KOI8-R instead as recommended in
CyrillicSupport, it works very well until we get real
UnicodeSupport, and TWiki does
not support UTF-8 for alphabetic languages (see
InstallationWithI18N).
On the use cases - I don't agree that we need independent branches or patches for this code, which would be a lot more work to maintain. I suggest we ignore use case 2 for
UnicodeSupport (i.e. Perl 5.6) - anyone who wants Unicode must simply upgrade to a recent version of Perl 5.8.x, since there are huge numbers of bug fixes and performance improvements
So now we just need to support the traditional
I18N code and the newer
UnicodeSupport code. Most of the difference is in setup (
use open
, locales, etc), and can largely be confined to a few places - there may well be some Perl tricks that let us put this code in a special module that still has the required effect on the 'main' module that it's called from.
I think the Unicode work should be done on a development branch, since it can be quite difficult to get working and touches many modules, but should be frequently re-synced from the mainstream development. Also, if it can be broken into phases, it could be implemented in phases, as long as it's always protected by an 'if in Unicode mode' condition.
--
RichardDonkin - 22 Feb 2007
Probably, You are right (I did not use
SVN before).
What about
CyrillicSupport in koi8-r for current 4.11, the two still remaining problems are the sorting ( while LC_CTYPE holds from TWiki.pm, LC_COLLATE loose; the patchs above fix it for me ) and the cyrilic UserName registration does not work for now. I did not report both as a bugs as I think the unicode support will be good enough in the nearest future and it has greater priority.
--
SergejZnamenskij - 22 Feb 2007
Interesting that you need to set LC_COLLATE via
setlocale
- this seems to be a bug in the
I18N code, not sure when it was introduced as I did test this originally with sorting. Perhaps it got taken out as part of code cleanup without re-testing. This does need fixing and looks like it would affect all locales. Also seen in
LdapPluginDev.
As for Cyrillic usernames, I didn't implement
I18N usernames because Apache basic authentication only permits ASCII (or at least it did when I wrote the code). If that's changed, it would be quite easy to change the username regex to permit this - probably a one line fix. This has been reported before, so please Google for earlier reports and check
InternationalisationIssues.
It would be good to report both of those as bugs, but particularly the first one, and submit patches, as that would help improve the current code - even with
UnicodeSupport done, not everyone will turn this on due to performance and probably Unicode issues in Perl, so it's worth fixing this.
On Unicode - please read
ProposedUTF8SupportForI18N - you would need to implement Unicode collation (Phase 3) to get the same sort of sorting you get today with locales (or will when LC_COLLATE is set). Personally I would focus on Phase 2 first (basic Unicode support) to get the core working, and then add features like collation and
UnicodeNormalisation.
--
RichardDonkin - 23 Feb 2007
LoginName was pure english, just the
WikiUserName in cyrillic -see
Bugs:Item3679
It seems that the testing in Cyrillic indicates some general localisation bugs much faster, then testing in Latin Languages.
--
SergejZnamenskij - 23 Feb 2007
I was talking about
WikiUsernames. Just had a look at
Users.pm and it looks like it's using the correct regex -
TWiki::regex{wikiWordRegex}$
. Are you sure your webserver setup permits non-ASCII user names? That was the reason I didn't implement this originally.
Cyrillic is a very good test case for
I18N because every letter is non-ASCII, rather than only accented letters as with most Latin languages.
--
RichardDonkin - 24 Feb 2007
In TWiki4 webserver has nothing to do with
WikiUserName if
LoginName of the user differs. The user can use latin
LoginName to login and authorize and quite a different
WikiUserName will be associated with him via
TWikiUsers topic analyse in TWikiUserMapping.pm. This function works fine for Latin but does not work for Cyrillic. See details in
Bugs:Item3679.
--
SergejZnamenskij - 24 Feb 2007
Ah, I see - never tested Latin usernames +
I18N WikiUsernames. Will comment on that bug page.
--
RichardDonkin - 25 Feb 2007
I do Russian (none Twiki) pages with windows-1251 heading:
<?xml version="1.0" encoding="windows-1251"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head xml:lang="ru" lang="ru"><meta http-equiv="content-type" content="text/html; charset=windows-1251" xml:lang="ru" lang="ru" />
Any browser displays such pages correctly. It validates perfect and I can read and edit code with any editor, as it should.
The question is how to make Twiki generate pages with above Windows-1251 heading?
--
DimitriRytsk - 11 May 2007
Dimitri - It's best to raise this as a Support question, but the short answer is that if you can install or generate a Unix/Linux locale on the TWiki server that uses Windows-1251, you can have TWiki use this locale so that all pages will be in this character set. However, this may not be too easy, as Windows-1251 is of course Windows-specific, and may not work so well on Linux servers, whereas KOI8-R is platform-independent.
--
RichardDonkin - 13 May 2007
Interesting that revision 31 of this topic, by Dimitri, seems to have caused the verbatim/code sections to be collapsed into a single line for each section, and has also put spaces at the start of many other lines. Not sure if this is a TWiki issue or something else... I've fixed this by reverting whole topic to rev 30 and pasting in more recent comments - the corruption was quite extensive so I'm sure it was not a deliberate edit.
--
RichardDonkin - 25 May 2007
Minor update to Motivation etc above.
--
RichardDonkin - 23 Aug 2007
More updates adding the Good News section etc.
--
RichardDonkin - 14 Sep 2007
I've added a new section above to quickly check core code or plugins for possible use of regexes that don't take account of
I18N - just run the following one-liner under Linux or Cygwin - it will search all *.pm files including any subdirectories for use of
[a-z]
,
\w
and
\b
, which are all potential issues unless you really mean A to Z without any international characters:
find . -name *.pm | xargs egrep '\(\[a-z\]|\\w|\\b\)' >regex-warnings.txt
--
RichardDonkin - 14 Jun 2008