The more work I do on getting
I18N support right in
WYSIWYG, the more convinced I am that TWiki goes out of it's way to make life difficult for users, admins and extensions authors by
not using UTF8.
UnderstandingEncodings is a detailed primer on character sets and a discussion on the problems inherent in trying to support non-UTF8 character sets in the TWiki core. Please read it carefully before commenting. I also highly recommend the following overview of unicode and UTF8
http://www.cl.cam.ac.uk/~mgk25/unicode.html. RD also recommends
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), which is a nice gentle introduction.
Proposal
The proposal here is to modify TWiki to
assume the use of UTF8 in all content. That means UTF8 would be assumed in:
- topic content
- topic and web names
- template files
- url parameters, including form content
Key concepts
- UTF-8 is the encoding - see UnderstandingEncodings
- UTF-8 character mode (aka Perl utf8 mode) - Perl handles the 1 to N bytes of a Unicode character as a single character, not as N bytes. This is the target of this work. See
perldoc perlunicode for details.
- UTF-8 as bytes - Perl happens to be processing the 1 to N bytes of a Unicode character as N bytes not as a character. This is usually a mistake if you are trying to "Use UTF8", but is sort-of supported with current TWiki versions (see InstallationWithI18N for when it's used) - you don't get WikiWord support and so on, but the characters aren't mangled.
Technical Detail - What would need to be done?
- In terms of the core changes required, this would mainly be a case of deleting code. TWiki.pm especially contains a lot of special support which is primarily aimed at different character sets. Most of this support is poorly documented and used incorrectly or not at all in the code (for example, many regexes in code use [A-Z] incorrectly to represent word characters. Therefore correction and updating of documentation is also essential.
- RD: This isn't my perception (lots of special support) - can't think of anything I wrote that was specific to a character set, apart from EBCDIC which is a special case for TWikiOnMainframe and not a priority. I think some of this code may have been added in recent years though. You are right about
[A-Z] though - see InternationalisationGuidelines for a new shell one-liner that helps detect such things, and also \w and \b which are very common and almost always wrong.
- all streams opened by the store need to use
:encoding(UTF-8), not :utf8 - the latter doesn't check for a valid UTF-8 encoding, leading to possible security holes - see Security comment below.
- stdout and stderr need to be re-opened in
:encoding(UTF-8), using equivalent of binmode(STDxxx,':encoding(UTF-8)'); - this also needs to apply to ModPerl and similar CgiAccelerators which may not use stdout/stderr.
- Increment the store version
- Check very carefully whether input data from a form is indeed UTF-8 encoded - generally if you force the output page to UTF-8 using HTTP header and HTML charset, the data returned in a POST or GET will also be in UTF-8. So this is mainly useful to guard against a user explicitly setting their browser to the wrong character set. Fortunately CPAN:Encode can do this very efficiently, certainly faster than the EncodeURLsWithUTF8 regex.
-
CGI.pm does not do any encoding. In fact it can't, because the encoding is not given with the HTTP request. (RD: However, CPAN:CGI does turn on Perl utf8 mode in some more recent versions, which has been a problem for the current pre-Unicode versions of TWiki. We might need to test against specific CPAN:CGI versions if we get problems.)
- Evaluate the encoding of all content which is retrieved via other protocols:
- HTTP (e.g.
%INCLUDE{http://somewhere}% and other TWiki::Net interfaces
- Mail as in MailInContrib
- Define the encoding of all content which TWiki sends elsewhere
- Sent mail as in TWiki's notifications (ouch: my mail client doesn't handle UTF-8 -- haj)
- Command parameters for Sandbox commands (needs to divine the operating system's default encoding)
- An audit of the core code to find cases where failure to acknowledge the encoding correctly has implicitly broken the code.
- Unit testcases would be required for:
- Existing topic in non-UTF-8 charset
- Topic with broken UTF-8 encodings
- Check encoding on all pages generated by TWiki
- Fixes for the following bugs would need to be confirmed: TWikibug:Item3574 TWikibug:Item4074 TWikibug:Item2587 TWikibug:Item3679 TWikibug:Item4292 TWikibug:Item4077 TWikibug:Item4419 TWikibug:Item5133 TWikibug:Item5351 TWikibug:Item5437 TWikibug:Item4946
- Corrections to the documentation
- Add guidelines for adding localized templates or skins: they need to be (or to be converted to) UTF-8, too
- Review all extensions (plugins, skins, contribs) for assumptions about character sets (e.g.
/[A-Z]+/) and add guidelines for extensions authors
The default character set for TWiki would become unicode. This means that "old" topics (those that predate the change to unicode) could
break
TWiki if they include high-bit characters for a non-unicode character set. To overcome this problem, there needs to be a way to kick into a
"compatibility mode" when reading such topics. One possible algorithm is:
- Read content using a byte stream
- If {Site}{CharSet} is set to a non-UTF8 character set ({Site}{CharSet} is basically used as a legacy setting to say "this is the charset that used to be used by this site before the change to UTF8") then
- if (1) content uses high-bit characters and (2) store version is prior to the current version * use
Encode::decode({Site}{CharSet}, $text) to convert to the perl internal character representation * Note that the version in the TOPICINFO may not be useable if the
- otherwise
- use
Encode::decode_utf8($text) to decode utf8 to the internal representation
- RD: All this assumes we don't do a bulk migration using an offline tool - this is what I'd recommend, see material below.
Some other considerations:
- Security
- TWiki must take care to check that possible UTF-8 data is in fact using only valid UTF-8 codepoints (characters in the encoding) and is not using an 'overlong' encoding - both can lead to security holes. Some specific points:
- Performance benchmarking and tweaking
- RD: suggest that benchmarking is done very early so we get some good metrics of how the Unicode changes are affecting performance. Some optimisations may be possible though I have no idea what they are. My experience a few years back was a 3 times slowdown, hopefully Perl has improved since then.
- Pre-Unicode charset support - are we going to still support pre-Unicode charsets? From a TinyMCE perspective I guess the answer is no as it's quite painful to convert to/from the site's pre-Unicode charset (e.g. ISO-8859-1). However sites that don't use TinyMCE might want to be able to do this.
- MacOSX server support
- MacOS X encodes filenames on HFS+ filesystems in a unique flavour of Unicode, using an Apple-enhanced NFD normalization type (see UnicodeNormalisation), whereas the rest of the world uses NFC normalisation (i.e. W3C, Linux, Windows, etc) - it actually stores them in a 16-bit encoding of Unicode, but NFD is the problem. This means that simply getting TWiki to create Unicode filenames on most Mac server disks may give us some issues -TWiki will try to create NFC UTF-8 filenames, which get converted to either UCS-2 or UTF-16 (16-bit Unicode encodings), but using NFD not NFC. The risk is that NFD is then presented back to the web or email client, and in most cases the I18N characters aren't viewed properly, unless TWiki has converted on the fly from NFD to NFC. This may "just work" without any extra code, but needs testing. MacOSXFilesystemEncodingWithI18N has the gory details. convmv has some support in this area for batch filename conversion to/from MacOS X's NFD flavour.
- Windows server support
- Windows may also have some issues perhaps with Unicode filenames, but it uses NFC so should be OK. Apache on Windows works best with UTF-8 URLs, so actually our Windows I18N support could improve with UTF-8.
- Apache on Windows does have issues with non-UTF-8 PATH_INFO used by TWiki, and some other CGI environment variables - it erroneously tries to convert these to UTF-16, which is what Windows uses in NTFS, despite these environment variables most likely having nothing to do with pathnames (and it does this even if the server is on a FAT filesystem that doesn't use UTF-16). I got a patch into Apache 2.0.54 for this (ApacheBug:32730 and ApacheBug:34985), but I think some bugs may still remain in this area.
- Backward compatibility with Perl 5.6 and early 5.8.x's - this is more relevant if we support pre-Unicode site charsets, but early 5.8.x applies to Unicode as well.
- Once we start doing UnicodeSupport, TWiki will no longer work with Perl 5.6 due to its broken Unicode support. Also, TWiki may only work on later 5.8 versions - some systems have older 5.8.x's with too many Unicode bugs to be usable.
- So it will be important to survey our user base to see how they feel about this. If backward compatibility is seen as important, and we think it is worth the extra hassle, this would require some extra work - one idea on dynamically supporting both Unicode and non-Unicode mode is:
From Bugs:Item772: it's worth noting that the locale code needs re-working anyway to cover two cases when we do Unicode:
- Unicode - do a dynamic
use open to set utf8 mode on all data read and written (must also cover ModPerl which doesn't use file descriptors to pass data to TWiki scripts, unlike CGI. This code path must never do a use locale or equivalent because mixing Unicode and locales breaks things quite comprehensively (a Perl bug-fest, I tried this...)
- Non-Unicode - should function as now (assuming this is just a bug)
- Migration of topics and filenames - any pre-Unicode encoded non-ASCII data in the topics or filenames (including attachment filenames but not contents) will need to be converted if we don't support pre-Unicode charsets. There are some tools on most Unix systems that will handle this, but this requires an upgrade step, unlike all other TWiki upgrades. Doing this topic by topic as they are updated will not work, because the older topics won't be viewable properly (unless you have per-topic Unicode mode which is fairly horrible IMO).
- Automating the upgrade process could be quite hairy, particularly the filename changes in a deep directory hierarchy, and would probably not work on MacOS X at all due to MacOSXFilesystemEncodingWithI18N - probably best to ensure the upgrader does good backups and provide some scripts and docs. There might be similar problems on NTFS or FAT filesystems, possibly with variations depending on whether the OS is *nix or Windows - I believe that NTFS translates UTF-8 to UTF-16 but hopefully it doesn't do any UnicodeNormalisation.
- I think a batch migraton process is essential - I realise this goes against the TWiki upgrade philosophy but this is quite a big and complex change to the entire
pub and data trees, including existing filenames, so I don't see how you can do this 'online' (what if there are some topics that aren't updated for months, but other topics refer to them - which URL should you use? What about attachments where you don't want to re-upload them just to convert the filename on disk?). Batch migration also means that you can use tools such as convmv which converts the filenames from one format to another, and write a simple Perl script that converts the contents in place, using the original LocalSite.cfg or TWiki.cfg spec for site charset to drive the conversion to Unicode.
- Sorting support
- Locales are very buggy when combined with Perl utf8 mode, in RD's experience - best avoided, and many Perl Unicode apps don't use locales at all.
- UnicodeCollation support is the main alternative, not sure about performance though
- Bugginess of Perl Unicode generally - should be better now, but I ran into some issues and we should expect to uncover and workaround some Perl bugs.
- "Unicode mode" toggle
- Despite the niceness of Unicode, I think it's important to have a simple toggle that globally disables Unicode usage for TWiki. While this disables any I18N, it also allows the user or developer to:
- Avoid any Perl utf8 bugs
- Ensure best possible performance even if some strings get forced into Perl utf8 mode.
- Easily compare the non-utf8 and utf8 modes for unit and system testing.
- Run TWiki in non-I18N mode on Perl 5.6
- It may be a bit of a hassle initially to enable this (e.g. dynamic code in BEGIN blocks etc) but I think it's worth it.
Earlier work by RD on
UnicodeSupport - can provide my code, which is based on an old TWiki alpha version, and it did get to the point of running on a semi-public Unicode test site, running in real "perl utf8" mode not just "bytes" mode with UTF-8 encoding.
--
Contributors: CrawfordCurrie,
HaraldJoerg,
RichardDonkin
Discussion
CC asks:
Isn't it easier just to say "if there is no accept-charset, UTF8 is assumed"? - sorry, I don't think it is that simple. In a HTTP request there might be an
Accept-Charset header, but according to my experience it is simply an indication of browser capabilities, and not related to an
accept-charset attribute of a form. If you add
accept-charset="utf-8" to all templates, you can catch TWiki's own edit and search workflows, but you'll likely miss most TWiki applications with homegrown topic creators. Unfortunately the specification of the
accept-charset attribute is crappy in itself: It allows to specify a list of encodings in a
form tag, but does not provide a way for a browser to indicate which one it actually used to encode. Browser implementations have been reported to behave differently, but I
guess this is no longer true for current browsers (for certain values of
current).
I see two alternatives:
- Use UTF-8 not only internally, but also externally: Always encode HTTP responses in UTF-8, and set the appropriate HTTP headers (and
meta elements). There's no need then to use accept-charset attributes on forms. The configuration setting {Site}{Charset} would be just used for legacy file conversion. This solution assumes that UTF-8 is good enough for all TWiki users, and might cause problems for those who deliberately use ISO-8859 today because they use their TWiki together with external data sources (e.g. data bases or localized skins). TWiki would explicitly need to decode parameters from UTF-8, maybe with appropriate precautions to avoid a crash.
- Keep sending pages in
{Site}{Charset} encoding, and rely on {Site}{Charset} being the encoding you get for parameters. Again this would not require forms to use accept-charset. Decoding of parameters would be needed iff {Site}{Charset} is a multibyte encoding. I am not sure, however, how browsers behave if users enter characters in a form field which can not be encoded in {Site}{Charset}.
Hell. Getting encoding right
is tough, and it has been so since tapes were written in either ASCII or EBCDIC. TWiki got rather far without touching encoding, but it is unlikely to get any further in a world which is moving towards UTF-8.
--
HaraldJoerg - 12 Apr 2008
I think option (2) is dangerous. I have had a lot of bug reports resulting from UTF8 being interpreted as high-bit iso-8859 and crashes from illegal encodings as a result of high-bit iso-8859 being interpreted as UTF8. If TWiki chooses to move to UTF-8 I think it should move
everywhere. Imagine the scenario; some poor Russkaya mafiya is trying to import a topic with a Russian name, written in Russian, by a Russian-speaking colleague working undercover in the US department of electronic security, and you are trying to give them support:
- What encoding is used in the US TWiki? iso-8859-1, KO18-R, UTF8, other?
- What encoding is used in your local TWiki? KO18-R, UTF8, other?
- What charset is used in your browser?
All questions that Josef Average will find all but impossible to answer. If, on the other hand, you always require UTF8, the worst you will be faced with (hopefully) is:
- "is the charset specified as utf8 or utf-8, because utf8 will cause a bug in internet explorer"
I can't imagine many scenarios where utf8 - and therefore unicode - would
not be enough for users. Unless they are writing in Klingon (the Klingon character set is inexplicably missing from unicode). Or Elvish.
One possible scenario is where a site is trying to use a plugin that has been coded to assume iso-8859, and the admin doesn't have the will, or a way, to get it fixed. For that reason I added a review of extensions to the todo list above, which should cover your point about external datasources.
--
CrawfordCurrie - 13 Apr 2008
I think it is feasible to migrate TWiki towards pure UTF-8. It should work for all known languages on the planet Earth and will limit the platform we have to test on to ONE.
With respect to plugins the important steps will be
- Ensure anything in Func is UTF-8 compatible
- Migrate the most popular plugins to UTF-8 and in the process document what it is it takes to upgrade a plugin to UTF-8
- Provide a safe upgrade method. I have tried to upgrade ASCII type topics by converting to UTF-8 and so far it has worked fine each time. This is a situation where it is difficult - maybe impossible - to do it on the fly. But unlike upgrade scripts suggested for syntax changes that are always bound to fail because we cannot predict the more advanced ways to make applications, upgrading to UTF-8 is happening at "byte level" and is a well known process which there are plenty of tools available for. But we need to have a good way to ensure that one does not double convert topics (convert a topic which is already converted).
Going to pure utf-8 will be a task that takes a lot of testing. I am testing UTF-8 at the moment in the 4.2.1 context and as you know there are a couple of new bugs I opened where I have seen that SEARCH and verbatim is not yet working in UTF-8. There will be many more test steps needed before we can let go of other charsets. But I think the step to make TWiki utf-8 only should be considered with a positive spirit because it will make TWiki fully
I18N which it is not now and with a chance of being stable also for non-English users.
--
KennethLavrsen - 13 Apr 2008
I agree that UTF-8 should be the way to go, and I fully support moving towards encoding topics in UTF-8 as soon as possible. Easy moving of topics between TWiki installation needs a unique encoding, and UTF-8 seems to be without alternative for that purpose. Topics (and templates) have long-lived encodings, they tend to lie on disks for years without surreptitiously changing their encoding, hence the migration path needs to be carefully paved (as you did in your proposal).
My options do not refer to using UTF-8 for writing
topics, but to the encoding used for TWiki's other interface, HTTP/HTML written for browsers. UTF-8 would work fine for me, and maybe for all installations (including
Elvish). So probably we could jettison option (2) right now.
So what it boils down to is not that TWiki is
using UTF-8 (because, strictly spoken, TWiki is using
Perl's internal encoding all the time), but that TWiki
expects all its external data interfaces to be encoded in UTF-8. From that point of view, topics are the easiest part because writing and reading topics is under TWiki's more or less exclusive control. As you wrote, we'll need to carefully collect
assumptions about encodings, but also identify
unjustified ignorance. Maybe you summarized these cases with your item
An audit of the core code to find cases where failure to acknowledge the encoding correctly has implicitly broken the code.. I added some to the list above, hopefully it won't grow too much.
A minor note about UTF-8 vs. utf8: Internet protocols use 'UTF-8', case-insensitive, and Perl uses 'utf8', always lowercase.
--
HaraldJoerg - 13 Apr 2008
It would be good to look at
UnicodeSupport and linked pages, which contain a lot of thinking about this. I've commented at
UnicodeProblemsAndSolutionCandidates in detail on some of the issues that would need to be solved, which cover some of the points made above. It would be helpful if the various Unicode pages were interlinked - perhaps
UnicodeSupport could be refactored into a 'landing page' for all these topics including latest discussions, to make it easier to find them.
Shame I missed this discussion - I haven't been tracking TWiki for a while now, but would be interested in participating if someone can email me. Unfortunately TWiki.org doesn't have a good way of monitoring 'only pages with certain keywords' that I'm aware of.
- WebRss supports SEARCH statements to narrow down what you get notified of (and Crawford entered an enhancement request of mine for supporting SEARCH queries (full TML actually) in WebNotify) - SD
--
RichardDonkin - 14 Jun 2008
Thanks for the tip, Sven.
On the options - I think the best one is option 1, i.e. UTF-8 at the presentation level and internally. There should be very few systems these days where UTF-8 is not supported - even on an ancient 486 you can boot a live CD that supports UTF-8 in Lynx - but I'm sure someone will come up with one.
In a possible Phase 2 of UTF-8 adoption, we could implement some charset conversion at the presentation layer, e.g. if someone has a browser or email client that only does a legacy Russian or Japanese character set, perhaps, and they are unable to upgrade their clients. This could perhaps be driven by accept-charset. However, this adds complexity so let's not do it in the first phase of
UseUTF8.
See more comments in text prefixed with RD.
--
RichardDonkin - 15 Jun 2008
I've added a key concepts section above to try to differentiate between "UTF-8 character mode" in Perl vs. processing UTF-8 as bytes (which is not what we want), as a result of commenting on
Bugs:Item5566.
--
RichardDonkin - 26 Jun 2008
One remark on "need to use :utf8": Don't use ":utf8" use ":encoding(UTF-8)" instead. Why? Because with ":utf8" Perl doesn't check if it's really utf8 and because of this there can arise serious security problems. See
PerlMonks: UTF8 related proof of concept exploit released at T-DOSE for an example.
- RD: You can also use "utf-8-strict" as a synonym for "UTF-8" in Perl pragmas, which might be less vulnerable to typos.
--
ChristianLudwig - 27 Jun 2008
Good point - have merged this above where it talks about
:utf8 and also added a Security bullet under the 'other considerations' part.
I've added quite a lot of material above, comments would be useful.
One simple next step might be to agree whether we can dump the
accept-charset idea which IMO is not required.
--
RichardDonkin - 28 Jun 2008
I think "keep it simple" has to be the guiding principle here. I think
accept-charset falls the wrong side of that line, and should not be used.
The main support problem we have had with encoding support to date has been excessive flexibility coupled with a lack of documentation explaining in simple terms what the casual admin needs to do. I had to research quite a lot to reach my poor level of understanding, and it's unreasonable to expect yer averidge admin to do the same.
So, from a user perspective, I don't want to know it's using UTF8 (or any other encoding).
configure should have
no encoding options, just a single, simple options for setting the user interface language. If that means committing to a less-than-100%-flexible approach, then I'm in favour.
--
CrawfordCurrie - 28 Jun 2008
A less flexible approach should be possible since we won't be using locales, and I agree completely with going for simplicity. Some remaining issues though:
- Batch migration of topics - this is essential to keep core code simple, so it only has to deal with UTF-8
- Performance - early testing and tuning will be important, covering both the English-only and the I18N-heavy cases. If this can't be optimised, a Unicode-mode toggle as mentioned above will be important, but it could be based on a simple toggle such as {UseInternationalisation}.
- Sorting - if we don't do locales, topic and table column sorting will need UnicodeCollation. This has to be based on "the language", which can most simply be derived from the user's language (for message internationalisation). Unicode obviously supports multiple languages but for collation you need to know which language the user is working in, and hence which Unicode collation order to use. The good news is that CPAN:Unicode::Collate does all this for you as long as it's used in any sort routines.
Some expert-level config options may be needed to work around brokenness, but we should try to avoid wherever possible (like the Unicode mode toggle). If we limit ourselves to Perl 5.8 only that will simplify matters - if Perl 5.6 must be supported it could turn off all
I18N and use only ASCII.
--
RichardDonkin - 28 Jun 2008
- I thought the user interface language code required the locale to work?
- I'm torn on batch migration. Migration on the fly is seductive, and fairly easy to make work, but the performance is likely to stink. Batch migration has the potential to lose the history (unless it rebuilds it using the new encoding)
- Another issue is Extensions. Authors need comprehensive support to make sure they don't fall into the /[A-Za-z]/ trap.
- Case detection and conversion (whatever 'case' means)
- Sort collation
- Language/encoding information
My personal opinion is that Perl 5.6 is past it's sell-by date and should be dropped. This might shut out some hosting providers; I'd be interested to hear if any are still using 5.6.
--
CrawfordCurrie - 29 Jun 2008
Will have to look at the UI language code but I think it only uses locales because the core does. If we go UTF-8 they simply have to convert all their translation files and make a slight adjustment (IMO, without looking at code yet.)
Non-batch migration is really hard as well as slow:
- Most significantly, how do you search for an I18N word using
grep across a mixed set of converted and unconverted pages? A: you have to run two grep searches.With all this complexity you are actually supporting the pre-Unicode character set forever, since you can never know when you will hit a page that nobody has yet viewed or edited.
- How would you handle page A with WikiWord links to pages B and C, where A and B are ISO-8859-1, and C is already converted? A: You would convert page A into UTF-8 and
any generated URLs would also be UTF-8 any generated URLs should use UTF-8 for TWiki pages and {Site}{CharSet} for attachments (due to EncodeURLsWithUTF8 and the need for web server to directly serve attachments). Fortunately the inbound URL conversion of EncodeURLsWithUTF8 helps here with the link to page B, but you have to keep that logic around.
Batch migration
would preserve history: since
RCS is a text format (just checked with
man rcsfile and
this page documenting more details of RCS file format) and doesn't appear to have any length or checksum fields that would mess this up, it is fairly trivial - just use the
iconv utility for the file, and
convmv for the filename itself (and directory names). There may be some corner cases if people have embedded URL-encoded links within a TWiki page, but that's unlikely and not required with current
I18N. The page linked here makes it clear that it is safe to embed UTF-8 in
RCS files - the only problem might come if Asian sites have (against TWiki
I18N recommendations at
InstallationWithI18N) used a non-ASCII-safe double-byte character set such as Shift-JIS as the {Site}{CharSet} when we convert this to UTF-8, as
RCS may have escaped a conflicting byte within a double-byte character. I suggest we don't bother with this case, as such character sets were never supported (see
JapaneseAndChineseSupport as well).
EncodeURLsWithUTF8 may need to be enhanced slightly - haven't thought about the details yet, but limiting ourselves to browsers supporting UTF-8 should help and might even simplify it. Attachment support through UTF-8 URLs will be the main remaining issue - however by making the browser use UTF-8 we force all its URLs to be in UTF-8 format. We might even find
TWikiOnMainframe I18N works without special code...
Extensions are a problem, which
InternationalisationGuidelines tries to address, but it's really down to the extension author and promoting
I18N amongst authors. Many extensions aren't
I18N-aware, but I think those that are already
I18N-aware will have an easier time converting, and going Unicode makes life easier generally, particularly for extensions that interface to third-party systems that already use UTF-8.
Support for
MacOS X will be something of a challenge I think, requiring
UnicodeNormalisation but only on
MacOS servers, as mentioned above. Support for Windows servers is not a problem, I now believe - I have updated the comments above to reflect this.
On collation, I did a bit of research yesterday:
UnicodeCollation doesn't give you language specific sorting, but it does provide a good default sort order across all languages. People who want correct sorting in Swedish, Danish, Japanese, etc, will need a 'language sort module' that adds some language specific collation rules for their language. This should be done as with the UI internationalisation stuff, ideally so that when you do a translation someone a bit more techie defines the collation rules - these are available from various sources with a bit of luck but there are very few language specific modules on
CPAN that help. Also,
CPAN:Unicode::Collate involves loading a 1.3 MB default collation order file, which could have some performance impact...
UnicodeCollation might need to be enabled only for those who want language-specific collation to be absolutely correct, with the default being to sort by UTF-8 codepoint, which doesn't look nice but at least is fairly fast - some performance testing needed, with and without
ModPerl.
--
RichardDonkin - 29 Jun 2008
A few more updates above to my comment of 29 Jun, and also some updates to main text - in particular I've removed the accept-charset part since we are agreed we don't want to do this.
--
RichardDonkin - 01 Jul 2008
Any more thoughts on this? I've done some updates to
UnicodeCollation including a test script - this isn't hard to do.
--
RichardDonkin - 15 Jul 2008
I'm with you on batch migration. I think extension authors will have to be left to sort out their own houses; though the most common extensions will need to be tested. I don't care much about OSX, and until an OSX user with hardware steps forward I doubt anyone else will.
The main problem I foresee is testing. I don't think it makes sense to do any of this without a testing strategy. My preference is for UTF8 testcases to be added to the existing unit test suite, as lack of unit tests in this area has been crippling in the past. And as you say, performance testing is required.
I'd like to make proper UTF8 support a feature of TWiki 5.0, but I think it requires a lot more concentrated effort from interested parties than just the two of us batting ideas around, especially as neither of us is likely to be actively coding anything. Specifically I'd like to hear from community members who actually want to actively use non-western charsets in their day-to-day work, as their experiences would be key to the success of the venture.
--
CrawfordCurrie - 15 Jul 2008