Unicode support
So, let's investigate what's needed for full Unicode support.
From
RichardDonkin on
Bugs:Item772
:
(Digression) It's worth noting that the locale code needs re-working anyway to cover two cases when we do Unicode, though that's not in scope for Dakar:
- Unicode - do a dynamic
use open to set utf8 mode on all data read and written (must also cover ModPerl which doesn't use file descriptors to pass data to TWiki scripts, unlike CGI. This code path must never do a use locale or equivalent because mixing Unicode and locales breaks things quite comprehensively (a Perl bug-fest, I tried this...)
- Non-Unicode - should function as now (assuming this is just a bug)
The hard part is that the switch between (1) and (2) must be dynamic, based on a TWiki.cfg setting. It should NOT be based purely on locale matching /\.utf-?8$/, because some people may validly want to run with a UTF-8 locale and browser character set, but without Unicode mode.
Also, please don't do use utf8 to implement Unicode - it has an entirely different meaning between Perl 5.6 (where it means 'assume all data processed is UTF-8') and 5.8 (where it means 'variable names, literals, etc in this file can be UTF-8').
--
AntonioTerceiro - 21 Nov 2005
ProposedUTF8SupportForI18N has a lot of existing thinking and planning - should be a reasonable starting point, though it probably doesn't talk enough about performance issues. Requiring a recent Perl 5.8.x version is important too to avoid annoying bugs and perhaps help with performance.
It would also be worth considering GB18030 support, perhaps only in the browser - this is a 1-1 mapping from Unicode (i.e. really a Unicode Transformation Format analogous to UTF-8) that has been mandated by the Chinese government. More details:
Wikipedia:GB18030
and
IBM DeveloperWorks article
.
CPAN:Encode::HanExtra
supports GB18030 conversion to/from Unicode.
--
RichardDonkin - 21 Nov 2005
Worth noting also that some more recent versions of
CPAN:CGI
set Unicode mode on characters, which can be a good thing for Unicode support, or a bad thing if you don't want Perl's Unicode mode turned on. For some pointers on this, see discussion on
ProblemsWithInternationCharactersInOddPlaces.
--
RichardDonkin - 31 Jan 2006
One support request that really needs Unicode support is
CentralEuropeanCharacters. Since the increasingly popular
MediaWiki has excellent Unicode support, I think we need to do something here. Unfortunately I have virtually no time for coding but I have done a lot of research and am happy to advise and review.
--
RichardDonkin - 03 Sep 2006
One thing to watch out for is that Perl 5.8 now distinguishes between "utf8" and "UTF-8" - the former is Perl's looser interpretation, the latter is as specified by the Unicode standards, and is also known as "utf-8-strict". For details, see
recent Encode documentation
.
There are also some interesting war stories about doing Unicode with Perl in this
blog entry
.
Also, this
excellent blog entry
provides some wrappers around
CPAN:CGI
and
CPAN:DBI
to make them work better with UTF-8.
--
RichardDonkin - 04 Nov 2006
Good
blog posting about Perl UTF-8 coding
including
CPAN:encoding::warnings
- handy for debugging Unicode.
--
RichardDonkin - 03 Apr 2007