Unicode support
So, let's investigate what's needed for full Unicode support.
From
RichardDonkin on
Bugs:Item772:
(Digression) It's worth noting that the locale code needs re-working anyway to cover two cases when we do Unicode, though that's not in scope for Dakar:
- Unicode - do a dynamic
use open to set utf8 mode on all data read and written (must also cover ModPerl which doesn't use file descriptors to pass data to TWiki scripts, unlike CGI. This code path must never do a use locale or equivalent because mixing Unicode and locales breaks things quite comprehensively (a Perl bug-fest, I tried this...)
- Non-Unicode - should function as now (assuming this is just a bug)
The hard part is that the switch between (1) and (2) must be dynamic, based on a TWiki.cfg setting. It should NOT be based purely on locale matching /\.utf-?8$/, because some people may validly want to run with a UTF-8 locale and browser character set, but without Unicode mode.
Also, please don't do use utf8 to implement Unicode - it has an entirely different meaning between Perl 5.6 (where it means 'assume all data processed is UTF-8') and 5.8 (where it means 'variable names, literals, etc in this file can be UTF-8').
--
AntonioTerceiro - 21 Nov 2005
ProposedUTF8SupportForI18N has a lot of existing thinking and planning - should be a reasonable starting point, though it probably doesn't talk enough about performance issues. Requiring a recent Perl 5.8.x version is important too to avoid annoying bugs and perhaps help with performance.
It would also be worth considering GB18030 support, perhaps only in the browser - this is a 1-1 mapping from Unicode (i.e. really a Unicode Transformation Format analogous to UTF-8) that has been mandated by the Chinese government. More details:
Wikipedia:GB18030 and
IBM DeveloperWorks article.
CPAN:Encode::HanExtra supports GB18030 conversion to/from Unicode.
--
RichardDonkin - 21 Nov 2005
Worth noting also that some more recent versions of
CPAN:CGI set Unicode mode on characters, which can be a good thing for Unicode support, or a bad thing if you don't want Perl's Unicode mode turned on. For some pointers on this, see discussion on
ProblemsWithInternationCharactersInOddPlaces.
--
RichardDonkin - 31 Jan 2006
One support request that really needs Unicode support is
CentralEuropeanCharacters. Since the increasingly popular
MediaWiki has excellent Unicode support, I think we need to do something here. Unfortunately I have virtually no time for coding but I have done a lot of research and am happy to advise and review.
--
RichardDonkin - 03 Sep 2006
One thing to watch out for is that Perl 5.8 now distinguishes between "utf8" and "UTF-8" - the former is Perl's looser interpretation, the latter is as specified by the Unicode standards, and is also known as "utf-8-strict". For details, see
recent Encode documentation.
There are also some interesting war stories about doing Unicode with Perl in this
blog entry.
Also, this
excellent blog entry provides some wrappers around
CPAN:CGI and
CPAN:DBI to make them work better with UTF-8.
--
RichardDonkin - 04 Nov 2006
Good
blog posting about Perl UTF-8 coding including
CPAN:encoding::warnings - handy for debugging Unicode.
--
RichardDonkin - 03 Apr 2007
Russian characters or encoding.
Why force UTF when all browsers do not support it equally and html code editors do not copy / paste UTF right ?
I do Russian (none Twiki) pages with heading:
<?xml version="1.0" encoding="windows-1251"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head xml:lang="ru" lang="ru"><meta http-equiv="content-type" content="text/html; charset=windows-1251" xml:lang="ru" lang="ru" />
Any browser displays such pages correctly. It validates perfect and I can read and edit code with any editor, as it should.
Using charset=iso-8859-15 as in carrent Twiki never does the job for Russian
You’d have to switch (any) browser manually to UTF-8 each time of changing page.
The question is how to make Twiki perl generate pages with above Windows-1251 heading?
--
DimitriRytsk - 11 May 2007
See my other response. Windows-1251 is clearly a short term solution that is in no way comparable to proper Unicode support, which supports multiple languages simultaneously. Please don't ask the same question twice on pages that have nothing to do with Russian character set support.
--
RichardDonkin - 13 May 2007