internationalization1Add my vote for this tag create new tag
, view all tags

Unicode support

So, let's investigate what's needed for full Unicode support.

From RichardDonkin on Bugs:Item772:

(Digression) It's worth noting that the locale code needs re-working anyway to cover two cases when we do Unicode, though that's not in scope for Dakar:
  1. Unicode - do a dynamic use open to set utf8 mode on all data read and written (must also cover ModPerl which doesn't use file descriptors to pass data to TWiki scripts, unlike CGI. This code path must never do a use locale or equivalent because mixing Unicode and locales breaks things quite comprehensively (a Perl bug-fest, I tried this...)
  2. Non-Unicode - should function as now (assuming this is just a bug)

The hard part is that the switch between (1) and (2) must be dynamic, based on a TWiki.cfg setting. It should NOT be based purely on locale matching /\.utf-?8$/, because some people may validly want to run with a UTF-8 locale and browser character set, but without Unicode mode.

Also, please don't do use utf8 to implement Unicode - it has an entirely different meaning between Perl 5.6 (where it means 'assume all data processed is UTF-8') and 5.8 (where it means 'variable names, literals, etc in this file can be UTF-8').

-- AntonioTerceiro - 21 Nov 2005

ProposedUTF8SupportForI18N has a lot of existing thinking and planning - should be a reasonable starting point, though it probably doesn't talk enough about performance issues. Requiring a recent Perl 5.8.x version is important too to avoid annoying bugs and perhaps help with performance.

It would also be worth considering GB18030 support, perhaps only in the browser - this is a 1-1 mapping from Unicode (i.e. really a Unicode Transformation Format analogous to UTF-8) that has been mandated by the Chinese government. More details: Wikipedia:GB18030 and IBM DeveloperWorks article. CPAN:Encode::HanExtra supports GB18030 conversion to/from Unicode.

-- RichardDonkin - 21 Nov 2005

Worth noting also that some more recent versions of CPAN:CGI set Unicode mode on characters, which can be a good thing for Unicode support, or a bad thing if you don't want Perl's Unicode mode turned on. For some pointers on this, see discussion on ProblemsWithInternationCharactersInOddPlaces.

-- RichardDonkin - 31 Jan 2006

One support request that really needs Unicode support is CentralEuropeanCharacters. Since the increasingly popular MediaWiki has excellent Unicode support, I think we need to do something here. Unfortunately I have virtually no time for coding but I have done a lot of research and am happy to advise and review.

-- RichardDonkin - 03 Sep 2006

One thing to watch out for is that Perl 5.8 now distinguishes between "utf8" and "UTF-8" - the former is Perl's looser interpretation, the latter is as specified by the Unicode standards, and is also known as "utf-8-strict". For details, see recent Encode documentation.

There are also some interesting war stories about doing Unicode with Perl in this blog entry.

Also, this excellent blog entry provides some wrappers around CPAN:CGI and CPAN:DBI to make them work better with UTF-8.

-- RichardDonkin - 04 Nov 2006

Good blog posting about Perl UTF-8 coding including CPAN:encoding::warnings - handy for debugging Unicode.

-- RichardDonkin - 03 Apr 2007

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r11 - 2009-10-15 - RichardDonkin
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.