Question
- TWiki version: 01 Feb 2003
- Perl version: 5.005_03
- Web server & version: Apache/1.3.26 (Unix)
- Server OS: 4.6-STABLE FreeBSD i386
- Web browser & version: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.3) Gecko/20030414
- Client OS: 4.6-STABLE FreeBSD i386
The procedure in TWiki.pm at line number 1158, which encode characters into XML entities if( $pageMode eq 'rss' ), incorrectly assume ISO8859-1 charset, breaking the international compatibility, i.e. koi8-r charset. The procedure should convert characters from specified charset into unicode, than encode as entities (if needed).
When browser see å it always display lowercase `a` with ring, NOT russian koi8-r char with value 229, even if charset of document is specified as koi8-r.
--
SergeySolyanik - 14 Apr 2003
Answer
You probably have this set up correctly, but just to check... Do you have KOI8-R as your
%CHARSET% setting, i.e. the last part of
$siteLocale? Also, is
%CHARSET% used in your version of
CVSget:templates/view.rss.tmpl
? If so, displaying
å should correctly display the corresponding KOI8-R character, not the ISO-8859-1 å.
The code simply takes a character with the high bit set and encodes it as an HTML entity, without assuming anything about the charset except that it is an 8-bit charset - if you have CHARSET set correctly this should work. See
NationalCharactersEncodedInSearchResults which was reason for this behaviour - done for someone using KOI8-R and tested using this charset with Mozilla and
InternetExplorer. Some Cyrillic test pages are at
CyrillicSupport, though the RSS feed on my site is broken at the moment.
--
RichardDonkin - 15 Apr 2003
Yes, I have correct
%CHARSET% in view.rss.tmpl, and
$siteLocale is ru_RU.KOI8-R. But the assumption about displaying KOI8-R character entity is incorrect! No standart compliant browser should apply encoding to entities, and that is what mozilla do.
å is å in any encoding, that what entities for. And
NationalCharactersEncodedInSearchResults example doesn't use HTML entities, it's use 8-bit characters.
I think that using Unicode, UTF-8 will be good enough for RSS feeds. Pages should be mapped from local charset into UTF-8. Or, if we use other charset, TWiki should not encode characters.
--
SergeySolyanik - 16 Apr 2003
The simplest fix is just to remove all XML entity encoding for search results, for all non-Unicode character sets (UTF-8 not being supported yet). See
NationalCharactersEncodedInSearchResults which I have re-opened as a result, and thanks for logging this. I've attached an example of an RSS feed that includes one URL to a TWiki topic with KOI8-R name - this should work in any browser since the encoding on this page is KOI8-R, demonstrating that avoiding use of XML entities will fix this bug. The summary doesn't work in the attachment since it uses XML entities, illustrating the issue.
--
RichardDonkin - 16 Apr 2003