Question
If I have a topic with Japanese text, and have the whole site set to EUC-JP as per the instructions, all the page display and editing seems to work problem-free.
However, if I try to use an RSS feed, I get the Japanese text all scrambled. I use:
http://mywebsite/bin/view/TWiki/WebRss?skin=rss&contenttype=text/xml
and the file is happily parsed by Opera 7.54,
RssReader
and Internet Explorer, but the Japanese text cannot be displayed correctly, just
mojibaked
nonsense! I suspect there is perhaps one conversion to Unicode to build up the XML file, but no conversion back to EUC-JP for delivery to the browser?
Environment
--
KenYasumotoNicolson - 16 Nov 2004
Answer
May be you need Ludovic's UTF patch in
HeadlinesPluginDev?
--
PeterThoeny - 16 Nov 2004
Can you give us an example of the output of that RSS link, as an attachment to this page? Also, an HTML attachment of the output of
testenv would be helpful.
I did put in some code to make RSS
I18N work, but it's a little complex and probably not really fully working yet: it's in
SVNget:lib/TWiki/Render.pm
, as follows:
# Encode special chars into XML &#nnn; entities for use in RSS feeds
# - no encoding for HTML pages, to avoid breaking international
# characters. FIXME: Only works for ISO-8859-1 characters, where the
# Unicode encoding (&#nnn;) is identical.
if( $renderMode eq 'rss' ) {
# FIXME: Issue for EBCDIC/UTF-8
$htext =~ s/([\x7f-\xff])/"\&\#" . unpack( "C", $1 ) .";"/ge;
}
The problem is that this is a hack that only works for ISO-8859-1 sites - what's needed is some code that used
CPAN:Encode
or similar to convert from EUC-JP to Unicode
&#nnnnn; entities (aka Numeric Character References). These should then be usable by any RSS reader, providing it supports Unicode NCRs as it should.
A simpler short-term fix for non-ISO-8859-1 sites might be to first make sure the RSS reader supports EUC-JP. Then, since the RSS template (
SVNget:templates/view.rss.tmpl
) specifies EUC-JP as the charset (this is automatic through the %CHARSET% variable), it is just a matter of commenting out the line beginning with
$htext above.
This
patch is a slightly more elegant way of doing this which also works for ISO-8859-1 - not tested, please let me know how you get on.
This is a duplicate of
PageModeRssEncodeBug, and the fix is the same. Will add the full fix to
ProposedUTF8SupportForI18N. BTW, there are no conversions to/from Unicode in TWiki at present, except for conversion from UTF-8 URLs into the site character set (
EncodeURLsWithUTF8).
--
RichardDonkin - 18 Nov 2004
I love the word

, it is so cute
--
PeterThoeny - 20 Nov 2004
Go on, tell me what it means - always a risk of cutting and pasting in languages you don't know
--
RichardDonkin - 21 Nov 2004
Where do I insert these changes? The patch file says
@@ -1213,10 +1213,9 @@ but there's only 1132 lines in my installed Render.pm and 1142 in the version I downloaded from the link above.
I'll also attach a copy of the mojibaked RSS file and the "Raw text" output from one of the source pages - I'm using Windows, so I hope MS hasn't done some character set conversion on the fly as I cut and paste...
--
KenYasumotoNicolson - 22 Nov 2004
The simplest way to apply these changes is to use the
patch program - see
PatchGuidelines for a link. This program automates applying patches for TWiki and many other
OpenSource projects.
However, you can also just locate the change by hand - just search the file for 'x7f', which should be the line starting '$htext =~ ...' above, and then replace that line with the lines:
# Encode special chars into XML &#nnn; entities for use in RSS feeds
# - no encoding for HTML pages, to avoid breaking international
# characters. Only works for ISO-8859-1 sites, since the Unicode
# encoding (&#nnn;) is identical for first 256 characters.
# I18N TODO: Convert to Unicode from any site character set.
if( $renderMode eq 'rss' and $TWiki::siteCharset =~ /^iso-?8859-?1$/i ) {
$htext =~ s/([\x7f-\xff])/"\&\#" . unpack( "C", $1 ) .";"/ge;
}
This fix will soon be in the TWiki code (SVN DEVELOP branch).
Thanks for the attachments - these are the best way to provide non-ASCII character data for
I18N bug reports.
--
RichardDonkin - 22 Nov 2004