Tags:
internationalization1Add my vote for this tag create new tag
view all tags

Question

If I have a topic with Japanese text, and have the whole site set to EUC-JP as per the instructions, all the page display and editing seems to work problem-free.

However, if I try to use an RSS feed, I get the Japanese text all scrambled. I use:

http://mywebsite/bin/view/TWiki/WebRss?skin=rss&contenttype=text/xml
and the file is happily parsed by Opera 7.54, RssReader and Internet Explorer, but the Japanese text cannot be displayed correctly, just mojibaked nonsense! I suspect there is perhaps one conversion to Unicode to build up the XML file, but no conversion back to EUC-JP for delivery to the browser?

Environment

TWiki version: TWikiRelease01Sep2004
TWiki plugins: DefaultPlugin, EmptyPlugin, InterwikiPlugin
Server OS: Windows 2000
Web server: Apache 1.3
Perl version: 5.8.?
Client OS: Windows XP
Web Browser: Opera 7.54, Explorer
Categories: Internationalisation

-- KenYasumotoNicolson - 16 Nov 2004

Answer

May be you need Ludovic's UTF patch in HeadlinesPluginDev?

-- PeterThoeny - 16 Nov 2004

Can you give us an example of the output of that RSS link, as an attachment to this page? Also, an HTML attachment of the output of testenv would be helpful.

I did put in some code to make RSS I18N work, but it's a little complex and probably not really fully working yet: it's in SVNget:lib/TWiki/Render.pm, as follows:

    # Encode special chars into XML &#nnn; entities for use in RSS feeds
    # - no encoding for HTML pages, to avoid breaking international 
    # characters. FIXME: Only works for ISO-8859-1 characters, where the
    # Unicode encoding (&#nnn;) is identical.
    if( $renderMode eq 'rss' ) {
        # FIXME: Issue for EBCDIC/UTF-8
        $htext =~ s/([\x7f-\xff])/"\&\#" . unpack( "C", $1 ) .";"/ge;
    }

The problem is that this is a hack that only works for ISO-8859-1 sites - what's needed is some code that used CPAN:Encode or similar to convert from EUC-JP to Unicode &#nnnnn; entities (aka Numeric Character References). These should then be usable by any RSS reader, providing it supports Unicode NCRs as it should.

A simpler short-term fix for non-ISO-8859-1 sites might be to first make sure the RSS reader supports EUC-JP. Then, since the RSS template (SVNget:templates/view.rss.tmpl) specifies EUC-JP as the charset (this is automatic through the %CHARSET% variable), it is just a matter of commenting out the line beginning with $htext above.

This patch is a slightly more elegant way of doing this which also works for ISO-8859-1 - not tested, please let me know how you get on.

This is a duplicate of PageModeRssEncodeBug, and the fix is the same. Will add the full fix to ProposedUTF8SupportForI18N. BTW, there are no conversions to/from Unicode in TWiki at present, except for conversion from UTF-8 URLs into the site character set (EncodeURLsWithUTF8).

-- RichardDonkin - 18 Nov 2004

I love the word mojibake, it is so cute smile

-- PeterThoeny - 20 Nov 2004

Go on, tell me what it means - always a risk of cutting and pasting in languages you don't know wink

-- RichardDonkin - 21 Nov 2004

Where do I insert these changes? The patch file says @@ -1213,10 +1213,9 @@ but there's only 1132 lines in my installed Render.pm and 1142 in the version I downloaded from the link above.

I'll also attach a copy of the mojibaked RSS file and the "Raw text" output from one of the source pages - I'm using Windows, so I hope MS hasn't done some character set conversion on the fly as I cut and paste...

-- KenYasumotoNicolson - 22 Nov 2004

The simplest way to apply these changes is to use the patch program - see PatchGuidelines for a link. This program automates applying patches for TWiki and many other OpenSource projects.

However, you can also just locate the change by hand - just search the file for 'x7f', which should be the line starting '$htext =~ ...' above, and then replace that line with the lines:

    # Encode special chars into XML &#nnn; entities for use in RSS feeds
    # - no encoding for HTML pages, to avoid breaking international
    # characters. Only works for ISO-8859-1 sites, since the Unicode
    # encoding (&#nnn;) is identical for first 256 characters.
    # I18N TODO: Convert to Unicode from any site character set.
    if( $renderMode eq 'rss' and $TWiki::siteCharset =~ /^iso-?8859-?1$/i ) {
        $htext =~ s/([\x7f-\xff])/"\&\#" . unpack( "C", $1 ) .";"/ge;
    }

This fix will soon be in the TWiki code (SVN DEVELOP branch).

Thanks for the attachments - these are the best way to provide non-ASCII character data for I18N bug reports.

-- RichardDonkin - 22 Nov 2004

Topic attachments
I Attachment History Action Size Date Who Comment
Texttxt InnovationsAwards.txt r1 manage 3.0 K 2004-11-22 - 00:54 UnknownUser Source text file direct from twiki/data/News/InnovationsAwards.txt
Unknown file formatpatch rss-i18n.patch r1 manage 0.8 K 2004-11-18 - 13:18 UnknownUser Patch for I18N of RSS feeds for non-ISO-8859-1 sites
XMLxml webRss.xml r1 manage 17.6 K 2004-11-22 - 00:51 UnknownUser Mojibaked WebRSS xml file
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r8 - 2004-11-22 - RichardDonkin
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.