Bug: " " breaks RSS feed

Using an   in the first few lines of a topic breaks the TWiki.org RSS feed (see RichSiteSummary).

Test case

Using IE5.5 or Mozilla, click on http://twiki.org/cgi-bin/view/Support/WebRss (the RSS feed generated by WebRss) (removed parameters since no longer needed -- PTh - 22 Sep 2005)
You'll see an XML error (currently) because the description element for the Support.WebHome page includes an   entity. Presumably this is not allowed by the RSS spec (see RichSiteSummary's link to RSS 1.0 spec), but I'm not sure.

Another test is to load this URL into FeedReader, which also complains.

Environment

TWiki version:	TWiki.org
TWiki plugins:	various
Server OS:
Web server:
Perl version:
Client OS:	Win2000
Web Browser:	IE 5.5 or FeedReader 1.65b

-- RichardDonkin - 18 Feb 2002

Follow up

One solution is for TWiki to escape or omit such entities (definitely the simplest approach) - however, a bit of research is needed to see what is allowed in an RSS feed and make sure we avoid such problems cropping up again.

Another approach is to define the entities allowed - XML apparently includes very few valid entities, whereas HTML includes lots, hence the problem. See http://www.oreillynet.com/cs/user/view/cs_msg/2189 for a thread about this issue with Meerkat RSS feeds. The RSS 0.91 DTD actually includes some additional in-line entity definitions that would get rid of these undefined entity errors.

For RSS 1.0 feeds, the 1.0 spec's Core Syntax section includes some sample code (under 'Entities') on how to include HTML entities for Latin-1 character sets, which might be an easy solution that doesn't require entity mangling. However, the set of entities may depend on whether the included TWiki topic contents are really HTML 3.2, HTML 4.0 or XHTML (no idea if the set of entities varies between these HTML flavours) - see XhtmlConsideredHarmful and ConvertToXHTML10.

-- RichardDonkin - 18 Feb 2002

Fix record

TWiki filters now   in the summary since it is not needed. Not sure if other entities like & impose a problem for RSS feeds.

In TWikiAlphaRelease and TWiki.org.

-- PeterThoeny - 19 Feb 2002

Thanks for the fix, that does solve the current problem. Most HTML entities do cause issues for RSS feeds, see the 1.0 spec's Core Syntax section for the listing of five entities that are OK ( & is one of them). The spec recommends including the following at the front of every RSS doc to support all XHTML entities (the xhtml-lat1.ent file should probably be hosted on the TWiki site as part of the TWiki distribution):

<?xml version="1.0"?>

<!DOCTYPE rdf:RDF [
<!ENTITY % HTMLlat1 PUBLIC
 "-//W3C//ENTITIES Latin 1 for XHTML//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
%HTMLlat1;
]>

<rdf:RDF
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns="http://purl.org/rss/1.0/"
>

...

By the way, <verbatim> has improved a lot - no editing required to include that extract!

I've reset this to BugAssigned, since I think fixing the bug for all HTML entities is worth doing. Also, there needs to be a bit of investigation into where it's necessary to generate " (double quotes) and ' (single quote) entities - seems to be only when putting a string such as "Hello", she said into double quotes.

Numeric entities are always OK, even Unicode ones as in one Test web topic, but it's a lot easier on the user not to have to enter these.

-- RichardDonkin - 20 Feb 2002

Solved this one differently. Entities are simply removed for the summary, and special chars above #127 are encoded like . Tested with Sandbox.TestSpecialCharsForWebRss

In TWikiAlphaRelease and at TWiki.org.

-- PeterThoeny - 13 Jun 2002

This is related to the I18N work at InternationalisationEnhancements - character entity encodings such as   are really just a way of writing a Unicode character without directly embedding it in the document. As it happens,   maps to U+00A0, which is in the first 256 code points of Unicode and therefore is the same as ISO-8859-1 (aka Latin-1). In any case, removing this entity is a good idea.

More at InternationalisationUTF8 and PageModeRssEncodeBug, where encoding of characters above 127 caused problems for Russian users of RSS feeds - hence this was reverted in Dec 2002 as part of the I18N work.

-- RichardDonkin - 14 Oct 2003

WebForm
TopicClassification	BugResolved
TopicSummary
InterestedParties
AssignedTo
AssignedToCore
ScheduledFor
RelatedTopics
SpecProgress
ImplProgress
DocProgress

Topic revision: r8 - 2005-09-23 - PeterThoeny

Account
- Log In
- Register User

Edit
Attach

Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

Bug: "&nbsp;" breaks RSS feed

Test case

Environment

Follow up

Fix record

Bug: " " breaks RSS feed