Bug: " " breaks RSS feed
Using an
in the first few lines of a topic breaks the TWiki.org RSS feed (see
RichSiteSummary).
Test case
- Another test is to load this URL into FeedReader, which also complains.
Environment
| TWiki version: |
TWiki.org |
| TWiki plugins: |
various |
| Server OS: |
|
| Web server: |
|
| Perl version: |
|
| Client OS: |
Win2000 |
| Web Browser: |
IE 5.5 or FeedReader 1.65b |
--
RichardDonkin - 18 Feb 2002
Follow up
One solution is for TWiki to escape or omit such entities (definitely the simplest approach) - however, a bit of research is needed to see what is allowed in an RSS feed and make sure we avoid such problems cropping up again.
Another approach is to define the entities allowed -
XML apparently includes very few valid entities, whereas
HTML includes lots, hence the problem. See
http://www.oreillynet.com/cs/user/view/cs_msg/2189
for a thread about this issue with Meerkat RSS feeds. The
RSS 0.91 DTD
actually includes some additional in-line entity definitions that would get rid of these undefined entity errors.
For RSS 1.0 feeds, the
1.0 spec's Core Syntax
section includes some sample code (under 'Entities') on how to include
HTML entities for Latin-1 character sets, which might be an easy solution that doesn't require entity mangling. However, the set of entities may depend on whether the included TWiki topic contents are really
HTML 3.2,
HTML 4.0 or
XHTML (no idea if the set of entities varies between these
HTML flavours) - see
XhtmlConsideredHarmful and
ConvertToXHTML10.
--
RichardDonkin - 18 Feb 2002
Fix record
TWiki filters now
in the summary since it is not needed. Not sure if other entities like
& impose a problem for RSS feeds.
In
TWikiAlphaRelease and TWiki.org.
--
PeterThoeny - 19 Feb 2002
Thanks for the fix, that does solve the current problem. Most
HTML entities do cause issues for RSS feeds, see the
1.0 spec's Core Syntax
section for the listing of five entities that are OK (
& is one of them). The spec recommends including the following at the front of every RSS doc to support all
XHTML entities (the
xhtml-lat1.ent file should probably be hosted on the TWiki site as part of the TWiki distribution):
<?xml version="1.0"?>
<!DOCTYPE rdf:RDF [
<!ENTITY % HTMLlat1 PUBLIC
"-//W3C//ENTITIES Latin 1 for XHTML//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
%HTMLlat1;
]>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://purl.org/rss/1.0/"
>
...
By the way,
<verbatim> has improved a lot - no editing required to include that extract!
I've reset this to
BugAssigned, since I think fixing the bug for all
HTML entities is worth doing. Also, there needs to be a bit of investigation into where it's necessary to generate " (double quotes) and ' (single quote) entities - seems to be only when putting a string such as
"Hello", she said into double quotes.
Numeric entities are always OK, even Unicode ones as in one Test web topic, but it's a lot easier on the user not to have to enter these.
--
RichardDonkin - 20 Feb 2002
Solved this one differently. Entities are simply removed for the summary, and special chars above #127 are encoded like
. Tested with
Sandbox.TestSpecialCharsForWebRss
In
TWikiAlphaRelease and at TWiki.org.
--
PeterThoeny - 13 Jun 2002
This is related to the
I18N work at
InternationalisationEnhancements - character entity encodings such as
are really just a way of writing a Unicode character without directly embedding it in the document. As it happens,
maps to U+00A0, which is in the first 256 code points of Unicode and therefore is the same as ISO-8859-1 (aka Latin-1). In any case, removing this entity is a good idea.
More at
InternationalisationUTF8 and
PageModeRssEncodeBug, where encoding of characters above 127 caused problems for Russian users of RSS feeds - hence this was reverted in Dec 2002 as part of the
I18N work.
--
RichardDonkin - 14 Oct 2003