How to Cache RSS Feeds with Apache RewriteRule
Public TWiki sites may experience a lot of traffic by the RSS feeds. For example, one third of the topic views on TWiki.org is caused by RSS feeds (151K of total 452K views in any given week).
It is possible to use wget to cache RSS feeds as static
HTML pages and to use Apache's mighty
RewriteRule
to deliver the static
HTML file when a TWiki view script is accessed.
Here is how to set this up:
1. Create a
/feeds directory under the htdocs root directory
2. Create a cron job that generates the static
HTML pages. Example for Codev web:
2,17,32,47 * * * * cd /path/to/htdocs/feeds; /path/to/bin/wget --http-user TWikiGuest --http-passwd guest -O CodevWebRss.xml http://twiki.org/cgi-bin/view/Codev/WebRss?t=t > .log.txt 2>&1
The
?t=t query string makes sure that the rewrite rule does not fire for the cache update and for RSS feeds that have a search parameter. The TWikiGuest login prevents a redirect in case a recent topic has a view access restriction, which results in an invalid file of zero bytes.
In this example, the generated file can be accessed as
http://feeds/CodevWebRss.xml
3. Update Apache
http.conf with these rules for the cgi-bin directory:
RewriteEngine On
RewriteCond %{QUERY_STRING} ^$
RewriteRule view/(Codev|Main|Plugins|Sandbox|Support|TWiki)/WebRss$ /feeds/$1\WebRss.xml [L,T=application/rss+xml]
Make sure to load the rewrite module, consult the
Apache docs
.
4. Restart Apache with
sbin/apachectl restart
--
Contributors: PeterThoeny,
MichaelDaum
Discussion
The itching factor for this "how-to" is described in
CacheWebRssFeedForSpeed.
--
PeterThoeny - 09 Mar 2006
The rewrite rule should use
application/rss+xml and also cache the WebAtom feed.
Here's a more generic rewrite rule that btw supports the
BlogPlugin 's extra feeds also
RewriteEngine On
RewriteCond %{QUERY_STRING} ^$
RewriteRule view/(.*)/(Web(Rss|Atom)(Combined|Comments|Teaser)?)$ /feeds/$1$2.xml [L,T=application/rss+xml]
Note, that you have to adjust this if you've got hierarchical webs.
I attached a shell
getfeeds script to be used in the cronjob instead of coding the wget into the crontab directly. Store it into an arbitrary directory; copy the
getfeeds.conf.example file in the same directory renaming it to
getfeeds.conf and change
the default values in there.
I'd recommend to regenerate the feeds in a lower frequence like once an our:
0 * * * * /home/www-data/twiki/getfeeds
(adjust the path to
getfeeds).
--
MichaelDaum - 09 Mar 2006
Thanks Micha for the additional info and script.
On TWiki.org I actually installed many cron jobs, separated by 2 minutes. This refreshes each feed in 15 min intervals and distributes the load on the server. The
getfeeds script is useful but adds bursted load on the server depending on the number of feeds you have on the server.
--
PeterThoeny - 09 Mar 2006
Then add
SLEEP=60 to your
getfeeds.conf file (defaults to 1 second).
--
MichaelDaum - 09 Mar 2006
Cool!
--
PeterThoeny - 09 Mar 2006
I've edited the script that MD provided to use curl instead of wget. wget was working nicely until it started fetching the same topic over and over. I had an intuition that using curl I might have better results, so I changed the script. Doing so helped greatly and sped things up.
line 42 of
getfeeds I have:
# wget -q -O $TMP_FILE $VIEW_URL/$WEB/$TOPIC?t=`date +"%s"` && mv $TMP_FILE $OUT_FILE && chmod go+r $OUT_FILE
curl --compressed -s -G -o $TMP_FILE $VIEW_URL/$WEB/$TOPIC?t=$(date +"%s") && mv $TMP_FILE $OUT_FILE && chmod 644 $OUT_FILE
it's a simple change (oh and the chmod change too, I just wanted to be sure of the right perms on it). BTW, if you're not using mod_deflate or mod_gzip
LEAVE OFF the
--compressed it won't work otherwise.
Also, as to the having to restart the daemon above, that's overkill, you just need to reload; quite often that's enough. Only when you've changed a module (like adding or removing one for instance) would you need to restart.
HTH
--
EricCote - 15 Mar 2006
The cached RSS feeds randomly failed on TWiki.org. This was caused if a recently changed topic has a view access restriction, which triggered a redirect, which in turn resulted in a RSS file of zero bytes. I fixed this by supplying the TWikiGuest user to wget.
--
PeterThoeny - 17 Apr 2006
if QUERY_STRING is view/Main/WebRss, then "RewriteRule view/(.*)/(Web(Rss|Atom)(Combined|Comments|Teaser)?)$ /feeds/$1$2.xml" is OK, but what's the rule if QUERY_STRING is view/Main/webRss?skin=rss ? I ever try
RewriteRule view/(.*)/(Web(Rss|Atom)(Combined|Comments|Teaser)?)\?skin=rss$ /feeds/$1$2.xml", it's not OK.
--
LunaLin - 10 Jan 2007