Bug: Email addresses can be harvested by spammers
When viewing topics with the 'raw' modifier, email addresses are returned unobscured.
Test case
me@examplePLEASENOSPAM.com
The above email address is obscured on the original page at
http://twiki.org/cgi-bin/view/Codev/RawParamLeaksEmailAddresses
but available in clear text at
http://twiki.org/cgi-bin/view/Codev/RawParamLeaksEmailAddresses?raw=on
This allows email harvesting of otherwise obscured email addresses.
Possible counter measures:
- robots.txt (possible?) - usually not honored by spammers
- restrict read access to URLs using the "raw=on" URL parameter on the Twiki level to unauthorized users (possible?)
- restrict read/any access to URLs using the "raw=on" URL parameter on the Twiki level to everybody (possible?)
- restrict any access to URLs containing "raw=on" on .htaccess layer (should work by means of rewrite rules)
- restrict any access to URLs containing "raw=on" on httpd.conf layer (should work by means of rewrite rules)
- add a captcha type thing for raw=on urls
- any other ideas?
The goal should be to be as unrestrictive as possible and to keep the functionality of the "raw" option available for as many users as possible while still making original email addresses unavailable to external and internal indexing mechanisms.
This should probably be considered a problem with the default TWiki configuration and the installation instructions, not a bug in the code itself.
Environment
--
AlsterWassermann - 13 Jan 2005
Follow up
Agreed, it should be thought of as a config/documentation problem. I would argue strongly against munging the text produced by "raw", as it is critical to
TWikiApplications that rip that text for external processing.
--
CrawfordCurrie - 13 Jan 2005
I'd favour an ALLOWTOPICREADRAW (or just ALLOWTOPICRAW) - not only for email leakage protection but also because there are some sites that consider their
TWikiApplications to be intellectual capital: enabling SEARCH/FORMs etc to be protected from being ripped off by looking at the source could be key to them more adoption of TWiki as a solution.
--
MartinCleaver - 13 Jan 2005
How about just making the raw option require an authenticated user (similar to Martin's option, but his idea is more controllable - would require authenticated view script). Or perhaps a separate
viewraw script for ease of .htaccess setup, so only viewraw needs authentication.
--
RichardDonkin - 13 Jan 2005
I would be irritated by someone understanding
TWikiApplications as intellectual property unless public property is meant. Check
LicensingAndCopyrightFAQ
--
AlsterWassermann - 18 Jan 2005
I don't understand why this would not be classified as a bug and just a configuration issue. How can I right now and without mod_rewrite restrict access to anonymous raw views?
--
MattWilkie - 19 Jan 2005
An alternative is to enhance the
BlackListPlugin to watch out for
raw parameters and bump up the score by a bigger value. That will catch a harvester quickly.
--
PeterThoeny - 19 Jan 2005
This is now implemented on TWiki.org via
BlackListPlugin. Each regular view increases the score by one point, each "raw" view by 20 points.
There is a potential "gotcha". People could get on the blacklist by looking at several pages in raw mode quickly. There is a whitelist for the contributors. As before, please contact me or one of the
CoreTeam members with your IP address, we can put you on the whilelist.
Let us know if there are any issues.
--
PeterThoeny - 19 Jan 2005
As
AntonAylward noted earlier this cries for the recognition that currently twiki
TWikiGuest conflates two different classes of users:
anonymous but logged in users and
not logged in. If we create a new
NotLoggedIn user then we can do things like
DENYCOMMENT = NotLoggedIn,
DENYWEBVIEW = NotLoggedIn, and the like. This would also provide and easy way to prevent access to
raw page views and spidering of old revision pages etc, etc.
--
MattWilkie - 19 Jan 2005
Although the
BlackListPlugin may stop harvesting from the site directly there is still the google cache.
It is possable to create a search in google for raw user topics and view them in googles cache.
Therefore we also need
SearchEngineIndexOnlyPlainView to stop google and other search engines from indexing and caching raw pages.
--
SamHasler - 25 Jan 2005
Fixed on my install by restricting these urls to authentificated users with this RewriteRule :
RewriteCond %{QUERY_STRING} raw=
RewriteRule ^/view/(.*) /viewauth/$1 [R]
--
BenVoui - 23 Feb 2005
Fix record