There are ways to keep sites or pages from being crawled by search engine indexing robots.
See:
Contents
Notes
Last time I paid attention to which WikiLearn pages were being indexed by Google, I found a lot of duplicates because Google was indexing pages like WebChanges and other pages with dynamic searches.
Because a robots.txt file must be used at the top level of a site (like
http://twiki.org/robots.txt) and the WikiLearn site is currently hosted at twiki.org along with several other webs, I can't use the robots.txt approach.
The alternate, which doesn't work for all robots, is the robots meta tag:
from
HTML Author's Guide to the Robots META tag:
<html>
<head>
<meta name="robots" content="noindex,nofollow">
<meta name="description" content="This page ....">
<title>...</title>
</head>
<body>
...
I may try this at the top of the
content of some WikiLearn topics to see if it will work even though it will not at the required/recommended location (in the head).
See the last link under resources (Hack #100) for "noarchive" and other possibilities.
Page being tested:
- RobotsTest
- WebChanges (added 7 May 2003)
Resources
See Resource Recommendations. Feel free to add additional resources to these lists, but please follow the guidelines on ResourceRecommendations including Guidelines_for_Rating_Resources.
Recommended
- (rhk) Robots Exclusion; ; — good introduction, with links to the following four pages:
- HACK #100: Removing Your Materials from Google; — "How to remove your content from Google's various web properties." — looks useful for Google and other sites, covers Usenet and some other things as well, mentions noarchive as the way to avoid cached pages.
Contributors
- () RandyKramer - 15 Apr 2003
- If you edit this page: add your name here; move this to the next line; and include your comment marker (initials), if you have created one, in parenthesis before your WikiName.
Page Ratings