How to prevent Google to index revisions and print versions

Page contents:

How to prevent Google to index revisions and print versions
Solution
- The robots meta tag in the page source
- robots.txt file in the site directory
Information

To my surprise I found out today that Google has indexed quite a lot of my TWiki site. But to my dismal I see that a lot of indexed pages are older revisions, print skins and the Trash Web. I searched all the topics on Codev but I can't find a solution for this. Is it possible to control the indexing process? A text file for the bots maybe?

-- ArthurClemens - 04 Jul 2003

Solution

The robots meta tag in the page source

I believe Google obeys the html tag:

<meta name="robots" content="noindex" />

Stick it in the <HTML><HEAD> section of all the templates other than view.tmpl and that should do the trick. I had the reverse problem - I had put that tag in all the templates and nothing was being indexed. smile

robots.txt file in the site directory

And this robots.txt disallows everything except view (too bad the robots.txt specification doesn't let you say "deny everything except ....").

User-agent: *
Disallow: /bin/attach
Disallow: /bin/changes
Disallow: /bin/edit
Disallow: /bin/geturl
Disallow: /bin/installpasswd
Disallow: /bin/mailnotify
Disallow: /bin/manage
Disallow: /bin/oops
Disallow: /bin/passwd
Disallow: /bin/photonsearch
Disallow: /bin/preview
Disallow: /bin/rdiff
Disallow: /bin/register
Disallow: /bin/rename
Disallow: /bin/save
disallow: /bin/savecomment
Disallow: /bin/savemulti
Disallow: /bin/search
Disallow: /bin/setlib.cfg
Disallow: /bin/statistics
Disallow: /bin/testenv
Disallow: /bin/upload
Disallow: /bin/viewauth
Disallow: /bin/viewfile
Disallow: /list/

-- MattWilkie - 05 Jul 2003

I am almost 100% certain that the <meta name="robots" content="noindex" /> line also works if it is the first line (or among the first lines) of the text of any page.

I add that line to pages that consist primarily of an embedded search as finding such pages just adds noise to search results.

-- RandyKramer - 05 Jul 2003

i have added to my file:

Disallow: /*?skin=print*
Disallow: /*?rev=*
Disallow: /*.png$
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow: /*.jpeg$

-- ArthurClemens - 14 Sep 2003

I am satisfied with the search results I get now from Google: no more old revisions and ?skin=print pages.

-- ArthurClemens - 04 Dec 2003

Would it not make sense to modify the view template so that a plugin could selectively enable or disable the meta tag >meta name="robots" content="noindex, nofollow" /< ? For instance Arthur's list was fine in 2003, but now there are other skins and raw plus sortcol.

-- ChristopherOezbek - 27 Jan 2005

It sound like there people have been thinking about it. See NoRobotsMetaTagInPageHead.

As another idea that came up because of a recent announcement by Google a link attribute could be added to the links that lead to undesired pages:

<a href="http://twiki.org/cgi-bin/view/Codev/PreventGoogleToIndexRevisions?skin=print" rel="nofollow">PreventGoogleToIndexRevisions</a>

The rel-attribute is mainly added to prevent Blogs from being abused by Spammers to promote their pages but I am sure that it has similar effects like the meta-tag.

-- ChristopherOezbek - 01 Feb 2005

The other issue with crawlers is that they can hammer the CPU. Meta tags stop them being indexed but don't stop the CPU use, and not every administrator has access to the robots.txt for thier site. So we need a rel="nofollow" solution. See SearchEngineIndexOnlyPlainView.

-- SamHasler - 02 Feb 2005

Information

A Standard for Robot Exclusion
The method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots. This file must be accessible via HTTP on the local URL "/robots.txt".
Google Information for Webmaster
Googlebot understands some extensions to the robots.txt standard: Disallow patterns may include * to match any sequence of characters, and patterns may end in $ to indicate that the $ must match the end of a name. This is described on this Google remove page.
Microsoft Search Site Owner Page
Describes the same extended robots.txt features that the Google bot has.
The Robots META tag
The Robots META tag allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links. No server administrator action is required. Note that currently only a few robots implement this.
In this simple example a robot should neither index this document, nor analyse it for links:

   <meta name="robots" content="noindex, nofollow" />

This is probably a worthy feature for EdinburghRelease.

This was in Dakar so I'm dropping this down to BasicForm.

-- SamHasler - 29 Apr 2006

BasicForm
TopicClassification	TWikiDeployment
TopicSummary
InterestedParties	WillNorris
RelatedTopics	SearchEngineIndexOnlyPlainView

Topic revision: r14 - 2006-04-29 - SamHasler

Account
- Log In
- Register User

Edit
Attach

Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.