Tags:
create new tag
, view all tags

Feature Proposal: Search Engines Should Index Only Plain View

Motivation

Tell search engines to index only the plain topic, not any specialized view of a topic. This reduces the clutter when googling TWiki.org and other public TWikis.

Description

Google indexes too much of TWiki.org and other public TWikis. If you search "with the omitted results included" you get many hits for a particular topic, e.g. http://www.google.com/search?q=blacklistplugin+plugins+site:twiki.org&hl=en&lr=&ie=UTF-8&filter=0 returns BlackListPlugin, but also variants with parameters ?_foo=1.7, ?sortcol=4&table=4&up=0, ?skin=print.pattern, ?raw=on, etc.

There is also the problem highlighted in RawParamLeaksEmailAddresses that "raw" views are also indexed and cached and in these views email addresses are returned unobscured. It is possable to create a search in google for raw user topics and view them in googles cache.

Only the plain topic should get indexed, e.g. each TWiki topic should be indexed only once, the one without any parameters.

Impact and Available Solutions

Current spec: The view script already adds a <meta name="robots" content="noindex" /> tag if you look at an older topic revision. Technically speaking, the skin has the tag by default, and the view script removes it if you are looking at the top revision.

Proposed new spec: Do not remove the noindex tag if the view script has any URL parameters. This has the desired effect that only the plain topic gets indexed.

In addition, we can make it easier for the search engines by telling what links not to follow, e.g. we can add a rel="nofollow" parameter to the anchor tags of links such as printable, older revs, table sort, view raw etc. (The BlackListPlugin does that for external links.)

-- PeterThoeny - 23 Jan 2005

Documentation

Examples

logout is a search term which provides enough hits (but not too many) to make it a reasonable test case. Compare:

http://www.google.com/search?q=logout+site:twiki.org+-inurl:raw%3D+-inurl:skin%3D+-inurl:oops+-inurl:%2Fviewauth+-inurl:t%3D+-inurl:%2Fattach+-inurl:_foo%3D+-inurl:%2Frdiff+-inurl:sortcol%3D+-inurl:rev%3D+&filter=0&num=100 vs. http://www.google.com/search?q=logout+site:twiki.org&filter=0&num=100
77 (23 Jan 2005) hits 236 (23 Jan 2005)

Implementation


Discussion:

Just to be clear, when you say "plain" you mean the default skin, what any first time vistor would see coming to the site, not the "plain skin", correct?

I agree with the proposal.

-- MattWilkie - 23 Jan 2005

Yes, plain as in "URL with no parameters", e.g. http://twiki.org/cgi-bin/view/Codev/SearchEngineIndexOnlyPlainView for this topic.

-- PeterThoeny - 23 Jan 2005

note that there are also entries in the google index for actions other than view: viewauth, attach, oops, and rdiff

perhaps a solution lies in a robots.txt which returns only plain view url's for each topic (similiar to WebTopicList), the idea being rather than deny access, tell the bots what is allowed.

-- WillNorris - 23 Jan 2005

I've changed the priority to 5 because of google caching raw views, (see pagagraph added to the Description above)

We could also put rel="nofollow" on links with parameters to stop google even retriving the page to save bandwidth (or does google only request the header and if it has <meta name="robots" content="noindex" /> it doesn't retrieve the rest of the page, so adding nofollow wouldn't lead to much of a bandwidth saving, and would increase page processing unnessasarily).

-- SamHasler - 25 Jan 2005

Ah, I see using rel="nofollow" has already been suggested. However it could lead to little saving if I'm right about headers.

SpeedUpTipsForTWiki20040901#The_robot_problem_is_caused_by_s point 2 highlights the fact that links to searches with parameters that are followed will eat up CPU resources. Ok, they might not get indexed if they have in them, but the server has still had to generate the page.

So it would be a good idea to add rel="nofollow" to any internal links with parameters not just to stop indexing, but to save CPU resources.

-- SamHasler - 01 Feb 2005

Is spending processing time finding links to add rel="nofollow" to the easiest way of doing this? Couldn't we just add Disallow: /*?* to robots.txt?

-- SamHasler - 01 Feb 2005

seems nice and simple. I think it will work for Google, but the robot.txt validator says wildcards are non-standard (e.g. other spiders might ignore it).

http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

-- MattWilkie - 01 Feb 2005

Talking to Peter last night he pointed out that not everyone has access to the robots.txt file for thier site, so we will have to implement a rel="nofollow" solution as well.

-- SamHasler - 02 Feb 2005

I tried the rel="nofollow" solution recently in one of our local TWikis and the robots.txt in another. The result was, that the rel -tag took less than a week to work with Google. While after 4 weeks I am still waiting for robots.txt to work.

-- ChristopherOezbek - 07 Feb 2005

Maybe due to the size or pagerank of the sites they are not indexed/crawled as frequently. It could also be that Google are specifically looking for sites that use rel="nofollow" to reindex at this time.

-- SamHasler - 07 Feb 2005

Done in DevelopBranch r3675. I added $cfg{NoFollow} that is set to the string rel='nofollow' by default. This is used in building links in code, and expanded as in templates and topics (though it should arguably be %CFG{"nofollow"}%, but that's a discussion for another day).

-- CrawfordCurrie - 20 Feb 2005

Crawford,

  1. does that mean that all URLs have a "nofollow" if set in the configuration? I think this is overkill if this is the case.
  2. I deliberately made the change in the BlackListPlugin since the nofollow feature is not in line with the TWikiMission.
  3. My proposed change is to add the nofollow only to links of topics that should never be indexed, such as rdiff etc (see above "Impact and Available Solutions").
  4. There is another issue: I am intersted in keeping twiki.org's Google ranking high. That is, we should not prevent Google from finding twiki.org via thousands of public TWiki sites.
  5. Unless there are good counter arguments I suggest to revert Crawfords last change.
  6. Also, all parameter tags TWiki generates so far use double quotes. Single quote might be legal, but not sure if all user agents support that. So, better to generate rel="nofollow" as the BlackListPlugin does. (This point is N/A if change is reverted)

-- PeterThoeny - 22 Feb 2005

1 and 3: I followed your 23 Jan 2005 spec above. As I read it, that was the proposed change i.e. we can add a rel="nofollow" parameter to the anchor tags of links such as printable, older revs, table sort, view raw etc. Normal topic links with no parameters do not carry nofollow. Neither do external links. If you are in doubt, inspect the results of the current code at http://develop.twiki.org/~develop/cgi-bin/view/

2: In what way is nofollow not in line with the TWiki mission? Is it because it is mainly targeted at public TWikis? Many intranets use robots to index their own internal websites; why would nofollow be uninteresting to them?

4: A link from a view page to TWiki.org is an external link and is not marked nofollow. Linsk to other view pages are not marked nofollow. You can always set $cfg{NoFollow} to the empty string if you want all link types (such as rdiff and ?raw=on) to be followed.

5: Maybe it's just me, but I can't think of a good reason why this change should be reverted.....?

6: RFC1866 (the HTML 2.0 spec, the earliest I can find) says:

   The value of the attribute may be either:
        * A string literal, delimited by single quotes or double
        quotes and not containing any occurrences of the delimiting
        character.

-- CrawfordCurrie - 22 Feb 2005

While "nofollow" is nice, one would think that having the a <meta name="robot" ... statement in the skind header would be more effective. It turns out it is there, but it is edited out in View.pm for all except older revisions.

This makes no sense to me. What is the point of making the site unconditionally indexable?

-- AntonAylward - 17 Jul 2005

Not so. In CairoRelease it is never edited out. In DevelopBranch it is edited out only if you have enabled {AntiSpam}{RobotsAreWelcome} in configure.

-- CrawfordCurrie - 18 Jul 2005

Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r19 - 2005-07-18 - CrawfordCurrie
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.