Tags:
caching3Add my vote for this tag performance3Add my vote for this tag create new tag
, view all tags

TWikiCache

A pluggable caching service and built-in HTML page cache

See also blog entries: Caching for TWiki Part I and Part II

Previous work

There have been a couple of attempts to implement caching for TWiki on different levels. All these solution don't share any code and are all quite different in their use. Some are

There are even plugins that do their own caching of downloaded or finished operations like:

There are certainly more plugins out there that have a need to temporarily store objects and retrieve them instead of recomputing them. I've looked at all of these implementations for their most useful and interesting features, leaving out what can be done better from within the core engine than from the "outside". Some of the above caching solutions will not directly profit from a central caching services, but indirect as the burden on them is relieved a bit. Some are still of great great value to ease the "first hit", when the page is not found in the cache and is about to be computed.

Requirements

So the basic requirement here is to have some sort of caching service offered by the TWiki engine itself and made available to subsidiary components, extensions and plugins. This caching service should relieve plugin authors from having to implement the caching store themselves by offering a simple API to a centralized cache. This caching service might be configured to use different backends that may be choosed from depending on the given hosting environments. For example in a hosted environment it is quite unlikely that you will be allowed to run a memcached daemon. Nevertheless you are able to use a file based store. A solution somewhere in the middle would be a (size-aware) memory cache that will keep objects for the timespan of the TWiki request. The caching effect for this solution might be increased by using perl accelerators like mod_perl or speedy-cgi that keep the memory cache alive for more than one request.

There are different areas where there's obviously a need for caching even within the core engine itself, e.g. for preferences or users. Last not least the produced html page itself may be cached to be available if the same page is requested once again. These different requirements for caching objects establish a system of multiple levels of caching, with different levels of granularity with regards to the handled objects. The outermost and most coarse-grained level is the one of caching the complete html page. This is also the level where the most caching effects occur and objects are becoming invalid most frequently.

Cache maintenance

Which leads to the question when actually objects in the cache are outdated. This question can't be answered in general and for all levels of caching. The knowledge which information has been used to create a certain object may be buried in the depth of some plugin and there is no way for the cache to know when it has to forget an object without any help by that plugin itself.

Tracking dependencies

Typically, content management systems deal with cache invalidation by recording "dependencies" between objects and "fire" them on demand. Thus they are effectively removing the objects from cache again, so that they can be recreated next time they are requested. So the needed interfaces here is addDependency() and fireDependency() to inform the "dependency tracker" about what he actually stores.

So the hard part of the TWikiCache is actually not to implement a pluggable caching service but to get the dependency tracking done right, i.e. for the html page cache.

There are a couple of different kinds of dependencies.

  1. automatically detected,
  2. external,
  3. manually added and
  4. temporal dependencies.

A lot of dependencies can be detected by the engine itself while rendering a page. Foremost these are added by reading in other files and topics, like WebPreferences and so on. Basically everything that goes through TWiki::Store::readTopicRaw() adds a dependency to the currently rendered page. But also such little things like WikiWords to existing or not yet existing topics add a dependency. Stuff like this can be recorded automatically.

External dependencies are added to the current page by some plugin because it says so. These may be automatic dependencies also but are not recorded/recordable by the core engine itself. The event for such a dependency to get fired may come from outside emitted by some update to an external database. We can't do much about it from inside the core engine but offer the API to hook into the dependency tracker.

Manual dependencies are those that the author of a topic may want to add to it. He may either decide to

  • prevent the current page from being cached at all (Set CACHEABLE = no)
  • invalidate a couple of other topics whenever the current one is changed (Set DEPENDENCIES = web.topic, web.topic)
  • invalidate a couple of topics when any topic in the web has changed (Set WEBDEPENDENCIES = ...)

The latter, for example, allows to invalidate a topic that has a SEARCH in it on every edit in a web, so that the output of the SEARCH can be rendered once again as there might be a changed hit set now due to the topic changes.

Last not least, you'd like to automatically expire an object in cache and a topic author may decide to do so by adding a Set CACHEEXPTIME = preference value.

Firing dependencies

Once all dependencies of an object are known, it can be purged if one of them "fires". Dependencies are fired by the TWiki::Store when

  • a topic has been saved,
  • a topic has moved (if the topic is moved to a new web, the target webdependencies have to be fired also)
  • a new attachment is added or
  • an attachment has been moved (source and target topic are firing their dependencies).

Firing a dependency is a recursive process because an object might depend on an object that itself depends on another object. Dependencies are fired "backwards" using the reverse relation. A dependency also establishes a reverse dependency in the target object. See "Notes on the implementation" below.

As can be seen, most cache maintenance overhead happens during saving and renaming objects, not during viewing them. The only overhead added during view is actually storing the objects into the cache and recording new dependencies. And this only happens on the "first hit".

Dirty areas

Sometimes caching complete pages is really to coarse-grained. There are some parts of a page that change much too often while other parts of the same page never change, nevertheless are computed with non-zero costs. In that case the TWikiCache can be told not to cache certain parts of the topic, called "dirty areas", keep the TWikiMarkup inside as it is, and patch in the information computed during the request. In a way, cached pages with dirty areas resemble tempates.

Backends

The current code implements backends based on
  • Cache::Memcached
  • Cache::FastMmap
  • DB_File
  • TDB_File
  • Cache::FileCache
  • Cache::SizeAwareFileCache
  • Cache::MemoryCache
  • Cache::SizeAwareMemoryCache

where the three most interesting are Cache::Memcached (well known due to its use at http://www.livejournal.com), Cache::FileCache and Cache::FastMmap. All but the first four are part of the Perl-Cache Project. The memcached based cache backend is first choice for high end sites where you are able to run (a pool of) cache servers. the Cache::FileCache backend is a good general purpose choice for hosted environments where you are not allowed to run persistent daemon processes. Cache::FileCache is also preferable as it allows to share the same cache between multiple processes (several views). The Cache::SizeAwareFileCache is there only for completeness as the maintenance done by the backend itself is far too heavy to be done by the TWiki process itself. The Cache::MemoryCache is not sharing its cache with other processes and is cleared when the process dies. This kind of cache can be usefull for objects that are created during one request and should not be created more than once during one request, e.g. user objects. If you are using mod_perl or speedy-cgi then this kind of cache will also persist during multiple requests but may impose too high memory costs for each persistent perl interpreter . In this case the Cache::SizeAwareMemoryCache is a good choice to still have control over recources. Compared to Cache::SizeAwareFileCache it is able to perform its backend maintenance more efficient as it does not suffer from disk io.

Cache::FastMmap is a very fast backend, faster than Cache::FileCache and still allows to share the same cache among multiple processes. It is implemented in C, uses a single mmap-ed file and fnctl locking. Unfortunately this backed does not implement per-object expiry. So CACHEEXPTIME has no effect using this backend.

The same holds for the DB_File backend. A cache expiry has to be implemented as a separate cronjob then. In addition the DB_File is not "size-aware". According to the performance comparison at http://cpan.robm.fastmail.fm/cache_perf.html a berkeley db based cache seems to perform astonishingly good.

The TDB_File backend is very similar to the DB_File backend with the difference that TDB_File uses record-level locking while DB_File uses db-level locking. TDB_File has advantages for concurrent writers, while being slower due to the locking overhead. Performance experiments need to show if this trade-off pays off on higher loads.

Notes on the implementation

A single topic may be cached several times distinguished by
  • url parameters
  • session values
  • wikiuser
Foremost this means that every user has a separate set of cached topics. This is needed to cope with different user-level preferences as a site can look differently for every single user. This also means that a new copy of the current topic will be cached if it has been called with different url parameters. In some cases those variants don't differ in their result but this extra space is worth spending for the sake of correctness. The same argument holds for values stored on the server side inside the session objects. Unfortunately not all session values are worth distinguishing and may result in cache trashing. There's a certain list of session values that are excluded. There might be a need for a clear specification of how to name session values so that they are excluded from the cache logic (e.g. all session values starting with an underscore are ignored).

All variants of a topic are stored in the same pageBucket object and cached under the same key. There's one pageBucket for one topic. All variants share the same dependencies. If a dependency for this page fires, all the pageBucket and all its variants are deleted in one step.

A pageBucket for a topic has the following properties:

  • $pageBucket{deps}{$topic} = 1 ... hash of dependencies
  • $pageBucket{revDeps}{$topic} = 1 ... hash of all reverse dependencies
  • $pageBucket{exptime} = <seconds> ... expiration time
  • $pageBucket{variations}{$key} = <html-text> ... the cached html page for this topic; variations are distinguished using $key

Reverse dependencies are created to ease firing dependencies and replicate the normal forward dependency in the target pageBucket. They are a kind of backlinks. So when a topic gets cached it not only adds the variant to the correct pageBucket but also updates all pageBuckets of topics it depends on and adds the reverse dependencies. To be precise, this means, that the reverse relations is used to fire dependencies, while the forward relation is only used to establish its reverse counterpart. This also means that there might already be a pageBucket object even if the topic for it does not yet exist. This is needed to invalidate those entries that contain pages with NewWikiWord links. When the NewWikiWord topic comes to existence, it updates its pageBucket, finds the reverse dependencies and fires them recursively.

Configuration

  • Select the caching backend:
    $TWiki::cfg{CacheManager} = 'TWiki:Cache::((SizeAware)?(FileCache|MemoryCache))|Memcached|FastMmap|none'; # default none
         
  • Enable/Disable the cache manager:
    $TWiki::cfg{Cache}{Enabled} = 0|1; #default off
         
  • Enable/Disable debug output:
    $TWiki::cfg{Cache}{Debug} = 0|1; # default off
         
  • Namespace in the cache used by this TWiki (must be different for all TWiki instances installed on the same host):
    $TWiki::cfg{Namespace} = $TWiki::cfg{DefaultUrlHost}.':'.$TWiki::cfg{ScriptUrlPath};
         
  • Options specific for size-aware cache backends (SizeAwareMemoryCache, SizeAwareFileCache):
    • max size of cache:
      $TWiki::cfg{Cache}{MaxSize} = 10000;
               
  • Memcached-specific options (see the documentation of memcached for more information):
    • Address and port memcached servers (comma separated)
      $TWiki::cfg{Cache}{Servers} = '127.0.0.1:11211';
             
    • Compression of data in the cache:
      $TWiki::cfg{Cache}{CompressionThreshold} = 10000;
               
  • FileCache-specific options:
    • root directory of the cache file system (will be created automatically if not yet present):
      $TWiki::cfg{Cache}{RootDir} = '/tmp/twiki_cache;' 
             
    • number of sub-directories in the file cache (should be large enouph so that no directory has more than a few hundred objects):
      $TWiki::cfg{Cache}{SubDirs} = 3;
              
    • Umask used to create directories:
      $TWiki::cfg{Cache}{Umask} = 077;
               
  • FastMmap-specific options:
    • name of the mmap file:
      $TWiki::cfg{Cache}{ShareFile} = '/tmp/twiki_mmap'; 
            
  • DB_File-specific options
    • name of the db-file:
      $TWiki::cfg{Cache}{DBFile} = '/tmp/twiki_db'; 
            
  • TDB_File-specific options
    • name of the db-file:
      $TWiki::cfg{Cache}{TDBFile} = '/tmp/twiki_tdb'; 
            

Usage

Preventing the current topic from being cached:
  * Set CACHEABLE = no

Auto-expire caching of the current topic:

   * Set CACHEEXPTIME = <seconds>

Prevent certain parts of a topic being cached, rendering it during request time:

  • pointless example:
    <dirtyarea> Don't cache this. </dirtyarea>
  • using TimeSincePlugin to display the age of the current topic
    <dirtyarea> %TIMESINCE% </dirtyarea>
  • never cache the result of this SEARCH:
    <dirtyarea> %SEARCH{...}% </dirtyarea>

Manually add dependencies to other topics; This will invalidate the cache for the listed topics if the current one is changed:

 
   * Set DEPENDENCIES = Main.ListofAllEmployees, AllOpenActions
Best practice: add a manual dependency to those topics that are searched for regularly pointing to the topic that contains the SEARCH. Example: there's one ReportTopic that dynamically lists ReportItem topics. So add the following lines to all ReportItem topics (best by using a topic template when creating the ReportItem topics):
   * Set DEPENDENCIES = ReportTopic
So editing any of the ReportItem topics will result in the ReportTopic to be recomputed automatically.

Manually add dependencies from all topics of a web to a list of other topics; This will invalidate the cache for the listed topics if any topic in the current web changes (best used in WebPreferences):

   * Set WEBDEPENDENCIES = WebHome, Main.WebHome, People.ListofAllEmployees

You may force the current topic to be recomputed by adding a refresh=on url parameter. The complete cache can be cleared using refresh=all (not implemented for TWiki::Cache::Memcached as this operation is not supported by the backend).

Next

There already is an implementation actively being tested on the WikiRing. The current patch is against the MAIN branch and is ready to be checked in. What really is needed is more testing, performance tests and rewriting some key plugins. The BlackListPlugin for one needs to prevent the current page from being cached, when notifying some TWikiGuest that (s)he has just been blacklisted. This page should not be cached and shown to another independent TWikiGuest. There is a method TWiki::Cache::isCacheable($topic) that decides if the current page is cacheable or not. Plugins need a way to intervene here.

The Func API needs to be augmented to allow plugin authors to add dependencies as outlined above.

The potential of using a central caching service throughout the complete core engine has to be investigated.

There might be the need to use different backend implementations at the same time, for example use Cache::FileCache for html pages but Cache::MemoryCache for user objects and preferences. Right now, the implementation allows to use only one backend at the same time as configured in LocalSite.cfg.

-- Contributors: MichaelDaum - 27 Feb 2007

Discussion

Good idea - not sure that the all-encompassing term 'TWiki cache' should be used for this though, as it could easily be confused with the many types of content cache that people have built as plugins for TWiki, caching rendered pages, variables, or database information. The top hit for Google:TWiki+cache is one of the cache plugins. Maybe this should be TWiki::Cache::Internal or something, leaving scope for TWiki::Cache::Headers for CacheControlHeaders (also includes links to cache plugins).

When I read this I thought 'great, Sven is going to address content caching'... If you make these caches persistent, or ModPerl memory resident, they could actually help with internal caching, though - so perhaps I should have waited till you fleshed out the concept.

Sounds like this is mainly about 'internal caching' as in VarCachePlugin but covering a lot more objects. Does sound like you could speed things up quite a lot if there's a persistent TWikiCache implementation.

-- RichardDonkin - 26 Feb 2007

Actually, I did already implement the TWiki::Cache as outlined here. We discussed it several times on the WikiRing before getting it out to the community. It is actively used to speed up the WikiRing blog.

The description above is not accurate and infact does not describe what the foremost feature of the current code is. And yes, Richard, your first expectations are right: the TWiki::Cache is a content cache that does the job that some of the plugins you mentioned tried. Note, that I commented on the limitations of a plugin like the TWikiCacheAddOn years ago. So I decided to do it but right.

So what is the TWiki::Cache?

  • It caches the content using memcached before it sends it out to the browser.
  • It retrieves content from the cache before trying to render the same once again.
  • It tracks all dependencies of each cached page in cache and invalidates it as needed.
  • It offers an API for TWikiApplications and plugins to hook into the dependency tracker to provide additional knowledge about content dependencies.
  • It can invalidate content automatically based on a timer.
  • It allows to disable pages from being cached.
  • It allows to protect parts of a page from being cached (dirty areas) that are then rendered after retrieving them from the cache, before sending it to the browser.
  • It does all this the most efficient way possible.
  • It reduces server load and rendering time considerably, though I did not have the time to to thorough testing.

In combination with speedy-cgi, pages can be send out in under one second each, no matter how complicated the TWikiApplication behind is. The overhead of maintaining the cache does only happen during save firing dependencies, storing the new page.

Memcached is a highly efficient distributed in-memory-caching services. Unfortunately it isn't available for hosted environments, which is the reason why having a filesystem fallback mechanism is desirable.

Only as a side-product you get caching services as described above. But the most important factor in the TWiki::Cache is to track dependencies for html pages. Frankly, offering a plain caching service isn't of great value if you don't specify when this content is outdated.

The weak point of the current implementation is that it relies on memcached and its perl API to be installed on the system. This infact can be made pluggable to use different storage mechanisms in the background, e.g. modules that implement the Cache::Cache interface.

The current code is available as a patch against the MAIN branch. I am about to maintain a similar patch for TWiki-4.1.x as some of my clients already use this code on intranet sites.

I'm going to refactor the above introduction to this page asap.

-- MichaelDaum - 26 Feb 2007

Most interesting, I'm looking forward to try it out. You are saying this can't be used on hosting. You mean shared hosting I assume, is that because memcached needs it's own process?

What would be great to have is a framework that would allow you to implement your own cache mechanism fairly easily based on memcached or whatever else you wish to use. But I see you already mention that Michael.

-- StephaneLenclud - 26 Feb 2007

Sounds very interesting - is it extensible to support CacheControlHeaders? The content invalidation is the hard part, once you have that the HTTP cache headers should be quite easy, and a big win for enterprises with internal proxy caches as well as ISPs that provide proxy caches, such as AOL.

-- RichardDonkin - 26 Feb 2007

I've reworked the description of this FeatureRequest and implemented the other cache backends.

-- MichaelDaum - 27 Feb 2007

TreePlugin for instance certainly ought to make use of such a cache mechanism.

Also look at mod_cache even though it's probably unusable for our purpose.

-- StephaneLenclud - 28 Feb 2007

The key that makes third party caches (like mod_cache) usable is that we need a way to fire dependencies, that is an API to actively invalidate cache entries. And this must be available in perl. That's why mod_cache and also varnish are probably out. These http accelerators are simply to "uninformed" about what's going on in the content management system behind. However, in a more complex server setup, it may still make sense to add another level of caching.

-- MichaelDaum - 28 Feb 2007

I've got the code ready to check it in to the MAIN branch.

-- MichaelDaum - 28 Feb 2007

What about attaching the diff to this topic or to Bugs:Item3695 so that we can have a look at the code and documentation? Of course, I am interested in caching, but I'd guess it needs some installation and configuration items, fallbacks (if neither memcached nor Cache::Cache are installed), and admin and author guides (e.g. which TWiki variables should be avoided for better caching performance).

-- HaraldJoerg - 28 Feb 2007

There's no other fallback than no caching if neither memcached nor Cache::Cache are installed. That's ok for now as there are no other parts in the engine that depend on the cache service to be there. Here is the patch against MAIN, revision 12994. Apply it, install Cache::Cache, add $TWiki::cfg{CacheManager} = 'TWiki::Cache::FileCache'; $TWiki::cfg{Cache}{Enabled} = 1; to your LocalSite.cfg and see if it works for you.

-- MichaelDaum - 28 Feb 2007

Just found this http://cpan.robm.fastmail.fm/cache_perf.html ... and implemented a Cache::FastMmap backend. But unfortunately it fails to cache large html pages and bloats up the memory requirements of the view process on large mmap files.

-- MichaelDaum - 28 Feb 2007

Implemented a DB_File base cache backend as a fallback. I think that a set of sensible cache backends have been implemented now. Infact, the SizeAwareFileCache, SizeAwareMemoryCache and MemoryCache are not worth it and may even degrade performance. Cache::FastMmap needs to be checked if there's an error in the adapter class or if it is too buggy/resource hungry.

For now I'd propose to only keep

  • Cache::Memcached (for high end sites)
  • Cache::FileCache
  • DB_File (if installing Cache::FileCache is no option)

The configure could check for Cache::FileCache and fall back to DB_File if it is not available. DB_File comes with perl but I am not sure if it was in perl-5.6.1 already.

-- MichaelDaum - 28 Feb 2007

Another alternative for Cache::Cache is the caching library by Chris Leishman. It says in the docu that it is a complete reimplementation of Cache::Cache ... but not if it is performing better or worse.

-- MichaelDaum - 28 Feb 2007

Thanks Micha for tackling the caching question on a system-wide level. This is good stuff!

On CPAN dependencies, we must make sure that we ship with a default that "just works out of the box."

On single topic caching several times: I am wondering what the right balance is between caching everything and caching important stuff only. Caching also brings an overhead, it can be counter-productive (by performance) if content gets cached and is not used later. Possibly good to cache only content that is accessed frequently, such as topic viewed by TWikiGuest user, without any URL params. This simple approach could also simplify the design. As a data point, 96% of the view traffic on twiki.org is by TWikiGuest.

The name <dirtyarea> assumes knowledge that this is part of caching. Most users editing a topic have no idea what this word means. Possibly name it <nocaching> or the like to give a hint that this is cache related.

-- PeterThoeny - 01 Mar 2007

In general, you'd like to cache as much as possible. Some cache implementations do slow down the fuller they get (collisions getting more frequent, cache directories holding several hunderds of entries). But that can be dealt with on the cache implementation level using different removal strategies, e.g. FIFO (removing the oldest entry first) or LRU (removing the least frequently used entry first).

In addition the amount of dependencies firing on a save reduce the cache size automatically. Even if the view/save ratio is rather high, a single save is very likely to delete lots of cached pages. So virtually, the size of a cache will not grow as dramatically as one would expect.

About CPAN dependencies: does anybody know which perl versions do not ship DB_File? For now, as there are no subsystems that do rely on the caching services being there, TWiki still runs as normal without any html page caching. If, however, we'd like to make use of the caching services more extensively in other areas, you are right that we need to be sure that there is always a lowlevel option ... though DB_File does quite fine wrt performance. There's also the option to bundle TWiki containing a CPAN package it can fallback to.

-- MichaelDaum - 01 Mar 2007

Updated patch for latest MAIN branch. Added TDB_File backend.

-- MichaelDaum - 01 Mar 2007

Re DB_File - seems that Perl 5.6 does include this, but Perl 5.8 does not - see the perl core modules list.

As for caching a lot vs a little - I think it's best to cache as much as possible, subject to memory and disk space limits of course. Most TWiki sites are not that large, and as Peter says public Internet TWikis have a lot of guest access, so caching the whole site on disk should be no problem, and even caching in memory is feasible.

Since this module is the first to address the whole area of cache invalidation/maintenance, it would be quite easy to plug in HTTP CacheControlHeaders support, which would make this even more useful for large-scale use across WAN links within enterprises, and on the Internet, by enabling browsers and proxies to cache more effectively. You've done all the hard work, or at least are doing it!

One thing to watch out for is that we don't re-introduce any existing issues solved through cache headers and per-edit-operation URLs, i.e. BackFromPreviewLosesText and RefreshEditPage.

-- RichardDonkin - 03 Mar 2007

Preview and Edit pages are not cached. Only view pages are cached.

Hm, I've downloaded the perl-5.8.8.tar.gz and DB_File definitely is in there.

Wrt small vs big sites. Yes you have to decide on the caching backend and chose one that is appropriate. Small sizes can use one that is not "size aware", e.g. DB_File or Cache::FileCache. Large sizes should use the memcached backend which is size aware. I still haven't tried the caching library by Chris Leishman which might implement a vital size aware filebased caching backend. The one I've tried so far (Cache::SizeAwareFileCache) was way too slow.

Wrt CacheControlHeaders, this is quite a different issue and could have been addressed independently working towards better proxy and client-side cache behavior (is it so bad?). This is rather likely to be a frustrating work as the cache, you are addressing, is not under your control regarding the implementations not following the specs and cache invalidation. In principle, proxies and browsers can't know when their content is out of date. They only cache based on the assumption that hopefully - within a certain time frame - there's most probably no need to fetch a fresh copy. TWiki can't tell them in advance either. It can only tell them not to come back any sooner but x seconds/minutes/.... But if you can sacrifice content validity for speed - and this is the case for some kind of sites - the metadata, on which proxies and browsers base their caching, should be as cooperative as possible. I don't see that the current TWikiCache implementation is a specific facilitator for continuing work on CacheControlHeaders.

-- MichaelDaum - 04 Mar 2007

Re DB_File, this seems to be in the AnyDBM_File package as part of perl-5.8.8.

I agree that CacheControlHeaders is a separate feature that could be added later, possibly as a plugin if there are suitable core APIs to determine the true 'last modified' date, cacheability, etc, for a given page built using TWikiCache.

TWikiCache would be useful as an enabler for HTTP CacheControlHeaders, because it provides HTML page caching, already provides much of the required data, and handles the whole area of cache invalidation/maintenance. In fact many of the concepts are similar, e.g. the HTTP/1.1 ETag is a unique ID for the page variants that you store within a TWikiCache pageBucket, so there is a lot of synergy.

The expiry time is really a policy decision to be controlled by the TWiki site administrator - e.g. the site could use these headers to prevent or minimise most HTTP caching, if that's preferred. Sites that want to make use of HTTP proxy and browser caching could set parameters that allow most pages to be cached (e.g. all view pages but not those with significant embedded searches).

Cache control headers may also be important for security - proxy caches will often cache items for longer than they should, and in some cases can cache personalised content, but cache control headers provide a way to control this. Use of explicit freshness information (expiry date) and 'validators' (unique URLs and ETags that flag that a particular page version is unique, as with RefreshEditPage) are a good idea for many pages, and in particular for Edit and Preview pages - by using unique URLs and ETags (basically a unique ID for a page/object), together with a long expiration time, browser Back buttons will keep working for Edit and Preview, while proxy caches can be told not to cache Edit and Preview pages. At the very least, using a unique ETag for personalised pages should guarantee that HTTP/1.1 caches will not cache anything that is personalised - this has been a problem with at least one Squid information disclosure bug relating to cookies.

It's also possible to tell proxy caches to never cache a page using a header such as Cache-Control: no-store, which forces the proxy cache to go direct to the web server each time. Other options are to allow caching but force re-validation of the URL+ETag on every client request, i.e. a much shorter request that could perhaps be served by a lookup into TWikiCache.

Some useful resources:

-- RichardDonkin - 04 Mar 2007

Why is the already done work not available as sources?

-- KennethLavrsen - 18 Mar 2007

It is. See patch below. Unfortunately I was not able to visit the two last release meetings so that this work was formally accepted as a 4.2 feature.

This work is not finished. The following things have to be done:

  1. implement a size-aware filebased cache backend based on caching library by Chris Leishman
  2. implement a cronjob that purges cache entries. this might ease the runtime of Cache::SizeAwareFileCache and likely cache backends, given you can overcome their purge during set/get behavior.
  3. delayed write access in maintenance cycles to reduce cache-twiki communications
  4. debugging re-rendering of dirtyareas: rendering might be sensitive to the context a dirty area was found in. so there might be a need to capture the context of a dirty area as otherwise the re-rendering result might be different.
  5. extensive performance measurements to evaluate different aspects of the implementation, i.e. growing cache maintenance overhead on large caches

-- MichaelDaum - 20 Mar 2007

There are for sure some more things that need to be done - and maybe there are still some design challenges:

  1. The cache has user interfaces which are yet undocumented:
    1. For authors: They need to know where and when to introduce <dirtyarea>. In my opinion a cache should not have such an interface at all, or at least the typical culprits (like %SEARCH%, %ACTIONSEARCH%) should imply "don't cache this".
    2. For readers: They need to know when to use the reload url parameter and when to use the reload button in the browser.
  2. A cache entry of the complete rendered page is done (needs to be done) "per user". This gives bad performance characteristics on sites where every access is authenticated (happens in my intranet). Our most frequent scenario of people clicking on URLs reported by WebNotify will inevitably give one new cache entry per view, but rarely a cache re-use.
  3. Caching "on view" instead of "on write" gives bad performance characteristics if you have a search engine periodically visiting your site (happens in my intranet) unless your cache can accommodate all topics for this particular "user".
  4. The cache is "pluggable" with respect to backends, but not with respect to caching techniques: This is "cache complete pages on read", whereas previous ideas went for "cache compiled templates".
  5. Finally, some test cases would be in order, too.

-- HaraldJoerg - 20 Mar 2007

Re (1.1): SEARCHing is the most expensive operation in a TWiki. Most of the time a search is performed, the engine finds the same things again. There's no need to do so, but only if there was a change within the scope of the search.That means, whenever a topic was edited/created/renamed/deleted within that scope (e.g. a single web) the cached search results need to be invalidated and recomputed. But not if nothing changed. All TWikiApplications are based on some sort of SEARCH. These all wouldn't be cacheable.

Re (1.2): no, readers shall never need to reload a page explicitly. That's what the dependency tracker is for. There is a way to refresh a page manually using "More actions".

Re (3,4): Caching on "view" is fine. You simply keep what you just computed. That's the normal way caches work. Pre-computing html even if you might not need it does seem to be a much bigger waste conceptually: "Don't think about things beforehand, but remember what you just found out in case someone asks again."

Re (1,5): self-evident

Re (2): Right, you get notified about a changed topic. As it was changed by some user, he invalidated the cache for that new topic. If you click on it and you need to authenticate to view it, then the page needs to be recomputed for you because someone else changed it. But if you visit it again and nothing changed you get the cached page. If you don't need to authenticate to view this page and some other guest has already visited it, then you get a cached version also. Pre-computing (versions of) that topic for all known users is no option imho. Not sure if that's doable at all.

-- MichaelDaum - 20 Mar 2007

I am going to check the current code into MAIN - if nobody disagrees - in the hope to get some more testers.

-- MichaelDaum - 26 Mar 2007

yes please, I have too many patches that i'm tending to apply major ones like this. but once its in, I can start to 'just use it', and to code without breaking it.

-- SvenDowideit - 26 Mar 2007

Nothing was implemented before deadline for 4.2.0

So this is deferred to GeorgetownRelease

-- KennethLavrsen - 03 Jun 2007

This is implemented and used on a daily base on the WikiRing. However there are some issues I had no time to figure out before feature freeze. This deadline came too early to get code like this into 4.2.0 in a safe way. I will keep on working on it and might release the code as a TWikiCacheContrib independently. Infact, some of my clients want a backport of the TWikiCache for 4.1.2. So there are some sponsors already to fund this work, though on a lower priority than a couple of other projects.

-- MichaelDaum - 03 Jul 2007

Michael, you really did a great work. I decided me to put in code my own (quite close, actually) approach as PublicCacheAddOn. It has not exactly the same goals, and is definitely less polished or complete, but I hope to be able to use it to have a different look at issues to help you and the various performance efforts on TWiki

-- ColasNahaboo - 13 Jan 2008

It does have the same goals! The core of the TWikiCache is (a) the PageCache for full page caching and (b) a dependency tracker to automatically invalidate cache entries if a dependency of a page gets fired. The rest is generic infrastructure one could reuse otherwise.

Too bad you never contacted me or commented here before, and started YACI (yet another cache implementation). frown

-- MichaelDaum - 14 Jan 2008

Sorry Michael, I was not planning to hurt you. I had only hunches and no clear experience on this subject, so I wanted to get my hands personally dirty to gain experience on this subject. Otherwise, I would have just made comments on this page backuped by no real knowledge, that would not have helped you. And I am more and more wary to say "I am going to do this" as experience proves that I am often sidetracked and do to deliver promises. Moreover I am a very bad perl programmer, with no intimate knowledge of freetown, letting me loose in your code would have been, like we say in French, "an elephant in a china shop". Also I think I tried a different approch from you, due to sligthly different goals (do not cache different versions of pages), that I really wanted to try. Besides, this is in my free time, and I needed something fun to do, fun for me being now getting bare to the metal. Anyways, I sincerely think our approaches are quite complemetary in implementation and that we can gain insight in comparing the 2 approaches

-- ColasNahaboo - 14 Jan 2008

After some interesting IRC discussions with Michael, http://koala.ilog.fr/twikiirc/bin/irclogger_log/twiki?date=2008-01-14,Mon&sel=1175-1334#l1171 I think the best way for me is just to avoid taking the same approaches than TWikiCache. For instance, I will drop my idea of analyzing the topic contents to find dependencies, and even once I have a working strategy, try to implement it as varnish rules to see if I can just replace my front end code by varnish

-- ColasNahaboo - 15 Jan 2008

I can't find any code for this extension. Is any of it public?

-- ArthurClemens - 18 Mar 2008

There is a patch below but it's from a year ago. Maybe Michael can comment?

-- RichardDonkin - 18 Mar 2008

I am gonna check that in to trunk asap.

-- MichaelDaum - 19 Mar 2008

Here's a silly idea: why implement caching in TWiki? Implementing rfc-compliant HTTP caching is quite tricky, and is imho better left to dedicated software such as varnish, squid, or whatever.

Rather than reinventing the above, wouldn't it make sense to provide the proper HTTP headers such that an upstream cache can properly cache content. This is more scalable too.

Next to the headers (Cache-Control, Expires, Last-Modified, etags etc..) it'd be nice if TWiki could emit proper PURGE requests to upstream caches if content needs to be refreshed of course..

Ie. what i would like is not caching but better cache control.

-- KoenMartens - 04 Apr 2008

Yep, my current TWikiCache patches do include Etags and gzip compression (if the browser supports it). That's enhancing upstream caching a lot as well reduces bandwidth.

The main reason to implement caching in TWiki is its dependency tracker. Only TWiki itself can track and purge caches. If no dependencies were fired, it will return the same page again and never do the same thing twice. This by no means superseeds more upstream caching, i.e. using a reverse proxy. TWikiCache follows the idea to: never do the same thing twice, i.e. never render the same page twice. Looking at other external upstream caches, they only get a caching effect if they sacrifice cache correctness, that is return the same page for a certain timespan and ask for updates from the backend less frequently. TWikiCache does not do that. It gets its caching effects because nothing changed and there's no reason to render exactly the same page twice.

-- MichaelDaum - 04 Apr 2008

Are there patches for this for 4.2?

-- KenGoldenberg - 09 Apr 2008

I would be interested in the 4.1.2 patch if you ever managed to do that backport Michael.

-- StephaneLenclud - 10 Apr 2008

Michael, proper cache control means that the app tracks what has changed and what not, and notifies the upstream cache accordingly. Either by sending a http PURGE to the cache or by answering properly to HEAD requests from the cache. Anyway, as i've been losing interest in commiting code i should not complain or say how things should be done smile

-- KoenMartens - 14 Apr 2008

I always felt TWiki shouldn't cache entire processed pages but instead pre-calculate tags and wiki markup.

Suppose we add a mechanism for common tags to specify if the result of the tag is static, somewhat dynamic or very dynamic. Then TWiki could calculate a pre-processed topic text on save with the static entries and wiki markup replaced, leaving only the dynamic tags.

The difference between "somewhat" and "very" dynamic tags would be that the first only gets updated every x minutes and the second on every page load. For example, a search would probably be "somewhat" dynamic but a user tag would be very dynamic. In contrast, the result of a TeX math formula or syntax highlighting will always be the same.

The "somewhat" dynamic tags could return a list of topics they depend on instead of just relying on a timed update, which ties into the dependency tracking above.

The advantages of this approach are:

  • Fully transparent and backwards-compatible
  • The look and feel (skins etc) are dynamic, but the expensive topic text conversions are optimized
    • If the look and feel are optimized towards client-side processing, they become static and fast as well
  • Most processing is done on save, which is a slow event anyway
  • Tag handlers can be updated one-by-one, concentrating on the most CPU-expensive ones first
  • Plugin writers get an extra incentive to use the common tags handler instead of regexing

Thoughts?

-- WoutMertens - 23 Apr 2008

I agree Wout smile and there is even a patch here in Codev somewhere that does cache template evaluations - it makes a measureable difference, and I'll be picking it up again soon. The others in your list - yup, most have been proven to make a big difference, but doing a complete releaseable and compatible change needs work - the largest amount in defining more unit tests, so we can be sure to have changed TWiki as little as possible.

-- SvenDowideit - 23 Apr 2008

The TWikiCache already supports partial caching in the sense that it allows to prevent certain areas from being stored into the cache and re-computes them for each request instead like this:

...
static content
...
<dirtyarea>
...
non cacheable content
...
</dirtyarea>
...
static content
...
<dirtyarea>
...
non cacheable content
...
</dirtyarea>
...
static content
...

Nevertheless, I agree that templates could in theory be pre-compiled, although this is a much tougher job than what the TWikiCache, i.e. its page cache does. That's because templates can be very dynamic.

Modern CMS systems do caching on multiple levels. The TWikiCache's page cache - caching the full html - is just one of these levels, sitting somewhere in between.

-- MichaelDaum - 23 Apr 2008

Wout, all you describe is interesting, but you must remember that all this cost CPU. For instance in my tests on a slow machine, the big topic TWikiVariables takes 6 seconds to render without cache, 3s with TWikiCache, and... 0.06s with the full html cache PublicCacheAddOn that cache the final full processed html. So there is no free lunch, you must take care that the time to compute sophisticated algorythms do not eat up all your efficiency gains...

-- ColasNahaboo - 24 Apr 2008

Colas, I agree that any processing at all will eat CPU, but you gain processing. I feel that caching the full processed html kills the flexibility that TWiki provides. Users are no longer able to have personal skin settings etc.

Caching should be a transparent process resulting in a gain in speed without losing accuracy and flexibility.

Sven, Michael: Thinking more about the template caching, it might be that the real problem is the skin. After all 99% of the topics on a typical wiki site is completely static, making them prime candidates for regular page caching, but the skin adds site trees, username expansions etc. If we would agree that a complicated skin would only be needed/supported on a full-scale browser, skins could use javascript and iframes to load the static topics separately from the current page. Browsers that don't support iframes would have to use a plainer or slower skin.

An example of how this would work:

  • User requests TopicA
  • The skin delivers an html page with nothing but 3 iframes. These iframes are each wiki topics, but with the skin=subskin parameter.
  • The iframes would be the header of the twiki site, the navigation bar and the topic text. Of these, only the navigation bar would typically be dynamic.
  • The navigation bar could even contain a bunch of javascript that loads the topic tree separately, making that iframe static as well.
  • TWikiCache would be able to cache each iframe according to its own dynamics
  • The user would see the static content almost immediately, and faster because of connection parallellism

Drawing is not editable here (insufficient permission or read-only site)

-- WoutMertens - 24 Apr 2008

Interesting idea, Wout. I'd love to see this explored in reality! You could even use an AjaxSkin instead of (i)frames, as that would be more flexible when each three regions "interact" in some way.

Wrt, user settings being impossible using TWikiCache. Not true. Each page is stored using a sophisticated key that takes a couple of things into account. Besides the plain url - which is the only key a normal upstream cache would use as a key to the cache - are the url params, session values and user identity. These information bits, i.e. the session values, are only available from within TWiki. These are used to calculate so called "page variations". All page variations are stored in one bucket for the url. A purge will always empty a complete bucket, including all of the page variations. I've been experimenting with more finegrained purging on a variations base, but that turned out to be too complicated in terms of CPU and code maintenance.

So you can see that TWikiCache by no means reduces the flexibility of TWiki wrt user settings whatsoever.

Colas, topics like TWikiVariables or even worse TWikiDocumentation should be banned. They are too complex for any "system" on the way: twiki, bandwidth, browser and last but not least the user that is simply overstrained by such an amount of information. Even the first hit, that is when the page isn't cached yet, is too expensive. If someone wanted such a page, e.g. to print it all out, he should be able to ask for such a page using a special separate link or button. Any such monster pages thwarts the workflow of omeone who just wants to look up the documentation of a variable quickly.

Still, your benchmarks are great! Well done.

-- MichaelDaum - 25 Apr 2008

Michael, I see. Hmmm. Does that mean that each topic gets generated at least once per user, or is the code smart enough to notice that some topics are invariant to any or all internal values?

Given that most users will visit a certain page only a few times, not taking such invariance into account means a lot of CPU and disk waste.

If we implement static/dynamic scheme on plugin execution, that takes care of the invariance, plus it moves a lot of the processing to save time, making things faster at view time.

Of course, any caching done transparently and correctly is better than no caching at all smile

-- WoutMertens - 25 Apr 2008

So basically what I'm proposing (as someone who doesn't have time to code it, sorry frown ) is to automatically generate a "precompiled" version of a topic with the dirty areas marked at save time.

TWikiCache can then be used on top of that at view time.

Templates can be left out of the equation by optimizing the skin as explained above. TWikiCache would then automatically be in the position to cache the proper parts of the page.

Page processing time goes down, user satisfaction goes up.

One interesting metric that should be looked at is the variability in page generation time given "normal" topics, unlike TWikiVariables wink

-- WoutMertens - 25 Apr 2008

I've considered this several times, but dropped it for a simple reason: security and complexity (again). Any page in TWiki can suddenly show up information that is only visible due to the user being authorized to see it. Search results of a FormattedSearch or an INCLUDE are all filtered through TWiki's access control. As far as I investigated it, things got much more complicated to get it right than one would think in the first place. Last but not least I even prefer to render an invariant topic twice but be sure that no unauthorized information is disclosed by means of suddenly see page fragments of another user being more authorized. I took this issue so serious that there deliberately is no sharing of cached pages among registered users at all. Sure, anonymous users will all see the pages as they were cached for the TWikiGuest.

While it seems to be a pity not to share fragments among registered users, the main factor that pages in the cache get purged and need to be recomputed is that wiki content is highly interdependent: a single notion of a WikiWord linking to another page creates a dependency to this page. So a couple of edits on strategic topics, e.g. WebPreferences, will purge large amounts of the cache. As a consequence, on a long running TWikiCache the number of currently valid pages in the cache won't be that much as one might expect: it takes time to capture pages again, but lots are purged with a single edit.

Remember: the more complex the cache algorithm gets, i.e. its dependency tracker, the lower is the net value of caching, and the more probable are principle flaws in the code that may lead to unwanted information disclosure.

-- MichaelDaum - 25 Apr 2008

You bring up a good point with the security, I hadn't considered it. However I disagree that it means we shouldn't consider precompilation.

All an %INCLUDE% tag handler needs to do is mark the results of including a page with access controls as dynamic (possibly by user). Also note that tag handlers that don't return a static/dynamic identifier would be handled as dynamic.

But the most important thing to remember is that precompilation is at a different layer than TWikiCache. Even if TWikiCache would not use the static/dynamic hints that would be provided, it would still get a boost from decreased topic compilation times.

So in summary:

  • TWikiCache strives to be correct at all times, at the expense of CPU and storage
  • Letting tag handlers return static/dynamic hints about their results would enable precompilation
  • Precompilation would speed up regular topic compilation times, which would speed up cache-misses for TWikiCache
  • Optionally, TWikiCache could query the staticness of a topic to know if:
    • A cache deletion is really necessary
    • The topic would be invariant to users and therefore could be shared

Right?

Either precompilation or TWikiCache would work without each other. I'm just really curious what the speedup is. Some profiling would be able to tell us if precompilation is worth pursuing.

-- WoutMertens - 26 Apr 2008

So as a fairly newbie (just installed TWiki 4.2), I've got several questions: 1) Why does TWiki not support cacheing form "out of the box"? Clearly (from my own experience) TWiki is a little sluggish-- even with the help of mod_perl. Supporting and focusing on a fully functional cacheing solution would be outstanding. 2) I think this is the right addon for me -- i'm running a protected TWiki on a public internet on my own server for a small team of people. Everything is protected via .htaccess methods with the TWiki guest function turned off. PublicCacheAddOn looks like another great implementation, but it seems to sacrifice user individuality and other features (although I'm not sure why that sacrifice was made -- simplicity?) 3) Most importantly: How do I install this addition???? Is there a linked page with better installation instructions for 4.2? 3.1) What does $TWiki::cfg mean? This isn't a command line call.... is this a Twiki/configure call of some sort? What do some of the rest of the calls mean?

I look forward to the response(s). Thanks

-- RedByer - 10 May 2008

On PublicCacheAddOn:

  • If your TWiki is just write-protected, clearly it is a good solution for your. If it is read-protected via TWiki Access control statements, you cannot use it. However, if you are protecting it with .htaccess, I think you can use it: just allow anonymous access from the local machine (via its IP) and it should work (The cache gets the pages by wget from the local machine, it does not save a particular view of an user)
  • on "sacrifice": remember that the web work because everybody sees the same thing at the same URL. Personalisation goes against the fundamental web architecture (no google could exist if google saw different things from users), and in my opinion is evil and should be banned. For instance if you keep things "right" you can get efficient caching (100x) otherwise you will get only a 2x speedup from my tests. I regret deeply that people were not more aware of this and let themselves prisoner from these "features" and try to obey them rather than ditch them, which explains why there is no caching out of the box: it is too hard to do trying to accomodate these "features". A personalised left bar should for instance be forbidden in the TWiki engine. If you really want this, use non "core web" techniques like javascript.

On $TWiki::cfg it is perl code, to be used in TWiki addon or plugins, to access variables set by bin/configure

-- ColasNahaboo - 11 May 2008

If this were a public site, I'd be all over PublicCacheAddOn, but for now we're hosting a private workgroup of about 5 users. There is to be no public access or anonymous viewing. The controls are done with Apache .htaccess controls to block off access to all the sections (including /pub). The personalization of the webpages handles the link on the top left of the web bar to send to the user to their homepage.

Still not sure I understand how to apply this patch -- not written for us noobs.

-- RedByer - 12 May 2008

Red, sorry that the code isn't released yet. TWiki-5.0 will have this kind of caching build in, out of the box.

Colas, personalized or role based web content is quite a common thing. Nothing wrong per se with it, although it hinders different users from sharing fully cached pages, obviously.

I wouldn't go so far to ban personalization from TWiki just because it is hard to cache. For example, people simply get different content because a query returns a different hit set based on their access rights. Banning personalized/rolebased content from TWiki would also put an end to any workflow feature where content has got different states of clearance etc.

-- MichaelDaum - 17 May 2008

Although I cannot technically add to the discussion I just wanted to throw in a "thumbs up!" for the caching efforts. I believe, that this will help especially large TWiki-Implementations a lot.

-- MartinSeibert - 18 May 2008

I am setting this to parked and no committed developer. Please feel free to flip that and own & implement.

-- PeterThoeny - 2010-08-01

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpatch TWikiCache.patch r5 r4 r3 r2 r1 manage 55.0 K 2007-03-02 - 17:21 UnknownUser against MAIN, revision 13018
Unknown file formatdraw untitled.draw r1 manage 3.3 K 2008-04-24 - 17:27 WoutMertens TWiki Draw draw file
GIFgif untitled.gif r1 manage 8.8 K 2008-04-24 - 17:28 WoutMertens TWiki Draw GIF file
Edit | Attach | Watch | Print version | History: r59 < r58 < r57 < r56 < r55 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r59 - 2010-08-01 - PeterThoeny
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.