Tags:
create new tag
, view all tags

Get Rid of Excess Meta Data

Analysis

Recently I've been doing quite a lot of work with topics without a TWiki instance in place, and this has raised again the spectre of duplicate meta-data in TWiki topics.

Specifically, META:TOPICINFO and META:FILEATTACHMENT duplicate information stored in the revision control system (RCS) viz

  • Revision numbers
  • Revision dates
  • Checkin comments

Why is this an issue? Well, for several reasons.

  • First, if you want to version a topic without all the TWiki baggage in place, you have to jump through hoops to update this information. TWiki gets really confused, really easily, if you don't.
  • Second, it's duplicate information, and TWiki is permanently on the horns of a dilemma; should it use the info from meta (which might be wrong) or should it use the info in the revision control system? This has lead to confused and complex code.

Why was this done originally? Apparently for performance, so that this information was available just from reading a topic, without reference back to the revision control system or, in the case of file attachments, reading the directory on disc.

Proposal

  1. Remove duplicate meta-data from the stored form of topics
    • Eliminate META:TOPICINFO and META:FILEATTACHMENT completely
  2. Load this information on demand from the revision control system/file system

Some will argue that we should repopulate the text of the topic when it is passed to plugins (some of which may depend on it being present :-/ ). Personally I think this is a bad idea. We need to bite the bullet and fix this problem once and for all. Constantly trying to maintain 100% compatibility is killing this project.

Note that this is another move in the direction of a TopicObjectModel.

I'd be delighted to hear comments.

-- Contributors: CrawfordCurrie - 11 Oct 2006

Why we should'nt remove the excess META data

  1. Performance: caching the TOPICINFO and ATTACHMENT information into the topics reduces the number for files that needs to be queried to retrieve the info.
  2. Audit trail: The complete history of changes to a topic must be versionable
    • This can be achieved without the caching - RA
  3. May break some existing plugins in the wild that makes assumptions on the topic format.

Discussion

I support but I think the argument could be strengthened by providing measurements on the performance impact.

My intuition is that this performance impact is small, as evidenced by the autoattachment capability.

I do not think that there are backwards compatability issues, as

  • the physical representation of metadata is not part of the published TWiki API and the %META% tags can still yield this information unchanged

-- ThomasWeigert - 11 Oct 2006

But some plugins may depends on the fact that the metadata is interwined with the topic text. That is part of the published API.

I would say: follow the first step (remove duplication) on the TWiki4 family, but leaving the backward compatible code. And announce that starting from TWiki5, the META will NOT be interwined with the topic text. Given our cicle of more than a year per release, that should be time enough for a simple change in the API.

The caveat is that there shoulnd't be a big change in the published API for TWiki5 or we may encounter "resistance".

-- RafaelAlvarez - 11 Oct 2006

Here are the design decisions why the META:TOPICINFO and META:FILEATTACHMENT are stored in the topic text: Performance, audit trail and KISS

  1. Performance:
    • Only one topic needs to be consulted to display relevant info on topic view, which is fast. In other words, meta data is a cache. In the rare case where a previous revision is accessed, the slower revision history needs to be consulted.
  2. Audit trail:
    • Everything needs to be versioned, also topic info and file attachment info.
  3. KISS:
    • No extra step is necessary to have complete audit trail, e.g. no need to look in multiple places to get the complete picture.
    • Data is in sync if content is manipulated in the documented way.

I see the architectual value of not caching meta data. I am concerned however that if we take the meta data out of the topic we will make topic view more CPU and disk intensive, in other words the performance might suffer.

To make a point, Dakar has about 3 times more disk IO than Cairo. Running strace -f -o ./strace.out -e trace=open ./view; wc ./strace.out reveils that Cairo opens 216 files, and Dakar opens 620 files. That number is higher if there are file attachments since TWiki is discovering attachments at run time (which is expensive). We should try hard to improve the performance and to reduce the complexity of the code.

To support the need to "work with topics without a TWiki instance in place", I would like to rephrase this to the need to "manipulate topics easily on the file level without http". This can be done if we create a TWikiShell that allows admins to easily manipulate topics on the shell level, such as: twikish -set -formfield Status -value Complete Prj.SlicerTestRuns

-- PeterThoeny - 11 Oct 2006

No, that's not the point. I'm manipulating topics on the file level via http - specifically REST handlers. I have to work this way becase the excise of running up a TWiki instance is too high.

Interesting stats on the number of open files, but I would point out that opening 3 times as many files does not mean "3 times as much disk IO". Having said that, it is true that far too many files are opened and read.

-- CrawfordCurrie - 12 Oct 2006

For a speedy REST interface we can offer a lightweight TWikiShell that talks to a TWikiDaemon.

-- PeterThoeny - 12 Oct 2006

That's fine, but you are not addressing the fundamental problem here; the duplication of meta-data, and the maintenance problem inherent in that. Getting back on track, thank you for sharing the justification for embedded meta-data, that is most helpful. I think I understand much more clearly now what the original design intent was, and where it went wrong.

The original vision of the .txt file was purely of a cache , right? A simple snapshot of the top of the revision tree. By cacheing data and meta-data in this file, a single file access could accelerate most-recent-rev access. The cache would be a throwaway, as it could be regenerated at any point from the top of the revision tree.

This has not been my understanding for a long time, and I think many others share my confusion, as evidenced by much of the code and plugins. I suspect most of us have perceived the .txt as the topic, and the .txt,v as a backup history, which is wrong.

  • You can look at it both ways, and it does not really matter as long as the APIs are used. While it is true that the revision file contains all relevant data and you can recover the .txt from it, it is also possible to drop .txt files into the directory tree without worrying about history file and META:TOPICINFO etc. The store will sync that on first save (if any.) -- PTh

So where did it all go pear shaped? Well, major mistakes:

  1. not distinguishing between meta-data - such as file attachment lists, topic parents, and form data - and cache data, such as TOPICINFO and the version information in file attachments.
    • That is a matter of opinion. I consider storing relevant information in one single place for fast retreival a feature. -- PTh
  2. allowing cache data to be stored in the revision history. This has resulted in temporary cache data being persisted in older revs, which is extremely misleading.
    • I do not understand why this can be misleading. It see this a feature to be able to retreive a previous topic and have all relevant information readily available, without the need to look in many places (such as attachment file size). -- PTh
  3. not using APIs internally in the code; i.e. in always assuming that the cache is always bang up to date, and never checking.
    • Can you elaborate? I think the Func API is clear, as well as the store API. The Store is the only place that updates the meta data/topic cache data. -- PTh
  4. not clearly documenting the design decisions that lead to this, which has resulted (I suspect) in most people who joined the project since it's inception getting the wrong end of the stick. That means most of us working today.
    • Good point, documentation can always be improved. By me, by anyone else. -- PTh
  5. using the same internal structures to store cache data and meta-data (the Meta object) which at the end of the day is probably the cause of all the other mistakes.
    • Duplicate of point 1. -- PTh

I'm tempted at this point just to throw my hands up in horror at the whole mess and walk away. I just can't see a way for the project to recover from such a major long-term failure to impose such a fundamental architectural principle without a major refactoring. Friday 13th is living up to it's reputation.

-- CrawfordCurrie - 13 Oct 2006

CDot clearly pointed out the design problems of the current data format. The motivation to put more effort into that area are more far-reaching than outlined here. This is just a starter. So let's just fix it and have the best store implementation we can think of. This must be possible project-wise.

-- MichaelDaum - 13 Oct 2006

While I agree with the frustration expressed above, I also would like to remind everybody that TWiki is a pretty good platform. Let's not throw the baby out with the bathwater... In the end, having this little confusion is not the end of the world, I think....

We are in an excellent shape now to drive forward. E.g., swap out the store for a DB in a custom store implementation.

-- ThomasWeigert - 13 Oct 2006

My Friday the 13th started actually with pretty good feedback in a conference call by a WikiChampion of a major airline manufacturer: "TWiki has very good documentation; nice end user functionality with dynamic content (embedded search and include); very good security model; surprised by good performance even without DB backend". I believe it is uplifting to highlight positive things. smile

I think the friction here is caused by miscommunication (as in many cases) and assuming things without verifying. Crawford, I edited your "major mistakes" section above to address your points. I prefer to stick with the facts and would like to keep the FUD as far away as possible. I am looking forward to constructive feedback. After all, this is a BrainstormingIdea.

-- PeterThoeny - 13 Oct 2006

Let see it from another side: Can we make a numbered list of why we shouldn't REMOVE the cached information from the topic? Forgeting about alternative implementations or whatever else. Just focusing on explaining it to a crazy developer that one day wakes up in the morning and said "I'm going to remove all the cached meta-data information that is redundant with the RCS history and the pub directory". Just one rule: Don't call up the TWikiMission. Instead explain why the change whould be against the TWikiMission.

This will serve to focus on the concerns, and to have a "solution" that take them into account.

I'll start from what I can collect here, starting with Peter's list above.

-- RafaelAlvarez - 13 Oct 2006

I think with the three points (performance, audit trail and KISS) I stated clearly why the cached information is/should be in the topic. I am open to constructive ideas why we would want to change that (vs. spending brain cycles on TWiki enhancement for unsability, performance, cool new Ajax features and the like.)

-- PeterThoeny - 13 Oct 2006

So you would rather we spent our time on CreepingFeaturitis than on KISS (keep it simple) design is preferred over complicated designs?

You know, you could have eased any friction by simply saying "Sorry, I didn't realise that neither you nor anyone else understood this point".

Let's just get one thing straight. You can look at it both ways, and it does not really matter as long as the APIs are used. Rubbish. Absolute, utter, total, complete, rubbish. Before I got there, the only documented API was Func, and the documentation of that didn't once mention this issue. Not once. In fact, it went out of it's way to compound the confusion by only providing methods such as saveTopicText, which embed the cache metadata in the text passed to innocent plugin authors, without telling them how to use it. Almost all of the internal API documentation was written by me, so reflects the erroneous assumption. And the use of the Meta object to store both cache and persistant data was bound to create this confusion. APIs are only as good as the doc that tells you how to use them.

You seem to be confirming my analysis of the role of the .txt file above. As such, my original proposal is obviously not the right approach, as it was based on a misunderstanding. As I see it, the most pressing requirement is to clearly document this design decision so that no-one else can get caught by it again. In the Store documentation, and in TWikiMetaData, and in the doc of the Func API, we require something like this:

Certain meta-data are used to cache the state of the topic in the revision control system, for fast presentation during rendering. These cache meta-data are:
    • META:TOPICINFO - all fields except format and reprev
    • META:FILEATTACHMENT - all fields
Cache meta-data duplicates data that is stored by the revision control system. However, because it is possible for other tools to read and write topics and their histories (such as RCS commands issued from the shell), cache meta-data cannot be relied on to be accurate, or even to be present in topics. For historical reasons cache meta-data is stored mixed up with real, permanent meta-data. Plugin authors must be prepared to ignore cache meta-data if it is present in the topic text passed to the plugin, but cannot assume it will be present, and must not assume it is correct. Plugins that require access to the information that is duplicated in cache meta-data should use the Func API methods such as getRevisionInfo, which will ensure that the correct information is loaded from the revision control system. Plugins should not attempt to modify cache meta-data in any way.
The second thing is to ensure that the API really does provide routes to the correct data for plugins authors. The obvious thing to do is to add a method to TWiki::Meta that would do the job, both for the core and for extensions.
---+ ObjectMethod updateCacheFromStore()
Updates the cache meta-data i.e.
   * META:TOPICINFO - all fields except =format= and =reprev=, 
   * META:FILEATTACHMENT - all fields
from the revision control system. This method may be called on demand by any application (such as a plugin)
that requires accurate meta-data, at the cost of a round-trip to the store to refresh the cache. If you don't call
this method, you can't rely on the correctness of any of the fields in META:TOPICINFO or META:FILEATTACHMENT.
The third thing is a complete and careful review of the use of meta-data in the core. I am fairly sure there is more than one place where the incorrect assumption has been made by coders (and not just by me).

Longer term I really think we need to increase the distance between cache meta-data and real meta-data, but the above approach is a step in the right direction.

-- CrawfordCurrie - 14 Oct 2006

Crawford, I am not going to address this further in this form, I am not interested in flamewars. I believe we should try hard to work with each other.

-- PeterThoeny - 15 Oct 2006

Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r14 - 2006-10-15 - PeterThoeny
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.