DataStorageForMultiLevelWikiWebs < Codev

Overview

Note that this section is an overview, and contains repetition of points covered in the individual sections below.

In a multilevel web with an arbitrary number of levels we can go one step further and allow any topic to also be a web. (MartinCleaver and perhaps others have stated the same thing.)

Note: There are other pages in Codev that discuss similar or related topics, including topics like MultiLevelWikiWebs, PackageTWikiStore, and TheBrainVsTwiki.

The fact that a web is stored as a directory while a topic is stored as a file adds some cognitive dissonance to that mental model. (Hey, I get paid extra for using long words. wink Another contributor to the cognitive dissonance is the difference in naming rules between a webname and a topic name.

An alternate approach could be to store all topics in one directory, with a file name for each topic that includes the full TWiki multilevel web name. For example, a file named Linux.Scripts.Backup.TWikiBackup.txt would correspond to topic TWikiBackup in subweb Linux.Scripts.Backup, (also known as subweb Backup of subweb Scripts of subweb Linux (of the implied single main web of TWiki)).

If there is a topic Backup as well as the subweb Backup, the topic Backup is named Linux.Scripts.Backup.txt, and topics in the subweb Backup are named like Linux.Scripts.Backup.TWikiBackup.txt, Linux.Scripts.Backup.OtherTopic.txt, etc.

In the following I've tried to start a discussion of the advantages and disadvantages of several different methods of backend storage.

NicholasLee feels that the details of the means of storage should be hidden in a module or object. I don't necessarily disagree, but I would like topic data to continue to be stored in individual text files, one per topic (or at least have that as an option), to allow for the possibility of indexing by an external search engine which might not understand HTML or know how to spider a web. And, I think there is something more "wiki like" in a storage scheme that is accessible in ways other than just through a module that attempts to provide an "abstract" interface to the data. (I'm sure that the OO police will issue a warrant for my arrest. wink

I wonder if this approach was discussed and rejected sometime in the past, and if so, what were the reasons. I remember that searching (or listing the contents of) a directory can get very slow as the number of files increases, but that's just a vague recollection. Does that come into play if we are merely attempting to open a file with a known filename? (I wouldn't think so.) Would the standard TWiki search based on grep slow down "exponentially" as the number of files increases? If that is the case, but opening a file with a known file name does not take longer in a larger directory, then an indexed search approach may be necessary in a scheme like this.

Discussion

I can imagine the following possibilities for storing the pages for multilevel (beyond the current number of levels) wikiwebs, and I've tried to list some advantages and disadvantages of each.

Overview
Discussion

This came to mind while thinking about nesting webs to arbitrary levels and recognizing that TWiki distinguishes between a web and a topic by storing a topic as a file but a web as a directory. Although I recognize that this is workable, I wonder if anything (including the code) would be simplified by looking at it another way. And, I think this may be what Peter alludes to at the end of MultiLevelWikiWebs, in his reference to HierarchicalNavigation.

As you will see, this is unfinished (and a first draft -- sorry!) -- it is intended as food for thought. I hope people will add comments and that eventually I (or someone else) will refactor this to be more clear and concise, with comments by others incorporated. If it makes sense, maybe I and all contributors of comments should be listed as "Contributors" at the bottom of the page after it is refactored, as opposed to signed comments throughout or at the bottom. (Until refactoring, it is useful to sign your comments so that questions can be directed appropriately or we may recognize some contextual information about "where you're coming from".)

I. SubWebs Are Directories

This is (IMO) "closest" to the way TWiki is done today -- the extension to additional subwebs would be by adding subdirectories below directories for existing webs. (This is the approach discussed in MultiLevelWikiWebs, AFAICT.)

I.A. Advantages

Work is proceeding in this direction. * for an intranet-type site, it allows to secure some part of a web where sensitive content is stored (through apache's .htaccess).

I.B. Disadvantages

Adding subwebs requires creating subdirectories in data (and, for attachments, pub).

The extension to arbitrary multilevel webs carries with it the possibility of thinking in terms that any topic is also a web. (I think MartinCleaver (and perhaps others) alludes to (or states) the same thing.) If we adopt the mental model that any topic can also be a web, there can be a mental disconnect between thinking that and the fact that a web is stored as a directory while a topic is stored as a file (ignoring that at some other level of abstraction a directory is a file).

I.C. Effect on storing attachments

I.D. Effect on the Webname.TopicName link syntax

I.E. Effect on searching by an external search engine (local or remote)

See discussion under the database approach.

I.F. Effect on RCS

My guess is minor, because the .txt,v file would still be found in the same directory as the .txt file, with the same filename.

I.G. Ability to create hidden webs

Just like today?

I.H. Ability to create password protected webs

Just like today?

I.I. Ability to provide "per web" functionality, like templates and WebNotify

Just like today?

I.J. Effect on other specific features:

II. SubWebs Incorporated in File Names

In this approach, all topics are stored in one directory, but the file name for each topic includes the full TWiki multilevel web name. For example, a file named Linux.Scripts.Backup.TWikiBackup.txt corresponds to topic TWikiBackup in subweb Backup of subweb Scripts of subweb Linux (of the implied single main web of TWiki). (As you can see, I'm ignoring the effects of any limits on the length of a filename. And that Windows probably has a problem with multiple periods in a file name.)

Suppose there is a topic Backup as well as the subweb Backup? I think that works OK, the topic Backup is named Linux.Scripts.Backup.txt, and topics in the subweb Backup are named like Linux.Scripts.Backup.TWikiBackup.txt.

II.A. Advantages

Avoids the need to create directories for each new subweb. Maybe helps with the thinking as discussed in disadvantages of "SubWebs Are Directories".

II.B. Disadvantages

II.C. Effect on storing attachments

Must directories still be created for storing attachments? Yes, AFAICT. Does this approach make it any harder or easier? Not sure, but it affects my "argument" that this approach makes it easier by not requiring the creation of all the subdirectory tree structure. However, maybe it is still easier -- maybe we create all the attachment directories under pub with a similar naming convention, in other words, the attachments for the TWikiBackup topic mentioned above go under a subdirectory of pub named "Linux.Scripts.Backup.TWikiBackup". This can be created without creating subdirectories pub/Linux, pub/Linux/Scripts, and pub/Linux/Backup/Scripts. (I'm also ignoring limits on the number of files (in Linux: inodes?) within a directory -- I wonder what those limits are, or are they adjustable? Probably different for Linux and Windows.)

II.D. Effect on the Webname.TopicName link syntax

The strings that Perl would have to parse and recognize as links would have arbitrary numbers of strings separated by periods. I don't know how difficult that might be to handle. (Wouldn't it be the same though for the "SubWebs Are Directories" approach?)

II.E. Effect on searching by an external search engine (local or remote)

See discussion under the database approach.

II.F. Effect on RCS

My guess is minor, because the .txt,v file would still be found in the same directory as the .txt file, with the same filename.

II.G. Ability to create hidden webs

Not considered so far -- prefix with periods in Linux? Even if that works in Linux, what about Windows? Maybe, similar to the discussion in the next topic, hidden webs should be in a separate "implied" main TWiki web.

II.H. Ability to create password protected webs

Not considered so far -- and I guess I don't know enough at this time to consider. Can we provide a long list of files in the proper place(s) in .htaccess? Even if we can, that seems to be rather cumbersome. Can we use include files? Even if we can, that's only a little less cumbersome. Can we specify files with wildcards (or regular expressions), for example, use Linux.Scripts.Backup.* to require a password for all topics in the subweb Linux.Scripts.Backup? (And my use of wildcards seems more like the Windows approach than the Linux.) Maybe putting wild cards or long lists of files in .htaccess would slow down the system -- maybe another approach is to have more than one "implied" main TWiki web -- one for all non-password protected webs and topics, and another for password protected webs and topics, and maybe with the restriction that each is a complete hierarchy with no duplication between them.

II.I. Ability to provide "per web" functionality, like templates and WebNotify

The WebNotify topic for the Linux.Scripts.Backup web could be named Linux.Scripts.Backup.WebNotify.txt. The templates for the Linux.Scripts.Backup web could be named, for example, Linux.Scripts.Backup.view.tmpl. Although topic name parsing would have to be done differently, this would also support per topic templates, and a hierarchy of templates. When looking for a view template for Linux.Scripts.Backup.TWikiBackup.txt, first look for Linux.Scripts.Backup.TWikiBackup.view.tmpl, then Linux.Scripts.Backup.view.tmpl, then Linux.Scripts.view.tmpl, then Linux.view.tmpl, and finally just view.tmpl (all under .../data/). A similar approach could be adopted for WebNotify, allowing a WebNotify topic to be placed at any level of the web hierarchy which would notify for changes to all webs and topics at or below that level. (In the case of notify, all "higher level" WebNotify topics would have to be processed so that, for example, people that requested notification for a change to any topic would be notified in addition to those who requested notification for changes to only a specific topic or subweb.)

II.J. Effect on other specific features

III. Database

I've done even less thinking here. I think a backend database can make many things in TWiki faster and "more efficient", but I am most concerned about the effect on the use of an external search engine.

Aside: I haven't listed two separate approaches for the database ("SubWebs Are Directories" vs. "SubWebs Incorporated in File Names" for reasons that are hard to articulate accurately, but come down to the fact that the Database approach is a fundamentally different mode of storage (and in some sense, can be made to mimic either of the previous approaches to the extent it is applicable).

III.A. Advantages

Faster access, more efficient storage. (And, if I'm not mistaken, there is work taking place in this direction.)

III.B.Disadvantages

When I am installing or upgrading a TWiki, it occasionally happens tht the System Administratos group gets messed up. When that happens, I fix it by using emacs to edit the appropriate file as root. Afterward, I can continue system administration (and properly check in the changes using normal TWiki channels. This would probably be impossible if the pages are in a data base. -- HendrikBoom - 18 Jul 2001

III.C. Effect on storing attachments

Not sure -- attachments could be stored within or external to the database -- if external they could still the existing approach or the approach suggested under "SubWebs Incorporated in File Names".

III.D. Effect on the Webname.TopicName link syntax

III.E. Effect on searching by an external search engine (local or remote)

I (and at least some others) are not satisfied with the search abilities built into TWiki at this time. I would like (among others) the ability for proximity searches (words, phrases, or sentences within x words, lines, paragraphs of other words, phrases, or sentences). The approach I had in mind (also mentioned by others) is to install an external search engine on my TWiki server (like htdig, zyindex, altavista (personal?), google, or whatever) and use its search capabilities. For several of those tools the best (or perhaps only) approach is to search the .txt files rather than the dynamic HTML "image" of a page. Those tools will have more trouble with a database, for two reasons -- one is understanding and dealing with the storage format, and the other is that, without modification, some of them only report which file contains a match, not the location within a file. Worst case, all searches result in hits in the single file TWiki.db (or whatever it's called).

So, to summarize my rambling about searching with an external search engine, it can create problems and is sort of irrelevant or an aside to a comparison between storage with "SubWebs Are Directories" vs. "SubWebs Incorporated in File Names".

III.F. Effect on RCS

Strikes me that it would be more significant compared to "SubWebs Incorporated in File Names", but I'm sure the people working on (or considering) this approach have a plan in mind to deal with it.

III.G. Ability to create hidden webs

I'm sure there's a way

III.H. Ability to create password protected webs

I'm sure there's a way

III.I. Ability to provide "per web" functionality, like templates and WebNotify

I'm sure there's a way

III.J. Effect on other specific features

Contributors

RandyKramer - 12 Jun 2001
See comments, above and below

Comments

I think this is the wrong approach. The important is to hide the complexity of storage from one part of the TWiki application. ie. All of the above should be possible. So the actual mechanism for storage is not as important as the way in which the data is exposed to the rest of TWiki from the storage handler.

The basic concept I'm chasing at the moment is: "A Web is a Topic with Childen. TWiki is a Tree of Topics or Tree full of Webs."

I've got some rough notes and I was trying to find some time to post an RFC here in the Codev at sometime in the near future. Daily life prevents me at the moment finding anytime to put into TWiki other than casual discussion here.

I can make what I have available to those interested though.

-- NicholasLee - 12 Jun 2001

Nicholas,

I'd be interested in taking a look at your rough notes. I will also add another approach to the list -- do you want to suggest a name? (Without a suggestion, my brain (without thinking) is starting with something like "Storage Hidden in Module" -- I'd hope to improve on that by tomorrow.)

My hope is that, if TWiki goes this way, that storing the data in individual text files, one per topic, is still at least an option, and thus, there would still be a "back door" means to index the data with a standalone search engine that can handle text only -- it can't spider the web or handle HTML.

-- RandyKramer - 13 Jun 2001

Nicholas,

I got your email and your notes and started reading -- will probably require multiple reads over a period of time.

-- RandyKramer - 15 Jun 2001

How the content is stored in the back end should usually be up to the end adminstrator. Of course the difference between say a database and flat text might mean it impossible to use 3rd party tools outside the TWiki framework to index files. The Storage modules will of course provide an API to access the data, and I'm attempting to consider indexing requirements as part of the overall redesign.

-- NicholasLee - 15 Jun 2001

"I'm attempting to consider indexing requirements as part of the overall redesign" -- thanks, I apreciate that!

Just to provide a little more information, the biggest thing I miss is a proximity search -- for example, the ability to find the topics that have "file", "TWiki", and "topic" within the same paragraph or sentence (although even the ability to find topics with all three within the same topic would be an improvement). Having the option to continue to use text files with one topic per file gives me many search options, including using grep from the command line. (I think I could do "grep file * | grep TWiki * | grep topic *" or something similar -- I'm a Linux newbie.)

The other thing that is becoming painful as I set up my TWiki on SourceForge with a local backup is creating the directories (I'm planning quite a few subwebs). On SourceForge it's worse because I don't have root access to adjust ownership and permissions -- I must post a support request each time I add new directories.

-- RandyKramer - 16 Jun 2001

One away around that on sourceforge (I think Peter uses this) is to create some perl cgi-script that creates the directorys with the approiate masks.

With regards to proximity searches, I would assume that an index was just a simple word:location type thing. Additional complexity in searching would be additional on an application level. In fact its probably possible to make it pluggable.

-- NicholasLee - 16 Jun 2001

Nicholas,

Thanks for the response! Yeah, I'm now aware that Peter uses a cgi-script (I was confused at first because I thought it was a bash script and it doesn't work as a bash script.)

I don't know a whole lot about the internals of search engines or indexes, but I would think you are right about an index being a word:location thing.

-- RandyKramer - 17 Jun 2001

Proximity searching should be the default IMO - that's what Google.com uses, which is by far the easiest search engine to use (although it has clever software that does a lot more than that). From watching some new users trying to use TWiki, they expect the search box (which I've embedded in WebHome) to handle multiple words separated by spaces and do something intelligent with them. Moving away from regular expressions as the only way of doing multi-word searches would also help a lot.

I really recommend Jakob Nielsen's article on search usability - we should try to make the search features as simple as possible, in the default skin, with advanced features available via a link. Probably we should have a 'basic' default skin (as in BetterSkins) including simple searching, fewer advanced options such as the list of revisions, etc, and an 'advanced' skin (much like the current one).

I quite liked ZWiki:JumpSearch at one point - this first searches in the topic names by default, jumping to first alphabetically if any topic name hits, then searches the full text after that. However, having used it a bit I think this is quite unusable - you don't get a full list of topic name hits so it's easy not to get the right topic.

Another issue is removing the Go field from the top of the page - some people have just assumed this is a search box, and it would be better if it was (i.e. search on every page as Nielsen recommends. Some people even have a Site Map on every page, though that is perhaps going too far...

Probably this discussion should really move to one of the topics listed in http://twiki.org/cgi-bin/search/Codev/?scope=topic&search=search, as multi-level webs are only a slight refinement to the basic issue of improving search.

-- RichardDonkin - 17 Jun 2001

Richard,

Feel free to move your comment and duplicate some of the other portions. My only point to Nicholas here was that I hoped he was considering search in the design of his storage module, or allowing me my "back door" access to the text files for indexing.

I agree with the comments you make above.

-- RandyKramer - 17 Jun 2001

IMO the right (controllable) way is to build an data-level API for all storage mechanisms to expose the data in a consistant way to applications that need it. (Such as search engines, etc). So we would start with an indexing step which would involve returning each search term with an application key or handle to where that term was used. When searching, the search engine picks out all the terms and picks out the associated keys. These key are then passed back to the storage API which returns the topics (and locations in those topics) that contain the term.

-- MartinCleaver - 16 Jul 2001

Of course searching has a consider part of the design for a storage mechanism. I've considered it and have some notes on my thoughts. Basically as Martin says it comes down to defining a clear access API.

My thoughts at the moments for doing this tie in with the same structure consider for the SubTopic stuff. ie. Content (being the text data, or meta information concerning that node, or a binary Word document attachment stream) exist in nodes. Once your got this is should be easy enough to make the search system, even extend it to be plugable. ie. index a Word document as well.

The thing to be aware of is the a search mechanism given a expression returns a list of content keys. ie. It needant even access the storage module at the specific time of a given search. Especially if a related index table is updated only when contented it saved in the storage system.

Furthermore the search system doesn't even care about display of the information. It just passes this "list of topic keys" to it caller which can then call the Render pipeline or whatever. ie. the webnotify View is different from the search script View.

Thats all straight forward OO conceptization, of course the tricky thing is putting that into a structure and actual design implamenation that works with TWiki.

Of course I think that a generalise storage/search mechanism will require an index mechanism to deal with performance concern. A nice thing to happen would be to generalise it all, while still having the old behaviour still exist. ie. create an search/store/index system where an index 'plugin' could just be a grep on the sub-storage text (rcs) files.

-- NicholasLee - 17 Jul 2001

Added "numbers" to the headings to word around the Table of Contents problem with matching headings. Added a few notes and similar to hopefully make it a little easier to read. Still very rough.

-- RandyKramer - 09 Feb 2002

BasicForm
TopicClassification	BrainstormingIdea
TopicSummary
InterestedParties
RelatedTopics

Topic revision: r20 - 2008-09-17 - TWikiJanitor

Account
- Log In
- Register User

Edit
Attach

Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.