create new tag
, view all tags

Feature Proposal: Using multiple disks for DataDir and PubDir


A TWiki site may reach a point where a single disk drive cannot house all files. Having PubDir on a different disk from the others doesn't help a lot because on a large site, PubDir takes the most capacity by far. In that case, there is no option but use multiple disks for PubDir.

It's possible to have a single multi-terabyte file system. But that doesn't mean it's practical to use a file system of such scale in your environemnt. You may need to use storage provided by a team you don't have control over. And the size limit might be 1T or 500G bytes.

For reasons described later, it's not possible to utilize additional disks simply by symbolic links without enhancing TWiki code.

Description and Documentation

To utilize multiple disks, in addition to $TWiki::cfg{DataDir} and $TWiki::cfg{PubDir}, you need to have directories specified by $TWiki::cfg{DataDir1}, $TWiki::cfg{PubDir1}, $TWiki::cfg{DataDir2}, $TWiki::cfg{PubDir2}, ...

There need to be a way to specify which web uses which data directory and which pub directory.


As you see below, utilizing multiple disks is not trivial. You need to cope with various implications. Some implications cannot be totally hidden from users. Because of that, even with this enhancement, it's better to seek ways to have a large single disk space. NOSQL databases may provide good alternatives.



-- Contributors: HideyoImazu - 2012-07-11

How to turn it on

If $TWiki::cfg{MultipleDisks} and $twiki->{mdrepo} are true, this feature is turned on.

The notion of disks

For ease of management and understanding, DataDirX and PubDirX should reside in the same disk. Based on that, let's call $TWiki::cfg{DataDir} and $TWiki::cfg{PubDir} combined the "default disk". Likewise, let's call DataDir1 and PubDir1 combined the disk 1, DataDir2 and PubDir2 the disk 2. This leads to the notion of disk IDs - the default disk's ID is '' (zero width string), disk 1's ID is 1, etc.

Metadata repository required

This feature requires that metadata repository is in use. This is because a mechanism to specify which web resides in which disk is required.

Since this feature is for a large site having thousands of webs, it's not practical to use a topic for the disk specification purposes.

Each web has the disk field specifying which disk the web resides in.


Each DataDirN and PubDirN need to have their own TrashxNx, e.g. Trashx1x, Trashx2x, ... For topic, attachment, and web deletion, a proper TrashxNx needs to be selected automatically. %TRASHWEB% is not constant and expanded to the proper trash web name depending on the current web.

You may think Trash1, Trash2, ... are straightforward and desirable rather than Trashx1x, Trashx2x, ... The reason is the Trash web might be aged - every week or every day the new Trash web is created based on _Trash after Trash is renamed to Trash1 after Trash1 is renamed to Trash2, ... after Trash10 is deleted. The aged Trash webs would clash their named with trashes for extra disks.


With %WEBLIST{...}%, the "canmoveto" filter (introduced by ReadOnlyAndMirrorWebs) of the webs parameter needs to eliminate webs further than usual. The filter needs to eliminate webs residing in a disk different from the current web.

Attachment URLs

Using multiple disks and exposing them in URLs are two different matters. Topic URLs are not affected by the disk in which topics reside. But attachment URLs may if attachment retrieval is handled by a web server directly.

Based on a careful consideration, attachment URLs must not be affected by disks they are housed. If an attachment's URL path is affected by disk (e.g. /pub1/FooWeb/BarTopic/file.png), when the web moves to a different disk, the attachment URL changes. If %ATTACHURL% or %PUBURL% is used to refer to the attachment, it keeps working but there is no guarantee - you cannot police users with 100% accuracy.

To achieve that, you need to do either of the following

  • Putting symbolic links under the directory $TWiki::cfg{PubDir}
  • Rewriting /pub/WEB/TOPIC/FILE to /cgi-bin/viewfile/WEB/TOPIC/FILE


For further explanation, let's have a concrete data.


$TWiki::cfg{DataDir}  = '/disk0/data';
$TWiki::cfg{PubDir}   = '/disk0/pub';

$TWiki::cfg{DataDir1} = '/disk1/data';
$TWiki::cfg{PubDir1}  = '/disk1/pub';

$TWiki::cfg{DataDir2} = '/disk2/data';
$TWiki::cfg{PubDir2}  = '/disk2/pub';

Metadata repository's webs file:

Name Admin Disk
Eng EngAdminGroup 2
Main TWikiAdminGroup  
Sales SalesAdminGroup 1
TWiki TWikiAdminGroup  
Trash TWikiAdminGroup  
Trashx1x TWikiAdminGroup 1
Trashx2x TWikiAdminGroup 2

Federated sites

So far, the discussion has been about a stand-alone (= not federated) large site. And this section is about a federation of sites, which is mentioned at RepositoryForSiteAndWebMetadata#Federation_of_sites.

Sites in a federation needs to share site data for content mirroring. The 'sites' file of the metadata repository is for that. Here's an example of 'sites' file of a three site federation focusing on storage information. Sites file has other fields but they are omitted.

Name DataDir PubDir DataDir1 PubDir1 DataDir2 PubDir2
am /disk0/data /disk0/pub /disk1/data /disk1/pub /disk2/data /disk2/pub
eu /d/twiki/data /d/twiki/pub /d1/twiki/data /d1/twiki/pub /d2/twiki/data /d2/twiki/pub
as /twiki/data /twiki/pub /twiki1/data /twiki1/pub /twiki2/data /twiki2/pub

Given that, $TWiki::cfg{DataDir}, $TWiki::cfg{PubDir}, $TWiki::cfg{DataDir1}, $TWiki::cfg{PubDir1}, ... are redundant. As such, if $TWiki::cfg{SiteName} is defined, the sites file record corresponding to the current site is referred to get DataDir1, PubDir1, ... For practicality, $TWiki::cfg{DataDir} and $TWiki::cfg{PubDir} need to be set though they are redundant.

In case DataDir1 field doesn't have value, the feature resorts to $TWiki::cfg{DataDir1}, $TWiki::cfg{PubDir1}.

Why enhancement is required

Having additional disks and putting symbolic links under PubDir for some webs to off-load the primary disk doesn't work. This is because when a topic is deleted, topic.txt and topic.txt,v are moved from DataDir/WEB to the DataDir/Trash. And the PubDir/WEB/topic directory is moved to PubDir/Trash. If PubDir/WEB is a symbolic link a different disk, then moving PubDir/WEB/topic to PubDir/Trash fails.

Considerations on symbolic link based enhancement

It should be possible to implement multiple disk use based on symbolic links, in which $TWiki::cfg{PubDir} (and probably $TWiki::cfg{DataDir} as well) has symbolic links to off-load the file system in which $TWiki::cfg{PubDir} resides.

Here are the things you need to overcome for that way of enhancement.

Having thousands of symblic links under a directory may not be practical

This is not a total show stopper, but having thousands of symbolic links under a directory is not practical under some circumstances. To utilize additional disks via symbolic links, you may end up having thousands of symbolic links under $TWiki::cfg{PubDir}. Going through the directory bloated by symbolic links may be extremely slow though visiting a directory having the same number of subdirectories may have no problem. This is because a symbolic link takes much more space than a subdiretory entry in a directory.

This is not a theoretical concern. HideyoImazu experienced such a situation.

Copying a large directory takes time

When a topic is moved between disks (including deleting a topic), the directory for the topic's attachments need to be copied and then the original directories need to be removed. When a web or subwebis moved between disks (including deleting a web or subweb), the directory for the topics and the directory for the attachments of the topics need to be copied and then removed.

It has the following issues.

  • This may take a very long time and browser may time out.
    • Maybe moving a web or subweb between disks should be forbidden, which requires enhancement to %WEBLIST{...}% and the file and directory move logic. But then, subweb deletion is forbidden if it resides in a different disk from Trash. Is that acceptable? On a large site having thousands of webs, thins should be as self-service as possible.
  • The chance for the copy and remove operation to be interrupted by some error is not ignorable.
    • The copy and remove operation needs to be made atomic (e.g. copy to a temporary directory name and then rename it. And there need to be a cleaner of files and directories left due to interruption.


This looks doable, but I am concerned about complexity. Couldn't you avoid all the added complexity by selecting a file system that handles large data, such as ZFS or XFS? How much disk space do you require for all TWiki content? See related discussion at http://superuser.com/questions/371469/linux-file-system-for-a-big-file-server

Related, long term it would be good to add a NoSQL backend, such as MongoDB or CouchDB.

-- PeterThoeny - 2012-07-11

It is doable - I've been using the extension applied to TWiki 4.1.2 in production for more than 6 months.

In many cases, it should be avoidable. But there are cases where this is the best possible solution. You may have no option but use network attached storage provided by your infrastructure. It may have 1T or 500G size limit per file system, over which you may not have control.

If you have highly available network attached storage infrastructure withstanding a single data center loss, it's challenging to implement a NoSQL backend of the same level of availability at comparable ease and cost as the network attached storage.

I admit this is only for a very large site and this is not for every large site. You may not have highly available storage to begin with. In that case, building a NoSQL backend and using it makes sense.

-- HideyoImazu - 2012-07-13

To summarize my position as discussed in JerusalemReleaseMeeting2012x07x20, I do not object to this feature if documented well (e.g. ok to implement), but i have a hunch that it is better to avoid the complexity by using a file system that supports large data, or symbolic links with cp & rm where needed. This is more intuitive for users.

-- PeterThoeny - 2012-07-20

I revised the proposal to explain the design more thorowly. I wrote my concerns on "simpler approach" utilizing symbolic links based on my experience. I think this is a necesary evil for some large sites operated under sub terabyte file size size limit.

-- HideyoImazu - 2012-07-22

As mentioned in the Attachment URLs section, I now believe the disk housing a web must not affect attachment URLs.

-- HideyoImazu - 2012-09-20

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng multiple-disks.png r1 manage 16.0 K 2012-07-11 - 11:01 HideyoImazu  
Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r18 - 2012-09-24 - HideyoImazu
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2016 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.