Tags:
create new tag
, view all tags

Question

We have been using TWiki for a few weeks now to assess its suitability for collaboration and document management tasks. So far so good, but one question I have been asked more than once relates to scalability - are there any bottlenecks that are likely to be hit as we use it more? I am not really concerned about scalability with number of users since twiki.org already has more users than we will have, it is more with the amount of data entered. We have huge amounts of information, documents and other material that could potentially be added into the twiki site. Has anyone hit any performance or other problems when the sites get large? Search might be one area, for example.

Thanks.

-- MartinWatt - 19 May 2002

Answer

If you're attaching lots of huge files, you might find TWiki a little unwieldy. I use TWiki on my site to store pictures, for which I added some functionality to do an automatic thumbnail, but it's still a lot clunkier than a drag and drop mounted drive thing when it comes time to attach 30 pictures. The browser upload works 95% of the time, but it's far from perfect. Too bad there's not something like the Yahoo Y-Drive that mounts what appears to be a hard drive to your PC for bulk attachment uploading.

Also, if you attach lots of binary files there's a size growth problem because it stores all previous versions too in the versioned archive file. But the versioning does make backup easy; I do a nightly rsync to a secondary drive and I don't need need previous backups since they're wrapped in TWiki. I've thought about auto-deleting all .jpg, .gif, etc. archive files but figure that'd mess up future attachments.

Since TWiki has no centralized server, it kind of rides on the back with the scalability of the filesystem, Apache and perl. You can throw CPU speed, RAID and dual processor CPU's at the problem, but you can't (as far as I know) cluster your way to increased speed like you might with a database-backed system.

And because TWiki has no always-alive server core, it needs to always re-generate some stuff like search results or the index each time you request it (largely through massive perl grep's which run suprisingly fast IMHO). One suggestion: if you're internet visible, I guess Google can provide search services for your site. (That reminds me: to help this, I think TWiki should put META search keywords on each page by unpacking the topic name - e.g. LinuxHelpTips becomes "Linux Help Tips" - so that search engines can digest them!)

Note: don't judge TWiki speed by my site because it runs on a tired P166. Tip: Apache is a RAM hog, and I have a cron job to restart it as it tends to bloat itself into a crawl with a memory leak every month or so. Then again I have only 64 MB...

-- MattWalsh - 20 May 2002

ModPerl will greatly speed up TWiki, and is used in production at at least one TWiki site (DrKW). I got a factor of 10 speedup, which is fairly typical.

Re uploading large numbers of files - you could investigate setting up a Samba server as a way of putting files directly into TWiki. There would need to be some sort of integration that runs a script when a file is added to a Samba directory, doing the same CGI script that would normally run. Some web hosts do support Samba, but it's not very common.

Search is probably the main area where a big site would slow down, but that's an area that would need improvement anyway - TWiki's current search mechanism is not very easy for new users to use (quite a lot of people at my site try a search, find a huge hitlist returned, then give up and try browsing instead.)

-- RichardDonkin - 20 May 2002

Thanks for the replies. We already use mod_perl. The samba thing sounds interesting. We are comfortable with scripting, and it would be tempting to bypass the browser for batch-type file entry. This would be on our intranet, so we have full control over it and could easily use Samba.

Another question - how about running several different TWiki instances rather than one big one? This thought came up when I discovered that by coincidence another group in our company was also evaluating TWiki and had set up their own version. I wondered first about the possibility (and difficulty) of merging our two databases, then thought maybe it's better to keep them separate as the two uses are fairly unrelated. So we could take this further and set up separate TWikis for each distinct purpose, eg documentation, bugtracking, discussions etc. You lose the company-wide search option, but assuming the TWikis are split logically the users should know where to go. And maybe it would be possible to have a master search script that sends search requests to each TWiki and combines the results? Again I'm thinking mainly of the performance benefits when considering multiple TWikis.

-- MartinWatt - 20 May 2002

TWiki scales pretty well at my day-time job. We used to have several TWiki installations but merged all into one central one, located in the head quarters. We currently have over 1000 registered users, 85 webs, 11K topics, around 10K topic changes per month, and 12 GB worth of attachments. We are not running under mod_perl but are planning to upgrade the hardware from an aging Sun hardware to a more recent dual processor system.

We do not have an issue with search since we have a policy for creating new webs. Each new web needs to go through an approval process, and we have web names that include the business unit in upper case and a group name / functional name in lower case. We promote a few webs over many webs. That way content is structured and folks tend to search only in the current web, which is fast.

The main scaling issue we currently have is with renaming a topic, it takes now around one minute to find all references.

You can cluster TWiki, in fact that is the setup at SF: A farm of load balanced web servers with one big storage server at the back-end.

-- PeterThoeny - 21 May 2002

Moved this question from the Support web to the Codev web because it is a TWikiDeployment question.

-- PeterThoeny - 21 May 2002

Interesting discussion. I've got a potential project which may stress TWiki somewhat. It's an archive of five years' worth of posts to a now-offline forum. These are currently extant as 125,000 flatfiles in a directory. I've found this is navagable in tolerable time using ReiserFS, but suspect that dealing with this under TWiki would be slow. The fact that there's a threefold increase in file count in TWIki (file, revision file, lockfile) doesn't help matters much. Group operations would likely be glacial. Granted, a goal would be to winnow through the stuff and find the gems -- there's an awful lot of chaff in that wheat.

Still, I'm curious as to how this might compare with other projects. My general throught is that directories are much happier capped at ~10k files. This would require a large number of directory partitions given the volume of files I'm considering.

-- KarstenSelf - 07 Jun 2002

An interesting size of problem. This does sound like a lot of files in one directory, regardless of any TWiki considerations. One thought would be to split them by year to get to a more plausible number. I think the many TWiki limitation is likely to be on searching, but if the files are fairly small this might not be a total killer. The RCS files will only be created if you modify a topic, similarly the lock files will only be created when a topic is first modified - so I don't think you'll suffer the big file explosion that you expect (an advantage of the file based nature of TWiki).

-- JohnTalintyre - 07 Jun 2002

Admitted, this is arguably gross abuse of the filesystem. What I noted first was that ext2fs simply doesn't scale to these sizes, not because of directory listings, but because of file inserts (and possibly deletes). Having to scan the directory list doesn't perform particularly well at the 10k - 15k entries level. This can be readily tested with a shell script. Other ops seem OK, but partitioning lockfiles and RCS files to their own subdirectories seems advised.

-- KarstenSelf - 10 Jun 2002

I'm sure this is easily solved.

It's just a case where the topics are stored. It is up to the TWiki::Store::File (or whatever it is called) to determine this, presently they are stored in a single directory by the name of the Web. That's not the only possible mapping.

I'd suggest that if the underlying filesystem began to strain we would alter the algorithm to partition the storage of the files for each topic across multiple directories.

E.g. initially: TopicDataFiles (topic+lock+rcs) all in one directory. as number increases, partition into two directories (buckets), one for each of TopicDataFiles A-L and M-Z.

If these two get exceeded, partition storage again: A-D,E-L,M-S,T-Z Once you get beyond twenty six, choose the directory name based on the first two letters.

All this can transparent to the rest of TWiki.

You'd have to make allowance in the naming conventions for the directories for Nested Webs, but that's fairly easy.

-- MartinCleaver - 07 Jun 2002

This change would actually be fairly complicated and doesn't address the most likely issue which is search performance.

-- JohnTalintyre - 07 Jun 2002

If the number of files in a directory is an issue, a first step could be to put all the ,v files into an RCS directory. This is transparent for RCS, and we would just have to make sure that the initial ,v file that is created is put into the RCS directory, rather than in the same directory as the topic. This only reduces the number of files in a directory by half, but it is a simple start.

-- ThomasWeigert - 07 Jun 2002

Actually, RCS directories only reduce files by a third. A' la Car Talk, the first half of the show is the .txt files. The second half is the ,v RCS files. The third half is the lock files.

-- KarstenSelf - 10 Jun 2002

Correction: Obsolete lock files get removed periodically by the notify cron job, that means there are never more then a few ones of topics that have been edited recently. So, the RCS directory reduces the files by half.

There is alredy a switch in TWiki.cfg to put all repositories in a RCS directory. Note that you can't simply turn on the switch, you need to move existing repositories at the same time.

-- PeterThoeny - 10 Jun 2002

Main.JohnTalintyre: This change would actually be fairly complicated and doesn't address the most likely issue which is search performance.

I'm not sure that I understand your point. Are you talking about the file-seek time, ie. the amount of time needed to retrieve a file from a large directory given the TWiki topic name, or are you talking about the time required to search through all files as required by the Search functionality?

If, as I suspect, it is the former, then I question both assertions. If it is the latter, then fair enough - I don't know enough to comment.

In terms of performance, we currently state that there is a problem if the filesystem takes a significantly longer seek time when there are many more files in the directory than when there are few. The solution I suggest would change the algorithm so that it looked in one of several small directories instead. This take the burden off the filesystem to behave efficiently with large directories, as there are now few files in each of many directories.

And what would be complicated? As I see it the only change would be to the Topic-name-to-filename-lookup, a function that currently, is just a simple append of web to topic name.

For a topic called TopicName:

     filenameForTopicName(Web,TopicName) returns Web/TopicName.txt
becomes:
     filenameForTopicName(Web,TopicName) returns Web/To/TopicName.txt

Am I over-simplifying the problem?

-- MartinCleaver - 09 Jun 2002

It is searching the file content I was referring to. Whilst having this many files in a directory is not optimal, it will probably work okay. I think the assumption of topic naming appears in many places and simply changing a mapping function will not deal with this; this is why I think such a change would be fairly complicated.

-- JohnTalintyre - 09 Jun 2002

I suspect searching large webs is ultimately better handled by creating a periodically updated index. Ideally, this would scan only updated files (and on a reasonably mature TWiki, this is going to be a small set). It should be possible to generate updates several times an hour (say, every five - fifteen minutes). Or indexing could be made part of the edit commit process (I think I like this better). Searches would be against the index rather than the source files.

-- KarstenSelf - 10 Jun 2002

I suspect searching large webs is ultimately better handled by creating a periodically updated index.

I agree with this - indeed it is what we implemented for IIS in my last job (we fed this back as the IndexServerSearchForMsIisAddOn) .

I had originally attempted to use an OpenSource package for this as this would then have worked for all platforms but we were unsuccessful in getting them to work on Windows (see SearchAttachments for details.)

-- MartinCleaver - 10 Jun 2002

Referring to text above about reducing number of files by using an RCS directory. If this is existing content, most of which will not be edited, then this is not an issue, as the RCS file will not get created. One of the nice things about TWiki is that it is happy to work with the history (RCS) file not being present i.e. you can just drop in .txt files and things work.

-- JohnTalintyre - 10 Jun 2002

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r16 - 2002-06-10 - JohnTalintyre
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.