Search Engine Swish-E Add-On
A common request by my users was better search that included attachments.
We picked Swish-E
since it now includes capability to index various Microsoft formats: .doc, .ppt, .xls
The instructions here are for setting it up on Linux. Sorry I don't think this will work on Windows.
- Fast search of TWiki topic text and attachments
- Attachments of Microsoft Word (.doc), Powerpoint (.ppt) and Excel (.xls) are indexed
- Fast: for our site, it is about 20x faster than twiki's built in serach, with much less server load.
- "Excellent, does the job"
- "We needed this months ago"
- Swish does not have incremental index updating, so the index rebuild is very time consuming given how slow TWiki is
- Requires some unix skill to configure, comparable to setting up TWiki itself.
The current implementation is http spidering of twiki. This made it easy to get correct indexing of the current versions, but at significant cost in indexing speed.
- Get 'last modified date' to work. Approach (A) is to index the files, (B) is to integrate some kind of last-modified-date into the generated HTTP as sent to the spider (and probably not for general users)
- Set up swish to index the files directly (both data/**/*.txt and pub). This would allow ranking by "last modified date", and make it faster. A more complicated filter would be required to remove the twiki 'meta' fields from the topics.
- Modify spidering/indexing to include author and web as 'meta' fields for swish. Goal is to enable subset searching (only in this web, or only by this author).
It appears the
module does not properly handle non-ascii content in the excel files.
I've not investigated this to determine how the Russian and Japanese content is getting handled (I know it is present in the files). The warning printed in the log is about the
of the bytes retrieved from the data file.
Add-On Installation Instructions
You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.
- Swish itself, from http://www.swish-e.org
catdoc parsing program, from http://www.45.free.net/~vitus/ice/catdoc/. This program is used to convert MS Word (.doc) and Powerpoint (.ppt) to text format that can be indexed.
- Install the CPAN modules for XLS parsing:
Spreadsheet::ParseExcel_XlHTML. (I used the Perl module manager via my installation of Webmin. I highly recommend Webmin, it really simplifies Linux administration.)
- PDF conversion tool. I used the recommended one: http://www.foolabs.com/xpdf/download.html
tools separately is recommended.
I used the directory
, which is hardcoded in various of the scripts as required
See File Details
for the complete list of files included.
- Download the ZIP file from the Add-on Home (see below)
- Unzip into a temporary directory.
- Move the
swish/scripts into a directory that is not part of twiki.
- Edit those scripts to have the proper hardcoded paths for where they have bene installed
- Move the
searchtools files into a subdirectory of your twiki
- Edit the top of the cgi to give the path for the swish config
- edit the swish config to indicate where the index is stored
- If you want to use my customizations of the template, install the other files appropriately
Configure Swish to index TWiki including attachments
This uses the configuration scripts (in the ZIP file), along with these steps:
- Create a new user specifically for use by the spidering program. I used
- Edit the %MAINWEB%/SpideringEngine page to specify
plain as the default skin for this user
* Set default skin to remove 'print' button etc
* Set SKIN = plain
- Edit the
- Modify the
- Modify the
credentials given in the
twiki-spider-config.pl match the created user.
- Check the limit for number of URLS. I use 15000, which must include any duplicate URLs. Twiki generates duplicate URLs frequently. It will print out "Max indexed files Reached" if this is too small for your site.
- The default filtering of topics should be ok for most twikis. The default in
twiki-spider-config.pl is like this:
- Excludes some specific topics (WebChanges, etc) that take a long time to render with little value to the index
- It filters out the index links when query parameters are present, so as to avoid indexing the same topic due to table sorting links, attachment management links, etc.
Test that spidering is working
Check these things
- Edit the
twiki-spider-config.pl to enable debug printing at 'info' level (see comments in the file)
- Run the spider tool to create a data file of the content to be indexed. The script
spider-only.sh is provied for this.
- The printout while spidering is running should include your URL's and topic names
- You should glance through the resulting data file to verify contents are reasonable It may be necessary to use
tail to edit the latter part of file if it is more than 8mb. Look for
Path-Name at the beginning of a line to locate start of each URL indexed.
- You don't really need to let it spider the entire twiki, you can ctrl-C after just a few 100 topics are indexed. The main thing is to make sure the indexing is finding your twiki content properly.
- It is a good idea to verify the attachments get their content converted properly so indexing will include them. This usually requires indexing the entire twiki, or altering the starting URL.
If it is not finding your twiki URL's, or failing due to username/password error, fix the settings in
Remove the debug line in
twiki-spider-config.pl when finished with this step
Manually build the indexes for testing
This is the
script in the attached zip file that combines the spidering and indexing.
Be sure to look through the log to ensure it has properly indexed the attachments:
it will include error messages if the various filter programs are not properly enabled.
The first time, you can reuse the data file created by the spidering if you want, see the Swish documentation for how to do this.
Engage search within the TWiki
This is a simple setup of the swish.cgi search script
inside of the twiki
Be sure to read the swish documentation on customizing the templates, since some customization is always required.
I included our script that has multiple indexes to allow search of our bug database (updated every 3 hours).
Test the search script and index
URL will be
Configure your twiki to have an appropriate "Full Text Search" button
We put a button on the left column that goes to the swish search page
cron to run the index as desired
should be run periodically to rebuild the index.
Even with use of SpeedyCgi
, TWiki rendering is not fast. As a result, I rebuild the index only once per day at 4 am. It takes about 1.5 hours to index our TWiki, which has ~4000 topics and attachments
topic in your local TWiki, view the raw text for TWiki:Plugins/SearchEngineSwishEAddOn
, and copy the contents to your local TWiki.
Contents of the ZIP file
All of these files will require some tweaks
since they have hardcoded paths in them.
home/data/swish/scripts are used to build the index
|| Swish config file
|| Spider.pl config file to decide what to index
|| included config file
|| run spidering only for test
|| script to run via
twiki/bin directory: added a subdirectory for the search script
|| search script, basically same as distributed
|| config for search to allow search of multiple indexes
|| modified template, includes a 'help' link and link back to twiki. Search for "CUSTOMIZED" to see what should be reviewed or modified
|| "help" topic for the template, to be in the TWiki 'Twiki' data directory
Related Topic: TWikiAddOns
- Set SHORTDESCRIPTION = Search within attached documents, such as doc, html, txt, pdf, ppt, xls
- 16 Aug 2005
Slightly revised version uploaded on 27 Aug 2005.