Search Engine Swish-E Add-On
A common request by my users was better search that included attachments.
We picked
Swish-E since it now includes capability to index various Microsoft formats: .doc, .ppt, .xls
The instructions here are for setting it up on Linux. Sorry I don't think this will work on Windows.
Summary
- Fast search of TWiki topic text and attachments
- Attachments of Microsoft Word (.doc), Powerpoint (.ppt) and Excel (.xls) are indexed
- Fast: for our site, it is about 20x faster than twiki's built in serach, with much less server load.
User reaction
- "Excellent, does the job"
- "We needed this months ago"
Caveats
- Swish does not have incremental index updating, so the index rebuild is very time consuming given how slow TWiki is
- Requires some unix skill to configure, comparable to setting up TWiki itself.
Future work
The current implementation is http spidering of twiki. This made it easy to get correct indexing of the current versions, but at significant cost in indexing speed.
Future ideas:
- Get 'last modified date' to work. Approach (A) is to index the files, (B) is to integrate some kind of last-modified-date into the generated HTTP as sent to the spider (and probably not for general users)
- Set up swish to index the files directly (both data/**/*.txt and pub). This would allow ranking by "last modified date", and make it faster. A more complicated filter would be required to remove the twiki 'meta' fields from the topics.
- Modify spidering/indexing to include author and web as 'meta' fields for swish. Goal is to enable subset searching (only in this web, or only by this author).
Known Issues
It appears the
Spreadsheet::ParseExcel_XlHTML
module does not properly handle non-ascii content in the excel files.
I've not investigated this to determine how the Russian and Japanese content is getting handled (I know it is present in the files). The warning printed in the log is about the
pack
of the bytes retrieved from the data file.
Add-On Installation Instructions
Note: You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.
Install modules
- Swish itself, from http://www.swish-e.org
- The
catdoc
parsing program, from http://www.45.free.net/~vitus/ice/catdoc/. This program is used to convert MS Word (.doc) and Powerpoint (.ppt) to text format that can be indexed.
- Install the CPAN modules for XLS parsing:
Spreadsheet::ParseExcel
and Spreadsheet::ParseExcel_XlHTML
. (I used the Perl module manager via my installation of Webmin. I highly recommend Webmin, it really simplifies Linux administration.)
- PDF conversion tool. I used the recommended one: http://www.foolabs.com/xpdf/download.html
Testing the
catdoc
and
pdf
tools separately is recommended.
Install scripts
I used the directory
/home/data/swish
, which is hardcoded in various of the scripts as required
See
File Details for the complete list of files included.
- Download the ZIP file from the Add-on Home (see below)
- Unzip into a temporary directory.
- Move the
swish/scripts
into a directory that is not part of twiki.
- Edit those scripts to have the proper hardcoded paths for where they have bene installed
- Move the
searchtools
files into a subdirectory of your twiki
- Edit the top of the cgi to give the path for the swish config
- edit the swish config to indicate where the index is stored
- If you want to use my customizations of the template, install the other files appropriately
Configure Swish to index TWiki including attachments
This uses the configuration scripts (in the ZIP file), along with these steps:
- Create a new user specifically for use by the spidering program. I used
SpideringEngine
- Edit the %MAINWEB%/SpideringEngine page to specify
plain
as the default skin for this user
* Set default skin to remove 'print' button etc
* Set SKIN = plain
- Edit the
twiki-spider-config.pl
file
- Modify the
base_url
- Modify the
credentials
given in the twiki-spider-config.pl
match the created user.
- Check the limit for number of URLS. I use 15000, which must include any duplicate URLs. Twiki generates duplicate URLs frequently. It will print out "Max indexed files Reached" if this is too small for your site.
- The default filtering of topics should be ok for most twikis. The default in
twiki-spider-config.pl
is like this:
- Excludes some specific topics (WebChanges, etc) that take a long time to render with little value to the index
- It filters out the index links when query parameters are present, so as to avoid indexing the same topic due to table sorting links, attachment management links, etc.
Test that spidering is working
- Edit the
twiki-spider-config.pl
to enable debug printing at 'info' level (see comments in the file)
- Run the spider tool to create a data file of the content to be indexed. The script
spider-only.sh
is provied for this.
Check these things
- The printout while spidering is running should include your URL's and topic names
- You should glance through the resulting data file to verify contents are reasonable It may be necessary to use
tail
to edit the latter part of file if it is more than 8mb. Look for Path-Name
at the beginning of a line to locate start of each URL indexed.
- You don't really need to let it spider the entire twiki, you can ctrl-C after just a few 100 topics are indexed. The main thing is to make sure the indexing is finding your twiki content properly.
- It is a good idea to verify the attachments get their content converted properly so indexing will include them. This usually requires indexing the entire twiki, or altering the starting URL.
If it is not finding your twiki URL's, or failing due to username/password error, fix the settings in
twiki-spider-config.pl
Remove the debug line in twiki-spider-config.pl
when finished with this step
Manually build the indexes for testing
This is the
build-twiki-index.sh
script in the attached zip file that combines the spidering and indexing.
Be sure to look through the log to ensure it has properly indexed the attachments: it will include error messages if the various filter programs are not properly enabled.
The first time, you can reuse the data file created by the spidering if you want, see the Swish documentation for how to do this.
Engage search within the TWiki
This is a simple setup of the
swish.cgi search script inside of the twiki
bin
directory.
Be sure to read the swish documentation on customizing the templates, since some customization is always required.
I included our script that has multiple indexes to allow search of our bug database (updated every 3 hours).
Test the search script and index
URL will be
%TWIKI%/searchtools/swish.cgi
Configure your twiki to have an appropriate "Full Text Search" button
We put a button on the left column that goes to the swish search page
Set up cron
to run the index as desired
The script
build-twiki-index.sh
should be run periodically to rebuild the index.
Even with use of
SpeedyCgi, TWiki rendering is not fast. As a result, I rebuild the index only once per day at 4 am. It takes about 1.5 hours to index our TWiki, which has ~4000 topics and attachments
Create Documentation
Create a
SearchEngineSwishEAddOn
topic in your local TWiki, view the raw text for
TWiki:Plugins/SearchEngineSwishEAddOn, and copy the contents to your local TWiki.
Contents of the ZIP file
All of these files will require some tweaks since they have hardcoded paths in them.
File: |
Description: |
home/data/swish/scripts are used to build the index |
home/data/swish/scripts/twiki-index.config |
Swish config file |
home/data/swish/scripts/twiki-spider-config.pl |
Spider.pl config file to decide what to index |
home/data/swish/scripts/purisma.config |
included config file |
home/data/swish/scripts/spider-only.sh |
run spidering only for test |
home/data/swish/scripts/build-twiki-index.sh |
script to run via cron |
twiki/bin directory: added a subdirectory for the search script |
home/data/devweb/twiki/bin/searchtools/swish.cgi |
search script, basically same as distributed |
home/data/devweb/twiki/bin/searchtools/swish-cgi.config |
config for search to allow search of multiple indexes |
Others |
usr/local/lib/swish-e/perl/SWISH/TemplateDefault.pm |
modified template, includes a 'help' link and link back to twiki. Search for "CUSTOMIZED" to see what should be reviewed or modified |
home/data/devweb/twiki/data/TWiki/SwishSearchSyntax.txt |
"help" topic for the template, to be in the TWiki 'Twiki' data directory |
Add-On Info
- Set SHORTDESCRIPTION = Search within attached documents, such as doc, html, txt, pdf, ppt, xls
Related Topic: TWikiAddOns
--
TWiki:Main.StanleyKnutson - 16 Aug 2005
Slightly revised version uploaded on 27 Aug 2005.