Tags:
pdf1Add my vote for this tag prometricexam_1Add my vote for this tag search3Add my vote for this tag create new tag
view all tags

Search Engine Swish-E Add-On

A common request by my users was better search that included attachments. We picked Swish-E since it now includes capability to index various Microsoft formats: .doc, .ppt, .xls

The instructions here are for setting it up on Linux. Sorry I don't think this will work on Windows.

Summary

  • Fast search of TWiki topic text and attachments
  • Attachments of Microsoft Word (.doc), Powerpoint (.ppt) and Excel (.xls) are indexed
  • Fast: for our site, it is about 20x faster than twiki's built in serach, with much less server load.

User reaction

  • "Excellent, does the job"
  • "We needed this months ago"

Caveats

  • Swish does not have incremental index updating, so the index rebuild is very time consuming given how slow TWiki is
  • Requires some unix skill to configure, comparable to setting up TWiki itself.

Future work

The current implementation is http spidering of twiki. This made it easy to get correct indexing of the current versions, but at significant cost in indexing speed.

Future ideas:

  • Get 'last modified date' to work. Approach (A) is to index the files, (B) is to integrate some kind of last-modified-date into the generated HTTP as sent to the spider (and probably not for general users)
  • Set up swish to index the files directly (both data/**/*.txt and pub). This would allow ranking by "last modified date", and make it faster. A more complicated filter would be required to remove the twiki 'meta' fields from the topics.
  • Modify spidering/indexing to include author and web as 'meta' fields for swish. Goal is to enable subset searching (only in this web, or only by this author).

Known Issues

It appears the Spreadsheet::ParseExcel_XlHTML module does not properly handle non-ascii content in the excel files. I've not investigated this to determine how the Russian and Japanese content is getting handled (I know it is present in the files). The warning printed in the log is about the pack of the bytes retrieved from the data file.

Add-On Installation Instructions

Note: You do not need to install anything on the browser to use this add-on. The following instructions are for the administrator who installs the add-on on the server where TWiki is running.

Install modules

  1. Swish itself, from http://www.swish-e.org
  2. The catdoc parsing program, from http://www.45.free.net/~vitus/ice/catdoc/. This program is used to convert MS Word (.doc) and Powerpoint (.ppt) to text format that can be indexed.
  3. Install the CPAN modules for XLS parsing: Spreadsheet::ParseExcel and Spreadsheet::ParseExcel_XlHTML. (I used the Perl module manager via my installation of Webmin. I highly recommend Webmin, it really simplifies Linux administration.)
  4. PDF conversion tool. I used the recommended one: http://www.foolabs.com/xpdf/download.html

Testing the catdoc and pdf tools separately is recommended.

Install scripts

I used the directory /home/data/swish, which is hardcoded in various of the scripts as required

See File Details for the complete list of files included.

  1. Download the ZIP file from the Add-on Home (see below)
  2. Unzip into a temporary directory.
  3. Move the swish/scripts into a directory that is not part of twiki.
  4. Edit those scripts to have the proper hardcoded paths for where they have bene installed
  5. Move the searchtools files into a subdirectory of your twiki
  6. Edit the top of the cgi to give the path for the swish config
  7. edit the swish config to indicate where the index is stored
  8. If you want to use my customizations of the template, install the other files appropriately

Configure Swish to index TWiki including attachments

This uses the configuration scripts (in the ZIP file), along with these steps:

  1. Create a new user specifically for use by the spidering program. I used SpideringEngine
  2. Edit the %MAINWEB%/SpideringEngine page to specify plain as the default skin for this user
    * Set default skin to remove 'print' button etc
       * Set SKIN = plain
  3. Edit the twiki-spider-config.pl file
    1. Modify the base_url
    2. Modify the credentials given in the twiki-spider-config.pl match the created user.
  4. Check the limit for number of URLS. I use 15000, which must include any duplicate URLs. Twiki generates duplicate URLs frequently. It will print out "Max indexed files Reached" if this is too small for your site.
  5. The default filtering of topics should be ok for most twikis. The default in twiki-spider-config.pl is like this:
    1. Excludes some specific topics (WebChanges, etc) that take a long time to render with little value to the index
    2. It filters out the index links when query parameters are present, so as to avoid indexing the same topic due to table sorting links, attachment management links, etc.

Test that spidering is working

  • Edit the twiki-spider-config.pl to enable debug printing at 'info' level (see comments in the file)
  • Run the spider tool to create a data file of the content to be indexed. The script spider-only.sh is provied for this.

Check these things

  • The printout while spidering is running should include your URL's and topic names
  • You should glance through the resulting data file to verify contents are reasonable It may be necessary to use tail to edit the latter part of file if it is more than 8mb. Look for Path-Name at the beginning of a line to locate start of each URL indexed.
  • You don't really need to let it spider the entire twiki, you can ctrl-C after just a few 100 topics are indexed. The main thing is to make sure the indexing is finding your twiki content properly.
  • It is a good idea to verify the attachments get their content converted properly so indexing will include them. This usually requires indexing the entire twiki, or altering the starting URL.

If it is not finding your twiki URL's, or failing due to username/password error, fix the settings in twiki-spider-config.pl

Remove the debug line in twiki-spider-config.pl when finished with this step

Manually build the indexes for testing

This is the build-twiki-index.sh script in the attached zip file that combines the spidering and indexing.

Be sure to look through the log to ensure it has properly indexed the attachments: it will include error messages if the various filter programs are not properly enabled.

The first time, you can reuse the data file created by the spidering if you want, see the Swish documentation for how to do this.

Engage search within the TWiki

This is a simple setup of the swish.cgi search script inside of the twiki bin directory.

Be sure to read the swish documentation on customizing the templates, since some customization is always required.

I included our script that has multiple indexes to allow search of our bug database (updated every 3 hours).

Test the search script and index

URL will be %TWIKI%/searchtools/swish.cgi

Configure your twiki to have an appropriate "Full Text Search" button

We put a button on the left column that goes to the swish search page

Set up cron to run the index as desired

The script build-twiki-index.sh should be run periodically to rebuild the index.

Even with use of SpeedyCgi, TWiki rendering is not fast. As a result, I rebuild the index only once per day at 4 am. It takes about 1.5 hours to index our TWiki, which has ~4000 topics and attachments

Create Documentation

Create a SearchEngineSwishEAddOn topic in your local TWiki, view the raw text for TWiki:Plugins/SearchEngineSwishEAddOn, and copy the contents to your local TWiki.

Contents of the ZIP file

All of these files will require some tweaks since they have hardcoded paths in them.

File: Description:
home/data/swish/scripts are used to build the index
home/data/swish/scripts/twiki-index.config Swish config file
home/data/swish/scripts/twiki-spider-config.pl Spider.pl config file to decide what to index
home/data/swish/scripts/purisma.config included config file
home/data/swish/scripts/spider-only.sh run spidering only for test
home/data/swish/scripts/build-twiki-index.sh script to run via cron
twiki/bin directory: added a subdirectory for the search script
home/data/devweb/twiki/bin/searchtools/swish.cgi search script, basically same as distributed
home/data/devweb/twiki/bin/searchtools/swish-cgi.config config for search to allow search of multiple indexes
Others
usr/local/lib/swish-e/perl/SWISH/TemplateDefault.pm modified template, includes a 'help' link and link back to twiki. Search for "CUSTOMIZED" to see what should be reviewed or modified
home/data/devweb/twiki/data/TWiki/SwishSearchSyntax.txt "help" topic for the template, to be in the TWiki 'Twiki' data directory

Add-On Info

  • Set SHORTDESCRIPTION = Search within attached documents, such as doc, html, txt, pdf, ppt, xls

Add-on Author: TWiki:Main.StanleyKnutson
Add-on Version: 16 Aug 2005 (v1.000)
Change History:  
27 Aug 2005: Updated slightly, added doc about testing spidering first.
16 Aug 2005: Initial version
CPAN Dependencies: Spreadsheet::ParseExcel and Spreadsheet::ParseExcel_XlHTML
Other Dependencies: Swish-E, catdoc, PDF conversion tool e.g. xpdf
Perl Version: 5.005
License: GPL
Add-on Home: http://TWiki.org/cgi-bin/view/Plugins/SearchEngineSwishEAddOn
Feedback: http://TWiki.org/cgi-bin/view/Plugins/SearchEngineSwishEAddOnDev
Appraisal: http://TWiki.org/cgi-bin/view/Plugins/SearchEngineSwishEAddOnAppraisal

Related Topic: TWikiAddOns

-- TWiki:Main.StanleyKnutson - 16 Aug 2005

Slightly revised version uploaded on 27 Aug 2005.

Topic attachments
I Attachment History Action Size Date Who Comment
Compressed Zip archivetgz example-swish-scripts.tgz r3 r2 r1 manage 38.6 K 2005-08-28 - 03:46 UnknownUser Updated scripts for indexing twiki via SwishE
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r11 - 2007-08-28 - PeterThoeny
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.