Kino Search Engine Add-On
KinoSearch is a Perl implementation of Lucene search engine (implemented
in Java). This is the base of this indexed search engine for TWiki. With KinoSearch you create an index over all webs including attachments like Word, Excel and PDF. Based on that you get a really fast search over all topics and the attachments. You need this add-on
- if your TWiki has grown so big, that normal search is too slow or
- if you want to do search not only on the topics but also the attachments.
There is already an indexed search based on
Plucene:
TWiki:Plugins/SearchEnginePluceneAddOn. But Plucene is relatively slow and is not developed further. KinoSearch is the successor of Plucene and much faster and more scalable. Thus I created this search engine add-on based on the sources of SearchEnginePluceneAddOn.
Screenshot of a search results list
Usage
Indexing with kinoindex
With the script
kinoindex you index all the public webs. For each topic the text body, the title, the form fields and attached documents are indexed.
By now, you should run this script manually after installation to create the index files used by KinoSearch. If you want, you can also schedule a weekly or monthly crontab job to create the index files again, or maybe execute it manually when you take down your server for maintenance tasks. To prevent browser access, it has been placed out of the public bin folder.
Updating with kinoupdate
The
kinoupdate script uses the web's
.changes files to know about topic modifications. Also, a
.kinoupdate file is used on each web directory storing the last timestamp the script was run on it. So when this script is executed, it first checks if there are any topic updates since last execution. The most recent topic updates are removed from the index and then reindexed again.
This script should be executed by an hourly crontab. As before, this script has been placed out of the public bin folder.
Attachment file types to be indexed
All the PDF, HTML, DOC, XLS and text attachments are indexed by default. If you want to override this setting you can use a TWiki preference
KINOSEARCHINDEXEXTENSIONS. You can copy & paste the next lines in your
Main.TWikiPreferences topic
* KinoSearch settings
* Set KINOSEARCHINDEXEXTENSIONS = .pdf, .html, .txt, .doc, .xls
or whatever extensions you want. If you add other file extensions, they are treated as ASCII files. If needed, you can add more specialised stringifiers for further document types ( see
Indexing further document types).
Indexing of form fields
All form fields are indexed. For this, the form templates are checked and the included fields are indexed. Additionally the name of the form of a topic is stored in the field
form_name. How to search for this is described below.
Note: With
kinoupdate only the form fields that existed at the
time the initial index was created are indexed. Thus if you add a
form or if you add a new field to an existing form, you should create a new index with
kinoindex.
Searching with kinosearch
The
kinosearch script uses a template
kinosearch.pattern.tmpl (if you use the pattern skin). There is also a
KinoSearch topic with a form ready to use with the
kinosearch script.
Query syntax
- To search for a word, just put that word into the Search box. (Alternatively, add the prefix
text: before the word.)
- To search for a phrase, put the phrase in "double quotes".
- Use the
+ and - operators, just as in Google query syntax, to indicate required and forbidden terms, respectively.
- To search on metadata, prefix the search term with
field: where <field> is the field name in the metadata (for instance, author).
NOTE: KinoSearch tries to split the single words from composed
things. Thus it reads "something-combined-together" as three words:
"something combined together". The same is true for combinations with
underscore. Thus "something_with_underscore" will be treated as
"something with underscore". This feature is extremely usefull, as you
can search for the single words and need not know the complete word
(Note: KinoSearch has no possibility to search with wildcards!). But
of course you need to know about it. If you want so search for
"something-combined-together", you need to search for "something
combined together". If you add also the " to the search string, you
are sure, that the three words are in that order one after the other.
Query examples
-
text:kino or just kino
-
text:"search engine" or just "search engine"
-
author:MarkusHesse — note that to search for a TWiki author, use their login name
-
form:WebFormName to get all topics with that form attached.
-
CONTACTINFO:MarkusHesse if you have declared CONTACTINFO as a variable to be indexed
-
type:doc to get all attachments of given type
-
web:Main to get all the topics in a given web
-
topic:WebHome to get all the topics of a given name
-
+web:Sandbox +topic:Test to get all the topics containing "Test" in their titles and belonging to the Sandbox web.
Note: The current version of KinoSearch does not support wildcards.
Search form
The following form submits text to the
kinosearch script. The installation instructions are detailed below.
Indexing further document types
The indexing of attached documents is realised in two steps: In the
first step, the content of the document is changed to an ASCII
string. This is called stringification. In the second step, this ASCII
string is index with KinoSearch. This is the normal way in all
index applications.
To index different types of documents, it is necessary to have
specialised stringifiers, i.e. classes to extract the ASCII text out
of the document. In this add-on, a plug-in mechanism is implemented,
so that additional stringifiers can be added without changing the
existing code. All stringifier plugins are stored in the directory
lib/TWiki/Contrib/KinoSearch/StringifierPlugins.
You can add new stringifier plugins by just adding new files here. The minimum things to be implemented are:
- The plugin must inherit from
TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyBase
- The plugin must register itself by
__PACKAGE__->register_handler($application, $file_extension);
- The plugin must implement the method
$text = stringForFile ($filename)
Then you should extend the list in
KINOSEARCHINDEXEXTENSIONS. Now
the defined document type should be index and the new stringifier
should be used.
NOTE: If you just extend the list without having a special stringifier
in place, this document type is treaded like an ASCII file. For binary
document types, this may lead to problems (inpropper search results,
long indexing times and potential indexing break downs).
Add-On Installation Instructions
Note: You do not need to install anything on the browser to use
this add-on. The following instructions are for the administrator who
installs the add-on on the server where TWiki is running.
Backend for indexing Word documents
Install a backend to stringify Word documents if you want to index
Word documents. For this either install antiword, abiword or wvWare.
Note: This add-on comes with stringifiers for all three of
them. Depending on what is installed, the right stringifiers is used.
Note2: If you install more than one of the three backends, it is not
predictable, which of them is used. If you want to be sure, delete the
not used stringifiers from the directory
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins
(either
DOC_antiword.pm,
DOC_abiword.pm or
DOC_wv.pm).
Note2: If you do not install any of the mentioned backends, you
should remove
.doc from the
KINOSEARCHINDEXEXTENSIONS variable.
To install antiword for Debian you can do:
-
aptitude install antiword
To install abiword for Debian you can do:
To install wvWare for Debian you can do:
Backends for PDF, PPT
Install xpdf and ppthtml, if you want to index attached PDF and PPT files:
- For Debian you can use
aptitude install xpdf-utils and aptitude install ppthtml
- If you do not install
xpdf, you should remove .pdf from the KINOSEARCHINDEXEXTENSIONS variable.
- If you do not install
ppthtml, you should remove .ppt from the KINOSEARCHINDEXEXTENSIONS variable.
Installation of additional CPAN modules
You need to install the following modules: KinoSearch,
File::MMagic, Module::Pluggable, HTML::TreeBuilder and
Spreadsheet::ParseExcel
You can do that by running:
- perl -MCPAN -e "install KinoSearch"
- perl -MCPAN -e "install File::MMagic"
- perl -MCPAN -e "install Module::Pluggable"
- perl -MCPAN -e "install HTML::TreeBuilder"
- perl -MCPAN -e "install Spreadsheet::ParseExcel"
- perl -MCPAN -e "install CharsetDetector"
- perl -MCPAN -e "install Encode"
Note for Windows: For Windows, make sure, you have a C-compiler in place. This is normally part of Visual Studio etc.
Installation of the add on itself
Like many other TWiki extensions, this module is shipped with a automatic installer script written using the
BuildContrib.
- If you have TWiki 4.1 or later, you can install from the configure interface (Go to Plugins->Find More Extensions)
- The webserver user has to have permission to write to all areas of your installation for this to work.
- If you have a permanent connection to the internet, you are recommended to use the automatic installer script
- Just download the BuildContrib_installer perl script and run it.
- Notes:
- The installer script will
- copy files into the right places in your local install (even if you have renamed data directories),
- check in new versions of any installed files that have existing RCS histories files in your existing install (such as topics).
- If the $TWIKI_PACKAGES environment variable is set to point to a directory, the installer will try to get archives from there. Otherwise it will try to download from twiki.org or cpan.org, as appropriate.
- (Developers only: the script will look for twikiplugins/BuildContrib/BuildContrib.tgz before downloading from TWiki.org)
- If you don't have a permanent connection, you can still use the automatic installer, by downloading all required TWiki archives to a local directory.
- Point the environment variable $TWIKI_PACKAGES to this directory, and the installer script will look there first for required TWiki packages. # $TWIKI_PACKAGES is actually a path; you can list several directories separated by :
- If you are behind a firewall that blocks access to CPAN, you can pre-install the required CPAN libraries, as described at http://twiki.org/cgi-bin/view/TWiki/HowToInstallCpanModules
- If you don't want to use the installer script, or have problems on your platform (e.g. you don't have Perl 5.8), then you can still install manually:
- Download and unpack one of the .zip or .tgz archives to a temporary directory.
- Manually copy the contents across to the relevant places in your TWiki installation.
- Repeat from step 1 for any missing dependencies.
If you don't use the installer script, you need to install the add-on by hand:
- Download the ZIP file from the Add-on Home (see below)
- Unzip
SearchEngineKinoSearchAddOn.zip in your twiki installation directory. Content: | File: | Description: |
bin/kinosearch | script that searches the index files |
data/TWiki/KinoSearch.txt | Kino search topic |
data/TWiki/KinoSearch.txt,v | Kino search topic repository |
data/TWiki/SearchEngineKinoSearchAddOn.txt | Add-on topic |
pub/TWiki/SearchEngineKinoSearchAddOn/KinoSearchResult.jpg | Attachment |
templates/kinosearch.pattern.tmpl | template used by new search script for the pattern skin |
kinosearch/bin/LocalLib.cfg | this file is required and should be modified according to the twiki/lib absolute path of your installation |
kinosearch/bin/kinoindex | script that indexes all topics |
kinosearch/bin/kinoupdate | script that updates the index |
kinosearch/bin/ks_test | script to test stringification |
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn.pm | |
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/KinoSearch.pm | base script with common functionality |
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Index.pm | functionality for creating and updating the index |
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Search.pm | functionality to search the index |
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Stringifier.pm | Class to stringify attached files |
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifyBase.pm | Base class for stringifier plugins |
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins | Directory with stringifier plugins |
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/DOC_antiword.pm | Stringifier for MS Word documents using antiword |
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/DOC_abiword.pm | Stringifier for MS Word documents using abiword |
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/DOC_wv.pm | Stringifier for MS Word documents using wv |
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/HTML.pm | Stringifier for html files |
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/PDF.pm | Stringifier for pdf files |
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/Text.pm | Stringifier for ASCII files |
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/XLS.pm | Stringifier for MS Excel files |
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/PPT.pm | Stringifier for MS Powerpoint files |
kinosearch/index/ | directory for index files to be stored |
kinosearch/logs/ | the index and update logs will be written here - admin should monitor this folder |
Configuration
This add-on uses several preferences which should be set at
Main.TWikiPreferences. All these preferences are optional. If you are fine with the default values given below, you need not change anything.
* KinoSearch settings
* Set KINOSEARCHINDEXEXTENSIONS = .pdf, .doc, .xml, .html, .txt, .xls, .ppt
* Set KINOSEARCHSEARCHATTACHMENTSONLY = 0
* Set KINOSEARCHSEARCHATTACHMENTSONLYLABEL = Display only attachments
* Set KINOSEARCHINDEXSKIPWEBS = Trash, Sandbox, TWiki
* Set KINOSEARCHINDEXSKIPATTACHMENTS = Web.SomeTopic.AnAttachment.txt, Web.OtherTopic.OtherAttachment.pdf
* Set KINOSEARCHANALYSERLANGUAGE = en
* Set KINOSEARCHSUMMARYLENGTH = 300
* Set KINOSEARCHDEBUG = 0
* Set KINOSEARCHMAXLIMIT = 2000
You can optionally insert the following lines in your
lib/LocalSite.cfg. Thus you can determine where the index is created and where the log files are created.
Note: The directories must exist.
$TWiki::cfg{KinoSearchLogDir} = '/home/httpd/twiki/kinosearch/logs';
$TWiki::cfg{KinoSearchIndexDir} = '/home/httpd/twiki/kinosearch/index';
Remember to edit the file
kinosearch/bin/LocalLib.cfg and modify
twikiLibPath accordingly to your configuration
Test of the installation
- Test if the installation was successful:
- Check that
antiword, abiword or wvHtml is in place: Type antiword, abiword or wvHtml on the prompt and check that the command exists.
- Check that
pdftotext is in place: Type pdftotext on the prompt and check that the command exists.
- Check that
ppthtml is in place: Type ppthtml on the prompt and check that the command exists.
- Change the working directory to the
kinosearch/bin twiki installation directory.
- Run
./kinoindex
- Once finished, open a browser window and point it to the
TWiki/KinoSearch topic.
- Just type a query and check the results.
Test of stringification with ks_test
Some users report problems with the stringification: The
kinoindex
scipts fails, takes too long on attachments or
kinosearch does not yield correct
results. Some times this may result from installation errors esp. of
the installation of the backends for the stringification.
ks_test give you the opportunity to test the stringification in
advance.
Usage:
ks_test stringify file_name
(I plan to extend ks_test, but at the moment the only possible second
parameter is stringify).
In the result you see, which stringifier is used and the result of the
stringification.
Example:
/home/httpd/twiki/kinosearch/bin$ ./ks_test stringify /home/httpd/twiki_svn/SearchEngineKinoSearchAddOn/test/unit/SearchEngineKinoSearchAddOn/attachement_examples/Simple_example.doc
Used stringifier: TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyPlugins::DOC_antiword
Stringified text:
Simple example Keyword: dummy Umlaute: Größer, Überschall, Änderung
You see that the stringifier DOC_antiword is used and the resulting
text seems to be O.K.
Add-On Info
- Set SHORTDESCRIPTION = Fast indexed search with indexing of attachments like Word, Excel, PDF and PPT
| Add-on Author: | TWiki:Main/MarkusHesse |
| Add-on Version: | v1.16 |
| Update note | If you update from an older version to v1.15 or higher, remove the old index and create a complete new one with kinoindex |
| Change History: | |
| 4 Jun 2008: | v 1.16, Bugs:Item5646: Problem with attachments with capital letter suffix |
| 12 May 2008: | v 1.15, Bugs:Item5579, Bugs:Item5580, Bugs:Item5619: Problem with ALLOWWEBVIEW and Forms fixed |
| 23 Apr 2008: | v 1.14, Bugs:Item5273, Bugs:Item5546, Bugs:Item5550, Bugs:Item5552: Use current user in search script |
| 27 Jan 2008: | v 1.13, Bugs:Item5271: Option "show locked topics" now works |
| 19 Jan 2008: | v 1.12, Bugs:Item5270: Enhancement of stringifiers |
| 19 Dec 2007: | v 1.11, Additions on stringifiers, modification of output format |
| 17 Nov 2007: | v 1.10, PPT stringifier added |
| 11 Nov 2007: | v 1.09, Some bugfixing |
| 3 Nov 2007: | v 1.08, Some bugfixing |
| 7 Oct 2007: | v 1.07, Some bugfixing |
| 6 Oct 2007: | v 1.06, Upgrade for 4.1, Release with BuildContrib |
| 29 Sep 2007: | v 1.05, Indexing of form fields |
| 16 Sep 2007: | v 1.04, Stringifier plugins for doc, xls and html |
| 13 Sep 2007: | v 1.03, Indexing of PDF and TXT attachments |
| 08 Sep 2007: | v 1.02, Index and update script enhanced |
| 24 Aug 2007: | v 1.01, Update script included, Result uses highlighter |
| 14 Aug 2007: | Initial version (v1.000) |
| CPAN Dependencies: | CPAN:KinoSearch |
| | CPAN:File::MMagic |
| | CPAN:Module::Pluggable |
| | CPAN:HTML::TreeBuilder |
| | CPAN:Spreadsheet::ParseExcel |
| | CPAN:CharsetDetector |
| | CPAN:Encode |
| Other Dependencies: | pdftotext (part of xpdf-utils) |
| | antiword, abiword or wvWare |
| | ppthtml |
| Perl Version: | Tested with 5.8.0 |
| License: | GPL |
| Add-on Home: | http://TWiki.org/cgi-bin/view/Plugins/SearchEngineKinoSearchAddOn |
| Feedback: | http://TWiki.org/cgi-bin/view/Plugins/SearchEngineKinoSearchAddOnDev |
| Appraisal: | http://TWiki.org/cgi-bin/view/Plugins/SearchEngineKinoSearchAddOnAppraisal |
--
MarkusHesse - 24 Aug 2007