Tags:
create new tag
, view all tags

SearchEngineKinoSearchAddOnDev Discussion: Page for developer collaboration, enhancement requests, patches and improved versions on SearchEngineKinoSearchAddOn contributed by the TWikiCommunity.
• Please let us know what you think of this extension.
• For support, check the existing questions, or ask a new support question in the Support web!
• Please report bugs below

Feedback on SearchEngineKinoSearchAddOn

-- MarkusHesse - 24 Aug 2007

KinoSearch is a Perl implementation of Lucene Search engine (implemented in Java). It could be the base of an indexed search engine for TWiki.

Why yet another search engine?

AFAIK there are three approaches to setup an indexed search in TWiki:

Google is not always possible, especially if I don't want to open my Wiki to the internet.

SearchEngineSwishEAddOn is complex to install (I never tried it because of the warnings from CrawfordCurrie) and the architectural aproach (spidering the HTML output of TWiki) seems to me not optimal.

In the end Pucence would be optimal, if the performance and scaling issues could be overcome. But Pucence is not developed further when the information on http://plucene.minty.org/ is correct.

In seaching the internet, I came across the page on KinoSearch (http://www.rectangular.com/kinosearch/). They seem to be the new Perl port of Lucene. Especially the benchmark numbers are quite promising: KinoSearch indexes nearly as fast as native Lucene and is more than 20 times faster than Plucene (this somehow confirms my experiences in benchmarking Plucene).

I put a first version here.

KinoSearch proofs to be very fast: I index 10.000 topics in about five minutes and the search yields answers just immediately. Also the memory usage is much better: The index script never uses more than 50 MB, the search script uses even less.

-- MarkusHesse - 04 Aug 2007

Thank you Markus for sharing this add-on with the TWikiCommunity. This should make it much easier to index TWiki sites.

Small feedback on add-on topic:

  • Use ---+ heading on top
  • Use only interwiki links to point to twiki.org topics (WikiWord links break when installed on another site)
  • Preferred place to define site preferences settings is the Main.TWikiPreferences topic (so that upgrades are easier)
  • Defining executables and directories as preferences settings can be a security hazard. Better to define KINOSEARCHINDEXPATH and KINOSEARCHATTACHMENTSPATH as configure settings (stored in LocalSite.cfg)
-- PeterThoeny - 28 Aug 2007

You have to improve on the instructions. And it you would be great to have Windows instructions too.

-- AnoopR - 30 Aug 2007

I get this error 'cl' is not recognized as an internal or external command, i dont have visual cll and i dont want to install it. can you help?

-- AnoopR - 30 Aug 2007

I get this error 'cl' is not recognized as an internal or external command, i dont have visual c++ and i dont want to install it. can you help?

-- AnoopR - 30 Aug 2007

If you realy want to install KinoSearch on Windows, you will need the MS C-compiler cl.exe and that is part of MS Visual C++. I'm afraid, there is no other way.

-- MarkusHesse - 02 Sep 2007

can run "make" in another system which has visual c++ and then move it to the target system and then run "make install"?

-- AnoopR - 06 Sep 2007

I did it! i just had to run NMAKE on another system that had perl and visual c++. Then i copied the folder to my systemand ran "namke install". It ran and installed successfully.

I did also try to install the Windows SDK on my system i think i partially installed something, i am not sure if that was useful. The full SDK was around 700 MB, so i removed a lot and just installed few things that made sense...

-- AnoopR - 06 Sep 2007

but now i have a new problem.

my $tmpl = TWiki::Func::readTemplate( "plucsearch" );

is not reading the template file if i request through the web client. but if I do a

perl -wT plucsearch>test.html

then i get a good result i am running this on win2k/apache

-- AnoopR - 06 Sep 2007

There shoudn't be any "plucsearch". It should be my $tmpl = TWiki::Func::readTemplate( "kinosearch" ); Do you use the latest verion?

-- MarkusHesse - 06 Sep 2007

I am sorry for the confusion it is "kinosearch", its just that i had the exact same trouble with plucene search too. Is there any concept of registering the templates? because everything else seems to work fine.

-- AnoopR - 07 Sep 2007

i solved that, that is becuase i was using the NAT skin at that time. It worked as soon as i removed the skin.

-- AnoopR - 07 Sep 2007

hi Again....

For some weird reason only a part of the result generated by the kinosearch script is being served to the client. But again the perl -wT kinosearch>test.html, gives a very good result.

i have given the page-source that the client receives.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-us" lang="en-us">
<head>
....
class="patternSearchResultsBegin"></div><div class="patternSearchResultsHeader" style="background-color:#FFD8AA"><span class="twikiLeft">Search results</span><span class="twikiRight">retrieved at 08:12 (GMT)</span><br class="twikiClear" /></div>

-- AnoopR - 07 Sep 2007

I uploaded a new version. I fixed a bug in the kinoupdate script: If a topic was changed, the changes were added to the index but the old version remained in the index. Thus a search returned besides the new topic also the old one. If you updae from an older version, you need to rerun the initial index script (kinoindex) to rebuild the index.

Annop: Congratulations for installing KinoSearch on Windows. Perhaps you can contribute more details on that so that the installation instuction can be enhanced for Windows users (I will never try to run TWiki on Windows).

-- MarkusHesse - 08 Sep 2007

Markus, I have used the gnu C compiler on Windows with some success for compiling non-gui Linux code for Windows. Surely you could compile using that?

-- CrawfordCurrie - 08 Sep 2007

hi I will load the instructions tomo. Any idea about the partial response error that i mentioned in my previous comment?

-- AnoopR - 11 Sep 2007

Hi Anoop im my installation, the output goes on with

<br class="twikiClear" /></div><div class="patternSearchResults"><div class="twikiTopRow"><div class="twikiLeft"> <a href="/twiki/bin/view/a_web/a_topic" class="twikiLink">a_web.a_topic</a> <span class="twikiAlert"> </span></div><div class="twikiRight twikiSRAuthor"> <a href="/twiki/bin/view/Main/MarkusHesse" class="twikiLink">MarkusHesse</a></div><div class="twikiRight twikiSRRev"><a href="/twiki/bin/rdiff/blabla" rel='nofollow'>08 Sep 2007 - 12:41</a> - r3&nbsp;</div><br class="twikiClear" /></div><div class="twikiBottomRow"><div class="twikiSummary twikiGrayText">bei <strong>Replikaten</strong>!) <ul>
...

Did you check in the apache error log?

-- MarkusHesse - 11 Sep 2007

yup! you are smart!!! and i am really dumb not to have looked at that earlier. a preference variable was not set properly, awesome!!! now everything works except the attachment search.... how do we go ahead....

-- AnoopR - 13 Sep 2007

I am already working on the attachment stuff: I looked at the solution from Plucene: They use products outside of Perl for stringification. The same is true for SocialText (they use KinoSearch as well see http://www.eu.socialtext.net/open/index.cgi?kinosearch_due_diligence). I would like to avoid this: These products need separate installation and this installation is always different for Windows and for each Linux distribution. I would prefer something from CPAN, but up to now, I don't find anything good:

PDF::API2 has a method called "stringify" but it does something completely different: It puts the PDF on a string from where ist can be reproduced (its more like serialize in Java). All special characters etc. remain.

PDF::OCR::Thorough: Sounds good but it needs ImageMagic. So I could use pdftotext right ahead.

CAM::PDF: Seems to be good but it does not work with all PDF documents: On testing about ten documents, I found one where the content was not red propperly and another, where it ran in an infinite loop. So this seem to be not realy mature.

Do you have any idea or experience with this stringification stuff?

I think I will try pdftotext and see how is works.

-- MarkusHesse - 13 Sep 2007

I uploaded a new version:

  • I did some refactoring on the code: Some of the common code is now gathered in lib\TWiki\Contrib\KinoSearch.pm. This avoids some of the code duplications (far from perfect but better than nothing).
  • The variable KINOSEARCHATTACHMENTSPATH is no more necessary. Instead I use TWiki::Func::getPubDir().
  • I implemented a first version of indexing attachments. To index PDF, I use pdftotext, a pice of xpdf. Thus the installation of xpf is necessary. For Debian this is quite simple. Perhaps someone can give the installation instuctions for other Linux distributions and for windows.
-- MarkusHesse - 13 Sep 2007

I uploaded a new version:

Open points:
  • On indexing PDF, I have problems with special characters (German umlauts like , , , ...). I think, I need to learn something about encodings. Perhaps someone can give me a hint.
  • Variables from forms are not indexed yet.
The installation is now a bit more complex, because the non Perl modules xpdf and antiword are needed. Perhaps someone could give the installation instruction for other Linux distributions and windows.

I tested only on TWiki 4.0. Perhaps someone could test the stuff on 4.1.

-- MarkusHesse - 16 Sep 2007

It works on 4.1. Thanks for this good plugin and for adding the PDF support! In the installation there are two minor issues though:

  • the kinosearch script doesn't have the right permissions (not executable), so the CGI section in the configure script raises a warning.
  • The new (in 4.1) "Find more extensions" part of the configure script doesn't recognise the plugin correctly. It is listed as available from twiki.org, but the installed version is not shown. Also the plugin version is given as "Tested with 5.8.0" instead of the actual plugin version. As you are working on 4.0. you don't have this dialogue, and I have no idea where you have to change something to make it work. BTW SearchEnginePluceneAddOn has exactly the same problem.
-- StephanMatthiesen - 28 Sep 2007

I've noticed two things that perhaps could be improved

  • In the search results, perhaps the sectioning commands shouldn't be rendered. It's a bit confusing when you get fat headlines in the middle of the research results....
  • Formfields are not searched, are they?
-- StephanMatthiesen - 29 Sep 2007

Hi Stephan,

great to hear, that it works on 4.1. Sooner or later I will upgrade to 4.1 and check for the things you mentioned.

If I understand you right, you have KinoSearch and Plucene running together. Perhaps you can publish some benchmark numbers: Time for indexing, time for search and MB used for that processes. It would be very interesting to see that numbers.

To the sectioning in the result: In fact I need a stingifier for TML: At the moment the TWiki pages are index as simple ASCII files without any changes. Thus things like ---++ just go into the index and they are shown in the result as well (I don’t use the summary function of TWiki, because I like the one from KinoSearch with the highlighting of the searched words. BTW: This is especially useful for attachments where TWiki cannot give any summary). To fix this problem, I need something that throws the TML elements out of the TWiki text files. Is there something within TWiki? Does anybody know?

Formfields: I am just working on that. Hopefully I can release that this weekend.

-- MarkusHesse - 29 Sep 2007

I uploaded a new version:

  • Form Fields are now indexed
  • The index directory is no configured via LocalSite.cfg (along with Peters advice)
  • The result representation is improved: Headlines are now normal text etc.
  • Many things refactured. Not perfect but I hope it is getting better.
The major functionality of KinoSearch is now complete. The next steps will be more cosmetic things.
  • I am afraid, there are still some problems with special characters (, , ,..)
  • Additional stringifiers for PPT etc.
  • Indexing of attached ZIP-files.
-- MarkusHesse - 29 Sep 2007

If I include .xls in the list of searchable extensions, kinoindex runs out of memory every time I try to create an index.

kinoindex: Out of memory during array extend at /usr/local/share/perl/5.8.8/Spreadsheet/ParseExcel.pm line 904.

I'm using Spreadsheet::ParseExcel v0.32.

Has anyone else come across this issue?

-- MartinKaufmann - 01 Oct 2007

I tested this only with a few Excel sheets, but I had some pretty big ones (2.5MB) and found no problems. ???

-- MarkusHesse - 02 Oct 2007

It looks like it was caused by a broken (?) Excel file. The file itself is very small (only 40kB). If I skip this particular file, everything works OK.

-- MartinKaufmann - 02 Oct 2007

Great job guys! Some questions from a VM Debian newbie:

  1. When I type antiword in the command line it returns Command not found . However when I run the index, it is able to pass through small word documents but it hangs on one doc >3 MB. Is it a limitation on size or am I having issues with my install?
  2. I know it's third party question, but is antiword able to index word 2003 docs?
  3. Do I need to setup a cron job for the index? If so, is there a brave soul that would post a step by step kino cron job for the debian noob ?
Thanks a lot and keep the good work

-- MiloValenzuela - 05 Oct 2007

I can give you a hand with the crontab script. You need at least one script for updating the kino index. On Debian, you can put this script in /etc/cron.hourly or /etc/cron.daily (depending on whether you would like to update the index hourly or daily). To do this (do all these steps as root), type cd /etc/cron.hourly, then open a new script: vi kinoupdate with the following content (make sure you adjust the paths):

#!/bin/sh
cd /path/to/twiki/kinosearch/bin;perl -I/path/to/twiki/lib -I/path/to/twiki/lib/CPAN/lib ./kinoupdate -q > /dev/null 2>&1

Leave out everything after -q if you'd like to get an hourly email with the output of the script (might be handy for debugging at the beginning). To save this script (don't be offended if this is obvious...) type ESC, then :wq.

If you'd like to rerun the whole indexing once a week/month (e.g. to update newly create form fields), create a second script ( vi /etc/cron.weekly/kinoindex) with the same content as above, just replace kinoupdate with kinoindex.

Regarding Antiword and Word 2003 documents: The Antiword webpage states the following:

Antiword converts the binary files from Word 2, 6, 7, 97, 2000, 2002 and 2003 to plain text

Having said that, I haven't tried it myself.

-- MartinKaufmann - 05 Oct 2007

On indexing word files without having antiword installed: Kinoindex will not complain about that but it uses the default stringifier and not the specialised stringifier for word. The default stringifier just takes the word file as if it were an ASCII file (just try to read a .doc file with vi and you know what I mean). This results in very long strings to be indexed and that can take long time or lead to break downs and (even worse) the search results will be incomplete.

So if you don't have antiword in place: Either disable the indexing of .doc files or provide an alternative stringifier.

-- MarkusHesse - 06 Oct 2007

I uploaded a new version. Now the sources are released with the help of BuildContrib. Thus the installation with the configure dialog in 4.1 is possible. I still have a problem with the additional CPAN modules: They should be installed automatically, but they don't. Let's see...

I hope nobody tried the version 1.06: I released it too early and it contained some severe bugs. The current version (1.07) runs (as good or as bad) as the 1.05 did.

-- MarkusHesse - 07 Oct 2007

@ MartinKaufmann:

Thanks a lot for your steps. I ended up downloading antiword 0.37 and installing it manually. It indexed extremely fast documents, including word 2003 files > 3mb. The KinoSearch topic renders extected results in topics and attachments correctly.

Regarding my debian noob-ness, I'm dealing with twiki VM debian and I have the capability to interact with it through samba which allows me to handle debian file system through windows. I ended up using pspad to create the hourly files in the appropiate cron folders. I assumed that vi would be doing exactly the same. Would that suffice the cron setup? How does it know where to send the hourly debugging email?

Appreciate your help!

-- MiloValenzuela - 09 Oct 2007

Creating cron files through samba should work as well - however, I see the possibility of file permission issues. You can get the emails sent to you by adding the following line to your /etc/crontab file (and adjusting the email address):

MAILTO=your.name@company.com.invalid

This only works if the mail server is set up and working. I don't know whether this is the case for the TWiki VM.

Just a piece of advice: If you intend to work with TWiki (or other Unix/Linux applications) seriously, it's definitely worth investing some time in learning to use the shell. Most tasks can be done if you know just a few commands.

-- MartinKaufmann - 10 Oct 2007

Point well taken Martin. I need to fix my micro$oft addiction. I'll try the approach.

-- MiloValenzuela - 10 Oct 2007

Dear Martin,

The kinosearch seems to work succesfully. However when I checked the error logs in Apache right after a succesful query, it registers:

kinosearch: Use of uninitialized value in concatenation (.) or string at /home/httpd/twiki/lib/TWiki/Contrib/KinoSearch/Search.pm line 112., referer: [Here It Path to WebHome]

And line 112 is:

$tmplSearch =~ s/%SEARCHATTACHMENTSONLY%/<a href="%SCRIPTURLPATH%\/kinosearch\/$webName\/?search=$tempVal\%20\%2Battachment:yes">$attachmentsOnlyLabel<\/a>/go;

Is this something that I need to setup that I forgot?

Thanks a lot for the quick answer and support.

-- MiloValenzuela - 10 Oct 2007

After a new installation I get the following error when running the initial kinoindex:

wiki:/srv/www/twiki/kinosearch/bin # ./kinoindex
KinoSearch index files init
- to suppress all normal output: kinoindex -q
Use of uninitialized value in split at /srv/www/twiki/lib/TWiki/Contrib/KinoSearch/KinoSearch.pm line 48.
Use of uninitialized value in split at /srv/www/twiki/lib/TWiki/Contrib/KinoSearch/KinoSearch.pm line 59.
Use of uninitialized value in concatenation (.) or string at ./kinoindex line 70.
Variables to be indexed:
Required parameter 'invindex' not supplied at /srv/www/twiki/lib/TWiki/Contrib/KinoSearch/Index.pm line 31

In the documentation I didn't really understand "Check in any installed files that have existing ,v files in your existing install (take care not to lock the files when you check in)". I hope I didn't miss some FAQ or similar basics.

-- IngoKappler - 01 Nov 2007

Hi Ingo,

your error message looks a little bit strange: The file kinoindex.pm has only 31 lines in the current version. Thus an error a line 70 indicates that you use an old version.

On the documentation: I copied some stuff form BuildContrib. Maybe some of that can be omitted.

To Milo: There are really some errors causing reports in the apache log. I am just working on that.

-- MarkusHesse - 01 Nov 2007

Hi Markus,

thanks for your quick answer, it gave me the right direction how to proceed. I manually reinstalled the .tgz file and it started working. smile

I assume the previous issue was somehow caused by using the automatic installer. I've tried using it with zip and tgz files and ended up with some error although extraction appeared OK first. Please find here the errors I got maybe it helps on improving the automatic installation process:

./SearchEngineKinoSearchAddOn_installer.pl
...
  inflating: lib/TWiki/Contrib/KinoSearch/StringifierPlugins/XLS.pm
  inflating: lib/TWiki/Contrib/KinoSearch/StringifyBase.pm
   creating: templates/
  inflating: templates/kinosearch.pattern.tmpl
Archive unpacked
Install lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/DOC.pm, permissions 0444
Install failed: No such file or directory



./SearchEngineKinoSearchAddOn_installer.pl
...
Install lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Stringifier.pm, permissions 0444
Install bin/kinosearch, permissions 0544
Install kinosearch/index/, permissions 0755
Could not create kinosearch/index/.bak: Is a directory
Install failed: Is a directory

-- IngoKappler - 02 Nov 2007

Hi Ingo,

great to hear, that it is working now.

Don't know, what went wrong on your first try: In the current version, the directory lib/TWiki/Contrib/KinoSearch does not exist any more. Instead the path is lib/TWiki/Contrib/SearchEngineKinoSearchAddOn. Thus I think you mixed up some old and new stuff, but I have really no idea how you came to that point.

-- MarkusHesse - 02 Nov 2007

Markus --

I've modified some of the Kino syntax instructions on my own installation, and thought you might want to propagate them to the release version. My version is:


Query syntax

  • To search for a word, just put that word into the Search box. (Alternatively, add the prefix text: before the word.)
  • To search for a phrase, put the phrase in "double quotes".
  • Use the + and - operators, just as in Google query syntax, to indicate required and forbidden terms, respectively.
  • To search on metadata, prefix the search term with field: where <field> is the field name in the metadata (for instance, author).

Query examples

  • text:kino or just kino
  • text:"search engine" or just "search engine"
  • author:MarkusHesse — note that to search for a TWiki author, use their login name
  • form:WebFormName to get all topics with that form attached.
  • CONTACTINFO:MarkusHesse if you have declared CONTACTINFO as a variable to be indexed
  • type:doc to get all attachments of given type
  • web:Sandbox to get all the topics in a given web
  • topic:WebHome to get all the topics of a given name
  • +web:Sandbox +topic:Test to get all the topics containing "Test" in their titles and belonging to the Sandbox web.
Note: the current version of KinoSearch does not support wildcards.

-- EaCohen - 26 Nov 2007

I'm having some trouble with portuguese accents inside doc and ppt files. When KINOSEARCHANALYSERLANGUAGE is "en" (inside TWikiPreferences), the search with accents works, but it shows wrong characters on the search results (characters with accents are replaced by strange symbols). If I change KINOSEARCHANALYSERLANGUAGE to "pt", the strange characters continue to appear, but the search for characters with accents stop working (it can't find these characters any more). I guess this problem is related to the specific parsers (antiword and ppthtml), because xls files don't have this problem with Spreadsheet::ParseExcel.

-- GuilhermeGarnier - 26 Nov 2007

Ehm…. Never tried something with Portuguese accents (reminds me to do some tests with German Umlauts). On the KINOSEARCHANALYSERLANGUAGE: This is only used for the indexing done by KinoSearch: The so called stemming, i.e. the splitting and shortening of words, is done depending on the language. E.g. in English "words" would be stemmed to "word" and should be found that way (at least this is what I understood from the documentation of KinoSeach). Thus you may be right that the problem arises from the stringifiers.

If you could analyse the problem a little bit more and perhaps find a solution or workaround, I can integrate it in the next release.

-- MarkusHesse - 28 Nov 2007

Markus,

I found out that this problem with accents is related to the type of encoding used on the files attached. When I use ppthtml to convert ppt files to html, for example, it creates a file encoded with utf8. But, as twiki uses iso-8859-15, characters with accents are not correctly shown. The same problem occurred when I attached a txt file encoded with utf-8.

To solve this problem, I'm updating the stringifier plugins for each file type, to correctly encode each one in iso-8859-15 (actually, just for HTML and Text, because the other formats are converted to one of these for parsing). As soon as I finish, I can send you the code, if you wish.

About the stemming, I don't think it's really working for Portuguese. I changed KINOSEARCHANALYSERLANGUAGE to "pt", recreated the index and searched for "testes" ("tests" in portuguese), hoping it would find occurrences for the singular word "teste". It did find "teste", but also found "test", which I didn't want (and isn't a portuguese word). I don't know if this problem is just when I mix two languages inside twiki documents and attachments.

-- GuilhermeGarnier - 29 Nov 2007

Also, I'm making some other updates:

- delete every temp file created during the file conversions;

- inside the XLS.pm plugin, when you have a cell with a numeric value, it shows "GENERAL" on the search results. The solution for that is to change $cell->Value with $cell->{Val}

-- GuilhermeGarnier - 29 Nov 2007

I found another bug: the KINOSEARCHATTACHMENTSONLYLABEL variable appears as KINOSEARCHSEARCHATTACHMENTSONLYLABEL on the plugin documentation. One of them should be changed.

-- GuilhermeGarnier - 29 Nov 2007

Hi, My platform is "Red Hat Enterprise Linux ES release 4" and have some problems:

  1. Is there anyone successfully install KinoSearch on RedHat for attachment (.doc, .ppt, & .xls) full-text search?
  2. I can't find ppthtml for RedHat....=(
  3. I am using TWiki 4.1 but when I go to Plugins->Find More Extensions, it always timeout and left a "Consulting TWiki.org..." message on the page.
-- MagicYang - 03 Dec 2007

Hi MagicYang

1. My platform is CentOS 4.4, which is based on RedHat ES 4. It's working here

2. Actually you should search for xlhtml, which includes ppthtml. I installed from the source (http://prdownloads.sf.net/chicago/xlhtml-0.5.tgz)

3. I didn't install this way. I downloaded the plugin, and unzipped it inside my twiki home

-- GuilhermeGarnier - 03 Dec 2007

It works and the result is good!

BTW, does anyone ever try to search keyword other than English? I try some chinese word and kinosearch can't find anything! Should I change setting KINOSEARCHANALYSERLANGUAGE?

BTW2, does anyone do kinosearch performance tuning? My index file is 10MB now but it need 25sec to get a search result!

-- MagicYang - 05 Dec 2007

25 sec for a search is by fare too slow. I have a 210 MB index and the result of a search comes out in less than 2 seconds. If the result is very big it may take longer to render the result. But I always see the first result after two seconds (I run TWiki in a very small (150 MB) vitual machine, thus with better hardware, even better results should be possible.)

-- MarkusHesse - 05 Dec 2007

I tried to search portuguese words, and it worked fine, except for the problems reported before (29 Nov 2007, my first comment).

My searches here are very fast, but my development enviroment is really small. I will only have a clue after installing on production.

-- GuilhermeGarnier - 05 Dec 2007

Thanks for your information! I will look into it to see what the problem is!

I want to schedule kinoupdate to regularly update my index:
01 * * * * /userap/users/teamspace/twiki/kinosearch/bin/kinoupdate
But have problem like below (seems like relative path problem):

[teamspace@ts01 ~]$ /userap/users/teamspace/twiki/kinosearch/bin/kinoupdate Can't locate ../../bin/setlib.cfg in @INC (@INC contains: . /usr/lib/perl5/5.8.5 /i386-linux-thread-multi /usr/lib/perl5/5.8.5 /usr/lib/perl5/site_perl/5.8.5/i38 6-linux-thread-multi /usr/lib/perl5/site_perl/5.8.4/i386-linux-thread-multi /usr /lib/perl5/site_perl/5.8.3/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8. 2/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.1/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi /usr/lib/perl5/site_perl /5.8.5 /usr/lib/perl5/site_perl/5.8.4 /usr/lib/perl5/site_perl/5.8.3 /usr/lib/pe rl5/site_perl/5.8.2 /usr/lib/perl5/site_perl/5.8.1 /usr/lib/perl5/site_perl/5.8. 0 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.5/i386-linux-thread-mu lti /usr/lib/perl5/vendor_perl/5.8.4/i386-linux-thread-multi /usr/lib/perl5/vend or_perl/5.8.3/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.2/i386-linu x-thread-multi /usr/lib/perl5/vendor_perl/5.8.1/i386-linux-thread-multi /usr/lib /perl5/vendor_perl/5.8.0/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8. 5 /usr/lib/perl5/vendor_perl/5.8.4 /usr/lib/perl5/vendor_perl/5.8.3 /usr/lib/per l5/vendor_perl/5.8.2 /usr/lib/perl5/vendor_perl/5.8.1 /usr/lib/perl5/vendor_perl /5.8.0 /usr/lib/perl5/vendor_perl .) at /userap/users/teamspace/twiki/kinosearch /bin/kinoupdate line 18. BEGIN failed--compilation aborted at /userap/users/teamspace/twiki/kinosearch/bi n/kinoupdate line 18.

Any suggestion?

-- MagicYang - 06 Dec 2007

You have to tell Perl where to look for all the required files. I use the following command:

cd /home/httpd/twiki/kinosearch/bin;perl -I/home/httpd/twiki/lib -I/home/httpd/twiki/lib/CPAN/lib ./kinoupdate -q > /dev/null 2>&1

-- MartinKaufmann - 06 Dec 2007

I am getting below error , when I am executing ./kinosearch

[root@test bin]# ./kinoindex
pdftotext version 3.02
Copyright 1996-2007 Glyph & Cog, LLC
pptHtml - Outputs Power Point files as Html.
Usage: pptHtml <FILE>
KinoSearch index files init
- to suppress all normal output: kinoindex -q
Indexing started
Skipping Trash topics
Invalid parameter: 'create' at /home/twiki/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Index.pm line 128
 at /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi/KinoSearch/Util/Class.pm line 17
        KinoSearch::Util::Class::new('KinoSearch::InvIndexer', 'invindex', '/home/twiki/pub/../kinosearch/index', 'create', 1, 'analyzer', 'KinoSearch::Analysis::PolyAnalyzer=HASH(0xb502118)') called at /home/twiki/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Index.pm line 128
        TWiki::Contrib::SearchEngineKinoSearchAddOn::Index::indexer('TWiki::Contrib::SearchEngineKinoSearchAddOn::Index=HASH(0xa02...', 'KinoSearch::Analysis::PolyAnalyzer=HASH(0xb502118)', 1, 'SKIN', 1, 'WIKIWEBMASTER', 1, 'WEBRSSCOPYRIGHT', 1, ...) called at /home/twiki/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Index.pm line 58
        TWiki::Contrib::SearchEngineKinoSearchAddOn::Index::createIndex('TWiki::Contrib::SearchEngineKinoSearchAddOn::Index=HASH(0xa02...', 1) called at ./kinoindex line 29

Any Suggestion to fix above error ?? Thanks in advance...

-- AnjaniKumar - 06 Dec 2007

I cannot reproduce the error in my installation. One idea could be to check on the versions: Did you install the latest version of KinoSearch? I am working with the version 0.16 (but I worked with 0.15 previously and that should work also).

-- MarkusHesse - 06 Dec 2007

Markus,

as I wrote previously, I made some changes in the stringifier plugins, to correct some bugs I found. Do you want to see what I did, so you could update the current version of the plugin?

-- GuilhermeGarnier - 07 Dec 2007

Of course I want to see your enhancements. Just post them here and I will integrate them in the next release.

-- MarkusHesse - 07 Dec 2007

Here it is. I'm attaching SearchEngineKinoSearchAddOn-StringifierPlugins.zip, with the stringifier plugins I changed (I didn't change any other file).

Here is a description of what I did:

- PDF: added ">/dev/null 2>&1" on the first call to pdftotext, so that its header isn't repeated after every attached pdf; closing the filehandler and deleting the temp file;

- PPT: same as PDF;

- XLS: when a cell had a numeric value, it was converted to "GENERAL". I just replaced "$cell->Value" with $cell->{Val};

- Text: I had some trouble with encoding of characters with accents, so I used the Text::Iconv module to convert text to iso-8859-15;

- HTML: Same as Text files, but in some HTML files I've tested, Text::Iconv didn't work, so I used the Encode module here;

- Doc: Antiword can't convert some small files (explanation here: http://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg44415.html), so I replaced it with wv. I also tested with Abiword, which also worked, but it has much more requisites to install. Anyway, I added 3 versions of Doc.pm inside the zip file: Doc.pm (using wv to convert to HTML), Doc.pm.antiword and Doc.pm.abiword. You can choose anyone, but I prefer wv.

Besides that, the KINOSEARCHATTACHMENTSONLYLABEL variable appears as KINOSEARCHSEARCHATTACHMENTSONLYLABEL on the plugin documentation. One of them should be changed.

If you choose to use these files on the next version, you should add Encode and Text::Iconv Perl modules to the plugin requisites.

-- GuilhermeGarnier - 07 Dec 2007

Hi Markus, I am using the following to have a kinosearch input field in the upper right corner behaving and looking exactly the same as the normal search field there. So it could either replace the default field or be added without interferring with the default look and feel. You may want to incorporate into the examples delivered with kinosearch.

   * <form name="kinosearch" form action="%SCRIPTURLPATH%/kinosearch%SCRIPTSUFFIX%/%INTURLENCODE{"%INCLUDINGWEB%"}%/">
  %IF{" '%URLPARAM{"search" encode="entity" scope="topic"}%' = '' " then="<input type=\"text\" class=\"twikiInputField patternFormFieldDefaultColor\" name=\"search\" value=\"%MAKETEXT{"Search Index"}%\" size=\"14\" onfocus=\"clearDefaultandCSS(this);\" onblur=\"setDefaultText(this);\" />" else="<input type=\"text\" class=\"twikiInputField\" name=\"search\" value=\"%URLPARAM{ "search" encode="entity" scope="topic" }%\" size=\"14\" />"}%</form>

-- IngoKappler - 11 Dec 2007

Hi Markus, I want to change my search keyword appearance to RED (not just bold) in each search result item's summary paragraph. What can I do?

-- MagicYang - 13 Dec 2007

The grey text is at the moment analogue to the grey text in a "normal" TWiki search. Here also the summary is printed grey. Thus the summaries of KinoSearch are also grey but with bold words to highlight the found words. The problem is that the hits are no visible at first glance. But to make them red would be too much in my opinion.

The Google-like approach would be to make all text black with bold black hits. I think that would be nicer.

-- MarkusHesse - 13 Dec 2007

Markus, did you take a look at the files I sent?

After updating these files, I updated my production server with them, and I got some other problems with encoding. I was able to correct most of them, but I'm still testing.

-- GuilhermeGarnier - 13 Dec 2007

Dear Markus:

Sometimes attach file does not cause topic version change and kinoupdate will not do re-index. Is there any solution to fix this problem?

-- MagicYang - 14 Dec 2007

Finally indexing started with KinoSearch-0.162.tar.gz but it skipping attachments like doc, html. I checked antiword, pphtml are installed .

sh: ppthtml: command not found
sh: ppthtml: command not found

Even after indexing get completed its not giving desired result , at least it should give search result for topics.

-- AnjaniKumar - 14 Dec 2007

Anjani: The command line answer command not found means, that ppthtml is not installed. Perhaps the same is true for antiword.

Magic: I did not see that, but I did not test on that enough. I am thinking about a mechanism that does not rely on the .changes file but scans all files for changes since a certain time stamp. Don't know when I can realise that.

Guilherme: First of all thank you for your contribution. I had a first look on the files but have not spent much time on that up to now. If you have an update on them, please post it.

-- MarkusHesse - 14 Dec 2007

OK Markus, I'm finishing my tests and I should send you the new files soon.

May I replace the zip file that I attached before, or should I upload it with a new name?

-- GuilhermeGarnier - 14 Dec 2007

Just upload the new version of the file with the same name. TWiki manages also the versions of attachments. Thus if anybody is interested in an older version of that file, he can just download that through manage and looking at the version history of that file.

BTW: If you have written any unit tests, please send them also.

-- MarkusHesse - 15 Dec 2007

Markus: I think it maybe TWiki's versioning problem? The test is easy, just upload a file and do kinoupdate & SOMETIMES nothing happen in kinoupdate console output...

-- MagicYang - 16 Dec 2007

Markus, I updated the attached zip file with my latest changes.

I updated the following files:

  • Index.pm: when the topic name started with a character with an accent, I got some warnings on kinoindex/kinoupdate. I found out that the problem was in tripFirstchar function;

  • Search.pm: on the search results, characters with accents inside the attachments comments appeared as ".". I commented a line on this file that made this substitution (is there any reason for that line anyway?);

  • DOC.pm: on some files I got an error message on kinoindex/kinoupdate, and it wasn't deleting the jpg, png, wmf and emf files created by wvHTML;

  • HTML.pm: When I tested some different files, I had more encoding problems. I solved that using CharsetDetector CPAN module to check the HTML file encoding (and using Encode module to convert to iso-8859-15) - some of them were DOC files converted to HTML;

  • PDF.pm: as it wasn't checking pdftotext result (not the return value neither if the text file was really created), it was crashing when trying to parse corrupted PDFs;

  • PPT.pm: the same problem with PDF.pm;

  • Text.pm: similar to HTML.pm, same solution (I also replaced Text::Iconv module with Encode). Some of the encoding problems were with PPT and XML files threated as text;

  • XLS.pm: problems with corrupted XLS files. This was more complicated, because it wasn't a problem on the plugin. On line 22, Spreadsheet::ParseExcel::Workbook->Parse simply crashed and exited kinoindex/kinoupdate. I found out the problem was on OLE::Storage_Lite module (used by Spreadsheet::ParseExcel). To solve that, I had to update /usr/lib/perl5/site_perl/5.8.5/OLE/Storage_Lite.pm and replace line 43 ( die "Error PPS:$iType $sNm\n") with return undef. After that, when kinoindex reads my corrupted XLS file, I see some warnings inside the log file, but kinoindex continues to run normally. I don't think this is the best solution, but it was the simpler I found;
Note that I'm using CharsetDetector CPAN module, which should be added to the prerequisites, and that I'm not using Text::Iconv anymore (it was replaced with Encode).

I didn't write unit tests, I simply attached many different attachments, ran kinoindex and searched for different terms.

-- GuilhermeGarnier - 17 Dec 2007

Hi Guilherme,

really great stuff, thank you very much. Could you perhaps send some of the files you used for testing. This way I could update my unit tests. I did already do some tests with German Umlauts but could not really reproduce all problems.

-- MarkusHesse - 18 Dec 2007

Dear Markus:

I want to ask a performance question again! Just like you said, my search now response the first page quickly (2~3 sec) but rest of result (46 items) rendering need ~1 min. My index file is 12MB now. Is there any possibility to tuning the performance?

-- MagicYang - 18 Dec 2007

I don't think tuning is necessary, as it takes about 3-4s on my system with an index of 19MB to display 44 search results. I guess there is something wrong with your setup.

-- MartinKaufmann - 18 Dec 2007

Sorry Magic, but have to give the same answer like Martin: My system renders about 300 items in 20 seconds. My index is 210 MB big and consists of about 10.000 topics. So there must be something wrong with your system and I cannot search for any performance problem here.

If you can solve it, perhaps you can contribute the solution, so that others can benfit from your experience.

On thinking about it a little, I have one idea: The add-on fetches the hits from KinoSearch but then additionally it checks, if the current user is authorised to read just that topic (see Search.pm: if (! $self->topicAllowed($restopic, $resweb, $text, $remoteUser))). Somewhere in the support web I red from performance issues of TWiki authorisation. Perhaps this hint can help you to find the problem.

-- MarkusHesse - 18 Dec 2007

I uploaded a new version. Thanks for the contributions from several sides. I hope I included most of the things.

  • I enhanced the stringifiers a lot with the help of GuilhermeGarnier (New stringifiers using vwWare and abiword, enhancements for encodings)
  • I changed the template a bit: The 1st line is now the same as for normal TWiki search. The summary is now black and not grey, so that the found words are better readable.
  • Several little bugfixes
Guilherme: I took you code for vwWare nearly 1:1. But I see some problems at the lines
    $cmd = "rm -f " . $tmp_file . "*";
    `$cmd`;

This will only work on Unix. On windows (without Cygwin) this will produce trouble. Perhaps you have some ideas to improve that (NOTE: At the moment I cannot test that code, because I have problems installing vwWare).

On XLS.pm I kept the Text::Iconv converter: Without it, I can reproduce problems with German Umlauts. Perhaps, I can rewrite that to use the Text.pm as a general point to do encodings.

-- MarkusHesse - 18 Dec 2007

Markus,

I've tested this new version, with the adjustments you made. Some comments:

  • you forgot to include the Encode module on the prerequisites
  • the DOC parser is wv, not vw...
  • I see you keept the 3 DOC parsers, but how can you choose which DOC parser do you want to use? If you have only one of the 3 parsers installed, it will use it, but if you have 2 or 3 of them, I guess it depends of the order that Perl reads each .pm file. In my tests, kinoindex loaded DOC_vw.pm first, then DOC_abiword.pm and DOC_antiword.pm, and the last one was used for DOC parsing. To choose a specific parser, you would have to delete the pm files you don't want to use
  • about the Windows incompatibility, you're right, I didn't notice that. The "rm -rf" command should be replaced, but I think you would have to implement a recursive function using unlink
  • about the note inside PDF.pm and DOC.pm.antiword ("This way, the encoding of the text is reworked..."), I think it may be true on files that already are iso-8859-15. But if you have utf-8 strings, you should reencode them. As this job is already done inside Text.pm, I thought it would be better to do that way
  • about XLS.pm, I tested with a XLS file that has characters with accents, and the search stopped working for these characters
  • still on XLS.pm, I didn't understand the code you inserted (to check if raw_value is different from formatted_value). When you have a cell with numeric value (ex: 123), it writes "GENERAL 123"
I'm attaching a test.xls file that has numeric cells and accents. When I attached this file, I had the last 2 problems reported above (try to search for "Formatao", it doesn't return any results). When I remove the two $converter->convert lines, it worked again. I also tried to replace Text::Iconv with Encode and it worked too. Maybe you can try this.

-- GuilhermeGarnier - 19 Dec 2007

There appears to be a small mistake regarding the installation description: "Install xpdf and ppthtml, if you want to index attached Word and PDF files:" I assume a corrected statement rather should be similar to: "Install xpdf and ppthtml, if you want to index attached PDF and ppt files:"

-- IngoKappler - 27 Dec 2007

Hi Guilherme, hi Ingo,

thank you for testing and looking at the details I missed. Sorry, that my answer is a bit late, but I was a bit busy the last days. I hope I can fix some points in the next days.

I will fix some documentation stuff (Encode prerequisite, wv/vw...).

On the three DOC stringifiers: I would like to keep them all three: The admin may decide, which backend tool he wants to use (in some situations, there is no choice, because some tools are not available on a certain platform etc.). The current solution takes the correct stringifier, when only one backend is installed. If more than one is installed, the admin shall delete the not wanted stringifier (as described in the documentation). I think this is the most flexible solution.

BTW: It would be very interesting to have a kind of benchmark for those three backends:

  • Are there any restrictions? (e.g. antiword has problems with very short documents, but only if they come from Open Office).
  • Does anybody have performance benchmarks on the backends? (on big files, on large number of files,...)
On the encoding itself: I would like to refactor the whole stuff again: How about a service method in the base class that all stringifiers just can use? Thus we have a common solution for all stringifiers and the solution is encapsulated at one place and can be improved for all of then at once if needed. For this it would be good to have more test files (perhaps Guilherme can send some more...).

On XLS: I will check again (thank you for the test file). I never got the problem with "GENERAL". According to the documentation I can get the raw value from $cell->{Val} and the formatted value from $cell->Value. This means for instance, if the raw value is 1000000 the formatted value may be 1.000.000 . By taking both values, the user can search for both of them. I thought that could be a good idea...

-- MarkusHesse - 28 Dec 2007

Hello, how can I index linked ( SomeDocument ) but not attached localhost html documents?

-- ElviraMiller - 28 Dec 2007

Interesting idea, but on thinking about it, it becomes very complex quite soon: If you index documents on the same host, why not indexing files in the intranet? Perhaps you can configure a list of file servers that may be indexed. But now you run in problems to access those servers: Not all servers available from your client are automatically available from the TWiki server. Further more: How about access rights to the files: Often the TWiki installation has its own users that need not match to the users of the intranet.... At the moment I see too many too complex problems to implement this. But perhaps someone else has some genius ideas how to realise this.

-- MarkusHesse - 28 Dec 2007

Hello Markus, thanks for reply. The files have the rights of www-data (server user rights). and are accessible over TWiki. There is a link . In HTMLonTWiki is described how you can search in html. But the search result is only the root document but not the linked document which contains the search string. It is possible to modifie the search, so the document which contains the string is listed?

-- ElviraMiller - 29 Dec 2007

If you have the files already in the attachment area and you play around with the TWiki txt files with sed or whatever, why not making the linked files real attachments in the TWiki sense? Just make an attachment via the "normal" TWiki web interface and have a look at the txt file and see, how TWiki saves attachments in the file. There should be something like %META:FILEATTACHMENT{....}. Just do the same with the files you refer to.

-- MarkusHesse - 29 Dec 2007

Hello Markus, I'm sorry for the late answer.

I agree with you about the DOC stringifiers. I just think it would be nice to add a note with your explanation above on the installation document.

I'm sending another update. I corrected some bugs and also added some suggestions my co-workers gave to me.

  • data/TWiki/KinoSearch.txt -> I added a maxlength parameter to the search text, and I added a javascript code to the submit button, to avoid searches with empty or one-character strings - I don't know if you'll want to add this to the plugin, but I think it may be a good idea to at least check for searches with an empty string. Actually, it would be even better to trim white spaces before checking;

  • lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Search.pm -> I decide to set an absolute limit to the search results, because when a search got some hundreds or thousands of results, the browser locked after some minutes slowly loading the results. I added a variable "KINOSEARCHMAXLIMIT" on TWikiPreferences page for the limit (It's used when the limit search field is set to "all"). It's working fine, but maybe a better solution would be to page the results;

  • lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/DOC_*.pm and PPT.pm -> I redirected STDERR to /dev/null, just to avoid getting error messages on the logs (the command result value is already being checked). Also, when the attachment filename had some characters like "(" or ")", I got an error message when running the command line, because these characters should be escaped. To solve that, I enclosed the filename with single quotes (I didn't do that for the tmp filename because I assumed tmpnam command create names with just alphanumeric characters;

  • lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/XLS.pm -> I think the problem I got (numeric cells appeared as to "GENERAL") is only with spreadsheets created with Openoffice (see http://www.perlmonks.org/?node_id=461873). I'm now checking if the cell is numeric or not, and using $cell->{Val} or $cell->Value. I also replaced Text::Iconv with Encode, I got better results with this (some of my files weren't correctly converted by Text::Iconv);

  • templates/kinosearch.pattern.tmpl -> Some people think it's annoying to have to go back after a search to make a new search, so I added the search form on the top of the search results page (with the same modifications I made to data/TWiki/KinoSearch.txt).
PS: I didn't work on the recursive file removal (to replace "rm -f" inside DOC stringifiers), did you try to do something about that?

-- GuilhermeGarnier - 07 Jan 2008

Hello Markus, does the function "show locked topics" work or not? I try to lock (edit) a topic (.lease file appear in the server) and search again, the result are all the same no matter I check or uncheck the option!

-- MagicYang - 10 Jan 2008

Hello, Markus. I think there may be a problem indexing words in topics that have a hyphen in them. it looks like the hyphen is converted to space during indexing, and then if you do a search for the hyphenated word, it doesn't come up as a match. At least, that's what I'm seeing.

-- PhilipBarrett - 11 Jan 2008

Hi Phillip, KinoSearch tries to split the single words from composed things. Thus it reads "something-combined-together" as three words: "something combined together". The same is true for combinations with underscore. Thus "something_with_underscore" will be treated as "something with underscore". This feature is extremely usefull, as you can search for the single words and need not know the complete word (Note: KinoSearch has no possibility to search with wildcards!). But of course you need to know about it (Perhaps I will add some words in the documentation). If you want so search for "something-combined-together", you need to search for "something combined together". If you add also the " to the search string, you are sure, that the three words are in that order one after the other.

Maybe, we can enhance the add-on, so that it changes queries with "-" or "_" to queries with spaces, but I am not so deep in the KinoSearch details. Perhaps there are more things to be considered.

-- MarkusHesse - 13 Jan 2008

Hi, great plugin, just one quick suggestion for improvement. Some of the external programs ( pdftotext, ppthtml) always produce unnecessary messages:

pdftotext version 3.00
Copyright 1996-2004 Glyph & Cog, LLC
pptHtml - Outputs Power Point files as Html.
Usage: pptHtml <FILE>

I suggest that you modify the command to add the quiet option ( pdftotext -q) or redirect to STDERR. Would be good for cron jobs... Thanks!

-- StephanMatthiesen - 16 Jan 2008

Hi MagicYang,

I am coming back to your remark from 14 Dec 2007: The update mechanism of this add-on look into the file .changes to find out, what has been changed. It seems, that TWiki does not always write something in there. If you change a topic first time it writes something, if you change it short afterwards again, it does not write in .changes. The same is true for attachments. So this seems to be a general problem.

Does anybody else have the same problem? Or does anybody know, if TWiki can be configured (or patched), so that it writes to .changes on any change?

-- MarkusHesse - 27 Jan 2008

Markus, I just added a comment on Bugs:Item5273. I didn't test it yet, but I think it may be a solution.

-- GuilhermeGarnier - 29 Jan 2008

I wonder if it is possible to include the search results in a topic similar to the %SEARCH% variable for the normal search. I was experimenting with

%INCLUDE{"%SCRIPTURL{"kinosearch"}%/%INCLUDINGWEB%?search=something"}%

but get the error that my site doesn't allow includes of URLs (which I don't want to change). Any ideas? Perhaps it would be a nice improvement to the Plugin to define a new variable %KINOSEARCH% somehow?

-- StephanMatthiesen - 02 Feb 2008

Is this PlugIn already a fully fuctional search engine? Could anybody post a screeshot of a results page. I would try a lot, to overcome the poor results list of today. (I do not mean to offend anybody. But a lot of our staff quizzed me with good questions on why the TWiki-search has an alphabetical order. (See discussion in GoodSortingForSearchResults)

-- MartinSeibert - 04 Feb 2008

I've got a couple of question regarding KinoSearch:

  1. Has anyone set up KinoSearch to completely replace the default search engine? Are there any drawbacks? Anything to look out for? I guess I'd only have to edit some template files.
  2. I'm considering using KinoSearch to also index and search a (network) directory outside TWiki's scope, maybe even an SMB directory. Has anyone done that already? I'm fully aware that this would require some hacking.
-- MartinKaufmann - 07 Mar 2008

Hi Martin,

I hoped, some of the users answered you. So I try to give some information.

You can exchange the standard searchbox of TWiki (the one in the top right corner of each page) very easy. For this see contribution of IngoKappler from 11 Dec 2007, alos added as an example to the AddOn).

Another possibility I like very much is to put the search to the search engines of Firefox. Thus you have the search always in the Firefox toolbar and need not open a TWiki page to start searching.

The major handicap of kinosearch against the TWiki search is that it cannot cope with regular expressions and even wildcards are not possible. For most searches this is no problem. I think I never saw any user using regular expressions. In TWiki the regular expressions are mostly used to find values in form fields. For this kinosearch has special syntax (see documentation).

Another handicap is that the current kinosearch is not a plugin. I.e. it cannot be used like %KINOSEARCH…%.

Also there is not immediate update of the index when adding or changing content. You can only run a regular update script via a cron job. I think this can be changed very easy by implementing a plugin with a appropriate handler.

Indexing other files than just the TWiki content could be very interesting. A similar idea came from ElviraMiller ( see 28 Dec 2007). Perhaps this way a real DMS based on TWiki can be setup.

For the purposes I use SearchEngineKinoSearchAddOn the current implementation is more than sufficient. Thus I am afraid I will not implement any more features (even though it would be great fun). If you want to extend the AddOn, have a look at the sources in SVN. Feel free to add new things. (Note that at the moment most features are checked with unit test. Thus I would appreciate, if on any change, the unit test would be kept up to date.)

-- MarkusHesse - 15 Mar 2008

Help!: Remote user not identified when searching with Kino

I'm running TWiki 4.2 with Kino installed and cannot authenticate users when requesting a kino search. Dumping the $remoteUser variable to a debug file always produces TWikiGuest. I have tried both TWiki and Apache authentication schemes. Client sessions are enabled and I have tried enabling "session IDs in URL" and IP matching. I should also note that the results page produced by a Kino search appears as though no user is logged in even though my Wiki Name appears after returning to any other page. How can I pass the correct Wiki Name to this add-on for authentication? The add-on shows: my $remoteUser = $TWiki::Plugins::SESSION->{user}->{login}||"TWikiGuest" as the process for obtaining the user's Wiki Name. Is this still correct for 4.2? I'd really like to use this add-on.

-- HarrisWong - 29 Mar 2008

Thank you so much, Markus, for creating this Add-On!

There are two serious issues we have here:

  • for a general ("text:") search, we would want all form fields to be included as well. Is there an easy way to get this? At the moment, topics which have the string in a form field will not be shown as a result.
  • the "type:" modifier does not work at all. "type:doc" does not result in any hits even though several doc files are in the index (and can be found with keywords they contain)
Any hints are greatly appreciated!

-- LuziSchucan - 31 Mar 2008

After having the same problems as HarrisWong with the user not being authenticated properly in 4.2, I created a plugin KinoSearchPlugin that allows you to embedded the KinoSearch results into a page. It is highly dependent on this Add-On i.e uses the search functions and the indexes created using this addon. It is currently in a very raw state and perl isn't exactly my best language so you have been warned.

-- DavidGuest - 07 Apr 2008

I run Kino on 4.2 and found that it always fail when I do the first kinoindex. It always stops at TextFormattingRules with following error msg:

RCS: /usr/bin/rlog -h %FILENAME|F% of .../TWiki/TextFormattingRules.txt,v failed: at /teamspace/tw iki/lib/TWiki/Store/RcsWrap.pm line 275.

If I delete this file, it will fail at another .txt,v file again! What is the problem in these 4.2 files? Is it possible to exclude some content (ex: TWiki web) when I do kinoindex & kinoupdate?

-- MagicYang - 10 Apr 2008

An update to previous post:
I delete all .txt,v files in 4.2.0 TWiki web and .kinoindex can finish the index job successfully!

BTW, I have some other discovery & question here:

  • When I put search form in my topic (ex: my Glossary.WebHome), the search result page look different in 4.1.2 & 4.2, why?
    • Search result page on 4.1.2:
      4.1.2.gif
    • Search result page on 4.2:
      4.2.gif
  • The drop down list in the search form doesn't display all my Webs, why?
    • I can't find anything from the web that is not listed in the drop down list (but the web was actually indexing in the log)
    • The ignored web (ex: TWiki, Trash, Sandbox) should not be listed in the drop down list!!!!
  • How can I customize this search result page in TWiki 4.2? (It seems that it is created on the fly because I don't have Glossary.KinoSearch!!) I customized the result page by modifying kinosearch.pattern.tmpl!
-- MagicYang - 14 Apr 2008

Hi Magic,

on your .txt,v problem: These files are used by RCS to keep the history of the files. If you delete those files you loose the revision control on them. I cannot reproduce your problem. In my environment (TWiki 4.2, just setup completely from scratch on a Debian VM) kinoindex runs through without any problems. In fact kinoindex does not do anything with the .txt,v files. Your error message seems to come from RCS. Perhaps the installation of RCS is not O.K. But to be honest, I have no clue how you got there.

On the difference between 4.1 and 4.2: The header with the search template come from the file kinosearch.pattern.tmpl as you already found out. Maybe this worked different on 4.1 (Sorry I do not test anymore on 4.1).

On the webs dropdown list: Here some enhancements on the HTML code are needed. If someone could give me a hint on this, I will include that.

-- MarkusHesse - 14 Apr 2008

I have some discovery for your reference:

  1. If I put "Set KINOSEARCHINDEXSKIPWEBS = Trash,Sandbox,TWiki" in Main.TWikiPreferences, it doesn't work!!! It works only if I put blank space after commas.
  2. Access control problem:
    • When I search, the left-upper corner's login information disappear (become "Log in") when the result page appear. (This only occurred in 4.2)
    • It seems that I (I am Admin) become TWikiGuest at this moment so that some webs' (protected by web-level access control) search result will not be listed in the result page! (This occurred in both 4.1.2 & 4.2 even login information looks normal in 4.1.2!)
    • When I click and leave search result page, the login information (also with personal sidebar) display normally again.
    • The drop down list (WEBLIST variable) doesn't show all web list also because of access control.
  3. It's my fault that I put different version of kino template in 4.1.2 & 4.2 so that my search result page looks different!!!
-- MagicYang - 15 Apr 2008

Hi,

I encountered the following error (seems somehow related to using the installer script and I also originally missed installation of:
perl -MCPAN -e "install Module::Pluggable"
perl -MCPAN -e "install CharsetDetector"
perl -MCPAN -e "install Encode")

Couldn't open file '/srv/www/twiki420/kinosearch/index/_1.f0': File exists at /usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi/KinoSearch/Store/FSInvIndex.pm line 88
        KinoSearch::Store::FSInvIndex::open_outstream('KinoSearch::Store::FSInvIndex=HASH(0x15022b0)', '_1.f0') called at /usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi/KinoSearch/Index/SegWriter.pm line 40
        KinoSearch::Index::SegWriter::init_instance('KinoSearch::Index::SegWriter=HASH(0x156c600)') called at /usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi/KinoSearch/Util/Class.pm line 31
        KinoSearch::Util::Class::new('KinoSearch::Index::SegWriter', 'invindex', 'KinoSearch::Store::FSInvIndex=HASH(0x15022b0)', 'seg_name', '_1', 'finfos', 'KinoSearch::Index::FieldInfos=HASH(0x1560eb0)', 'field_sims', 'HASH(0x1501fb0)', ...) called at /usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi/KinoSearch/InvIndexer.pm line 152
        KinoSearch::InvIndexer::_delayed_init('KinoSearch::InvIndexer=HASH(0x1502080)') called at /usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi/KinoSearch/InvIndexer.pm line 197
        KinoSearch::InvIndexer::new_doc('KinoSearch::InvIndexer=HASH(0x1502080)') called at /srv/www/twiki420/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Index.pm line 437
        TWiki::Contrib::SearchEngineKinoSearchAddOn::Index::indexTopic('TWiki::Contrib::SearchEngineKinoSearchAddOn::Index=HASH(0x75f...', 'KinoSearch::InvIndexer=HASH(0x1502080)', 'Duesseldorf', 'DuesseldorfExternal', 'FirstName', 1, 'Address', 1, 'OrganisationURL', ...) called at /srv/www/twiki420/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Index.pm line 69
        TWiki::Contrib::SearchEngineKinoSearchAddOn::Index::createIndex('TWiki::Contrib::SearchEngineKinoSearchAddOn::Index=HASH(0x75f...', 1) called at ./kinoindex line 29

I managed solving it using this procedure:

cd twiki420/kinosearch
rm -r index/
rm -r logs/
cd twiki420
tar xvf SearchEngineKinoSearchAddOn.tgz
chown wwwrun:www -Rv twiki420 | grep -v retained
cd twiki420/kinosearch/bin/
cp LocalLib.cfg.bak LocalLib.cfg
./kinoindex

-- IngoKappler - 15 Apr 2008

Hi Markus,

just FYI there is a slight difference in 4.2 regarding the look and feel between the default search box and the kinosearch search box. I guess that's because kinosearch is not "yet" fully 4.2 adjusted or do I need to adjust that locally?

Regarding the drop down menu I'd like to use a selection of different webs for default search instead of all public webs. This would allow to index everything including e.g. TWiki web but having users not necessarily getting all the time TWiki web results while ensuring several other webs are searched by default. Can you imagine to implement something like that?

And last wink I am wondering if I am the only one for whom type:doc or type:pdf doesn't lead to any results while e.g. web:Main works. It is the same on 4.1 for me.

Related to web:Sandbox I suggest choosing as query example in TWiki/KinoSearch rather web:Main instead of web:Sandbox because by default Sandbox and also TWiki are not indexed.

Thanks for that great search engine!

-- IngoKappler - 17 Apr 2008

I uploaded a new version. The main change is that now the current user is used in the search script. Thus the access control should now work correctly. The problems reported by HarrisWong (29 Mar 2008) and MagicYang (15 Apr 2008) should be fixed by this.

-- MarkusHesse - 23 Apr 2008

Hi Markus,

I installed kinosearch (version 1.14) on our twiki server running TWiki 4.2 in a Windows environment based on cygwin. I installed antiword, pdftotext and ppthtml and they seem to work (when I call them directly from the shell). The kinoindex script runs as well as the kinoupdate script without complaints - except a complaint for an unititialized value for some excel files:

Indexing attachment | IKDB.RisikoManagement | IKDB_ACE_Stand20050725jn002.xls
Use of uninitialized value in pattern match (m//) at /usr/lib/perl5/site_perl/5.8/Spreadsheet/ParseE
xcel/Utility.pm line 101.
Use of uninitialized value in length at /usr/lib/perl5/site_perl/5.8/Spreadsheet/ParseExcel/Utility.
pm line 126.

However, when I make a search using the KinoSearch Plugin, the result is not as expected. These are my observations (positive and negative):

  • an Index for about 1700 topics and 1500 attachments is built in about 45 Minutes (using a standard Intel Pentium PC with 2.6 GHz running Windows 2000).
  • most excel attachments are stringified fast, some excel attachments need considerable time (sometimes several minutes)
  • Searching for a word that is known to be in an attachment produces results for attachments of type .xls.
  • Searching for a word that is known to be in an attachment produces no results for attachments of type .txt, .pdf, .doc and .ppt.
  • Searching for a word that is known to be in a non public web produces no results and the list of searchable webs contains only the public webs after the search. The previously logged-in user is not logged in after executing the search.
I suspect that the strings coming from certain types of attachments are not properly stored in the index while the strings coming from the topics and .xls attachments are.

Is there a possibility to check if the index is properly built or inspect the index?
Has kinosearch been successfully tested with TWiki 4.2?
What can be done in order to make kinosarch working? The users of the TWiki instance that I am running have strong demand for it.

-- MichaelSchmidt - 28 Apr 2008

Hi Michael,

sorry, but I never tried an installation under Windows. I just retested all the things you mention but I cannot reproduce it under Linux.

  • The user is still the same after running a search
  • I find all attachment types (but I must admit, that I used extreemly simple examples)
I run kinosearch under a fresh installation of TWiki 4.2.

I assume that the installation of backends for PDF, DOC and PPT are somehow not O.K. Very strange is that you report a problem with .txt: Here not backend is used. It should run in all environments.

Perhaps some of the other users do have experiences with an installation on Windows.

-- MarkusHesse - 28 Apr 2008

I've installed the Kino addon on a TWiki 4.2 with hierarchical webs and strict authorization policies...

When I run the kinoindex script I get the following:

AccessControlException: Access to VIEW CCTnuovo/Clienti.RapportoInterventoForm for BaseUserMapping_666 is denied.

Clearly CCTnuovo/Clienti.RapportoInterventoForm is a page in a web that can be only viewed by a group of users... it's seems that the script present itself with an unidentified user that, of course, don't have access to protected webs...

It's possible to specify the user used by kinoindex and kinoupdate? Or, since all the site content should be indexed, make it use the TWiki administrator?

-- IvanSassi - 29 Apr 2008

Hi Ivan,

can you perhaps give some more details: How can the situation be reproduced? Is there some more in the log files?

Up to now I thought the index script is independent of the user and reads any topic regardless of the access permissions. (A control of the access permissions should only run in the search script.) I also did some test with topics with access control and that worked well. But I must admit I never tried hierarchical webs.

-- MarkusHesse - 29 Apr 2008

When the kinoindex fails it don't leave any significant log, it don't write anything nor in apache2 logs (error and access) neither in TWiki's (log and warn); the kino log is truncated at the skipping step:

| 2008/04/30 09:24:23 | Indexing started
| 2008/04/30 09:24:23 | Skipping Sandbox topics
| 2008/04/30 09:24:23 | Skipping Trash topics
| 2008/04/30 09:24:23 | Skipping TWiki topics

No file was created in the kinosearch/index directory...

The problem is surely on the ALLOWWEBVIEW setting of the webs, cause if I don't set any user in the web preferences of the nested web CCTdoc/Clienti it stops at another web:

AccessControlException: Access to VIEW CCTnuovo/GiornaleAttivita.PaginaGiornaleAttivitaForm for BaseUserMapping_666 is denied.

I'm not sure that the problem is in hierarchy... I tried to delete the ALLOWWEBVIEW content for any nested web and it stops anyway on a first level web:

AccessControlException: Access to VIEW Informazioni.ComunicazioneForm for BaseUserMapping_666 is denied.

A behavior that I noted is that it stops always on the topics SomestuffForm; that are all topics containing tables used for some form elsewhere...

An example: the CCTnuovo/GiornaleAttivita.PaginaGiornaleAttivitaForm contain the following:

| *Name* | *Type* | *Size* | *Values* | *Tooltip message* | *Attributes* | 
| TipoAttivita | select | 1 |  | Selezionare il tipo di attivita effettuata | |
| Autore | label | 40 | | |
| SvoltoDalCliente | select | 1 | | Selezionare il nominativo del cliente| |
| Descrizione | text | 80 | | |
| Data | text | 15  | | Data dell'intervento | |

I don't see any reason for this but I try to clean the ALLOWWEBVIEW for one web after another and it stops ALWAYS on a topic SomethingForm...

-- IvanSassi - 30 Apr 2008

Hi Markus,

apparently the kinosearch module shows a different behaviour under Windows/cygwin compared to Unix.

Another observation I made is: When the Index is built, storing topic modifications works. After some modifications and corresponding updates of the index, storing a topic produces a lengthy error message indicating an access problem to some index file.

I hope that somebody with Windows/cygwin experience is able to check this module and fix the observed problems.

-- MichaelSchmidt - 06 May 2008

Hi Ivan,

I had the same problem with a form. It always happend to me when I set ALLOWEBVIEW to a Web which had a WEBFORM.

I found two Solutions

1. not set ALLOWWEBVIEW to this WEB 2. Set KINOSEARCHINDEXSKIPWEBS = in TWikiPreferences and put the WEB name there

I could not find any other solution

-- MarkusEberius - 07 May 2008

Hi all together,

I found an interresting blog post describing the installation of KinoSearch on windows: http://arnshea.spaces.live.com/blog/cns!32B8884142441255!436.entry

Can someone verify the desribed installation steps. If it works, I can add it to the documentation. (I do not have a TWiki installation on windows and in the short run, I will not set up such an installation).

-- MarkusHesse - 07 May 2008

I tested the two solutions and I can confirm that they works... unfortunatly I must use the ALLOWWEBVIEW and I can't skip the entire webs... maybe add a parameter than can exclude only specified topics will do the trick... it's possible?

-- IvanSassi - 08 May 2008

Hi Ivan, hi Markus,

I think I have a preliminar solution for you problem: Change the code in lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Index.pm in method formsFieldNames from

my $form = TWiki::Form->new($TWiki::Plugins::SESSION, $web, $formName);

to

my $session = new TWiki( "admin", undef);
my $form = TWiki::Form->new($session, $web, $formName);

Question to the TWiki gurus:

Is "admin" always the name of the super admin user who can read everything? If not, where can I get that name (I assume something in the configuration. Perhaps someone could give me a hint.)

-- MarkusHesse - 10 May 2008

Markus's solution works for me! But I need to use some other admin account instead of "admin" in 4.1.2!

-- MagicYang - 14 May 2008

Did you try the new version of this add on? There I use TWiki($TWiki::cfg{AdminUserLogin}. This should give the correct admin user.

-- MarkusHesse - 14 May 2008

Great! The new version works fine with forms and ALLOWWEBVIEW.

However I got another problem, there is some truble with the .pdf format:

Can't exec "pdftotext": No such file or directory at /srv/www/htdocs/prova/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/PDF.pm line 27.

But the pdftotext is available on my system:

lnxprotowebtext:~ # pdftotext -v
pdftotext version 3.02
Copyright 1996-2007 Glyph & Cog, LLC

-- IvanSassi - 14 May 2008

Another problem: the index is created but the search don't work very well, no indexed attachment at all on the results... however with the kinoupdate it seems to add correctly the new attachments, in a research made after the update the only attach in results is the newly added...

-- IvanSassi - 16 May 2008

I solved the problem with the .pdf attachment editing the /srv/www/htdocs/prova/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/PDF.pm at line 27 and putting in there the absolute path of pdftotext:

    unless ((system("/usr/local/bin/pdftotext", $filename, $tmp_file, "-q") == 0) && (-f $tmp_file)) {
        return "";
    }

-- IvanSassi - 19 May 2008

New problem... I've tested the indexing in my production ambient and after nearly 11 hours of work (some like 4 Gb of attachments and it is VERY slow in the pdf indexing) the procedure crashed with the following error:

Use of uninitialized value in numeric ne (!=) at /usr/local/lib/perl5/site_perl/5.10.0/CharsetDetector.pm line 8289, <$in> line 5256.

When crashed it was working on a directory containing some PPT files...

-- IvanSassi - 21 May 2008

Wow, 4GB.. I never tried such an amount of data. KinoSearch itselef should not have any problems with so many data. Of course if individual files produce a break down of the indexing process this is a big problem. Perhaps you could post such files that produce the breakdown.

I would like to make the indexing more robust, so that crashes of individual files do not stop the whole indexing process. But I need example data for that.

Also some examples of very slow PDF documents would be interresting. At the moment I have not done anything on performance improvements: KinoSearch itself is fast enough (see the benchmark numbers on the KinoSearch web site). Indexing of "normal" TWiki topics is from my point of view also O.K. But the indexing of attachments was not in my scope at the moment. It was O.K. for me, that is runs at all, but perhaps I should have a close look on performance in this area. It would be very helpful if the users/Perl gurus out there could give me some advice, where improvments could be realised.

-- MarkusHesse - 21 May 2008

I've to correct myself, it isn't slow only on pdf files, it's slow on any file indexing... it's strange, with some files it works well, with others it remain hanged with the processor at 99% for incredible times...

I'm trying to spot the ppt that crashed the procedure indexing one file a time... for indexing the file CLIENTI_1.PPT it requires 45 minutes (!!!)... I've tryed to launch manually the ppthtml on the same file and the results are immediate...

The file was indexed fine; however the search result is not exactly the finest thing on the world:

TEST-TWikiKinoSearchAddOn1.jpg

I've found the file that crash the indexing procedure: CLIENTI_4.PPT

-- IvanSassi - 21 May 2008

Hi Ivan,

I just did some experiments with your files and I think I got it: My stringifiers do not recognise the capital letters as the correct postfix. Thus "*.PPT" is not recognised as Power Point but stringified as ASCII text. It must be named "*.ppt". I will fix this asap.

If you need a faster solution, just look in the file ..lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifyBase.pm and change the method handler_for in the following way:

   sub handler_for {
        my ($self, $filename, $mime) = @_;
        if (exists $mime_handlers{$mime}) { return $mime_handlers{$mime} }
   $filename = lc($filename);  # NEW LINE
        for my $spec (keys %extension_handlers) {
            if ($filename =~ /$spec$/) { return $extension_handlers{$spec} }
        }
        return DEFAULT_HANDLER;
    }

-- MarkusHesse - 21 May 2008

Ok, I fixed it with your workaround... however I've some problem with the relative path of the executables I've to put the absolute path for anything... following a list of changes:

lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifyBase.pm

   sub handler_for {
        my ($self, $filename, $mime) = @_;
        if (exists $mime_handlers{$mime}) { return $mime_handlers{$mime} }
+       $filename = lc($filename);
        for my $spec (keys %extension_handlers) {
            if ($filename =~ /$spec$/) { return $extension_handlers{$spec} }
        }
        return DEFAULT_HANDLER;
    }

Line added for correctly recognize the extentions with upper case...

lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/PDF.pm

- if (__PACKAGE__->_programExists("pdftotext")){
+ if (__PACKAGE__->_programExists("/usr/local/bin/pdftotext")){

-    unless ((system("pdftotext", $filename, $tmp_file, "-q") == 0) && (-f $tmp_file)) {
+    unless ((system("/usr/local/bin/pdftotext", $filename, $tmp_file, "-q") == 0) && (-f $tmp_file)) {

Added the absolute path of the pdftotext, without the path the pdf files don't were indexed...

There is some method to work with the system path for include any directory visible to root?

lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/PPT.pm

- if (__PACKAGE__->_programExists("ppthtml")){
+ if (__PACKAGE__->_programExists("/usr/local/bin/ppthtml")){

-    my $cmd = "ppthtml '$filename' > $tmp_file 2>/dev/null";
+    my $cmd = "/usr/local/bin/ppthtml '$filename' > $tmp_file 2>/dev/null";

As for the pdf I've to add the absolute path of the program or the file don't were indexed...

lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/DOC_wv.pm

I've made a lot of test and it don't seems work... I tryed to add the absolute path as for pdf and ppt but with no result... the indexisation is very slow and the search on the file don't work; I'm almost sure that the program don't recognise the doc type and try to index the file as text...

I'm doing my tests with the file Attivita_SISTEMISTICHE_Varie_SACMI_SAP.doc

-- IvanSassi - 22 May 2008

Hi Ivan,

I just tried you word file in my test environment: It works fine: I can index it in reasonaly time (about 10 sec) and I can succesfully search for it.

Your problem with relative paths of the executables seem a bit strange: Did you check, that PATH is O.K? (also in the context of the user doing the indexing?) If that is O.K., the relative paths should work (I think).

-- MarkusHesse - 22 May 2008

I checked the PATH for the root user (used for launch the kinoindex) and it contains /usr/local/bin, where the conversion programs are... however with relative path it don't works, it's possible that the indexing procedure is using another profile?

-- IvanSassi - 26 May 2008

Ok, I solved the problem with relative path creating a symbolic link for each conversion program in /bin, now the procedures works fine without absolute path...

However the problem with the .doc files remains, it don't seems to recognize the document as "application/word" and try to indexize it as text (using a lot of time and creating a non consistent index)... I've tryed with wvHtml and abiword ( antiword had his last update in the 2005 and it is no longer mantained), what do you use in your test environment?

I don't have any idea on the reason of this issue, do you have something in mind?

-- IvanSassi - 28 May 2008

I've been working on a weighted search (see OrderSearchResultsMostRelevantFirst), which my clients like, but now they want weighted search of attachments. Indexed searching should also improve runtime performance. Any idea how difficult this would be with Kino?

-- ClifKussmaul - 28 May 2008

Hi Clif,

I followed your work on enhanced search and I am impressed, what is possible with some sophisticated TWiki tricks. But at some point you come to the limits of TWiki search: Performance and search inside of attachments.

SearchEngineKinoSearchAddOn is exactly designed for this purpose: Fast search in huge webs including attachments. Additionally the results are ordered by relevance (mainly number of appearances of the search strings). Installation should be quite easy (I use a debian virtual machine, installation takes less than 15 minutes plus 15 minutes for indexing 10.000 topics). Try it and post you experiences. See also KinoSearchPlugin for extended use of this add on.

-- MarkusHesse - 04 Jun 2008

My previous comment wasn't clear enough - sorry. I've got Kino working, and generally like it.

I'd like to know more about how relevance is calculated, and if or how weightings could be customized. I'm also talking with some librarians who are search fanatics...

-- ClifKussmaul - 05 Jun 2008

Hi Clif,

for the details for scoring the search results you should look at the KinoSearch home page (http://www.rectangular.com/kinosearch/). As far as I understand, the scoring depends on the number of occurrences of a word and the size of the text where it appears. Additionally parts of the text can be "boosted", i.e. a factor to the score on that occurrences can be defined. In this add on I boost the name of the topic by a factor of three. This means if a word appears in the topic name it scores the same as three times the same word in the text.

-- MarkusHesse - 06 Jun 2008

I have an index error message like below:

Logfile cannot be opend in /teamspace/twiki/kinosearch/logs/index-20080618.log. at /userap/users/tea
mspace/twiki/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/KinoSearch.pm line 55.

Does anybody have the same error?

-- MagicYang - 18 Jun 2008

Another kinoindex error:

Attachments available for: Activity, BIDW
Can't locate object method "stringForFile" via package "TWiki::Contrib::SearchEngineKinoSearchAddOn:
:StringifyPlugins::Text" at /teamspace/twiki/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Stringifier.pm line 30.

After I got this error and run kinoindex again, I got:

Couldn't require TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifierPlugins::HTML : Can't locat
e CharsetDetector.pm in @INC (@INC contains: /teamspace/twiki/lib . /usr/lib/perl5/5.8.5/i386-linux-
thread-multi /usr/lib/perl5/5.8.5 /usr/lib/perl5/site_perl/5.8.5/i386-linux-thread-multi /usr/lib/pe
rl5/site_perl/5.8.4/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.3/i386-linux-thread-multi /
usr/lib/perl5/site_perl/5.8.2/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.1/i386-linux-thre
ad-multi /usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.5 /usr/
lib/perl5/site_perl/5.8.4 /usr/lib/perl5/site_perl/5.8.3 /usr/lib/perl5/site_perl/5.8.2 /usr/lib/per
l5/site_perl/5.8.1 /usr/lib/perl5/site_perl/5.8.0 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_per
l/5.8.5/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.4/i386-linux-thread-multi /usr/lib/pe
rl5/vendor_perl/5.8.3/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.2/i386-linux-thread-mul
ti /usr/lib/perl5/vendor_perl/5.8.1/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.0/i386-li
nux-thread-multi /usr/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/vendor_perl/5.8.4 /usr/lib/perl5/ve
ndor_perl/5.8.3 /usr/lib/perl5/vendor_perl/5.8.2 /usr/lib/perl5/vendor_perl/5.8.1 /usr/lib/perl5/ven
dor_perl/5.8.0 /usr/lib/perl5/vendor_perl . /teamspace/twiki/lib/CPAN/lib//arch /teamspace/twiki/lib
/CPAN/lib//5.8.5/i386-linux-thread-multi /teamspace/twiki/lib/CPAN/lib//5.8.5 /teamspace/twiki/lib/C
PAN/lib/) at /teamspace/twiki/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/HTML.
pm line 18.
BEGIN failed--compilation aborted at /teamspace/twiki/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/
StringifierPlugins/HTML.pm line 18.
Compilation failed in require at (eval 40) line 3.
 at /usr/lib/perl5/site_perl/5.8.5/Module/Pluggable.pm line 28
Couldn't require TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifierPlugins::Text : Can't locat
e CharsetDetector.pm in @INC (@INC contains: /teamspace/twiki/lib . /usr/lib/perl5/5.8.5/i386-linux-
thread-multi /usr/lib/perl5/5.8.5 /usr/lib/perl5/site_perl/5.8.5/i386-linux-thread-multi /usr/lib/pe...
....

Does anybody have the same error?

-- MagicYang - 18 Jun 2008

I need TWiki's topic (or at least web) View permissions to be honored for both topic and attachment search. So far, I can't see how Kino, or Lucene partition the indexes - or enable me to combine separate indexes - is this possible??

-- SvenDowideit - 18 Jun 2008

Hi Magic,

your first error ("Logfile cannot be opened") sound like a file permission error. The second looks as if you did not install the CPAN module CharsetDetector.

Hello Sven,

I am not sure, if I understand your question correctly: For the permissions: During the indexing (script "kinoindex"), this add on takes the permissions of the TWiki admin, so that is can read everything. Only on search, the permissions are checked against the permissions of the actual user who is logged in actually. This means internally in the first step kino finds everything, regardless of the permissions of the user. But then it filters against the permissions of the user, so that the user actually sees only topics/attachments he is allowed to see.

On the second topic: I never tried to partition or combine indexes. I think KinoSearch itself has some possibilities for this, but I never went into details of this.

Can you tell me the requirement behind your question?

-- MarkusHesse - 18 Jun 2008

Yup wink The requirement is basically a massive TWiki install in which the clients have to be totally and uterly convinced that there is zero possibility that their confidential information is loaded when someone else's content is being rendered.

Additionally, I'm considering making indexes of my mp3 collection, my ebook collection, my Email archive (dating back to 1996ish) and my non-TWiki files repository - and I'm not sure that only one index makes sense in that case.

Its also in part due to the fact that I was originally (and may go back to) Xapian, which has very obvious dynamically compbinable indexes.

-- SvenDowideit - 18 Jun 2008

Oh, Markus - I'm currently playing with your kino work, and have a few enhancements - shall I just check them in so you can review them and decide where to place them?

Also, I think we should refactor this work (and presumably the plucene and xapian ones) to use TWiki internals better. Things like removing the custom cgi's index/update and replacing them with rest handlers (which is much more secure, as most users won't know to add the extra scripts to the apache or template auth settings to restrict them).

-- SvenDowideit - 19 Jun 2008

Hi Sven,

don't hesitate to check in your enhancements. That's why the sources are in SVN. I would appreciate if you also look at the unit tests and keep them up to date if you do any changes.

-- MarkusHesse - 19 Jun 2008

oh frig. on my machine i get a seg fault.

   KinoSearchTests::test_new
Segmentation fault

well, that'l slow me down :/, but sweet smile

-- SvenDowideit - 19 Jun 2008

Anyone successfully using KinoSearch with LdapContrib? I'm having issues displaying results if the web is under restricted viewing permissions. Basically the group mapping from LdapContrib does not let me view the results (even though I am part of the permitted group).

To be more clear: SomeWeb has view permissions for company (ldap group returned by LdapContrib)

KinoSearch for word contained in topics in SomeWeb does not return any hits from SomeWeb. However, if I add my wikiname directly to the SomeWeb.WebPreferences topic explicitly (not as part of a group)....then the results are shown.

-- DeanSpicer - 23 Jun 2008

Has anyone got this AddOn working on a Windows box with TWiki 4.2.0 and ActiveState Perl 5.8.8 please?

-- JamesGMoore - 29 Jun 2008

I installed SearchEngineKinioSearchAddOn via webinstaller. I had some errors saying that i should complete the installation manualy. I did everything like described in SearchEngineKinoSearchAddOn like loading those CPAN modules and Word-stringifiers.

But SearchEngineKinoSearchAddOn still does not appear in my Twiki configuration page in the installed plugins list and - in addition i cannot connect to the Twiki Extension page any longer. It only says "Find TWiki Extensions -- Consulting TWiki.org..." but never shows up any results.

Does anyone have a clue how to fix that?

-- MathiasReiche - 30 Jun 2008

Mathias, it does not appear for me either in configure/Plugins. Regardless, it does work provided you extracted the tar in the root of the twiki folder and made sure all prerequiste programs are installed and working.

The only issue I have is with integrating it in to WebTopBar in TWiki 4.2. I would like the input box to look exactly like the built in TWiki search. When you click on Jump or Search the words "Jump" or "Search" disappear and you can type in your search criteria. With the Kino search box this is not the case. I tried hacking up the form but have been unsuccessful.

-- GordonTerrell - 01 Jul 2008

I guess I'll have to use Oracle Ultra Search to search my attachments and everything...

-- JamesGMoore - 03 Jul 2008

I am getting some messages in the httpd error_log but everything appears to be working normally.

[Thu Jul 03 11:03:46 2008] [error] [client 192.168.2.100] [Thu Jul  3 11:03:46 2008] view: "my" variable @topicsToUpdate masks earlier declaration in same scope at /var/www/twiki/lib/TWiki/Plugins/KinoSearchPlugin.pm line 87., referer: http://192.168.2.28/twiki/bin/view/InformationServices/WebHome

-- GordonTerrell - 03 Jul 2008

I am running TWiki 4.0 with Plucene in my production environment and am testing TWiki 4.2 with Kino. One big issue I am having is with the way Kino splits words with - and _ in them (an_example_word). The inconsistency comes form the fact that I will get two different sets of results if I search for an_example_word or "an example word". If an attachment has an_example_word in it and I search for "example" then I will not get a hit however I would get a hit from just a topic. I have a lot of topics with example SQL code in them so there might be something like

WHERE C.X_DATAFLEX_FLG='Y'

in it. If I want to search for x_dataflex_flg then I would have to search for it two different ways, depending on if I wanted to get a hit from a plain topic or from within an attachment. With Plucene I can simply search for x_dataflex_flg and get hits from attachments as well as topics.

-- GordonTerrell - 08 Jul 2008

Anyone have any pointers on how to remove word splitting in the index when the word contains a - or _?

-- GordonTerrell - 31 Jul 2008

now that TWikiRelease04x02x01 is ont of the way, I'll be getting back to this. I'll integrate the ideas from KinoSearchPlugin while I'm at it, and also make replacements for the cgi / update scripts using the restHandler. I'm working towards making this(, and when MichaelDaum commits his refactor of the Plucene one, it) pluggable into the normal %SEARCH% and WebSearch - its already working on my test :).

-- SvenDowideit - 05 Aug 2008

oh crud. Looks like Kino in its current form is a deal breaker for me Note: The current version of KinoSearch does not support wildcards.

I've checked in an experimental SEARCH backend that uses Kino - but it requires changes that I just made that will be in 4.2.2. see Bug:Item5888 .

Markus, can you take a look over what i've done, and if you're happy, release it (or tell me to :)).

-- SvenDowideit - 06 Aug 2008

I have updated this package further, and uploaded a version. There are now some more configuration settings in configure, and I expect we'll move the TWikiPreferences to configure over time too.

The KinoSearchPlugin is now a part of this package - by default you get index update and search restHandlers, and the KINOSEARCH tag. you can optionally configure it to autoupdate the index when topics are changed via a configure parameter.

-- SvenDowideit - 19 Aug 2008

-- GregNiedermaier - 19 Aug 2008

(sorry for the blank comment up above)

The real question: I'm on a Red Hat Enterprise Linux 5 box running TWiki 4.2. I can search through every type of file except .doc. I have antiword installed and using the ks_test, it stringifies just fine. Is there any reason that I cannot search through these documents. Also: when using the type:doc modifier, my .doc files come up. I'm not certain if that's related or not. Thanks for the help!

-- GregNiedermaier - 19 Aug 2008

While I finish install kinosearch plugin , I execute kinoindex command to create index file , some error happen during creating index , and interrupt , error message below:

Attachments available for: Fin, WebHome Modification of non-creatable array value attempted, subscript -1 at /usr/lib/perl5/site_perl/5.8.5/Spreadsheet/ParseExcel.pm line 1578.

I found out that attachment of excel file contain VBA (Visual Basic for Applications) I don't know that if "SearchEngineKinoSearchAddOn" plugin , support "excel file with VBA" or not ?

any ideas or solutions ?

Thanks..

-- BretHuang - 02 Sep 2008

Bret, Its quite plausible that the CPAN module either does not support VBA, or morel likely that you've found a bug in it. Markus has an outstanding todo to make the indexer more robust - as you note, when things go wrong, it does not continue on to the next.

I would suggest 2 options - send your (or a simplified version) excel spreadsheet to the Spreadsheet::ParseExcel cpan module developer (which may fix the root cause), or give us the similar, and someone should be able to code up a workaround (which probably skip indexing that file)

-- SvenDowideit - 03 Sep 2008

Thank you SvenDowideit Excel file contain VBA (Visual Basic for Applications), but VBA script has error in this file

I wonder to know that VBA has error , execute perl ParseExcel.pm has problem ,

so that kinoindex can't create index file

today I installed Spreadsheet-ParseExcel-0.2603.tar.gz from CPAN

I met another problem,

error message:

Character in 'C' format wrapped in pack at /usr/lib/perl5/site_perl/5.8.5/Spreadsheet/ParseExcel/FmtDefault.pm line 68.

show above message , but kinoindex will continue to create index

any ideas Thanks..

-- BretHuang - 03 Sep 2008

I got the KinoSearch plugin up and running fine. Then I added an attachment and tried kinoindex. I got the following error:

KinoSearch index files init - to suppress all normal output: kinoindex -q Indexing started adding Main topics Skipping Sandbox topics adding TWiki topics Skipping Trash topics adding Main topics Skipping Sandbox topics adding TWiki topics Skipping Trash topics Indexing web | Main Indexing topic | Main.ClassicSkinUserViewTemplate Indexing topic | Main.ColinFerguson Indexing topic | Main.NobodyGroup Indexing topic | Main.PatternSkinUserViewTemplate Indexing topic | Main.TWikiAdminGroup Indexing topic | Main.TWikiAdminUser Indexing topic | Main.TWikiContributor Indexing topic | Main.TWikiGroupTemplate Indexing topic | Main.TWikiGroups Indexing topic | Main.TWikiGuest Indexing topic | Main.TWikiGuestLeftBar Indexing topic | Main.TWikiPreferences Indexing topic | Main.TWikiRegistrationAgent Indexing topic | Main.TWikiUsers Indexing topic | Main.TestTopic Indexing topic | Main.UnknownUser Indexing topic | Main.UserHomepageHeader Indexing topic | Main.UserList Indexing topic | Main.UserListByDateJoined Indexing topic | Main.UserListByLocation Indexing topic | Main.UserListByPhotograph Indexing topic | Main.UserListHeader Indexing topic | Main.VoxwareTestTopic Attachments available for: Main, VoxwareTestTopic Use of uninitialized value in numeric ne (!=) at /usr/lib/perl5/site_perl/5.8.8/CharsetDetector.pm line 8289, <$in> line 1684. 

Any ideas? FYI I had set the search to use antiword. I cleaned up the index using the procedure given by IngoKappler 15 April 2008 above, and switched to abiword. This time when I ran kinoindex the abiword GUI opened in gnome! When I closed the GUI, the script continued and failed in the same way.

-- ColinFerguson - 12 Sep 2008

I have made modifications to the add on to allow users to subscribe to RSS feeds of search results and checked them in under Bugs:Item6002. I wont upload a new release as it looks like other enhancements are being actioned by Sven.

-- AndrewRJones - 17 Sep 2008

StringifierPlugins::Text can choke on some binary data while detecting charset encoding via CharsetDetector::detect1(). Perhaps this should be wrapped in an eval {} then $@ checked for an error?

This happens on Windows with IndigoPERL when indexing a PDF that's basically just a scanned image or the fonts are horrendously corrupted. It could happen in other cases though...

-- ArnsheaClayton - 18 Sep 2008

Im getting the following error while creating or updating the index using the rest script:

Couldn't require TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifierPlugins::DOC_antiword : Insecure directory in $ENV{PATH} while running with -T switch at /.../lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifyBase.pm line 50.

This happens for all the stringifier plugins in the directory.

The problem may be something do with the following, which i found at http://www.xav.com/perl/lib/Pod/perldiag.html

Insecure directory in %s
    (F) You can't use system(), exec(), or a piped open in a setuid or setgid script if $ENV{PATH} contains a directory that is writable by the world. See the perlsec manpage. 

None of my TWiki directiories are open to the world. Does anyone have any idea? I will keep looking for a solution.

-- AndrewRJones - 02 Oct 2008

Hi All, KinoSearch is not able to search texts in pptx or docx files. Is there a way or workaround to support the latest MS Office 2007 format doc (like pptx, docx etc) and make these files searchable.

-- DeepaMann - 15 Oct 2008

Deepa, you need to find a tool that converts those files into plain text, then they can be added to the engine's index - if you find some, please add links smile

-- SvenDowideit - 16 Oct 2008

WBNIF SearchEngineKinoSearchPlugin can handle (delete) zombie files from previous runs when creating the index from scratch. smile

-- OliverKrueger - 27 Oct 2008

[1]. We have observed many times process of indexing breaks while Indexing (kinoindex and kinoupdate scripts) the attachments to the topics. It would be good idea to use Error Perl module to capture the error and continue indexing.

[2]. Change the place of keeping information of "Skip attachments". Currently this information is handled by KINOSEARCHINDEXSKIPATTACHMENTS variable. I think the best way would be to use architecture of TagMePlugin. Use the directory {TWIKI_ROOT}/working/work_areas/SearchEngineKinoSearchAddOn to keep the file names. Following snippet can give good idea, "/var/www/twiki" is TWIKI_ROOT in this example.

bash-3.2$ pwd
/var/www/twiki/working/work_areas/SearchEngineKinoSearchAddOn
bash-3.2$ ls
_skip_.Attachmentsweb.MyTopic.txt
bash-3.2$ cat _skip_.Attachmentsweb.MyTopic.txt
ProblematicAttachment.pdf
bash-3.2$

In this snippet, ProblematicAttachment.pdf is attached to the topic "MyTopic" from "Attachmentsweb" Web. The kinoindex crashes while indexing because of this file, so skip from indexing it.

This also means - some API is required to manage these files. SearchEngineKinoSearchPlugin (which is part of addon) would be the best place to handle this.

[3]. The Error module as mentioned in [1] should be able to capture the error and update the work area mentioned in [2] to skip the attachments while indexing next time.

[4]. On fresh implementations of the TWiki, "index" directory is expected to be empty if "kinoindex" or "kinoupdate" is never run. Modify the templates related to SearchEngineKinoSearchAddOn to display meaningful messages when some one runs kinosearch query.

[5].Some way to monitor the recent "kinoindex" or "kinoupdate" logs through browser. I think the best way would be to improve the SearchEngineKinoSearchPlugin to display the last few lines of log files. Use Perl modules like File::Tail to monitor the files. If log files does not show lines "Indexing Complete" - gives indication to "admin" to take next action.

[6]. The "rest" handlers included with SearchEngineKinoSearchPlugin gives way to start the index/update using browser. The current code allows anyone to run these jobs. OR, May be I am not aware of setup of these restrictions. May be adding access control - like only "admin" or members of "TWikiAdmin" group should be able to start these jobs.

[7]. Some way of cleaning/empty "index" directory using browser? This can go with SearchEngineKinoSearchPlugin with access restricted to "admin" or "TWikiAdminGroup" members.

[8] About attachments ".docx", ".xlsx", ".pptx" - At least I could not figure out the way to convert these attachments to the text content. Actually TWiki really needs "search" engine based on better technologies than "Plucene" or "Kinosearch". Something is still missing in these two techonlogies. Perl deserves better "Lucene" port.

How about using PHP (http://framework.zend.com/manual/en/zend.search.lucene.html)?

-- SopanShewale - 05 Jan 2009

Merry new year 2009. . About the last point ( using zend), the more widely the search engine back-end is used, developed and supported on the internet, the better : desktop search engines such as http://projects.gnome.org/tracker are more widely used and supported than kinosearch perl modules, but they won't work on both unix and windows systems ; zend.search.lucene will work on both unix and windows systems... Whatever the search engine's back-end is, It would be great to maintain and improve a unique SearchEngineAddOn's code, with specific search engine command lines and options set by "configure".

-- OlivierThompson - 10 Jan 2009

Made the changes [1] to [7] discussed above. Can't help anything about [8]. - check Item6177

I have comment about kinoupdate : This script re-indexes the topics according to the .changes file present in each Web Directory. The .changes file captures the changes only if topics are updated for major revisions or after 60 mins of prev change. So it is possible to miss re-indexing of the topic if topic was changed for minor changes or within 60 mins.

I think it would be good idea to capture all the changes in different file similar to .changes file and capture the topic changes for every changes using afterSaveHandler SearchEngineKinoSearchAddOnPlugin

-- SopanShewale - 10 Feb 2009

I recently implemented support for Office OpenXML documents (such as .docx, .xlsx and .pptx files) for a TWiki installation which I maintain. Implementation details and code can be found at http://www.clausbrod.de/Blog/DefinePrivatePublic20090720TWikiKinoSearch. Hope you'll find this useful.

-- ClausBrod - 2009-07-24

Cool, thank you Claus for sharing this with the TWiki community!

Related article: Convert OpenXML (.docx, etc.) in Linux using command line, http://www.oooninja.com/2008/01/convert-openxml-docx-etc-in-linux-using.html

-- PeterThoeny - 2009-07-27

For documents with extension - .docx, i guess the best candidate to convert the document into text would be - http://docx2txt.sourceforge.net/

Developed module to stringfy docx attachments, checkedin the code into development branch. Check Item6177

-- SopanShewale - 2009-08-11

Here is a poor-man's approach to convert .docx to text. A .docx is simply a zip'ed file of xml files, one of which is the actual content. Quick & dirty docx2txt script to convert a .docx to text format, output to stdout:

#!/bin/bash
if [ ! -f $1 ]
then
echo "ERROR: File $1 does not exist"
exit 1;
fi
unzip -p $1 word/document.xml | sed 's/<[^>]*>/ /g; s/  */ /g'

The script from docx2txt.sourceforge.net is probably more reliable than this one. However, the quick & dirty approach can be used to convert .xlsx and .pptx to text.

-- PeterThoeny - 2009-08-12

Added stringifiers for docx, pptx and xlsx - check the Item6177 for more details The converters are Perl based.

-- SopanShewale - 2009-08-18

Released pptx2txt at http://sourceforge.net/projects/pptx2txt/ please suggest improvements on the same page. Thank you Twiki, Inc. allowing me to invest time on the tool development.

-- SopanShewale - 2009-08-25

Here I am with a new problem... the indexisation fails with the following error:

Maximum token length is 65535; got 101013 at /usr/local/lib/perl5/site_perl/5.10.0/i686-linux-thread-multi/KinoSearch/Index/SegWriter.pm line 84

KinoSearch::Index::SegWriter::add_doc('KinoSearch::Index::SegWriter=HASH(0x19dbde78)', 'KinoSearch::Document::Doc=HASH(0x1b627cc8)') called at /usr/local/lib/perl5/site_perl/5.10.0/i686-linux-thread-multi/KinoSearch/InvIndexer.pm line 224

KinoSearch::InvIndexer::add_doc('KinoSearch::InvIndexer=HASH(0x8c595c8)', 'KinoSearch::Document::Doc=HASH(0x1b627cc8)') called at /srv/www/htdocs/prototwiki/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Index.pm line 606

TWiki::Contrib::SearchEngineKinoSearchAddOn::Index::indexAttachment('TWiki::Contrib::SearchEngineKinoSearchAddOn::Index=HASH(0x804...', 'KinoSearch::InvIndexer=HASH(0x8c595c8)', 'CCTdoc/Sidi/Giemmestile', 'GiemmestileRapportini', 'HASH(0x1dd0cf68)') called at /srv/www/htdocs/prototwiki/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Index.pm line 528

TWiki::Contrib::SearchEngineKinoSearchAddOn::Index::indexTopic('TWiki::Contrib::SearchEngineKinoSearchAddOn::Index=HASH(0x804...', 'KinoSearch::InvIndexer=HASH(0x8c595c8)', 'CCTdoc/Sidi/Giemmestile', 'GiemmestileRapportini', 'InseritaDa', 1, 'ResponsabileRisorsa', 1, 'StatoUtenteTWiki', ...) called at /srv/www/htdocs/prototwiki/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Index.pm line 69

TWiki::Contrib::SearchEngineKinoSearchAddOn::Index::createIndex('TWiki::Contrib::SearchEngineKinoSearchAddOn::Index=HASH(0x804...', '') called at ./kinoindex line 29

In that path there are three Microsoft Word documents that were correctly indexised in the past so the problem must be somewhere else... I hope that the problem isn't that there are too many documents to put in the index (as you may rememer I got a very large document repository on my TWiki site, now I've reached about 5 Gb of attachments)...

-- IvanSassi - 2009-09-02

Any news?

-- IvanSassi - 2009-09-15

Searching around I found this: http://www.rectangular.com/pipermail/kinosearch/2008-July/005322.html

Could it be useful?

-- IvanSassi - 2009-09-21

Hi IvanSassi - Looks like it will be useful, but we need to test it further. Do you know problematic document (ms doc) which you trying to Index? Would you mind sharing it to me - through email/ any temp ftp site or attaching to this topic.. Please share only if its of small size (less than 1 or 2 mbs')

Accepting suggestion from above link regarding link token size also mean adding a few more dependencies to install smile

-- SopanShewale - 2009-10-08

Released latest version - SearchEngineKinoSearchAddOn Most of features discussed above are implemented with indexing support for .pptx, .docx and .xlsx

Suggestions, improvement, issues - let us discuss here

-- SopanShewale - 2009-10-11

Alas, as I wrote, the problem don't seems related to any particular document but it occurs in documents that were correctly indexised in the past. Now I'm installing the new version of the addon for a new test, I will post the results (but if the changes are only related to the .pptx, .docx and .xlsx formats I don't expect any better mood).

-- IvanSassi - 2009-10-19

As expected even with the new version the results are the same:

Maximum token length is 65535; got 101013 at /usr/local/lib/perl5/site_perl/5.10.0/i686-linux-thread-multi/KinoSearch/Index/SegWriter.pm line 84
        KinoSearch::Index::SegWriter::add_doc('KinoSearch::Index::SegWriter=HASH(0x1a2fc998)', 'KinoSearch::Document::Doc=HASH(0x1ff16bd8)') called at /usr/local/lib/perl5/site_perl/5.10.0/i686-linux-thread-multi/KinoSearch/InvIndexer.pm line 224
        KinoSearch::InvIndexer::add_doc('KinoSearch::InvIndexer=HASH(0x8addeb8)', 'KinoSearch::Document::Doc=HASH(0x1ff16bd8)') called at /srv/www/htdocs/prototwiki/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Index.pm line 606
        TWiki::Contrib::SearchEngineKinoSearchAddOn::Index::indexAttachment('TWiki::Contrib::SearchEngineKinoSearchAddOn::Index=HASH(0x804...', 'KinoSearch::InvIndexer=HASH(0x8addeb8)', 'CCTdoc/Sidi/Giemmestile', 'GiemmestileRapportini', 'HASH(0x1c2c6760)') called at /srv/www/htdocs/prototwiki/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Index.pm line 528
        TWiki::Contrib::SearchEngineKinoSearchAddOn::Index::indexTopic('TWiki::Contrib::SearchEngineKinoSearchAddOn::Index=HASH(0x804...', 'KinoSearch::InvIndexer=HASH(0x8addeb8)', 'CCTdoc/Sidi/Giemmestile', 'GiemmestileRapportini', 'InseritaDa', 1, 'ResponsabileRisorsa', 1, 'StatoUtenteTWiki', ...) called at /srv/www/htdocs/prototwiki/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/Index.pm line 69
        TWiki::Contrib::SearchEngineKinoSearchAddOn::Index::createIndex('TWiki::Contrib::SearchEngineKinoSearchAddOn::Index=HASH(0x804...', 1) called at ./kinoindex line 29

Do you have done some testing on the above suggestion?

-- IvanSassi - 2009-10-27

Hi IvanSassi - this last suggestion was not added in the latest version. Expect me to work on this soon . Cheers

-- SopanShewale - 2009-10-27

Any news? (sorry to bother you but without the indexed search the potential of TWiki as documentation repository is seriously compromised)

-- IvanSassi - 2009-11-16

No issues IvanSassi, I have some more idea's to improve the performance - and my issue is bandwidth.. i have very less resources to through away some of tasks which i want to handle...

Do you have some bandwidth? if you are interested - let us get connected via irc/chat. I can help you to develop/own some of plugins at TWiki cheers

-- SopanShewale - 2010-01-07

Observation:

The indexing time increases (we can call it exponential) with the size of attachments. Its not true if "M" MB size pdf file takes "N" seconds then "10xM" MB pdf with similar type of content will take "10xMxN" seconds.

The indexing totally depends on - how clean text you feed into Kinosearch Engine and how big is that text?

What can be done? One can do the following to improve the indexing performances

  • Restrict the size of attachments, say if size of above 2MB (size can decided on memory size of the box) then skip from indexing. This should be similar to " Set KINOSEARCH_ATTACHMENT_LIMIT = 20000"

  • Add Cache Kind of stuff for stringify activity. Store the text converted attachments on the hard-drive, if the attachments are not changed then dont stringify them again, use the already converted text files.
If any one interested to code/contribute to the community - please let me know. Let us get connected on irc or personal IM's

-- SopanShewale - 2010-01-15

added support to - KINOSEARCH_ATTACHMENT_INDEX_SIZELIMIT preference variable. This can save indexing time by skipping real large attachments from indexing

-- SopanShewale - 2010-01-30

I am working with KinoSearch (KS) and TWiki for a industrial client, and I'm looking for feedback and suggestions before working on some enhancements. (I'm posting versions of this message to KinoSearch, TWiki, and FosWiki sites).

We have ~50 MB of wiki pages and ~5000 MB of attachments (eg Adobe PDF, Microsoft Word, PowerPoint, & Excel) hosted on a virtual server. Currently it takes ~30 minutes to make an index (150MB) of just the wiki pages, and we would like to stringify and index the attachments.

We've had problems where the indexing would hang or crash, leaving an incomplete index file, and incorrect search results. I think these problems have been fixed (thanks to Marvin Humphrey). However, I still worry about creating a single index for the entire site. The longer it takes to build, the more risk that something will go wrong.

Would it make sense to use separate indices for each web or for each file type? From skimming the KS forums, it looks like KS doesn't (yet) provide much support for this, though it's being considered: http://www.rectangular.com/pipermail/kinosearch/2006-August/006513.html

Or should multiple indexes be combined into one large index? http://www.rectangular.com/pipermail/kinosearch/2006-July/004847.html

To use multiple indexes, I think I would need to: - make the indexer and updater accept parameters for web(s), file type(s), and index location(s) - figure out which index(s) to update when a file changes - make the searcher accept parameters for web(s), file type(s), and index location(s) - make a parent searcher that forwards queries to appropriate children based on web, file type, etc, and then combines the results (modeled on KS's MultiSearcher) - maybe make a parent indexer (and updater) that forwards files to appropriate children based on web, file type, etc

Should this be managed within KS or in the wiki plugin/addon code? I'm more familiar with the latter, but the former might be more widely useful.

Separate but related issues:

For TWiki/FosWiki, would it make sense to expose the index path so there is a reasonable default that can be overridden? Among other things, this would make testing easier.

For Excel files, we're using the default stringifier in the wiki addon, which indexes the contents of all spreadsheet cells. I think of cells as containing either numbers, formulas, or text (anything else). Would it make sense for the stringifier to skip numbers & formulas, or to make this a configuration option?

Any suggestions for related things to do while I'm inside the code? Thank you for your time and consideration.

-- ClifKussmaul - 2010-05-10

Hi ClifKussmaul

Thanks for your posting. You seems to have reasonably large TWiki to work on ! My suggestions would be:

  • Separate Index Directory per Web.
    • This will involve modification for search and indexing scripts.
    • Let us brainstorm more when you have time. I this we need to handle at TWiki end, may not be at KS End.
  • About Excel - one can develop new modules/and logic. Currently we are using standard Spreadsheet::ParseExcel
  • Better the text extractors, smaller the indexing time.

-- SopanShewale - 2010-05-12

I am currently upgrading to TWiki 5.0.1 and ran into trouble with this Addon (which is very important for us). I had to reinstall CPAN module KinoSearch which now came in version 0.311. But when installing the addon I get the error message: TWiki::Plugins::SearchEngineKinoSearchPlugin could not be loaded. Errors were: KinoSearch::InvIndexer has been replaced by KinoSearch::Index::Indexer. at /usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi/KinoSearch/InvIndexer.pm line 3 require KinoSearch/InvIndexer.pm called at /srv/www/twiki/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/KinoSearch.pm line 26 As far as I understand perl, the latest KinoSearch CPAN module does no longer support InvIndexer.pm.

I tried to use an older branch of KinoSearch Cpan but that again led to other errors.

Can anybody help?

-- MichaelGulitz - 2010-12-29

We tested over 170 extensions for TWiki-5.0; the KinoSearch add-on has not yet been tested. Your help is appreciated in testing and fixing this add-on.

-- PeterThoeny - 2010-12-29

I use Apache Tika to stringify - it's one actively maintained package that converts many file formats. I'm happy to provide more detail if there is interest.

-- ClifKussmaul - 2011-12-21

Clif, Yes I'd be interested to know more about how you are using Tika. Thanks!

-- JudBarron - 2012-06-19

Tika was pretty straightforward.

  1. download & install Tika - I don't think any special steps were needed
  2. create a wrapper script (e.g. /usr/local/bin/tika2txt, below) to run java with the tika jarfile and appropriate options.
  3. add a new stringifier (e.g. Tika.pm, attached below) to twiki/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins
  4. check that other stringifiers are disabled

#!/bin/sh
# path to jarfile (used below)
tikajar=/usr/local/tika/tika-app-1.0.jar

# if 0 params, print info, otherwise run on params
# - increase max heap to 512M (from 128M default)
case $# in
    0) echo "Usage: $0 [FILE]..."
       echo "convert file(s) to text. For more info:"
       java -jar $tikajar --help ;;
    * *) java -Xmx512m -Djava.io.tmpdir=/twiki/data/tmp -jar $tikajar --encoding=iso-8859-15 --text $* ;;
esac

-- ClifKussmaul - 2012-06-20

Can the GoogleAjaxSearchPlugin search attachments? It's not clear to me from the references above and the web searching I've done. And if it does, please send me a link to how it works if you can.

-- Mark Ikemoto - 2013-02-28

Hi MarkusHesse,

I have installed KinoSearch plugin on my twiki server. But when I try to search using this tool, I am getting below error:

Internal Server Error The server encountered an internal error or misconfiguration and was unable to complete your request.

Please contact the server administrator, root@localhost and inform them of the time the error occurred, and anything you might have done that may have caused the error.

More information about this error may be available in the server error log.

Any suggestions please...

Thanks, Kalyan

-- Kalyan Pasupuleti - 2013-10-17

So, what is the Apache error_log saying? Best to open a support question in the Support forum.

-- Peter Thoeny - 2013-10-17

Also, the add-on is marked as tested only on TWiki-4.3 and 4.2. It likely needs some work to run on the latest TWiki versions. If you need timely help I recommend to hire a TWiki consultant.

-- Peter Thoeny - 2013-10-17

Topic attachments
I Attachment History Action Size Date Who Comment
GIFgif 4.1.2.gif r1 manage 6.0 K 2008-04-14 - 05:34 MagicYang Search result page on 4.1.2
GIFgif 4.2.gif r1 manage 7.1 K 2008-04-14 - 05:35 MagicYang Search result page on 4.2
Texttxt Error_kinosearch.txt r1 manage 6.9 K 2008-05-06 - 08:21 MichaelSchmidt Error message of kinosearch under Windows/cygwin
Compressed Zip archivezip SearchEngineKinoSearchAddOn-StringifierPlugins.zip r3 r2 r1 manage 40.4 K 2008-01-07 - 21:24 GuilhermeGarnier Updated stringifier plugins
JPEGjpg TEST-TWikiKinoSearchAddOn1.jpg r1 manage 42.9 K 2008-05-21 - 13:08 IvanSassi An image of a ugly search result
Perl source code filepm Tika.pm r1 manage 2.8 K 2012-06-20 - 15:04 ClifKussmaul stringifier using Tika
Texttxt pptx2txt.pl.txt r1 manage 5.3 K 2009-08-18 - 12:10 SopanShewale tool to extract text from pptx files
Microsoft Excel Spreadsheetxls test.xls r1 manage 94.5 K 2007-12-19 - 15:54 GuilhermeGarnier XLS file for testing
Edit | Attach | Watch | Print version | History: r210 < r209 < r208 < r207 < r206 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r210 - 2013-10-17 - PeterThoeny
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2015 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.