Tags:
create new tag
, view all tags

SearchEnginePluceneAddOnDev Discussion: Page for developer collaboration, enhancement requests, patches and improved versions on SearchEnginePluceneAddOn contributed by the TWikiCommunity.
• Please let us know what you think of this extension.
• For support, check the existing questions, or ask a new support question in the Support web!
• Please report bugs below

Feedback on the SearchEnginePluceneAddOn

-- JoanMVigo - 18 Nov 2004

Interesting! Many thanks for posting this. Note that you could use the beforeSaveHandler to trigger a re-indexing event. I can't see any way to get Lucene (and I guess therefore Plucene) to maintain the index incrementally, which would be essential to avoid killing the server.

-- CrawfordCurrie - 18 Nov 2004

Plucene (and Lucene too) can do incremental updates in a special way:

  • open an IndexReader, search the doc to be updated with a unique id (could be web+topic, checking that no attachment field exists) and delete it, then close the index reader
  • open an IndexWriter, create the document, set the web and topic values into the fields, add the document to the collection and close the index writer
Something similar could be executed when a document is attached. Also, a crontab entry should be created with a new script, just to open an index writer and optimize it.

I think I could code these operations, however I don't know where to put the them. I didn't know about beforeSaveHandler event. Where I should put the code?

-- JoanMVigo - 18 Nov 2004

It would be useful if we isolated the mechanism you used to override search, generalise it and make this available for other plugins.

I say this because the indexing and searching of attachments is something we did in IndexServerSearchForMsIisAddOn: I imagine there are commonalities.

-- MartinCleaver - 18 Nov 2004

The beforeSaveHandler is used in plugins; you can write a plugin to provide it. Read the source of EmptyPlugin.pm (EmptyPluginDotPm) for information on plugin handlers. The attachment handlers are not documented there, I notice, but they look like this:

sub beforeAttachmentSaveHandler {
   my ( $attrHash, $topic, $web ) = @_;

sub afterAttachmentSaveHandler {
   my ( $attrHash, $topic, $web, $error ) = @_;
I've never used them myself, but I think MartinCleaver has.

-- CrawfordCurrie - 20 Nov 2004

Yes, Crawford is right: afterAttachmentSaveHandler would be ideal for your needs. It is defined to run "sometime after" the attachment is uploaded. (The current implementation is before returning to view but I envisioned a regular job fulfilling a queue of pending changes).

I tried to update the ExamplePlugin EmptyPlugin myself a while back but found that it was not in the Plugins web but rather part of the distro. I've since been granted access to DevelopBranch - I'll update it shortly on there.

-- MartinCleaver - 20 Nov 2004

Well, I just started to code when I realised that things were not just that easy. If a topic is moved, the old indexed topic must be removed while the new is saved. The same happens when you move attachments and the topic metadata is updated.

Another problem arises when two or more users update some topics: there is a write lock for the index.

So I just have made another approachment to resolve the index update: an incremental update script just as the mailnotify. Using web-based files .changes and .plucupdate, we can know which files have been updated and the last time the plucupdate script run. Then, we can schedule a crontab job for plucupdate each hour (or what you like) and only the not yet indexed most recent changes will be processed by the crontab job.

I have to finishing testing but I hope to update the topic add-on tomorrow, and to upload a new ZIP file with the new version (I've also added comments to the scripts code).

-- JoanMVigo - 22 Nov 2004

Ok. The incremental version is ready, however I want to improve it beacuse under very heavy usage the plucupdate script may raise the "too many files open" error (I've read at Plucene mailing list that under certain conditions some file handlers aren't closed).

I'd love to get some feedback about its usage. Please, try it and post your comments here.

-- JoanMVigo - 23 Nov 2004

I've just fixed some code to make this add-on compatible with TWiki Cairo release

  • use of formatTime instead of old formatGmTime
  • use of TWiki::Render::getRenderedVersion instead of old TWiki::getRenderedVersion
Performance issues seems to be solved, as plucindex initialized the .plucupdate file for each web.

-- JoanMVigo - 26 Nov 2004

I really want to give this a go, but my plate is already full with the DEVELOP branch. However, rest assured I will try it out and give feedback just as soon as I can!

-- CrawfordCurrie - 26 Nov 2004

Like Crawford, I agree that what you are doing is important, but lack the time to help out. Are there specifics you need assistance with? How do you interface to TWiki's inbuilt search mechanism?

-- MartinCleaver - 29 Nov 2004

The plucsearch script reuses a few lines of code from the Search.pm (retrieving of webs, list of topics for each web, checking access to result topics). However I didn't code the inline search or any options to limit search to a specific web, ordering the results by topic name, author or date.

To Do (unresolved questions)

  • I don't know if a Plucene inline search should be implemented. What do you think? Could be useful an inline search which includes attachments in its search scope? If it were, should it be implemented as a Plugin (%PLUCSEARCH%)?
    Note Meta fields are indexed, however searching for them does not get any hit. Should be investigated
  • Enhancements to the plucsearch script: limit search to a web scope, order results by topic name, author or date
  • Develop more parsing backends (see CPAN:Plucene::SearchEngine::Index), so more document types could be indexed (with help from 3rd party apps, as could be parsing MSWord with antiword)

-- JoanMVigo - 03 Dec 2004

Joan: If you could componentise SearchCgiScript and TWiki/UI/Search so that your search stuff is pluggable this would be particularly useful as we could eliminate the plusearch script and thus get wider adoption of your work.

All: At some point it would be useful for a Windows user to compare and integrate the IIS index server implementation: then a generic indexer interface and generic search becomes even more important. I appreciate the windows stuff probably does not interest Joan though. Its an architectural matter that is unlikely to fall squarely on any individual organisation's agenda. As we have no fund to help the architect type people in GettingPaidToDevelopTWiki, I fear that this is likely to happen at all.

-- MartinCleaver - 03 Dec 2004

Thanks for the good plugin. I am able to use this plugin for MS Word and MS Excel after creating backend module inside CPAN:Plucene::SearchEngine::Index

I have yet to figure out what to be done for MS Powerpoint Presentation. If any one has idea of PPT please do provide me.

-- SopanShewale - 03 Dec 2004

Yes, I am able to search MS Powerpoint Presentations also. I am using ppthtml provided with xlhtml. The xlhtml is for viewing Excel files, so I am going to review Excel Search Capability with xlhtml also. I have created Plucene::SearchENgine::Index::{DOC.pm, XLS.pm and PPT.pm}. I will try to upload at CPAN by Tuesday after reviewing and comparing with best converstion tools.

-- SopanShewale - 04 Dec 2004

Update to the above discussion:= I have written backend parsers to parse the MS Word, Excel, PPT type files. Kindly see the attachment ExtraBackendParsers.zip (README is provided). It will be great help if some one helps to add thoses files as a modules in CPAN:Plucene::SearchEngine::Index module.

To use these backend parsers, small change is required in plucindex and plucupdate scripts. The line no. 117 in plucindex should be read as follows.

116    # only pdf, html and txt - for more file types look for Plucene::SearchEngine at search.cpan.org
117      if ( $name =~ m/\.pdf$/ || $name =~ m/\.html$/ || $name =~ m/\.txt$/ || $name =~ m/\.doc$/ || $name =~ m/\.xls$/ || $name =~ m/\.ppt$/ )  {
118             $author = $attachment->{'user'};

Similar changes are required in plucupdate script at line no. 221

-- SopanShewale - 08 Dec 2004

Thanks for your efforts Sopan. I'll try to integrate your parsers into the main branch - smile

However, I'd like to code something that could be aware of new implementations, so that the addon does not require to be modified each time new parsers/indexers are available. Does Perl have some mechanism to check or list the member classes of one class? (something like the reflection API in Java, so that CPAN:Plucene::SearchEngine::Index becomes aware of the classes belonging to itself)

-- JoanMVigo - 09 Dec 2004

New version available (check the add on topic). Now, after unzipping the files you just should add two new TWiki preferences before start using the scripts. It's not required to modify the scripts anymore.

   * Plucene settings
      * Set PLUCENEINDEXPATH = /srv/www/personal/index or where your index folder is located
      * Set PLUCENEINDEXEXTENSIONS = .pdf,.html,.txt,.doc
The PLUCENEINDEXPATH variable shoud be included in FINALPREFERENCES.

Now, if you use the contribution (ExtraBackendParsers.zip) by TWiki:Main.SopanShewale or any other index library, you can esaily configure your Plucene engine about the attachment file types that should be indexed.

I've also attached to this Dev discussion a Plucene::SearchEngine::Index::DOC.pm that uses the antiword utility for indexing DOC files. You should copy it to your Perl - Plucene location, /usr/lib/perl5/site_perl/5.8.0/Plucene/SearchEngine/Index/ for me.

-- JoanMVigo - 15 Dec 2004

Because of missing feature of wildcard searching in Plucene, the partial topic name searching is little difficult. To add the support to partial-topic name search, changes in the plucindex, plucupdate and plucsearch scripts are required. I have created the patches for the scripts which will help us to search topics as described below.

The topicname : MyNewelyCCreated-Topic

Any of the query "topic:My", "topic:Newely", "topic:Created", "topic:Topic" can give the result of this topicname.

The patches are provided with plucenscriptpatches.zip attached below. The unzip of this file will create the directory patches with three files plucindex.patch, plucsearch.patch and plucupdate.patch in it. Use the patch command to patch each individual scripts.

-- SopanShewale - 23 Mar 2005

Hello, i installed the Plugin, but now i get

Software error:
Can't use an undefined value as an ARRAY reference at /usr/local/share/perl/5.6.1/Plucene/Search/BooleanQuery.pm line 122.

Does anybody know where my problem could be? Thx!

-- MarkusLitz - 06 May 2005

Does this work equally on Windows and UNIX boxes? Is there an argument in making it the default search engine and rolling it into EdinburghRelease?

-- MartinCleaver - 14 May 2005

If we ship this as the default search engine we'd raise the TWikiSystemRequirements because of the dependencies on external tools. Something I like to avoid.

-- PeterThoeny - 17 May 2005

Well, I've been working hardly for my employer so I couldn't answer you before. Sorry!

First, regarding installation:

  • This plugin has not been tested on any Windows platform. I don't know if CPAN dependencies (Plucene and related) can be compiled/installed on Windows platforms.
  • This plugin has been tested on Linux boxes using Perl 5.8.1 and above. So regarding the problem with undefined value as an ARRAY reference, maybe some required package is missing or a version conflict occurs between required and installed packages. (please, detail a little bit more your software environment, thanks!)

Regarding the Plugin:

  • I developed a new version which enables to sort the hit collection by document score. However, I will not update it until I can patch it with the SopanShewale contribution. It will be soon, I hope - Sopan, thanks for your patience.
  • As I said, I'm not a Perl guru, and a lot of the original capabilities of the grep search engine are missing, it also should be improved a lot.
  • However, as it indexes attachments, I think that it should be shipped with TWiki, as the grep search engine is. Once TWiki is installed, then you should choose the search engine to be used switching some configuration parameter at TWiki.cfg or TWikiPreferences. What do you think about it?

-- JoanMVigo - 25 May 2005

Ok. New version released including:

  • results sorted by score
  • partial topic names are also indexed, so you can just search topic:Word (workaround for missing Plucene wildcard search)
Thanks for your patience.

-- JoanMVigo - 02 Jun 2005

Joan - if you could find in Dakar where in the main code an interface could be made, and do the coding so that both the existing grep search and your plucene search to use that interface and make it configurable with the switch you mention, we could make the case that the core team include the stub and provide swappable modules for either search implementations.

-- MartinCleaver - 03 Jun 2005

I see a "problem" implementing the swappable search engine: The Query syntax.

Either:

  1. All the search implementations must be regexp-based (to maintain backward compatibility OR
  2. State that different implementations may use different syntax and don't make any guarantee of compatibility between them (is this against the TWikiMission?) OR
  3. Every implementation must provide a translator from regexp to it's propietary query syntax, and provide a mechanism to use either the regexp or the propietary syntax.

-- RafaelAlvarez - 28 Jun 2005

Just for the records, this TWikiAddOns has the following dependancies as of today (as reported by CPAN):

  • IO::Scalar
  • Time::Piece
  • Class::Accessor::Fast
  • Encode::compat
  • Tie::Array::Sorted
  • File::Slurp
  • Lingua::Stem::En
  • Class::Accessor
  • Bit::Vector::Minimal
  • Class::Virtual

-- RafaelAlvarez - 28 Jun 2005

Hi,

Great plugin. I've installed it instead of the regular search engine. Some remarks and questions:

  1. In the PluceneSearch topic that is installed by the plugin, there is a wrong link to TWiki:Plugins/PluceneSearchEngineAddOn (should be: TWiki:Plugins/SearchEnginePluceneAddOn)
  2. That same topic tells that wildcards are not supported. However, Searching for "topic", "topic*" and "*topic*" yield three different results. What's happening? BTW: my users expected "*topic*" when they typed "topic"
  3. My end users only know Google. What are the main differences between Google and Plucene (bar the scoring)? Has anybody already tried to explain this to his/her end users?
  4. It would be great if the power of Plucene could be mixed with the control of the traditional Advanced Search with all the options. Especially selecting in which webs to search, or sorting of the results per web.

Thanks for a great plugin!

-- JosMaccabiani - 28 Jun 2005

Hi,

  • In order to effectively manage a index, it is necessary to get feedback about which webs, topics and attachments are indexed. Also good to know whether the indexing was successfully complete or not.
  • You should be able to skip some of the webs i.e. Sandbox from the indexing. You should be able to skip some of the attachments which are difficult to handle by our indexing modules, for example, some .doc files result into "Segmentation Fault" while running plucindex script.

I have made changes to the scripts to address above requirements. I am attaching ScriptsWithLoggingFeatures.zip file which contains plucindex, plucupdate and dontindex.cfg file with this topic. You have to copy those at appropriate place.

Some of the details are as follow: Just an example - The plucindex script creates the logfile 20050708.log on 8July 2005, (The format is YYYYMMDD.log) The plucupdate script creates the logfile update20050708.log on 8July 2005 The Logfiles start with lines similar to "The indexing started at 11-06-2005 17:49:46" The Logfiles end with lines similar to "Indexing finished at 11-06-2005 18:31:26". If the last line is not similar to this line means issue with the indexing, you have to skip some attachment...why only attachments? Webs or Topics are Text documents and plucene is very good at indexing that stuff so gives no issues at all...so you have to bother only on attachments (provided you are indexing documents of type .pdf, .ppt, .xls, .pdf etc).

If "Indexing finished at ....." line is missing and indexing script is not running, the last line is as follow "Indexing attachment : Trash:TrashAttachment:Data-Overview.ppt", this means, the attachment "Data-Overview.ppt", which is attached with topic "TrashAttachment", in "Trash" web is creating problem-may be "Segmentation Fault", so skip it.. Just add "attachment:Trash:TrashAttachment:Data-Overview.ppt" line in dontindex.cfg, next time the attachment is skipped from indexing.

The same procedure should be followed for updateYYYYYMMDD.log file.

The procedure which I followed may not be that good, e.g. Some one might feel the skipping webs or files should be handled using stuff similar to * Set SKIPATTACHMENTS = TrashAttachment:Data-Overview.ppt in WebPreference topic of Trash Web.

-- SopanShewale - 08 Jul 2005

checked .zip into CVS

-- WillNorris - 19 Jul 2005

Some answers:

  • As I pointed earlier in this topic, this add on was originally planned just for indexing and searching the attachments, including the TWiki topics for obvious reasons. And now, I still think it should not replace the grep engine. In all of the installations I setup, the WebSearch is not deleted and it is available for everybody. However, the Plucene form search is coded in the WebTopBar, so it is the default option.
    • I did not code the Plucene package, so some search functions are not available: wildcards (note that * is just another char) and regexps. To make it fully compatible with those functions, a lot of code should be added to the Plucene package, not to the add on.
    • I did not code all the options the grep engine has (inline search, multiple sorting, separated results per web), neither, because I am/was not an experienced Perl programmer (and I did not find all of them necessary for my users)
  • The requirements of Plucene are those that RafaelAlvarez point out, plus other that may be required (by those listed) if not installed in some default setups. Note also that 3rd party tools may be required to extract text from attachments (xpdf, xlhtml, antiword, ...). There are other Lucene ports that could be used to develop a plugin with less dependencies. Take a look at Lucene Implementations
  • The grep engine, this add on and Google are just search engines. Each one has its query syntax, its options and all of them list the results in some different way.

About development:

  • I agree with SopanShewale about monitoring the indexing. However, I think new preferences should be used:
  • SKIPATTACHMENTS (per web) attachments not to be indexed
  • INDEXSTATISTICSTOPICS (per web) topic where indexing errors are reported (just like WEBSTATISTICSTOPIC)
Also, maybe some kind of email alert to the administrator should be sent, reporting the attachments which fail to be indexed. What do you think?

-- JoanMVigo - 19 Jul 2005

I guess there are two definitions of search engine. The user-query (default grep-like but that this makes plucene-like) and SEARCH-query (default grep-like).

I was suggesting only to make the user-query pluggable. If this could be done before DevelopBranch goes live we could much ease and increase the adoption of the plucene based engine.

-- MartinCleaver - 19 Jul 2005

I was getting these 'Too many open files' errors on Solaris:

Plucene::Store::InputStream cannot open /tmp/QoBg7OKMLk/_627.f24 for reading: Too many open files at /usr/local/lib/perl5/site_perl/5.8.6/Plucene/Store/InputStream.pm line 35.
        (in cleanup) Plucene::Store::InputStream cannot open /tmp/QoBg7OKMLk/_627.f24 for reading: Too many open files at /usr/local/lib/perl5/site_perl/5.8.6/Plucene/Store/InputStream.pm line 35.
I spent half the day looking for a fix. Setting ulimit -n 2000 did nothing, adding use BSD::Resources; setrlimit(setrlimit(RLIMIT_NOFILE, 2000, RLIM_INFINITY); did nothing. Eventually, I found this rather obscure reference: http://plucene.minty.org/cgi-bin/wiki.pl?Totally_Un-Official_Plucene_FAQ#0020 and changed my Plucene::Index::Writer::mergefactor to default to 5 instead of 10, and finally it works! I think this should be settable from a preference.

-- WadeTurland - 25 Aug 2005

Running plucupdate produces the following error:

"my" variable $writer masks earlier declaration in same scope at ./plucupdate line 272.

-- JosMaccabiani - 26 Aug 2005

Hi JosMaccabiani, This is because $writer is already defined somewhere near line 181. Just remove "my" from line

my $writer = Plucene::Index::Writer->new($idxpath, $analyser, 0);

-- SopanShewale - 29 Aug 2005

Hi, I have installed the SearchEnginePlucendeAddOn. It run fine, when i insert the

<form action="%SCRIPTURLPATH%/plucsearch%SCRIPTSUFFIX%/%INTURLENCODE{"%INCLUDINGWEB%"}%/">
   <input type="text" name="search" size="32" />
</form>
in the side. But wenn i insert this text on the WebLeftBar then i have no Anwers. What must i change to work this correct?

-- KarlHeinzWichmann - 21 Sep 2005

Hello KarlHeinzWichmann, I have faced similar problem while developing ApplicationAuthenticationAddOn

The template twiki.pattern.tmpl includes "WebLeftBar" using html form- so if you are adding search form in WebLeftBar topic, it creates form within form and becomes a problem to browser.

Change the following block in twiki.pattern.tmpl from

%TMPL:DEF{"leftbar"}%<div class="twikiLeftBar"><div class="twikiWebIndicator"><b>%WEB%</b></div>
<div class="twikiLeftBarContents"><form name="main" action="%SCRIPTURLPATH%/view%SCRIPTSUFFIX%/%WEB%/%TOPIC%">
%INCLUDE{"WebLeftBar"}%</form></div></div>%TMPL:END%

to

%TMPL:DEF{"leftbar"}%<div class="twikiLeftBar"><div class="twikiWebIndicator"><b>%WEB%</b></div>
%INCLUDE{"WebLeftBar"}%</div>%TMPL:END%

This should solve your problem.

-- SopanShewale - 30 Sep 2005

Hi JosMaccabiani, You had a following question:

> 1. My end users only know Google. What are the main differences between Google and Plucene (bar the scoring)? Has anybody already tried to explain this to his/her end users?

Thanks for raising this question.

The main difference is : Google by default performs "AND" searches, while Plucene performs "OR" searches, so this plugin also performs "OR" searches.

You can change this behavior by modifying the $DefaultOperator to "AND" in Plucene::QueryParser module or by setting the variable $Plucene::QueryParser::DefaultOperator='AND'; in your script.

-- SopanShewale - 30 Sep 2005

There is a book on Lucene available now:

"Lucene in Action" by Otis Gospodnetic and Erik Hatcher, Manning Publications Co. ISBN 1-932-39428-1. See http://www.manning.com for more information.

-- AntonAylward - 29 Oct 2005

Hi All,

I'm trying to get SearchEnginePluceneAddOn to work with Dakar (build 7330). I get perl errors when I try to run the plucindex script. When digging into things, it seems that SearchEnginePluceneAddOn uses methods of the TWiki perl object that existed in Cairo but have been removed since (e.g. TWiki::basicInitialize() and TWiki::Prefs::initializePrefs()). Does anybody know whether there is an upgrade (or a fix) available "out there" somewhere? or whether somebody is working on one?

-- RobertStahr - 18 Nov 2005

I installed this add-on and Sopan's ExtraBackendParsers on a Cairo installation, works nicely. However, now I noticed that the doc, pdf, ppt and xls files are no longer searchable after an hourly update with plucupdate (html, txt and topic text is OK). Running plucindex fixes the issue. Any idea what is going on?

Also, I need this functionality for a document management system TWikiApplication: Is there a way to limit the search scope to only one web? Or, alternatively, all attachments in topics that have form XYZ? Preferably I'd like to hide that search scope from the user, e.g. in a hidden form field. Something like <input type="hidden" name="searchweb" value="%WEB%" />

I support the idea of Google like search. This is the standard people expect nowadays. Idea: The add-on could translate soap +wsdl "web service" -shampoo into the Plucene syntax.

-- PeterThoeny - 06 Jan 2006

Peter, Thanks for your comments. About issues of �plucupdate�, I am able to use that on my setup (cairo release); still I will go through the script to fix the issue.

Limiting search scope to a particular web � That�s also my requirement. We have to handle this hidden value appropriate to give the results. Already if you search �web:Myweb sometext�, this returns the results from Myweb web. I should be able to do this work.

Expectations like Google � Please see my comments of date 30/Sept/2005. By default behavior of �OR�, which can be converted into �AND� by making changes in Plucene::QueryParser module. Other +, - stuff works similar to google. Yes, Wildcard search is not yet supported by �Plucene�, that development should happen.

Indexing speed: My intranet site has around 5500 documents (topics and attachments), it takes around 2hr time to index. Indexing time should be reduced. I think some one should give a thought of using Lucene or some other port of Lucene for indexing purpose.

I am planning to do the thourough testing of this add-on for Dakar Release.

-- SopanShewale - 10 Jan 2006

when I try to execute twiki/bin/plucindex, I got �undefined subroutine &Twiki::basicInitialize called at ./plucindex line 42�.

could anybody give me some help?

-- TWikiGuest - 09 Feb 2006

We are working in a new version (SopanShewale and myself) using just functions provided by the TWiki::Func module. So it will be compatible at least with Dakar (and Cairo, I hope).

Some of the issues discussed in this topic has been addressed (limiting scope, search query like Google, skip defined webs from indexing), so we hope it will be quite useful. Stay tuned wink

-- JoanMVigo - 23 Feb 2006

Great! Just asking, what is the timeline for the new version?

-- PeterThoeny - 27 Feb 2006

Peter, We should be able to release the new version compatible with dakar by Friday, March 3. If the same does not work for Cairo, then will be provide separate fileset for cairo by March 10.

-- SopanShewale - 27 Feb 2006

Finally, new versions of this add on has been released, one for Cairo and other one for Dakar. Please, we would appreciate very much your feedback. Note also that due to lack of functionality exposed by TWiki::Func, the two versions of this add on still use internal core functions of TWiki.

For interested people on Plucene and/or its development, I just post here some links.

-- JoanMVigo - 02 Mar 2006

Thank you very much for upgrading this add-on! I will try it out.

As for packaging, it is better to have just one zip filename for the two versions as described in HandlingCairoDakarPluginDifferences. So instead of SearchEnginePluceneAddOn-Cairo.zip and SearchEnginePluceneAddOn-Dakar.zip it is preferred to overwrite =SearchEnginePluceneAddOn.zip with the latest Cairo version, then overwrite it with the latest Dakar version; and in the add-on text, point to the latest cairo version (with a viewfile link). If you do not like this setup you could overwrite the SearchEnginePluceneAddOn.zip with the Dakar version, and keep a separete Cairo zip.

-- PeterThoeny - 02 Mar 2006

One of the first things I've been asked after my TWiki was up and running was "how can I search keywords MS-Office attachments?" Well, I thought, that's easy. Google for TWiki and MS-Office, and there you go.

But now things are starting to get hairy. Is there something like a PluceneForTWikiQuickStartGuide ? I find myself running perl -wTd to find out what the "required third party tools" might be (hint: all of xlhtml, ppthtml and wv are available as Debian packages)....

-- HaraldJoerg - 03 Mar 2006

Yes, you are right. I should update the topic with the following instructions ...

Just build Plucene with

perl -MCPAN -e "install Plucene"
perl -MCPAN -e "install Plucene-SearchEngine-1.1"
should make the Plucene installation straight-forward.

Regarding document parsers ( 3rd party tools ) :

  • for PDF files, install xpdf
  • for DOC files, install antiword and use the stand alone DOC.pm file
  • for DOC, XLS and PPT you can use also the excellent collection ExtraBackendParsers.zip provided by SopanShewale

You can try to build your own parsers using other text extracting tools. Just download DOC.pm. Change lines 1, 3, 8, 12 & 19 with corresponding extension, mime type and external tool and you will get a brand new parser.

-- JoanMVigo - 03 Mar 2006

Sopan and Joan: I added both of you to the TWikiCommunityGroup so that you can move/delete content. Please review the notes on that group topic.

-- PeterThoeny - 04 Mar 2006

Thanks, Joan, for the explanations. The installation of Plucene and its search engine is straightforward, but takes quite a time on slim installations (like the VM engine) due to the list of dependencies - much longer than installing the plugins and its extra parsers together.

Two notes on the ExtraBackendParsers:

  1. The excel parser relies on CPAN's Spreadsheet::ParseExcel, which seems to croak on the Excel2003 files we use in our office. On the other hand, xlhtml does the trick.
  2. Doc.pm seems to go an extra loop by converting .doc to .pdf, and then .pdf to .html. Is the result better than directly converting .doc to .html (which the wv package can do as well)?

And many thanks for the confirmation how to build parsers. I would have guessed that changing some lines is all it takes, but wasn't sure (and hadn't time to try because the machine was still running test suites for Plucene's dependencies).

I wonder - are there any experiences in the Lucene/Plucene world with converter comparison (e.g. antiword vs. wv with .doc, Spreadsheet::ParseExcel vs. xlhtml with .xls) with regard to indexing performance, suitability for search?

-- HaraldJoerg - 04 Mar 2006

Hi there! Thx for the great search engine. I have tried it with the latest Dakar version, but I get only results with topics, which have attachments. No results in "normat" topics. At first, the scripts (plucindex, plucupdate) doesn't work. But I fixed it. So my only problem is, to get a result in "normal" topics. What's my fault?

-- HugoKuegerl - 12 Mar 2006

Hugo, the scripts work fine with the last Dakar release (build 8740). The only changes needed are:

  • set $twikiLibPath in your_twiki_path/plucene/bin/LocalLib.cfg
  • edit TWiki prefs as explained in documentation
Review plucene/logs for problems indexing topics/attachments.

Regarding your searches, topics within webs with NOSEARCHALL = on are not displayed.

-- JoanMVigo - 13 Mar 2006

I installed the latest version on TWiki-4 and run into some problems summarized at PluceneAddOnIssues.

-- PeterThoeny - 15 Mar 2006

Hi all. I have just uploaded a new release of this add on which solves an issue while updating when topics have similar names: TestTopic1, TestTopic2, TestTopic3, ...

Also, PLUCENEINDEXTENSIONS TWiki variable values have changed. Each extension needs a DOT before it. Just type Set PLUCENEINDEXEXTENSIONS = .pdf, .html, .txt, .doc as in older versions. Sorry for the incovenience.

Thanks to PeterThoeny for bringing these issues to light. Also to HugoKuegerl for discovering a bug in index/update operations (Dakar indexing was always reading first version topic texts!!!)

-- JoanMVigo - 21 Mar 2006

I've been helping set up the plucene search add on to an experimental twiki installation. Everything was going swimmingly. We could index all the attachments we were interested in (except .pdf files, though since I installed pdftotext that should be fixed, too). Then, this morning, we found that all searches came up empty. The indexer appears to work normally. The logs look good. But there are no search results. Has anyone else seen this?

-- DavidHoughton - 20 Apr 2006

I installed plucene today and I'm facing the same problem as David. There are no search results.

-- AlokNarula - 11 May 2006

Is there any apache configuration needed to enable plucsearch? I'm getting no search results eventhough my index is generated perfectly. However the Apache error log says this:

Don't know how to turn into an index reader at /home/twiki/bin/plucsearch line 209, referer: http://localhost.localdomain/twiki/bin/view/TWiki/PluceneSearch

-- AlokNarula - 12 May 2006

Same problem here: indexing seems to work proprerly but no search result at all... no one have found some reason fot the issue?

-- IvanSassi - 31 May 2006

It does not work for me either. Index finishes successfully but not getting any search results. I have tried setting "Set PLUCENESEARCHATTACHMENTSONLY = 0" to see if I get any non-attachment results from any of the webs and I get back 0 results.

-- GordonTerrell - 05 Jun 2006

Sorry for not replying before.

There are two kinds of errors. Those produced by the script and those by Plucene. Please, give some details of your configuration when submitting errors, because it's very difficult to guess settings or what produces the error. If you don't want to disclose them publicly, send me an email with your settings and some logs.

  • If one day all works great and the following there are no results, it's problably that the index is corrupted. Disable the update, run the index manually and re-enable the update. Try to search something, first before the update is ran and then after a few updates.
  • plucsearch must be placed in your twiki/bin folder with executable permissions for your webuser, just the way you did with other scripts, like view or edit. You don't have to setup Apache in a special way other than the specified by TWiki general intall notes
  • line 209 of plucsearch tries to open the index folder. If an error occurs, it's very likely your setup is not correct. Verify the PLUCENEINDEXPATH variable in TWikiPreferences ( it should point to /your_twiki_path/plucene/index )
  • If enabled, the PLUCENESEARCHATTACHMENTSONLY just displays the 'show only attachments' message while displaying results

Hope this helps!

-- JoanMVigo - 07 Jun 2006

Joan, it is nice to see your reply. But plucene search is still not showing the results. I've checked everything that you've indicated. You can download the current TWiki version from this site and install Plucene. You'll be able to see the results yourself.

-- AlokNarula - 07 Jun 2006

I have just deployed the Plugin in a new fresh installation - latest TWiki available build 9626. Indexing and searching both work fine.

Does your environment involve user authentication? If you have topic authentication enabled, then the script plucsearch by default may be executed as user nobody/TWikiGuest. Please, check this! Also, consider that only allowed topics for the authenticated user may be displayed as results.

If you have user authentication enabled, you should add the following lines to /twiki/bin/.htaccess if using Apache login module

<Files "plucsearch">
       require valid-user
</Files>
Otherwise, if using Template login module, launch /twiki/bin/configure script in your web browser and append plucsearch to {AuthScripts}

-- JoanMVigo - 08 Jun 2006

I have appended plucsearch to {AuthScripts} but plucsearch is unable to search restricted webs eventhough the index has been generated correctly. plucsearch works fine with public webs. What is the problem?

-- AlokNarula - 13 Jun 2006

Same here with restricted webs. I created a new topic in the Sandbox with an attachment and was able to get search results from both the topic and the attachment. I never get any results from the one web I have with restriced access. For example if I comment out Set ALLOWWEBVIEW = InformationServicesGroup then I get results. I have logged in with several accounts all of which are a part of the InformationServiceGroup and get the same results.

-- GordonTerrell - 19 Jun 2006

I have checked it and finally the problem is that:

  1. plucsearch script gets the user from the SESSION object exposed by the TWiki fuync. module: my $remoteUser = $TWiki::Plugins::SESSION->{remoteUser}; and ...
  2. when using /twiki/bin/.htaccess configured to authenticate the plucsearch as described above (see my comments 08 Jun 2006), remoteUser is the one you typed, so the results are displayed ok, even with restricted webs, however ...
  3. when not using /twiki/bin/.htaccess, remoteUser is always the user guest even if you are authenticated using TemplateLogin and plucsearch appears in {AuthScripts}, so any restricted web's results are never listed.

I have tested some setup possibilities, and it seems that just editing the plucsearch script and changing line 58, replacing the old one my $remoteUser = $TWiki::Plugins::SESSION->{remoteUser}; with this new one my $remoteUser = $TWiki::Plugins::SESSION->{user}->{login}; will solve this problem, and the plucsearch script will always work, regardless which auth setup you have chosen.

Once again, I am sorry for the delayed reply.

-- JoanMVigo - 21 Jun 2006

Thanks Joan. plucsearch can now find text in the restricted webs. Perhaps you can modify the plucsearch script and upload the latest plugin to TWiki.

-- AlokNarula - 21 Jun 2006

The template authentication search bug has been solved and a new release of this plugin is available in the add on topic. Thanks to AlokNarula and GordonTerrell for discovering the issue and for providing feedback.

I also have fixed a bug updating the index: due to partial topic name search enhacement, old topics were not removed from index. Whenever possible, replace plucsearch and plucupdate with latest versions (for Cairo version, only plucupdate replacement is required).

Finllay, the addon was tested succesfully on latest Dakar release.

-- JoanMVigo - 27 Jun 2006

The plugin works well now, all my files are indexed but i don't understand why plucene don't search into ppt files. However there are indexed by plucupdate. It's strange. Thanks for yours answers

-- EmmanuelMaatouk - 11 Jul 2006

Emmanuel, to have ppt files indexed, you need to:

Have you completed all the above actions?

-- JoanMVigo - 18 Jul 2006

Found something interesting today. When I use search criteria that has a number in it I get no results returned. For example, if I search for Internet Explorer 5.0 I will get no results but if I put quotes around it I do get results and if I search for just Internet Explorer then I will get results as well. Also, when I do a search with quotes I will get results for non exact items. For example, if I search for "Internet Explorer 5.0" it will return results for just Internet Explorer. Is all of this this normal?

-- GordonTerrell - 21 Sep 2006

I think that it is an issue which involves Plucene, not the addon index/search scripts which do not strip or stem text or analyze tokens. The addon scripts are just the bridge that allow TWiki data indexing/searching with the Plucene libary...

The good news is that Plucene is quite active. Since Plucene 1.22, the Plucene library mantainers are Tony Bowden and Marty Pauley. Although the current version ( 1.25 - 26 Aug 2006) still lacks wildcard searches, they are working on that since v1.24 ( its POD file states that it has not yet been fully implemented ). I recommend you to update the CPAN:Plucene lib with the latest release (you should to rebuild the index) and wait for the next one.

-- JoanMVigo - 04 Oct 2006

I noticed a bug in plucindex not indexing my attachments as per PLUCENEINDEXEXTENSIONS. (it was skipping all my attachments)

The source of the problem was on line 189 - a period '.' is the assumed prefix when looking up the indexextensions hash key, though the actual key is without it.

i.e.

if (($indexextensions{".$extension"})
should be
 if (($indexextensions{"$extension"}) 

-- GlennRoberts - 10 Oct 2006

Either that, or include a "." when listing the extensions. Plugin topic suggests extensions should be listed without a ".", Dev-topic suggests they should be included - I guess this Dev topic should be refactored and main plugin topic updated accordingly (.. when we get the time) :-).

-- SteffenPoulsen - 10 Oct 2006

Hello, I need help using Plucene IndexSearch. It all works fine with my TWiki, but I want to attach ComplexHTML Documents, containing sub folders, to topics. How do I have to change the Plucindex Code and which files I have to modify so that all files in the sub folders will be indexed too? Has someone any ideas? I thought about adding a line to the file twiki/bin/plucindex, round about line 250 in the method:

foreach my $attachDefP (@attachmentList){...
#process file
Plucene::SearchEngine::Index::File->examine(..) # adding such a line with the Subfolder to examine, 
but it was only a guess of mine and I couln't get it work. For any hints, ideas or solution I will be thankful.

-- DanielWiechmann - 12 Oct 2006

Yes, you're right. Dev-topic is not fully updated as the zipped topic included in the release. You should append dot to your extensions as in

      * Set PLUCENEINDEXEXTENSIONS = .pdf, .htm, .html, .txt, .doc

Regarding ComplexHTML Documents, if the topic meta has information regarding those files, the plucindex should process them. Which extension you use to attach those Complex files?

-- JoanMVigo - 16 Oct 2006

Yesterday, i had installed the latest version of the Plucene Search-engine and the Add-On for Dakar. After the installation of the cpan-modul the engine seems to be working. But totay i have problems. I changed the PLUCENEINDEXEXTENSIONS to .pdf and so on. Now the plucindex create a huge amount of errors while he is indexing the attachments. The Error is

Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/lib/perl5/vendor_perl/5.8.8/i586-linux-thread-multi/HTML/Parser.pm line 102
In addition there is another problem. When i'm looking for a topic with an attachment, i looks good and work fine. But when i'm looking for a topic without an attachment, there is no result, although the PLUCENESEARCHATTACHMENTSONLY is 0. All the topics of the web are indexed.

-- MichaelWeber - 18 Oct 2006

Hey Joan, I explain you what I mean with these Complex HTML Documents.It is based on one html file and one folder. In this folder are additional html files which are referred from the one file.I copy (do not use the upload function) the file and the folder into that folder, that is generated by a topic (perhaps topic Project in Web Main, so there is the (directory)-structure twiki/pub/Main/Project). Now there will be indexed the one html file but not the files in the "attached" folder, when I run the index-script. Any idea how to modify the plucindex code??

-- DanielWiechmann - 19 Oct 2006

Sopan & Joan, I added a SHORTDESCRIPTION to the "Add-On Info" section so that this add-on is represented properly in the AddOnPackage topic and query topics. Please feel free to take this into the next release.

-- PeterThoeny - 04 Nov 2006

Is it possible to implement a "FromThisTopic" option? If selected, SearchEnginePluceneAddOn should only give results from a given topic and the topics where it is linking to. Perhaps with the help of DirectedGraphWebMapPlugin.

-- RichardVinke - 11 Nov 2006

I know that this might go beyond the scope of this discussion, but would you please help this newbie on this issue? I am trying to install all the dependencies of this add-on, and I see that it requires wvWare. I got it from wvware.sourceforge.net, but I have no idea on how to install it on a Debian Linux system. Would you be so kind to explain me the steps?

Thanks,

-- MiloValenzuela - 13 Nov 2006

Milo: download libwmf-0.2.8.4.tar.gz. Use commands: "./configure", "make", "make install". Solve errors before proceeding to the next step.

-- RichardVinke - 14 Nov 2006

Thanks for the quick response...I am totally new to Debian Linux. I finally installed everything as indicated (all dependencies, etc) and set up all the variables. plucindex runs succesfully however the search returns NO results. I think I correctly defined the variables:

---++ Plucene settings
      * Set PLUCENEINDEXEXTENSIONS = .pdf, .htm, .html, .txt, .doc
      * Set PLUCENEINDEXPATH = /home/httpd/twiki/plucene/index 
      * Set PLUCENEATTACHMENTSPATH = /home/httpd/twiki/pub 
      * Set PLUCENESEARCHATTACHMENTSONLY = 0
      * Set PLUCENESEARCHATTACHMENTSONLYLABEL = Display only attachments
      * Set PLUCENEINDEXVARIABLES = CONTACTINFO, JUSTANOTHERONE
      * Set PLUCENEINDEXSKIPWEBS = Trash, Sandbox
      * Set PLUCENEINDEXSKIPATTACHMENTS = 
      * Set PLUCENEDEBUG = 1

Any ideas?

Thanks!

-- MiloValenzuela - 14 Nov 2006

Milo: What do the log files say about the plucindex? I saw several questions about this problem (above wink ), perhaps the answer is here.

-- RichardVinke - 15 Nov 2006

I had a similar problem - see my post above on 10 Oct 2006.

-- GlennRoberts - 15 Nov 2006

I installed all the dependencies as suggested above. The plucindex is succesful. It indexes all the expected attachments succesfully. However, the searches only work within topics. It seems that its not searching within the indexed documents. I don't think its Glenn's issue cause the indexing was succesful. The plucene logs present only the indexing log (which seems succesful) and the Apache error log doesn't mention plucene. Is there any other way to debug this plugin?

-- MiloValenzuela - 15 Nov 2006

I have a problem. It does not seem to search the content of form fields. Should it? I have topics with a form attached, and it does find words in the topic text and even (with form:MyForm) the form name, but it does not find text in the form entries. Neither with searching for text nor FormField:text. Is this a local problem on my setup, or does it ignore form fields? And how can I start debugging (the index seems ok)? Otherwise it's a great plugin, thanks!

-- StephanMatthiesen - 04 Feb 2007

Correction: it does search form fields now, but only with the search string of the form FormField:text, but this is only possible when I know the name of the field.

Is it possible to have a general search where it doesn't matter if the search terms is in the topic text or in a form field? Forms are quite important to structure the TWiki, so it would be a shame if they are not really included in the search.

-- StephanMatthiesen - 04 Feb 2007

My plucene search never worked fine for searching within attachments... frown it might be my uber newbieness with Linux, but I can't seem to cd into the plucene/index directory (which I suspect the search engine can't enter). If I do a "ls -l" it displays the directory but when I attempt a cd into it, it gives me the "no such file or directory" error. Any thoughts or recommendations?

-- MiloValenzuela - 03 Mar 2007

The execution flag is possibly not set on the directory. Try a: chmod 775 index

-- PeterThoeny - 05 Mar 2007

Peter, it tells me that it "cannot access 'index' : No such file or directory". It seems weird to me cause the ls -l lists the directory with chathe following characteristics:

 drwxr-sr-x 2 root    www-data 8192 2007-03-01 19:11 index 

...but it cannot do anything with it. Is there such a thing as a "corrupted" directory in Linux or again its just my newbieness?

Thanks,

-- MiloValenzuela - 08 Mar 2007

I am using antiword with the module attached below to index *.doc files. I am wondering if any once else has run into the issue of indexing RTF formatted files w/ a *.doc extension.

Word opens them just fine. Antiword on the other hand, doesn't seem to know how to handle this datastream.

-- BrianGupta - 12 Mar 2007

Hello Milo, try this

      * Set PLUCENEINDEXPATH = /home/httpd/twiki/plucene/index/
      * Set PLUCENEATTACHMENTSPATH = /home/httpd/twiki/pub/
without spaces at the end of the line. That should fix your problem.

I have another question: Is it possible to search over directories, that you bind in your topics? For example:

[[file://Server/yourDirectory/]]

How do I have to modify this Addon to do this. Or are there any other Addons who can do this?

Thanks for your help.

-- JoergSchoenknecht - 17 Apr 2007

Hi, When I run ./plucindex, I am getting the following error. Can't locate Plucene/Document.pm in @INC (@INC contains: /home/httpd/twiki/lib . /etc/perl /usr/local/lib/perl/5.8.4 /usr/local/share/perl/5.8.4 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl .) at ./plucindex line 26. BEGIN failed--compilation aborted at ./plucindex line 26. Can any one please guide. Thanks

-- TojanJohn - 02 Jul 2007

one of the plucenescriptpatches doesnt work (plucupdate). has anyone encountered the same problem? why doesnt plucene have the function which searches partial topicnames aswel?

-- MaartenDeRuiter - 13 Jul 2007

I´ve installed Plucene Search Engine for my Twiki, and I can already search within txt and html attachments. But there is a problem with xpdf. It is installed and works fine within the console (pdftotext a.pdf). But when I try ./plucindex, I get the message: Can't exec "pdftotext": No such file or directory at /usr/local/share/perl/5.8.4/Plucene/SearchEngine/Index/PDF.pm line 20.

What can I do?

-- StefanSchauer - 25 Jul 2007

I´ve solved to problem now. pdftotext has to be copied to the usr/bin folder, which is shown in the $PATH environment variable. I works fine now. Very useful Add-On!

-- StefanSchauer - 26 Jul 2007

i am using twiki on a windows platform. i get the fllowing errors in the log life.

[Thu Aug 30 15:02:05 2007] [error] [client 127.0.0.1] Use of uninitialized value in -d at C:/perl/site/lib/Plucene/Search/IndexSearcher.pm line 52., referer: http://localhost/twiki/bin/view/TWiki/PluceneSearch [Thu Aug 30 15:02:05 2007] [error] [client 127.0.0.1] Use of uninitialized value in concatenation (.) or string at C:/perl/site/lib/Plucene/Search/IndexSearcher.pm line 55., referer: http://localhost/twiki/bin/view/TWiki/PluceneSearch [Thu Aug 30 15:02:05 2007] [error] [client 127.0.0.1] Don't know how to turn into an index reader at C:/Apache2.2/htdocs/TWiki/bin/plucsearch line 209, referer: http://localhost/twiki/bin/view/TWiki/PluceneSearch

pls help?

-- AnoopR - 30 Aug 2007

This is probably because you have not set the PLUCENEINDEXPATH in your TWikiPreferences.

-- AndrewRJones - 31 Aug 2007

thanks for the tip, i got that working. but now i have a new problem.

my $tmpl = TWiki::Func::readTemplate( "plucsearch" );

is not reading the template file if i request through the web client. but if I do a

perl -wT plucsearch>test.html

then i get a good result i am running this on win2k/apache

-- AnoopR - 06 Sep 2007

The implementation of this plugin is quite in a messy state. Anybody mind if I clean up a bit.

-- MichaelDaum - 06 Sep 2007

Looks like there is very little activity on this topic. MichealDaum, i think everybody is looking at KinoSearch , which is promising much more. I am stuck here, only till i find out a way to install the C/C++ code in the KinoSearch package without installing MS Visual C++... any ideas?

-- AnoopR - 06 Sep 2007

I did it! i just had to run NMAKE on another system that had perl and visual c++. Then i copied the folder to my systemand ran "namke install". It ran and installed successfully.

I did also try to install the Windows SDK on my system i think i partially installed something, i am not sure if that was useful. The full SDK was around 700 MB, so i removed a lot and just installed few things that made sense...

-- AnoopR - 06 Sep 2007

Are there any benchmarks available for this plugin? We have a really large twiki with lots and lots of attachments, so I'm worried about how this scales.

-- MichelleHedstrom - 17 Sep 2007

MichaelDaum, I would be thrilled for any cleanup you felt like doing. I have been unable to get this addon to work at my company, and I would very much love to. There's no Solaris version of Kino search, so I can't turn to SearchEngineKinoSearchAddOn, and the installation for SearchEngineSwishEAddOn is involved enough that it would be a last resort.

-- JohnWorsley - 19 Sep 2007

Sorry, I meant there's no Solaris version of antiword, but I guess that's not strictly true. I see now on the antiword site that "Users have reported successful compilations on ... Solaris".

Since I know nothing about compiling from source, and I have to work with Solaris, I have still been more interested in this attachment search engine since I know it can work with other MS Word handlers besides antiword.

-- JohnWorsley - 20 Sep 2007

Hi John,

if antiword is the only problem, you can also use any other program, that can extract ASCII text out of a MS word document. Then you need to exchange the stringifier plug-in DOC.pm by a new one (see also SearchEngineKinoSearchAddOn#Indexing_further_document_types). This are only a few lines of Perl code (look at the existing plug-ins, they are really easy). I used antiword for KinoSearch simply because I had problems installing wvWare on my Debian VM.

To MichaelDaum:

If you do some refactoring of this code, how about creating a base module for SearchEnginePluceneAddOn and SearchEngineKinoSearchAddOn and perhaps further indexing machines? As fare as I see, much of the code for both add-ons deal with common tasks like getting the preferences from TWiki, iterating the webs and topics, extracting the data and meta data from the topics, finding the attachments, stringify the attachments and rendering the result list. Only few tasks are really special for the indexing machine: Creating the index, filling and updating the index and searching on the index (i.e. getting the hit list for a given query).

With such a base module, the integration of further indexing machines could be much easier: You just create a class with some defined methods for doing the special tasks. The rest is done by the base. Developers for that special classes need not have so much insight to TWiki and we get earlier new (faster, better, …) indexed search machines. And of course, improvements of the base results in improvements for all indexed searches. If you agree we could setup a Brainstorming topic on that (or discuss further on IndexingTWiki). What do you think?

-- MarkusHesse - 22 Sep 2007

Markus, that's exactly what I was aiming at. The current code refactoring I did on the plucene integration goes directly in that direction. Gonna check it in so that you can see what is there so far.

-- MichaelDaum - 24 Sep 2007

I've fixed this code for 4.2. As some user on #twiki requests the same, I'll put it here before having SVN access.

--- ./plucene/bin/plucupdate    2006-06-27 10:39:50.000000000 +0200
+++ .././plucene/bin/plucupdate 2008-06-06 10:34:13.000000000 +0200
@@ -20,6 +20,7 @@
 BEGIN { unshift @INC, '.'; require '../../bin/setlib.cfg' }
 
 use TWiki;
+use TWiki::Func;
 
 use Time::Local;
 
@@ -94,36 +95,34 @@
 
     $debug && print "Checking $web ...";
 
-    # NOTE violates store encapsulation, possible compatibility issue with future releases
-    my $changes= $TWiki::Plugins::SESSION->{store}->readMetaData( $web, 'changes' );
-    my $prevLastmodify = $TWiki::Plugins::SESSION->{store}->readMetaData($web,'plucupdate') || "0";
+    # Get the last time we indexed this web
+    my $lastmodifyDir = TWiki::Func::getWorkArea("Plucene");
+    my $prevLastmodify = 0;
+    if ( open(LAST, "<$lastmodifyDir/$web") ) {
+      my $prevLastmodifyTainted = ;
+      close LAST;
+      if( $prevLastmodifyTainted =~ /^(\d+)$/ ) {
+        $prevLastmodify = $1;
+      }
+    }
     my $currLastmodify = "";
     
     # do not process the same topic twice
     my %exclude;
 
+    my $changes = TWiki::Func::eachChangeSince( $web, $prevLastmodify );
     # process the web changes
-    foreach( reverse split( /\n/, $changes ) ) {
-      # Parse lines from .changes:
-      #                      
-      my ($topicName, $userName, $changeTime, $revision) = split( /\t/);
-       
-      if( ( ! %exclude ) || ( ! $exclude{ $topicName } ) ) {
-        if( ! $currLastmodify ) {
-          # newest entry
-          $time = &TWiki::Func::formatTime( $prevLastmodify );
-          $currLastmodify = $changeTime;
-          if( $prevLastmodify eq $changeTime ) {
-            # newest entry is same as at time of previous update
-            $debug && print "-> no topics new/changed since $time\n";
-            last;
-          }
-          $debug && print "-> changed topics since $time:\n";
-        }
-        if( $prevLastmodify >= $changeTime ) {
-          # found item of last update
-          last;
-        }
+    $time = &TWiki::Func::formatTime( $prevLastmodify );
+    if( $changes->hasNext() ) {
+      # We have some changes
+      $debug && print "-> changed topics since $time:\n";
+      while( $changes->hasNext() ) {
+        my $change = $changes->next();
+        my ($topicName, $userName, $changeTime, $revision) = @{change}{qw/
+       topic user time revision/};
+  
+        $currLastmodify = $changeTime;
+        next if defined $exclude{ $topicName };
         $exclude{ $topicName } = "1";
         $debug && print "   * $topicName\n";
         push( @topicsToUpdate, [ $web, $topicName ] );
@@ -133,11 +132,18 @@
           push( @topicsToUpdate, [ $web, "WebHome" ] );
         }
       }
+  
+      if ( open(LAST, ">$lastmodifyDir/$web") ) {
+        print LAST $currLastmodify;
+        close LAST;
+        $debug && print "$lastmodifyDir/$web saved\n";              
+      } else {
+        warn "Couldn't update $lastmodifyDir/$web: $!";
+      }
+    } else { # No new changes
+      $debug && print "-> no topics new/changed since $time\n";
+      $currLastmodify = $time;
     }
-
-    # NOTE violates store encapsulation, possible compatibility issue with future releases
-    $TWiki::Plugins::SESSION->{store}->saveMetaData( $web, 'plucupdate', $currLastmodify );
-    $debug && print "$web .plucupdate saved\n";              
   }
 
   if (@topicsToUpdate > 0) {
--- ./plucene/bin/plucindex     2006-03-21 10:01:33.000000000 +0100
+++ .././plucene/bin/plucindex  2008-06-05 18:26:34.000000000 +0200
@@ -37,6 +37,7 @@
 my $debug = ! ( @ARGV && $ARGV[0] eq "-q" );
 
 # Log stuff: opening the log file 
+use TWiki::Func;
 my $time = TWiki::Func::formatTime( time(), '$year$mo$day', 'servertime');
 my $logfile = "../logs/index-".$time.".log";
 
@@ -118,8 +119,15 @@
     $logtime = TWiki::Func::formatTime( time(), '$rcs', 'servertime' ); 
     print LOGFILE  "| $logtime | Indexing web | $web | |\n";
 
-    # NOTE violates store encapsulation, possible compatibility issue with future releases
-    $TWiki::Plugins::SESSION->{store}->saveMetaData( $web, 'plucupdate', time() );
+    # Saves the last update run for this web
+    my $lastmodifyDir = TWiki::Func::getWorkArea("Plucene");
+    if ( open(LAST, ">$lastmodifyDir/$web") ) {
+      print LAST time();
+      close LAST;
+      $debug && print "$lastmodifyDir/$web saved\n";
+    } else {
+      warn "Couldn't update $lastmodifyDir/$web: $!";
+    }
 
     # get the list of topics
     my @topics = TWiki::Func::getTopicList( $web );

Hope this helps.

-- OlivierRaginel - 28 Jul 2008

Heya Olivier - I've already enabled your commit access smile I recon check it in and release it smile

-- SvenDowideit - 02 Aug 2008

Sven, if I can figure out where I can modify it, I'd commit it right away, but for now, what's in the trunk are the modifications MichaelDaum made. No idea where I can find this source to tweak it. Also, as MichaelDaum pointed out, TWiki::Func::eachChangeSince only exists since 4.2.0, right? So we need to "fork" this per version of TWiki.

-- OlivierRaginel - 05 Aug 2008

Olivier, I merged your fixes to the changes I made in a way that no fork is needed. I will upload my changes asap.

-- MichaelDaum - 05 Aug 2008

I had to make a few additional fixes on plucsearch to handle permissions, but I think your version is safe Michael. I can send you my changes if you wish, or check the channel logs, as I was doing this for gordho on Tuesday, 5th August 2008, around 7pm CEST.

-- OlivierRaginel - 08 Aug 2008

I added in the code updates in plucupdate, but I could not get it to work. Instead I will just run the plucindex at 4AM each day. Hope it won't take to long when our content grows with the wiki empty it take about 2 minutes.

-- GregNeugebauer - 2009-10-14

I should note the above was on TWik 4.3.2

-- GregNeugebauer - 2009-10-14

I did run the plucene in version TWiki-5.1.2, Sun, 07 Oct 2012, build 23565, Plugin API version 1.4 Subsequently I have documented here all necessary changes to files.

-- JavierFernandezSanchez - 2012-10-30

Javier, where have you documented this? We appreciate help in doc improvements.

-- PeterThoeny - 2012-10-30

Start with documentation.

[root@twiki ~]# lsb_release -a LSB Version: :core-4.0-ia32:core-4.0-noarch:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-ia32:printing-4.0-noarch Distributor ID: CentOS Description: CentOS release 5.8 (Final) Release: 5.8 Codename: Final

-- JavierFernandezSanchez - 2012-10-31

yum repolist yum --enablerepo=* repolist yum install gcc make links rcs rpm -Uvh http://epel.mirrors.arminco.com/5/i386/epel-release-5-4.noarch.rpm yum list NCFTP yum install ncftp antiword abiword wv xpdf-utils yum install ppthtml ERROR********* http://chicago.sourceforge.net/xlhtml/ perl -MCPAN -e shell install Bundle::CPAN reload cpan install Unicode::String install Archive::Tar install Test::LeakTrace install LWP::UserAgent install File::MMagic install Module::Pluggable install Spreadsheet::ParseExcel install CharsetDetector install Spreadsheet::XLSX install Text::Iconv install HTML::TreeBuilder install Lingua::Stem::Snowball install Plucene install Plucene::SearchEngine install Plucene::SearchEngine::Index

-- JavierFernandezSanchez - 2012-10-31

change to file twiki/bin/plucindex add use TWiki; use TWiki::Func;

-- JavierFernandezSanchez - 2012-10-31

change to faike twiki/bin/plucindex # NOTE violates store encapsulation, possible compatibility issue with future releases #$TWiki::Plugins::SESSION->{store}->saveMetaData( $web, 'plucupdate', time() ); # Saves the last update run for this web my $lastmodifyDir = TWiki::Func::getWorkArea("Plucene"); if ( open(LAST, ">$lastmodifyDir/$web") ) { print LAST time(); close LAST; $debug && print "$lastmodifyDir/$web saved\n"; } else { warn "Couldn't update $lastmodifyDir/$web: $!"; }

-- JavierFernandezSanchez - 2012-10-31

change to file twiki/templates/plucsearch.pattern.tmpl %TMPL:INCLUDE{"twiki"}% replate %TMPL:INCLUDE{"view"}%

-- JavierFernandezSanchez - 2012-10-31

change file twiki/bin/plucsearch replace #$icon = $TWiki::Plugins::SESSION->mapToIconFileName($name); for this $icon = "%ICON{\"$name\"}%";

-- JavierFernandezSanchez - 2012-10-31

modify the MAIN topic add <form action="%SCRIPTURLPATH%/plucsearch%SCRIPTSUFFIX%/%INTURLENCODE{"%INCLUDINGWEB%"}%/"> <input type="text" name="search" size="32" /> <input type="submit" value="Search text" /> in <select name="web"> <option value="all">all public webs</option> <option value="%INCLUDINGWEB%">current web</option> %WEBLIST{" <option>$name</option>"}% </select> <input type="checkbox" name="nosummary" /> do not show summaries <input type="checkbox" name="nototal" /> do not show total matches <input type="checkbox" name="showlock" /> show locked topics limit result count to <input type="text" name="limit" size="5" value="all" /> </form>

-- JavierFernandezSanchez - 2012-10-31

ok, is necesary run all days at 2:00 am the process plucindex the process plucupdate have problem i working in that. The rest of the prodoc is ok,

-- JavierFernandezSanchez - 2012-10-31

install in roor user cron to work at 00:01 am crontab -e 1 0 * * * /var/www/twiki/plucene/bin/plucindex

if you like to run every ahour then 1 * * * * /var/www/twiki/plucene/bin/plucindex

-- JavierFernandezSanchez - 2012-10-31

I downloaded

"SearchEnginePluceneAddOn.zip r12 2006-06-27 - 08:46"

from the SearchEnginePluceneAddOn page, and now I find that saveMetaData is broken. Probably because the fixes mentioned above were done in 2008.

So, where is the most recent version available for download?

Thanks!

-- Rusty Carruth - 2015-07-31

Rusty: This extension needs some TLC, it has not been updated for many years and does not seem to work on the latest TWiki release. You could fix the extension (and contribute the changes back to the community), or engage one of the TWikiConsultants to get it fixed.

-- Peter Thoeny - 2015-08-04

Topic attachments
I Attachment History Action Size Date Who Comment
Perl source code filepm DOC.pm r1 manage 0.6 K 2004-12-15 - 11:09 JoanMVigo Index DOC files with Plucene::SearchEngine::Index::DOC.pm & antiword
Compressed Zip archivezip ExtraBackendParsers.zip r1 manage 3.5 K 2004-12-08 - 14:36 SopanShewale Backend Parsers to parse MS Word, Excel, PPT files.
Compressed Zip archivezip ScriptsWithLoggingFeatures.zip r1 manage 8.1 K 2005-07-08 - 12:52 SopanShewale Scripts with Logging stuff
Compressed Zip archivezip plucenscriptpatches.zip r1 manage 2.0 K 2005-03-23 - 12:17 SopanShewale The patches for partial-topic name search
Edit | Attach | Watch | Print version | History: r133 < r132 < r131 < r130 < r129 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r133 - 2015-08-04 - PeterThoeny
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.