Tags:
create new tag
view all tags

Question

I am using the latest SearchEnginePluceneAddOn on a TWiki-4.0.1 installation. The ExtraBackendParsers.zip is installed as well. I can index and search for .pdf, .ppt, .htm, .html, .txt, .doc, .xls files. Thanks for updating the add-on for TWiki-4!

I run into some problems:

1. Directory named "doc"

TWiki 4 has a doc directory in the attachment directory of TWiki.JSCalendarContrib. When running plucene/bin/plucindex I get a No such file error. Here is the output:

Plucene index files init
- to suppress all normal output: plucindex -q
Attachment extensions to be indexed: pdf, ppt, htm, html, txt, doc, xls
Variables to be indexed:
Indexing Doc topics
Indexing Main topics
Skipping Sandbox topics
Indexing TWiki topics
Skipping Trash topics
Indexing attachments ...

Current directory: /tmp

Current directory: /tmp

** (process:25170): WARNING **: No such file '/var/www/html/pub/TWiki/JSCalendarContrib/doc'

Current directory: /tmp
Some problem running latex.
Check for Errors in 71iWAfJMcS.log
Continuing...
Conversion into dvi failed
Error: Couldn't open file '/tmp/71iWAfJMcS'
Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/lib/perl5/vendor_perl/5.8.6/i386-linux-thread-multi/HTML/Parser.pm line 104.
Optimizing index ...
Indexing complete.

If I rename the doc directory to doc-x I no longer get the No such file error. That is an indication that the add-on is trying to process doc as a .doc. It looks like the add-on needs to be made aware of the dots.

End of output:

...
Skipping Trash topics
Indexing attachments ...

Current directory: /tmp

Current directory: /tmp
Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/lib/perl5/vendor_perl/5.8.6/i386-linux-thread-multi/HTML/Parser.pm line 104.
Optimizing index ...
Indexing complete.

2. Error message "Parsing of undecoded UTF-8"

As seen in above output, there is a message:

Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/lib/perl5/vendor_perl/5.8.6/i386-linux-thread-multi/HTML/Parser.pm line 104.

I do not see any problems though when searching for attachments.

3. No index update if attachment is updated

I created a new topic, and attached a .txt file to it. I could search for it as expected after a plucene/bin/plucindex. Then I updated the file with some new text and re-attached the file. I can view the updated attachment properly. When I run plucene/bin/plucupdate I get this output:

Plucene index files update
- to suppress all normal output: plucupdate -q
Checking Doc ...-> no topics new/changed since 15 Mar 2006 - 04:29
Doc .plucupdate saved
Checking Main ...-> no topics new/changed since 15 Mar 2006 - 04:34
Main .plucupdate saved
Skipping Sandbox topics
Checking TWiki ...-> no topics new/changed since 14 Mar 2006 - 05:10
TWiki .plucupdate saved
Skipping Trash topics
No index update necessary
Updating index complete.

Notice the timestamp of the Doc web, 15 Mar 2006 - 04:29. The current local time was Tue Mar 14 23:58:55 EST 2006. Not sure if this is a GMT vs local time issue, or possibly connected to the "update topic without bumping up rev" feature (if done within one hour.)

An hour later... It might be related to the "same rev" feature. I attached an updated txt file after one hour, the topic rev went up. Running plucupdate I get this output:

Plucene index files update
- to suppress all normal output: plucupdate -q
Checking Doc ...-> changed topics since 15 Mar 2006 - 05:50:
   * DocID20060314213653
Doc .plucupdate saved
Checking Main ...-> changed topics since 15 Mar 2006 - 05:50:
Main .plucupdate saved
Skipping Sandbox topics
Checking TWiki ...-> changed topics since 15 Mar 2006 - 05:50:
TWiki .plucupdate saved
Skipping Trash topics
Now removing old topics
Removing of old topics finished
Indexing new and changed topics
Attachment extensions to be indexed: pdf, ppt, htm, html, txt, doc, xls
Variables to be indexed:
Reindexing attachments ...
Optimizing index ...
Indexing complete.
Updating index complete.

The output looks OK now. But:

4. Index destroyed after running update

After running plucupdate I can search for the attachments that have been updated (as shown by the =plucupdate output), but all other attachments can't be searched anymore. In above example, only text in attachment of Doc.DocID20060314213653 can be found. Searching topic text is still OK. Running plucindex again fixes that issue. I can send the content of the index directory by e-mail if needed (after running plucupdate and after running plucindex).

Workaround

I can work around issue 3. and 4. by performing always full indexes instead of updates. This is OK for now, but soon there will be a lot of content where an hourly full update will not be feasible.

Environment

TWiki version: TWikiRelease04x00x00
TWiki plugins: Dakar default
Server OS: Linux
Web server: Apache
Perl version: 5.8.6
Client OS:  
Web Browser:  
Categories: Search, Add-Ons

-- PeterThoeny - 15 Mar 2006

Answer

ALERT! If you answer a question - or someone answered one of your questions - please remember to edit the page and set the status to answered. The status selector is below the edit box.

1. Dots should be added to extensions in PLUCENEINDEXEXTENSIONS as in previous version. However, the attachments attributes are retrieved from the topic metadata, and the attachment extension is compared with the TWiki variable. I have reviewed TWiki/JSCalendarContrib.txt and there's no meta info, so does TWiki build the meta info examining the pub/TWiki/JSCalendarContrib contents? Also, I have checked doc folder contents' are not listed nor indexed ...

  • I already tried to add the dots in PLUCENEINDEXEXTENSIONS, but no attachments get indexed anymore. Not sure why it picked the doc dir. -- PeterThoeny
    • The plucindex and plucupdate scripts should also be modified to get the dot when comparing extension with those of PLUCENEINDEXEXTENSIONS. I will update the ZIP file to include this workaround this week. -- JoanMVigo - 15 Mar 2006

2. Plucene uses HTML/Parser.pm to index html attachments. Also see TWiki:Support/PluceneAndSpecialCharacters

3. Right. The plucupdate script checks the .changes file. If an attachment is updated with a new version, but the topic does not get a new revision number, it will not be added to .changes file, so it will be ignored next time .plucupdate is run.

  • Re-attaching files within one hour happens. I think the algorithm should be changed to look at the timestamp (in .changes) instead of rev number. -- PeterThoeny
    • The pluc scripts both look at timestamp, because the .plucupdate file generated just contains the timestamp of last indexation/update. Is it possible that re-attaching does not update .changes ?? -- JoanMVigo - 15 Mar 2006
    • Yes, that is the case, the .changes file does not get updated on follow-up saves since the first entry is sufficient for an e-mail notification. But this is not sufficient for updating the index. Not sure though how to solve the issue. -- PeterThoeny

4. The plucupdate scripts first removes all topics that have been updated since last run. Attachments of those removed topics are also removed. Then new/updated topics and their attachments are indexed. I do not understand why the plucupdate script removes all existing attachments in your installation. Searching with attachment:yes should display all the attachments currently indexed ... I am running Cairo and Dakar on different servers and I never have experienced the problem described here. Please, send the console output and both log files index-yyyymmdd.log / update-yyyymmdd.log instead of index folder.

  • Thanks, I will send the log files by e-mail -- PeterThoeny
    • I will study them wink -- JoanMVigo - 15 Mar 2006

-- JoanMVigo - 15 Mar 2006

Peter, Request you to send the log files to me (sopan_shewale@persistentPLEASENOSPAM.co.in) also.

-- SopanShewale - 17 Mar 2006

Done. Thanks for checking.

-- PeterThoeny - 17 Mar 2006

Hi Joan! I wrote a comment at 12.03.2006 about a problem to find text in a body! Before you released the "dakar version" of PluceneSearch, I changed the previous version from you. It works great in dakar. Than I installed the new version from you, and get the problems described above. So I think I resolved the problem now. If you create a topic and run plucupdate, you can find the correct item. But, if you edit the topic, add a text, save it in a new release, do a plucupdate, you don't get a result. So I changed the line "my ($meta, $text) = TWiki::Func...." to "... = $TWiki::Plugins::SESSION->readTopic(undef, $web, $topic, undef);". And form now, it works. I appreciate your comment.

-- HugoKuegerl - 20 Mar 2006

Sorry, the correct call: ...SESSION->{store}->readTopic(...

-- HugoKuegerl - 20 Mar 2006

Hugo, the following code is from lib/TWiki/Func.pm module Dakar release (lines 1145 to 1150)

sub readTopic {
    ...
    return $TWiki::Plugins::SESSION->{store}->readTopic( undef, @_ );
}
In theory, both calls should get the same result. The one plucupdate script currently does (TWiki::Func::readTopic) is more polite because uses the TWiki API instead of a TWiki internal call. I have tried just to use direct calls ONLY when API does not provide the required function.

I have tried

my ($meta, $text) = TWiki::Func::readTopic($web, $topic, undef);
instead of current one
my ($meta, $text) = TWiki::Func::readTopic($web, $topic, 1);

You are right. It was always reading text from version 1. I have updated the last Dakar release, so I think these issues may be considered as answered.

Note that Cairo release addon version calls TWiki::Store::readTopic($web, $topic, 1). In this case, the parameter 1 is used just to bypass access control

-- JoanMVigo - 21 Mar 2006

Thanks for fixing the most pressing issues. The unresolved ones are not that urgent: 1. Directory named "doc", and 3. No index update if attachment is updated.

-- PeterThoeny - 21 Mar 2006

is this ready for TWiki4 ?

-- NikhilMulley - 18 Oct 2006

Oops..I should have read the first comment. Sorry for it.

-- NikhilMulley - 18 Oct 2006

[Sat Oct 11 15:25:04 2008] [error] [client 192.168.123.235] [Sat Oct 11 15:25:04 2008] save: Parsing of undecoded UTF-8 will give garbage when decoding entities at /home/httpd/twiki/lib/TWiki/Plugins/WysiwygPlugin/HTML2TML.pm line 110., referer: http://192.168.123.37/twiki/bin/edit/Sandbox/Inventory?cover=kupu&t=1223709775

I have some problem with UTF-8 and no usage to solve it few days

-- XiaoMei - 07 Oct 2008

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r12 - 2008-10-07 - XiaoMei
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.