Question
I am using the latest
SearchEnginePluceneAddOn on a TWiki-4.0.1 installation. The
ExtraBackendParsers.zip is installed as well. I can index and search for
.pdf,
.ppt,
.htm,
.html,
.txt,
.doc,
.xls files. Thanks for updating the add-on for TWiki-4!
I run into some problems:
1. Directory named "doc"
TWiki 4 has a
doc directory in the attachment directory of TWiki.JSCalendarContrib. When running
plucene/bin/plucindex I get a
No such file error. Here is the output:
Plucene index files init
- to suppress all normal output: plucindex -q
Attachment extensions to be indexed: pdf, ppt, htm, html, txt, doc, xls
Variables to be indexed:
Indexing Doc topics
Indexing Main topics
Skipping Sandbox topics
Indexing TWiki topics
Skipping Trash topics
Indexing attachments ...
Current directory: /tmp
Current directory: /tmp
** (process:25170): WARNING **: No such file '/var/www/html/pub/TWiki/JSCalendarContrib/doc'
Current directory: /tmp
Some problem running latex.
Check for Errors in 71iWAfJMcS.log
Continuing...
Conversion into dvi failed
Error: Couldn't open file '/tmp/71iWAfJMcS'
Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/lib/perl5/vendor_perl/5.8.6/i386-linux-thread-multi/HTML/Parser.pm line 104.
Optimizing index ...
Indexing complete.
If I rename the
doc directory to
doc-x I no longer get the
No such file error. That is an indication that the add-on is trying to process
doc as a
.doc. It looks like the add-on needs to be made aware of the dots.
End of output:
...
Skipping Trash topics
Indexing attachments ...
Current directory: /tmp
Current directory: /tmp
Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/lib/perl5/vendor_perl/5.8.6/i386-linux-thread-multi/HTML/Parser.pm line 104.
Optimizing index ...
Indexing complete.
2. Error message "Parsing of undecoded UTF-8"
As seen in above output, there is a message:
Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/lib/perl5/vendor_perl/5.8.6/i386-linux-thread-multi/HTML/Parser.pm line 104.
I do not see any problems though when searching for attachments.
3. No index update if attachment is updated
I created a new topic, and attached a
.txt file to it. I could search for it as expected after a
plucene/bin/plucindex. Then I updated the file with some new text and re-attached the file. I can view the updated attachment properly. When I run
plucene/bin/plucupdate I get this output:
Plucene index files update
- to suppress all normal output: plucupdate -q
Checking Doc ...-> no topics new/changed since 15 Mar 2006 - 04:29
Doc .plucupdate saved
Checking Main ...-> no topics new/changed since 15 Mar 2006 - 04:34
Main .plucupdate saved
Skipping Sandbox topics
Checking TWiki ...-> no topics new/changed since 14 Mar 2006 - 05:10
TWiki .plucupdate saved
Skipping Trash topics
No index update necessary
Updating index complete.
Notice the timestamp of the Doc web,
15 Mar 2006 - 04:29. The current local time was
Tue Mar 14 23:58:55 EST 2006. Not sure if this is a GMT vs local time issue, or possibly connected to the "update topic without bumping up rev" feature (if done within one hour.)
An hour later... It might be related to the "same rev" feature. I attached an updated txt file after one hour, the topic rev went up. Running
plucupdate I get this output:
Plucene index files update
- to suppress all normal output: plucupdate -q
Checking Doc ...-> changed topics since 15 Mar 2006 - 05:50:
* DocID20060314213653
Doc .plucupdate saved
Checking Main ...-> changed topics since 15 Mar 2006 - 05:50:
Main .plucupdate saved
Skipping Sandbox topics
Checking TWiki ...-> changed topics since 15 Mar 2006 - 05:50:
TWiki .plucupdate saved
Skipping Trash topics
Now removing old topics
Removing of old topics finished
Indexing new and changed topics
Attachment extensions to be indexed: pdf, ppt, htm, html, txt, doc, xls
Variables to be indexed:
Reindexing attachments ...
Optimizing index ...
Indexing complete.
Updating index complete.
The output looks OK now. But:
4. Index destroyed after running update
After running
plucupdate I can search for the attachments that have been updated (as shown by the
=plucupdate output), but all other attachments can't be searched anymore. In above example, only text in attachment of Doc.DocID20060314213653 can be found. Searching topic text is still OK. Running
plucindex again fixes that issue. I can send the content of the
index directory by e-mail if needed (after running
plucupdate and after running
plucindex).
Workaround
I can work around issue 3. and 4. by performing always full indexes instead of updates. This is OK for now, but soon there will be a lot of content where an hourly full update will not be feasible.
Environment
--
PeterThoeny - 15 Mar 2006
Answer
If you answer a question - or someone answered one of your questions - please remember to edit the page and set the status to answered. The status selector is below the edit box.
1. Dots should be added to extensions in PLUCENEINDEXEXTENSIONS as in previous version. However, the attachments attributes are retrieved from the topic metadata, and the attachment extension is compared with the TWiki variable. I have reviewed
TWiki/JSCalendarContrib.txt and there's no meta info, so does TWiki build the meta info examining the
pub/TWiki/JSCalendarContrib contents? Also, I have checked
doc folder contents' are not listed nor indexed ...
- I already tried to add the dots in PLUCENEINDEXEXTENSIONS, but no attachments get indexed anymore. Not sure why it picked the doc dir. -- PeterThoeny
- The
plucindex and plucupdate scripts should also be modified to get the dot when comparing extension with those of PLUCENEINDEXEXTENSIONS. I will update the ZIP file to include this workaround this week. -- JoanMVigo - 15 Mar 2006
2. Plucene uses HTML/Parser.pm to index html attachments. Also see
TWiki:Support/PluceneAndSpecialCharacters
3. Right. The
plucupdate script checks the
.changes file. If an attachment is updated with a new version, but the topic does not get a new revision number, it will not be added to
.changes file, so it will be ignored next time
.plucupdate is run.
- Re-attaching files within one hour happens. I think the algorithm should be changed to look at the timestamp (in
.changes) instead of rev number. -- PeterThoeny
- The pluc scripts both look at timestamp, because the
.plucupdate file generated just contains the timestamp of last indexation/update. Is it possible that re-attaching does not update .changes ?? -- JoanMVigo - 15 Mar 2006
- Yes, that is the case, the
.changes file does not get updated on follow-up saves since the first entry is sufficient for an e-mail notification. But this is not sufficient for updating the index. Not sure though how to solve the issue. -- PeterThoeny
4. The
plucupdate scripts first removes all topics that have been updated since last run. Attachments of those removed topics are also removed. Then new/updated topics and their attachments are indexed. I do not understand why the
plucupdate script
removes all existing attachments in your installation. Searching with
attachment:yes should display all the attachments currently indexed ... I am running Cairo and Dakar on different servers and I never have experienced the problem described here. Please, send the console output and both log files
index-yyyymmdd.log /
update-yyyymmdd.log instead of
index folder.
- Thanks, I will send the log files by e-mail -- PeterThoeny
--
JoanMVigo - 15 Mar 2006
Peter, Request you to send the log files to me (
sopan_shewale@persistentPLEASENOSPAM.co.in) also.
--
SopanShewale - 17 Mar 2006
Done. Thanks for checking.
--
PeterThoeny - 17 Mar 2006
Hi Joan! I wrote a comment at 12.03.2006 about a problem to find text in a body! Before you released the "dakar version" of
PluceneSearch, I changed the previous version from you. It works great in dakar. Than I installed the new version from you, and get the problems described above. So I think I resolved the problem now. If you create a topic and run plucupdate, you can find the correct item. But, if you edit the topic, add a text, save it in a new release, do a plucupdate, you don't get a result. So I changed the line "my ($meta, $text) = TWiki::Func...." to "... = $TWiki::Plugins::SESSION->readTopic(undef, $web, $topic, undef);". And form now, it works. I appreciate your comment.
--
HugoKuegerl - 20 Mar 2006
Sorry, the correct call: ...SESSION->{store}->readTopic(...
--
HugoKuegerl - 20 Mar 2006
Hugo, the following code is from
lib/TWiki/Func.pm module Dakar release (lines 1145 to 1150)
sub readTopic {
...
return $TWiki::Plugins::SESSION->{store}->readTopic( undef, @_ );
}
In theory, both calls should get the same result. The one plucupdate script currently does (TWiki::Func::readTopic) is more
polite because uses the TWiki API instead of a TWiki internal call. I have tried just to use direct calls ONLY when API does not provide the required function.
I have tried
my ($meta, $text) = TWiki::Func::readTopic($web, $topic, undef);
instead of current one
my ($meta, $text) = TWiki::Func::readTopic($web, $topic, 1);
You are right. It was always reading text from version 1. I have updated the last Dakar release, so I think these issues may be considered as answered.
Note that Cairo release addon version calls TWiki::Store::readTopic($web, $topic, 1). In this case, the parameter 1 is used just to bypass access control
--
JoanMVigo - 21 Mar 2006
Thanks for fixing the most pressing issues. The unresolved ones are not that urgent: 1. Directory named "doc", and 3. No index update if attachment is updated.
--
PeterThoeny - 21 Mar 2006
is this ready for TWiki4 ?
--
NikhilMulley - 18 Oct 2006
Oops..I should have read the first comment. Sorry for it.
--
NikhilMulley - 18 Oct 2006
[Sat Oct 11 15:25:04 2008] [error] [client 192.168.123.235] [Sat Oct 11 15:25:04 2008] save: Parsing of undecoded UTF-8 will give garbage when decoding entities at /home/httpd/twiki/lib/TWiki/Plugins/WysiwygPlugin/HTML2TML.pm line 110., referer:
http://192.168.123.37/twiki/bin/edit/Sandbox/Inventory?cover=kupu&t=1223709775
I have some problem with UTF-8 and no usage to solve it few days
--
XiaoMei - 07 Oct 2008