SearchEngineVsGrepSearch < Codev

Tags: view all tags

TWiki's grep based search is amazingly fast. At work we have over 800 topics in our knowledge base web and search is more then good enough. However, it can get a issue on a public site where hundreds of people access a site simultaneously. AFAIK, the TWiki installation at JOS disabled search just because of that.

There are free search engines available that could be used to spider the TWiki webs.

ht://Dig at http://www.htdig.org/ . Probably the most popular one. We use it at work (not for TWiki)
Webglimpse at http://webglimpse.net/ .
SwishE at http://sunsite.berkeley.edu/SWISH-E/

Here are pluses and minuses compared to grep search:

Plus: Fast.
Plus: Flexible search with AND / OR combination.
Minus: Time lag of indexing (e.g. topic updates don't show up immediately in search)
Minus: Can't realize all TWiki options (regex for example, or KevinKinnell's latest additions)

I have not looked into installing a search engine, but as I heard you create some template files for the search engine. When you call a search script from a form, it generates a search result based on the input parameters and formatted using the templates. Ideally it should be possible to do it the other way around in TWiki: Create a replacement wikisearch.pm script that calls a search script, capture the result in stdout, and then format it to the TWiki needs.

Anybody has experience with search engines?

-- PeterThoeny - 05 May 2000

Have you considered http://search.cpan.org/search?dist=MMM-Text-Search or something similar?

-- NicholasLee - 06 May 2000

Workin' on it in Pure Perl as a .pm and probably using one of the search modules on CPAN. Won't be done for some time; gotta make a living too.

-- KevinKinnell - 07 May 2000

I have some experience with search engines. I've used ht://Dig on the LUF site and SwishE at work. One problem that I've run into with ht://Dig was that the database grew to a very large size since it stores copies of the pages in the database. Also it finds words that are in the templates and not just the .txt files.

I'm thinking of switching to SwishE since you can do a file system search to search the .txt files and map them into the TWiki URLs. Also since SwishE is a command line driven program it should be directly integratable with the currently existing TWiki search script. Also, it supports searching multiple indexes which would allow searches for individual or multiple webs. It also supports boolean operators, wildcards, and can limit searches to particular fields (i.e. META, TITLE, etc.) With the HTTP indexing feature, I can include other parts of the site or even index other sites to be included in the search parameters.

-- JamalWills - 09 May 2000

You may also consider SWISH++ ( http://homepage.mac.com/pauljlucas/software/swish/ ) for indexing and searching. The httpindex part can be fed from command line - so there's no problem with only indexing .txt files or excluding some files. Think of the unix "find" command for generating file list. I'm corrently working on some combination of searching TWiki webs with SWISH++.

-- MichaelSpringmann - 27 Mar 2001

I'd be very interested in your conclusions.

-- MartinCleaver - 27 Mar 2001

A generic indexing key:text_to_index package would be best for the long run. Consider PackageTWikiStore and the comments at the end of NativeVersionFormat on Web.Topic indexes.

-- NicholasLee - 27 Mar 2001

See also SearchAttachments

-- MartinCleaver - 29 Mar 2001

After looking at the product listings in SearchAttachments, I found that a good starting point would be: http://www.perlfect.com/freescripts/search/ (full perl, as opposed to the other suggestions), however I think that any search engine has to be TWiki-ized, That is:

Indexing has to be light weight, i.e. we cannot trigger full site indexing on every topic change, neither can we just wait for the next indexing event to happen.
The web/subweb structure has to be mantained by the indexing mechanism
The TWiki vocabulary of text mark-up and wiki words has to be understood by the search engine and the indexing mechanism (i.e. wikiwords themselves should be broken into component words)
Meta information should be kept in the indexes to speed searches ie. revision/author information, initial set of lines...
Stop words (non-indexing terms) should probably be dynamically generated by the indexing mechanism, and not pre-established (would reduce language and vocabulary dependencies)

The first point is the main one to me, and I have given some thought to this, so let me open them for comments:

On every topic edit generate a temporal file that contains the current state of the topic (alternatively, this file could just contain the current indexing information for the topic)
- This could also be done by extracting the previous revision from RCS, however, not all topic revisions are being saved, as those that happen in the topic lock interval.
On topic save, generate new indexing information, remove the old information from the indexes, and place the new information in.
- We would probably need both forward and backward indexes per web, to speed up the process.
- However: a global reverse word index for the site would simplify the actual search engine
On some regular interval either re-index the site, or do some consistency check on the indexes, to ensure that no errors have crept into the databases.

Advantages: Much faster than trying to index the whole site, can be a background task after the actual topic save.

But there are a couple of stopping points that I can't figure out yet:

What to do when two people try to edit the same topic simultaneously (this is actually allowed by TWiki).
Mmmm... I guess I figured this one out, keeping with the peer trust system we just need to re-generate indexes on topic saves, so we just look at the current topic before the file is actually saved [ EB: 4/3/01 ]
An index locking mechanism has to be put in place, which begs the question: what to do when we need to re-index a topic but it is locked?

Comments???

-- EdgarBrown - 30 Mar 2001

See MultipleCommitPathBehaviour which will define minimum spec for the save-path part of the edit-preview-save cycle.

-- NicholasLee - 03 Apr 2001

We are looking at integrating Ht:dig with TWiki. The primary objective is to provide searching on attachments - this and attachment rev control mean that TWiki is an effective document repository (that actually gets used!). Rather than try and replace the native TWiki search, the idea is to fire both searches in parallel and then to integrate the search results pages.

Advanced search can include the (3) htdig controls and allow users to choose which tool to use.

The question is what to display as the default - is it best to exclude topics from the ht:dig collection, just to show ht:dig and live with the fact that it indexes overnight, or to somehow include both.

Can anyone offer advice?

-- SteveRoe - 01 Jun 2001

The advantage of using something like ht:dig is that we're not adding the complexity of indexing etc to TWiki. But, what are the disadvanges and do they matter:

Users have to install both TWiki and a search engine
Unless some integration is done, the search engine is likely to be out of date by up to say a day.

-- JohnTalintyre - 01 Jun 2001

JamalWills said last year (above) that SwishE might be a better option. Has anyone got experience with that?

-- MartinCleaver - 01 Jun 2001

John / Steve / Anyone else - did you get htdig integrated with TWiki? On Windows or on UNIX? I've got a member of staff who can investigate this this week, and I'm finding that having no ability to search attachments is a major impediment on a web with 400 topics but over 3GB of data! Many thanks.

-- MartinCleaver - 02 Oct 2001

At work we have htdig running alongside TWiki - we've done no integration. Users have a choice of TWiki integrated search or faster htdig search that also search some attachment types (including Word an PDF files).

-- JohnTalintyre - 03 Oct 2001

Aha. I see, thanks. Are you running on NT or UNIX? Have you created a [.Topic] in TWiki from which to invoke the htdig search?

-- MartinCleaver - 03 Oct 2001

Running unix, modified WebSearch to include ht/dig.

-- JohnTalintyre - 03 Oct 2001

Great, thanks. I'm going to get someone to try it on Windows! (Figures crossed!!)

Are you ht/digging only the attachment space or do you do the do the topics as well?
Once you have found something in an attachment, do you relate it back to the topic? If so, how?

-- MartinCleaver - 04 Oct 2001

Just a note -- I found a new (to me) search engine:

Namazu 2.0 -- at a quick glance it appears to do everything I want to do except proximity searches. There are some confusing things in the documentation (something like "can't search on another computer" -- but I suspect this is just a translation problem, especially as the web site can be searched using namazu 2.0.5) -- the product is written by Japanese.

I'm not sure this will be useful -- the thing that intrigues me is that one of the search engines is namazu.cgi which somehow makes me (as a non-expert in cgi, html, perl, etc.) suspect that maybe this will be convenient to integrate into TWiki. This is at least partially a note to remind me to investigate further.

-- RandyKramer - 27 Nov 2001

Namazu looks like a very useful Intranet tool. I've managed to get the windows binary version half-way working on Windows2k. There are limitations which I'm sure a programmer could fix fairly easily:

MSOffice must be installed on the local computer to index .doc & .xls files - Namazu uses Win32::OLE by default. I see code to use wvWare (as well? instead?) but I don't know how to get that working.
MSO documents being indexed must be on the local computer - after processing ~100 remote documents I get a crash in winword.exe after which no further office documents are indexed. (There does not appear to be any problem indexing remote .txt or .html files.)

Other stuff might be harder to fix. For instance indexing MSO documents is slow: 70 minutes to index 1800 local documents totalling 370 mb on a dual PIV800MHz with 512mb ram. Maybe using wvWare instead of OLE would help here.

	Total	indexed
files	1826	1451
size	373mb	72mb

Rough RAM use breakdown while indexing:

perl.exe	52 mb
winword	15 mb
excel	8 mb

CGI installation was a snap:

copy namazu.cgi.exe to /cgi-bin/ (there are no other namazu files on the webserver machine)
edit /cgi-bin/.namazurc (based on $namazu/etc/namazurc-sample, only 2 changes were necessary)
- add path to index files
- add a Replace line so the search results are clickable
point browser to http://mywebserver/cgi-bin/namazu.cgi and it "just works"

-- MattWilkie - 28 Nov 2001

Wonderful! (Please tell me you were working on this before you saw my note, for the sake of my sanity!)

I'll have to look into this, but it sounds simple enough. In my case I'd be indexing the (TWiki .txt) files on the webserver, so I would have the indexing programs there as well, probably run by a cron job. (My (private) web server is on Linux, and I want to put a public site on SourceForge, so, for example, the ".exe" is not applicable, but I suspect the installation would be almost identical except for those kinds of things.)

Thanks for the feedback, and, if you did try this after seeing my note -- thanks for the effort!

PS: Does it look like the search feature could be incorporated on a TWiki template? (To keep a consistent look and feel)

-- RandyKramer - 28 Nov 2001

Keeping a consistant look and feel should be easy enough. Namazu already uses seperate header and footer files which are just simple html fragments (kept in the same directory as the index), and the script which genarates the index has a "--with-template" argument.

One other thing I forget to mention is that the cgi is very fast, the search results come up faster than a regular twiki page.

As for your sanity, sorry, I can't help there because I traded mine in years ago. : )

-- MattWilkie - 29 Nov 2001

Matt, thanks for the follow up!

-- RandyKramer - 30 Nov 2001 Has anybody looked at GNU bool ( http://www.gnu.org/software/bool/bool.html or ftp://ftp.gnu.org/pub/gnu/bool/ )? It supports boolean expressions of the form:

(sanity and there) or because (sorry near help) and gnu

No regexp support unfortunately. While I agree there are some usability issues with booleans, this seems like it might be a good option/plugin for use in the simple search.

I ran

    /tools/bool-0.2/bin/bool.exe -l -i 'fragment and space' *.c *.h

in the bool src directory and came up with:

   bool.c
   context.c

bool also understands HTML 4 and can deal with the character entities (or so they claim).

Now how do plugins affect the search pages? Maybe time for a SearchPlugins topic?

-- JohnRouillard - 01 Dec 2001

Thanks for the pointer - bool is really impressive, it just works out of the box with TWiki, and it's very easy to build (./configure; make; make check; make install). Its interpretation of 'near' is good, too - it treats two newlines as a new paragraph and only considers words within a paragraph as 'near'.

I just tried this out by changing Twiki.cfg to point $egrepCmd and $fgrepCmd to bool. The only thing remaining is to change TWiki so that it knows it is using bool and changes syntax such as easy wiki into easy NEAR wiki, and "easy wiki" into a phrase-matching search (just apply bool's -F flag).

I previously implemented AND searching in a simple script called andgrep (see CategorySearchForm), but bool is much better for phrase searching. andgrep is still useful for form searching based on Field1 = foo AND Field2 = bar, though, since RegularExpressions are needed.

I'm adding this page to InterfaceProject since the TWiki search feature is one of the key usability issues I hear about from TWiki users in my company. I'd like to aim for something like Google searching, in its use of proximity searching even if not in intelligence. Ranking search results based on best match is important as well.

Ideally we could support two search options, keeping an identical search syntax at the user level and for embedded searching:

bool or similar for out-of-the-box searching on small to medium size TWiki sites
an OpenSource search engine for larger sites with more data and users
- SWISH++, at least, has incremental indexing (i.e. it can add just one page to the index), so it should be possible to index when saving a topic (or perhaps after a short delay if this is a slow operation).

I don't think plugins make much difference to searching, as long as the raw topic text has meaningful keywords - bool type searching happens on the *.txt files not the HTML output, of course.

-- RichardDonkin - 09 Feb 2002

I ended up using Perlfect Search, mostly because it was written in Perl, and that's what I do best. It did require a bit of hacking though because it is not a web spider, so it sees the source of all of the .txt files. I also needed it to index MS Word docs (it only comes with PDF support). Here is a list of what I changed in the sources:

Added Word support to the indever by using "catdoc", a Unix based conversion utility (http://www.ice.ru/~vitus/catdoc/).
Hacked the indexer to ignore meta info.
Had the indexer call TWiki functions to convert the topic to HTML before indexing. This allows it to use heading tags for weighting and also makes sure that includes and embeded searches get indexed as part of a given page.
Toying with the idea of adding Excel support. This can be done with a utility called xsl2csv which comes as part of more recent catdoc distributions. This would be easy to add though.
Had to post process file paths discovered by the search indexer so that the .txt extension would be dropped and the pages would be opened through the "view" script.
Perlfect only supports a single path to the files, and my attachments are not stored with the data, so I needed to hack that in support for that as well.

I was pretty impressed with the results of the searches. The hits were right on, clips from the pages were displayed in the results with the search terms hilighted (including Word and PDF docs), you can perform searches to exclude terms, and it was easy to use built in Perlfect params to limit a search to a single web.

Granted, it doesn't do everything... like searching for WebForm data. So I still use the old search, mostly for embedding topic lists in pages based on a form, but that is about it. In all it was a great step up from the grep search, and finding what I really need is a breeze. Best of all, it's written in Perl, so it should be easy to make further enhancements.

...After reading this page, I figure I ought to check out htdig though, sounds interesting.

BTW - If anyone is interested in my Perlfect search hacks I can upload the changes. Just to warn in advance, the changes aren't well documented and my emphasis was on getting it to work the way I wanted, not to make the code look pretty.

Cheers.

-- RobertHanson - 17 Jul 2002

I'd like to see the changes. Have you considered turning them back to the Perlfect people?

-- JohnRouillard - 18 Jul 2002

I'll try to get the changes together later today. As for the Perlfect people, I don't know that they would be interested in it at this point, it was a fairly quick hack and needs a lot of cleaning. I also started to make their code object oriented just to make it a little easier to work with... so there is still lots of work/cleaning to be done.

-- RobertHanson - 18 Jul 2002

Well, I keep positing bad news in the form of bugs here, so I thought I'd try something different. Here is a positive usefull addition to twiki for those who want a bit more kick in the simple search.

I have uploaded a perl wrapper for the bool search tool see #BoolPrevDisc on this page for further info. This does what RichardDonkin asked for. It turns search requests like twiki problem crash into twiki and problem and crash. It also makes "teamwork tool" twiki into "teamwork tool" and twiki which implements a phrase search for the words teamwork tool right next to each other rather than just returning any topic that has the words teamwork and tool somewhere in its text.

It also allows more complex boolean expressions like: twiki and (fix near loss) not rouillard to allow you to find all pages with the word twiki, the word fix within 10 words of the word loss and that doesn't have my name on them 8-).

Bool does have a problem with searching stdin. It tries to provide lines of context around the match, and this doesn't work because it prints multiple file names (and partial file names up to 60 characters of context) around the matching line. So my wrapper falls back on fgrep for searching topic names.

I've only had this running since this afternoon, but I have gotten a few caveats: not rouillard does not work as expected. Something like e not rouillard does do the trick, I guess bool want to be positive, and match something. I have tried to do a good job of cleaning the search string, but as always with my stuff use at your own risk, no warantee implied or provided etc.

I installed it like so:

install bool
1. download bool from ftp://ftp.gnu.org/gnu/bool/
2. build it for your platform (bool build OOTB on cygwin)
3. install it
download the wrapper called boolwrapper.
make sure the shebang line point to your perl installation.
change the paths in the wrapper to point to your fgrep and bool install. My path to bool is almost certainly wrong unless you are running a Depot-Lite software installation.
install the wrapper in some directory.
in TWiki.cfg, change your $fgrepCmd to point to the wrapper.

I didn't make it take over the role of egrep because I really wanted to keep regular expressions available for those that know how to use them.

On another topic, has anybody tried using perl in slup the entire file mode for applying regexps? Might be useful for patterns that can extend over multiple lines.

-- JohnRouillard - 07 Aug 2002

This looks very useful - I was going to do something similar but work intervened. As you probably know, TWiki does now have SearchWithAnd built in, but the syntax you've implemented here is much better since everyone knows it from Google.

It would be good to map the basic "teamwork tools" twiki type syntax onto SearchWithAnd, which should not be too hard, so that there is an out-of-the-box improved search syntax (perhaps controlled by a new parameter on %SEARCH%?). bool is more flexible for complex boolean searching of course, and good in terms of performance for complex searches as it only reads each file once, whereas SearchWithAnd must launch grep several times and re-scan at least some files.

The other thing that would be great is relevance ranking - not sure if bool can do this, but ranking requests higher if they have multiple hits for search terms would be useful. Ultimately it would be good to use something like Google's relevance ranking (see Google:Google+PageRank), but that's probably patented and would require a batch index building process - so a simple ranking based on no. of search terms 'hit' would be useful.

-- RichardDonkin - 07 Aug 2002

Well, I don't know about ranking, but bool (and grep) will return a line count (-c option) that could be used if the TWiki core made use of it. I was also thinking about changing the core so that it parses the output of the grep and uses it as context. E.G. grepping for "user's password" returns (with output truncated to make page readable):

AppendixFileSystem.txt:| =.htpasswd= | Basic Authentication (htaccess) users file with username and encrypted password pairs |
InstallPassword.txt:an encrypted password generated by a user with ResetPassword.
InstallPassword.txt:After submitting this form the user's password will be changed.
MainFeatures.txt:       * *Managing users:* Web-based [[TWikiRegistration][user registration]] ...
TWikiInstallationGuide.txt:             *  *NOTE:* When a user registers, a new line with the ...
TWikiRegistrationPub.txt:To edit pages on this TWiki Collaborative Web, you must have a ...
TWikiUpgradeGuide.txt:  * *[[TWiki06x01.TWikiUserAuthentication#ChangingPasswords][Change passwords]]* ...

parse out the file names, use the count of lines as a ranking and use the rest as lines of context in a formatted search. E.G. context=120 would show 120 characters of returned info for each file in addition to (or maybe replacing) the summary depending on the search.

An example output (ascii format) with sortby=score context=160 would look like:

InstallPassword(2)                 14 Dec 2001 - 02:42 - NEW             AndreaSterbini
Install an Encrypted Password This form can be used only by the MAINWEB .TWikiAdminGroup
users to install an encrypted password generated by a user with ResetPassword ...
>an encrypted password generated by a user with ResetPassword.
>after submitting this form the user's password will be changed.

AppendixFileSystem(1)                 18 Jul 2002 - 07:08 - r1.10        PeterThoeny
TOC STARTINCLUDE #FileSystem # Appendix A: TWiki Filesystem Annotated directory
and file listings, for the 01- Dec-2001 TWiki production release. Who and What is This ...
> | =.htpasswd= | Basic Authentication (htaccess) users file with username and encrypted 
  password pairs |

so the entry for InstallPassword floats to the top since it had the most lines returned, the number of hits is indicated by the number in parens on the top line. You could even (for gnu type greps) use the -C # option to print additional context lines around the matching line. For bool, you would use -O #= for the number of lines and ==-C # for the number of characters in the context lines.

This would make it a bit more useful. Just removing the -l from the grep and bool command lines will allow us to do a "poor man's" context. Does this could like a good addition for the core TWiki?

Also does anybody think I should update the boolwrapper to allow ; to mean and?

-- JohnRouillard

Just a note: The Glimpse search engine handles regular expressions, misspellings, and ranking of results. Their engine works pretty well through an "agrep" command line utility which would make TWiki integration relatively painless. You can find them here: http://www.webglimpse.net and here: http://www.webglimpse.org - Free for non-commercial use and also for those who will help develop and test.

-- TomKagan - 24 Sep 2002

WebForm
TopicClassification	FeatureBrainstorming

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
ext	boolwrap	r1	manage	3.3 K	2002-08-07 - 03:04	UnknownUser	First cut at boolwrap.

Topic revision: r36 - 2002-09-24 - TomKagan

Account
- Log In
- Register User

Edit
Attach

Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.