CategorySearchForm < Codev

Tags: view all tags

For categorized webs like a TicketWiki it would be very useful to do complex searches on categories. The form should be created directly from the catitems template.

For example, the Know . WebSearch would have sth like this:

Proposed rules for converting the catitems to a search / form:

make selects, radiobuttons and checkboxes into checkboxes
let text fields as is, possibly allowing RegularExpressions
OR the checkboxes of each category, i.e. checking OsHPUX and OsSunOS yields pages for either Sun OR HP
AND the all selected categories, i.e. checking OsHPUX and PublicFAQ yiels only HP FAQs
ignore empty categories

Possibly, we could prefix each category with boolean operators like

On the other hand, usability studies (see http://www.useit.com/alertbox/9707b.html

) show consistently, that most users do not really understand boolean searches. The first model works like the professional ticketing tool from http://www.quintus.com

I happen to know.
On yet another hand: the target audience for a such a mechanism is not a user seeking for a specific page, but rather an expert, who likes to keep an overview over huge numbers of topics. So we might add it, if it is easy to implement.

So what do you think? Does that make sense?

-- PeterKlausner - 08 Jun 2001 (btw - won't the variable TWikiGuest change with each subsequent save?)

Actually %WIKIUSERNAME% changes with each view.

-- NicholasLee - 08 Jun 2001

I'm implementing some FormTemplateSystem pages that use embedded searches a lot - while there is no user interface as suggested on this page (often called query by forms or query by example in the DB world, by the way), I am going to be implementing AND searching. This will be for embedded searches only, and will require use of regular expressions, but will let me do things like find pages where dUmMy= Approved and SubjectArea = Security. The idea is that navigating to the SecurityArea page will show you all Approved and Draft entries in that area.

I am planning to allow additional attributes for embedded searches, using the and attribute for the first additional criterion, then and2 and so on. These will be implemented using a series of egrep -l commands, probably combined with xargs (which should be used in any case as it's possible for the current code to blow up with very large numbers of topics, by exceeding the allowed size of arguments for a Unix/Linux command). This is not elegant but it should work, and the performance hit should be fairly reasonable.

I did have a look at agrep, mentioned in SearchEnhancementsWithAndOr, which promised to let me do this with no Perl hacking. However, the use of agrep's -d option (required to match AND terms not on the same line) doesn't work with regular expressions. agrep does look nice for non-forms use of searching with AND/OR.

-- RichardDonkin - 08 Jul 2001

I've now implemented AND searching as a separate Perl script called andgrep - it uses agrep-style syntax but is intended only for use with TWiki. This should be usable with any TWiki release, just by installing andgrep and pointing the $egrepCmd setting of the TWiki.cfg or wikicfg.pm file to this command. Unlike agrep, you can use regular expressions and there are no limits on pattern length when using AND.

I've only been using this script for a few hours, but it's very simple and seems to work OK. I'll attach it here when I get into work, as I'm having a few VPN problems at the moment. It relies on 'egrep' being in the path, so you may need to edit the script and/or use a new variable in TWiki.cfg if it isn't.

It's particularly useful with embedded searches - you can now search on field1 AND field2, e.g. this search in a web modelled on TWiki.Know (regex=on is required, space inserted after '%' because of problems with <verbatim>):

% SEARCH{ "AnswerStatus.*<br><\/td><td>.*Approved;QuestionArea.*<br><\/td><td>.*SecurityArea"  regex="on"}%

UPDATE: Now attached as 'andgrep' - have been using this on live TWiki today without problems. Now handles topic-name searching with regular expressions. I'm using the March beta but it should be useable with other releases as well.

-- RichardDonkin - 08 Jul 2001

Great idea Richard to decouple this completely from the TWiki core! Did you do some performance testing?

-- PeterThoeny - 11 Jul 2001

Good idea to have seperate from the core for now. But, shouldn't we consider adding this code to the core in a future release? Reasons:

Clean interface to storage system
Documentation clearly specifies capability
Any plugin or embedded search can assume this functionality is present

-- JohnTalintyre - 12 Jul 2001

I haven't done any performance testing, but it seems reasonably fast albeit on a very small set of topics. In the worst case, for N regular expressions which occur in all topics, e.g. 'and;the', the search time will be N times greater than a single RE, since it runs egrep once for each keyword, and the set of filenames scanned will not go down much one pass to another. However, this is quite unlikely - if you search for two reasonably common terms (say first hits on 30% of topics and second on 10%, and you have 1000 topics), the first pass will return 300 filenames, and the second pass will search those 300 files again. The I/O should not be a big problem given a reasonable sized filesystem cache, so the total CPU overhead in this case is 30% over the single RE case. If the less common keyword was searched for first, performance would be better, since it would search only 1100 documents. If you are doing embedded searches, it's not too hard to optimise the order of keyword terms.

As for putting this in the core - sounds like a good idea in the longer term, rather than forking another Perl interpreter. I just kept it separate for ease of debugging. It would also be good to enable AND searching through the forms without using REs - in fact, in some ways it should be the default for multiple word searches:

jim fred should turn into the /jim;fred/ RE, doing an AND search
"jim fred" should turn into the /jim fred/ RE, doing a phrase search

RE based searching should probably use the full syntax. Using ';' is not a bad idea since agrep already uses this.

It would be good to support agrep as well, as an alternative grep - this may be better for non-RE AND searching since it does a single pass.

I'd be interested to hear other people's experiences with andgrep - as long as your egrep is in the path, you can try it out immediately without altering TWiki code, just update TWiki.cfg (Dec 2000 onwards) or wikicfg.pm (May 2000).

-- RichardDonkin - 12 Jul 2001

A couple of observations:

Doesn't seem to work on Windows - I haven't found out why yet
Any idea how to pass ";" to the search script in a URL? ";" is as much or a seperator as "&", so can't be used directly in a value.

-- JohnTalintyre - 13 Jul 2001

The Windows problems may be related to this line in the script, which assumes Unix line endings:

    # Get rid of the \n after every filename
    chomp(@filenames);

You may need to play around with getting rid of \r\n if that's what the backticks return on Windows. Or it could be something else, try turning on the debug.

As for semicolons, you need to use %3B, the URL encoding for semicolon, as mentioned here and in RFC 1738 on URLs. I just tried doing a search from WebSearch with the March beta, typing in 'foo;bar', and it worked fine, just like the embedded searches, so you'll only need this if you are constructing your own URLs in Perl.

-- RichardDonkin - 15 Jul 2001

Richard - thanks. I've added your code directly into my copy of Search.pm - only a few lines change. I'm happy to put into the CVS when Peter gives the green light. However, I think it would be better to be compatible with most search engines and switch, as some people have suggested, to space being the and separator, with literals enclosed in double quotes.

-- JohnTalintyre - 18 Jul 2001

I agree with the idea of space meaning AND, etc, and in fact suggested this above smile Would be good to see this in the core since it's a fairly low-impact change. However, using space to mean AND is not backwards-compatible, of course, so we'd need a bit of discussion to see how non-embedded search usage might be affected.

-- RichardDonkin - 19 Jul 2001

I am not sure if this was possible some other way, but I have modified the andgrep to allow a search to use AND NOT. I needed this for a ticket system internally.

You simply include an '-'at the beginning of the term you want to NOT find:

% SEARCH{ "AnswerStatus.*<br><\/td><td>.*Approved;-QuestionArea.*<br><\/td><td>.*SecurityArea"  regex="on"}%

The change to andgrep is:

    if ($simplePattern =~ /^\-/i) {
        $simplePattern =~ s/\-//;
        $doInvertMatch = "-L";
    } else {
        $doInvertMatch = "";
    }

    $cmd = "egrep $doIgnorecase $doListfiles $doInvertMatch '$simplePattern' @filenames";

Thought it might help someone, somewhere. smile Any suggestions on a better way to implement this?

-- AdrianLynch - 20 Nov 2001

The following change fixes an error if no results are found. It also changes the 'not' character to \! Used this change successfully on our intranet for job tracking.

    if ($simplePattern =~ /^\\\!/i) {
        if (@filenames) {
            $simplePattern =~ s/\\\!//;
            $doInvertMatch = "-L";
        } else {
            die; # No files found to filter.
        }
    } else {
        $doInvertMatch = "";
    }

    $cmd = "egrep $doIgnorecase $doListfiles $doInvertMatch '$simplePattern' @filenames";

-- AdrianLynch - 30 Jan 2002

Thanks for posting the updates - I missed these due to lack of ConversationTracking smile In any case, the form-based web where I was using this didn't really get used due to staff changes...

You might also want to look at GNU bool, linked at end of SearchEngineVsGrepSearch - this implements AND, NOT and proximity searching, and works fine with TWiki (just change the $egrepCmd etc to bool). Probably faster than andgrep, but requires a C compiler (try CygWin if on Windows).

-- RichardDonkin - 09 Feb 2002

Just wanted to add my support of andgrep. After finally reaching the conclusion that searches can't span lines (this could be better documented - particularily regarding usage of webforms), and just before giving up on being able to use webforms, I discovered andgrep.

Now we can use webforms and searches to generate the desired index pages.

One of the dozen or so Webs that we're implementing is Software. Within this Web there will be dozens of unique projects. I wanted to use a TopicClassification similar to the TWiki's WebForm, but needed to add another field to identify the project. Now to generated bug summaries, etc., unique to each project, the SEARCH needs to span multiple lines. This isn't possible with TWiki out-of-the-box. andgrep works beautifully for this purpose.

If for no other reason then to support better use of webforms, TWiki should offer multi-line searches.

-- MartyBacke - 18 March 2002

I've changed this to a FeatureUnderConstruction. I have built search using ";" to do "AND" into my copy of TWiki.pm and am almost ready to upload to CVS.

-- JohnTalintyre - 18 Mar 2002

Yes, lets take the AND functionality into the core code. Probably better in TWiki::Search so that it works from a search form and also from embedded search.

Compatibility is probably no issue for regex search, the chance that existing text already uses ";" is low.

For non-regex search it would be nice to have search engine type syntax, e.g. "good food" +Sushi +Hamachi -Maguro. The question here is compatibility with existing text. A possible soluion is a new switch that enables this type of syntax.

A QueryByExampleSearch would be nice too, in fact I have a need for that at work.

-- PeterThoeny - 22 Mar 2002

Another small enhancement: SearchScriptWithFormattedSearch

-- PeterThoeny - 22 Mar 2002

I've now added the ability to have and in a regexp search to CVS - SearchWithAnd. And being represented by ";" as discussed above.

-- JohnTalintyre - 23 Mar 2002

I've noticed a security hole in andgrep. Since it doesn't check for ' in the search string, you can terminate the grep part of the command and then execute arbitrary commands on the host machine. I haven't managed to get this to do anything other than give me a word count (wc) on the Wiki files in the 10 minutes I've been playing with it, but I'm pretty sure you could do about anything you wanted that the web server's user can do with a little patience. I'm updating my andgrep script to change ' to . for now; I'll look into a more robust solution and post it here if someone hasn't already posted a better one before I get mine.

To test the hole, try running a WebSearch on

'; wc '

(you must enter the ' as part of the search string) with regex = "on" on any TWiki with andgrep installed.

-- BobbyMartin - 30 Oct 2002

Nope, the security hole seems to be a general TWiki hole. I'm running the Dec 2001 release. I turned off my use of andgrep and the wc hole remains. It appears to be fixed in the code running here on twiki.org, though.

-- BobbyMartin - 30 Oct 2002

I do not understand why ' should be a command separator. Could you explain?

-- PeterThoeny - 31 Oct 2002

the andgrep script has been superceeded with andgrep based functionality that is now included in the TWikiAlphaRelease, so use that instead if you can.

-- RichardDonkin - 01 Nov 2002

WebForm
TopicClassification	FeatureUnderConstruction

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
ext	andgrep		manage	2.6 K	2001-07-09 - 17:57	UnknownUser	andgrep

Topic revision: r24 - 2003-03-07 - SvenDowideit

Account
- Log In
- Register User

Edit
Attach

Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.