Tags:
create new tag
, view all tags

Feature Proposal: Enhance FormattedSearch to return pattern and count in context

Motivation

These changes make it easier to provide weighted search results, rather than the current alphabetical list of matches, as discussed in OrderSearchResultsMostRelevantFirst.

Description and Documentation

Add 2 variables to $format in Search.pm:

  • $searchpattern returns the search pattern
  • $countcontext returns the number of times the search pattern appears with the given prefix and suffix
For example:
format="|$percntCALC{\"$IF($SEARCH($searchpattern, $topic),1,0)\"}$percnt                   $countcontext((---\+[^\n\r+]*?)([^\n\r]*?)) |" 

Examples

Search code like this:

%TABLE{headerrows="2" initsort="5" initdirection="up"}%
|  10 |  5 |  3 |  1 | (Weights) | | | 
| *T* | *H1* | *Hn* | *any* | *Score* | *Web: Topic Date User* | *Summary* |
%SEARCH{ "%URLPARAM{search}%" nosearch="on" nototal="on"
   type="keyword" scope="%URLPARAM{scope}%" web="%URLPARAM{web}%" 
   casesensitive="%URLPARAM{casesensitive}%"
   format="|$percntCALC{\"$IF($SEARCH($searchpattern, $topic),1,0)\"}$percnt |\
                   $countcontext((---\+[^+\n\r]*?)([^\n\r]*?)) |\
                   $countcontext((---\+\+[^\n\r]*?)([^\n\r]*?))  |\
                   $countcontext(()())                     |\
                   $percntCALC{$SUMPRODUCT(R$ROW():C1..R$ROW():C4, R1:C1..R1:C4)}$percnt |\
                   [[$web.WebHome][$web]]: $web.$topic %BR% $date %BR% %USERSWEB%.$wikiname $searchpattern |\
                   $summary |" 
}%

Produces output like this:

10 5 3 1 (Weights)
T H1 Hn any Score Web: Topic Date User Summary
1 1 0 12 27 Main: TWikiAdminGroup
03 Apr 2008 - 23:05
ClifKussmaul
TWiki Administrator Group Member list: Set GROUP ATWikiUser, ClifKussmaul Persons/group who can change the list: #Set ALLOWTOPICCHANGE ...
1 1 0 8 23 Main: TWikiGroups
27 Mar 2005 - 13:45
TWikiContributor
TWiki Groups These groups can be used to define fine grained.TWikiAccessControl in TWiki: New Group: Note: A group topic name must be a .WikiWord ...
1 1 0 6 21 Main: NobodyGroup
27 Mar 2005 - 13:45
TWikiContributor
Nobody Group Member list: Set GROUP Persons/group who can change the list: Set ALLOWTOPICCHANGE TWikiAdminGroup Used to prevent dangerous...

Impact

WhatDoesItAffect: Search, Usability, Vars

Implementation

I've implemented this in 4.1.2 with the following changes to Search.pm:

  1. in _countPattern()
    1. add a casesensitive parameter
    2. use it to adjust the search
  2. in searchWeb()
    1. convert @tokens (back) into a regex and a version of the regex with escaped vertical bars
    2. modify the action for $count to use the new countPattern()
    3. add action for $countcontext to use the new countPattern()
    4. add action for $searchpattern to return the escaped regex (above)
Diff is attached.

-- Contributors: ClifKussmaul - 10 May 2008

Discussion

Useful; certainly a feature for power users.

-- PeterThoeny - 12 May 2008

In response to a question on OrderSearchResultsMostRelevantFirst, I've also attached AdvancedSearch.txt to show a full example of how this could be used. Note that this file will only work if Search.pm is modified as described.

I haven't installed these changes on an Internet-accessible TWiki, but maybe I should so that others can see how it works...

-- ClifKussmaul - 14 May 2008

You had added a date but not your name to the form below. I now also added your name and todays date so the 14-date clock can start ticking on this proposal.

Note that feature enhancements for the core code are not for patch releases but for next minor or major release. This means that the branch for this is trunk in SVN.

Welcome on board as developer smile

-- KennethLavrsen - 27 May 2008

IMHO this is potentially a very useful feature. Good work, Clif.

The feature description above leaves quite a bit to the imagination, which concerns me. As I understand it, $countcontext((pre)(post)) counts the number of times the search pattern, s, appears with the prefix pre and the suffix post. I guess this is a case of taking the search regular expression, concatenating with pre.s.post, and counting the number of times this matches in the topics. However:

  1. there are no semantics given for other search types:
    • scope="topic"
    • type="query"
  2. ExtractAndCentralizeFormattingRefactor and SupportAccessToArbitraryMetaDataFromSEARCH need to be considered in this work, and the syntax made consistent. The proposal suggests using round brackets as delimiters, and eschews the idea of a parameter separator such as comma or semicolon. This isn't a syntax familiar to TWiki users, who are more used to the named parameter syntax of TWiki variables and the comma separator of SpreadSheetPlugin. I'm not saying using round brackets as delimiters is bad, just that all these new operators in format expressions need to use a consistent syntax to avoid melting users' brains.
  3. There is no specification of how a round bracket embedded in one of these expressions is to be handled, nor of the line-ending behaviour of the match (^ and $).

Finally, I know I sound like a cracked record, but the $countcontext feature requires careful unit testing. I can think of several ways to break it.

-- CrawfordCurrie - 28 May 2008

Thanks for the feedback and encouragement.

I wasn't sure if "Committed Developer" meant "developer on committers list" or "developer committed to feature". I'm still not sure which it is, but either way I'm committed...

For "topic", $countcontext could a) search the text b) search only the topic name, c) do nothing and return a constant. To me, a) seems best (and simplest).

Dealing with "query" is less clear. "do nothing" would be easy :-). I can imagine doing more with the query, but the semantics are more complex, and less flexible.

I will work to clarify semantics, and address Crawford's other comments. I welcome ideas & suggestions...

-- ClifKussmaul - 28 May 2008

I was being stupid; there is no issue with the topic and query semantics. Your operators process the topics that are found, and how they are found is incidental.

-- CrawfordCurrie - 28 May 2008

"Developer committed to feature" is the right meaning - or simply - there is at least one person who is willing to implement what is proposed so it is worth for the rest of the community to make a decision on it. It is explained on TWikiFeatureProposals. It is not because I am a red tape loony that I enforce the rule. The reason is technical. Developers inspect the "Proposals where 14-day rule applies" regularly and know that it is important to react if they see a problem. And this table only works when both CommittedDeveloper and DateOfCommitment is populated.

Crawford now that the proposal is more clear do you still have a concern? For the format syntax we actually already have round bracket syntax with $n() and it is also the syntax we know from spreadsheet plugin.

But perhaps the double parenthesis syntax should be replaced with something like $countcontext("regex" prefix="prefix" suffix="suffix")

That syntax would be more in line with other syntax used in twiki and also enables use of () in the regex so you can OR in the regex.

-- KennethLavrsen - 29 May 2008

yes, ExtractAndCentralizeFormattingRefactor and SupportAccessToArbitraryMetaDataFromSEARCH both suggest that the parameters inside format="" and %SOMETAG{}% should be the same, so that format="$formfield('fieldname' default='aaa')" and %FORMFIELD{'fieldname' default='aaa'}% work the same way.

Similarly cleanups wrt the differences between " and ' are removed (at the moment some tags work with either, others do not).

-- SvenDowideit - 29 May 2008

Yes, I still have a concern until the specification addresses my points. I think Sven is on the right track, but there needs to be a documented agreement.

-- CrawfordCurrie - 29 May 2008

I'm happy to do whatever seems best with parameters. I also recall some problems because of how much nesting these variables & parameters have (see TML example above). Other format variables use comma-separated parameters - should the proposed variables do the same?

Stepping back a moment: Originally, I was trying to provide OrderSearchResultsMostRelevantFirst, and managed a crude version without any changes to core code. Then I realized that the approach I was using could be much better with some fairly minor additions to Search.pm (described above). In particular, the choice of patterns and their weightings is still in TML, not in core code, so different searches can use different weightings. But the syntax isn't pretty (even with consistent parameters), and there are still issues with ExtractAndCentralizeFormattingRefactor and SupportAccessToArbitraryMetaDataFromSEARCH, and probably with performance.

Would it be better to rethink the approach to weighted search results, with more extensive changes to Search.pm? I could imagine using preference variables to define the countcontext patterns and weightings, and doing most of the work in perl, rather than in TML with SEARCH and CALC. More work, but probably less brittle. Should we look at the current proposal as a stop-gap measure - useful until we find a better, cleaner way to address the issue?

-- ClifKussmaul - 29 May 2008

As I see it what you have proposed is a simple (from a feature use point of view) enhancement to SEARCH that makes it possible to create some very powerful TWiki applications that can display numeric results of searches. This is extremely useful in applications where you gather data and want to present some status metric or similar. Or you want to create a graph based on a search.

The best is to try and write the spec like we would have written the documentation for the users

Look at FormattedSearch

Under section 2 format="..." we find the table be want to extend

$searchpattern returns the search pattern used in the search Now I write this - what exactly is this??
$countcontext('regex' prefix='prefix' suffix='suffix' returns the number of times the search pattern appears with the given prefix and suffix

I am a little unsure how the $searchpattern is generated in the different search contexts: keyword, word, literal, regex and query.

Second. We have today a $count. I would assume we know which regex we are searching for. Why is it that one cannot just use $count(prefix searchregex suffix).

So to wrap up...

Why do we need $searchpattern in the first place?

How is $countcontext different from $count?

-- KennethLavrsen - 29 May 2008

On syntax, I suggest to stick with the current convention. DontMakeMeThink. We already have comma list, such as $topic(40, ...) and nested brackets, such as $pattern(.*?\*.*?Email\:\s*([^\n\r]+).*). So, rather than inventing a new syntax like $countcontext('text' prefix='prefix' suffix='suffix') I think it is better to stick with $countcontext(prefix(text)suffix), e.g. same as the existing $pattern().

-- PeterThoeny - 29 May 2008

Good point. On first inspection that slams the door in the face of other functions who's parameter lists may not fit into the xxx(yyy)zzz mould. Fortunately xxx(yyy)zzz is a valid regex, so you can choose:

  1. to interpret xxx(yyy)zzz as a regular expression, and
  2. to read $countcontext(xxx(yyy)zzz) as "like" %COUNTCONTEXT{xxx(yyy)zzz}%, which is valid TWiki variable syntax, and consistent with Sven's proposal.

BTW as I understood it, $countcontext has two parameters (prefix and suffix), not three. Sven, what is regex in your example?

One possibility would be to treat empty braces as a "special token" referring to the just-matched search pattern, thus the requesite expressions for Clif's example would be:

$countcontext(---\+[^+\n\r]*?()[^\n\r]*?)
$countcontext(---\+\+[^\n\r]*?()[^\n\r]*?)
$countcontext(())
which, if a proper parser were developed that normalises the syntax for these embedded functions, would also support:
$countcontext('---\+[^+\n\r]*?()[^\n\r]*?')
$countcontext('---\+\+[^\n\r]*?()[^\n\r]*?')
$countcontext('()')
$pattern(.*?\*.*?Email\:\s*([^\n\r]+).*)
$pattern('.*?\*.*?Email\:\s*([^\n\r]+).*')
May the gods help whoever has to write the parser, though.

-- CrawfordCurrie - 30 May 2008

Kenneth, I used both $countcontext and $searchpattern to avoid converting search queries into regexs multiple (possibly with different results) or manually in TML. Neither is needed if the user searches for a single keyword, but if they do something more complex (quoted string, negative keywords, etc), they are more useful. $countcontext lets the user specify a context around the regex, and $searchpattern returns the (escaped) regex, both for debugging and (in the example above) for searching page titles.

I like the idea of empty braces as a special token; could $countcontext then be part of $count? Crawford's final comment worries me, however smile

OTOH, I've just been working with SearchEngineKinoSearchAddOn, and indexed searching (including attachments) is nice.

I'm sorry I haven't addressed some other issues raised above; I will try to do so in the next few days.

-- ClifKussmaul - 30 May 2008

I'm making some progress, but I'm getting stuck trying to figure out how best to make this work.

As Crawford and Kenneth suggested, here is the start of a more detailed specification for semantics.

$searchpattern returns the search pattern used in the search, in a format suitable for use with SpreadSheetPlugin's CALC. This is most useful with non-trivial keyword searches.
option A $countcontext((prefix)(suffix)) returns the number of times the search pattern appears between the given prefix and suffix
option B $countcontext(prefix()suffix) returns the number of times the search pattern appears between the given prefix and suffix
option C $countcontext('regex' prefix='prefix' suffix='suffix') returns the number of times the regex appears with the given prefix and suffix

For B and C, we might be able to extend the existing $count rather than adding $countcontext.

When I start thinking about unit tests for prefix and suffix, I get scared. Search.pm doesn't really parse variables in the format string, and there are plenty of ways to get into trouble, e.g.

context prefix suffix
H1 (---\+[^+\n\r]*?) ([^\n\r]*?)
Hn (n>1) (---\+\+[^\n\r]*?) ([^\n\r]*?)
boldface ( \*[^ \*\n\r]*?) ([^ \*\n\r]*?\*)
paren (\([^ \)\n\r]*?) ([^\)\n\r]*?\))
any () ()

particularly because these regexs will be inside $countcontext, inside format, inside SEARCH. I wonder if we should back up and find a cleaner way to do this - e.g. with preference variables.

For type="keyword","word","literal","regex", Search.pm generates a token list, which is easy to convert into $searchpattern. For type="query" we would have to parse the query and try to extract useful keywords. Can we define $searchpattern to return nothing for type='query', at least for now?

$count works for the text, but not for the topic or form fields - the sample above uses CALC to search the title, and presumably we could do something similar for form fields. It would be nice to have a more consistent way of counting - e.g. $count(regex, $topic) and $count(regex, $formfield(name)), but this adds to the implementation problems.

Help!

-- ClifKussmaul - 02 Jun 2008

Yup, it's pretty horrible. I'm assuming:

  1. Options B+C is the long term plan, but for now we are only concerned with option B
  2. () is a special token representing the search pattern
  3. () ($searchpattern) is the empty string for query searches
  4. Unescaped round brackets in valid regexes are always balanced
  5. Counting in formfields is beyond the scope of the current work
Because brackets are always balanced, you should be able to write code that will extract $count() expressions containing balanced round brackets from a format string. For example, given $s is a format, something like this (untested):
my ($nest, $function, $output) = (0, '', '');
forerach my $bit (split(/(\$count|\(|\))/, $s)) {
    if ($output !~ /\\$/) {
        # A backslash at the end of the output always escapes the next character
        if ($bit eq '$count') {
            die '$count in $count' if $nest >= 0;
            $nest = 1;
            $function = '$count';
            next;
        } elsif ($nest && $bit eq '(') {
            $nest++;
            $function .= $bit;
            next;
        } elsif ($nest && $bit eq ')') {
            $nest--;
            $function .= $bit;
            $bit = processFunction($function) unless ($nest > 1);
        }
    }
    $output .= $bit;
}
processFunction would be responsible for parsing the function call and applying it. Obviously this "mini-parser" can be used to extract other well-formed $x() expressions in the format string. Of course it should also handle balanced unescaped single quotes, which is left as an exercise for the reader.

-- CrawfordCurrie - 03 Jun 2008

How does this proposal get processed further? We have not heard much from the proposer.

-- KennethLavrsen - 15 Sep 2008

I'm sorry I haven't gotten back to this. Knowing that others are interested helps move it up on my priority list. Are others happy with the semantics that Crawford & I discussed in June? Do other issues need to be addressed?

Side note: while SearchEngineKinoSearchAddOn has some problems, it seems to me that indexed search is almost a requirement for large sites, and that TWiki should be moving toward indexed search by default. Is the current feature an evolutionary dead end? Should we be spending our time improving indexed search instead?

-- ClifKussmaul - 16 Sep 2008

no, the current search feature is not a dead end, and if you look at the SearchEngineKinoSearchAddOn Search Algo work I started, you'll see that in twiki5 the 'indexed' search will be plugged into the existing feature smile

-- SvenDowideit - 16 Sep 2008

REVISED PROPOSAL

Objective: Present "better" search result topics near top of list, by counting occurrences of regexs in result topics, and weighting the counts.

1. Find result topics as usual (see VarSEARCH)

  • e.g. scope="topic","text","all", type="keyword","word","literal","regex","query"

2. Count occurrences of regexs (with FormattedSearch) and put counts (hidden) in results table. Several possible approaches:

  1. count with current syntax (unmodified code)
    • to search text: $count(reg-exp)
    • to search name: $percntCALC{\"$AND($SEARCH(reg-exp, $topic))\"}$percnt
    • to search field: $percntCALC{\"$AND($SEARCH(reg-exp, $formfield(name)))\"}$percnt
  2. add $searchpattern in FormattedSearch (Search.pm)
    • most useful for reusing non-trivial regex searches within SpreadSheetPlugin's CALC (see examples below)
    • for type="literal","regex", return search pattern
    • for type="keyword","word", return search pattern (TODO: ? discard negative search terms, but not searches for "-")
    • for type="query" return nothing (TODO: ? parse query for keywords)
    • ? is there a better name than $searchpattern
    • Clif thinks YES - not difficult to implement, facilitates counting for non-trivial searches
  3. count with $searchpattern (unmodified $count())
    • to search text: $count(.*$searchpattern.*)
    • to search name: $percntCALC{\"$AND($SEARCH($searchpattern, $topic))\"}$percnt
    • to search field: $percntCALC{\"$AND($SEARCH($searchpattern, $formfield(name)))\"}$percnt
    • to search for H1: $count(---\+[^+\n\r]*?$searchpattern[^\n\r]*?.*)
    • to search for Hn: $count(---\+\+[^\n\r]*?$searchpattern[^\n\r]*?.*)
    • to search for bold: $count( \*[^ \*\n\r]*?$searchpattern[^ \*\n\r]*?\*.*)
    • to search for (paren): $count(\([^ \)\n\r]*?$searchpattern[^\)\n\r]*?\).*)
  4. redefine $count() in FormattedSearch (Search.pm)
    • alternatively, add $countcontext(prefix()suffix) to keep $count() simpler
    • returns # of times the search pattern appears between the given prefix and suffix
    • () is a special token referring to current search pattern
    • to search for H1: $count(---\+[^+\n\r]*?()[^\n\r]*?.*)
    • to search for Hn: $count(---\+\+[^\n\r]*?()[^\n\r]*?.*)
    • to search for bold: $count( \*[^ \*\n\r]*?()[^ \*\n\r]*?\*.*)
    • to search for (paren): $count(\([^ \)\n\r]*?()[^\)\n\r]*?\).*) (broken currently due to paren matching)
    • Clif thinks NO - parsing will be difficult and error-prone, and there are alternatives
  5. redefine $count() in FormattedSearch (Search.pm)
    • so count can be used for topic name and form fields
    • to search text: $count(reg-exp, text) (for consistency with topic name and formfield) or $(count(reg-exp) (for backwards compatibility)
    • to search name: $count(reg-exp, topic)
    • to search field: $count(reg-exp, formfield(name))
    • TODO: ? other metadata
    • Clif thinks MAYBE - not difficult to implement, easier to use than current approaches
  6. put search pattern or prefix/suffix in preference variable(s), and reference with simpler syntax
    • define SEARCHVAR_A_PREFIX, SEARCHVAR_A_SUFFIX
    • to search text: $count( A, text)
    • expands to $count(SEARCH_VAR_A_PREFIX$searchpatternSEARCH_VAR_A_SUFFIX)
    • Clif thinks MAYBE - need input on syntax and desirability
  7. avoid performance / DoS problems
    • ? skip/simplify counting if number or total size of result topics exceeds threshold
    • ? specify max count in preference variable SEARCHVAR_MAX_SORT_COUNT
    • Clif thinks MAYBE - need input on syntax and desirability

3. Compute weighted sum of hidden columns

  1. use weights in header row (with SpreadSheetPlugin)
  2. put sum (visible) in results table

4 Sort result table by weighted sum

  1. ? on server - would have to compute weights, then sort results using weights
  2. using JavaScript on client (current approach)

SAMPLE SEARCH CODE USING DRAFT IMPLEMENTATION

Note that:

  • some columns have two values (which should be equal), from using both $countcontext and $count
  • search target inside () receives negative weight (less important)

%TABLE{headerrows="2" initsort="5" initdirection="up"}%
|  10 |  5 |  3 |  1 |  -1 | (Weights) | | | 
| *T* | *H1* | *Hn* | *any* | *paren* | *Score* | *Web: Topic Date User* | *Summary* |
%SEARCH{ "%URLPARAM{search}%" nosearch="on" nototal="on"
   type="keyword" scope="%URLPARAM{scope}%" web="%URLPARAM{web}%" 
   casesensitive="%URLPARAM{casesensitive}%"
   format="|$percntCALC{\"$AND($SEARCH($searchpattern, $topic))\"}$percnt |\
                   $countcontext((---\+[^+\n\r]*?)([^\n\r]*?))  $count(---\+[^+\n\r]*?$searchpattern[^\n\r]*?.*) |\
                   $countcontext((---\+\+[^\n\r]*?)([^\n\r]*?)) $count(---\+\+[^\n\r]*?$searchpattern[^\n\r]*?.*) |\
                   $countcontext(()()) $count(.*$searchpattern.*) |\
                   $count(\([^ \)\n\r]*?$searchpattern[^\)\n\r]*?\).*) |\
                   $percntCALC{$SUMPRODUCT(R$ROW():C1..R$ROW():C5, R1:C1..R1:C5)}$percnt |\
                   [[$web.WebHome][$web]]: $web.$topic %BR% $date %BR% %USERSWEB%.$wikiname $searchpattern |\
                   $summary |" 
}%

I've tried to lay out everything in more detail, and it's making more sense, at least to me. smile I now think that $searchpattern is useful, but changing $count() to support () as a special token will be a lot of trouble for little value. However, it might be more useful to:

  • extend $count() to search in topic name and formfields
  • move the prefix/suffix idea into preference variables
  • consider how to throttle this weighting scheme for large sets of search results, etc.

What do you think?

-- ClifKussmaul - 17 Sep 2008

I only worry about the syntax part and I have removed my concern now. wink

-- KennethLavrsen - 19 Sep 2008

This proposal has good potential. Anyone driving this?

-- PeterThoeny - 2009-04-20

No committed developer, I removed ClifKussmaul from committed developer field and parked the proposal. Anyone interested in driving this can take ownership of this proposal.

-- PeterThoeny - 2009-09-28

Topic attachments
I Attachment History Action Size Date Who Comment
Texttxt AdvancedSearch.txt r1 manage 10.0 K 2008-05-14 - 13:28 ClifKussmaul  
Unknown file formatdiff Search.pm.diff r1 manage 3.3 K 2008-05-10 - 15:21 ClifKussmaul diff of changes for propsed feature
Unknown file formatpatch Search.pm.patch r1 manage 2.3 K 2008-05-27 - 08:56 UnknownUser Patch for TWiki 4.2.0
Edit | Attach | Watch | Print version | History: r27 < r26 < r25 < r24 < r23 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r27 - 2009-09-28 - PeterThoeny
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2016 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.