Feature Proposal: Enhance FormattedSearch to return pattern and count in context

Motivation

These changes make it easier to provide weighted search results, rather than the current alphabetical list of matches, as discussed in OrderSearchResultsMostRelevantFirst.

Description and Documentation

Add 2 variables to $format in Search.pm:

  • $searchpattern returns the search pattern
  • $countcontext returns the number of times the search pattern appears with the given prefix and suffix
For example:
format="|$percntCALC{\"$IF($SEARCH($searchpattern, $topic),1,0)\"}$percnt                   $countcontext((---\+[^\n\r+]*?)([^\n\r]*?)) |" 

Examples

Search code like this:

%TABLE{headerrows="2" initsort="5" initdirection="up"}%
|  10 |  5 |  3 |  1 | (Weights) | | | 
| *T* | *H1* | *Hn* | *any* | *Score* | *Web: Topic Date User* | *Summary* |
%SEARCH{ "%URLPARAM{search}%" nosearch="on" nototal="on"
   type="keyword" scope="%URLPARAM{scope}%" web="%URLPARAM{web}%" 
   casesensitive="%URLPARAM{casesensitive}%"
   format="|$percntCALC{\"$IF($SEARCH($searchpattern, $topic),1,0)\"}$percnt |\
                   $countcontext((---\+[^+\n\r]*?)([^\n\r]*?)) |\
                   $countcontext((---\+\+[^\n\r]*?)([^\n\r]*?))  |\
                   $countcontext(()())                     |\
                   $percntCALC{$SUMPRODUCT(R$ROW():C1..R$ROW():C4, R1:C1..R1:C4)}$percnt |\
                   [[$web.WebHome][$web]]: $web.$topic %BR% $date %BR% %USERSWEB%.$wikiname $searchpattern |\
                   $summary |" 
}%

Produces output like this:

10 5 3 1 (Weights)
T H1 Hn any Score Web: Topic Date User Summary
1 1 0 12 27 Main: TWikiAdminGroup?
03 Apr 2008 - 23:05
ClifKussmaul?
TWiki Administrator Group Member list: Set GROUP ATWikiUser? , ClifKussmaul? Persons/group who can change the list: #Set ALLOWTOPICCHANGE ...
1 1 0 8 23 Main: TWikiGroups?
27 Mar 2005 - 13:45
TWikiContributor
TWiki Groups These groups can be used to define fine grained.TWikiAccessControl in TWiki: New Group: Note: A group topic name must be a .WikiWord ...
1 1 0 6 21 Main: NobodyGroup?
27 Mar 2005 - 13:45
TWikiContributor
Nobody Group Member list: Set GROUP Persons/group who can change the list: Set ALLOWTOPICCHANGE TWikiAdminGroup? Used to prevent dangerous...

Impact

WhatDoesItAffect: Search, Usability, Vars

Implementation

I've implemented this in 4.1.2 with the following changes to Search.pm:

  1. in _countPattern()
    1. add a casesensitive parameter
    2. use it to adjust the search
  2. in searchWeb()
    1. convert @tokens (back) into a regex and a version of the regex with escaped vertical bars
    2. modify the action for $count to use the new countPattern()
    3. add action for $countcontext to use the new countPattern()
    4. add action for $searchpattern to return the escaped regex (above)
Diff is attached.

-- Contributors: ClifKussmaul - 10 May 2008

Discussion

Useful; certainly a feature for power users.

-- PeterThoeny - 12 May 2008

In response to a question on OrderSearchResultsMostRelevantFirst, I've also attached AdvancedSearch.txt to show a full example of how this could be used. Note that this file will only work if Search.pm is modified as described.

I haven't installed these changes on an Internet-accessible TWiki, but maybe I should so that others can see how it works...

-- ClifKussmaul - 14 May 2008

You had added a date but not your name to the form below. I now also added your name and todays date so the 14-date clock can start ticking on this proposal.

Note that feature enhancements for the core code are not for patch releases but for next minor or major release. This means that the branch for this is trunk in SVN.

Welcome on board as developer smile

-- KennethLavrsen - 27 May 2008

IMHO this is potentially a very useful feature. Good work, Clif.

The feature description above leaves quite a bit to the imagination, which concerns me. As I understand it, $countcontext((pre)(post)) counts the number of times the search pattern, s, appears with the prefix pre and the suffix post. I guess this is a case of taking the search regular expression, concatenating with pre.s.post, and counting the number of times this matches in the topics. However:

  1. there are no semantics given for other search types:
    • scope="topic"
    • type="query"
  2. ExtractAndCentralizeFormattingRefactor and SupportAccessToArbitraryMetaDataFromSEARCH need to be considered in this work, and the syntax made consistent. The proposal suggests using round brackets as delimiters, and eschews the idea of a parameter separator such as comma or semicolon. This isn't a syntax familiar to TWiki users, who are more used to the named parameter syntax of TWiki variables and the comma separator of SpreadSheetPlugin. I'm not saying using round brackets as delimiters is bad, just that all these new operators in format expressions need to use a consistent syntax to avoid melting users' brains.
  3. There is no specification of how a round bracket embedded in one of these expressions is to be handled, nor of the line-ending behaviour of the match (^ and $).

Finally, I know I sound like a cracked record, but the $countcontext feature requires careful unit testing. I can think of several ways to break it.

-- CrawfordCurrie - 28 May 2008

Thanks for the feedback and encouragement.

I wasn't sure if "Committed Developer" meant "developer on committers list" or "developer committed to feature". I'm still not sure which it is, but either way I'm committed...

For "topic", $countcontext could a) search the text b) search only the topic name, c) do nothing and return a constant. To me, a) seems best (and simplest).

Dealing with "query" is less clear. "do nothing" would be easy :-). I can imagine doing more with the query, but the semantics are more complex, and less flexible.

I will work to clarify semantics, and address Crawford's other comments. I welcome ideas & suggestions...

-- ClifKussmaul - 28 May 2008

I was being stupid; there is no issue with the topic and query semantics. Your operators process the topics that are found, and how they are found is incidental.

-- CrawfordCurrie - 28 May 2008

"Developer committed to feature" is the right meaning - or simply - there is at least one person who is willing to implement what is proposed so it is worth for the rest of the community to make a decision on it. It is explained on TWikiFeatureProposals. It is not because I am a red tape loony that I enforce the rule. The reason is technical. Developers inspect the "Proposals where 14-day rule applies" regularly and know that it is important to react if they see a problem. And this table only works when both CommittedDeveloper and DateOfCommitment is populated.

Crawford now that the proposal is more clear do you still have a concern? For the format syntax we actually already have round bracket syntax with $n() and it is also the syntax we know from spreadsheet plugin.

But perhaps the double parenthesis syntax should be replaced with something like $countcontext("regex" prefix="prefix" suffix="suffix")

That syntax would be more in line with other syntax used in twiki and also enables use of () in the regex so you can OR in the regex.

-- KennethLavrsen - 29 May 2008

yes, ExtractAndCentralizeFormattingRefactor and SupportAccessToArbitraryMetaDataFromSEARCH both suggest that the parameters inside format="" and %SOMETAG{}% should be the same, so that format="$formfield('fieldname' default='aaa')" and %FORMFIELD{'fieldname' default='aaa'}% work the same way.

Similarly cleanups wrt the differences between " and ' are removed (at the moment some tags work with either, others do not).

-- SvenDowideit - 29 May 2008

Yes, I still have a concern until the specification addresses my points. I think Sven is on the right track, but there needs to be a documented agreement.

-- CrawfordCurrie - 29 May 2008

I'm happy to do whatever seems best with parameters. I also recall some problems because of how much nesting these variables & parameters have (see TML example above). Other format variables use comma-separated parameters - should the proposed variables do the same?

Stepping back a moment: Originally, I was trying to provide OrderSearchResultsMostRelevantFirst, and managed a crude version without any changes to core code. Then I realized that the approach I was using could be much better with some fairly minor additions to Search.pm (described above). In particular, the choice of patterns and their weightings is still in TML, not in core code, so different searches can use different weightings. But the syntax isn't pretty (even with consistent parameters), and there are still issues with ExtractAndCentralizeFormattingRefactor and SupportAccessToArbitraryMetaDataFromSEARCH, and probably with performance.

Would it be better to rethink the approach to weighted search results, with more extensive changes to Search.pm? I could imagine using preference variables to define the countcontext patterns and weightings, and doing most of the work in perl, rather than in TML with SEARCH and CALC. More work, but probably less brittle. Should we look at the current proposal as a stop-gap measure - useful until we find a better, cleaner way to address the issue?

-- ClifKussmaul - 29 May 2008

As I see it what you have proposed is a simple (from a feature use point of view) enhancement to SEARCH that makes it possible to create some very powerful TWiki applications that can display numeric results of searches. This is extremely useful in applications where you gather data and want to present some status metric or similar. Or you want to create a graph based on a search.

The best is to try and write the spec like we would have written the documentation for the users

Look at FormattedSearch

Under section 2 format="..." we find the table be want to extend

$searchpattern returns the search pattern used in the search Now I write this - what exactly is this??
$countcontext('regex' prefix='prefix' suffix='suffix' returns the number of times the search pattern appears with the given prefix and suffix

I am a little unsure how the $searchpattern is generated in the different search contexts: keyword, word, literal, regex and query.

Second. We have today a $count. I would assume we know which regex we are searching for. Why is it that one cannot just use $count(prefix searchregex suffix).

So to wrap up...

Why do we need $searchpattern in the first place?

How is $countcontext different from $count?

-- KennethLavrsen - 29 May 2008

On syntax, I suggest to stick with the current convention. DontMakeMeThink. We already have comma list, such as $topic(40, ...) and nested brackets, such as $pattern(.*?\*.*?Email\:\s*([^\n\r]+).*). So, rather than inventing a new syntax like $countcontext('text' prefix='prefix' suffix='suffix') I think it is better to stick with $countcontext(prefix(text)suffix), e.g. same as the existing $pattern().

-- PeterThoeny - 29 May 2008

Good point. On first inspection that slams the door in the face of other functions who's parameter lists may not fit into the xxx(yyy)zzz mould. Fortunately xxx(yyy)zzz is a valid regex, so you can choose:

  1. to interpret xxx(yyy)zzz as a regular expression, and
  2. to read $countcontext(xxx(yyy)zzz) as "like" %COUNTCONTEXT{xxx(yyy)zzz}%, which is valid TWiki variable syntax, and consistent with Sven's proposal.

BTW as I understood it, $countcontext has two parameters (prefix and suffix), not three. Sven, what is regex in your example?

One possibility would be to treat empty braces as a "special token" referring to the just-matched search pattern, thus the requesite expressions for Clif's example would be:

$countcontext(---\+[^+\n\r]*?()[^\n\r]*?)
$countcontext(---\+\+[^\n\r]*?()[^\n\r]*?)
$countcontext(())
which, if a proper parser were developed that normalises the syntax for these embedded functions, would also support:
$countcontext('---\+[^+\n\r]*?()[^\n\r]*?')
$countcontext('---\+\+[^\n\r]*?()[^\n\r]*?')
$countcontext('()')
$pattern(.*?\*.*?Email\:\s*([^\n\r]+).*)
$pattern('.*?\*.*?Email\:\s*([^\n\r]+).*')
May the gods help whoever has to write the parser, though.

-- CrawfordCurrie - 30 May 2008

Kenneth, I used both $countcontext and $searchpattern to avoid converting search queries into regexs multiple (possibly with different results) or manually in TML. Neither is needed if the user searches for a single keyword, but if they do something more complex (quoted string, negative keywords, etc), they are more useful. $countcontext lets the user specify a context around the regex, and $searchpattern returns the (escaped) regex, both for debugging and (in the example above) for searching page titles.

I like the idea of empty braces as a special token; could $countcontext then be part of $count? Crawford's final comment worries me, however smile

OTOH, I've just been working with SearchEngineKinoSearchAddOn, and indexed searching (including attachments) is nice.

I'm sorry I haven't addressed some other issues raised above; I will try to do so in the next few days.

-- ClifKussmaul - 30 May 2008

I'm making some progress, but I'm getting stuck trying to figure out how best to make this work.

As Crawford and Kenneth suggested, here is the start of a more detailed specification for semantics.

$searchpattern returns the search pattern used in the search, in a format suitable for use with SpreadSheetPlugin's CALC. This is most useful with non-trivial keyword searches.
option A $countcontext((prefix)(suffix)) returns the number of times the search pattern appears between the given prefix and suffix
option B $countcontext(prefix()suffix) returns the number of times the search pattern appears between the given prefix and suffix
option C $countcontext('regex' prefix='prefix' suffix='suffix') returns the number of times the regex appears with the given prefix and suffix

For B and C, we might be able to extend the existing $count rather than adding $countcontext.

When I start thinking about unit tests for prefix and suffix, I get scared. Search.pm doesn't really parse variables in the format string, and there are plenty of ways to get into trouble, e.g.

context prefix suffix
H1 (---\+[^+\n\r]*?) ([^\n\r]*?)
Hn (n>1) (---\+\+[^\n\r]*?) ([^\n\r]*?)
boldface ( \*[^ \*\n\r]*?) ([^ \*\n\r]*?\*)
paren (\([^ \)\n\r]*?) ([^\)\n\r]*?\))
any () ()

particularly because these regexs will be inside $countcontext, inside format, inside SEARCH. I wonder if we should back up and find a cleaner way to do this - e.g. with preference variables.

For type="keyword","word","literal","regex", Search.pm generates a token list, which is easy to convert into $searchpattern. For type="query" we would have to parse the query and try to extract useful keywords. Can we define $searchpattern to return nothing for type='query', at least for now?

$count works for the text, but not for the topic or form fields - the sample above uses CALC to search the title, and presumably we could do something similar for form fields. It would be nice to have a more consistent way of counting - e.g. $count(regex, $topic) and $count(regex, $formfield(name)), but this adds to the implementation problems.

Help!

-- ClifKussmaul - 02 Jun 2008

Yup, it's pretty horrible. I'm assuming:

  1. Options B+C is the long term plan, but for now we are only concerned with option B
  2. () is a special token representing the search pattern
  3. () ($searchpattern) is the empty string for query searches
  4. Unescaped round brackets in valid regexes are always balanced
  5. Counting in formfields is beyond the scope of the current work
Because brackets are always balanced, you should be able to write code that will extract $count() expressions containing balanced round brackets from a format string. For example, given $s is a format, something like this (untested):
my ($nest, $function, $output) = (0, '', '');
forerach my $bit (split(/(\$count|\(|\))/, $s)) {
    if ($output !~ /\\$/) {
        # A backslash at the end of the output always escapes the next character
        if ($bit eq '$count') {
            die '$count in $count' if $nest >= 0;
            $nest = 1;
            $function = '$count';
            next;
        } elsif ($nest && $bit eq '(') {
            $nest++;
            $function .= $bit;
            next;
        } elsif ($nest && $bit eq ')') {
            $nest--;
            $function .= $bit;
            $bit = processFunction($function) unless ($nest > 1);
        }
    }
    $output .= $bit;
}
processFunction would be responsible for parsing the function call and applying it. Obviously this "mini-parser" can be used to extract other well-formed $x() expressions in the format string. Of course it should also handle balanced unescaped single quotes, which is left as an exercise for the reader.

-- CrawfordCurrie - 03 Jun 2008

 
Topic attachments
I Attachment Action Size Date Who Comment
txttxt AdvancedSearch.txt manage 10.0 K 14 May 2008 - 13:28 ClifKussmaul  
elsediff Search.pm.diff manage 3.3 K 10 May 2008 - 15:21 ClifKussmaul diff of changes for propsed feature
elsepatch Search.pm.patch manage 2.3 K 27 May 2008 - 08:56 AlexanderSeith Patch for TWiki 4.2.0
Topic revision: r19 - 03 Jun 2008 - 09:18:41 - CrawfordCurrie
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback SourceForge.net Logo