format="|$percntCALC{\"$IF($SEARCH($searchpattern, $topic),1,0)\"}$percnt $countcontext((---\+[^\n\r+]*?)([^\n\r]*?)) |"
%TABLE{headerrows="2" initsort="5" initdirection="up"}%
| 10 | 5 | 3 | 1 | (Weights) | | |
| *T* | *H1* | *Hn* | *any* | *Score* | *Web: Topic Date User* | *Summary* |
%SEARCH{ "%URLPARAM{search}%" nosearch="on" nototal="on"
type="keyword" scope="%URLPARAM{scope}%" web="%URLPARAM{web}%"
casesensitive="%URLPARAM{casesensitive}%"
format="|$percntCALC{\"$IF($SEARCH($searchpattern, $topic),1,0)\"}$percnt |\
$countcontext((---\+[^+\n\r]*?)([^\n\r]*?)) |\
$countcontext((---\+\+[^\n\r]*?)([^\n\r]*?)) |\
$countcontext(()()) |\
$percntCALC{$SUMPRODUCT(R$ROW():C1..R$ROW():C4, R1:C1..R1:C4)}$percnt |\
[[$web.WebHome][$web]]: $web.$topic %BR% $date %BR% %USERSWEB%.$wikiname $searchpattern |\
$summary |"
}%
Produces output like this:
| 10 | 5 | 3 | 1 | (Weights) | ||
| T | H1 | Hn | any | Score | Web: Topic Date User | Summary |
|---|---|---|---|---|---|---|
| 1 | 1 | 0 | 12 | 27 | Main: TWikiAdminGroup 03 Apr 2008 - 23:05 ClifKussmaul |
TWiki Administrator Group Member list: Set GROUP ATWikiUser, ClifKussmaul Persons/group who can change the list: #Set ALLOWTOPICCHANGE ... |
| 1 | 1 | 0 | 8 | 23 | Main: TWikiGroups 27 Mar 2005 - 13:45 TWikiContributor |
TWiki Groups These groups can be used to define fine grained.TWikiAccessControl in TWiki: New Group: Note: A group topic name must be a .WikiWord ... |
| 1 | 1 | 0 | 6 | 21 | Main: NobodyGroup 27 Mar 2005 - 13:45 TWikiContributor |
Nobody Group Member list: Set GROUP Persons/group who can change the list: Set ALLOWTOPICCHANGE TWikiAdminGroup Used to prevent dangerous... |
-- KennethLavrsen - 27 May 2008
IMHO this is potentially a very useful feature. Good work, Clif.
The feature description above leaves quite a bit to the imagination, which concerns me. As I understand it, $countcontext((pre)(post)) counts the number of times the search pattern, s, appears with the prefix pre and the suffix post. I guess this is a case of taking the search regular expression, concatenating with pre.s.post, and counting the number of times this matches in the topics. However: scope="topic"
type="query"
format expressions need to use a consistent syntax to avoid melting users' brains.
$countcontext feature requires careful unit testing. I can think of several ways to break it.
-- CrawfordCurrie - 28 May 2008
Thanks for the feedback and encouragement.
I wasn't sure if "Committed Developer" meant "developer on committers list" or "developer committed to feature".
I'm still not sure which it is, but either way I'm committed...
For "topic", $countcontext could a) search the text b) search only the topic name, c) do nothing and return a constant. To me, a) seems best (and simplest).
Dealing with "query" is less clear. "do nothing" would be easy :-). I can imagine doing more with the query, but the semantics are more complex, and less flexible.
I will work to clarify semantics, and address Crawford's other comments. I welcome ideas & suggestions...
-- ClifKussmaul - 28 May 2008
I was being stupid; there is no issue with the topic and query semantics. Your operators process the topics that are found, and how they are found is incidental.
-- CrawfordCurrie - 28 May 2008
"Developer committed to feature" is the right meaning - or simply - there is at least one person who is willing to implement what is proposed so it is worth for the rest of the community to make a decision on it. It is explained on TWikiFeatureProposals. It is not because I am a red tape loony that I enforce the rule. The reason is technical. Developers inspect the "Proposals where 14-day rule applies" regularly and know that it is important to react if they see a problem. And this table only works when both CommittedDeveloper and DateOfCommitment is populated.
Crawford now that the proposal is more clear do you still have a concern? For the format syntax we actually already have round bracket syntax with $n() and it is also the syntax we know from spreadsheet plugin.
But perhaps the double parenthesis syntax should be replaced with something like $countcontext("regex" prefix="prefix" suffix="suffix")
That syntax would be more in line with other syntax used in twiki and also enables use of () in the regex so you can OR in the regex.
-- KennethLavrsen - 29 May 2008
yes, ExtractAndCentralizeFormattingRefactor and SupportAccessToArbitraryMetaDataFromSEARCH both suggest that the parameters inside format="" and %SOMETAG{}% should be the same, so that format="$formfield('fieldname' default='aaa')" and %FORMFIELD{'fieldname' default='aaa'}% work the same way.
Similarly cleanups wrt the differences between " and ' are removed (at the moment some tags work with either, others do not).
-- SvenDowideit - 29 May 2008
Yes, I still have a concern until the specification addresses my points. I think Sven is on the right track, but there needs to be a documented agreement.
-- CrawfordCurrie - 29 May 2008
I'm happy to do whatever seems best with parameters. I also recall some problems because of how much nesting these variables & parameters have (see TML example above). Other format variables use comma-separated parameters - should the proposed variables do the same?
Stepping back a moment: Originally, I was trying to provide OrderSearchResultsMostRelevantFirst, and managed a crude version without any changes to core code. Then I realized that the approach I was using could be much better with some fairly minor additions to Search.pm (described above). In particular, the choice of patterns and their weightings is still in TML, not in core code, so different searches can use different weightings. But the syntax isn't pretty (even with consistent parameters), and there are still issues with ExtractAndCentralizeFormattingRefactor and SupportAccessToArbitraryMetaDataFromSEARCH, and probably with performance.
Would it be better to rethink the approach to weighted search results, with more extensive changes to Search.pm? I could imagine using preference variables to define the countcontext patterns and weightings, and doing most of the work in perl, rather than in TML with SEARCH and CALC. More work, but probably less brittle. Should we look at the current proposal as a stop-gap measure - useful until we find a better, cleaner way to address the issue?
-- ClifKussmaul - 29 May 2008
As I see it what you have proposed is a simple (from a feature use point of view) enhancement to SEARCH that makes it possible to create some very powerful TWiki applications that can display numeric results of searches. This is extremely useful in applications where you gather data and want to present some status metric or similar. Or you want to create a graph based on a search.
The best is to try and write the spec like we would have written the documentation for the users
Look at FormattedSearch
Under section 2 format="..." we find the table be want to extend
| $searchpattern | returns the search pattern used in the search Now I write this - what exactly is this?? |
| $countcontext('regex' prefix='prefix' suffix='suffix' | returns the number of times the search pattern appears with the given prefix and suffix |
$topic(40, ...) and nested brackets, such as $pattern(.*?\*.*?Email\:\s*([^\n\r]+).*). So, rather than inventing a new syntax like $countcontext('text' prefix='prefix' suffix='suffix') I think it is better to stick with $countcontext(prefix(text)suffix), e.g. same as the existing $pattern().
-- PeterThoeny - 29 May 2008
Good point. On first inspection that slams the door in the face of other functions who's parameter lists may not fit into the xxx(yyy)zzz mould. Fortunately xxx(yyy)zzz is a valid regex, so you can choose: xxx(yyy)zzz as a regular expression, and
$countcontext(xxx(yyy)zzz) as "like" %COUNTCONTEXT{xxx(yyy)zzz}%, which is valid TWiki variable syntax, and consistent with Sven's proposal.
regex in your example?
One possibility would be to treat empty braces as a "special token" referring to the just-matched search pattern, thus the requesite expressions for Clif's example would be:
$countcontext(---\+[^+\n\r]*?()[^\n\r]*?) $countcontext(---\+\+[^\n\r]*?()[^\n\r]*?) $countcontext(())which, if a proper parser were developed that normalises the syntax for these embedded functions, would also support:
$countcontext('---\+[^+\n\r]*?()[^\n\r]*?')
$countcontext('---\+\+[^\n\r]*?()[^\n\r]*?')
$countcontext('()')
$pattern(.*?\*.*?Email\:\s*([^\n\r]+).*)
$pattern('.*?\*.*?Email\:\s*([^\n\r]+).*')
May the gods help whoever has to write the parser, though.
-- CrawfordCurrie - 30 May 2008
Kenneth,
I used both $countcontext and $searchpattern to avoid converting search queries into regexs multiple (possibly with different results) or manually in TML. Neither is needed if the user searches for a single keyword, but if they do something more complex (quoted string, negative keywords, etc), they are more useful.
$countcontext lets the user specify a context around the regex, and $searchpattern returns the (escaped) regex, both for debugging and (in the example above) for searching page titles.
I like the idea of empty braces as a special token;
could $countcontext then be part of $count?
Crawford's final comment worries me, however
OTOH, I've just been working with SearchEngineKinoSearchAddOn, and indexed searching (including attachments) is nice.
I'm sorry I haven't addressed some other issues raised above; I will try to do so in the next few days.
-- ClifKussmaul - 30 May 2008
I'm making some progress, but I'm getting stuck trying to figure out how best to make this work.
As Crawford and Kenneth suggested, here is the start of a more detailed specification for semantics.
| $searchpattern | returns the search pattern used in the search, in a format suitable for use with SpreadSheetPlugin's CALC. This is most useful with non-trivial keyword searches. |
| option A $countcontext((prefix)(suffix)) | returns the number of times the search pattern appears between the given prefix and suffix |
| option B $countcontext(prefix()suffix) | returns the number of times the search pattern appears between the given prefix and suffix |
| option C $countcontext('regex' prefix='prefix' suffix='suffix') | returns the number of times the regex appears with the given prefix and suffix |
| context | prefix | suffix |
|---|---|---|
| H1 | (---\+[^+\n\r]*?) | ([^\n\r]*?) |
| Hn (n>1) | (---\+\+[^\n\r]*?) | ([^\n\r]*?) |
| boldface | ( \*[^ \*\n\r]*?) | ([^ \*\n\r]*?\*) |
| paren | (\([^ \)\n\r]*?) | ([^\)\n\r]*?\)) |
| any | () | () |
format, inside SEARCH.
I wonder if we should back up and find a cleaner way to do this - e.g. with preference variables.
For type="keyword","word","literal","regex", Search.pm generates a token list, which is easy to convert into $searchpattern.
For type="query" we would have to parse the query and try to extract useful keywords.
Can we define $searchpattern to return nothing for type='query', at least for now?
$count works for the text, but not for the topic or form fields - the sample above uses CALC to search the title,
and presumably we could do something similar for form fields.
It would be nice to have a more consistent way of counting - e.g. $count(regex, $topic) and $count(regex, $formfield(name)),
but this adds to the implementation problems.
Help!
-- ClifKussmaul - 02 Jun 2008
Yup, it's pretty horrible. I'm assuming: () is a special token representing the search pattern
() ($searchpattern) is the empty string for query searches
$count() expressions containing balanced round brackets from a format string. For example, given $s is a format, something like this (untested):
my ($nest, $function, $output) = (0, '', '');
forerach my $bit (split(/(\$count|\(|\))/, $s)) {
if ($output !~ /\\$/) {
# A backslash at the end of the output always escapes the next character
if ($bit eq '$count') {
die '$count in $count' if $nest >= 0;
$nest = 1;
$function = '$count';
next;
} elsif ($nest && $bit eq '(') {
$nest++;
$function .= $bit;
next;
} elsif ($nest && $bit eq ')') {
$nest--;
$function .= $bit;
$bit = processFunction($function) unless ($nest > 1);
}
}
$output .= $bit;
}
processFunction would be responsible for parsing the function call and applying it.
Obviously this "mini-parser" can be used to extract other well-formed $x() expressions in the format string. Of course it should also handle balanced unescaped single quotes, which is left as an exercise for the reader.
-- CrawfordCurrie - 03 Jun 2008
How does this proposal get processed further? We have not heard much from the proposer.
-- KennethLavrsen - 15 Sep 2008
I'm sorry I haven't gotten back to this.
Knowing that others are interested helps move it up on my priority list.
Are others happy with the semantics that Crawford & I discussed in June?
Do other issues need to be addressed?
Side note: while SearchEngineKinoSearchAddOn has some problems,
it seems to me that indexed search is almost a requirement for large sites,
and that TWiki should be moving toward indexed search by default.
Is the current feature an evolutionary dead end?
Should we be spending our time improving indexed search instead?
-- ClifKussmaul - 16 Sep 2008
no, the current search feature is not a dead end, and if you look at the SearchEngineKinoSearchAddOn Search Algo work I started, you'll see that in twiki5 the 'indexed' search will be plugged into the existing feature
-- SvenDowideit - 16 Sep 2008
scope="topic","text","all", type="keyword","word","literal","regex","query"
$count(reg-exp)
$percntCALC{\"$AND($SEARCH(reg-exp, $topic))\"}$percnt
$percntCALC{\"$AND($SEARCH(reg-exp, $formfield(name)))\"}$percnt
$searchpattern in FormattedSearch (Search.pm) type="literal","regex", return search pattern
type="keyword","word", return search pattern (TODO: ? discard negative search terms, but not searches for "-")
type="query" return nothing (TODO: ? parse query for keywords)
$searchpattern
$searchpattern (unmodified $count()) $count(.*$searchpattern.*)
$percntCALC{\"$AND($SEARCH($searchpattern, $topic))\"}$percnt
$percntCALC{\"$AND($SEARCH($searchpattern, $formfield(name)))\"}$percnt
$count(---\+[^+\n\r]*?$searchpattern[^\n\r]*?.*)
$count(---\+\+[^\n\r]*?$searchpattern[^\n\r]*?.*)
$count( \*[^ \*\n\r]*?$searchpattern[^ \*\n\r]*?\*.*)
$count(\([^ \)\n\r]*?$searchpattern[^\)\n\r]*?\).*)
$count() in FormattedSearch (Search.pm) $countcontext(prefix()suffix) to keep $count() simpler
$count(---\+[^+\n\r]*?()[^\n\r]*?.*)
$count(---\+\+[^\n\r]*?()[^\n\r]*?.*)
$count( \*[^ \*\n\r]*?()[^ \*\n\r]*?\*.*)
$count(\([^ \)\n\r]*?()[^\)\n\r]*?\).*) (broken currently due to paren matching)
$count() in FormattedSearch (Search.pm) $count(reg-exp, text) (for consistency with topic name and formfield) or $(count(reg-exp) (for backwards compatibility)
$count(reg-exp, topic)
$count(reg-exp, formfield(name))
SEARCHVAR_A_PREFIX, SEARCHVAR_A_SUFFIX
$count( A, text)
$count(SEARCH_VAR_A_PREFIX$searchpatternSEARCH_VAR_A_SUFFIX)
$countcontext and $count
%TABLE{headerrows="2" initsort="5" initdirection="up"}%
| 10 | 5 | 3 | 1 | -1 | (Weights) | | |
| *T* | *H1* | *Hn* | *any* | *paren* | *Score* | *Web: Topic Date User* | *Summary* |
%SEARCH{ "%URLPARAM{search}%" nosearch="on" nototal="on"
type="keyword" scope="%URLPARAM{scope}%" web="%URLPARAM{web}%"
casesensitive="%URLPARAM{casesensitive}%"
format="|$percntCALC{\"$AND($SEARCH($searchpattern, $topic))\"}$percnt |\
$countcontext((---\+[^+\n\r]*?)([^\n\r]*?)) $count(---\+[^+\n\r]*?$searchpattern[^\n\r]*?.*) |\
$countcontext((---\+\+[^\n\r]*?)([^\n\r]*?)) $count(---\+\+[^\n\r]*?$searchpattern[^\n\r]*?.*) |\
$countcontext(()()) $count(.*$searchpattern.*) |\
$count(\([^ \)\n\r]*?$searchpattern[^\)\n\r]*?\).*) |\
$percntCALC{$SUMPRODUCT(R$ROW():C1..R$ROW():C5, R1:C1..R1:C5)}$percnt |\
[[$web.WebHome][$web]]: $web.$topic %BR% $date %BR% %USERSWEB%.$wikiname $searchpattern |\
$summary |"
}%
I've tried to lay out everything in more detail, and it's making more sense, at least to me.
I now think that $searchpattern is useful,
but changing $count() to support () as a special token will be a lot of trouble for little value.
However, it might be more useful to: $count() to search in topic name and formfields
-- KennethLavrsen - 19 Sep 2008
This proposal has good potential. Anyone driving this?
-- PeterThoeny - 2009-04-20
No committed developer, I removed ClifKussmaul from committed developer field and parked the proposal. Anyone interested in driving this can take ownership of this proposal.
-- PeterThoeny - 2009-09-28
| ChangeProposalForm | |
|---|---|
| TopicClassification | FeatureRequest |
| TopicSummary | Enhance FormattedSearch to return pattern and count in context |
| CurrentState | ParkedProposal |
| CommittedDeveloper | |
| ReasonForDecision | None |
| DateOfCommitment | 2008-05-28 |
| ConcernRaisedBy | |
| BugTracking | |
| OutstandingIssues | |
| RelatedTopics | |
| InterestedParties | |
| ProposedFor | GeorgetownRelease |
| TWikiContributors | |
| I | Attachment | History | Action | Size | Date | Who | Comment |
|---|---|---|---|---|---|---|---|
| |
AdvancedSearch.txt | r1 | manage | 10.0 K | 2008-05-14 - 13:28 | UnknownUser | |
| |
Search.pm.diff | r1 | manage | 3.3 K | 2008-05-10 - 15:21 | UnknownUser | diff of changes for propsed feature |
| |
Search.pm.patch | r1 | manage | 2.3 K | 2008-05-27 - 08:56 | UnknownUser | Patch for TWiki 4.2.0 |