Tags:
search1Add my vote for this tag create new tag
, view all tags

Question

Is it possible to order the search results in order of relevance? I've spent quite a long time searching for a method, without success, so I presume not.

Obviously, "relevance" is a subjective thing, so the most relevant result from a given search term may be different for one person than it is for another. Much research has been conducted into such searches, so presumably an algorithm could be implemented that takes into account relevance - Google seem to do quite a good job (although is proprietary).

When searching for a given term, a page containing that term in its title and many times within its body is likely to be more relevant to most people than a page where the term features only once in the body.

There is some discussion about changes to the order of search results at the link below, but this is all based on either the topic name, or when it was last modified.

SearchOrderAndLimitBehavour

Environment

TWiki version: TWikiRelease04x00x03
TWiki plugins: SpreadSheetPlugin, CalendarPlugin, CommentPlugin, EditTablePlugin, EmptyPlugin, InterwikiPlugin, PreferencesPlugin, ProjectPlannerPlugin, RenderListPlugin, SlideShowPlugin, SmiliesPlugin, TablePlugin, TagMePlugin
Server OS:  
Web server:  
Perl version:  
Client OS: MS Windows XP Service Pack 2
Web Browser: Internet Explorer 6
Categories: Search

-- AndrewWhitefield - 13 Feb 2007

Answer

ALERT! If you answer a question - or someone answered one of your questions - please remember to edit the page and set the status to answered. The status selector is below the edit box.

TWiki's default sreach does not rank the results. You can spider the TWiki content with a search engine to get ranking, see Tag:search.

Look also into content tagging, the TagMePlugin shows tag search results with ranking based on the number of votes a tag gets.

-- PeterThoeny - 13 Feb 2007

Thanks, it looks like the "Google AJAX Search Plugin" will be best for us to use to get results in a more relevant order.

-- AndrewWhitefield - 19 Feb 2007

This is not difficult to do with the default search.

<form action="%SCRIPTURLPATH{"view"}%/%WEB%/%TOPIC%">
  Find Topics: 
  <input type="text" name="query" size="32" value="%URLPARAM{"query"}%" />
  <input type="submit" class="twikiSubmit" value="Search" />
  <input type="hidden" name="table" value="1" />
  <input type="hidden" name="sortcol" value="0" />
  <input type="hidden" name="up" value="1" />
</form>
%SEARCH{ search="%URLPARAM{"query"}%" nosearch="on"
   header="| *Count* | *Topic* | *Summary* |" 
   format="|  $count(.*?(%URLPARAM{"query"}%).*) | $topic %BR% $date %BR% $wikiname | $summary |" 
}%

-- ClifKussmaul - 08 Apr 2008

Thanks Clif, this is a creative way of doing simple ranking. You can use the TablePlugin to pre-sort the table on the first column that has the count.

-- PeterThoeny - 08 Apr 2008

If you spend too much time tweaking things, you can do more complex rankings...

Here's code that does a query and ranks the result by weighting occurances in the topic, top-level headings, other headings, and the body.

<!-- 
    form that links back to this page
    - hidden fields specify table, column, & direction to sort
-->
<form action="%SCRIPTURLPATH{"view"}%/%WEB%/%TOPIC%">
  <input type="hidden" name="table" value="1" />
  <input type="hidden" name="sortcol" value="0" />
  <input type="hidden" name="up" value="1" />
  Find Topics: 
  <input type="text" name="query" size="32" value="%URLPARAM{"query"}%" />
  <input type="submit" class="twikiSubmit" value="Search" />
</form>

<!-- 
    results - use URLPARAM to extract query
    - weighted: title (10), top heading (5), other headings (3), other appearances (1)
    - (?i) specifies case-independent regex
-->
%SEARCH{ search="%URLPARAM{"query"}%" nosearch="on"
   header="| *Weight* | *T* | *H1* | *Hn* | *R* | *Topic* | *Summary* |" 
   format="|    $percntCALC{$EVAL(        10 * $IF($SEARCH((?i)%URLPARAM{"query"}%, $topic),1,0) +         5 * $T(R$ROW():C3) +         3 * $T(R$ROW():C4) +         1 * $T(R$ROW():C5) )}$percnt |    $percntCALC{$IF($SEARCH((?i)%URLPARAM{"query"}%, $topic),1,0)}$percnt |    $count(.*?(?i)---\+[^+]*?(%URLPARAM{"query"}%)[^\n\r]*.*) |    $count(.*?(?i)---\+\+.*?(%URLPARAM{"query"}%)[^\n\r]*.*) |    $count(.*?(?i)(%URLPARAM{"query"}%).*) |    $topic %BR% $date %BR% $wikiname | $summary |" 
}%

<!-- button to toggle hidden columns - mostly for debugging -->
<input type="button" value="toggle hidden columns" onclick='javascript:toggleColumns([1,2,3,4]);'/>
<script type="text/javascript">
function toggleColumns(cols) {
  var table = document.getElementsByTagName('table')[0];
  var newstyle = (table.rows[0].cells[cols[0]].style.display == 'none') ? '' : 'none';
  for (var c = 0; c < cols.length; c++) {
    for (var r = 0; r < table.rows.length; r++) {
      table.rows[r].cells[cols[c]].style.display = newstyle;
    }
  }
}
toggleColumns([1,2,3,4]);
</script>

And yes, I'm being paid by hardware vendors to produce processor-intensive code. smile

-- ClifKussmaul - 09 Apr 2008

Cliff, this is great smile

On TWikiVMDebianStable it is actually quiet fast as well...

-- CarloSchulz - 09 Apr 2008

I'm still tweaking the weighted query code to fix problems and add features - we've even used it to replace the default WebSearch. The big problem is that the child $count and $SEARCH don't handle multiple keywords well, and I'm thinking it might be better to rewrite TWiki's search code (lib/TWiki/Search.pm) rather than continue to work around it this way. Any thoughts?

-- ClifKussmaul - 11 Apr 2008

btw the toggle button does not work. no reaction at all...

-- CarloSchulz - 11 Apr 2008

Carlo, can you tell me anything more - e.g. what browser do you have? (The toggle button uses JavaScript)

-- ClifKussmaul - 11 Apr 2008

To expand on my earlier comment - the main problem now is with weighting multiple keywords, quoted strings, and negative keywords. To find the query string, the code uses SpreadSheetPlugin's $SEARCH and FormattedSearch's $count, which don't (can't) convert the query to a regex the way VarSEARCH does.

My hunch is that I need to either:

  1. convert the query into a regex before passing it to anything else, so they all get the same regex.
  2. create a new Plugin based on lib/TWiki/Search.pm - more work, but probably more flexible (and readable) than what I have now.
  3. extend Search.pm, perhaps by extending the $format parameter to include a $contextcount(pre,suf) variable that counts results with given prefix & suffix, or in specific contexts (like the topic title) - this seems most parsimonious. Syntax to count in H1 headers might look like: $contextcount(---\+[^+]*?)([^\n\r]*?)

I'm leaning toward the last option. Does this make sense? Comments or suggestions, anyone?

-- ClifKussmaul - 18 Apr 2008

I've added two variables to $format in Search.pm. $searchpattern returns the search pattern from Search.pm (by join-ing the search tokens) and $countcontext returns the number of times the search pattern appears with the given prefix and suffix. For example (each column shows expressions with and without the new variables):

   format="|$percntCALC{\"$IF($SEARCH($searchpattern, $topic),1,0)\"}$percnt              $percntCALC{\"$IF($SEARCH((?i), $topic),1,0)\"}$percnt |             $countcontext(---\+[^+]*?)([^\n\r]*?) $count(.*?(?i)---\+[^+]*?()[^\n\r]*?.*) |             $countcontext(---\+\+.*?)([^\n\r]*?)  $count(.*?(?i)---\+\+.*?()[^\n\r]*?.*) |             $countcontext()()                     $count(.*?(?i)().*) |             $percntCALC{$SUMPRODUCT(R$ROW():C1..R$ROW():C4, R1:C1..R1:C4)}$percnt |             [[$web.WebHome][$web]]: $web.$topic %BR% $date %BR% %USERSWEB%.$wikiname |             $summary |" 

Do these seem like reasonable changes to Search,pm?

-- ClifKussmaul - 19 Apr 2008

Looks like useful enhancements. Please file a feature request following the TWikiReleaseManagementProcess. On syntax, I suggest $countcontext((pattern1)(pattern2)) to (1) retain the current convention, (2) make it safer to parse.

-- PeterThoeny - 19 Apr 2008

I've changed the syntax - thanks for the feedback. I will definitely file a feature request, one I resolve a few more issues.

-- ClifKussmaul - 21 Apr 2008

I'd like to try your latest solution but I'm not sure what to copy/change...

wrt toggle button: I'm using the latest FF browser with JS enabled.

-- CarloSchulz - 25 Apr 2008

Feature request filed: FormattedSearchPatternAndCountContext

-- ClifKussmaul - 10 May 2008

Cool, does that patch work with TWiki 4.1 and 4.2?

-- MartinSeibert - 12 May 2008

I'm using it with 4.1.2. I haven't installed 4.2 but I guess I should...

-- ClifKussmaul - 13 May 2008

I've applied the diff file from FormattedSearchPatternAndCountContext to Search.pm. What next? How do I get the feature to sort (or get sorted) search results by relevance? Which code do I have to add/change and in which file?

-- AlexanderSeith - 14 May 2008

Alexander, once you've updated Search.pm, you will need to add a new search page or modify the existing search page(s). FormattedSearchPatternAndCountContext includes sample search code and output, and I've just attached a sample AdvancedSearch.txt to it. Does this make sense?

-- ClifKussmaul - 14 May 2008

Perfect, exactly what I needed. Thanks a lot!

-- AlexanderSeith - 15 May 2008

Yippieh! Alexander (see above) implemented the patch in our wiki-implementation. That's a huge improvement. Long live TWiki and its active community! smile

-- MartinSeibert - 15 May 2008

Is it possible, to limit the results to 30 by standard. If needed, it would be good for the user to be able to enhance the list by changing the number of results.

-- MartinSeibert - 16 May 2008

I requested an implementation in the standard: ImplementSortingByRelevanceInStandard

-- MartinSeibert - 16 May 2008

Martin, VarSEARCH's limit parameter should work as normal.

I've considered expanding my AdvancedSearch page to include more options, as in WebSearchAdvanced (and maybe hiding those options with JavaScript, by default).

I guess we could also define preference variables for limit, etc.

-- ClifKussmaul - 16 May 2008

I tried to apply the patch to a Search.pm from TWiki Version 4.2.0 and it failed. It looks like the code from Search.pm is a little bit different from version 4.1.2.

Clif: I'm in deep debt if you could take a look at this. I wish I could make the changes on my own, but I have almost zero experience with perl. Thanks!

-- AlexanderSeith - 20 May 2008

Hi,

I managed to patch the Search.pm for TWiki 4.2.0. I attached the modified patch to FormattedSearchPatternAndCountContext. However, it seems that the red highlighting for the search criterion doesn't work anymore.

-- AlexanderSeith - 27 May 2008

Sorry I've been dormant for a while. Recently, I've been thinking about other factors that might contribute to "most relevant", in some contexts:

  • document size - is bigger better?
  • document age - is newer better, or is older more reliable?
  • # of occurrences of search pattern in first N characters of the topic
  • # of chars from start of topic to first occurrence of search pattern

Other ideas? I'm trying to make it easier to extract such data so that people can more easily implement searches that make sense for their webs.

-- ClifKussmaul - 20 Sep 2008

Newer is definitively better.

-- MartinSeibert - 21 Sep 2008

I would like to know how I could expand Clifs great work to work with kinosearch. I set {RCS}{SearchAlgorithm} to TWiki::Store::SearchAlgorithms::Kino. But indexed documents are not weighted. Imo they should get the "any" rating by minimum - maybe more for the search term in document name.

-- JanDreyer - 18 Nov 2008

Change status to:
Topic attachments
I Attachment History Action Size Date Who Comment
Texttxt SearchByRelevance.txt r1 manage 9.7 K 2008-05-15 - 08:47 UnknownUser Topic for search by relevance, slightly modified and translated into german
Edit | Attach | Watch | Print version | History: r31 < r30 < r29 < r28 < r27 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r31 - 2008-11-18 - JanDreyer
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.