Tags:
search1Add my vote for this tag create new tag
view all tags

Implemented: Keyword Search with Implicit AND

  • Add a new "keyword" search type besides the existing "literal" search and "regex" search. Users expect keyword search, e.g. TWiki's search should behave like Google and other modern search engines
  • A keyword search string is split up into words and literal text:
    • Literal text is enclosed in double quotes, like "web service"
    • Words and literal text are delimited by space
    • An AND search is performed for the list of words and literal text
    • A minus sign preceeding a word or literal text indicates an AND NOT search; use it to exclude words or literal text, like -"web service"
    • A plus sign preceeding a word or literal text is ignored, it implies an AND search
    • If you want to search for a minus sign or plus sign, followed by a word, enclose them in double quotes, like "-shampoo"
    • Example: soap +wsdl "web service" -shampoo searches for topics that have the word "soap", "wsdl", the literal "web service", but not "shampoo"
  • A RegularExpression search string is split up into patterns:
    • Patterns are delimited by a semicolon
    • An AND search is performed for the list of patterns
    • An exclamation point preceeding a pattern indicates an AND NOT search; use it to exclude a pattern, like !web service
    • If you want to search for a semicolon or an exclamation point, escape them with a leading backslash, like \!shampoo
    • Example: soap;wsdl;web service;!shampoo searches for topics that have the word "soap", "wsdl", the literal "web service", but not "shampoo"
  • Select the type of search with a new type="" parameter that can be set to "keyword", "literal" or "regex"
  • If not specified, default settings in the TWikiPreferences are used:
    • Default for search form (which calls the search script):
      • Set SEARCHDEFAULTTTYPE = keyword
    • Default for %SEARCH{} variable:
      • Set SEARCHVARDEFAULTTYPE = literal
  • Stop words:
    • Stop words are common words and characters such as "how" and "where" that are excluded from a keyword search
    • Prefix a keyword with a plus sign if you want to search for a word in the stop word list
    • Stop words are defined in the SEARCHSTOPWORDS setting in the TWikiPreferences like this:
      • Set SEARCHSTOPWORDS = a, an, how, or, the, where
  • The regex="on" parameter gets deprecated and undocumented, but remains implemented for compatibility

Contributors:
-- ArthurClemens - 05 Nov 2003
-- PeterThoeny - 10 Dec 2003
-- SamHasler - 05 Nov 2003

Discussions

For a RegularExpression search, a SearchWithAnd is already possible. The simple search currently is a literal search, sometimes also called a phrase search, where the string is searched literally. This has historical reasons because it was a straight forward implementation based on the grep tool.

There are some good arguments below to change the simple search to be aligned with search engines, e.g. a

soap wsdl "web service"

search should search for "soap" AND "wsdl" AND the literal "web service" text. See the Google:soap+wsdl+%22web+service%22 result of this example.

The implementation should be simple, it already has been done for reg-ex searches in TWiki::Search.

Question: Can we make a spec change without a new switch or do we need to consider backward-compatibility? I consider to change the spec without an additional switch since embedded searches tend to use regular expression search.

-- PeterThoeny - 12 Feb 2003

The second "usability bug" is stated in a way that is strange to me -- to me, the difference in behavior is not the difference between a full text search and a literal search, but rather the difference between the search for a phrase and the search for all of the words in the phrase, in any order.

To clarify that, IIUC, you want "why diverge" to find any page with the word "why" and the word "diverge" anywhere on the page. (This can be done, BTW, with the fairly new AND search (using ";" as the connector under advanced search and checking "Regular Expression". Instead, "why diverge" tries to find pages with the phrase "why diverge" on it.

Older search engines that I use to use (forget which, maybe zyIndex, maybe askSam) did do a phrase search. Modern search engines (Google) do what you expect.

(I think all the searches that search the body text are "full text searches" in the sense that I would use -- they search all the text in the body.)

Aside: I don't know whether PeterThoeny's comment envisions treating both of these as RFEs or not, but I guess there is nothing to say that nobody can implement them as RFEs either.

-- RandyKramer - 08 Feb 2003

Randy, thanks for your hint, I used the "full text search" in the wrong context. I revised the above proposal accordingly. I hope it's now more comprehensible.

I tried the new AND search using ";" and it's a nice improvement. But I think that's not enough in terms of usability.

IMHO, simple search and the search boxes offered on different pages as conveniences work in a way that will frustrate most users as they don't get many hits using it. Simple Search just does not work the way they expect it to work. It's an apparent fact that people are used to the search for all words style of queries. So why does TWIKI walk a different path? Why ignore what's obvious? Why confront users with the ";" syntax? Users should be able to state their queries using a "syntax" they are familiar with. Is there any reason why the current specs are not changed to accomodate those needs?

I think, now that the AND search has been implemented it should be only a small step to get this to work in simple search as well (please without the ";"). Doing so will not raise any compatibility issues as the AND search will find the same pages as the current search and more.

Would you please explain what is the advantage of simple search doing a phrase search. Maybe it's a mere historical feature?

It's a pity, Peter changed this to "FER". Mind you, my first proposal to offer search all words was on 22 Feb 2001! After doing some reading I figured, Peter had also suggested a search engine like query syntax (see CategorySearchForm). The discussion lead to the implementation of the ";" AND search. Richard Donkin recommended using this core functionality to do a all words search while hiding its syntax from the user (jim fred should turn into the /jim;fred/ RE, doing an AND search).

-- DanielKabs - 10 Feb 2003

I agree 100% with Daniel. Most popular search engine, google, uses space as AND. So most people are trained to use space. Why invent and try to force different standard?

I reworded slightly RuleOne. Now Rule #1 of web usability says: People spend most of the time on other websites. Do it (and call it) as other sites do. Thank you, Daniel, for not accepting explanation of how quirky Twiki does it, and requesting to do it right way.

I can see why Twiki developers understand this issue differently: if they use Twiki engine on intranet, they spend substantial amount of time on Twiki, and get used to Twiki's quirks. But for new and casual users, they are quirks, not virtues, and should be handled as so: get rid of them, use web standards.

-- PeterMasiar - 11 Feb 2003

I'm discussing a way to get to a more search engine like feature set in SearchEnhancements.

This topic is dealing with the same issues as SearchSuggestion ( searching topic and/or text ) and SearchIsBroken ( implicit AND / Space as AND).

-- SamHasler - 11 Feb 2003

Ignoring quotes for the moment, would the following work?

***************
*** 233,239 ****
  
      } else {
          $tempVal = $TWiki::fgrepCmd;
!         @tokens = $theSearchVal;
      }
      $cmd =~ s/%GREP%/$tempVal/go;
  
--- 236,244 ----
  
      } else {
          $tempVal = $TWiki::fgrepCmd;
!       # @tokens = $theSearchVal;
!       # Codev.SearchWithImplicitAnd
!       @tokens = split( /\s/, $theSearchVal );
      }
      $cmd =~ s/%GREP%/$tempVal/go;
  

There were some other references to $theSearchVal in Search.pl but they didn't appear to be relevant.

-- SamHasler - 12 Feb 2003

Also see the GoIsSearch discussion, which sparked the PhotonSearch script.

To answer the question "why aren't they combined", I'll hazard a guess: the topic name search is just a filesystem dir search (which is why it is case insensitive on NT) while the full text search uses egrep.

-- MattWilkie - 06 Jun 2003

I'm tempted to remove the Go box because I don't think it's being used, and it gets confusing having it and the search box. I tend to use the address bar anyway, it has a bigger history and I can add stuff like ?raw=on or ?rev=1.34 which you can't do with the go box.

-- SamHasler - 06 Jun 2003

... but this is something you have to know, in other words: not for the average user. I would also prefer GoIsSearch.

Refactored:
Combining search in WebSearch in both topic text and title, see patch in SearchSuggestion.

... and Sam's "patch" on this page is still awaiting feedback...

-- ArthurClemens - 16 Aug 2003

It appears that PhotonSearch does alternatively search in topics and topic texts, and does this with implicit AND. Seems almost perfect! So why is PhotonSearch still not integrated into TWiki?

-- ArthurClemens - 02 Sep 2003

Lack of time to review pending contributions in detail. This will change with the expanded CoreTeam and AppealToCodevCommunityByCoreTeam.

-- PeterThoeny - 03 Sep 2003

Moved related discussion from SearchDoesNotWorkAsExpected to here.

-- PeterThoeny - 11 Oct 2003

I would like to follow up on this.

At the same time we should implement an search with inverted match, e.g. a search that shows topics that do not contain the search string.

  • Proposed syntax for normal search:
    • soap wsdl "web service" -shampoo
  • Proposed syntax for regex search:
    • soap;wsdl;web service;!shampoo

Anyone interested in implementing this?

Question: Do we need to be compatible with existing content? The search string can be in an embedded FormattedSearch which could break existing content or TWikiApplication. I assert that most applications use a regular expression search, so the chance is small that this spec change breaks existing content.

-- PeterThoeny - 02 Nov 2003

Please be careful with this. Until we have a repeatable and manageable process to track, resolve and inform the user of problems, I feel very uncomfortable with breaking existing content.

-- SvenDowideit - 03 Nov 2003

I personally cannot think of any custom searches I have which would be broken by the proposed change. I'll dig around to make sure. I agree it will be nice to have tests to automate this part. : )

-- MattWilkie - 04 Nov 2003

Don't forget to include an escape mechanism or to document it if current features provide one, so that you can search for <hyphen>something and similarly for exclamation marks.

-- SamHasler - 05 Nov 2003

It looks like we need a new search type. I updated the proposed spec accordingly. Is "keyword" the right word for the Google style search syntax? Any other feedback?

-- PeterThoeny - 05 Nov 2003

"Google search syntax" is perfect (see hits). Do not use "keyword" in this context, because we are talking about full text search - in contrast to (assigned) keyword searching, where people manually assign keywords to documents/pages.

A nice addition would be to have a stop word list, to decrease the number of irrelevant links. Should be language dependent of course.

In the syntax:

  • soap wsdl "web service" -shampoo +xml
... users can use the plus sign, but they should get the same results as in
  • soap wsdl "web service" -shampoo xml

And another nice addition would be the keyword NEAR, to get results where two search terms both occur in a 'word window'. But let us first have the simple Google syntax!

-- ArthurClemens - 05 Nov 2003

Google help at http://www.google.com/help/basics.html talks about keywords, but agreed, this could be confused for the meta keywords in an HTML doc.

So what is a good value for the type parameter? type="Google search syntax" is too verbose. How about type="default" or type="standard" or type="word" or type="token" or ...?

A added the stop word list, "+" syntax, and the escape pr plus and minus signs to the proposed syntax.

-- PeterThoeny - 06 Nov 2003

type="google" ?

-- Unknown

My opinion is that the search should default to "AND" mode. You don't need a check box for exact phrase mode, as this is provided already by enclosing the terms with quotes. The embedded searches usually use regular expressions but are certainly limited in the number of operators available, but I don't see this as a big problem. The main issue is for users who are new to TWiki understanding how to use the searches.

For example:

"This is an exact phrase search"

This is a search looking for pages including all the words

-- RaymondLutz - 06 Nov 2003

I changed the Google type search to "keyword" since it is the standard term in the industry.

Default for search:

  • Default for the search form (which in turn calls the search script) should be "keyword" (defined by the SEARCHSCRIPTTYPE setting)
  • Default for the %SEARCH{}% variable should remain as "literal" (defined by the SEARCHVARIABLETYPE setting). This is for compatibility with existing apps.

I updated the proposed spec on top.

We can go ahead with the implementation.

-- PeterThoeny - 10 Dec 2003

I like the changes and support the proposed specification, although the "literal" checkbox is superfluous as you can obtain that by enclosing in quotes. Yet, it doesn't hurt too much to have this redundant capability. This is a very important improvement. Thanks.

(May I also suggest that while you are mucking with this code, that the parameters are split up in the Search.pm module to further correct partitioning. As it is, you have to change Twiki.pm to add a parameter and this should not be necessary... )

-- RaymondLutz - 10 Dec 2003

The redundant "literal" type is needed for backward compatibility.

I am currently working on the implementation.

Does anyone have a good list of SEARCHSTOPWORDS?

-- PeterThoeny - 03 Jan 2004

Search Engine World has this list, but is seems a bit to large (why exclude 'microsoft' for instance?).

Overview of other lists, such as Google's stop word list.

Stop word lists are generally context dependent, and can be computed: words that occur n times (where n is very big) can be incorporated in the stop list. For general purposes, the Google list seems to be a good start.

So to aid the generation of stop word lists, it would be handy to have a tool that scans all topics and computes a list of words with their occurrences.

which in turn would be very hand for auto-generating crude "what's related" charts -- MattWilkie - 06 Jan 2004

-- ArthurClemens - 04 Jan 2004

Thanks Arthur, added to TWikiPreferences.

-- PeterThoeny - 04 Jan 2004

I believe that we did not add this to the specs: using '+' should overrule the stop word list.

-- ArthurClemens - 04 Jan 2004

Implementation without much testing is done; enhancement is in TWikiAlphaRelease and at TWiki.org. Docs and testing is pending.

Could someone help out testing it against above spec? Use WebSearch for search scripe testing, and FormattedSearchFormTesting for %SEARCH{}% variable testing (in the search string, remember to escape double quotes by backslash)

-- PeterThoeny - 04 Jan 2004

First observations:

  1. "Scope: topic text" should be "topic title".
  2. Using keyword search I cannot search for contributor names; with literal search I can. I believe that TopicContributor in Main.TopicContributor should be searchable too with keyword search.
  3. If a WikiWord is included in the search, the input field gets pretty messed up (for instance search on: soap CoreTeam)
  4. Is there a way to know the number of search results before displaying them? The number of topics is displayed at the bottom, where it would be more useful at the top.
    • That could be done; best to start a new enhancement request with proposed spec. -- PTh - 04 Jan 2004
  5. I don't get any results when searching for compound words, like "users expect" (keyword or literal)
    • The FormattedSearchFormTesting is a hack for testing, it does not handle double quotes correctly. As mentioned above, prefix double quotes with backslash, e.g., \"users expect\". This does run the search correctly but does not refill the search string in the search field -- PTh - 04 Jan 2004

-- ArthurClemens - 04 Jan 2004

Thanks Arthur, comments added above.

-- PeterThoeny - 04 Jan 2004, -- ArthurClemens - 05 Jan 2004

I've done some more testing, and all seems to be working fine.

A UI enhancement would be when the user enters one of the stopwords, to omit this from the line "Search: ". For instance if the users enters 'search' and 'with':

Search: people
'with' is a very common word and was not included in your search (details)

I copied this from Google. Note that with multiple stopwords they use:
The following words are very common and were not included in your search: with by. (details)

-- ArthurClemens - 14 Jan 2004

> Does anyone have a good list of SEARCHSTOPWORDS?

Yes -> []

This might seem flippant, but if one thing rubbed off the search people whilst I was at Inktomi was that getting these right depends on the local information domain and that search relevance is difficult*. (Unless you want to start using other people's patented methods for relevance...)

Stopword lists are often generated by any number of techniques, including creating an inverted cross reference index of the contents of text and lopping off words that occur above a certain number of times. Getting this to work right is difficult* and if this is implemented in TWiki I would suggest that such functionality is moved out to a plugin (and indeed making search replaceable into a plugin).

If you think this isn't a problem, try this search phrase - to be or not to be

There's more to it than "just" slapping on a stopword list. One man's stop list is another man's data.

-- MS - 14 Jan 2004

I agree on the inherent problematics of stoplists. At the same time I still find they are useful (but mainly in OR searches) - for AND searches they mean a performance improvement.

Two ways to meet in the middle:

  1. Allow '+' to force inclusion of stopwords in the search: +to +be +or +not. This works with Google: see previous search but with plusses.
  2. Phrases should ignore stopwords: "to be or not to be". This works with Google too: see previous search but now in quotes.

-- ArthurClemens - 14 Jan 2004

On stopwords, the current implementation works like Arthur describes.

-- PeterThoeny - 15 Jan 2004

TWikiVariables, SearchHelp, docs are updated. WebSearch is now an embedded SEARCH with URL parameters; WebSearchAdvanced is pending.

-- PeterThoeny - 18 Jan 2004

Small bug: when I enter the words ICON CALC in the search field, on the results screen this gets changed to ICON TWiki.CALC.

-- ArthurClemens - 14 Feb 2004

This is now fixed, see IncludeFromOtherWebLinksACRONYMS.

-- PeterThoeny - 25 Apr 2004

Edit | Attach | Watch | Print version | History: r44 < r43 < r42 < r41 < r40 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r44 - 2004-05-20 - PeterThoeny
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.