Implemented: Keyword Search with Implicit AND
- Add a new "keyword" search type besides the existing "literal" search and "regex" search. Users expect keyword search, e.g. TWiki's search should behave like Google and other modern search engines
- A keyword search string is split up into words and literal text:
- Literal text is enclosed in double quotes, like
"web service"
- Words and literal text are delimited by space
- An AND search is performed for the list of words and literal text
- A minus sign preceeding a word or literal text indicates an AND NOT search; use it to exclude words or literal text, like
-"web service"
- A plus sign preceeding a word or literal text is ignored, it implies an AND search
- If you want to search for a minus sign or plus sign, followed by a word, enclose them in double quotes, like
"-shampoo"
- Example:
soap +wsdl "web service" -shampoo
searches for topics that have the word "soap"
, "wsdl"
, the literal "web service"
, but not "shampoo"
- A RegularExpression search string is split up into patterns:
- Patterns are delimited by a semicolon
- An AND search is performed for the list of patterns
- An exclamation point preceeding a pattern indicates an AND NOT search; use it to exclude a pattern, like
!web service
- If you want to search for a semicolon or an exclamation point, escape them with a leading backslash, like
\!shampoo
- Example:
soap;wsdl;web service;!shampoo
searches for topics that have the word "soap"
, "wsdl"
, the literal "web service"
, but not "shampoo"
- Select the type of search with a new
type=""
parameter that can be set to "keyword"
, "literal"
or "regex"
- If not specified, default settings in the TWikiPreferences are used:
- Default for search form (which calls the
search
script):
-
Set SEARCHDEFAULTTTYPE = keyword
- Default for
%SEARCH{}
variable:
-
Set SEARCHVARDEFAULTTYPE = literal
- Stop words:
- Stop words are common words and characters such as "how" and "where" that are excluded from a keyword search
- Prefix a keyword with a plus sign if you want to search for a word in the stop word list
- Stop words are defined in the SEARCHSTOPWORDS setting in the TWikiPreferences like this:
-
Set SEARCHSTOPWORDS = a, an, how, or, the, where
- The
regex="on"
parameter gets deprecated and undocumented, but remains implemented for compatibility
Contributors:
--
ArthurClemens - 05 Nov 2003
--
PeterThoeny - 10 Dec 2003
--
SamHasler - 05 Nov 2003
Discussions
For a
RegularExpression search, a
SearchWithAnd is already possible. The simple search currently is a literal search, sometimes also called a phrase search, where the string is searched literally. This has historical reasons because it was a straight forward implementation based on the grep tool.
There are some good arguments below to change the simple search to be aligned with search engines, e.g. a
soap wsdl "web service"
search should search for
"soap"
AND
"wsdl"
AND the literal
"web service"
text. See the
Google:soap+wsdl+%22web+service%22 result of this example.
The implementation should be simple, it already has been done for reg-ex searches in
TWiki::Search
.
Question: Can we make a spec change without a new switch or do we need to consider backward-compatibility? I consider to change the spec without an additional switch since embedded searches tend to use regular expression search.
--
PeterThoeny - 12 Feb 2003
The second "usability bug" is stated in a way that is strange to me -- to me, the difference in behavior is not the difference between a full text search and a literal search, but rather the difference between the search for a phrase and the search for all of the words in the phrase, in any order.
To clarify that, IIUC, you want "why diverge" to find any page with the word "why" and the word "diverge" anywhere on the page. (This can be done, BTW, with the fairly new AND search (using ";" as the connector under advanced search and checking "Regular Expression". Instead, "why diverge" tries to find pages with the phrase "why diverge" on it.
Older search engines that I use to use (forget which, maybe zyIndex, maybe askSam) did do a phrase search. Modern search engines (Google) do what you expect.
(I think all the searches that search the body text are "full text searches" in the sense that I would use -- they search all the text in the body.)
Aside: I don't know whether PeterThoeny's comment envisions treating both of these as RFEs or not, but I guess there is nothing to say that nobody can implement them as RFEs either.
--
RandyKramer - 08 Feb 2003
Randy, thanks for your hint, I used the "full text search" in the wrong context. I revised the above proposal accordingly. I hope it's now more comprehensible.
I tried the new
AND search using ";" and it's a nice improvement. But I think that's not enough in terms of
usability.
IMHO,
simple search and the search boxes offered on different pages as conveniences work in a way that will frustrate most
users as they don't get many hits using it.
Simple Search just does not work the way they expect it to work. It's an
apparent fact that people are used to the
search for all words style of queries. So why does TWIKI walk a different path? Why ignore what's obvious? Why confront users with the ";" syntax?
Users should be able to state their queries using a "syntax" they are familiar with. Is there any reason why the
current specs are not changed to accomodate those needs?
I think, now that the
AND search has been implemented it should be only a small step to get this to work in
simple search as well (please without the ";"). Doing so will not raise any compatibility issues as the
AND search will find the same pages as the current search and more.
Would you please explain what is the advantage of
simple search doing a
phrase search. Maybe it's a mere historical feature?
It's a pity, Peter changed this to "FER". Mind you, my first proposal to offer
search all words was on
22 Feb 2001! After doing some reading I figured, Peter had also suggested a search engine like query syntax (see
CategorySearchForm). The discussion lead to the implementation of the ";"
AND search. Richard Donkin recommended using this core functionality to do
a
all words search while hiding its syntax from the user (
jim fred should turn into the /jim;fred/ RE, doing an AND search
).
--
DanielKabs - 10 Feb 2003
I agree 100% with Daniel. Most popular search engine, google, uses space as AND. So most people are trained to use space. Why invent and try to force different standard?
I reworded slightly
RuleOne. Now Rule #1 of web usability says:
People spend most of the time on other websites. Do it (and call it) as other sites do. Thank you, Daniel, for not accepting explanation of how quirky Twiki does it, and requesting to do it right way.
I can see why Twiki developers understand this issue differently: if they use Twiki engine on intranet, they spend substantial amount of time on Twiki, and get used to Twiki's quirks. But for new and casual users, they are quirks, not virtues, and should be handled as so: get rid of them, use web standards.
--
PeterMasiar - 11 Feb 2003
I'm discussing a way to get to a more search engine like feature set in
SearchEnhancements.
This topic is dealing with the same issues as
SearchSuggestion ( searching topic and/or text ) and
SearchIsBroken ( implicit AND / Space as AND).
--
SamHasler - 11 Feb 2003
Ignoring quotes for the moment, would the following work?
***************
*** 233,239 ****
} else {
$tempVal = $TWiki::fgrepCmd;
! @tokens = $theSearchVal;
}
$cmd =~ s/%GREP%/$tempVal/go;
--- 236,244 ----
} else {
$tempVal = $TWiki::fgrepCmd;
! # @tokens = $theSearchVal;
! # Codev.SearchWithImplicitAnd
! @tokens = split( /\s/, $theSearchVal );
}
$cmd =~ s/%GREP%/$tempVal/go;
There were some other references to
$theSearchVal
in Search.pl but they didn't appear to be relevant.
--
SamHasler - 12 Feb 2003
Also see the
GoIsSearch discussion, which sparked the
PhotonSearch script.
To answer the question "why aren't they combined", I'll hazard a guess: the topic name search is just a filesystem dir search (which is why it is case insensitive on NT) while the full text search uses egrep.
--
MattWilkie - 06 Jun 2003
I'm tempted to remove the Go box because I don't think it's being used, and it gets confusing having it and the search box. I tend to use the address bar anyway, it has a bigger history and I can add stuff like ?raw=on or ?rev=1.34 which you can't do with the go box.
--
SamHasler - 06 Jun 2003
... but this is something you have to
know, in other words: not for the average user. I would also prefer
GoIsSearch.
Refactored:
Combining search in WebSearch in both topic text and title, see patch in
SearchSuggestion.
... and Sam's "patch" on this page is still awaiting feedback...
--
ArthurClemens - 16 Aug 2003
It appears that
PhotonSearch does alternatively search in topics and topic texts, and does this with implicit AND. Seems almost perfect! So why is PhotonSearch still not integrated into TWiki?
--
ArthurClemens - 02 Sep 2003
Lack of time to review pending contributions in detail. This will change with the expanded
CoreTeam and
AppealToCodevCommunityByCoreTeam.
--
PeterThoeny - 03 Sep 2003
Moved related discussion from
SearchDoesNotWorkAsExpected to here.
--
PeterThoeny - 11 Oct 2003
I would like to follow up on this.
At the same time we should implement an search with inverted match, e.g. a search that shows topics that do
not contain the search string.
- Proposed syntax for normal search:
-
soap wsdl "web service" -shampoo
- Proposed syntax for regex search:
-
soap;wsdl;web service;!shampoo
Anyone interested in implementing this?
Question: Do we need to be compatible with existing content? The search string can be in an embedded
FormattedSearch which could break existing content or
TWikiApplication. I assert that most applications use a regular expression search, so the chance is small that this spec change breaks existing content.
--
PeterThoeny - 02 Nov 2003
Please be careful with this. Until we have a repeatable and manageable process to track, resolve and inform the user of problems, I feel
very uncomfortable with breaking existing content.
--
SvenDowideit - 03 Nov 2003
I personally cannot think of any custom searches I have which would be broken by the proposed change. I'll dig around to make sure. I agree it will be nice to have tests to automate this part. : )
--
MattWilkie - 04 Nov 2003
Don't forget to include an escape mechanism or to document it if current features provide one, so that you can search for
<hyphen>something
and similarly for exclamation marks.
--
SamHasler - 05 Nov 2003
It looks like we need a new search type. I updated the proposed spec accordingly. Is "keyword" the right word for the Google style search syntax? Any other feedback?
--
PeterThoeny - 05 Nov 2003
"Google search syntax" is perfect (see
hits). Do not use "keyword" in this context, because we are talking about
full text search - in contrast to (assigned) keyword searching, where people manually assign keywords to documents/pages.
A nice addition would be to have a stop word list, to decrease the number of irrelevant links. Should be language dependent of course.
In the syntax:
-
soap wsdl "web service" -shampoo +xml
... users can use the plus sign, but they should get the same results as in
-
soap wsdl "web service" -shampoo xml
And another nice addition would be the keyword NEAR, to get results where two search terms both occur in a 'word window'. But let us first have the simple Google syntax!
--
ArthurClemens - 05 Nov 2003
Google help at
http://www.google.com/help/basics.html talks about keywords, but agreed, this could be confused for the meta keywords in an
HTML doc.
So what is a good value for the type parameter?
type="Google search syntax"
is too verbose. How about
type="default"
or
type="standard"
or
type="word"
or
type="token"
or ...?
A added the stop word list, "+" syntax, and the escape pr plus and minus signs to the proposed syntax.
--
PeterThoeny - 06 Nov 2003
type="google" ?
-- Unknown
My opinion is that the search should default to "AND" mode. You don't need a check box for exact phrase mode, as this is provided already by enclosing the terms with quotes. The embedded searches usually use regular expressions but are certainly limited in the number of operators available, but I don't see this as a big problem. The main issue is for users who are new to TWiki understanding how to use the searches.
For example:
"This is an exact phrase search"
This is a search looking for pages including all the words
--
RaymondLutz - 06 Nov 2003
I changed the Google type search to "keyword" since it is the standard term in the industry.
Default for search:
- Default for the search form (which in turn calls the search script) should be "keyword" (defined by the SEARCHSCRIPTTYPE setting)
- Default for the
%SEARCH{}%
variable should remain as "literal" (defined by the SEARCHVARIABLETYPE setting). This is for compatibility with existing apps.
I updated the proposed spec on top.
We can go ahead with the implementation.
--
PeterThoeny - 10 Dec 2003
I like the changes and support the proposed specification, although the "literal" checkbox is superfluous as you can obtain that by enclosing in quotes. Yet, it doesn't hurt too much to have this redundant capability. This is a very important improvement. Thanks.
(May I also suggest that while you are mucking with this code, that the parameters are split up in the Search.pm module to further correct partitioning. As it is, you have to change Twiki.pm to add a parameter and this should not be necessary... )
--
RaymondLutz - 10 Dec 2003
The redundant "literal" type is needed for backward compatibility.
I am currently working on the implementation.
Does anyone have a good list of SEARCHSTOPWORDS?
--
PeterThoeny - 03 Jan 2004
Search Engine World has
this list, but is seems a bit to large (why exclude 'microsoft' for instance?).
Overview of other lists, such as
Google's stop word list.
Stop word lists are generally context dependent, and can be computed: words that occur n times (where n is very big) can be incorporated in the stop list. For general purposes, the Google list seems to be a good start.
So to aid the generation of stop word lists, it would be handy to have a tool that scans all topics and computes a list of words with their occurrences.
which in turn would be very hand for auto-generating crude "what's related" charts -- MattWilkie - 06 Jan 2004
--
ArthurClemens - 04 Jan 2004
Thanks Arthur, added to
TWikiPreferences.
--
PeterThoeny - 04 Jan 2004
I believe that we did not add this to the specs: using '+' should overrule the stop word list.
--
ArthurClemens - 04 Jan 2004
Implementation without much testing is done; enhancement is in
TWikiAlphaRelease and at TWiki.org. Docs and testing is pending.
Could someone help out testing it against above spec? Use
WebSearch for search scripe testing, and
FormattedSearchFormTesting for
%SEARCH{}%
variable testing (in the search string, remember to escape double quotes by backslash)
--
PeterThoeny - 04 Jan 2004
First observations:
- "Scope: topic text" should be "topic title".
- Using keyword search I cannot search for contributor names; with literal search I can. I believe that TopicContributor in Main.TopicContributor should be searchable too with keyword search.
- If a WikiWord is included in the search, the input field gets pretty messed up (for instance search on: soap CoreTeam)
- Is there a way to know the number of search results before displaying them? The number of topics is displayed at the bottom, where it would be more useful at the top.
- That could be done; best to start a new enhancement request with proposed spec. -- PTh - 04 Jan 2004
- I don't get any results when searching for compound words, like "users expect" (keyword or literal)
- The FormattedSearchFormTesting is a hack for testing, it does not handle double quotes correctly. As mentioned above, prefix double quotes with backslash, e.g.,
\"users expect\"
. This does run the search correctly but does not refill the search string in the search field -- PTh - 04 Jan 2004
--
ArthurClemens - 04 Jan 2004
Thanks Arthur, comments added above.
--
PeterThoeny - 04 Jan 2004, --
ArthurClemens - 05 Jan 2004
I've done some more testing, and all seems to be working fine.
A UI enhancement would be when the user enters one of the stopwords, to omit this from the line "Search: ". For instance if the users enters 'search' and 'with':
Search:
people
'with' is a very common word and was not included in your search (details)
I copied this from Google. Note that with multiple stopwords they use:
The following words are very common and were not included in your search: with by. (details)
--
ArthurClemens - 14 Jan 2004
>
Does anyone have a good list of SEARCHSTOPWORDS?
Yes ->
[]
This might seem flippant, but if one thing rubbed off the search people whilst I was at Inktomi was that
getting these right depends on the local information domain and that search relevance is difficult*. (Unless you want to start using other people's patented methods for relevance...)
Stopword lists are often generated by any number of techniques, including creating an inverted cross reference index of the contents of text and lopping off words that occur above a certain number of times. Getting this to work right is difficult* and if this is implemented in TWiki I would suggest that such functionality is moved out to a plugin (and indeed making search replaceable into a plugin).
If you think this isn't a problem, try
this search phrase - to be or not to be
There's more to it than "just" slapping on a stopword list. One man's stop list is another man's data.
-- MS - 14 Jan 2004
I agree on the inherent problematics of stoplists. At the same time I still find they are useful (but mainly in OR searches) - for AND searches they mean a performance improvement.
Two ways to meet in the middle:
- Allow '+' to force inclusion of stopwords in the search: +to +be +or +not. This works with Google: see previous search but with plusses.
- Phrases should ignore stopwords: "to be or not to be". This works with Google too: see previous search but now in quotes.
--
ArthurClemens - 14 Jan 2004
On stopwords, the current implementation works like Arthur describes.
--
PeterThoeny - 15 Jan 2004
TWikiVariables,
SearchHelp, docs are updated.
WebSearch is now an embedded SEARCH with URL parameters;
WebSearchAdvanced is pending.
--
PeterThoeny - 18 Jan 2004
Small bug: when I enter the words ICON CALC in the search field, on the results screen this gets changed to ICON TWiki.CALC.
--
ArthurClemens - 14 Feb 2004
This is now fixed, see
IncludeFromOtherWebLinksACRONYMS.
--
PeterThoeny - 25 Apr 2004