RegexToFindTopicSummary < Support

Question

I am attempting to design a RegularExpression within a FormattedSearch (using $pattern variable) to list a topic summary using a convention such as desribed in AllowDesignationOfSummary.

Following this convention, I have created the variables %SUMMARY% and %ENDSUMMARY% to demark and format a summary for some topics. Now I want to create a formated search that list the topic name and the text between the summary variables.

I get so frustrated with Regexes! I've tried but can't make it work. Could some kind soul help me formulate a Regex that captures all the text between %SUMMARY% and %ENDSUMMARY%?

TWiki version: 01Feb03

-- LynnwoodBrown - 26 Feb 2003

Answer

I'd like to, but it's been a while since I fooled with regexes. The one suggestion I'd make is to use an editor that supports regexes (like Nedit on Linux), create (or import) a "test page" within Nedit, and then attempt to get a regex working in Nedit to find the text you want. It's possible that the regex would require some modification to then use in TWiki, but it shouldn't be that much, and this approach gives you much faster feedback (I think) than inserting a regex in the TWiki code and then trying to test.

If I find some time, I will look at my notes on WikiLearn with respect to regexes (search for "explain" or "construct" in topic titles I think you'll find some pages that attempt to "deconstruct" some Perl (or other) regexes.

-- RandyKramer - 27 Feb 2003

I'm back -- three things:

I'm starting to look at some of the relevant pages on WikiLearn:

An alternate and better search might be for topics that begin with Regex.

Just brainstorming, wouldn't the regex include something like %SUMMARY%.*%ENDSUMMARY% -- I know that's pretty "bare" -- I'm sort of intrigued and may spend a little more time looking.

I think I'd like to suggest that you use the tags <SUMMARY> and <ENDSUMMARY> (or lowercase and </summary> for the end tag, or even <sum> and </sum> to make typing easier) so they can be embedded in the ordinary text of a page and delimit a portion of that ordinary text to be treated as the summary. Right now I'm mixed up on what a delimiter surrounded in "%" does -- does it appear in the text of the page or is it hidden as metadata. If it appears in the text of the page, do the delimiters also show up -- it appears they do in the preview. Saving now to try to see, but I think I will remain confused. Well, unless I'm doing something wrong, they show up in the text -- oh, wait, maybe that will be corrected when this function is implemented?

-- RandyKramer - 27 Feb 2003

Back again -- maybe the last time for awhile -- can't find much useful on WikiLearn, and I can't access my private TWiki from where I'm at right now. And, I'm probably confusing the heck out of anyone trying to read this.

Two more things:

In my next iteration of the regex, I'd surround the .* with parenthesis, then the "found text" should be accessible as $1 (but not sure where that access is usable -- maybe just in the regex, so on the same line you need to assign it to some Perl variable, or actually do what you want with it (like print it or whatever)?

I have a second reason to suggest that you not use "%" to surround the delimiters -- I guess those mark a TWiki (and maybe Perl??) variable, and might have something substituted for them -- and maybe that is one of the problems you are having with your attempts -- maybe TWiki / Perl is trying to substitute something for them, but they have not been defined. (Just a very WAG.)

Good luck! When I get home, I may look some more.

-- RandyKramer - 27 Feb 2003

Randy - thanks for responding. I'll take some more time to explore what's on WikiLearn to see what I find but wanted to quickly respond to a couple of your points:

As described in AllowDesignationOfSummary, %SUMMARY% and %ENDSUMMARY% are custom Twiki variables I defined that produce some formating to highlight the topic summary. My hope is that they also serve as convenient markers to find the summary using a regex.
Your suggestion of trying a regex along the lines of %SUMMARY%.*%ENDSUMMARY% was exactly what I tried first with my VERY rudimentary understanding of regex. It didn't work. The results show the first few characters of the HTML code that the variable inserts.

-- LynnwoodBrown - 27 Feb 2003

Here is some stuff from my "private" TWiki. _Because the examples might be considered "derived works" (from Learning Perl, 2nd Ed., Schwartz, Randal L. and Christiansen, Tom, O'Reilly, July, 1997), I have been reluctant to move those pages to WikiLearn before seeking their permission:

/apple(.*)orange\1/
In this case, whatever string is found after apple is remembered and used again via the "\1" reference. Thus, this regular expression would match "appleJUICEorangeJUICE", but not "appleSAUCEorangeJUICE".

If there is more than one set of parenthesis, the first remembered substring is referred to as \1, the next as \2, and so on.
The "memorized" strings can also be referred to outside of the regular expression using scalar variables $1, $2, ... corresponding to the \1, \2, ... respectively. The $1, $2, ... are read only variables -- you cannot make assignments to them, and the variables are reassigned after each search -- you will need to preserve the values elsewhere if you expect to use them after some other search.

This indicates to me that the approach I suggested earlier should be close. My worry is that either the % signs, or the fact that %SUMMARY% is a variable (and might be substituted before this regular expression is run) makes me leery of them. Why don't you start with an example that uses "apple" and "orange" as the delimiters (just like the example, but leave out the "\1" and in the next line of code, do something like "print $1" to see what you get) -- if that works, build from there.

-- RandyKramer - 28 Feb 2003

I think that you are right about the % signs and the variable being substituted before the regular expression is run. I figured this out because the regex was returning the first few characters of the html code that the variable inserted. So I've been trying to construct a regex using the html code as delimiters along the lines of:

.*<tr><td><b>(.*)<\/td><\/tr><\/table>.*

But I get a software error that reads:

Unmatched ( before HERE mark in regex m/.*<tr><td><b>( << HERE .*/ at ../lib/TWiki/Search.pm line 766.

That were I am right now. I'll just keep plugging away at it by trail and error...

Thanks again for the help!

-- LynnwoodBrown - 28 Feb 2003

OK, here we go:

START-SUMMARY Example summary (between START-SUMMARY and END-SUMMARY) we try to extract. END-SUMMARY

Here is the SEARCH on the current topic only, extracting just the summary, formatted as | topic: summary | :

RegexToFindTopicSummary: Example summary (between START-SUMMARY and END-SUMMARY) we try to extract.

RegexToFindTopicSummary: Example summary (between START-SUMMARY and END-SUMMARY) we try to extract.

Above search pattern is: $pattern(.*?START-SUMMARY[\n\r]*(.*?)[\n\r]*END-SUMMARY.*), that is, a non-greedy scan over the start keyword; scan over new lines; scan and keep text non-greedily; scan and discard new lines and end keyword; scan and discard the remaining text.

You could extract the first paragraph in a topic after the first heading, e.g. declaring it as the implicit summary. Here we go:

RegexToFindTopicSummary: I am attempting to design a RegularExpression within a FormattedSearch (using $pattern variable) to list a topic summary using a convention such as desribed in AllowDesignationOfSummary.

RegexToFindTopicSummary: I am attempting to design a RegularExpression within a FormattedSearch (using $pattern variable) to list a topic summary using a convention such as desribed in AllowDesignationOfSummary.

Above search pattern is: $pattern(.*?\-\-\-\+[^\n\r]*[\n\r]*([^\n\r]*).*), that is, a non-greedy scan for heading start; scan until end of heading line; scan over new lines; scan and keep all text excluding new lines (aka whole paragraph); scan and discard the remaining text

-- PeterThoeny - 28 Feb 2003

Thanks, Peter!

Just doing a (very) little cleanup on my notes -- an extraneous line snuck in from somewhere, and fixed a typo or two.

-- RandyKramer - 28 Feb 2003

Yes, indeed - Thanks Peter! I'm actually learning something out of all this. Unfortunately, I still haven't produced the intended result. One part of the problem seems to be with the fact that I'm searching for variables which, as I indicated above, seem to be expanded before the regex kicks in. Consequently, I've tried to use some of the HTML code for the search along the same lines as Peter's example:

$pattern(.*?<td><b>[\n\r]*(.*?)[\n\r]*<\/td><\/tr>.*)

Unfortunately, this comes up with nothing. If I delete the ? within the search segment I want to keep, I get the error message mentioned before.

Thanks again for the help guys. I'll keep plugging at this and post solution if I find it. Ah, the learning opportunity. wink

-- LynnwoodBrown - 28 Feb 2003

So:

You're still using something like %SUMMARY% and %ENDSUMMARY% as your delimiters?

Do you think they are being expanded? What do they expand into? Oh, OK I see now -- they expand into those ... table markers. Have you tried searching for just SUMMARY and ENDSUMMARY with them "not defined" so that they don't expand (or to see whether the seach is occurring before or after they expand)? (I'm not sure they get expanded before the search, I was just worried that they might.)

If you try Peter's START-SUMMARY END-SUMMARY example, does it work for you? (Ok, I guess that's (just about) the same as searching for SUMMARY, etc. without it being defined.

How about trying it without the parenthesis around the (.*?)?

I just noticed the pattern searches for a <b> but doesn't search for a matching </b> (or whatever)?

-- RandyKramer - 02 Mar 2003

Maybe regex gets confused having SUMMARY 3 times (once in expanded text). I like Peter T's suggestion about STARTSUMM and ENDSUMM, might un-confuse regex. <b> should not be used at all in pattern: it's in the expanded text. When it is there, ENDSUMMARY was already substituted for a </table> tag.

-- PeterMasiar - 02 Mar 2003

WebForm
SupportStatus	AnsweredQuestions

Topic revision: r11 - 2003-03-02 - PeterMasiar

Account
- Log In
- Register User

Edit
Attach

Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.