create new tag
, view all tags
Help please! smile I'm trying to build a plugin for custom TextFormattingRules and I'm having some difficulties. There doesn't seem to be a topic yet for these kinds of questions, so I started this one. Think of it as a CoffeeBreak for wannabee plugin authors.

Custom TextFormattingRules

Skipping HTML tags

I want to run my custom rendering rules on %TEXT%. I don't care about the page header or footers, I'm saving those for another project. How to skip over the html-style tags? e.g. my custom rendering rules would apply to the highlighted text but ignore anything between matching < >'s.

<div>some stuff is <span class="subdued">"unimportant"</span></div>

This is important because I'm replacing quotes and apostrophes with their typographicaly nicer representatives - “like this” as opposed to "like that".

thanks much.

-- MattWilkie - 31 Jan 2004, 02 Feb 2004

The short answer is using bare regexes to parse HTML is fraught with problems - much more so than you would normally expect - don't do it - use a proper parser to do it for you. (eg one of the CPAN modules, and step through the tree, perform replacements, etc)

(Parsing HTML correctly using tools like lex & yacc can be just as awkward for that matter)

That doesn't help you if you don't want to do things that way.

What you're after with regard to your example is not to parse the HTML, but only to deal with what's outside the tags. On the surface of it a tag matches the form

And a regex for stuff outside would then be:

Which then makes an HTML page essentially something like:


Problem is HTML tags can include stuff like:

<font size="+0" title="Hello <World"> Hello World </font>

Which renders as for you as : >>> <<< (Move your mouse there - it is there)

You might say it's malformed, but then the majority of HTML on the web is malformed.

The best you can probably hope for is something close - match any number of consecutive HTML tags followed by something that isn't a tag, and process that with a global eval replacement. (Whilst assuming your HTML doesn't have < or > inside attributes - if it does, you're out of luck)

Something like:

    $text = s/([^<>].*)<([^>]*?>/&handleOutsideHTMLTags($1)/ge;
    $text = s/([^<>].*)$/&handleOutsideHTMLTags($1)/ge;
Line 1 handles all cases of (NonHTML HTML). The second handles all cases of (NonHTML end-of-string)

(The /o modifier isn't sensible since the regex being built doesn't change)

This is untested, but the approach suggested essentially notes:

  • A page that might contain HTML consists of two types of text - sequences of NonHTML followed by HTML, or NonHTML followed by end of string. You're not interested in the situation where you have the string starting with HTML.

  • The regexes for HTML is essentially < followed by a minimal number of non > chars, followed by a > char, with the noted invalid assumption that HTML tags don't contain < or > in tag attributes. (They shouldn't, but that's not the same as whether they do or don't)

  • The regex for non HTML says that it's a sequence of non > or < chars.

It then calls a handleOutsideHTMLTags function, which should return the replacement. (Your replacement of quotes can happen inside that function.)

It's worth noting that this approach is pretty simple to break, and that's part of the reason for not parsing HTML using regexes.

-- MS - 03 Feb 2004

Thanks for the response MS, especially for including some example regexes I can play with (btw, the RegexCoach is awesome for someone like me). I'm heartened to learn that it's not because I'm stupid that I haven't been able to get very far with this. On the other hand I wish it was just me being foolish and overlooking the obvious - because then it would be easier to solve!

-- MattWilkie - 05 Feb 2004

Okay, what's wrong with this code? The substitutions are being done, html tags are being skipped, but the page as rendered by twiki doesn't reflect this. EmptyPlugin.pm:

sub endRenderingHandler
foreach ( split (/(<.*?>)/s, $_[0], -1) ) {
   #print "\n---split: $_ ";

   # only process stuff which aren't html tags
   # (assumes any split-line starting with < is a tag)
   if ( $_ =~ /^[^<]/ ) {
      $_ =~ s/\\\\\n/<br>/g;         # manual line break
      $_ =~ s/ --- /\&#8212\;/g;       # long dash
      $_ =~ s/ -- /&#8211;/g;          # short dash
      $_ =~ s/\.\.\./&#8230;/g;        # trailing elipsis...
      $_ =~ s/"(.*?)"/&8220;$1&8221;/g;   # curly quotes
      # Why doesn't the rendered page show the substitutions?
      # the debugs prove the work is being done:
      print "\n---subst: $_ ";
      TWiki::Func::writeDebug( "---subst: \t $_");

thanks in advance,

-- MattWilkie - 07 Feb 2004

The reason is this part: (paraphrased and line numbers added for clarity)

    1: sub endRenderingHandler {
    2: foreach ( split (/(<.*?>)/s, $_[0], -1) ) {
    3:    # only change $_

Stepping through the logic of what happens here.

  • In TWiki.pm, a call is made to endRenderingHandler via the plugins subsystem:
    • &TWiki::Plugins::endRenderingHandler( $result );
    • That function calls applyHandlers to loop through all the plugins.
    • Key point: $result is handled by reference - if you change $_[0] in the plugin, you change the value of $result in the main codeline.

Note also that the original call - &TWiki::Plugins::endRenderingHandler( $result ); requires the callee to change the value of the variabled referenced.

Stepping through your code:

  1. Your function is called - $_[0] is set to be a reference to $result
  2. You take $_[0] and split it into lots of strings.
    • This creates an array of strings (there's probably optimisatins internally, but that's the logic)
    • These strings in this array and copies of the contents of $_[0], not references to the contents of $_[0]
  3. That means when you do your substitutions on $_ you are changing copies, not changing $_[0] . Since $_ doesn't reference anything and isn't saved, at the end of the body of the loop, the work done is thrown away.

After changing copies of all the sections in $_[0] (and throwing the results away) the loop exits and the function exits, leaving the value of $_[0] unchanged.

An approach that will work: (change %PATT%, %SKIP%, logic to fit smile )

sub endRenderingHandler {
  # my ($result,$snippet) =("","");   # mhw: hangs for some reason
  my $result = "";
  foreach $snippet ( split (/%PATT%/s, $_[0], -1) ) {
      if ($snippet =~ /%SKIP%/) {
         $snippet = &transform($snippet);
      $result .= $snippet;
   $_[0] = $result; # return result

An alternative, which depending on how fast string joins are might be quicker is: (it's pretty fast in perl though)

sub endRenderingHandler {
  my @result = ("");
  my $snippet;
  foreach $snippet ( split (/%PATT%/s, $_[0], -1) ) {
      if ($snippet =~ /%SKIP%/) {
         $snippet = &transform($snippet);
   $_[0] = join "", @result; # return result

(None of the above are designed to run straight away, just explain what's wrong!) Hope that helps.

-- MS - 07 Feb 2004

thank you very much for your help! With some tweaking I have been able to get method 1 to work. M2 has some syntax errors which I fixed (unmatched braces, which I'll fix here when I get home and have the code in front of me) but still gives me empty results.

In Method One I had to change:
my ($result,$snippet) =("",""); to
my $result = "";

otherwise the page would just sit there 'loading' forever. {shrug} I'm happy, I'm finally back to what I was really trying to do a week ago. smile thanks again for your help.

-- MattWilkie - 11,12 Feb 2004

Great ! Pleased to be of help. Seems a bit OTT to me for the desired effect, but it's been useful to me too - supporting this kind of transform in a sensible way would be cool.

Have fun,

-- MS - 12 Feb 2004

Skipping Rendering Stages

I see by the StepByStepRenderingOrder that rendering plugins will get called at least 3 times (%text%, %metadata%, %template%). Is there anyway way to flag steps as "clean"? e.g. No need to call me here, I don't do metadata or templates, only text?

-- MattWilkie - 02 Feb 2004

Random Notes

Added a couple of heading since the Q's answers would be distinct. Hope that's OK.

of course it is. increasing clarity is always appreciated. -MW

-- MS - 03 Feb 2004

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r9 - 2004-02-12 - MichaelSparks
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.