Tags:
create new tag
view all tags

How to Use Regular Expressions to Parse Nested Structures

2011-09-15 - 08:09:06 by PeterThoeny in Development
Get Involved!
TWiki is an open source project with 10+ years of history, built by a team of volunteers from around the world, and used by millions of people in over 100 countries. The community is focusing on building the best collaboration platform for the workplace. We invite you to get involved!
What is TWiki?
A leading open source enterprise wiki and web application platform used by 50,000 small businesses, many Fortune 500 companies, and millions of people.
MOVED TO... Learn more.
This is an advanced topic that teaches how to to parse nested tree structures such as programming languages and spreadsheet formulas using regular expressions in Perl. The article is generic enough to be used in other languages that support regular expressions.

Regular expressions are very powerful to parse and modify text. They are also very fast in processing large text over and over again. For example, the TWiki engine makes heavy use of regular expressions to turn TML (TWiki Markup Language) into HTML. Can regular expressions be used to parse nested structures, such as to parse a programming language or a nested spreadsheet formula?

Some people say that regular expressions (e.g. finite state automaton) cannot parse nested structures because they are not regular, by definition. They argue, you would need an LL parser such as ANTLR to generate an abstract syntax tree which can be further processed with a tree parser.

All that sounded too complicated when I was going to implement TWiki's SpreadSheetPlugin. Regular expressions are extremely fast, and I thought it might be faster than the LL parser and tree parser approach. So I took the challenge to parse spreadsheet formulas using regular expressions.

For illustration, let's take a spreadsheet formula that needs to be parsed and processed. The example formula has 7 function calls with 5 levels of nesting, some of which have more than one function per level:

$ROUND( $TIME( ), $TIMEDIFF( $TIME( $T( R$ROW( ):C$COLUMN( -1 ) ) ), day ) )

There is a way to use regular expressions to parse, say, up to 5 levels, and with the limitation that you can have only one function call per level. Certainly we need a more flexible solution that does not limit us on the number of levels for nesting and number of function calls per level.

The solution is to parse twice, once to temporarily label the parentheses with level of nesting, then to resolve function calls recursively using matching parentheses identified by the labels. This can be done with regular expressions and recursive function calls.

First, let's determine the level of nesting, indicated by the numbers above the parentheses:

      1      2 2           2      3   4      5 5         5    5 4 3      2 1
$ROUND( $TIME( ), $TIMEDIFF( $TIME( $T( R$ROW( ):C$COLUMN( -1 ) ) ), day ) )

We use a regular expression to add the nesting level to the parentheses, together with an escape token (so that we can easily identify it later for processing and removal). Escape tokens with level are indicated by -esc-N; newlines are added for clarity:

$ROUND-esc-1( $TIME-esc-2( -esc-2), $TIMEDIFF-esc-2( $TIME-esc-3(
  $T-esc-4( R$ROW-esc-5( -esc-5):C$COLUMN-esc-5( -1 -esc-5) -esc-4)
 -esc-3), day -esc-2) -esc-1)

First, we write the main function that parses and resolves a spreadsheet formula:

sub _doFormula
{
    my( $formula ) = @_;
    my $level = 0;
    $formula =~ s/([\(\)])/_addNestingLevel($1, \$level)/geo;
    $formula = _doFunc( "MAIN", $text );
    return $formula;
}

This function does two things:

  1. Add an escape token with the level of nesting to each parenthesis of the formula. This is done using a regular expression that calls function _addNestingLevel(). We use modifiers geo, e.g. g for global (don't stop at the first match), e for execute (to call a function for each match), and o to compile the regular expression pattern only once (because the pattern is fixed).

  2. Call _doFunc() to resolve the functions recursively.

The _addNestingLevel() function adds the escape tokens with the level of nesting to a parenthesis, such as -esc-5( and -esc-5):

my $escToken = "\0";
sub _addNestingLevel
{
    my( $theParen, $theLevelRef ) = @_;
    my $result = "";
    if( $theParen eq "(" ) {
        # Increment level before adding escape token
        $$theLevelRef++;
        $result = "$escToken$$theLevelRef$theParen";
    } else {
        # Decrement level after adding escape token
        $result = "$escToken$$theLevelRef$theParen";
        $$theLevelRef--;
    }
    return $result;
}

The nesting level is passed as a reference so that we can modify it. The escape token is the null character, which is a valid character in a Perl string. For the opening parenthesis "(", the nesting is incremented before adding the escape token with the nesting level. For the closing parenthesis ")", it is decremented after the fact. This ensures that we have the same nesting level for matching parentheses.

The _doFunc function handles all spreadsheet functions:

sub _doFunc
{
    my( $theFunc, $theAttr ) = @_;

    # Handle functions recursively
    $theAttr =~ s/\$([A-Z]+)$escToken([0-9]+)\((.*?)$escToken\2\)/_doFunc($1,$3)/geo;
    # Clean up unbalanced mess
    $theAttr =~ s/$escToken\-*[0-9]+([\(\)])/$1/go;

    my $result = "";
    if( $theFunc eq "MAIN" ) {
        $result = $theAttr;
    } elsif( $theFunc eq "ROUND" ) {
        $result = _handleROUND( $theAttr );
    } elsif( $theFunc eq "TIME" ) {
        $result = _handleTIME( $theAttr );
    # etc.
    }
    return $result;
}

The first regular expression s/\$([A-Z]+)$escToken([0-9]+)\((.*?)$escToken\2\)/_doFunc($1,$3)/geo finds all functions of format $NAME-esc-N( ... -esc-N) and calls the _doFunc function recursively to resolve the spreadsheet function. How can we identify the matching closing parenthesis of a function? The ([0-9]+) after $escToken captures the level. With \((.*?)$escToken\2\) we scan non-greedily over the text to find the first escape token of same level, indicated with \2. We use the same geo modifiers for the regular expression, the g modifier results in parsing the string from left to right to find and handle all spreadsheet functions of the same level.

Spreadsheet functions at a deeper level are resolved first because of the recursion. The combined use of recursion and g modifier results in an inside out / left to right handling of the spreadsheet functions, which is the desired behavior.

The second regular expression s/$escToken\-*[0-9]+([\(\)])/$1/go cleans up escape tokens that are left over. For example $EVAL( (2+7)/3 ) has parentheses that are not part of a function construct. Or, the formula might have unbalanced parentheses.

Finally we have a long if-then-else-if-then switch that handles all spreadsheet functions.

That is basically it! As you can see, with a few lines of code it is possible to parse complex nested structures by using just a few regular expressions and a function recursively.

This is a simplified representation of the actual implementation of the SpreadSheetPlugin. Some spreadsheet functions get special treatment. For example, the $IF( condition, then, else ) function needs to evaluate first the condition, then handle either the then or the else part based on the condition. This complicates the simple inside out / left to right parsing/handling described in this article. Interested folks are invited to download the SpreadSheetPlugin and study the code.

It would be interesting to do a reference implementation using the LL-parser/tree parser approach so that we could compare the performance to the pure regular expression/recursion approach discussed here. I have a hunch that the regular expression approach is an order of magnitude faster. Anyone challenging that? I would be happy to be proven wrong!

I hope this advanced regular expression topic is a learning experience for you as it certainly was for me. May be you can use this for one of your own projects?

References:

Comments

Regular expressions, as implemented by common programming languages, are way more powerful than "regular expressions", as intended by computer science theorists in connection with finite state automata. There are very good reasons behind this power. Even something very basic, such as backreferences, are beyond the capabilities of regular expressions. I found this: http://dev.perl.org/perl6/doc/design/apo/A05.html, where Larry Wall says: "... generally having to do with what we call "regular expressions", which are only marginally related to real regular expressions".

-- Kushal Kumaran - 2011-09-19

Thanks for the pointer Kushal, I see there are dramatic changes/fixes to regexes coming with Perl 6.

-- Peter Thoeny - 2011-09-19

That tagging is a great idea. I've used recursion in some toy code of mine, which works well, but has less than wonderful performance, and it has the limitation that the inner parentheses must be processed away or the outer parentheses cannot be evaluated. I haven't tested this yet, but it does have potential.

That having been said, it looks to me like it'd be better to put the tag after the parentheses. That way, the parenthesis itself indicates the next character will be the first character of a tag. There's no way any non-parenthesis will be confused as a tag temporarily, triggering a back track. You then need to have a sequence to end the tag, but since you can't possibly be looking at a false positive, a single, non-digit character will do.

So, instead of

$ROUND-esc-1( $TIME-esc-2( -esc-2), $TIMEDIFF-esc-2( $TIME-esc-3(
  $T-esc-4( R$ROW-esc-5( -esc-5):C$COLUMN-esc-5( -1 -esc-5) -esc-4)
 -esc-3), day -esc-2) -esc-1)
you have
$ROUND(1. $TIME(2. )2., $TIMEDIFF(2. $TIME(3.
  $T(4. R$ROW(5. )5.:C$COLUMN(5. -1 )5. )4.
 )3., day )2. )1.

-- Ed Grimm - 2011-09-19

Thanks Ed for your insights. Not sure how much performance you gain by swapping the parenthesis with the escape token. The -esc- was just for visuals, in reality it is a null character. Does it make a difference in performance if you scan for .N( vs (N.? Your approach is better in a sense that you can take any non-digit char to terminate the sequence, vs. a null character to start the sequence.

-- Peter Thoeny - 2011-09-20

See Stackoverflow answer using this approach to change nested |x+2| to abs(x+2): https://stackoverflow.com/questions/66557385/regex-javascript-convert-strings-with-absolute-values-x-to-absx

JavaScript code:

function fixAbs(str) {
  const startTag = '{{s%L%}}';
  const endTag   = '{{e%L%}}';
  const absRegex = /\{\{s(\d+)\}\}(.*?)\{\{e\1\}\}/g;
  let level = 0;
  str = str
  .replace(/ /g, '')  // remove all spaces
  .replace(/(\|*)?(\w+)(\|*)?/g, function(m, c1, c2, c3) {
    // regex matches variables with all leading and trailing `|`s
    let s = c2;
    if(c1) {
      // add a start tag to each leading `|`: `{{s0}}`, `{{s1}}`, ...
      // and post-increase level
      s = '';
      for(let i = 0; i < c1.length; i++) {
        s += startTag.replace(/%L%/, level++);
      }
      s += c2;
    }
    if(c3) {
      // decrease level,
      // and add a end tag to each trailing `|`: `{{e2}}`, `{{e1}}`, ...
      for(let i = 0; i < c3.length; i++) {
        s += endTag.replace(/%L%/, --level);
      }
    }
    return s;
  });
  // find matching start and end tag from left to right,
  // repeat for each level
  while(str.match(absRegex)) {
    str = str.replace(absRegex, function(m, c1, c2, c3) {
      return 'abs(' + c2 + ')';
    });
  }
  // clean up tags in case of unbalanced input
  str = str.replace(/\{\{[se]-?\d+\}\}/g, '|'); 
  return str;
}

const testCases = [
  '|x|+12',
  '|x|+|y+|z||',
  '|x|+||y|+z|',
  '|x|+|x+|y|+z|',
  '|x|+|x+|y+|t||+z|',
  '|x|+12+|2+x|',
  '|x|+12+|x+2|'
].forEach(str => {
  let result = fixAbs(str);
  console.log('"' + str + '" ==> "' + result + '"');
});

Output:

"|x|+12" ==> "abs(x)+12"
"|x|+|y+|z||" ==> "abs(x)+abs(y+abs(z))"
"|x|+||y|+z|" ==> "abs(x)+abs(abs(y)+z)"
"|x|+|x+|y|+z|" ==> "abs(x)+abs(x+abs(y)+z)"
"|x|+|x+|y+|t||+z|" ==> "abs(x)+abs(x+abs(y+abs(t))+z)"
"|x|+12+|2+x|" ==> "abs(x)+12+abs(2+x)"
"|x|+12+|x+2|" ==> "abs(x)+12+abs(x+2)"

-- Peter Thoeny - 2021-03-10

.

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r7 - 2012-07-20 - PeterThoeny
 

Twitter Facebook Digg Google Bookmarks E-mail LinkedIn Reddit    
  • Help
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.