Tags:
create new tag
view all tags

Presentation: Regular Expressions, To Match Or Not, That is the Question, Silicon Valley Perl, 2017-06-01

What is TWiki?
A leading open source enterprise wiki and web application platform used by 50,000 small businesses, many Fortune 500 companies, and millions of people.
MOVED TO... Learn more.
This is the presentation material for the talk on "Regular Expressions: To Match Or Not, That is the Question" at Silicon Valley Perl, NVIDIA, Santa Clara, CA, 2017-06-01. Peter Thoeny prepared this talk for developers who want to scan and process text quickly.

This presentation is outdated, see the latest presentation: Regular Expressions, To Match Or Not, That is the Question.

Presentation View the slides of this presentation.

    Copyright © 2017 by TWiki.org. This presentation may be reproduced as long as the copyright notice is retained and a link is provided back to http://twiki.org/.    

Start Presentation

Slide 1: Regular Expressions: To Match Or Not, That is the Question

regex-example.png



Presentation for Silicon Valley Perl, 2017-06-01

-- Peter Thoeny - @PeterThoeny - peter09[at]thoeny.org - TWiki.org

Slide 2: About Peter Thoeny

Slide 3: Manipulating Text: The Conventional Way

  • Task: Process text to:
    • validate input
    • change text from one format to another
    • extract snippets of text (e.g. screen scraping), ...
  • Text manipulation using procedural Perl:
    $i = index( $sString, $sSubString );
    $str = substr( $sOldString, 4 );
    $i = length( $sString );
    $str = join( ':', split( '', $sString ) );
    # chomp, chr, hex, lc, ord, pack, sprintf, uc, ... (details)
    # Many CPAN modules...

Slide 4: Manipulating Text: The Object Oriented Way

  • Text manipulation using OO Perl:
    use String;
    my $str = new String( "Perl" );
    printf( "Length: %d\n", $str->length );
    printf( "First char: %s\n", $str->charAt( 0 ) );
    printf( "Position of 'er': %d\n", $str->indexOf( 'er' ) );
  • Details: CPAN:String

Slide 5: Regular Expression in Wikipedia

regex-wikipedia.png

Slide 6: Why use regular expressions?

twiki-regex.png
  • "Wildcard on steroids" - *.txt becomes .*\.txt$
  • Process large amounts of text over and over again
  • Very powerful and flexible
  • Extremely fast

  • But: There is a learning curve

Slide 7: xkcd: Regular Expressions

regular_expressions.png

Slide 8: Manipulating Text: Use Regular Expressions

twiki-regex.png
  • Match a string with a regex pattern:
    if( $str =~ m/.../ ) {
      # ...
    }
  • Replace a pattern with a string:
    $str =~ s/.../.../;  # Perl syntax
    var newstr = str.replace( /.../, "..." );  // JavaScript syntax
  • split up a string using a regex delimiter:
    my @items = grep { /.../ } split( /\s*,\s*/, $str );
  • There is also: (details)
    • pos
    • qr/STRING/
    • quotemeta

Slide 9: Regular Expression Basics

  • Color code used in examples:
    • Hello World. - string we operate on (teletype)
    • regex - regular expression (red)
    • match - match we found in string (green)

  • /l/ - single character: Hello World.
    • second and third match: Hello World.
  • /or/ - character sequence: Hello World.
  • Metacharacters with special meaning: ( ) parenthesis, [ ] square brackets, { } curly braces, \ backslash, ^ caret, $ dollar, . period, | vertical bar, * asterisk, + plus
  • /\./ - escape metacharacter: Hello World.
  • /./ - without escape: Hello World.

Slide 10: Regular Expression Basics: Character Sets

  • /[oe]/ - one char out of several chars: Hello World.
    • second and third match: Hello World.
  • /H[ea]llo/ - match Hello and Hallo: Hallo World.
  • /[0-9]/ - match range of chars: ID3735.
  • /#[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]/ - match hex color: Color: #ff4444.
  • /W[^x]/ - negate a character set: Hello World.
  • /[^0-9]/ - any non-digit: ID3735.

Slide 11: Regular Expression Basics: Shorthand Character Sets

  • /\d/ - match a digit: ID3735.
  • /\s/ - whitespace (space, tabs, line breaks) : Hello World.
  • /\w/ - match a "word character" (alphanumeric chars & underscore) : Hello World.
  • /\s\w/ - whitespace followed by word char: Hello World.

Slide 12: Regular Expression Basics: Special Character Sets

  • /\t/ - match a tab character (ASCII 0x09)
  • /\r/ - match a carriage return (0x0D)
  • /\n/ - match a line feed (0x0A)
  • /\r\n/ - match a Windows line end
  • /\x5E/ - match a specific character by its hexadecimal index in the character set, e.g. a caret ^

Slide 13: Regular Expression Basics: Period (Dot)

  • /./ - dot matches a single char, except line break chars: Hello World.
    • "dot matches all" or "single line" mode possible that makes the dot match also line breaks
  • /[^\n]/ - long version of /./ on Unix: Hello World.
  • /[^\r\n]/ - long version of /./ on Windows: Hello World.
  • /H.llo/ - grok English and German: Hallo World.

Slide 14: Regular Expression Basics: Anchors

  • Anchors don't match any characters - they match a position in a string.
  • /^/ - match at the start of the string: |Hello World.
    • "Multi-line" mode possible that makes /^/ match after any line break
  • /$/ - match at the end of the string: Hello World.|
  • /\b/ - match at a word boundary: |Hello World.
    • second match: Hello| World.
  • /\B/ - match at every position where \b cannot match: H|ello World.

Slide 15: Regular Expression Basics: Alternation (Logic OR)

  • /dog|cat|fish/ - alternation: I like cats and dogs.
    • second match: I like cats and dogs.

  • What about logic AND? Stay tuned.

Slide 16: Regular Expression Basics: Repetition

  • /colou?r/ - a question mark ? matches preceding token in regex zero or one time: Red colour.
  • /<[a-zA-Z][a-zA-Z0-9]*>/ - an asterisk * matches zero or multiple times: <tt> text </tt>
  • hex-color-visual.png /#[0-9a-f]+/ - a plus + matches one or multiple times: Color: #ff4444.
  • /<[a-zA-Z0-9]+>/ - what is the problem with this? <tt> text </tt>

Slide 17: Regular Expression Basics: More Repetition

  • /#[0-9a-f]{6}/ - curly braces { } match a specific number of times: Color: #ff444400.
  • /\b[1-9][0-9]{3}\b/ - match a number between 1,000 and 9,999: Value 3500.
  • /\.[a-zA-Z]{2,6}\// - match a specific range of times: http://bma.art.museum/
  • /\b[1-9][0-9]{3,4}\b/ - match a number between 1,000 and 99,999: Range 3500.

Slide 18: Regular Expression Basics: Greedy and Lazy Repetition

  • Greedy repetition: "Use up as much stuff as possible" - this is the default
    • /<.+>/ - greedy repetition: <tt> fixed </tt> text.

  • Lazy or non-greedy repetition: "Use as little stuff as possible" - add a ? to the qualifier
    • /<.+?>/ - non-greedy repetition: <tt> fixed </tt> text.
    • /<\/?[^<>]+>/ - similar, but faster: <tt> fixed </tt> text.
      • second match: <tt> fixed </tt> text.

Slide 19: Regular Expression Basics: Grouping and Backreferences

  • Purpose: Enclose tokens in parenthesis to group them together for later reference.
    • In Perl, $1 holds the content of the first group, $2 the second, ...
  • /Ready, (set)/ - test for mandatory sub-string, set is captured: Ready, set, go.
    • if( $str =~ m/Ready, (set)/ ) { print "found $1\n"; }
  • /Ready, (set)?/ - test for optional sub-string, set is captured: Ready, set, go.

  • Grouping without creating a capturing group:
    • /Ready, (?:set[,\s]*)?/ - test for optional sub-string without capturing: Ready, set, go.

Slide 20: Regular Expression Basics: Grouping for Search and Replace

  • Task: Switch the first two words:
    my $str = "that is the question";
    $str =~ s/(\w+)([^\w]+)(\w+)/$3$2$1/;
    # matched: "that is the question"
    # result:  "is that the question"

Slide 21: Regular Expressions: Complete Example

Slide 22: Performance

is_it_hot_in_here.jpg
  • Example: Read & parse 10K lines of CSV file

  • Native ColdFusion: 100,000 ms
  • ColdFusion with regex: 11,000 ms
  • Source: Ben Nadel blog

  • Native Perl regex with I/O: 15.8 ms
  • Native Perl regex without I/O: 3.5 ms

Slide 23: Regular Expressions: Modifiers

  • Matching operations can have various modifiers to change the behavior.
  • Example: To change a string into title case we have to find the first character of every word and change it to upper case.
    • The "every" part is done with the "g" (global) modifier
    • The "upper case" part is done by calling the uc() function using the "e" (execute) modifier
  • my $str = "this is my regex world.";
    $str =~ s/\b(\w)/uc($1)/ge;
    # matches: this is my regex world.
    # result:  This Is My Regex World.
  • More modifiers, see http://perldoc.perl.org/perlre.html

Slide 24: Advanced Regular Expressions: Lookaround

  • Purpose: Test for "stuff" before or after the current location, without capturing

  • /c(?=k)/ - positive lookahead: Matches quick but not active
  • /c(?!k)/ - negative lookahead: Matches active but not quick
  • /(?<=a)b/ - positive lookbehind
  • /(?<!a)b/ - negative lookbehind

Slide 25: Advanced Regular Expressions: Logical AND

  • Question: How can I do a logical AND using regex?

  • A1: Use logical OR with permutation:
    /foo.*bar|bar.*foo/ - not practical with many ANDs

  • A2: Use positive lookahead to test for each item:
    if( $str =~ m/^(?=.*foo)(?=.*bar)(?=.*baz)/s ) {
        print "found foo, bar and baz\n";
    }
  • Note: If one of the positive lookaheads fails, the whole regex fails, hence logical AND

Slide 26: Advanced Regular Expressions: Backreference in Regex

  • Backreferences can be used to reuse part of a regex match within the regex itself.
  • \1 contains the content of the first group, \2 the second, ...
  • Use case: For any HTML tag find the matching end tag and capture everything in between.
  • my $str = "This is <i>italic text</i>.";
    if( $str =~ m/<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)<\/\1>/ ) {
        # match: This is <i>italic text</i>.
        print "found $2\n"; # $2 contains: italic text
    }
  • Note: <\/\1> identifies the matching HTML end tag. The \/ is an escaped slash, \1 contains the tag name found in the first group.

Slide 27: Advanced Regular Expressions: Parse Nested Structures

  • TWiki's SpreadSheetPlugin uses regular expressions to parse and evaluate formulas.
  • Task: Parse and evaluate the following using regular expressions:
    speradsheet-formula.png

Slide 28: Regular Expression Puzzle

  • Print this puzzle, and use pencil and eraser!
    regex-crossword-puzzle

This regular expression crossword was created by Dan Gulotta from an idea by Palmer Mebane, and was part of MIT's Mystery Hunt in 2013.

Slide 29: Resources: Online References and Books

  • Book: Introducing Regular Expressions, by Michael Fitzgerald, O'Reilly Media, ISBN:1449392687
  • Book: Mastering Regular Expressions, by Jeffrey E.F. Friedl, O'Reilly Media, ISBN:0596528124

Slide 30: Resources: Visualize Regular Expressions

Slide 31: Resources: Online Regular Expression Tester

Slide 32: Questions & Discussions








This presentation:

http://bit.ly/regex2017   (http://twiki.org/cgi-bin/view/Codev/TWikiPresentation2017x06x01Regex)

Slide 33: BACKUP SLIDES












BACKUP SLIDES






Slide 34: What is TWiki?

  • twiki-logo-200x72.png TWiki is a wiki engine and wiki application platform, established in 1998
  • TWiki is specifically built for the workplace
  • Large number of TWiki Extensions: 200+ actively maintained extensions
  • Open Source software (GPL) with active community, hosted at http://TWiki.org/
  • 1,000 downloads per month, 600,000 total downloads, estimate 50,000+ installations, 130+ countries
  • Est. $27M of human capital invested (ref. Ohloh)
  • Source Forge 2009 "Best Enterprise Project" Finalist (among 230,000 open source projects)

Slide 35: TWiki Open Source Community

Slide 36: TWiki I/O Architecture

twiki-io-architecture.png

Notes

    Copyright © 2017 by TWiki.org. This presentation may be reproduced as long as the copyright notice is retained and a link is provided back to http://twiki.org/.    

See also: RegularExpression, What is TWiki, TWiki presentation, public TWiki sites, TWiki screenshots, TWiki.org Blog

-- Author: Peter Thoeny - 2017-06-01

Discussion

BasicForm
TopicClassification TWikiCommunity
TopicSummary Presentation: Regular Expressions, To Match Or Not, That is the Question, Silicon Valley Perl, 2017-06-01
InterestedParties

RelatedTopics

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r2 - 2021-01-26 - PeterThoeny
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.