Tags:
create new tag
, view all tags

Presentation: Regular Expressions, To Match Or Not, That is the Question, SVPerl, 2013-03-07

This is the presentation material for the talk on "Regular Expressions: To Match Or Not, That is the Question" at Silicon Valley Perl, 2013-03-07. Peter Thoeny prepared this talk for developers who want to scan and process text quickly.

Presentation View the slides of this presentation.


   Copyright © 2013 by TWiki.org. This presentation may be reproduced as long as the copyright notice is retained and a link is provided back to http://twiki.org/.  

Start Presentation

Slide 1: Regular Expressions: To Match Or Not, That is the Question

regex-example.png



Presentation for Silicon Valley Perl, 2013-03-07

-- Peter Thoeny - peter09[at]thoeny.org - TWiki.org

Slide 2: About Peter

  • Peter Thoeny
  • CTO and Founder of TWiki.org, an Enterprise Collaboration Platform provider
  • Wikis for Dummies cover Founder of TWiki, the open source wiki for the enterprise, managing the project for 10+ years
  • Invented the concept of structured wikis - where free form wiki content can be structured with tailored wiki applications
  • Thought-leader in wikis and social software, featured in numerous articles and technology conferences including LinuxWorld, Business Week, Wall Street Journal and more
  • Graduate of the Swiss Federal Institute of Technology in Zurich
  • Lived in Japan for 8 years, developing CASE tools
  • Now in the Silicon Valley for 10+ years
  • Co-author of Wikis for Dummies book

Slide 3: Manipulating Text: The Conventional Way

  • Task: Process text to:
    • validate input
    • change text from one format to another
    • extract snippets of text (e.g. screen scraping), ...
  • Text manipulation using procedural Perl:
    $i = index( $sString, $sSubString );
    $str = substr( $sOldString, 4 );
    $i = length( $sString );
    $str = join( ':', split( '', $sString ) );
    # chomp, chr, hex, lc, ord, pack, sprintf, uc, ... (details)
    # Many CPAN modules...

Slide 4: Manipulating Text: The Object Oriented Way

  • Text manipulation using OO Perl:
    use String;
    my $str = new String( "Perl" );
    printf( "Length: %d\n", $str->length );
    printf( "First char: %s\n", $str->charAt( 0 ) );
    printf( "Position of 'er': %d\n", $str->indexOf( 'er' ) );
  • Details: CPAN:String

Slide 5: Regular Expression in Wikipedia

regex-wikipedia.png

Slide 6: Why use regular expressions?

twiki-regex.png
  • "Wildcard on steroids" - *.txt becomes .*\.txt$
  • Process large amounts of text over and over again
  • Very powerful and flexible
  • Extremely fast

  • But: There is a learning curve

Slide 7: Manipulating Text: Use Regular Expressions

twiki-regex.png
  • Match a string with a regex pattern:
    if( $str =~ m/.../ ) {
      # ...
    }
  • Replace a pattern with a string:
    $str =~ s/.../.../;
  • split up a string using a regex delimiter:
    my @items = grep { /.../ } split( /\s*,\s*/, $str );
  • There is also: (details)
    • pos
    • qr/STRING/
    • quotemeta

Slide 8: Regular Expression Basics

  • Color code used in examples:
    • Hello World. - string we operate on (teletype)
    • regex - regular expression (red)
    • match - match we found in string (green)

  • /l/ - single character: Hello World.
    • second and third match: Hello World.
  • /or/ - character sequence: Hello World.
  • Metacharacters with special meaning: ( ) parenthesis, [ ] square brackets, { } curly braces, \ backslash, ^ caret, $ dollar, . period, | vertical bar, * asterisk, + plus
  • /\./ - escape metacharacter: Hello World.
  • /./ - without escape: Hello World.

Slide 9: Regular Expression Basics: Character Sets

  • /[oe]/ - one char out of several chars: Hello World.
    • second and third match: Hello World.
  • /H[ea]llo/ - match Hello and Hallo: Hallo World.
  • /[0-9]/ - match range of chars: ID3735.
  • /#[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]/ - match hex color: Color: #ff4444.
  • /W[^x]/ - negate a character set: Hello World.
  • /[^0-9]/ - any non-digit: ID3735.

Slide 10: Regular Expression Basics: Shorthand Character Sets

  • /\d/ - match a digit: ID3735.
  • /\s/ - whitespace (space, tabs, line breaks) : Hello World.
  • /\w/ - match a "word character" (alphanumeric chars & underscore) : Hello World.
  • /\s\w/ - whitespace followed by word char: Hello World.

Slide 11: Regular Expression Basics: Special Character Sets

  • /\t/ - match a tab character (ASCII 0x09)
  • /\r/ - match a carriage return (0x0D)
  • /\n/ - match a line feed (0x0A)
  • /\r\n/ - match a Windows line end
  • /\x5E/ - match a specific character by its hexadecimal index in the character set, e.g. a caret ^

Slide 12: Regular Expression Basics: Period (Dot)

  • /./ - dot matches a single char, except line break chars: Hello World.
    • "dot matches all" or "single line" mode possible that makes the dot match also line breaks
  • /[^\n]/ - long version of /./ on Unix: Hello World.
  • /[^\r\n]/ - long version of /./ on Windows: Hello World.
  • /H.llo/ - grok English and German: Hallo World.

Slide 13: Regular Expression Basics: Anchors

  • Anchors don't match any characters - they match a position in a string.
  • /^/ - match at the start of the string: |Hello World.
    • "Multi-line" mode possible that makes /^/ match after any line break
  • /$/ - match at the end of the string: Hello World.|
  • /\b/ - match at a word boundary: |Hello World.
    • second match: Hello| World.
  • /\B/ - match at every position where \b cannot match: H|ello World.

Slide 14: Regular Expression Basics: Alternation (Logic OR)

  • /dog|cat|fish/ - alternation: I like cats and dogs.
    • second match: I like cats and dogs.

  • What about logic AND? Stay tuned.

Slide 15: Regular Expression Basics: Repetition

  • /colou?r/ - a question mark ? matches preceding token in regex zero or one time: Red colour.
  • /<[a-zA-Z][a-zA-Z0-9]*>/ - an asterisk * matches zero or multiple times: <tt> text </tt>
  • hex-color-visual.png /#[0-9a-f]+/ - a plus + matches one or multiple times: Color: #ff4444.
  • /<[a-zA-Z0-9]+>/ - what is the problem with this? <tt> text </tt>

Slide 16: Regular Expression Basics: More Repetition

  • /#[0-9a-f]{6}/ - curly braces { } match a specific number of times: Color: #ff444400.
  • /\b[1-9][0-9]{3}\b/ - match a number between 1,000 and 9,999: Value 3500.
  • /\.[a-zA-Z]{2,6}\// - match a specific range of times: http://bma.art.museum/
  • /\b[1-9][0-9]{3,4}\b/ - match a number between 1,000 and 99,999: Range 3500.

Slide 17: Regular Expression Basics: Greedy and Lazy Repetition

  • Greedy repetition: "Use up as much stuff as possible" - this is the default
    • /<.+>/ - greedy repetition: <tt> fixed </tt> text.

  • Lazy or non-greedy repetition: "Use as little stuff as possible" - add a ? to the qualifier
    • /<.+?>/ - non-greedy repetition: <tt> fixed </tt> text.
    • /<\/?[^<>]+>/ - similar, but faster: <tt> fixed </tt> text.
      • second match: <tt> fixed </tt> text.

Slide 18: Regular Expression Basics: Grouping and Backreferences

  • Purpose: Enclose tokens in parenthesis to group them together for later reference.
    • In Perl, $1 holds the content of the first group, $2 the second, ...
  • /Ready, (set)/ - test for mandatory sub-string, set is captured: Ready, set, go.
    • if( $str =~ m/Ready, (set)/ ) { print "found $1\n"; }
  • /Ready, (set)?/ - test for optional sub-string, set is captured: Ready, set, go.

  • Grouping without creating a capturing group:
    • /Ready, (?:set[,\s]*)?/ - test for optional sub-string without capturing: Ready, set, go.

Slide 19: Regular Expression Basics: Grouping for Search and Replace

  • Task: Switch the first two words:
    my $str = "that is the question";
    $str =~ s/(\w+)([^\w]+)(\w+)/$3$2$1/;
    # matched: "that is the question"
    # result:  "is that the question"

Slide 20: Regular Expressions: Complete Example

Slide 21: Performance

is_it_hot_in_here.jpg
  • Example: Read & parse 10K lines of CSV file

  • Native ColdFusion: 100,000 ms
  • ColdFusion with regex: 11,000 ms
  • Source: Ben Nadel blog

  • Native Perl regex with I/O: 15.8 ms
  • Native Perl regex without I/O: 3.5 ms

Slide 22: Regular Expressions: Modifiers

  • Matching operations can have various modifiers to change the behavior.
  • Example: To change a string into title case we have to find the first character of every word and change it to upper case.
    • The "every" part is done with the "g" (global) modifier
    • The "upper case" part is done by calling the uc() function using the "e" (execute) modifier
  • my $str = "this is my regex world.";
    $str =~ s/\b(\w)/uc($1)/ge;
    # matches: this is my regex world.
    # result:  This Is My Regex World.
  • More modifiers, see http://perldoc.perl.org/perlre.html

Slide 23: Advanced Regular Expressions: Lookaround

  • Purpose: Test for "stuff" before or after the current location, without capturing

  • /c(?=k)/ - positive lookahead: Matches quick but not active
  • /c(?!k)/ - negative lookahead: Matches active but not quick
  • /(?<=a)b/ - positive lookbehind
  • /(?<!a)b/ - negative lookbehind

Slide 24: Advanced Regular Expressions: Logical AND

  • Question: How can I do a logical AND using regex?

  • A1: Use logical OR with permutation:
    /foo.*bar|bar.*foo/ - not practical with many ANDs

  • A2: Use positive lookahead to test for each item:
    if( $str =~ m/^(?=.*foo)(?=.*bar)(?=.*baz)/s ) {
        print "found foo, bar and baz\n";
    }
  • Note: If one of the positive lookaheads fails, the whole regex fails, hence logical AND

Slide 25: Advanced Regular Expressions: Backreference in Regex

  • Backreferences can be used to reuse part of a regex match within the regex itself.
  • \1 contains the content of the first group, \2 the second, ...
  • Use case: For any HTML tag find the matching end tag and capture everything in between.
  • my $str = "This is <i>italic text</i>.";
    if( $str =~ m/<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)<\/\1>/ ) {
        # match: This is <i>italic text</i>.
        print "found $2\n"; # $2 contains: italic text
    }
  • Note: <\/\1> identifies the matching HTML end tag. The \/ is an escaped slash, \1 contains the tag name found in the first group.

Slide 26: Advanced Regular Expressions: Parse Nested Structures

  • TWiki's SpreadSheetPlugin uses regular expressions to parse and evaluate formulas.
  • Task: Parse and evaluate the following using regular expressions:
    speradsheet-formula.png

Slide 27: Regular Expression Puzzle

  • Print this puzzle, and use pencil and eraser!
    regex-crossword-puzzle

This regular expression crossword was created by Dan Gulotta from an idea by Palmer Mebane, and was part of MIT's Mystery Hunt in 2013.

Slide 28: Resources: Online References and Books

  • Book: Introducing Regular Expressions, by Michael Fitzgerald, O'Reilly Media, ISBN:1449392687
  • Book: Mastering Regular Expressions, by Jeffrey E.F. Friedl, O'Reilly Media, ISBN:0596528124

Slide 29: Resources: Visualize Regular Expressions

Slide 30: Questions & Discussions








This presentation: http://bit.ly/regexlearn   (http://twiki.org/cgi-bin/view/Codev/TWikiPresentation2013x03x07)

Slide 31: BACKUP SLIDES












BACKUP SLIDES






Slide 32: What is TWiki?

  • twiki-logo-200x72.png TWiki is a wiki engine and wiki application platform, established in 1998
  • TWiki is specifically built for the workplace
  • Large number of TWiki Extensions: 200+ actively maintained extensions
  • Open Source software (GPL) with active community, hosted at http://TWiki.org/
  • 4,000+ downloads per month, 600,000 total downloads, estimate 50,000+ installations, 130+ countries
  • Est. $27M of human capital invested (ref. Ohloh)
  • Source Forge 2009 "Best Enterprise Project" Finalist (among 230,000 open source projects)

Slide 33: TWiki Open Source Community

Slide 34: TWiki I/O Architecture

twiki-io-architecture.png

Notes

   Copyright © 2013 by TWiki.org. This presentation may be reproduced as long as the copyright notice is retained and a link is provided back to http://twiki.org/.  

See also: RegularExpression, What is TWiki, TWiki presentation, public TWiki sites, TWiki screenshots, TWiki.org Blog

-- Peter Thoeny - 2013-03-07

Comments

Thanks Peter for sharing and great presentation.

Very informative and team interactive session.

-- Min

Amazing! Looking forward to more such sessions.

-- Ram Govind Krishnan

Great presentation. Thanks Peter!

-- Satoshi Yagi

Great experience, learned a lot of new stuff.

-- Evgeni Stavinov

Excellent presentation, thanks again Peter!

-- Marc Kandel

Above is some feedback from the Meetup.com event page at http://www.meetup.com/SVPerl/events/89342932/

Thanks all, it was a fun session!

-- Peter Thoeny - 2013-03-08

Outstanding slides! Wish I could have been at the presentation.

-- Paul Reiber - 2013-03-09

Peter,

I am enjoying learning Twiki. I am volunteering at internet archive in San Francisco where they have been using Twiki since 2004, so I have something to practice on. I have been using Drupal and WordPress for the other nonprofits where I volunteer and I find them all very similar.

I am looking around for videos and other learning tools online to help me to improve my skills -- I was surprised that there wasn't more on youtube.

Lori Guidos
KE6INO
lori@disabledcommunityPLEASENOSPAM.org

-- Lori Guidos - 2013-05-06

Nice, TWiki at the Internet Archive! At this time we do not have good YouTubeVideos for learning. For now I recommend reading Blog articles, reading TWiki docs, and asking questions in the Support forum. If you like videos you could get involved and produce some how-to videos for TWiki.

-- Peter Thoeny - 2013-05-06

BasicForm
TopicClassification TWikiCommunity
TopicSummary Presentation: Regular Expressions, To Match Or Not, That is the Question, SVPerl, 2013-03-07
InterestedParties

RelatedTopics

Edit | Attach | Watch | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r21 - 2014-07-10 - PeterThoeny
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2015 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.