Presentation: Regular Expressions, To Match Or Not, That is the Question, Silicon Valley Perl, 2017-06-01
What is TWiki?
A leading open source enterprise wiki and web application platform used by 50,000 small businesses, many Fortune 500 companies, and millions of people.
Learn more.
This is the presentation material for the talk on
"Regular Expressions: To Match Or Not, That is the Question" at Silicon Valley Perl, NVIDIA, Santa Clara, CA, 2017-06-01. Peter Thoeny prepared this talk for
developers who want to scan and process text quickly.
| | Copyright © 2017 by TWiki.org. This presentation may be reproduced as long as the copyright notice is retained and a link is provided back to http://twiki.org/. | | |
Slide 1: Regular Expressions: To Match Or Not, That is the Question
Presentation for Silicon Valley Perl, 2017-06-01
-- Peter Thoeny
-
@PeterThoeny
- peter09[at]thoeny.org
-
TWiki.org
Slide 2: About Peter Thoeny
Slide 3: Manipulating Text: The Conventional Way
- Task: Process text to:
- validate input
- change text from one format to another
- extract snippets of text (e.g. screen scraping), ...
- Text manipulation using procedural Perl:
$i = index( $sString, $sSubString );
$str = substr( $sOldString, 4 );
$i = length( $sString );
$str = join( ':', split( '', $sString ) );
# chomp, chr, hex, lc, ord, pack, sprintf, uc, ...
(details)
# Many CPAN modules...
Slide 4: Manipulating Text: The Object Oriented Way
- Text manipulation using OO Perl:
use String;
my $str = new String( "Perl" );
printf( "Length: %d\n", $str->length );
printf( "First char: %s\n", $str->charAt( 0 ) );
printf( "Position of 'er': %d\n", $str->indexOf( 'er' ) );
- Details: CPAN:String
Slide 5: Regular Expression in Wikipedia
Slide 6: Why use regular expressions?
- "Wildcard on steroids" -
*.txt
becomes .*\.txt$
- Process large amounts of text over and over again
- Very powerful and flexible
- Extremely fast
- But: There is a learning curve
Slide 7: xkcd: Regular Expressions
Slide 8: Manipulating Text: Use Regular Expressions
- Match a string with a regex pattern:
if( $str =~ m/.../ ) {
# ...
}
- Replace a pattern with a string:
$str =~ s/.../.../; # Perl syntax
var newstr = str.replace( /.../, "..." ); // JavaScript syntax
- split up a string using a regex delimiter:
my @items = grep { /.../ } split( /\s*,\s*/, $str );
- There is also: (details)
Slide 9: Regular Expression Basics
- Color code used in examples:
-
Hello World.
- string we operate on (teletype
)
-
regex
- regular expression (red
)
-
match
- match we found in string (green
)
-
/l/
- single character: Hello World.
- second and third match:
Hello World.
-
/or/
- character sequence: Hello World.
- Metacharacters with special meaning:
(
)
parenthesis, [ ]
square brackets, { }
curly braces, \
backslash, ^
caret, $
dollar, .
period, |
vertical bar, *
asterisk, +
plus
-
/\./
- escape metacharacter: Hello World.
-
/./
- without escape: Hello World.
Slide 10: Regular Expression Basics: Character Sets
-
/[oe]/
- one char out of several chars: Hello World.
- second and third match:
Hello World.
-
/H[ea]llo/
- match Hello
and Hallo
: Hallo World.
-
/[0-9]/
- match range of chars: ID3735.
-
/#[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]/
- match hex color: Color: #ff4444.
-
/W[^x]/
- negate a character set: Hello World.
-
/[^0-9]/
- any non-digit: ID3735.
Slide 11: Regular Expression Basics: Shorthand Character Sets
-
/\d/
- match a digit: ID3735.
-
/\s/
- whitespace (space, tabs, line breaks) : Hello World.
-
/\w/
- match a "word character" (alphanumeric chars & underscore) : Hello World.
-
/\s\w/
- whitespace followed by word char: Hello World.
Slide 12: Regular Expression Basics: Special Character Sets
-
/\t/
- match a tab character (ASCII 0x09)
-
/\r/
- match a carriage return (0x0D)
-
/\n/
- match a line feed (0x0A)
-
/\r\n/
- match a Windows line end
-
/\x5E/
- match a specific character by its hexadecimal index in the character set, e.g. a caret ^
Slide 13: Regular Expression Basics: Period (Dot)
-
/./
- dot matches a single char, except line break chars: Hello World.
- "dot matches all" or "single line" mode possible that makes the dot match also line breaks
-
/[^\n]/
- long version of /./
on Unix: Hello World.
-
/[^\r\n]/
- long version of /./
on Windows: Hello World.
-
/H.llo/
- grok English and German: Hallo World.
Slide 14: Regular Expression Basics: Anchors
- Anchors don't match any characters - they match a position in a string.
-
/^/
- match at the start of the string: |Hello World.
- "Multi-line" mode possible that makes
/^/
match after any line break
-
/$/
- match at the end of the string: Hello World.|
-
/\b/
- match at a word boundary: |Hello World.
- second match:
Hello| World.
-
/\B/
- match at every position where \b
cannot match: H|ello World.
Slide 15: Regular Expression Basics: Alternation (Logic OR)
-
/dog|cat|fish/
- alternation: I like cats and dogs.
- second match:
I like cats and dogs.
- What about logic AND? Stay tuned.
Slide 16: Regular Expression Basics: Repetition
-
/colou?r/
- a question mark ? matches preceding token in regex zero or one time: Red colour.
-
/<[a-zA-Z][a-zA-Z0-9]*>/
- an asterisk * matches zero or multiple times: <tt> text </tt>
-
/#[0-9a-f]+/
- a plus + matches one or multiple times: Color: #ff4444.
-
/<[a-zA-Z0-9]+>/
- what is the problem with this? <tt> text </tt>
Slide 17: Regular Expression Basics: More Repetition
-
/#[0-9a-f]{6}/
- curly braces { } match a specific number of times: Color: #ff444400.
-
/\b[1-9][0-9]{3}\b/
- match a number between 1,000 and 9,999: Value 3500.
-
/\.[a-zA-Z]{2,6}\//
- match a specific range of times: http://bma.art.museum/
-
/\b[1-9][0-9]{3,4}\b/
- match a number between 1,000 and 99,999: Range 3500.
Slide 18: Regular Expression Basics: Greedy and Lazy Repetition
- Greedy repetition: "Use up as much stuff as possible" - this is the default
-
/<.+>/
- greedy repetition: <tt> fixed </tt> text.
- Lazy or non-greedy repetition: "Use as little stuff as possible" - add a ? to the qualifier
-
/<.+?>/
- non-greedy repetition: <tt> fixed </tt> text.
-
/<\/?[^<>]+>/
- similar, but faster: <tt> fixed </tt> text.
- second match:
<tt> fixed </tt> text.
Slide 19: Regular Expression Basics: Grouping and Backreferences
- Purpose: Enclose tokens in parenthesis to group them together for later reference.
- In Perl,
$1
holds the content of the first group, $2
the second, ...
-
/Ready, (set)/
- test for mandatory sub-string, set is captured: Ready, set, go.
-
if( $str =~ m/Ready, (set)/ ) { print "found $1\n"; }
-
/Ready, (set)?/
- test for optional sub-string, set is captured: Ready, set, go.
- Grouping without creating a capturing group:
-
/Ready, (?:set[,\s]*)?/
- test for optional sub-string without capturing: Ready, set, go.
Slide 20: Regular Expression Basics: Grouping for Search and Replace
- Task: Switch the first two words:
my $str = "that is the question";
$str =~ s/(\w+)([^\w]+)(\w+)/$3$2$1/;
# matched: "that is the question"
# result: "is that the question"
Slide 21: Regular Expressions: Complete Example
Slide 22: Performance
- Example: Read & parse 10K lines of CSV file
- Native ColdFusion: 100,000 ms
- ColdFusion with regex: 11,000 ms
- Source: Ben Nadel blog
- Native Perl regex with I/O: 15.8 ms
- Native Perl regex without I/O: 3.5 ms
Slide 23: Regular Expressions: Modifiers
- Matching operations can have various modifiers to change the behavior.
- Example: To change a string into title case we have to find the first character of every word and change it to upper case.
- The "every" part is done with the "g" (global) modifier
- The "upper case" part is done by calling the
uc()
function using the "e" (execute) modifier
-
my $str = "this is my regex world.";
$str =~ s/\b(\w)/uc($1)/ge;
# matches: this is my regex world.
# result: This Is My Regex World.
- More modifiers, see http://perldoc.perl.org/perlre.html
Slide 24: Advanced Regular Expressions: Lookaround
- Purpose: Test for "stuff" before or after the current location, without capturing
-
/c(?=k)/
- positive lookahead: Matches quick
but not active
-
/c(?!k)/
- negative lookahead: Matches active
but not quick
-
/(?<=a)b/
- positive lookbehind
-
/(?<!a)b/
- negative lookbehind
Slide 25: Advanced Regular Expressions: Logical AND
- Question: How can I do a logical AND using regex?
- A1: Use logical OR with permutation:
/foo.*bar|bar.*foo/
- not practical with many ANDs
- A2: Use positive lookahead to test for each item:
if( $str =~ m/^(?=.*foo)(?=.*bar)(?=.*baz)/s ) {
print "found foo, bar and baz\n";
}
- Note: If one of the positive lookaheads fails, the whole regex fails, hence logical AND
Slide 26: Advanced Regular Expressions: Backreference in Regex
- Backreferences can be used to reuse part of a regex match within the regex itself.
-
\1
contains the content of the first group, \2
the second, ...
- Use case: For any HTML tag find the matching end tag and capture everything in between.
-
my $str = "This is <i>italic text</i>.";
if( $str =~ m/<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)<\/\1>/ ) {
# match: This is <i>italic text</i>.
print "found $2\n"; # $2 contains: italic text
}
- Note:
<\/\1>
identifies the matching HTML end tag. The \/
is an escaped slash, \1
contains the tag name found in the first group.
Slide 27: Advanced Regular Expressions: Parse Nested Structures
- TWiki's SpreadSheetPlugin uses regular expressions to parse and evaluate formulas.
- Task: Parse and evaluate the following using regular expressions:
Slide 28: Regular Expression Puzzle
- Print this puzzle, and use pencil and eraser!
This regular expression crossword was created by Dan Gulotta from an idea by Palmer Mebane, and was part of MIT's Mystery Hunt in 2013
.
Slide 29: Resources: Online References and Books
- Book: Introducing Regular Expressions, by Michael Fitzgerald, O'Reilly Media, ISBN:1449392687
- Book: Mastering Regular Expressions, by Jeffrey E.F. Friedl, O'Reilly Media, ISBN:0596528124
Slide 30: Resources: Visualize Regular Expressions
Slide 31: Resources: Online Regular Expression Tester
Slide 32: Questions & Discussions
This presentation:
http://bit.ly/regex2017 (http://twiki.org/cgi-bin/view/Codev/TWikiPresentation2017x06x01Regex)
Slide 33: BACKUP SLIDES
BACKUP SLIDES
Slide 34: What is TWiki?
- TWiki is a wiki engine and wiki application platform, established in 1998
- TWiki is specifically built for the workplace
- Large number of TWiki Extensions: 200+ actively maintained extensions
- Open Source software (GPL) with active community, hosted at http://TWiki.org/
- 1,000 downloads per month, 600,000 total downloads, estimate 50,000+ installations, 130+ countries
- Est. $27M of human capital invested (ref. Ohloh)
- Source Forge 2009 "Best Enterprise Project" Finalist (among 230,000 open source projects)
Slide 35: TWiki Open Source Community
Slide 36: TWiki I/O Architecture
Notes
| | Copyright © 2017 by TWiki.org. This presentation may be reproduced as long as the copyright notice is retained and a link is provided back to http://twiki.org/. | | |
See also: RegularExpression,
What is TWiki,
TWiki presentation,
public TWiki sites,
TWiki screenshots,
TWiki.org Blog
--
Author: Peter Thoeny - 2017-06-01
Discussion