Tardate 2016: Explaining Regular Expressions

Sunday, February 24, 2008

Explaining Regular Expressions

Unless you get to work with regular expressions everyday, the details can easily blur. You remember enough to know when a regex could be useful, but not enough to write it without reaching for a reference manual.

There are a few tools out there to help you write and understand regular expressions, including some IDEs that can provide assistance (like Komodo).

Sometimes you just want a quick explanation of a regular expression you might have seen in some code. The YAPE::Regex::Explain perl module is one tool that helps you do just that. Turning the module into a simple command-line tool is a simple one-liner:

#!/usr/bin/perl -w
print YAPE::Regex::Explain->new($ARGV[0])->explain;

I've also turned it into a simple CGI utility for those times when my regex memory fails me. You can use it here: regexplainr (sorry, you may find that site offline but here's the source code).

In a previous post, On Parsing CSV and other Delimited/Quoted Formats, I used the following regular expression to parse a whitespace-delimited string:

"([^"]+?)"\s?|([^\s]+)\s?|\s

Regexplainr produces the following commentary:

The regular expression:

(?-imsx:"([^"]+?)"\s?|([^\s]+)\s?|\s)

matches as follows:

NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
" '"'

----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^"]+? any character except: '"' (1 or more
times (matching the least amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[^\s]+ any character except: whitespace (\n,
\r, \t, \f, and " ") (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

(Permalink to this regexplanation)

Book tip: (thanks to Tony) O'Reilly's Mastering Regular Expressions. Available on google books, and also from Amazon.

8 comments:

Tony said...: Thanks for the tips, excellent stuff in there. The book: Mastering Regular Expressions, Third Edition
By Jeffrey E. F. Friedl, is one of the best technical books I've ever read and it taught me heaps about regex.; 7:26 PM
Unknown said...: Yes Tony, yet another great O'Reilly book. Totally agree with the recommendation!; 7:59 PM
Anonymous said...: Nice tip, however may i ask if formatting can be restored in the regexplainer output ? That would make it even more uable.; 9:27 PM
Unknown said...: @anonymous: hmm, maybe. What's the formatting issue you have? Can you post an example?; 10:01 PM
Unknown said...: Hey Paul,

Actually when i ran a sample regex, everything appeared on one line, as it newlines were eaten (alive).

Hence i thought maybe i'll let you know, I used IE, can't test FF from work.

R; 10:13 PM
Unknown said...: Raj, I think I got it - you have a regex expressed over multiple lines, and want that preserved? You are correct - at the moment the regexplainer I posted assumes the input regex is all in one line. I'll put it down as a little project for the weekend;-); 1:39 AM
Unknown said...: Thanks Paul ... I already got it bookmarked. Very handy tool.; 4:09 AM
Unknown said...: Thanks Raj.
btw, that weekend project is still pending. But then I didn't say which weekend;-); 10:12 AM

my recent reads..

Sunday, February 24, 2008

Explaining Regular Expressions

8 comments: