Sunday, February 24, 2008

Explaining Regular Expressions

Unless you get to work with regular expressions everyday, the details can easily blur. You remember enough to know when a regex could be useful, but not enough to write it without reaching for a reference manual.

There are a few tools out there to help you write and understand regular expressions, including some IDEs that can provide assistance (like Komodo).

Sometimes you just want a quick explanation of a regular expression you might have seen in some code. The YAPE::Regex::Explain perl module is one tool that helps you do just that. Turning the module into a simple command-line tool is a simple one-liner:
#!/usr/bin/perl -w
print YAPE::Regex::Explain->new($ARGV[0])->explain;
I've also turned it into a simple CGI utility for those times when my regex memory fails me. You can use it here: regexplainr (sorry, you may find that site offline but here's the source code).

In a previous post, On Parsing CSV and other Delimited/Quoted Formats, I used the following regular expression to parse a whitespace-delimited string:
"([^"]+?)"\s?|([^\s]+)\s?|\s
Regexplainr produces the following commentary:
The regular expression:

(?-imsx:"([^"]+?)"\s?|([^\s]+)\s?|\s)

matches as follows:

NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
" '"'

----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^"]+? any character except: '"' (1 or more
times (matching the least amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[^\s]+ any character except: whitespace (\n,
\r, \t, \f, and " ") (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
(Permalink to this regexplanation)

Book tip: (thanks to Tony) O'Reilly's Mastering Regular Expressions. Available on google books, and also from Amazon.

8 comments:

Tony said...

Thanks for the tips, excellent stuff in there. The book: Mastering Regular Expressions, Third Edition
By Jeffrey E. F. Friedl, is one of the best technical books I've ever read and it taught me heaps about regex.

Paul said...

Yes Tony, yet another great O'Reilly book. Totally agree with the recommendation!

Anonymous said...

Nice tip, however may i ask if formatting can be restored in the regexplainer output ? That would make it even more uable.

Paul said...

@anonymous: hmm, maybe. What's the formatting issue you have? Can you post an example?

Raj J said...

Hey Paul,

Actually when i ran a sample regex, everything appeared on one line, as it newlines were eaten (alive).

Hence i thought maybe i'll let you know, I used IE, can't test FF from work.

R

Paul said...

Raj, I think I got it - you have a regex expressed over multiple lines, and want that preserved? You are correct - at the moment the regexplainer I posted assumes the input regex is all in one line. I'll put it down as a little project for the weekend;-)

Raj J said...

Thanks Paul ... I already got it bookmarked. Very handy tool.

Paul said...

Thanks Raj.
btw, that weekend project is still pending. But then I didn't say which weekend;-)