Sunday, February 10, 2008

On Parsing CSV and other Delimited/Quoted Formats

Parsing delimited text that may have quoted elements is a perennial requirement. Quick-and-dirty parses can be achieved with regular expressions, but for more flexible and encapsulated parsing I've been checking out the opencsv java library. Hat tip to Jakub Pawlowski for highlighting the library on his blog

A Regular Expression Approach
Just recently I released and blogged about a JDeveloper Filter Add-in, and it contains a class called ExecShell [API, source] which needs to know how to break a command line into its component arguments. The command line is of course space-delimited, but may use quotes to group an argument with embedded spaces (so a simple split on spaces won't do).

The salient code below uses the REGEX to chop theCmdLine String into theCmdArray Vector of arguments:
Vector<String> theCmdArray = new Vector<String>(0);
String REGEX = "\"([^\"]+?)\"\\s?|([^\\s]+)\\s?|\\s";
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(theCmdLine);
while (m.find())
{
theCmdArray.add( m.group().trim() );
}

The regular expression bears a little explaining, and is inspired by this example. Here's how it breaks down:

\"([^\"]+?)\"\\s?
Matches a group within double-quotes. Group is a lazy match on one or more characters except double-quote. Optionally followed by some whitespace
|([^\\s]+)\\s?or Matches a group delimited by whitespace, optionally followed by some whitespace
|\\sDiscards a pure whitespace match

In this case, we are using whitespace as the delimiter (appropriate for command lines). The regex can be adapted for other delimiters by replacing \\s with the delimiter. For example, to handle a comma-separated format:
String REGEX = "\"([^\"]+?)\",?|([^,]+),?|,";

Using OpenCSV
The same space-delimited parsing requirement can be met with a couple of lines and the opencsv library:
CSVReader reader = new CSVReader(new StringReader(theCmdArray), ' ');
String[] s = reader.readNext();

Simple, yet currently not so robust. Since we define the delimiter to be a single space (over-ridding the default comma), other whitespace characters (like a tab) will not be recognised. Further, repeated spaces will not be coalesced, but will each be treated as the delimiter for a new element.

Internally, CSVReader parses the input character-by-character and so adapting to handle repeated delimiters as one would be reasonably straight-forward.

8 comments:

Anonymous said...

your regular expression doesn't work.

Paul said...

@anonymous: it would help if you post an example that doesn't work.
It does the job in the code available above

Anonymous said...

Look for the CSV parsing regex over at snippets.dzone.com. Really worth a look!

Paul said...

@anonymous: nice tip! you are talking about this post yes? I'll check it out...

Wolf said...

Regular expression is really wonderful to parsing HTML or matching pattern. I use this a lot when i code. Actually when I learn any new langauge, first of all I first try whether it supports regex or not. I feel ezee when I found that.

http://icfun.blogspot.com/2008/04/ruby-regular-expression-handling.html

Here is about ruby regex. This was posted by me when I first learn ruby regex. So it will be helpfull for New coders.

Anonymous said...

Shalom! Anne Arriola . payday loans

Brx said...

This is great, the RegEx works like a charm. thanks.-

Brx said...

Thanks for the regex, works like a charm :)