Tuesday, January 20, 2009

An interactive Regular Expression Editor

The other day I had to filter an HTTP access log. I figured I could write a dozen or two lines of Java code to do the job, to be written with a dozen or so test-fix cycles. But I also knew that it could be done in just a few lines when using a regular expression. The things that I don't like about using regular expressions in Java code is the fact that the regular expression becomes difficult to edit because the backslash and quote characters need to be escaped with an extra backslash, and secondly I typically need even more test-fix cycles to get the expression right. Thirdly, a non-trivial regular expression becomes difficult to read.

It occurred to me that much of my reservations could be eliminated with an interactive editor that would allow me to edit the regular expression in plain text, and that would show me the results immediately. Since I've been wanting to look at NetBeans' Matisse editor, I thought I'd be able to whip out an editor in less than half an hour, and by doing so, save time not just for the project I was working on, but also for whenever I need to edit a regular expression in the future.

The Matisse editor in NetBeans is as easy as advertised. It took me only minutes to put together a basic editor. To my satisfaction, the time savings in editing regular expressions immediately proved to be enormous!

That was a few weeks ago. This week I was on vacation, and to kill time in the airplane, I've been upgrading this editor further. What I've added is the ability to convert the regular expression in the editor and convert it into Java. That is, the tool now takes care of escaping special characters such as backslashes. The tool is now sort of a code generator. But how about round-trip engineering? I also threw in an option to take a snippet of Java code with a regular expression, and remove the escaping characters. This makes it easy when you later need to revisit a regular expression: you copy the expression from the Java code, paste it into the editor, hit convert and you can edit the plain text regular expression. When done, generate the expression to Java code, paste it back in the Java source file.

Sounds easy, no? What else could you wish for? Of course the whole thing could be integrated into NetBeans and / or Eclipse. Maybe an exercise for another plain trip.

Use the Interactive Regular Expression Editor!

I’ve made the application into a JNLP application, so you can launch it from the browser.

Launch Interactive Regular Expression Editor

Sample screenshot. Click to launch the editor

How do you use the editor? The screen consists of three edit panes. The middle editor is where you edit the regular expression. Any time that you change the expression, the sample input in the top pane is evaluated, and the results are displayed in the bottom edit pane. When you're done editing the expression, click the "generate" button, and Java code will appear in the bottom pane. Copy that code and use it in your Java programs. If you later on need to edit the expression again, simply paste the Pattern definition in the regular expression pane, and hit the "extract" button. The code in the middle pane will be replaced with the clear text version.

Tips for working with Regular Expressions

Here are some tips for working with regular expressions: use the Pattern.COMMENTS flag:  it allows you to  break up your regular expression into multiple lines. Doing so makes it possible to document the parts that the regular expression consists of. This leads me to the second tip: document the parts of your expression so that you'll make it a lot easier on yourself if you later have to revisit the expression. Commenting is easy: use the # sign just like you would use // in Java. Whitespace in the expression is ignored, so if you want to insert a literal space, escape it with a backslash.

Parsing text with regular expressions

I knew that regular expressions are great for matching text, but now that I have the interactive regular expression editor, I also came to appreciate the regular expressions to parse strings -- I mean to extract fragments out of a string. It surely is a lot easier than finding positions in a string and then using String.subString().

For example, here is a regular expression with capturing groups to parse the GlassFish log file:

# Begin marker
\[\#

# Date and time
\|  (\d\d\d\d-\d\d-\d\d)\D(\d\d:\d\d:\d\d\.\d\d\d\D\d\d\d\d)

# Level
\|  (.+?)

# Product
\|  (.+?)

# Category
\|  (.+?)

# Key-value pairs
\|  (.+)?

# Msg text
\|  (.+?)

# Optional stack trace
(^\t at \s \p{javaLowerCase} .*  \.java .*)?

# End marker
\|\#\]

This will result in 8 groups. This looks like a difficult expression, but with the Interactive regular expression editor, it's pretty simple to write this expression.And showing off the Interactive regular expression editor, this is what Java code it produces:

public static Pattern REGEX = Pattern.compile("# Begin marker\r\n" + 
  "\\[\\#\r\n" + 
  "\r\n" + 
  "# Date and time\r\n" + 
  "\\|  (\\d\\d\\d\\d-\\d\\d-\\d\\d)\\D(\\d\\d:\\d\\d:\\d\\d\\.\\d\\d\\d\\D\\d\\d\\d\\d)\r\n" + 
  "\r\n" + 
  "# Level\r\n" + 
  "\\|  (.+?)\r\n" + 
  "\r\n" + 
  "# Product\r\n" + 
  "\\|  (.+?)\r\n" + 
  "\r\n" + 
  "# Category\r\n" + 
  "\\|  (.+?)\r\n" + 
  "\r\n" + 
  "# Key-value pairs\r\n" + 
  "\\|  (.+)?\r\n" + 
  "\r\n" + 
  "# Msg text\r\n" + 
  "\\|  (.+?)\r\n" + 
  "\r\n" + 
  "# Optional stack trace\r\n" + 
  "(^\\t at \\s \\p{javaLowerCase} .*  \\.java .*)?\r\n" + 
  "\r\n" + 
  "# End marker\r\n" + 
  "\\|\\#\\]", 0 | Pattern.COMMENTS | Pattern.DOTALL | Pattern.MULTILINE);

Blogger.com works after all

I had this issue with blogspot where it would insert a great number of line breaks just before a table.  As I found out  (thanks to Edward Chou), this behavior can be turned off in the settings page of Blogspot.

I also found out how to create Atom feeds from labels. The help text on Google is incorrect, and I found out to reference a category like Sun, the corresponding URL is: http://frankkieviet.blogspot.com/feeds/posts/default/-/Sun

Now I’m wondering if I should stick with Blogspot or with Wordpress…