Wednesday, April 29, 2009

Implementing regular expression matching using Partial derivative (Part 1: Intro)

Regular expression pattern is a common programming tool which allows us to extract a fragment from a series of items (such as characters, symbols etc) if the fragment matches the pattern that we specify. The process that computes the match result is sometime called regular expression pattern matching. Regular expression pattern matching is commonly found in many main stream programming languages such as, Perl, Python and Java. The "traditional" way of implementing regular expression pattern matching is to construct a DFA (deterministic Finite Automata) that represent the regular expression pattern. Execute the DFA over the input will yield the result.

In this article, I propose an alternative implementation of regular expression pattern matching using Partial Derivatives.
In a similar idea, we can also use partial derivative to perform inequality/equality test among regexs. See Martin's post here.

What is the regular expression pattern matching problem?

Let's consider an example. Given a regular expression pattern, say

(x : A*, y : A*)

which matches a sequence of alphabets A and binding zero or more As to the pattern variable x and sending the rest (of As) to the other pattern variable y.

Suppose we have an input sequence


(or we will call it "a word" subsequently) which matches with the above pattern and producing
one of the following possible bindings.

{ x -> AA, y -> () }
{ x -> A, y -> A }
{ x -> (), y -> AA }

The first solution is called the longest matched, because the first sub pattern consumes the longest possible prefix of the word and the remaining goes to the second pattern. On the other hand, the third solution is referred to the shortest matched, because the first sub pattern consumes the shortest possible prefix. In most of this article we will talk about the longest match.