Tuesday, May 5, 2009

Implementing regular expression matching using Partial derivative (Part 2: The word problem and derivatives)

How do we solve the regex matching problem? To start let us consider a simpler problem - the word problem. The word problem is a classic problem in which we test whether a regex accepts a word.

One well-known approach is to solve the word problem using Brzozowski's derivative operation. The derivative operation is defined as follows

d(l,r) = { w | (l,w) in r ]


We describe the algorithm in the following using Haskell.



> import qualified Data.ByteString.Char8 as S

> data RE = Phi -- empty lang
> | Empty -- empty word
> | L Char -- literal
> | Choice RE RE -- r1 + r2
> | Seq RE RE -- (r1,r2)
> | Star RE -- r*


A word is a byte string.

> type Word = S.ByteString

> type Letter = Char



We follow Brzozowski's approach to define the derivative operations.


> deriv :: RE -> Letter -> RE
> deriv Phi l = Phi
> deriv Empty l = Phi
> deriv (L l') l
> | l == l' = Empty
> | otherwise = Phi
> deriv (Choice r1 r2) l = Choice (deriv r1 l) (deriv r2 l)
> deriv (Seq r1 r2) l
> | isEmpty r1 =
> let c1 = (Seq (deriv r1 l) r2)
> c2 = deriv r2 l
> in Choice c1 c2
> | otherwise = Seq (deriv r1 l) r2
> deriv (Star r) l = Seq (deriv r l) (Star r)



To establish the base case of the algorithm,
we need a function that checks whether a regexp accepts the empty word.



-- check whether regular expressions are empty

> class IsEmpty a where
> isEmpty :: a -> Bool

> instance IsEmpty RE where
> isEmpty Phi = False
> isEmpty Empty = True
> isEmpty (Choice r1 r2) = (isEmpty r1) || (isEmpty r2)
> isEmpty (Seq r1 r2) = (isEmpty r1) && (isEmpty r2)
> isEmpty (Star r) = True
> isEmpty (L _) = False



Then the word matching algorithm can be implemented as follows

> match :: Word -> RE -> Bool
> match w r =
> case S.uncons w of
> Nothing -> isEmpty r -- empty
> Just (l,w') -> w' `match` (deriv r l)


The gist of the word matching algorithm is that a non empty word (l,w') matches a regular expr r if we take away l from the word, the remainder w' matches with the derivative (deriv r l).

A sample execution:


*Main> match (S.pack "AA") (Seq (Star (L 'A')) (Star (L 'A')))
True

No comments: