Tuesday, May 5, 2009

Implementing regular expression matching using Partial derivative (Part 2: The word problem and derivatives)

How do we solve the regex matching problem? To start let us consider a simpler problem - the word problem. The word problem is a classic problem in which we test whether a regex accepts a word.

One well-known approach is to solve the word problem using Brzozowski's derivative operation. The derivative operation is defined as follows

d(l,r) = { w | (l,w) in r ]


We describe the algorithm in the following using Haskell.


> import qualified Data.ByteString.Char8 as S

> data RE = Phi -- empty lang
>  | Empty        -- empty word
>  | L Char   -- literal
>  | Choice RE RE -- r1 + r2
>  | Seq RE RE    -- (r1,r2)
>  | Star RE      -- r*


A word is a byte string.

> type Word = S.ByteString

> type Letter = Char



We follow Brzozowski's approach to define the derivative operations.

> deriv :: RE -> Letter -> RE
> deriv Phi l = Phi
> deriv Empty l = Phi
> deriv (L l') l 
>     | l == l' = Empty
>     | otherwise = Phi
> deriv (Choice r1 r2)  l = Choice (deriv r1 l) (deriv r2 l)
> deriv (Seq r1 r2) l
>     | isEmpty r1 = 
>           let c1 = (Seq (deriv r1 l) r2)
>               c2 = deriv r2 l
>           in Choice c1 c2
>     | otherwise = Seq (deriv r1 l) r2
> deriv (Star r) l = Seq (deriv r l) (Star r)



To establish the base case of the algorithm,
we need a function that checks whether a regexp accepts the empty word.


-- check whether regular expressions are empty

> class IsEmpty a where
>     isEmpty :: a -> Bool

> instance IsEmpty RE where
>   isEmpty Phi = False
>   isEmpty Empty = True
>   isEmpty (Choice r1 r2) = (isEmpty r1) || (isEmpty r2)
>   isEmpty (Seq r1 r2) = (isEmpty r1) && (isEmpty r2)
>   isEmpty (Star r) = True
>   isEmpty (L _) = False



Then the word matching algorithm can be implemented as follows
> match :: Word -> RE -> Bool
> match w r = 
>     case S.uncons w of
>     Nothing     -> isEmpty r -- empty
>     Just (l,w') -> w' `match` (deriv r l)


The gist of the word matching algorithm is that a non empty word (l,w') matches a regular expr r if we take away l from the word, the remainder w' matches with the derivative (deriv r l).

A sample execution:

*Main> match (S.pack "AA") (Seq (Star (L 'A')) (Star (L 'A')))
True

No comments: