Thursday, May 7, 2009

Implementing regular expression matching using Partial derivative (Part 3: The derivative approach)

The lesson we learned from the word problem is that we solve the word matching problem by reducing

match w r

to

match w' (deriv r l)
where
(l:w') = w



Suppose regex pattern matching can be defined in terms of a function

> allmatch :: Pat -> Word -> [Env]

where the type Pat represents all possible regex patterns.

> data Pat = PVar Int Word RE
> | PPair Pat Pat
> | PChoice Pat Pat
> | PEmpty Pat -- used internally to indicate that mkEmpty function has been applied.
> deriving Show

In the above, we define four kinds of regex patterns, the variable pattern, the pair parttern, the choice pattern,
and the empty pattern. Note that in the variable pattern we use integer to represent pattern variable.
Within the variable pattern, we use a placeholder of type Word to record
the word being matched by this variable pattern. The empty pattern is used internally, we will provide the explanation shortly.


Env denotes the matching results which are mappings between pattern variables and the words.

> type Env = [(Int,Word)]


Applying the same idea, we should be able to solve the regex pattern matching problem by reducing


allmatch p w

to

allmatch derivative_of_p_with_respect_to_l w'
where
(l:w') = w


To implement this, we first need to define the derivatives operation for regex patterns.


> dPat :: Pat -> Letter -> Pat


The derivative operation for variable pattern is pretty straigh-forward.


> dPat (PVar x w r) l =
> let d = deriv r l
> in PVar x (w `S.append` (S.pack [l])) d


In the above, we "reduce" the pattern annotation regex by computing its derivative. In addition, in the resulting derivative we record the information that we have "consumed" the letter l.

The derivative operation for choice pattern is simple, we omit the explanation.

> dPat (PChoice p1 p2) l = PChoice (dPat p1 l) (dPat p2 l)


The derivative operation for pair patterns is more interesting.
Let strip be a function that strips away variable bindings from a pattern.

> strip :: Pat -> RE
> strip (PVar _ w r) = r
> strip (PPair p1 p2) = Seq (strip p1) (strip p2)
> strip (PChoice p1 p2) = Choice (strip p1) (strip p2)
> strip (PEmpty p) = strip p

The "partial" solution we can immediately think of is as follows,

> dPat (PPair p1 p2) l =
> if (isEmpty (strip p1))
> then PChoice (PPair (dPat p1 l) p2)
> (PPair _ (dPat p2 l))
> else PPair (dPat p1 l) p2

When the sub pattern p1 accepts empty word, the resulting derivative must be a choice pattern. In the first alternative, we use p1 to consume l and keep p2 unchanged; while in the second alternative, we skip p1 and use p2 to consume l. How can we skip p1? In other words, what should _ be replaced with?
Because p1 accepts empty words, we could further match incoming letters just with p2. But we cannot discard p1, simply because we might have recorded some part of the input in p1. On the other hand, we cannot keep p1 as it is. We somehow need to indicate that the matching using p1 is finished.

The idea is to replace _ with a variation of p1 where we make all regular expressions empty whichever is possible.


> dPat (PPair p1 p2) l =
> if (isEmpty (strip p1))
> then PChoice (PPair (dPat p1 l) p2)
> (PPair (mkEmpPat p1) (dPat p2 l))
> else PPair (dPat p1 l) p2

> mkEmpPat :: Pat -> Pat
> mkEmpPat (PVar x w r)
> | isEmpty r = PVar x w Empty
> | otherwise = PVar x w Phi
> mkEmpPat (PPair p1 p2) = PPair (mkEmpPat p1) (mkEmpPat p2)
> mkEmpPat (PChoice p1 p2) = PChoice (mkEmpPat p1) (mkEmpPat p2)


There is a space for optimization here. Since we have made p1 empty, we should give it a mark, so that in the subsequent
derivative operation, we could immediately skip it. Putting all pieces together, the derivative operation for regex pattern is as follows,


> dPat :: Pat -> Letter -> Pat
> dPat (PVar x w r) l =
> let d = deriv r l
> in PVar x (w `S.append` (S.pack [l])) d
> dPat (PChoice p1 p2) l = PChoice (dPat p1 l) (dPat p2 l)
> dPat (PPair (PEmpty p1) p2) l =
> (PPair (PEmpty p1) (dPat p2 l))
> dPat (PPair p1 p2) l =
> if (isEmpty (strip p1))
> then PChoice (PPair (dPat p1 l) p2)
> (PPair (PEmpty (mkEmpPat p1)) (dPat p2 l))
> else PPair (dPat p1 l) p2

> mkEmpPat :: Pat -> Pat
> mkEmpPat (PVar x w r)
> | isEmpty r = PVar x w Empty
> | otherwise = PVar x w Phi
> mkEmpPat (PPair p1 p2) = PPair (mkEmpPat p1) (mkEmpPat p2)
> mkEmpPat (PChoice p1 p2) = PChoice (mkEmpPat p1) (mkEmpPat p2)
> mkEmpPat (PEmpty p) = PEmpty p


Recall that the main idea is to perform regex pattern matching by reducing it into derivative forms, how about the base case, i.e. when the input word is empty or fully consumed. We need to collect the bindings from the pattern.


> collect :: Pat -> [Env]
> collect (PVar x w r) | isEmpty r = [[(x,w)]]
> | otherwise = []
> collect (PPair p1 p2) = combine (collect p1) (collect p2)
> collect (PChoice p1 p2) = (collect p1) ++ (collect p2)
> collect (PEmpty p) = collect p

> combine :: [Env] -> [Env] -> [Env]
> combine envs1 envs2 =
> [ env1 ++ env2 | env1 <- envs1, env2 <- envs2 ]


Now we can define our derivative based pattern matching algorithm.


> allmatch :: Pat -> Word -> [Env]
> allmatch p w =
> let p' = S.foldl (\s l -> dPat s l) p w
> in collect p'

> firstmatch :: Pat -> Word -> Maybe Env
> firstmatch p w =
> case allmatch p w of
> [] -> Nothing
> (env:_) -> Just env



Some example,

Kenny's example p4 = ( (x :: A|(AB) , y :: (BAA) | A ), z :: (AC)| C )

> p4 = PPair (PPair p_x p_y) p_z
> where p_x = PVar 1 S.empty (Choice (L 'A') (Seq (L 'A') (L 'B')))
> p_y = PVar 2 S.empty (Choice (Seq (L 'B') (Seq (L 'A') (L 'A'))) (L 'A'))
> p_z = PVar 3 S.empty (Choice (Seq (L 'A') (L 'C')) (L 'C'))

> input = S.pack "ABAAC"




The sample execution,

*Main> firstmatch p4 input
Just [(1,"A"),(2,"BAA"),(3,"C")]


The correctness of this implementation can be found in my thesis here.
The correctness is verified under the POSIX matching policy, which is mentioned in Vansumeran's paper.


Burak Emir has demonstrated the same algorithm works in Scala, See here and here.

Related Works

S. Vansummeren. Type inference for unique pattern matching. ACM TOPLAS, 28(3):389–428, May 2006.

No comments: