Monday, May 11, 2009

Implementing regular expression matching using Partial derivative (Part 4: The partial derivative approach)

In the previous post, we have presented a regex pattern matching algorithm that uses derivative. The idea is to extend the derivative operation to handle regex patterns. With this extension, we can perform regex pattern matching by "pushing" all the labels in the input word into the pattern via derivative, and collect the results by examining the resulted pattern (derivative).

This approach is neat but hard to be put to practical use. Here are the reasons.

Issue #1) The derivative operation for regex patterns always builds new patterns. The "structural size" of the derivative pattern is always larger than the original one. This prevents us from getting a compilation scheme.

Issue #2) The derivative based algorithm requires backtracking.
To illustrate, let us consider the following example,

Let p be the pattern (x :: (A|(AB)), y :: (A|C) ),
> p = ( PPair ( PVar 1 S.empty (Choice (L 'A') (Seq (L 'A') (L 'B'))) ) (PVar 2 S.empty (Choice (L 'A') (L 'C')))

Let w be the input AA
> w = S.pack "AA"

Matching w against p
> firstmatch p w -- (1)

To compute (1), we need to compute the derivatives (dPat (dPat p 'A') 'A')

(dPat (dPat p 'A') 'A')
--> (dPat (PPair ( PVar 1 "A" (Choice Empty (L 'B'))) (PVar 2 "" (Choice (L 'A') (L 'C'))) 'A')
--> (PChoice
(PPair ( PVar 1 "AA") (Choice Phi Phi)) (PVar 2 "" (Choice (L 'A') (L 'C'))) 'A')
(PPair (PPair ( PVar 1 "A" (Choice Empty (L 'B'))) (PVar 2 "A" (Choice Empty Phi))) -- (2)

Since we are searching for the firstmatch, we search (2) from left to right for the first successful match. It turns out that the first alternative of (2) leads to a matching failure. Therefore, we have to backtrack, and look at the second alternative, in which we find the match, [(1,"A"), (2,"A")].

We all know that backtracking is costly. The situation could be worse in the context of regex pattern matching, where we not only need to test for successful match but also keep track of bindings.

To address the issue #1, we need to take a different approach which is based on partial derivatives.

So what is the partial derivative? How does it differ from derivative?

Given a regex r, the partial derivatives of r with respect to some letter l are a set of regex which are the possible results of removing l from r.

Both derivative and partial derivative both describe the "states" after removing some leading label l from a regex. The derivate operation yields a single regex. On the other hand, the partial derivative operation yields a set of regexs. We can think of derivatives as states in a DFA because they deterministic, i.e. from one input regex and a letter, we can only get one derivative.
On the contrary, we can think of partial derivatives as states in an NFA because they are non-deterministic, i.e. from one input regex and a letter, we get a set of partial derivatives.

Note that as we pointed out earlier in this section, the set of all possible derivatives of a regex is infinite.
On the other hand, the set of all possible partial derivatives of a given regex is finite. This is one of the important results in Antimirov's work.

We recast Antimirov's definition of partial derivative operations as follows,

> partDeriv :: RE -> Char -> [RE]
> partDeriv Phi l = []
> partDeriv Empty l = []
> partDeriv (L l') l
> | l == l' = [Empty]
> | otherwise = []
> partDeriv (Choice r1 r2) l = nub ((partDeriv r1 l) ++ (partDeriv r2 l))
> partDeriv (Seq r1 r2) l
> | isEmpty r1 =
> let s1 = [ (Seq r1' r2) | r1' <- partDeriv r1 l ]
> s2 = partDeriv r2 l
> in nub (s1 ++ s2)
> | otherwise = [ (Seq r1' r2) | r1' <- partDeriv r1 l ]
> partDeriv (Star r) l = [ (Seq r' (Star r)) | r' <- partDeriv r l ]

It is not hard to realize that by making use of the partial derivative operation we can define a derivative operation which yields a "minimal" regex.

Exercise: Implement the derivative operation by making use of the partial derivative operation.

Furthermore, we are able to use the partial derivative operation to solve the word problem.

Exercise: Implement an algorithm that solves the word problem using the partial derivative operations.

Like what we did to the derivative operation, we can extend the partial derivative operation to operate on regex patterns.

> pdPat :: Pat -> Letter -> [Pat]
> pdPat (PVar x w r) l =
> let pd = partDeriv r l
> in if null pd then []
> else [PVar x (w `S.append` (S.pack [l])) (resToRE pd)]
> pdPat (PPair p1 p2) l =
> if (isEmpty (strip p1))
> then ([ PPair p1' p2 | p1' <- pdPat p1 l] ++
> [ PPair (mkEmpPat p1) p2' | p2' <- pdPat p2 l])
> else [ PPair p1' p2 | p1' <- pdPat p1 l ]
> pdPat (PChoice p1 p2) l =
> ((pdPat p1 l) ++ (pdPat p2 l))

Summig up a list of regular expressions with choice operation.

> resToRE :: [RE] -> RE
> resToRE (r:res) = foldl Choice r res
> resToRE [] = Phi

And the partial derivative pattern matching algorithm follows naturally,

> allmatch :: Pat -> Word -> [Env]
> allmatch p w = concat (map collect (allmatch' [p] w))
> where
> allmatch' :: [Pat] -> Word -> [Pat]
> allmatch' ps w =
> case S.uncons w of
> Nothing -> ps
> Just (l,w') -> let ps' = (concat [ pdPat p l | p <- ps ])
> in allmatch' ps' w'

> firstmatch :: Pat -> Word -> Maybe Env
> firstmatch p w =
> case allmatch p w of
> [] -> Nothing
> (env:_) -> Just env

Recall the previous example

Kenny's example

> p4 = PPair (PPair p_x p_y) p_z
> where p_x = PVar 1 S.empty (Choice (L 'A') (Seq (L 'A') (L 'B')))
> p_y = PVar 2 S.empty (Choice (Seq (L 'B') (Seq (L 'A') (L 'A'))) (L 'A'))
> p_z = PVar 3 S.empty (Choice (Seq (L 'A') (L 'C')) (L 'C'))

> input = S.pack "ABAAC"

*Main> firstmatch p4 input
Just [(1,"AB"),(2,"A"),(3,"AC")]

Note that there is a slight difference between the result obtained from the derivative based and the one above. The key observation here is that the two algorithms differ when the regex pattern has a left associated nested pairs. In case that the pair patterns are nested to the right. The two algorithms behaves the same, which can be proven easily. In realworld regex pattern matching like Perl or grep, we do not have nesting in pattern sequences.

One key observation is that the set of all partial derivative patterns are finite if we drop the "cumulative" bindings. This gives us an opportunity to turn the above algorithm into a compilation scheme.

And we yet need to address the issue of backtracking (#2).
(To be continued)

No comments: