Monday, April 28, 2014

Getting face recognition to work on Raspberry Pi with Pi Cam

1. Buy a pi and a camboard module.

2. Install Raspbian.

3. Run
$ sudo raspi-config
to turn on the camera module

4. Follow the instruction here to compile and install opencv.
http://robertcastle.com/2014/02/installing-opencv-on-a-raspberry-pi/

5. Install the UV4L driver to make /dev/video0 available as a virtual device to your pi cam.
http://www.linux-projects.org/modules/sections/index.php?op=viewarticle&artid=14


6. Reconfig the U4L driver so that there is no annoying preview coming up.
$ sudo pkill uv4l
$ sudo uv4l --driver raspicam --auto-video_nr --width 640 --height 480 --encoding jpeg --nopreview

7. Download
https://github.com/berak/opencv_smallfry
8.
$ startx
$ cd opencv_smallfry
$ python facerec_online.py your_folder_to_store_training_pics /usr/local/share/OpenCV/haarcascades/haarcascade_frontalface_default.xml
9. Enjoy!

Friday, April 4, 2014

How to sail through the storm with phantom types

The storm framework is a handy tool when it comes to handling real-time infinite stream of data. Credits go to Nathan and his team! In a nutshell, storm models a flow based programming paradigm where the processes are modelled using Java classes spout and bolt, where spout denotes a data-generation process and bolt denotes a data processing/transforming possess. To illustrate the idea, let's consider the typical word count example in Storm. For conciseness, we adopt the Scala DSL syntax from ScalaStorm, (more credits go to Evan Chan]
  class RandomSentenceSpout extends StormSpout(outputFields = List("sentence")) {
    val sentences = List("the cow jumped over the moon",
                         "an apple a day keeps the doctor away",
                         "four score and seven years ago",
                         "snow white and the seven dwarfs",
                         "i am at two with nature")
    def nextTuple = {
      Thread sleep 100
      emit (sentences(Random.nextInt(sentences.length)))
    }
  }
Viewing a storm topology as flow of data processes, a spout is a data generating process. For example, the RandomSetenceSpout defined above operates by executing the method nextTuple, which randomly picks one sentence out of the predefined set and passes it to the next process in the pipeline.
Let's move onto the next item in the pipeline, the SplitSentence bolt.
  class SplitSentence extends StormBolt(outputFields = List("word")) {
    def execute(t: Tuple) = t matchSeq {
      case Seq(sentence: String) => sentence split " " foreach
        { word => using anchor t emit (word) }
      t ack
    }
  }
Being different from a spout, a bolt takes the output from its previous process in the pipeline as input. Given the input, a bolt invokes the execute() method to compute the output which will be in term passed to the next process in the pipeline. In this example, the SplitSentence bolt splits the incoming sentences (from the RandomSentence bolt) by spaces and passes the splitted words to the next process, the WordCount bolt.
  class WordCount extends StormBolt(List("word", "count")) {
    var counts: Map[String, Int] = _
    setup {
      counts = new HashMap[String, Int]().withDefaultValue(0)
    }
    def execute(t: Tuple) = t matchSeq {
      case Seq(word: String) =>
        counts(word) += 1
        using anchor t emit (word, counts(word))
        t ack
    }
  }
The WordCount bolt is the last process in this wordcount topology. It simply counts the occurrences of words it collects. Putting all the three pieces into a topology we have.
object WordCountTopology {

  class RandomSentenceSpout extends StormSpout(outputFields = List("sentence")) {
    val sentences = List("the cow jumped over the moon",
                         "an apple a day keeps the doctor away",
                         "four score and seven years ago",
                         "snow white and the seven dwarfs",
                         "i am at two with nature")
    def nextTuple = {
      Thread sleep 100
      emit (sentences(Random.nextInt(sentences.length)))
    }
  }

  class SplitSentence extends StormBolt(outputFields = List("word")) {
    def execute(t: Tuple) = t matchSeq {
      case Seq(sentence: String) => sentence split " " foreach
        { word => using anchor t emit (word) }
      t ack
    }
  }

  class WordCount extends StormBolt(List("word", "count")) {
    var counts: Map[String, Int] = _
    setup {
      counts = new HashMap[String, Int]().withDefaultValue(0)
    }
    def execute(t: Tuple) = t matchSeq {
      case Seq(word: String) =>
        counts(word) += 1
        using anchor t emit (word, counts(word))
        t ack
    }
  }
  
  def main(args: Array[String]) = {
    val builder = new TopologyBuilder

    builder.setSpout("randsentence", new RandomSentenceSpout, 5)
    builder.setBolt("split", new SplitSentence, 8)
        .shuffleGrouping("randsentence")
    builder.setBolt("count", new WordCount, 12)
        .fieldsGrouping("split", new Fields("word"))

    val conf = new Config
    conf.setDebug(true)
    conf.setMaxTaskParallelism(3)

    val cluster = new LocalCluster
    cluster.submitTopology("word-count", conf, builder.createTopology)
    Thread sleep 10000
    cluster.shutdown
  }
}
Note that a storm topology is developed in the main() method, where the TopologyBuilder connects the set of data processes by chaining them together.

Note that processes are exchanging data in the Tuple data type, which is a ordered sequence in which each item is named. For example, in the definition of RandomSentence, we see
outputFields = List("sentence")
And in the definition of SplitSentence, we use
outputFields = List("word")

When a bolt is "connected" to its upstream, an input declaration has to be made. In case of the SplitSentence bolt, the distribution of the inputs from RandomSentenceSpout does not matter, hence shuffleGrouping() is used.

In this context of the WordCount bolt, the input assignment is crucial, as we want the same words must always go to the same bolt instance otherwise the counting is meaningless. Thus, fieldsGrouping() is used.




Everything is neat and tidy except we spot two immediate issues.
Issue number 1:
The connection between spouts and bolts are dynamically typed. i.e. if there is a type mismatch, it will only be discovered during run-time.
For instance, consider there is a slight adjustment of the definition of SplitSentence
  class SplitSentence extends StormBolt(outputFields = List("word")) {
    def execute(t: Tuple) = t matchSeq {
      case Seq(sentence: String) => sentence split " " foreach
        { word => using anchor t emit (word.toList) }
      t ack
    }
  }

As a result, the bolt emits List[Char] instead of String. The compiler happily accepts it and type checks it. The type mismatch between the output from SplitSentence and input to WordCount will only be raised as an error during run-time. This is unacceptable as we are dealing with large set of data and there are cases where this type of mismatch run-time error become hard to trace.
Issue number 2:
The construction of the topology requires that there should be at least one spout followed by zero or more bolts within a topology. It does not make sense to add a bolt into an empty topology. However the TopologyBuilder does not enforce this constraint statically.
    val builder = new TopologyBuilder
 
    builder.setSpout("randsentence", new RandomSentenceSpout, 5)
    builder.setBolt("split", new SplitSentence, 8)
        .shuffleGrouping("randsentence")
    builder.setBolt("count", new WordCount, 12)
        .fieldsGrouping("split", new Fields("word"))

To solve the above issues, we adopt a well-known technique in the FP world called phantom types. In a nutshell, phantom types are data type whose type variables are not fully mentioned in its data constructors.

We follow style of phantom type encoding described here .
First of all, we introduce two phantom types to capture the input and output types of the spout and the bolt.
  trait StormSpoutT[Out]{def spout:StormSpout}
  trait StormBoltT[In,Out]{def bolt:StormBolt}

Now we are able to declare the input output type of the spout and the bolts
  case class RandomSentenceSpoutT (spout: RandomSentenceSpout) extends StormSpoutT[String]
  class RandomSentenceSpout extends StormSpout(outputFields = List("sentence")) {
    val sentences = List("the cow jumped over the moon",
                         "an apple a day keeps the doctor away",
                         "four score and seven years ago",
                         "snow white and the seven dwarfs",
                         "i am at two with nature")
    def nextTuple = {
      Thread sleep 100
      emit (sentences(Random.nextInt(sentences.length)))
    }
  }

  case class SplitSentenceT (bolt:SplitSentence) extends StormBoltT[String,String]
  class SplitSentence extends StormBolt(outputFields = List("word")) {
    def execute(t: Tuple) = t matchSeq {
      case Seq(sentence: String) => sentence split " " foreach
        { word => using anchor t emit (word) }
      t ack
    }
  }


  case class WordCountT (bolt:WordCount) extends StormBoltT[String,(String,Int)]
  class WordCount extends StormBolt(List("word", "count")) {
    var counts: Map[String, Int] = _
    setup {
      counts = new HashMap[String, Int]().withDefaultValue(0)
    }
    def execute(t: Tuple) = t matchSeq {
      case Seq(word: String) =>
        counts(word) += 1
        using anchor t emit (word, counts(word))
        t ack
    }
  }
Next, we define three possible states of constructed topology, an ordering among the three states and a phantom type representing a state of topology
  abstract class TopEmpty;
  abstract class TopWithSpout extends TopEmpty;
  abstract class TopWithBolt extends TopWithSpout;

  case class TopologyBuilderT[+State,+Out](builder:TopologyBuilder,output_name:String) {
    def createTopology = builder.createTopology
  }
In addition, we define two combinators for the topology construction which enforces the matching constraint between the output type of one possesses and the input type of the its following processes. Further more, as the topology being constructed, we also keep track of the state of the topology. For example, when a topology is initiated, it should be TopEmpty, until a spout is added, it becomes TopWithSpout. Adding a bolt to an empty topology is not allowed by the type system.
  def addSpout[Out](top:TopologyBuilderT[TopEmpty,_])
    ( spout_name:String
    , ts:StormSpoutT[Out]
    , threadMax:Int) : TopologyBuilderT[TopWithSpout,Out] = 
    top match {
      case TopologyBuilderT(builder,_) => {
        builder.setSpout(spout_name, ts.spout, threadMax)
        TopologyBuilderT[TopWithSpout,Out](builder,spout_name)   
      }
    }
  
  def addBolt[In,Out,State <: TopWithSpout](top:TopologyBuilderT[State,In])
    ( bolt_name:String
    , tb:StormBoltT[In,Out]
    , threadMax:Int) 
    ( inDecl: BoltDeclarer => BoltDeclarer ) : TopologyBuilderT[TopWithBolt,Out] = 
    top match {
      case TopologyBuilderT(builder,output_name) => {
        val i = builder.setBolt(bolt_name,tb.bolt, threadMax)
        inDecl(i)
        TopologyBuilderT[TopWithBolt,Out](builder,bolt_name)   
      }
    } 
Putting everything together we will have a typeful version of the word count topology.
object WordCountTopologyP {

  trait StormSpoutT[Out]{def spout:StormSpout}

  trait StormBoltT[In,Out]{def bolt:StormBolt}


  abstract class TopEmpty;
  abstract class TopWithSpout extends TopEmpty;
  abstract class TopWithBolt extends TopWithSpout;


  case class TopologyBuilderT[+State,+Out](builder:TopologyBuilder,output_name:String) {
    def createTopology = builder.createTopology
  }


  case class RandomSentenceSpoutT (spout: RandomSentenceSpout) extends StormSpoutT[String]
  class RandomSentenceSpout extends StormSpout(outputFields = List("sentence")) {
    val sentences = List("the cow jumped over the moon",
                         "an apple a day keeps the doctor away",
                         "four score and seven years ago",
                         "snow white and the seven dwarfs",
                         "i am at two with nature")
    def nextTuple = {
      Thread sleep 100
      emit (sentences(Random.nextInt(sentences.length)))
    }
  }

  case class SplitSentenceT (bolt:SplitSentence) extends StormBoltT[String,String]
  class SplitSentence extends StormBolt(outputFields = List("word")) {
    def execute(t: Tuple) = t matchSeq {
      case Seq(sentence: String) => sentence split " " foreach
        { word => using anchor t emit (word) }
      t ack
    }
  }


  case class WordCountT (bolt:WordCount) extends StormBoltT[String,(String,Int)]
  class WordCount extends StormBolt(List("word", "count")) {
    var counts: Map[String, Int] = _
    setup {
      counts = new HashMap[String, Int]().withDefaultValue(0)
    }
    def execute(t: Tuple) = t matchSeq {
      case Seq(word: String) =>
        counts(word) += 1
        using anchor t emit (word, counts(word))
        t ack
    }
  }


  def addSpout[Out](top:TopologyBuilderT[TopEmpty,_])
    ( spout_name:String
    , ts:StormSpoutT[Out]
    , threadMax:Int) : TopologyBuilderT[TopWithSpout,Out] = 
    top match {
      case TopologyBuilderT(builder,_) => {
        builder.setSpout(spout_name, ts.spout, threadMax)
        TopologyBuilderT[TopWithSpout,Out](builder,spout_name)   
      }
    }

  
  def addBolt[In,Out,State <: TopWithSpout](top:TopologyBuilderT[State,In])
    ( bolt_name:String
    , tb:StormBoltT[In,Out]
    , threadMax:Int) 
    ( inDecl: BoltDeclarer => BoltDeclarer ) : TopologyBuilderT[TopWithBolt,Out] = 
    top match {
      case TopologyBuilderT(builder,output_name) => {
        val i = builder.setBolt(bolt_name,tb.bolt, threadMax)
        inDecl(i)
        TopologyBuilderT[TopWithBolt,Out](builder,bolt_name)   
      }
    } 
    

  def main(args: Array[String]) = {
    val builderT = addBolt(
                    addBolt(addSpout(new TopologyBuilderT(new TopologyBuilder,""))("randsentence", new RandomSentenceSpoutT(new RandomSentenceSpout), 8)) 
                      ("split", new SplitSentenceT(new SplitSentence), 8)( _.shuffleGrouping("randsentence"))
                  ) ("count", new WordCountT(new WordCount), 12)( _.fieldsGrouping("split", new Fields("word")))


    val conf = new Config
    conf.setDebug(true)
    conf.setMaxTaskParallelism(3)

    val cluster = new LocalCluster
    cluster.submitTopology("word-count", conf, builderT.createTopology)
    Thread sleep 10000
    cluster.shutdown
  }
}
There is still another problem. The topology construction is unreadable.
    val builderT = addBolt(
                    addBolt(addSpout(new TopologyBuilderT(new TopologyBuilder,""))("randsentence", new RandomSentenceSpoutT(new RandomSentenceSpout), 8)) 
                      ("split", new SplitSentenceT(new SplitSentence), 8)( _.shuffleGrouping("randsentence"))
                  ) ("count", new WordCountT(new WordCount), 12)( _.fieldsGrouping("split", new Fields("word")))
This is because the addBolt() and addSpout() methods are not infix. To fix this problem, we use the type class/Scala implicit style of phantom type encoding described here .
object WordCountTopologyT {

  trait StormSpoutT[Out]{def spout:StormSpout}

  trait StormBoltT[In,Out]{def bolt:StormBolt}


  abstract class TopEmpty;
  abstract class TopWithSpout extends TopEmpty;
  abstract class TopWithBolt extends TopWithSpout;


  case class TopologyBuilderT[+State,Out](builder:TopologyBuilder,output_name:String) {
    def createTopology = builder.createTopology

    def init : TopologyBuilderT[TopEmpty,Out] = 
      new TopologyBuilderT(builder,output_name)

    def >> [NextOut]
      ( spout_name:String
      , ts:StormSpoutT[NextOut]
      , threadMax:Int)
      (implicit evS:State <:< TopEmpty) : TopologyBuilderT[TopWithSpout,NextOut] = {

        builder.setSpout(spout_name, ts.spout, threadMax)
        new TopologyBuilderT[TopWithSpout,NextOut](builder,spout_name)   
      }

    
    def >>> [NextOut,State <: TopWithSpout]
      ( bolt_name:String
      , tb:StormBoltT[Out,NextOut]
      , threadMax:Int) 
      ( inDecl: BoltDeclarer => BoltDeclarer )
      (implicit evS: State <:< TopWithSpout) : TopologyBuilderT[TopWithBolt,NextOut] = {
        val i = builder.setBolt(bolt_name,tb.bolt, threadMax)
        inDecl(i)
        new TopologyBuilderT[TopWithBolt,NextOut](builder,bolt_name)   
      } 
  }

  case class RandomSentenceSpoutT (spout: RandomSentenceSpout) extends StormSpoutT[String]
  class RandomSentenceSpout extends StormSpout(outputFields = List("sentence")) {
    val sentences = List("the cow jumped over the moon",
                         "an apple a day keeps the doctor away",
                         "four score and seven years ago",
                         "snow white and the seven dwarfs",
                         "i am at two with nature")
    def nextTuple = {
      Thread sleep 100
      emit (sentences(Random.nextInt(sentences.length)))
    }
  }


  case class SplitSentenceT (bolt:SplitSentence) extends StormBoltT[String,String]
  class SplitSentence extends StormBolt(outputFields = List("word")) {
    def execute(t: Tuple) = t matchSeq {
      case Seq(sentence: String) => sentence split " " foreach
        { word => using anchor t emit (word) }
      t ack
    }
  }


  case class WordCountT (bolt:WordCount) extends StormBoltT[String,(String,Int)]
  class WordCount extends StormBolt(List("word", "count")) {
    var counts: Map[String, Int] = _
    setup {
      counts = new HashMap[String, Int]().withDefaultValue(0)
    }
    def execute(t: Tuple) = t matchSeq {
      case Seq(word: String) =>
        counts(word) += 1
        using anchor t emit (word, counts(word))
        t ack
    }
  }


  def main(args: Array[String]) = {
    val builderT = (new TopologyBuilderT(new TopologyBuilder,"")).init
                   .>> ("randsentence", new RandomSentenceSpoutT(new RandomSentenceSpout), 8) 
                   .>>> ("split", new SplitSentenceT(new SplitSentence), 8)( _.shuffleGrouping("randsentence")) 
                   .>>> ("count", new WordCountT(new WordCount), 12)( _.fieldsGrouping("split", new Fields("word")))

    val conf = new Config
    conf.setDebug(true)
    conf.setMaxTaskParallelism(3)

    val cluster = new LocalCluster
    cluster.submitTopology("word-count", conf, builderT.createTopology)
    Thread sleep 10000
    cluster.shutdown
  }
}
The difference is that we migrate the addBolt and addSpout combinator into the TopologyBuilderT class to make them infixable, they are renamed to >> and >>> respectively.
To enforce the constraints among the phantom type variables, we use the Scala implicit argument definition (similar to Haskell's type class evidence construction) to enforce the constraints, e.g. see line 23 and 35 in the above.
One observation is that this formulation is still a bit too verbose, though arguably speaking, we say the phantom types and the definition of TopologyBuilderT should be packaged as a common library, the definition of RandomSentenceSpoutT, SplitSentenceT and WordCountT are treated as type annotation.
We believe there shall be way of automatically inferring these annotation from the bodies of the RandomSentenceSpout, SplitSentence and WordCount. Maybe the new Scala Macro will be one of the option.
The TODOs include:
  1. Support one spout to multiple bolts and multiple spout to one bolt.
  2. Use Scala Macro to infer the "spout and bolt type annotation/declaration"

The full source can be found here .

Wednesday, February 6, 2013

GHCi 7.4.2 is finally working on ARM


I've got a Cubieboard a few weeks back. I started it off with Linaro (Ubuntu) 12.06. 
Today I started to upgrade the OS to 12.11, which surprisingly came with ghc 7.4.2.

And ghci magically works too.

Here are the links


sudo apt-get install update-manager-core
check /etc/update-manager/release-upgrades and update
Prompt=normal
sudo apt-get update
sudo apt-get upgrade
sudo do-release-upgrade

Tuesday, December 4, 2012

First day with Raspberry Pi

Finally, I ordered a Raspberry Pi (model B) from Element14.  With the efficient service, it shipped within 4 days (incl weekends).

Upon its arrival, my first thing was to hook it up with a keyboard, a mouse, a LAN cable and HDMI cable to my TV.

 Starting off with a 2G SD card, I downloaded the image from

http://www.raspberrypi.org/downloads

On my mac, I just use the following command to dump the image into the SD card

$ diskutil umount dd if=2012-10-28-wheezy-raspbian.img of=/dev/ bs=1M

When this is done, I booted it up and enabled SSH. 
 The default username and password is 'pi/raspberry'. 

Via the command 
$ startx 
the X (UI) was working like a charm. 

For my personal testing, the following packages are installed. 
$ sudo apt-get update 
$ sudo apt-get install mplayer 
$ sudo apt-get install vnc-server 
$ sudo apt-get install zd1211-firmware 
$ sudo apt-get install python-setuptools 

zd1211-firmware is to enable the wifi dongle (PCI/Planex GW-US54GZL) that I bought 4 years ago for my PSP XLink. With the wifi-conf tool, located on the desktop, configuration of the wifi network is very smooth.

Saturday, June 2, 2012

Extending Glushkov NFA with sub matching over Strings

We consider implementing Glushkov matching automata using Haskell. We first look at the the word matching problem. Let w be a list of characters (AKA string), r be a regular expression, we say w `matches` r iff w in L(r).

We define regular expression using Haskell data type as follows,
data RE where
  Phi :: RE                      -- empty language
  Empty :: RE                    -- empty word
  L :: Char -> Int -> RE                -- single letter with a position number 
  Choice :: RE  -> RE  -> RE     -- r1 + r2
  Seq :: RE  -> RE  -> RE        -- (r1,r2)
  Star :: RE  -> RE              -- r*
 deriving Eq

Phi denotes the empty language, the empty set. Empty denote the empty sequence, L _ _ construct a letter from a character. Note that we also annotate the letter with its position by an integer which is needed in the Glushkov NFA construction. We will come to this in details pretty soon. Choice _ _ implements the choice operator in regular expression. Seq _ _ implements the concatenation. Star _ implements the Kleene's star.

The input string is a ByteString.
type Word = S.ByteString

Step 1 Annotating the regular expression

According to the Glushkov construction, we need to annotate all the letters in the regular expression with the positional information. E.g. by annotating (a|b)a* we have (a1|b2)a3*, or in Haskell syntax, (Seq (Choice (L 'a' 1) (L 'b' 2)) (Star (L 'a' 3))). Note for convenient, in the following sections, we use this syntax in place where Haskell syntax is expected.

We implement the annotation operation via a State monad.
newtype State s a = State { runState :: (s -> (a,s)) } 
 
instance Monad (State s) where
   -- return :: a -> State s a
   return a        = State (\s -> (a,s))
   -- (>>=) :: State s a -> (a -> State s b) -> State s b
   (State x) >>= f = State (\s -> let (a,s') = x s 
                                      stb = f a
                                  in (runState stb) s')

run :: s -> State s a -> (a,s)
run s sta = (runState sta) s
We use the data type Env to keep tracks of the latest position number.
data Env = Env { cnt :: Int
               } deriving Show

initEnv :: Env 
initEnv = Env 0 
The nextCounter operation increment the counter by 1.
nextCounter :: State Env Int
nextCounter = State (\env -> let c = (cnt env) + 1
                                 env' = c `seq` env{cnt=c}
                             in env' `seq` (c, env'))


With the above operations defined the annotation process can be implemented straight-forward as follows,
rAnnotate :: RE -> RE
rAnnotate r = case run initEnv (rAnn r) of
             { (r', _ ) -> r' }  

rAnn :: RE -> State Env RE
rAnn Phi = return Phi
rAnn Empty = return Empty
rAnn (Choice r1 r2) = 
  do { r1' <- rAnn r1
     ; r2' <- rAnn r2
     ; return (Choice r1' r2') }
rAnn (Seq r1 r2) = 
  do { r1' <- rAnn r1
     ; r2' <- rAnn r2
     ; return (Seq r1' r2') }
rAnn (Star r) = 
  do { r' <- rAnn r                
     ; return (Star r') }
rAnn (L c _) = 
  do { i <- nextCounter       
     ; return (L c i) }



Now we turn to the key operations which are used in the Glushkov NFA construction. We implement the first, last and follow operations as follows,
rFirst :: RE -> [(Char, Int)] 
rFirst Phi = []
rFirst Empty = []
rFirst (L c i) = [(c,i)]
rFirst (Star r) = rFirst r
rFirst (Choice r1 r2) = (rFirst r1) ++ (rFirst r2)
rFirst (Seq r1 r2) = if isEmpty r1 then (rFirst r1) ++ (rFirst r2) else rFirst r1

rLast :: RE -> [(Char, Int)] 
rLast Phi = []
rLast Empty = []
rLast (L c i) = [(c,i)]
rLast (Star r) = rLast r
rLast (Choice r1 r2) = (rLast r1) ++ (rLast r2)
rLast (Seq r1 r2) = if isEmpty r2 then (rLast r1) ++ (rLast r2) else rLast r2

rFollow :: RE -> [(Int, Char, Int)]
rFollow Phi = []
rFollow Empty = []
rFollow (L _ _) = []
rFollow (Choice r1 r2) = (rFollow r1) ++ (rFollow r2)
rFollow (Seq r1 r2) = (rFollow r1) ++ (rFollow r2) ++ [ (l,c,f) | (_,l) <- rLast r1, (c,f) <- rFirst r2 ]
rFollow (Star r) = (rFollow r) ++ [ (l,c,f) | (_,l) <- rLast r, (c,f) <- rFirst r ]


Let r be a regular expression,

  • the rFirst function extracts the positions of all the possible "leading" letters from r.
  • the rLast function extracts the positions of all the possible "trailing" letters from r.
  • the rFollow function builds all the possible transitions from r structurally by leveraging on the rFirst and rLast operations.

For example, rFirst (a1|b2)a3* yields [1,2]. rLast (a1|b2)a3* yields [1,2,3]. rFollow (a1|b2)a3* yields [(1,'a',3), (2,'b',3), (3,'a',3)].

Step 2 Assembling the Lego piece by piece

We are ready to construct the NFA. We define the NFA using the following data type.
data NFA a l = NFA { states :: [a]
                   , initStates   :: [a]
                   , finalStates  :: [a]
                   , delta  :: [(a,l,a)] 
                   } deriving Show
To construct the Glushkov NFA, we introduce the following
rGlushkov :: RE -> NFA Int Char
rGlushkov r = let r' = rAnnotate r
              in NFA{ states = 0:(rPos r')
                    , initStates = [0]
                    , finalStates = if isEmpty r then 0:(map snd (rLast r')) else (map snd (rLast r'))
                    , delta = [ (0,c,f) | (c,f) <- rFirst r' ] ++ (rFollow r') }

rPos :: RE -> [Int]
rPos Phi = []
rPos Empty = []
rPos (L _ i) = [i]
rPos (Choice r1 r2) = (rPos r1) ++ (rPos r2) 
rPos (Seq r1 r2) = (rPos r1) ++ (rPos r2)
rPos (Star r) = rPos r
where rPos returns all the positions in r which constitute all the states in the Glushkov NFA besides the initial state 0. The final states are all the positions in rLast r. If r possess the empty string, state 0 is in the set of final states too. The delta are constructed from the rFollow operation.

Step 3 Running the NFA

Let's have a test drive of our newly constructed NFA. To do that, we need the following,
table :: Eq a => [(a,b)] -> [(a,[b])]
table ps = intern [] ps
  where intern t [] = t
        intern t ((k,v):ps) = case lookup k t of 
                              { Nothing -> intern ((k,[v]):t) ps
                              ; Just vs -> intern (update k (vs++[v]) t) ps }
        update k v t = let t' = filter (\(k',_) -> not (k == k')) t
                       in (k,v):t'


runNFA :: (Eq a, Eq l) => NFA a l -> [l] -> Bool
runNFA nfa w = 
   let xs = runIntern (initStates nfa) (table (map (\(f,s,t) -> ((f,s),t)) (delta nfa))) w
   in any (\x -> x `elem` finalStates nfa) xs
   where runIntern :: (Eq a, Eq l) => [a] -> [((a,l),[a])] -> [l] -> [a]
         runIntern currs _ [] = currs
         runIntern currs dels (l:ls) = 
           let nexts = concatMap (\a -> case lookup (a,l) dels of
                                     { Nothing -> []                        
                                     ; Just b  -> b }) currs
           in nexts `seq` runIntern nexts dels ls
where table function group all the transitions by starting position and the letter. The runNFA function takes an NFA and a list of letters and returns a boolean value to indicate whether the match is successful.
runGlushkov :: RE -> String -> Bool
runGlushkov r w = 
   let r' = rAnnotate r   
       nfa = rGlushkov r'          
   in runNFA nfa w 
runGlushkov allows us to pass in the regular expression directly into the matching procedure. e.g. runGlushkov (a1|b2)a3* "baa" yields True.

Step 4 Towards Submatching Automata

To support sub matching, we extend our language from regular expression to regular expression patterns where v :: r denotes a regular expression is marked by a sub matching group label v.

We consider the pattern data type,
data Pat where
 PVar :: Int -> RE -> Pat 
 PPair :: Pat -> Pat -> Pat  
 PChoice :: Pat -> Pat -> Pat 
  deriving Show
Note that we use integer to represent variables. For instance, the pattern ((1 :: a1b2|a3)(2 :: b4a5a6|a7)(3 :: a8c9|c10)) is represented as
p4 = PPair (PPair p_x p_y) p_z
   where p_x = PVar 1 (Choice (L 'a' 1) (Seq (L 'a' 2) (L 'a' 3)))
         p_y = PVar 2 (Choice (Seq (L 'b' 4) (Seq (L 'a' 5) (L 'a' 6))) (L 'a' 7))
         p_z = PVar 3 (Choice (Seq (L 'a' 8) (L 'c' 9)) (L 'c' 10))
And the annotation operation can be extended to regular expression pattern.
pAnnotate :: Pat -> Pat
pAnnotate p = case run initEnv (pAnn p) of { (p', _ ) -> p' }

pAnn :: Pat -> State Env Pat
pAnn (PVar v r) = do { r' <- rAnn r
                       ; return (PVar v r') }
pAnn (PPair p1 p2) = do { p1' <- pAnn p1
                        ; p2' <- pAnn p2 
                        ; return (PPair p1' p2') }
pAnn (PChoice p1 p2) = do { p1' <- pAnn p1
                          ; p2' <- pAnn p2
                          ; return (PChoice p1' p2') }
Extending the first, last and follow operations requires more attention.
pFirst :: Pat -> [(Char, Int, Int)]
pFirst (PVar v r) = [ (c,i,v) | (c,i) <- rFirst r ]
pFirst (PPair p1 p2) | isEmpty (strip p1) = pFirst p1 ++ pFirst p2
                     | otherwise          = pFirst p1
pFirst (PChoice p1 p2) = pFirst p1 ++ pFirst p2

pLast :: Pat -> [(Char, Int, Int)]
pLast (PVar v r) = [ (c,i,v) | (c,i) <- rLast r ]
pLast (PPair p1 p2) | isEmpty (strip p2) = pLast p1 ++ pLast p2
                    | otherwise          = pLast p2
pLast (PChoice p1 p2) = pLast p1 ++ pLast p2

pFollow :: Pat -> [(Int, Char, Int, Int)]
pFollow (PVar v r) = [ (p, c, q, v) | (p, c, q) <- rFollow r ]
pFollow (PPair p1 p2) = (pFollow p1) ++ (pFollow p2) 
                        ++ [ (l,c,f,v) |  (_,l,_) <- pLast p1, (c,f,v) <- pFirst p2 ]
pFollow (PChoice p1 p2) = (pFollow p1) ++ (pFollow p2)
Note that besides the letter and the position, the pFirst, pLast and pFollow functions return the pattern variable whose sub match result is going to be updated. We also need to extend the NFA to capture the updated pattern variables in the transition.
data NFA2 a l = NFA2 { states2 :: [a]
                   , initStates2   :: [a]
                   , finalStates2  :: [a]
                   , delta2  :: [(a,l,a,Int)] 
                   } deriving Show
The construction of the Glushkov NFA from the regular expression pattern can be defined as follows
pGlushkov :: Pat -> NFA2 Int Char
pGlushkov p = let p' = pAnnotate p
              in NFA2{ states2 = 0:(pPos p')
                    , initStates2 = [0]
                    , finalStates2 = if isEmpty (strip p) then 0:(map snd2 (pLast p')) else (map snd2 (pLast p'))
                    , delta2 = [ (0, c, f, v) | (c,f,v) <- pFirst p']  ++ (pFollow p') 
                    }

pPos :: Pat -> [Int]
pPos (PVar v r) = rPos r
pPos (PPair p1 p2) = pPos p1 ++ pPos p2
pPos (PChoice p1 p2) = pPos p1 ++ pPos p2

snd2 :: (a,b,c) -> b
snd2 (_,x,_) = x
Next let's consider how to we make use of the Glushkov NFA to perform regular expression sub matching.
runNFA2 :: (Eq a, Eq l) => NFA2 a l -> [l] -> [[Int]]
runNFA2 nfa w = 
   let xvs = runIntern (zip (initStates2 nfa) (repeat [])) 
              (table (map (\(f,s,t,v) -> ((f,s),(t,v))) (delta2 nfa))) w
   in map snd (filter (\(x,v) -> x `elem` finalStates2 nfa) xvs)
   where runIntern :: (Eq a, Eq l) => [(a,[Int])] -> [((a,l),[(a,Int)])] -> [l] -> [(a,[Int])]
         runIntern currs _ [] = currs
         runIntern currs dels (l:ls) = 
           let nexts = concatMap (\(a,vs) -> case lookup (a,l) dels of
                                     { Nothing -> []                        
                                     ; Just bvs -> map (\(b,v) -> (b, (vs++[v]))) bvs }) currs
           in nexts `seq` runIntern nexts dels ls
The runNFA2 function takes an NFA, a list of letters and return a list of all possible match results. Each match result is denoted by a sequence of variable names. The sequence has the same length as the input list of letters. It keeps track of the correspondent bound variables for every letter in the input list.
patMatchGlushkov :: Pat -> String -> [(Int,String)]
patMatchGlushkov p w = 
   let p' = pAnnotate p
       nfa = pGlushkov p'         
       matches = runNFA2 nfa w
   in case matches of 
      { [] -> [] -- no match
      ; (m:_) -> IM.toList (collect m w IM.empty) }
   where collect :: [Int] -> String -> IM.IntMap String -> IM.IntMap String
         collect [] _ m = m
         collect (i:is) (c:cs) m = 
                 case IM.lookup i m of
                     { Just r ->  collect is cs (IM.update (\_ -> Just (r++[c])) i m )
                     ; Nothing -> collect is cs (IM.insert i [c] m) }


For instance, patMatchGlushkov ((1 :: a1b2|a3)(2 :: b4a5a6|a7)(3 :: a8c9|c10)) "abaac" yields  [(1,"ab"), (2, "a"), (3,"ac")].

The Full Source Code

> {-# LANGUAGE GADTs, BangPatterns #-} 

--------------------------------------------------------------------------------
-- Regular Expression Pattern Matching via Glushkov automata (Position based)
--------------------------------------------------------------------------------

> module Main where

> import Monad
> import List 
> import Data.Bits
> import Char (ord)
> import GHC.IO
> import Data.Int
> import qualified Data.IntMap as IM
> import qualified Data.ByteString.Char8 as S

> import System.IO.Unsafe

------------------------------
-- regular expressions

> data RE where
>   Phi :: RE                      -- empty language
>   Empty :: RE                    -- empty word
>   L :: Char -> Int -> RE                -- single letter with a position number 
>   Choice :: RE  -> RE  -> RE     -- r1 + r2
>   Seq :: RE  -> RE  -> RE        -- (r1,r2)
>   Star :: RE  -> RE              -- r*
>  deriving Eq

A word is a byte string.

> type Word = S.ByteString

Pretty printing of regular expressions

> instance Show RE where
>     show Phi = "{}"
>     show Empty = "<>"
>     show (L c n) = show c ++ show n
>     show (Choice r1 r2) = ("(" ++ show r1 ++ "|" ++ show r2 ++ ")")
>     show (Seq r1 r2) = ("<" ++ show r1 ++ "," ++ show r2 ++ ">")
>     show (Star r) = (show r ++ "*")


> class IsEmpty a where
>     isEmpty :: a -> Bool

> instance IsEmpty RE where
>   isEmpty Phi = False
>   isEmpty Empty = True
>   isEmpty (Choice r1 r2) = (isEmpty r1) || (isEmpty r2)
>   isEmpty (Seq r1 r2) = (isEmpty r1) && (isEmpty r2)
>   isEmpty (Star r) = True
>   isEmpty (L _ _) = False


annotate add position info to the regex

> newtype State s a = State { runState :: (s -> (a,s)) } 
 
> instance Monad (State s) where
>    -- return :: a -> State s a
>    return a        = State (\s -> (a,s))
>    -- (>>=) :: State s a -> (a -> State s b) -> State s b
>    (State x) >>= f = State (\s -> let (a,s') = x s 
>                                       stb = f a
>                                   in (runState stb) s')

> run :: s -> State s a -> (a,s)
> run s sta = (runState sta) s


> data Env = Env { cnt :: Int
>                } deriving Show

> initEnv :: Env 
> initEnv = Env 0 

> nextCounter :: State Env Int
> nextCounter = State (\env -> let c = (cnt env) + 1
>                                  env' = c `seq` env{cnt=c}
>                              in env' `seq` (c, env'))


annotate a regex with position index

> rAnnotate :: RE -> RE
> rAnnotate r = case run initEnv (rAnn r) of
>              { (r', _ ) -> r' }  

> rAnn :: RE -> State Env RE
> rAnn Phi = return Phi
> rAnn Empty = return Empty
> rAnn (Choice r1 r2) = 
>   do { r1' <- rAnn r1
>      ; r2' <- rAnn r2
>      ; return (Choice r1' r2') }
> rAnn (Seq r1 r2) = 
>   do { r1' <- rAnn r1
>      ; r2' <- rAnn r2
>      ; return (Seq r1' r2') }
> rAnn (Star r) = 
>   do { r' <- rAnn r                
>      ; return (Star r') }
> rAnn (L c _) = 
>   do { i <- nextCounter       
>      ; return (L c i) }


> rFirst :: RE -> [(Char, Int)] 
> rFirst Phi = []
> rFirst Empty = []
> rFirst (L c i) = [(c,i)]
> rFirst (Star r) = rFirst r
> rFirst (Choice r1 r2) = (rFirst r1) ++ (rFirst r2)
> rFirst (Seq r1 r2) = if isEmpty r1 then (rFirst r1) ++ (rFirst r2) else rFirst r1

> rLast :: RE -> [(Char, Int)] 
> rLast Phi = []
> rLast Empty = []
> rLast (L c i) = [(c,i)]
> rLast (Star r) = rLast r
> rLast (Choice r1 r2) = (rLast r1) ++ (rLast r2)
> rLast (Seq r1 r2) = if isEmpty r2 then (rLast r1) ++ (rLast r2) else rLast r2

> rFollow :: RE -> [(Int, Char, Int)]
> rFollow Phi = []
> rFollow Empty = []
> rFollow (L _ _) = []
> rFollow (Choice r1 r2) = (rFollow r1) ++ (rFollow r2)
> rFollow (Seq r1 r2) = (rFollow r1) ++ (rFollow r2) ++ [ (l,c,f) | (_,l) <- rLast r1, (c,f) <- rFirst r2 ]
> rFollow (Star r) = (rFollow r) ++ [ (l,c,f) | (_,l) <- rLast r, (c,f) <- rFirst r ]

> rPos :: RE -> [Int]
> rPos Phi = []
> rPos Empty = []
> rPos (L _ i) = [i]
> rPos (Choice r1 r2) = (rPos r1) ++ (rPos r2) 
> rPos (Seq r1 r2) = (rPos r1) ++ (rPos r2)
> rPos (Star r) = rPos r

> data NFA a l = NFA { states :: [a]
>                    , initStates   :: [a]
>                    , finalStates  :: [a]
>                    , delta  :: [(a,l,a)] 
>                    } deriving Show


> table :: Eq a => [(a,b)] -> [(a,[b])]
> table ps = intern [] ps
>   where intern t [] = t
>         intern t ((k,v):ps) = case lookup k t of 
>                               { Nothing -> intern ((k,[v]):t) ps
>                               ; Just vs -> intern (update k (vs++[v]) t) ps }
>         update k v t = let t' = filter (\(k',_) -> not (k == k')) t
>                        in (k,v):t'


> runNFA :: (Eq a, Eq l) => NFA a l -> [l] -> Bool
> runNFA nfa w = 
>    let xs = runIntern (initStates nfa) (table (map (\(f,s,t) -> ((f,s),t)) (delta nfa))) w
>    in any (\x -> x `elem` finalStates nfa) xs
>    where runIntern :: (Eq a, Eq l) => [a] -> [((a,l),[a])] -> [l] -> [a]
>          runIntern currs _ [] = currs
>          runIntern currs dels (l:ls) = 
>            let nexts = concatMap (\a -> case lookup (a,l) dels of
>                                      { Nothing -> []                        
>                                      ; Just b  -> b }) currs
>            in nexts `seq` runIntern nexts dels ls


> rGlushkov :: RE -> NFA Int Char
> rGlushkov r = let r' = rAnnotate r
>               in NFA{ states = 0:(rPos r')
>                     , initStates = [0]
>                     , finalStates = if isEmpty r then 0:(map snd (rLast r')) else (map snd (rLast r'))
>                     , delta = [ (0,c,f) | (c,f) <- rFirst r' ] ++ (rFollow r') }



> runGlushkov :: RE -> String -> Bool
> runGlushkov r w = 
>    let r' = rAnnotate r   
>        nfa = rGlushkov r'          
>    in runNFA nfa w 


label with default position 0

> l0 c = L c 0

r1 = ((a|b)*,c)

> r1 = Seq (Star (Choice (l0 'a') (l0 'b'))) (l0 'c')



> data Pat where
>  PVar :: Int -> RE -> Pat 
>  PPair :: Pat -> Pat -> Pat  
>  PChoice :: Pat -> Pat -> Pat 
>   deriving Show

> strip :: Pat -> RE 
> strip (PVar _ r) = r
> strip (PPair p1 p2) = Seq (strip p1) (strip p2)
> strip (PChoice p1 p2) = Choice (strip p1) (strip p2)


extending annotate operation for patterns

> pAnnotate :: Pat -> Pat
> pAnnotate p = case run initEnv (pAnn p) of { (p', _ ) -> p' }

> pAnn :: Pat -> State Env Pat
> pAnn (PVar v r) = do { r' <- rAnn r
>                        ; return (PVar v r') }
> pAnn (PPair p1 p2) = do { p1' <- pAnn p1
>                         ; p2' <- pAnn p2 
>                         ; return (PPair p1' p2') }
> pAnn (PChoice p1 p2) = do { p1' <- pAnn p1
>                           ; p2' <- pAnn p2
>                           ; return (PChoice p1' p2') }

extending first, last and follow operations for patterns

the result of pFirst and pLast are  list of tripple, (the letter, the position, and the pattern variable updated)

> pFirst :: Pat -> [(Char, Int, Int)]
> pFirst (PVar v r) = [ (c,i,v) | (c,i) <- rFirst r ]
> pFirst (PPair p1 p2) | isEmpty (strip p1) = pFirst p1 ++ pFirst p2
>                      | otherwise          = pFirst p1
> pFirst (PChoice p1 p2) = pFirst p1 ++ pFirst p2

> pLast :: Pat -> [(Char, Int, Int)]
> pLast (PVar v r) = [ (c,i,v) | (c,i) <- rLast r ]
> pLast (PPair p1 p2) | isEmpty (strip p2) = pLast p1 ++ pLast p2
>                     | otherwise          = pLast p2
> pLast (PChoice p1 p2) = pLast p1 ++ pLast p2

we also introduce the pattern variable updated into the result of the follow operation

> pFollow :: Pat -> [(Int, Char, Int, Int)]
> pFollow (PVar v r) = [ (p, c, q, v) | (p, c, q) <- rFollow r ]
> pFollow (PPair p1 p2) = (pFollow p1) ++ (pFollow p2) 
>                         ++ [ (l,c,f,v) |  (_,l,_) <- pLast p1, (c,f,v) <- pFirst p2 ]
> pFollow (PChoice p1 p2) = (pFollow p1) ++ (pFollow p2)


> pPos :: Pat -> [Int]
> pPos (PVar v r) = rPos r
> pPos (PPair p1 p2) = pPos p1 ++ pPos p2
> pPos (PChoice p1 p2) = pPos p1 ++ pPos p2


> snd2 :: (a,b,c) -> b
> snd2 (_,x,_) = x

we need a different nfa because now the delta should keep track of which pattern variable is updated

> data NFA2 a l = NFA2 { states2 :: [a]
>                    , initStates2   :: [a]
>                    , finalStates2  :: [a]
>                    , delta2  :: [(a,l,a,Int)] 
>                    } deriving Show

> -- return a list of variable bindings
> runNFA2 :: (Eq a, Eq l) => NFA2 a l -> [l] -> [[Int]]
> runNFA2 nfa w = 
>    let xvs = runIntern (zip (initStates2 nfa) (repeat [])) 
>               (table (map (\(f,s,t,v) -> ((f,s),(t,v))) (delta2 nfa))) w
>    in map snd (filter (\(x,v) -> x `elem` finalStates2 nfa) xvs)
>    where runIntern :: (Eq a, Eq l) => [(a,[Int])] -> [((a,l),[(a,Int)])] -> [l] -> [(a,[Int])]
>          runIntern currs _ [] = currs
>          runIntern currs dels (l:ls) = 
>            let nexts = concatMap (\(a,vs) -> case lookup (a,l) dels of
>                                      { Nothing -> []                        
>                                      ; Just bvs -> map (\(b,v) -> (b, (vs++[v]))) bvs }) currs
>            in nexts `seq` runIntern nexts dels ls
> 

> pGlushkov :: Pat -> NFA2 Int Char
> pGlushkov p = let p' = pAnnotate p
>               in NFA2{ states2 = 0:(pPos p')
>                     , initStates2 = [0]
>                     , finalStates2 = if isEmpty (strip p) then 0:(map snd2 (pLast p')) else (map snd2 (pLast p'))
>                     , delta2 = [ (0, c, f, v) | (c,f,v) <- pFirst p']  ++ (pFollow p') 
>                     }



> patMatchGlushkov :: Pat -> String -> [(Int,String)]
> patMatchGlushkov p w = 
>    let p' = pAnnotate p
>        nfa = pGlushkov p'         
>        matches = runNFA2 nfa w
>    in case matches of 
>       { [] -> [] -- no match
>       ; (m:_) -> IM.toList (collect m w IM.empty) }
>    where collect :: [Int] -> String -> IM.IntMap String -> IM.IntMap String
>          collect [] _ m = m
>          collect (i:is) (c:cs) m = 
>                  case IM.lookup i m of
>                      { Just r ->  collect is cs (IM.update (\_ -> Just (r++[c])) i m )
>                      ; Nothing -> collect is cs (IM.insert i [c] m) }



> p4 = PPair (PPair p_x p_y) p_z
>    where p_x = PVar 1 (Choice (l0 'A') (Seq (l0 'A') (l0 'B')))
>          p_y = PVar 2 (Choice (Seq (l0 'B') (Seq (l0 'A') (l0 'A'))) (l0 'A'))
>          p_z = PVar 3 (Choice (Seq (l0 'A') (l0 'C')) (l0 'C'))

> s4 = "ABAAC"

Ungreedy match can be easily adopted in Glushkov 

e.g. consider p = ( x :: a1 * ?, y :: a2 * ) where 1 and 2 are position tags.

first p = [ ( 'a', 1, x) ,  ('a', 2, y) ]

last p  = [ ('a', 1, x), ('a', 2, y) ]

follow p = (follow (x :: a1*?)) ++ (follow  (y :: a2*)) ++ 
        [ (p1,c2,p2,v2) | (c1,p1,v1) <- last (x :: a1*?), (c2,p2,v2) <- first (y :: a2*) ]
         = [ (p, c, p',x) |  (p, c, p') <- follow (a1*?) ] ++ 
           [ (p, c, p',y) |  (p, c, p') <- follow (a2*) ] ++ 
        [ (p1,c2,p2,v2) | (c1,p1,v1) <- last (x :: a1*?), (c2,p2,v2) <- first (y :: a2*) ]
         = [ (1,'a',1,x) ] ++ [ (2,'a',2,y) ] ++ [ (p1,c2,p2,v2) | (c1,p1,v1) <- [('a',1,x)], (c2,p2,v2) <- [('a',2,y)] ]   -- (1)
         = [ (1,'a',1,x) ] ++ [ (2,'a',2,y) ] ++ [ (1,'a',2',y) ] 
Note that for (1) we have all the transitions. Assume during the matching, the transitions are 'tried' in the order
of left to right. Hence (1,'a',1,x) is always tried before (1,'a',2',y), which leads to a greedy matching.

On the other hand, if we swap [ (1,'a',1,x) ]  with  [ (1,'a',2',y) ]  we will have non-greedy matching

hence for a non-greedy match, if a1 is non-greedy
follow (q1,q2) =  [ (p1,c2,p2,v2) | (c1,p1,v1) <- last q1, (c2,p2,v2) <- first q2 ] 
               ++ follow q1 ++ follow q2
       


The end



Friday, April 16, 2010

Regular Expression Matching using Partial Derivative

We finally finish our first draft of the paper.

Abstract:
Regular expression matching is a classical and well-studied problem.
Prior work applies DFA and Thompson NFA methods for the construction
of the matching automata. We propose the novel use of derivatives and
partial derivatives for regular expression matching.
We show how to obtain algorithms for various
matching policies such as POSIX and greedy left-to-right.
Our benchmarking results show that the run-time performance is promising
and that our approach can be applied in practice.

Monday, May 11, 2009

Implementing regular expression matching using Partial derivative (Part 4: The partial derivative approach)

In the previous post, we have presented a regex pattern matching algorithm that uses derivative. The idea is to extend the derivative operation to handle regex patterns. With this extension, we can perform regex pattern matching by "pushing" all the labels in the input word into the pattern via derivative, and collect the results by examining the resulted pattern (derivative).

This approach is neat but hard to be put to practical use. Here are the reasons.

Issue #1) The derivative operation for regex patterns always builds new patterns. The "structural size" of the derivative pattern is always larger than the original one. This prevents us from getting a compilation scheme.

Issue #2) The derivative based algorithm requires backtracking.
To illustrate, let us consider the following example,


Let p be the pattern (x :: (A|(AB)), y :: (A|C) ),
> p = ( PPair ( PVar 1 S.empty (Choice (L 'A') (Seq (L 'A') (L 'B'))) ) (PVar 2 S.empty (Choice (L 'A') (L 'C')))

Let w be the input AA
> w = S.pack "AA"

Matching w against p
> firstmatch p w -- (1)


To compute (1), we need to compute the derivatives (dPat (dPat p 'A') 'A')


(dPat (dPat p 'A') 'A')
--> (dPat (PPair ( PVar 1 "A" (Choice Empty (L 'B'))) (PVar 2 "" (Choice (L 'A') (L 'C'))) 'A')
--> (PChoice
(PPair ( PVar 1 "AA") (Choice Phi Phi)) (PVar 2 "" (Choice (L 'A') (L 'C'))) 'A')
(PPair (PPair ( PVar 1 "A" (Choice Empty (L 'B'))) (PVar 2 "A" (Choice Empty Phi))) -- (2)
)


Since we are searching for the firstmatch, we search (2) from left to right for the first successful match. It turns out that the first alternative of (2) leads to a matching failure. Therefore, we have to backtrack, and look at the second alternative, in which we find the match, [(1,"A"), (2,"A")].

We all know that backtracking is costly. The situation could be worse in the context of regex pattern matching, where we not only need to test for successful match but also keep track of bindings.

To address the issue #1, we need to take a different approach which is based on partial derivatives.

So what is the partial derivative? How does it differ from derivative?

Given a regex r, the partial derivatives of r with respect to some letter l are a set of regex which are the possible results of removing l from r.

Both derivative and partial derivative both describe the "states" after removing some leading label l from a regex. The derivate operation yields a single regex. On the other hand, the partial derivative operation yields a set of regexs. We can think of derivatives as states in a DFA because they deterministic, i.e. from one input regex and a letter, we can only get one derivative.
On the contrary, we can think of partial derivatives as states in an NFA because they are non-deterministic, i.e. from one input regex and a letter, we get a set of partial derivatives.

Note that as we pointed out earlier in this section, the set of all possible derivatives of a regex is infinite.
On the other hand, the set of all possible partial derivatives of a given regex is finite. This is one of the important results in Antimirov's work.

We recast Antimirov's definition of partial derivative operations as follows,

> partDeriv :: RE -> Char -> [RE]
> partDeriv Phi l = []
> partDeriv Empty l = []
> partDeriv (L l') l
> | l == l' = [Empty]
> | otherwise = []
> partDeriv (Choice r1 r2) l = nub ((partDeriv r1 l) ++ (partDeriv r2 l))
> partDeriv (Seq r1 r2) l
> | isEmpty r1 =
> let s1 = [ (Seq r1' r2) | r1' <- partDeriv r1 l ]
> s2 = partDeriv r2 l
> in nub (s1 ++ s2)
> | otherwise = [ (Seq r1' r2) | r1' <- partDeriv r1 l ]
> partDeriv (Star r) l = [ (Seq r' (Star r)) | r' <- partDeriv r l ]


It is not hard to realize that by making use of the partial derivative operation we can define a derivative operation which yields a "minimal" regex.

Exercise: Implement the derivative operation by making use of the partial derivative operation.

Furthermore, we are able to use the partial derivative operation to solve the word problem.

Exercise: Implement an algorithm that solves the word problem using the partial derivative operations.


Like what we did to the derivative operation, we can extend the partial derivative operation to operate on regex patterns.


> pdPat :: Pat -> Letter -> [Pat]
> pdPat (PVar x w r) l =
> let pd = partDeriv r l
> in if null pd then []
> else [PVar x (w `S.append` (S.pack [l])) (resToRE pd)]
> pdPat (PPair p1 p2) l =
> if (isEmpty (strip p1))
> then ([ PPair p1' p2 | p1' <- pdPat p1 l] ++
> [ PPair (mkEmpPat p1) p2' | p2' <- pdPat p2 l])
> else [ PPair p1' p2 | p1' <- pdPat p1 l ]
> pdPat (PChoice p1 p2) l =
> ((pdPat p1 l) ++ (pdPat p2 l))

Summig up a list of regular expressions with choice operation.

> resToRE :: [RE] -> RE
> resToRE (r:res) = foldl Choice r res
> resToRE [] = Phi



And the partial derivative pattern matching algorithm follows naturally,


> allmatch :: Pat -> Word -> [Env]
> allmatch p w = concat (map collect (allmatch' [p] w))
> where
> allmatch' :: [Pat] -> Word -> [Pat]
> allmatch' ps w =
> case S.uncons w of
> Nothing -> ps
> Just (l,w') -> let ps' = (concat [ pdPat p l | p <- ps ])
> in allmatch' ps' w'

> firstmatch :: Pat -> Word -> Maybe Env
> firstmatch p w =
> case allmatch p w of
> [] -> Nothing
> (env:_) -> Just env



Recall the previous example


Kenny's example

> p4 = PPair (PPair p_x p_y) p_z
> where p_x = PVar 1 S.empty (Choice (L 'A') (Seq (L 'A') (L 'B')))
> p_y = PVar 2 S.empty (Choice (Seq (L 'B') (Seq (L 'A') (L 'A'))) (L 'A'))
> p_z = PVar 3 S.empty (Choice (Seq (L 'A') (L 'C')) (L 'C'))

> input = S.pack "ABAAC"




*Main> firstmatch p4 input
Just [(1,"AB"),(2,"A"),(3,"AC")]


Note that there is a slight difference between the result obtained from the derivative based and the one above. The key observation here is that the two algorithms differ when the regex pattern has a left associated nested pairs. In case that the pair patterns are nested to the right. The two algorithms behaves the same, which can be proven easily. In realworld regex pattern matching like Perl or grep, we do not have nesting in pattern sequences.

One key observation is that the set of all partial derivative patterns are finite if we drop the "cumulative" bindings. This gives us an opportunity to turn the above algorithm into a compilation scheme.

And we yet need to address the issue of backtracking (#2).
(To be continued)