Fast Regular Expression Searching

Boyer-Moore scanner
For finite automata
Sometimes very fast

Copyright 1998, Sean Barrett - sean at nothings dot org

Abstract

An analogous algorithm to Boyer-Moore string searching, but for regular expressions not strings (actually, for searching finite automata), allows searching in a text of n characters for a pattern whose shortest match is m characters while looking at only n/m characters in the best case, n worst case. Searching while allowing k errors occurs in best case examining kn/m characters. Worst case performance could be pretty bad, as a new finite automaton is (lazily) constructed, whose possible states number on the order of m*2^(s^2)*2^s where s is the number of states in the original nfa; this should be compared to 2^s, the number of possible states in the traditional search DFA.

Performance

The algorithm described provides a tight inner loop quite comparable to both traditional DFA scanners and to traditional Boyer-Moore scanners. On simple scans, it should be competitive--but I'm not observing this currently. When I tested it against gnu grep in 1993, it was competitive on simple scans, and on long complex scans (sufficient to defeat certain optimization stategies made by gnu grep), it was faster. I do not see that performance today, however, at least in a "grep" context.

Boyer-Moore scanning is of questionable utility for text-string searching on modern machines. Computation speeds are outrunning memory access speeds significantly. A scanner which reads only every 16th character will incur the same cache penalties as one which reads every character if cache lines are 16 bytes long. I suppose it's still interesting for some specialized cases searching for very long patterns. On the other hand, very long patterns tend to imply very large texts, and if the text is not in main memory, I/O time may dominate even over an algorithm that examines every character.

In retrospect, this algorithm seems like a rather obvious, brute-force way of doing things. But since I have yet to see a reference to this approach, it seems best to publish it.

Acknowledgements

I'd like to acknowledge that this is a pretty crappy "paper", but on the other hand, it's entirely unfunded. I did the original research in '90-'91 in my spare time while I was geting my bachelor's degree. I abandoned it when I graduated, but then eventually I implemented it and verified the utility of the research in 1993. Since then, I've tried to write a paper about five times, but each time I became bogged down in having too much to explain, and trying to describe it with carefully exact mathematical formulae.

So, this time I'm just going to crank it out. I will make no reference to other research (which I'm unfortunately out of touch with anyway--I have no idea how much of the wheel I've reinvented). I'm not going to explain stuff in very much detail; this version of my "paper" is aimed at experts; I will not explain what a finite automaton is, etc.

I acknowledge that I could do a lot more work to improve things, but eight years is a long time to not publish. I am continuing to write up a careful mathematical characterization of the underlying principles.

Profiling SGREP under DOS on a Pentium 166 on a short simple scan revealed 66% of the time being spent waiting for the OS to transfer file data to the app (and the file data was all in a main memory cache). I must build a test jig that simply searches a memory buffer. This means having to actually implement some other algorithm to compare against myself, and there don't seem to be any to compare against. Well, I could compare against a matching-with errors scanner, but since SGREP does more than just plain text, this probably isn't very fair. I might try it if I can find a simple one, though. But this is unfunded research, what do you expect?

The Algorithm

This is a dumb name, but I call this search technique "suffix-prefix-infix search", or SPI (pronounced "spy").

The algorithm as it would apply to plain string searching can be summarized briefly as:

Analogy with Boyer-Moore

The basic Boyer-Moore scanner searches through the text by moving an anchor point from left to right. The location of the anchor point is the place where a matching string might start. The algorithm places a scanning point further to the right of the anchor point, and scans it leftwards back towards the anchor.

SPI works identically, except the anchor does not need to be at the start of the pattern.

Boyer-Moore scans backwards until it has learned enough about the text that it can shift forwards. Specifically, it must search backwards until the text doesn't match the pattern. At this point, it applies one of two heuristic rules (I'm calling them heuristic because they are approximate, since they fail to exploit all the information available):

These heuristics are used because they allow small tables to be used--each of them only looks at one of the two pieces of state (the known suffix and the mismatched character). If no characters have been matched, the first rule offers a 0-distance shift, but the second rule always gives a positive shift. Otherwise, the first rule gives a positive shift, but the second rule may give a negative shift.

SPI also scans backwards until it has learned enough to shift the anchor forward. It also uses two independent heuristics to shift forwards. To compare with Boyer-Moore, we need to be a little bit more explicit about what happens with BM.

Modes of Internal Matching

Both Boyer-Moore heuristics have three possible behaviors, based on the text characters "known" to that heuristic (the known suffix for the first, and the one mismatched character for the second):

"Prefix-match" is so called because the examined characters between the start of the scan and the new placement of the anchor are a valid prefix of the pattern (as well as having been a valid suffix); similarly for "Infix-match".

The following table demonstrates these principles, using only the first Boyer-Moore heuristic. The mismatched character is represented by '-', indicating a character whose value is NOT known to the heuristic making the shift decision.

text         ??-ere??????    ???-re??????    ????-e??????
pattern      revere          revere          revere
shifts to          revere        revere        revere
description    mismatch      prefix-match    infix-match

SPI Heuristic Summary

The above description of Boyer-Moore makes characterizing SPI quite straightforward. SPI applies two heuristics, both of which are based on all the characters scanned while at this anchor point, and on the full history of the characters up to the anchor point.

The full history at the anchor point is simply the state of a traditional regexp scanning NFA.

The Suffix-Prefix match also handles mismatches.

Regular expression Searching

One problem with Boyer-Moore scanning is that it doesn't guarantee it will look at no more than n characters in a text of size n.

This is true even if no matches are found. The problem can be viewed in two ways: one is that a shift by fewer characters than we've scanned makes it possible we'll scan more than n. Another way to view it is that certain shifts lose information. (I assume this is well-know, although I've not read the literature about it; so I've provided an example search pattern in an appendix. It's possible I've made a mistake, I suppose.)

To avoid scanning more than n characters, we will not shift forwards unless the shift amount is larger than the scanning amount. To do this requires one hefty abstract change to the Boyer-Moore algorithm: Since some shifts might violate the size constraint, we may continue scanning backwards even though the being-scanned characters no longer matches the pattern.

This change might seem problematic, but in a moment we'll see why it's not.

Because the solution involves the use of finite automata, let's finally step away from Boyer-Moore and examine the problem of doing this backward scan using finite automata.

Backwards Scanning For Finite Automata Searches

Let's take a very simple but crucial operation. There might be a match at our current anchor point, and we want to determine this during the backwards scan.

That is, we ask: what operation corresponds to comparing characters against the source text during the reverse scan?

Suppose we assume that it does match. That means that from our current automaton state S, if we process the next m characters c1..cm in forward order, then we will reach a goal state. That means there is a sequence of transitions that lead from one of our current nfa states to one of the goal states, and those transitions are on c1..cm in order. [1 (technical clarification)]

Symmetrically, that means there is a "reverse" path leading from one of the goal states to one of our current states, where we follow transitions in the automaton backwards. This path will be characterized by the characters cm..c1 in order.

This leads to a simple, efficient solution to the posed problem: create a new automaton which is "reversed" (all the transitions have their endpoints swapped). Let the start state be the set of all goal states from the original. Process the characters from cm..c1 through this automaton. Compare the resultant set of states with the original set S; if the intersection is non-empty, then the pattern matches.

Getting the Work Done

We now reach the crucial point in the design of SPI. Suppose we verify as above that we do not in fact match at the current anchor point. We are now most definitely obliged to advance our anchor point past all n characters we've just scanned (so as to maintain the no-more-than-n-characters-scanned behavior). How the heck can we figure that out without looking at any more characters?

We just described how we can determine if we reach some goal state G (or set of states) after scanning a particular set of characters in reverse order by the use of a reverse-automaton. There was nothing special to the construction about the eventual destination being a goal state. We can answer the question "can I reach state X" in the same way, by starting in that state in the reverse automaton, and running through all the characters.

So, since we want to know all the states that we will reach, we can simply answer individually each of the questions "will I reach state X" for all the states in the graph.

But Shirley, That's Too Much work!

If the scanning NFA has s states, then the described algorithm will require us to update s different reverse-NFAs. Even if we convert the reverse-NFA to a DFA, that's clearly too much work, and this isn't even letting us skip any characters--we're just scanning them backwards.

Fortunately, we can do a lot better. As to why we're doing this--it turns out that the solution will provide us with all the text-dependent-information we need to implement the other heuristics. One of the ways I've gotten bogged down trying to explain this before is trying to motivate the solution; here I'm sidestepping it.

One way to do better is to make a mega-automaton. Each mega-automaton state is a collection of all s of the states of the reverse scanning automota.

This is not only one way, it is the right way. To understand why it's not as outrageous as it sounds (each of the sub-automata has as many as 2^s states, and there are s of them, so the mega-automaton can have 2^(s^2) states), we must first pretend we didn't come up with this solution.

Wasted Work

Suppose the input searching NFA is itself a DFA--this is certain possible, since DFAs are NFAs. The relevent property of DFAs for our concern is that they have exactly one outgoing transition for each possible input character.

Is the reverse scanning NFA (constructed by reversing the transitions) a DFA?

Generally, no. The DFA for scanning "food" has a node corresponding to "if we see a 'd', accept". The only way to reach this node is by seeing an 'o'. That means the only incoming transition is on 'o'. When we reverse the DFA, the only outgoing transition from that state will be on 'o', and hence it will not be a DFA, at least not under this definition.

When we run our reverse search for that state (we'll run one for each state), if the first character we process, cm is not 'o', then this automaton search will immediately step to the empty set. All further searches for that end state are irrelevent, because we can't get there.

That means we could cull out "dead" searches as we go, and not always have to search all of them.

We can do this more efficiently, however, by looking at the problem in a different light. Picture an automaton as a transition graph with a boolean flag at each node indicating whether it's in the current set or not.

Our prior characterization of the reverse scan was to imagine s of these graphs whose only difference was that the boolean states were different. Let us take a different tack, and superpose all the graphs together.

Characterizing Parallel Reverse Scan

Suppose we simply "or together" all the bits from all of the separate graphs. A moment's thought shows that we can do this initially, or after processing the characters--either way we get the same result, since all of them operate on the same graph.

If we do it initially, then we only have to update one NFA. So, we can ask, what does such an NFA compute (since we no longer have the information about the destinations, it does not compute the state our anchor would be in when we get there).

Assume for a moment that our nfa scanner implements a simple plain string search. The state of the full parallel search provides us with the information for what current states would lead to what destination states. If we throw out the destination state knowledge, then all it tells us is what current states might lead anywhere in the pattern.

(To draw a more useful conclusion, let's put one simple constraint on the imagined search NFA. Assume that the start state is self-looping, but that no other state points back to the start. That guarantees that on the reverse scan, any state which gets to the start stops immediately. Without this assumption, during the reverse scan all states could always be being reached from the start state, and we couldn't infer anything.)

Ok, so it tells us what current states might lead anywhere in the pattern--it presents a list of states which if they were true in the forward direction at that character, would then lead to somewhere else in the pattern.

Ok, let's motivate it a little more clearly. Suppose we've got this reverse finite automaton, but we never bothered giving it any goal states. Suppose we make the start state(s) of the original nfa be the goal state(s) of the reverse one. (Seems obvious, right?)

If the scanner starts at some state x, and processes characters backwards, what does it mean if it reaches a goal state of the reverse automaton? It means it reached the start of the original--the start of the string if it's a string matcher.

It means the scanner matched a prefix while scanning backwards.

And what does it mean if it reaches any old state, not just the start state? So long as the route didn't go through the start state (which is why we made the above self-looping constraint), it means we've matched an "infix"--we've found a subsequence of the string.

A Mapping Reverse Scanner

If we don't throw out the extra information that we did in the prior section, we immediately discover that our reverse-scanning system is finding internal subsequences of the pattern to be searched for, and finding them fully-- it knows if a state x exists in the search from state y, then it's matched a pattern subsequence which in a forward search leads from x to y. Let us recombine the automata in a different way so we don't throw out all this information.

We considered the automaton as a graph with a boolean at each node, indicating whether that node was in the current set of states. Now, instead of a boolean, each node will have one boolean for each "destination state"; we can alternately view this as a list of all the destination states for which that boolean is true.

We began the search with each of the independent automata having one state-- if an automaton was computing if state x was reachable, then it started with state x as its start state. (Or maybe the epsilon-closure of x, depending on how technical you want to get.)

That means that our superposed automaton begins with each state having a list of one element--the state itself.

Before, to update an automaton, we would take each boolean flag that was 1, and "propogate it" along the outgoing transitions (outgoing in the reverse graph) for the input character. Under the new one, that means we propogate all the boolean 1s at once.

In other words, if a given state has more than one item in its list, those items can be updated "simultaneously"--they both propogate to the same nodes. If a destination state becomes unreachable (as in "food" above), it disappears from this composite automaton--no extra testing is necessary to speed up this case.

The above decription shows a practical way of accelerating the parallel NFA operations by trying to take advantage of that parallelism. In return, note that the graph is fully populated when we start--every node contains a non-empty list. Thus, clearly it's a lot more work initially.

Now, let us return to what we said it meant. I wrote of the original "it knows if a state x exists in the search from state y, then it's matched a pattern subsequence which in a forward search leads from x to y." Under the new scheme, at each state x we'll have a set of zero or more states Y. This means that a forward automaton starting in state x will reach all of Y (and nothing else) during the forward scan.

Hence Y is the list of all the states reached from x under this circumstance. If we consider the state of the full automaton, all the xs at once, we can imagine it as a set of ordered pairs <x,Y> where Y is a set of states. I call this set a "mapping"--it tells us the outcome of all the states x given the forthcoming (but now otherwise forgotten) characters.

For example, if we determine in the forward direction that our automaton will be in state S, then the mapping tells us how to project those states forward and determine our final state after processing with those unseen characters.

SPI in Detail

Here's the SPI algorithm:

Keep an anchor point in the text, and the NFA search state resulting from forward-scanning all the characters prior to the anchor point in the text. (The second heuristic makes this claim slightly inaccurate--it is a "safe" approximation to the true state.)

After moving the anchor to a new location, the minimum distance until a match might occur is computed. A scanning pass then begins that many characters ahead, working backwards towards the anchor point. The mapping as described in the previous section is computed. After each character is processed, two heuristics are evaluated to determine if a forward shift of the anchor point is possible. Three pieces of information are available to the heuristics: the anchor point's nfa-state, the current scan distance, and the mapping.

Suffix-Prefix match

This heuristic recognizes when the characters seen so far match a prefix or are a mismatch. (That is, they only match a prefix--that is, it is correct to shift forward to align the prefix.) Previously I stated that it causes the anchor to be shifted forwards so that the anchor is embedded in those characters so as to match up with the prefix, but that this was a lie. In truth, since the mapping is available, Suffix-Prefix moves the anchor point all the way past all of the scanned characters, using the mapping to compute the new anchor-point nfa state which would occur at that point. This guarantees that no characters will be visited more than once, and hence only n characters will be visited.

The actual distinguishing characteristic of this heuristic is that it always does a full length forward shift. It actually isn't a heuristic in this sense; it shifts forward as soon as it can, based on all the information available to it. (Except that I introduce an approximation to simplify the computation.)

Suffix-Infix match

One can imagine that this heuristic attempts to shift forward to match up the pattern with some internal sequence. I think that's pretty much true, but it's not how I think about it in a formal NFA-processing sense.

Really what this heuristic does is answer the following question: How far forward can I shift such that using the start state as my NFA anchor state produces correct results? (In this case, my heuristic is only approximate-- it does not compute the answer perfectly.)

To put this more concretely: if the scanner shifts the anchor point forward to a location where it hasn't seen the previous character, it can't know the actual state the forward-scanning NFA would be in when it got there (because it hasn't seen the previous character).

However, we can know that any such missing states cannot reach a goal state. (Remember, no loops back to the start state from anything but the start state.) We can know this because we've seen "the future" (the characters ahead of that position).

Here's a concrete example. If our pattern is "father", and the text to be compared is "fatfather", then if we examine the 't', looking for an 'r', it is safe to immediately shift forward three, and reset our NFA to the start state. An actual forward scan with the NFA would contain other states at that point, because the leading characters do match. But none of these states matter--they can't reach a goal state, because we've seen that future 't'.

Thus, this heuristic will in practice recognize prefix and infix matches, and also mismatches. The prefix matches are irrelevent because the other heuristic does better (it doesn't throw out information-- it shifts farther).

Because this heuristic throws out information (shifting the anchor throws out the mapping data, so shifting the anchor and resetting the NFA state to the start state is identical to restarting the search--forgetting everything), it could lead to scanning more than n characters. Hence, we constrain the use of this heuristic to cases where that is not the case--it must shift forward by more characters than we've scanned, or else we ignore it and continue reverse scanning.

(As with Boyer-Moore, one can imagine replacing these two heuristics with a single shifting rule. Where one of mine computes the anchor NFA state perfectly and only shifts by the max distance, and the other computes the anchor state to be minimal and shifts by the max amount allowable for that rule, a "perfect" solution could shift forward some intermediate amount, computing an approximate NFA state that is more than just the start state. I haven't tackled how such a thing might work, nor do I plan to.)

Detailed Description of Algorithm

In this section I'll explain the computation of the two heuristics. I believe I've explained the mapping process sufficiently well.

Let k be the distance ahead of the anchor where the scan started. Let i be the current scan location, relative to the anchor.

Suffix-Prefix match

Suffix-Prefix match will shift us all the way past the scanned characters. To do this, we use the mapping to compute the resultant anchor's NFA.

The mapping lets us answer the question: If we reach state x after i characters, what states will we be in after k? The mapping scans the k-i characters beginning at k (and going backwards) to answer this.

The problem confronting us is that we don't know what states we'll be in after i characters at all!

Of course we don't--if we did we could always maximally shift. However, we can make an educated guess about it. In fact, we can put upper and lower bounds on it. (Yes, lower bounds can be larger than just the start state, since regexp matchers can match all characters.)

If we put upper and lower bounds on it, and those bounds are identical, then we know exactly what states we'll reach. That's not much help, obviously. However, if we take the upper bounds and run them through the mapping, and run the lower bounds through the mapping as well, if the results of both of those are identical, then we know we can shift forward fully (and the new state is exactly that result).

Call the lower and upper bounds minEstimate(S, i) and maxEstimate(S,i). These bounds on sets of states are derived directly from bounds on individual states, minEstimate(s, i) and maxEstimate(s,i); the bound for a set of states is the union of the bounds for those other states.

minEstimate(s, i) is the set of all states we'll always reach from s no matter what characters we see. You can view this as: take all strings of length i, and compute the resultant states; intersect all of them together.

maxEstimate(s, i) is the set of all states which might be reached from s after i characters are processed. That is, take all strings of length i, and compute the resultant states; union all of them together.

Let project(S, mapping) compute the results of update S by the mapping; that is, for all of the pairs <x,Y> where x is an element of S, union together all the Y.

To compute the Suffix-Prefix heuristic, let A = project(minEstimate(S,i), mapping); let B = project(maxEstimate(S,i), mapping). If A = B, then shift forward k, and let A be the new NFA state.

We can approximate this so as to compute it more quickly, at the cost of being a little conservative in the case of subpatterns which match all possible characters. Rather than comparing the result of projecting A and B through the mapping, we can compare their "domain", that is, the intersection of each with the domain of the mapping. If they are the same, then of course their projection will be the same. This will fail to catch the case where a state in B but not A only projects to states which are in A's projection. I'm not actually sure in what cases this can occur, though. (Also, in fact, the actual minEstimate function I use underestimates as well and does not capture states which are always reachable due to parts of the pattern matching any text.)

So, let D = domain(mapping) be the set of all x found in mapping. Then let A = minEstimate(S,i) intersect D; let B = maxEstimate(S,i) intersect D. If A = B, then shift forward k, and let project(A, mapping) be the new NFA state.

This can be further optimized. Since A is a subset of B, we can use set subtraction to replace the test A = B with the testB - A = 0. Then the intersection operation can be distributed out, giving us the test:
(maxEstimate(S,i) - minEstimate(S,i)) intersect D = 0.

The term inside the parentheses is independent of the mapping (and D), and hence, when we compute and cache the estimates for a given <S,i> pair, we can cache this difference. Thus, in actual code, we merely test maxMinusMin(S,i) intersect D.

If the range of D ever becomes only the start state, it means we had a mismatch, and forward shift and reset to the start state. However, this case is handled by the above test.

Suffix-Infix match

We wish to know for some j < i whether or not it is "safe" to shift forward to j. It is safe to shift forward to j if not states that we might reach by j but fail to include in our set could ever lead to a goal that we won't otherwise catch.

That is to say, let Z be maxEstimate(S,j). We want to know if any of the states in Z which aren't in the start state are "interesting". (Note that a better heuristic would shift forward to minEstimate(S,j), not the pure start state.) A state is interesting if it might lead to a state we won't otherwise reach. That means, take all strings q of length (i-j). For each string, compute the result of processing Z with it, and processing the start state with it. Now, compare each of those strings one at a time. If for each string, the projection of the two sets (one for Z and one for the start state) through the mapping is identical, then it is safe. (Again, this is the same as comparing with the domain of the mapping.)

Unfortunately, this comparison of all possible strings isn't feasible. Thus, we approach it from a different direction (which is very similar but not perfect).

For each state in the initial nfa graph, we can character its "distance" from the start state--the smallest path-length (in characters) that reaches it from the start state.

Consider each of the mappings <x,y> (where y is an element of Y from before). What we want to know is whether there is some x in that set which we might currently be able to reach, but we won't reach if we substitute the start state as our state at j. (Here's where it's different from the above: this wouldn't matter if the y we get from that x is guaranteed to be gotten by some route. This is the only case we miss, and I assume it's rare.)

In other words, we're asking is there a route to x from our current state which doesn't go through the start state at character j. Now we make this independent of our current state: is there ANY route to x that doesn't go through the start state at j? In other words, is there a path of length (i-j) characters which doesn't start at the start state, and reaches any of the x? If so, then we can't shift forward by j.

Well, the answer to the above question is simple. We can characterize each state x by the longest path from the start state to it, without going through the start state again. Call this startDist(x). (It should be infinite if the state is reachable through a cycle.) It is safe to shift the anchor forwards by j if i-j >= startDist(x) for all x in mapping. (If startDist(x) were larger, there'd be some route of that length that didn't pass through the start.)

Thus, if D is the domain of the mapping (i.e. all x), then let startDist(D) be the maximum of startDist(x). Then shifting by j is safe if i-j <= startDist(D). We can characterize this in terms of j: it's safe if j <= i-startDist(D). Since the goal of our heuristic is to determine the largest possible j for which we can so shift, we immediately determine that the formula is i-startDist(D).

Summary of Heuristics

Let k be the distance ahead of the anchor where the scan started. Let i be the current scan location, relative to the anchor. Let S be the current anchor point state. Let D be the domain of the mapping. After we advance the anchorpoint, we reset the mapping to the identity mapping (each state maps to itself). We must also compute k = maxLookahead(S). maxLookahead(S) is the minimum of all maxLookahead(s). maxLookahead(s) is the shortest distance from s to a goal state.

Algorithm Behavior

For patterns in which characters do not have a high probability of matching, SPI works very efficiently. However, there are large classes of patterns where the "obvious" maxLookahead function produces poor results.

We now describe the characteristics of the two heuristics for some archetypical patterns.

pattern: abcde

From the start state, SPI will lookahead five characters (in the sense that the next character is considered a lookahead of 1... I hope that's not misleading terminology).

If the character examined is not in the set [a-e], SPI will shift forward 5.

If the character examined is 'e', SPI must continue searching backwards (because this might be a match).

If the character examine is in the set [a-d], then SPI can shift forwards using the SI heuristic. For example, if the character is 'b', then the mapping will contain the singleton pair mapping from the state representing parsing 'a' to the state representing parsing 'b'. startDist of the mapping will be 1; i is 4 (it decrements from 5 to 4 as we process the character), and so SI shifts forward by 3.

If the final character is an 'a', the SP heuristic will produce a forward shift of 5 (whereas SI only shifts 4). SP also provides the mismatch heuristic.

If the final character is a 'd', we choose to not use the SI heuristic which only shifts by one character, since the SP heuristic will do at least as well. (If the next character is a 'c', SP will continue left scanning; if not, it will shift forward 5 characters, for a ratio of 2/5 characters examined.)

pattern: ababa

Suppose the text being compared against this pattern contains ??cab.

After reading b, the mapping will contain the mappings (a->ab) and (aba->abab). SI offers the possibility of shifting forward one. (This will be rejected since we bias in favor of SP if the two read equal numbers of characters, since SP retains more information.) SP cannot shift since these mappings are "reachable".

After reading the a, the mapping will contain the mappings (->ab) and (ab->abab). SI still offers the possibility of shifting forward one, but we've scanned two characters, so this is rejected. SP cannot shift.

After reading the c, the mapping contains (->ab). At this point, SP will shift forward five characters, leaving the automaton in the "I just saw 'ab'" state.

pattern: aababa

With text "abaabaaba" this is an example where naive Boyer-Moore has problems. Here, the first three characters (from the tail end) match. After reading the fourth, there are no internal matches, but a final prefix match (all four characters match the first four characters). SP shifts forward six characters, leaving the automaton in the state of having just seen an 'aaba'. It will now only lookahead two characters, to see if the 'ba' matches. It sees a 'b', and knows this could match in two different ways.

Here's how it proceeds:


text:    abaabaabaabaabaabaa
pattern: aababa

         ??aaba?????????????
         aababa

         ??aabaab???????????
           aababa

         ?????aabaab????????
              aababa

From here on the search pattern is stable, and will examine every character of the text, scanning backwards in groups of three.

pattern: ab.*cd

maxLookahead from the start state will be 4 (the shortest matching pattern is "abcd").

Whatever character we sample 4 ahead will potentially match the ".". If "." matches all characters, then in truth, the best possible performance for this pattern is to search every other character. SPI will instead scan ahead 4, and be forced to come all the way back. If the second character is neither a nor b, we will then be able to shift forwards by 4, thus scanning 3/4 of the characters in the best case. Optimal performance for this pattern would be gotten by constraining maxLookahead to 2.

pattern: a.*bbbb

SPI will search every character until an 'a' is found. It will do this by searching backwards in chunks of five characters. Once an 'a' is encountered, SPI will search optimally, looking four characters ahead at a time.

In one sense, SPI is sub-optimal for this pattern. A more efficient search would search for the 'bbbb' first, then search backwards for the a. However, since the ".*" could match an arbitrarily long string of characters, it is clear that SPI is probably in some sense doing "the right thing". In the context of a grep with limited line lengths (and the symbol . taken to not match line breaks--that is, matches occur within a single line), and assuming matches are infrequent, then clearly what SPI is doing is obviously not doing things very efficiently.

However, assuming an infinite search stream, the problem definition only requiring the reporting of the end locations of matches, and the algorithm requiring a finite buffer, SPI is in some sense doing the "right thing". However, this is not generally true. In this particular case, it is. Here is an example where it does not:

pattern: aa.*bbbb

Here, SPI will search 5/6 of the characters (if there are no as) until it finds an aa. Optimal performance would be to limit the lookahead from the initial state to 2.

"Fixing" this only requires improving the maxLookahead computation; the code and math for the SP and SI heuristics are still valid.

pattern: aa[^a-zA-Z]*bbbb

Replacing the "wild card" from the previous pattern with a variant which "rarely matches" suggests the complexity in determining how one could optimize the maxLookahead. Clearly it can be letter-frequency dependent.

pattern: aa[A-Z]*bbbb

Here a maxLookahead from the start state of 6 is recommended if the incidence of [A-Z] is low. If [A-Z] occurs with probabilty 50%, then maxLookahead of 6 is just barely worse than maxLookahead of 2. With a longer pattern it would still be better. If incidence of [A-Z] is high (90%), then maxLookahead of 2 is clearly better.

If letter frequencies are quantified (or assumed), it seems likely that some exact rules could be determined for good values for maxLookahead, but I haven't done the work to figure out what sort of graph analysis could do this efficiently.

This is I suppose I topic for further research: determine the optimal maxLookahead given a set of character probabilities, a search autoamton, and assuming matches never occur.

Implementation Notes

Source Code

Here is the inner loop when SPI is running entirely from the mega-dfa:

   for(;;) {
      tr = &state->trans[*input];
      input += tr->shift;
      if (input >= end) break;
      state = tr->state;
   }

The complete source to implement SPI is, however, about 2000 lines.

Here is the source code. Currently runs for Watcom C 9.5 under DOS, but it used to run under linux, and I don't think I changed anything that would have broken that, so it should work under reasonable unixen. It's an implementation of grep, so certain parts of it may work differently than expected. It also includes an option to do matching-with-errors via a simple nfa construction (k errors produces k+1 total copies of the original nfa graph, with transitions from the k'th to the k+1'th graph representing deletions, insertions, and substitutions.)

SGREP source is protected by copyright and should not be redistributed or used for commercial purposes. (The former restriction will be changed at some point, and probably the latter as well.)

SGREP source code 22k .zip file




[1]
This characterization means that when our automaton gets to the mth character, it will reach a goal state. It doesn't actually mean that the match starts at our current anchor; it could start before or after it. The former is correct behavior given that our anchor isn't really "the beginning of the pattern" anymore; the latter cannot happen if we do not allow our scan to start past the maximum allowable length, which is like the pattern length in Boyer-Moore, but depends on the state S. This is described later.




Appendix 1: Boyer-Moore search that scans more than n characters

text being searched:    abaabaabaabaabaaba

pattern:                aababa
                        ??aaba????????????          (4 samples)

                          aababa
                        ???????b??????????          (1 sample)

                           aababa
                        ?????aaba?????????          (4 samples)
     
                             aababa
                        ??????????b???????          (1 samples)
In this example, we will scan 5 characters to make a total forward shift of 3. Note that this is true no matter what heuristic is used; the above is the best any Boyer-Moore search can do if it throws away all info when it moves the anchor, and if it has to move the anchor on a mismatch.

We can extend the pattern by adding more 'aba's on the end. If we do so, then every other match, all those extra 'aba's will be scanned. Thus, with k aba's at the end, the number of characters scanned will alternate between (3*k+1) and 1. The pattern's length will be 3*k+3; thus scanning such a pattern of length m against this text will require reading m-1 (i.e. 3*k+1+1) characters for every total shift of 3, or (m-1)*(n/3) characters scanned total. Thus the worst case for a Boyer-Moore pattern that never matches is still O(m*n). [I don't know how much worse it can get than the above in practice.]

Of course, if matches occur, then you can get O(m*n) if every character matches (e.g. search for 'aaa' in a text of all 'a's). [That I learned from reading the literature, but I figured the above example before I read about the always-match case.]


[ Home ]