Saturday, April 13, 2013

regexes in interpreted language.. wut?


I realized that in order to do any sort of sensible string-processing, I'd need at least basic regular expressions in µL. So I wrote a regular expression implementation, and I wrote it in µL itself.. well.. because I wanted to see how practical it would be in terms of performance.. and well.. because of fun and.. well, we'll get to that.

So far it's pretty basic, with just the basics (grouping, alternatives, greedy/non-greedy.. still need {n,m} counted repetition, character classes and a bunch of escapes.. but those are simple parser extensions) and doesn't have a nice "gimme the match string" or "replace stuff" front-end functions.. but it does track sub-expressions (so replace is just a question of front-end), and runs online (meaning possibly infinite text) in constant space and linear time (well, obviously O(n*r) time and O(r) space for n input and r regex length).

It's not very special; in fact it's kinda stock "Thompson NFA" .. but since it is essentially an interpreter for a non-deterministic finite-state-machine, written in an interpreted (sort-of) language, it's obviously "not very fast" yet I expected it to perform much worse when there's a lot of parallel states to maintain. Turns out it's not nearly so horrible, and the predictable nature of O(n) execution means you can optimize regex for size instead of trying to work-around matching problems. I really wonder why linear time isn't more popular.

I guess I've got too used to the now-common "back-tracking" (or "depth first") implementations myself as well. Those get exponentially slower when your expressions get more complex or input data more degenerate (so they need to try more paths with longer look-aheads), so the performance can vary quite unpredictably. With a linear-time engine, even with all the constant factors being about 2^10 times larger than necessary (as in my case), it still feels quite usable (sort of like TCL on 486). Obviously no back-references this way.. but I've never managed to use them for anything anyway (easier to cascade a regex-lexer with a simple ad-hoc parser), so personally I won't care.

Anyway, one cool thing (and the reason I resisted the temptation to write the thing as "primitives" in C) is that nothing (well, almost nothing) in the thing relies on input being a string (or even a sequence of characters). You could just as well feed it a list, and then use fancy transition predicates that test the individual items against whatever. The regex-parser won't build such machines (not yet anyway) but the evaluator would already process them. Since the predicates can be anything, you could match recursively against other patterns (or just the same one).. which would still run in O(n) time (but certainly requires worst-case O(n) space) assuming the tree is really a tree (and not DAG).

Curiously, with a slightly more complex algorithm, it should be possible to do similar matching for cyclic graphs too, in O(n*r) time and space, using the rule that if a recursive rule matches up to a cycle of the same node with the same rule (ie visit pairs of [inputNode,matcherState] once), then it's a match. Could be fun to build an engine that could do that, but .. I guess it's not possible to get worst-case O(n*r) for cyclic input without modifying the input (at least temporarily); hash-maps would give a "typical" bound, obviously.

No comments: