Saturday, April 13, 2013

regexes in interpreted language.. wut?


I realized that in order to do any sort of sensible string-processing, I'd need at least basic regular expressions in µL. So I wrote a regular expression implementation, and I wrote it in µL itself.. well.. because I wanted to see how practical it would be in terms of performance.. and well.. because of fun and.. well, we'll get to that.

So far it's pretty basic, with just the basics (grouping, alternatives, greedy/non-greedy.. still need {n,m} counted repetition, character classes and a bunch of escapes.. but those are simple parser extensions) and doesn't have a nice "gimme the match string" or "replace stuff" front-end functions.. but it does track sub-expressions (so replace is just a question of front-end), and runs online (meaning possibly infinite text) in constant space and linear time (well, obviously O(n*r) time and O(r) space for n input and r regex length).

It's not very special; in fact it's kinda stock "Thompson NFA" .. but since it is essentially an interpreter for a non-deterministic finite-state-machine, written in an interpreted (sort-of) language, it's obviously "not very fast" yet I expected it to perform much worse when there's a lot of parallel states to maintain. Turns out it's not nearly so horrible, and the predictable nature of O(n) execution means you can optimize regex for size instead of trying to work-around matching problems. I really wonder why linear time isn't more popular.

I guess I've got too used to the now-common "back-tracking" (or "depth first") implementations myself as well. Those get exponentially slower when your expressions get more complex or input data more degenerate (so they need to try more paths with longer look-aheads), so the performance can vary quite unpredictably. With a linear-time engine, even with all the constant factors being about 2^10 times larger than necessary (as in my case), it still feels quite usable (sort of like TCL on 486). Obviously no back-references this way.. but I've never managed to use them for anything anyway (easier to cascade a regex-lexer with a simple ad-hoc parser), so personally I won't care.

Anyway, one cool thing (and the reason I resisted the temptation to write the thing as "primitives" in C) is that nothing (well, almost nothing) in the thing relies on input being a string (or even a sequence of characters). You could just as well feed it a list, and then use fancy transition predicates that test the individual items against whatever. The regex-parser won't build such machines (not yet anyway) but the evaluator would already process them. Since the predicates can be anything, you could match recursively against other patterns (or just the same one).. which would still run in O(n) time (but certainly requires worst-case O(n) space) assuming the tree is really a tree (and not DAG).

Curiously, with a slightly more complex algorithm, it should be possible to do similar matching for cyclic graphs too, in O(n*r) time and space, using the rule that if a recursive rule matches up to a cycle of the same node with the same rule (ie visit pairs of [inputNode,matcherState] once), then it's a match. Could be fun to build an engine that could do that, but .. I guess it's not possible to get worst-case O(n*r) for cyclic input without modifying the input (at least temporarily); hash-maps would give a "typical" bound, obviously.

Monday, April 8, 2013

µL hacks

Today I randomly got the idea "wouldn't it be cool to be able to quasiquote strings" and tried implementing such a feature into my poor little Lisp-variant. Here's how it works:

When you write {string} it gets parsed exactly as if you wrote "string" except the normal string escapes are disabled so {foo"bar\} results in a string that you would normally have to type as "foo\"bar\\" which on it's own is quite handy. However, you can use ~ (maybe should change that to ^ though) and if you do this, the reader will go back to reading a normal Lisp expression. Once it finishes, it'll keep reading the string again (and read another expression next time it sees ~ and so on) and when it finally sees } it'll turn the whole thing into strcat call (which is string concatenation):

{1 + 1 = ~(int->string (+ 1 1))!}
will read as
(strcat "1 + 1 = " (int->string (+ 1 1)) "!")
and evaluate as
"1 + 1 = 2!"
Naturally, since ~ just calls the reader recursively, you can nest such constructs just like any other code... so with a grand total of 11 short lines of code, I just added a rather natural "template" feature to my little language. :)

PS. Yeah, it'd be nice if such hacks could be done "on the fly" without changing the reader (like you could in some "other Lisps"), but currently the dispatch loop from which { needs to be intercepted isn't table-driven yet (several other things like string escapes, #\newline style character constants and special values like #!eof actually are table-driven, mainly because that way the pretty printer will always match what the reader expects.. should fix the dispatch too, I guess).

Another small (though somewhat larger) µL hack was to add support for "alien" types: these are basically opaque typed pointers from Lisp heap to somewhere else. Now it's simple to add new types for external resources (like files, or C-heap types, whatever), while still handling garbage collection sensibly (both collecting such external objects if desired, and allowing the external objects to keep references to Lisp heap objects that might move arbitrarily). Fortunately it turned out to be ridiculously simple to add finalization and a bunch of visitor callbacks to the garbage collector (which isn't that fancy.. just a simple Cheney with a rudimentary nursery to deal with the worst of "garbage on arrival" which the interpreter creates fast enough that I probably should add proper generational logic).

PS. Should probably upload the whole thing somewhere... but I'd like to fix a few more broken/missing features first, such that it might be even theoretically possible to find some practical utility, other than messing around in the REPL.