Wednesday, December 20, 2017

Closed hashing probes and clustering..

Closed hashing with linear probing has a problem: once with hit a filled slot, there's a good chance we'll be searching O(n) through the cluster of collisions. This is well-known and called primary clustering.

So, we'll use quadratic probing? Using (hash+(i+i*i)>>1)%N with a N=2^n tables we will still search all the slots, but probe sequences starting from different slots will diverge. Unfortunately they will match if we start from the same slot (we're just using n bits of the hash), which is known as secondary clustering and also well-known.

The usually suggested remedy for secondary clustering is double hashing. We take another hash (or even just hash the original full hash; this is enough to avoid the "only using n low-order bits" problem as long as your second hash is good at mixing the bits.. or even just bit-reversing the first hash could work), let's call it hash2 and use it to control the probe sequence. The standard approach is (hash+(hash2|1)*i)%N (where the bitwise-or with 1 will enforce full sequence for 2n tables) and this is now perfect?

What about the case where we are really unlucky and a bunch of entries all get the same low bits in hash2 and then collide? Now they could theoretically cluster just like in linear probing, since they end up using the same linear sequence. This obviously shouldn't happen if hash2 is good, just like we shouldn't get a collision in the first place if hash is good, so maybe we should think about it anyway?

So why not fix that too? It doesn't even seem complicated. See, we can combine the double hash with the quadratic probing and end up with (hash+(hash2|1)*((i+i*i)>>1))%N. That's still guaranteed to hit every slot (in a "randomly" permuted order) of a power-of-two table! If two entries with same low bits of hash2 collide, then the quadratic will force them to diverge as long as their home slots are different. If two entries have the same home slot (low bits of hash) then hash2 will force a different permutation.

Now, it seems like the above would lead to a rather "fool-proof" collision resolution, with little cost over regular double hashing and even somewhat resistant against the second hash being poor (eg. you still get a quadratic probe too). The only way you can then cluster is by hitting both primary AND secondary clustering at the same time (ie. collision in both hashes at the same time).

So .. why have I not seen this mentioned anywhere? Am I missing some non-obvious problem?

Wednesday, December 10, 2014

Lore - simple regex library for C++

I finally got tired of not having an incremental (character at a time), linear-time (wanted predictable time and space complexity) regular expression library in C++ that I could use in various situations (from text search to input validation to lexing tokens to whatever else).

Maybe there is already a better library for this, but it didn't seem like any of the "usual suspects" could really full-fill my requirements (either they are exp-time, need the whole input at once, or quite commonly both), so I decided to port my older (in custom Lisp) regex implementation to C++ for use in native projects.

I also tried to make the API as simple and flexible as possible. While the implementation isn't anything special, the code is available at github in case someone else needs something similar.

Thursday, December 19, 2013

Silly pattern-matching macro

One of things that I somehow always end up doing with Lisp code is transforming one type of list-structure into some other type of list-structure, where the rules of transformation are simple pattern matches..

That's probably not very surprising, considering some "other" languages are built completely on pattern matching, but one of the cool things of Lisp is that you can add all the missing syntax quite easily. So today I figured I'd hack together a simple pattern matcher in (my own "toy" dialect) µL.

The syntax of the language should be quite obvious to anyone who's ever written any Scheme (it's mostly just shorter), so I won't bother translating it, but the idea is to add a "case-match" form (that should probably get a shorter name) so you can do things like this:

(def (let-format form)
  (case-match form
    (((fn ,args . ,body) . ,binds)
     `(let ,(map list args binds) . ,body))
    (,other other)))

 .r = #!<fn let-format>

.uL.> (let-format '((fn (a b) (+ a b)) 1 2))

 .r = (let
        ((a 1)
         (b 2))
        (+ a b))

See, we match some arbitrary list-structure to each of the case-structures, and pick the first one that applies. In the pattern ,symbol will match anything and capture that symbol as a binding to the matched form in the code for that match, essentially like a destructuring bind, except without unquote the symbols match literally.

Nothing too fancy, not worth writing a blog-post about.. but the cool thing was how simple the implementation turned out to be. I actually wonder why I've not done this earlier (veteran lisp-programmers, please stop laughing). The resulting code only walks once per pattern and aborts early, so it's reasonably efficient too. It could be extended further, but the point is: it's not hard at all to add this functionality into Lisp-like languages. It's here:

;;; main dispatch, with the actual macro
;;; simply binds the "what" part to a symbol
;;; so it can be any arbitrary expression
(macro (case-match what . cases)
  (let ((wsym (gensym)))
    `(let ((,wsym ,what))
       ,(case-match-build (gensym) wsym cases))))

;;; This builds the main (if .. (if .. (if ..))) structure.
;;; The code built by case-match-case returns either closure
;;; or nil (no match), so we check and call closure if any.
(def (case-match-build tmp wsym cases)
  `(let ((,tmp ,(case-match-case wsym (car cases))))
     (if ,tmp 
       ,(if (cdr cases) 
          (case-match-build tmp wsym (cdr cases))))))

;;; Now what case-match-case does, it start's a CPS-style
;;; macro-expansion, because we want the final code to only
;;; recurse the match once (for efficiency). So we pass around
;;; the code that we want to insert in the "tail-position" in
;;; pretty much the same way we'd pass around continuations.
(def (case-match-case wsym 1case)
  (let ((pat (car 1case))
        (body (cdr 1case)))
    (case-match-worker wsym pat
      `(fn () . ,body))))

;;; So here we go: build match for a sub-pattern.
;;; The first cond-branch will do the unquote-match,
;;; and introduce bindings. The second one expands the
;;; CPS-style recursion and the last-one will match
;;; a literal, and either keep going, or abort.
(def (case-match-worker wsym pat k)
    ((and (cons? pat) 
          (eq (car pat) 'unquote)
          (and (cons? (cdr pat))
               (symbol? (cadr pat))
               (not (cddr pat))))
     `(let ((,(cadr pat) ,wsym))
    ((cons? pat)
     (let ((pat1 (car pat))
           (patr (cdr pat))
           (ws1 (gensym))
           (wsr (gensym)))
      `(if (cons? ,wsym)
        (let ((,ws1 (car ,wsym))
              (,wsr (cdr ,wsym)))
          ,(case-match-worker ws1 pat1
             (case-match-worker wsr patr k)))
    (t `(if (eq ',pat ,wsym) ,k ()))))

That's not too bad; no more manual destructuring for pattern matches. The main thing I don't like is the gensym-spam with a lot of temporary bindings.. which wouldn't be that bad if it wasn't for the lack of "safe-for-space" closures in my current implementation of the language itself. But this thing should help with building a better compiler to fix that.

Saturday, April 13, 2013

regexes in interpreted language.. wut?

I realized that in order to do any sort of sensible string-processing, I'd need at least basic regular expressions in µL. So I wrote a regular expression implementation, and I wrote it in µL itself.. well.. because I wanted to see how practical it would be in terms of performance.. and well.. because of fun and.. well, we'll get to that.

So far it's pretty basic, with just the basics (grouping, alternatives, greedy/non-greedy.. still need {n,m} counted repetition, character classes and a bunch of escapes.. but those are simple parser extensions) and doesn't have a nice "gimme the match string" or "replace stuff" front-end functions.. but it does track sub-expressions (so replace is just a question of front-end), and runs online (meaning possibly infinite text) in constant space and linear time (well, obviously O(n*r) time and O(r) space for n input and r regex length).

It's not very special; in fact it's kinda stock "Thompson NFA" .. but since it is essentially an interpreter for a non-deterministic finite-state-machine, written in an interpreted (sort-of) language, it's obviously "not very fast" yet I expected it to perform much worse when there's a lot of parallel states to maintain. Turns out it's not nearly so horrible, and the predictable nature of O(n) execution means you can optimize regex for size instead of trying to work-around matching problems. I really wonder why linear time isn't more popular.

I guess I've got too used to the now-common "back-tracking" (or "depth first") implementations myself as well. Those get exponentially slower when your expressions get more complex or input data more degenerate (so they need to try more paths with longer look-aheads), so the performance can vary quite unpredictably. With a linear-time engine, even with all the constant factors being about 2^10 times larger than necessary (as in my case), it still feels quite usable (sort of like TCL on 486). Obviously no back-references this way.. but I've never managed to use them for anything anyway (easier to cascade a regex-lexer with a simple ad-hoc parser), so personally I won't care.

Anyway, one cool thing (and the reason I resisted the temptation to write the thing as "primitives" in C) is that nothing (well, almost nothing) in the thing relies on input being a string (or even a sequence of characters). You could just as well feed it a list, and then use fancy transition predicates that test the individual items against whatever. The regex-parser won't build such machines (not yet anyway) but the evaluator would already process them. Since the predicates can be anything, you could match recursively against other patterns (or just the same one).. which would still run in O(n) time (but certainly requires worst-case O(n) space) assuming the tree is really a tree (and not DAG).

Curiously, with a slightly more complex algorithm, it should be possible to do similar matching for cyclic graphs too, in O(n*r) time and space, using the rule that if a recursive rule matches up to a cycle of the same node with the same rule (ie visit pairs of [inputNode,matcherState] once), then it's a match. Could be fun to build an engine that could do that, but .. I guess it's not possible to get worst-case O(n*r) for cyclic input without modifying the input (at least temporarily); hash-maps would give a "typical" bound, obviously.

Monday, April 8, 2013

µL hacks

Today I randomly got the idea "wouldn't it be cool to be able to quasiquote strings" and tried implementing such a feature into my poor little Lisp-variant. Here's how it works:

When you write {string} it gets parsed exactly as if you wrote "string" except the normal string escapes are disabled so {foo"bar\} results in a string that you would normally have to type as "foo\"bar\\" which on it's own is quite handy. However, you can use ~ (maybe should change that to ^ though) and if you do this, the reader will go back to reading a normal Lisp expression. Once it finishes, it'll keep reading the string again (and read another expression next time it sees ~ and so on) and when it finally sees } it'll turn the whole thing into strcat call (which is string concatenation):

{1 + 1 = ~(int->string (+ 1 1))!}
will read as
(strcat "1 + 1 = " (int->string (+ 1 1)) "!")
and evaluate as
"1 + 1 = 2!"
Naturally, since ~ just calls the reader recursively, you can nest such constructs just like any other code... so with a grand total of 11 short lines of code, I just added a rather natural "template" feature to my little language. :)

PS. Yeah, it'd be nice if such hacks could be done "on the fly" without changing the reader (like you could in some "other Lisps"), but currently the dispatch loop from which { needs to be intercepted isn't table-driven yet (several other things like string escapes, #\newline style character constants and special values like #!eof actually are table-driven, mainly because that way the pretty printer will always match what the reader expects.. should fix the dispatch too, I guess).

Another small (though somewhat larger) µL hack was to add support for "alien" types: these are basically opaque typed pointers from Lisp heap to somewhere else. Now it's simple to add new types for external resources (like files, or C-heap types, whatever), while still handling garbage collection sensibly (both collecting such external objects if desired, and allowing the external objects to keep references to Lisp heap objects that might move arbitrarily). Fortunately it turned out to be ridiculously simple to add finalization and a bunch of visitor callbacks to the garbage collector (which isn't that fancy.. just a simple Cheney with a rudimentary nursery to deal with the worst of "garbage on arrival" which the interpreter creates fast enough that I probably should add proper generational logic).

PS. Should probably upload the whole thing somewhere... but I'd like to fix a few more broken/missing features first, such that it might be even theoretically possible to find some practical utility, other than messing around in the REPL.

Sunday, December 16, 2012

More on µL..

I've got µL up to a point, where I can start the REPL, load the compiler with a C-code backend, and type (make-system) to rebuild the whole thing as C-code. The way it works, I have a basic runtime library (ulisp.lib) with just the primitives, garbage collector and other essentials. Then you build an application with µL code "compiled" into C-code (chunks; you could load those dynamically too) that runs the code, without requiring the reader or the compiler. It's still somewhat more like "loading" though, in the sense that it essentially calls what the compiler would have called had it compiled things "on-the-fly."

So last time I was talking about C++, but I actually gave up on C++ on this project. Somehow I couldn't get the library API clean without making it all "C-like" and I figured I could just rewrite it as C. So I did that; it's all now plain ANSI C (for practical purposes at least). Rewriting it all took a day and ended up many times cleaner. Adding new C procedures (whether simple primitive functions or C-closures) is now very straight-forward and simple.

I guess it's starting to get to the point where it might make sense to put it online somewhere and attach an open-source license... if I can figure out what that license should look like.

Saturday, December 8, 2012

Let's do µL again.

Back in 2003 or so, I had a project Lisp-variant that I called µL which eventually got buried behind other things, and I kinda lost the sources and all.

A week or so again (yes, this all happened in a week), I happened to consider the idea of writing a simple Lisp implementation again, and managed to find a moderately usable version of my old µL code. Specifically, the "init" library was what I wanted, as it had most of the "really annoying stuff" done: writing quasiquote without quasiquote is quite painful.. especially without fancy features like let-forms (most other macros depend on quasiquote, because it's just easier that way).

The story short: I decided to get rid of the old "compiler" and wrote a minimal interpreter in Gambit-C; about 250 lines of verbose Scheme-code, not counting primitive definitions (those are under 100 totally trivial lines). This got me running, so I could start writing µL code.

Naturally running µL inside Scheme would be pointless, as I could just use Scheme instead (and macros/redefinitions to make it a bit less call-with-verbose-name). Instead I wanted the whole thing to run inside C++ possibly as multiple copies in a DLL loaded into a third-party application that doesn't care (yeah, plugins, why not).

So.. after some cleanup, I then wrote a basic meta-circular compiler, that targets what is essentially CPS with explicit environments. What comes out of the compiler is still fairly high-level and it "generates code" by calling a bunch of procedures. Currently there exists 3 such "back-ends": primitive back-end in the Scheme-interpreter to generate Scheme closures, primitive back-end in the C++ VM to generate closures for the VM, and then a back-end written in µL that generates C++ source to build it all at run-time (sort of).

This was easy enough, and getting the C++ VM to work wasn't too bad either. Implementing primitives is obviously tedious, but simple enough. Writing a Cheney-style garbage collector took a day or so. Harder than writing the collector was making sure all temporaries are properly protected; in my setup every allocation can trigger a collection and invalidate unprotected pointers.. setting every allocation to actually collect turned out to be nice way to find the bugs. I had to redo some of that later when I added generation collection though, because I needed to retro-fit building of the "old to new" back-reference lists.

A bit more work was writing a reader and pretty printer. Debugging the reader proved especially nasty; it was confusing to try to figure out what input was going to the Gambit-C reader, and what was going to my in-development reader, that would crash randomly leaving the input-stream in weird states. Anyway, the reader is partially table-driven and the pretty-printer uses the same tables, so simple additions to the reader will automatically start printing right the moment they are defined.

Which brings us to the fun point: the C++ VM has just enough primitives to work with all the datatypes (cons, char, fixnum, string, symbol, fn) including the backend to generate functions on the fly, but practically no "higher level" functionality. The REPL, reader, compiler front-end, printer all run in the VM (and can removed by garbage collection if no-longer referenced). The actual I/O primitives required for an interactive prompt are getchar/putchar (that's all the I/O currently).

The VM reports runtime errors (trying to call non-procedures, or primitive argument type errors) back to the µL code in the VM, and there is no way for µL code to crash the VM (this is important for my purposes; the VM must be a safe sandbox by default). There are still two ways to cause "graceful" panic though: generate and run code that violates "linkage rules" or run out of memory. Out of memory is tricky, but turning the former into run-time errors would be trivial. However, I'd rather throw type-errors at link-time (ie when the compiler assembles the code, rather than when it's executed) and writing code for that hardly makes sense until the API has stabilized; being high-level, it needs changes (mostly additions) when adding optimizations to the compiler.

Did I mention that not only full tail-calls but also arbitrary first-class continuations work out of the box? In fact calls to call/cc are also properly tail-recursive (ie. you can loop through let/cc and never run out of space), though I'm not sure if it's possible to implement dynamic-wind without losing that.

Now, the VM needs a lot of cleanup (it can't currently provide the continuations of runtime errors for restarting purposes, for example) and features to integrate with a host-app, but it's stable, doesn't crash (except for the panics mentioned), and seems to run fast enough to be usable. Since inside the sand-box everything is meta-circular and you can redefine pretty much everything on the fly, you can totally break the environment, but it'll still keep running.

Examples of how to break the environment were quite painful when I was extending the macro-system to support extension of special forms (there are 6 special forms: if, do, fn, def, set, quote), but now I have a macro for (def (fname . args) ..) style defines (which you can redefine/extend further) while the compiler only support (def sym value) style.. and fn-forms (lambdas) support some addition features by macro-expansion too... and so on. Naturally replacing the compiler on-the-fly would also work (as far as the VM cares), but it's nice to keep the core language simple (namely 6 special forms and around 50 primitives, not counting the 20 or so that implement the compiler back-end).

Unfortunately, for "full rebuilds" I'm still dependent on the Scheme interpreter, because I don't have file-I/O (well, in the abstract sense of "file" since it's the host-application should decide what a "file" actually is; remember I want to embed this into applications) in µL yet, which prevents the µL in the C++ VM from generating new bootstrap code (in Scheme it's easier to cheat)... but considering this is still "days" rather than "weeks", "months" or "years" old project...

...I'm not sure why I've not done it before.