Tuesday, April 27, 2021

Regex Fun from Radix Trees

Despite sleeping at stupid o'clock, I found myself up at when the sun is up.

There isn't any surprise there.

What is truly surprising is after fixing the prettyprinter script to take care of the correct HTML5 nodes, I was dissatisfied. Mainly because I am using a regular expression to identify whether the provided HTML5 tag is one that I should recurse on or not.

That in its own isn't a problem. The problem comes when I want to make the expression not as long as the dumb enumeration mechanism, mostly because this JavaScript file is almost always loaded in all my pages that I author, and so ``every byte counts''. The relevant HTML5 tags share certain common prefixes (like h1, h2, h4), and in some cases, common suffixes (like figcaption, caption) as well. Thus, one can collapse some of these, for example h[124] instead of h1|h2|h4, and (fig|)caption instead of figcaption|caption.

The reason why I was lazy in the first place was that I was doing all these collapsing of cases by hand. It's stupid, error-prone, and unsystematic, not to mention a real pain in the ass to update/fix.

Naturally, I wrote a Python3 program to do the job for me.

The basic idea was to construct a type of radix tree to incrementally add each relevant HTML5 tag, before generating the associated regex using an in-order traversal. The splits are on the longest common prefix of the HTML5 tag, and any recursion on the children nodes just form parenthesised sub-expressions in the ensuing regex.

There's obviously no guarantee that the radix tree itself generates optimal length regexes, and frankly for my purposes it is not really needed. I just want an automated way that does reasonably efficient compression of the regex denoting the set of tags of interest, though the moment I wrote that past line, I am suddenly reminded of Bloom filters, of which I already have a working JavaScript version that auto-scales to any specific false positive probability threshold. But the machinery for that is messy, and increases the total number of bytes for the JavaScript file.

I added some micro-optimisations to the generator code as well. One of them is to collapse many single-character child-nodes into a single character set, i.e. from a|b|c|d to just [abcd]. The threshold where this starts to win is when the number of such single-character child-nodes is greater than 3 (they break even at 3), though this threshold is thrown out of the window if all the child-nodes are just single characters, i.e. a(b|c) versus a[bc].

The other is to handle common suffixes. The prefix-biased radix tree cannot detect common suffixes for collapsing (by design), so ul|ol to [uo]l isn't something that can be done. The hack I used is to look for all non-recursive children trees and construct a prefix tree of reversed strings, which is a suffix tree in disguise. There are some edge cases to take care of which are not important, but those are taken care of eventually.

While I was expecting to see [uo]l, I actually saw this optimisation used in s(|(ectio|pa)n|...) instead, which I find to be cute. This happened only because I expanded the list of tags I cared about since I now have an automated way of generating the hairy-looking regex.

Alright, that's enough nerding out for now. I'm going to head out to... wherever as I vaguely planned.

Till the next update.

No comments: