Thursday, April 29, 2021

Improving the Prettyprinter and a Random Rant

Ah Thursday. I want to say that I have done a lot of things, but the reality is often very different. I chilled and stared off into space a lot today. But a little more of that in a bit. Let me quickly summarise the more... concrete things that I have done.

I mentioned nearly a month ago about how I used MutationObserver to re-execute the entire tree-walker to ensure that any newly added nodes were pretty-printed correctly. I also mentioned that to do it correctly would require the careful tracking of what [type] of nodes that needed to be adjusted, and what to ignore. What I didn't mention was that part of the difficulty lay in the hard-to-maintain nature of the tag-pattern checker.

I solved this problem recently, so it seemed like the best time to address the efficiency issues of this more thoroughly this time round. I refactored the code that was to determine if a particular HTML5-node was supposed to be processed or not using the updated tag-name matcher, as well as to bring out the special ``ignore me'' class. This refactoring allowed me to do two more things to make the MutationObserver-based code more efficient:
  1. Check if the tag-name/class of the node is to be processed;
  2. Recursively walk from that tag-node down to its children.
I only faced one issue, which was how to ensure that all descendants of node-trees who had an ancestor marked as ``ignore me'' do not get processed.

My original plan was to walk up the parent nodes until we reached the root of the process tree (body-tag, in this scenario) or a node that said to ``ignore me'', returning an appropriate response. Naturally this was slow since I was doing O(n log(n)), assuming the worst case scenario of adding n nodes with an average tree-height of log(n). Then I tried to memoise the parents, but it was stupid for different reasons.

Then the answer hit me: since I had to walk through the tree anyway on the first load, why not just propagate any such ``ignore me'' class information down to all the children, so that if there are any new nodes to add, I just need to check the parent for such a flag, and if it exists, to propagate it down while doing the right thing. It would be the same as memoisation, except more efficient and significantly less stupid.

And that was what I did. Now that code is more efficient than before, and I am glad.

------

I didn't do any ``real'' reading, having only touched a bit on SCP Foundation Tome 5: SCP2000--2499 (page 952/2863). These SCP Tomes are in Mobi format, and thus they get reflowed all the time. This means that the actual page numbers matter less than the proportion they represent. The only reason why I even managed to read one SCP entry was because I was waiting in line for a snack that I was getting from the nearby food place.

I played more Cyberpunk 2077, clearing more side quests in Night City. It's cathartic to be running about in a ray-tracing enhanced game world. There is that subtle difference that makes it that much better than without any of the ray-tracing. Driving along a road after an in-game rain looks very different; all that subtle reflections and distortions in them. I guess the way graphics programming works over time and increasing graphics cards capabilities isn't about expanding the visible horizon, but to increase the details of things that are much closer. I mean, I play Minecraft too, and I will say that most of the time, one is really just interacting with a sort of ``close'' environment as opposed to the God-eye view of the world that requires large expanses of the land to be brought in to play.

Such technology limitations [of having the appropriate processing bandwidth to bring in information over a larger area for a wider visual horizon of processing] are often carefully camouflaged by the game design---in old school game engines running on old hardware without such capabilities, one would notice that level designs would attempt to draw out the strengths of the game engine while hiding the weaknesses, from the many elevators and corridors that hide the loading times of large area maps, to wide arenas that somehow seem to have just enough obstacles to make it nearly impossible to have a completely unfettered view out to the horizon, or the exact opposite of the Serious Engine by Croteam that eschews fine detail to support big and open areas with many mobs.

In many ways, there is a parallel of game design + technology against music composition/arrangement + skill of players. Good game design (music composition/arrangment) emphasises the strengths of the technology (players) while de-emphasises their weakness, to bias the consumer (audience) towards a more positive impression of the finished work. So in an objective sort of way, there is a decent enough bar to measure the success of a game design + technology combination, as there is a decent objective way of measuring the success of a music composition/arrangement + skill of players.

But these objectivities are then tempered against the preferences of the individual.

This idea of objectivity which is further projected into subjectivity can be seen as the driving concept of a lot of machine learning and data mining.

The big wins are already there---we're talking about the stuff that can already provide 75% precision/recall/F1. We know, objectively, that the big wins are already here because we already have textbooks that describe these techniques as though they are fact (compare with the discussion of Newtonian physics, at least at the early education levels). But large companies that had already been winning on these ``cheap'' but high impact tools want to raise the barrier of entry higher, and that is why the whole spiel on ``personalised machine learning'', or whatever the current marketing buzzword is.

However, the lines really aren't that clear-cut. It may seem to be better to allow machine learning algorithms to seemingly tailor their outcome to a person according to his/her personal data, but that method suffers from the literal lack of good quality data. Or put in another way, a data-driven algorithm that is trained only on [say] your data is not likely to perform well, even for you.

This is because all data-driven methods are necessarily interpolatory in nature. I would go as far as to claim [without proof] that using mathematical formulae alone to extrapolate will never get perfectly bounded variances, because dynamic systems are chaotic in nature.

Thus, the best way to hedge one's bets is to try and increase the sample size, which paradoxically means pooling ``similar'' people's data together so as to train a more strongly predictive model for the ``personalised machine learning''. Part of the black magic then is how such ``similarity'' can be determined. All these are non-trivial, and most require stupid amounts of data to do anything useful.

I mean, if a problem does not require stupid amounts of data to help fit some machine learning model, then it should be straightforward enough to write an old fashioned [sequential] program to solve it. That's the whole study of [classical] algorithms in computer science for anyway. It is because there are problems that we don't know the right representation for that we find that using a data-driven (i.e. data-based reconfiguration of the program skeleton, which represents the expressivity) to be superior.

In human terms, anything that doesn't reduce to a [low Kolmogorov complexity] heuristic that we can tell someone gets lumped into the catch-all of ``intuition''.

I don't know what this all means. But I have been staring randomly into space and mulling about this on and off.

I think I'll head back into Night City after this post goes up. Wanna enjoy more of that sweet-sweet graphics and gaming fun.

I also want to point out that the recent large number of anonymous comments was more amusing to me than alarming, mostly because it was anomalous. They are definitely not troll/spam levels of anomalous, but as someone who was trained in machine learning/data science (urgh I hate that term), observing patterns and anti-patterns is a thing.

Alright, that's all for now. I think I'll go work on the next Bible study session worksheet tomorrow, and maybe watch Futurama. Yes, I have decided to finish that for the umpteenth time instead of Darkwing Duck. But we'll get there... I do have more than half of my sabbatical left to go.

Till the next update.

No comments: