Friday, September 23, 2022

Prototype 5×10 Font Arrives!

Ooo... buckle up. This is a sequel to Stupid O'Clock is the Font of Stupidity.

Our sequel begins where the math was grokked. To wit:
For those, I can bring the terminal screen to use the full extent of the 1920 horizontal pixels. Easy. And without going that far, Vim has no text decoration that eats up cells, which gives us the full 80 characters (I refuse to use line numbers on the side of the simple reason that the ruler in my status already shows the line number of the cursor). Zipf's Law suggests that very long lines are rare, and thus as long as I can find a font for my terminal emulator to handle about 161 characters for half the screen (math says that each character needs to have about 1920/2/161=5.962 pixels), I'm gold.
That 6×11 font that I put together is displayed here:
In comparison, this is the 6×10 font of Opti Small:
Thus, the last time we ended that post, it was decided that I would stick with Opti Small because I figured that interpreting the width of 5.962 pixels as 6 pixels was good enough to support the vertical split screen with the 80-character line length limit.

Well, after nearly a week of using that set up, the answer was a resounding no.

Even a simple vertical split would leave one screen be good at showing only 79 characters for some reason, and with the word-delimiter soft-wrapping option that I use in Vim, it made everything very misleading.

Thus I had to choose the next best width, that of 5 pixels. There were, unfortunately, some problems. I mentioned in my previous post that CG Mono had some problems. Specifically, this was what I said:
It was fine, but still looked a bit off because the `>' glyph was bogus---it lacked the crispness that came from bitmap fonts, and looked like some other font was chosen to render it instead.
More specifically, this is how CG Mono looks like when it is at 7pt (or 6×10):
Notice that weirdness in the `>' glyph---I didn't do anything. All the screenshots are directly taken from whatever was rendered in mintty.

The only font that I had lying around that had 5 pixels of width was CG Mono at 6pt (or 5×9):
See, the problem with this font is that the parts where it is used most often (i.e. the characters between 0x20 to 0x7f) have instances where they are smushed into each other in the bid to preserve some of the fine detail (like the tines in `M', `w', and the like). Compare this against the sample of Opti Small above---Opti Small is so much more readable with the enforced vertical pixel of spacing.

The only problem was that Opti Small was not 5 pixels wide---it was 6.

And so, I decided to take Opti Small as the starting point, and make my own 5×10 font.

Why 5×10 and not 5×9 like the one in CG Mono? It's about the verticality (108 lines versus 120 lines) with the overall need for spacing, and the reduction of the number of dimensions of downsampling I needed to worry about (I only needed to eliminate one column of pixels as opposed to one column and one row of pixels).

The general observation here is that while the Opti Small font has each character sitting in a 6×10 cell, the actual character is designed to keep within a 5×7 grid, not counting descenders and ascenders. This means that Opti Small actually has a rough 3-pixel of whitespace that separates each line. Comparatively, CG Mono at 6pt (or 5×9) had only a 2-pixel of whitespace that separates each line. It doesn't sound like much, but at these scales, that's a huge difference between quickly deciphering what the character is from developing a headache.

Thus my real task is to approximate a 4×7 character out of the 5×7 that Opti Small uses. It was mostly quite straightforward---I followed these rules of thumb that I came up with:
  • Eliminate the third column of pixels if possible;
  • If a key feature of the glyph requires pixels in the centre, offset it such that we eliminate pixels to the left of it;
  • When in doubt, eyeball and trust what the eyeball says.
That led to this current version of my cobbled-together 5×10 font:
The eagle-eyed might ask: how did I manage to do this so fast considering that I had mentioned that I was test running the set up for the week before deciding to change things up more?

More pertinently, how did I begin with a bitmap of Opti Small that was accurate and without excessive amounts of labour?

The answer lies in FontForge, and the use of the BitmapFont section of the SFD file format. I loaded the .FON file of Opti Small into FontForge, and exported an SFD file. Then I extracted out the BitmapFont section, and wrote a simple parser in Python3 that looked out for each BDFChar command before using base64.a85decode() to decode the bitmap string and write it out in a file format that was compatible with Simon Tatham's font tools.

From that generated file, it was just a straightforward exercise in adjusting it so that it becomes a 5×10 font that I built into a .FON file, giving the result as shown above.

There are some things I'm still unhappy with this new 5×10 font, but it is mostly to do with the characters of the range 0x80 to 0xff, ranges that I rarely use. I suppose I could iteratively refine them over time.

One other thing that I forgot to mention is that in the bid to keep the vertical columns of whitespace to make individual character recognition easier, I had to decide how I was to deal with the ``high contrast'' tines that appear in glyphs like `M', `W', and the like. Using my experience of working in the text editor of the Pico-8 Fantasy Console, I've learnt of/and gotten used to the rather tiny and stylised font of the Pico-8---the idea of using thicker lines to represent implicit high-contrast vertical lines was no longer abhorrent.

And now, with a font that is 5 pixels wide, my half-screen set up permits 1920/2/5=192 characters across. It's wonderful. The ample amount of unused pixels in both the horizontal and vertical directions made reading with such a font surprisingly comfortable and no less different from reading something in Opti Small.

Will I be releasing this 5×10 font? I'm not sure. Like I said, there are still things that I need to tweak and fix. In addition, the original author of Opti Small, Nicolas Botti, left behind a little confusing bit of copyright:
Copyright: Nicolas Botti 2004. Use it, distribute it, change it.
I'm no lawyer, so I don't know how best to interpret this. I suppose it is fairly safe to release the 5×10 font, but it'll be ``when it's done'', perhaps even when I figured out how to generate a version of the font that can be directly used in the Linux terminal emulator in the desktop GUI.

Anyway, that's all I have for today. Till the next update then.

Friday, September 16, 2022

Stupid O'Clock is the Font of Stupidity. Who'd Thunk?

Man, I slept at 0230hrs this morning and woke up at 0825hrs.

The reason is quite banal---I was creating my own 6×11 bitmap font for use in my terminal emulator when I realised that the one I was using had a bad render of the `>' glyph. What ate much of the time was trying to figure out why the blazes my terminal emulator (mintty for the curious---it's from Cygwin) was still rendering that font as 6×13 instead.

Spoiler alert: semi-hidden option of AutoLeading was screwing up the vertical pixel count. 🤦‍♂️ But more on that later.

I need to begin the story where it will make more sense. I had thought long and hard, and have decided to move away from my default maximum character width of 76 characters for source files to that of 80 characters instead, especially for Python.

The reasons were as follows:
  1. That missing 4 characters was enough to cause an unnecessary break for a line that would fit exactly within 80 characters;
  2. By Zipf's Law, most of the code lines aren't ever going to be that long anyway, which means that it was usually safe to keep it at 80 characters as opposed to 76;
  3. My old comment on needing the extra columns to allow the additional decoration for TUI editors hasn't been applicable for more than two decades now.
By ``additional decoration'', I mean the following (text editor is Multi-Edit Lite running in DOSBox-X, originally shareware from Multi-Edit Software, but the company seems dead as at time of writing):
Notice how line-drawing characters are used to define the borders of the text window, together with the old school representation of scroll bars and the like? Those take up character cells, and with a starting point of 80, the left and right window edges immediately take it down to 78 characters being the maximum amount of visible text without scrolling. The 76 was derived from the observation that if one reaches the end of the line through any regular cursor movement, it will often end up one character beyond the last character, which can trigger a horizontal scroll.

In short, it is annoying to use the full complement of 80 characters, and it was better to use something like 76.

But I'm using Vim now---I've been using Vim for a long time now. That shouldn't matter any more.

The only thing that can matter is the idea of vertical split screens, either from looking at differences between two files, or three in the event of a three-way merge.

For those, I can bring the terminal screen to use the full extent of the 1920 horizontal pixels. Easy. And without going that far, Vim has no text decoration that eats up cells, which gives us the full 80 characters (I refuse to use line numbers on the side of the simple reason that the ruler in my status already shows the line number of the cursor). Zipf's Law suggests that very long lines are rare, and thus as long as I can find a font for my terminal emulator to handle about 161 characters for half the screen (math says that each character needs to have about 1920/2/161=5.962 pixels), I'm gold.

Vaguely, this means a font that has about 6 pixels of width. My go-to of Proggyfonts (or more specifically, Opti Small is already 6 pixels wide, but thanks to some fuckery from mintty, it looked off to me. That, and I wanted something that yielded more than 161 characters at the half-screen mark to have more room for other niceties.

I then started looking into using CG Mono instead. It was fine, but still looked a bit off because the `>' glyph was bogus---it lacked the crispness that came from bitmap fonts, and looked like some other font was chosen to render it instead.

Why? I don't know. Looking at it through FontForge showed nothing out of the ordinary. The time was about 2300hrs or so. And that's when I went to Simon Tatham's Fonts Page to grab his scripts and samples to build my own 6×11 font to fix this issue.

I completed the font, but when I tested it, it was... very off. The vertical extent as compared through the gridded version of MSPaint showed that it was consistently 2 pixels larger than what was stated. I read through Simon's code to see if he was doing anything strange, and found nothing.

It was then that I decided to go read the fine manual of mintty, to see if there was something I was missing.

Oh yeah, I was missing something all right---the completely inaccessible via configuration screen option of AutoLeading. It defaulted to 2 for some damn reason, and there was no way to change it in the pop-up configuration menu. Mangling the associated .minttyrc file allowed me to set it to 0, and everything worked well once more.

At that point, I switched away from CG Mono and back to my usual font, and it looks peachy.

🤦‍♂️

So what's the moral of the story then?

Don't do weird shit at stupid o'clock---one really ends up doing stupid things at stupid o'clock. I mean, redesigning an entire bitmap font when all that was needed was a tiny configuration change in some semi-obscure file? What an ``excellent'' use of 3 hours.

Till the next update.

Tuesday, September 13, 2022

I Can Haz Steam Deck Plz?

Okay, it's getting late, so I will keep this short.

This week is going to be a short week for me due to me taking leave on Friday for no other reason than to take a day off. After all, that's what paid time off or ``annual leave'' is for.

It's not a privilege but an entitlement. It's good to have a small break ever so often just to keep the burn-out monster at bay.

I'm waiting on the Steam Deck to be released in SIN city. So far, there is no indication that it will make its way here, though many signs are pointing towards that, considering how the availability blurb refers to Komodo making it available to South Korea, Taiwan, and Hong Kong, in addition to the originating country of Japan.

Okay, maybe not just SIN city, but SEA countries.

I'd love to have the Steam Deck. I used to have a Dingoo from way back when and enjoyed the times that I played it. It was mostly emulator-heavy, which was not a problem. Eventually I stopped playing on it because let's face it, many old games need to remain in the past because their interface was just atrocious as compared to what we have now, and that it was actually possible to exhaust the number of retro-games I wanted to play. I had to junk it a few months ago because the lithium battery pack was swollen.

But a Steam Deck? I can see myself playing/grinding games from my Steam library on it. The ability to carry a game's progress between systems over Steam is a big part of why the Steam Deck is so appealling despite my earlier lost of interest in the Dingoo. This means that I could [say] grind games of Binding of Isaac on either PC or the Steam Deck without worrying about having to replicate the feats to unlock features, nor do I have to worry about doing the synchronisation manually.

I'm not expecting to use the Steam Deck for long gaming periods---that's what Eileen-II is for, and also why I didn't get on board with the Nintendo portables, up to and including the Switch. Small 30-minute plays in commute are the most likely use cases---I can't keep reading forever. There are other handhelds similar to the Steam Deck (like the Aya Neo, or offerings from GPD), but they cost at least 50% more than the most pricey variant of the Steam Deck, and are Windows exclusive machines.

That latter part is a bit off-putting. For what is to be a dedicated gaming portable, running a two-ton gorilla like Windows as opposed to the gaming-centric SteamOS feels like a lot of the precious battery juice will get lost somewhere.

Anyway, as I said, this is short. Till the next time then.

Thursday, September 08, 2022

Consigning Ramblings to Records

To feel loved is something that comes naturally to humanity. Hell, even I cannot run away from it sometimes, as misanthropic as I claim to be.

But what does it mean to ``feel loved''? Is it the same as ``being loved''? Does ``feel loved'' actively require an other entity outside of the soul/consciousness that is currently piloting this meat body? Is it something purely bio-chemical that we attach more signficance to due to our insatiable need to derive meaning and symbology? If so, do animals other than humans ``feel loved'' as well?

I've been walking about the past couple of weeks in a semi-trance, in the sense that while I am fully aware and present wherever I am, I cannot help but notice that there is nothing inherently ``real'' about the world. All the physical sensations that we have, be they sight, sound, taste, and all others, are merely emergent behaviour built upon the interactions of unfathomably large numbers of primal components that are smaller than the atoms that we sort of understand.

Much of the world is emptiness, literally. Atomic structure itself is mostly space, though what is that space is something that I don't think we have an answer to. Feelings of friction, Newton's Third Law observations, and other human-scale phenomena are interactions of invisible electromagnetic fields as the atoms repel each other. Smaller-than-human-scale phenomenon that allow us to ``run'' a consciousness in the form of biochemistry is even more difficult to comprehend---stupendously large numbers of chemicals are synthesised, interacted with, and broken down, all without necessarily having awareness of the larger ``I'' that has emerged from their collective behaviour.

Where then does ``I'' begin? And if I cannot tell where ``I'' begins, where then can I start talking about ``feeling loved''?

God is unknowable because He isn't one of us even as we are made in the image of Him. Yet we are reminded to know God relationally, to bear in mind of His sovereign will over reality, of His Perfect Plan, and of His indescribable power in influencing and changing things according to the said Plan.

Is this what it means to ``feel loved''? That, despite how every other human may not give a rat's ass about us, we have a Creator God out there beyond our ken who made us through His cunning ways of manipulating extremely complex behavioural systems of remarkably simple axioms who cares enough about us that it does not matter if the rest of the world hates us.

Is ``feeling loved'' something that is bound by time, or in other words, is the existence of a specific direction of time necessary to allow one to ``feel loved''? Time is necessary for cause and effect---the very definition of cause and effect demands that a certain sequence occurs, and from such a sequence, a direction of time may be inferred.

If the house of God is beyond the observable universe, does it also mean that it has no direction of time the way we innately experience and ``know''? If so, when we are in the house of God in the end days, are we still loved, if ``feeling loved'' is indeed dependent on a cause and effect? We have hints in Scripture about the timeless nature of things, at least from the definition as derived from cause and effect, in that our salvation (and thus our reconnection with God the Father) comes not from works but from faith---it defies what we understand as a regular cause and effect.

If that is the purest and best form of love, does it mean that all other earthly emotions of ``feeling loved'' is just a shadow that we need to learn to disregard and discard?

The confused might ask ``MT, where are you going with this?''

And I reply, ``Like hell I know. Not every blog post is about making a point that makes sense.''

Till the next update.

Monday, September 05, 2022

Making the Common Word List Source Technically Correct at 95% by Mentions

Okay, I couldn't let it go. I mentioned about the true count being one processing step away from getting, and just did it. The true total mentions is 2.00T rounded off (or 1997515570677 versus the original 1991804657887) a difference of 5.71G (or 5710912790). This mass difference directly led to 2.23k words more than the original 26.8k words.

Will that make a noticeable difference? No clue because I'm too lazy to do a diff between the two top 95% words by total mentions and perform the necessary qualitative analysis---it's just not worth it.

I've updated the bit-array for the Bloom filter, and so things are more correct than before.

I've also added a couple of warnings for when the counting statistics yield something that may make all the estimated readability metrics even more suspect. In addition, I chased down the actual paper regarding the automated readability test, and realised that the Wikipedia article misrepresented it (the horrors!).

The actual links and what-not are updated on the readability test page itself.

Okay, that's about it. I've deliberately set this publishing time to be the next day to not overshadow the last entry's triumph.

Till the next update.

Sunday, September 04, 2022

Readability Test

Hoo-wee. Let's nerd out a little.

This online readability test has been a project that I have been working on and off for quite a while. While it is not something ground-breakingly new, it does get fed some rather beefy data to become the thing it is.

First off, let's get the big picture right. The tool implements three readability tests:
  1. Flesch reading ease;
  2. Flesch-Kincaid grade level; and
  3. Automated readability index.
The first two methods involve messing with word, sentence, and syllable counts, while the last involves only characters, words, and sentences.

I know that MSWord has such readability tests built in. However, I don't have MSWord accessible most of the time, so having a tool on my own is probably better.

Character counts, word counts, and sentence counts are lexicographically mechanical in nature---the heuristics for English (my primary language) are quite straightforward to use. The tricky one is syllable counting---coming with heuristics that can work for a large quantity of words is both challenging and time-consuming.

To deal with that in the laziest means possible, I decided to just build a simple linear model to handle it. In theory, I could just grab a list of words with their syllable counts and use that as is, but then you get faced with the word ``omphaloskepsis'' that is rare enough to not appear in the list and you are faced with the question of ``what's the syllable count of this word?''.

So some level of generalisability is necessary, and to do that, a model is needed.

Like any model, two big things need to be taken into account: ground truth and features. Roughly, we need ground truth to ``know'' what the mapping of word to syllable count (my learning problem) should be, while features are used to derive general properties of the word that may be generalisable to handle new unseen words.

I'll skip all the experimentation and just describe the stuff in point form.
  • The ground truth is derived from the CMU pronunciation dictionary, using a heuristic that maps one syllable count to one appearance of a vowel phoneme.
  • Features are binary (present or absent), no matter how represented they are.
  • I used skip-digrams as features: a skip-digram is a digram with a certain number of other characters (between 0 to 4) in between.
  • I used trigrams as features as well.
  • I used ``book-end'' features: it's a trigram made of the first character of the word, and the last two characters of the word.
  • I encoded the length of the word as a one-hot encoded feature for up to 10 characters, and one feature for words longer than 10 characters.
  • To ensure that the short words are generally more correct (most likely to be exceptions to any generalisable rule), I also added in a special short cut-off, storing words up to and including 4 characters as features.
  • The weight vector for the features are obtained through solving the system of linear equations using a sparse implementation of the least squares algorithm.
  • The final stored weight vector rounds off to 3 decimal places, and the final predicted syllable count from the model is rounded to nearest 1.
To use the model, I just created a JSON file of them, and made a small JavaScript interpreter to use that JSON data.

The final model took the 117k words extracted from the CMUdict dataset (~1.34M bytes) to the 473k bytes long JSON file. That's good enough I suppose.

The logistics of training the model was... funny. Due to the number of features involved (47.6k) and number of entries (117k), the matrix (5.57G entries, or about 44.6G bytes of memory for double-sized in dense mode) can only be solved using sparse linear algebra techniques. I can't run it in Cygwin due to the choice of library, but it can be run in Anaconda on Windows.

The only problem was that it was slow. Running it in a Xubuntu OS in VirtualBox was easily 3× faster.

Why? I have no idea. But it was done. And with only using 300M bytes of memory out of 2G bytes available too.

🤷‍♂️

So that solves the syllable count. Hurrah!

Well, that's not enough. If you've read the limitations of the measures, you'll realise that obscure words that happen to be short can break things. To deal with that, I decided to just add a list of discovered obscure words.

But it's quite nasty to add a list of obscure words to compare too---there are far too many of them. It's much easier to create a list of common words, and then report that a word is obscure if it does not appear in this list.

There are two issues to that approach:
  1. Where do I get such a word list?
  2. How do I store and search through such a word list in a way that is space and time time efficient?
Let's start with the second question since the answer is simpler: Bloom filters.

Put simply, Bloom filters implement a probabilistic data structure based around hashing and a fixed-sized bit array where the false negative rate is zero. The false positive rate can be set with a whole bunch of other parameters tuned to fit that---I think the linked to article does a better job to explain things.

So for my obscure words detector, I created a Bloom filter with false positive probability of 0.00001, or 1 in 100k. The resultant bit-array and support code in JavaScript to implement the data structure weighed in at around 217k bytes. All these without actually storing any of the words.

So how many words went in to this Bloom filter? About 26.8k words that constitute the top 95% of word uses.

Which brings us to the first question: where to get such a word list?

I only know one source for this: Google NGrams. More specifically, the 1gram dataset. I used the Google NGrams dataset before a long time ago to build some tetragrams as part of an evaluator for my experiments on automated cryptogram solvers, but the newest dataset (from 2020) has a slightly different data format.

No matter. Getting the data is easy. Processing the downloaded files yielded 79.1M ``words''. Dropping off all the specially marked words (hence the quotation marks) yielded 36.9M words. But that wasn't enough since this was case sensitive. One more round of processing yielded 29.0M case-insensitive words. This yielded a total mention of 1.99T mentions, of which I filtered off to the most popular words (by mention) that added up to 95% of the mentions (or about 1.89T mentions). This led to that 26.8k words, which sits at 216k bytes.

The more astute should realise that I should use a different sum total because of the ``minimum of 40 mentions to appear in the individual word list''. Yeah, but it's already done, and I am quite satisfied with the results, even though including those extra counts of ultra-obscure words will increase the number of frequently used words.

And so, we have the final result: the online Readability Test.

Phew, that's quite a big nerd out. The only other thing I want to add is the use of tqdm as a quick way to show progress in terms of total time, as well as the rate of processing in iterations per second, all while using unit scaling to show large numbers in nice SI-prefix notation.

Till the next update.