Sunday, September 04, 2022

Readability Test

Hoo-wee. Let's nerd out a little.

This online readability test has been a project that I have been working on and off for quite a while. While it is not something ground-breakingly new, it does get fed some rather beefy data to become the thing it is.

First off, let's get the big picture right. The tool implements three readability tests:
  1. Flesch reading ease;
  2. Flesch-Kincaid grade level; and
  3. Automated readability index.
The first two methods involve messing with word, sentence, and syllable counts, while the last involves only characters, words, and sentences.

I know that MSWord has such readability tests built in. However, I don't have MSWord accessible most of the time, so having a tool on my own is probably better.

Character counts, word counts, and sentence counts are lexicographically mechanical in nature---the heuristics for English (my primary language) are quite straightforward to use. The tricky one is syllable counting---coming with heuristics that can work for a large quantity of words is both challenging and time-consuming.

To deal with that in the laziest means possible, I decided to just build a simple linear model to handle it. In theory, I could just grab a list of words with their syllable counts and use that as is, but then you get faced with the word ``omphaloskepsis'' that is rare enough to not appear in the list and you are faced with the question of ``what's the syllable count of this word?''.

So some level of generalisability is necessary, and to do that, a model is needed.

Like any model, two big things need to be taken into account: ground truth and features. Roughly, we need ground truth to ``know'' what the mapping of word to syllable count (my learning problem) should be, while features are used to derive general properties of the word that may be generalisable to handle new unseen words.

I'll skip all the experimentation and just describe the stuff in point form.
  • The ground truth is derived from the CMU pronunciation dictionary, using a heuristic that maps one syllable count to one appearance of a vowel phoneme.
  • Features are binary (present or absent), no matter how represented they are.
  • I used skip-digrams as features: a skip-digram is a digram with a certain number of other characters (between 0 to 4) in between.
  • I used trigrams as features as well.
  • I used ``book-end'' features: it's a trigram made of the first character of the word, and the last two characters of the word.
  • I encoded the length of the word as a one-hot encoded feature for up to 10 characters, and one feature for words longer than 10 characters.
  • To ensure that the short words are generally more correct (most likely to be exceptions to any generalisable rule), I also added in a special short cut-off, storing words up to and including 4 characters as features.
  • The weight vector for the features are obtained through solving the system of linear equations using a sparse implementation of the least squares algorithm.
  • The final stored weight vector rounds off to 3 decimal places, and the final predicted syllable count from the model is rounded to nearest 1.
To use the model, I just created a JSON file of them, and made a small JavaScript interpreter to use that JSON data.

The final model took the 117k words extracted from the CMUdict dataset (~1.34M bytes) to the 473k bytes long JSON file. That's good enough I suppose.

The logistics of training the model was... funny. Due to the number of features involved (47.6k) and number of entries (117k), the matrix (5.57G entries, or about 44.6G bytes of memory for double-sized in dense mode) can only be solved using sparse linear algebra techniques. I can't run it in Cygwin due to the choice of library, but it can be run in Anaconda on Windows.

The only problem was that it was slow. Running it in a Xubuntu OS in VirtualBox was easily 3× faster.

Why? I have no idea. But it was done. And with only using 300M bytes of memory out of 2G bytes available too.

🤷‍♂️

So that solves the syllable count. Hurrah!

Well, that's not enough. If you've read the limitations of the measures, you'll realise that obscure words that happen to be short can break things. To deal with that, I decided to just add a list of discovered obscure words.

But it's quite nasty to add a list of obscure words to compare too---there are far too many of them. It's much easier to create a list of common words, and then report that a word is obscure if it does not appear in this list.

There are two issues to that approach:
  1. Where do I get such a word list?
  2. How do I store and search through such a word list in a way that is space and time time efficient?
Let's start with the second question since the answer is simpler: Bloom filters.

Put simply, Bloom filters implement a probabilistic data structure based around hashing and a fixed-sized bit array where the false negative rate is zero. The false positive rate can be set with a whole bunch of other parameters tuned to fit that---I think the linked to article does a better job to explain things.

So for my obscure words detector, I created a Bloom filter with false positive probability of 0.00001, or 1 in 100k. The resultant bit-array and support code in JavaScript to implement the data structure weighed in at around 217k bytes. All these without actually storing any of the words.

So how many words went in to this Bloom filter? About 26.8k words that constitute the top 95% of word uses.

Which brings us to the first question: where to get such a word list?

I only know one source for this: Google NGrams. More specifically, the 1gram dataset. I used the Google NGrams dataset before a long time ago to build some tetragrams as part of an evaluator for my experiments on automated cryptogram solvers, but the newest dataset (from 2020) has a slightly different data format.

No matter. Getting the data is easy. Processing the downloaded files yielded 79.1M ``words''. Dropping off all the specially marked words (hence the quotation marks) yielded 36.9M words. But that wasn't enough since this was case sensitive. One more round of processing yielded 29.0M case-insensitive words. This yielded a total mention of 1.99T mentions, of which I filtered off to the most popular words (by mention) that added up to 95% of the mentions (or about 1.89T mentions). This led to that 26.8k words, which sits at 216k bytes.

The more astute should realise that I should use a different sum total because of the ``minimum of 40 mentions to appear in the individual word list''. Yeah, but it's already done, and I am quite satisfied with the results, even though including those extra counts of ultra-obscure words will increase the number of frequently used words.

And so, we have the final result: the online Readability Test.

Phew, that's quite a big nerd out. The only other thing I want to add is the use of tqdm as a quick way to show progress in terms of total time, as well as the rate of processing in iterations per second, all while using unit scaling to show large numbers in nice SI-prefix notation.

Till the next update.

No comments: