Monday, September 05, 2022

Making the Common Word List Source Technically Correct at 95% by Mentions

Okay, I couldn't let it go. I mentioned about the true count being one processing step away from getting, and just did it. The true total mentions is 2.00T rounded off (or 1997515570677 versus the original 1991804657887) a difference of 5.71G (or 5710912790). This mass difference directly led to 2.23k words more than the original 26.8k words.

Will that make a noticeable difference? No clue because I'm too lazy to do a diff between the two top 95% words by total mentions and perform the necessary qualitative analysis---it's just not worth it.

I've updated the bit-array for the Bloom filter, and so things are more correct than before.

I've also added a couple of warnings for when the counting statistics yield something that may make all the estimated readability metrics even more suspect. In addition, I chased down the actual paper regarding the automated readability test, and realised that the Wikipedia article misrepresented it (the horrors!).

The actual links and what-not are updated on the readability test page itself.

Okay, that's about it. I've deliberately set this publishing time to be the next day to not overshadow the last entry's triumph.

Till the next update.

No comments: