Friday, March 19, 2021

Python3 Script Tinkering Again

It's a Friday!

I have been spending most of the day attempting to power through OpenStax College: Organisational Behaviour---I am currently at 356/704 pages. It will be a while yet. I'm not sure if this can be considered a type of procrastination, but I did take a short break to read On the Decay of the Art of Lying by Mark Twain as well.

Some time was also spent watching more Linus Tech Tips videos.

But funny enough, I actually ended up spending quite a bit of the afternoon putting about with my Python3-based HTML/JavaScript minifier. I was getting a little bored with waiting for the 22 s it takes to run through all the files sequentially to minify and/or compress, and wanted to do use multiprocessing to handle it, just like what I did the page counting script. But before doing that, I wanted to update the sources yet again, this time to trigger as few pylint complaints, like the recent exercise I did with other Python3 scripts. I got most of the warnings out of the way, and accidentally fixed a bug that I had been observing forever.

The bug was that for some reason, the minified HTML5 files still had extra spaces here and there in between an end-tag and a start-tag. For a long time, I had been looking into it on and off. And today, while going through the code-base again, I realised the problem: I was tagging the #text node equivalent data as Token.DATA instead of Token.TEXT. The significance of this difference lies in how the subsequent re-generator worked with the tokenised output---Token.TEXT nodes would have a more aggressive white-space eater/coalescer, while Token.DATA nodes will be left very much alone.

Such a simple oversight that got fixed because I was procrastinating on my sabbatical. What a joke.

Anyway, after adjusting according to what pylint complained and fixing the bug, I sort of stopped right there---the next step involved some re-factoring and other adjustments, but I was too lazy to work on them since it was ``obvious'' what needs to be done; it was just that it was a little tedious and I wasn't in the mood.

------

Funny enough, instead of stopping the programming for the day, I did even more programming, this time enhancing the same page counting script I was referring to. I added another field to the current quadro of number of pages, average bytes per page, total bytes for file, and file name. This field is ``modified time of the file in days relative to when the script was executed''.

The purpose of this field was to give me more information on when a particular e-book was assembled---the earlier they were assembled, the larger their magnitude, or given the esoteric definition, the ``smaller the negative number''. So, an e-book that was last modified about 14.7 years ago would return as −5.4k days.

As I was tinkering that, I started to think about all the stupid hacks that I was doing to get mobi-python to barely work. The decoding process was clearly problematic for the .mobi files that I had, because the page-count estimate was about twice as large as it should be, using the output of Sumatra PDF as a gauge, not to mention all the weird hackery that I had to do. While I was writing that blog entry then, I soon found that there was another mobi decoding library in Python3---it was aptly named as mobi. Why I did not see this earlier when I was writing the original handler was due to the search engine I used---Duck Duck Go yielded mobi-python as the first entry, while Google yielded mobi as the first entry.

At the time of the blog entry, I didn't have the time/wherewithal to experiment and switch over. But today, I decided to give it a go. It generated a temporary directory as part of its decoding process, which was not ideal, but it did get the job done well. And so, I updated my little script to use this new library instead of the really alpha (and really broken) mobi-python library. I won't cause myself silly problems, but will first bring up the original computation for reference:
Here is the updated version with the replacement mobi module:
So the numbers for e-books that I can archive that I have read are 151.2k pages with 29.3k bytes per page on average. You can also see the new data column, whose units are in days. These numbers are positive because the final computation occurred after the start of the script (the anchor point), and can be seen as a small easter egg of sorts to crudely time the run-time of the script in units of days. Notice that the run-time were between 99.4µ days to 388.7µ days---I will leave you, the reader, to work out how many seconds that is. I could have left it in seconds, but staring at units of millions of seconds were hard to fathom, and so this compromise was created---the output was designed so that sort -k4h would do the right thing.

Also, take my word that the number of pages that I have left to read as at now is 225.2k pages with an average of about 13.9k bytes per page. I think it is much more representative now.

Anyway, that's all I have for today. Till the next update, I suppose.

No comments: