Sunday, February 28, 2021

Answering Two sTuPiD qUeStIoNs

Unfortunately, I didn't die. I was just otherwise pre-occupied.

The last couple of days or so were socially more interesting than what I would normally do. I spent the evening of 元宵 having dinner at my other sister's parent's place, and then heading off to her place proper for a housewarming. The last time I wrote about being at my other sister's place, I was helping with the unpacking. It is rather confusing to look back and realise that a full month had passed since that day. The place is definitely more ready to be lived in than then since the carpentry work and other finishing touches were completed as well. It is a cosy place, and I am very glad that my other sister and my other brother-in-law are finally together at their own place---their story has been eleven years in the making.

Yesterday itself was also socially more interesting. I had gone to church of service, as always, but after that the care group met up for dinner at Khansama Tandoori Village. While researching the place for this entry, I realised that there are two such restaurants---the main page I linked to is the one purporting to be in 166 Serangoon Road, Singapore 218050, but the one we went to was at 87 Science Park Dr, Singapore 118260 instead. I think the Serangoon one is the main branch---it makes sense because it is right smack in the middle of Little India, while the one we went to is a side branch. Anyway, the food is delicious, and the company was good.

I was subtly reminded about the time I ate at one of the North Indian restaurants out in Little India with the Muramatsu Flutes technicians when Chara was in charge of taking them out for dinner at the most recent flute festival organised by MusicGear Singapore and The Band World.

It was a happy memory, bittersweet.

------

I spent the past couple of days working on a small Python3 script that answers a stupid question: just how many pages of reading do I have left out of my stash of e-books?

My first answer was based on only counting PDF pages, and it was 192.2k pages at an average size of 15.3k bytes per page.

That didn't feel right, because I didn't process the .mobi files. That made me spend upwards of a day doing research on it and debugging the only mobi Python library (mobi-python) that I could find that didn't generate too much junk. Unfortunately, the code is at alpha quality, and it really showed---the written LZ77 decompressor had mixed semantics on byte versus character that ran completely brokenly in Python3, among other things. I shimmed and patched as much as I could to get what I needed going, and somehow managed to get it to work.

The end result is this:
It's 227.8k pages to read, at an average size of 13.3k bytes per page.

The first version was also a single-process, single-threaded, sequential file processor. It took about 4m 42s to count those pages---this included the new mobi files processing.

So, I reworked it to use multiprocessing instead. Using 5 out of the 6 cores yielded a run-time of 1m 18s to count the same pages.

That would have been job finished, had I not asked the next stupid question: just how many pages of reading had I already done?

Now, some qualifiers. Not everything from my read list exists as an e-book, so whatever estimates I have are at best, a lower bound. The other thing is that I set up archives of about 1G bytes in size of the read e-books just for ease of making back-ups.

So to answer my second stupid question, I needed to process lists of files from the RAR archives. I used rarfile to handle this, and did a test run with it on some comic book archive files to test the tool chain, thus increasing more capability for the script. It worked well, and I started to run it on my actual archive files.

The bloody thing crashed. Something about requesting a read of 65k bytes but receiving 0 in return.

I was sick of debugging and shimming mobi-python to get it to work. I double checked the RAR archive file with the provided tools and it checked out. There was just something stupidly wrong with the low-level methods used by RarFile.extractall().

Angrily, I just created a direct call to the unrar tool that rarfile was using in the first place to extract the contents into some temporary directory location so that the original refactored directory-walking page-counting script could work.

There was some additional shenanigans needed to transform the temporary file location (in Cygwin-land) into a path that Windows could understand.

But to cut a long story short, the answer is 151.4k pages, at 28.6k bytes per page on average.
And so, to answer 2 stupid questions, I now have a weird Python3 script that can count pages. Hurrah...

I just want to point out that it supports both mobi and epub files, and the number of pages from these reflowable documents are themselves estimates. I also want to add that for some of the PDF that could not be processed by the pyPDF2 module, I ended up using a 1-nearest neighbour estimator to impute the unknown number of pages to generate the final statistics. Yay...

------

I have slowly made my way through Deep Learning, reaching 489/802 pages as at now. What a slog it has been, and what a slog it is going to continue to be. I think instead of heading down immediately to OpenStax College: Organisational Behaviour as originally planned, I might want to detour to some fiction for a bit. But we'll see.

Till the next update, I suppose.

No comments: