Tuesday, February 23, 2021

Final Solution to Eileen-II's Temperature Issues?

Ooo, a somewhat early entry!

I got quite a few things done today, so let me begin.

The first thing that I did was complete The Outer Worlds, completing both the ``good ending'' (The Hope colonists revived) and the ``bad ending'' (Capitalism forever!). That happened at around 0200hrs this morning, and thus counts as a ``today'' thing, even though it was more of yesterday on extended time.

I spent some time in the morning replacing the elastic strap in my Roi backpack. This elastic strap holds the concert flute (Aurelia, in this case) in position. I have had this backpack since 2015, and after carrying it weekly with a nearly full load of flute/piccolo, music, and flute/piccolo stands, it has slowly fallen apart. The top handle's pleather has flaked off, and has a corduroy handle cover (made from an old pant leg) sewed in place by Chara, one of the two zipper sliders is no longer aligning the zipper and is kept permanently in one side, and the latest of course was the loss of elasticity of the strap that was used to hold the flute case down. I carefully unpicked it and sewed in a new one with 1 in thickness, and it is now as good as new.

I just can't wait for us to get back into rehearsals at the Chinese orchestra. I have a new double-width collapsible music stand that I can't wait to make use of---I only need to either buy a carrying case for it, or make one. I'll probably make one, but to do so would mean that I need to source for cloth (probably canvas), and set up the sewing machine. There is also that possibility that I end up hand-sewing it; it's not something new. I have a tendency of doing that most of the time anyway, except for that one time that I was patching back together my pen case for my fountain pens.

------

The last time I talked about Eileen-II, I mentioned about turning off hyper-threading to reduce run-time temperatures. There was only one small problem with that entry that I want to correct: the temperature did not drop to around 80°C---it was more like 89°C to 92°C, as compared to 95°C to 100°C when hyper-threading was turned on.

The other thing that I didn't mention was that the idle temperature was about 66°C.

I don't like these numbers at all. It's partly because such high numbers would imply a shortening of the lifespan of the components (in this case, the CPU), but more importantly such high numbers meant that when I was using the keyboard of Eileen-II herself I would end up actually burning myself slightly should I accidentally touch the institial parts of the frame instead of the plastic key caps of the keyboard.

I knew of a solution to this, but I have not had the wherewithal to do it, because of how time consuming it was---it was the pseudo-inverse of overclocking, a technique called undervolting using a tool like ThrottleStop. CPUs, like all electronic components, require a certain potential difference across it too provide enough electrical energy to do the computations that they need to do. Loosely speaking, for ``greater performance'', the CPU should get as much energy as it can take, which is indirectly controlled by regulating the voltage of the supply going into it.

But the catch is that ``greater performance'' is not the same as ``greater apparent performance'', the reason being that if the CPU gets too hot (for instance, 100°C), the CPU is forced to downclock itself to avoid burning itself from the excess heat. Following the math from this subsection on CPU power, this excess energy is proportional to the square of the voltage supplied.

Every CPU has its own sweet spot voltage where there is a minimal amount of energy that is supplied that can give the CPU its maximal performance without actually generating excessive thermal energy that forces a temperature-based shut down. For reasons of sanity, the factory defaults for the make of Eileen-II are set to be ``performance-oriented'', i.e. they do not attempt to find such a ``best'' voltage. This ensures that the machine will always attempt to run at its ``best performance'' regardless of the situation.

So yeah, the first thing I did then was to find out what that lowest voltage is. To help me with it, I turned off the Turbo Boost capability from within ThrottleStop, and carefully experimented on the voltage offset for the CPU Core and CPU Cache. This was the most time consuming part because if I ever went too low, the CPU would freeze up, and I would have to do a hard reset. Granted, with Eileen-II's specs, a hard reboot doesn't take too long, but it still takes time to pull off. The final values that I ended up was −90.8 mV---starting at the default −100.0 mV voltage offset was just too much and made Eileen-II crash. This led to an idle temperature of around 50°C.

I could try to tweak down to the 1.0 mV precision, but at this point, I don't think it is worth it.

The next thing that I tried to tackle was the way Turbo Boost worked. Disabling Turbo Boost was a good way to get low temperatures, but I didn't just want low temperatures---I wanted Eileen-II to actually run more powerfully if there's a need to, but without the whole hot-enough-to-burn-my-finger-tips thing.

Turbo Boost worked by applying a multiplier on the basic clock speed. For Eileen-II, the basic clock speed is 99.768 MHz (as reported by ThrottleStop). At the published ``base frequency'' of 2.60 GHz, we are looking at a multiplier of around 26×. The maximum reported frequency that the CPU can support is 5.1 GHz, which works out to around 50--51×. Thus, any multiplier larger than 26 and less than 51 are considered potential ``Turbo Boost ratios''.

At this point, I think a screenshot of the Fully Integrated Voltage Regulator (FIVR) screen will make things more enlightening.
Observe the lower left corner frame under ``Turbo Rate Limits''. The Max value is the maximum ratio that the CPU is reported to support---going any higher than that counts as overclocking (i.e. out of specifications). The table below lists down the multipliers to be used when Turbo Boost is triggered, sorted by the number of active Cores.

My understanding of active Cores are those that are assigned the particular process/thread to operate on [at 100% capacity].

Now I just want to point out that I live in Singapore, and naturally, am operating Eileen-II in Singapore. In case you were wondering, Singapore has a climate that is simply terrible for high performance electronics if one does not have/use air-conditioning.

So those default numbers in Turbo Ratio Limits that start at 50× for one active Core to 43× for 6 active Cores? Yeah, they are waaaaaaay too optimistic. For reference, the highest I ever saw the clock speed went was about 4.2 GHz, and even then, it immediately faced thermal throttling.

I went ahead to get my favourite prime number tester to generate the types of high CPU loads that I needed to benchmark the relationship between the multiplier and resultant temperature. I first did a full 6-core ``Small FFTs'' torture test with the undervolt conditions and Turbo Boost turned off. According to the options, the ``Small FFTs'' torture test tests the L1/L2/L3 caches, generating maximum power/heat/CPU stress, which was exactly what I wanted. The idea for this initial test is to spot the empirical distribution of which Cores have a tendency to run hotter than others. From this empirical distribution, I would then pick the hottest n cores when I am trying to tune the multiplier for n active Cores.

My goal was to ensure that no matter how many Cores are active, the maximum temperature of all the Turbo Boosted Cores do not exceed 85°C. There is no scientific reason why 85°C over other temperatures, but there are two good ones:
  1. 85°C is roughly the top end of the temperature where the frame is merely warm to the touch;
  2. Any lower temperature than 85°C would lead to a multiplier that is basically useless, in which case I might as well just turn the Turbo Boost off.
Empirically, the Cores from hottest to coolest on Eileen-II are Core 5, Core 3, Core 2, Core 4, Core 1, and Core 0.

One last piece of the puzzle was to do the actual assignment of Cores for the job. The way I did this was to pull up ``Task Manager'' with Ctrl-Shift-Esc, get it to show more details (this is by default for me), get to the ``Details'' tab, search for prime95.exe under the process name, right clicking on it to pull up the context menu, then selecting ``Affinity''. And from there, I would choose which Cores (or CPUs in this case) to assign prime95.exe to.

Then it was just a case of trial and error on the multipliers with careful execution of the ``Small FFTs'' torture test with the associated number of workers.

The multipliers that you see in the FIVR screenshot above are obtained through this process. Notice that at 6 active Cores, the multiplier is 30× (corresponding to a clock frequency of not more than 2993.04 MHz), and at a single active Core, the multiplier is 40× (clock frequency not more than 3990.72 MHz). These numbers are definitely much lower than the very optimistic defaults.

Notice that I am being very conservative here---I chose to ensure that the hottest Core spike does not exceed 85°C; this does not mean that the hottest Core will operate at 85°C. To illustrate this point, I have taken yet another screenshot, this time with a 6-core ``Small FFTs'' torture test with all the settings set up in the FIVR screenshot.
As you can tell, it runs much cooler now, with the lowest spike temperature being 78°C, and the highest spike temperature being 85°C. Notice also how the empirical distribution works for both the instantaneous measurement and the Max measurement.

That was not the end though. I re-ran the ``Small FFTs'' torture tests again from 6-core down to 1-core, but this time without forcing the processor affinity. The temperature profiles are definitely much cooler, since the fewer active Cores often meant that the workload was being shuffled randomly among all the available Cores, leading to less sustained work.

Hopefully I don't have to do anything more drastic to keep the temperatures at this comfortable level.

------

Technical nerdry aside, I also started reading a short anthology of poems/prose/plays by Robert Wilks. Rainbow: A Collection of Stories, Poems and Plays (Volume 2) is funny because it was actually my secondary two English Literature text book from back in the day. It is one of a few books that I kept from my secondary school days, mostly because past-me sort of knew that there would be one day where future-me would be interested in reading the wonderful writings from then. Thank you, past-me!

I have also watched Dick Tracy, a 1990 film about the eponymous comic-book detective. It's funny---I remembered that I loved the concept of Dick Tracy (he's a detective, and when I was young, I loved the idea of being a detective, and also a spy), but I never seemed to like read/watch anything about the character. And so, I rectified that today. To my surprise, Madonna was in it as well.

Woah.

The film itself was fun, but what really drew me in was how normalised all the comic book weirdness was---this is a testament on how well the film was directed/produced. Big coats in all the colours except black, canary yellow hat and coat while trying to be sneaky and somehow not getting seen, colourful back drops despite the 1930s ``hard-boiled'' aesthetic.

And Madonna sings like the nightingale she is. Ho yez.

Anyway, that's all I have for today. It's a long entry, no thanks to nerding out. Till the next update, I suppose.

No comments: