Wednesday, July 08, 2015

Hub, Spokes, SLAs

Unless you are living not in SIN city, by now you ought to have heard of the massive MRT failure along the North-South and East-West lines. More interestingly, you ought to have heard on the chaos surrounding the eventual dispersal of all the commuters who were stranded by the failed mass rapid transit.

This post isn't about who's to blame in that incident, nor even about speculations relating to that incident. I will use this incident as an analogy for system architecture design.

There are many different strategies in architecting a system, most of which depends on the nature of the system being developed. For example, if security is of paramount importance, then a hub-spoke strategy is used, with the server (or more often, a cluster of servers) sit in the hub, and everything else (the clients) connect as spokes to the hub. If the ability to ``self-heal'' is more important, then some kind of peer-to-peer based strategy is used.

There are of course more strategies to be used, and I will not go through them. I just want to point out something interesting. The public infrastructure in SIN city is designed around the concept of the hub-spoke strategy, where for the most part people are expected to make use of the mass rapid transits to cross the large distances (it's SIN city, so 40km or 25mi is considered ``far'') before switching to a bus or two for the proverbial last mile. This is the reason why there are many new MRT line constructions over the past fifteen years, political conspiracies aside.

There is a catch that is amply demonstrated through last evening's disruption. If you are using a hub-spoke model for architecting a system, the hub cannot afford to fail at all. Maybe I'm not saying this loud enough: THE HUB CANNOT AFFORD TO FAIL AT ALL. It is the innate risk behind this particular means of architecting. Since all traffic passes through the hub at some point, any downtime of the hub means massive damage to the system at large. As shown in last evening's debacle, the failure of the two biggest routes in the network was enough to cause a spillover of commuters that lasted too long for comfort.

Thus, the effective use of a hub-spoke model for system architecting will imply that one has in place a solid proactive maintenance plan for the hub to avoid any downtime whatsoever. This is where we find those triple-9 (< 8.75hr downtime/year) and quad-9 (< 52min downtime/year) service level agreements (SLA).

Now that has got me wondering what the SLA is for the running of the MRT lines in SIN city...

No comments: