No Downtime - Again!
Every day I think about downtime, wait for it to pass, and record the result.
The last time I wrote this (did you read that dev blog? It's really good, I promise), Tranquility's auto-reboot on weekends was approximately 4 minutes and 20-40 seconds, just enough for a quick cup of tea. Today's auto-reboot, a year and a half later, was 3 minutes and 34 seconds - just enough for a quicker cup of tea (I measured it while writing this dev blog). Given the improvements we have made since 2019, an auto reboot downtime of 3 minutes and 30-40 seconds is pretty normal these days.
I could focus on this improvement of 50-60 seconds, a 22.5% improvement between 3 Dec 2019 and 3 July 2021, and predict the end of downtime on 19 December 2026 with this super-scientific graph, but the reality is more complicated than that.
There is a (soft) lower bound of approximately 3 minutes given the three different activities during downtime - shutdown, database jobs, startup - which last approximately 1 minute each, unless fundamental changes are made, and the most fundamental one is still to not have any downtime at all. Downtime will not become much less than 160-200 seconds; instead there must first be fewer downtimes and then none at all. Nevertheless, I wanted to start this blog with a concrete example of improvements made in downtime reduction since last time. And now another no-downtime experiment is being planned for September 9.
The purpose of this second no-downtime experiment is at least four-fold:
- Verify the fixes made for the issues discovered in the previous experiment in the live production environment
- Verify that no other code/features have regressed since last time and in general look for further issues
- Observe memory usage
- Verify that our technology platform (which you will hear more about later) is not making any downtime assumptions
So what did we discover last time, I hear you ask?
First and foremost we discovered reliance on downtime as an event to mark the beginning of a daily cycle, and a reliance on a daily startup, such as structures not finishing 24+ hour timers and corporations not joining Faction Warfare. We fixed all those issues that we found, and those you reported to us. Now we want to verify them further (of course they have been tested but our test environments don't have Tranquility's scale) and look for more such issues.
We also observed time desynchronization (which we fixed), and significant memory usage (which we improved somewhat).
The time desynchronization was a known issue, but last time we were observing whether players noticed at the end of day #2. The target for time desynchronization is a maximum of ±0.5 seconds. But with newer hardware, we had been observing an end-of-run desynchronization of 2.25 seconds and - predictably - 4.5 seconds at the end of day #2 in the first no-downtime experiment in 2019.
Players started to notice once the desynchronization was above 3 seconds, mostly by noting what felt like module lag or delay when their client and the node hosting their solar system disagreed significantly about when modules were cycling. Time desynchronization is now normally within ±1/100 of a second, well within the maximum of ±0.5 seconds.
Tranquility has always been memory hungry. For better performance, then, we have always opted for pre-computing values & processing data and storing the results for later reference rather than re-computing those values again later. As an example, the Brain in a Box and Dogma Rewrite projects in 2015 were all about computing and storing skills and their effects (i.e., the characters' brains) and transferring the computed results between solar systems instead of re-computing the brains on each entry to a new solar system. We also never clean up any memory, as the cluster node memory is reset every day anyway, which is a reliance on a daily reboot (note: we of course don't clear our DB cache memory or our Redis cache memory, but the main simulation cache memory is cleared in the reboot in each downtime).
The most memory-hungry nodes in the Tranquility cluster, the Character Services nodes that store those brains I mentioned above (among other things), were at 75% memory pressure at the end of day #2 last time, which is just below our operating tolerance of 80%. We might be able to run Tranquility for 3 days (and perhaps 7 hours more) if we were to run the cluster to a "first-node-at 100% memory usage" state, given those 2019 numbers. In 2019, the day #1 memory pressure was at 55%, but these days it is around 35% and so we want to rebase our observations.
No-downtime is a long-term goal and all our technological advances aim towards that. We have been working for a few years now on a micro-service and message bus technology platform for EVE, and started using that platform for a number of features. We now want to observe how that ecosystem holds up with no downtime of the primary game cluster, making sure no assumptions have been made about a daily downtime.
See you on Thursday, 9 September, as the second No Downtime experiment commences. And, just like last time, there will be a video coming soon. A whole lotta effort and power is needed for such a heroic stunt.