Apologies for over extending the estimates
For this update we wanted to be extra careful as possible as the size of it is more than anything done before. Once all the database upgrading was done (several passes over more than 50 GB of data) we started the servers in a way that only we here at CCP and Customer support at Síminn could connect.
We wanted to do this to do a dry run of the various systems that were affected by the update. All was looking good but about 30 minutes into the checklists the nodes in the cluster started to go offline.
This of course caused us to change plans. We Immediately rebooted the cluster with full logging and started running through checklists again, after exactly 30 minutes the same thing happened.
We pulled all logs over to Iceland and started digesting and analysing. After digging around for some time we found the reason for this.
The reason was an old bug that has been in our cluster code since release. The proxies and nodes are configured to health check each other. If they haven't heard from their siblings in a fixed amount of time, they send a heartbeat package, if the node answers then all is fine but if the node doesn't answer then the sibling node removes her sister node from the node registry, this we call sororicide.
Now when TQ is in full swing there is little need for the heartbeat packages, the is enough packed flow between the nodes and the proxy so they are aware of each other health.
When we were running said checklists, then of course the load is a lot less than when 5.000 of you guys are playing. This caused the heartbeat system to kick in a reveal an ancient bug. Basically instead of heartbeating N nodes 1 time each node was heartbeating 1 node N times. This caused them to think their sisters were not alive -> sororicide galore.
Now that we have found this old bug, we have an explanation why we have had more node deaths in the nights when there is less traffic, it is exactly due to the reduced traffic. Now that the bug is fixed you can expect night node deaths to stop and better quality of service during the nights.
Don't know if such details interest you but I feel slightly better after giving you the details of this unfortunate event.