Well, I guess it´s time to own up...
As always our policy is that you deserve the truth no matter what. The reason for the server outages yesterday and this evening is and was a tiny module in the network equipment called a gigabit interface card (gbic).
The EVE cluster consists of a row of frontend servers (proxies) and another row of application servers behind those. In between there is a network layer consisting of three cisco switches connected together via fiberoptic cables.
In each switch there is an interface card where the fiberoptic cable plugs into and it was one of these modules that failed. "No big deal, just replace it" you're probably thinking, which is exactly what we did and pronto.
Thinking the worst was over we relaxed a little and made plans to redesign the network because this outage made painfully obvious the total lack of redundancy in this network layer. The lack of redundancy was intentional however because we utilize remote deployment software from IBM (RDM) which does not work when the spanning tree protocol is enabled in cisco switches. Given the good track record all the admins on the team had with cisco products we figured the likelyhood of failures was so small it was a risk worth taking, totally forgetting old Murphy... smart right? :(
Well, probably the on site personnel replacing the module last night did not insert the fiber optic cable properly or the new module is faulty because we started getting link flaps on the same interface again this evening (meaning the link between the switches is continually resetting) which eventually caused the cisco switch to disable the link and the whole cluster came crashing down because proxy-server communication was cut of. As I am writing this I am waiting for a technician to arrive on site (I am in Iceland, he is in London) with spare parts to replace. The cluster is up and running again and so far no link flaps on the faulty interface so I am keeping my fingers crossed. Hopefully we can repair this by putting a second fiber link between the switches and allow spanning tree to work it´s magic so that there won´t be any need for downtime.
We are still debating if that is a risk worth taking. In any case come monday morning we will be taking a long hard look at the current design and make whatever changes we need to make it more robust. I know the front end is fully redundant but the internal layers need revising.
So if I didn´t say it before, sorry about this. It was totally predictable, too bad none of us had the smarts to spot it...