Fixing Lag: Well, this one doesn't really...
When Dominion was released we started getting reports about pilots being horribly stuck for minutes, sometimes stretching into hours when jumping en-masse into highly loaded systems. We were never able to reproduce this issue in-house which left us with little investigative material.
After spending weeks, nay, months reading code, analyzing logs, hacking at the console, trying to grab fights in action (you people fight at ungodly hours) to dig into running code, trying to trace a case of a stuck pilot before he relogs or the node dies or everybody just quits, we finally figured out where they were stuck.
When you jump, the handover between systems involves a combination of a server handover and establishing new client connections to the destination system, where the last part is a client session attach which signals to the server that you need a fresh state uploaded. If for any reason you finish the server handover part, but fail to provide a session attach the receiving system will force your ship into space and you can be shot at. This is necessary because there is no reliable way of distinguishing between a disconnect, a cheat or lag. We cannot have pilots in-between systems, and we can't reliably back out of the operation at this point; it's simply too late.
When anything is added or removed from a ticking solarsystem, a lock must be held to ensure that the operation isn't interrupted during critical paths. This includes the case when you're attaching the aforementioned client session to the system. If a fleet jumps into a highly loaded system, a queue of client requests forms waiting on this lock and this is where you're stuck; server handover part has completed, so you're "there" as far as the server is concerned, but your client never finishes attaching to it and thus never gets the initial solarsystem state.
Now, this pileup of requests only happens if the node is under heavy load to begin with. At that point a sudden surge in these requests can send the server into a death-spiral where each thread holds the lock a little bit longer and thus the delay is a bit longer between requesting it and getting it, and thus the queue of waiting threads grows larger and thus more time elapses between the lock being released by one thread until the next one requesting it gets to run etc, etc. This is about the same time players get frustrated and start to spam click all their buttons, relog, try their alt characters and such, further compounding the issue and adding to the already problematic load and horror-queuing.
So, I did some code monkeying using like, science and stuff, to figure out exactly what the critical paths were that we we're guarding with this lock and what exactly we were guarding against. Result being much more granular exclusive locks on session attaches and detaches. Yay.
This was actually done a few months ago, but since this was a very low layer change with massive fallout potential, we left it to simmer on Singularity for a while after testing and then carefully activated the new code manually on selected fleet fight nodes. When nothing blew up, we left it enabled on fleet fight nodes, followed by staggered deployment to Jita and mission hubs and finally making it the default way to do things on the 29th of July.
Now, much tighter locks don't lighten the load on the server at all. The same amount of work still needs to happen for any given jump; we're just mixing the tasks up better so it's fair to those wanting in on it. In effect it means that someone else blocks you a lot shorter when you jump, which means you get added to the fight at the other end sooner allowing you, dear pilot, to activate your modules and drones and whatnot and thus actually adding to the existing load and increasing the existing lag.
And that is a good thing. Right? You wanted in, right? I mean, you did press the "Jump" button (repeatedly probably) knowing that there was a fleet waiting for you?
- CCP GingerDude
PS. Just to make it clear. Performance improvements are being worked on. Several optimizations are already on Tranquility and more are in the deployment pipeline and even more are being worked on. Lagfixing is not a bugfix. It's a granular process with occasional big wins. Don't go all "why're you doing this when you should've fixed lag?" Infrastructure must be upgraded when you want to optimize.
PPS. We believe we've uncovered a rather rare (statistically speaking) seperate old issue of not loading the grid. I realize that it's pretty darn impossible for a player to distinguish between being blocked on a lock and that issue, but in the latter case it's actually a client issue. We are working on that.