Making Our Backside Bigger | EVE Online

Making Our Backside Bigger

2009-06-10 - CCP Valar

I'm Valar, the senior database administrator on the Virtual World Operations team and I'm here to introduce the changes that we are in the process of doing to Tranquility. Since I'm a database administrator, the main focus of this blog is about upgrades to our back end systems, that is, the database servers, but I will touch upon the hardware replacements we've been doing to the rest of the cluster as well.

The new database servers
We brought in a couple of IBM xSeries 3850 M2's, with 128 GB RAM and two 2.6 GHz six core Xeons (Dunnington) to replace our aging IBM xSeries 3950, with two dual core 3.5 GHz Xeons and 64 GB RAM. This upgrade doubles the RAM of our database servers, but that helps with page cache lifetime and decreases IO and CPU load.

Our old database hardware was not nearing its capacity, however, due to our desire to decrease downtime and our goal to do maintenance and other previously downtime-regulated tasks while EVE-Online is online, we decided that increasing the capacity of our hardware was prudent.

The new database software
Since we are replacing the database hardware, why not replace the software as well. The new database servers run Windows 2008 Server Enterprise x64. The plan is to upgrade to SQL Server 2008 Enterprise x64 SP1 as well, but I will outline that better later in the blog.
Both the old and the new SQL servers run as a 2 node Windows cluster, with a SQL Server failover cluster in active/passive mode. There will be no change there.

Phase 1 - Hardware upgrade
With help from top experts from Microsoft, we created a plan to do both the hardware and software upgrades with a minimum of downtime. The downtime required for the hardware upgrade was actually so minimal that we planned to do it within the regular downtime window. We did the upgrade on Monday, May 11th during the regular downtime, however, due to unforeseen technical problems, the downtime got extended by 20 minutes.

This was accomplished by having the new servers set up with SQL failover clustering, connected to the disk array that was used before we installed the RamSan-500, with the storage laid out in the same way as the live servers. During the downtime we reconfigured our storage, or basically switched the cables so the new servers were connected to the RamSans and the old servers to the old DS4800 disk array. This is an oversimplification to say the least, but this is the essential step that allowed us to do this without an extended downtime.

We are now running on Windows 2008 Server, however we are still running on SQL Server 2005. This will be addressed in the next step.
As a result of this part of the upgrade the time to start the server has gone from ~6-8 minutes to ~2.5-3.5 minutes. We have shortened the startup countdown from 10 minutes to 7 minutes and will likely move it even lower if the startup time stays as it is when we have finished this project.

Phase 2 - Upgrade to SQL Server 2008
As I said in the last section, we didn't upgrade to SQL Server 2008 at the same time as we upgraded the hardware. There are a few reasons for this:

  • Measurability, we want to be able to measure the impact of the Windows 2008 server upgrade on the new hardware individually from SQL Server 2008.
  • Ability to rollback with as little downtime as possible in phase 1. As soon as you attach a database to SQL Server 2008, it cannot go back to SQL Server 2005.
  • Minimal downtime. By splitting up in phases, each step can be fitted within a smaller downtime window.
  • Ability to go back to SQL Server 2005 after running on SQL Server 2008 for a while. I'll go into this below.

We've been doing test upon test that have ensured compatibility and that Eve does not perform worse on SQL Server 2008. We still want to have the option to go back to our old setup without data loss and extensive downtime, as I mentioned above.

To be able to perform the upgrade to SQL Server 2008 and still be able to go back to SQL Server 2005 with not too much downtime and without doubling our storage hardware, we are going to start replication between the new servers and the old server (that are connected to our DS4800) while both servers are on SQL Server 2005.

We will leave this running for some days, just to make sure this works properly, before scheduling a downtime to do an in-place upgrade to SQL Server 2008 on the new servers.

We originally planned to do this upgrade during the daily scheduled downtime period, but we've decided that it's prudent to have the QA department run their dry-runs on the server after we upgrade to SQL Server 2008, just to make sure that everything is 100% okay.

When the servers are back up, the replication will work like nothing happened. If we decide to go back to SQL Server 2005 for any reason, we schedule a downtime and use the plan from phase 1 to go back, with the exception that we need to move the datafiles from the DS4800 array to the RamSans before starting up from SQL Server 2005 and old servers again.

We could then do a storage switch to the new servers on SQL Server 2005 again a few days later after.

Benefits?
The benefits of the upgrade for you are quicker responses from database intensive operations, such as logging in. The shortened countdown after server startup is another obvious benefit as well as the shorter downtime as database jobs that run at downtime take a shorter time to run. However the benefits are mostly in the backend. We, the database administrators, can do more maintenance outside of downtime, leading to fewer downtimes where we have to use the full hour.

It also gives our developers the chance to expose more data through the API and perhaps through other future projects. Reporting for our research and statistic department can also be done in larger part on our live database now.

We can also utilize a new feature of SQL Server 2008 to prevent reporting, websites, maintenance or the API from affecting Tranquility negatively. The feature is called Resource governor and would for example allow us to prevent the API to use more than 10% CPU or 10 GB RAM. It also allows prioritization of workloads, so for example, the websites would have priority over the API while Tranquility would have priority over everything. If we'd have a run-away query on the websites, it would not take Tranquility down. There are more things in SQL Server 2008, like backup and datafile compression, filtered indexes, sparse columns, new date and time data types and other things that we are interested in, but thats outside the scope of this blog.

But you mentioned other hardware upgrades?!
As Mindstar mentioned in apocrypharrrrrdware!, we replaced half of the cluster around Christmas. Around 3 weeks ago we replaced the other half of our sol servers. The new servers have 3.3 GHz Wolfdale CPUs and 16 GB of RAM and replace our old 2.8 GHz AMDs that have 4 GB of RAM. With this upgrade we were able to start running all of our cluster on 64-bit processes, but we had to run 32-bit processes on machines that had 4 GB RAM due to the RAM usage overhead of 64-bit processes.

By now I've overwhelmed you with tech talk and if you've reached this far, you get a cookie.