EC2 for databases

EC2 or Elastic Compute 2 is Amazon’s cloud computing offering.  You rent servers from Amazon at a low hourly rate, micropayment style.  Every byte going into your server, every byte you store for a month, it all adds up on your monthly bill.  This approach applies not just to the hardware, you can effectively rent Windows and SQL Server as well. The end effect is very low initial costs but slightly higher annual costs - perfect for startups.  

Where it gets interesting is that you do not rent physical machines but virtual machines.  This means that you can clone your virtual machines, destroy them, scale them up, etc. to your heart’s content.  If you want to deploy a big change to production and have pre-prod running for a week then you can - and simply blow it away when you’re finished.  You don’t have to predict future computing needs - simply rent what you need now and rent what you need later well... later.   Think of the benefits that virtualisation provided a few years ago to the server room - Amazon’s cloud offering makes a similar advance to the ease of system management.

Also remember how when virtualisation first came popular it had terrible performance? Sadly we see something similar here.  Amazon’s instances do not have hard drives, they’re connected to their storage via the network port.  This means that at best you are going to get gigabit networking and at times you will get much worse (I’ve had as low as 100kB/s).

This brings us back to databases. Normal databases are heavily I/O bound, with 15k drives commonplace and many organisations transitioning to Solid State.  The idea of trying to run a large database on a single USB hard drive would normally be considered a joke, and yet that’s effectively what Amazon provides. The performance of a complete system is therefore ... disappointing even when looking at basic specs (RAM, CPU etc.) it would be quite acceptable.  

And there’s nothing you can do about it, a larger instance will mitigate the issue slightly by giving you more of the network bandwidth but fundamentally you’re limited to gigabit ethernet and even if you control the entire physical machine, you won’t control the SAN.

Unfortunately it gets worse.  Most databases I’m involved with are used for reporting and modern reporting tools are quite data intensive, generating beautiful visualisations which summarise a large amount of underlying data.  The problem is how you get a large amount of data from the database to the visualisation tool - with Amazon in Virginia and me in New Zealand, I’m lucky if I can sustain 1MB/s - perfectly acceptable for browsing the internet but extremely painful when trying to run reporting tools that were designed to operate correctly over a LAN.

So while cloud computing is interesting, I would advise you to hold off deploying substantial databases to EC2 until Amazon sorts out their I/O issues.