…what is so shocking about this banter is that startups around the globe were essentially blaming a hard drive manufacturer for taking down their sites. I don’t believe I’ve ever heard of a startup blaming NetApp or Seagate for an outage in their hosted environments. People building on the cloud shouldn’t get a pass for poor architectural decisions that put too much emphasis on, essentially, network attached RAID1 storage saving their asses in an outage.
Go read the rest, it’s great. Better than mine.
tl;dr: Amazon had a major outage last week, which took down some popular websites. Despite using a lot of Amazon services, SmugMug didn’t go down because we spread across availability zones and designed for failure to begin with, among other things.
We’ve known for quite some time that SkyNet was going to achieve sentience and attack us on April 21st, 2011. What we didn’t know is that Amazon’s Web Services platform (AWS) was going to be their first target, and that the attack would render many popular websites inoperable while Amazon battled the Terminators.
Sorry about that, that was probably our fault for deploying SkyNet there in the first place.
We’ve been getting a lot of questions about how we survived (SmugMug was minimally impacted, and all major services remained online during the AWS outage) and what we think of the whole situation. So here goes.
HOW WE DID IT
We’re heavy AWS users with many petabytes of storage in their Simple Storage Service (S3) and lots of Elastic Compute Cloud (EC2) instances, load balancers, etc. If you’ve ever visited a SmugMug page or seen a photo or video embedded somewhere on the web (and you probably have), you’ve interacted with our AWS-powered services. Without AWS, we wouldn’t be where we are today – outages or not. We’re still very excited about AWS even after last week’s meltdown.
I wish I could say we had some sort of magic bullet that helped us stay alive. I’d certainly share if it I had one. In reality, our stability during this outage stemmed from four simple things:
First, all of our services in AWS are spread across multiple Availability Zones (AZs). We’d use 4 if we could, but one of our AZs is capacity constrained, so we’re mostly spread across three. (I say “one of our” because your “us-east-1b” is likely different from my “us-east-1b” – every customer is assigned to different AZs and the names don’t match up). When one AZ has a hiccup, we simple use the other AZs. Often this is a graceful, but there can be hiccups – there are certainly tradeoffs.
Second, we designed for failure from day one. Any of our instances, or any group of instances in an AZ, can be “shot in the head” and our system will recover (with some caveats – but they’re known, understood, and tested). I wish we could say this about some of our services in our own datacenter, but we’ve learned from our earlier mistakes and made sure that every piece we’ve deployed to AWS is designed to fail and recover.
Third, we don’t use Elastic Block Storage (EBS), which is the main component that failed last week. We’ve never felt comfortable with the unpredictable performance and sketchy durability that EBS provides, so we’ve never taken the plunge. Everyone (well, except for a few notable exceptions) knows that you need to use some level of RAID across EBS volumes if you want some reasonable level of durability (just like you would with any other storage device like a hard disk), but even so, EBS just hasn’t seemed like a good fit for us. Which also rules out their Relational Database Service (RDS) for us – since I believe RDS is, under the hood, EC2 instances runing MySQL on EBS. I’ll be the first to admit that EBS’ lack of predictable performance has been our primary reason for staying away, rather than durability, but a durability & availability has been a strong secondary consideration. Hard to advocate a “systems are disposable” strategy when they have such a vital dependency on another service. Clearly, at least to us, it’s not a perfect product for our use case.
Which brings us to fourth, we aren’t 100% cloud yet. We’re working as quickly as possible to get there, but the lack of a performant, predictable cloud database at our scale has kept us from going there 100%. As a result, the exact types of data that would have potentially been disabled by the EBS meltdown don’t actually live at AWS at all – it all still lives in our own datacenters, where we can provide predictable performance. This has its own downsides – we had two major outages ourselves this week (we lost a core router and its redundancy earlier, and a core master database server later). I wish I didn’t have to deal with routers or database hardware failures anymore, which is why we’re still marching towards the cloud.
So what did we see when AWS blew up? Honestly, not much. One of our Elastic Load Balancers (ELBs) on a non-critical service lost its mind and stopped behaving properly, especially with regards to communication with the affected AZs. We updated our own status board, and then I tried to work around the problem. We quickly discovered we could just launch another identical ELB, point it at the non-affected zones, and update our DNS. 5 minutes after we discovered this, DNS had propagated, and we were back in business. It’s interesting to note that the ELB itself was affected here – not the instances behind it. I don’t know much about how ELBs operate, but this leads me to believe that ELBs are constructed, like RDS, out of EC2 instances with EBS volumes. That seems like the most logical reason why an ELB would be affected by an EBS outage – but other things like network saturation, network component failures, split-brain, etc could easily cause it as well.
Probably the worst part about this whole thing is that the outage in question spread to more than one AZ. In theory, that’s not supposed to happen – I believe each AZ is totally isolated (physically in another building at the very least, if not on the other side of town), so there should be very few shared components. In practice, I’ve often wondered how AWS does capacity planning for total AZ failures. You could easily imagine peoples automated (and even non-automated) systems simply rapidly provisioning new capacity in another AZ if there’s a catastrophic even (like Terminators attacking your facility, say). And you could easily imagine that surge in capacity taking enough toll on one or more AZs to incapacitate them, even temporarily, which could cause a cascade effect. We’ll have to wait for the detailed post-mortem to see if something similar happened here, but I wouldn’t be surprised if a surge in EBS requests to a 2nd AZ had at least a deteriorating effect. Getting that capacity planning done just right is just another crazy difficult problem that I’m glad I don’t have to deal with for all of our AWS-powered services.
This stuff sounds super simple, but it’s really pretty important. If I were starting anew today, I’d absolutely build 100% cloud, and here’s the approach I’d take:
- Spread across as many AZs as you can. Use all four. Don’t be like this guy and put all of the monitoring for your poor cardiac arrest patients in one AZ (!!).
- If your stuff is truly mission critical (banking, government, health, serious money maker, etc), spread across as many Regions as you can. This is difficult, time consuming, and expensive – so it doesn’t make sense for most of us. But for some of us, it’s a requirement. This might not even be live – just for Disaster Recovery (DR)
- Beyond mission critical? Spread across many providers. This is getting more and more difficult as AWS continues to put distance between themselves and their competitors, grow their platform and build services and interfaces that aren’t trivial to replicate, but if your stuff is that critical, you probably have the dough. Check out Eucalyptus and Rackspace Cloud for starters.
- I should note that since spreading across multiple Regions and providers adds crazy amounts of extra complexity, and complex systems tend to be less stable, you could be shooting yourself in the foot unless you really know what you’re doing. Often redundancy has a serious cost – keep your eyes wide open.
- Build for failure. Each component (EC2 instance, etc) should be able to die without affecting the whole system as much as possible. Your product or design may make that hard or impossible to do 100% – but I promise large portions of your system can be designed that way. Ideally, each portion of your system in a single AZ should be killable without long-term (data loss, prolonged outage, etc) side effects. One thing I mentally do sometimes is pretend that all my EC2 instances have to be Spot instances – someone else has their finger on the kill switch, not me. That’ll get you to build right. 🙂
- Understand your components and how they fail. Use any component, such as EBS, only if you fully understand it. For mission-critical data using EBS, that means RAID1/5/6/10/etc locally, and some sort of replication or mirroring across AZs, with some sort of mechanism to get eventually consistent and/or re-instantiate after failure events. There’s a lot of work being done in modern scale-out databases, like Cassandra, for just this purpose. This is an area we’re still researching and experimenting in, but SimpleGeo didn’t seem affected and they use Cassandra on EC2 (and on EBS, as far as I know), so I’d say that’s one big vote.
- Try to componentize your system. Why take the entire thing offline if only a small portion is affected? During the EBS meltdown, a tiny portion of our site (custom on-the-fly rendered photo sizes) was affected. We didn’t have to take the whole site offline, just that one component for a short period to repair it. This is a big area of investment at SmugMug right now, and we now have a number of individual systems that are independent enough from each other to sustain partial outages but keep service online. (Incidentally, it’s AWS that makes this much easier to implement)
- Test your components. I regularly kill off stuff on EC2 just to see what’ll happen. I found and fixed a rare bug related to this over the weekend, actually, that’d been live and in production for quite some time. Verify your slick new eventually consistent datastore is actually eventually consistent. Ensure your amazing replicator will actually replicate correctly or allow you to rebuild in a timely fashion. Start by doing these tests during maintenance windows so you know how it works. Then, once your system seems stable enough, start surprising your Ops and Engineering teams by killing stuff in the middle of the day without warning them. They’ll love you.
- Relax. Your stuff is gonna break, and you’re gonna have outages. If you did all of the above, your outages will be shorter, less damaging, and less frequent – but they’ll still happen. Gmail has outages, Facebook has outages, your bank’s website has outages. They all have a lot more time, money, and experience than you do and they’re offline or degraded fairly frequently, considering. Your customers will understand that things happen, especially if you can honestly tell them these are things you understand and actively spend time testing and implementing. Accidents happen, whether they’re in your car, your datacenter, or your cloud.
Best part? Most of that stuff isn’t difficult or expensive, in large part thanks to the on-demand pricing of cloud computing.
WHAT ABOUT AWS?
Amazon has some explaining to do about how this outage affected multiple AZs, no question. Even so, high volume sites like Netflix and SmugMug remained online, so there are clearly cloud strategies that worked. Many of the affected companies are probably taking good hard looks at their cloud architecture, as well they should. I know we are, even though we were minimally affected.
Still, SmugMug wouldn’t be where we are today without AWS. We had a monster outage (~8.5 hours of total downtime) with AWS a few years ago, where S3 went totally dark, but that’s been the only significant setback. Our datacenter related outages have all been far worse, for a wide range of reasons, as many of our loyal customers can attest. 😦 That’s one of the reasons we’re working so hard to get our remaining services out of our control and into Amazon’s – they’re still better at this than almost anyone else on earth.
Will we suffer outages in the future because of Amazon? Yes. I can guarantee it. Will we have fewer outages? Will we have less catastrophic outages? That’s my bet.
THE CLOUD IS DEAD!
There’s a lot of noise on the net about how cloud computing is dead, stupid, flawed, makes no sense, is coming crashing down, etc. Anyone selling that stuff is simply trying to get page views and doesn’t know what on earth they’re talking about. Cloud computing is just a tool, like any other. Some companies, like Netflix and SimpleGeo, likely understand the tool better. It’s a new tool, so cut the companies that are still learning some slack.
Then send them to my blog. 🙂
And, of course, we’re always hiring. Come see what it’s like to love your job (especially if you’re into cloud computing).
UPDATE: Joe Stump is out with the best blog post about the outage yet, The Cloud is not a Silver Bullet, imho.
Been at the MySQL conference the last few days, and I have to say, I’m really blown away by MySQL 5.5.4‘s improvements. Last year I keynoted and I begged Oracle on stage to realize that MySQL and InnoDB under one roof represented opportunity. It’s clear they heard the community – this is some serious progress, and right when we needed it.
Jeremy Zawodny’s blog post covers most of the stuff I’m really excited about, and there are some great detailed technical slides here and here, but I wanted to go into a little more detail on one important improvment: We’ve been plagued by MySQL’s undo slot limits for an awfully long time. Basically, you could have 512 INSERT transactions and 512 UPDATE transactions running at once, for a grand total of 1024. If you use INSERT … ON DUPLICATE KEY UPDATE, though, it takes two of those spots, meaning you get 512 concurrent transactions. On modern hardware, it’s trivially easy to hit this limit.
I’ve had an Enterprise support ticket open for years on the issue, there’s been a MySQL bug for a long time, and there was basically no movement. In fact, I’d gotten so frustrated about this issue, I’d basically decided this year was our last year of Enterprise MySQL support. It was one of the sole reasons we paid for support for the last few years – the promise that a fix was just around the corner. I felt good about voting with my dollars, and contributing back to a core technology we depend on, but enough was enough.
Lo and behold, it’s fixed! You can now have a whopping 128K transactions in flight. Best of all, it’s far more performant than it used to be! And craziest of all? If you run 5.5.4 on a database, then roll back to some older release, the change still takes effect. Backwards bug and performance fixing – that’s a new one on me.
THANK YOU ORACLE!
Shameless plug – we’re hiring. And it’s a blast.
Been asked a few times in the last few days about where my slides are from my MySQL keynote from *last* year.
Um, yeah. Sorry about that. Here’s a link to ‘The SmugMug Tale’ slides, and you can watch the video below:
Sorry for the extreme lag. I suck.
The important highlights go something like this:
- Use transactional replication. Without it, you’re dead in the water. You have no idea where a crashed slave was.
- Use a filesystem that lets you do snapshots. Easily the best way to do backups, spin up new slaves, etc. I love ZFS. You’ll need transactional replication to really make this painless.
- Use SSDs if you can. We can’t afford to be fully deployed on SSDs (terabytes are expensive), but putting them in the write path to lower latency is awesome. The read path might help, too, depending on how much caching you’re already doing. Love hybrid storage pools.
- Use Fishworks (aka Open Storage) if you can. The analytics are unbeatable, plus you get SSDs, snapshots, ZFS, and tons of other goodies.
- Use transactional replication. This is so important I’m repeating it. Patch it into MySQL (Google, Facebook, and Percona have patches) or use XtraDB if you use replication. We use the Percona patch.
Holler in the comments if something in the presentation isn’t clear, I’ll answer. Apologies again.
Shameless plug – we’re hiring. And it’s a blast.
For anyone who lives and dies by MySQL and InnoDB, things are finally starting to heat up and get interesting. I’ve been banging the “MySQL/InnoDB scales poorly” drums for years now, and despite having paid Enterprise licenses, I haven’t been able to get anywhere. I was pretty excited when Sun bought MySQL since their future is intrinsically tied to concurrency, but things have been pretty slow going over there this year.
But the community has finally taken up arms and is fighting the good fight. It’s (finally!) a great time to be a MySQL user because there’s been lots of recent progress. Here’re some of my favorites (and highlights of work left to do):
I can’t sing Percona’s praises enough. They’re probably the most knowledgeable MySQL experts out there (possibly even including Sun). Absolutely the best bang for the buck in terms of MySQL service and support – better than MySQL’s own offering. (If I had to guess why that is, I’d bet that MySQL/Sun don’t want to step on Oracle’s toes by fixing InnoDB – but >99% of what we need is related to InnoDB. Percona has no such tip-toeing limitations.) Let me quickly count the ways they’ve helped me in the last few months:
- They knew of a super obscure configuration setting “back_log“. Have you ever heard of it? I hadn’t. But we started seeing latency on MySQL connections (up to *3 seconds*!) on systems that hadn’t changed recently (exactly 3 seconds sounded awfully suspicious, and sure enough, it was TCP retries). After going through every single kernel, network, and MySQL tuning parameter I know (and I know a lot), I finally called Percona. They dug in, investigated the system, and unearthed ‘back_log’ within an hour or two. Popped that into my configuration and boom, everything was fine again. Whew!
- We have servers that easily exceed InnoDB’s transaction limits. Did you know InnoDB has a concurrent transaction limit of 1024? (Technically, 1024 INSERTs and 1024 UPDATEs. But INSERT … ON DUPLICATE KEY UPDATE manages to chew up one of each). I know all about it – I’ve had bugs open with MySQL Enterprise for more than 2 years on the issue. What’s more, these are low-end systems – 4 cores, 16GB of RAM – and they’re no-where near CPU or IO bound. It took MySQL months to figure out what the problem was (years, really, to figure out all the final details like the different undo logs for INSERT vs UPDATE). Their final answer? It’ll be fixed in MySQL 6. 😦 Note that 5.1 *just* went GA after years and years. On the other hand, it took Percona one weekend to diagnose the problem, and 13 days to have a preliminary patch ready to extend it to 4072 undo slots. Talk about progress! (And yes, we want Percona to release the patch to the world)
- Solving the CPU scaling problems. These have been plaguing us for years (we have had some older four-socket systems for awhile … now with quad-core, it’s even worse), and thanks to Google and Percona, this problem is well on its way to being solved. We’re sponsoring this work and can’t wait to see what happens next.
- XtraDB. This is the biggy. So big it deserves its own heading….
Oracle’s done a terrible job of supporting the community with InnoDB. The conspiracy theorists can all say “I told you so! Oracle bought them to halt MySQL progress” now – history supports them. Which is a shame – Heikki is a great guy and has done amazing work with InnoDB, but the fact remains that it wasn’t moving forward. The InnoDB plugin release was disappointing, to say the least. It addressed none of the CPU or IO scalability issues the community has been crying about for years.
Luckily, Percona finally did what everyone else has been too afraid to do – they forked InnoDB. XtraDB is their storage engine, forked from InnoDB (and then turbocharged!). We’re not running it in production yet, but we are running all of the patches that went into XtraDB and I can tell you they’re great. We’re sponsoring more XtraDB development (and yes, we made sure Percona will be contributing anything they build for us back to the community) with Percona, and I’m sure that’ll continue.
I’ve already blogged a bit about Drizzle, but it sure looks like Drizzle + XtraDB might be a match made in heaven. Drizzle can be though of as a MySQL engine re-write with an eye towards web workloads and performance, rather than features. MySQL 4.1, 5.0, and 5.1 added a lot of features that bloated the code without offering anything really useful to web-oriented workloads like ours, so the Drizzle team is ripping all that stuff back out and rethinking the approaches to the things that are being left in. Very exciting.
The advent of “cheap enough” super-fast SSD storage is finally upon us. I’ve got Sun S7410 storage appliances in production and they’re blazingly fast. I have a very thorough review coming, but the short version is that even with NFS latencies, we’re able to do obscene write workloads to these boxes (let alone reads). 10000+ write IOPS to 10TB of mirrored, crazy durable (thanks ZFS!) storage is a dream come true. Once you mix in snapshots, clones, replication, and Analytics – well, it just doesn’t get much better than this.
(Don’t get sticker shock looking at the web pricing – no-one pays anything even remotely like that. Sign up for Startup Essentials if you can, or talk to your Sun sales rep if you can’t, and you can get them much cheaper. I nearly had a heart attack myself until I got “real” pricing. Tell them I sent you – enough Sun people read this blog, it might just help 🙂 ).
So, all in all, there’s been an awful lot of progress this year, which is great. CPUs are finally scaling under InnoDB, and we finally have storage that isn’t bounded by physical rotation and mechanical arms. Unfortunately, great CPU scaling plus amazing IO capabilities isn’t something InnoDB digests very well. As is common in complicated systems, once you fix one bottleneck, another one elsewhere in the system crops up. This time, it’s IOPS. It was eerie reading Mark Callaghan’s post about this last night – I’d come to the exact same conclusions (from an Operations point of view rather than code-level) just yesterday.
Bottom line: Despite having ample CPU and ample IO, InnoDB isn’t capable of using the IO provided. You can bet we’ll be working with Percona, Google and Sun (read: sitting back and admiring their brilliant work while writing the occasional check and providing production workload information) to look into fixing this.
In the meantime, we’re back to the old standbys: replication and data partitioning. Yes, we’re stacking lots of MySQL instances on each S7410 to maximize both our IOPS and our budget. Fun stuff – more on that later. 🙂
UPDATE: Just occurred to me that there are plenty of *new* readers to my blog who haven’t heard me praise Google and their patches before. Mark Callaghan’s team over at Google definitely deserves a shout-out – they’ve really been a catalyst for much of this work along with Percona.
In high school, I had a great programmable calculator. I’d program it to solve complicated math and science problems “automatically” for me. Most of my teachers got upset if they found out, but I’ll always remember one especially enlightened teacher who didn’t. He said something to the effect of “Hey, if you managed to write software to solve the equation, you must thoroughly understand the problem. Way to go!”.
George Reese wrote up a blog post over at O’Reilly the other day called On Why I Don’t Like Auto-Scaling in the Cloud. His main argument seems to be that auto-scaling is bad and reflects poor capacity planning. In the comments, he specifically calls SmugMug out, saying we’re “using auto-scaling as a crutch for poor or non-existent capacity planning”.
George is like one of those math teachers who doesn’t “get it”. I was tempted not to write this post because he gets it so wrong, I’d hate to spread that meme. SkyNet auto-scales well. No humans at SmugMug are monitoring it and it just hums along, doing its job. Why is it so efficient? Because I understand the equation. I know what metrics drive our capacity planning and I programmed SkyNet to take these into account. It checks an awful lot of data points every minute or so – this isn’t simply “oh, we have idle CPU, let’s kill some instances.” (I would argue that, depending on the application, simple auto-scaling based on CPU usage or similar data point can be very effective, too, though).
SkyNet has been in production for over a year with only two incidents of note and SmugMug has more than doubled in size and capacity during that time without adding any new operations people. How on earth is this a bad thing?
FYI, Sun is announcing some sweet new storage stuff on Monday at 3:30pm PT.
I’m reviewing a few of the things they’re announcing, and hope to publish my thoughts here soon (one of them joins my production network tonight if all goes well). However, I’m at Disneyland with my kids (first trip!) from Monday through Thursday, so I don’t know (yet) when I’ll be able to write them up. Bear with me if it takes a few days.
But the gear is exciting, and the direction Sun is headed is even more exciting!