S3 outage – We weren't affected
Amazon S3 had an outage today. First I knew about it was reporters emailing and calling me asking if we were knocked out by it.
We weren’t. No customers reported issues, and our systems were all showing typically low and acceptable error rates. To be honest, I’m surprised.
I wasn’t going to blog about it until I understood why we weren’t affected, but I’m really getting inundated with requests now, so I figured this would be a good way to optimize my time rather than spending all day on the phone. 🙂
We’re researching what happened now, but again, I didn’t know about the outage until after it was over, and I haven’t spoken to anyone at Amazon yet. Until I finish my research and speak with Amazon, I’m not going to speculate on what may have happened or why.
I can say, once again, that we pay the same rates everyone else pays and that, other than some early access to upcoming beta services, we don’t get any preferential treatment that I’m aware of.
Some thoughts, though:
- We expect Amazon to have outages. No website I’m aware of doesn’t, whether it’s Google, Amazon, your bank, or SmugMug.
- I’ve written about Amazon S3 outages in the past, but in the last ~12 months, we’ve only seen a single ~2 minute outage window (on January 22nd, 2008 at around 4:38pm Pacific). We also had one recent fairly major hiccup with EC2.
- Yes, I believe there will probably be times where SmugMug is seriously affected, possibly even offline completely, because Amazon (or some other web services provider) is having problems. Today wasn’t that day. Nobody likes outages, especially not us, but we’ve decided the tradeoffs are worth it. You should have your eyes wide open when you make the decision to use AWS or any other external service, though. It will fail from time to time.
- We’ve done our best to build SmugMug in such a way that we handle failures as gracefully as possible. We can’t cover every case, but I think that the fact that we didn’t experience customer-facing outages today is a testament to that. Again, I want to stress that we do expect Amazon to cause us (rare) outages in the future, and that’s unavoidable, but today we dodged that bullet.
- Amazon’s communication about this has been terrible. It took far too long to acknowledge the problem. Fixing a major problem can take forever, which is understandable, but communicating with your customers should happen very rapidly. Amazon’s culture, internally, is very customer focused, so this is a strange anomaly. I will definitely lean on them some about it, and everyone who was affected should rightfully howl too.
- I’ve asked Amazon repeatedly for an “Amazon Web Services Health” page that shows the current expected state of all their services. Then you can tell at a glance (and even poll and work into your own monitoring) whether any of the services are having problems. Something like Keynote’s Internet Health Report would be a good start, but as Jesse Robbins points out, trust.salesforce.com is the gold standard. This page could also double as a mechanism to let customers know what’s being worked on and current ETAs when there are problems.
I’ll try to post a follow-up about why we weren’t affected when I know more. It’s possible that some of the reasons we survived was due to some of our “secret sauce” and I just won’t be able to say, but I kinda doubt it.
Bottom line: While the outage was certainly a big deal to those affected, I think the bigger deal here is how Amazon handled the outage. They need to communicate better about these mission critical services and their health.
If I didn’t answer any questions you’d like me to answer, please post a comment and/or send me an email. I’ll do my best to respond.
UPDATE 1: I’m not sure why there’s all this confusion, but SmugMug *does* use Amazon as our primary data store. We maintain a small “hot cache” in our datacenters of frequently/recently viewed photos and videos, but there are massive numbers of them that are only at Amazon. This is a change from our initial usage of S3, and the change is based on how reliable they’ve been. Yes, we still consider them to be very reliable even after an outage like this. And yes, I suspect our “hot cache” did at least partially enable us to ride out this issue.