Amazon S3: Outages, slowdowns, and problems
First of all, I’m giving a session on Amazon web services (with S3 being the main focus, with a little EC2 and other service love thrown in) at ETech this year. I’ll post a PDF or something of my slides here when I’m done, but if you’re really interested in this stuff, you might want to stop by. Wear some SmugMug gear and I’ll comp your account. 🙂
UPDATE: I’ve posted a call for topics you’re interested in hearing at ETech or in the resulting PDF. Let me know.
So there’s been some noise this month about S3 problems, and I’ve been getting requests about what we do when Amazon has problems and why our site is able to stay up when they do. I’m happy to answer as best I can, and I’d like to remind everyone that I’m not paid by Amazon – it’s the other way around. I pay them a lot of money, so I expect good service. 🙂 That being said, I think they’re getting too much heat, and I’ll explain why.
First, lets define the issues. During our history with Amazon S3 (since April of 2006), we’ve experienced four large problems. The first two were catastrophic outages – they involved core network switch failures and caused everything to die for 15-30 minutes. And by everything, I mean Amazon.com itself was offline, at least from my network view. (Due to DNS caching issues, even GSLB’d sites can look “down” to part of the world while remaining “up” to other parts. I don’t know if this was the case during these two times). We’ve had core network switch failures here at SmugMug, too, and they’re almost impossible to prevent.
The other two were performance-related. Not outages, because the service still functioned, but massively slower than we were used to. In the first case, which happened right as the BusinessWeek cover article hit newstands and during the Web 2.0 Summit, our customers were at our gates with pitchforks and torches. Our paying customers were affected and they could tell there was something wrong. Not good.
The second time, though, was in early January, and our customers had no idea. I emailed the S3 team to let them know we were seeing issues, flipped a switch in our software, and we were fine.
So what was the difference? We’ve been playing with using Amazon in a variety of different roles and scenarios at SmugMug. At first, we were just using them as a backup copy. That provided some great initial savings and a great deal of customer satisfaction as our customers became aware that their photos were safer than ever. As time went on and we grew more confident in Amazon’s ability to scale and keep their systems reliable, though, we moved Amazon into a more fundamental role at SmugMug and experimented with using them as primary storage. The week we started to experiment with that was the first of the two performance issues, and shined a bright glaring light on the downsides of using them in this way. We quickly shifted gears and are now quite happy with our current architecture, both from a cost view and a reliability view.
So what are we doing differently? Simple. Amazon serves as “cold storage” where everyone’s valuable photos go to live in safety. Our own storage clusters are now “hot storage” for photos that need to be served up fast and furious to the millions of unique visitors we get every day. That’s a bit of an oversimplification of our architecture, as you can imagine, but it’s mostly accurate. The end result is that performance problems with S3 are mostly buffered and offset by our local storage, and even outages are mostly properly handled while resyncing after the outage passes. For the curious, this architecture reduces our actual physical disk usage in our own datacenters by roughly 95%.
Further, we also have the ability to target specific Amazon S3 clusters. In January, we noticed that their West Coast cluster seemed to be performing more slowly than their East Coast cluster, even though we’re on the West Coast, so we toggle our primary endpoint to use the East Coast for awhile. This is the switch I mentioned earlier that I flipped, and it worked out beautifully.
Now, though, I think we come to the real meat of the problem. Are we upset about Amazon’s issues? Do we regret using them? Are we looking elsewhere? Absolutely not, and here’s why:
I can’t think of a particular vendor or service we use that doesn’t have outages, problems, or crashes. From disk arrays to networking gear, everything has bad days. Further, I can’t think of a web site that doesn’t, either. It doesn’t matter if you’re GMail or eBay, you have outages and performance problems from time to time. I knew going into this that Amazon would have problems, and I built our software and our own internal architecture to accommodate occasional issues. This is the key to building good internet infrastructures anyway. Assume every piece of your architecture, including Amazon S3, will fail at some point. What will you do? What will your software do?
Amazon does need to get better about communicating with their customers. They need to have a page which shows the health of their systems, and pro-active notification of major issues, a 24/7 contact method, etc. I’m on their Developer Advisory Council, and believe me, they know about these issues. I’m sure they’re working on them.
To put things into perspective, we have vendors which we pay hundreds of thousands of dollars to each year that seem to be incapable of providing us with decent support. Amazon is not unique in terms of providing a great product but average support. If you ask nearly anyone in IT, I think you’ll find that’s far more common in our industry than it should be and not unique to Amazon in particular.
Finally, S3 is a new service and yet remarkably reliable. Since April 2006, they’ve been more reliable than our own internal systems, which I consider to be quite reliable. Nothing’s perfect, but they’re doing quite well so far for a brand-new service. Oh, and their services has also saved our butts a few times. I’ll try to write those up in the near future, too.
Other Amazon articles:
See you at ETech!