Home > amazon, business, smugmug, web 2.0 > Amazon S3: Outages, slowdowns, and problems

Amazon S3: Outages, slowdowns, and problems

January 30, 2007

First of all, I’m giving a session on Amazon web services (with S3 being the main focus, with a little EC2 and other service love thrown in) at ETech this year. I’ll post a PDF or something of my slides here when I’m done, but if you’re really interested in this stuff, you might want to stop by. Wear some SmugMug gear and I’ll comp your account. 🙂

UPDATE: I’ve posted a call for topics you’re interested in hearing at ETech or in the resulting PDF. Let me know.

So there’s been some noise this month about S3 problems, and I’ve been getting requests about what we do when Amazon has problems and why our site is able to stay up when they do. I’m happy to answer as best I can, and I’d like to remind everyone that I’m not paid by Amazon – it’s the other way around. I pay them a lot of money, so I expect good service. 🙂 That being said, I think they’re getting too much heat, and I’ll explain why.

First, lets define the issues. During our history with Amazon S3 (since April of 2006), we’ve experienced four large problems. The first two were catastrophic outages – they involved core network switch failures and caused everything to die for 15-30 minutes. And by everything, I mean Amazon.com itself was offline, at least from my network view. (Due to DNS caching issues, even GSLB’d sites can look “down” to part of the world while remaining “up” to other parts. I don’t know if this was the case during these two times). We’ve had core network switch failures here at SmugMug, too, and they’re almost impossible to prevent.

The other two were performance-related. Not outages, because the service still functioned, but massively slower than we were used to. In the first case, which happened right as the BusinessWeek cover article hit newstands and during the Web 2.0 Summit, our customers were at our gates with pitchforks and torches. Our paying customers were affected and they could tell there was something wrong. Not good.

The second time, though, was in early January, and our customers had no idea. I emailed the S3 team to let them know we were seeing issues, flipped a switch in our software, and we were fine.

So what was the difference? We’ve been playing with using Amazon in a variety of different roles and scenarios at SmugMug. At first, we were just using them as a backup copy. That provided some great initial savings and a great deal of customer satisfaction as our customers became aware that their photos were safer than ever. As time went on and we grew more confident in Amazon’s ability to scale and keep their systems reliable, though, we moved Amazon into a more fundamental role at SmugMug and experimented with using them as primary storage. The week we started to experiment with that was the first of the two performance issues, and shined a bright glaring light on the downsides of using them in this way. We quickly shifted gears and are now quite happy with our current architecture, both from a cost view and a reliability view.

So what are we doing differently? Simple. Amazon serves as “cold storage” where everyone’s valuable photos go to live in safety. Our own storage clusters are now “hot storage” for photos that need to be served up fast and furious to the millions of unique visitors we get every day. That’s a bit of an oversimplification of our architecture, as you can imagine, but it’s mostly accurate. The end result is that performance problems with S3 are mostly buffered and offset by our local storage, and even outages are mostly properly handled while resyncing after the outage passes. For the curious, this architecture reduces our actual physical disk usage in our own datacenters by roughly 95%.

Further, we also have the ability to target specific Amazon S3 clusters. In January, we noticed that their West Coast cluster seemed to be performing more slowly than their East Coast cluster, even though we’re on the West Coast, so we toggle our primary endpoint to use the East Coast for awhile. This is the switch I mentioned earlier that I flipped, and it worked out beautifully.

Now, though, I think we come to the real meat of the problem. Are we upset about Amazon’s issues? Do we regret using them? Are we looking elsewhere? Absolutely not, and here’s why:

I can’t think of a particular vendor or service we use that doesn’t have outages, problems, or crashes. From disk arrays to networking gear, everything has bad days. Further, I can’t think of a web site that doesn’t, either. It doesn’t matter if you’re GMail or eBay, you have outages and performance problems from time to time. I knew going into this that Amazon would have problems, and I built our software and our own internal architecture to accommodate occasional issues. This is the key to building good internet infrastructures anyway. Assume every piece of your architecture, including Amazon S3, will fail at some point. What will you do? What will your software do?

Amazon does need to get better about communicating with their customers. They need to have a page which shows the health of their systems, and pro-active notification of major issues, a 24/7 contact method, etc. I’m on their Developer Advisory Council, and believe me, they know about these issues. I’m sure they’re working on them.

To put things into perspective, we have vendors which we pay hundreds of thousands of dollars to each year that seem to be incapable of providing us with decent support. Amazon is not unique in terms of providing a great product but average support. If you ask nearly anyone in IT, I think you’ll find that’s far more common in our industry than it should be and not unique to Amazon in particular.

Finally, S3 is a new service and yet remarkably reliable. Since April 2006, they’ve been more reliable than our own internal systems, which I consider to be quite reliable. Nothing’s perfect, but they’re doing quite well so far for a brand-new service. Oh, and their services has also saved our butts a few times. I’ll try to write those up in the near future, too.

Other Amazon articles:

See you at ETech!

Categories: amazon, business, smugmug, web 2.0
  1. January 30, 2007 at 2:58 pm

    Great post and timing. As a self-funded startup we attempted to move all of our back-end file storage to S3 largely based on SmugMug as the poster-child. I love the concept of S3 and I’m also a big fan of SmugMug. We encountered quite a few issues in January with respect to performance and latencies (some of which were clearly S3 issues, some were the underlying library we were using that connected to it). In the end we decided that we weren’t willing to bet our company on S3 at this time and have rolled back to local storage.

    We are definitely looking at it for backups and other non-essential tasks going forward.

    BTW, I love the new pure javascript interface…kudos to your dev team for thinking outside the box.

  2. January 30, 2007 at 8:19 pm

    So are you implementing some type of ILMesque logic to “cache” recent or frequently requested photos in primary storage while keeping the %95 of photos that almost never get looked at in your tier2 storage? If so, it’s a great idea. But you knew that. 🙂

  3. J. S.
    January 30, 2007 at 11:36 pm

    Do you have a date for the early Janaury slowness? I’m curious to see if it coincides with us moving a large amount of traffic there. It was enough for them to notice and call us.

  4. January 31, 2007 at 2:14 am


    I believe we noticed something on December 28th, and some other S3 people talked about issues on January 4th. I’ll bet they were related, but I’m not positive.

  5. January 31, 2007 at 11:04 am


    Thanks. Yea, it’s working great so far. 🙂

  6. Stephen
    January 31, 2007 at 11:18 am

    I appologize I’m new to all the ins and outs of Amazon S3, I’m curious though how are you guys targeting actual datacenters (geographical clusters)?

  7. J. S.
    January 31, 2007 at 2:15 pm

    Yep, Jan 4th. Interesting. Things have been ok since then so hopefully it was just temporary while they adjusted to our extra traffic. I don’t do anything on the tech side so I don’t know if we go into their west coast cluster or not, we’re in northern California.

  8. Geoff
    January 31, 2007 at 2:47 pm

    Ever talk to Jason over at Joyent?

  9. J.S.
    January 31, 2007 at 5:36 pm

    Ok, we’re on the west coast cluster but they’ve done some analysis and we’ll be moving to the east coast cluster. I guess we’re not cool enough like you guys to be able to use both at the flip of a switch. 😛

  10. January 31, 2007 at 5:40 pm


    Love to chat via email and compare notes, since it sounds like you’re high-volume like we are. Drop me a line if you get a second: don at smugmug


  11. April 10, 2007 at 6:32 am

    Hi, everybodye

  12. Ashwin
    February 28, 2008 at 3:14 am

    Hi Don,
    I’m working on a community portal for language learning and I’m evaluating using S3 as a CDN for our media content. I was wondering if you’d be willing to shed a little more light into how you’re able to switch between specific S3 clusters. Part of our requirement is doing a bit of traffic shaping to allow our users to access content from a node as close to them as possible.

    Any insight into this would be much appreciated.

  1. January 30, 2007 at 11:25 pm
  2. February 2, 2007 at 11:20 am
  3. February 2, 2007 at 10:33 pm
  4. February 6, 2007 at 1:05 am
  5. February 28, 2007 at 3:31 pm
  6. April 29, 2007 at 1:22 am
  7. December 18, 2007 at 10:35 am
  8. February 15, 2008 at 12:43 pm
Comments are closed.
<span>%d</span> bloggers like this: