S3 outage – We weren't affected | SmugMug's Don MacAskill

Home > amazon > S3 outage – We weren't affected

S3 outage – We weren't affected

February 15, 2008 Don MacAskill

Amazon S3 had an outage today. First I knew about it was reporters emailing and calling me asking if we were knocked out by it.

We weren’t. No customers reported issues, and our systems were all showing typically low and acceptable error rates. To be honest, I’m surprised.

I wasn’t going to blog about it until I understood why we weren’t affected, but I’m really getting inundated with requests now, so I figured this would be a good way to optimize my time rather than spending all day on the phone. 🙂

We’re researching what happened now, but again, I didn’t know about the outage until after it was over, and I haven’t spoken to anyone at Amazon yet. Until I finish my research and speak with Amazon, I’m not going to speculate on what may have happened or why.

I can say, once again, that we pay the same rates everyone else pays and that, other than some early access to upcoming beta services, we don’t get any preferential treatment that I’m aware of.

Some thoughts, though:

We expect Amazon to have outages. No website I’m aware of doesn’t, whether it’s Google, Amazon, your bank, or SmugMug.
I’ve written about Amazon S3 outages in the past, but in the last ~12 months, we’ve only seen a single ~2 minute outage window (on January 22nd, 2008 at around 4:38pm Pacific). We also had one recent fairly major hiccup with EC2.
Yes, I believe there will probably be times where SmugMug is seriously affected, possibly even offline completely, because Amazon (or some other web services provider) is having problems. Today wasn’t that day. Nobody likes outages, especially not us, but we’ve decided the tradeoffs are worth it. You should have your eyes wide open when you make the decision to use AWS or any other external service, though. It will fail from time to time.
We’ve done our best to build SmugMug in such a way that we handle failures as gracefully as possible. We can’t cover every case, but I think that the fact that we didn’t experience customer-facing outages today is a testament to that. Again, I want to stress that we do expect Amazon to cause us (rare) outages in the future, and that’s unavoidable, but today we dodged that bullet.
Amazon’s communication about this has been terrible. It took far too long to acknowledge the problem. Fixing a major problem can take forever, which is understandable, but communicating with your customers should happen very rapidly. Amazon’s culture, internally, is very customer focused, so this is a strange anomaly. I will definitely lean on them some about it, and everyone who was affected should rightfully howl too.
I’ve asked Amazon repeatedly for an “Amazon Web Services Health” page that shows the current expected state of all their services. Then you can tell at a glance (and even poll and work into your own monitoring) whether any of the services are having problems. Something like Keynote’s Internet Health Report would be a good start, but as Jesse Robbins points out, trust.salesforce.com is the gold standard. This page could also double as a mechanism to let customers know what’s being worked on and current ETAs when there are problems.

I’ll try to post a follow-up about why we weren’t affected when I know more. It’s possible that some of the reasons we survived was due to some of our “secret sauce” and I just won’t be able to say, but I kinda doubt it.

Bottom line: While the outage was certainly a big deal to those affected, I think the bigger deal here is how Amazon handled the outage. They need to communicate better about these mission critical services and their health.

If I didn’t answer any questions you’d like me to answer, please post a comment and/or send me an email. I’ll do my best to respond.

UPDATE 1: I’m not sure why there’s all this confusion, but SmugMug *does* use Amazon as our primary data store. We maintain a small “hot cache” in our datacenters of frequently/recently viewed photos and videos, but there are massive numbers of them that are only at Amazon. This is a change from our initial usage of S3, and the change is based on how reliable they’ve been. Yes, we still consider them to be very reliable even after an outage like this. And yes, I suspect our “hot cache” did at least partially enable us to ride out this issue.

Categories: amazon Tags: amazon, aws, ec2, outage, s3, smugmug, storage, web services

Comments (57)

Brad

February 15, 2008 at 12:52 pm

My understanding is that you simple use S3 as backup storage, so as long as you are queueing your syncs you should be able to handle fairly long outages on their side. I have never noticed an image URL on your site that is actually coming from S3.
Don MacAskill

February 15, 2008 at 12:58 pm

@Brad:

Sorry for the confusion, but S3 is our primary data store, not just backup.
Brad

February 15, 2008 at 1:07 pm

I’ve worked with S3 a bit. How are you serving up images from a http://www.smugmug.com address? Do you house them locally for a certain about of time until they get committed?
erik

February 15, 2008 at 1:12 pm

Hey Don – I’m glad that you (and my photos) weren’t affected by this downtime. I would echo your suggestion that AWS implement some sort of status page. How about leading by example, though? In this dgrin post, I recommended creating a Smugmug “status” blog that’s hosted in a different DC than the rest of your infrastructure. To my knowledge, no one responded to that suggestion.

Sure, outages are usually reported in Dgrin – that’s not an ideal situation, though, IMHO. If you implemented a status blog, it would make it very simple for your operations staff to post quick updates as they work through problems as well as enabling customers to subscribe to the blog’s RSS feed and get regular updates that way.

Thoughts?
Don MacAskill

February 15, 2008 at 1:18 pm

@erik:

I heard you loud and clear during our last outage, and we now have an offsite location to post updates: http://smugmug.wordpress.com/

We just haven’t had an outage since then to use the blog on – but we will!
erik

February 15, 2008 at 1:20 pm

@Don:

Very nice. I’m already subscribed.

Thanks and keep up the great work!
Matt Johnson

February 15, 2008 at 2:32 pm

@Don:
I have subscribed also… thanks for letting us know about the site.
Jorge Oliveira

February 15, 2008 at 4:39 pm

According to Amazon, only one out of three S3 server locations was affected. We (gladly) did not experience any outages today but we are sure, as you stated, that inevitably one day we will 🙂
bd_

February 15, 2008 at 5:14 pm

Speaking of S3, I was wondering if that article on you EC2 use you hinted at some time ago was coming soon… I’d be interested in hearing how you use it (it seems to me like you wouldn’t have enough resizing load to justify it… but I suppose I must be wrong :))
Arg

February 16, 2008 at 3:15 am

Well I guess we were lucky enough to have our files at the one that went down. Yippee. Our customers certainly appreciated it. There’s sure nothing like waking up to think all your files are gone to start a day right. Don’t worry about us though our reputation will be fine. About the time it’s back to where it was before this they’ll poop again and it’ll be right back down in the trash.

I’m sure your paying customers will appreciate your sentiments around the fact that Amazon is down less than you would have been if you did your own hosting. Does telling them that really work? If so I want your customers because ours are complete the opposite. Of course yours put up images for fun. Not ever customer of S3 is doing this for jollies. Not to lessen what you are doing, but some things are more mission critical. Maybe shame on us then for having an alternative. Have you looked into this Don? Why not send images another S3 like service. Then at least a simple DNS change or URL filter could be used. It doubles the cost of storage but if Amazon ever goes down for say a day it’s going to be hell of a bad day.
SHG

February 16, 2008 at 5:22 pm

Don, the reason that people are under the impression that S3 is only secondary storage for Smugmug is this post from you on 12 August 2006:

I don’t feel like we “bet the company” on S3 – every photo our customers entrust us with, we keep local copies in our existing distributed storage infrastructure. We use S3 as redundant secondary storage for use in cases of outages, data loss, or other catastrophe.
Don MacAskill

February 16, 2008 at 8:16 pm

@SHG:

That’s so strange to me. Do people still really believe that humans will never fly because there’s a lot written about how impossible it is a few centuries ago? 🙂

That blog post is ancient history. The Wright Brothers showed us how to fly, and SmugMug has moved on with S3. 🙂

Guess I’d better update those old posts, but that seems silly to me that people wouldn’t do a little research rather than trusting some old blog post….
whoALSE

February 18, 2008 at 6:00 am

No wonder I had some problems uploading at times on Friday (Aus) and there was an incident I uploaded successfully 5 pics and they didn’t appear in the gallery.

Keep us posted. Thanks
TimC

February 18, 2008 at 5:17 pm

@Arg: if you’re hosting mission critical data on something that openly states they only have 99.9% uptime, then you should be fired for incompetence. 99.9% uptime is NOT mission critical. If the service was advertising 5 9’s, you might have something to gripe about. Right now, you’re basically complaining that you’re having troubles getting a screw out of a board with a hammer.

@Don: If everything is on S3 now, how are there 3 copies of every photo? If it’s all sitting on S3, I guess that scares me. You’re basically entrusting them to never have catastrophic data loss… and I guess I don’t trust any one company that much 😀
Jorge Oliveira

February 19, 2008 at 9:36 am

@TimC: S3 has 3 different server locations.
TimC

February 19, 2008 at 10:39 pm

@Jorge:

they could have 10 different server locations, if they have a failure in their software which results in data loss, it doesn’t really matter how many different locations the failure occurs at.
Jorge Oliveira

February 20, 2008 at 3:34 am

@TimC: you asked “how are there 3 copies of every photo?”
TimC

February 20, 2008 at 11:35 pm

^^which would be why I asked originally when it was quoted if they are only using S3 now as I read it to be.
teki

March 3, 2008 at 9:44 pm

“That’s so strange to me. Do people still really believe that humans will never fly because there’s a lot written about how impossible it is a few centuries ago?”

That was the reason why I choose Smugmug and I can not find the announcement about the change :(.

It ultimately means that I am now paying only for a gallery and Smugmug resells S3 storage to host my photos.

Are you planning any discount/light packages with external S3 accounts? (I mean I am purchasing only the use of the software from Smugmug and the storage from S3)
Hikaye

September 26, 2008 at 6:05 am

thanks you
industryfinest

November 19, 2008 at 8:33 pm

I have subscribed
XAЛAШKA

December 24, 2008 at 9:08 am

Да таков наш современный мир и боюсь с этим ни чего невозможно поделать:)
JagreeMagedew

January 5, 2009 at 9:05 am

Хороший блог
HARRIENCE

January 5, 2009 at 11:26 am

Думаю эта методика уже не актуальна, есть более новые методы.
Aleksander

January 13, 2009 at 2:08 pm

Спасибо, очень заинтересовался, будет ли еше что то подобноее?
Aleksander

January 13, 2009 at 3:11 pm

Что то новенькое, пишите есче очень нравится.
botestams

January 16, 2009 at 2:06 pm

Отличные новости, так держать, удачи в будущем.
Aleksander

January 18, 2009 at 1:57 pm

Готов разместить вашу сылку у себя на сайте, очень понравился ваш материал.
botestams

January 23, 2009 at 7:07 pm

Стоит ли ждать обновления?.
botestams

January 23, 2009 at 7:24 pm

Могу предложить много инфы по данной теме, нужно?.
botestams

January 26, 2009 at 3:51 am

Как часто публикуете новости по данной тематике?.
stitoence

January 28, 2009 at 10:57 pm

Супер. Спасибо, так давно искал этот материал. Ну просто респектище автору. Никогда не забуду теперь
Audineappeday

February 1, 2009 at 9:18 am

Я люблю етот блог. Вот я тоже как то об этом думал
AlexjooolkT

February 1, 2009 at 12:01 pm

Информация подобрана очень успешно, когда будет обновление?
MoSuuns

February 2, 2009 at 12:35 pm

Эх етот кризис все нам портит
Nilofirond

February 2, 2009 at 2:19 pm

Собственно говоря я так и думал, вот про что все толдычут. Мда этож надо так
Agrubdokoho

February 4, 2009 at 4:03 am

Очень заинтересовал материал. Что за источник? Я бы еще почитал про сий материал
AnnaMakarovaa

February 4, 2009 at 5:04 pm

Кто может мне помоч подробнее в етом разобратся?
Зиновий

February 4, 2009 at 6:16 pm

Я тоже в блоге про такое пишу, только на тему кино
AlexAnderGG

February 5, 2009 at 1:52 pm

Поздравте меня у меня родился сын!
JagStyleR

February 7, 2009 at 3:40 pm

Ребят, кто подскажет где можно подробнее узнать.!
AlexaStyleX

February 8, 2009 at 11:31 am

Поздравляю всех с наступаюшим праздником!!!
GivistaHedKo

February 10, 2009 at 11:07 am

Вах. Какой блог хороший. Хачу постить у вас новости. Как можно данное реализовать. Спасиб
HouserrLiv

February 11, 2009 at 4:31 am

Поздравляю всех, с наступающим праздником, желаю всего наилучшего.
GorohoffFre

February 12, 2009 at 9:22 am

Хорошо что я увидел эту информацию. Очень полезно читать ваши посты
DoronLinFo

February 12, 2009 at 9:54 am

Вы всегда публикуете только лучшую информацию. У вас просто супер блог. Спасибо
FregtorKo

February 12, 2009 at 11:02 am

Лучшая инфа которую я прочел за последние несколько дней. Очень актуально. Спасибо
GoptotOpG

February 12, 2009 at 1:25 pm

Приличная новость. Буду постоянно читать ваши новости
PacMaaan

February 12, 2009 at 4:30 pm

Подскажите пожалуйста, конкретнее где можно посмотреть данную тематику, и подробное описание?
pete

February 15, 2009 at 4:41 am

Add your RSS in my reader.
Sahok

February 20, 2009 at 11:32 am

Very nice site!
Lexer

February 23, 2009 at 3:17 pm

U vas traffic na saite konchilsia
Angellaa

February 23, 2009 at 5:14 pm

Hmm, very cognitive post.
Is this theme good unough for the Digg?
kilo

March 2, 2009 at 11:29 am

Очень хороший пост! Спасибо за проделанную работу!
лямик

March 6, 2009 at 7:14 pm

Отличный у тебя сайтик. Есть свой стиль. А я вот делаю как попало, и не читает меня никто, кроме лохов.
ВебПолитолог

March 8, 2009 at 4:53 am

Поздравляю с 8 марта всех читательниц blogs.smugmug.com!!!
mod converter

November 6, 2009 at 3:59 pm

Love it! You got me so excited to get one and start shooting video!