Archive
Amazon S3: Outages, slowdowns, and problems
First of all, I’m giving a session on Amazon web services (with S3 being the main focus, with a little EC2 and other service love thrown in) at ETech this year. I’ll post a PDF or something of my slides here when I’m done, but if you’re really interested in this stuff, you might want to stop by. Wear some SmugMug gear and I’ll comp your account. 🙂
UPDATE: I’ve posted a call for topics you’re interested in hearing at ETech or in the resulting PDF. Let me know.
So there’s been some noise this month about S3 problems, and I’ve been getting requests about what we do when Amazon has problems and why our site is able to stay up when they do. I’m happy to answer as best I can, and I’d like to remind everyone that I’m not paid by Amazon – it’s the other way around. I pay them a lot of money, so I expect good service. 🙂 That being said, I think they’re getting too much heat, and I’ll explain why.
First, lets define the issues. During our history with Amazon S3 (since April of 2006), we’ve experienced four large problems. The first two were catastrophic outages – they involved core network switch failures and caused everything to die for 15-30 minutes. And by everything, I mean Amazon.com itself was offline, at least from my network view. (Due to DNS caching issues, even GSLB’d sites can look “down” to part of the world while remaining “up” to other parts. I don’t know if this was the case during these two times). We’ve had core network switch failures here at SmugMug, too, and they’re almost impossible to prevent.
The other two were performance-related. Not outages, because the service still functioned, but massively slower than we were used to. In the first case, which happened right as the BusinessWeek cover article hit newstands and during the Web 2.0 Summit, our customers were at our gates with pitchforks and torches. Our paying customers were affected and they could tell there was something wrong. Not good.
The second time, though, was in early January, and our customers had no idea. I emailed the S3 team to let them know we were seeing issues, flipped a switch in our software, and we were fine.
So what was the difference? We’ve been playing with using Amazon in a variety of different roles and scenarios at SmugMug. At first, we were just using them as a backup copy. That provided some great initial savings and a great deal of customer satisfaction as our customers became aware that their photos were safer than ever. As time went on and we grew more confident in Amazon’s ability to scale and keep their systems reliable, though, we moved Amazon into a more fundamental role at SmugMug and experimented with using them as primary storage. The week we started to experiment with that was the first of the two performance issues, and shined a bright glaring light on the downsides of using them in this way. We quickly shifted gears and are now quite happy with our current architecture, both from a cost view and a reliability view.
So what are we doing differently? Simple. Amazon serves as “cold storage” where everyone’s valuable photos go to live in safety. Our own storage clusters are now “hot storage” for photos that need to be served up fast and furious to the millions of unique visitors we get every day. That’s a bit of an oversimplification of our architecture, as you can imagine, but it’s mostly accurate. The end result is that performance problems with S3 are mostly buffered and offset by our local storage, and even outages are mostly properly handled while resyncing after the outage passes. For the curious, this architecture reduces our actual physical disk usage in our own datacenters by roughly 95%.
Further, we also have the ability to target specific Amazon S3 clusters. In January, we noticed that their West Coast cluster seemed to be performing more slowly than their East Coast cluster, even though we’re on the West Coast, so we toggle our primary endpoint to use the East Coast for awhile. This is the switch I mentioned earlier that I flipped, and it worked out beautifully.
Now, though, I think we come to the real meat of the problem. Are we upset about Amazon’s issues? Do we regret using them? Are we looking elsewhere? Absolutely not, and here’s why:
I can’t think of a particular vendor or service we use that doesn’t have outages, problems, or crashes. From disk arrays to networking gear, everything has bad days. Further, I can’t think of a web site that doesn’t, either. It doesn’t matter if you’re GMail or eBay, you have outages and performance problems from time to time. I knew going into this that Amazon would have problems, and I built our software and our own internal architecture to accommodate occasional issues. This is the key to building good internet infrastructures anyway. Assume every piece of your architecture, including Amazon S3, will fail at some point. What will you do? What will your software do?
Amazon does need to get better about communicating with their customers. They need to have a page which shows the health of their systems, and pro-active notification of major issues, a 24/7 contact method, etc. I’m on their Developer Advisory Council, and believe me, they know about these issues. I’m sure they’re working on them.
To put things into perspective, we have vendors which we pay hundreds of thousands of dollars to each year that seem to be incapable of providing us with decent support. Amazon is not unique in terms of providing a great product but average support. If you ask nearly anyone in IT, I think you’ll find that’s far more common in our industry than it should be and not unique to Amazon in particular.
Finally, S3 is a new service and yet remarkably reliable. Since April 2006, they’ve been more reliable than our own internal systems, which I consider to be quite reliable. Nothing’s perfect, but they’re doing quite well so far for a brand-new service. Oh, and their services has also saved our butts a few times. I’ll try to write those up in the near future, too.
Other Amazon articles:
See you at ETech!
Amazon S3: Show me the money
UPDATE 4/30/07: This post was written in November 2006, so these numbers are a little out of date. It’s now been 12 months and we’ve saved almost exactly $1M. You can see the most recent numbers, as of April 2007, in my ETech slides.
I still have some more Web 2.0 Summit stuff to write up if I get a few minutes today, but let me talk about Amazon’s S3 for a minute. At the conference, I was chatting with Michael Arrington of TechCrunch fame (who perfectly handled a blogosphere mini-explosion last week, I thought) and we got to talking about S3. He was impressed with how we were using it, but joked that our $500K saved number sounded like “complete bullsh*t”. I laughed along with him and assured him it was true, but on the way home I got to thinking that it IS a really big number to throw out there without details.
So here are the cold hard facts:
- Our estimate, as you can see in BusinessWeek’s cover story, is that we’re saving $500K per year. We’ve been using S3 for almost 7 months so far (we launched it on or around April 14th), so for my $500K estimate to be in the right ballpark, we should be somewhere near $291K saved to date (well, we don’t grow linearly, so less than that … but let’s do easy math, shall we?).
- We had roughly 64,000,000 photos when we launched S3. We now have close to 110,000,000 photos. Yes, that’s ~72% growth in 7 months.
- To sustain our pre-S3 growth, we were buying roughly $40,000 per month in hard disks plus servers to attach them to. We’re not talking about EMC or other over-priced storage solutions. We’re talking about single processor commodity Pentium 4 servers attached to really cheap Apple Xserve RAID arrays. Not quite off-the-shelf IDE disks, but once you factor in the reliability and managability, the TCO comes out to be in a similar ballpark (We’ve done it both ways).
- If you’re doing the math at home, $40K may seem a little high until you realize how our architecture works: We use RAID-5, with hot spares, and we have two entirely separate storage clusters. That means we have to buy 1.4TB of raw disk to store an actual 500GB.
- To sustain our current, Nov 2006 growth rate, we’d need to buy more like ~$80K per month. Let’s assume over the 7 months, it ramped from $40K to $80K linearly (it was actually more of a curve, but this makes the math easier). $40K + $46K + $53K + $60K + $66K + $73K + $80K = $418K
- Our datacenter space, power and cooling costs for those arrays is ~$1.36/month for every $100 of storage. (~$544month @ $40K, ramping to ~$1088/month @ $80K). $544 + $626 + $721 + $816 + $898 + $993 + $1088 = $5,686.
- It’s cost us some manpower to move everything up to S3. So while I expect to save money on manpower in the long run, currently it’s probably break even – I don’t have to install, manage and maintain new hardware, but I’ve had to copy more than 100TB up to Amazon. (We’re not done copying old data up yet, either)
- Total amount NOT spent over the last 7 months: $423,686
- Total amount spent on S3: $84,255.25
- Total savings: $339,430.75
- That works out to $48,490 / month, which is $581,881 per year. Remember, though, our rate of growth is high, so over the remaining 5 months, the monthly savings will be even greater.
- These are real, hard numbers after using S3 for 7 months, not our projections. They closely match (but are actually slightly better) than our projections.
So there you have it.
But wait! It gets even better! Because of the stupid way the tax law operates in this country, I would actually have to pay taxes on the $423K I spent buying drives (yes, exactly like the money I spent was actually profit. Dumb.). So I’d have to pay an additional ~$135K in taxes. Technically, I’d get that back over the next 5 years, so I didn’t want to include it as “savings” but as you can imagine, the cash flow implications are huge. In a very real sense, the actual cash I conserved so far is about $474,000.
But wait! It gets even better! Amazon has been so reliable over the last 7 months (considerably more reliable than our own internal storage, which I consider to be quite reliable), that just last week we made S3 an even more fundamental part of our storage architecture. I’ll save the details for a future post, but the bottom line is that we’re actually going to start selling up to 90% of our hard drives on eBay or something. So costs I had previously assumed were sunk are actually about to be recouped. We should get many hundreds of thousands of dollars back in cash.
I expect our savings from Amazon S3 to be well over $1M in 2007, maybe as high as $2M.
Perhaps most important, though, is the difficult-to-quantify time, effort, and mental thought we’re saving. We get to spend both that money and all of our extra time and effort on providing a better customer experience and delivering better customer service. Storage was a necessary evil that’s now been nearly removed as a concern.
Want more? I have some other posts on the subject:
And I’ll continue to post with more hard details, including our technical architecture and some of our code, as well. And yes, we’re starting to consume other Amazon services like EC2.
Web 2.0 Summit: Jeff Bezos
Jeff Bezos just gave a great presentation and had an interesting chat with Tim O’Reilly here at the Web 2.0 Summit. I’ve written about Amazon’s web services a few times, including the BusinessWeek cover story this week.
In case you don’t want to read the long-winded version, here’s a summary of what I think is really going on here:
- Amazon Web Services isn’t some strange deviation from Amazon’s core business. Instead, it’s an evolution of their business that makes a lot of sense. They’ve learned to scale datacenters well, and companies like ours don’t want to have to learn those same lessons, so we can build on Amazon. Amazon makes money, we save time (which is money) and get to focus on our application, and everyone wins.
- Google gets a lot of press for building a “WebOS” as they release web-based replacements for desktop applications. But they’re really focused on client-side desktop replacements, whereas Amazon is really focusing on backend, server-side recplacements. It’s less glamorous to the average consumer, but far more glamorous to anyone who needs those services to build their company.
I’m not sure everyone grasps how truly huge this is. I suppose that’s good, since we do and it gives us an edge.
Amazon + Two Guys + $0 = Next YouTube
The next YouTube will be built on Amazon’s Web Services by two guys in a dorm for roughly $0.
BusinessWeek has an article up about Amazon’s push into web services. GigaOM’s got a little coverage, and I see it spreading a bit over on TechMeme and TailRank.
We’re in the article, since we’re a big believer in this “new” vision of Amazon’s. Amazon calls the stuff they’re exposing the “muck” of doing business online, and I think it’s a perfect term. Some people see this as some radical departure from Amazon’s core business, but I don’t at all. Just like much of Amazon’s business, it’s an evolution. They began as a bookstore online (no-one remembers this, but they weren’t the first. BookStacks was relatively huge when Amazon launched), and eventually evolved by adding more and more products. They sell nearly everything, including groceries, now.
Why? Because once they had some of the infrastructure built to sell books, it made sense to add DVDs and CDs. And then once that was built, it made sense to add electronics and video games and gardening supplies and everything else under the sun. Why? Because they had even more infrastructure built. The average person doesn’t appreciate just how difficult the fulfillment piece of Amazon is, from warehousing, inventory control, packing, and shipping, but those who do boggle both at how difficult it is and at how well they do it. May as well leverage that expertise across other product lines where it makes sense, right?
Along the way, they also happened to get extremely good at systems. My father co-founded and successfully ran a direct competitor to Amazon, fatbrain.com, so we got a good, close look at just how good they were. While eBay was having massive outages, Amazon continued to purr along, scaling well and fast. When Toys R Us had a disastrous holiday season one year because they couldn’t scale their systems, who did they call? That’s right, Amazon.
I don’t have any direct knowledge of the chain of events, but I’ll bet it was something like that which caused the initial light bulb to start glowing. Dimly, at first, but glowing none-the-less. The thought process probably went something like this: “Hmm, you know, this letting other businesses like Target, Borders, and Toys R Us build their businesses on ours is turning out to be a good deal. It leverages our existing infrastructure and knowledge to grow our sales. I wonder how we can let other businesses build and grow on ours?”
Enter Amazon Web Services, zShops, Marketplace, and the other programs to let people sell things on Amazon without actually being a part of Amazon. The first Web Service, E-Commerce Services (ECS), allowed anyone to build their own shop online, using their own URL and look-and-feel to sell, say, TVs. But Amazon would handle all the nasty bits of the process, like actually acquiring and shipping the TVs. To use their terminology again, the “muck” of running an online retailer was taken out of the equation – the online TV shop could focus on customer acquisition rather than fulfillment headaches.
From there, it’s really a fairly small step, rather than a giant leap, for Amazon to say “Hey, we really like people building their businesses on ours. What else do we have hiding around here that would help businesses out?” And it turns out the answer revolves around their other core knowledge and infrastructure investment: datacenters, storage, and servers. Just like physical fulfillment, Amazon is one of the few truly experienced web-scale companies in the world. And just like physical fulfillment, the more volume you do, the more efficient you can get and the more you can lower costs. (Assuming you’re talented, that is, which is a large assumption).
As Amazon ships more items, they get better shipping rates. As they buy more bandwidth, they get better bandwidth rates. And that doesn’t even take into account the knowledge, software, and other intangibles that continue to get more precise as they scale, both in their warehouses and their datacenters.
It’s sorta silly that some of this stuff hasn’t become a commodity already. I think if you took a close look at how SmugMug has built and scaled, say, storage and how Flickr or YouTube or any other recent startup has, you’d see we’ve all done it in remarkably similar fashions. We’re all re-inventing and re-building the same wheel, over and over again. And it’s expensive, time consuming, and not core to our value proposition – except that without it, we can’t build our business. In other words, it’s “muck.”
At SmugMug, we want to focus on the customer experience, from user interface to customer service, and not have to worry about storage. For us, it’s a necessary evil that detracts from our ability to deliver better features faster. We’re actively investigating plenty of other web services at Amazon, both announced and otherwise, and are extremely excited about how much more time we’ll be able to spend with our customers instead of our datacenters.
Finally, I think it’s worth mentioning that we love web services of all shapes and sizes. We publish our own API, we consume web services from Google and Yahoo already, and we plan to add more to the mix. But while everyone else’s web services allow us to add whizzy features, like Google Maps, or Yahoo’s Geocoder, Amazon’s solve real hard problems down deep where no-one will notice them. The “muck.”
They’re really building the very foundation of future web applications.
Quickies: Hack Day, Sun T1000, Amazon S3
Really quick…
Yahoo! Hack Day
SmugMug was in the house at Hack Day 2006, and we had a great time! Many thanks to Yahoo for putting on such a great event – we learned a lot about Yahoo technologies and put together a great demo. Anytime they want to throw another one, we’ll be there. Fantastic group of people over at Yahoo.
Best part about it is that our demo will shortly be a shipping product our customers will love and that’ll generate extra revenue for our company. Oh, and BigWebGuy got his official hazing there at Hack Day – he coded for 36 hours straight (no sleep!) his first week on the job even though he was sick! Welcome to the family, Lee!
Sun T1000
The Sun T1000 is very much still on our radar. I don’t want to do an in-depth update until we’re absolutely sure about what’s going on, but here’s a short summary of where we are.
I spent 5 hours over at Sun a few days after our initial results were posted with some very intelligent people. They were as perplexed at the results as I was, and were determined to get to the bottom of it. The good news is we now have a T1000 running Solaris side-by-side with a T1000 running Ubuntu which is side-by-side with our dual dual-core Opteron running Red Hat. The bad news is the Sun guys weren’t able to coax any more performance (yet!) out of the T1000.
We have a theory that we might be saturating the GigE port with raw # of interrupts per second, so it’s getting throttled there and starving the CPUs. So we have a gameplan for what to attack next – I’ve just been too swamped to deal with it for the last few weeks. We’ll get to it, though, I promise and I’ll share all the details.
Amazon S3
I still haven’t posted the in-depth technical details and code samples I promised about our use of Amazon S3, but fear not – I’m actively working on it and will post it as soon as it’s done.
Just wanted you to know I hadn’t forgotten about you. 🙂
Incidentally, Jeremy Zawodny is playing around with using it for his personal backup storage. Sounds sweet!
Amazon S3 = The Holy Grail
I should have posted this a few weeks ago, but better late than never. We now use Amazon S3 for a significant part of our storage solution. We’re absolutely in love with it – and our customers are too (even if they don’t know it).
As you probably know, SmugMug has been profitable since our first year, with no investment capital. We’ve had a great track record for keeping our customers’ priceless photos safe and secure using only the profits we’ve accrued to purchase our storage (yes, I said purchase. We have no debt – we own all of our storage, we don’t lease). And every SmugMug customer gets unlimited storage – so that’s no mean feat. (Currently, unlimited means ~300TB of storage and nearly 500,000,000 images. To put that into perspective, that’s more than 65,000 DVDs or 480,000 CDs).
But Amazon’s S3 takes our storage architecture to the next level:
- Your priceless photos are stored in multiple datacenters, in multiple states, and at multiple companies. They’re orders of magnitude more safe and secure.
- We’d already built a custom, low-cost commodity-hardware redundant scalable storage infrastructure. Nonetheless, it’s significantly cheaper to use S3 than using our own – especially when you factor multiple states & datacenters into the equation.
- Perhaps even more importantly, our cash-flow situation is vastly improved. Instead of paying $25,000 for a handful of terabytes of redundant storage up-front, even before they’re used, we now pay $0.15/GB/month as we use it.
- When we have some sort of internal outage with storage, it doesn’t matter – Amazon’s always on. They eat their own dogfood – S3 is in production use on dozens of Amazon products. We’ve had storage-related internal outages a few times already, and our customers haven’t been able to tell. We’ll still have rare outages on our site, unfortunately, (everyone does), but storage is now vastly less likely to be part of the cause.
- I started writing our S3 interface on a Monday, and by that Friday, we were live and in production. It really is that simple to pick up and use, and it was basically a drop-in addition to our existing storage.
- It’s fast. I don’t mean 15K-SCSI-RAID0-fast, but I do mean internet-latency-fast. It’s basically as fast as our internal local storage + the roundtrip speed of light to Amazon. I can measure the difference with computer timing, but in blind tests, humans haven’t been able to tell the difference. Everything we serve from Amazon feels fast.
I hate to admit this, but Amazon has built a playing-field leveler. It’s now much much easier for a competitor of ours to spring fully-formed from two guys in a garage than it was. Anyone who doesn’t get on board with Amazon S3 (or the inevitable S3 competitors) may get left behind. I’m glad we’re first, but I doubt it’ll last.
Tim O’Reilly, technology visionary extraordinaire, recently said of Sun’s new ‘Thumper’, the Sun Fire X4500: “This is the Web 2.0 server.” While I think Tim has perhaps the clearest vision in the industry, and the Thumper does truly look awesome, this time I think he may have missed the mark. The Web 2.0 server is *any* cheap Linux box coupled with utility storage like S3.
Initially this post had a lot of technical detail (I am the ‘Chief Geek’, afterall), but I removed it since it was probably getting boring. So this is the quick-and-dirty ‘Business Case for Amazon S3 and How it Helps our Customers’ post. If there’s enough interest, I can write up a detailed post about exactly how we use S3, how it works in conjunction with our own local distributed filesystem, and post our S3 library (which was derived from someone else’s). Post in the comments if that’s of interest.
Also, we’ll be presenting at a storage conference in Florida in late October (I’m sorry, I don’t have the name of the con with me, but I’ll update this post when I do), and have had a few other people request conferences talks on the subject. Comment if that’s of interest, too, so we know where to go speak.
Finally, one last geek thought: Anyone using the SmugMug API is now actually using multiple APIs through ours (depending on what you’re doing, you may be using Google and/or Yahoo, but you’re almost certainly using Amazon). The stack continues to grow.
UPDATE #1: In response to a comment below, I don’t feel like we “bet the company” on S3 – every photo our customers entrust us with, we keep local copies in our existing distributed storage infrastructure. We use S3 as redundant secondary storage for use in cases of outages, data loss, or other catastrophe.




