aws | SmugMug's Don MacAskill

On Why Auto-Scaling in the Cloud Rocks

December 9, 2008 Don MacAskill 70 comments

In high school, I had a great programmable calculator. I’d program it to solve complicated math and science problems “automatically” for me. Most of my teachers got upset if they found out, but I’ll always remember one especially enlightened teacher who didn’t. He said something to the effect of “Hey, if you managed to write software to solve the equation, you must thoroughly understand the problem. Way to go!”.

George Reese wrote up a blog post over at O’Reilly the other day called On Why I Don’t Like Auto-Scaling in the Cloud. His main argument seems to be that auto-scaling is bad and reflects poor capacity planning. In the comments, he specifically calls SmugMug out, saying we’re “using auto-scaling as a crutch for poor or non-existent capacity planning”.

George is like one of those math teachers who doesn’t “get it”. I was tempted not to write this post because he gets it so wrong, I’d hate to spread that meme. SkyNet auto-scales well. No humans at SmugMug are monitoring it and it just hums along, doing its job. Why is it so efficient? Because I understand the equation. I know what metrics drive our capacity planning and I programmed SkyNet to take these into account. It checks an awful lot of data points every minute or so – this isn’t simply “oh, we have idle CPU, let’s kill some instances.” (I would argue that, depending on the application, simple auto-scaling based on CPU usage or similar data point can be very effective, too, though).

SkyNet has been in production for over a year with only two incidents of note and SmugMug has more than doubled in size and capacity during that time without adding any new operations people. How on earth is this a bad thing?

Categories: amazon, cloud computing, datacenter Tags: amazon web services, auto-scaling, aws, cloud computing, ec2, george reese, o'reilly, skynet

Huge EC2 release: Load Balancing & Auto-Scaling!

October 27, 2008 Don MacAskill 7 comments

June 5th, 2008 near Maryville, Missouri by Shane Kirk

In case you didn’t see it, Amazon had a huge EC2 announcement the other day that included:

EC2 is now out of beta.
EC2 has a SLA!
Windows is now availabled on EC2
SQL Server is now available on EC2

But the really cool bits, if you ask me, are the announcements about the next wave of related services:

Monitoring
Load Balancing
Auto-Scaling
A web-based management console

As frequent readers of my blog and/or conference talks will know, this means one of the last important building blocks to creating fully cloud-hosted applications *at scale* is nearly ready for primetime.

For those keeping score at home, my personal checklist shows that the only thing now missing is a truly scalable, truly bottomless database-like data store. Neither Elastic Block Storage (EBS) nor SimpleDB really solve the entire scope of the problem, though they’re great building blocks that do solve big pieces (or everything, at smaller scale). I’m positive that someone (Amazon or other) will solve this problem and I can start moving more stuff “to the Cloud”.

I can’t wait.

Categories: cloud computing, datacenter Tags: amazon web services, auto-scaling, aws, cloud computing, ebs, ec2, elastic block storage, load balancing, monitoring, simpledb

Amazon S3: Price reduction

October 13, 2008 Don MacAskill 6 comments

I know a lot of you get your Amazon Web Services news from me, so I thought I’d better mention this one. It’s huge!! 🙂

Amazon announced S3 price reductions as you scale. For us, since we’re way beyond 500TB, this is huge. And for any of you who are still in their first tier, it’s something to look forward to. 🙂

DevPay also got a significant new release, pricing-wise, recently, so if you’re interested in that, better check it out.

Thanks Amazon!

Categories: amazon Tags: amazon, aws, devpay, s3, web services

Hot technologies I care about – Sep '08

September 17, 2008 Don MacAskill 30 comments

photo by: ikegami

I’ve been too busy to blog lately, and for that I apologize. But here’s a quicky detailing the technologies (internet related and not) I’m excited about right now:

Drizzle. For years now, I’ve felt that MySQL has been doing in a direction in opposition to my use case. Stored procedures, views, etc etc have added bloat and complexity without offering me anything useful. Turns out I’m not alone – and thus Drizzle was born. To say I’m *super* excited about this is a serious understatement.
Google & Percona’s MySQL patches. While I wait for Drizzle, I’m stuck dealing with terrible concurrency issues in MySQL/InnoDB that force us to partition data way before we really should have to, making our system more complex. It’s crazy having a server keel over when it shouldn’t be either CPU-bound *or* IO-bound but that’s life with MySQL and InnoDB these days – or at least, it was until Google and Percona fixed what I couldn’t get MySQL to fix with our Platinum Enterprise subscriptions. Open source rules!
Flash storage. I really wish I could talk about this some more (pesky NDAs), but there are datacenter changes coming that are more dramatic than anything I’ve seen in 14 years of working on them. I hope I’ve talked to everyone in the space (and from the companies I’ve talked to, one of them seems to be the *very* clear winner for this upcoming round), but if you’re a storage vendor working on flash appliances and I haven’t talked to you, ping me. We’re a bleeding edge customer and we’ll put your stuff in production faster than you can deliver it to us. 🙂
ZFS. Regardless of flash storage, ZFS is the filesystem of choice – head and shoulders over everything we’ve used or heard of. The advent of flash just makes this even more compelling. The downside? It’s not on Linux. 😦
OpenSolaris. ZFS is so incredible, my hand has been forced, and we’re about to put our first OpenSolaris system into production. OpenSolaris is, in theory, the Solaris kernel (think ZFS, DTrace, SMF, high concurrency, etc) with the GNU-like userland (think Linux-like). In practice, it’s still extremely painful for a Linux expert and Solaris n00b like me to use – even on a single-purpose machine like a MySQL server. Only ZFS makes the pain worth it. For development, it’s basically unusable for Linuxers (it’s probaby fabulous for Solaris guys – lucky ducks).
Nexenta. Unlike OpenSolaris, Nexenta *is* the Solaris kernel plus GNU userland. Unfortunately, it’s not backed by Sun or anyone else I have any relationship with. Sun has been absolutely the very best technology vendor we’ve ever dealt with in terms of support, technical knowledge, and just plain listening to us, so that’s a big issue. I wish Sun had taken Nexenta’s approach (or would just buy them or offer support or something). If OpenSolaris continues to be painful, we may fall back on Nexenta instead – remember, ZFS is the driving factor here.
Amazon Web Services competitors. They’ve been promising they’d be coming out for years now and I’m shocked they’ve given Amazon this much runway. But I believe a few more are getting very close (can’t say more, again, pesky NDAs). Now, we’re extremely happy with Amazon, so we have no plans to switch, but competition is good for everyone – and Amazon is a fierce competitor. Plus there are still gaps in Amazon’s strategy, and if I can mix & match to plug some of those gaps, awesome – sign me up.
Memcached. This one’s been on my list for years, and it’s still way up there. Binary protocol on the verge of shipping, nice patch to resolve some networking issues we’ve seen, and talk about scabability. If you’re building web apps and this isn’t a core part of your infrastructure, you’re doing it wrong.
Big RAM. 4GB DIMMs are dirt cheap, so if you’re not loading your DB and Memcached boxes to the gills, you’re missing the boat. Cheap 2-socket 64GB (and relatively cheap 128GB at 4-sockets) are here.
Sun Fire X4140 and X4440. The best 1U (2-socket) and 2U (4-socket) servers on earth. Despite being late to the game with quad-core, Opteron RAM performance kills Xeon, so these are the servers we’re buying. You can load them to the gills with 4GB DIMMs, enjoy the dual-power supplies (yes, in the 1U box too), and crank out some great stuff.
OpenSocial, Y!OS, etc. The big boys are finally getting real about getting open and cross-pollinating data and I think we’re finally nearing an inflection point. We’re hiring a Sorcerer to do nothing but think and build in this space. I’m sure magic will ensue.
Nikon D90 and Canon 5D MkII. Nikon’s taken the photography world by storm with amazing high-ISO performance, and Canon just announced a DSLR that shoots full 1080p video. Both look amazing and both are game-changers.
Onkyo TX-SR806. I’m an A/V junkie and this thing is amazing. 5 HDMI inputs (need more?), THX Ultra2 Plus (the low-volume enhancements are *awesome* with young kids sleeping at home), automatic room EQ, decodes every modern audio encoding, etc. I don’t even use the amplifier section (I have separates), but it’s turning out to be the best Pre/Pro I’ve ever owned. Sounds fabulous on my gear.
iPhone App Store. That thing is a game changer, and we’re barely seeing the tip of the iceberg. All the other players have to respond – which is great for you and I. And talk about a platform that’s a dream to develop on!

So there you have it. Those are the most important pieces of tech I’m watching these days. I’ll *definitely* be writing up our ZFS experiments as they come along and I have interesting data to share. Stay tuned.

Oh, and if you’re curious about what I *wish* was on the list, there’s really only one thing: iTunes syncing. I have two desktops (one at my office, one at home) and two laptops, plus my wife has accounts on my computers. Keeping those all in sync so that when I update a playlist at the office, the update is waiting for me at home, is a nightmare. I’d pay lots of money if someone could solve that – seems like iTunes + AWS + a smart coder = solved, no? Wish I had some time….

Categories: personal Tags: amazon, aws, drizzle, flash, google, iphone, itunes, memcached, MySQL, nexenta, opensocial, opensolaris, percona, sun, y!os, zfs

S3 outage – We weren't affected

February 15, 2008 Don MacAskill 68 comments

Amazon S3 had an outage today. First I knew about it was reporters emailing and calling me asking if we were knocked out by it.

We weren’t. No customers reported issues, and our systems were all showing typically low and acceptable error rates. To be honest, I’m surprised.

I wasn’t going to blog about it until I understood why we weren’t affected, but I’m really getting inundated with requests now, so I figured this would be a good way to optimize my time rather than spending all day on the phone. 🙂

We’re researching what happened now, but again, I didn’t know about the outage until after it was over, and I haven’t spoken to anyone at Amazon yet. Until I finish my research and speak with Amazon, I’m not going to speculate on what may have happened or why.

I can say, once again, that we pay the same rates everyone else pays and that, other than some early access to upcoming beta services, we don’t get any preferential treatment that I’m aware of.

Some thoughts, though:

We expect Amazon to have outages. No website I’m aware of doesn’t, whether it’s Google, Amazon, your bank, or SmugMug.
I’ve written about Amazon S3 outages in the past, but in the last ~12 months, we’ve only seen a single ~2 minute outage window (on January 22nd, 2008 at around 4:38pm Pacific). We also had one recent fairly major hiccup with EC2.
Yes, I believe there will probably be times where SmugMug is seriously affected, possibly even offline completely, because Amazon (or some other web services provider) is having problems. Today wasn’t that day. Nobody likes outages, especially not us, but we’ve decided the tradeoffs are worth it. You should have your eyes wide open when you make the decision to use AWS or any other external service, though. It will fail from time to time.
We’ve done our best to build SmugMug in such a way that we handle failures as gracefully as possible. We can’t cover every case, but I think that the fact that we didn’t experience customer-facing outages today is a testament to that. Again, I want to stress that we do expect Amazon to cause us (rare) outages in the future, and that’s unavoidable, but today we dodged that bullet.
Amazon’s communication about this has been terrible. It took far too long to acknowledge the problem. Fixing a major problem can take forever, which is understandable, but communicating with your customers should happen very rapidly. Amazon’s culture, internally, is very customer focused, so this is a strange anomaly. I will definitely lean on them some about it, and everyone who was affected should rightfully howl too.
I’ve asked Amazon repeatedly for an “Amazon Web Services Health” page that shows the current expected state of all their services. Then you can tell at a glance (and even poll and work into your own monitoring) whether any of the services are having problems. Something like Keynote’s Internet Health Report would be a good start, but as Jesse Robbins points out, trust.salesforce.com is the gold standard. This page could also double as a mechanism to let customers know what’s being worked on and current ETAs when there are problems.

I’ll try to post a follow-up about why we weren’t affected when I know more. It’s possible that some of the reasons we survived was due to some of our “secret sauce” and I just won’t be able to say, but I kinda doubt it.

Bottom line: While the outage was certainly a big deal to those affected, I think the bigger deal here is how Amazon handled the outage. They need to communicate better about these mission critical services and their health.

If I didn’t answer any questions you’d like me to answer, please post a comment and/or send me an email. I’ll do my best to respond.

UPDATE 1: I’m not sure why there’s all this confusion, but SmugMug *does* use Amazon as our primary data store. We maintain a small “hot cache” in our datacenters of frequently/recently viewed photos and videos, but there are massive numbers of them that are only at Amazon. This is a change from our initial usage of S3, and the change is based on how reliable they’ve been. Yes, we still consider them to be very reliable even after an outage like this. And yes, I suspect our “hot cache” did at least partially enable us to ride out this issue.

Categories: amazon Tags: amazon, aws, ec2, outage, s3, smugmug, storage, web services

SmugMug's Don MacAskill

Archive

Amazon S3: Price reduction

Hot technologies I care about – Sep '08

S3 outage – We weren't affected

What I’m Doing:

Follow Blog via Email

SmugMug

Tags:

Archives

SmugMug's Don MacAskill

Archive

On Why Auto-Scaling in the Cloud Rocks

Share this:

Huge EC2 release: Load Balancing & Auto-Scaling!

Share this:

Amazon S3: Price reduction

Share this:

Hot technologies I care about – Sep '08

Share this:

S3 outage – We weren't affected

Share this:

What I’m Doing:

Follow Blog via Email

SmugMug

Tags:

Archives