On Why Auto-Scaling in the Cloud Rocks

Home > amazon, cloud computing, datacenter > On Why Auto-Scaling in the Cloud Rocks

On Why Auto-Scaling in the Cloud Rocks

December 9, 2008 Don MacAskill

In high school, I had a great programmable calculator. I’d program it to solve complicated math and science problems “automatically” for me. Most of my teachers got upset if they found out, but I’ll always remember one especially enlightened teacher who didn’t. He said something to the effect of “Hey, if you managed to write software to solve the equation, you must thoroughly understand the problem. Way to go!”.

George Reese wrote up a blog post over at O’Reilly the other day called On Why I Don’t Like Auto-Scaling in the Cloud. His main argument seems to be that auto-scaling is bad and reflects poor capacity planning. In the comments, he specifically calls SmugMug out, saying we’re “using auto-scaling as a crutch for poor or non-existent capacity planning”.

George is like one of those math teachers who doesn’t “get it”. I was tempted not to write this post because he gets it so wrong, I’d hate to spread that meme. SkyNet auto-scales well. No humans at SmugMug are monitoring it and it just hums along, doing its job. Why is it so efficient? Because I understand the equation. I know what metrics drive our capacity planning and I programmed SkyNet to take these into account. It checks an awful lot of data points every minute or so – this isn’t simply “oh, we have idle CPU, let’s kill some instances.” (I would argue that, depending on the application, simple auto-scaling based on CPU usage or similar data point can be very effective, too, though).

SkyNet has been in production for over a year with only two incidents of note and SmugMug has more than doubled in size and capacity during that time without adding any new operations people. How on earth is this a bad thing?

Categories: amazon, cloud computing, datacenter Tags: amazon web services, auto-scaling, aws, cloud computing, ec2, george reese, o'reilly, skynet

Comments (68)

david

December 9, 2008 at 9:06 am

Seems like a classic example of saying a technique is bad under the assumption that its implemented poorly.

Either that or, "The game has changed, and it scares me."
steve

December 9, 2008 at 9:19 am

It doesn't look like the AWS services are knit together yet such that they can be used by a startup to do auto-scaling without some custom code being architected and developed. Have I missed some documentation or a solid open source framework?

So for a startup that is less media intense than Smugmug, the question is: do it develop cloud-based auto-scaling (CBAS, you heard it here first) day one, or when cost and audience growth warrants the need for auto-scaling?
Adam Jacob

December 9, 2008 at 9:29 am

I think (and I said so in a comment on George's blog post) that SmugMug actually is an ideal case for how to do this well. You have an intimate understanding of the particulars of your systems architecture, you monitor those components, and you know exactly how to scale in the face of those issues. The real point is that you (SmugMug) are actually pretty unique – most people hear "auto-scale" and they think to themselves: when the load average goes over Foo, do Bar. You know (and I know) that it's not that simple.. to do it well (and safely) you need to really understand the fundamentals of your architecture.
Benjamin Black

December 9, 2008 at 9:35 am

if you do not build (your software) to scale from the start you will have to scale (your headcount) to build later. the latter is more expensive than the former.
Benjamin Black

December 9, 2008 at 9:37 am

right on and extra emphasis on the last bit: building your own system to autoscale makes autoscaling for your system straightforward (and great). building an autoscaling system to autoscale arbitrary apps to arbitrary scales is not straightforward.
steve

December 9, 2008 at 9:46 am

If the process of building your software to scale is to onerous, you will run out of money before you get to "later."
James D Kirk

December 10, 2008 at 5:01 am

Just a guess on my part, but do you think when George goes to the gas station with his (assuming he owns a) car he only puts in the exact amount of fuel for the journey that will get him where he needs to be, and back again to that station? Guessing not! Companies like SmugMug are clearly on the leading edge, and in my opinion have or are crossing the chasm of CBAS (trademark Steve 😉 !) Companies like RightScale are "knitting" a lot of those auto scaling capabilities together and allow slubs like me to benefit from the work of (really smart) slubs like Don! Those templates are readily available for putting together intelligent solutions for you company's needs. Hmmm, sort of like just filling up your gas tank when you get fuel. Imagine that!
Pete Skomoroch

December 10, 2008 at 12:20 pm

I agree with David and Adam. George has too many flaws in that post to deal with, but I'll take a stab at this one: "You still add capacity when you need it; you just add it according to a plan rather than willy-nilly based on perceived external events. …You just don't want the system automatically adjusting capacity based on usage."

This all comes down to false alarm rates and classic engineering, and it sounds like SmugMug tracks enough metrics to smooth out the blips. Intelligent feedback systems that anticipate a variable and respond accordingly have been built into products and defense systems for a long time: your car does it, airplanes do it. This is related to another key point the original post seems to miss: a good autoscaling algorithm actually reacts to its own predictions about system state over the next time period (based current data combined with historical patterns) instead of just reacting to the last 5 minutes of load.

What he calls "proper capacity planning" is really just a bunch of historical data, assumptions, and a cost function you are trying to optimize. A good autoscaling algorithm builds all that in and handles the day to day fluctuations in a repeatable way. A really good one can generalize well and respond accordingly without you lifting a finger.

Obviously some testing needs to take place and and manual overrides / fail-safes should be put in, but I don't buy the argument that everyone using autoscaling is setting themselves up for disaster. If your monitoring is in place, you are going to know when 40 instances are mistakenly fired up and at worst you are out $4.00 (or $200 for 250 XL instances as described in the orginal SmugMug post). In the long run, autoscaling will give you a much smoother, more cost effective capacity plan than a human could do – and it can adjust to bursty traffic (see the Animoto Facebook example).
Steven

December 11, 2008 at 4:07 am

Right On.
Tao

December 11, 2008 at 5:01 am

I think cloud computing is such a bubble. Personally the math just doesn't support the business model of using the cloud services. For some businesses, such as a flower shop that will get flooded requests during select few holidays, scaling on EC2 is cheap for them. If you have to rely on AWS 365 days of your life, then it doesn't really make economical sense to pay the high price for AWS.

Take S3 for example, Don wrote an article on how he was saving millions using S3 so that he doesn't have to buy terabytes of Storage Servers. Well, that model was only partially right if you are on the "growth cycle". The only advantage is that you don't have to pay upfront for the hardware costs which is a sunk cost(by the way, all one time sunk cost can be ignored or amortized over the lifespan of the hardware). On the other side of the story, now that we are in an economic meltdown, I assume that growth at Smugmug is declining if not negative. If you paid for your hardware up front, during contraction periods, your hardware cost will be 0 since you aren't buying more hardware to fuel your growth. So the variable running cost if you own your hardware is electricity, management and bandwidth, and depreciation cost of your hardware annualized.

On Smugmug's scale, management is basically paid for by the salaries of IT folks. It is not likely that Smugmug will fire the people who were managing the storage servers once it is moved to S3. So management cost can be seen as sunk cost no matter what. Now bandwidth. You have to have the scale to go all the way to the last tier to make S3 as cost effective as you can be. 10 cents per GB plus 0.0001 cent per request. That request part pretty much means S3 cannot be used to store files less than 100KB(or cost will escalate faster than you think. If you store 10K thumbnails, there are 100K request per 1GB of bandwidth, and you are effectively paying double the asking rate. At 100K file size, you are paying 11 cents per GB). Now Gigabit links are selling for about 5000 dollars/month. Assume that you don't use it all. You can transfer 200GB/month for every mbps. That means bandwidth cost is 200TB for 5000 dollars or 40GB/dollar or 2.5 cents per GB. Now that is 1/4 of what AWS is asking for.

Electricity cost can be estimated at 150 dollars per Amp per year. Assume that storage servers will last longer than CPUs at 5 year life span. That means for a MD1000 array sucking 3A(15x15k rpm or 15x 7200rpm), you will end up paying 1500 dollars per MD1000 array over the lifespan of the array. That's about 1/4 the cost of the array.

Then you figure out the depreciation schedule of the storage. Again, storage hasn't depreciated as much as CPU has. Assume 5 year life span, you are taking approximately 20% depreciation per year.

I would really love to see Don talk about the cost structure again now that the growth phase is over. If Don was still operating with owned storage servers, now he wouldn't be paying for new hardware right now. But he is still paying a lot of money per month for S3. The math is simple. Assume Don is using 500TB of space at Amazon, and he will be using RAID5 and every file is backed up . That means Don needed to buy 1500TB of Storage. That is 200 MD1000 arrays running 500GB SATA Drives attached to commodity 1 socket Opteron servers. That setup would cost him 5K(MD1000+15 500GB SATA drives)+2k(Server) = 7K per node plus some networking equipment. Let's say 10K per node. 200 nodes *10K = 2 Million dollars. So the sunk cost to own the hardware is about 2 Million bucks. Electricity is 5A per node or 750 dollars/node/year. That is 150K per year to keep the hardware running. Over 5 years, that's less than 1 Million dollars. So far we are at 3 Million without bandwidth. Given the current low interest environment(5%), the savings Don thinks he's getting by not purchasing the hardware outright is smaller than he thinks. Once the bandwidth cost is added, I think S3 cost about 2 times as much at the bare minimum.

The same analysis can be done for EC2. AWS simply isn't passing along the cost savings every generation of new CPU architecture gives you. EC2 pricing has not reflected the CPU unit cost depreciation due to inverse Moore's Law.(CPUs get double the transistors every 18 months or so. And assume the double transistors will translate to double the cores, which under the virtualization, each CPU unit amazon sells should cost about 50%less every 18 months. But Amazon pockets the benefits, you still get 10 cent per machine hour.
- Don MacAskill
  
  December 11, 2008 at 5:47 am
  
  So there are an awful lot of things wrong with this comment, but chief among them is the assumption that SmugMug's growth has somehow flattened or is in decline. That couldn't be further from the truth – we had explosive growth this year, despite our expectations of slowing thanks to the economic meltdown.
  
  The other element you're overlooking is that when Amazon lowers their S3 pricing (like they did this year), *all* of our data storage costs get reduced, not just the "new" stuff. That's a very different model than capitalizing our own storage – those costs are then not only sunk, but fixed.
  
  Further, capitalizing those storage costs requires cash. Lots of it. It's easy to say you'll just amortize it over 5 years, but you still have to come up with the cash up front or carry a nasty debt load. (Look where the debt load in this country got us this year). We like owning our business, being our own masters, and not owing anyone – investors or creditors – a dime. Amazon's cost structure allows us to do that – and personally, I'm not sure I can put a price on that sort of freedom.
  
  Finally, Amazon has serious incentive to keep lowering prices and keeping us happy. The "cost of moving" from Amazon is roughly equal to one month's "rent" so there's very little barrier to simply switching to something else (in-house storage, Windows Azure, Google AppEngine, etc). That keeps them honest, and keeps our operating cost structure down. The same cannot be said of doing your own storage in-house. Once you own that stuff, you have to monetize it.
  
  So yes, we're still extremely happy with Amazon, and extremely happy with the less-tangible benefits that come along with it (lower headcount, no venture capital, no debt, flexibility, less mental anguish, etc) along with the obvious business savings and optimizations.
botchagalupe

December 11, 2008 at 11:14 am

George was wrong and he admitted it to me on Twitter: He didn't really understand the dynamics of how people were using EC2. I pointed out to him that companies like Animoto and SmugMug were using queued models. Here is his reply…

http://twitter.com/GeorgeReese/status/1045799833

John
johnmwillis.com
- botchagalupe
  
  December 11, 2008 at 11:22 am
  
  Not sure what happened. It lLooks like it prefixes the URL.
  
  twitter.com/GeorgeReese/status/1045799833
  
  or here is what he said:
  
  @botchagalupe Yeah, and my objectives don't fit outside the realm of web apps anyways
Tao

December 11, 2008 at 11:29 am

Don:

I have been reading your blog for a long time now, and you know I fully respect your opinion. If one thing I don't get is your over optimism on Amazon AWS and Sun Storage 7000 series. Both are great products, however, I think you are underestimating the true cost.

1. First of all, congrats if you are still growing. I guess we haven't even entered into the dark tunnel yet. To assume that growth will continue in this economic environment is not prudent. I have tried to prove that when growth is negative or flat, the true cost of S3 picks up. Since if you own the hardware, when growth dies down, you no longer have to add hardware, where as in Amazon S3's case, you are still paying rent.

2. Again, Don, you are not sharing the numbers, but I bet you that what you pay Amazon can easily also carry the monthly payment of the debt load if you didn't have the cash to pay for hardware in the first place.

3. Amazon this year had "attempted" to lower price. On the face value, it did reduce the "storage cost". It now costs less to store stuff on there. But they make it up with the 1 cent per 10,000 requests, which easily could double your bandwidth cost if you upload file size is less than 10KB(like thumbnails). I am not convinced that is a true reduction in price.

Anyways, how is your Sun Storage 7410 doing? You said that you were going to review the Sun Hybrid Storage Pool a while ago. Some really interesting SSD drives are now here. Intel X25-M and X25-Es are pretty good. Samsung 25GB/50GB enterprise drives are pretty good and a lot cheaper than Intel's drives per GB. OCZ Vertex is a decent MLC drive to be used for L2ARC in ZFS.
cabbey

December 15, 2008 at 6:21 am

Good to see George came back and actually READ your original post and realized the errors of his hasty judgment.
Alena

December 16, 2008 at 2:20 am

I recently came across your blog and have been reading along. I thought I would leave my first

comment. I don't know what to say except that I have enjoyed reading. Nice blog. I will keep visiting this

blog very often.

Alena

http://www.smallbusinessavenues.com
Золотой

December 20, 2008 at 6:07 pm

Еще бы к этому тексты пару тематических картинок добавить. Было бы вообще идеально!
Google

December 21, 2008 at 5:20 am

Конкурс для блоггеров от DRUGREVENUE с призовым фондом в 3000 долларов, спешите
Unfaips

January 2, 2009 at 2:41 pm

Оценка 5, базару ноль
Jack

January 13, 2009 at 5:48 am

I had a chuckle reading this. I remember in university being given some math problems. I wrote up a program to solve them on the main frame, handed in the answers with the program listings and prompted got a D because I didn't show my work. My professor didn't get it.
john

February 2, 2009 at 7:13 am

nice
emimisuddy

February 10, 2009 at 12:58 pm

Опутеть как интересно, во задвигаете. Класс!
Rooftfluitig

February 12, 2009 at 9:24 am

Спасибо вам за сайт, очень полезный ресурс, мне очень нравится
amudge

February 13, 2009 at 9:40 am

На Ваш блог знакомый в аську ссылку кинул. Оказалось ,что не зря Понравилось. Тепрь постоянно читать буду
Cryday

February 14, 2009 at 3:25 pm

У вас RSS в кривой кодировке!
Blealsevew

February 15, 2009 at 2:21 pm

Спасибо за статью.. Актуально мне сейчас.. Взяла себе еще перечитать.
heiply

February 16, 2009 at 8:50 am

Опубликовал на своем блоге вашу статью, и напечатол там конечно-же обратную ссылку на вас. Но вот зашел посмотреть поевился ли трекбек, а его нет…
Ideocresessy

February 17, 2009 at 11:31 am

отличный дизайн )
cliedy

February 18, 2009 at 12:45 pm

Опутеть как интересно, во задвигаете. Класс!
beeteorrerma

February 18, 2009 at 12:47 pm

очень занимательно было почитать
interiouborb

February 19, 2009 at 10:09 am

У вас RSS в кривой кодировке!
Weertylart

February 19, 2009 at 2:41 pm

На Ваш блог знакомый в аську ссылку кинул. Оказалось ,что не зря Понравилось. Тепрь постоянно читать буду
smagsbluesse

February 20, 2009 at 3:48 pm

Ты как обычно радуешь нас своими лучшими фразами спасибо, беру!
EngetteBrene

February 20, 2009 at 5:27 pm

Интересно и позновательно, а будет еще что-то по этой теме?
mahNameImpef

February 21, 2009 at 5:23 pm

Извините если не туда, но как с админом сайта связаться?
steamp

February 23, 2009 at 6:15 am

Скажите, а у вас есть RSS поток в этом блоге?
Engighsnit

February 23, 2009 at 2:40 pm

По моему у Вас украли эту статью и поместили на другом сайте. Я её уже видела.
clivioustott

February 24, 2009 at 4:14 pm

а вот вопросик можно? У вас время после поста указано. Это московское? Заранее спасибо!
sacien

March 3, 2009 at 2:24 pm

Ваш пост навел меня на думки *ушел много думать* …
dyroitharrah

March 4, 2009 at 2:57 pm

Огромное вам пасибо! а еще посты на эту тему будут в будущем? Очень жду!
CammaFat

March 5, 2009 at 2:10 pm

Интересно и позновательно, а будет еще что-то по этой теме?
Наталья

March 22, 2009 at 12:11 pm

Хороший пост, интересно почитать
tio

March 29, 2009 at 6:33 am

George was wrong and he admitted it to me on Twitter: He didn't really understand the dynamics of how people were using EC2. I pointed out to him that companies like Animoto and SmugMug were using queued models.
psihometrika

March 29, 2009 at 6:36 am

I think (and I said so in a comment on George's blog post) that SmugMug actually is an ideal case for how to do this well. You have an intimate understanding of the particulars of your systems architecture, you monitor those components, and you know exactly how to scale in the face of those issues. The real point is that you (SmugMug) are actually pretty unique – most people hear "auto-scale" and they think to themselves: when the load average goes over Foo, do Bar. You know (and I know) that it's not that simple.. to do it well (and safely) you need to really understand the fundamentals of your architecture.
kompr

March 29, 2009 at 6:48 am

These are very interesting. Thanks for sharing.
Sergio

March 30, 2009 at 12:10 pm

Да, фильм эпоха!!
jim winstead

April 28, 2009 at 3:57 am

SkyNet has been in production for over a year with only two incidents of note and SmugMug has more than doubled in size and capacity during that time without adding any new operations people. How on earth is this a bad thing?

doesn't sound like a good thing if you're looking for work as an operations person.
Лука

May 3, 2009 at 6:54 pm

George is like one of those math teachers who doesn’t “get it”.
vichi

May 7, 2009 at 9:07 am

It doesn't look like the AWS services are knit together yet such that they can be used by a startup to do auto-scaling without some custom code being architected and developed. Have I missed some documentation or a solid open source framework?
Car Rentals
Адриана

May 17, 2009 at 1:22 pm

Всегда хотел посмотреть фильм терминатор… говорять прикольный, а руки никак не дотягиваюся посмотреть…
Пацик

May 19, 2009 at 2:26 am

МДа прикольно)))). Пишите ещё будем ждать
Generator

May 21, 2009 at 8:12 am

Very cool
Фильмы бесплатно

May 21, 2009 at 1:43 pm

Интересная статья. Терминатор – рулит!
mihawin

May 27, 2009 at 9:19 am

not bad, i agree with you, thank you for this nice post, but who is George Reese?
Jade

June 8, 2009 at 3:42 pm

I guess you were really smart guy to create such a calculator
i agree with your professor 🙂
Amey Kanade

July 9, 2009 at 7:01 pm

Hello Mr. McAsKill,
Your articles are simply amazing….I am working on a project which will be using S3 as a backend for storing mp3….and plan to use somewhat a similar architecture which you at smugmug use. Is you detailed architecture open and is it possible you can share it?

Do you use Django, Python, Json and S3…? I knw m asking a lot…which u mite not be willing to share….? If you have used Django for smugmug…how has been the experience…?
Germusya

July 11, 2009 at 1:26 pm

Спасибо очень полезная статья многое взял на заметку!
hotbestfree

July 15, 2009 at 9:24 pm

Cпасибо. Очень интересно. По теме можно еще посмотреть здесь http://hottenbabes.freehost123.com/
http://mjdatsite.k2free.com
pirat

August 15, 2009 at 7:12 am

Electricity cost can be estimated at 150 dollars per Amp per year. Assume that storage servers will last longer than CPUs at 5 year life span. That means for a MD1000 array sucking 3A(15x15k rpm or 15x 7200rpm), you will end up paying 1500 dollars per MD1000 array over the lifespan of the array. That's about 1/4 the cost of the array.
CCTV designer

October 18, 2009 at 8:20 pm

Autoscalling on cloud really rocks for small companies and startups who cannot afford expensive hosting services.
vadimius

November 1, 2009 at 10:35 pm

Спасибо за статью…
Mitrich

November 12, 2009 at 2:25 pm

Спасибо почитал, продолжение будет?
iphone ringtonemaker

November 14, 2009 at 3:34 pm

Hey Don, wish I could have joined you but it's a bit far from Switzerland 😉 Luckily I managed to catch it at IMAX Irvine, OC.
mazda 3 driver

November 20, 2009 at 5:21 pm

Auto-scaling works for your capacity planning so well because you truly understand it. You calculator example is perfect. George just doesn't get it and probably never will.
Lover of Sadness

November 21, 2009 at 9:50 am

i've also written some program to solve some basic algebra and matrix…
Vladimirs

December 14, 2009 at 6:44 pm

Hello Don!
I'm working on my private website at the moment, which is based on photo galleries. That's why I need a suggestion on how to decrease size of the photos, so that website will work faster, but quality of photos will not suffer. Maybe there are some best practises on "how-to" compress photo images withouh losses? If it's not a "top secret" maybe you can share a sample of your code to make it directly on the website?
Thanks!
o2 sensors

December 30, 2009 at 5:15 pm

Hello:

I am doing my work with Adobe photo shop.. Is Silverlight is more effective?? Has any one tried??
Kevin Perod

January 1, 2010 at 11:13 am

Hello, I arrived at this website by accident when I was exploring on Google then I came upon your web site. I have to tell you that your site is very interesting I really like your theme! Kevin Perod