On Why Auto-Scaling in the Cloud Rocks

In high school, I had a great programmable calculator. I’d program it to solve complicated math and science problems “automatically” for me. Most of my teachers got upset if they found out, but I’ll always remember one especially enlightened teacher who didn’t. He said something to the effect of “Hey, if you managed to write software to solve the equation, you must thoroughly understand the problem. Way to go!”.
George Reese wrote up a blog post over at O’Reilly the other day called On Why I Don’t Like Auto-Scaling in the Cloud. His main argument seems to be that auto-scaling is bad and reflects poor capacity planning. In the comments, he specifically calls SmugMug out, saying we’re “using auto-scaling as a crutch for poor or non-existent capacity planning”.
George is like one of those math teachers who doesn’t “get it”. I was tempted not to write this post because he gets it so wrong, I’d hate to spread that meme. SkyNet auto-scales well. No humans at SmugMug are monitoring it and it just hums along, doing its job. Why is it so efficient? Because I understand the equation. I know what metrics drive our capacity planning and I programmed SkyNet to take these into account. It checks an awful lot of data points every minute or so – this isn’t simply “oh, we have idle CPU, let’s kill some instances.” (I would argue that, depending on the application, simple auto-scaling based on CPU usage or similar data point can be very effective, too, though).
SkyNet has been in production for over a year with only two incidents of note and SmugMug has more than doubled in size and capacity during that time without adding any new operations people. How on earth is this a bad thing?
Seems like a classic example of saying a technique is bad under the assumption that its implemented poorly.
Either that or, "The game has changed, and it scares me."
It doesn't look like the AWS services are knit together yet such that they can be used by a startup to do auto-scaling without some custom code being architected and developed. Have I missed some documentation or a solid open source framework?
So for a startup that is less media intense than Smugmug, the question is: do it develop cloud-based auto-scaling (CBAS, you heard it here first) day one, or when cost and audience growth warrants the need for auto-scaling?
I think (and I said so in a comment on George's blog post) that SmugMug actually is an ideal case for how to do this well. You have an intimate understanding of the particulars of your systems architecture, you monitor those components, and you know exactly how to scale in the face of those issues. The real point is that you (SmugMug) are actually pretty unique – most people hear "auto-scale" and they think to themselves: when the load average goes over Foo, do Bar. You know (and I know) that it's not that simple.. to do it well (and safely) you need to really understand the fundamentals of your architecture.
if you do not build (your software) to scale from the start you will have to scale (your headcount) to build later. the latter is more expensive than the former.
right on and extra emphasis on the last bit: building your own system to autoscale makes autoscaling for your system straightforward (and great). building an autoscaling system to autoscale arbitrary apps to arbitrary scales is not straightforward.
If the process of building your software to scale is to onerous, you will run out of money before you get to "later."
Just a guess on my part, but do you think when George goes to the gas station with his (assuming he owns a) car he only puts in the exact amount of fuel for the journey that will get him where he needs to be, and back again to that station? Guessing not! Companies like SmugMug are clearly on the leading edge, and in my opinion have or are crossing the chasm of CBAS (trademark Steve 😉 !) Companies like RightScale are "knitting" a lot of those auto scaling capabilities together and allow slubs like me to benefit from the work of (really smart) slubs like Don! Those templates are readily available for putting together intelligent solutions for you company's needs. Hmmm, sort of like just filling up your gas tank when you get fuel. Imagine that!
I agree with David and Adam. George has too many flaws in that post to deal with, but I'll take a stab at this one: "You still add capacity when you need it; you just add it according to a plan rather than willy-nilly based on perceived external events. …You just don't want the system automatically adjusting capacity based on usage."
This all comes down to false alarm rates and classic engineering, and it sounds like SmugMug tracks enough metrics to smooth out the blips. Intelligent feedback systems that anticipate a variable and respond accordingly have been built into products and defense systems for a long time: your car does it, airplanes do it. This is related to another key point the original post seems to miss: a good autoscaling algorithm actually reacts to its own predictions about system state over the next time period (based current data combined with historical patterns) instead of just reacting to the last 5 minutes of load.
What he calls "proper capacity planning" is really just a bunch of historical data, assumptions, and a cost function you are trying to optimize. A good autoscaling algorithm builds all that in and handles the day to day fluctuations in a repeatable way. A really good one can generalize well and respond accordingly without you lifting a finger.
Obviously some testing needs to take place and and manual overrides / fail-safes should be put in, but I don't buy the argument that everyone using autoscaling is setting themselves up for disaster. If your monitoring is in place, you are going to know when 40 instances are mistakenly fired up and at worst you are out $4.00 (or $200 for 250 XL instances as described in the orginal SmugMug post). In the long run, autoscaling will give you a much smoother, more cost effective capacity plan than a human could do – and it can adjust to bursty traffic (see the Animoto Facebook example).
Right On.
I think cloud computing is such a bubble. Personally the math just doesn't support the business model of using the cloud services. For some businesses, such as a flower shop that will get flooded requests during select few holidays, scaling on EC2 is cheap for them. If you have to rely on AWS 365 days of your life, then it doesn't really make economical sense to pay the high price for AWS.
Take S3 for example, Don wrote an article on how he was saving millions using S3 so that he doesn't have to buy terabytes of Storage Servers. Well, that model was only partially right if you are on the "growth cycle". The only advantage is that you don't have to pay upfront for the hardware costs which is a sunk cost(by the way, all one time sunk cost can be ignored or amortized over the lifespan of the hardware). On the other side of the story, now that we are in an economic meltdown, I assume that growth at Smugmug is declining if not negative. If you paid for your hardware up front, during contraction periods, your hardware cost will be 0 since you aren't buying more hardware to fuel your growth. So the variable running cost if you own your hardware is electricity, management and bandwidth, and depreciation cost of your hardware annualized.
On Smugmug's scale, management is basically paid for by the salaries of IT folks. It is not likely that Smugmug will fire the people who were managing the storage servers once it is moved to S3. So management cost can be seen as sunk cost no matter what. Now bandwidth. You have to have the scale to go all the way to the last tier to make S3 as cost effective as you can be. 10 cents per GB plus 0.0001 cent per request. That request part pretty much means S3 cannot be used to store files less than 100KB(or cost will escalate faster than you think. If you store 10K thumbnails, there are 100K request per 1GB of bandwidth, and you are effectively paying double the asking rate. At 100K file size, you are paying 11 cents per GB). Now Gigabit links are selling for about 5000 dollars/month. Assume that you don't use it all. You can transfer 200GB/month for every mbps. That means bandwidth cost is 200TB for 5000 dollars or 40GB/dollar or 2.5 cents per GB. Now that is 1/4 of what AWS is asking for.
Electricity cost can be estimated at 150 dollars per Amp per year. Assume that storage servers will last longer than CPUs at 5 year life span. That means for a MD1000 array sucking 3A(15x15k rpm or 15x 7200rpm), you will end up paying 1500 dollars per MD1000 array over the lifespan of the array. That's about 1/4 the cost of the array.
Then you figure out the depreciation schedule of the storage. Again, storage hasn't depreciated as much as CPU has. Assume 5 year life span, you are taking approximately 20% depreciation per year.
I would really love to see Don talk about the cost structure again now that the growth phase is over. If Don was still operating with owned storage servers, now he wouldn't be paying for new hardware right now. But he is still paying a lot of money per month for S3. The math is simple. Assume Don is using 500TB of space at Amazon, and he will be using RAID5 and every file is backed up . That means Don needed to buy 1500TB of Storage. That is 200 MD1000 arrays running 500GB SATA Drives attached to commodity 1 socket Opteron servers. That setup would cost him 5K(MD1000+15 500GB SATA drives)+2k(Server) = 7K per node plus some networking equipment. Let's say 10K per node. 200 nodes *10K = 2 Million dollars. So the sunk cost to own the hardware is about 2 Million bucks. Electricity is 5A per node or 750 dollars/node/year. That is 150K per year to keep the hardware running. Over 5 years, that's less than 1 Million dollars. So far we are at 3 Million without bandwidth. Given the current low interest environment(5%), the savings Don thinks he's getting by not purchasing the hardware outright is smaller than he thinks. Once the bandwidth cost is added, I think S3 cost about 2 times as much at the bare minimum.
The same analysis can be done for EC2. AWS simply isn't passing along the cost savings every generation of new CPU architecture gives you. EC2 pricing has not reflected the CPU unit cost depreciation due to inverse Moore's Law.(CPUs get double the transistors every 18 months or so. And assume the double transistors will translate to double the cores, which under the virtualization, each CPU unit amazon sells should cost about 50%less every 18 months. But Amazon pockets the benefits, you still get 10 cent per machine hour.
So there are an awful lot of things wrong with this comment, but chief among them is the assumption that SmugMug's growth has somehow flattened or is in decline. That couldn't be further from the truth – we had explosive growth this year, despite our expectations of slowing thanks to the economic meltdown.
The other element you're overlooking is that when Amazon lowers their S3 pricing (like they did this year), *all* of our data storage costs get reduced, not just the "new" stuff. That's a very different model than capitalizing our own storage – those costs are then not only sunk, but fixed.
Further, capitalizing those storage costs requires cash. Lots of it. It's easy to say you'll just amortize it over 5 years, but you still have to come up with the cash up front or carry a nasty debt load. (Look where the debt load in this country got us this year). We like owning our business, being our own masters, and not owing anyone – investors or creditors – a dime. Amazon's cost structure allows us to do that – and personally, I'm not sure I can put a price on that sort of freedom.
Finally, Amazon has serious incentive to keep lowering prices and keeping us happy. The "cost of moving" from Amazon is roughly equal to one month's "rent" so there's very little barrier to simply switching to something else (in-house storage, Windows Azure, Google AppEngine, etc). That keeps them honest, and keeps our operating cost structure down. The same cannot be said of doing your own storage in-house. Once you own that stuff, you have to monetize it.
So yes, we're still extremely happy with Amazon, and extremely happy with the less-tangible benefits that come along with it (lower headcount, no venture capital, no debt, flexibility, less mental anguish, etc) along with the obvious business savings and optimizations.
George was wrong and he admitted it to me on Twitter: He didn't really understand the dynamics of how people were using EC2. I pointed out to him that companies like Animoto and SmugMug were using queued models. Here is his reply…
http://twitter.com/GeorgeReese/status/1045799833
John
johnmwillis.com
Not sure what happened. It lLooks like it prefixes the URL.
twitter.com/GeorgeReese/status/1045799833
or here is what he said:
@botchagalupe Yeah, and my objectives don't fit outside the realm of web apps anyways
Don:
I have been reading your blog for a long time now, and you know I fully respect your opinion. If one thing I don't get is your over optimism on Amazon AWS and Sun Storage 7000 series. Both are great products, however, I think you are underestimating the true cost.
1. First of all, congrats if you are still growing. I guess we haven't even entered into the dark tunnel yet. To assume that growth will continue in this economic environment is not prudent. I have tried to prove that when growth is negative or flat, the true cost of S3 picks up. Since if you own the hardware, when growth dies down, you no longer have to add hardware, where as in Amazon S3's case, you are still paying rent.
2. Again, Don, you are not sharing the numbers, but I bet you that what you pay Amazon can easily also carry the monthly payment of the debt load if you didn't have the cash to pay for hardware in the first place.
3. Amazon this year had "attempted" to lower price. On the face value, it did reduce the "storage cost". It now costs less to store stuff on there. But they make it up with the 1 cent per 10,000 requests, which easily could double your bandwidth cost if you upload file size is less than 10KB(like thumbnails). I am not convinced that is a true reduction in price.
Anyways, how is your Sun Storage 7410 doing? You said that you were going to review the Sun Hybrid Storage Pool a while ago. Some really interesting SSD drives are now here. Intel X25-M and X25-Es are pretty good. Samsung 25GB/50GB enterprise drives are pretty good and a lot cheaper than Intel's drives per GB. OCZ Vertex is a decent MLC drive to be used for L2ARC in ZFS.
Good to see George came back and actually READ your original post and realized the errors of his hasty judgment.
I recently came across your blog and have been reading along. I thought I would leave my first
comment. I don't know what to say except that I have enjoyed reading. Nice blog. I will keep visiting this
blog very often.
Alena
http://www.smallbusinessavenues.com
Еще бы к этому тексты пару тематических картинок добавить. Было бы вообще идеально!
Конкурс для блоггеров от DRUGREVENUE с призовым фондом в 3000 долларов, спешите
Оценка 5, базару ноль
I had a chuckle reading this. I remember in university being given some math problems. I wrote up a program to solve them on the main frame, handed in the answers with the program listings and prompted got a D because I didn't show my work. My professor didn't get it.
nice
Опутеть как интересно, во задвигаете. Класс!
Спасибо вам за сайт, очень полезный ресурс, мне очень нравится
На Ваш блог знакомый в аську ссылку кинул. Оказалось ,что не зря Понравилось. Тепрь постоянно читать буду
У вас RSS в кривой кодировке!
Спасибо за статью.. Актуально мне сейчас.. Взяла себе еще перечитать.
Опубликовал на своем блоге вашу статью, и напечатол там конечно-же обратную ссылку на вас. Но вот зашел посмотреть поевился ли трекбек, а его нет…
отличный дизайн )
Опутеть как интересно, во задвигаете. Класс!
очень занимательно было почитать
У вас RSS в кривой кодировке!
На Ваш блог знакомый в аську ссылку кинул. Оказалось ,что не зря Понравилось. Тепрь постоянно читать буду
Ты как обычно радуешь нас своими лучшими фразами спасибо, беру!
Интересно и позновательно, а будет еще что-то по этой теме?
Извините если не туда, но как с админом сайта связаться?
Скажите, а у вас есть RSS поток в этом блоге?
По моему у Вас украли эту статью и поместили на другом сайте. Я её уже видела.
а вот вопросик можно? У вас время после поста указано. Это московское? Заранее спасибо!
Ваш пост навел меня на думки *ушел много думать* …
Огромное вам пасибо! а еще посты на эту тему будут в будущем? Очень жду!
Интересно и позновательно, а будет еще что-то по этой теме?
Хороший пост, интересно почитать
George was wrong and he admitted it to me on Twitter: He didn't really understand the dynamics of how people were using EC2. I pointed out to him that companies like Animoto and SmugMug were using queued models.
I think (and I said so in a comment on George's blog post) that SmugMug actually is an ideal case for how to do this well. You have an intimate understanding of the particulars of your systems architecture, you monitor those components, and you know exactly how to scale in the face of those issues. The real point is that you (SmugMug) are actually pretty unique – most people hear "auto-scale" and they think to themselves: when the load average goes over Foo, do Bar. You know (and I know) that it's not that simple.. to do it well (and safely) you need to really understand the fundamentals of your architecture.
These are very interesting. Thanks for sharing.
Да, фильм эпоха!!
SkyNet has been in production for over a year with only two incidents of note and SmugMug has more than doubled in size and capacity during that time without adding any new operations people. How on earth is this a bad thing?
doesn't sound like a good thing if you're looking for work as an operations person.
George is like one of those math teachers who doesn’t “get it”.
It doesn't look like the AWS services are knit together yet such that they can be used by a startup to do auto-scaling without some custom code being architected and developed. Have I missed some documentation or a solid open source framework?
Car Rentals
Всегда хотел посмотреть фильм терминатор… говорять прикольный, а руки никак не дотягиваюся посмотреть…
МДа прикольно)))). Пишите ещё будем ждать
Very cool
Интересная статья. Терминатор – рулит!
not bad, i agree with you, thank you for this nice post, but who is George Reese?
I guess you were really smart guy to create such a calculator
i agree with your professor 🙂
Hello Mr. McAsKill,
Your articles are simply amazing….I am working on a project which will be using S3 as a backend for storing mp3….and plan to use somewhat a similar architecture which you at smugmug use. Is you detailed architecture open and is it possible you can share it?
Do you use Django, Python, Json and S3…? I knw m asking a lot…which u mite not be willing to share….? If you have used Django for smugmug…how has been the experience…?
Спасибо очень полезная статья многое взял на заметку!
Cпасибо. Очень интересно. По теме можно еще посмотреть здесь http://hottenbabes.freehost123.com/
http://mjdatsite.k2free.com
Electricity cost can be estimated at 150 dollars per Amp per year. Assume that storage servers will last longer than CPUs at 5 year life span. That means for a MD1000 array sucking 3A(15x15k rpm or 15x 7200rpm), you will end up paying 1500 dollars per MD1000 array over the lifespan of the array. That's about 1/4 the cost of the array.
Autoscalling on cloud really rocks for small companies and startups who cannot afford expensive hosting services.
Спасибо за статью…
Спасибо почитал, продолжение будет?
Hey Don, wish I could have joined you but it's a bit far from Switzerland 😉 Luckily I managed to catch it at IMAX Irvine, OC.
Auto-scaling works for your capacity planning so well because you truly understand it. You calculator example is perfect. George just doesn't get it and probably never will.
i've also written some program to solve some basic algebra and matrix…
Hello Don!
I'm working on my private website at the moment, which is based on photo galleries. That's why I need a suggestion on how to decrease size of the photos, so that website will work faster, but quality of photos will not suffer. Maybe there are some best practises on "how-to" compress photo images withouh losses? If it's not a "top secret" maybe you can share a sample of your code to make it directly on the website?
Thanks!
Hello:
I am doing my work with Adobe photo shop.. Is Silverlight is more effective?? Has any one tried??
Hello, I arrived at this website by accident when I was exploring on Google then I came upon your web site. I have to tell you that your site is very interesting I really like your theme! Kevin Perod