I don’t want to start a nerdfight here, but it might be inevitable. 🙂
Valleywag ran a story today about how Amazon’s EC2 instances are running at 50% of their stated speed/capacity. They based the story on a blog post by Ted Dziuba, of Persai and Uncov fame, whose writing I really love.
Problem is, this time, he’s just wrong. Completely full of FAIL.
I’ll get to that in a minute, but first, let me explain what I think is happening: Amazon’s done a poor job at setting user expectations around how much compute power an instance has. And, to be fair, this really isn’t their fault – both AMD and Intel have been having a hard time conveying that very concept for a few years now.
All of the other metrics – RAM, storage, etc – have very fixed numbers. A GB of RAM is a GB of RAM. Ditto storage. And a megabit of bandwidth is a megabit of bandwidth. But what on earth is a GHz? And how do you compare a 2006 Xeon GHz to a 2007 Opteron GHz? In reality, for mere mortals, you can’t. Which sucks for you, me, and Amazon – not to mention AMD and Intel.
Luckily, there’s an answer – EC2 is so cheap, you can spin up an instance for an hour or two and run some benchmarks. Compare them yourself to your own hardware, and see where they match up. This is exactly what I did, and why I was so surprised to see Ted’s post. It sounded like he didn’t have any empirical data.
Admittedly, we’re pretty insane when it comes to testing hardware out. Rather than trust the power ratings given by the manufacturers, for example, we get our clamp meters out and measure the machines’ power draw under full load. You’d be surprised how much variance there is.
There was one data point in a thread linked from Ted’s post that had me scratching my head, though, and I began to wonder if the Small EC2 instances actually had some sort of problem. (We only use the XLarge instance sizes) This guy had written a simple Ruby script and was seeing a 2X performance difference between his local Intel Core 2 Duo machine and the Small EC2 instance online. Can you spot the problem? I missed it, so I headed over to IRC to find Ted and we proceeded to benchmark a bunch of machines we had around, including all three EC2 instance sizes.
Bottom line? EC2 is right on the money. Ted’s 2.0GHz Pentium 4 performed the benchmark almost exactly as fast as the Small (aka 1.7GHz old Xeon) instance. My 866MHz Pentium 3 was significantly slower, and my modern Opteron was significantly faster.
So what about that guy with the Ruby benchmark? Can you see what I missed, now? See, he’s using a Core 2 Duo. The Core line of processors has completely revolutionized Intel’s performance envelope, and thus, the Core processors preform much better for each clock cycle than the older Pentium line of CPUs. This is akin to AMD, which long ago gave up the GHz race, instead choosing to focus on raw performance (or, more accurately, performance per watt).
Whew. So, what have we learned?
- All GHz aren’t created equal.
- CPU architecture & generation matter, too, not just GHz
- AMD GHz have, for years, been more effective than Intel GHz. Recently, Intel GHz have gotten more effective than older Intel GHz.
- Comparing old pre-Core Intel numbers with new Intel Core numbers is useless.
- “top” can be confusing at best, and outright lie at worst, in virtualized instances. Either don’t look at it, or realize the “steal %” column is other VMs on your same hardware doing their thing – not idle CPU you should be able to use
- Benchmark your own apps yourself to see exactly what the price per compute unit is. Don’t rely on GHz numbers.
- Don’t believe everything you read online (threads, blogs, etc) – including here! People lie and do stupid things (I’m dumb more often than I’m not, for example). Data is king – get your own.
Hope that clears that up. And if I’m dumb, I fully expect you to tell me so in the comments – but you’d better have the data to back it up!
(And yes, I’m still prepping a monster EC2 post about how we’re using it. Sorry I suck!)
photo by: Vandana
Choosing Vandana’s gorgeous photo (above) as the winner of Digital Grin’s Last Photographer Standing I must have been tough – just look at all the amazing entries – but man, oh man, talk about stunning! Vandana wins the $7500 grand prize plus a lifetime free SmugMug Pro account. Congrats!
Overall, there were $25K in total prizes given out during the year-long Last Photographer Standing I competition. I can’t wait to see what happens with LPS II. Bring it on!
Sûnnet Beskerming is out with a blog post claiming that we left some privacy holes open with our new scheme. I’m almost 100% positive we did leave some holes open, because this is a new release and we’re bound to have bugs, but they’re just dead wrong about this one. They clearly have an axe to grind (they would like us to hire them, and sound like they’re now pissed that we haven’t).
Since their original post, we’ve been tossing around the idea of hiring someone to periodically review our security & privacy policies/implementation, and they were on the list for consideration. It looks like we probably will hire someone, but given how poorly researched this new article is, it’s clearly not going to be them. I’ll bet we end up going with the brilliant experts over at OmniTI instead.
They made two bad assumptions:
- They somehow assume just because you know the ImageID and ImageKey, you can get the Original image. As all of our customers know, we let them lock down the Original so that no-one can get it.
- They then went on to explain that you could see a photo without providing the proper ImageKey simply by using an ImageKey from another photo in lightBox. Um, no. Apparently the concept of grandfathering older photos is beyond their comprehension. Our customers understood and appreciated it, but this so-called security firm doesn’t. Go figure.
Craziest part of this whole thing is that they chose to blog about their ignorance instead of just emailing us. We could have politely and privately researched the issue, discovered that things were working as designed, and set them straight. Instead they felt like they had to publicly attack and damage our business with a poorly researched story. (Nice way to drum up business, guys. Attack your potential customer AND get it wrong!)
To be clear: If you try their so-called exploit on a ‘new’ photo or video (one uploaded after our privacy changes on February 8th), it just won’t work. If you try it on an ‘old’ photo or video, it will – just like we designed it.
currently adding just added a little logic to change that behavior so that other people who jump to conclusions with no basis in fact will get an error, rather than silently working.
We’re also certainly not claiming our site is perfectly secure (and I can’t imagine we ever will). We think it’s *very* secure, but we’re still combing through all the dark corners of our codebase looking for areas where we can tighten things up. We still haven’t totally fixed a few of the issues brought up during our contest, even, though I can assure you we’re working on them. I’m sure we’ll continue to find more things, and that the community will as well.
Speaking of our wonderful community, now that our release is out and tested, we’re starting to pay the security bounties. Those of you who reported issues should have gotten, or will shortly be getting, an email from Markham. A few people refused their winnings, and refused to even let us donate to any charities in their name, so we’re donating the bounties to a charity of our choice instead.
Amazon S3 had an outage today. First I knew about it was reporters emailing and calling me asking if we were knocked out by it.
We weren’t. No customers reported issues, and our systems were all showing typically low and acceptable error rates. To be honest, I’m surprised.
I wasn’t going to blog about it until I understood why we weren’t affected, but I’m really getting inundated with requests now, so I figured this would be a good way to optimize my time rather than spending all day on the phone. 🙂
We’re researching what happened now, but again, I didn’t know about the outage until after it was over, and I haven’t spoken to anyone at Amazon yet. Until I finish my research and speak with Amazon, I’m not going to speculate on what may have happened or why.
I can say, once again, that we pay the same rates everyone else pays and that, other than some early access to upcoming beta services, we don’t get any preferential treatment that I’m aware of.
Some thoughts, though:
- We expect Amazon to have outages. No website I’m aware of doesn’t, whether it’s Google, Amazon, your bank, or SmugMug.
- I’ve written about Amazon S3 outages in the past, but in the last ~12 months, we’ve only seen a single ~2 minute outage window (on January 22nd, 2008 at around 4:38pm Pacific). We also had one recent fairly major hiccup with EC2.
- Yes, I believe there will probably be times where SmugMug is seriously affected, possibly even offline completely, because Amazon (or some other web services provider) is having problems. Today wasn’t that day. Nobody likes outages, especially not us, but we’ve decided the tradeoffs are worth it. You should have your eyes wide open when you make the decision to use AWS or any other external service, though. It will fail from time to time.
- We’ve done our best to build SmugMug in such a way that we handle failures as gracefully as possible. We can’t cover every case, but I think that the fact that we didn’t experience customer-facing outages today is a testament to that. Again, I want to stress that we do expect Amazon to cause us (rare) outages in the future, and that’s unavoidable, but today we dodged that bullet.
- Amazon’s communication about this has been terrible. It took far too long to acknowledge the problem. Fixing a major problem can take forever, which is understandable, but communicating with your customers should happen very rapidly. Amazon’s culture, internally, is very customer focused, so this is a strange anomaly. I will definitely lean on them some about it, and everyone who was affected should rightfully howl too.
- I’ve asked Amazon repeatedly for an “Amazon Web Services Health” page that shows the current expected state of all their services. Then you can tell at a glance (and even poll and work into your own monitoring) whether any of the services are having problems. Something like Keynote’s Internet Health Report would be a good start, but as Jesse Robbins points out, trust.salesforce.com is the gold standard. This page could also double as a mechanism to let customers know what’s being worked on and current ETAs when there are problems.
I’ll try to post a follow-up about why we weren’t affected when I know more. It’s possible that some of the reasons we survived was due to some of our “secret sauce” and I just won’t be able to say, but I kinda doubt it.
Bottom line: While the outage was certainly a big deal to those affected, I think the bigger deal here is how Amazon handled the outage. They need to communicate better about these mission critical services and their health.
If I didn’t answer any questions you’d like me to answer, please post a comment and/or send me an email. I’ll do my best to respond.
UPDATE 1: I’m not sure why there’s all this confusion, but SmugMug *does* use Amazon as our primary data store. We maintain a small “hot cache” in our datacenters of frequently/recently viewed photos and videos, but there are massive numbers of them that are only at Amazon. This is a change from our initial usage of S3, and the change is based on how reliable they’ve been. Yes, we still consider them to be very reliable even after an outage like this. And yes, I suspect our “hot cache” did at least partially enable us to ride out this issue.
I told you we’d listen.
After Philipp brought the issue up, we carefully listened to both our current customers and our potential would-be customers. Our current customers were a mixed bag. Luckily, most didn’t care one way or the other. Of those who did care, many didn’t want this change. 😦 But it was clear that lots of potential customers did. And as I said in my initial post, “Philipp is absolutely right.”
So we fixed the problem.
We made two big mistakes with this situation, one technical and one around setting user expectations. I was dumb for using autoincrement IDs alone, and we were dumb for calling the gallery setting ‘Private’ when that wasn’t clear enough. “Private” means different things to different people, and we should have known better. Both of these things, I believe, have now been remedied.
Here are the gory details and we have a dgrin thread with more:
- Your new galleries, photos, and videos are more private, and secure, than ever before.
- GUIDs did turn out to be both messy and expensive, as I thought they would be. We opted not to go that route.
- Instead, we created Keys for galleries and photos/videos and appended them to the relevant URLs. Kudos to Barnabus for planting this seed.
- The keys are made of 57 possible alphanumeric characters, and are 5 characters long, making the search space 57^5, or 601,692,057, strong. In theory, still guessable, but in practice, prohibitively expensive/difficult to do. Not to mention the fact that you have all the usual additional security and privacy settings you can turn on.
- Yes, this made our permalinks uglier. No, we’re not happy about it. But we think the tradeoff is worth it.
- Yes, older galleries and photos/videos are grandfathered. Their old URLs without the Keys still work. All new photos/videos, as well as old photos/videos inside of new galleries, require Keys to access. Same with new galleries.
- If you don’t want your older stuff grandfathered, simply create a new gallery and move your photos & videos from your old gallery into the new one. Key’d links will instantly be required for access (if you change your mind, just move them back and they’ll be re-grandfathered). Alternatively, you can set a password and turn off external links.
- The privacy options when creating a gallery and changing a gallery’s setting now use “Public” and “Unlisted” rather than “Public” and “Private” to better explain the difference and match customer expectations.
- When creating a new gallery, there’s a new option called “Lock it down” that’ll take things a step further and set all the right privacy *and* security settings to prevent unwanted access.
- This is a big, complicated release, so there will likely be bugs and bumps along the way. Let us know if you find any and I promise we’ll fix them.
I’m sorry this change took so long to ship. We were actually in testing last Thursday, January 31st, but then I was traveling from Friday to Wednesday, so we had to put it off. Thanks for your patience while we thought about the problem, discussed it with our community, and put together an update.
Special thanks to our customers and friends who weighed in with lots of detail both about the problem and the implementation, and Philipp for being so passionate and firm about the situation.
We’d love to hear your thoughts about this either here in the comments or over on this dgrin thread.