I’ve long known that YouTube had a secret weapon in their datacenter codenamed ‘Colin‘, but yesterday at the MySQL Conference, I met three more secret weapons – codenamed ‘Paul‘ and his team (sorry, guys, I’ve forgotten your names!).
Paul and his team are incredible. Paul’s keynote was easily the most interesting thing for me at the entire conference because of how technical and authoritative it was. It certainly helped that he spoke our language – he got down and dirty with his hardware, not just MySQL tuning variables, and discussed real world fixes. Plenty of other MySQL sessions were interesting, but most of them focused at a high level rather than down near the bare metal. We’ve long since left most of the high level stuff behind and are, ourselves, focused on bare metal.
Best of all, the MySQL team at YouTube sees eye-to-eye with us when it comes to DB storage arrays. There are a few differences, I think, but we’re essentially very similar. Hopefully my description of our ideal, perfect, high-performance DB storage array can help out any other startups out there looking for solutions. Certainly having our internal assumptions validated by YouTube helps.
I hate the “queries per second” or “queries per day” metrics, because they tell you absolutely nothing about how complicated or long the queries are, but we do many billions of queries per day, if you’re into those metrics. So we care a great deal about getting good, fast hardware.
- We like DAS for our DB boxes, with RAID controllers in the external enclosure, rather than internal disks. This is one area I’m not sure YouTube agrees with us (they might, we just didn’t discuss it). Let me explain:
- When a server has some fatal hardware problem, we like to just yank it out of the rack, slide another identical server in place, hook it up to storage, and turn it on. No mess, no fuss.
- Using LVM, we can add more storage and/or more spindles easily.
- Had problems in the past with RAID controllers failing and new ones not correctly picking up the RAID tags on the drives. External enclosures have two controllers, making single card failures less problematic.
- We love spindles. The more the merrier. Our typical RAID 1+0 array has 14 of them, making 7 effective spindles. At best, that means we can do 7 concurrent operations at a time.
- We love fast spindles. Give us 15K drives any day of the week.
- We love enclosures with odd-numbered drives. 15 drives, 13 drives, something odd. Why? Because we want *1* hot spare, not 2, and want the rest of the spindles for reads & writes.
- We love big battery-backed write caches. We stick them in write-back mode for super-fast writes (easily the hardest thing in a DB to scale).
- We hate read caching. We disable it entirely. The cache on RAID controllers is relatively puny (128-512M) compared to the RAM in our DB boxes (32GB), so any reads that aren’t in our DB’s main memory certainly aren’t going to be in the RAID controller cache. We want every byte in the cache for writes. Plus, we don’t want read cache misses to get serialized behind the pending writes.
- We hate prefetching. We disable it entirely. The DB is smart enough to request data it needs without the RAID controllers trying to be smart and tying up disks and the entire I/O path with extra data we don’t need.
- We want very configurable stripe/chunk sizes. Some controllers just have presets, like “DB”, which often have tiny (16K) sizes. Ugh. We want 1MB+ stripes.
Now, unfortunately, finding arrays that do all of this stuff is tough. We end up haggling with vendors, or wrestling with configurations, etc. And usually we have to compromise on one or two of the items. 😦 I think we’re close to finding an ideal one, but we’re not quite there yet. You’ll hear it here first when we do.
If you’re willing to lose DAS, both LSI (and according to YouTube, Adaptec) let you get at most of the settings I mention above. I haven’t used 3Ware for a while, but I understand that they do not. If I’m wrong, someone please correct me.
Finally, our typical DB class machine is a Sun X2200 M2 with two dual-core CPUs and 32GB of RAM. The RAM and the disk stuff I talk about above are far far more important for our workload than the # or speed of CPUs, and it sounds like the same holds true at YouTube. We’re popping SAS cards in them and attaching to DAS units.
Anyway, hope that helps any of you out there wondering what to buy. I will still blog about Sun’s storage shortly (and why it didn’t match up to what we needed), I’ve just been busy. This should help add some context, though.
Give me a few more days. 🙂
We’ve had this program for awhile, but as always, we suck at PR so you probably didn’t even know. 🙂
It’s pretty simple: Want to build something on SmugMug’s API? You’ll get a lifetime free Pro account ($150/year value) for doing so. Just sign up for a free trial and then drop us a note letting us know what application you’re gonna build, and we’ll take care of the rest.
Already have a SmugMug account? No worries, we’ll upgrade it and make it free for life. Just let us know.
And yes, we’re cool with everyone building commercial apps on top of SmugMug. Thrilled, in fact – we already have hundreds of them. There are only two exceptions we can think of: ‘SmugMug-in-a-box’ where you’re re-selling SmugMug (lots of people sharing a single account), and a situation where SmugMug is basically a database, like an ad server. When in doubt, just ask. Other than those, go wild.
It’s been two months since we divorced Rackable and married Sun as our new server & storage vendor and lots of people have been asking how it’s going. So while the ‘marriage’ is still early the server side of things is going really really well. We’re still starry-eyed in love. Our experience with Sun’s storage hardware isn’t nearly so rosy (in fact, it’s downright bad), but I’ll cover that in a near-future update.
So, what do we love about our new server partner?
- We can standardize on a single server platform for 99% (if not 100%) of our future server needs. The SunFire X2200 M2 servers are 1U and scale up to 2 x dual-core Opterons with 32GB of RAM (and, as important, down to 1 Opteron w/2GB of RAM). For us, that’s huge. Imagine, if you will, some catastrophe befalling one of our database boxes that requires hardware replacement. Instead of having lots of expensive, idle, duplicate hardware around, we could literally crack open a web server, add some more RAM and an external HBA card, and boom, we have a new DB box. There are many reasons Southwest is the most profitable US airline and a huge one is standard components.
- Their lights-out management (LOM) is a dream. I dinged the Sun T1000 last year because it’s LOM is pretty terrible, but the X2200’s LOM is freaking fantastic. How fantastic? Let me count the ways:
- It’s ethernet rather than serial. Yay!
- It can share the same ethernet port the OS does. One wire for both LOM and OS! Less datacenter mess. Double yay!
- It has a built-in Web UI that lets you see and access all of the features, in addition to telnet and SSH.
- The Web UI lets you actually view the VGA output on the console. Not just serial console redirection – actual video output.
- The LOM lets you remotely mount ISOs, floppy images, etc. Got a CD or DVD on your desktop at the office that you wish was in the drive at your datacenter? No problem.
- Built-in email notification ability for status changes.
- Lots of SNMP settings. Haven’t played with this much yet, but it looks full-featured.
- Lots and lots of hardware details, like motherboard and BIOS versions, NIC details, etc are all right there.
- All of the statuses (fan speeds, temp readings, voltage indicators, etc), with tons of detail, are at your fingertips
- Well built. First of all, it’s amazing what’s crammed into this 1U footprint. But second, it’s gorgeous inside. It’s clear that someone(s) spent a lot of time and energy working on the layout so that everything fit together just right. Feels like a labor of love. Nothing looks out of place.
- I gave the T1000 props for the way Sun does illustrations on their lids to show what parts are hot-swappable vs cold-swappable and the X2200 is no exception. The lid is printed with all kinds of useful diagrams that make servicing the hardware much much easier. I’m a sucker for attention to detail (one reason I love Apple).
- Turnaround time was excellent with both orders we’ve placed so far. We don’t have the luxury of planning for projects months and months in advance, so moving quickly when we need new hardware is key.
- Pricing was great. Thanks to Sun’s AMD (and soon, Intel) server platforms, their pricing is competitive with everyone else. I truly believe that the baseline hardware (CPU, RAM, HDDs) has become commodity and that the differentiating value is in the extra technology (like LOM), service, and support. Sun gets this, I think.
- Their rails just work. This is more rare than you might imagine – sucky rails really suck. Sun’s rails do what they’re supposed to – make it easy to install and, later, get access to your servers.
- Their diagnostic CD was extremely useful and easy to use. This is an often overlooked area, but we were unlucky enough to get some bad RAM (see below), and this came in handy.
- Fast. I thought this went without saying, since the performance bits are commodity components, but as you’ll see from the storage problems we had, speed on paper doesn’t always equal speed in the datacenter. These boxes are as fast as they should be – screaming.
So what’s not to like? Nothing’s bad enough that we’d kick Sun outta bed for eating crackers, but there are some quirks:
- We bought these direct from Sun, with custom configurations, and I believe Sun is still trying to get their head around direct sales (vs VARs). As a result, it turns out that they arrived without all of the RAM already installed. No biggie, we just installed it ourselves. Only thing is, the RAM also wasn’t tested beforehand. We’re used to our systems being fully tested & burned-in prior to delivery, and sure enough, we got a bad piece of RAM. That sucked. For now, we’re just adding a day of burn-in to our install routine, but we’re hoping Sun standardizes on this in the future. UPDATE 1: Just got word from Sun, there is an option to have custom configs burned-in at no cost, but it adds an extra 2-3 weeks to the lead time. We’ll have to think about how to best use this here, since we usually want our gear fast.
- As I mentioned in our engagement announcement, the sales and approval process (not the people) sucks. Having to go through the approval process over and over for each order that’s slightly different isn’t pleasant. Dell excels at this, by comparison. They fire off quotes (and hardware!) with lightning speed. Here’s how I wish it would work:
- Sun goes through the approval process for SmugMug and assigns us a discount.
- From then on, we can just go login to sun.com and place orders for as much (or as little) hardware as we want that day and it automagically applies our discount.
- Should we think our sales volume warrants a bigger discount or something, we re-engage to re-evaluate.
- Our sales team at Sun gets to focus on keeping us up-to-date on new technology, roadmap changes, and everything else without wasting time on the approval process for small orders that are similar to orders we’ve placed in the past.
- We’re happy, Sun’s happy, everyone’s happy.
If we could change anything about them, would we? Of course!
- Love to see dual power supplies. Since power supplies are a very common failure point for servers, we like redundancy here. (The moving parts fail far more often than our circuits do, so surprisingly, we don’t want dual power supplies to handle circuit failures).
- While we’re dreaming, I’d love to see DC power as an option and remove AC from the equation. We could get lower failure rates, better power utilization, and better redundancy in one fell swoop.
- And if we really want to get pie-in-the-sky, I’d love to see some sort of liquid or gas cooling system so we can get cooling efficiencies too. This is way outside of my field of expertise, so I don’t know how it would work, but Blackbox seems like it has some great stuff along these lines.
Stuff we really haven’t kicked the tires on yet:
- We typically whip out our amp meter and take power readings as soon as we get new hardware in our datacenter, since power & cooling are huge concerns for us. This time, we were under such a time crunch (and so busy with all of the nasty storage problems I’ll blog about soon), I haven’t had time. I’m hopeful that all of Sun’s noise about power efficiency is reflected, but I won’t know for sure until I get the hardware out and test it.
And finally, everyone at Sun deserve a shout out. They’ve built a great product, and they’ve certainly showed us a great deal of support and personal attention, which we appreciate. If the people we’ve dealt with are any indicator of upcoming success, Sun’s future looks bright. (No pun intended).
I will post a follow-up shortly detailing the nightmare that our quest for fast DB storage became and what we’ve managed to do about it, but for now, I hope this helps anyone looking for server solutions.
Bottom line: I can’t recommend the X2200 M2 highly enough.
In case you’ve been living in a cave, the newspaper business is in trouble.
I don’t know a single person born after 1976 that subscribes to a newspaper. I never have, and no-one I know from my generation has, either. Why would we? We could get all of our news online before I even graduated high school.
But I do read newspapers. I just consume them differently than people historically have – I pick and choose my favorite bits from the world’s papers, instead of reading my local paper cover-to-cover every morning. There’s certainly still a place in this world for newspapers – just maybe sans paper. How do I know?
This article is proof in and of itself. One of the world’s most renowned musicians, Joshua Bell, gave an impromptu concert in the DC Metro with his $3.5M violin, and the Washington Post has one of the best articles I’ve ever read about it. Beautifully written, see for yourself, you’ll love it. And if you needed it, it reminds you why traditional media still has a place in this world. (Thanks Matt!)
Oh, and in case you hadn’t already heard, Sam Zell is an idiot. Newspapers are doomed if you listen to his crap.
I sold mine unopened. How a company could make so many blunders, I’ll never know.
It reminds me of how they blew the music player market too. They owned music. Literally. Discman was theirs. Walkman was theirs. They even owned one of the big music labels. Yet Apple is the one who now dominates the market.
It didn’t take a rocket scientist back then to see how Sony could have won, well before Apple even entered the market. How do I know? Because everyone was talking about it online. The recipe for success was on everyone’s lips, ready for the taking. Sony could have made the dominant music player but chose to ignore the customer.
History repeats itself with PS3.
I pray I never lose sight of what our customers want. Thank goodness our customers are very patient with us and keep reminding us of all of our flaws. I promise – we’re listening and working on them.
Chris MacAskill, my father and SmugMug’s co-founder, covers it best: Be different or be damned. It’s in the April 9th issue.
We were lucky enough to be part of a BusinessWeek cover story last November, too.
As always (I should probably do a blog entry on this), anyone building something for our API or integrating an existing project gets a free lifetime Pro account at SmugMug. Just drop us a line.
And unlike some other sites online, we love it that you build commercial apps on our API. Go to town, make some money, change the world. We’ll help!