The Perfect DB Storage Array
I’ve long known that YouTube had a secret weapon in their datacenter codenamed ‘Colin‘, but yesterday at the MySQL Conference, I met three more secret weapons – codenamed ‘Paul‘ and his team (sorry, guys, I’ve forgotten your names!).
Paul and his team are incredible. Paul’s keynote was easily the most interesting thing for me at the entire conference because of how technical and authoritative it was. It certainly helped that he spoke our language – he got down and dirty with his hardware, not just MySQL tuning variables, and discussed real world fixes. Plenty of other MySQL sessions were interesting, but most of them focused at a high level rather than down near the bare metal. We’ve long since left most of the high level stuff behind and are, ourselves, focused on bare metal.
Best of all, the MySQL team at YouTube sees eye-to-eye with us when it comes to DB storage arrays. There are a few differences, I think, but we’re essentially very similar. Hopefully my description of our ideal, perfect, high-performance DB storage array can help out any other startups out there looking for solutions. Certainly having our internal assumptions validated by YouTube helps.
I hate the “queries per second” or “queries per day” metrics, because they tell you absolutely nothing about how complicated or long the queries are, but we do many billions of queries per day, if you’re into those metrics. So we care a great deal about getting good, fast hardware.
- We like DAS for our DB boxes, with RAID controllers in the external enclosure, rather than internal disks. This is one area I’m not sure YouTube agrees with us (they might, we just didn’t discuss it). Let me explain:
- When a server has some fatal hardware problem, we like to just yank it out of the rack, slide another identical server in place, hook it up to storage, and turn it on. No mess, no fuss.
- Using LVM, we can add more storage and/or more spindles easily.
- Had problems in the past with RAID controllers failing and new ones not correctly picking up the RAID tags on the drives. External enclosures have two controllers, making single card failures less problematic.
- We love spindles. The more the merrier. Our typical RAID 1+0 array has 14 of them, making 7 effective spindles. At best, that means we can do 7 concurrent operations at a time.
- We love fast spindles. Give us 15K drives any day of the week.
- We love enclosures with odd-numbered drives. 15 drives, 13 drives, something odd. Why? Because we want *1* hot spare, not 2, and want the rest of the spindles for reads & writes.
- We love big battery-backed write caches. We stick them in write-back mode for super-fast writes (easily the hardest thing in a DB to scale).
- We hate read caching. We disable it entirely. The cache on RAID controllers is relatively puny (128-512M) compared to the RAM in our DB boxes (32GB), so any reads that aren’t in our DB’s main memory certainly aren’t going to be in the RAID controller cache. We want every byte in the cache for writes. Plus, we don’t want read cache misses to get serialized behind the pending writes.
- We hate prefetching. We disable it entirely. The DB is smart enough to request data it needs without the RAID controllers trying to be smart and tying up disks and the entire I/O path with extra data we don’t need.
- We want very configurable stripe/chunk sizes. Some controllers just have presets, like “DB”, which often have tiny (16K) sizes. Ugh. We want 1MB+ stripes.
Now, unfortunately, finding arrays that do all of this stuff is tough. We end up haggling with vendors, or wrestling with configurations, etc. And usually we have to compromise on one or two of the items. 😦 I think we’re close to finding an ideal one, but we’re not quite there yet. You’ll hear it here first when we do.
If you’re willing to lose DAS, both LSI (and according to YouTube, Adaptec) let you get at most of the settings I mention above. I haven’t used 3Ware for a while, but I understand that they do not. If I’m wrong, someone please correct me.
Finally, our typical DB class machine is a Sun X2200 M2 with two dual-core CPUs and 32GB of RAM. The RAM and the disk stuff I talk about above are far far more important for our workload than the # or speed of CPUs, and it sounds like the same holds true at YouTube. We’re popping SAS cards in them and attaching to DAS units.
Anyway, hope that helps any of you out there wondering what to buy. I will still blog about Sun’s storage shortly (and why it didn’t match up to what we needed), I’ve just been busy. This should help add some context, though.
Give me a few more days. 🙂