Dell MD3000 – Great DAS DB Storage
So I’ve written about storage before, specifically our quest for The Perfect DB Storage Array and how Sun’s storage didn’t stack up with their excellent servers. As you can probably tell, I spend a lot of my time thinking about and investigating storage – both small-and-fast for our DBs and huge-and-slower (like S3) for our photos.
I believe we’ve finally found our best bang-for-the-buck storage arrays: Dell MD3000. Here’s a quick rundown of why we like them so much, how to configure yours to do the same, and where we’re headed next:
- The price is right. I have no idea why these companies (everyone does it) continue to show expensive prices on their websites and then quote you much much cheaper prices, but Dell is no exception. Get a quote, you’ll be shocked at how affordable they really are.
- DAS via SAS. If you’re scaling out, rather than up, DAS makes the most sense and SAS is the fastest, cheapest interconnect.
- 15 spindles at 15K rpm each. Yum. Both fast and odd. Why odd? Because you can make a 14 drive RAID 1+0 and have a nice hot spare standing by.
- 512MB of mirrored battery-backed write cache. Use write-back mode to have nice fast writes that survive almost all failure scenarios.
- You can disable read caching. This is a big one. Given we have relatively massive amounts of RAM (32GB on server vs 512MB on controller) *and* that the DB is intelligent at reading and pre-fetching precisely the stuff it wants, read caching is basically useless. Not only that, but it harms performance by getting in the way of writes – we want super-fast non-blocking writes. That’s the whole point.
- You can disable read-ahead prefetching. Again, our DB does its own pre-fetching already, so why would we want the controller trying to second guess our software? We don’t.
- The stripe sizes are configurable up to 512KB. This is important because if you’re going to read, say, a 16KB page for a DB, you want to involve only a single disk as often as you can. The bigger the stripes, the better the odds are of only using a single disk for each read.
- The controller ignores host-based flush commands by default. Thank goodness. The whole point of a battery-backed write-back cache is to get really fast writes, so ignoring those commands from the host is key.
- They support an ‘Enhanced JBOD’ mode where by you can get access to the “raw” disks as their own LUNs (in this case, 15), but writes still flow through the write-cache. Why is this cool? Because you can move to 100% server-controlled software storage systems, whether they’re RAID or LVM or whatever. More on this below…
Ok, sounds good, you’re thinking, but how to I get at all these goodies? Unfortunately, you have to use a lame command-line client to handle most of this stuff and it’s a PITA. However, you asked, so here you go (commands can be combined):
- disable read cache: set virtualDisk[“1”] readCacheEnabled=FALSE
- disable read pre-fetching: set virtualDisk[“1”] cacheReadPrefetch=FALSE
- change stripe size: read the docs for how to do this on new virtualDisks, but to do online changing of existing ones – set virtualDisk[“1”] segmentSize=512
- Enhanced JBOD: Just create 15 RAID 0 virtual disks! 🙂
- BONUS! modify write cache flush timings: set virtualDisk[“1”] cacheFlushModifier=60 – This is an undocumented command that changes the cache flush timing to 60 seconds from the default of 10 seconds. You can also use words like ‘Infinite’ if you’d like. I haven’t played with this much, but 10 seconds seems awfully short, so we will.
Wishlist? Of course I have a wishlist. Don’t I always? 🙂
- This stuff should be exposed in the GUI. Especially the stripe size setting should be easily selectable when you’re first setting up your disks. It’s just dumb that it’s not.
- Better documentation. After a handy-dandy Google search, it appears as if the Dell MD3000 is a rebranded LSI/Engenio array, which lots of other companies also appear to have rebranded, like the IBM DS4000. But the Engenio docs are more thorough, which is how I found the cacheFlushModifier setting. (On a side note, why do these companies hide who’s building their arrays? They don’t hide that Intel makes the CPUs… Personally, I’d rather know)
- Faster communication. I asked Dell quite awhile ago for information on settings like these and I had to wait awhile for a response. I imagine this might be related to the Engenio connection – Dell may have just not known the answers and had to ask.
- Bigger stripe sizes. I’d love to benchmark 1MB or bigger stripes with our workload.
- Better command-line interface. Come on, can’t we just SSH into the box and type in our commands already?
Ok, so where are we going next?
- ZFS. I believe the ‘Enhanced JBOD’ mode (15 x RAID-0) would be perfect for ZFS, in a variety of modes (striped + mirrored, RAID-Z, etc). So we’re gonna get with Sun and do an apples-to-apples comparison and see what shakes out. Our plan is to take two Sun X2200 M2 servers, hook them up to a Dell MD3000 apiece, run LVM/software RAID on one and ZFS on the other, then put them under a live workload and see which is faster. My hope is that ZFS will win or be close enough that it doesn’t matter. Why? Because I love ZFS’s data integrity and I believe COW will let us more easily add spindles and see a near-linear speed increase.
- Flash. We’ve been playing around with the idea of flash storage (SSD) on our own for awhile, and have been talking to a number of vendors about their approaches. It’s looking like the best bet may be to move from a two-tier storage system (system RAM + RAID disks) to a three-tier system (system RAM + flash storage + RAID disks) to dramatically improve I/O. If we come across anything that works in practice, rather than theory, I’ll definitely let you know.
- MySQL. We’ve now got boxes which appear to not be CPU-bound *or* I/O bound but are instead bounded by something in software on the boxes, either in MySQL or Linux. Tracking this down is going to be a pain, especially since it’s out of my depth, but we’ve gotta get there. If anyone has any insight or ideas on where to start looking, I’m all ears. We have MySQL Enterprise Platinum licenses so I can probably get MySQL involved fairly easily – I just haven’t had time to start investigating yet.
Also, you might want to check out this review of the MD3000 as well, he’s gone more in-depth on some of the details than I have.
Finally, I’m hoping other storage vendors perk up and pay attention to the real message here: Let us configure our storage. Provide lots of options, because ‘one size fits all’ is the exception, not the rule.
Sun’s announcement today that they’re unifying Storage and Servers under Systems is a good move, I think, but they’ve still got work to do. I believe (and everyone at Sun has heard this from me before) that their storage has been failing because it’s not very good. I hope this change does make a difference – because Jonathan’s right that storage is getting to be more important, not less.
UPDATE: One of the Dell guys who works with us (and helped us get all the nitty gritty details to configure these low-level settings) just posted his contact info in the comments. Feel free to call or email him if you have any questions.