ZFS & MySQL/InnoDB Compression Update
Network.com setup in Vegas, Thumper disk bay, green by Shawn Ferry
As I expected it would, the fact that I used ZFS compression on our MySQL volume in my little OpenSolaris experiment struck a chord in the comments. I chose gzip-9 for our first pass for a few reasons:
- I wanted to see what the “best case” compression ratio was for our dataset (InnoDB tables)
- I wanted to see what the “worst case” CPU usage was for our workload
- I don’t have a lot of time. I need to try something quick & dirty.
I got both those data points with enough granularity to be useful: a 2.12X compression ratio over a large & varied dataset, and the compression was fast enough to not really be noticeable for my end users. The next step, obviously, is to find out what the best ratio of compression and CPU is for our data. So I spent the morning testing exactly that. Here are the details:
- Created 11 new ZFS volumes (compression = [none | lzjb | gzip1-9])
- Grabbed 4 InnoDB tables of varying sizes and compression ratios and loaded them in the disk cache
- Timed the time (using ‘ptime’) it took to read the file from cache and write it to disk (using ‘cp’), watching CPU utilization (using ‘top’, ‘prstat’, and ‘mpstat’)
It quickly became obvious that there’s relatively little difference in compression between gzip-1 and gzip-9 (and, contrary to what people were saying in the comments, relatively little difference between CPU usage, either, in 3 of the 4 cases. The other case, though… yikes!). So I quickly stopped even doing anything but ‘none’, ‘lzjb’, ‘gzip-1′, and ‘gzip-9′. (LZJB is the default compression for ZFS – gzip-N was added later as an option).
Note that all the files were pre-cached in RAM before doing any of the tests, and ‘iostat’ verified we were doing zero reads. Also note that this is writing to two DAS enclosures with 15 x 15K SCSI disks apiece (28 spindles in a striped+mirrored configuration) with 512MB of write cache apiece. So these tests complete very quickly from an I/O perspective because we’re either writing to cache (for the smaller files) or writing to tons of fast spindles at once (the bigger files). In theory, this should mean we’re testing CPU more than we’re testing our IO – which is the whole point.
I ran each ‘cp’ at least 10 times, letting the write cache subside each time, selecting the fastest one as the shown result. Here they are (and be sure to read the CPU utilization note after the tables):
Notes on TABLE1:
- This dataset seems to be small enough that much of time is probably spent in system internals, rather than actually reading, compressing, and writing data, so I view this as only an interesting size datapoint, rather than size and time. Feel free to correct me, though. :)
Notes on TABLE2:
- gzip-9 is massively slower on this particular hunk of data. I’m no expert on gzip, so I have no idea why this would be, but you can see the tradeoff is probably rarely worth it, even if were using precious storage commodities (say, flash or RAM rather than hard disks). I ran this one extra times just to make sure. Seems valid (or a bug).
Notes on TABLE3:
- LZJB really shines here, performance wise. It delivers roughly 3X faster performance while also chewing up roughly 3X less bytes. Awesome.
- gzip’s compression ratios are crazy great on this hunk of data, but the performance is pretty awful. Definitely CPU-bound, not IO-bound.
Notes on TABLE4:
- Again, LZJB performs quite well. 1.5X bytes saved while remaining faster. Nice!
- gzip is again very obviously CPU bound, rather than IO-bound. Dang.
There’s one other very important datapoint here that ‘ptime’ itself didn’t show – CPU utilization. On every run with LZJB, both ‘top’ and ‘mpstat’ showed idle CPU. The most I saw it consume was 70% of the aggregate of all 4 CPUs, but the average was typically 30-40%. gzip, on the other hand, pegged all 4 CPUs on each run. Both ‘top’ and ‘mpstat’ verified that 0% CPU was idle, and interactivity on the bash prompt was terrible on gzip runs.
Some other crazy observations that I can’t explain (yet?):
- After a copy (even to an uncompressed volume), ‘du’ wouldn’t always show the right bytes. It took time (many seconds) before showing the right # of bytes, even after doing things like ‘md5sum’. I have no idea why this might be.
- gzip-9 made a smaller file (1355M vs 1380M) on this new volume as opposed to my big production volume (which is gzip-9 also). I assume this must be due to a different compression dictionary or something, but it was interesting.
- Sometimes I’d get strange error messages trying to copy a file over an existing one (removing the existing one and trying again always worked):
bash-3.2# ptime cp table4.ibd /data/compression/gzip-1
cp: cannot create /data/compression/gzip-1/table4.ibd: Arg list too long
- After running lots of these tests, I wasn't able to start MySQL anymore. It crashed on startup, unable to allocate enough RAM for InnoDB's buffer pool. (You may recall from my last post that MySQL seems to be more RAM limited under OpenSolaris than Linux). I suspect that ZFS's ARC might have sucked up all the RAM and was unwilling to relinquish it, but I wasn't sure. So I rebooted and everything was fine. :(
Conclusion? Unless you care a great deal about eking out every last byte (using a RAM disk, for example), LZJB seems like a much saner compression choice. Performance seem to improve, rather than degrade, and it doesn't hog your CPU. I'm switching my ZFS volume to LZJB right now (on-the-fly changes - woo!) and will copy all my data so it gets the new compression settings. I'll sacrifice some bytes, but that's ok - performance is king. :)
Also, my theory that I'd always have idle CPU with modern multi-core chips so compression wouldn't be a big deal seems to be false. Clearly, with gzip, it's possible to hog your entire CPU if you're doing big long writes. We don't tend to do high-MB/s reads or writes, but it's clearly something to think about. LZJB seems to be the right balance.
So, what should I test next? I wouldn't mind testing compression latencies on very small reads/writes more along the lines of what our DB actually does, but I don't know how to do that in a quick & dirty way like I was able to here.
Also, I have to admit, I'm curious about the different checksum options. Has anyone played with anything other than the default?