Home > datacenter, MySQL > ZFS & MySQL/InnoDB Compression Update

ZFS & MySQL/InnoDB Compression Update

October 13, 2008
Network.com setup in Vegas, Thumper disk bay, green by Shawn Ferry

Network.com setup in Vegas, Thumper disk bay, green by Shawn Ferry

As I expected it would, the fact that I used ZFS compression on our MySQL volume in my little OpenSolaris experiment struck a chord in the comments. I chose gzip-9 for our first pass for a few reasons:

  1. I wanted to see what the “best case” compression ratio was for our dataset (InnoDB tables)
  2. I wanted to see what the “worst case” CPU usage was for our workload
  3. I don’t have a lot of time. I need to try something quick & dirty.

I got both those data points with enough granularity to be useful: a 2.12X compression ratio over a large & varied dataset, and the compression was fast enough to not really be noticeable for my end users. The next step, obviously, is to find out what the best ratio of compression and CPU is for our data. So I spent the morning testing exactly that. Here are the details:

  • Created 11 new ZFS volumes (compression = [none | lzjb | gzip1-9])
  • Grabbed 4 InnoDB tables of varying sizes and compression ratios and loaded them in the disk cache
  • Timed the time (using ‘ptime’) it took to read the file from cache and write it to disk (using ‘cp’), watching CPU utilization (using ‘top’, ‘prstat’, and ‘mpstat’)

It quickly became obvious that there’s relatively little difference in compression between gzip-1 and gzip-9 (and, contrary to what people were saying in the comments, relatively little difference between CPU usage, either, in 3 of the 4 cases. The other case, though… yikes!). So I quickly stopped even doing anything but ‘none’, ‘lzjb’, ‘gzip-1’, and ‘gzip-9’. (LZJB is the default compression for ZFS – gzip-N was added later as an option).

Note that all the files were pre-cached in RAM before doing any of the tests, and ‘iostat’ verified we were doing zero reads. Also note that this is writing to two DAS enclosures with 15 x 15K SCSI disks apiece (28 spindles in a striped+mirrored configuration) with 512MB of write cache apiece. So these tests complete very quickly from an I/O perspective because we’re either writing to cache (for the smaller files) or writing to tons of fast spindles at once (the bigger files). In theory, this should mean we’re testing CPU more than we’re testing our IO – which is the whole point.

I ran each ‘cp’ at least 10 times, letting the write cache subside each time, selecting the fastest one as the shown result. Here they are (and be sure to read the CPU utilization note after the tables):

compression size ratio time
uncompressed 172M 1 0.207s
lzjb 79M 2.18X 0.234s
gzip-1 50M 3.44X 0.24s
gzip-9 46M 3.73X 0.217s

Notes on TABLE1:

  • This dataset seems to be small enough that much of time is probably spent in system internals, rather than actually reading, compressing, and writing data, so I view this as only an interesting size datapoint, rather than size and time. Feel free to correct me, though. πŸ™‚
compression size ratio time ratio
uncompressed 631M 1 1.064s 1
lzjb 358M 1.76X 0.668 1.59X
gzip-1 253M 2.49X 1.302 0.82X
gzip-9 236M 3.73X 11.1s 0.10X

Notes on TABLE2:

  • gzip-9 is massively slower on this particular hunk of data. I’m no expert on gzip, so I have no idea why this would be, but you can see the tradeoff is probably rarely worth it, even if were using precious storage commodities (say, flash or RAM rather than hard disks). I ran this one extra times just to make sure. Seems valid (or a bug).
compression size ratio time ratio
uncompressed 2675M 1 15.041s 1
lzjb 830M 3.22X 5.274 2.85X
gzip-1 246M 10.87X 44.287 0.34X
gzip-9 220M 12.16X 52.475 0.29X

Notes on TABLE3:

  • LZJB really shines here, performance wise. It delivers roughly 3X faster performance while also chewing up roughly 3X less bytes. Awesome.
  • gzip’s compression ratios are crazy great on this hunk of data, but the performance is pretty awful. Definitely CPU-bound, not IO-bound.
compression size ratio time ratio
uncompressed 2828M 1 17.09s 1
lzjb 1814M 1.56X 14.495s 1.18X
gzip-1 1384M 2.04X 48.895s 0.35X
gzip-9 1355M 2.09X 54.672s 0.31X

Notes on TABLE4:

  • Again, LZJB performs quite well. 1.5X bytes saved while remaining faster. Nice!
  • gzip is again very obviously CPU bound, rather than IO-bound. Dang.

There’s one other very important datapoint here that ‘ptime’ itself didn’t show – CPU utilization. On every run with LZJB, both ‘top’ and ‘mpstat’ showed idle CPU. The most I saw it consume was 70% of the aggregate of all 4 CPUs, but the average was typically 30-40%. gzip, on the other hand, pegged all 4 CPUs on each run. Both ‘top’ and ‘mpstat’ verified that 0% CPU was idle, and interactivity on the bash prompt was terrible on gzip runs.

Some other crazy observations that I can’t explain (yet?):

  • After a copy (even to an uncompressed volume), ‘du’ wouldn’t always show the right bytes. It took time (many seconds) before showing the right # of bytes, even after doing things like ‘md5sum’. I have no idea why this might be.
  • gzip-9 made a smaller file (1355M vs 1380M) on this new volume as opposed to my big production volume (which is gzip-9 also). I assume this must be due to a different compression dictionary or something, but it was interesting.
  • Sometimes I’d get strange error messages trying to copy a file over an existing one (removing the existing one and trying again always worked):

    bash-3.2# ptime cp table4.ibd /data/compression/gzip-1
    cp: cannot create /data/compression/gzip-1/table4.ibd: Arg list too long
  • After running lots of these tests, I wasn’t able to start MySQL anymore. It crashed on startup, unable to allocate enough RAM for InnoDB’s buffer pool. (You may recall from my last post that MySQL seems to be more RAM limited under OpenSolaris than Linux). I suspect that ZFS’s ARC might have sucked up all the RAM and was unwilling to relinquish it, but I wasn’t sure. So I rebooted and everything was fine. 😦

Conclusion? Unless you care a great deal about eking out every last byte (using a RAM disk, for example), LZJB seems like a much saner compression choice. Performance seem to improve, rather than degrade, and it doesn’t hog your CPU. I’m switching my ZFS volume to LZJB right now (on-the-fly changes – woo!) and will copy all my data so it gets the new compression settings. I’ll sacrifice some bytes, but that’s ok – performance is king. πŸ™‚

Also, my theory that I’d always have idle CPU with modern multi-core chips so compression wouldn’t be a big deal seems to be false. Clearly, with gzip, it’s possible to hog your entire CPU if you’re doing big long writes. We don’t tend to do high-MB/s reads or writes, but it’s clearly something to think about. LZJB seems to be the right balance.

So, what should I test next? I wouldn’t mind testing compression latencies on very small reads/writes more along the lines of what our DB actually does, but I don’t know how to do that in a quick & dirty way like I was able to here.

Also, I have to admit, I’m curious about the different checksum options. Has anyone played with anything other than the default?

  1. Justin Swanhart
    October 13, 2008 at 5:32 pm

    Your tests cover the CPU impact of sequential I/O, which is great for table scans, but not random I/O which is generally the type of I/O you will see when joining tables together. Have you done any random I/O testing to see how much CPU activity this generates? Perhaps you could use a script that invokes dd in a loop to read a chunk of data (probably 16K chunk to simulate InnoDB page reads) at random offsets into the file.

    What granularity is the compression at? File, block, page (N blocks), arbitrary? I’m curious how much of the file needs to be uncompressed to satisfy a random read.

  2. October 13, 2008 at 5:37 pm

    @Justin Swanhart:

    We don’t do JOINs. As in, really, no JOINs. JOINs are evil. πŸ™‚ Still, our read workload is likely very random, so I see where you’re coming from. But I’m fairly certain that some of the same data applies.

    I have no idea, though, what granularity the compression is at. That’s a good question. I’ll see if I can get an answer. I’ll bet it’s each “record” (aka block), but I could be wrong.

  3. Justin Swanhart
    October 13, 2008 at 5:55 pm

    I don’t subscribe to the JOINS are evil camp, but I understand your frustration. Personally I am a fan of materializing views for fast read performance on aggregated tables, but still the advantages of normalization for OLTP applications.

    But even without JOINs, InnoDB reads may be fairly random if you are using non PRIMARY indexes. As you probably know InnoDB tables are organized arround the primary key of the table, or around an automatically provided 64bit surrogate key. Secondary indexes have to do an additional lookup on the primary key.

    Are your tables read-only? Insertion and update may have really big penalties on compressed filesystems too. Also, have you considered the compression options of the new pluggable version of InnoDB?

  4. Tao Shen
    October 13, 2008 at 6:00 pm

    Hi Don:

    Interesting article on MySQL and ZFS with compression.

    I was wondering if file system level compression such as ZFS could compare to the commercial distributed column store with compression (Vertica and KDB+) which is commonly used for storing financial data(think 10Billion rows a day on very simple schema).

    Another point, as you have recently pointed out your frustration with Dell MD3000 which was once considered as the holy grail of the DB storage subsystem, it seems that the MD3000 is aged quite a bit since the last mention. It looks as if ZFS’s software RAID can work very nicely with SAS Expander cases from Supermicro(15-16 SAS bays for about 1000 dollars). You can hook a SAS controller(like those 1.2Ghz Intel IOP processors such as the Adaptec 5405) directly to a Supermicro case such as this linked below, and it will probably act as a MD1000 case and you have the choice of the 15K SAS drive to put in, or maybe Flash SAS drives later (Of course, I wouldn’t know how much discount smugmug is getting from Dell for their MD1000 and MD3000 enclosures.)


  5. October 13, 2008 at 6:57 pm


    Thanks for posting this info. Very interesting.

  6. October 13, 2008 at 7:53 pm

    Don, typo in table 2? says gzip-9 file was 46M again.

  7. October 13, 2008 at 10:02 pm

    Speaking as the jb in lzjb, this makes me happy. πŸ˜‰

  8. October 13, 2008 at 10:37 pm

    @Jeff Bonwick:

    I figured you would. πŸ™‚

    Been too long, they must be keeping you busy over there. Don’t be a stranger…

  9. October 13, 2008 at 10:45 pm

    Thanks for writing these entertaining and useful articles.
    Have you considered submitting a talk about these matters for the Users Conference 2009? (http://mysqlconf.com)


  10. Marc
    October 14, 2008 at 6:12 am

    Someone really needs to find a (legal) way to use lzo compression with zfs, this is imho the real contender for lzjb in interactive use (as opposed to archiving where you can afford gzip-9 or something even slower, but block-level may not be so well suited anymore).

    • roland
      December 20, 2008 at 4:29 am

      >Someone really needs to find a (legal) way to use lzo compression with zfs, this is imho the real contender

      yes, indeed.

      i don`t know why this did not yet attract sun developers.
      regarding licensing, i think there is a way to proceed, as the author may (perhaps) give explicit permission to use lzo with zfs. there is even a commercial lzo implementation with some optimization. zfs+lzo would be "win-win" for sun and for lzo author.

      there exists a patch for zfs-fuse to enable lzo compression (just demonstration purpose) and the results are real interesting as lzo can give better performance and better compression than lzjb – see http://www.mail-archive.com/zfs-discuss@opensolar

  11. October 14, 2008 at 11:38 am

    @Don — with record size == 16kb and compression enabled, does ZFS write blocks smaller than 16kb?

    Is your access pattern write-once or write, update, update, …? I am curious about the impact of defragmentation from copy-on-write in ZFS. InnoDB prefetch code thinks that logical file order is correlated to physical order and that is less true with copy-on-write.

  12. Jacques Chester
    October 14, 2008 at 10:14 pm

    I think the difference is that LZJB is a compression algorithm designed for streaming data — it doesn’t rely on a full scan of the data to perform its compression. The gzip algorithm builds a dictionary first, so that explains the big slowdowns on larger datasets.

    Obviously building a dictionary of items for compression purposes lets you do things you can’t do otherwise, which is why gzip is better for total compression ratio.

    A good followup test would be to see performance on a random-access basis. What happens with lots of writes and updates? I’m not sure if gzip requires a from-the-top effort or not. I imagine Jon Bonwick could tell you.

    Of course, it could be that I’m an undergraduate who knows not whereof he speaks. YMMV.

  13. Ricardo Correia
    October 18, 2008 at 5:03 am

    Did you run “sync” at the end of each “cp” command and waited for the sync to complete before collecting the elapsed time?

    Note that ZFS only compresses data when it’s actually writing to the disks, which typically only happens every 5 seconds given enough RAM. So if you didn’t sync at the end, the first 2 tables are not really useful for compression comparison purposes and the other ones are very suspect πŸ™‚

    Also, if you used sync at the end, the average of the 10 runs would be more representative of the actual performance than only taking the fastest run.

    As for the different checksum options, I see no difference in performance when enabling/disabling the default checksum algorithm, but I noticed that in zfs-fuse the SHA1 checksum is way too heavy on the CPUs (especially considering all the context switching going on), but it’s possible that this isn’t so bad in Solaris/OpenSolaris.

  14. Steve
    October 20, 2008 at 8:02 am

    Did you set the ZFS record size to 16kb before doing this? The MySQL docs say the ZFS record size should match the InnoDB record size of 16kb.


    If not, I would love to see the results with the record size set to 16kb.

  15. Jacques Chester
    October 21, 2008 at 11:02 pm

    Just to follow up my previous remark.

    LZJB seems to follow the basic scheme in Lempel-Ziv 77 (LZRW1 actually) with some modifications. It has a moving window through which data streams. When an item comes in, the algo checks to see if it’s been seen. If it has, the element is replaced with a pointer to the earliest instance in the window. Otherwise it’s added to a dictionary of such addresses and the algorithm moves to the next item.

    This is ideal for filesystem compression for two reasons:

    1. You don’t need to store header data for the decompressor. All the data needed to decompress the data is already in situ in the data.
    2. You don’t need to see all the data to begin compressing. As soon as you have data streaming in, you can begin to compress it.

    By contrast gzip uses an algorithm called Deflate, which is actually two algorithms run in series. The first step is a Lempel-Ziv 77 pass like the one in LZJB, followed by a second pass of Huffman coding, where repeated blocks are replaced with short symbolic codes. This involves a tree structures being built and then discarded every 64k or so, on top of the pass already made by the Lempel-Ziv algorithm.

    I think that accounts for the observed differences in performance.

  16. Scott C.
    October 28, 2008 at 12:04 pm

    Since ZFS compresses only one block at a time, the gzip blocks and tree structure discard mentioned by Jacques above do not apply.
    For 16K blocks, what happens is a 16K data block (uncompressed size) is compressed with lzjb or gzip into a new block of smaller size, and stored on disk in the smaller size (which has to be a binary size, IIRC — 8k, 4k, 2k, etc) and like all blocks in ZFS, there is a checksum elsewhere to make sure the data is ok.

    LZJB, as mentioned by Jacques, is a LZ-77 style algorithm. It is supposedly faster and a bit better at compression than the original, and versus the lzo suggested by Marc above. Either way, its basically the same class of small-window streaming compression algorithm.

    Gzip is also a streaming compression algorithm, and as mentioned above is basically just a LZ 77 followed immediately by huffman encoding. Most of the time, a default huffman encoder is used and the dictionary isn’t built up by analyzing the data first. Gzip has a default dictionary that is used for almost all streams to avoid this step (or more correctly, deflate does) If it is, it is done by analyzing a smaller chunk. Either way for ZFS, this chunk size is at most the size of the file block.

    Deflate/Gzip are RFC 1951, and 1952. RFC 1950 is the related zlib wrapper format commonly used in coordination with deflate. (just search for RFC 1952 to find more about gzip than you ever want to know).

    For info on LZJB, look at the source code in zfs πŸ˜€ It is well commented from what I recall.

  17. Ethan
    March 30, 2009 at 5:14 pm

    I know this is an old posting, but did you settle on using lzjb or gzip in production, for InnoDB data files? Also, we ran into memory allocation limits in Solaris as well, and it turned out we just didn't have enough swap. We found a need for ~50% swap -> memory allocated. So for a 48G system, with a 40G innodb_buffer_pool, we needed about 18-20G of swap.

  18. Guest
    October 12, 2009 at 2:10 am

    Is it an allocation limit in Solaris, or in InnoDB?

  19. Ethan
    October 12, 2009 at 6:57 pm

    @Guest: Not a limit in either, simply a requirement by Solaris (unlike Linux) to have virtual memory backing memory allocation.

  20. November 14, 2009 at 3:36 pm

    Canon FTL. I'll never buy another one of their camera's. This seals it. Peace.

  1. October 13, 2008 at 6:33 pm
  2. October 13, 2008 at 9:54 pm
  3. October 14, 2008 at 12:45 am
  4. October 14, 2008 at 2:01 am
  5. December 8, 2008 at 10:53 am
Comments are closed.
%d bloggers like this: