Archive
Great things afoot in the MySQL community
tl;dr: The MySQL community rocks. Percona, XtraDB, Drizzle, SSD storage, InnoDB IO scalability challenges.
For anyone who lives and dies by MySQL and InnoDB, things are finally starting to heat up and get interesting. I’ve been banging the “MySQL/InnoDB scales poorly” drums for years now, and despite having paid Enterprise licenses, I haven’t been able to get anywhere. I was pretty excited when Sun bought MySQL since their future is intrinsically tied to concurrency, but things have been pretty slow going over there this year.
But the community has finally taken up arms and is fighting the good fight. It’s (finally!) a great time to be a MySQL user because there’s been lots of recent progress. Here’re some of my favorites (and highlights of work left to do):
PERCONA
I can’t sing Percona’s praises enough. They’re probably the most knowledgeable MySQL experts out there (possibly even including Sun). Absolutely the best bang for the buck in terms of MySQL service and support – better than MySQL’s own offering. (If I had to guess why that is, I’d bet that MySQL/Sun don’t want to step on Oracle’s toes by fixing InnoDB – but >99% of what we need is related to InnoDB. Percona has no such tip-toeing limitations.) Let me quickly count the ways they’ve helped me in the last few months:
- They knew of a super obscure configuration setting “back_log“. Have you ever heard of it? I hadn’t. But we started seeing latency on MySQL connections (up to *3 seconds*!) on systems that hadn’t changed recently (exactly 3 seconds sounded awfully suspicious, and sure enough, it was TCP retries). After going through every single kernel, network, and MySQL tuning parameter I know (and I know a lot), I finally called Percona. They dug in, investigated the system, and unearthed ‘back_log’ within an hour or two. Popped that into my configuration and boom, everything was fine again. Whew!
- We have servers that easily exceed InnoDB’s transaction limits. Did you know InnoDB has a concurrent transaction limit of 1024? (Technically, 1024 INSERTs and 1024 UPDATEs. But INSERT … ON DUPLICATE KEY UPDATE manages to chew up one of each). I know all about it – I’ve had bugs open with MySQL Enterprise for more than 2 years on the issue. What’s more, these are low-end systems – 4 cores, 16GB of RAM – and they’re no-where near CPU or IO bound. It took MySQL months to figure out what the problem was (years, really, to figure out all the final details like the different undo logs for INSERT vs UPDATE). Their final answer? It’ll be fixed in MySQL 6. 😦 Note that 5.1 *just* went GA after years and years. On the other hand, it took Percona one weekend to diagnose the problem, and 13 days to have a preliminary patch ready to extend it to 4072 undo slots. Talk about progress! (And yes, we want Percona to release the patch to the world)
- Solving the CPU scaling problems. These have been plaguing us for years (we have had some older four-socket systems for awhile … now with quad-core, it’s even worse), and thanks to Google and Percona, this problem is well on its way to being solved. We’re sponsoring this work and can’t wait to see what happens next.
- XtraDB. This is the biggy. So big it deserves its own heading….
XTRADB
Oracle’s done a terrible job of supporting the community with InnoDB. The conspiracy theorists can all say “I told you so! Oracle bought them to halt MySQL progress” now – history supports them. Which is a shame – Heikki is a great guy and has done amazing work with InnoDB, but the fact remains that it wasn’t moving forward. The InnoDB plugin release was disappointing, to say the least. It addressed none of the CPU or IO scalability issues the community has been crying about for years.
Luckily, Percona finally did what everyone else has been too afraid to do – they forked InnoDB. XtraDB is their storage engine, forked from InnoDB (and then turbocharged!). We’re not running it in production yet, but we are running all of the patches that went into XtraDB and I can tell you they’re great. We’re sponsoring more XtraDB development (and yes, we made sure Percona will be contributing anything they build for us back to the community) with Percona, and I’m sure that’ll continue.
DRIZZLE
I’ve already blogged a bit about Drizzle, but it sure looks like Drizzle + XtraDB might be a match made in heaven. Drizzle can be though of as a MySQL engine re-write with an eye towards web workloads and performance, rather than features. MySQL 4.1, 5.0, and 5.1 added a lot of features that bloated the code without offering anything really useful to web-oriented workloads like ours, so the Drizzle team is ripping all that stuff back out and rethinking the approaches to the things that are being left in. Very exciting.
SSD STORAGE
The advent of “cheap enough” super-fast SSD storage is finally upon us. I’ve got Sun S7410 storage appliances in production and they’re blazingly fast. I have a very thorough review coming, but the short version is that even with NFS latencies, we’re able to do obscene write workloads to these boxes (let alone reads). 10000+ write IOPS to 10TB of mirrored, crazy durable (thanks ZFS!) storage is a dream come true. Once you mix in snapshots, clones, replication, and Analytics – well, it just doesn’t get much better than this.
(Don’t get sticker shock looking at the web pricing – no-one pays anything even remotely like that. Sign up for Startup Essentials if you can, or talk to your Sun sales rep if you can’t, and you can get them much cheaper. I nearly had a heart attack myself until I got “real” pricing. Tell them I sent you – enough Sun people read this blog, it might just help 🙂 ).
STILL NEEDED…
So, all in all, there’s been an awful lot of progress this year, which is great. CPUs are finally scaling under InnoDB, and we finally have storage that isn’t bounded by physical rotation and mechanical arms. Unfortunately, great CPU scaling plus amazing IO capabilities isn’t something InnoDB digests very well. As is common in complicated systems, once you fix one bottleneck, another one elsewhere in the system crops up. This time, it’s IOPS. It was eerie reading Mark Callaghan’s post about this last night – I’d come to the exact same conclusions (from an Operations point of view rather than code-level) just yesterday.
Bottom line: Despite having ample CPU and ample IO, InnoDB isn’t capable of using the IO provided. You can bet we’ll be working with Percona, Google and Sun (read: sitting back and admiring their brilliant work while writing the occasional check and providing production workload information) to look into fixing this.
In the meantime, we’re back to the old standbys: replication and data partitioning. Yes, we’re stacking lots of MySQL instances on each S7410 to maximize both our IOPS and our budget. Fun stuff – more on that later. 🙂
UPDATE: Just occurred to me that there are plenty of *new* readers to my blog who haven’t heard me praise Google and their patches before. Mark Callaghan’s team over at Google definitely deserves a shout-out – they’ve really been a catalyst for much of this work along with Percona.
ZFS & MySQL/InnoDB Compression Update

Network.com setup in Vegas, Thumper disk bay, green by Shawn Ferry
As I expected it would, the fact that I used ZFS compression on our MySQL volume in my little OpenSolaris experiment struck a chord in the comments. I chose gzip-9 for our first pass for a few reasons:
- I wanted to see what the “best case” compression ratio was for our dataset (InnoDB tables)
- I wanted to see what the “worst case” CPU usage was for our workload
- I don’t have a lot of time. I need to try something quick & dirty.
I got both those data points with enough granularity to be useful: a 2.12X compression ratio over a large & varied dataset, and the compression was fast enough to not really be noticeable for my end users. The next step, obviously, is to find out what the best ratio of compression and CPU is for our data. So I spent the morning testing exactly that. Here are the details:
- Created 11 new ZFS volumes (compression = [none | lzjb | gzip1-9])
- Grabbed 4 InnoDB tables of varying sizes and compression ratios and loaded them in the disk cache
- Timed the time (using ‘ptime’) it took to read the file from cache and write it to disk (using ‘cp’), watching CPU utilization (using ‘top’, ‘prstat’, and ‘mpstat’)
It quickly became obvious that there’s relatively little difference in compression between gzip-1 and gzip-9 (and, contrary to what people were saying in the comments, relatively little difference between CPU usage, either, in 3 of the 4 cases. The other case, though… yikes!). So I quickly stopped even doing anything but ‘none’, ‘lzjb’, ‘gzip-1’, and ‘gzip-9’. (LZJB is the default compression for ZFS – gzip-N was added later as an option).
Note that all the files were pre-cached in RAM before doing any of the tests, and ‘iostat’ verified we were doing zero reads. Also note that this is writing to two DAS enclosures with 15 x 15K SCSI disks apiece (28 spindles in a striped+mirrored configuration) with 512MB of write cache apiece. So these tests complete very quickly from an I/O perspective because we’re either writing to cache (for the smaller files) or writing to tons of fast spindles at once (the bigger files). In theory, this should mean we’re testing CPU more than we’re testing our IO – which is the whole point.
I ran each ‘cp’ at least 10 times, letting the write cache subside each time, selecting the fastest one as the shown result. Here they are (and be sure to read the CPU utilization note after the tables):
TABLE1 | |||
compression | size | ratio | time |
uncompressed | 172M | 1 | 0.207s |
lzjb | 79M | 2.18X | 0.234s |
gzip-1 | 50M | 3.44X | 0.24s |
gzip-9 | 46M | 3.73X | 0.217s |
Notes on TABLE1:
- This dataset seems to be small enough that much of time is probably spent in system internals, rather than actually reading, compressing, and writing data, so I view this as only an interesting size datapoint, rather than size and time. Feel free to correct me, though. 🙂
TABLE2 | ||||
compression | size | ratio | time | ratio |
uncompressed | 631M | 1 | 1.064s | 1 |
lzjb | 358M | 1.76X | 0.668 | 1.59X |
gzip-1 | 253M | 2.49X | 1.302 | 0.82X |
gzip-9 | 236M | 3.73X | 11.1s | 0.10X |
Notes on TABLE2:
- gzip-9 is massively slower on this particular hunk of data. I’m no expert on gzip, so I have no idea why this would be, but you can see the tradeoff is probably rarely worth it, even if were using precious storage commodities (say, flash or RAM rather than hard disks). I ran this one extra times just to make sure. Seems valid (or a bug).
TABLE3 | ||||
compression | size | ratio | time | ratio |
uncompressed | 2675M | 1 | 15.041s | 1 |
lzjb | 830M | 3.22X | 5.274 | 2.85X |
gzip-1 | 246M | 10.87X | 44.287 | 0.34X |
gzip-9 | 220M | 12.16X | 52.475 | 0.29X |
Notes on TABLE3:
- LZJB really shines here, performance wise. It delivers roughly 3X faster performance while also chewing up roughly 3X less bytes. Awesome.
- gzip’s compression ratios are crazy great on this hunk of data, but the performance is pretty awful. Definitely CPU-bound, not IO-bound.
TABLE4 | ||||
compression | size | ratio | time | ratio |
uncompressed | 2828M | 1 | 17.09s | 1 |
lzjb | 1814M | 1.56X | 14.495s | 1.18X |
gzip-1 | 1384M | 2.04X | 48.895s | 0.35X |
gzip-9 | 1355M | 2.09X | 54.672s | 0.31X |
Notes on TABLE4:
- Again, LZJB performs quite well. 1.5X bytes saved while remaining faster. Nice!
- gzip is again very obviously CPU bound, rather than IO-bound. Dang.
There’s one other very important datapoint here that ‘ptime’ itself didn’t show – CPU utilization. On every run with LZJB, both ‘top’ and ‘mpstat’ showed idle CPU. The most I saw it consume was 70% of the aggregate of all 4 CPUs, but the average was typically 30-40%. gzip, on the other hand, pegged all 4 CPUs on each run. Both ‘top’ and ‘mpstat’ verified that 0% CPU was idle, and interactivity on the bash prompt was terrible on gzip runs.
Some other crazy observations that I can’t explain (yet?):
- After a copy (even to an uncompressed volume), ‘du’ wouldn’t always show the right bytes. It took time (many seconds) before showing the right # of bytes, even after doing things like ‘md5sum’. I have no idea why this might be.
- gzip-9 made a smaller file (1355M vs 1380M) on this new volume as opposed to my big production volume (which is gzip-9 also). I assume this must be due to a different compression dictionary or something, but it was interesting.
- Sometimes I’d get strange error messages trying to copy a file over an existing one (removing the existing one and trying again always worked):
bash-3.2# ptime cp table4.ibd /data/compression/gzip-1
cp: cannot create /data/compression/gzip-1/table4.ibd: Arg list too long
- After running lots of these tests, I wasn’t able to start MySQL anymore. It crashed on startup, unable to allocate enough RAM for InnoDB’s buffer pool. (You may recall from my last post that MySQL seems to be more RAM limited under OpenSolaris than Linux). I suspect that ZFS’s ARC might have sucked up all the RAM and was unwilling to relinquish it, but I wasn’t sure. So I rebooted and everything was fine. 😦
Conclusion? Unless you care a great deal about eking out every last byte (using a RAM disk, for example), LZJB seems like a much saner compression choice. Performance seem to improve, rather than degrade, and it doesn’t hog your CPU. I’m switching my ZFS volume to LZJB right now (on-the-fly changes – woo!) and will copy all my data so it gets the new compression settings. I’ll sacrifice some bytes, but that’s ok – performance is king. 🙂
Also, my theory that I’d always have idle CPU with modern multi-core chips so compression wouldn’t be a big deal seems to be false. Clearly, with gzip, it’s possible to hog your entire CPU if you’re doing big long writes. We don’t tend to do high-MB/s reads or writes, but it’s clearly something to think about. LZJB seems to be the right balance.
So, what should I test next? I wouldn’t mind testing compression latencies on very small reads/writes more along the lines of what our DB actually does, but I don’t know how to do that in a quick & dirty way like I was able to here.
Also, I have to admit, I’m curious about the different checksum options. Has anyone played with anything other than the default?
Speaking at Velocity next week

photo by: Andrew Tobin
I’m thrilled that O’Reilly is putting on a great performance and operations conference, so I’m especially happy to be speaking there. I’m on a great-sounding panel, Success: A Survival Guide. I’m sure you’ll hear about our first few years where, like clockwork on the same days each year, we got massively hammered with traffic and what we did to handle it.
If you’re going to the conference, come say “Hi!”. I’ll be wearing a red SmugMug hat, as always. 🙂
Oh, and if you haven’t signed up yet, use ‘vel08js’ to get a nice discount. 🙂
EC2 isn't 50% slower
I don’t want to start a nerdfight here, but it might be inevitable. 🙂
Valleywag ran a story today about how Amazon’s EC2 instances are running at 50% of their stated speed/capacity. They based the story on a blog post by Ted Dziuba, of Persai and Uncov fame, whose writing I really love.
Problem is, this time, he’s just wrong. Completely full of FAIL.
I’ll get to that in a minute, but first, let me explain what I think is happening: Amazon’s done a poor job at setting user expectations around how much compute power an instance has. And, to be fair, this really isn’t their fault – both AMD and Intel have been having a hard time conveying that very concept for a few years now.
All of the other metrics – RAM, storage, etc – have very fixed numbers. A GB of RAM is a GB of RAM. Ditto storage. And a megabit of bandwidth is a megabit of bandwidth. But what on earth is a GHz? And how do you compare a 2006 Xeon GHz to a 2007 Opteron GHz? In reality, for mere mortals, you can’t. Which sucks for you, me, and Amazon – not to mention AMD and Intel.
Luckily, there’s an answer – EC2 is so cheap, you can spin up an instance for an hour or two and run some benchmarks. Compare them yourself to your own hardware, and see where they match up. This is exactly what I did, and why I was so surprised to see Ted’s post. It sounded like he didn’t have any empirical data.
Admittedly, we’re pretty insane when it comes to testing hardware out. Rather than trust the power ratings given by the manufacturers, for example, we get our clamp meters out and measure the machines’ power draw under full load. You’d be surprised how much variance there is.
There was one data point in a thread linked from Ted’s post that had me scratching my head, though, and I began to wonder if the Small EC2 instances actually had some sort of problem. (We only use the XLarge instance sizes) This guy had written a simple Ruby script and was seeing a 2X performance difference between his local Intel Core 2 Duo machine and the Small EC2 instance online. Can you spot the problem? I missed it, so I headed over to IRC to find Ted and we proceeded to benchmark a bunch of machines we had around, including all three EC2 instance sizes.
Bottom line? EC2 is right on the money. Ted’s 2.0GHz Pentium 4 performed the benchmark almost exactly as fast as the Small (aka 1.7GHz old Xeon) instance. My 866MHz Pentium 3 was significantly slower, and my modern Opteron was significantly faster.
So what about that guy with the Ruby benchmark? Can you see what I missed, now? See, he’s using a Core 2 Duo. The Core line of processors has completely revolutionized Intel’s performance envelope, and thus, the Core processors preform much better for each clock cycle than the older Pentium line of CPUs. This is akin to AMD, which long ago gave up the GHz race, instead choosing to focus on raw performance (or, more accurately, performance per watt).
Whew. So, what have we learned?
- All GHz aren’t created equal.
- CPU architecture & generation matter, too, not just GHz
- AMD GHz have, for years, been more effective than Intel GHz. Recently, Intel GHz have gotten more effective than older Intel GHz.
- Comparing old pre-Core Intel numbers with new Intel Core numbers is useless.
- “top” can be confusing at best, and outright lie at worst, in virtualized instances. Either don’t look at it, or realize the “steal %” column is other VMs on your same hardware doing their thing – not idle CPU you should be able to use
- Benchmark your own apps yourself to see exactly what the price per compute unit is. Don’t rely on GHz numbers.
- Don’t believe everything you read online (threads, blogs, etc) – including here! People lie and do stupid things (I’m dumb more often than I’m not, for example). Data is king – get your own.
Hope that clears that up. And if I’m dumb, I fully expect you to tell me so in the comments – but you’d better have the data to back it up!
(And yes, I’m still prepping a monster EC2 post about how we’re using it. Sorry I suck!)