Archive
Great things afoot in the MySQL community
tl;dr: The MySQL community rocks. Percona, XtraDB, Drizzle, SSD storage, InnoDB IO scalability challenges.
For anyone who lives and dies by MySQL and InnoDB, things are finally starting to heat up and get interesting. I’ve been banging the “MySQL/InnoDB scales poorly” drums for years now, and despite having paid Enterprise licenses, I haven’t been able to get anywhere. I was pretty excited when Sun bought MySQL since their future is intrinsically tied to concurrency, but things have been pretty slow going over there this year.
But the community has finally taken up arms and is fighting the good fight. It’s (finally!) a great time to be a MySQL user because there’s been lots of recent progress. Here’re some of my favorites (and highlights of work left to do):
PERCONA
I can’t sing Percona’s praises enough. They’re probably the most knowledgeable MySQL experts out there (possibly even including Sun). Absolutely the best bang for the buck in terms of MySQL service and support – better than MySQL’s own offering. (If I had to guess why that is, I’d bet that MySQL/Sun don’t want to step on Oracle’s toes by fixing InnoDB – but >99% of what we need is related to InnoDB. Percona has no such tip-toeing limitations.) Let me quickly count the ways they’ve helped me in the last few months:
- They knew of a super obscure configuration setting “back_log“. Have you ever heard of it? I hadn’t. But we started seeing latency on MySQL connections (up to *3 seconds*!) on systems that hadn’t changed recently (exactly 3 seconds sounded awfully suspicious, and sure enough, it was TCP retries). After going through every single kernel, network, and MySQL tuning parameter I know (and I know a lot), I finally called Percona. They dug in, investigated the system, and unearthed ‘back_log’ within an hour or two. Popped that into my configuration and boom, everything was fine again. Whew!
- We have servers that easily exceed InnoDB’s transaction limits. Did you know InnoDB has a concurrent transaction limit of 1024? (Technically, 1024 INSERTs and 1024 UPDATEs. But INSERT … ON DUPLICATE KEY UPDATE manages to chew up one of each). I know all about it – I’ve had bugs open with MySQL Enterprise for more than 2 years on the issue. What’s more, these are low-end systems – 4 cores, 16GB of RAM – and they’re no-where near CPU or IO bound. It took MySQL months to figure out what the problem was (years, really, to figure out all the final details like the different undo logs for INSERT vs UPDATE). Their final answer? It’ll be fixed in MySQL 6. 😦 Note that 5.1 *just* went GA after years and years. On the other hand, it took Percona one weekend to diagnose the problem, and 13 days to have a preliminary patch ready to extend it to 4072 undo slots. Talk about progress! (And yes, we want Percona to release the patch to the world)
- Solving the CPU scaling problems. These have been plaguing us for years (we have had some older four-socket systems for awhile … now with quad-core, it’s even worse), and thanks to Google and Percona, this problem is well on its way to being solved. We’re sponsoring this work and can’t wait to see what happens next.
- XtraDB. This is the biggy. So big it deserves its own heading….
XTRADB
Oracle’s done a terrible job of supporting the community with InnoDB. The conspiracy theorists can all say “I told you so! Oracle bought them to halt MySQL progress” now – history supports them. Which is a shame – Heikki is a great guy and has done amazing work with InnoDB, but the fact remains that it wasn’t moving forward. The InnoDB plugin release was disappointing, to say the least. It addressed none of the CPU or IO scalability issues the community has been crying about for years.
Luckily, Percona finally did what everyone else has been too afraid to do – they forked InnoDB. XtraDB is their storage engine, forked from InnoDB (and then turbocharged!). We’re not running it in production yet, but we are running all of the patches that went into XtraDB and I can tell you they’re great. We’re sponsoring more XtraDB development (and yes, we made sure Percona will be contributing anything they build for us back to the community) with Percona, and I’m sure that’ll continue.
DRIZZLE
I’ve already blogged a bit about Drizzle, but it sure looks like Drizzle + XtraDB might be a match made in heaven. Drizzle can be though of as a MySQL engine re-write with an eye towards web workloads and performance, rather than features. MySQL 4.1, 5.0, and 5.1 added a lot of features that bloated the code without offering anything really useful to web-oriented workloads like ours, so the Drizzle team is ripping all that stuff back out and rethinking the approaches to the things that are being left in. Very exciting.
SSD STORAGE
The advent of “cheap enough” super-fast SSD storage is finally upon us. I’ve got Sun S7410 storage appliances in production and they’re blazingly fast. I have a very thorough review coming, but the short version is that even with NFS latencies, we’re able to do obscene write workloads to these boxes (let alone reads). 10000+ write IOPS to 10TB of mirrored, crazy durable (thanks ZFS!) storage is a dream come true. Once you mix in snapshots, clones, replication, and Analytics – well, it just doesn’t get much better than this.
(Don’t get sticker shock looking at the web pricing – no-one pays anything even remotely like that. Sign up for Startup Essentials if you can, or talk to your Sun sales rep if you can’t, and you can get them much cheaper. I nearly had a heart attack myself until I got “real” pricing. Tell them I sent you – enough Sun people read this blog, it might just help 🙂 ).
STILL NEEDED…
So, all in all, there’s been an awful lot of progress this year, which is great. CPUs are finally scaling under InnoDB, and we finally have storage that isn’t bounded by physical rotation and mechanical arms. Unfortunately, great CPU scaling plus amazing IO capabilities isn’t something InnoDB digests very well. As is common in complicated systems, once you fix one bottleneck, another one elsewhere in the system crops up. This time, it’s IOPS. It was eerie reading Mark Callaghan’s post about this last night – I’d come to the exact same conclusions (from an Operations point of view rather than code-level) just yesterday.
Bottom line: Despite having ample CPU and ample IO, InnoDB isn’t capable of using the IO provided. You can bet we’ll be working with Percona, Google and Sun (read: sitting back and admiring their brilliant work while writing the occasional check and providing production workload information) to look into fixing this.
In the meantime, we’re back to the old standbys: replication and data partitioning. Yes, we’re stacking lots of MySQL instances on each S7410 to maximize both our IOPS and our budget. Fun stuff – more on that later. 🙂
UPDATE: Just occurred to me that there are plenty of *new* readers to my blog who haven’t heard me praise Google and their patches before. Mark Callaghan’s team over at Google definitely deserves a shout-out – they’ve really been a catalyst for much of this work along with Percona.
Sweet new Sun storage stuff on Monday, Nov 10th
FYI, Sun is announcing some sweet new storage stuff on Monday at 3:30pm PT.
I’m reviewing a few of the things they’re announcing, and hope to publish my thoughts here soon (one of them joins my production network tonight if all goes well). However, I’m at Disneyland with my kids (first trip!) from Monday through Thursday, so I don’t know (yet) when I’ll be able to write them up. Bear with me if it takes a few days.
But the gear is exciting, and the direction Sun is headed is even more exciting!
Success with OpenSolaris + ZFS + MySQL in production!

Pimp My Drive by Richard and Barb
There’s remarkably little information online about using MySQL on ZFS, successfully or not, so I did what any enterprising geek would do: Built a box, threw some data on it, and tossed it into production to see if it would sink or swim. 🙂
I’m a Linux geek, have been since 1993 (Slackware!). All of SmugMug’s datacenters (and our EC2 images) are built on Linux. But the current state of filesystems on Linux is awful, and it’s been awful for at least 8 years. As a result, we’ve put our first OpenSolaris box into production at SmugMug and I’ve been pleasantly surprised with the performance (the userland portions of the OS, though, leave a lot to be desired). Why OpenSolaris?
ZFS.
ZFS is the most amazing filesystem I’ve ever come across. Integrated volume management. Copy-on-write. Transactional. End-to-end data integrity. On-the-fly corruption detection and repair. Robust checksums. No RAID-5 write hole. Snapshots. Clones (writable snapshots). Dynamic striping. Open source software. It’s not available on Linux. Ugh. Ok, that sucks. (GPL is a double-edged sword, and this is a perfect example). Since it’s open-source, it’s available on other OSes, like FreeBSD and Mac OS X, but Linux is a no go. *sigh* I have a feeling Sun is working towards GPL’ing ZFS, but these things take time and I’m sick of waiting.
The OpenSolaris project is working towards making Solaris resemble the Linux (GNU) userland plus the Solaris kernel. They’re not there yet, but the goal is commendable and the package management system has taken a few good steps in the right direction. It’s still frustrating, but massively less so. Despite all the rough edges, though, ZFS is just so compelling I basically have no choice. I need end-to-end data integrity. The rest of the stuff is just icing on an already delicious cake.
The obvious first place to use ZFS was for our database boxes, so that’s what I did. I didn’t have the time, knowledge of OpenSolaris, or inclination to do any synthetic benchmarking or attempt to create an apples-to-apples comparison with our current software setup, so I took the quickest route I could to have a MySQL box up and running. I had two immediate performance metrics I cared about:
- Can a MySQL slave on OpenSolaris with ZFS keep up with the write load with no readers?
- If yes, can the slave shoulder its fair share of the reads, too?
Simple and to the point. Here’s the system:
- SunFire X2200 M2 w/64GB of RAM and 2 x dual-core 2.6GHz Opterons
- Dell MD3000 w/15 x 15K SCSI disks and mirrored 512MB battery-backed write caches (these are really starting to piss us off, but that’s another post…)
The quickest path to getting the system up and running resulted in lots of variables in the equation changing:
- Linux -> OpenSolaris (snv_95 currently)
- MySQL 5.0 -> MySQL 5.1
- LVM2 + ext3 -> ZFS
- Hardware RAID -> Software RAID
- No compression -> gzip9 volume compression
Whew! Lots of changes. Let me break them down one by one, skipping the obvious first one:
MySQL – MySQL 5.1 is nearing GA, and has a couple of very important bug fixes for us that we’ve been working around for an awfully long time now. When I downloaded the MySQL 5.0 Enterprise Solaris packages and they wouldn’t install properly, that made the decision to dabble with 5.1 even easier – the CoolStack 5.1 binaries from Sun installed just fine. 🙂
Going to MySQL 5.1 on a ~1TB DB is painful, though, I should warn you up front. It forced ‘REPAIR TABLE’ on lots of my tables, so this step took much longer than I expected. Also, we found that the query optimizer in some cases did a poor job of choosing which indexes to use for queries. A few “simple” SELECTs (no JOINs or anything) that would take a few milliseconds on our 5.0 boxes took seconds on our 5.1 boxes. A little bit of code solved the problem and resulted in better efficiency even for the 5.0 boxes, so it was a net win, but painful for a few hours while I tracked it down.
Finally, after running CoolStack for a few days, we switched (on advice from Sun) to the 5.1.28 Community Edition to fix some scalability issues. This made a huge difference so I highly recommend it. (On a side note, I wish MySQL provided Enterprise binaries for 5.1 for their paying customers to test with). The Google & Percona patches should make a monster difference, too.
Volume management and the filesystem – There’s some debate online as to whether ZFS is a “layering violation” or not. I could care less – it’s pure heaven to work with. This is how filesystems should have always been. The commands to create, manage, and extend pools are so simple and logical you basically don’t even need man pages (discovering disk names, on the other hand, isn’t easy. I finally used ‘format’ but even typing it gives me the shivers…). zpool create MYPOOL c0t0d0
You just created a ZFS pool. Want a mirror? zpool create MYPOOL mirror c0t0d0 c0t0d1
Want a striped mirror (RAID-1+0) w/spare? zpool create MYPOOL mirror c0t0d0 c0t0d1 mirror c0t0d2 c0t0d3 spare c0t0d4
Want to add another mirror to an already striped mirror (RAID-1+0) pool? zpool add MYPOOL mirror c0t0d5 c0t0d6
Get the idea? Super-easy. Massively easier than LVM2+ext3 where adding a mirror is at least 4 commands: pvcreate, vgextend, lvextend, resize2fs – usually with an fsck in there too.
Software RAID – This is something we’ve been itching for for quite some time. With modern system architectures and modern CPUs, there’s no real reason “storage” should be separate from “servers”. A storage device should be just a server with some open-source software and lots of disks. (The “open source” part is important. I’m sick of relying on closed-source RAID firmware). The amount of flexibility, performance, reliability and operational cost savings you can achieve with software RAID rather than hardware is enormous. With real datacenter-grade flash storage devices just around the corner, this becomes even more vital. ZFS makes all of this stuff Just Work, including properly adjusting the write caches on the disk, eliminating the RAID-5 write hole, etc. Our first box still has a battery-backed write-cache between the disks and the CPU for write performance, but all the disks are just exposed as JBOD and striped + mirrored using ZFS. It rocks.
Compression – Ok, so this is where the geek in me decided to get a little crazy. ZFS allows you to turn on (and off) a variety of compression mechanisms on-the-fly on your pool. This comes with some unknown (depends on lots of factors, including your workload, CPUs, etc) performance penalty (CPU is required to compress/decompress), but can have performance upsides too (smaller reads and writes = less busy disk).
InnoDB is notoriously bad at disk usage (we see 2X+ space usage using InnoDB) and while it’s not an enormous concern, it’d be something nice to curtail. On most of our DB boxes, we have idle CPU around (we’re not really I/O bound either – MySQL is a strange duck in that you can be concurrency bound without being either CPU or I/O bound fairly easily thanks to poor locking), so I figured I’d go wild and give it a shot.
Lo and behold, it worked! We’re getting a 2.12X compression ratio on our DB, and performance is keeping up just fine. I ran some quick performance tests on large linear reads/writes and we were measuring 45.6MB/s sustained uncompression and 39MB/s sustained compression on a single-threaded app on an Opteron CPU. We’ll probably continue to test compression stuff, and of course if we run into performance bottlenecks, we’ll turn it off immediately, but so far the mad science experiment is working.
Configuration
Configuring everything was relatively painless. I bounced a few questions off of Sun (imho, this is where Sun really shines – they listen to their customers and put technical people with real answers within arms reach) and read the Evil Tuning Guide to ZFS. In the end I really only ended up tweaking two things (plus setting compression to gzip-9):
- I set the recordsize to match InnoDB’s – 16KB.
zfs set recordsize=16K MYPOOL
- I turned off file-level prefetching. See the Evil Tuning Guide. (I’m testing with this on, now, and so far it seems fine).
I believe since ZFS is fully checksummed and transactional (so partial writes never occur) I can disable InnoDB’s doublewrite buffer. I haven’t been brave enough to do this yet, but I plan to. I like performance. 🙂
Performance
This box has been in production in our most important DB cluster for two weeks now. On the metrics I care about (replication lag, query performance, CPU utliization, etc) it’s pulling its fair share of the read load and keeping completely up on replication. Just eyeballing the stats (we haven’t had time to number crunch comparison stats, though we gave some to Sun that I’m hoping they crunch), I can’t tell a difference between this slave and any of the others in the cluster running Linux. I sure feel a lot better about the data integrity, though.
Why not [insert other OS here]?
We could have gone with Nexenta, FreeBSD, Mac OS X, or even *gulp* tried ZFS on FUSE/Linux. To be honest, Nexenta is the most interesting because it actually *is* the Solaris kernel plus Linux userland, exactly what I wanted. I’ve played with it a tiny bit, and plan to play with it more, but this is a mission-critical chunk of data we’re dealing with, so I need a company like Sun in my corner. I find myself wishing Sun had taken the Nexenta route (or offered support for it that I could buy or something). Instead, we’ll be buying software service & support from Sun for this and any other mission-critical OpenSolaris boxes.
FreeBSD also doesn’t have the support I need, Mac OS X wasn’t performant enough the last time I fiddled with it as a server, and most FUSE filesystems are slow so I didn’t even bother.
Gotchas
- On my 64GB Linux boxes, I give InnoDB 54GB of buffer pool size. With otherwise exactly the same my.cnf settings, MySQL on OpenSolaris crashes with anything more than 40GB. 14GB, or 21.9% of my RAM, that I can’t seem to use effectively. Sun is looking into this, I’ll let you know if I find anything out.
- For a Linux geek, OpenSolaris userland is still painful. Bear in mind that this is a single-purpose box, so all I really want to do is install and configure MySQL, then monitor the software and hardware. If this were a developer box, I would have already given up. OpenSolaris is still very early, so I’m still hopeful, but be prepared to invest some time. Some of my biggest peeves:
- Common commands, like ‘ps’, have very different flags.
- Some GNU bins are provided in /usr/gnu/bin – but a better ‘ps’ is missing, as is ‘top’ (no, ‘prstat’ is *not* the same!), ‘screen’, etc (Can anyone even use remote command-line Unix boxes without ‘screen’? If so, how?)
- Packages are crazily named, making finding your stuff to install tough. Like instead of Apache being called ‘apache’ or ‘httpd’, it’s called ‘SUNWapch’. What?
- After finally figuring out how to search for packages to get the names (‘pkg search -r Apache’ – which doesn’t provide pleasant results), I discovered that ‘top’ and ‘screen’ just simply aren’t provided (or they’re named even worse than I thought). Instead, I had to go to a 3rd party repository, BlastWave, to get them. And then, of course, the ‘top’ OpenSolaris package wouldn’t actually install and I had to manually break into the package and extract the binary. Ugh.
Whew! Big post, but there was a lot of ground to cover. I’m sure there are questions, so please post in the comments and I’ll try to do a follow-up. As I fiddle, tweak, and change things I’ll try to post updates, too – but no promises. 🙂
UPDATE: One other gotcha I forgot to mention. When MySQL (or, presumably, anything else running on the box) gets really busy, user interactivity evaporates on OpenSolaris. Just hitting enter or any other key at a bash prompt over SSH can take many seconds to register. I remember when Linux had these sort of issues in the past, but had blissfully forgotten about them.
UPDATE: I went more in depth on ZFS compression testing and blogged the results. Enjoy!
Hot technologies I care about – Sep '08

photo by: ikegami
I’ve been too busy to blog lately, and for that I apologize. But here’s a quicky detailing the technologies (internet related and not) I’m excited about right now:
- Drizzle. For years now, I’ve felt that MySQL has been doing in a direction in opposition to my use case. Stored procedures, views, etc etc have added bloat and complexity without offering me anything useful. Turns out I’m not alone – and thus Drizzle was born. To say I’m *super* excited about this is a serious understatement.
- Google & Percona’s MySQL patches. While I wait for Drizzle, I’m stuck dealing with terrible concurrency issues in MySQL/InnoDB that force us to partition data way before we really should have to, making our system more complex. It’s crazy having a server keel over when it shouldn’t be either CPU-bound *or* IO-bound but that’s life with MySQL and InnoDB these days – or at least, it was until Google and Percona fixed what I couldn’t get MySQL to fix with our Platinum Enterprise subscriptions. Open source rules!
- Flash storage. I really wish I could talk about this some more (pesky NDAs), but there are datacenter changes coming that are more dramatic than anything I’ve seen in 14 years of working on them. I hope I’ve talked to everyone in the space (and from the companies I’ve talked to, one of them seems to be the *very* clear winner for this upcoming round), but if you’re a storage vendor working on flash appliances and I haven’t talked to you, ping me. We’re a bleeding edge customer and we’ll put your stuff in production faster than you can deliver it to us. 🙂
- ZFS. Regardless of flash storage, ZFS is the filesystem of choice – head and shoulders over everything we’ve used or heard of. The advent of flash just makes this even more compelling. The downside? It’s not on Linux. 😦
- OpenSolaris. ZFS is so incredible, my hand has been forced, and we’re about to put our first OpenSolaris system into production. OpenSolaris is, in theory, the Solaris kernel (think ZFS, DTrace, SMF, high concurrency, etc) with the GNU-like userland (think Linux-like). In practice, it’s still extremely painful for a Linux expert and Solaris n00b like me to use – even on a single-purpose machine like a MySQL server. Only ZFS makes the pain worth it. For development, it’s basically unusable for Linuxers (it’s probaby fabulous for Solaris guys – lucky ducks).
- Nexenta. Unlike OpenSolaris, Nexenta *is* the Solaris kernel plus GNU userland. Unfortunately, it’s not backed by Sun or anyone else I have any relationship with. Sun has been absolutely the very best technology vendor we’ve ever dealt with in terms of support, technical knowledge, and just plain listening to us, so that’s a big issue. I wish Sun had taken Nexenta’s approach (or would just buy them or offer support or something). If OpenSolaris continues to be painful, we may fall back on Nexenta instead – remember, ZFS is the driving factor here.
- Amazon Web Services competitors. They’ve been promising they’d be coming out for years now and I’m shocked they’ve given Amazon this much runway. But I believe a few more are getting very close (can’t say more, again, pesky NDAs). Now, we’re extremely happy with Amazon, so we have no plans to switch, but competition is good for everyone – and Amazon is a fierce competitor. Plus there are still gaps in Amazon’s strategy, and if I can mix & match to plug some of those gaps, awesome – sign me up.
- Memcached. This one’s been on my list for years, and it’s still way up there. Binary protocol on the verge of shipping, nice patch to resolve some networking issues we’ve seen, and talk about scabability. If you’re building web apps and this isn’t a core part of your infrastructure, you’re doing it wrong.
- Big RAM. 4GB DIMMs are dirt cheap, so if you’re not loading your DB and Memcached boxes to the gills, you’re missing the boat. Cheap 2-socket 64GB (and relatively cheap 128GB at 4-sockets) are here.
- Sun Fire X4140 and X4440. The best 1U (2-socket) and 2U (4-socket) servers on earth. Despite being late to the game with quad-core, Opteron RAM performance kills Xeon, so these are the servers we’re buying. You can load them to the gills with 4GB DIMMs, enjoy the dual-power supplies (yes, in the 1U box too), and crank out some great stuff.
- OpenSocial, Y!OS, etc. The big boys are finally getting real about getting open and cross-pollinating data and I think we’re finally nearing an inflection point. We’re hiring a Sorcerer to do nothing but think and build in this space. I’m sure magic will ensue.
- Nikon D90 and Canon 5D MkII. Nikon’s taken the photography world by storm with amazing high-ISO performance, and Canon just announced a DSLR that shoots full 1080p video. Both look amazing and both are game-changers.
- Onkyo TX-SR806. I’m an A/V junkie and this thing is amazing. 5 HDMI inputs (need more?), THX Ultra2 Plus (the low-volume enhancements are *awesome* with young kids sleeping at home), automatic room EQ, decodes every modern audio encoding, etc. I don’t even use the amplifier section (I have separates), but it’s turning out to be the best Pre/Pro I’ve ever owned. Sounds fabulous on my gear.
- iPhone App Store. That thing is a game changer, and we’re barely seeing the tip of the iceberg. All the other players have to respond – which is great for you and I. And talk about a platform that’s a dream to develop on!
The Sky is Falling! MySQL charging for features!
There’s quite a bit of buzz on the blogosphere from people I respect a great deal, like Jeremy Cole at Proven Scaling and Vadim at Percona, about MySQL’s new Enterprise backup plans.
The big deal? They’re releasing a Community version that doesn’t have all the same features as the Enterprise version of Online Backup, including compression and encryption. The Community version is open-sourced under GPL, the Enterprise version is not.
Personally, I think this is awesome. Don’t get me wrong – I love open source. We couldn’t have built our business without it, and we love it when we get a chance to contribute back to the community.
But let’s not forget that MySQL is a business. And that business helps the community and improves the software. They have customers (I’m one – we’re a paying MySQL Enterprise Platinum customer), and they have to solve those customers’ problems. This is a virtuous cycle where the community benefits directly as MySQL thrives financially.
Every time a business like us pays MySQL for a service or feature, MySQL can then invest in better software that benefits all. The end result in MySQL’s case is more GPL’d code. In a very real way, without companies like mine, there wouldn’t be a new backup tool at all – let alone the differences this debate is focused on.
Every day, I hear someone saying “Man, I love SmugMug so much! It has [insert features here] which I love! Why isn’t it free?”
The answer? “It wouldn’t be SmugMug if it was free.” MySQL’s situation is very similar.
I wish more open source projects would make it easier for this cycle to ignite. Some of them, like Red Hat, refuse to even take our money. Talk about stupid. There are *lots* of businesses out there willing to pay for extra services and features, and the community can harness that revenue in amazing ways, including getting more (or better) GPL’d code.
Couple more thoughts:
- I wouldn’t be surprised if future releases add new Enterprise-only features and some existing Enterprise-only features migrate down to Community.
- The Community version is open-sourced, so I’m sure the community will develop their own compression and encryption features.
- This is really no different from Enterprise Monitor, which has been only for Enterprise customers for awhile.
- Lots of other projects do this (and I would argue this benefits those projects and their communities, too)
- I’m 99% sure that this was the plan before Sun acquired MySQL.