datacenter | SmugMug's Don MacAskill

Success with OpenSolaris + ZFS + MySQL in production!

October 10, 2008 Don MacAskill 82 comments

There’s remarkably little information online about using MySQL on ZFS, successfully or not, so I did what any enterprising geek would do: Built a box, threw some data on it, and tossed it into production to see if it would sink or swim. 🙂

I’m a Linux geek, have been since 1993 (Slackware!). All of SmugMug’s datacenters (and our EC2 images) are built on Linux. But the current state of filesystems on Linux is awful, and it’s been awful for at least 8 years. As a result, we’ve put our first OpenSolaris box into production at SmugMug and I’ve been pleasantly surprised with the performance (the userland portions of the OS, though, leave a lot to be desired). Why OpenSolaris?

ZFS.

ZFS is the most amazing filesystem I’ve ever come across. Integrated volume management. Copy-on-write. Transactional. End-to-end data integrity. On-the-fly corruption detection and repair. Robust checksums. No RAID-5 write hole. Snapshots. Clones (writable snapshots). Dynamic striping. Open source software. It’s not available on Linux. Ugh. Ok, that sucks. (GPL is a double-edged sword, and this is a perfect example). Since it’s open-source, it’s available on other OSes, like FreeBSD and Mac OS X, but Linux is a no go. *sigh* I have a feeling Sun is working towards GPL’ing ZFS, but these things take time and I’m sick of waiting.

The OpenSolaris project is working towards making Solaris resemble the Linux (GNU) userland plus the Solaris kernel. They’re not there yet, but the goal is commendable and the package management system has taken a few good steps in the right direction. It’s still frustrating, but massively less so. Despite all the rough edges, though, ZFS is just so compelling I basically have no choice. I need end-to-end data integrity. The rest of the stuff is just icing on an already delicious cake.

The obvious first place to use ZFS was for our database boxes, so that’s what I did. I didn’t have the time, knowledge of OpenSolaris, or inclination to do any synthetic benchmarking or attempt to create an apples-to-apples comparison with our current software setup, so I took the quickest route I could to have a MySQL box up and running. I had two immediate performance metrics I cared about:

Can a MySQL slave on OpenSolaris with ZFS keep up with the write load with no readers?
If yes, can the slave shoulder its fair share of the reads, too?

Simple and to the point. Here’s the system:

SunFire X2200 M2 w/64GB of RAM and 2 x dual-core 2.6GHz Opterons
Dell MD3000 w/15 x 15K SCSI disks and mirrored 512MB battery-backed write caches (these are really starting to piss us off, but that’s another post…)

The quickest path to getting the system up and running resulted in lots of variables in the equation changing:

Linux -> OpenSolaris (snv_95 currently)
MySQL 5.0 -> MySQL 5.1
LVM2 + ext3 -> ZFS
Hardware RAID -> Software RAID
No compression -> gzip9 volume compression

Whew! Lots of changes. Let me break them down one by one, skipping the obvious first one:

MySQL – MySQL 5.1 is nearing GA, and has a couple of very important bug fixes for us that we’ve been working around for an awfully long time now. When I downloaded the MySQL 5.0 Enterprise Solaris packages and they wouldn’t install properly, that made the decision to dabble with 5.1 even easier – the CoolStack 5.1 binaries from Sun installed just fine. 🙂

Going to MySQL 5.1 on a ~1TB DB is painful, though, I should warn you up front. It forced ‘REPAIR TABLE’ on lots of my tables, so this step took much longer than I expected. Also, we found that the query optimizer in some cases did a poor job of choosing which indexes to use for queries. A few “simple” SELECTs (no JOINs or anything) that would take a few milliseconds on our 5.0 boxes took seconds on our 5.1 boxes. A little bit of code solved the problem and resulted in better efficiency even for the 5.0 boxes, so it was a net win, but painful for a few hours while I tracked it down.

Finally, after running CoolStack for a few days, we switched (on advice from Sun) to the 5.1.28 Community Edition to fix some scalability issues. This made a huge difference so I highly recommend it. (On a side note, I wish MySQL provided Enterprise binaries for 5.1 for their paying customers to test with). The Google & Percona patches should make a monster difference, too.

Volume management and the filesystem – There’s some debate online as to whether ZFS is a “layering violation” or not. I could care less – it’s pure heaven to work with. This is how filesystems should have always been. The commands to create, manage, and extend pools are so simple and logical you basically don’t even need man pages (discovering disk names, on the other hand, isn’t easy. I finally used ‘format’ but even typing it gives me the shivers…). zpool create MYPOOL c0t0d0You just created a ZFS pool. Want a mirror? zpool create MYPOOL mirror c0t0d0 c0t0d1Want a striped mirror (RAID-1+0) w/spare? zpool create MYPOOL mirror c0t0d0 c0t0d1 mirror c0t0d2 c0t0d3 spare c0t0d4Want to add another mirror to an already striped mirror (RAID-1+0) pool? zpool add MYPOOL mirror c0t0d5 c0t0d6Get the idea? Super-easy. Massively easier than LVM2+ext3 where adding a mirror is at least 4 commands: pvcreate, vgextend, lvextend, resize2fs – usually with an fsck in there too.

Software RAID – This is something we’ve been itching for for quite some time. With modern system architectures and modern CPUs, there’s no real reason “storage” should be separate from “servers”. A storage device should be just a server with some open-source software and lots of disks. (The “open source” part is important. I’m sick of relying on closed-source RAID firmware). The amount of flexibility, performance, reliability and operational cost savings you can achieve with software RAID rather than hardware is enormous. With real datacenter-grade flash storage devices just around the corner, this becomes even more vital. ZFS makes all of this stuff Just Work, including properly adjusting the write caches on the disk, eliminating the RAID-5 write hole, etc. Our first box still has a battery-backed write-cache between the disks and the CPU for write performance, but all the disks are just exposed as JBOD and striped + mirrored using ZFS. It rocks.

Compression – Ok, so this is where the geek in me decided to get a little crazy. ZFS allows you to turn on (and off) a variety of compression mechanisms on-the-fly on your pool. This comes with some unknown (depends on lots of factors, including your workload, CPUs, etc) performance penalty (CPU is required to compress/decompress), but can have performance upsides too (smaller reads and writes = less busy disk).

InnoDB is notoriously bad at disk usage (we see 2X+ space usage using InnoDB) and while it’s not an enormous concern, it’d be something nice to curtail. On most of our DB boxes, we have idle CPU around (we’re not really I/O bound either – MySQL is a strange duck in that you can be concurrency bound without being either CPU or I/O bound fairly easily thanks to poor locking), so I figured I’d go wild and give it a shot.

Lo and behold, it worked! We’re getting a 2.12X compression ratio on our DB, and performance is keeping up just fine. I ran some quick performance tests on large linear reads/writes and we were measuring 45.6MB/s sustained uncompression and 39MB/s sustained compression on a single-threaded app on an Opteron CPU. We’ll probably continue to test compression stuff, and of course if we run into performance bottlenecks, we’ll turn it off immediately, but so far the mad science experiment is working.

Configuration

Configuring everything was relatively painless. I bounced a few questions off of Sun (imho, this is where Sun really shines – they listen to their customers and put technical people with real answers within arms reach) and read the Evil Tuning Guide to ZFS. In the end I really only ended up tweaking two things (plus setting compression to gzip-9):

I set the recordsize to match InnoDB’s – 16KB. zfs set recordsize=16K MYPOOL
I turned off file-level prefetching. See the Evil Tuning Guide. (I’m testing with this on, now, and so far it seems fine).

I believe since ZFS is fully checksummed and transactional (so partial writes never occur) I can disable InnoDB’s doublewrite buffer. I haven’t been brave enough to do this yet, but I plan to. I like performance. 🙂

Performance

This box has been in production in our most important DB cluster for two weeks now. On the metrics I care about (replication lag, query performance, CPU utliization, etc) it’s pulling its fair share of the read load and keeping completely up on replication. Just eyeballing the stats (we haven’t had time to number crunch comparison stats, though we gave some to Sun that I’m hoping they crunch), I can’t tell a difference between this slave and any of the others in the cluster running Linux. I sure feel a lot better about the data integrity, though.

Why not [insert other OS here]?

We could have gone with Nexenta, FreeBSD, Mac OS X, or even *gulp* tried ZFS on FUSE/Linux. To be honest, Nexenta is the most interesting because it actually *is* the Solaris kernel plus Linux userland, exactly what I wanted. I’ve played with it a tiny bit, and plan to play with it more, but this is a mission-critical chunk of data we’re dealing with, so I need a company like Sun in my corner. I find myself wishing Sun had taken the Nexenta route (or offered support for it that I could buy or something). Instead, we’ll be buying software service & support from Sun for this and any other mission-critical OpenSolaris boxes.

FreeBSD also doesn’t have the support I need, Mac OS X wasn’t performant enough the last time I fiddled with it as a server, and most FUSE filesystems are slow so I didn’t even bother.

Gotchas

On my 64GB Linux boxes, I give InnoDB 54GB of buffer pool size. With otherwise exactly the same my.cnf settings, MySQL on OpenSolaris crashes with anything more than 40GB. 14GB, or 21.9% of my RAM, that I can’t seem to use effectively. Sun is looking into this, I’ll let you know if I find anything out.
For a Linux geek, OpenSolaris userland is still painful. Bear in mind that this is a single-purpose box, so all I really want to do is install and configure MySQL, then monitor the software and hardware. If this were a developer box, I would have already given up. OpenSolaris is still very early, so I’m still hopeful, but be prepared to invest some time. Some of my biggest peeves:
- Common commands, like ‘ps’, have very different flags.
- Some GNU bins are provided in /usr/gnu/bin – but a better ‘ps’ is missing, as is ‘top’ (no, ‘prstat’ is *not* the same!), ‘screen’, etc (Can anyone even use remote command-line Unix boxes without ‘screen’? If so, how?)
- Packages are crazily named, making finding your stuff to install tough. Like instead of Apache being called ‘apache’ or ‘httpd’, it’s called ‘SUNWapch’. What?
- After finally figuring out how to search for packages to get the names (‘pkg search -r Apache’ – which doesn’t provide pleasant results), I discovered that ‘top’ and ‘screen’ just simply aren’t provided (or they’re named even worse than I thought). Instead, I had to go to a 3rd party repository, BlastWave, to get them. And then, of course, the ‘top’ OpenSolaris package wouldn’t actually install and I had to manually break into the package and extract the binary. Ugh.

Whew! Big post, but there was a lot of ground to cover. I’m sure there are questions, so please post in the comments and I’ll try to do a follow-up. As I fiddle, tweak, and change things I’ll try to post updates, too – but no promises. 🙂

UPDATE: One other gotcha I forgot to mention. When MySQL (or, presumably, anything else running on the box) gets really busy, user interactivity evaporates on OpenSolaris. Just hitting enter or any other key at a bash prompt over SSH can take many seconds to register. I remember when Linux had these sort of issues in the past, but had blissfully forgotten about them.

UPDATE: I went more in depth on ZFS compression testing and blogged the results. Enjoy!

Categories: datacenter, MySQL Tags: compression, datacenter, dell, ec2, filesystem, filesystem compression, freebsd, fuse, gzip, hardware raid, Linux, lvm, lvm2, mac os x, md3000, MySQL, opensolaris, raid, smugmug, software raid, solaris, sun, sunfire, volume management, volume manager, x2200, zfs

SkyNet Lives! (aka EC2 @ SmugMug)

June 3, 2008 Don MacAskill 53 comments

Everyone knows that SmugMug is a heavy user of S3, storing well over half a petabyte of data (non-replicated) there. What you may not know is that EC2 provides a core part of our infrastructure, too. Thanks to Amazon, the software and hardware that processes all of your high-resolution photos and high-definition video is totally scalable without any human intervention. And when I say scalable, I mean both up and down, just the way it should be. Here’s our approach in a nutshell:

OVERVIEW

The architecture basically consists of three software components: the rendering workers, the batch queuing piece, and the controller. The rendering workers live on EC2, and both the queuing piece and the controller live at SmugMug. We don’t use SQS for our queuing mechanism for a few reasons:

We’d already built a queuing mechanism years ago, and it hasn’t (yet?) hit any performance or reliability bottlenecks.
SQS’s pricing used to be outta whack for what we needed. They’ve since dramatically lowered the pricing and it’s now much more in line with what we’d expect – but by then, we were done.
The controller consumes historical data to make smart decisions, and our existing queuing system was slightly easier to generate the historical data from.

RENDER WORKERS

Our render workers are totally “dumb”. They’re literally bare-bones CentOS 5 AMIs (you can build your own, or use RightScale’s, or whatever you’d like) with a single extra script on them which is executed from /etc/rc.d/rc.local. What does that script do? It fetches intelligence. 🙂

When that script executes, it sends an authenticated request to get a software bundle, extracts the bundle, and starts the software inside. That’s it. Further, the software inside the bundle is self-aware and self-updating, too, automatically fetching updated software, terminating older versions, and relaunching itself. This makes it super-simple to push out new SmugMug software releases – no bundling up new AMIs and testing them or anything else that’s messy. Simply update the software bundle on our servers and all of the render workers automatically get the new release within seconds.

Of course, worker instances might have different roles or be assigned to work with different SmugMug clusters (test vs production, for example), so we have to be able to give it instructions at launch. We do this through the “user-data” launch parameter you can specify for EC2 instances – they give the software all the details needed to choose a role, get software, and launch it. Reading the user-data couldn’t be easier. If you haven’t done it before, just fetch http://169.254.169.254/latest/user-data from your running instance and parse it.

Once they’re up and running, they simply ping the queue service with a “Hi, I’m looking for work. Do you have any?” request, and the queue service either supplies them with work or gives them some other directive (shutdown, software update, take a short nap, etc). Once a job is done (or generated an error), the worker stores the work result on S3 and notifies the queue service that the job is done and asks for more work. Simple.

QUEUE SERVICE

This is your basic queuing service, probably very similar to any other queueing service you’ve seen before. Ours supports job types (new upload, rotate, watermark, etc) and priorities (Pros go to the head of the line, etc) as well as other details. Upon completion, it also logs historical data such as time to completion. It also supports time-based re-queueing in the event of a worker outage, miscommunication, error, or whatever. I haven’t taken a really hard look at SQS in quite some time, but I can’t imagine it would be very difficult to implement on SQS for those of you starting fresh.

CONTROLLER (aka SkyNet)

For me, this was the fun part. Initially we called it RubberBand, but we had an ususual partial outage one day which caused it to go berzerk and launch ~250 XL instances (~2000 normal EC2 instances) in a single call. Clearly, it had gained sentience and was trying to take over the world, so we renamed it SkyNet. (We’ve since corrected the problem, and given SkyNet more reasonable thresholds and limits. And yes, I caught it within the hour.).

SkyNet is completely autonomous – it operates with with zero human interaction, either watching or providing interactive guidance. No-one at SmugMug even pays attention to it anymore (and we haven’t for many months) since it operates so efficiently. (Yes, I realize that means it’s probably well on its way to world domination. Sorry in advance to everyone killed in the forthcoming man-machine war.)

Roughly once per minute, SkyNet makes an EC2 decision: launch instance(s), terminate instance(s), or sleep. It has a lot of inputs – it checks anywhere from 30-50 pieces of data to make an informed decision. One of the reasons for that is we have a variety of different jobs coming in, some of which (uploads) are semi-predictable. We know that lots of uploads come in every Sunday evening, for example, so we can begin our prediction model there. Other jobs, though, such as watermarking an entire gallery of 10,000 photos with a single click, aren’t predictable in a useful way, and we can only respond once the load hits the queue.

A few of the data points SkyNet looks at are:

How many jobs are pending?
What’s the priority of the jobs?
What type of jobs are they?
How complex are the pending jobs? (ex: HD video vs 1Mpix photo)
How time-sensitive are the pending jobs? (ex: Uploads vs rotations)
Current load of the EC2 cluster
Current # of jobs per sample processed
Average time per job per sample
Historical load and job performance
How close any instances are to the end of their 1-hour cost window
Recent SkyNet actions (start/terminate/etc)

.. and the list goes on.

Our goal is to keep enough slack around to handle surges of unpredictable batch operations, but not enough so it drains our bank account. We’ve settled on an average of roughly 25% of excess compute capacity available when averaged over a full 24 hour period and SkyNet keeps us remarkably close to that number. We always err on the side of more excess (so we get faster processing times) rather than less when we have to make a decision. It’s great to save a few bucks here and there that we can plow back into better customer service or a new feature – but not if photo uploads aren’t processing, consistently, within 5-30 seconds of upload.

Our workers like lots of threads, so SkyNet does its best to launch c1.xlarge instances (Amazon calls these “High-CPU Instances“), but is smart enough to request equivalent other instance sizes (2 x Large, 8 x Small, etc) in the event it can’t allocate as many c1.xlarge instances as it would like. Our application doesn’t care how big/small the instances are, just that we get lots of CPU cores in aggregate. (We were in the Beta for the High-CPU feature, so we’ve been using it for months).

One interesting thing we had to take into account when writing SkyNet was the EC2 startup lag. Don’t get me wrong – I think EC2 starts up reasonably fast (~5 mins max, usually less), but when SkyNet is making a decision every minute, that means you could launch too many instances if you don’t take recent actions into account to cover startup lag (and, conversely, you need to start instances a little earlier than you might actually need them otherwise you get behind).

THE MONEY

SmugMug is a profitable business, and we like to keep it that way. The secrets to efficiently using EC2, at least in our use case, are as follows:

Take advantage of the free S3 transfers. This is a biggy. Our workers get and put almost all of their bytes to/from S3.
Make sure you have scaling down working as well as scaling up. At 3am on an average Wednesday morning, we have very few instances running.
Use the new High-CPU Instances. Twice the CPU resources for the same $$ if you don’t need RAM.
Amazon kindly gives you 30 days to monetize your AWS expenses. Use those 30 days wisely – generate revenues. 🙂

WHY NO WEB SERVERS?

I get asked this question a lot, and it really comes down to two issues, one major and one minor:

No complete DB solution. SimpleDB is interesting, and the new EC2 Persistent Storage is too, but neither provides a complete solution for us. EC2 storage isn’t performant enough without some serious, painful partitioning to a finer grain than we do now – which comes with its own set of challenges, and SimpleDB both isn’t performant enough and doesn’t address all of our use cases. Since latency to our DBs matters a great deal to our web servers, this is a deal-killer – I can’t have EC2 web servers talking to DBs in my datacenters. (There are a few corner cases we’re exploring where we probably can, but they’re the exception – not the rule).
No load balancing API. They’ve got an IP address solution in the form of Elastic IPs, which is awesome and major step forward, but they don’t have a simple Load Balancer API that I can throw my web boxes behind. Yes, I realize I can manually do it using EC2 instances, but that’s more fragile and difficult (and has unknown scaling properties at our scale). If the DB issue were solved, I’d probably dig into this and figure out how to do it ourselves – but since it’s not, I can keep asking for this in the meantime.

Let me be very clear here: I really don’t want to operate datacenters anymore despite the fact that we’re pretty good at it. It’s a necessary evil because we’re an Internet company, but our mission is to be the best photo sharing site. We’d rather spend our time giving our customers great service and writing great software rather than managing physical hardware. I’d rather have my awesome Ops team interacting with software remotely for 100% of their duties (and mostly just watching software like SkyNet do its thing). We’ll get there – I’m confident of that – we’re just not there yet.

Until then, we’ll remain a hybrid approach.

Categories: amazon, datacenter Tags: amazon, cloud computing, ec2, photo processing, rubberband, s3, skynet, smugmug, sqs, video rendering

MySQL and the Linux swap problem

May 1, 2008 Don MacAskill 44 comments

Ever since Peter over at Percona wrote about MySQL and swap, I’ve been meaning to write this post. But after I saw Dathan Pattishall’s post on the subject, I knew I’d better actually do it. 🙂

There’s a nasty problem with Linux 2.6 even when you have a ton of RAM. No matter what you do, including setting /proc/sys/vm/swappiness = 0, your OS is going to prefer swapping stuff out rather than freeing up system cache. On a single-use machine, where the application is better at utilizing RAM than the system is, this is incredibly stupid. Our MySQL boxes are a perfect example – they run only MySQL and we want InnoDB to have a lot of RAM (32-64GB … and we’re testing 128GB).

You can’t just not have any swap partitions, though, or kswapd will literally dominate one of your CPU cores doing who-knows-what. But you can’t have it swapping to disk, or your performance goes into the toilet. So what to do?

Our solution is to make swap partitions out of RAM disks. Yes, I realize how insane that sounds, but the Linux kernel’s insanity drove us to it. Best part? It works. Here’s how:

mkdir /mnt/ram0 mkfs.ext3 -m 0 /dev/ram0 mount /dev/ram0 /mnt/ram0 dd bs=1024 count=14634 if=/dev/zero of=/mnt/ram0/swapfile mkswap /mnt/ram0/swapfile swapon /mnt/ram0/swapfile

That’ll give you a 14MB swap partition that’s actually in RAM, so it’s super-fast. This assumes your kernel is creating 16MB ramdisk partitions, but you can adjust your kernel paramenters and/or the ‘dd’ line above to suit whatever size you want.

We’ve found that anywhere from 20MB-40MB tends to be enough (so use /dev/ram1, /dev/ram2, etc), depending on load of the box. kswapd no longer uses any noticeable CPU, there’s always a few MB of free “swap”, and life is back in the fast lane. Just add those lines to your relevant startup file, like /etc/rc.d/rc.local, and it’ll persist after reboots.

Some Linux purists will probably hate this approach, others may have more efficient ways of achieving the same thing, but this works for us. Give it a shot. 🙂

Oh, and I hope it goes without saying, but make *darn* sure you know what you’re running on your box and what the maximum RAM footprint will be before you try running with only 20-40MB of swap. We’ve never OOMed (Out-Of-Memory) a production MySQL box – but that’s because we’re careful.

UPDATE: See what happens when I wait to blog? I forget that I read another related post over on Kevin Burton’s blog. Like Kevin, we’re using O_DIRECT, but unlike Kevin, this doesn’t solve the problem for us. Linux still swaps. We use the latest 2.6.18-53.1.14.el5 kernel from CentOS 5, btw. (Sorry, had posted 2.6.9 because I was dumb. We’re fully patched)

Categories: datacenter, MySQL Tags: Innodb, Linux, memory, MySQL, OOM, percona, RAM, swap

New Amazon Features: Status Dashboard & Paid Service

April 17, 2008 Don MacAskill 10 comments

I realize I’m already way behind blogging about other new Amazon Web Services features like the recent EC2 release with static IPs, availability zones, and user kernels not to mention the new block storage service. I’ll still try to get to them – but I didn’t want to wait for this one.

I’ve been pushing Amazon hard to do something like this, and I’m thrilled it’s finally out. They have a great new service status dashboard complete with historical data and a mechanism for communicating to us, their customers, about any issues they may be having. Especially cool is that the data is provided via RSS, so you can programmatically poll the status and take steps as necessary. Awesome! Get all the details here.

One possible gotcha is that it looks like the dashboard is hosted at Amazon. We’ve run into outages (very rare) where all of amazon.com is down. In those cases, it’d be nice to have an externally-hosted site where they could post updates. Our customers asked us for this recently, so on January 29th, we were happy to comply. Perhaps Amazon could post to their TypePad blog in events like these, rare as they may be?

Next, they now offer paid premium support. Need some sort of help that’s not provided on the AWS forums or via searching Google? No worries – whip out your credit card and pay for it. Looks like they have two plans which should cover lots of use cases I’ve seen in my own comments and on the forums.

I’d still like to see a pay-per-incident model, personally, even with an extremely high price-tag for each incident. We rarely use support for AWS, but at the same time, we’re very big customers of theirs, so the monthly price is quite high. But if we really come up against a big problem, it’d be nice to know I could pay for support just that one time. I imagine most of their customers will like their Silver and Gold monthly packages, but for us, they’re just not quite the right fit. Do they work for you?

I’m pretty thrilled about this release, but maybe our use case is different from yours. Do you like these new features? Are they missing things you’d like to see?

Categories: amazon, datacenter Tags: amazon, block storage, ec2, s3, web services

Death of MySQL read replication highly exaggerated

April 16, 2008 Don MacAskill 4 comments

I know I’m a little late to the discussion, but Brian Aker posted a thought-provoking piece on the imminent death of MySQL replication to scale reads. His premise is that memcached is so cool and scales so much better, that read replication scaling is going to become a think of the past. Other MySQL community people, like Arjen and Farhan, chimed in too.

Now, I love memcached. We use it as a vital layer in our datacenters, and we couldn’t live without it. But it’s not a total solution to all reads, so at least for our use case, it’s not going to kill our replica slaves that we use to scale reads.

Why? Because we still need to do index lookups to get the keys that we can extract from memcached. And we have to do lots of those indexed queries. Most of the row data lives inside of memcached, so this turns out to be a great solution, but we still need read slaves to provide the lists of keys. Bottom line is that we still use read replication heavily – but we use it for different things that we did in years past.

And then, of course, there’s the issue of memcached failure. For us, it’s very rare, and thanks to the way memcached works, it rarely hampers system performance, but when a node fails and needs to be re-filled, we have to go back to disk to get it. And doing that efficiently means read slaves again.

For us, memcached plus MySQL replication is true magic. Brian’s a very smart guy, and I realize he wrote the post to get people thinking and talking about the issue, but at least for us, read slaves are here to stay. 🙂

Categories: datacenter, memcached, MySQL Tags: arjen lentz, brian aker, farhan mashraqi, memcached, MySQL, replication, slaves

The Sky is Falling! MySQL charging for features!

April 16, 2008 Don MacAskill 10 comments

There’s quite a bit of buzz on the blogosphere from people I respect a great deal, like Jeremy Cole at Proven Scaling and Vadim at Percona, about MySQL’s new Enterprise backup plans.

The big deal? They’re releasing a Community version that doesn’t have all the same features as the Enterprise version of Online Backup, including compression and encryption. The Community version is open-sourced under GPL, the Enterprise version is not.

Personally, I think this is awesome. Don’t get me wrong – I love open source. We couldn’t have built our business without it, and we love it when we get a chance to contribute back to the community.

But let’s not forget that MySQL is a business. And that business helps the community and improves the software. They have customers (I’m one – we’re a paying MySQL Enterprise Platinum customer), and they have to solve those customers’ problems. This is a virtuous cycle where the community benefits directly as MySQL thrives financially.

Every time a business like us pays MySQL for a service or feature, MySQL can then invest in better software that benefits all. The end result in MySQL’s case is more GPL’d code. In a very real way, without companies like mine, there wouldn’t be a new backup tool at all – let alone the differences this debate is focused on.

Every day, I hear someone saying “Man, I love SmugMug so much! It has [insert features here] which I love! Why isn’t it free?”

The answer? “It wouldn’t be SmugMug if it was free.” MySQL’s situation is very similar.

I wish more open source projects would make it easier for this cycle to ignite. Some of them, like Red Hat, refuse to even take our money. Talk about stupid. There are *lots* of businesses out there willing to pay for extra services and features, and the community can harness that revenue in amazing ways, including getting more (or better) GPL’d code.

Couple more thoughts:

I wouldn’t be surprised if future releases add new Enterprise-only features and some existing Enterprise-only features migrate down to Community.
The Community version is open-sourced, so I’m sure the community will develop their own compression and encryption features.
This is really no different from Enterprise Monitor, which has been only for Enterprise customers for awhile.
Lots of other projects do this (and I would argue this benefits those projects and their communities, too)
I’m 99% sure that this was the plan before Sun acquired MySQL.

In short, I view this as one of the ways we can both build our business and give back to the open source community. Keep it up, MySQL!

Categories: datacenter, MySQL Tags: enterprise, enterprise monitor, MySQL, online backup, smugmug, sun

Thoughts on Google App Engine

April 8, 2008 Don MacAskill 23 comments

First: Very cool.

Next: I think it’s interesting that Google has basically taken a sniper scope out and aimed it at a specific cloud computing target. App Engine is only for web applications. No batch computing, no cron jobs, no CPU/disk/network access, etc.

I think this is very smart of Google. Rather than attacking Amazon head-on, Google has realized there’s a huge playing field for cloud computing, and are attempting to dominate another portion of it, one where they have a lot of expertise. Very good business move, imho.

Will we use it? I wouldn’t be surprised. I’ve long thought that we’ll continue to mix in web services from a variety of providers, and it looks like App Engine can solve a slice of our datacenter need that other providers don’t yet provide.

I’m more than a little concerned, though, by how much vendor lock-in there is with App Engine. At first glance, it doesn’t look like the apps will be portable at all. If I want to switch providers, or add in other providers so I’m not relying solely on Google, I’m outta luck.

I’m hopeful other languages get supported, too. I think Python is great – don’t get me wrong – but we have a lot more experience with other languages, so there’ll be a learning curve.

Finally, I’m dying to find out what pricing for an application of our scale will look like. I can see some immediate, obvious things I’d like to try to do on App Engine, but the beta limits aren’t gonna cut it for us. 😦

Will it replace Amazon? It sure doesn’t look like it from where I sit. In fact, I don’t see this as much of a competitor to Amazon Web Services. There’s some overlap in some small area (hosted web apps on EC2), but I doubt that’s the bulk of Amazon’s business. As I said, we’ll likely end up using both (and other providers as they come along, too).

My favorite bit? In theory, Google has solved the data scaling problem. I don’t mean raw binary (blob) storage, which S3, SmugFS, MogileFS, and plenty of other things have solved, but the “database” scaling problem. Every popular web app runs into this problem, and it’s typically solved with a combination of memcached, federation, and replication. But it’s messy. In theory, Google has automated that piece for us. I can’t wait to play with it and see if that’s true.

I also can’t wait to see who else is going to wade into this fray. Sun? Microsoft? Yahoo? IBM?

Bring it on!

Categories: business, datacenter, web 2.0 Tags: amazon, app engine, appengine, ec2, google, mogilefs, s3, smugfs, web services

EC2 isn't 50% slower

February 27, 2008 Don MacAskill 21 comments

I don’t want to start a nerdfight here, but it might be inevitable. 🙂

Valleywag ran a story today about how Amazon’s EC2 instances are running at 50% of their stated speed/capacity. They based the story on a blog post by Ted Dziuba, of Persai and Uncov fame, whose writing I really love.

Problem is, this time, he’s just wrong. Completely full of FAIL.

I’ll get to that in a minute, but first, let me explain what I think is happening: Amazon’s done a poor job at setting user expectations around how much compute power an instance has. And, to be fair, this really isn’t their fault – both AMD and Intel have been having a hard time conveying that very concept for a few years now.

All of the other metrics – RAM, storage, etc – have very fixed numbers. A GB of RAM is a GB of RAM. Ditto storage. And a megabit of bandwidth is a megabit of bandwidth. But what on earth is a GHz? And how do you compare a 2006 Xeon GHz to a 2007 Opteron GHz? In reality, for mere mortals, you can’t. Which sucks for you, me, and Amazon – not to mention AMD and Intel.

Luckily, there’s an answer – EC2 is so cheap, you can spin up an instance for an hour or two and run some benchmarks. Compare them yourself to your own hardware, and see where they match up. This is exactly what I did, and why I was so surprised to see Ted’s post. It sounded like he didn’t have any empirical data.

Admittedly, we’re pretty insane when it comes to testing hardware out. Rather than trust the power ratings given by the manufacturers, for example, we get our clamp meters out and measure the machines’ power draw under full load. You’d be surprised how much variance there is.

There was one data point in a thread linked from Ted’s post that had me scratching my head, though, and I began to wonder if the Small EC2 instances actually had some sort of problem. (We only use the XLarge instance sizes) This guy had written a simple Ruby script and was seeing a 2X performance difference between his local Intel Core 2 Duo machine and the Small EC2 instance online. Can you spot the problem? I missed it, so I headed over to IRC to find Ted and we proceeded to benchmark a bunch of machines we had around, including all three EC2 instance sizes.

Bottom line? EC2 is right on the money. Ted’s 2.0GHz Pentium 4 performed the benchmark almost exactly as fast as the Small (aka 1.7GHz old Xeon) instance. My 866MHz Pentium 3 was significantly slower, and my modern Opteron was significantly faster.

So what about that guy with the Ruby benchmark? Can you see what I missed, now? See, he’s using a Core 2 Duo. The Core line of processors has completely revolutionized Intel’s performance envelope, and thus, the Core processors preform much better for each clock cycle than the older Pentium line of CPUs. This is akin to AMD, which long ago gave up the GHz race, instead choosing to focus on raw performance (or, more accurately, performance per watt).

Whew. So, what have we learned?

All GHz aren’t created equal.
CPU architecture & generation matter, too, not just GHz
AMD GHz have, for years, been more effective than Intel GHz. Recently, Intel GHz have gotten more effective than older Intel GHz.
Comparing old pre-Core Intel numbers with new Intel Core numbers is useless.
“top” can be confusing at best, and outright lie at worst, in virtualized instances. Either don’t look at it, or realize the “steal %” column is other VMs on your same hardware doing their thing – not idle CPU you should be able to use
Benchmark your own apps yourself to see exactly what the price per compute unit is. Don’t rely on GHz numbers.
Don’t believe everything you read online (threads, blogs, etc) – including here! People lie and do stupid things (I’m dumb more often than I’m not, for example). Data is king – get your own.

Hope that clears that up. And if I’m dumb, I fully expect you to tell me so in the comments – but you’d better have the data to back it up!

(And yes, I’m still prepping a monster EC2 post about how we’re using it. Sorry I suck!)

Categories: amazon, datacenter Tags: amazon, amd, cpu, datacenter, ec2, ghz, intel, performance, persai, server, speed, ted dziuba, uncov, valleywag

More on MySQL & Sun

January 16, 2008 Don MacAskill 5 comments

Laura Thomson has an interesting post about the MySQL acquisition. And I think it really highlights a fundamental disconnect that some companies built on providing open source applications for enterprises face:

Their means of getting revenue are at odds with their customers’ needs.

I’m a paying MySQL Enterprise Platinum customer, and I’m seriously considering not renewing for another year if Laura’s thoughts are on target. In a nutshell, here’s why:

I would pay more for a version of MySQL that has Yasufumi Kinoshita and Google’s patches than I would pay for a version without.

In fact, as I mentioned already, I probably wouldn’t pay for MySQL as it stands today. I paid for it in the hopes that, as a paying customer, my feedback that these patches (and others like them) are vital would be listened to. Thus far, it hasn’t.

I could care less about MySQL’s desire to keep their released, supported software dual-licensed (commercial and GPL). I don’t consider our Enterprise subscription to be for the software – mentally, I’m paying for service and support. And the support (fixing InnoDB’s concurrency problems) is increasingly at odds with the business (releasing a commerical binary-only Enterprise release). But they’re on a collision course – I’m not the only one who will stop paying for it, resulting in damage to MySQL’s business.

I believe the right (and admittedly scary) thing to do is provide paid support for the GPL’d version and move the ball forward – accept community patches that fix major problems.

You can bet that I’ll be telling Sun this, over and over again. Since they have a history of listening, I’m optimistic.

(BTW, this problem isn’t unique to MySQL. Red Hat has the same dilemma – and they won’t take my money, no matter how hard I try to throw it their way)

Categories: business, datacenter, web 2.0

Sun acquires MySQL!

January 16, 2008 Don MacAskill 10 comments

Remember when I said Sun was a company that listened? They sure do.

Maybe MySQL will finally start fixing all the performance/concurrency issues with InnoDB (basically, InnoDB’s threading and concurrency aren’t working well with modern multi-core CPUs). Google’s had some fabulous patches for awhile, and the brilliant Yasufumi Kinoshita does as well, but they don’t seem to be making their way into MySQL anytime soon.

Personally, I worry they’re focused too much on Falcon and not enough on InnoDB – but luckily Sun listens, so that may change. 🙂

Categories: business, datacenter, web 2.0

Newer Entries Older Entries

SmugMug's Don MacAskill

Archive

Success with OpenSolaris + ZFS + MySQL in production!

MySQL and the Linux swap problem

New Amazon Features: Status Dashboard & Paid Service

Death of MySQL read replication highly exaggerated

The Sky is Falling! MySQL charging for features!

Thoughts on Google App Engine

EC2 isn't 50% slower

More on MySQL & Sun

Sun acquires MySQL!

What I’m Doing:

Follow Blog via Email

SmugMug

Tags:

Archives

Archive

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Follow Blog via Email

SmugMug

Tags:

Archives