Home > datacenter, MySQL > Success with OpenSolaris + ZFS + MySQL in production!

Success with OpenSolaris + ZFS + MySQL in production!

October 10, 2008
Pimp My Drive by Richard and Barb

Pimp My Drive by Richard and Barb

There’s remarkably little information online about using MySQL on ZFS, successfully or not, so I did what any enterprising geek would do: Built a box, threw some data on it, and tossed it into production to see if it would sink or swim. :)

I’m a Linux geek, have been since 1993 (Slackware!). All of SmugMug’s datacenters (and our EC2 images) are built on Linux. But the current state of filesystems on Linux is awful, and it’s been awful for at least 8 years. As a result, we’ve put our first OpenSolaris box into production at SmugMug and I’ve been pleasantly surprised with the performance (the userland portions of the OS, though, leave a lot to be desired). Why OpenSolaris?

ZFS.

ZFS is the most amazing filesystem I’ve ever come across. Integrated volume management. Copy-on-write. Transactional. End-to-end data integrity. On-the-fly corruption detection and repair. Robust checksums. No RAID-5 write hole. Snapshots. Clones (writable snapshots). Dynamic striping. Open source software. It’s not available on Linux. Ugh. Ok, that sucks. (GPL is a double-edged sword, and this is a perfect example). Since it’s open-source, it’s available on other OSes, like FreeBSD and Mac OS X, but Linux is a no go. *sigh* I have a feeling Sun is working towards GPL’ing ZFS, but these things take time and I’m sick of waiting.

The OpenSolaris project is working towards making Solaris resemble the Linux (GNU) userland plus the Solaris kernel. They’re not there yet, but the goal is commendable and the package management system has taken a few good steps in the right direction. It’s still frustrating, but massively less so. Despite all the rough edges, though, ZFS is just so compelling I basically have no choice. I need end-to-end data integrity. The rest of the stuff is just icing on an already delicious cake.

The obvious first place to use ZFS was for our database boxes, so that’s what I did. I didn’t have the time, knowledge of OpenSolaris, or inclination to do any synthetic benchmarking or attempt to create an apples-to-apples comparison with our current software setup, so I took the quickest route I could to have a MySQL box up and running. I had two immediate performance metrics I cared about:

  • Can a MySQL slave on OpenSolaris with ZFS keep up with the write load with no readers?
  • If yes, can the slave shoulder its fair share of the reads, too?

Simple and to the point. Here’s the system:

  • SunFire X2200 M2 w/64GB of RAM and 2 x dual-core 2.6GHz Opterons
  • Dell MD3000 w/15 x 15K SCSI disks and mirrored 512MB battery-backed write caches (these are really starting to piss us off, but that’s another post…)

The quickest path to getting the system up and running resulted in lots of variables in the equation changing:

  • Linux -> OpenSolaris (snv_95 currently)
  • MySQL 5.0 -> MySQL 5.1
  • LVM2 + ext3 -> ZFS
  • Hardware RAID -> Software RAID
  • No compression -> gzip9 volume compression

Whew! Lots of changes. Let me break them down one by one, skipping the obvious first one:

MySQLMySQL 5.1 is nearing GA, and has a couple of very important bug fixes for us that we’ve been working around for an awfully long time now. When I downloaded the MySQL 5.0 Enterprise Solaris packages and they wouldn’t install properly, that made the decision to dabble with 5.1 even easier – the CoolStack 5.1 binaries from Sun installed just fine. :)

Going to MySQL 5.1 on a ~1TB DB is painful, though, I should warn you up front. It forced ‘REPAIR TABLE’ on lots of my tables, so this step took much longer than I expected. Also, we found that the query optimizer in some cases did a poor job of choosing which indexes to use for queries. A few “simple” SELECTs (no JOINs or anything) that would take a few milliseconds on our 5.0 boxes took seconds on our 5.1 boxes. A little bit of code solved the problem and resulted in better efficiency even for the 5.0 boxes, so it was a net win, but painful for a few hours while I tracked it down.

Finally, after running CoolStack for a few days, we switched (on advice from Sun) to the 5.1.28 Community Edition to fix some scalability issues. This made a huge difference so I highly recommend it. (On a side note, I wish MySQL provided Enterprise binaries for 5.1 for their paying customers to test with). The Google & Percona patches should make a monster difference, too.

Volume management and the filesystem – There’s some debate online as to whether ZFS is a “layering violation” or not. I could care less – it’s pure heaven to work with. This is how filesystems should have always been. The commands to create, manage, and extend pools are so simple and logical you basically don’t even need man pages (discovering disk names, on the other hand, isn’t easy. I finally used ‘format’ but even typing it gives me the shivers…). zpool create MYPOOL c0t0d0You just created a ZFS pool. Want a mirror? zpool create MYPOOL mirror c0t0d0 c0t0d1Want a striped mirror (RAID-1+0) w/spare? zpool create MYPOOL mirror c0t0d0 c0t0d1 mirror c0t0d2 c0t0d3 spare c0t0d4Want to add another mirror to an already striped mirror (RAID-1+0) pool? zpool add MYPOOL mirror c0t0d5 c0t0d6Get the idea? Super-easy. Massively easier than LVM2+ext3 where adding a mirror is at least 4 commands: pvcreate, vgextend, lvextend, resize2fs – usually with an fsck in there too.

Software RAID – This is something we’ve been itching for for quite some time. With modern system architectures and modern CPUs, there’s no real reason “storage” should be separate from “servers”. A storage device should be just a server with some open-source software and lots of disks. (The “open source” part is important. I’m sick of relying on closed-source RAID firmware). The amount of flexibility, performance, reliability and operational cost savings you can achieve with software RAID rather than hardware is enormous. With real datacenter-grade flash storage devices just around the corner, this becomes even more vital. ZFS makes all of this stuff Just Work, including properly adjusting the write caches on the disk, eliminating the RAID-5 write hole, etc. Our first box still has a battery-backed write-cache between the disks and the CPU for write performance, but all the disks are just exposed as JBOD and striped + mirrored using ZFS. It rocks.

Compression – Ok, so this is where the geek in me decided to get a little crazy. ZFS allows you to turn on (and off) a variety of compression mechanisms on-the-fly on your pool. This comes with some unknown (depends on lots of factors, including your workload, CPUs, etc) performance penalty (CPU is required to compress/decompress), but can have performance upsides too (smaller reads and writes = less busy disk).

InnoDB is notoriously bad at disk usage (we see 2X+ space usage using InnoDB) and while it’s not an enormous concern, it’d be something nice to curtail. On most of our DB boxes, we have idle CPU around (we’re not really I/O bound either – MySQL is a strange duck in that you can be concurrency bound without being either CPU or I/O bound fairly easily thanks to poor locking), so I figured I’d go wild and give it a shot.

Lo and behold, it worked! We’re getting a 2.12X compression ratio on our DB, and performance is keeping up just fine. I ran some quick performance tests on large linear reads/writes and we were measuring 45.6MB/s sustained uncompression and 39MB/s sustained compression on a single-threaded app on an Opteron CPU. We’ll probably continue to test compression stuff, and of course if we run into performance bottlenecks, we’ll turn it off immediately, but so far the mad science experiment is working.

Configuration

Configuring everything was relatively painless. I bounced a few questions off of Sun (imho, this is where Sun really shines – they listen to their customers and put technical people with real answers within arms reach) and read the Evil Tuning Guide to ZFS. In the end I really only ended up tweaking two things (plus setting compression to gzip-9):

  • I set the recordsize to match InnoDB’s – 16KB. zfs set recordsize=16K MYPOOL
  • I turned off file-level prefetching. See the Evil Tuning Guide. (I’m testing with this on, now, and so far it seems fine).

I believe since ZFS is fully checksummed and transactional (so partial writes never occur) I can disable InnoDB’s doublewrite buffer. I haven’t been brave enough to do this yet, but I plan to. I like performance. :)

Performance

This box has been in production in our most important DB cluster for two weeks now. On the metrics I care about (replication lag, query performance, CPU utliization, etc) it’s pulling its fair share of the read load and keeping completely up on replication. Just eyeballing the stats (we haven’t had time to number crunch comparison stats, though we gave some to Sun that I’m hoping they crunch), I can’t tell a difference between this slave and any of the others in the cluster running Linux. I sure feel a lot better about the data integrity, though.

Why not [insert other OS here]?

We could have gone with Nexenta, FreeBSD, Mac OS X, or even *gulp* tried ZFS on FUSE/Linux. To be honest, Nexenta is the most interesting because it actually *is* the Solaris kernel plus Linux userland, exactly what I wanted. I’ve played with it a tiny bit, and plan to play with it more, but this is a mission-critical chunk of data we’re dealing with, so I need a company like Sun in my corner. I find myself wishing Sun had taken the Nexenta route (or offered support for it that I could buy or something). Instead, we’ll be buying software service & support from Sun for this and any other mission-critical OpenSolaris boxes.

FreeBSD also doesn’t have the support I need, Mac OS X wasn’t performant enough the last time I fiddled with it as a server, and most FUSE filesystems are slow so I didn’t even bother.

Gotchas

  • On my 64GB Linux boxes, I give InnoDB 54GB of buffer pool size. With otherwise exactly the same my.cnf settings, MySQL on OpenSolaris crashes with anything more than 40GB. 14GB, or 21.9% of my RAM, that I can’t seem to use effectively. Sun is looking into this, I’ll let you know if I find anything out.
  • For a Linux geek, OpenSolaris userland is still painful. Bear in mind that this is a single-purpose box, so all I really want to do is install and configure MySQL, then monitor the software and hardware. If this were a developer box, I would have already given up. OpenSolaris is still very early, so I’m still hopeful, but be prepared to invest some time. Some of my biggest peeves:
    • Common commands, like ‘ps’, have very different flags.
    • Some GNU bins are provided in /usr/gnu/bin – but a better ‘ps’ is missing, as is ‘top’ (no, ‘prstat’ is *not* the same!), ‘screen’, etc (Can anyone even use remote command-line Unix boxes without ‘screen’? If so, how?)
    • Packages are crazily named, making finding your stuff to install tough. Like instead of Apache being called ‘apache’ or ‘httpd’, it’s called ‘SUNWapch’. What?
    • After finally figuring out how to search for packages to get the names (‘pkg search -r Apache’ – which doesn’t provide pleasant results), I discovered that ‘top’ and ‘screen’ just simply aren’t provided (or they’re named even worse than I thought). Instead, I had to go to a 3rd party repository, BlastWave, to get them. And then, of course, the ‘top’ OpenSolaris package wouldn’t actually install and I had to manually break into the package and extract the binary. Ugh.

Whew! Big post, but there was a lot of ground to cover. I’m sure there are questions, so please post in the comments and I’ll try to do a follow-up. As I fiddle, tweak, and change things I’ll try to post updates, too – but no promises. :)

UPDATE: One other gotcha I forgot to mention. When MySQL (or, presumably, anything else running on the box) gets really busy, user interactivity evaporates on OpenSolaris. Just hitting enter or any other key at a bash prompt over SSH can take many seconds to register. I remember when Linux had these sort of issues in the past, but had blissfully forgotten about them.

UPDATE: I went more in depth on ZFS compression testing and blogged the results. Enjoy!

  1. Chris Ryland
    October 10, 2008 at 4:58 pm

    Have you guys ever seriously tried Postgres? Seems like it’s a much more performant “large system” database…

  2. October 10, 2008 at 5:28 pm

    Nice post Don – always keen to hear experiences of using OpenSolaris (especially what things you’re tripping up against) and thanks for your support,. Our obvious goal is to avoid making it the frustrating experience it has been in the past. We should have ‘top’ and ‘screen’ both available for 2008.11 (from b100). Search should be drastically improved for 2008.11, but the package refactoring/naming won’t happen until 2009.04. Also interested in any userspace slowness metrics you have too!

  3. October 10, 2008 at 5:49 pm

    Thanks for all of the details and for the awesome mask pictures — http://cmac.smugmug.com/gallery/2504559#131481399_ZnZmK

    I think that compression in ZFS will work better than compression in the storage engine as InnoDB has in the 5.1 plugin because more of it can be done in the background by the kernel rather than by the thread executing a query.

    15 15k disks sounds nice. I wish I had such gear. Do you ever get write bound? Turning off the double write buffer can make a big deal when there is a lot of writing to be done. With it on you do: large (<= 2MB) sequential IO to the doublewrite buffer, sync, <= 128 random IOs to the database, sync.

    Do you run with buffered IO or O_DIRECT? Make MySQL give you a build with support for more InnoDB background IO threads and configurable background IO rate limiting. With a few changes you might be able to get much more out of the very nice system (ZFS + many fast disks + big RAM) you are running.

  4. October 10, 2008 at 5:54 pm

    @Chris Ryland: First of all, I tend to choose my open-source technologies by the size of their user community. So Postres fell down there first. Second, the replication was just awful the last time I used it. Unusable, basically, which was a deal-killer. Replication (even with the major limitations MySQL currently has) is one of MySQL’s secret weapons.

  5. October 10, 2008 at 5:55 pm

    @Glynn Foster: Hey, great news! Thanks! I updated the blog post about userspace slowness because that was another gotcha I forgot to mention. It gets unbearably slow sometimes (I assume it’s hoarding all the slices for MySQL or something).

  6. October 10, 2008 at 6:04 pm

    @Mark Callaghan: Thank *you* for all the patches. Before Google stepped up to the plate, I was seriously losing hope in InnoDB.

    I agree, filesystem compression sounded like the saner choice to me as well, hence the experiment. So far it seems to really be paying off. I’d like to find out exactly how much latency it adds, but it doesn’t seem to really be human measurable, at least with this load on this system, which is good enough for me to leave it on and keep watching.

    We’re not really write limited currently, no, but I’ve found that having lots of fast disks can mask the concurrency problems with InnoDB in many cases. I assume that’s just because the writes are returning faster so the locks aren’t being held as long, but I don’t know for sure. You’d probably know better than I would. :) I didn’t mention it in the post, but late last night we added another 15 x 15K disks to each of the members in this cluster, so they’re actually all 30 now. Adding to the pool with ZFS was so insanely simple it boggled my mind. (We added the disks due to storage constraints, not I/O constraints)

    ZFS doesn’t support O_DIRECT, so this slave is using buffered IO. The Linux boxes in the cluster all use O_DIRECT and we’ve seen significant gains with it. A few people at Sun have said they’ve seen workloads on ZFS where using a much smaller InnoDB buffer pool and relying on ZFS’s disk cache resulted in performance increases, but that’s deviating even farther from a direct comparison to our Linux slaves, so I haven’t played with it yet. It’s on the list.

    And yes, I’m dying to get a MySQL build with all the great patches you guys have provided, preferably against 5.1.28. It’s on my todo list – I’ll build it myself if I have to, but keeping track of which trees have which patches is getting interesting. We went from a complete drought of patches to a flood in no time – thank goodness! :) Looks like you’re steadily updating the stuff on Launchpad, so I’ll check there first.

  7. October 10, 2008 at 6:36 pm

    @Don – I am thrilled that there are many outlets for patches now. I get to lobby Percona and MySQL and Drizzle to use bits of the Google patch and eventually end-users will benefit. It also helps that MySQL users like you document the problems. People at MySQL are much more aware of the problems because of this.

    With buffered IO there is not much benefit from more background IO threads for writes, but they still help for reads. With respect to shrinking the InnoDB buffer cache, up to half of it can be used for the insert buffer. On my workloads I think that I get at least a 5X reduction in writes for secondary index maintenance because of the insert buffer.

  8. Dan
    October 10, 2008 at 7:22 pm

    OpenSolaris is a win. But, even more than ZFS, dtrace is the reason in my book.

    And I agree that if Postgres replication were better, it would clearly be the way to go over MySQL.

  9. October 10, 2008 at 7:47 pm

    Hi!

    What were the nature of the “A little bit of code solved the problem and resulted in better efficiency” that you made?

    It is good to see some real world interest in ZFS.

    Cheers,
    -Brian

  10. October 11, 2008 at 12:26 am

    Don,

    As a member of the team building the OpenSolaris packaging system (rogue’s gallery here: http://dp.smugmug.com/gallery/4882941_mCL2n#291253680_XLbpL ), and as a loyal and longtime smugmug customer (and advocate) I can say that we are working hard to improve the packaging system as fast as possible. And I think there is alignment between what you want, and what we are trying to deliver.

    As Glynn said, the 2008.11 release, while not yet perfect, is going to be better. Searches are now case-insensitive by default, as an example. The output of ‘pkg list -a’ will be more useful. Performance of the ‘pkg’ command should be better (there’s more to do, though). We’re more robust in the face of network problems. ‘pkg verify’ works a lot better. The depot server web pages at http://pkg.opensolaris.org should be more attractive and more useful pretty soon. And so forth. On the free software side, things are getting better too, although I’m similarly annoyed about the lack of ‘screen.’ I thought it had already been done, but apparently not. Time to go harass some people.

    While the os may provide you a ‘top’, prstat is better and you should use it. :) Besides being a pile more efficient than top, one especially nice feature in prstat is that process RSS calculations when considering aggregations of processes (like with -a, -Z, or -J) are actually accurate, accounting properly for sharing in the VM system. prstat -m is also amazingly useful because of the much finer grained detail about process events.

    Best wishes, and thanks for being a maverick. I just ordered some prints the other day , on lustre paper, and they came out awesome.

    -dp

  11. rb
    October 11, 2008 at 2:58 am

    you might want to check out Nexenta. its a GNU userland based off Ubuntu packages thrown on top of the OpenSolaris kernel. to me, it feels like kind of a bizarre hybrid, but I feel like Solaris sucks so hard, I’d rather use something a little wonky thats close to Linux than just use something wonky.

    there’s always FBSD too, although it lacks the Zones that the Solaris kernel provides.

  12. October 11, 2008 at 4:17 am

    Since you (currently) want to give a lot of the system’s memory to MySQL, have you considered limiting the size of ZFS’s ARC cache? The Evil Tuning Guide talks about this (including an arcstats script which can be helpful to understand how useful the ARC is being for you).

  13. October 11, 2008 at 6:55 am

    Once you start using dtrace with serious intent, you will never wish for the Linux userspace again. Dtrace is the wind beneath my wings.

  14. October 11, 2008 at 6:58 am

    Excellent and well written post Don.
    I saved it to show for some friends. ;)

  15. October 11, 2008 at 7:59 am

    I would suggest the following:

    1. check what your default scheduler is with “dispadmin -l ” and possibly consider using the Fair Share Scheduler – should let interactivity get back to normal without interfering with MySQL.

    2. Alternatively try using gzip on its regular setting instead of with -9 ; should see almost the same efficiency on disk space, while CPU will be much less. Are the disks busy when you see the issue with interactive response?

    3. For better visibility on your disks, use “iostat -xnc 5″ which will give you disk stats every 5 seconds. Look at the average service time and percentage of disk busy.

    4. There may be some system tunables to tweak – possibly the reason you have 14GB unavailable is due to one of the following: max size of shared memory segments, disk cache reserves a set amount of RAM, kernel-related space taken for mapping ZFS and various devices.

  16. Jim Zemlin
    October 11, 2008 at 8:02 am

    We told you ZFS didn’t matter, it’s just a feature, you have to listen, or we’re going to kick you out of the community. On second thought, if you use ZFS, the Linux Foundation will have to sue you.

  17. October 11, 2008 at 9:44 am

    Hi Don, good post!
    I’ve had some pleasant experiences with ZFS as well, it’s a very decent fs. I’m however keeping a very close eye on btrfs as well since it’s got most of the aspects I like with ZFS and it’s got some good very ideas behind it. It’ll be mighty interesting to battle that out against other filesystems.

  18. Logan Shaw
    October 11, 2008 at 10:10 am

    Very interesting article. I do have one piece of advice, though. For the love of god, reconsider that “gzip-9″ choice! There’s nothing wrong with using the gzip algorithm, but the “-9″ is not even close to the sweet spot in the trade-off between compression ratio and CPU usage. “-9″ will only get you something like 0.5% better compression than “-7″, and it will use more than double the CPU time.

    Here’s a quick little test you can do as an illustration of what I mean. It uses the regular gzip command line program, but it’s the same algorithm ZFS will be using. If I’m lucky, the formatting won’t get too mangled:

    #! /bin/sh

    cd /tmp || exit 1

    # create a test file to compress
    ( cd /etc && tar cf – . ) > etc.tar

    # try all 9 levels
    for level in 1 2 3 4 5 6 7 8 9
    do
    echo level $level
    time gzip -v -$level /dev/null
    done

    rm etc.tar

    You should see that after you get past about level 4, the compression ratio only improves a little bit, but the CPU usage really goes through the roof.

    Disk space is just not so expensive that this is worth it, even on modern systems with powerful CPUs. I’d set it to gzip-4 or gzip-3 and try that.

  19. Logan Shaw
    October 11, 2008 at 10:12 am

    WordPress ate my redirection characters and more on the gzip line. The script should be this (hopefully right this time):

    #! /bin/sh

    cd /tmp || exit 1

    # create a test file to compress
    ( cd /etc && tar cf – . ) > etc.tar

    # try all 9 levels
    for level in 1 2 3 4 5 6 7 8 9
    do
    echo level $level
    time gzip -v -$level < etc.tar > /dev/null
    done

    rm etc.tar

  20. October 11, 2008 at 10:50 am

    Great article and nice job leading the way on this. ZFS & Dtrace are definitely where we need to go for our database servers. Plan is Postgres on Nexenta but we may need to go with plain OpenSolaris for the same reasons you suggested. Glad to hear ZFS is giving you the performance you’re needing and hope you’ll continue to post. Hopefully I’ll have some good news in a few weeks.

  21. October 11, 2008 at 11:35 am

    Solaris is the shizz. Don, thanks for opening up so much information. It’s really helpful. I haven’t used Solaris for so many years now, and it kicks linux so hard. I want dtrace :(

  22. October 11, 2008 at 12:26 pm

    @Brian Aker:

    The “little bit of code” was simply removing an ORDER BY on the affected SELECTs and instead sorting on the client – something I prefer to do whenever possible anyway. 5.1 was choosing the wrong (imho) index to use with the ORDER BY, but was fine without it.

  23. October 11, 2008 at 12:43 pm

    @D. Price:

    Awesome! So glad you guys are working on the problem. :)

    It’s not that I’m not fond of ‘prstat’ – I am. I have a ‘prstat’ running constantly in one of my other screens, too (along with mpstat, iostat, and vmstat) But I really really like having a single screen I can quickly glance at and see CPU stats, RAM stats, etc. On that particular screen, I’m not nearly as interested in process stats other than a really quick overview. I view ‘top’ as a really handy quick overview. Then if something is amiss or I don’t understand something, I use the other more detailed tools. Make sense?

  24. October 11, 2008 at 12:45 pm

    @William Hathaway:

    Yep, playing around with limiting the ARC is on the todo list. Thanks! :)

  25. October 11, 2008 at 12:50 pm

    @Logan Shaw:

    Yep, testing the various gzip levels (and the other possible ZFS compression options) is on the todo list as well. I just wanted to do something quick & dirty for now.

    FYI, even with gzip-9, we usually have tons of free CPU available, so I don’t believe it’s impacting system performance significantly.

  26. October 11, 2008 at 2:31 pm

    Don, I’ve been running ZFS in production for almost 2 years now. I recommend LZJB over any of the GZIPs. You’ll find the compression is still roughly 2x, but you’ll get back some CPU, which will be useful while running scrubs, which are extremely CPU intensive. I’ve tried pretty much every setting of compression and there usually isn’t a significant difference between LZJB and GZIP for *most* (but not all) data types. It’s usually within 10%.

  27. jd
    October 11, 2008 at 9:37 pm

    Sounds like OpenSolaris hasn’t come very far along then. The goal of supplying gnu binaries and a real linux-alike environment is spot-on, what’s taking so bloody long to get it right? I guess my shop will be stuck doing the tedious “installing 101 sunfreeware packages to make Solaris 10 even remotely usable, then another 101 to have a reasonable dev/build environment” routine after each new rollout for a long time to come. War and Peace is probably shorter than our tasklist for this :(

  28. October 12, 2008 at 1:08 am

    1. You ssh issue looks strange … i didn´t had this problem so far on my mysql systems.
    2. Did you tried /usr/ucb/bin/ps ?

  29. October 12, 2008 at 8:31 am

    Don. Excellent article. I am a Sr. Database Administrator but almost exclusively Oracle and SQL Server. I have been waiting with baited breath for Oracle to release their latest version Oracle 11gR1 on the Sun Solaris x86_64 (64-bit) Version 10. From their certification site, it shows the status as “pending”. Again, fantastic article!

  30. October 12, 2008 at 8:49 am

    @Joerg M:

    $ ls /usr/ucb/bin
    /usr/ucb/bin: No such file or directory

    No dice. Any other ideas?

  31. Jason
    October 12, 2008 at 11:15 pm

    That should be /usr/ucb/ps (no /bin).

    One big difference between Solaris and Linux is that Solaris supports a lot of different standards for commands. In some cases these might conflict. The solution is that the different versions of these utilities reside in separate directories, and you chose $PATH to give the behavior you desire (GNU utilities in /usr/gnu/bin, various posix revisions in /usr/xpg4 and /usr/xpg5, etc.). In this instance, being more used to the BSD style ps than SysV ps, you want the ps in /usr/ucb (you’ll see other utilities that behave like BSD versions in there as well).

  32. October 13, 2008 at 1:00 am

    Great post and very interesting! Small nit about the OpenSolaris userland comments.. I’ve been an avid Linux user/developer/admin for quite a few years and made the switch like you.. While I must say it takes some getting used to the spartan approach to bringing your own shell rc scripts, vimrc and adjusting paths it’s really quite trivial if you know what you’re doing and don’t mind spending an extra minute here and there to take a look at the rosetta stone for these things. Just a thought since OpenSolaris makes a *great* developer box as well, but just wait until IPS breaks something when you reboot ;)

  33. October 13, 2008 at 10:12 am

    @codestr0m:

    See, you nailed it right on the head. Time is absolutely our very most precious commodity (I feel like our business is years behind where we really should be right now), so a minute here or a minute there (multiplied by the the # of developers) is actually a serious problem.

    Our goal with our systems is to automate them so much that they don’t suck up *any* time, and so natural for our developers that they can code without mucking around with system settings.

  34. October 13, 2008 at 2:56 pm

    Hi Don, great article. We use OpenSolaris at Joyent as the basis for our Cloud computing IaaS. So, it’s really cool to see this article. We pretty much replaced the packaging system with pkgsrc. The results have been good and now we get comments more like, “OK, so as of right now I have pretty much copied over my environment. The new template is so fully configured that pretty much everything that had been causing me issues is now just installed by default.” To do this we maintain a fairly large pkgsrc repo for our clients. Also we’ve heavily modified the userland. We find that these changes we’ve made along the way have made it a lot easier and usable for people when moving to Solaris. We have a long way to go still but the power of DTrace, SMF, ZFS, stability, and scalability have been great.

  35. October 13, 2008 at 6:33 pm

    There are more compression details in my follow-up: http://blogs.smugmug.com/don/2008/10/13/zfs-mysqlinnodb-compression-update/

  36. October 13, 2008 at 6:38 pm

    Did I say something wrong? Or was my comment not substantial enough to include?

    Thanks, I will have some more substantive after I have time to go back and review some of my configs. I enjoy reading your blog and share your pain seeing the stupidity of Canon USA – ouch.

  37. October 13, 2008 at 6:41 pm

    @Ann E. Mouse:

    Sorry, but I don’t see a comment from you – either posted or in the moderation queue. Would you mind re-posting it? Thanks!

  38. October 13, 2008 at 6:52 pm

    Don, thanks, sorry I thought for sure it went through but got rejected. Can you delete these two and I’ll repost.

  39. Jamie
    October 13, 2008 at 10:04 pm

    So you got Solaris working with a md3000, with or without multipathing? Which HBAs? With AVT or without? I’d love to see some more details on that setup.

  40. October 13, 2008 at 10:31 pm

    @Jamie:

    You know, it didn’t even occur to me that it might not work. Haha. We’re just using a LSI SAS HBA. No multipathing. Dunno what AVT even is (or just don’t recognize the acronym). We just plugged it in and turned it on and it worked. *shrug*

    I’m so used to modern OSes just doing the right thing with device drivers I didn’t even bother to check compatibility first.

  41. UX-admin
    October 13, 2008 at 11:52 pm

    “For a Linux geek, OpenSolaris userland is still painful.”

    Of course it is, since you don’t know / understand System V yet. Learn it. Use it. Love it!

  42. UX-admin
    October 13, 2008 at 11:57 pm

    “$ ls /usr/ucb/bin
    /usr/ucb/bin: No such file or directory

    No dice. Any other ideas?”

    /usr/ucb is SunOS 4.x BSD compatibility. Other than being there for really old BSD scripts from SunOS 4.x days, it had no business being on a modern System V system. That’s why you don’t have it. And most knowledgeable Solaris sysadmins (and ALL system engineers) will purposely leave it out when churning their own Solaris builds.

    • David Halko
      January 6, 2009 at 9:44 am

      When running as "root", the /usr/ucb/ps is nice since it will give you VERY LONG command lines, that are (unfortunately) unavailable using the /usr/bin/ps

      $ ps -eo user,pid,stime,args | nawk '!/nawk/ && /ivserver/ && Count<1 { Count+=1 ; print }'
      root 16599 Dec_31 /opt/InfoVista/Essentials/bin/ivserver -m /opt/InfoVista/Essentials/data/manage

      $ /usr/ucb/ps -auxww | nawk '!/nawk/ && /ivserver/ && Count<1 { Count+=1 ; print }'
      root 16599 0.5 1.4462144427136 ? S Dec 31 522:01 /opt/InfoVista/Essentials/bin/ivserver -m /opt/InfoVista/Essentials/data/manage

      # /usr/ucb/ps -auxww | nawk '!/nawk/ && /ivserver/ && Count<1 { Count+=1 ; print }'
      root 16599 0.5 1.3464640417320 ? S Dec 31 522:02 /opt/InfoVista/Essentials/bin/ivserver -m /opt/InfoVista/Essentials/data/manager_iv2.db -c /opt/InfoVista/Essentials/data/collector_iv2.db -print /opt/InfoVista/Essentials/log/collector_iv2.log -ep 42119 -lrip 0.0.0.0 -lrport 0 -rc /opt/InfoVista/Essentials/init/InfoVista_iv2

      I used the ucb version of PS since Solaris 1.x days – I wish Solaris 2.x would fix this so as to allow the super-wide output available under the ucb version!

      Until Solaris supports an equivalent of "ww" on the output line, there will always be a need for /usr/ucb/ps

  43. October 14, 2008 at 12:19 am

    @UX-admin:

    I don’t have the time to learn SysV. I have a business to run and millions of users to support – and Linux dominates now. Solaris needs to adapt or it won’t survive – not the other way around. It’s sad but true.

    • David Halko
      January 6, 2009 at 9:48 am

      I sympathize greatly.

      Having made the persona migration from BSD to SVR4 when moving to another vendor's UNIX platform, moving back to Solaris has been VERY nice, since I have the option of enjoying the best of both (BSD & SVR4) worlds now.

  44. Jamie
    October 14, 2008 at 12:48 am

    OK cause I’ve been trying (and failing) to make Solaris x86 10 update 5 work with a SAS-attached MD3000 including multipathing and failover, and haven’t had any luck without enabling Automatic Volume Transfer, which I’m reluctant to rely on (groundless suspicion on my part really). It appears that the necessary bits (particularly /kernel/misc/scsi_vhci/scsi_vhci_f_asym_lsi) are only part of more recent vintages (OpenSolaris has it from the looks of things).

  45. Alex
    October 14, 2008 at 2:58 am

    I’ve been interested in ZFS and OpenSolaris for quite a while, whilst some of the things it offers are great, most of your post seems to confirm one thing – it’s a bit of a hassle.

    Are the benefits of ZFS worth it? Is there a decent performance increase to justify it all? At the moment it all sounds like you’re doing a lot of beta testing with OpenSolaris!

  46. Greg
    October 14, 2008 at 12:32 pm

    As Jason suggests above, I just alias /usr/ucb/ps to ps.
    I.E. in .bashrc
    alias ps=’/usr/ucb/ps -auxwww’

  47. marty duffy
    October 15, 2008 at 12:17 pm

    On the problem with getting the package names, I believe there is a bundle “amp-dev” that will load apathe2, Mysql, and PHP on on opensolaris command “pkg add amp-dev”

  48. Unix Emaxer
    October 15, 2008 at 4:55 pm

    JD,

    It seems you don’t have your Solaris build environment put together very well. We use a lot of
    FOSS tools here at work, 100s if not more and all are installed painlessly and effortlessly
    thanks to a well put together jumpstart environment.

    If you are still building Solaris system with CDs or a basic Jumpstart server, your not much of
    an SA or if your not in charge, the guys building systems for you are doing a piss poor job if you
    have to trudge through 100s of FOSS packages each time you build a Solaris system.

    ’nuff said….

    UE

  49. October 22, 2008 at 12:24 pm

    Don

    Thanks for this. Not to channel Hilary Clinton, but ‘it takes a community”! And we’re all going to get ZFS into the hands of millions of users thanks to useful commentary like this that adds to the critical mass of our community.

    ALSO, in one of your responses to comments you say ‘time is our most precious commodity.’ Amen. I’m biased as making ZFS easy to use is our #1 focus (and storage for VMware is next). But I really think tech projects and companies would be much better off if we spent a little more time on ‘it just works’ vs. ‘you can get it to do really cool things if you just spend a day or two on it.’

    In your post you say nice stuff about the NexentaOS and suggest that absent support it isn’t a viable alternative for you. We’re discussing partnering with serious support organizations to deliver support to NexentaOS (nexenta.org) users and hope to be able to announce something by the end of the year. Folks can contact me at evan at nexenta.com if they’d like to discuss.

  50. October 24, 2008 at 10:57 am

    Put /usr/ucb early in your path and all your usual BSD stuff will work.

    export PATH=/usr/ucb:$PATH

  51. Kenneth Lareau
    October 27, 2008 at 12:10 pm

    Just a note, as people keep saying ‘add /usr/ucb to your path’ – the default OpenSolaris install does NOT include most of the files in /usr/ucb (I know, as I did a fresh install from b98 just a short while ago); you need to do ‘pkg install SUNWscp’ and then you should find some of the common BSD-based versions of various programs there (including ‘ps’).

  52. October 28, 2008 at 7:50 am

    Very informative.
    I am deeply glad that Sun purchased mySQL , just wait a bit, until we’ll get 5.1 , the road will be paved with better software , and enterprise ready. mySQL isn’t yet ready to compete with Oracle.

    I would like to see more VPS offers using ZFS and OpenSolaris.

  53. Blake
    November 7, 2008 at 4:58 pm

    ALCON:

    I have a ZFS question, perhaps someone can help me out.

    I installed Solaris 10 on a sparc based machine onto a zfs root pool (single drive) of 136GB

    I have 3 other drives that total 136GB also, and I’d like to now setup a mirror of the root zfs partition using the 3 drives totally 136GB to mirror the one 136GB HD. Any ideas?

    Blake

  54. November 22, 2008 at 5:27 am

    Running a production with MySQL and ZFS ? Of course !

    We are running MySQL 5.1.26-rc (!) successfully for 6 month now on a production (!) machine of type
    SPARC Enterprise T5220. The company is a 5000 people company and is among the top 5 of Germany’s IT Service providers.

    The JSP application (SGMJ, consisting of 2 Glassfish services, two MySQL database services) is used by approximately 100 Users every day and is mission critical. All services are integrated with smf.

    The app is running in a Solaris Zone, so it is totally isolated from currently 20 other heavily used virtual OS environments on this server.The performance of this web app is overwhelming.

    All this has been built at zero cost for licences. And even better: the whole application is using less than 0.1% to 0.2% of the available cpu ressources of the machine. So we get all this at nearly no cost. I doubt if any other architecture would compete in these 2 categories (cost-effectiveness and performance).

    In other words we can expect that even 1000 of these applications on the same server would perform considerably well.
    Unfortunately our customers need less than 1000 web applications. Currently only around 300. So using the above architecture we could expect to be using a third of this server for all of these ;-)

    Instead, however these other 300 applications are currently running on 600 or more Linux machines using JBosses, Weblogics, and another few hundret for Oracle DBs and so on. Costing a fortune of several million dollars a year for energy, cooling and space. All this, due to the lack of a supported virtualisation (like Zones), ZFS on Linux and due to the once modern (now old) idea of running all and everything on Intel servers.

    Cheers
    Karsten

  55. December 1, 2008 at 8:29 am

    Мне нравятся Ваши посты, заставляет задуматься)

  56. Jan Holtzhausen
    December 3, 2008 at 9:12 am

    FYI
    Unscientific, stickit in excel and lookit graph indicates that -3 and -7 are your sweet spots

  57. December 4, 2008 at 5:41 am

    open sol is nice. real nice. but it uses too much memory.

    • David Halko
      January 6, 2009 at 9:54 am

      No joke!

      I want OpenSolaris (with X) on an appliance with 128 Meg of RAM (no, I can't swap memory chips!)… like the original release of Solaris 10.

  58. Eric
    December 4, 2008 at 5:48 am

    Fact that you require the GNU utilities shows that you are just expecting everything to be nice and like you like it, but in fact, every single part of Solaris/SunOS/OpenSolaris is far, far better than anything GNU has ever put out.

    The GNU people have been trying to make a bad clone of Solaris for decades.

  59. December 4, 2008 at 9:46 am

    problemas de modem usb en open solaris alguien sabe algo de eso gracias

  60. kevin
    December 4, 2008 at 10:32 am

    It's all gabber-wockey to me, yet I've been in IT for twenty years.

  61. Tim
    December 18, 2008 at 11:03 am

    As a Linux user, you may not like the Solaris userspace commands, but we Solaris users loath the Gnu ones.

  62. Andre
    January 3, 2009 at 2:56 am

    "Dell MD3000 w/15 x 15K SCSI disks and mirrored 512MB battery-backed write caches (these are really starting to piss us off, but that’s another post…)"

    Any brief comments about the issues you were having with those MD3000s please? Thanks in advance!

  63. Eddy
    February 18, 2009 at 11:04 pm

    I'd also like to here more about the issues with the MD3000.

  64. Jon
    August 7, 2009 at 9:30 pm

    That is a pretty awesome setup. I thought that ext4 and LVM 2 were flexible, but it sounds like they're nothing compared to ZFS. I use RAID for personal use, mostly just backups and media storage. Just wondering how easy it is to setup a ZFS solution for personal use? If anybody is interested, I have a tutorial showing how to setup RAID 0, 1, 5, 6 or 10 on Linux with a GUI. This is a nice tutorial for novices to the open source world.

  65. August 20, 2009 at 1:14 pm

    mmm…thanks..

  66. September 4, 2009 at 7:46 am

    Cool , Is that possible to launch EC2-ami of Open+Solaries WITH mysql MASTER+SLAVE

  67. November 14, 2009 at 3:36 pm

    Canon FTL. I'll never buy another one of their camera's. This seals it. Peace.

  1. October 10, 2008 at 5:31 pm
  2. October 11, 2008 at 1:14 pm
  3. October 11, 2008 at 8:32 pm
  4. October 12, 2008 at 11:29 am
  5. October 13, 2008 at 3:44 pm
  6. October 14, 2008 at 2:02 am
  7. October 17, 2008 at 9:33 am
  8. October 19, 2008 at 12:04 am
  9. October 28, 2008 at 12:05 am
  10. November 16, 2008 at 7:39 pm
  11. December 8, 2008 at 10:52 am
  12. December 23, 2008 at 4:11 pm
Comments are closed.
Follow

Get every new post delivered to your Inbox.

Join 34 other followers

%d bloggers like this: