MySQL and the Linux swap problem

Home > datacenter, MySQL > MySQL and the Linux swap problem

MySQL and the Linux swap problem

May 1, 2008 Don MacAskill

Ever since Peter over at Percona wrote about MySQL and swap, I’ve been meaning to write this post. But after I saw Dathan Pattishall’s post on the subject, I knew I’d better actually do it. 🙂

There’s a nasty problem with Linux 2.6 even when you have a ton of RAM. No matter what you do, including setting /proc/sys/vm/swappiness = 0, your OS is going to prefer swapping stuff out rather than freeing up system cache. On a single-use machine, where the application is better at utilizing RAM than the system is, this is incredibly stupid. Our MySQL boxes are a perfect example – they run only MySQL and we want InnoDB to have a lot of RAM (32-64GB … and we’re testing 128GB).

You can’t just not have any swap partitions, though, or kswapd will literally dominate one of your CPU cores doing who-knows-what. But you can’t have it swapping to disk, or your performance goes into the toilet. So what to do?

Our solution is to make swap partitions out of RAM disks. Yes, I realize how insane that sounds, but the Linux kernel’s insanity drove us to it. Best part? It works. Here’s how:

mkdir /mnt/ram0 mkfs.ext3 -m 0 /dev/ram0 mount /dev/ram0 /mnt/ram0 dd bs=1024 count=14634 if=/dev/zero of=/mnt/ram0/swapfile mkswap /mnt/ram0/swapfile swapon /mnt/ram0/swapfile

That’ll give you a 14MB swap partition that’s actually in RAM, so it’s super-fast. This assumes your kernel is creating 16MB ramdisk partitions, but you can adjust your kernel paramenters and/or the ‘dd’ line above to suit whatever size you want.

We’ve found that anywhere from 20MB-40MB tends to be enough (so use /dev/ram1, /dev/ram2, etc), depending on load of the box. kswapd no longer uses any noticeable CPU, there’s always a few MB of free “swap”, and life is back in the fast lane. Just add those lines to your relevant startup file, like /etc/rc.d/rc.local, and it’ll persist after reboots.

Some Linux purists will probably hate this approach, others may have more efficient ways of achieving the same thing, but this works for us. Give it a shot. 🙂

Oh, and I hope it goes without saying, but make *darn* sure you know what you’re running on your box and what the maximum RAM footprint will be before you try running with only 20-40MB of swap. We’ve never OOMed (Out-Of-Memory) a production MySQL box – but that’s because we’re careful.

UPDATE: See what happens when I wait to blog? I forget that I read another related post over on Kevin Burton’s blog. Like Kevin, we’re using O_DIRECT, but unlike Kevin, this doesn’t solve the problem for us. Linux still swaps. We use the latest 2.6.18-53.1.14.el5 kernel from CentOS 5, btw. (Sorry, had posted 2.6.9 because I was dumb. We’re fully patched)

Categories: datacenter, MySQL Tags: Innodb, Linux, memory, MySQL, OOM, percona, RAM, swap

Comments (37)

Peter Zaitsev

May 1, 2008 at 9:06 pm

Don,

Thanks for posting. I think you mentioned this to me and I think this is interesting idea.
The interesting thing about this swap problem they seems to be distribution and workload specific – I constantly run into people having no swap and running just fine while it breaks for others.

BTW – CentOS 5.x has 2.6.18.x kernels – 2.6.9 were CentOS 4.x so make sure you really upgraded 🙂
cabbey

May 1, 2008 at 9:42 pm

Man it hurts to see “current” and “2.6.9” in the same sentence.

If you’re not actively swapping, then the location of swap in memory or on disk won’t matter once you’re up and running at a steady state. If so you can reclaim that memory by moving the swap file to disk. Might also be able to use tmpfs instead of ext3 to save some overhead there… or even skip the fs entirely and just mkswap on /dev/ram/0. 🙂
Don MacAskill

May 1, 2008 at 9:56 pm

@Peter Zaitsev: Woops. We’re on 2.6.18-53.1.14.el5 on our CentOS 5 boxes, not 2.6.9. I did ‘uname -a’ in the wrong window 🙂

@cabbey: Well, the crazy thing is that Linux *was* actively swapping, even though there was no reason to (>2GB in filesystem cache, no other apps running). And the swapping *was* degrading performance – badly. Stupid.
Kevin Burton

May 2, 2008 at 12:20 am

Yes. This drives me CRAZY. It’s 2008 and we don’t even have an OS that swaps correctly.

The sheer stupidity of this makes me want to implode.

Anyway.

At the MySQL conf someone suggested mounting a SMALL swap partition (which I didn’t think of before).

This should fix the CPU issue with kswapd, but I haven’t tried it.

It’s an interesting idea though. Just create a 10MB swap file…

Also, It turns out O_DIRECT only helped alleviate the problem. It still persists …

O_DIRECT bypasses the Linux page cache so there’s no vm pressure on reads.

The problem is that the binary logs aren’t opened with O_DIRECT so those reads put pressure on the page cache.

If only I could tell Linux to NOT cache these files or NOT cache at all….

BTW… this was a setting on RHEL when they had a 2.4 kernel. You could allocate a percentage of memory for the page cache.

Kevin
Robert Milkowski

May 2, 2008 at 4:57 am

Well, try MySQL on Solaris 10 x86 and the problem is gone…
Better, download Sun Cluster for free and your MySQL under a cluster – the HW is already there.
Nils

May 2, 2008 at 5:27 am

On the more insane side, you could also use video card memory as swap…

http://hedera.linuxnews.pl/_news/2002/09/03/_long/1445.html
Tim

May 2, 2008 at 6:16 am

Call me an old skewl baller, but I just don’t like disabling swap, particularly if this is simply due to poor behavior of the kernel. I would say, fix the kernel, don’t hack up your system. Swap is there for a reason and while, with tons of RAM, it is less of an issue, I always prefer to have some swap. It saved many a customer server from corrupt tables or other nasties…
Harrison

May 2, 2008 at 6:54 am

Have you tried using large pages for InnoDB to force the OS to keep it in memory? It is a bit of a pain to setup, but the OS can’t swap it out.

Of course, it might still swap out other bits of memory that MySQL is using, but the thread buffers are normally short lived to get those swapped out.
Guillaume Lefranc

May 2, 2008 at 7:05 am

I’m OK with linux swap at the moment (using Ubuntu 6.06 64-bit and kernel 2.6.20) but it looks like MySQL server tends to allocate more memory than possible on the course of time i.e. I’ve defined InnoDB buffer pool = 13G but after ~4 months uptime the process size is raising to 15.5G leading to some memory pages being swapped in. This is not so annoying since the swapped pages does not seem to be ever accessed at all.
PJ

May 2, 2008 at 7:35 am

2.6.18 is current? I’m pretty sure that 2.6.25 was released a week or two ago. 2.6.18 was released… September of 2006. That’s 4 months shy of TWO YEARS ago. Have you tried to see if this bad behavior has been fixed in a more recent kernel?
Nelson Castillo

May 2, 2008 at 8:01 am

It is not that insane. I recently learned about this project that provides a compressed
ramdisk for swap. It is meant for small machines.

http://code.google.com/p/compcache/

Perhaps you should also try a more recent kernel.
mofino

May 2, 2008 at 8:34 am

Are you sure you’re not an idiot? Are you positive this is the behaviour you are witnessing? Do you have a full understanding of how Linux handles memory? I’d say your answer is no for all.
Don MacAskill

May 2, 2008 at 8:57 am

@Tim: Alas, I’m not capable of “fixing” the kernel, and neither is anyone else at SmugMug, so that’s not an option.

@Harrison: I’ve heard too many horror stories about large pages to give it a whirl, especially when this solution is so simple. 🙂

@Guillaume Lefranc: I have production MySQL systems up for longer than a year without constant memory growth, so I’m not sure what you’re seeing – but we don’t seem to be seeing it. We do see a few GB over the InnoDB buffer pool, probably for other things like thread caches and whatnot, but we just factor that into the math. Works for us. *shrug*

@PJ: We’re not running 2.6.18, we’re running 2.6.18-53.1.14.el5. There’s a big difference. First of all, it’s a tested, Enterprise kernel release, second, it’s had more than 53 revisions to fix/enhance/backport stuff. There’s no way on earth I’m gonna run a bleeding edge kernel like 2.6.25 in production at my DB layer. That being said, I’ve heard from others who are more brave that it is not fixed, and further, that Linus doesn’t consider this to be a bug/problem.

@Nelson Castillo: Whoa, neat. Other insane people are around. 🙂

@mofino: On the contrary, I’m sure I’m an idiot. However, there are an awful lot of us idiots (read: the entire MySQL on Linux community), so at least I’m not alone. And my solution is both simple and works, so does it really matter that I’m an idiot?
Tim Desjardins

May 2, 2008 at 11:52 am

Linux file cacheing tends to dominate memory in systems that get heavy use and do a lot of logging. This is a flaw in linux that you can’t mount a file system O_DIRECT only, for instance to do your logging to, thus bypassing the cache and not putting pressure on your real cache. ZFS has this option, which I’m looking forward to using on Linux in the near future, I also wrote a log appender which uses O_DIRECT only, but it hasn’t been put through its paces yet. To summarize there is not point in caching log files.
Mike

May 2, 2008 at 12:18 pm

That is a really cool idea, but I don’t understand why you would first format the ram disk with an ext3 filesystem only to fill it with a swap file. Why not instead just run “mkswap /dev/ram0” directly? It seems that would be a bit more efficient.
spizzy

May 2, 2008 at 12:36 pm

– yes, just mkswap. Cargo-culting going on I guess.

Still, seems a bit risky. Linux supports swap partition priorities – you could set the pseudoswap to have a high priority, and still supply lower-priority disk swap, so that if there’s _real_ demand for swap, the system can degrade gracefully.

I do wish linux 2.6.x just ran a bit more sensibly with no swap partition though (I work in embedded field).
Alex

May 2, 2008 at 12:39 pm

Wouldn’t this problem be solved by just using raw I/O? This is a good read from the OS perspective. Basically, InnoDB is already behaving as if it knows how to manage memory better than the OS, why not bypass the filesystem’s use of the disk entirely?
Kevin

May 2, 2008 at 12:48 pm

select * from users where clue > 0;
Don MacAskill

May 2, 2008 at 12:48 pm

@Mike: I’m 99% sure I tried this, and Linux balked/freaked out/something. Give it a shot, let me know what you find.

@Alex: I need to post about how in love we are with volume managers (LVM, ZFS) because of the ease of doing backups. So we like filesystems over raw I/O because there are all sorts of side benefits, but you’re right, doing raw I/O would route around this problem. Of course, I already have a *very* simple solution, so doing raw I/O isn’t very compelling. 🙂
Jason

May 2, 2008 at 12:52 pm

FreeBSD’s VM system doesn’t do this; this has come up before:
http://jeremy.zawodny.com/blog/archives/000132.html
with a solution in
http://jeremy.zawodny.com/blog/archives/000697.html

Of course, if you’re exclusively Linux based this doesn’t really help, but you could possibly just move over your single app DB machines.
vicaya

May 2, 2008 at 1:08 pm

Since you have access to the mysql source code, have you tried to use posix_fadvise on the open log files? Let us know, if it works for you. It sure worked for me before.
Frank Mashraqi

May 2, 2008 at 4:51 pm

When I experienced a similar issue on a few of my old and highly tuned Solaris 10 UltraSPARC IIIi boxes, I thought I was hitting the same issue as described by Peter. Luckily, I was wrong and this issue doesn’t exist in Solaris 10.
Kevin Burton

May 2, 2008 at 7:17 pm

@Robert Milkowski Switching to Solaris would invite another set of problems. There are less web 2.0 shops running Solaris than BSD … 🙂 I’m not planning on switching any time soon. you can pry my Linux from my cold dead hands 🙂

@Harrison … good point about large pages. I forgot that this was an option

@Don … we’re on 2.6.24 with our newer machines. After 24 hours only 725 blocks have been paged. Not sure if this means the problem was fixed. We’re on a slightly different config. Note that we were on 2.6.18 before so we might be running into the same kernel problem.

@Tim Desjardins … yes. It would be nice to mount an ENTIRE volume O_DIRECT…

@Don What are the horror stories you’ve heard about large pages?
usaf

May 2, 2008 at 8:05 pm

Thanks, very useful post.
zeek

May 2, 2008 at 10:34 pm

I don’t understand what problem you’re having running no swap? We have 50 servers at my work running with no swap. Running a swap(file|partition) is not required for linux.

Swap is a dead paradigm. RAM speeds have gone up by a factor of 100, disk speeds only a factor of 10. Swap stopped making sense when the disks couldn’t keep up.

tldr: don’t configure a swap partition of swapfile!
Kevin Burton

May 2, 2008 at 11:54 pm

Zeek, some people do run without swap but there shouldn’t be a need to disable it or run without swap.

The problem is the OOM killer.

If you’re on a 16GB box… and you have 16GB of memory in InnoDB… an errant ls or top shouldn’t be able to trigger the OOM killer and thus kill -9 your mysqld.

Things like overcommit make it worse…
Olaf van der Spek

May 3, 2008 at 11:36 am

> You can’t just not have any swap partitions, though, or kswapd will literally dominate one of your CPU cores doing who-knows-what.

I assume this bug has been reported, but I don’t see a link. Got one? 😉

> Zeek, some people do run without swap but there shouldn’t be a need to disable it or run without swap.

There shouldn’t be a need to enable swap in the first place.

> The problem is the OOM killer.

> If you’re on a 16GB box… and you have 16GB of memory in InnoDB… an errant ls or top shouldn’t be able to trigger the OOM killer and thus kill -9 your mysqld.

I think what is needed is something like memory quotas. You could allocate/reserve 15 gb for MySQL. The OoM killer can then make a better decision about what to kill when more memory is needed.
Finite

May 3, 2008 at 11:33 pm

This is some kind of joke, right?

As if swapping to a ramdisk wasn’t retarded enough, you’ve decided to throw a filesystem in the middle? Hilarious stuff!

So, what happens when the kernel decides to swap out some of that memory used by the ramdisk?
just a bar

May 4, 2008 at 4:08 am

That’s a known problem with swappiness=0. Set it to 10 and kswapd won’t hog the CPU anymore.
A Linux Fan now gone

May 4, 2008 at 5:15 pm

Why not forget all of Linux’s kernel insanity and just run on OpenSolaris.
Kevin Burton

May 4, 2008 at 8:40 pm

Just FYI… I’m getting the same behavior on 2.6.24 on Debian Etch.

I loaded MySQL and a robot process which took up 3.8G out of 4G… a good bit of it left over.

I then decided to go for a hike in the Marin Headlands for a few hours.

I come back and Linux has decided to swap out 2G of MySQL.

Brilliant. Kill me now please.

At least my hike was good.

Hm…. I wonder if I could LD_PRELOAD the kernel to disable using the page cache……..

Kevin
Bernd Eckenfels

May 5, 2008 at 2:28 pm

Rik van Riel recently commented on that on lkml: 20080505164446.2ec9a543@cuia.bos.redhat.com

One of the problems is that the process pages (anonymous memory) and
page cache pages live on the same LRU, so the kernel cannot always
easily find the page cache pages when it is trying to evict something.

Once that is fixed, and replacement is biased towards evicting page
cache pages, the system may do the right thing by itself.

I realize that this is no quick fix for your issue, but I am working
on a split LRU patch series to make sure Linux does the right thing in
the future.

You can find the latest patch at http://people.redhat.com/riel/splitvm/
and info on the design at http://linux-mm.org/PageReplacementDesign
Don MacAskill

May 5, 2008 at 2:39 pm

@Bernd Eckenfels

Thanks for the info! (Reads just joining, here’s a link to Rik’s post in that thread: http://www.ussg.iu.edu/hypermail/linux/kernel/0805.0/2261.html )
Kevin Burton

May 5, 2008 at 3:20 pm

Thanks Bernd……

I started a thread on LKML about this so hopefully this patch will fix this problem.

Now I need to find some time to apply the patch to our kernel.

Kevin
Vanhai Phung

August 14, 2008 at 12:41 am

Vietnam tours and travel service .saigon tour , saogon hotel , inbound and out bound tours .
prasana

September 9, 2008 at 1:04 pm

Dan : Curious to hear about the issues with a larger page size and mysql – we are in the process of trying out 64KB to see if we can get 200+MBps on the MD3000 using MPath. saw your posting about the “horror” stories and wanted to see if you could provide some pointers to that. thanks, -prasana AT admob DOT com
Kiran Hota

February 13, 2009 at 1:45 pm

Nice article on swap problem