Silent data corruption on AMD servers

Home > datacenter > Silent data corruption on AMD servers

Silent data corruption on AMD servers

July 25, 2007 Don MacAskill

One of my readers, Yusuf Goolamabbas, let me know about a nasty silent data corruption on AMD servers with 4GB or more of RAM running Linux. Yikes! This is the sort of thing that keeps me up at night. Yusuf linked me to two bugs on the subject, one at kernel.org and another at Red Hat.

Lots of servers from a variety of manufacturers seem to be affected. It looks like a combination of some problem with Nvidia’s hardware (I’m not an expert, so maybe it’s AMD’s fault, but it doesn’t sound that way to me) and the Linux kernel not doing stuff properly with GART pages. Other OSes don’t seem to be affected, either because they don’t use the hardware iommu or they do things correctly in the first place.

One sucky thing? Apparently Red Hat’s fix isn’t out yet for RHEL5 or RHEL4. Ugh. You can force the kernel to use software iommu instead, but I’m glad I’m not affected.

Most of our servers have over 4GB of RAM, and as you know doubt know, we’re pretty in love with our SunFire x2200 servers, most of which have 4GB – 32GB of RAM. So I fired off a frantic email late last night to Sun, asking them if our servers have the problem.

The good news? They don’t! Whew. Maybe I’ll get some sleep tonight… 🙂

FYI, there are some Sun servers (and plenty from every other vendor, too) that are affected. Here’s a link to Sun Alert 102790 with more info. Sun was also good enough to send along info on a similar-sounding, but different, issue in Sun Alert 102323.

My next question for Sun will be about how ZFS would handle silent data corruption like this, since it’s supposed to be quite resilient to strange hardware behavior. My bet is that this is likely outside of the scope of things ZFS can detect and avoid (I think it’s awesome at read error detection, but I’m not sure how it could tell that a write doesn’t contain the right data. But then, I’m not as smart as they are 🙂 )

Anyway, hope this info helps some of you out. I know I’d want to know about this stuff.

Categories: datacenter

Comments (6)

Wout

July 26, 2007 at 1:42 am

Well, from the bugs it seems that data read from the disk is not what was stored. So the system memory is not affected?

ZFS adds the checksum before the write. As soon as you hand off a block of data to the ZFS subsystem, it’s protected. Anything that happens at a lower level, like disk errors, wire problems, controller issues etc will be detected.

If the data you give to ZFS is wrong though, there’s nothing that can help you 🙂 ZFS will simply hand you back the same wrong data when you read it.

So if this bug silently overwrites main memory, ZFS might not be in a position to help.
Matt Culbreth

July 26, 2007 at 8:31 am

Don,

I’m assuming you’re running Solaris on your Sun boxes, not Linux?

P.S.–Great blog. I’m trying out a Sun box here soon mostly on your blog’s recommendation.
Don MacAskill

July 26, 2007 at 9:28 am

@Matt:

No, we’re running Linux on our Sun boxes. CentOS5 to be exact.

But ZFS has us perpetually curious about Solaris. 🙂
Jeff Bonwick

July 27, 2007 at 12:29 am

Hey Don — I don’t know enough about the bug and the BIOS/Linux interaction to say for sure whether ZFS would save you here. If, as the bug report implies, it’s related to the iommu, that would suggest that the data is corrupted during DMA. In which case, ZFS would detect it on the next read; and if you were running with mirrors or RAID-Z, ZFS would correct it as well.

Wout is right that if you get silent in-memory corruption *before* the data is written to disk, we currently have no way to detect that. We’ve considered adding an option to keep in-memory buffers checksummed and verify them before any modification and before any disk write. It would be insanely expensive, of course, but could come in handy when trying to track down broken hardware.
Don MacAskill

July 27, 2007 at 1:02 am

@Jeff:

Thanks for the insight – that’s what I was thinking, too.
Chris

September 7, 2007 at 1:07 pm

Hey Don,
Just a heads up looks like that first Kernel.org link is broken. I realize this is a month+ late though :-P.

BTW, just started reading random parts of this blog a few weeks ago, and I have say thanks. Your incite and experience sharing is priceless and I wish more companies were as open as you are regarding hardware/scaling/experiences. Keep it up.