Silent data corruption on AMD servers
One of my readers, Yusuf Goolamabbas, let me know about a nasty silent data corruption on AMD servers with 4GB or more of RAM running Linux. Yikes! This is the sort of thing that keeps me up at night. Yusuf linked me to two bugs on the subject, one at kernel.org and another at Red Hat.
Lots of servers from a variety of manufacturers seem to be affected. It looks like a combination of some problem with Nvidia’s hardware (I’m not an expert, so maybe it’s AMD’s fault, but it doesn’t sound that way to me) and the Linux kernel not doing stuff properly with GART pages. Other OSes don’t seem to be affected, either because they don’t use the hardware iommu or they do things correctly in the first place.
One sucky thing? Apparently Red Hat’s fix isn’t out yet for RHEL5 or RHEL4. Ugh. You can force the kernel to use software iommu instead, but I’m glad I’m not affected.
Most of our servers have over 4GB of RAM, and as you know doubt know, we’re pretty in love with our SunFire x2200 servers, most of which have 4GB – 32GB of RAM. So I fired off a frantic email late last night to Sun, asking them if our servers have the problem.
The good news? They don’t! Whew. Maybe I’ll get some sleep tonight… 🙂
FYI, there are some Sun servers (and plenty from every other vendor, too) that are affected. Here’s a link to Sun Alert 102790 with more info. Sun was also good enough to send along info on a similar-sounding, but different, issue in Sun Alert 102323.
My next question for Sun will be about how ZFS would handle silent data corruption like this, since it’s supposed to be quite resilient to strange hardware behavior. My bet is that this is likely outside of the scope of things ZFS can detect and avoid (I think it’s awesome at read error detection, but I’m not sure how it could tell that a write doesn’t contain the right data. But then, I’m not as smart as they are🙂 )
Anyway, hope this info helps some of you out. I know I’d want to know about this stuff.