Linux – Memory cache losing sync with disk


(Ubuntu Linux server, 64-bits)
I was troubleshooting a problem with a file (~3.0 GB) which I had just downloaded, but it was failing the integrity test, when I discovered something really unusual.

First this is the MD5 of the file after download, which didn't match the expected value:

~% md5sum media.iso 
5d74facb904cc1765a468354908a8f34  media.iso

Some time passes, nothing should have changed the file during this time, but when I went to check the file again:

~% md5sum media.iso
a5b97c5016afb39bd67ccfc3fa6ca59e  media.iso

This was really unexpected. Since I have a lot of RAM, I suspected this was the effect of caching and something was going awry with it. I decided to retry with the whole file from disk, for my surprise:

~% sudo sysctl -w vm.drop_caches=3    # This linux command invalidates
vm.drop_caches = 3                    # everything in the memory cache.
~% md5sum media.iso
2992aa6270f6e1de9154730ed3beedc1  media.iso

I redid it and now it seems to stay consistent, although this still isn't the value I was expecting. Certainly, the contents in memory cache were different from the contents on disk. This is the big problem.

To fix the download, I created a torrent on the source machine and opened it in the target machine. Five 1MB chunks out of ~3.0GB failed integrity check. I used the torrent to fix these file chunks and how the file integrity is ok.

The problem now is to determine where the data got out-of-sync.

  1. I tested the memory with memtest86+, all but the bit fading test. I was expecting to see some failing memory module, but there wasn't anything. Everything is ok.
  2. Filesystem is Ext4, over LVM2, over a 3-disk RAID5 array. Ext4 is considered stable, and if data were inconsistent between disks, mdadm would have warned. But there is nothing in the logs. S.M.A.R.T. error logs are clean, the disks are new (have less than 30 days of "power-on-hours").
  3. I'm looking for information about any data-loss bugs in my current kernel (2.6.35), but there doesn't seem to be anything, as far as I looked.

Any ideas on what else I could check, or where exactly could be the defect/bug?

It is a Ubuntu 10.10 64-bit, Core i7 930, 6 GB non-ECC RAM.

Update: I confirmed that the files are being correctly written to the disk, the pages are being altered after they are read from disk, while in memory. I did a lot more memtests (I left it doing bit fade test overnight), and still nothing. All memory modules seem ok.

Some more tests:

~% md5sum media.iso 
cc8bcf1ce67ff7704eadc2222650c087  media.iso
~% cp media.iso tmp1
~% md5sum tmp1 
bde6c54b2d7b03404b43056b908036ed  tmp1
~% md5sum media.iso 
134f607cf4c633ef11d2576d1c635d08  media.iso    # ← THIS IS THE CORRECT VALUE
~% cmp -l media.iso tmp1
  98697009 101 121
~% udiff <(xxd -s ... media.iso) <(xxd -s ... tmp1 )
--- /proc/self/fd/11    2010-11-03 14:52:55.649433000 -0200
+++ /proc/self/fd/13    2010-11-03 14:52:55.649433000 -0200
@@ -13,7 +13,7 @@
 5e1fef1: 280f 5a87 37d2 e6d6 647d bebe f04e 64d8  (.Z.7...d}...Nd.
 5e1ff01: 19a5 2ff4 178b 1e37 afb0 e914 e03f bd62  ../....7.....?.b
 5e1ff11: 2b8d 4245 985f a9f8 a993 1f51 6d31 30e7  +.BE._.....Qm10.
-5e1ff21: 8274 0d35 ab8f 86b7 130f e1d7 20c6 3541  .t.5........ .5A
+5e1ff21: 8274 0d35 ab8f 86b7 130f e1d7 20c6 3551  .t.5........ .5Q
 5e1ff31: 387b f226 6348 fabc 1eae 67ef adda c3b6  8{.&cH....g.....
 5e1ff41: a931 bf29 690f 25f9 8922 6dcc 009f 60a5  .1.)i.%.."m...`.
 5e1ff51: 559a 9d03 92cb fb5c a75f a26e 0954 0af4  U......\._.n.T..

~% md5sum media.iso        
54d67cc4dcad49b6d1bf6619074b471c  media.iso
~% direcat media.iso|md5sum
134f607cf4c633ef11d2576d1c635d08  -
~% direcat media.iso | cmp -l media.iso -
  98697009 121 101
 231297649 146 147
 519630641 177 157
2291859249 377 357
2442055473 127 107
2907131697 171 151

(direcat is a version of cat that reads with O_DIRECT, that is, bypassing page cache)

There is a clear pattern: it always happens to the 2nd byte in a 16-byte alignment. In that byte, almost always the bit 4 (LSB) flips to one, but there was one instance where bit 2 flipped to zero.

Best Answer

If the md5sum of a file changes, there are several possible explanations, in order of likelyhood:

  • The file was written to.
  • Your RAM is defective (or another motherboard component, but RAM is by far the most failure-prone).
  • Your storage is defective. (Unlikely because defective storage usually leads to unreadable file, not corrupted data.)
  • A kernel bug, perhaps in the filesystem code. (Highly unlikely with ext4.)

Note that “inconsistency between disk and cache” is a symptom, not a cause. It's not even a symptom you observed: what you observed was a difference between memory at time T and memory at time T'.

If you're sure the file wasn't being modified, then defective RAM is the most likely explanation. Memory tests don't always detect bad RAM, unfortunately. If you can get two different copies of the file, compare them (cmp -l file1 file2); if the differences are aligned (e.g. the differences are always on the 42nd bit of a 16-byte sequence) or consist of displaced blocks (the sign of corruption occurring to a pointer variable), all signs point to defective RAM.