Hunting for bit rot

Florian Weimer

I examined multiple copies of my personal data for bit rot.

The analysis covered about 300 GB of data, of which a large part existed in five copies: the current production file system, two copies from RAID-1 mirrored file system, and two hard disks containing progressive incremental backups created with rsync with the --link-dest option which resided on USB disks. The total raw data was about 2 TB large. Some files date back to 1998 and earlier, but the incremental backups went only back to the end of 2009, so that is the time frame of the analysis.

Bit rot was assumed when file contents changed while the modification time (mtime) remained the same across different copies. Metadata was not verified beyond a file system check.

Comparison was based on hashes gathered with the no-changes <https://git.enyo.de/fw/no-changes.git/> tool. This tool contains an optimization which hashes hard-linked files only once, which was essential for completing the hash computation of the rsync backups at media speed.

Only files which remained at the same relative location in the tree were compared with each other. An earlier attempt at locating bit-rotten copies based on file base name (without the directory), length and mtime proved unworkable. It turned out that too many parts of the file system had repetitive names updated within the same second. rsync unfortunately does not preserve sub-second precision of file system time stamps, which could have reduced collisions to a more manageable level.

The result of this analysis is that it did not detect any bit rot at all, even though a lot of the data was copied at least twice over USB. Tools which uses hashes extensively (such as Git) did not report any bit-rot-related errors over the years, so this result is not too surprising.

Revisions


Florian Weimer
Home Blog (DE) Blog (EN) Impressum RSS Feeds