Wednesday, December 29, 2010

Disk Failure, Raid Corruption and CRCs

This week is started with a raid corruption due to a disk failure. Disk has continued to work without notifying anyone about the corruption, but this is somehow acceptable. The inacceptable thing is that the file-system is not aware about its data state. The "Old File-Systems" are not interested in user data and even with journaling, user data is not  a priority.

As a Spare-Time File-System Developer, is really funny saying "With my File-System, This failure would not have happened!"

Simple Solution: B-Tree and CRCs
A really simple way to solve this problem is adding CRC to user data, or better (from file-system point of view) adding a CRC on every block. If the file-system is entirely based on B-Tree (when I say B-Tree, I mean B*Tree or B+Tree) you can simple add CRC for the block in the node header, and CRC for the child in the blocks pointer.

Excluding from a moment the super-block and other things... You start reading the root of the tree (1st block), if node crc check fail there's something wrong with your disk-read (or maybe your memory, but this is less probable). When you start traversing the tree the check is even better and "secure".

Checking the CRC on the block read is good, but having the CRC stored in a different location, and read at a different time is even better.

Storing CRC of the child node within the pointer gives you the ability to do a double integrity check. When data becomes larger and cannot be stored in a tree node, you can do the same thing with the extents (Pointers and CRC).

Another maybe more "intrusive" solution is storing a CRC for every N bytes of data, but I think that this more acceptable for a user-space "crc-fs" implementation (This approach is implemented in Hadoop's ChecksumFileSystem class).

Yesterday night I've implemented a quick and dirty B+Tree, it's not tuned for speed but for a didactic view. It has a couple of nice features like Pointers with CRC and Variable Length nodes to be able to compress nodes on disk. You can find the source code on my github repository (B+Tree Code).