Random bit flips are far more common than most people, even IT professionals, think. Surprisingly, the problem isn’t widely discussed, even though it is silently causing data corruption that can directly impact our jobs, our businesses, and our security. It’s really scary knowing that such corruptions are happening in the memory of our computers and servers – that is before they even reach the network and storage portions of the stack. Google’s in-depth study of bit-level DRAM errors showed that such uncorrectable errors are a fact of life. And do you remember the time when Amazon had to globally reboot their entire S3 service due to a single bit error?
The Error-Prone Data Trail
Let’s assume for a moment that your data survives its many passes through a system’s DRAM and emerges intact. That data must then be safely transported over a network to the storage system where it is written to disk. How do you assure the data remains unaltered along the way? Well, if you’re using one of the storage protocols that lack end-to-end checksums (e.g. NFSv2, NFSv3, SMBv2), your data remains susceptible to random bit flips and data corruption. Even NFSv4 plus Kerberos 5 with integrity checking (krb5i) doesn’t offer true end-to-end checksums. Once the data is extracted from the RPC, it is unprotected once again. In addition, widespread adoption of NFSv4 hasn’t happened, and fewer still use krb5i.
Over a decade ago, the folks at CERN urged that “checksum mechanisms (…) be implemented and deployed everywhere”. This appeal only amplifies today when one considers the storage sizes and daily rates of data transfer we’re dealing with. Data corruption can no longer be ignored as just a “theoretical” issue. And if you think modern applications protect against this problem, I’ve got bad news for you: In 2017, researchers at the University of Wisconsin uncovered serious problems for some storage systems when they introduced bit errors into some well-known and widely used applications.
Checksums Came at a Cost that’s Worth its Price Today
When NFS was designed, file writes and the general amount of data were relatively small and checksum computations were very expensive. Hence, the decision to rely on TCP checksums for data protection seemed reasonable. Unfortunately, these checksums proved to be too weak, especially when transferring more than 64k bytes per packet – which easily happens when you transfer Gigabytes per second. What about Ethernet checksums, you ask? They are indeed stronger. However, they don’t allow for end-to-end protection and opportunities for data corruption are manifold: Cut-through switches that don’t recompute checksums and kernel drivers for NICs are just two examples where things can go horribly wrong.
Quobyte’s Checksums and the End of Silent Data Corruption
Quobyte has seen such silent data corruption happen, even in mid-sized installations. After we had informed one of our customers that their data corruption happened in transit, they began investigating the network stack. It turned out to be a driver-related issue that occurred after a kernel update broke the TCP offload feature of their NICs. Tracking down the problem was both difficult and time-consuming.
That’s where end-to-end checksums come in. Quobyte uses them as follows: As soon as our client software receives the data from the operating system, each block (usually 4k bytes, but that can be adjusted in the volume configuration) is checksummed. Because this checksum stays with the data block forever, the data is protected – even against software bugs – as it travels through the software stack. The checksum is validated along the path throughout the life of the data – even at rest when the data isn’t accessed (via periodic disk scrubbing). All this is possible because we don’t rely on a dated legacy protocol like NFS. Instead, we use our own RPC protocol where each data block – and the message itself – is checksum-protected. And since modern CPUs have built-in CRC32 computation capabilities, there’s no longer a performance penalty for using CRCs.
If a Quobyte component detects corrupted data – and when it’s caused by a drive – the data will be repaired immediately. Should the corruption happen somewhere outside of Quobyte’s control, you’ll get an alert and will know which nodes experience issues. This level of protection is mandatory when you deal with data in the petabyte range and is a prerequisite for hyperscale.
We understand that nobody can just ditch legacy protocols like NFS overnight. But consider some of the reasons why NFS no longer meets today’s operational goals – its poor failover options, lack of widespread pNFS adoption for parallel IO, etc. – and it should become clear why modern storage systems need to move past downright dangerous protocols like NFSv3. The way to go is to switch to protocols that offer proper in-transit data protection (e.g. S3 or Kerberos), or, better yet, true end-to-end checksums. If you ignore checksums, you’re sitting on a ticking time bomb.