This blog post is about data redundancy in distributed storage systems – how to do it the right way in order to properly protect your data, and which methods to avoid. We’ll also talk about the tradeoffs required when doing disaster recovery over long distances.
The Good, the Bad and the Ugly would also have been a good title for this blog post and lends itself nicely to structure it. In storage, the terms mirroring, replication, and DR are often used inconsistently and interchangeably, although they have quite different meanings. To make things worse, they are also used for two very different uses in a storage system – data redundancy and disaster recovery.
The Good: Data redundancy protects your data integrity
One of the main purposes of a storage system designed to store data redundantly is that the failure of a hardware component does not cause data loss or corruptions. To be more precise, the storage system needs to implement strong consistency. We can define this roughly as “If write A happens before read B, B must return the value written by A.” This is also part of the POSIX standard, and you can observe this behavior on a local file system.
Many applications, like databases, rely on these strong consistency guarantees. Most servers use local RAID across multiple drives to achieve data redundancy. For simplicity, we’ll ignore how local caching can violate strong consistency (hey, I’m looking at you NFS).
However, when talking about distributed storage, this becomes much more complex, mostly due to the large number of failure scenarios. You now have to deal with networking issues like packet loss, intermittent outages, broken hardware, partitions, or the failure of an entire server. To make things worse, a server in a distributed system doesn’t know whether another server has crashed or if the network is down. You can observe that when you try to figure out whether your internet connection is broken or if the web server is offline. The fact that a ping doesn’t go through won’t you to tell which component is broken.
A distributed storage system has to deal with all of the above failure scenarios and still needs to make sure that your data is stored redundantly and that strong consistency is guaranteed. There are essentially two methods to implement data redundancy in a distributed system: synchronous replication and erasure coding.
Synchronous replication writes updates to all replicas before acknowledging the write. The tricky part here is to make sure that this works when the above-mentioned network errors occur. To ensure that each replica is being updated even if one of them is unavailable, crashed, or dead, most systems use quorums (in case you wondered, this is where Quobyte got it’s name). There is a wealth of research on how to implement quorum-based replication. In case you want to dig deeper (shameless plug) some of it is from the Quobyte team and describes part of Quobyte’s core technology. The other approach to keep data in sync is erasure coding (see our previous blog post for how that works, and how to use it in Quobyte) with a row commit protocol to achieve data redundancy and consistency across machines.
In both scenarios, a write needs to go to multiple servers before it is acknowledged to the application. This means that the storage system will be as fast (or slow) as the slowest network round-trip time between the servers. As a result, this method is mostly used inside a single data center or metro area where latencies are low. The big benefit here is that both data redundancy mechanisms guarantee strong consistency even when failures happen. Your database will simply continue to work and there is no data loss – any transaction that was committed before the failure will still be there; any file created will be there and it will have the correct content.
Now the obvious question comes up – and we hear it on an almost daily basis – do I have to make the trade-off between latency (the message round-trips) and data safety (consistency)?
Yes, you do. Eric Brewer’s CAP theorem states that out of the three properties, C(onsistency) A(vailability) and P(artition-tolerance), you can only have two. Consistency is exactly our strong consistency. An available system will always give you a non-error response to your requests. Finally, a system is partition-tolerant if it works despite a message loss or delay.
Different storage systems deal differently with CAP. A distributed POSIX file system keeps C and P and has to give up A (cf. http://pubs.opengroup.org/onlinepubs/9699919799/) – it is strongly consistent and should be able to tolerate message delays and drops. The trade-off is that you won’t get an answer from the system when consistency cannot be guaranteed.
Object stores, another category of storage systems, trade in consistency (theirs is eventual consistency, where ‘eventual’ can mean “inconsistent till the end of the universe”). Such systems always give out a response, but it might be totally outdated.
The Bad: How to lose data
Now to the bad part. My favorite example for doing data redundancy the wrong way is mirroring, i.e. implementing RAID1 across two machines to implement a CP system.
The idea is simple – updates are executed on both machines. This will work, until a failure happens, and then it’ll fail horribly and lead to data corruption. The basic problem is that once there is a failure, the machines have no way to tell whether the other machine is down or there is just packet loss or some other network problem. In a worst-case scenario, both will assume the other machine is dead and continue to modify the data locally, in different and inconsistent ways. This is called “split brain,” and means that you’ll end up with two diverging copies of your database. In most cases, you can now either bring in the data forensics team to manually restore the database or say goodbye to some transactions. Whenever a system is based on things like heartbeat, heartbeat links or a STONITH, you know that you are dealing with a system that is very likely broken by design.
Sometimes, mirroring is combined with a lock manager (cluster quorum, witness replicas, and the like) that helps to decide which node is “dead.” This is only a slight improvement, which avoids running into the split-brain situation. However, there are some very realistic scenarios where all the system can do is let you read outdated data or return an error until the second mirror is back. Applications will see data suddenly change, sometimes even back to a previous version. As you can imagine, this will easily corrupt your database and make transactions disappear.
The Ugly: Life is full of trade-offs
Ugly is when you need a second copy of your data in a data center that is far enough away to survive a disaster at site location one. In this case, you have to make a trade-off and give up consistency to avoid having the long latency to the remote site in the IO path. This is often called “geo-replication” or “asynchronous replication”. What you get is an AP system on top that tries to apply changes on a best-effort basis at the remote site. When a disaster hits your primary data center, you can switch over to your copy at the cost of potentially losing some of the latest updates. However, this is often still much better than not having access to the data at all. Like with mirroring in a local data center, adding a witness can only help to prevent the split-brain situation, but not prevent data loss or reading outdated data.
So, when you go shopping for a storage system, please keep two things in mind:
Mechanisms for strong data consistency (CP) in storage systems are bound by the maximum round-trip time between any two servers. The only way to avoid this is to give up consistency and go with an AP system – unless you’ve found a way to increase the speed of light.
Mirroring with two copies is fundamentally broken, and the only place where it’s a good idea is for asynchronous data mirroring between two remote sites for disaster recovery. In that case you have to give up consistency to win availability since the CAP theorem clearly tells us that you can’t have a system that delivers all three properties.
Until we get the faster-than-light subspace communication as seen on StarTrek, you will have to make a choice between consistency and performance/latency. Until then, the mythical “global fully consistent storage” belongs to the realm of science fiction.