When I talk to people about erasure coding, I often feel that, even though they have it on their radar, they are often uncertain about what exactly it is and how it works. Some topics in distributed storage are quite complicated and require thorough studying in order to understand them (like Paxos). Erasure coding isn’t one of them. Actually, if you know about RAID, you already know most of what’s important about erasure coding.
(Mostly) Just like RAID
RAID with parity (like RAID 5) and erasure coding improve data availability and safety. They enrich the original with redundancy data and spread it across disks. A subset of this data is enough to regenerate the original data. RAID is used for redundant storage of entire disks within a server. Erasure coding is usually used in distributed storage systems. So you can think of erasure coding as generalized and distributed RAID.
So what value does erasure coding in a distributed storage system like Quobyte add over a mere RAID setup?
- In distributed storage, data is spread across disks in different servers. Distribution improves availability because the system can tolerate loss of entire machines. Store the data on a single server instead and you risk unavailability of the data in case even a single disk fails.
- RAID is a disk-level concept. Erasure coding, in contrast, is often implemented in a way that data isn’t restricted to a set of disks. Quobyte’s distributed files system applies erasure coding individually to each file and spreads data independently across disks. It can regenerate data on all available disks and from data on many different disks which enables an enormous rebuild bandwidth (sometimes called “declustered rebuild”). That immensely improves the rebuild time as compared to RAID where the rebuild is confined to the disk set’s resources.
- Erasure coding allows setting arbitrary ratios of original data and coding data. With a ratio of m parts of original data to n parts of coding data, the code can tolerate the loss of any n parts and regenerate the original m parts. For example a code of m:n = 8:3 enriches every 8 parts of data with 3 parts of coding data and spreads the data across 11 (8 + 3) disks. This encoding can then tolerate the loss of any 3 disks and generates a redundancy blow-up of just (m + n)/n = (8 + 3)/8 = 1.375. Compare this with three-way quorum replication, which can tolerate the loss of only 1 or 2 disks and has a blow-up of a factor of 3.
When to Use Erasure Coding
Erasure coding performs best in cases of squential data writes. The erasure coding engine immediately writes the original data to remote disks as the data streams in. It computes the coding parts on the fly and writes them along. If written in random order, write performance for data degrades severely. The coding engine needs to read all data of the coding group first, recompute the coding parts, and then write out the modified original data along with the coding data. That amplifies any random write by m – 1 reads and m writes. Because erasure coding stores the original data as-is, reading erasure coded data behaves just like reading replicated data. And it has no caveats around access patterns.
All this boils down to one essential rule: For sequentially written data (usually written only by one writer and only once) use erasure coding. That’ll give you the benefit of increased data safety and minimal blow-up. For everything else use replication. Quobyte’s policy engine lets you choose between the two down to the level of individual files. One last thing: Note that erasure coding shares this behavior with RAID. But the effect in an erasure coding scenario stands out more since RAID only has small m and n and does not have to read data over the network.
Quobyte: Parallel File System with Direct Erasure Coding
Quobyte implements erasure coding directly as an alternative IO path in the client. And it does so instead of eventually re-coding data that has already been written. Quobyte touches and writes data only once for a maximum in efficiency. Erasure coded files also benefit from the same level of scalability and performance that applies to quorum replicated files. What’s best: Quobyte is a parallel file system with direct erasure coding. It can thus be used for both primary storage (think video or HPC) and secondary or archival storage.
Erasure coded files really are first-class citizens in Quobyte:
- They are subject to Quobyte’s policy engine and can be moved transparently in the background.
- They enjoy the same strong end-to-end checksum protection all other data has.
- Quobyte protects erasure coded data against the write hole in RAID (cf. Jeff Bonwick’s post on RAID-Z).
- They support any access pattern. This way, an occasional out-of-order write does not cause compatibility problems.