Computers and networks are subject to failures and errors. A storage network composed of thousands of computer nodes will regularly see some of those nodes go down temporarily or permanently, which could mean loss of data. Thus, in this context, one might wonder, how can a reliable system be built from unreliable components?
To properly answer this question, the most common types of failures to which these systems are exposed needs to be understood:
- An entire node is dropped from the network (due to hardware problems or by choice).
- A node permanently loses its entire disk.
- A node loses some disk sectors but not the entire disk.
These three types of data loss are quite different and to understand them completely, it is important to comprehend the frequency of each event to inform storage networks how they can be optimized. Partial disk failures have become exceedingly rare with modern HDDs, and especially SSDs, meaning that the most frequent sources of data loss are situations where the entirety of the node’s data suddenly becomes unavailable. Therefore, data loss prevention mechanisms should be optimized to deal with entire-node-loss cases (i.e., the first two types of failures) as the highest priority, and partial-data-loss cases as the lowest priority. There are two industry standard mechanisms for data loss prevention:: replication and error-correcting codes..
One of the most common and well-known answers to the challenge of building reliable systems from unreliable components is replication. When something is important, it is replicated, so that if there is a failure in one part, the other can be used instead. Replication is so effective that it is used in mission-critical systems such as spacecrafts and nuclear power plants.
The same idea can be used for storage networks. Let’s imagine a large dataset that needs to be stored over 20 nodes. Then, we want to replicate this dataset into 5 copies so that we can tolerate multiple node failures. The replicated dataset wil require a total of 5x20=100 nodes. The first copy will be placed on nodes 1-20, the second copy on nodes 21-40 and so on until the last copy is stored on nodes 81-100. This way, even if some of those computers go down, the data can still be downloaded from other replicas. But replication still has two significant drawbacks: (1) with N replicas, the actual data is only 1/N of the total raw space used, and (2) if computers 1, 21, 41, 61, and 81 in this example fail, the data is still lost.
These drawbacks mean that data replication has a huge overhead on the space used, yet is not very reliable in terms of how many nodes can be lost before data begins to be lost. A more efficient and reliable data loss prevention mechanism can be used instead: error-correcting codes.
Error-correcting codes (ECC) are used to store data in unreliable environments. There are many different types of ECC. RAID-5 and RAID-6 are simple ECC schemas allowing recovery from one or two lost nodes. A more advanced scheme, Reed-Solomon (RS), allows the recovery of many more failed nodes. At a high level, RS coding works as follows:
- A dataset is divided into K blocks of equal size, called data blocks.
- RS coding computes other M blocks of the same size, called parity blocks.
- The original dataset can be restored from any K blocks of those K+M blocks!
How does this compare to replication? From the previous data replication example with 100 nodes, RS coding could be implemented to use 20% of the raw data space. This means 20 nodes would store the raw data (K=20) and 80 nodes would store parity blocks (M=80). Any 80 nodes out of the 100 could be lost and the original dataset could still be recovered. With replication, in the worst case, 5 node losses (1, 21, 41, 61, 81) would render the dataset unavailable, however with RS coding (K=20, M=80), the system could tolerate the loss of any 80 nodes. That is, it could tolerate up to 20 times more losses than replication with the same number of nodes.
K and M parameters can be optimized further to offer the same level of protection as replication but require much less storage space. In the replication example, there were 5 replicas, or in other words a 400% storage overhead, so in the worst case, the system can tolerate a maximum of 4 node failures, using 100 nodes total. Using RS coding, the loss tolerance is the same (M=4), yet we only need 24 nodes total (K=20, K+M=24) instead of 100. This is only 20% of storage overhead, i.e. well below the space required even for a single replica, yet it provides significant protection.
Longer isn’t always better
In ECC algorithms, datasets are divided into blocks of equal size. With modern media, the typical data block size may be around 64 KB. With a 64 KB block size, even a modest 64 GB data storage contains one million blocks. With modern multi-terabyte disks and hundreds of nodes participating, the total number of blocks in a dataset may easily reach billions.
While modern error-correction codes support ECC groups with billions of blocks, they can become impractical. Indeed, it's great when a group can outlive the loss of any 200 million blocks. However, there are significant drawbacks. In particular, the recovery of a single lost data block will require downloading K blocks - in this case, 800 million blocks!
Even for modest K=1000, sometimes we will need to download 1000 blocks for recovery of a single lost data block, reducing effective bandwidth 1000 times. Because of this, we try to limit the ECC group size to a reasonable value, usually under a hundred.
Placing the Blocks
As explained above, we want to maintain small ECC group sizes. But how small is too small? Remember the story of the shop selling ice cream and hot fudge cakes: a system must be optimized for its most frequent errors and nowadays, total disk/node outages are far more frequent than partial disk errors, thus it is better to optimize for those cases.
For each dataset, we make an ECC group size (K+M) equal to the number of nodes participating in the dataset. This means that inside each ECC group, each node stores exactly one block – it may be one of K data blocks or one of M parity blocks. As a result, each ECC group can survive a full drop of any M nodes, and less severe damages (e.g. disk sector loss) can also be tolerated .
Sometimes it is Good to be Lazy
From a high level, the idea of ECC is simple: extra blocks of data are produced that can be used to recover lost data in the case of a failure. But how does this recovery process occur? How is the node failure discovered? When is the recovery triggered? Who is in charge of the data reconstruction?
Most systems implement a continuous monitoring mechanism in which the important components of the network are asked at a given frequency to prove that they are alive. For instance, the nodes in a network send a tiny message every x seconds to demonstrate that they are online. This is called a heartbeat, when the heartbeat line goes flat, i.e., no messages are received over some time, it means the node is gone. In the case of storage networks, we want to test at a given frequency that the data is still present on the nodes. Once this mechanism detects that a block of data is missing, an available node executes the recovery process.
However, the recovery process is expensive, generating both network traffic and computational load. It requires downloading K blocks from K nodes and executing a computationally intensive and time-consuming decoding algorithm. If this is done every time a node is down, it could lead to a substantial cost for the system to replace only one node. Moreover, a node may only lose internet connections for a brief time.
Here is where being lazy can make the system more efficient. Remember that an RS-coded system can tolerate up to M nodes being lost. We can start recovering at any number of lost nodes between 1 and M. Thus, one could lazily wait until, for example M/2 nodes are lost to do the recovery process. In this case, the recovery node has to download the same K blocks and execute the same computationally expensive decoding algorithm, but with the difference that it is producing M/2 blocks in a single recovery pass, which is much more efficient than doing the same work to produce only one block. In other words, as the colloquial expression says: “to kill M/2 birds with one stone”. In addition, transient network outages are frequent in this type of system, thus this mechanism also smooths over the situation where a node loses network connectivity, but then comes back.
Maintaining data reliability on data networks is not a trivial problem. The naive approach of simply duplicating the data is not scalable and it does not offer the level of resilience that users need. Thus, it is important to pay attention to all the different aspects mentioned in this post, in order to achieve both reliability and performance. In particular, if the storage system is designed to handle massive amounts of data, it is critical to carefully study the design space in order to achieve scalability and efficiency.