# On Data Verification and Sampling

Every time you upload data to the cloud, you trust that the entity hosting your data will not lose it and that five years from now you can still enjoy the pictures of that fantastic trip you did. This is called *data durability* and storage providers implement different strategies to guarantee it, as we have covered in the past. For instance, it is common knowledge that triple replication can be used to guarantee that your data won’t be lost. When one of the nodes/disks fails, the data is still present in the other two nodes, and it can be replicated in a third node again. The probability of those 3 hard disk drives (HDD) failing simultaneously is very low, however, it is not zero. One could then ask: What is the likelihood of such a thing happening? Well, if we assume that the probability of a node failing is one in a thousand (0.1%), and we assume those three nodes are completely independent then the probability of those three nodes failing simultaneously is P = 0.1% * 0.1% * 0.1% = 0.0000001%. In other words, the probability that you can retrieve your data at any moment is 99.9999999%, some people refer to this number as *Nine nines*.

# A Byzantine Challenge

Checking for data availability on centralized storage is as simple as asking the server if the data is still there, because you trust that server. However, the same verification becomes substantially more complex in a decentralized storage system, where storage nodes are independent and cannot be trusted. Most decentralized storage systems implement incentives to compensate storage providers for hosting the data they host, but this opens the door to a wide range of attacks. For instance, a malicious user could pretend to host some data to get the rewards for that dataset, but erase the data as soon as it receives the rewards in order to host another dataset. Another attack could be to only be online 50% of the time to save on bandwidth and energy consumption. Thus, decentralized storage systems need to constantly verify if any blob of data has been lost, regardless of whether it is by an accidental hardware failure, or a malicious user trying to play the system.

In our previous blog post, we explained how we protect datasets using erasure codes. Basically, the original dataset is divided into K blocks of data, augmented with M blocks of encoded data, and stored in N=K+M nodes in the P2P network. From the N nodes storing the dataset, the system can tolerate up to M nodes disappearing from the network simultaneously, and still be able to reconstruct the original dataset. In order to know when data has been lost, the system needs to request a proof from the storage nodes demonstrating that they still have the data. Note that just checking that the storage nodes are alive is not enough anymore, as malicious nodes could potentially erase data. These data proofs could be posted into a blockchain for the users to see and check the status of the network.

# A Storage Problem

The most accurate and simple way to verify that a node still has a block of data would be to send the whole data block, but obviously, this is prohibitively expensive both in bandwidth as well as storage. A much more space-efficient proof would be to send the hash of the data. However, this does not demonstrate that the node has the data at this moment, only that it had it at some point in the past, the hash could be computed upon reception of the data, saved in a database, and the dataset discarded. A real verification system needs to integrate some sort of randomness that should not be predictable by the storage nodes. For instance, one could apply a simple XOR between the hash of the last block produced by a given blockchain (e.g., Ethereum) and the block of data targeted. However, XOR is not suitable either because we come back to the proof size problem mentioned previously.

Fortunately, there are mathematical functions that receive a block of data, a one-time token (i.e., the source of randomness), and produce a short proof that can be stored in the blockchain. An example of such a mathematical tool are the famous succinct non-interactive arguments of knowledge, also known as SNARKs. Furthermore, some algebraic primitives allow aggregating multiple of those proofs into one, to further reduce bandwidth and storage costs. We will not go into the details of those verification systems in this post, that will be covered in a future article. What is important to retain here, is that by using such functions, the system can host terabyte-size datasets and produce proofs that are only a few bytes in size.

# Space and Time

To implement such a verification system, one would usually divide the block of data into small chunks that can be handed to the algebraic function that generates the proof. The proof will be generally only a fraction of the size of the chunk, which is the objective to avoid sending the whole dataset. Let’s use an example to better visualize it. A dataset of 80 GBs needs to be stored. The dataset is divided into K=80 blocks of 1GB, and M=20 blocks of encoded data are generated using erasure coding. There are a total of N=100 blocks of 1GB that will be dispersed and stored in 100 nodes. Internally, each 1GB block of data is subdivided into 1024 chunks of 1MB. Each chunk of data generates a proof that is only 1% the size of the chunk, that is to say, each proof is about 10KBs, so about 10MBs for all the proofs of one node. However, those proofs can be aggregated into one single proof of 10KB for the entire node. Moreover, they can be aggregated across nodes as well, so the entire 80GB global dataset could generate a single proof of only 10KB.

Succinct proofs and proof aggregation have the power to dramatically reduce the space required to store the proofs. So that is it, we solved the problem! Well, not completely, there is something we haven’t told you: succinct proofs and proof aggregation are computationally expensive and may take a long time to be computed. In addition, the system will need to generate proofs at relatively high frequency in order to detect failures soon enough to prevent data loss. Thus, in a system with petabytes of data and millions of files to verify, it could take hours or even days for the system to verify all datasets. In other words, the computational cost of such algebraic primitives makes it prohibitively expensive to generate and aggregate all these proofs on a regular basis. Thus, we have transformed a space problem into a time problem. Indeed, there is no free lunch. However, not everything is lost, to overcome this new issue, we leverage one more mathematical trick: *probabilities*.

# Probabilistic Sampling

Sampling allows us to verify whether someone has the content they say they have, without asking for the whole dataset. The randomness makes it hard for the storage node to remain undetected if it decides to discard part of the content. For instance, if a malicious storage node decides to only store half of a dataset and discard the other half to save storage, detecting this attack would be the equivalent of tossing a coin. If you ask for a small chunk of the data you have 50% chance to get the data (head) and 50% chance of not getting the data (tail) and thus detecting that the node is malicious. Now, if you ask for two independent small chunks of data, the probability of not detecting that the node is malicious is only 25%, the equivalent probability of getting head two times in a row. If you ask three chunks the probability goes down to 12.5% and if you ask 10 chunks the probability of not detecting that a node is malicious goes down to 0.097%. Looking at it from the other perspective, if you ask for 10 chunks and you receive them all, you can be 99.9% sure that the node is not erasing 50% of the data (i.e., malicious).

Remember the *Nine nines* mentioned at the beginning? Well, here we could ask how many chunks do I need to request to have *nine nines* of certainty that the storage node is not pulling off this attack (i.e., 50% data loss), and the answer will be 30 chunks. For having six nines of certainty we only need to request 20 chunks and to be 99.9% sure we only need to request 10 chunks. To detect malicious nodes that discard less data, say 10%, we need to verify 66, 132 and 197 chunks to have three, six and nine nines of certainty. Thus, instead of requesting for thousands of succinct proofs and aggregations, you only need a few dozens of proofs to have a very high level of certainty that the storage node has your data. This reduces the computational cost several orders of magnitude, while still guaranteeing high reliability.

# Putting it All Together

To recap, the untrusted setting of a permissionless open p2p network poses challenges when implementing a decentralized storage system. To solve that issue, one can implement a non-interactive verification system that demonstrates that the storage nodes still have the data they committed to store. However, this can lead to high storage and bandwidth consumption if implemented naively, for instance resending the whole dataset every time the system wants to verify the data is still available. Therefore, we use mathematical tools to generate succinct proofs (e.g., SNARKs) that can be aggregated and thus limit the bandwidth and storage requirements of the verification system. This solves the storage problem but raises a computational one. To reduce the computational cost of SNARK verifications and aggregations we make use of probabilistic sampling. For more details, please visit our PoR model.