The Codex organisation is innovating around decentralised data storage. The org and the blockchain community are building solutions to decentralise data storage and guarantee robust data durability. This article examines the current problem with modern data storage techniques, including their weaknesses and challenges. The piece further explores solutions in the form of decentralised data storage and erasure coding, the hurdles of implementing it, and what the future holds in an era of a more decentralised, open, and fault-tolerant internet landscape.
The Problem: Honeypots and Powder Kegs
Data storage is the Achilles heel of the internet. Many of the images, text, music, and software applications we enjoy live in a centralised server. Our digital footprint amounts to increasingly stored data across these servers, including on mega-servers hosted by Google, Facebook, TikTok, Yahoo, Instagram, and other similar services. Most people take it for granted that we should store data in this inefficient and costly manner.
This storage centralisation into repositories represents an unnecessary and dangerous risk to our collective and personal data. When data — like money or other assets — gets stored in a singular location, that location becomes a powder keg and honey pot.
It is a powder keg because, at any moment, outages, hardware malfunctions, or disk crashes can destroy data or render it inaccessible. It is a honeypot because once data is accumulated and stored in a single location, it attracts hackers, scammers, and thieves, especially if PII (personally identifiable information) or other sensitive info, such as financial or medical records, gets stored en masse.
Ransomware is one example of how bad actors have exploited the situation. Ransomware is an attack where a hacker puts a bug into software, locking the owners out of that software and its data until they pay a ransom. Any time organisations store data in one location in large troves, those entities expose themselves to the possibility of ransomware attacks. Astra Security expressed the growing problem:
‘There are 1.7 million ransomware attacks every day which means every second 19 ransomware attacks. The first half of 2022 saw nearly 236.7 million ransomware attacks worldwide. Ransomware is expected to cost its victims around $265 billion (USD) annually by 2031.’
Decentralisation of data storage is not a foolproof mechanism for preventing ransomware attacks. It is a partial solution. In reality, a mix of decentralisation and strong encryption will help minimise attacks on data. Protecting valuable troves of data against the scourge of accidental outages, North Korean hacker attacks, government seizure of servers, and other unforeseen attacks must be a key priority for all organisations.
Storage Concerns and Vulnerabilities in Web3
Storage becomes more vital as the internet community moves from Web2 to Web3. In this ecosystem, people enjoy games or collect art that involves the maintenance and storage of NFTs. Web3 is the internet directly linked to blockchains that leverage tokenised assets. NFTs, or nonfungible tokens, are assets that live on these Web3 blockchains and belong to individuals.
The problem, as Dr Leonardo Bautista-Gomez pointed out, is that NFTs and similar digital assets point to files, including images and videos, that are too large to store on a blockchain. Many organisations keep these assets in a centralised manner that exposes them to failure, hacking, or censorship risks. Bautista explained:
'While NFT metadata may be stored and replicated over thousands of nodes in a decentralised manner on a blockchain (e.g. Ethereum), such metadata often points to a large file (image, music, video) that is prohibitively expensive and inefficient to store in the same manner.'
Another vulnerability within Web3 is that entities can target DApps (decentralised application frontends). In their current form, they are not connected to blockchains but instead connected to centralised storage services. Data breaches related to DApp frontends could become a threat without reliable decentralised storage solutions to mitigate them. Frontend seizures and censorship also plague Web3 frontends.
We expect the asset and token ecosystem in Web3 to grow tremendously in the coming years. This growth represents a future risk because data is more closely tied to value. Implementing robust data storage practices will ensure our data's integrity and help maintain the value of digital collectables and assets. In their current form, stored assets are incredibly fragile, and a black swan event could have catastrophic effects on data storage systems.
How do we futureproof against such an immense problem to guarantee a decentralised web while making our assets more antifragile?
The Solution: Decentralised Node Storage, Erasure Coding, and Remote Auditing
We must decentralise storage. That is easier said than done. There are many complexities in decentralising data storage. In the current centralised storage model, companies and individuals tend to copy and store data at offsite locations. This form of data protection is called 'replication.' It is a relatively simple method of providing redundancy to data and having it as a convenient 'backup.' However, replication has two main pitfalls.
- Honeypot. Hackers can still target those offsite locations even if organisations replicate data and store it offsite. They can attempt to access or compromise replicated data. That data remains prone to censorship, deletion, or theft.
- Cost. Replicating large caches of data represents a financial liability for organisations. Imagine the annual storage cost for data replicated across various servers. Yes, it is enormous.
The solution to these problems appears straightforward. In theory, it is. Break the data into smaller chunks, expand, and encode it. Then, store it across multiple devices on a network. This 'chunking' or 'sharding' of data is the solution offered by a scheme called 'erasure coding.' The Storj team provided a clear definition of erasure coding:
'Erasure coding is a means of data protection in which data is broken into pieces, where each piece is expanded and encoded with redundant data.'
Having erasure-coded data is essential because decentralised systems can reconstruct that data even if some chunks are missing. In this case, erasure-coding is a tactic to gain fault tolerance. Many computing protocols and feedback systems leverage redundant (or erasure-coded) components to minimise faults and restore normal functionality during failures.
A typical example is aircraft components. The ‘cybernetic’ feedback systems in aircraft machinery maintain redundancy to prevent disastrous errors. This redundancy principle to protect a system from failures and ‘rebuild’ vital functions is similar to erasure coding.
Erasure coding is not new, though. It was developed in 1950 by Richard Hamming of Bell Laboratories as 'forward error correction codes.' However, due to technical hurdles, it has yet to be used to its full potential within decentralised storage.
Developers have primarily implemented erasure coding as a component piece to larger systems. Organisations and developers at Codex are innovating in various aspects and building a decentralised storage solution for the future.
Let us examine some hurdles facing Codex and other organisations seeking to solve these problems.
Data Durability and Byzantine Generals
Maintaining data durability is one of the roadblocks to building these types of systems. For data to be durable, it must be resistant to failures or breakage. A Redis.com article provides a succinct definition.
'Data durability is a means of safeguarding data from loss or corruption in the event of an outage or failure. Data durability is the process by which one ensures data is (and remains) intact, devoid of any degradation. In essence, durable data means uncompromised data.'
Decentralised systems often lack durability, i.e., uncompromised data, because of network faults and unreliable or malicious nodes. Computer scientists describe this sticky wicket as the Byzantine General’s Problem. Some similarly refer to the issue as a 'coordination' or 'consensus' problem.
Many have compared this problem to an army encircling a city, preparing for an attack. Multiple generals are in charge of the army’s divisions. Each division must advance based on instructions from the entire army, but the army could have some generals who want to retreat. Others could be treacherous, seeking to sabotage the attack.
This problem of traitorous generals is also called a 'Byzantine fault.' In the case of data storage, we can represent the issue as a storage node that lies about what it is storing to reap higher rewards. For instance, it could mislead the network into thinking it is harbouring more data than claimed. The node could also generate multiple identities in the form of a Sybil attack to fake the network. The consequence of this ‘faking’ is that the storage provider can use the identities to dupe the network into thinking it has multiple storage nodes. But in reality, it is storing everything in one location.
Many Byzantine fault attack vectors emerge within the context of network architectures, and solving for those is not always easy. What is the solution?
Remote Auditing and Zero Knowledge
Codex is leveraging a remote auditing scheme to prove that nodes have the data they claim to possess. In other words, Codex uses “remote auditing” to verify that all network nodes play by the rules. This form of remote auditing relies on zero-knowledge proofs (ZKPs), sometimes called “verifiable computation.” The Ethereum Foundation (EF) accurately describes ZKPs:
'A zero-knowledge proof is a way of proving the validity of a statement without revealing the statement itself. The ‘prover’ is the party trying to prove a claim, while the ‘verifier’ is responsible for validating the claim.’
This apparent privacy-centric use case the EF describes is the most commonly cited use for ZKPs but can also be used in remote auditing schemes. There is a problem, though. Attempting to do calculations required to remote audit storage nodes normally consumes impossibly large computational resources, making use of the network untenable. The computation of the prover to generate the proof also needs to happen fast enough and within a certain frequency.
The ZKPs Codex uses as part of remote auditing to solve this problem are SNARKs, or succinct non-interactive arguments of knowledge. These SNARKs can be verified on the blockchain via a smart contract to provide evidence that a node is honest. In remote auditing, the prover generates a SNARK rather than having to provide a whole dataset. However, generating the proof can also be resource-intensive.
To solve the issue, the Codex 'proving system' requests proofs for verification using probabilistic sampling. Probabilistic sampling is a mathematical trick that allows a storage host to prove they have the data while minimising resource burden. Effectively, the prover can compress evidence of the data they store via the proof. Bautista explains how probabilistic sampling lowers the cost:
‘You only need a few dozens of proofs to have a very high level of certainty that the storage node has your data. This reduces the computational cost several orders of magnitude, while still guaranteeing high reliability.’
Probabilistic sampling used in this fashion is the key to minimising computational constraints while keeping storage nodes honest.
Furthermore, the network can initiate a 'lazy repair' mechanism to rebuild broken data or datasets if the datasets are incomplete or an error occurs. In this way, Codex can protect the network and remedy faults or random data failures (which are notoriously common), providing robust data durability guarantees. We will dive more deeply into remote auditing, SNARKs, and lazy repair in future pieces.
TLDR: Solving Censorship and Failure Risk
Overall, Codex strives to solve the problems around decentralised storage while staying true to data protection principles. Modern, centralised storage solutions represent a real threat to our valuable data, including current Web3 data. Worse than the weaknesses of data durability is the continuous threat of data censorship. When data is stored or backed up in a centralised location, bad actors can still delete or censor that data.
The Codex community will not compromise on its goal to make data storage more robust, efficient, and censorship-resistant. The future of data storage lies in the total decentralisation of data storage, which involves the organisation having to innovate new processes, methods, and technologies. The team has come a long way and is working to solve some of the most vexing problems in network architecture, decentralised storage, and zero-knowledge cryptography.