# Suggestion for proof of retrievability

(Originally posted here)

Scenario

In this scenario, $A$ and $B$ wants to distribute segments of their data (preferably encrypted). We do not care about coverage, we are trying to maximize the amount of remotely stored data. The game is essentially that any peer, will try to minimize their own storage requirements and thus maximize other peers’.

Protocol

Two peers $A$ and $B$ will repeatedly negotiate upon storage space according to the follwing protocol, where $A$ is prover and $B$ verifier. Let $S(B,A)$ be the set of segments owned by $B$ and stored by $A$. Moreover, let $s \in S(B,A)$ be an abritrary segment and let $\text{sig}(s)$ denote the signature of $s$.

1. $B$ picks a random string $r$ of length $n$ bits (equal to the length of the signature)
2. $B$ asks $A$ to send the closests segment from $S(B,A)$, i.e., such that $\| r - \text{sig}(s) \|_1$ is minimized.
3. $B$ verifies the segment $s$ via the signature. Then, using the distance $\| r - \text{sig}(s) \|_1$, $B$ can estimate the number of segments stored by $A$.

Repeat until the variance is low enough. The procedure is performed both ways.

Approximating $\delta$

Intuitively, we can think of $S(B,A)$ as a random (non-linear) code $\mathcal{C}$ over $\mathbb{F}_2^n$ and the task is to find a minimum-weight codeword in $\mathcal{C} + r = \{ c + r | c \in \mathcal C\}$. Then $N$ corresponds to the number of codewords in $\mathcal{C}$.

Assume that A has stored $N$ segments. Without loss of generality, we can assume that $r$ is the all-zero string, hence transforming the problem to finding the minimum weight of $n$ binomial variables. Let $X_1, X_2, \dots, X_N \in \textsf{Bin}(n, \frac12)$ and $Z := \min(X_1, X_2, \dots, X_n)$. Then with high probability and for larger values of $N$, $\frac n2 - \sqrt{\frac n3 \log N} \geq E(Z) \geq \frac n2 - \sqrt{\frac n2 \log N}.$

The above can be obtained by Gaussian approximation and the conventional expectation of the minimum of Gaussian random variables. We omit the proof here. Of course, this can be pre-computed for reasonable values of $N$.

Assume there exists a peer $A$ with infinite computing power that tries to minimize its required storage for a given minimum distance $d$. Done in an optimal manner, $A$ will pick equidistant points in the space $\mathbb{F}_2^n$. This corresponds to a perfect error-correcting code with minimum distance $d$. The covering-coding bound states that $d\leq n-k$. For a perfect code, which has minimal size, we have $d = n-k$. Therefore, the size of the code is $|\mathcal C| = 2^{n-d}$, which is also the number of elements $A$ needs to store.

A glaring problem is that $A$ cannot arbitrarily pick points (since points must be valid signatures). The best $A$ can do is to decide upon a perfect code $\mathcal{C}$ (or pick one that is closest to the current set of points — which is a very hard problem). Then, $A$ will look for points $v$ and $v'$ that map to the same codeword $c \in \mathcal{C}$ and discard the point that is the farthest away from $c$. For larger $n$ and smaller $d$, it is not realistic to achieve the covering-coding bound since there in reality are not that many shared segments. So, even given infinite computation, $A$ will not be able to sample points until a perfect code is achieved.

The covering-coding bound serves as a crude estimation on how to set the parameters.

Simulating

In Figure $1$, we see that the upper bound for expected minimum distance approximates $N$ for values larger than $2^7$ and $n = 256$. Assume that $A$ send a signature $s$ along with the segment. Let $\delta(r,s) = \|r - \text{sig}(s)\|_1 - \frac n2.$ $B$ can now use different bounds at its own choice. Assume that $\delta(r,s) = 100 - \frac{256}{2}.$ Then, using the expression $N \approx \exp\left(-\frac 2n \cdot \delta(r,s) \cdot \text{abs}\left(\delta(r,s)\right) \right)$ $B$ determines that $A$ most likely has at least $458$ segments with high probability. Using the lower bound, $A$ has at least $9775$ segments.

So, what is all this?

The intention is to create a simplistic protocol for asserting that shared data actually is retrievable, based on a probabilistic model. This could serve as basis for shared storage, in which segments are distributed. Even if the storer does not have access to the segments it has distributed, it can use the protocol to verify that it can retrieve them without actually doing so.