Dagger: A Memory-Hard to Compute, Memory-Easy to Verify Scrypt Alternative

Over the past five years of experience with Bitcoin and alternative cryptocurrencies, one important property for proof of work functions that has been discovered is that of "memory-hardness" - computing a valid proof of work should require not only a large number of computations, but also a large amount of memory. Currently, two major categories of memory-hard functions, scrypt and Primecoin mining, exist, but both are imperfect; neither require nearly as much memory as an ideal memory-hard function could require, and both suffer from time-memory tradeoff attacks, where the function can be computed with significantly less memory than intended at the cost of sacrificing some computational efficiency. This paper presents Dagger, a memory-hard proof of work based on moderately connected directed acyclic graphs (DAGs, hence the name), which, while far from optimal, has much stronger memory-hardness properties than anything else in use today.

Why Be Memory-Hard?

The main reason why memory hardness is important is to make the proof of work function resistant to specialized hardware. With Bitcoin, whose mining algorithm requires only a simple SHA256 computation, companies have already existed for over a year that create specialized "application-specific integrated circuits" (ASICs) designed and configured in silicon for the sole purpose of computing billions of SHA256 hashes in an attempt to "mine" a valid Bitcoin block. These chips have no legitimate applications outside of Bitcoin mining and password cracking, and the presence of these chips, which are thousands of times more efficient per dollar and kilowatt hour at computing hashes than generic CPUs, makes it impossible for ordinary users with generic CPU and GPU hardware to compete.

This dominance of specialized hardware has several detrimental effects:

It negates the democratic distribution aspect of cryptocurrency mining. In a generic hardware-dominated ecosystem, the fact that everyone has a computer guarantees that everyone will have an equal opportunity to earn at least some of the initial money supply. With specialized hardware, this factor does not exist; each economic actor's mining potential is linear (in fact, slightly superlinear) in their quantity of pre-existing capital, potentially exacerbating existing wealth inequalities.
It increases resource waste. In an efficient market, marginal revenue approaches marginal cost. Since mining revenue is a linear function of money spent on mining hardware and electricity, this also implies that total revenue approaches total cost. Hence, in a specialized hardware dominated ecosystem, the quantity of resources wasted is close to 100% of the security level of the network. In a CPU and GPU-dominated ecosystem, because everyone already has a computer, people do not need to buy specialized hardware for the first few hashes per second worth of mining power. Hence, revenue is sublinear in cost - everyone gets a bit of revenue "for free". This implies that the quantity of resources wasted by the network is potentially considerably lower than its security parameter.
It centralizes mining in the hands of a few actors (ie. ASIC manufacturers), making 51% attacks much more likely and potentially opening the network up to regulatory pressure.

Specialized hardware is so powerful because it includes many thousands of circuits specifically designed for computing the proof of work function, allowing the hardware to compute the function thousands of times in parallel. Memory hardness alleviates this problem by making the main limiting factor not CPU power, but memory. One can also make a modest improvement by targeting CPU clock speed, but the extent to which such optimizations can be made is fundamentally limited to a very low value by technological considerations. Thus, improvement through parallelization hits a roadblock: running ten memory-hard computations in parallel requires ten times as much memory. Specialized hardware menufacturers can certainly pack terabytes of memory into their devices, but the effect of this is mitigated by two factors. First, hobbyists can achieve the same effect by simply buying many off-the-shelf memory cards. Second, memory is much more expensive to produce (if measured in laptop equivalents) than SHA256 hashing chips; the RAM used in ordinary computers is already essentially optimal.

Algorithm specification:

Essentially, the Dagger algorithm works by creating a directed acyclic graph (the technical term for a tree where each node is allowed to have multiple parents) with ten levels including the root and a total of 2²⁵ - 1 values. In levels 1 through 8, the value of each node depends on three nodes in the level above it, and the number of nodes in each level is eight times larger than in the previous. In level 9, the value of each node depends on 16 of its parents, and the level is only twice as large as the previous; the purpose of this is to make the natural time-memory tradeoff attack be artificially costly to implement at the first level, so that it would not be a viable strategy to implement any time-memory tradeoff optimizations at all. Finally, the algorithm uses the underlying data, combined with a nonce, to pseudorandomly select eight bottom-level nodes in the graph, and computes the hash of all of these nodes put together. If the miner finds a nonce such that this resulting hash is below 2²⁵⁶ divided by the difficulty parameter, the result is a valid proof of work.

Let D be the underlying data (eg. in Bitcoin's case the block header), N be the nonce and || be the string concatenation operator (ie. 'foo' || 'bar' == 'foobar') . The entire code for the algorithm is as follows:

spread(L) = 16 if L == 9 else 3

node(D,xn,0,0) = D
node(D,xn,L,i) = 
    with m = spread(L)
         p[k] = sha256(D || xn || L || i || k) mod 8^(L-1) for k in [0...m-1]
    sha256(node(L-1,p[0]) || node(L-1,p[1]) ... || node(L-1,p[m-1]))

eval(D,N) = 
    with xn = floor(n / 2^26)
         p[k] = sha256(D || xn || i || k) mod 8^8 * 2 for k in [0...3]
    sha256(node(D,xn,9,p[0]) || node(D,xn,9,p[1]) ... || node(D,xn,9,p[3]))

Objective: find k such that eval(D,k) < 2^256 / diff

Properties:

With 2²⁵ memory (ie. 512 MB, since each memory unit is a 32-byte hash), the optimal algorithm is to pre-compute the 2²⁴ leaf nodes run eval with all 2²⁵ possibilities. The main computational difficulty is computing SHA256 hashes. Each node in levels 1 - 8 will take two hashes, so the levels will take up ^{2^25 - 4} hashes in total (not -2 since the root node requires no hash). Each node in level 9 takes 16 hashes, adding 2²⁸ hashes, and then each nonce will take 8 hashes, adding 2²⁸ hashes once again. Thus, running through 2²⁶ nonces will take 2²⁹ + 2²⁵ - 4 hashes altogether.
One potential problem is lazy evaluation; parts of the tree can be evaluated only as needed in order to reduce the number of hashes required. However, because a (pseudo-) random node out of 2²⁵ is taken 2²⁸ times, we can statistically estimate that each node has a 1 / e⁸ change of remaining unused - only about 0.03%. Hence, the benefit from lazy evaluation is insubstantial.
It is possible to run the algorithm with very little memory, but one must re-compute the eight bottom leaf nodes that the proof of work result on each nonce depends on each time. Naively doing an exponential calculation, we get that 3⁸ * 16, or 104976, steps are required to compute a single nonce. In practice, many values are reused, so only an average of almost exactly 6000 hashes is needed, but this is still a very large amount - the memory-based algorithm manages a mere 8 computations per nonce, giving low-memory hardware a 750x penalty.
Verification requires computing one nonce without pre-computation, so it also takes 6000 hashes and about 160 KB of memory.
Because each node in the last level has 16 parents instead of 3, the time-memory tradeoff attack is severely weakened - attempting to store 8 levels instead of 9 reduces memory usage by a factor of 2 but slows down computation by a factor of 16. Thus, no practical time-memory tradeoff attack exists; close to the full 512 MB is required for any reasonable level of efficiency.

Conclusion

This algorithm provides a proof of work mining function with memory hardness properties that are not ideal, but that are nevertheless a massive improvement over anything available previously. It takes 512 MB to evaluate, 112 KB memory and 4078 hashes to verify, and even the tinest time-memory tradeoff is not worthwhile to implement because of the bottom-level branching adjustment. These parameters allow Dagger to be much more daring in its memory requirements than Primecoin or scrypt, asking for 512 MB of RAM for a single thread. Because the primary determinant of hardness is memory, and not computation, specialized hardware has only a tiny advantage; even an optimal Dagger mining ASIC would have little to offer over a hobbyist purchasing hundreds of gigabytes of memory cards off the shelf and plugging them into a medium-power GPU. And even in such an equilibrium, mining with ordinary CPUs will likely continue to be practical.

Appendix: Why 6000 hashes?

To compute the required four values in level 9, one must look at 64 randomly selected values in level 8. With negligibly less than 100% probability, these values will all be different. From there, suppose that you have computed V[n] different values for level n. You will need to make an expected 3 * V[n] queries to level n-1. Because the queries are pseudorandom, after these queries the probability that any individual cell will remain empty is (1 - (1/8^n)) ^ (3 * V[n]). Hence, we can estimate that the expected value of V[n-1], the number of cells that will not be empty (ie. will need to be computed) is close to 8^n - 8^n * (1 - (1/8^n)) ^ (3 * V[n]). We thus have a recursive computation:

V[8] = 64
V[7] = 8^7 - 8^7 * (1 - 1/8^7) ^ (64 * 3) = 191.99
V[6] = 8^6 - 8^6 * (1 - 1/8^6) ^ (191.99 * 3) = 575.34
V[5] = 8^5 - 8^5 * (1 - 1/8^5) ^ (575.34 * 3) = 1681.38
V[4] = 8^4 - 8^4 * (1 - 1/8^4) ^ (1681.38 * 3) = 2900.72
V[3] = 8^3 - 8^3 * (1 - 1/8^3) ^ (2900.72 * 3) = 512.00
V[2] = 8^2 - 8^2 * (1 - 1/8^2) ^ (512.00 * 3) = 64.00
V[1] = 8^1 - 8^1 * (1 - 1/8^1) ^ (64.00 * 3) = 8.00
V[0] = 8^0 - 8^0 * (1 - 1/8^0) ^ (8.00 * 3) = 1.00

V[0] + V[1] + ... + V[8] = 5998.43

This computation is not quite accurate, since EV(f(x) is not the same as f(EV(x)) (EV = expected value) if f is not linear, as is the case here, but because we are not dealing with especially fat-tailed or skewed distributions the standard deviations are sufficiently low for this inaccuracy to have no substantial effect.

Acknowledgements

Thanks to Adam Back and Charles Hoskinson, for discussion and critique of earlier drafts of the protocol
See also: Fabien Coelho's paper, in which Fabien Coelho had independently discovered a similar algorithm in 2005.