The Effectiveness of Distributed Proving

Guyu, CTO
May. 15, 2024

Distributed computing achieves horizontal scaling through massive parallelism across a network of nodes, and we show that it can be effectively applied to ZK proving to reduce both latency and cost. In this technical report, we will provide an overview of our distributed proving methodology and go over benchmarks for PSE’s EVM circuit (k=18) using our latest distributed halo2 prover.

Distributed Commitment

On a high level, ZKP generation is a two step process: committing to witness values and evaluating constraints. Since witness values are usually laid out as an execution trace table to be committed column by column, we can easily distribute commitment work by partitioning the table vertically and assigning each slice to a different node.

With the trace table generated locally, there is virtually no distribution overhead. Our benchmarks for advice column commitment shows that two c5a.8xlarge instances are twice as fast as one.

	Single Prover	Worker 1/2	Worker 2/2
Instance Type	c5a.8xlarge	c5a.8xlarge	c5a.8xlarge
vCPUs	32	32	32
Advice Commitment Time (ms)	5030	2270	2510

Distributed Constraint Evaluation

Distributed constraint evaluation is argubaly much more important than distributed commitment since it takes up the majority of proving time, especially so for large circuits. For example, 90% of the proving time for the EVM circuit is spent on evaluating the so-called “h poly”, a high-degree polynomial aggregated from a random linear combination of all constraints.

Constraint evaluation is also more difficult to distribute optimally. A circuit could have tens or hundreds of thousands of constraints, each of which is different in both evaluation complexity and column dependency. Brute-forcing the most “balanced” split is intractable so we must instead rely on heuristics. For the EVM circuit, we ran a series of experiments to find the optimal splitting boundary for custom gates, permutations, and lookups. The result is a near-perfect split with a 1.75x speedup using two c5a.8xlarge instances compared to one.

	Single Prover	Worker 1/2	Worker 2/2
Instance Type	c5a.8xlarge	c5a.8xlarge	c5a.8xlarge
vCPUs	32	32	32
H Poly Eval Time (ms)	464891	264399	266732

Scaling Beyond Two Nodes

Although the benchmarks look promising, this is only the beginning. We need to scale beyond just two nodes to realize the full potential of distribute proving.

Extrapolating from the numbers we have, it is reasonable to expect at least one order of magnitude faster proving as we pile on more nodes. And since our current setup uses CPU-only machines, any speedup factor from specialized hardware like GPUs will become an additional multiplier, allowing us to deliver unprecendentedly low proving latency.

On the flip side, a high degree of distribution allows us to use many cheap low-end machines instead of a single ultra high-end rig. Since hardware performance increases much more slowly than price, our setup also enables low-cost proving without compromising latency. Moreover, with low-end machines being eligible for proving, we can tap into the vast pool of idle commodity hardware in data centers, mining farms, and most importantly, in the hands of individuals like you and me.

For distributed commitment, scaling to more than two nodes is fairly trivial as we just need to change the way we partition the execution trace. However, it isn’t so simple for distributed constraint evaluation because manually searching for the optimal splitting boundaries becomes impractical as the number of nodes increases (and as we expand to other circuits). To address this issue, our team is actively developing an automated optimizer with intelligent heuristics for generating initial candidates and an orchestration framework that does experimentation-based fine-tuning.

In the coming weeks, we will continue to post updates here. And we are planning to open-source everything as soon as our distributed halo2 prover is mature enough for an alpha release, so please stay tuned!