Friendly Introduction to Sindri's API

June 6, 2023 · 8 min read

Zero-Knowledge Programs

Suppose you have a function that takes in two variables and returns some result:

\Large{ \text{result} = doSomething(x,y) }

There are a few consequences of executing $doSomething$ the traditional way:

Anyone wishing to verify result will need both inputs x and y
They will also have to expend compute resources to run $doSomething$

In a trustless setting, you may want to keep some of the variables private or you may want to convince anyone that the result is legitimate without having to rerun the program. Zero-knowledge proofs modify the execution of your program to look more like this:

\Large{ \text{result}, \text{proof} = doSomething^{ZK}(\text{public}\ x, \text{private}\ y) }

so that:

broadcasting result and proof reveals nothing about y
verifying result only requires proof and x

A major caveat is that $doSomething^{ZK}$ takes much longer to execute. Research surrounding the subject of Zero-Knowledge Proofs has proliferated in the last few years. Every work is attempting to optimize at least one of the following:

The runtime of $doSomething^{ZK}$
The verification time given result and proof
The size of proof

Proofs

In this section, we present a high level overview of the KZG polynomial commitment scheme in order to explain why $doSomething^{ZK}$ generally takes so long. KZG is a popular protocol choice for blockchain developers because the proof size is small and the verification work is minimized. Furthermore, while proving still needs to happen off-chain, verification is enabled on-chain.

One way to convince someone that you’ve run code is to show your work by publishing the execution trace, i.e. all of the intermediate steps and variable values. You could convince any verifier that you ran the program at the price of giving the verifier full knowledge of private inputs. Additionally, the size of such a proof is unappealing. However, this transformation is still a useful first step towards a zero-knowledge proof.

The diagram above depicts the process of transforming pseudocode into an execution trace, which is sometimes called arithmetization. We will call the right hand side a circuit: it is a matrix with each entry given by a formula representing some intermediate value of the program. For a concrete input pair x and y, filling in the actual values of the circuit is known as witness generation.

The next step is going to turn each column of the circuit into a polynomial. This improvement is not initially obvious. In the diagram below, we have a vector of length n, which will turn into a polynomial of degree n, and it takes the same amount of data to describe both the vector and the polynomial. However, we do not intend to publish the entire polynomial for a column. Rather, we will convince a verifier that we know the global structure of the polynomial and that we can predict how it will behave in any local neighborhood.

The first phase of the polynomial argument, the global structure, functions similarly to a hash. Random hash collisions are unlikely, meaning the hash output is a sufficiently identifying property of a polynomial. Additionally, the hash is irreversible. So, an adversary cannot learn much about the pre-image given only the hash output.

The second phase of the polynomial argument, the local structure, opens up the behavior of the polynomial around any random point (note that the prover does not get to choose this). Consider first the claim: “I have a polynomial and when I plug $12$ into it, I get $42$ back”. In order to convince a hostile listener that I have full knowledge of this polynomial, I’ll need to supply more information. Like the diagram below indicates, if I know $P(12)=42$ , then $P(x)-42$ will cross the x-axis at $12$ . There is a programmatic way (polynomial division) to retrieve a polynomial of degree $n-1$ that crosses the x-axis in all the same places as $P(x)-42$ , except at the root $12$ . In short, in my role as the prover I need to supply $Q$ so that $Q(x)(x-12) = P(x)-42$ . This is a very convincing argument to the verifier that I know the polynomial $P(x)$ . But, to stick to the principles of ZK, I need to withhold both $P(x)$ and $Q(x)$ . We already have a hashed version of $P$ , so we’ll supply the hashed version of $Q$ as well.

The final act is for the verifier to evaluate the proof they have received. The prover sent along commitments for the two polynomials $P$ and $Q$ in addition to the claim that $P$ evaluates to $42$ when you plug in $12$ . The verifier is not able to to directly check that the polynomial equation $Q(x)(x-12) = P(x)-42$ holds but they can check a “hashed” version of this equality, very similar to the way the commitments for $P$ and $Q$ were obtained.

That concludes the steps in a ZK proof. The “not quite zero-knowledge” comment in the diagram above deserves elaboration. If you were to rerun $doSomething^{ZK}$ many times and show a different evaluation point of the polynomial each time (the $(12,42)$ pair), then an adversary gains a little information about your polynomial $P$ each time. It would only take them $n$ communications from you to completely determine your polynomial $P$ and work backwards to get the private $x$ value. KZG does not initially qualify as a zero-knowledge protocol, but there are straightforward revisions which grant this property.

The discussion above oversimplified a lot of the “moon math” that makes the final verify equation un-cheat-able. Because of that, the complexity might be unclear. Here are some more specific comments:

The commitment function, or hash, is not fast. Another name for this operation is ‘multi-scalar multiplication’ or MSM. Each of the coefficients of your polynomial is represented with around 250 bits and the $\tau$ values are elliptic curve points which require around 380 bits to represent. While MSM is O(n/log(n)), n is already large enough for this to be a heavy computational burden. Additionally, KZG requires more calls to MSM than the operation described next, so this is the bottleneck.
Calculating Q(x) requires polynomial division. Fast fourier transform (FFT) techniques place this computation in O(n*log(n)). In addition to the complexity introduced by large n (the degree of your polynomial), recall that the coefficients are each around 250 bits.

Sindri's API

Hopefully the discussion above has convinced you of the difficulties of producing ZK proofs. If you have your own $doSomething$ and want to convert to $doSomething^{ZK}$ then there are three major options available to you:

Perform all the work yourself with your own machines. There are many open source libraries available to you that already provide accelerated proving implementations on certain hardware options. Selecting the implementation that best suits your hardware setup will require a technical deep dive. Additionally, you should not underestimate the work it will take to scale this acceleration beyond your initial prototype and reconfigure the entire stack as needed to stay up to date with a rapidly evolving backdrop.
Plug $doSomething$ into a ZK-VM that will outsource the entire process. As long as you write your program in a way the virtual machine can understand, it will automatically convert it into a $doSomething^{ZK}$ circuit and host proofs for you. There may be limitations placed on the development of $doSomething$ . For instance, a VM might have trouble with a multi-thread program which is very hard to prove. The biggest perk of competitive VM solutions is that proof acceleration is baked into their product, so that you reap the benefits of an incredibly technical field with zero headaches. However, a crucial consideration with this option is that you have very limited control. The field of ZKP is still evolving and the form $doSomething^{ZK}$ takes currently may become obsolete given research advances. You must place a lot of trust in your VM solution to keep pace with those advances while providing stability to your own application.
Compute the $doSomething^{ZK}$ circuit definition and upload to Sindri. Sindri's API offers the middle ground between the previous two options. You’ll need to learn a little bit of the ZK space in order to pick the proving scheme and arithmetization that is right for your use case. Additionally, you’ll have to roll up your sleeves and provide the circuit definition for us. After that, the responsibility of running proofs quickly and reliably falls to Sindri's infrastructure.

Summary

Sagittal MSM: A Streamlined MSM Implementation

December 9, 2022 · 4 min read

Karl Preisner

Lead Developer

TL;DR — Our streamlined approach, which has a deterministic ordering of bucket updates, 3 orders of magnitude fewer buckets, and removes the logic for the double/double-add, queuing, scheduling, and conflict detection/resolution, results in at least a 90% reduction in the number of EC point additions when compared to the pseudo-naive approach.

Sindri Labs developed a streamlined implementation of the Multi-Scalar-Multiply (MSM) algorithm that computes the inner product of a vector of large (e.g. 256 bit) scalars with a vector of points on an elliptic curve (EC) whose coordinates are in the finite field Fq with q roughly 384 bits in size. Our improvements are targeted at GPU and FPGA implementations.

The well-known algorithm of Pippenger (Bucket Algorithm) allows the computation to use a remarkably small number of EC point additions, but at a cost. Proposed implementations require maintaining on the order of $2^{20}$ partial sums (buckets) of EC points. Moreover, the updates to the partial sums are performed in a random order, and can cause stalls in feeding the EC point addition pipeline if point additions involving the same bucket are not separated in time by at least the length of the pipeline. Conflict resolution strategies require relatively complex logic.

Our proposed solution falls short of the theoretical minimum number of EC point additions required for an MSM with vectors of length in excess of $2^{20}$ , but it has several main advantages that will make up for it.

Our solution:

Has a simple, deterministic ordering of bucket updates. This guarantees no conflicts and obviates the need for logic to detect and resolve them (queues and schedulers) that the Pippenger approach requires.
Removes the logic of the double or double-add method for performing an efficient scalar multiply using additions.
Utilizes projective coordinates in Montgomery form.
Reduces the memory footprint for the buckets by three orders of magnitude; it uses $2^{8}$ buckets instead of $2^{20}$ buckets. With so few buckets, aggregating the buckets in the final phase is of negligible cost. Therefore, we can move our finalized buckets back to the host (CPU) and aggregate them there. On an FPGA implementation, this frees block RAM because we do not have to implement the aggregation algorithm on the device.
Allows for the precomputation of EC point additions. This is a one-time computation for all proofs that use the same common reference string (CRS). It increases the size of the CRS, so it comes at the expense of either a higher data transfer rate to the device or a larger amount of device RAM. The amount of precomputed EC point additions is an adjustable parameter.

The combined effect of a reduced memory footprint and simpler control logic offers an opportunity to realize multiple EC point addition pipelines on the FPGA and a consequent improvement from parallelization. We anticipate that our EC point addition pipeline will be shorter than the schedule of partial sum updates. This means we can break up particular bottlenecks in the pipeline if it allows for a higher clock rate.

Furthermore, this streamlined approach is more adaptable to GPUs and may open the door to more performant GPU zk-prover systems because, with the omission of the performance hits of queues, scheduling functions, conflict resolution functions, caching, and double/double-add logic, our approach will allow for a much greater GPU utilization.‍

Comparison Against the Pseudo-Naive Approach

Pseudo-naive approach problem definition

Each three-dimensional point, $G_k$ , of an array of points $G$ , needs to be scaled by adding to itself a corresponding number of times, $s_k$ , from an array, $s$ , of the same index size, $N$ , as $G$ . The sum of all the scaled points, $R_f$ , needs to be computed and returned.

Let

\large{ R_f = \sum_{k=0}^{N-1} s_k * G_k }

where $N$ is on the order of 100 million, $G_k$ is a three-element array (x,y,z) of 377 bit integers, $s_k$ is a single 253 bit integer, and $s_k * G_k$ is an elliptic curve point scalar multiply that utilizes the double/double-add method, which uses at most $2*s_k$ EC point additions ( $1.5*s_k$ on average).

Comparison

Our approach results in at least a 90% reduction in the number of EC point additions when compared to the pseudo-naive approach described above.‍

Want to chat? Give us a visit at Sindri Labs.

References

Aasaraai, Kaveh, et al. FPGA Acceleration of Multi-Scalar Multiplication: CycloneMSM. 2022, https://eprint.iacr.org/2022/1396.

Lu, Tao, et al. CuZK: Accelerating Zero-Knowledge Proof with a Faster Parallel Multi-Scalar Multiplication Algorithm on Gpus. 2022, https://eprint.iacr.org/2022/1321.

Xavier, Charles. F. pipeMSM: Hardware Acceleration for Multi-Scalar Multiplication. 2022, https://eprint.iacr.org/2022/999.

Zero-Knowledge Programs​

Proofs​

Sindri's API​

Summary​