We’re excited to announce a significant advancement in reducing the script size required for field multiplication on bitcoin. Our recent research paper introduced an efficient method for multiplying two 254-bit prime (BN254 curve) integers using the w-windowed method. Building on this foundation, we have developed TMUL (Tunable MUL), an efficient field multiplication technique for bitcoin. TMUL offers substantial reductions in script size compared to current state-of-the-art methods, with the added flexibility for developers to adjust the script size and maximum stack usage to meet their specific requirements.

Our results show that field multiplication on bitcoin can now be achieved with a script size of approximately 78 KB, while maintaining a maximum stack usage of 150 elements. This marks a reduction in script size by roughly 44% compared to the most optimized field multiplication produced in the BitVM2 implementation. If we allow maximum stack usage to increase to 295 elements, we can achieve a script size of 73 KB which is a 47% reduction in size. Moreover, TMUL enables efficient computation of linear combination of field multiplications. This innovation stands as a crucial milestone in the evolution of on-chain protocols for SNARK verification on bitcoin.

#### Background

Efficient finite field arithmetic is crucial for any SNARK-based system. However, implementing such arithmetic on bitcoin presents unique challenges due to the limitations of bitcoin's scripting language. Bitcoin Script, intentionally non-Turing complete and stack-based, is designed for simplicity and security, imposing strict constraints on both script size and stack elements. The maximum script size is capped by the block size at 4MB and the standardness limit of 400KB. Stack usage is further restricted to a maximum of 1,000 elements. Additionally, arithmetic opcodes like OP_ADD, which are essential for building higher-order primitives, require operands to be 32-bit signed numbers. These limitations necessitate innovative techniques for achieving efficient finite field operations, particularly field multiplication, while minimizing script size.

##### Windowed Representation

In our recent research paper, we introduced the w-windowed method to optimize big integer multiplication on bitcoin. By representing the scalar in a w-bit windowed form, we performed Elliptic Curve point scalar multiplication using the double-and-add algorithm. With further fine-tuning—details of which are in the paper—we achieved a compact script size of ~72KB.

However, the implications of this approach extend far beyond the optimization of big integer multiplication. In the following sections, we will demonstrate how we leveraged the insights from our research to optimize a field multiplication. We then generalized this approach to efficiently compute linear combinations of field multiplications of the form:

$$r = \sum_{i=1}^{n} (-1)^k \ a_i \times b_i \ \ (\text{mod} \ \ p); \ \ \ \ \ k \in \{0,1\}$$

Such algebraic structures are commonly encountered in SNARKs like Groth16. The ability to efficiently compute these structures in batches can significantly reduce the on-chain cost of SNARK verification.

#### Our Approach

We observed that the structure of the double-and-add algorithm, along with the w-windowed representation, allows us to employ neat optimization tricks for field multiplications.

##### Field Multiplication

Recall, we can perform field multiplication through the division-remainder decomposition:

$$x \times y \ \ (\text{mod} \ p) \rightarrow x \times y = p \times q + r $$

$$r = (x \times y) - (p \times q) \\$$

where $q$ is the quotient. Here $p$ is the characteristic of the pairing curve used for SNARK verification and $q$ is passed as an auxiliary parameter to the script. There always exists a unique pair of $0 ≤ (q, r) < p$ that satisfies the above equation.

A straightforward implementation of field multiplication would involve performing two separate multiplications and then subtracting their results. While this offers marginal improvements over the current state-of-the-art, it is not the most efficient approach.

A more effective method is to batch two multiplications together. This technique allows us to leverage the same doubling cost—**the most computationally expensive part of the algorithm**—for both multiplications. Instead of performing $2N$ doublings for two separate multiplications, we can achieve the same result with just $N$ doublings. The trade-off is an increased pre-computation cost, as maintaining two lookup tables becomes necessary. However, this remains well within bitcoin's constraint of 1000 stack elements.

##### Linear Batching of Field Multiplications

We extended this approach further to compute a linear combination of $n$ field multiplications, enabling the use of a single doubling cost across all multiplications.

$$r = \sum_{i=1}^{n} (-1)^k \ a_i \times b_i \ \ (\text{mod} \ \ p); \ \ \ \ \ k \in \{0,1\} $$

$$r = (\sum_{i=1}^{n} (-1)^k \ a_i \times b_i) - (p \times q)$$

The overall cost of batching $n$ field multiplications, including the cost of generating $n+1$ lookup tables, then becomes

$$(n+1)\times\left[2^{w-1}\mathsf{A} + 2^{w-1}\mathsf{D}\right] + \left[\frac{(n+1)N}{w} \mathsf{A} + N \mathsf{D}\right]$$

where $\mathsf{A}$ and $\mathsf{D}$ represents cost of addition and doubling, respectively. The number of lookup tables and the number of addition operations increase linearly with the number of field multiplications, whereas the number of doubling operations remains constant.

##### Further Optimizations

In addition to batching multiplications, we implemented several low-level optimizations to further reduce script size. The key optimizations include:

###### Intermediate Value Optimization

We observed that the intermediate values of the double-and-add accumulator never exceed $(254+w\text{-bits}+1)$ bits for the correct set of inputs $(q,r)$. This means that intermediate results do not need to be represented as full $508$-bit integers. For an incorrect pair $(q,r)$, the program terminates after checking for overflow of the accumulator.

###### Theorem

****If $0 \leq r < p$ where $r = x y - p q$, then every accumulator value $acc_{i} = B acc_{i+1} + y_i x - p_i q$ satisfies

$$-B p < acc_i < B p, \ \ \ \text{where} \ \ B = 2^w.$$

###### Proof

****Let $L$ be the number of base $B$ digits in $p$. For any value $n \in 0,...,L-1$, let

$$X_n = \sum_{i=0}^{n-1} B^i (y_i x - p_i q)$$

$$Y_n = \sum_{i=n}^{L-1} B^i (y_i x - p_i q)$$.

Note that $X_n + Y_n = x y - p q$ by construction. Since every value of $y_i, p_i \in (-B, B)$ and $x, q \in [0, p)$ we have that

$$|X_n| < \sum_{i=0}^{n-1} B^i p (B-1) = p (B-1) \frac{B^{n} - 1}{B - 1} = p (B^n - 1)$$.

Notice that $acc_i B^n = Y_n$ and suppose $|acc_i| \geq p B.$ This means that $|Y_n| = |acc_i| B^n > p B^{n+1}.$ By the bound on $X_n$ we have that

$$|X_n + Y_n| > p (B^{n+1} - B^n + 1) > p$$.

Therefore, if $|acc_i| > p B$ it must be the case that the algorithm will terminate with a value of $r = x y - p q$ that exceeds $p$. By contraposition, if $0 \leq x y - p q < p$ then the accumulator must always lie in the prescribed range. Intuitively, the remaining terms in the sum are never able to compensate for a large accumulator, so the result must lie outside of the range. In our specific case, $p B < 2^{254 + w + 1}$ so $acc_i$ can always be represented in as many bits.

###### Hardcoding Field Characteristic

****This value can be hardcoded since $p$ is the field characteristic of the curve we choose in advance. To squeeze more out of the current implementation, we removed redundant “addition-with-zero” operations for windows that are zero values.

###### Dynamic Window Generation

****Instead of generating a complete set of $w$-bit windows upfront, we generate it dynamically as needed, reducing runtime stack usage.

For a more detailed look, please refer to our implementation.

#### Result

Through these optimization techniques, we achieved two major benefits described below.

###### Significant Reduction in Script Size

We have significantly reduced the script size for single field multiplication from 140KB to 78KB (using $\omega = 3$). This reduction becomes even more impactful in the case of linear batching, where the script size increases only modestly compared to the current state-of-the-art, which required 140KB per field multiplication. With TMUL, the script size scales as 52K + 26K * n and the maximum stack usage scales as 78 + 72 * n elements for $\omega = 3$, where n is the number of linearly combined field multiplications. The following graph illustrates this improvement by comparing the script sizes of TMUL and the SOTA field multiplication in BitVM2 for the linear combination of $n$ field multiplications.

###### Tunable Script Size and Stack Usage

Our approach allows tuning between script size and maximum stack usage. While bitcoin’s stack is a limited resource, it is free compared to the script size. The flexibility to write Bitcoin Script that uses the stack in the most cost-effective way is highly advantageous.

#### Conclusion

A single SNARK verifier can require thousands of modular multiplication operations resulting in sizes in the order of gigabytes. Since executing scripts in the order of gigabytes on-chain is fundamentally impractical, protocols such as BitVM2 introduces interactivity in verification to make it practical.

However, the current costs associated with running these protocols are still high. These costs are highly proportional to the cost of performing field multiplication, which in bitcoin is the size of the script required to perform field multiplication. So, decreasing the size of a field multiplication proportionally decreases the costs associated with verifying SNARKs on-chain. TMUL represents a significant step toward achieving practical, cost-effective SNARK verification on bitcoin without the need for any soft fork.