neptune-crypto / twenty-first Goto Github PK

View Code? Open in Web Editor NEW

69.0 7.0 20.0 4.47 MB

Collection of mathematics routines and cryptography for the twenty-first century

License: GNU General Public License v2.0

Makefile 0.37% Rust 99.63%

cryptography rust zero-knowledge zk-starks algebra merkle-tree post-quantum-cryptography

twenty-first's Introduction

twenty-first

A collection of cryptography primitives written in Rust.

Content of this library

This library contains primarily the following cryptographic primitives:

The Tip5 hash function
- The Tip5 Hash Function for Recursive STARKs
Lattice-crypto
- arithmetic for the quotient ring $\mathbb{F}_ p[X] / \langle X^{64} + 1 \rangle$
- arithmetic for modules over this quotient ring
- a IND-CCA2-secure key encapsulation mechanism
- Lattice-Based Cryptography in Miden VM
BFieldElement, XFieldElement
- The prime-field type $\mathbb{F}_p$ where $p = 2^{64} - 2^{32} + 1$
- The extension field $\mathbb{F}_p[x]/(x^3 - x + 1)$
- A codec trait for encoding and decoding structs as Vecs of BFieldElement
- An efficient prime for number-theoretic transforms
NTT
- Number Theoretic Transform (discrete Fast Fourier Transform)
- Anatomy of a STARK, Part 6: Speeding Things Up
Univariate and multivariate polynomials
Merkle Trees
Merkle Mountain Ranges

Release protocol

While twenty-first's version is 0.x.y, releasing a new version:

Is the release backwards-compatible? Then the new version is 0.x.y+1. Otherwise the new version is 0.x+1.0.
Checkout the last commit on Mjolnir, and run make bench-publish. Save the benchmark's result and verify that there is no performance degredation.
Create a commit that increases version = "0.x.y" in twenty-first/Cargo.toml. The commit message should give a one-line summary of each release change. Include the benchmark result at the bottom.
Have a v0.x.y git tag on this commit created. (git tag v0.x.y [sha], git push upstream --tags)
Have this commit cargo published on crates.io and in GitHub tags.

If you do not have the privilege to create git tags or run cargo publish, submit a PR and the merger will take care of these.

Building

This crate depends on the rs-leveldb crate which wraps the C++ implementation of leveldb. As such, snappy and leveldb must be installed. Eg, on Ubuntu:

sudo apt-get install libleveldb-dev libsnappy-dev

For more detailed buildings instructions, see the description in HACKING.md.

twenty-first's People

Contributors

Stargazers

Watchers

twenty-first's Issues

Speed up NTT-based interpolation

While benchmarking @AlexanderLemmens's modular reduction procedure I observed isolated cases where performance improved by switching from Montgomery representation and multiplication to canonical representation and Alexander reduction. Specifically, for large enough $N$, NTT-based interpolation seems to be faster. The following benchmark snippets are obtained after switching back from Alexander's method to Montgomery.

lagrange_interpolation/NTT-Based interpolation/4096                        
                        time:   [387.90 ms 388.46 ms 389.06 ms]
                        thrpt:  [10.528 Kelem/s 10.544 Kelem/s 10.559 Kelem/s]
                 change:
                        time:   [+1.3519% +2.9690% +4.3549%] (p = 0.00 < 0.05)
                        thrpt:  [-4.1731% -2.8834% -1.3339%]
                        Performance has regressed.

lagrange_interpolation/NTT-Based interpolation/16384                        
                        time:   [4.7962 s 4.8422 s 4.9162 s]
                        thrpt:  [3.3327 Kelem/s 3.3836 Kelem/s 3.4160 Kelem/s]
                 change:
                        time:   [+36.423% +38.074% +40.215%] (p = 0.00 < 0.05)
                        thrpt:  [-28.681% -27.575% -26.699%]
                        Performance has regressed.

All other benchmarks are green, implying that Montgomery is faster.

Focusing on these red ones, I would say it indicates some suboptimal processing. I estimate that there is some performance gain to be had here.

Note that this task has very low priority since NTT-based interpolation is not used anywhere. (We use INTT in Triton.)

Make hashing faster by making the sponge updateable

All hashing now happens via AlgebraicHasher using either the Hashable trait:

pub trait Hashable {
    fn to_sequence(&self) -> Vec<BFieldElement>;
}

or in the case of ProofStream the IntoIterator<Item = BFieldElement>:

impl<Item, H> ProofStream<Item, H>
where
    Item: IntoIterator<Item = BFieldElement> + Clone,
    H: AlgebraicHasher,

In both cases, this means that hashing large, nested objects involves a large degree of unnecessary cloning and concatenating, since hash_slice() (formerly hash_sequence()) can only hash a consecutive slice of BFieldElements. For example, if wanting to impl Hashable for Foo:

pub struct Foo {
    bars: Vec<Bar>,
    bobs_with_digests: Vec<(Bob, Digest)>,
    bunch_of_elems: Vec<Vec<BFieldElement>>,
    thing: JustAnotherThing,
    proof_of_verbosity: Option<(Verbosity, Vec<XFieldElement>)>,
}

impl Hashable for Foo {
    fn to_sequence(&self) -> Vec<BFieldElement> {
        todo!()
    }
}

...then the best one can do wrt. creating one large &[BFieldElement] for hash_slice() is to not materialize the intermediate Vec<BFieldElement> at each scope (one that combines Bars, etc.), and instead .chain() iterables together to perform one giant .collect_vec() at the end. But even then, a Bar, Bob, Digest, JustAnotherThing, Verbosity and Vec<XFieldElement> all need their own impl Hashable that must materialize a consecutive allocation, which is discarded once concatenated in the parent scope, recursively.

The way std::hash::Hash works is:

impl std::hash::Hash for Foo {
    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
        todo!()
    }
}

where recursively defined Hash instances pass on the hasher's mutable state, being the hash function's sponge construction.

RescuePrimeRegular does not currently expose its sponge via its hash_varlen(); if it did, it would be possible for twenty-first's Hashable to gradually update the sponge without first allocating and re-allocating the serialised input. It would be possible to restrict the interface so that only the first RATE elements are addressable. There would be some need in the mutable hasher's state to keep track of how far into the RATE allowed elements one sub-serialization has reached, so that absorbing and padding works correctly regardless of how many elements a given sub-part of some structure has.

Flaky FRI test for BigInt

The test generate_proof_1024_bigint is flaky. Figure out why.

It might be because indices are repeated in the colinearity test.

Error in derived implementation of `BFieldCodec` for vectors

Minimal example

use twenty_first::shared_math::tip5::Digest;

#[derive(Clone, Copy, Debug, Serialize, Deserialize, PartialEq, Eq, Hash, GetSize, BFieldCodec)]
pub struct Foo {
    pub bar: Digest,
}

#[cfg(test)]
mod tests {
    use rand::random;
    use super::*;

    #[test]
    fn foo_vec_test() {
        for i in 0..5 {
            let foos = vec![
                Foo {
                    bar: random(),
                };
                i
            ];
            let encoded = foos.encode();
            let decoded = *Vec::<Foo>::decode(&encoded).unwrap();
            assert_eq!(foos, decoded);
        }
    }

Error message:

thread 'foo_vec_test' panicked at 'called `Result::unwrap()` on an `Err` value: Length indication plus one must match actual sequence length. Claimed vector length was 1. Item size was 5. Sequence length was 7.', ...
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The encoding prepends the size of each element, but the decoder does not expect that.

Fastest NTT

This paper from Microsoft Research seems to contain a pseudo-algorithm for what is probably the optimal NTT implementation.

https://eprint.iacr.org/2016/504.pdf

Another article about a fast NTT:
https://crypto.ethz.ch/publications/files/Seiler18.pdf

impl From<i32> for BFieldElement

It can be very convenient to execute 2.into() and get a BFieldElement. Rust seems to default to i32's often. Thus, a conversion from i32 to BFieldElement might be handy. From a previous conversation here:

@sshine wrote:

I'm uncertain about that. From<i32> for BFieldElement seems bad.

It makes sense if you think of the i32 as a field element that wasn't cast yet.

As with From<u64> for BFieldElement, these instances break certain expectations about arithmetic where you kind of have to accept that i32s and u64s sometimes have modular arithmetic properties with some implicit modulo, if they happen to live near BFieldElements.

I'd expect $\forall x:$BFieldElement::from(x).value() == u64::from(x).

The only perfect conversion that we have great use of is From<u32> for BFieldElement.

If we're looking for syntactic convenience, we could consider a macro.

@jan-ferdinand:
A macro does actually sound like it could make sense.

Calculate Differential Uniformity

Branch asz/tip5 has two tests that exist only to compute parameters or verify correct design principles for the Tip5 hash function. One test, calculate_differential_uniformity needs a lot of computing power. I am interested in the output of this test. If you have spare computing power lying around, please run

cargo test calculate_differential_uniformity -- --include-ignored --nocapture

and tell me what the output is.

MMR Verify Return Type

MmrMembershipProof::verify returns a bool and possibly a hash.

Verify a membership proof for an MMR. If verification succeeds, return the final state of the accumulator hash.

It should return just a bool. Methods that need the hash also need to call a different function. This is a small change but percolates far down.

Implement compatible primitive roots of unity as powers of two

Cf. https://www.ingonyama.com/blogs/goldilocks-ntt-trick some of the primitive roots of unity can be implemented as powers of 2. This holds for the roots with order 4 to 64 (both inclusive).

With this trick, the new orders of unity become:

static PRIMITIVE_ROOTS: phf::Map<u64, u64> = phf_map! {
    2u64 => Q - 1,
    4u64 => 1u64 << 48,
    8u64 => 1u64 << 24,
    16u64 => 1u64 << 12,
    32u64 => 1u64 << 6,
    64u64 => 1u64 << 3,
    128u64 => 2198989700608,
    256u64 => 4404853092538523347,
    512u64 => 6434636298004421797,
    1024u64 => 4255134452441852017,
    2048u64 => 9113133275150391358,
    4096u64 => 4355325209153869931,
    8192u64 => 4308460244895131701,
    16384u64 => 7126024226993609386,
    32768u64 => 1873558160482552414,
    65536u64 => 8167150655112846419,
    131072u64 => 5718075921287398682,
    262144u64 => 3411401055030829696,
    524288u64 => 8982441859486529725,
    1048576u64 => 1971462654193939361,
    2097152u64 => 6553637399136210105,
    4194304u64 => 8124823329697072476,
    8388608u64 => 5936499541590631774,
    16777216u64 => 2709866199236980323,
    33554432u64 => 8877499657461974390,
    67108864u64 => 3757607247483852735,
    134217728u64 => 4969973714567017225,
    268435456u64 => 2147253751702802259,
    536870912u64 => 2530564950562219707,
    1073741824u64 => 1905180297017055339,
    2147483648u64 => 3524815499551269279,
    4294967296u64 => 7277203076849721926,
};

I think the NTT/INTT is only determined up to a constant so there might be some intermediate NTT results that are no longer valid.

Populate Triton VM memory with a Merkle tree

With the release of 0.30.0 the Merkle tree nodes were made private. But some code in tasm-lib relied on those node digests being accessible, specifically

for (i, node) in tree.nodes.into_iter().enumerate().skip(1) {
            for j in 0..DIGEST_LENGTH {
                memory.insert(
                    list_address
                        + BFieldElement::one()
                        + BFieldElement::new((i * DIGEST_LENGTH + j) as u64),
                    node.values()[j],
                );
            }
        }

One option for rewriting this code is to have a get_all_nodes function. Another option is to implement BFieldCodec for the Merkle tree structure. The latter option is probably preferable, as that would ensure that the Merkle tree is put into memory in a canonical representation.

Change name of `CompressedAuthenticationPath`

After speaking with @aszepieniec, we concluded that PartialAuthenticationPath is a better name than CompressedAuthenticationPath.

`slow_lagrange_interpolation` has a bad signature

This function for Polynomial structs has a function pub fn slow_lagrange_interpolation(points: &[(U, U)]) -> Self. Instead of taking a slice of tuples, it should accept to slices, one for x-values and one for y-values, as this is often how we have the values in the caller.

Minimize cycle count for MMR operations

Nervos Network has an MMR implementation in Rust. It's worth looking at their algorithms to see if they can bring down our cycle count:
https://github.com/nervosnetwork/merkle-mountain-range

Mimblewimble also has an implementation:
https://github.com/mimblewimble/grin/blob/0ff6763ee64e5a14e70ddd4642b99789a1648a32/core/src/core/pmmr.rs#L257

And here is a list of other implementations:
https://github.com/topics/merkle-mountain-range

`SimpleRustyStorage` could take a `sync` boolean parameter

SimpleRustyStorage has a persist function that calls the underlyding rusty-leveldb write function. This write function takes a boolean called sync which determines whether the changes should be written to disk or not. Exposing this boolean in SimpleRustyStorage's persist would allow an application using this data model to better control when it writes to disk.

Collatz Sequence STARK

Eliminate extended computational trace from composition polynomial
Verifier must verify the linear combination of the composition polynomial. This is done as follows:
- Get composition polynomial weights from boundary quotient Merkle root
- Get the top-level indices from FRI
- Get the values of the boundary quotients at those indices, and the shifted (next cycle) ones.
- Calculate the trace value from the identity ECT(x) = BQ(x)Z_B(x) + L(x), where L(x) is the boundary interpolant.
- Calculate the AIR values from the identity AIR_{bit,j} = (ECT_j(x))^2 - ECT_j(x), and the AIR_{sum}(x) from the sum AIR equation.
- Calculate the transition quotients from the AIR values: TQ(x) = AIR(x)/Z_{transition}(x).
- Compute the sum from the above arguments for each index and test it against the leaf opened by FRI.

Unused functions in `shared_math/other.rs`

E.g. mod_pow_raw is unused in shared_math/other.rs. It's implemented with the % operator which is much slower than the B-field specialized operations we have for BFieldElement and for XFieldElement. It should be removed, so no one uses it accidently.

Doing this, one should also grep after all other functions in other.rs and remove those that are unused. If a function is only called by a test, it should also be removed.

Don't be multi-crate

After deleting the deprecated Rescue Prime and Brainfuck STARKs in 7659738, this repo no longer needs to be a multi-crate repository.

Implement `Hashable` for `MmrAccumulator`

Before doing that maybe we want to consider the leaf_count as a u64 instead of as a u128 as it is now? I guess once could also implement the leaf_count as a U32s<2> or U32s<4> struct but that might also just make things too complicated.

check for overflows in arithmetic

Due to an oversight, tests currently don't fail on overflows. However, they should.

enable overflow-checks = true in the test profile
fix all uncovered bugs
explicitly mark code where overflows are desired or required

Should we keep a slow STARK prover/verifier for education purposes?

Maybe it'll help onboard people?

support tuple fields in derive macro `BFieldCodec`

Structs with tuples as fields are not supported by the current derive macro BFieldCodec. More specifically, deriving the trait will succeed, and the resulting implementations of encode and decode will compile. However, the derived methods encode and decode won't be each others duals. For example, the following code will panic.

#[derive(BFieldCodec)]
struct Foo((u8, u8));
let encoded = Foo((1, 2)).encode();
let decoded = Foo::decode(&encoded).unwrap();

This is possibly caused by missing or incomplete implementations of GetSize for tuples. Deriving their size using macros like get_size_derive is not possible with their current version v0.1.2.

Figure out a way to support tuple fields for derive macro BFieldCodec.

Montgomery Batch inversion (MBI)

Implement Montgomery Batch inversion as described here
https://paulmillr.com/posts/noble-secp256k1-fast-ecc/

Python code here:
https://www.vitalik.ca/general/2018/07/21/starks_part_3.html

Error in `MPolynomial::degree`

This method seems to wrongly count terms whose coefficient is zero towards the degree of a multivariate polynomial. It shouldn't.

https://github.com/Neptune-Crypto/twenty-first/blob/master/twenty-first/src/shared_math/mpolynomial.rs#L1057-L1071

Change name of MMR function `get_peak_index_and_height`

This function returns the node index of a peak, not index into the MMR accumulator's peak list. Luckily it's a private function in twenty-first/src/util_types/mmr/mmr_membership_proof.rs, but it should still be changed, as we use the term "peak index" to refer to the index into the peaks list.

`BfieldCodec::decode` may not crash at runtime

There's plenty of ways that BFieldCodec::decode can crash because we're not checking the length of the BFieldElement sequence before indexing into it.

Example:
In the following code, we are not checking if the str BFieldElement slice has a sufficient length before indexing into it. The problematic lines are
let t = *T::decode(&str[index..index + len_t])?;
and
let s = *S::decode(&str[index..index + len_s])?;

The length must be checked these expressions are evaluated.

impl<T: BFieldCodec, S: BFieldCodec> BFieldCodec for (T, S) {
    fn decode(str: &[BFieldElement]) -> Result<Box<Self>> {
        let mut index = 0;
        // decode T
        let len_t = match T::static_length() {
            Some(len) => len,
            None => {
                let length = match str.get(index) {
                    Some(bfe) => bfe.value() as usize,
                    None => bail!(
                        "Prepended length of type T satisfying unsized BFieldCodec does not exist"
                    ),
                };
                index += 1;
                length
            }
        };
        let t = *T::decode(&str[index..index + len_t])?;
        index += len_t;

        // decode S
        let len_s = match S::static_length() {
            Some(len) => len,
            None => {
                let length = match str.get(index) {
                    Some(bfe) => bfe.value() as usize,
                    None => bail!(
                        "Prepended length of type S satisfying unsized BFieldCodec does not exist"
                    ),
                };
                index += 1;
                length
            }
        };
        let s = *S::decode(&str[index..index + len_s])?;
        index += len_t;

        if index != str.len() {
            bail!("Error decoding (T,S): length mismatch");
        }

        Ok(Box::new((t, s)))
    }
    
    ...
}

Remove deprecated functions from `MPolynomial`

We removed the two subcrates, the STARKs for Rescue Prime and Brainfuck. They relied on multiple methods for MPolynomial that can now be removed.

impl Sum<XFieldElement> for XFieldElement

Version 0.1.4 includes breaking changes

The crate ecosystem follows semver with a twist: for versions 0.y.z, y is considered to be the major version and 0.y.z+1 should be compatible with 0.y.z. Unfortunately, 0.1.4 has a lot of breaking changes compared to 0.1.3.

https://doc.rust-lang.org/cargo/reference/resolver.html#semver-compatibility

Could you consider yanking 0.1.4, releasing 0.1.3 again as 0.1.5, and releasing 0.1.4 as 0.2.0? This should fix any compat issues downstream.

Regards.

Rename `rescue_prime_digest` to `digest`

The name space rescue_prime_digest is used for the digest type of Rescue Prime, Rescue Prime Optimized, and Tip5. So it should live in a file called digest.rs, not rescue_prime_digest.rs.

Implement `Display` for amount types.

Both DyadicRationals and U32s should implement Display.

impl Display for U32s<N>
impl Display for DyadicRationals

Running `rustc` through `make` is too slow

In Makefile we set export RUSTFLAGS = -Dwarnings. Changing RUSTFLAGS invalidates all cache and forces a rebuild which makes runnning make very slow and it makes a subsequent run of cargo test equally slow as its cache was also deleted.

The solution is probably to specify in the Makefile a custom target directory such that make builds store compiled files somewhere else than the custom place.

Benchmarks in CI

According to this article GitHub's build servers are fairly consistent for running benchmarks:
https://labs.quansight.org/blog/2021/08/github-actions-benchmarks

Let's revisit which benchmarks we want to include in the CI, which features, and perhaps we can get a graph to show the time reports.
https://github.com/marketplace/actions/continuous-benchmark

H/t to @Munksgaard for both links.

`batch_inversion` must panic on division by zero

We should not allow caller to invert a zero-element. batch_inversion must panic for these cases.

Tests should be fixed. Both prime_field_element and prime_field_element_big have a function with this name.

How to commit to low-degreeness of omicron-shifted x values?

To prove the transition quotient's correctness, we need to commit to omicron-shifted trace extension values. How do we prove the low-degreeness of these values?

reduce size of `BFieldCodec`-encoded authentication structure

An “authentication structure” removes redundancies in a collection of Merkle authentication paths. Two authentication paths for the same Merkle tree are bound to overlap. For example, in the tree below, the authentication path for the leaf marked by the blue arrow is indicated by blue nodes, and similarly for the leaf marked by the pink arrow.

When verifying the blue authentication path, the node encircled in pink is computed. Therefore, it does not have to be supplied if both paths are being verified. The same holds true for the node encircled in blue.

In our current implementation, an authentication structure is a Vec<Vec<Option<Digest>>>. Omitted nodes are represented by a None, included nodes are Some(digest). Each authentication path is granted its own Vec<_>. However, the indices of the leaves whose inclusion in the tree is to be proven identify all nodes that can be omitted. An authentication structure could well be a Vec<Digest>.

Some quick experiments show that for a Merkle tree with $2^{15}$ leaves, an authentication structure for 32 revealed elements requires a total of 2487 base field elements for encoding. Of those, 388 (15.6%) are required for encoding Nones. Another 32 (1.3%) are used for length-prepending the inner vectors. In total, the encoded size of an authentication structure could be reduced by about 15%¹².

Above numbers come from a single measurement. Different leaf indices and tree sizes will give different results. ↩
In Triton VM, the total reduction of the proof size is estimated at around 3%. ↩

Is it a problem that this implementation generates the same BFieldElement for two different `usize` values?

This seems to have been a part of the upgrade to 0.2.0 which included a rewrite of the Hasher trait.

Change function signature of `decode` trait function

It shouldn't be necessary to know the length of what you are about to decode before calling decode. A mutable input, or a cursor can probably achieve this.

BFieldCodec encoding must be unique

10a2f14 changes a behavior of BFieldCodec::decode for Option<T> to only accept 0 or 1 as enum indicators representing None and Some, respectively. Prior to that commit any non-zero value would indicate the Some variant.

Since we are hashing these encodings and using them in a cryptographic context, it seems to me that we need to ensure uniqueness in the allowed encoding of all data structures.

Add to the prop test helper function in src/shared_math/bfield_codec.rs some bit flipping of the encoded data structure and ensure that their decoding either fails or returns a different value.
Manually inspect encode and decode implementations to ensure encoding-uniqueness.

Fix MMR authentication path verification

should return bool

Why it is Okay for a FRI Prover to Cheat in a Small Number of Locations

On 2021-08-12 @Sword-Smith drew attention to a peculiar behavior observed while running FRI unit tests. The prover first proves the low-degreeness of a low-degree codeword, and the verifier accepts this proof. Then a few bits in the codeword are flipped, making it a codeword of high degree. The prover "proves" the low-degreeness of this codeword, and the verifier is tasked with verifying the resulting proof just to make sure it rejects it. It turns out, that in a statistically significant proportion of cases, the verifier accepts the codeword as though it were a low-degree one. Clearly this points to a security-critical error in either the code or the maths, right? Well, not really ...

What's Going On?

The introduced errors propagate with high probability through every split-and-fold down to the very last codeword. The verifier samples num_colinearity_checks-many locations of this codeword, and then interpolates between the observed points. If the degree of the interpolant is below a certain bound, the verifier accepts; otherwise he rejects. The reason why the verifier sometimes accepts "invalid" points is because the error locations are not party of the sampled indices. As a result, the verifier sees only points that correspond to a low-degree polynomial.

Why This Isn't Critical for Security

A few bit flips in a codeword does change the degree of the interpolating polynomial from low to high. However, the resulting codeword is still close to a codeword that corresponds to a polynomial of low degree. So in terms of the security proposition of FRI, the prover isn't really cheating with this strategy.

The noisy codeword still uniquely identifies one polynomial of low degree. The prover who commits to the noisy protocol is still making a cryptographically binding commitment to that polynomial.

In order for a cheating prover to succeed, he needs to convince the FRI verifier that a given codeword is close to a codeword of low degree in terms of Hamming weight when in fact it is far from all such codewords. You can think of the entire STARK machinery as forcing the cheating prover to commit to codewords far from low degree ones.

How to Fix the Issue

There are several ways forward.

Do nothing. It is not a security issue and cannot lead to exploits even if it does occur in practice.
Fix the test. Disturb the codeword in a noticeable fraction of locations, and not just in a few select points.
Send the last codeword in full, not just the sampled points. This solution is quite robust in the sense that it makes in impossible for the prover to succeed even if only one location is disturbed. However, it has the drawback of increasing the proof size as well as verifier time.
Set the number of colinearity tests to a power of two. In this way, the entire last codeword is interpolated by the verifier.
Panic. Irrational but fun.

Reduce length-prepending in `BFieldCodec` encoding through the use of `Cursor`

By using a Cursor data struct, I think we can avoid the prepending of the struct length when encoding both Vec<T> and [T; N]. Each element is double length prepended since the length prepending happens in both the encoding implementation for Vec<T> and T when the encoding length is not known at compile time.

I think this can be avoided if we use a type of Cursor<Vec<BFieldElement>> for the decoding process.

Make a ProofStream that is byte-serializable/deserializable and BFieldElement-serializable

The first proof-of-concept STARKs use proof_stream::ProofStream that serializes and deserializes to bytes using a combination of bincode and custom length-prepended items. It had two shortcomings being (1) slightly unsafe deserialization, and more significantly (2) did not serialize to a stream of BFieldElements, which is a prerequisite for implementing a recursive verifier in Triton VM, and for creating a Fiat-Shamir checksum over the content of the proof-stream using the Rescue-Prime hash function (both in the Rust version of verify and in the Triton VM version). Prior, making a Fiat-Shamir checksum with BLAKE3 over a byte array was possible. Here, proof_stream_typed::ProofStream was made which looks like:

impl<Item, H> ProofStream<Item, H>
where
    Item: IntoIterator<Item = BFieldElement> + Clone,
    H: Hasher<Digest = Vec<BFieldElement>>,
...

The implementation of proof_stream_typed::ProofStream has some problems, though:

It does not serialize/deserialize to &[u8], which is needed for the network client.
The way it serializes to &[BFieldElement] isn't particularly efficient (a lot of unnecessary copying).
The Item type is shared between STARK and FRI, creating an unnecessary coupling: We'd like for Fri to be oblivious of any particular STARK it's used in (so that it can live in twenty-first and be used in triton-vm), but proof_stream_typed::ProofStream forces the Item type to be shared, meaning Fri proof-stream items are a subset of constructors in a ProofItem enum in the triton-vm repository.

The task is to create a ProofStream based on the original untyped ProofStream that happens to also serialize to &[BFieldElement].

Leaky abstraction in implementation of `Mul` for `XFieldElement`

The implementation of Mul for XFieldElement contains an optimization that's probably only relevant for NTT: If the right-hand side of the product lies in the B field (i.e., if the right-hand side is a BFieldElement lifted into the X field) then only 3 B field multiplications have to be done as opposed to the regular 12 that have to be performed. This gives a great speedup that we do not want to miss.

But maybe we can create a better interface for this functionality? We could add a function to the FiniteField trait called mul_with_b_field or something like that. Then that function could be called from the NTT and the mul implementation for XFieldElement would avoid this optimization that complicates the code.

The relevant lines are highlighted below, from src/shared_math/ntt.rs:

    let mut m = 1;
    for _ in 0..log_2_of_n {
        let w_m = omega.mod_pow_u32(n / (2 * m));
        let mut k = 0;
        while k < n {
            let mut w = PFElem::one();
            for j in 0..m {
                let mut t = x[(k + j + m) as usize];
                t *= w; # <----- HERE!
                let mut tmp = x[(k + j) as usize];
                tmp -= t;
                x[(k + j + m) as usize] = tmp;
                x[(k + j) as usize] += t;
                w *= w_m;  # <----- AND HERE!
            }

            k += 2 * m;
        }

        m *= 2;
    }

FRI must sample unique locations in each round

Our current FRI implementation probes non-unique locations and points can thus be repeated in each round. They must all be unique so the set of remaining locations must be remembered and the next chosen point must be sampled from there. If there are not enough unique points left, the FRI iterator must stop and return the final codeword which may not be constant but could be of degree e.g. 8. The verifier must verify that this last codeword has the expected degree.

Add benchmarking

Follow this guide to add benchmarking functionality for this repo
http://seenaburns.com/benchmarking-rust-with-cargo-bench/

Write STARK verifier, and test it

The STARK verifier has been implemented in a test, but we need a separate function for it.

Use NTT to evaluate and interpolate polynomials

Our current algorithms for polynomial interpolation and evaluation for multiple points are not asymptotically optimal as they have quadratic runtimes. With NTT, we can go from quadratic rumtimes to Nlog(N) runtimes.

Add `inverse_or_zero()` to {B, X}FieldElement

In Triton VM, it is a common pattern to invert a field element if it the element is non-zero. Instead of writing the same code all the time, supply method mul_inverse_or_zero().

Kill Rescue-Prime

We no longer use this hash function at all.

Separate integer ring Lagrange interpolation from modular arithmetic

We support both Lagrange interpolation for integer rings and for modular arithmetic, this functionality should be split into two files, since it's a mess right now. We should have one file for integer ring polynomials and one for modular arithmetic polynomials.