Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(distributed_sp1): first working version of sp1 distributed prover #302

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Champii
Copy link

@Champii Champii commented Jul 2, 2024

This PR introduces the SP1DistributedProver

Description

These changes add a new TCP server and a new distributed prover for sp1.

It introduce two new roles for a taiko instance that are the orchestrator and the worker.

The orchestrator is responsible for fetching the blocks data as a normal prover would. It will then check in its ip list file distributed.json the list of every worker it has access to, and check their connectivity and send a simple Ping packet to assess their reliability.

It will then compute the execution's checkpoints and then send each checkpoints to a different worker for them to commit the shard. Each worker then send back the commitment data.when all workers answered, the orchestrator instantiate a Challenger to observe the commitments. It then send the challenger and the checkpoints again to each worker so that they can prove them.
The orchestrator then merge them, verify the complete proof and sends it back to the caller.

The checkpoints contain a number of shards that are calculated to be equaly distributed between each worker. In fact, the number of checkpoints match the number of available workers.

In the event of a worker failure during runtime (network or processing error), the unproven checkpoint is pushed back in a queue for any remaining workers to pick.

How to run

The Orchestrator

Create a file distributed.json containing a JSON array of the workers' IP:PORT like this

[
    "1.2.3.4:8081",
    "5.6.7.8:8081"
]

NOTE: The orchestrator can also be a worker in its own pool

Then run raiko as you are used to

export TARGET=sp1
export CPU_OPT=1
# This is used to minimise the RAM usage. Increase this to reduce the proving time
# Note: you don't need to set it on the workers, it is propagated along the request
export SHARD_BATCH_SIZE=1
make run

The Workers

You have to set the ORCHESTRATOR env variable. It will enable the sp1 worker on this instance and only allow connections from this orchestrator address

export TARGET=sp1
export CPU_OPT=1
export ORCHESTRATOR=10.200.0.15
make run

The Script

You have to specify the prover as sp1_distributed.

./script/prove-block.sh taiko_a7 sp1_distributed 333763

What's left to do

  • Add a version number to the WorkerEnvelope to make sure orchestrator and workers are compatible
  • Even better error management
  • Refactor the sp1_specific module that contains the code used for proving
  • Make clippy happy
  • Fix the CI tests
  • Better TCP socket reliability (add a size limit)
  • Add a target to the makefile to run in worker mode

What's next

  • Allow the workers to cancel their work if the orchestrator cancels
  • Encrypt the network traffic
  • Investigate a regression in the time taken by sp1 to prove a block (15% more time spent proving than the previous used version)
  • Remove the patched version of Plonky3 that is used to (de)serialize the DuplexChallenger. Propose a PR for that
  • Add some tests ?

@Champii Champii had a problem deploying to test-environment July 2, 2024 02:30 — with GitHub Actions Failure
@Champii Champii temporarily deployed to test-environment July 2, 2024 02:37 — with GitHub Actions Inactive
@Champii Champii changed the title WIP: First working version of SP1 Distributed Prover [WIP] feat(distributed_sp1): First working version of SP1 Distributed Prover Jul 2, 2024
host/src/lib.rs Show resolved Hide resolved
host/src/server/worker.rs Outdated Show resolved Hide resolved
host/Cargo.toml Show resolved Hide resolved
provers/sp1/driver/Cargo.toml Outdated Show resolved Hide resolved
provers/sp1/driver/src/distributed/worker/envelope.rs Outdated Show resolved Hide resolved
provers/sp1/driver/src/distributed/worker/socket.rs Outdated Show resolved Hide resolved
@Champii Champii temporarily deployed to test-environment July 2, 2024 17:14 — with GitHub Actions Inactive
@Champii Champii force-pushed the sp1_distributed_prover branch from 3229632 to a0272b8 Compare July 2, 2024 17:25
@Champii Champii temporarily deployed to test-environment July 2, 2024 17:26 — with GitHub Actions Inactive
@Champii Champii temporarily deployed to test-environment July 2, 2024 18:11 — with GitHub Actions Inactive
@Champii Champii force-pushed the sp1_distributed_prover branch 2 times, most recently from 1c98633 to 21a6338 Compare July 2, 2024 18:23
@Champii Champii temporarily deployed to test-environment July 2, 2024 18:25 — with GitHub Actions Inactive
@Champii Champii force-pushed the sp1_distributed_prover branch from 21a6338 to 19084ab Compare July 2, 2024 18:40
@Champii Champii temporarily deployed to test-environment July 2, 2024 18:41 — with GitHub Actions Inactive
@Champii Champii force-pushed the sp1_distributed_prover branch from 19084ab to 0b10003 Compare July 2, 2024 18:45
@Champii Champii temporarily deployed to test-environment July 2, 2024 18:46 — with GitHub Actions Inactive
@Champii Champii force-pushed the sp1_distributed_prover branch from 0b10003 to 03ff234 Compare July 3, 2024 13:54
@Champii Champii temporarily deployed to test-environment July 3, 2024 13:55 — with GitHub Actions Inactive
@Champii Champii force-pushed the sp1_distributed_prover branch from 03ff234 to 1dc27c3 Compare July 3, 2024 14:33
@Champii Champii temporarily deployed to test-environment July 3, 2024 14:35 — with GitHub Actions Inactive
@Champii Champii force-pushed the sp1_distributed_prover branch from 1dc27c3 to c52bc22 Compare July 3, 2024 15:27
@Champii Champii temporarily deployed to test-environment July 3, 2024 15:28 — with GitHub Actions Inactive
@Champii Champii force-pushed the sp1_distributed_prover branch from c52bc22 to 5435c64 Compare July 3, 2024 15:44
@Champii Champii temporarily deployed to test-environment July 3, 2024 15:45 — with GitHub Actions Inactive
@Champii Champii force-pushed the sp1_distributed_prover branch from 5435c64 to 85fff43 Compare July 3, 2024 16:02
@Champii Champii temporarily deployed to test-environment July 3, 2024 16:03 — with GitHub Actions Inactive
@Champii Champii temporarily deployed to test-environment July 4, 2024 07:02 — with GitHub Actions Inactive
@Champii Champii force-pushed the sp1_distributed_prover branch from 2680485 to 31ef3a2 Compare July 4, 2024 07:14
@Champii Champii force-pushed the sp1_distributed_prover branch 2 times, most recently from 4db0ce9 to e14f6d6 Compare July 10, 2024 13:27
@Champii Champii temporarily deployed to test-environment July 10, 2024 13:28 — with GitHub Actions Inactive
@Champii Champii force-pushed the sp1_distributed_prover branch from e14f6d6 to b7bf25c Compare July 10, 2024 13:37
@Champii Champii temporarily deployed to test-environment July 10, 2024 13:37 — with GitHub Actions Inactive
@Champii Champii force-pushed the sp1_distributed_prover branch from b7bf25c to 3bad047 Compare July 10, 2024 13:41
@Champii Champii temporarily deployed to test-environment July 10, 2024 13:42 — with GitHub Actions Inactive
Copy link
Contributor

@smtmfft smtmfft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quickly go through once, seems good to me.
just 1 thinking that maybe we should organize it using a new independent crate/repo and upstream to sp1. Because it actually has nothing (just few if not none) to do with raiko, as it's a complement to sp1 local & sp1 network, we can call sp1 with SP1_DISTRIBUTE="url.xxx.xxx", just like current sp1 network and risc0 bonsai.

@@ -119,6 +123,7 @@ impl std::fmt::Display for ProofType {
f.write_str(match self {
ProofType::Native => "native",
ProofType::Sp1 => "sp1",
ProofType::Sp1Distributed => "sp1_distributed",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just feel that we don't need to make it explicit, the client does not need to know there is sp1 or sp1_distributed. So merge it into inside sp1 prover might be more reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smtmfft So only behave differently based on compile time flags?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think he talks about runtime ENV variable like SP1_DISTRIBUTE which would indeed be more aligned with the current sp1 behaviour

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, like current sp1, you can setup env to customize sp1's behavior without client's awareness.
for example by set/unset SP1_PROVER=network SP1_PRIVATE_KEY=... then raiko get proof from sp1 network/local-cpus, but for client, they are the same sp1 proof.

provers/sp1/driver/src/distributed/worker/pool.rs Outdated Show resolved Hide resolved
@@ -0,0 +1 @@
1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to sth like 0.1.0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should directly include the Cargo.toml version field ?

Maybe having a full semver is a bit overkill, as this field is only here to indicate actual breaking changes and incompatibilities between different workers' version. So tracking minors and patches seems irrelevant to me, what do you think ?

@Champii
Copy link
Author

Champii commented Jul 12, 2024

quickly go through once, seems good to me. just 1 thinking that maybe we should organize it using a new independent crate/repo and upstream to sp1. Because it actually has nothing (just few if not none) to do with raiko, as it's a complement to sp1 local & sp1 network, we can call sp1 with SP1_DISTRIBUTE="url.xxx.xxx", just like current sp1 network and risc0 bonsai.

That was my first approach, but @mratsim advocated for inclusion into raiko directly to avoid having to manage a full sp1 fork. I can put this code back into sp1 but it might adds some complexity

@smtmfft
Copy link
Contributor

smtmfft commented Jul 15, 2024

That was my first approach, but @mratsim advocated for inclusion into raiko directly to avoid having to manage a full sp1 fork. I can put this code back into sp1 but it might adds some complexity

you don't need to manage a full sp1 fork isn't it?, basically what you need is a stable sp1 API set?? I was thinking of it because this solution is more generic than just a raiko sub-module. Anyway, for fork management, I think a reasonable working approach is stick to release to avoid managing unstable forks, as currently taiko-reth does, but sp1 seems do not have one official release so far. BTW: I think once it has, raiko should be switch to that one either to avoid endless upgrading API.

@@ -119,6 +123,7 @@ impl std::fmt::Display for ProofType {
f.write_str(match self {
ProofType::Native => "native",
ProofType::Sp1 => "sp1",
ProofType::Sp1Distributed => "sp1_distributed",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smtmfft So only behave differently based on compile time flags?

Comment on lines 52 to 61
if let WorkerResponse::Commitment {
commitments,
shards_public_values,
} = response
{
commitments_vec.extend(commitments);
shards_public_values_vec.extend(shards_public_values);
} else {
return Err(WorkerError::InvalidResponse);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be refactored into let ... else form. Not that important, it makes it into a guard clause which is more readable IMO.

Something like:

Suggested change
if let WorkerResponse::Commitment {
commitments,
shards_public_values,
} = response
{
commitments_vec.extend(commitments);
shards_public_values_vec.extend(shards_public_values);
} else {
return Err(WorkerError::InvalidResponse);
}
let WorkerResponse::Commitment {
commitments,
shards_public_values,
} = response else {
return Err(WorkerError::InvalidResponse);
};
commitments_vec.extend(commitments);
shards_public_values_vec.extend(shards_public_values);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed more readable :)

Comment on lines 93 to 97
if let WorkerResponse::Proof(partial_proof) = response {
proofs.extend(partial_proof);
} else {
return Err(WorkerError::InvalidResponse);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here (see above comment)

Comment on lines 57 to 58
WorkerProtocol::Request(req) => write!(f, "Request({})", req),
WorkerProtocol::Response(res) => write!(f, "Response({})", res),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be a variable capture instead of passing it to the macro:

Suggested change
WorkerProtocol::Request(req) => write!(f, "Request({})", req),
WorkerProtocol::Response(res) => write!(f, "Response({})", res),
WorkerProtocol::Request(req) => write!(f, "Request({req})"),
WorkerProtocol::Response(res) => write!(f, "Response({res})"),

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I should definitely use this feature more, I think it didn't exist when I first learned Rust and I've been stuck with this since

async fn read_data(&mut self) -> Result<Vec<u8>, WorkerError> {
let size = self.socket.read_u64().await? as usize;

log::debug!("Receiving data with size: {:?}", size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log::debug!("Receiving data with size: {:?}", size);
log::debug!("Receiving data with size: {size:?}");

// TODO: handle the case where the data is bigger than expected
}
Err(e) => {
log::error!("failed to read from socket; err = {:?}", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log::error!("failed to read from socket; err = {:?}", e);
log::error!("failed to read from socket; err = {e:?}");

@Champii Champii force-pushed the sp1_distributed_prover branch 2 times, most recently from 1a7cdae to 810d241 Compare July 24, 2024 09:39
@Champii Champii had a problem deploying to test-environment July 24, 2024 09:41 — with GitHub Actions Failure
@Champii Champii force-pushed the sp1_distributed_prover branch from 810d241 to 0c4a1ea Compare July 24, 2024 09:49
@Champii Champii had a problem deploying to test-environment July 24, 2024 09:51 — with GitHub Actions Failure
@Champii Champii force-pushed the sp1_distributed_prover branch from 0c4a1ea to debe988 Compare July 24, 2024 09:53
@Champii Champii had a problem deploying to test-environment July 24, 2024 09:54 — with GitHub Actions Failure
@Champii Champii force-pushed the sp1_distributed_prover branch from debe988 to d8738fd Compare July 24, 2024 09:59
@Champii Champii had a problem deploying to test-environment July 24, 2024 10:00 — with GitHub Actions Failure
.verify(&proof, &vk)
.map_err(|_| ProverError::GuestError("Sp1: verification failed".to_owned()))?;
client.verify(&proof, &vk).map_err(|e| {
ProverError::GuestError(format!("Sp1: verification failed: {:#?}", e).to_owned())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ProverError::GuestError(format!("Sp1: verification failed: {:#?}", e).to_owned())
ProverError::GuestError(format!("Sp1: verification failed: {e:#?}").to_owned())

sp1-helper = { git = "https://github.com/succinctlabs/sp1.git", branch = "main" }
sp1-sdk = { git = "https://github.com/succinctlabs/sp1.git", rev = "dd032eb23949828d244d1ad1f1569aa78155837c" }
sp1-zkvm = { git = "https://github.com/succinctlabs/sp1.git", rev = "dd032eb23949828d244d1ad1f1569aa78155837c" }
sp1-helper = { git = "https://github.com/succinctlabs/sp1.git", rev = "dd032eb23949828d244d1ad1f1569aa78155837c" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the v1.0.1 tag instead of the commit hash, it would be easier for future maintenance as people won't try to hunt "what is the specific feature we depend on that was added in that commit?"

image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm waiting for taikoxyz/sp1#1 to be merged in the taiko branch and I'll use that particular revision as a source instead, as the tag v1.0.1 will be obsolete. Should we add our own versioning system on top of sp1's and refer to that ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe v1.0.1-taiko-1, you're the SP1 maintainer now ;). cc @smtmfft @Brechtpd @petarvujovic98

p3-challenger = { git = "https://github.com/Champii/Plonky3.git", branch = "serde_patch" }
p3-poseidon2 = { git = "https://github.com/Champii/Plonky3.git", branch = "serde_patch" }
p3-baby-bear = { git = "https://github.com/Champii/Plonky3.git", branch = "serde_patch" }
p3-symmetric = { git = "https://github.com/Champii/Plonky3.git", branch = "serde_patch" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be added to the taikoxyz org, or maybe a taikoxyz-patches special org to avoid pollution if across Raiko and Gwyneth we expect lots of long-term patches. cc @Brechtpd

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Could anybody fork Plonky3 so that I can make a PR for that too ? :)

@@ -88,6 +103,10 @@ impl Opts {
"0.0.0.0:8080".to_string()
}

fn default_sp1_worker_address() -> String {
"0.0.0.0:8081".to_string()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"0.0.0.0:8081".to_string()
// Placeholder address
"0.0.0.0:8081".to_string()

AFAIK in some cases 0.0.0.0 is/was use for broadcasting so I think it should be clarified it's a placeholder until we have a valid address.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the listening address of the worker socket, I always saw this format to indicate to listen on every interfaces. I'll do some more research but that's a sensible default in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants