-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(distributed_sp1): first working version of sp1 distributed prover #302
base: main
Are you sure you want to change the base?
Conversation
3229632
to
a0272b8
Compare
1c98633
to
21a6338
Compare
21a6338
to
19084ab
Compare
19084ab
to
0b10003
Compare
0b10003
to
03ff234
Compare
03ff234
to
1dc27c3
Compare
1dc27c3
to
c52bc22
Compare
c52bc22
to
5435c64
Compare
5435c64
to
85fff43
Compare
2680485
to
31ef3a2
Compare
4db0ce9
to
e14f6d6
Compare
e14f6d6
to
b7bf25c
Compare
b7bf25c
to
3bad047
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quickly go through once, seems good to me.
just 1 thinking that maybe we should organize it using a new independent crate/repo and upstream to sp1. Because it actually has nothing (just few if not none) to do with raiko, as it's a complement to sp1 local & sp1 network, we can call sp1 with SP1_DISTRIBUTE="url.xxx.xxx", just like current sp1 network and risc0 bonsai.
core/src/interfaces.rs
Outdated
@@ -119,6 +123,7 @@ impl std::fmt::Display for ProofType { | |||
f.write_str(match self { | |||
ProofType::Native => "native", | |||
ProofType::Sp1 => "sp1", | |||
ProofType::Sp1Distributed => "sp1_distributed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just feel that we don't need to make it explicit, the client does not need to know there is sp1 or sp1_distributed. So merge it into inside sp1 prover might be more reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smtmfft So only behave differently based on compile time flags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think he talks about runtime ENV variable like SP1_DISTRIBUTE
which would indeed be more aligned with the current sp1 behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, like current sp1, you can setup env to customize sp1's behavior without client's awareness.
for example by set/unset SP1_PROVER=network SP1_PRIVATE_KEY=...
then raiko get proof from sp1 network/local-cpus, but for client, they are the same sp1 proof.
provers/sp1/driver/worker.version
Outdated
@@ -0,0 +1 @@ | |||
1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to sth like 0.1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should directly include the Cargo.toml
version
field ?
Maybe having a full semver is a bit overkill, as this field is only here to indicate actual breaking changes and incompatibilities between different workers' version. So tracking minors and patches seems irrelevant to me, what do you think ?
That was my first approach, but @mratsim advocated for inclusion into raiko directly to avoid having to manage a full sp1 fork. I can put this code back into sp1 but it might adds some complexity |
you don't need to manage a full sp1 fork isn't it?, basically what you need is a stable sp1 API set?? I was thinking of it because this solution is more generic than just a raiko sub-module. Anyway, for fork management, I think a reasonable working approach is stick to release to avoid managing unstable forks, as currently taiko-reth does, but sp1 seems do not have one official release so far. BTW: I think once it has, raiko should be switch to that one either to avoid endless upgrading API. |
core/src/interfaces.rs
Outdated
@@ -119,6 +123,7 @@ impl std::fmt::Display for ProofType { | |||
f.write_str(match self { | |||
ProofType::Native => "native", | |||
ProofType::Sp1 => "sp1", | |||
ProofType::Sp1Distributed => "sp1_distributed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smtmfft So only behave differently based on compile time flags?
if let WorkerResponse::Commitment { | ||
commitments, | ||
shards_public_values, | ||
} = response | ||
{ | ||
commitments_vec.extend(commitments); | ||
shards_public_values_vec.extend(shards_public_values); | ||
} else { | ||
return Err(WorkerError::InvalidResponse); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be refactored into let ... else
form. Not that important, it makes it into a guard clause which is more readable IMO.
Something like:
if let WorkerResponse::Commitment { | |
commitments, | |
shards_public_values, | |
} = response | |
{ | |
commitments_vec.extend(commitments); | |
shards_public_values_vec.extend(shards_public_values); | |
} else { | |
return Err(WorkerError::InvalidResponse); | |
} | |
let WorkerResponse::Commitment { | |
commitments, | |
shards_public_values, | |
} = response else { | |
return Err(WorkerError::InvalidResponse); | |
}; | |
commitments_vec.extend(commitments); | |
shards_public_values_vec.extend(shards_public_values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed more readable :)
if let WorkerResponse::Proof(partial_proof) = response { | ||
proofs.extend(partial_proof); | ||
} else { | ||
return Err(WorkerError::InvalidResponse); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here (see above comment)
WorkerProtocol::Request(req) => write!(f, "Request({})", req), | ||
WorkerProtocol::Response(res) => write!(f, "Response({})", res), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could just be a variable capture instead of passing it to the macro:
WorkerProtocol::Request(req) => write!(f, "Request({})", req), | |
WorkerProtocol::Response(res) => write!(f, "Response({})", res), | |
WorkerProtocol::Request(req) => write!(f, "Request({req})"), | |
WorkerProtocol::Response(res) => write!(f, "Response({res})"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I should definitely use this feature more, I think it didn't exist when I first learned Rust and I've been stuck with this since
async fn read_data(&mut self) -> Result<Vec<u8>, WorkerError> { | ||
let size = self.socket.read_u64().await? as usize; | ||
|
||
log::debug!("Receiving data with size: {:?}", size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log::debug!("Receiving data with size: {:?}", size); | |
log::debug!("Receiving data with size: {size:?}"); |
// TODO: handle the case where the data is bigger than expected | ||
} | ||
Err(e) => { | ||
log::error!("failed to read from socket; err = {:?}", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log::error!("failed to read from socket; err = {:?}", e); | |
log::error!("failed to read from socket; err = {e:?}"); |
1a7cdae
to
810d241
Compare
810d241
to
0c4a1ea
Compare
0c4a1ea
to
debe988
Compare
debe988
to
d8738fd
Compare
.verify(&proof, &vk) | ||
.map_err(|_| ProverError::GuestError("Sp1: verification failed".to_owned()))?; | ||
client.verify(&proof, &vk).map_err(|e| { | ||
ProverError::GuestError(format!("Sp1: verification failed: {:#?}", e).to_owned()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ProverError::GuestError(format!("Sp1: verification failed: {:#?}", e).to_owned()) | |
ProverError::GuestError(format!("Sp1: verification failed: {e:#?}").to_owned()) |
sp1-helper = { git = "https://github.com/succinctlabs/sp1.git", branch = "main" } | ||
sp1-sdk = { git = "https://github.com/succinctlabs/sp1.git", rev = "dd032eb23949828d244d1ad1f1569aa78155837c" } | ||
sp1-zkvm = { git = "https://github.com/succinctlabs/sp1.git", rev = "dd032eb23949828d244d1ad1f1569aa78155837c" } | ||
sp1-helper = { git = "https://github.com/succinctlabs/sp1.git", rev = "dd032eb23949828d244d1ad1f1569aa78155837c" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm waiting for taikoxyz/sp1#1 to be merged in the taiko
branch and I'll use that particular revision as a source instead, as the tag v1.0.1 will be obsolete. Should we add our own versioning system on top of sp1's and refer to that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe v1.0.1-taiko-1
, you're the SP1 maintainer now ;). cc @smtmfft @Brechtpd @petarvujovic98
p3-challenger = { git = "https://github.com/Champii/Plonky3.git", branch = "serde_patch" } | ||
p3-poseidon2 = { git = "https://github.com/Champii/Plonky3.git", branch = "serde_patch" } | ||
p3-baby-bear = { git = "https://github.com/Champii/Plonky3.git", branch = "serde_patch" } | ||
p3-symmetric = { git = "https://github.com/Champii/Plonky3.git", branch = "serde_patch" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be added to the taikoxyz org, or maybe a taikoxyz-patches special org to avoid pollution if across Raiko and Gwyneth we expect lots of long-term patches. cc @Brechtpd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Could anybody fork Plonky3
so that I can make a PR for that too ? :)
@@ -88,6 +103,10 @@ impl Opts { | |||
"0.0.0.0:8080".to_string() | |||
} | |||
|
|||
fn default_sp1_worker_address() -> String { | |||
"0.0.0.0:8081".to_string() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"0.0.0.0:8081".to_string() | |
// Placeholder address | |
"0.0.0.0:8081".to_string() |
AFAIK in some cases 0.0.0.0 is/was use for broadcasting so I think it should be clarified it's a placeholder until we have a valid address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the listening address of the worker socket, I always saw this format to indicate to listen on every interfaces. I'll do some more research but that's a sensible default in my opinion.
This PR introduces the
SP1DistributedProver
Description
These changes add a new TCP server and a new distributed prover for sp1.
It introduce two new roles for a taiko instance that are the
orchestrator
and theworker
.The orchestrator is responsible for fetching the blocks data as a normal prover would. It will then check in its ip list file
distributed.json
the list of every worker it has access to, and check their connectivity and send a simplePing
packet to assess their reliability.It will then compute the execution's checkpoints and then send each checkpoints to a different worker for them to commit the shard. Each worker then send back the commitment data.when all workers answered, the orchestrator instantiate a
Challenger
to observe the commitments. It then send the challenger and the checkpoints again to each worker so that they can prove them.The orchestrator then merge them, verify the complete proof and sends it back to the caller.
The checkpoints contain a number of shards that are calculated to be equaly distributed between each worker. In fact, the number of checkpoints match the number of available workers.
In the event of a worker failure during runtime (network or processing error), the unproven checkpoint is pushed back in a queue for any remaining workers to pick.
How to run
The Orchestrator
Create a file
distributed.json
containing a JSON array of the workers'IP:PORT
like thisNOTE: The orchestrator can also be a worker in its own pool
Then run raiko as you are used to
The Workers
You have to set the
ORCHESTRATOR
env variable. It will enable the sp1 worker on this instance and only allow connections from this orchestrator addressThe Script
You have to specify the prover as
sp1_distributed
../script/prove-block.sh taiko_a7 sp1_distributed 333763
What's left to do
WorkerEnvelope
to make sure orchestrator and workers are compatiblesp1_specific
module that contains the code used for provingclippy
happyWhat's next
Plonky3
that is used to (de)serialize theDuplexChallenger
. Propose a PR for that