-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kad: Fix substream opening and dialing race #222
Conversation
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Other protocols are likely also racy when opening a substream?
After a quick look, it seems this is the only place where we tried to open a substream and then dial on failure. |
Signed-off-by: Alexandru Vasile <[email protected]>
10aa306
to
73ce087
Compare
73ce087
to
55f555e
Compare
## [0.7.0] - 2024-09-05 This release introduces several new features, improvements, and fixes to the litep2p library. Key updates include enhanced error handling, configurable connection limits, and a new API for managing public addresses. ### [Exposing Public Addresses API](#212) A new `PublicAddresses` API has been added, enabling users to manage the node's public addresses. This API allows developers to add, remove, and retrieve public addresses, which are shared with peers through the Identify protocol. ```rust // Public addresses are accessible from the main litep2p instance. let public_addresses = litep2p.public_addresses(); // Add a new public address to the node. if let Err(err) = public_addresses.add_address(multiaddr) { eprintln!("Failed to add public address: {:?}", err); } // Remove a public address from the node. public_addresses.remove_address(&multiaddr); // Retrieve all public addresses of the node. for address in public_addresses.get_addresses() { println!("Public address: {}", address); } ``` **Breaking Change**: The Identify protocol no longer includes public addresses in its configuration. Instead, use the new `PublicAddresses` API. Migration Guide: ```rust // Before: let (identify_config, identify_event_stream) = IdentifyConfig::new( "/substrate/1.0".to_string(), Some(user_agent), config.public_addresses, ); // After: let (identify_config, identify_event_stream) = IdentifyConfig::new("/substrate/1.0".to_string(), Some(user_agent)); // Public addresses must now be added using the `PublicAddresses` API: for address in config.public_addresses { if let Err(err) = public_addresses.add_address(address) { eprintln!("Failed to add public address: {:?}", err); } } ``` ### [Dial Error and List Dial Failures Event](#206) The `DialFailure` event has been enhanced with a new `DialError` enum for more precise error reporting when a dial attempt fails. Additionally, a `ListDialFailures` event has been introduced, listing all dialed addresses and their corresponding errors when multiple addresses are involved. Other litep2p errors, such as `ParseError`, `AddressError`, and `NegotiationError`, have been refactored for improved error propagation. ### [Immediate Dial Error and Request-Response Rejection Reasons](#227) This new feature paves the way for better error handling in the `litep2p` library and moves away from the overarching `litep2p::error::Error` enum. The newly added `ImmediateDialError` enum captures errors occurring before a dial request is sent (e.g., missing peer IDs). It also enhances the `RejectReason` enum for request-response protocols, offering more detailed rejection reasons. ```rust match error { RequestResponseError::Rejected(reason) => { match reason { RejectReason::ConnectionClosed => "connection-closed", RejectReason::DialFailed(Some(ImmediateDialError::AlreadyConnected)) => "already-connected", _ => "other", } } _ => "other", } ``` ### [Connection Limits](#185) Developers can now set limits on the number of inbound and outbound connections to manage resources and optimize performance. ```rust // Configure connection limits for inbound and outbound established connections. let litep2p_config = Config::default() .with_connection_limits(ConnectionLimitsConfig::default() .max_incoming_connections(Some(3)) .max_outgoing_connections(Some(2)) ); ``` ### [Feature Flags for Optional Transports](#192) The library now supports feature flags to selectively enable or disable transport protocols. By default, only the `TCP` transport is enabled. Optional transports include: - `quic` - Enables QUIC transport. - `websocket` - Enables WebSocket transport. - `webrtc` - Enables WebRTC transport. ### [Configurable Keep-Alive Timeout](#155) The keep-alive timeout for connections is now configurable, providing more control over connection lifecycles. ```rust // Set keep alive timeout for connections. let litep2p_config = Config::default() .with_keep_alive_timeout(Duration::from_secs(30)); ``` Thanks for contributing to this @[Ma233](https://github.com/Ma233)! ### Added - errors: Introduce immediate dial error and request-response rejection reasons ([#227](#227)) - Expose API for `PublicAddresses` ([#212](#212)) - transport: Implement `TransportService::local_peer_id()` ([#224](#224)) - find_node: Optimize parallelism factor for slow to respond peers ([#220](#220)) - kad: Handle `ADD_PROVIDER` & `GET_PROVIDERS` network requests ([#213](#213)) - errors: Add `DialError` error and `ListDialFailures` event for better error reporting ([#206](#206)) - kad: Add support for provider records to `MemoryStore` ([#200](#200)) - transport: Add accept_pending/reject_pending for inbound connections and introduce inbound limits ([#194](#194)) - transport/manager: Add connection limits for inbound and outbound established connections ([#185](#185)) - kad: Add query id to log messages ([#174](#174)) ### Changed - transport: Replace trust_dns_resolver with hickory_resolver ([#223](#223)) - crypto/noise: Generate keypair only for Curve25519 ([#214](#214)) - transport: Allow manual setting of keep-alive timeout ([#155](#155)) - kad: Update connection status of an existing bucket entry ([#181](#181)) - Make transports optional ([#192](#192)) ### Fixed - kad: Fix substream opening and dialing race ([#222](#222)) - query-executor: Save the task waker on empty futures ([#219](#219)) - substream: Use write_all instead of manually writing bytes ([#217](#217)) - minor: fix tests without `websocket` feature ([#215](#215)) - Fix TCP, WebSocket, QUIC leaking connection IDs in `reject()` ([#198](#198)) - transport: Fix double lock and state overwrite on disconnected peers ([#179](#179)) - kad: Do not update memory store on incoming `GetRecordSuccess` ([#190](#190)) - transport: Reject secondary connections with different connection IDs ([#176](#176)) ### Testing Done - pulled the latest changes into Substrate - performed warp sync in kusama cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <[email protected]>
This PR makes the substream opening and dialing of query commands more robust.
In the past, the logic of a query action would:
There is a race between the moment when a substream is opened and when the peer is dialed (T0 and T1), where the peer could connect to us. The race manifested after extensively running
FindNode
commands via subp2p-explorer bench-cliTo mitigate this race, retry to open the substream on recoverable errors from dialing (ie Error::PeerAlreadyConnected).
cc @paritytech/networking