-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoC scouting bugfix for IPv4LL/IPv6LL employed on multiple NICs #1598
base: main
Are you sure you want to change the base?
Conversation
PR missing one of the required labels: {'documentation', 'enhancement', 'bug', 'internal', 'dependencies', 'breaking-change', 'new feature'} |
PR missing one of the required labels: {'enhancement', 'new feature', 'documentation', 'breaking-change', 'internal', 'bug', 'dependencies'} |
@Mallets would be great if you could re-start the CI status checks. |
@psiegl the testing and review of this PR is on my backlog, I didn't forget about it :) |
After investigating more the code, I believe we have few road blockers: Assumptions
The Hello message is crafted only once with all the locators included and sent untouched on the multicast group. This means that the message going out on a given interface may contain locators that belong to another interface. Proposed approachThe proposed approach always adds the Considerations
So, do you have any more detailed use case of using link local addresses in combination of scouting? And what would be the case where only link local addresses are configured and no global addresses are available? At the end, should link local locators simply be removed from scouting? |
Well, as mentioned in my very first statement:
Maybe we could emphasis the niche and based on our discussion it seems to be more an enhancement. I also detected, that the Hello message gathers all given locators and sends one message back. To me, this looks rather like a performance optimisation, not even a limitation as you state: However, the gossip scouting seems to be truly a big issue (in case I understood correctly). I guess Zenoh is completely fine with one pure link-local network itself. Or on a node that contains two NICs, where the first NIC obtains both IPv4 and IPv6 via DHCP, while the second NIC performs IPv4LL/IPv6LL.
Link-local addresses are of interest in Zero-configuration networking, which might not be the daily driver of Zenoh workloads. However, in my case due to some requirements IPv4LL/IPv6LL are quite of interest (see my very first intro explanation). Considering that as of today, Zenoh would solely be able to fix this issue in Unix-derivates due to tokio, and considering that Zeroconf might not be the typical use-case of Zenoh users: I wonder if you have an ERRATA for Zenoh, where you could at least hint above as a known issue? What do you think? |
I agree on all the zero touch networking, it's a concept I'm pretty familiar with so it wasn't hard to convince me on that one :) Another option is to directly handle this at link level (e.g. TCP, UDP, etc.): whenever a link local address is provided, either by multicast/gossip scouting or the user, it will simply try to establish a connection on any of the available interfaces. If the link local is not reachable on the given interfaces, then the connection establishment at link level will fail and the next interface will be tried. Connection establishment may require a bit more time but it should always work. This approach will also be able to solve also the issue of being forced to provide the interface name in the case of configuring Zenoh via config file or in the case of using mDNS resolving to IPv4LL/IPv6LL. Clearly, this will be available only on Unix-derivates due to tokio, but still better than nothing :) Regarding known issues and errata, we track them in GitHub or directly in the documentation. |
I like your idea, because it is truly simple. And likely better maintainable in the long run. |
This might be a niche issue, however it is yet a bug in Zenoh.
Below is the most simplest description, but the issue could be a bit bigger.
Imagine there are two machines. Machine A has 2 NICs, where the first is connected to a DHCP IPv4 router.
Furthermore, machine A has a second NIC, that is connected to machine B, while both employ IPv4LL/IPv6LL.
This could look like as follows:
(For later simplicity, machine Bs interface starts counting with 1.)
Intention is, to employ a PUB on machine A, and a SUB on machine B. Thus, what matters is ethernet 1.
Both machines employ Zenoh in peer mode, however other modes will have the same issue.
The network protocol of interest could be Zenohs default TCP.
Both machines shall be listening on IPv6:
One could hardcode the NIC with the Linux feature
#iface=eth1
, but it won't change the issue.Now, what is important is:
Zenohs scouting shall be employed to find each other due to IPv4LL/IPv6LL addresses.
Consequently, a Zenoh config could hereby look as follows (in case for IPv4 Multicast):
Using again the Linux-feature of hardcoding the NIC with
#iface=eth1
does not make a difference.For now, I have not tried IPv6 Multicast (sth. like
[ff00::224]
).However, by looking into the code, I assume it wouldn't fix the issue.
Well, Zenohs scouting reaches out to anyone, and a Hello returns with specific endpoints, as described in:
zenoh/commons/zenoh-protocol/src/scouting/hello.rs
Obviously, this is happening for machine A obtaining a Locator with an IPv6LL with port of machine B:
See
zenoh/src/net/runtime/orchestrator.rs
->Runtime::scout()
Similarily machine B obtains a Locator with an IPv6LL with port of machine A.
As soon as both try to establish the connection, machine A fails to do so and thus both can't communicate.
Reason for this is, that link-local addresses don't come explicitly with a default route.
An
ip -6 route
on machine A will show:Thus, despite the fact that machine B is physically connected with machine A via eth1, Zenoh will try to connect with eth0.
Reason for this is, that Zenoh does not determine for an incoming Hello message, what NIC this message truly came in.
Meaning, Zenoh is hoping the routing is nicely set up, which might not be the case for link-local.
While investigating into
zenoh/src/net/runtime/orchestrator.rs
->Runtime::scout()
, the last time the socket with the incoming Hello message is given is at line:As soon as this
select_all()
incl. the loop finishes, the information about the receiving NIC is gone.However, the Locator does not obtain any NIC definition, see
commons/zenoh-protocol/src/core/locator.rs
->Locator::new()
The default value for config is "", thus
Runtime::scout()
won't keep the information about the NIC:As of this, my suggestion is to modify each
hello.locator[]
in the sense of appending the specific interface via#iface=<iface found>
Zenoh provides such lookup already with:
zenoh_util::net::get_interface_names_by_addr(local_addr.ip())
Sole pity is:
get_interface_names_by_addr()
can return multiple NICs; for now i.e. discussion I fixed it with taking the first one.Now, similar issues should be capable to be seen with IPv4LL only, in case another machine C would be connected to machine As eth0. While two link-local networks would be employed, i.e. between A and C, as well as A and B.
Bottom line: scouting has an issue, in case multiple NICs are present and
ip route
is inconclusive (such as in case of IPv4LL/IPv6LL). However, Zenoh detects any Hello response on its particular socket, thus has the capability to determine, on what NIC such Hello truly came in. Zenoh is solely required to keep this information and use it the moment a connection shall be established. At least for Linux the#iface=
feature is present, which the fix is leveraging. However, it seems Windows and Mac miss this features, thus it is wise to implement it properly for these OSses as well.