-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
route: properly handle multiple kernel prefix routes with same dst #385
base: main
Are you sure you want to change the base?
Conversation
When configuring IP(v6) addresses on interfaces, the kernel will install appropriate prefix routes (except when using noprefixroute). When configuring addresses from the same subnet on multiple interfaces, the kernel will happily configure multiple prefix routes. Most prominent example are per interface IPv6 link local fe80::/64 routes. Currently libnl treats these routes as identical, and thus will only keep the newest one of the routes in the cache (as each additional one is treated as an update to the previous route). Therefore we need to extend route_id_attrs_get() to properly treat these routes as different: * For IPv4, these routes will have a preferred source set. * For IPv6, these routes will have a single nexthop pointing to the interface (OIF). Also, only kernel may create these routes, attempts from userspace will be rejected when trying to add a second route for the same prefix. ip route output: $ ip r show table all ... 10.0.0.0/24 dev tun0 proto kernel scope link src 10.0.0.1 10.0.0.0/24 dev enp0s31f6 proto kernel scope link src 10.0.0.2 ... broadcast 10.0.0.255 dev tun0 table local proto kernel scope link src 10.0.0.1 broadcast 10.0.0.255 dev enp0s31f6 table local proto kernel scope link src 10.0.0.2 ... fe80::/64 dev tun0 proto kernel metric 256 pref medium fe80::/64 dev enp0s31f6 proto kernel metric 1024 pref medium ... multicast ff00::/8 dev enp0s31f6 table local proto kernel metric 256 pref medium multicast ff00::/8 dev tun0 table local proto kernel metric 256 pref medium $ ip r show table all | wc -l 37 Before: $ ./src/nl-route-list ... inet 10.0.0.0/24 table main type unicast via dev tun0 ... inet 10.0.0.255 table local type broadcast via dev tun0 ... inet6 fe80::/64 table main type unicast via dev tun0 inet6 fe80::/64 table main type unicast via dev enp0s31f6 ... inet6 ff00::/8 table local type multicast via dev enp0s31f6 $ ./src/nl-route-list | grep -v cache | wc -l 34 After: $ ./src/nl-route-list ... inet 10.0.0.0/24 table main type unicast via dev tun0 inet 10.0.0.0/24 table main type unicast via dev enp0s31f6 ... inet6 fe80::/64 table main type unicast via dev tun0 inet6 fe80::/64 table main type unicast via dev enp0s31f6 ... inet6 ff00::/8 table local type multicast via dev enp0s31f6 inet6 ff00::/8 table local type multicast via dev tun0 $ ./src/nl-route-list | grep -v cache | wc -l 37 (ip route doesn't show cache routes) Signed-off-by: Jonas Gorski <[email protected]>
@@ -387,6 +387,39 @@ static uint32_t route_id_attrs_get(struct nl_object *obj) | |||
if (route->rt_family == AF_MPLS) | |||
rv &= ~ROUTE_ATTR_PRIO; | |||
|
|||
if (route->rt_protocol == RTPROT_KERNEL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's special in the case that only kernel can install "duplicate" routes for the same destination which have the same priority. Anybody else can only do that by setting different priorities (which is already part of the id attrs, so will be treated as unique routes).
At least in my limited testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you use ip route append
or ?ip route prepend
(EDIT: prepend
doesn't exist)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#!/bin/bash
set -x
ip netns del x 2>/dev/null || :
ip netns add x
ip -n x link add v0 type veth peer v1
ip -n x link set v0 up
ip -n x addr add 1:2:3::6/64 dev v0
ip -n x route append 1:2:3::99/128 via 1:2:3::1
ip -n x route append 1:2:3::99/128 via 1:2:3::2
ip -n x -6 addr
ip -n x -6 route
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't try append, that indeed did work:
$ip address add 10.0.0.1/24 dev tun0 noprefixroute
$ip address add 10.0.0.2/24 dev enp0s31f6 noprefixroute
$ip route add 10.0.0.0/24 dev tun0 src 10.0.0.1
$ip route append 10.0.0.0/24 dev enp0s31f6 src 10.0.0.2
Still presented to user space as multiple routes though:
$ ip r
10.0.0.0/24 dev tun0 scope link src 10.0.0.1
10.0.0.0/24 dev enp0s31f6 scope link src 10.0.0.2
Likewise for IPv6:
ip netns del x >/dev/null
ip netns add x
ip -n x link add v0 type veth peer v1
ip -n x link set v0 up
ip -n x link add v3 type veth peer v2
ip -n x link set v3 up
ip -n x addr add 2001:db8:1::1/64 dev v0
ip -n x addr add 2001:db8:1::2/64 dev v3
ip -n x -6 addr
ip -n x -6 route
=>
10: v0@v1: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 state LOWERLAYERDOWN qlen 1000
inet6 2001:db8:1::1/64 scope global tentative
valid_lft forever preferred_lft forever
12: v3@v2: <BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 state UNKNOWN qlen 1000
inet6 2001:db8:1::2/64 scope global tentative
valid_lft forever preferred_lft forever
2001:db8:1::/64 dev v0 proto kernel metric 256 linkdown pref medium
2001:db8:1::/64 dev v3 proto kernel metric 256 linkdown pref medium
(I have no idea what the v* mean so I'm just cargo culting there).
So I would think the (stop-gap?) solution is to do what I currently do for IPv6 routes also for IPv4 routes and then always, regardless of source/protocol?
Also I'm trying to fix the case where the same prefix route is present on different interfaces, not multiple nexthops on a single interface. Where the kernel sends each prefix route as its own RTM_NEWROUTE
, not as a single route with multiple nexthops (or a single nexthop group).
Maybe I missed a explanatory comment there also, adding ROUTE_ATTR_MULTIPATH
to the ID attributes is done to trigger a nexthop comparison by route_compare()
, without it ignores the difference in the (single) nexthop and treats the routes as identical.
So this is only as a "please also check that the nexthop (= inferface) is identical". Not meant that these routes are actually MULTIPATH routes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this special treatment for IPv6 ECMP/multhop does not apply to IPv4 routes. In IPv4, a route can have 1 or more next-hops, and those next-hops are just part of the identity of the route. If you add a second IPv4 route, that differs from the first only by the list of nexthops, then simply a new route is added (and you have two similar IPv4 routes with different nexthop lists). Quite easy. On RTM_NEWROUTE message corresponds to one route object.
ip -n x addr add 2001:db8:1::2/64 dev v3
here you added onlink-routes. I guess, this merging of next-hops does not happen then (as you see). Compare my example, where the nexthops are gateways, then the merging happens.
What maybe would be the right way to model the IPv6 behaviors, is that IPv6 route objects are always only singlehop. Except, that one RTM_NEWROUTE message then can notify about multiple(!) single-hop routes (instead of one multi-hop routes). I did that in NetworkManager. Anyway, that is not what libnl currently does. It would be quite a change in behavior. What it does instead, it tries to search the cache for "similar" routes, and mangle them according to nexthop updates. That seems a questionable approach. It also seems hard to do efficiently (we cannot do a dictionary lookup, if the object's identity contains the next-hop list, but we don't know the object we search for).
So I would think the (stop-gap?) solution is to do what I currently do for IPv6 routes also for IPv4 routes and then always, regardless of source/protocol?
I think the first step is to extend the set of ID attributes. I would just follow what NetworkManager does. We can also add one ID attribute at a time, if you don't want to bother to investigate them all at once. What I think was wrong, is to tie that to if (route->rt_protocol == RTPROT_KERNEL)
.
Also I'm trying to fix the case where the same prefix route is present on different interfaces, not multiple nexthops on a single interface. Where the kernel sends each prefix route as its own RTM_NEWROUTE, not as a single route with multiple nexthops (or a single nexthop group).
Yes. Because they are on-link, right?
Maybe I missed a explanatory comment there also, adding ROUTE_ATTR_MULTIPATH to the ID attributes is done to trigger a nexthop comparison by route_compare(), without it ignores the difference in the (single) nexthop and treats the routes as identical.
that sounds right. Especially for IPv4.
And for IPv6, I think the proper solution would be that a route object always only has one next hop. In that case, honoring ROUTE_ATTR_MULTIPATH would also be correct (albeit there is nothing "multi"). I think in step 1, we can ignore that problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess there will be three types of routes:
- no nexthop / synthetic nexthop based on RTA_OIF, RTF_VIA etc
- multipath with 1 or more nexthops
- nh id referencing a nexthop (group)
And via ip route append
you can have multiple routes out of these. E.g. I managed to setup:
$ ip -n x -6 route show 2001:db8:3::/64
2001:db8:3::/64 nhid 20 metric 1024 pref medium
nexthop via 2001:db8:1::2 dev v0 weight 1
nexthop via 2001:db8:2::2 dev v3 weight 1
2001:db8:3::/64 nhid 11 via 2001:db8:2::2 dev v3 metric 1024 pref medium
2001:db8:3::/64 via 2001:db8:2::3 dev v3 metric 1024 pref medium
where I maybe even found some weird kernel behavior With this setup, if I do
ip -n x -6 route append 2001:db8:3::/64 nexthop via 2001:db8:1::2 dev v0 nexthop via 2001:db8:2::3 dev v3
RTNETLINK answers: File exists
So the kernel says no, but it didn't do nothing:
$ ip -n x -6 route show 2001:db8:3::/64
2001:db8:3::/64 nhid 11 via 2001:db8:2::2 dev v3 metric 1024 pref medium
2001:db8:3::/64 metric 1024 pref medium
nexthop via 2001:db8:2::3 dev v3 weight 1
nexthop via 2001:db8:1::2 dev v0 weight 1
It added the route regardless, and deleted the one referencing nh_id 20 and the "single path" route with direct via.
Feels like it shouldn't modify the configuration and return an error. Will need to investigate if this behavior is intended/expected.
|
||
/* | ||
* For IPv6 addresses, the prefix routes will have | ||
* a single dev nexthop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think IPv6 multihop routes need a special treatment to make sense with a concept of "identity" for route objects.
It's clear that the next-hop is part of the identity. If you add a route A with a certain next hop, and then add another route B that only differs by their nexthop, then kernels RTM_NEWROUTE
message will pretend this is one route C with two next hops. So what happend, was that adding route B caused route A to be replaced by route C. That makes only sense, if we think that IPv6 routes can only have one nexthop. With the minor exception, that one RTM_NEWROUTE/RTM_DELROUTE can affect multiple singlehop routes at once.
The idea that one RTM_DELROUTE modifies an existing route's nexthop list (thereby creating another route altogether) makes IMO no sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least the IPv6LL routes are treated as unique routes and not a single ECMP route:
$ strace ip -6 r show
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=
...
/* inet6 fe80::/64 table main type unicast via dev tun0 */
[{nlmsg_len=116, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=1715693113, nlmsg_pid=682500}, {rtm_family=AF_INET6, rtm_dst_len=64, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=20, nla_type=RTA_DST}, inet_pton(AF_INET6, "fe80::")], [{nla_len=8, nla_type=RTA_PRIORITY}, 256], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("tun0")], [{nla_len=36, nla_type=RTA_CACHEINFO}, {rta_clntref=0, rta_lastuse=0, rta_expires=0, rta_error=0, rta_used=0, rta_id=0, rta_ts=0, rta_tsage=0}], [{nla_len=5, nla_type=RTA_PREF}, 0]]],
/* inet6 fe80::/64 table main type unicast via dev enp0s31f6 */
[{nlmsg_len=116, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=1715693113, nlmsg_pid=682500}, {rtm_family=AF_INET6, rtm_dst_len=64, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=20, nla_type=RTA_DST}, inet_pton(AF_INET6, "fe80::")], [{nla_len=8, nla_type=RTA_PRIORITY}, 1024], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("enp0s31f6")], [{nla_len=36, nla_type=RTA_CACHEINFO}, {rta_clntref=0, rta_lastuse=0, rta_expires=0, rta_error=0, rta_used=0, rta_id=0, rta_ts=0, rta_tsage=0}], [{nla_len=5, nla_type=RTA_PREF}, 0]]]
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, and for some reason the priorities are different on my system (likely because one if on a VPN tun device), which is why before even shows them as different.
When configuring IP(v6) addresses on interfaces, the kernel will install appropriate prefix routes (except when using noprefixroute).
When configuring addresses from the same subnet on multiple interfaces, the kernel will happily configure multiple prefix routes. Most prominent example are per interface IPv6 link local fe80::/64 routes.
Currently libnl treats these routes as identical, and thus will only keep the newest one of the routes in the cache (as each additional one is treated as an update to the previous route).
Therefore we need to extend route_id_attrs_get() to properly treat these routes as different:
Also, only kernel may create these routes, attempts from userspace will be rejected when trying to add a second route for the same prefix.
ip route output:
Before:
After:
(ip route doesn't show cache routes)