Currently the code is marking the nhg as uninstalled but not
causing that to flood up to the dependent nhgs:
nhg 3 is a group of 1/2
1 -> interface A
2 -> interface B
Suppose A goes down, old code would mark nhg 1 as !VALID and !INSTALLED.
Suppose B then goes down, old code would mark nhg 2 as !VALID and !INSTALLED
But would not mark nhg 3 as !VALID and !INSTALLED (sort of assuming that
it would just be cleaned up by NHG refcounts ). I would prefer that
the code is pedantic about nhg 3 actually being removed from the system.
This code moves the setting of !INSTALLED into zebra_nhg.c where it
really belongs.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The nexthop group debugs were using %u to just display the id.
I found this very hard to figure out what was going on.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add `%pNG` so that a nexthop group can be displayed in debugs/logs
such that it can provide useful information.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Firstly, *keep no change* for `hash_get()` with NULL
`alloc_func`.
Only focus on cases with non-NULL `alloc_func` of
`hash_get()`.
Since `hash_get()` with non-NULL `alloc_func` parameter
shall not fail, just ignore the returned value of it.
The returned value must not be NULL.
So in this case, remove the unnecessary checking NULL
or not for the returned value and add `void` in front
of it.
Importantly, also *keep no change* for the two cases with
non-NULL `alloc_func` -
1) Use `assert(<returned_data> == <searching_data>)` to
ensure it is a created node, not a found node.
Refer to `isis_vertex_queue_insert()` of isisd, there
are many examples of this case in isid.
2) Use `<returned_data> != <searching_data>` to judge it
is a found node, then free <searching_data>.
Refer to `aspath_intern()` of bgpd, there are many
examples of this case in bgpd.
Here, <returned_data> is the returned value from `hash_get()`,
and <searching_data> is the data, which is to be put into
hash table.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
Operators are seeing:
Mar 28 07:19:37 kingpin zebra[418]: [TZANK-DEMSE] netlink_nexthop_msg_encode: nhg_id 68 (zebra): proto-based nexthops only, ignoring
Mar 28 07:19:37 kingpin zebra[418]: [TZANK-DEMSE] netlink_nexthop_msg_encode: nhg_id 68 (zebra): proto-based nexthops only, ignoring
Mar 28 07:19:37 kingpin zebra[418]: [YXPF5-B2CE0] netlink_route_multipath_msg_encode: RTM_DELROUTE 2804:4d48:4000::/42 vrf 0(254)
Mar 28 07:19:37 kingpin zebra[418]: [YXPF5-B2CE0] netlink_route_multipath_msg_encode: RTM_NEWROUTE 2804:4d48:4000::/42 vrf 0(254)
Mar 28 07:19:37 kingpin zebra[418]: [TVM3E-A8ZAG] _netlink_route_build_singlepath: (single-path): 2804:4d48:4000::/42 nexthop via fe80::b6fb:e4ff:fe26:c5d5 if 2 vrf default(0)
Mar 28 07:19:37 kingpin zebra[418]: [HYEHE-CQZ9G] nl_batch_send: netlink-dp (NS 0), batch size=140, msg cnt=2
Mar 28 07:19:37 kingpin zebra[418]: [P2XBZ-RAFQ5][EC 4043309074] Failed to install Nexthop ID (68) into the kernel
When `zebra nexthop proto only` is turned on.
Effectively zebra intentionally does not do the nexthop group installation
and the dplane notification in zebra_nhg.c just assumes it was a failure
and prints an error message. Since this act was intentional, let's
just notice that it was intentional and not report the message
as a failure.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently if a end user has something like this:
Routing entry for 192.168.212.1/32
Known via "kernel", distance 0, metric 100, best
Last update 00:07:50 ago
* directly connected, ens5
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
K>* 0.0.0.0/0 [0/100] via 192.168.212.1, ens5, src 192.168.212.19, 00:00:15
C>* 192.168.212.0/27 is directly connected, ens5, 00:07:50
K>* 192.168.212.1/32 [0/100] is directly connected, ens5, 00:07:50
And FRR does a link flap, it refigures the route and rejects the default
route:
2022/04/09 16:38:20 ZEBRA: [NZNZ4-7P54Y] default(0:254):0.0.0.0/0: Processing rn 0x56224dbb5b00
2022/04/09 16:38:20 ZEBRA: [ZJVZ4-XEGPF] default(0:254):0.0.0.0/0: Examine re 0x56224dbddc20 (kernel) status: Changed Installed flags: Selected dist 0 metric 100
2022/04/09 16:38:20 ZEBRA: [GG8QH-195KE] nexthop_active_update: re 0x56224dbddc20 nhe 0x56224dbdd950 (7), curr_nhe 0x56224dedb550
2022/04/09 16:38:20 ZEBRA: [T9JWA-N8HM5] nexthop_active_check: re 0x56224dbddc20, nexthop 192.168.212.1, via ens5
2022/04/09 16:38:20 ZEBRA: [M7EN1-55BTH] nexthop_active: Route Type kernel has not turned on recursion
2022/04/09 16:38:20 ZEBRA: [HJ48M-MB610] nexthop_active_check: Unable to find active nexthop
2022/04/09 16:38:20 ZEBRA: [JPJF4-TGCY5] default(0:254):0.0.0.0/0: After processing: old_selected 0x56224dbddc20 new_selected 0x0 old_fib 0x56224dbddc20 new_fib 0x0
So the 192.168.212.1 route is matched for the nexthop but it is not connected and
zebra treats it as a problem. Modify the code such that if a system route
matches through another system route, then it should work imo.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
This bug should only really affect kernel routes. To reproduce:
a) Have multiple connected routes that point to the same prefix
swp8 up default 169.254.0.250/30
swp9 up default 169.254.0.250/30
b) Have a kernel route that uses one of those connected routes
7.6.2.8 via 169.254.0.249 dev swp8 proto static
(But have it choose a non-selected connected nexthop)
c) Introduce an event that causes the rib table to be reprocessed,
say a unrelated interface going up / down
This causes the route to be lost with this message:
2022/03/28 21:21:53 ZEBRA: [YXCJP-0WZWV] netlink_nexthop_msg_encode: ID (3454): 169.254.0.249, via swp8(1383) vrf default(0)
2022/03/28 21:21:53 ZEBRA: [YF2E6-J60JH] nexthop_active: 169.254.0.249, via swp8 given ifindex does not match nexthops ifindex found found: directly connected, swp9
Effectively the nexthop that zebra is choosing would not be the one
that the kernel route has choosen and FRR removes the route:
022/03/28 21:21:53 ZEBRA: [NM15X-X83N9] rib_process: (0:254):7.6.2.8/32: rn 0x56042e632e90, removing re 0x56042e6316e0
2022/03/28 21:21:53 ZEBRA: [Y53JX-CBC5H] rib_unlink: (0:254):7.6.2.8/32: rn 0x56042e632e90, re 0x56042e6316e0
2022/03/28 21:21:53 ZEBRA: [KT8QQ-45WQ0] rib_gc_dest: (0:?):7.6.2.8/32: removing dest from table
What is happening?
Zebra is not looking at all connected routes and if any of them
would have the appropriate ifindex and just blindly rejecting
the route.
So when nexthop resolution happens and it matches a connected
route and the dest->selected nexthop ifindex does not match, let's sort
through the rest of them and see if any of them match and if so
let's keep the route.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add support for setting the protodown reason code.
829eb208e8
These patches handle all our netlink code for setting the reason.
For protodown reason we only set `frr` as the reason externally
but internally we have more descriptive reasoning available via
`show interface IFNAME`. The kernel only provides a bitwidth of 32
that all userspace programs have to share so this makes the most sense.
Since this is new functionality, it needs to be added to the dplane
pthread instead. So these patches, also move the protodown setting we
were doing before into the dplane pthread. For this, we abstract it a
bit more to make it a general interface LINK update dplane API. This
API can be expanded to support gernal link creation/updating when/if
someone ever adds that code.
We also move a more common entrypoint for evpn-mh and from zapi clients
like vrrpd. They both call common code now to set our internal flags
for protodown and protodown reason.
Also add debugging code for dumping netlink packets with
protodown/protodown_reason.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Contraints of host routes are too strict in current code:
Host routes with same destination address and nexthop address are forbidden
even when cross VRFs.
Currently host routes with different destination and nexthop address can cross
VRFs, it is ok. But host routes with same addresses are forbidden to cross VRFs,
it is wrong.
Since different VRFs can have the same addresses, leak specific host route with
the same nexthop address ( it means destination address is same to nexthop
address ) to other VRFs is a normal case.
This commit relaxes that contraints. Host routes with same destination address
and nexthop address are forbidden only when not cross VRFs.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
When a end operator is doing cross vrf imports in bgp:
router bgp 3239 vrf FOO
address-family ipv4 uni
import vrf BAR
!
and zebra has this configuration:
vrf FOO
ip protocol bgp route-map EVA
!
The current code in zebra_nhg.c was looking up the vrf of the
nexthop and attempting to apply the ip protocol route-map.
For most people the nexthop vrf and the re vrf are one and the
same so they never see a problem.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
I'm seeing this crash in various forms:
Program terminated with signal SIGSEGV, Segmentation fault.
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f418efbc7c0 (LWP 3580253))]
(gdb) bt
(gdb) f 4
267 (*func)(hb, arg);
(gdb) p hb
$1 = (struct hash_bucket *) 0x558cdaafb250
(gdb) p *hb
$2 = {len = 0, next = 0x0, key = 0, data = 0x0}
(gdb)
I've also seen a crash where data is 0x03.
My suspicion is that hash_iterate is calling zebra_nhg_sweep_entry which
does delete the particular entry we are looking at as well as possibly other
entries when the ref count for those entries gets set to 0 as well.
Then we have this loop in hash_iterate.c:
for (i = 0; i < hash->size; i++)
for (hb = hash->index[i]; hb; hb = hbnext) {
/* get pointer to next hash bucket here, in case (*func)
* decides to delete hb by calling hash_release
*/
hbnext = hb->next;
(*func)(hb, arg);
}
Suppose in the previous loop hbnext is set to hb->next and we call
zebra_nhg_sweep_entry. This deletes the previous entry and also
happens to cause the hbnext entry to be deleted as well, because of nhg
refcounts. At this point in time the memory pointed to by hbnext is
not owned by the pthread anymore and we can end up on a state where
it's overwritten by another pthread in zebra with data for other incoming events.
What to do? Let's change the sweep function to a hash_walk and have
it stop iterating and to start over if there is a possible double
delete operation.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Current implementation doesn't copy nexthop_srv6. This causes unexpected
behavior when receiving SID information and nexthop isn't onlink.t
Signed-off-by: Ryoga Saito <contact@proelbtn.com>
At some point we broke the ifp pointer for nhe->ifp such
that it was pointing to an interface even in groups/recurisve
instances.
Add checks here to make it again so that we only set the ifp
pointer if it is a fully resolved singleton NHE.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
With this patch, zclient can intall seg6 rotues when
they set properties "nh_seg6_segs" on struct nexthop
and set ZEBRA_FLAG_SEG6_ROUTE on zapi_route's flag.
Signed-off-by: Hiroki Shirokura <slank.dev@gmail.com>
With this patch, zclient can intall seg6local rotues whem
they set properties nh_seg6local_{action,ctx} on struct nexthop
and set ZEBRA_FLAG_SEG6LOCAL_ROUTE on zapi_route's flag.
Signed-off-by: Hiroki Shirokura <slank.dev@gmail.com>
This action is initiated by nhrp and has been stubbed when
moving to zebra. Now, a netlink request is forged to set
the link interface of a gre interface if that gre interface
does not have already a link interface.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Properly handle refcounting of Proto-owned NHGs when
zebra is operating under graceful restart and retain
conditions.
We have an extra refcnt of 1 we keep for proto-owned NHGs to
indicate the upper level proto has created and owns it.
When we are reading these in from the kernel, we need to set them
to 1 as appropriate. Without this, we fail in the assert() during
zebra_nhg_proto_add() after the owning daemons resends the NHG
and the refcnts are off by one.
Also add in the same logic we use for routes when sweeping with
respect to uptimes.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Add uptime for use with NHEs to keep track of how
long we have had this NHE in our rib without an update.
This is treated exactly the same as the re->uptime for
routes. When we get an update for a route, we reset the
uptime.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Add a PROTO_OWNED macro for code readability when checking
ID bounds for whether a NHG is proto owned.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
When capturing backup nexthops with recursive resolution,
ensure that inner labels from the recursive nexthop are
included in each backup (as they are with the resolving
primary nexthops).
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Use the main zebra workqueue for daemon-owned NHGs, in addition
to processing kernel-owned NHGs. The zapi message processing
creates a temporary object that's enqueued to the workqueue,
then processed/installed as part of the workqueue processing.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Instead of directly configuring the neighbor table after read from zapi
interface, a zebra dplane context is prepared to host the interface and
the family where the neighbor table is updated. Also, some other fields
are hosted: app_probes, ucast_probes, and mcast_probes. More information
on those fields can be found on ip-ntable configuration.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
EVPN neighbor operations were already done in the zebra dataplane
framework. Now that NHRP is able to use zebra to perform neighbor IP
operations (by programming link IP operations), handle this operation
under dataplane framework:
- assign two new operations NEIGH_IP_INSTALL and NEIGH_IP_DELETE; this
is reserved for GRE like interfaces:
example: ip neigh add A.B.C.D lladdr E.F.G.H
- use 'struct ipaddr' to store and encode the link ip address
- reuse dplane_neigh_info, and create an union with mac address
- reuse the protocol type and use it for neighbor operations; this
permits to store the daemon originating this neighbor operation.
a new route type is created: ZEBRA_ROUTE_NEIGH.
- the netlink level functions will handle a pointer, and a type; the
type indicates the family of the pointer: AF_INET or AF_INET6 if the
link type is an ip address, mac address otherwise.
- to keep backward compatibility with old queries, as no extension was
done, an option NEIGH_NO_EXTENSION has been put in place
- also, 2 new state flags are used: NUD_PERMANENT and NUD_FAILED.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
This one also needed a bit of shuffling around, but MTYPE_RE is the only
one left used across file boundaries now.
Signed-off-by: David Lamparter <equinox@diac24.net>
Add a control and api for the use of backup nexthops in
recursive resolution. With 'no', we won't try to use installed
backup nexthops when resolving a recursive route.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
like it has been done for iptable contexts, a zebra dplane context is
created for each ipset/ipset entry event. The zebra_dplane_ctx job is
then enqueued and processed by separate thread. Like it has been done
for zebra_pbr_iptable context, the ipset and ipset entry contexts are
encapsulated into an union of structures in zebra_dplane_ctx.
There is a specificity in that when storing ipset_entry structure, there
was a backpointer pointer to the ipset structure that is necessary
to get some complementary information before calling the hook. The
proposal is to use an ipset_entry_info structure next to the ipset_entry,
in the zebra_dplane context. That information is used for ipset_entry
processing. The ipset name and the ipset type are the only fields
necessary.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The iptable processing was not handled in remote dataplane, and was
directly processed by the thread in charge of zapi calls. Now that call
can be handled in the zebra_dplane separate thread. once a
zebra_dplane_ctx is allocated for iptable handling, the hook call is
performed later. Subsequently, a return code may be triggered to zclient
interface if any problem occurs when calling the hook call.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Start reorg of zebra nexthop-resolution so that we can use the
resolution logic for nexthop-groups as well as routes. Change
the signature of the core nexthop_active() api so that it does
not require a route-entry or route-node. Move some of the logic
around so that nexthop-specific logic is in nexthop_active(),
while route-oriented logic is in nexthop_active_check().
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Disallow the resolution to nexthops that are marked duplicate.
When we are resolving to an ecmp group, it's possible this
group has duplicates.
I found this when I hit a bug where we can have groups resolving
to each other and cause the resolved->next->next pointer to increase
exponentially. Sufficiently large ecmp and zebra will grind to a hault.
Like so:
```
D> 4.4.4.14/32 [150/0] via 1.1.1.1 (recursive), weight 1, 00:00:02
* via 1.1.1.1, dummy1 onlink, weight 1, 00:00:02
via 4.4.4.1 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.2 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.3 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.4 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.5 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.6 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.7 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.8 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.9 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.10 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.11 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.12 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.13 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.15 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1 onlink, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1 onlink, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.16 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1 onlink, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
D> 4.4.4.15/32 [150/0] via 1.1.1.1 (recursive), weight 1, 00:00:09
* via 1.1.1.1, dummy1 onlink, weight 1, 00:00:09
via 4.4.4.1 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.2 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.3 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.4 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.5 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.6 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.7 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.8 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.9 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.10 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.11 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.12 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.13 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.14 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.16 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1 onlink, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
D> 4.4.4.16/32 [150/0] via 1.1.1.1 (recursive), weight 1, 00:00:19
* via 1.1.1.1, dummy1 onlink, weight 1, 00:00:19
via 4.4.4.1 (recursive), weight 1, 00:00:19
via 1.1.1.1, dummy1, weight 1, 00:00:19
via 4.4.4.2 (recursive), weight 1, 00:00:19
...............
................
and on...
```
You can repro the above via:
```
kernel routes:
1.1.1.1 dev dummy1 scope link
4.4.4.0/24 via 1.1.1.1 dev dummy1
==============================
config:
nexthop-group doof
nexthop 1.1.1.1
nexthop 4.4.4.1
nexthop 4.4.4.10
nexthop 4.4.4.11
nexthop 4.4.4.12
nexthop 4.4.4.13
nexthop 4.4.4.14
nexthop 4.4.4.15
nexthop 4.4.4.16
nexthop 4.4.4.2
nexthop 4.4.4.3
nexthop 4.4.4.4
nexthop 4.4.4.5
nexthop 4.4.4.6
nexthop 4.4.4.7
nexthop 4.4.4.8
nexthop 4.4.4.9
!
===========================
Then use sharpd to install 4.4.4.16 -> 4.4.4.1 pointing to that nexthop
group in decending order.
```
With these changes it prevents the growing ecmp above by disallowing
duplicates to be in the resolution decision. These nexthops are not
installed anyways so why should we be resolving to them?
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Send the results of daemons' nhg updates asynchronously,
after the update has actually completed. Capture additional
info about the source daemon in order to locate the correct
zapi session. Simplify the result types considered by the
zebra_nhg module.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
In the case where a routes nexthops cannot be resolved as part
of route processing, immmediately notify the upper level protocol
that their routes failed to install if they are interested in
being informed about this issue.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
A couple NHG messages we were logging as errors are a bit spammy
in usecases where you routinely add/remove interfaces (VM heavy
deployments). Its not really an error a user cares about and
more for a developer to know what went wrong after the fact so
it makes more sense for these to be under a debug rather than
an error since seeing them does not implicitly mean error during
those usecases.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
This includes -
1. non-DF block filter
2. List of es-peers that need to be blocked per-access port (for
split horizon filtering)
3. Backup nexthop group to failover local-es via the VxLAN overlay
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Because the backup nexthop groups currently are more like pseudo-NHEs
(they don't have IDs and are not inserted into the ID table or
hashed), they can't really have this depends/dependents relationship
yet in both directions. Some work needs to be done there to make
them more like first class citizens like "normal" NHGs to enable
this.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
When `-r` is specified to zebra, on shutdown we should
not remove any routes from the fib. This was a problem
with nhg's on shutdown due to their ref-count behavior.
Introduce a methodology where on shutdown we don't mess
with the nexthop groups in the kernel. That way on
next startup things will be ok.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Apparantly the dependents backpointer trees for singletons
got broken at some point and we never noticed. There is
not really any code making use of this right now so not
suprising but let's go ahead and fix it for zebra and proto
NHGs.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Let's just track the NHEs we get from the kernel(dplane) for
ID usage with internal routes. I tried to be smart originally
and allow them to be re-used internal to zebra but its proving
to cause more bugs than it's worth.
This doesn't break any functionality. It just means we won't
use NHEs we get from the kernel with our routes, we will create
new ones.
Decided this based on various bugs seen ith the lastest one
being on startup with this kernel state:
```
[root@alfred frr-2]# ip next ls
id 15 via 192.168.161.1 dev doof scope link proto zebra
id 17 group 15 proto zebra
[root@alfred frr-2]# ip ro show 3.3.3.1
3.3.3.1 nhid 17 via 192.168.161.1 dev doof
```
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Add a param to the common NHE creation callstack so we can
know if this is one we have read in from the dataplane. We can
add some logic on how to handle these special ones later.
I considered putting this on a struct as a flag or something
but it would have required it being put on struct nexthop
since we have some `*_find_nexthop()` functions that can
be called when given NHEs from the dataplane.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
When debugging why a route was not successfully installed into the
rib, it would be preferable that the end user only have to turn
on `debug zebra rib detail` as that is what we have been telling
people to do for the last couple of years. Consolidate *back*
to this.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add the zapi code for encoding/decoding of backup nexthops for when
we are ready for it, but disable it for now so that we revert
to the old way with them.
When zebra gets a proto-NHG with a backup in it, we early fail and
tell the upper level proto. In this case sharpd. Sharpd then reverts
to the old way of installation with the route.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Add type to the nhg_proto_del API params for sanity checking
that the types of the route sent by the proto matches the type
found with the ID.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
In scoring our NHEs during shutdown there is a chance we could release mutliple
NHEs at the same time during one iteration. This can cause memory corruption
if the two being released are directly next to each other in the hash table.
hash_iterate accounts for releasing one during the iteration but not
two by setting hbnext before release but if hbnext is also freed,
we obviously can have a problem.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Reject proto NHGs of type blackhole/interface for now.
We need to think a bit more about how to resolve these
given the linux kernel needs to know the Address Family
of the routes that will use them and install it with them.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Add a flag to track the released state of a proto-based NHG.
This flag is used to know whether the upper level proto has called
the *_del API. Typically, the NHG would just get removed and uninstalled
at this point but there is a chance we are being sent it while routes
are still being owned or we were sent it multiple times. This flag
and associated code handles that.
Ticket: CM-30369
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
We currently don't support ADD/DEL/REPLACE with proto-based
NHGs that are not already fully resolved and ifindex/onlink
based. If we are handed one that doesn't have ifindex set
i.e. recursive, gracefully fail and with a notification.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
The code was installing the nexthop group again using
the NLM_F_REPLACE function causing extremely large
route installation times. This reduces the time from
installing 1 million routes from sharpd with a nhg
from > 200 seconds ( where I gave up ) to ~15
seconds on my machine for 32 x ecmp. As a side note 1 million
routes using master sharpd takes ~50 seconds to do
the same thing.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Return the proto nhe on del even if their are still possible
route references.
We may get a del before the routes are removed. So we still need
to return this to the caller so they can decrement the ref.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Fix the releasing of proto-owned singletons from the attribute
hashed table. Proto-owned singleton nexthops are hashed so they
can still be shared therefore they are present in this table
and need to be released when the time comes.
This check was only matching on zebra proto before. Changed
to match IDs in zebra allocated range.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Increment the nhg proto score iterator we used to count
leftover NHGs after client disconnect and log.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Fix some reference counting issues seen when replacing
a NHG and deleting one.
For replacement, we should end with the same refcnt on the new
one.
For delete, its the caller's job to decrement its ref after
its done with it.
Further, update routes in the rib with the new pointer after replace.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Add code to handle proto-based NHG uninstalling after
the owning client disconnects.
This is handled the same way as rib_score_proto() but for now
we are ignoring instance.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Remove some leftover boilerplate from the old replace
code path. That code ended up in the add API so its no
longer needed.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Make NHG ID allocation smarter so it wraps once it hits
the lower bound for protos and performs a lookup to make
sure we don't already have that ID in use.
Its pretty unlikely we would wrap since the ID space is somewhere
around 24million for Zebra at this point in time.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Determine the NHG ID spacing and lower bound with ZEBRA_ROUTE_MAX
in macros.
Directly set the upperbound to be the lower 28bits of the uint32_t ID
space (the top 4 are reserved for l2-NHGs). Round that number down
a bit to make it more even.
Convert all former lower_bound calls to just use the macro.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
When we receive a NHG from the kernel, we set the ID counter
to that to avoid using IDs owned from the kernel.
If we get one outside of zebra's range, lets not update it
since its probably one we created and never deleted anyway.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
For now let's assume proto-NHG-based routes are good to go
(we assume they are onlink/interface based anyway) and bypass
route resolution altogether.
Once we determine how to handle recursive nexthop-resolution for
proto-NHGs we will revisit this.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Implement the ability to replace an NHG sent down
from an upper level proto. With proto-owned NHGs, we make the
assumption they are ecmp and always treat them as a group
to make the replace from 1 -> 2 and 2 -> 1 quite a bit
easier.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
To prevent duplication of singleton NHGs, lets hash
any zebra-ID spaced NHGs sent from an upper level proto.
These would be singleton NHGs anyway and should prevent duplication
of dataplane installs.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Add a command/functionality to only install proto-based nexthops.
That is nexthops owned/created by upper level protocols, not ones
implicitly created by zebra.
There are some scenarios where you would not want zebra to be
arbitrarily installing nexthop groups and but you still want
to use ones you have control over via lib/nexthop_group config
and an upper level protocol.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Implement the underlying zebra functionality to Add/Del an
internal zebra and kernel NHG.
These NHGs are managed by the upperlevel protocols that send them
down via zapi messaging.
They are not put into the overall zebra NHG hash table and only
put into to the ID table. Therefore, different protos cannot
and will not share NHGs.
The proto is also set appropriately when sent to the kernel.
Expand the separation of Zebra hashed/shared/created NHGs and
proto created and mangaged NHGs.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Remove the code for setting a NHG as unhashable. Originally
this was to prevent us from attempting to put duplicates from
the kernel in our hashtable.
Now I think its better to not use them in the hashtable at all
and only track them in the ID table. Routes will still be able
to use them if they specify the ID explicitly when sending Zebra
the route, but 'normal' routes we hash the nexthop group on
will not.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Let's not make the entire `depend_finds` function pay
for the data gathering needed for the debug. There
are numerous other places in the code that check
the NEXTHOP_FLAG_RECURSIVE and do the same output.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
We can make the Linux kernel send an ARP/NDP request by adding
a neighbour with the 'NUD_INCOMPLETE' state and the 'NTF_USE' flag.
This commit adds new dataplane operation as well as new zapi message
to allow other daemons send ARP/NDP requests.
Signed-off-by: Jakub Urbańczyk <xthaid@gmail.com>
For the sake of Segment Routing (SR) and Traffic Engineering (TE)
Policies there's a need for additional infrastructure within zebra.
The infrastructure in this PR is supposed to manage such policies
in terms of installing binding SIDs and LSPs. Also it is capable of
managing MPLS labels using the label manager, keeping track of
nexthops (for resolving labels) and notifying interested parties about
changes of a policy/LSP state. Further it enables a route map mechanism
for BGP and SR-TE colors such that learned BGP routes can be mapped
onto SR-TE Policies.
This PR does not introduce any usable features by now, it is just
infrastructure for other upcoming PRs which will introduce 'pathd',
a new SR-TE daemon.
Co-authored-by: Renato Westphal <renato@opensourcerouting.org>
Co-authored-by: GalaxyGorilla <sascha@netdef.org>
Signed-off-by: Sebastien Merle <sebastien@netdef.org>
Added a macro to validate the v4 mapped v6 address.
Modified bgp receive & send updates for v4 mapped v6 address as
nexthop and installing it as recursive nexthop in RIB.
Minor change in fpm while sending the routes for nexthop as
v4 mapped v6 address.
Signed-off-by: Kaushik <kaushik@niralnetworks.com>
Improve vty output for routes and lsps with backups, including
json. Simplify or correct some code that uses both primary and
backup nexthops in dplane, nht.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
If we are asked to check if a nexthop is active and it matches a
connected route but the ifindex on it does not match the interface
with the connected route, mark as inactive. This is a bad nexthop.
Before, we would skip this check and just assume any nexthop that matches
on a connected route is valid and return here then fail during
installation. This adds a check for the IPV*_ifindex nexthop case where the
ifindex we have been sent doesn't match.
Old:
F>r 0.0.0.0/0 [200/0] via 20.0.0.2, test, weight 1, 00:00:27
r via 40.4.4.4, lo, weight 1, 00:00:27
New:
F>* 0.0.0.0/0 [200/0] via 20.0.0.2, test, weight 1, 00:00:06
* via 40.4.4.4, lo inactive, weight 1, 00:00:06
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
When handling a fib notification event that involves a route
with backup nexthops, be clearer about representing the
installed state of the backups: any installed backup will be
on a dedicated route_entry list.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Remove a special-case clause for static routes - it was the same
as the clause for other recursive routes. Have staticd just tell
zebra that recursion is allowed. Update topotest that was aware
of this 'internal' flag.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Use the right list of daemons to avoid trying to start zebra twice.
Change a zebra log message to INFO level to avoid stderr check
failure.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
This commit is the first step to convert IP rule installation to
use dplane thread.
* Add dataplane's internal representation of a pbr rule
* Add dplane stats related to rules
* Introduce a new type of dplane operation
Signed-off-by: Jakub Urbańczyk <xthaid@gmail.com>
When checking if a nexthop is active, if it has been marked as onlink,
just check on the presence and status of the nexthop's interface. When
handling client request to create a route, if the client says that the
nexthop is onlink, trust it; when internally (in zebra) determining
that the nexthop is onlink, ensure it is only done in the case of an
interface with a /32 IP address which is the case for OSPF unnumbered.
Signed-off-by: Vivek Venkatraman <vivek@cumulusnetworks.com>
Reviewed-by: Donald Sharp <sharpd@cumulusnetworks.com>
Reviewed-by: Stephen Worley <sworley@cumulusnetworks.com>