Operators are seeing:
Mar 28 07:19:37 kingpin zebra[418]: [TZANK-DEMSE] netlink_nexthop_msg_encode: nhg_id 68 (zebra): proto-based nexthops only, ignoring
Mar 28 07:19:37 kingpin zebra[418]: [TZANK-DEMSE] netlink_nexthop_msg_encode: nhg_id 68 (zebra): proto-based nexthops only, ignoring
Mar 28 07:19:37 kingpin zebra[418]: [YXPF5-B2CE0] netlink_route_multipath_msg_encode: RTM_DELROUTE 2804:4d48:4000::/42 vrf 0(254)
Mar 28 07:19:37 kingpin zebra[418]: [YXPF5-B2CE0] netlink_route_multipath_msg_encode: RTM_NEWROUTE 2804:4d48:4000::/42 vrf 0(254)
Mar 28 07:19:37 kingpin zebra[418]: [TVM3E-A8ZAG] _netlink_route_build_singlepath: (single-path): 2804:4d48:4000::/42 nexthop via fe80::b6fb:e4ff:fe26:c5d5 if 2 vrf default(0)
Mar 28 07:19:37 kingpin zebra[418]: [HYEHE-CQZ9G] nl_batch_send: netlink-dp (NS 0), batch size=140, msg cnt=2
Mar 28 07:19:37 kingpin zebra[418]: [P2XBZ-RAFQ5][EC 4043309074] Failed to install Nexthop ID (68) into the kernel
When `zebra nexthop proto only` is turned on.
Effectively zebra intentionally does not do the nexthop group installation
and the dplane notification in zebra_nhg.c just assumes it was a failure
and prints an error message. Since this act was intentional, let's
just notice that it was intentional and not report the message
as a failure.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently if a end user has something like this:
Routing entry for 192.168.212.1/32
Known via "kernel", distance 0, metric 100, best
Last update 00:07:50 ago
* directly connected, ens5
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
K>* 0.0.0.0/0 [0/100] via 192.168.212.1, ens5, src 192.168.212.19, 00:00:15
C>* 192.168.212.0/27 is directly connected, ens5, 00:07:50
K>* 192.168.212.1/32 [0/100] is directly connected, ens5, 00:07:50
And FRR does a link flap, it refigures the route and rejects the default
route:
2022/04/09 16:38:20 ZEBRA: [NZNZ4-7P54Y] default(0:254):0.0.0.0/0: Processing rn 0x56224dbb5b00
2022/04/09 16:38:20 ZEBRA: [ZJVZ4-XEGPF] default(0:254):0.0.0.0/0: Examine re 0x56224dbddc20 (kernel) status: Changed Installed flags: Selected dist 0 metric 100
2022/04/09 16:38:20 ZEBRA: [GG8QH-195KE] nexthop_active_update: re 0x56224dbddc20 nhe 0x56224dbdd950 (7), curr_nhe 0x56224dedb550
2022/04/09 16:38:20 ZEBRA: [T9JWA-N8HM5] nexthop_active_check: re 0x56224dbddc20, nexthop 192.168.212.1, via ens5
2022/04/09 16:38:20 ZEBRA: [M7EN1-55BTH] nexthop_active: Route Type kernel has not turned on recursion
2022/04/09 16:38:20 ZEBRA: [HJ48M-MB610] nexthop_active_check: Unable to find active nexthop
2022/04/09 16:38:20 ZEBRA: [JPJF4-TGCY5] default(0:254):0.0.0.0/0: After processing: old_selected 0x56224dbddc20 new_selected 0x0 old_fib 0x56224dbddc20 new_fib 0x0
So the 192.168.212.1 route is matched for the nexthop but it is not connected and
zebra treats it as a problem. Modify the code such that if a system route
matches through another system route, then it should work imo.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
This bug should only really affect kernel routes. To reproduce:
a) Have multiple connected routes that point to the same prefix
swp8 up default 169.254.0.250/30
swp9 up default 169.254.0.250/30
b) Have a kernel route that uses one of those connected routes
7.6.2.8 via 169.254.0.249 dev swp8 proto static
(But have it choose a non-selected connected nexthop)
c) Introduce an event that causes the rib table to be reprocessed,
say a unrelated interface going up / down
This causes the route to be lost with this message:
2022/03/28 21:21:53 ZEBRA: [YXCJP-0WZWV] netlink_nexthop_msg_encode: ID (3454): 169.254.0.249, via swp8(1383) vrf default(0)
2022/03/28 21:21:53 ZEBRA: [YF2E6-J60JH] nexthop_active: 169.254.0.249, via swp8 given ifindex does not match nexthops ifindex found found: directly connected, swp9
Effectively the nexthop that zebra is choosing would not be the one
that the kernel route has choosen and FRR removes the route:
022/03/28 21:21:53 ZEBRA: [NM15X-X83N9] rib_process: (0:254):7.6.2.8/32: rn 0x56042e632e90, removing re 0x56042e6316e0
2022/03/28 21:21:53 ZEBRA: [Y53JX-CBC5H] rib_unlink: (0:254):7.6.2.8/32: rn 0x56042e632e90, re 0x56042e6316e0
2022/03/28 21:21:53 ZEBRA: [KT8QQ-45WQ0] rib_gc_dest: (0:?):7.6.2.8/32: removing dest from table
What is happening?
Zebra is not looking at all connected routes and if any of them
would have the appropriate ifindex and just blindly rejecting
the route.
So when nexthop resolution happens and it matches a connected
route and the dest->selected nexthop ifindex does not match, let's sort
through the rest of them and see if any of them match and if so
let's keep the route.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
This has already been a requirement for Solaris, it is still a
requirement for some of the autoconf feature checks to work correctly,
and it will be a requirement for `-fms-extensions`.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
Since there are two kinds of ESI (Type-0 and Type-3), the warnings
should distinguish between the two cases.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
It's confusing for a user to see 'Tx RA failed' in the logs when
they've enabled RAs (either through interface config or BGP unnumbered)
on an interface that can't send them. Let's avoid sending RAs on
interfaces that are bridge_slaves or don't have a link-local address,
since they are the two of the most common reasons for RA Tx failures.
Signed-off-by: Trey Aspelund <taspelund@nvidia.com>
The bounded vrf of `l2vni/zevpn` have wrong relation with the order
in which vxlan interface and svi interface are set.
If set vxlan interface with vlanid first, then set svi interface with
vrf, it is ok that vxlan interface will get correct `vrf` inherited
from svi. But reverse the set sequence (i.e. set svi first, then vxlan),
vxlan interface can't get correct `vrf`, becasue the handling of
`ZEBRA_VXLIF_VLAN_CHANGE` missed inheritting `vrf` by mistake.
```
host# do show evpn vni 101
VNI: 101
Type: L2
Tenant VRF: vrf1
```
So update `vrf` ("Tenant VRF") of l2vni in `zebra_vxlan_if_update()`.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
Since `NDA_VLAN` is no longer mannually defined in header file,
the check for `NDA_VLAN` should be removed.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
Like `zvni_map_to_svi_ns()` for `ns_walk_func()`, just use "assert"
instead of unnecessary check.
Since these parameters for `ns_walk_func()`, e.g. `in_param` and others,
must not be NULL. So use `assert` to ensure the these parameters, and
remove those unnecessary checks.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
When a client disconnects, we need to check & remove NHT entries for
other SAFIs too. Otherwise we crash later trying to access stale data.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
There exists code paths in the linux kernel where a dump command
will be interrupted( I am not sure I understand what this really
means ) and the data sent back from the kernel is wrong or incomplete.
At this point in time I am not 100% certain what should be done, but
let's start noticing that this has happened so we can formulate a plan
or allow the end operator to know bad stuff is a foot at the circle K.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
In the FreeBSD code if you delete the interface
and it has no configuration, the ifp pointer will
be deleted from the system *but* zebra continues
to dereference the just freed pointer.
==58624== Invalid read of size 1
==58624== at 0x48539F3: strlcpy (in /usr/local/libexec/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==58624== by 0x2B0565: ifreq_set_name (ioctl.c:48)
==58624== by 0x2B0565: if_get_flags (ioctl.c:416)
==58624== by 0x2B2D9E: ifan_read (kernel_socket.c:455)
==58624== by 0x2B2D9E: kernel_read (kernel_socket.c:1403)
==58624== by 0x499F46E: thread_call (thread.c:2002)
==58624== by 0x495D2B7: frr_run (libfrr.c:1196)
==58624== by 0x2B40B8: main (main.c:471)
==58624== Address 0x6baa7f0 is 64 bytes inside a block of size 432 free'd
==58624== at 0x484ECDC: free (in /usr/local/libexec/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==58624== by 0x4953A64: if_delete (if.c:283)
==58624== by 0x2A93C1: if_delete_update (interface.c:874)
==58624== by 0x2B2DF3: ifan_read (kernel_socket.c:453)
==58624== by 0x2B2DF3: kernel_read (kernel_socket.c:1403)
==58624== by 0x499F46E: thread_call (thread.c:2002)
==58624== by 0x495D2B7: frr_run (libfrr.c:1196)
==58624== by 0x2B40B8: main (main.c:471)
==58624== Block was alloc'd at
==58624== at 0x4851381: calloc (in /usr/local/libexec/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==58624== by 0x496A022: qcalloc (memory.c:116)
==58624== by 0x49546BC: if_new (if.c:164)
==58624== by 0x49546BC: if_create_name (if.c:218)
==58624== by 0x49546BC: if_get_by_name (if.c:603)
==58624== by 0x2B1295: ifm_read (kernel_socket.c:628)
==58624== by 0x2A7FB6: interface_list (if_sysctl.c:129)
==58624== by 0x2E99C8: zebra_ns_enable (zebra_ns.c:127)
==58624== by 0x2E99C8: zebra_ns_init (zebra_ns.c:214)
==58624== by 0x2B3FF2: main (main.c:401)
==58624==
Zebra needs to pass back whether or not the ifp pointer
was freed when if_delete_update is called and it should
then check in ifan_read as well as ifm_read that the
ifp pointer is still valid for use.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add new debug output to show the string of the message type that
is currently unhandled:
2022-03-24 18:30:15.284 [DEBG] zebra: [V3NSB-BPKBD] Kernel:
2022-03-24 18:30:15.284 [DEBG] zebra: [HDTM1-ENZNM] Kernel: message seq 792
2022-03-24 18:30:15.284 [DEBG] zebra: [MJD4M-0AAAR] Kernel: pid 594488, rtm_addrs {DST,GENMASK}
2022-03-24 18:30:15.285 [DEBG] zebra: [GRDRZ-0N92S] Unprocessed RTM_type: RTM_NEWMADDR(d)
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When running zebra w/ valgrind, it was noticed that there
was a bunch of passing uninitialized data to the kernel:
==38194== Syscall param ioctl(generic) points to uninitialised byte(s)
==38194== at 0x4CDF88A: ioctl (in /lib/libc.so.7)
==38194== by 0x49A4031: vrf_ioctl (vrf.c:860)
==38194== by 0x2AFE29: vrf_if_ioctl (ioctl.c:91)
==38194== by 0x2AFF39: if_get_mtu (ioctl.c:161)
==38194== by 0x2B12C3: ifm_read (kernel_socket.c:653)
==38194== by 0x2A7F76: interface_list (if_sysctl.c:129)
==38194== by 0x2E9958: zebra_ns_enable (zebra_ns.c:127)
==38194== by 0x2E9958: zebra_ns_init (zebra_ns.c:214)
==38194== by 0x2B3F82: main (main.c:401)
==38194== Address 0x7fc000967 is on thread 1's stack
==38194== in frame #3, created by if_get_mtu (ioctl.c:155)
==38194==
==38194== Syscall param ioctl(generic) points to uninitialised byte(s)
==38194== at 0x4CDF88A: ioctl (in /lib/libc.so.7)
==38194== by 0x49A4031: vrf_ioctl (vrf.c:860)
==38194== by 0x2AFE29: vrf_if_ioctl (ioctl.c:91)
==38194== by 0x2AFED9: if_get_metric (ioctl.c:143)
==38194== by 0x2B12CB: ifm_read (kernel_socket.c:655)
==38194== by 0x2A7F76: interface_list (if_sysctl.c:129)
==38194== by 0x2E9958: zebra_ns_enable (zebra_ns.c:127)
==38194== by 0x2E9958: zebra_ns_init (zebra_ns.c:214)
==38194== by 0x2B3F82: main (main.c:401)
==38194== Address 0x7fc000967 is on thread 1's stack
==38194== in frame #3, created by if_get_metric (ioctl.c:137)
==38194==
==38194== Syscall param ioctl(generic) points to uninitialised byte(s)
==38194== at 0x4CDF88A: ioctl (in /lib/libc.so.7)
==38194== by 0x49A4031: vrf_ioctl (vrf.c:860)
==38194== by 0x2AFE29: vrf_if_ioctl (ioctl.c:91)
==38194== by 0x2B052D: if_get_flags (ioctl.c:419)
==38194== by 0x2B1CF1: ifam_read (kernel_socket.c:930)
==38194== by 0x2A7F57: interface_list (if_sysctl.c:132)
==38194== by 0x2E9958: zebra_ns_enable (zebra_ns.c:127)
==38194== by 0x2E9958: zebra_ns_init (zebra_ns.c:214)
==38194== by 0x2B3F82: main (main.c:401)
==38194== Address 0x7fc000707 is on thread 1's stack
==38194== in frame #3, created by if_get_flags (ioctl.c:411)
Valgrind is no longer reporting these issues.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When `zebra_evpn_mac_svi_add()` adds one found mac by
`zebra_evpn_mac_lookup()` and the found mac is without
svi flag, then call `zebra_evpn_mac_svi_add()` to create
one appropriate mac, but it will call `zebra_evpn_mac_lookup()`
the second time. So lookup twice, the procedure is redundant.
Just an optimization for it, make sure only lookup once.
Modify `zebra_evpn_mac_gw_macip_add()` to check the `macp`
parameter passed by caller, so it can distinguish whether
really need lookup or not.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
When issuing a RTM_DELETE operation and the kernel tells
us that the route is already deleted, let's not complain
about the situation:
2022/03/19 02:40:34 ZEBRA: [EC 100663303] kernel_rtm: 2a10:cc42:1d51::/48: rtm_write() unexpectedly returned -4 for command RTM_DELETE
I can recreate this issue on freebsd by doing this:
a) create a route using sharpd
b) shutdown the nexthop's interface
c) remove the route using sharpd
This would also be true of pretty much any routing protocol's behavior.
Let's just not complain about the situation if a RTM_DELETE
operation is issued and FRR is told that the route does not
exist to delete.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Since the `RB_INSERT()` is called after not found in RB tree, it MUST be ok and
and return zero. The check of returning value of `RB_INSERT()` is redundant,
just remove them.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
Commit: 5d41413833 added 3 new dplane ops:
DPLANE_OP_INTF_INSTALL
DPLANE_OP_INTF_UPDATE
DPLANE_OP_INTF_DELETE
The build system does not build lua so zebra_script.c
was not updated. Update of course!
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When encoding a response to the upper level protocol the
prefixlen is not something that needs to be part of the
switch statement for handling of a prefix.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently the nexthop tracking code is only sending to the requestor
what it was requested to match against. When the nexthop tracking
code was simplified to not need an import check and a nexthop check
in b8210849b8 for bgpd. It was not
noticed that a longer prefix could match but it would be seen
as a match because FRR was not sending up both the resolved
route prefix and the route FRR was asked to match against.
This code change causes the nexthop tracking code to pass
back up the matched requested route (so that the calling
protocol can figure out which one it is being told about )
as well as the actual prefix that was matched to.
Fixes: #10766
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Since `zvni_map_to_svi_ns()` is used to find and return one specific interface
based on passed attributes of SVI, so the two parameters `in_param` and `p_ifp`
must not be NULL.
Passing NULL `p_ifp` makes no sense, so the check `if (p_ifp)` is
unnecessary.
So use `assert` to ensure the two parameters, and remove that unnecessary check.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
Upon 'no advertise-all-vni', cleanup l2vni from
its tenant-vrf's l3vni list, instead of passed
zvrf->l3vni which will not be present in case
of default instance.
Reviewed By:
Testing Done:
Before Fix:
----------
TORC12(config-router-af)# advertise-all-vni
TORC12(config-router-af)# end
TORC12# show evpn vni 4001
VNI: 4001
Type: L3
Tenant VRF: vrf1
Vxlan-Intf: vni4001
State: Up
Router MAC: 44:38:39:ff:ff:01
L2 VNIs: 134217728 0 1000 1002 <-----
After Fix:
----------
TORC12# show evpn vni 4001
VNI: 4001
Type: L3
Tenant VRF: vrf1
Vxlan-Intf: vni4001
State: Up
Router MAC: 44:38:39:ff:ff:01
L2 VNIs: 1000 1002
Signed-off-by: Chirag Shah <chirag@nvidia.com>
RMAC keeping list of nexthops to keep track
of its existiance, remove the (old way) host prefix
mapping.
Ticket: #2798406
Reviewed By:
Testing Done:
TORS1# show evpn rmac vni 4001 mac 44:38:39:ff:ff:01
MAC: 44:38:39:ff:ff:01
Remote VTEP: 36.0.0.11
Refcount: 0
Prefixes:
Signed-off-by: Chirag Shah <chirag@nvidia.com>
Keep the list of remote-vteps/nexthops in
rmac db.
Problem:
In CLAG deployment there might be a situation
where CLAG secondary sends individual ip as nexthop
along with anycast mac as RMAC. This combination
is updated in zebra's rmac cache.
Upon recovery at clag secondary sends withdrawal
of the incorrect rmac and nexthop mapping.
The RMAC entry mapping to nh is not cleaned up properly
in the zebra rmac cache.
Fix:
Zebra rmac db needs to maintain a list of nexthops.
When a bgp withdrawal for rmac to nexthop mapping
is received, remove the old nexthop from the rmac's nh
list and if the host reference still remains for
the RMAC,fall back to the nexthop one remaining in
the list.
At most you expect two nexthops mapped to RMAC
(in clag deployment).
Ticket: 2798406
Reviewed By:
Testing Done:
CLAG primary and secondary have advertise-pip enabled
advertise type-5 route (default route) with
individual IP as nh and individual svi mac as rmac.
- disable advertise pip on both clag devices, this
results in advertisement of routes with anycast ip as nh
and anycast mac as rmac.
- disable peerlink on clag primary, this triggers
clag secondary to (transitory) send bgp update with
individual ip as nh and anycast mac as rmac.
- At the remote vtep:
Check the zebra's rmac cache/nh mapping correctly
and points to anycast rmac and anycast ip as nh of the
clag system.
Signed-off-by: Chirag Shah <chirag@nvidia.com>
Cleanup the logs in the netlink code for setting
protodown on/off to be more useful to a user parsing them
after an issue.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Use the SET/UNSET/CHECK/COND macros for flag bifields
where appropriate throught the protodown code base.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Ensure we include the old reason when we are updating the reason
code for a evpn-mh bond member. Now that this is a common API
it could include things external to EVPN in this reason code
bitfield (ex: vrrp).
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Make the netlink protodown static function for checking
if the only bit set for protodown reason is FRR's more
easily readable to someone not familiar with the code.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Simplify the code for printing the reason codes via
show command. Just remove the trailing comma last
before printing.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Cleanup the logs in the api for setting protodown on/off
that zapi and others use. Make them more useful to a user parsing
them after an issue.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Avoid initialization in dplane_ctx_intf_init() so
the compiler can warn us about using unintialized data.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
When we are processing a bond member's protodown we get from
the dataplane, check to make sure we haven't already queued
up a set. If we have, it's likely this is just a notification
we get from the kernel after we set protodown and before we have
processed the result in our dplane pthread.
This change is needed now that we set protodown via the dplane
pthread.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
When setting the protodown reason use the update api
where we can directly update the entire reason bitfield
since we have to set more than one.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Extern the api for setting the protodown reason code
bitfield directly. Some places may want to completely update the
bitfield with more than one reason at a time.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Only clear protodown reason on shutdown/sweep, retain protodown
state.
This is to retain traditional and expected behavior with daemons
like vrrpd setting protodown. They expet it to be set on shutdown
and retained on bring up to prevent traffic from being dropped.
We must cleanup our reason code though to prevent us from blocking
others.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Add functionality to clear any reason code set on shutdown
of zebra when we are freeing the interface, in case a bad
client didn't tell us to clear it when the shutdown.
Also, in case of a crash or failure to do the above, clear reason
on startup if it is set.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Add enums for set/unset of prodown state to handle the mainthread
knowing an update is already queued without actually marking it
as complete.
This is to make the logic confirm a bit more with other parts of the code
where we queue dplane updates and not update our internal structs until
success callback is received.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Add command for use to set protodown via frr.conf in
the case our default conflicts with another application
they are using.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Add support for setting the protodown reason code.
829eb208e8
These patches handle all our netlink code for setting the reason.
For protodown reason we only set `frr` as the reason externally
but internally we have more descriptive reasoning available via
`show interface IFNAME`. The kernel only provides a bitwidth of 32
that all userspace programs have to share so this makes the most sense.
Since this is new functionality, it needs to be added to the dplane
pthread instead. So these patches, also move the protodown setting we
were doing before into the dplane pthread. For this, we abstract it a
bit more to make it a general interface LINK update dplane API. This
API can be expanded to support gernal link creation/updating when/if
someone ever adds that code.
We also move a more common entrypoint for evpn-mh and from zapi clients
like vrrpd. They both call common code now to set our internal flags
for protodown and protodown reason.
Also add debugging code for dumping netlink packets with
protodown/protodown_reason.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Contraints of host routes are too strict in current code:
Host routes with same destination address and nexthop address are forbidden
even when cross VRFs.
Currently host routes with different destination and nexthop address can cross
VRFs, it is ok. But host routes with same addresses are forbidden to cross VRFs,
it is wrong.
Since different VRFs can have the same addresses, leak specific host route with
the same nexthop address ( it means destination address is same to nexthop
address ) to other VRFs is a normal case.
This commit relaxes that contraints. Host routes with same destination address
and nexthop address are forbidden only when not cross VRFs.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
When an interface goes down, it signals any related NHGs to
re-validate themselves. During zebra shutdown, ensure we remove
any NHGs we've installed.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
In `zebra_evpn_neigh_gw_macip_add()`, it sets `mac->flags` to "ZEBRA_MAC_DEF_GW"
for "advertise-default-gw" mode. But this set is redundant because this "mac"
is already set by `zebra_evpn_mac_gw_macip_add()`.
So remove this redundant assignment.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
In the loop, local variable `ip` is always set even if the check condition
is not satisfied.
Avoid the redundant set, move this set exactly after the check condition is
satisfied. Set `ip` only if the check condition is met, otherwise needn't.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
zebra crash is seen during shutdown (frr restart).
During shutdown, remote neigh and remote mac clean up
is triggered first, followed by per vni all neigh
(including local) and macs cleanup is triggered.
The crash occurs when a remote mac is cleaned up first
and its reference is remained in local neigh.
When local neigh attempt removes itself from its associated
mac's neigh_list it triggers inaccessible memory crash.
The fix is during mac deletion if its neigh_list is non-empty
then retain the MAC in AUTO state.
This can arise when MAC and neigh duo are in different state
(remote/local). Otherwise, the order of cleanup operation
is neighs followed by macs.
The auto mac will be cleaned up when per vni all neighs and macs
are cleaned up.
Ticket:CM-29826
Reviewed By:CCR-10369
Testing Done:
Configure evpn symmetric config where
MAC is in remote state and neigh is in local state.
Perform frr restart then crash is not seen.
Signed-off-by: Chirag Shah <chirag@nvidia.com>
With recent changes to interface up mechanics in if_netlink.c
FRR was receiving as many as 4 up events for an interface
on ifdown/ifup events. This was causing timing issues
in FRR based upon some fun timings. Remove this from
happening.
Ticket: CM-31623
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Vxlan interfaces flap (protodown/up) event,
non ptm operative interfaces do not come up
as protodown up event do not trigger "if_up()"
event.
Ticket:CM-30477
Reviewed By:CCR-10681
Testing Done:
validated interfaces flaps, ip link down, ifdown
and protodown followed by UP event. all Vxlan interfaces
come up in bgpd post flap.
Signed-off-by: Chirag Shah <chirag@nvidia.com>
Frr need to handle protocol down event for vxlan
interface.
In MLAG scenario, one of the pair switch can put
vxlan port to protodown state, followed by
tunnel-ip change from anycast IP to individual IP.
In absence of protodown handling, evpn end up
advertising locally learn EVPN (MAC-IP) routes
with individual IP as nexthop.
This leads an issue of overwriting locally learn
entries as remote on MLAG pair.
Ticket:CM-24545
Reviewed By:CCR-10310
Testing Done:
In EVPN deployment, restart one of the MLAG
daemon, which puts vxlan interfaces in protodown state.
FRR treats protodown as oper down for vxlan interfaces.
VNI down cleans up/withdraws locally learn routes.
Followed by vxlan device UP event, re-advertise
locally learn routes.
Signed-off-by: Chirag Shah <chirag@nvidia.com>
When a end operator is doing cross vrf imports in bgp:
router bgp 3239 vrf FOO
address-family ipv4 uni
import vrf BAR
!
and zebra has this configuration:
vrf FOO
ip protocol bgp route-map EVA
!
The current code in zebra_nhg.c was looking up the vrf of the
nexthop and attempting to apply the ip protocol route-map.
For most people the nexthop vrf and the re vrf are one and the
same so they never see a problem.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Upon restart zebra reads in the kernel state. Under linux
there is a mechanism to read the route and convert the protocol
to the correct internal FRR protocol to allow the zebra graceful
restart efforts to work properly.
Under *BSD I do not see a mechanism to convey the original FRR
protocol into the kernel and thus back out of it. Thus when
zebra crashes ( or restarts ) the routes read back in are kernel
routes and are effectively lost to the system and FRR cannot
remove them properly. Why? Because FRR see's kernel routes
as routes that it should not own and in general the admin
distance for those routes will be a better one than the
admin distance from a routing protocol. This is even
worse because when the graceful restart timer pops and rib_sweep
is run, FRR becomes out of sync with the state of the kernel forwarding
on *BSD.
On restart, notice that the route is a self route that there
is no way to know it's originating protocol. In this case
let's set the protocol to ZEBRA_ROUTE_STATIC and set the admin
distance to 255.
This way when an upper level protocol reinstalls it's route
the general zebra graceful restart code still works. The
high admin distance allows the code to just work in a way
that is graceful( HA! )
The drawback here is that the route shows up as a static
route for the time the system is doing it's work. FRR
could introduce *another* route type but this seems like
a bad idea and the STATIC route type is loosely analagous
to the type of route it has become.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
FRR will crash when the re->type is a ZEBRA_ROUTE_ALL and it
is inserted into the meta-queue. Let's just put some basic
code in place to prevent a crash from happening. No routing
protocol should be using ZEBRA_ROUTE_ALL as a value but
bugs do happen. Let's just accept the weird route type
gracefully and move on.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
There exists some interface types that are slow on startup
to fully register their link speed. Especially those that
are working with an asic backend. The speed_update timer
associated with each interface would keep trying if the
system returned a MAX_UINT32 as the speed. This speed
means both unknown or there is none under linux.
Since some interface types are slow on startup let's modify
FRR to try for at most 4 minutes and give up trying on those
interfaces where we never get any useful data.
Why 4 minutes? I wanted to balance the time associated with
slow interfaces coming up with those that will never give us
a value. So I choose 4 minutes as a good ballpark of time
to keep trying
Why not track all those interfaces and just not attempt to
do the speed lookup? I would prefer to not keep track of these
as that I do not know all the interface types, nor do I wish
to keep programming as new ones come in.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Look up linked interface in the correct netns, otherwise, either a wrong
interface or NULL would be used.
For example, enable VRF netns backend, and:
ip netns add ns1
ip link add link eth0 link1 type macvlan
ip link set link1 netns ns1 up
Zebra will crash in zebra_vxlan_macvlan_up because zif->link is NULL.
Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
End operator is reporting that they are receiving buffer overruns
when attempting to read from the kernel receive socket. It is
possible to adjust this size to more modern levels especially
for when the system is under load. Modify the code base
so that *BSD operators can use the zebra `-s XXX` option
to specify a read buffer.
Additionally setup the default receive buffer size on *BSD
to be 128k instead of the 8k so that FRR does not run into
this issue again.
Fixes: #10666
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Use the dataplane to query and read interface NETCONF data;
add netconf-oriented data to the dplane context object, and
add accessors for it. Add handler for incoming update
processing.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
Allow self-produced xxxNETCONF netlink messages through the BPF
filter we use. Just like address-configuration actions, we'll
process NETCONF changes in one path, whether the changes were
generated by zebra or by something else in the host OS.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
a) We'll need to pass the info up via some dataplane control method
(This way bsd and linux can both be zebra agnostic of each other)
b) We'll need to modify `struct interface *` to track this data
and when it changes to notify upper level protocols about it.
c) Work is needed to dump the entire mpls state at the start
so we can gather interface state. This should be done
after interface data gathering from the kernel.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
Description:
===========
Change is intended for fixing the NHT resolution logic.
While recursively resolving nexthop, keep looking for a valid/useable route in the rib,
by not stopping at the first/most-specific route in the rib.
Consider the following set of events taking place on R1:
R1(config)# ip route 2.2.2.0/24 ens192
R1# sharp watch nexthop 2.2.2.32 connected
R1# show ip nht
2.2.2.32(Connected)
resolved via static
is directly connected, ens192
Client list: sharp(fd 33)
-2.2.2.32 NHT is resolved over the above valid static route.
R1# sharp install routes 2.2.2.32 nexthop 2.2.2.32 1
R1# 2.2.2.32(Connected)
resolved via static
is directly connected, ens192
Client list: sharp(fd 33)
-.32/32 comes which is going to resolve through itself, but since this is an invalid route,
it will be marked as inactive and will not affect the NHT.
R1# sharp install routes 2.2.2.31 nexthop 2.2.2.32 1
R1# 2.2.2.32(Connected)
unresolved(Connected)
Client list: sharp(fd 50)
-Now a .31/32 comes which will resolve over .32 route, but as per the current logic,
this will trigger the NHT check, in turn making the NHT unresolved.
-With fix, NHT should stay in resolved state as long as the valid static or connected route stays installed
Fix:
====
-While resolving nexthops, walk up the tree from the most-specific match,
walk up the tree without any ZEBRA_NHT_CONNECTED check.
Co-authored-by: Vishal Dhingra <vdhingra@vmware.com>
Co-authored-by: Kantesh Mundaragi <kmundaragi@vmware.com>
Signed-off-by: Iqra Siddiqui <imujeebsiddi@vmware.com>
Two minor changes:
1) Change `zebra_evpn_mac_gw_macip_add()` 's return type to `void`.
2) Since `zebra_evpn_mac_gw_macip_add()` has already `assert` the returned
`mac`, the check of its return value makes no sense. And keep setting
`mac->flags` inside `zebra_evpn_mac_gw_macip_add()` is more reasonable. So
just move the setting `mac->flags` inside `zebra_evpn_mac_gw_macip_add()`.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
The recently-added hashtable of nlsock objects needs to be
thread-safe: it's accessed from the main and dplane pthreads.
Add a mutex for it, use wrapper apis when accessing it. Add
a per-OS init/terminate api so we can do init that's not
per-vrf or per-namespace.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
NetDEF CI has been whining about multiline string style.
Make the strings single-line and call it a day.
Signed-off-by: Trey Aspelund <taspelund@nvidia.com>
Trying to call multiple ioctl calls on ifreq will result in
overwriting ifreq with garbage data. On if_get_flags call,
try to keep the flags field safe from another possible ioctl
call before applying the flags field.
Modified code as per Code Review, done by Donald Sharp.
Signed-off-by: Bijan <bijanebrahimi@riseup.net>
Currently when the kernel sends netlink messages to FRR
the buffers to receive this data is of fixed length.
The kernel, with certain configurations, will send
netlink messages that are larger than this fixed length.
This leads to situations where, on startup, zebra gets
really confused about the state of the kernel. Effectively
the current algorithm is this:
read up to buffer in size
while (data to parse)
get netlink message header, look at size
parse if you can
The problem is that there is a 32k buffer we read.
We get the first message that is say 1k in size,
subtract that 1k to 31k left to parse. We then
get the next header and notice that the length
of the message is 33k. Which is obviously larger
than what we read in. FRR has no recover mechanism
nor is there a way to know, a priori, what the maximum
size the kernel will send us.
Modify FRR to look at the kernel message and see if the
buffer is large enough, if not, make it large enough to
read in the message.
This code has to be per netlink socket because of the usage
of pthreads. So add to `struct nlsock` the buffer and current
buffer length. Growing it as necessary.
Fixes: #10404
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Store the fd that corresponds to the appropriate `struct nlsock` and pass
that around in the dplane context instead of the pointer to the nlsock.
Modify the kernel_netlink.c code to store in a hash the `struct nlsock`
with the socket fd as the key.
Why do this? The dataplane context is used to pass around the `struct nlsock`
but the zebra code has a bug where the received buffer for kernel netlink
messages from the kernel is not big enough. So we need to dynamically
grow the receive buffer per socket, instead of having a non-dynamic buffer
that we read into. By passing around the fd we can look up the `struct nlsock`
that will soon have the associated buffer and not have to worry about `const`
issues that will arise.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Store and use the sequence number instead of using what is in
the `struct nlsock`. Future commits are going away from storing
the `struct nlsock` and the copy of the nlsock was guaranteeing
unique sequence numbers per message. So let's store the
sequence number to use instead.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Using memcmp is wrong because struct ipaddr may contain unitialized
padding bytes that should not be compared.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
When using wait for install there exists situations where
zebra will issue several route change operations to the kernel
but end up in a state where we shouldn't be at the end
due to extra data being received. Example:
a) zebra receives from bgp a route change, installs sends the
route to the kernel.
b) zebra receives a route deletion from bgp, removes the
struct route entry and then sends to the kernel a deletion.
c) zebra receives an asynchronous notification that (a) succeeded
but we treat this as a new route.
This is the ships in the night problem. In this case if we receive
notification from the kernel about a route that we know nothing
about and we are not in startup and we are doing asic offload
then we can ignore this update.
Ticket: #2563300
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The ctx->zd_is_update is being set in various
spots based upon the same value that we are
passing into dplane_ctx_ns_init. Let's just
consolidate all this into the dplane_ctx_ns_init
so that the zd_is_udpate value is set at the
same time that we increment the sequence numbers
to use.
As a note for future me's reading this. The sequence
number choosen for the seq number passed to the
kernel is that each context gets a copy of the
appropriate nlsock to use. Since it's a copy
at a point in time, we know we have a unique sequence
number value.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When nl_batch_read_resp gets a full on failure -1 or an implicit
ack 0 from the kernel for a batch of code. Let's immediately
mark all of those in the batch pass/fail as needed. Instead
of having them marked else where.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
I'm seeing this crash in various forms:
Program terminated with signal SIGSEGV, Segmentation fault.
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f418efbc7c0 (LWP 3580253))]
(gdb) bt
(gdb) f 4
267 (*func)(hb, arg);
(gdb) p hb
$1 = (struct hash_bucket *) 0x558cdaafb250
(gdb) p *hb
$2 = {len = 0, next = 0x0, key = 0, data = 0x0}
(gdb)
I've also seen a crash where data is 0x03.
My suspicion is that hash_iterate is calling zebra_nhg_sweep_entry which
does delete the particular entry we are looking at as well as possibly other
entries when the ref count for those entries gets set to 0 as well.
Then we have this loop in hash_iterate.c:
for (i = 0; i < hash->size; i++)
for (hb = hash->index[i]; hb; hb = hbnext) {
/* get pointer to next hash bucket here, in case (*func)
* decides to delete hb by calling hash_release
*/
hbnext = hb->next;
(*func)(hb, arg);
}
Suppose in the previous loop hbnext is set to hb->next and we call
zebra_nhg_sweep_entry. This deletes the previous entry and also
happens to cause the hbnext entry to be deleted as well, because of nhg
refcounts. At this point in time the memory pointed to by hbnext is
not owned by the pthread anymore and we can end up on a state where
it's overwritten by another pthread in zebra with data for other incoming events.
What to do? Let's change the sweep function to a hash_walk and have
it stop iterating and to start over if there is a possible double
delete operation.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add to `show zebra` whether or not RA is compiled into FRR
and whether or not BGP is using RFC 5549 at the moment.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The dataplane pthread only processes a limited set of incoming
netlink notifications: only register for that set of events,
reducing duplicate incoming netlink messages.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
Current code treats all metaqueues as lists of route_node structures.
However, some queues contain other structures that need to be cleaned up
differently. Casting the elements of those queues to struct route_node
and dereferencing them leads to a crash. The crash may be seen when
executing bgp_multi_vrf_topo2.
Fix the code by using the proper list element types.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
The name 'opaque' is a little general - call the route_entry
struct 're_opaque' to make it more specific.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
RA packets are pretty chatty and when there is a warning from
a missconfiguration on the network, the log file gets filed
up with warnings. Modify the code in rtadv.c to only spit
out the warning in these cases at most every 6 hours.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
EVPN route add should be queued to preserve the config order.
In particular, against deletion in rib_delete().
Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
When a operator has a FRR based route installed into the
FIB and a better route comes in from the system. There
is code in the data plane to schedule the batching
and continue processing. But in this case we are done
so we can just return
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
vrf_disable is always called first before
vrf_delete. The rnh_table and rnh_table_multicast tables
are already deleted as part of vrf_disable. No need
to do it again.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
VRF name should not be printed in the config since 574445ec. The update
was done for NB config output but I missed it for regular vty output.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
Add a thread_ignore_late_timer(struct thread *thread) function
that allows thread.c to ignore when timers are late to the party.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Pass in the route_node that is under consideration
into route_notify_internal to allow calling functions
to reduce stack size as well as looking up data.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The dest_pfx was pretty much only ever used for
debug output and FRR already knows the rn. So
use that instead.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
the dest_p and src_p values were only ever used for
debugs and %pFX, when we already have the rn.
There is no need to do this lookup
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
FRR must give variable names instead of not defining
them in the .h file. This just cleans up this
problem for redistribute.h
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The function zsend_redistribute_route uses the prefix and
source prefix. Just pass in the route_node instead.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
FRR is passing around a bunch of data that is encapsulated
within the route node. Let's just pass that around instead.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
FRR is passing around a bunch of data that is encapsulated
within the route node. Let's just pass that around instead.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
If you have this setup:
router ospf 3
redistribute sharp
!
and then install:
sharp install route 4.5.6.7 nexthop 192.168.100.1 1
sharp install route 4.5.6.8 nexthop 192.168.100.1 1 instance 3
sharp install route 4.5.6.9 nexthop 192.168.100.1 1 instance 4
The .8 and .9 routes are auto redistributed into ospf instance 3:
eva# show ip ospf data
OSPF Instance: 3
OSPF Router with ID (192.168.122.1)
AS External Link States
Link ID ADV Router Age Seq# CkSum Route
4.5.6.7 192.168.122.1 13 0x80000001 0x477c E2 4.5.6.7/32 [0x0]
4.5.6.8 192.168.122.1 5 0x80000001 0x3d85 E2 4.5.6.8/32 [0x0]
4.5.6.9 192.168.122.1 5 0x80000001 0x338e E2 4.5.6.9/32 [0x0]
This cannot be correct behavior. When redistributing in the absense
of an instance number the default instance of 0 should be used and should
be the only route redistributed. Here is the correct behavior:
eva# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
K>* 0.0.0.0/0 [0/100] via 192.168.119.1, enp39s0, 00:00:28
D>* 4.5.6.7/32 [150/0] via 192.168.100.1, virbr1, weight 1, 00:00:02
D[3]>* 4.5.6.8/32 [150/0] via 192.168.100.1, virbr1, weight 1, 00:00:02
D[4]>* 4.5.6.9/32 [150/0] via 192.168.100.1, virbr1, weight 1, 00:00:02
C>* 192.168.100.0/24 is directly connected, virbr1, 00:00:28
C>* 192.168.110.0/24 is directly connected, virbr2, 00:00:28
C>* 192.168.119.0/24 is directly connected, enp39s0, 00:00:28
C>* 192.168.122.0/24 is directly connected, virbr0, 00:00:28
eva# show ip ospf data
OSPF Instance: 3
OSPF Router with ID (192.168.122.1)
AS External Link States
Link ID ADV Router Age Seq# CkSum Route
4.5.6.7 192.168.122.1 6 0x80000001 0x477c E2 4.5.6.7/32 [0x0]
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
FRR allows redistribution to a client with a specific
instance in mind. The code was not allowing you to figure
out what instance was being looked at. So let's clarify this
in the debugs.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Update ospfd and ospf6d to send opaque route attributes to
zebra. Those attributes are stored in the RIB and can be viewed
using the "show ip[v6] route" commands (other than that, they are
completely ignored by zebra).
Example:
```
debian# show ip route 192.168.1.0/24
Routing entry for 192.168.1.0/24
Known via "ospf", distance 110, metric 20, best
Last update 01:57:08 ago
* 10.0.1.2, via eth-rt2, weight 1
OSPF path type : External-2
OSPF tag : 0
debian#
debian# show ip route 192.168.1.0/24 json
{
"192.168.1.0\/24":[
{
"prefix":"192.168.1.0\/24",
"prefixLen":24,
"protocol":"ospf",
"vrfId":0,
"vrfName":"default",
"selected":true,
[snip]
"ospfPathType":"External-2",
"ospfTag":"0"
}
]
}
```
Signed-off-by: Renato Westphal <renato@opensourcerouting.org>
Topology:
IXIA-----(ens192)FRR(ens224)------iXIA
Configuration:
1. Create 8 sub-interfaces on ens192 under Default VRF and configure 8
EBGP session between FRR and IXIA.
2. Create 1000 sub-interfaces on ens224 under Default VRF and configure
1000 EBGP session between FRR and IXIA.
3. 2M prefixes distributed from Left side Ixia each with 8 ECMP path.
4. So in total, there are 2M prefixes * 8 ECMP = 16M prefixes entries
in RIB and FIB.
Issue:
Shut ens192 and ens224, this is taking 1hr 15 mins to clean up the routes.
Root Cause:
In the case of route deletion, if the particular route node is having
nht count = 0, we are going to the parent and doing nht evaluation,
which is not needed.
Fix:
If the deleted the route node is having nht count > 0, then do a nht
evaluation on the parent node.
Shut ens192 and ens224, it is taking 1 min to clean up the routes
with the fix.
Signed-off-by: Sarita Patra <saritap@vmware.com>
Used for graceful-restart mostly.
Especially for bgp_show_neighbor_graceful_restart_capability_per_afi_safi()
Signed-off-by: Donatas Abraitis <donatas.abraitis@gmail.com>
This function is sure to return correct value by "assert", so the
checking its return value should be removed.
Signed-off-by: anlan_cs <anlan_cs@tom.com>
Currently, it is possible to rename the default VRF either by passing
`-o` option to zebra or by creating a file in `/var/run/netns` and
binding it to `/proc/self/ns/net`.
In both cases, only zebra knows about the rename and other daemons learn
about it only after they connect to zebra. This is a problem, because
daemons may read their config before they connect to zebra. To handle
this rename after the config is read, we have some special code in every
single daemon, which is not very bad but not desirable in my opinion.
But things are getting worse when we need to handle this in northbound
layer as we have to manually rewrite the config nodes. This approach is
already hacky, but still works as every daemon handles its own NB
structures. But it is completely incompatible with the central
management daemon architecture we are aiming for, as mgmtd doesn't even
have a connection with zebra to learn from it. And it shouldn't have it,
because operational state changes should never affect configuration.
To solve the problem and simplify the code, I propose to expand the `-o`
option to all daemons. By using the startup option, we let daemons know
about the rename before they read their configs so we don't need any
special code to deal with it. There's an easy way to pass the option to
all daemons by using `frr_global_options` variable.
Unfortunately, the second way of renaming by creating a file in
`/var/run/netns` is incompatible with the new mgmtd architecture.
Theoretically, we could force daemons to read their configs only after
they connect to zebra, but it means adding even more code to handle a
very specific use-case. And anyway this won't work for mgmtd as it
doesn't have a connection with zebra. So I had to remove this option.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
Add optional NHG ID output to `show ip route` dumps. We have
this in json output already as nexthopGroupID but nice
to have the option in a normal dump as well. Not including in main
output for now to avoid breaking screen scrapers.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
On startup we create a thread timer event to do a rib sweep
of the system. On shutdown we never stopped this timer and
as such we have a situation where a thread event could be run
on shutdown after the data for it has been freed. Here is the
crash I am seeing:
(gdb) bt
(gdb)
Save the thread data in zebra_router and stop the thread so we don't
accidently do work on shutdown we don't mean to. In this case
it happened in our topotests with some severe system load.
Essentially we happened to kill the zebra daemon just as the
graceful_restart timer popped here.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
In some cases, zebra may install a nexthop-group id that is
different from the id of the nhe struct attached to a
route-entry. This happens for a singleton recursive nexthop,
for example, where a route is installed with the resolving
nexthop's id.
The installed value is the most useful value - that corresponds
to information in the kernel on linux/netlink platforms that
support nhgs. Display both values if they differ in ascii
output, and include both values in the json form.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
Since f60a1188 we store a pointer to the VRF in the interface structure.
There's no need anymore to store a separate vrf_id field.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
During zebra shutdown, we clear out the LSP workqueue. The LSPs
will be uninstalled and freed during the shutdown process, so
just ignore any LSPs that happen to be on the workqueue.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
At some scale we eventually run out of room displaying v4/v6 route
totals for `show zebra client summ`:
janelle# show zebra client summ
Name Connect Time Last Read Last Write IPv4 Routes IPv6 Routes
--------------------------------------------------------------------------------
bgp 04w0d18h 00:00:19 00:01:2411729127/4052681 2037786/903094
This total over ran the space in just a little over a week of uptime.
Expand to have a bit more room.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The dplane_ctx_get_pbr_ipset_entry function only
failed when the caller did not pass in a valid
usable pointer. Change the code to assert on
a pointer not being passed in and remove the
bool return
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The only time this function ever failed is when
the developer does not pass in a usable pointer
to place the data in. Change it to an assert
to signify to the end developer that is what
we want and then remove all the if checks
for failure
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The function call dplane_ctx_get_pbr_ipset only
returns false when the calling function fails to
pass in a valid ipset pointer. This should
be an assertion issue since it's a programming
issue as opposed to an actual run time issue.
Change the function call parameter to not return
a bool on success/fail for a compile time decision.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
We should always treat the VRF interface as a loopback. Currently, this
is not the case, because in some old pre-VRF code we use if_is_loopback
instead of if_is_loopback_or_vrf. To avoid any future problems, the
proposal is to rename if_is_loopback_or_vrf to if_is_loopback and use it
everywhere. if_is_loopback is renamed to if_is_loopback_exact in case
it's ever needed, but currently it's not used anywhere.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
During shutdown, when table_manager_disable is called for the default
VRF, its vrf_id is already set to VRF_UNKNOWN, so the expression is true
and the table manager memory is not freed. Change the expression to
compare the VRF name instead of the id. The check in table_manager_enable
is changed for consistency.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
Free the LSP workqueue later during shutdown, so that zebra
has enough time to clean up and uninstall any LSPs.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
42d4b30e introduced per-VRF table manager.
Table manager is allocated when the VRF is created, but it is freed when
the VRF is disabled. When this VRF is re-enabled, zebra ends up with
table manager being NULL pointer and it crashes on any dereference.
Table manager should be freed when the VRF is deleted, not when it's
disabled.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
We don't receive interface down/delete notifications from kernel when a
netns is deleted. Therefore we have to manually replicate the necessary
actions, otherwise interfaces are kept in the system with stale pointers
to the deleted netns.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
Problem:
L2-VNI SVI down followed by L2-VNI's vxlan device
deletion leads to stale entry into L3VNI's
L2-VNI list.
Solution:
When L2-VNI associated SVI is down, default vrf
is the new tenant vrf.
Remove L2-VNI from L3VNI's l2vni list as
L3VNI/VRF is no longer valid in absence of associated
SVI.
When SVI is up re-add L2-VNI into associated VRF's
L3VNI.
The above remove/add from the L3VNI's L2VNI list is
already done when vxlan or L2-VNI is flaped, just need
to handle when SVI is flapped.
Ticket:#2817127
Reviewed By:
Testing Done:
After deleting SVI following by L2-VNI deletion,
L3VNI's L2-VNI list delets the L2-VNI. (no stale entry).
After adding back SVI/L2-VNI, L3VNI list adds back the
L2-VNI and it is associated right tenant VRF.
Signed-off-by: Chirag Shah <chirag@nvidia.com>
In the rtadv_timer(), it always uses the zvrf's socket to send RA
packets. In the vrf-lite mode, it's righ since it uses the default
vrf to send the RA packets. But in the netns mode, it uses socket
in each netns. So the issue only happens in the netns mode because
the zvrf's socket may not be in the same netns as the interface's
netns. In order to compatible with both vrf-lite and netns mode,
the fix uses the if_lookup_by_index() to check whether interfaces
can use the zvrf's socket.
Signed-off-by: LEI BAO <bali.baolei@cn.ibm.com>
Before 42d4b30e, table_manager_enable was called only once and the hook
was also registered once. After the change, the hook is registered per
each VRF that is created in the system. This is wrong.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>