Move a few things into places they actually belong, and reduce the
number of places we have `#ifdev HAVE_RTADV`. Just overall code
prettification.
... I had actually done this quite a while ago while doing some other
random hacking and thought it more useful to not be sitting on it on my
disk...
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
The parent node of "vrf" MUST be non-NULL, so the check is unnecessary and
misleading. Otherwise, there will be a branch of NULL parent node, it makes
no sense, remove it.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
The kernel supports l3vxlan device to have (l3vni)
vni filter similar to vlan filtering on bridge device.
To receive netlink notification, FRR to register
for new netlink RTNLGRP_TUNNEL message.
This message required to register via additional
socket option as it's beyond bitmap size.
kernel patches:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
linux.git/commit/?h=v5.18-rc7&id=7b8135f4df98b155b23754b6065c157861e268f1
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
linux.git/commit/?h=v5.18-rc7&id=f9c4bb0b245cee35ef66f75bf409c9573d934cf9
Ticket:#3073812
Testing Done:
Signed-off-by: Chirag Shah <chirag@nvidia.com>
Currently, `zif->es_info.esi` is always set even for a few unnecessary
cases in `zebra_evpn_local_es_update()`.
Delay setting `zif->es_info.esi` and remove the annoying rollback
(i.e. unset `zif->es_info.esi`) operation on failure case.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
The global vrf in zebra is always non-NULL. In general, it is bound to
default vrf by `zebra_vrf_init()`, at other times bound to some specific
vrf. Anyway, non-NULL.
So remove all redundant checkings for the returned value of
`zebra_vrf_get_evpn()`.
Additionally, remove the unnecessary check for `zvrf` in
`zebra_vxlan_cleanup_tables()`.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
RFC 7471 Section 4.2.7:
It is possible for min delay and max delay to be the same value.
Prior to this change, the code required min < avg < max. This
change allows min == avg and avg == max.
test case:
interface eth-rt1
link-params
delay 8000 min 8000 max 8000
Signed-off-by: G. Paul Ziemba <paulz@labn.net>
Firstly, *keep no change* for `hash_get()` with NULL
`alloc_func`.
Only focus on cases with non-NULL `alloc_func` of
`hash_get()`.
Since `hash_get()` with non-NULL `alloc_func` parameter
shall not fail, just ignore the returned value of it.
The returned value must not be NULL.
So in this case, remove the unnecessary checking NULL
or not for the returned value and add `void` in front
of it.
Importantly, also *keep no change* for the two cases with
non-NULL `alloc_func` -
1) Use `assert(<returned_data> == <searching_data>)` to
ensure it is a created node, not a found node.
Refer to `isis_vertex_queue_insert()` of isisd, there
are many examples of this case in isid.
2) Use `<returned_data> != <searching_data>` to judge it
is a found node, then free <searching_data>.
Refer to `aspath_intern()` of bgpd, there are many
examples of this case in bgpd.
Here, <returned_data> is the returned value from `hash_get()`,
and <searching_data> is the data, which is to be put into
hash table.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
Don't rely on the OS interface name length definition and use the FRR
definition instead.
Signed-off-by: Rafael Zalamena <rzalamena@opensourcerouting.org>
There's a common pattern of "get VRF context for CLI node" here, which
first got a helper macro in zebra that then permeated into pimd.
Unfortunately the pimd copy wasn't quite adjusted correctly and thus
caused two coverity warnings (CID 1517453, CID 1517454).
Fix the PIM one, and clean up by providing a common base macro in
`lib/vty.h`.
Also rename the macros (add `_VRF`) to make more clear what they do.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
1. Adding a field family in the existing ZEBRA_IPMR_ROUTE_STATS
to get the ipv4 as well as ipv6 trafic stats between pim and zebra.
2. Modify the debug to print both v4/v6 prefixes
pimd: pim6d: Modify pim_zlookup_sg_statistics to get ipv6 stats
Modify the pim_zlookup_sg_statistics api to
get ipv4/ipv6 stats from zebra. Making the api
common.
Signed-off-by: Mobashshera Rasool <mrasool@vmware.com>
Modify the structure mcast_route_data to store ipv4/ipv6
addr and lastused multicast information from kernel.
Adjust the related APIs to parse ipv4/ipv6 informations.
Signed-off-by: Mobashshera Rasool <mrasool@vmware.com>
By changing this API call to use a `struct ipaddr`, which encodes the
type of IP address with it. (And rename/remove the `IPV4` from the
command name.)
Also add a comment explaining that this function call is going to be
obsolete in the long run since pimd needs to move to proper MRIB NHT.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
Add initial zebra tracepoint support infrastructure
as well as add a frr_zebra:netlink_interface
callback.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
If you are in a situation where you have multiple addresses on an
interface, zebra creates one connected route for them.
The issue is that the rib entry is not created if addresses were
added before the interface was running.
We add the address to a running interface in a typical flow.
Therefore, we handle the route & rib creation within a single ADD event.
In the opposite case, we create the route entries without activating them.
These are considered to be active since ZEBRA_IFC_DOWN is not set.
On the following interface UP, we ignore the same ADDR_ADD as it overlaps
with the existing prefixes -> rib is never created.
The minimal reproducible setup:
-----------------------------------------
ip link add name dummy0 type dummy
ip addr flush dev dummy0
ip link set dummy0 down
ip addr add 192.168.1.7/24 dev dummy0
ip addr add 192.168.1.8/24 dev dummy0
ip link set dummy0 up
vtysh -c 'show ip route' | grep dummy0
Signed-off-by: Volodymyr Huti <v.huti@vyos.io>
Operators are seeing:
Mar 28 07:19:37 kingpin zebra[418]: [TZANK-DEMSE] netlink_nexthop_msg_encode: nhg_id 68 (zebra): proto-based nexthops only, ignoring
Mar 28 07:19:37 kingpin zebra[418]: [TZANK-DEMSE] netlink_nexthop_msg_encode: nhg_id 68 (zebra): proto-based nexthops only, ignoring
Mar 28 07:19:37 kingpin zebra[418]: [YXPF5-B2CE0] netlink_route_multipath_msg_encode: RTM_DELROUTE 2804:4d48:4000::/42 vrf 0(254)
Mar 28 07:19:37 kingpin zebra[418]: [YXPF5-B2CE0] netlink_route_multipath_msg_encode: RTM_NEWROUTE 2804:4d48:4000::/42 vrf 0(254)
Mar 28 07:19:37 kingpin zebra[418]: [TVM3E-A8ZAG] _netlink_route_build_singlepath: (single-path): 2804:4d48:4000::/42 nexthop via fe80::b6fb:e4ff:fe26:c5d5 if 2 vrf default(0)
Mar 28 07:19:37 kingpin zebra[418]: [HYEHE-CQZ9G] nl_batch_send: netlink-dp (NS 0), batch size=140, msg cnt=2
Mar 28 07:19:37 kingpin zebra[418]: [P2XBZ-RAFQ5][EC 4043309074] Failed to install Nexthop ID (68) into the kernel
When `zebra nexthop proto only` is turned on.
Effectively zebra intentionally does not do the nexthop group installation
and the dplane notification in zebra_nhg.c just assumes it was a failure
and prints an error message. Since this act was intentional, let's
just notice that it was intentional and not report the message
as a failure.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently if a end user has something like this:
Routing entry for 192.168.212.1/32
Known via "kernel", distance 0, metric 100, best
Last update 00:07:50 ago
* directly connected, ens5
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
K>* 0.0.0.0/0 [0/100] via 192.168.212.1, ens5, src 192.168.212.19, 00:00:15
C>* 192.168.212.0/27 is directly connected, ens5, 00:07:50
K>* 192.168.212.1/32 [0/100] is directly connected, ens5, 00:07:50
And FRR does a link flap, it refigures the route and rejects the default
route:
2022/04/09 16:38:20 ZEBRA: [NZNZ4-7P54Y] default(0:254):0.0.0.0/0: Processing rn 0x56224dbb5b00
2022/04/09 16:38:20 ZEBRA: [ZJVZ4-XEGPF] default(0:254):0.0.0.0/0: Examine re 0x56224dbddc20 (kernel) status: Changed Installed flags: Selected dist 0 metric 100
2022/04/09 16:38:20 ZEBRA: [GG8QH-195KE] nexthop_active_update: re 0x56224dbddc20 nhe 0x56224dbdd950 (7), curr_nhe 0x56224dedb550
2022/04/09 16:38:20 ZEBRA: [T9JWA-N8HM5] nexthop_active_check: re 0x56224dbddc20, nexthop 192.168.212.1, via ens5
2022/04/09 16:38:20 ZEBRA: [M7EN1-55BTH] nexthop_active: Route Type kernel has not turned on recursion
2022/04/09 16:38:20 ZEBRA: [HJ48M-MB610] nexthop_active_check: Unable to find active nexthop
2022/04/09 16:38:20 ZEBRA: [JPJF4-TGCY5] default(0:254):0.0.0.0/0: After processing: old_selected 0x56224dbddc20 new_selected 0x0 old_fib 0x56224dbddc20 new_fib 0x0
So the 192.168.212.1 route is matched for the nexthop but it is not connected and
zebra treats it as a problem. Modify the code such that if a system route
matches through another system route, then it should work imo.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
This bug should only really affect kernel routes. To reproduce:
a) Have multiple connected routes that point to the same prefix
swp8 up default 169.254.0.250/30
swp9 up default 169.254.0.250/30
b) Have a kernel route that uses one of those connected routes
7.6.2.8 via 169.254.0.249 dev swp8 proto static
(But have it choose a non-selected connected nexthop)
c) Introduce an event that causes the rib table to be reprocessed,
say a unrelated interface going up / down
This causes the route to be lost with this message:
2022/03/28 21:21:53 ZEBRA: [YXCJP-0WZWV] netlink_nexthop_msg_encode: ID (3454): 169.254.0.249, via swp8(1383) vrf default(0)
2022/03/28 21:21:53 ZEBRA: [YF2E6-J60JH] nexthop_active: 169.254.0.249, via swp8 given ifindex does not match nexthops ifindex found found: directly connected, swp9
Effectively the nexthop that zebra is choosing would not be the one
that the kernel route has choosen and FRR removes the route:
022/03/28 21:21:53 ZEBRA: [NM15X-X83N9] rib_process: (0:254):7.6.2.8/32: rn 0x56042e632e90, removing re 0x56042e6316e0
2022/03/28 21:21:53 ZEBRA: [Y53JX-CBC5H] rib_unlink: (0:254):7.6.2.8/32: rn 0x56042e632e90, re 0x56042e6316e0
2022/03/28 21:21:53 ZEBRA: [KT8QQ-45WQ0] rib_gc_dest: (0:?):7.6.2.8/32: removing dest from table
What is happening?
Zebra is not looking at all connected routes and if any of them
would have the appropriate ifindex and just blindly rejecting
the route.
So when nexthop resolution happens and it matches a connected
route and the dest->selected nexthop ifindex does not match, let's sort
through the rest of them and see if any of them match and if so
let's keep the route.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
This has already been a requirement for Solaris, it is still a
requirement for some of the autoconf feature checks to work correctly,
and it will be a requirement for `-fms-extensions`.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
Since there are two kinds of ESI (Type-0 and Type-3), the warnings
should distinguish between the two cases.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
It's confusing for a user to see 'Tx RA failed' in the logs when
they've enabled RAs (either through interface config or BGP unnumbered)
on an interface that can't send them. Let's avoid sending RAs on
interfaces that are bridge_slaves or don't have a link-local address,
since they are the two of the most common reasons for RA Tx failures.
Signed-off-by: Trey Aspelund <taspelund@nvidia.com>
The bounded vrf of `l2vni/zevpn` have wrong relation with the order
in which vxlan interface and svi interface are set.
If set vxlan interface with vlanid first, then set svi interface with
vrf, it is ok that vxlan interface will get correct `vrf` inherited
from svi. But reverse the set sequence (i.e. set svi first, then vxlan),
vxlan interface can't get correct `vrf`, becasue the handling of
`ZEBRA_VXLIF_VLAN_CHANGE` missed inheritting `vrf` by mistake.
```
host# do show evpn vni 101
VNI: 101
Type: L2
Tenant VRF: vrf1
```
So update `vrf` ("Tenant VRF") of l2vni in `zebra_vxlan_if_update()`.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
Since `NDA_VLAN` is no longer mannually defined in header file,
the check for `NDA_VLAN` should be removed.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
Like `zvni_map_to_svi_ns()` for `ns_walk_func()`, just use "assert"
instead of unnecessary check.
Since these parameters for `ns_walk_func()`, e.g. `in_param` and others,
must not be NULL. So use `assert` to ensure the these parameters, and
remove those unnecessary checks.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
When a client disconnects, we need to check & remove NHT entries for
other SAFIs too. Otherwise we crash later trying to access stale data.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
There exists code paths in the linux kernel where a dump command
will be interrupted( I am not sure I understand what this really
means ) and the data sent back from the kernel is wrong or incomplete.
At this point in time I am not 100% certain what should be done, but
let's start noticing that this has happened so we can formulate a plan
or allow the end operator to know bad stuff is a foot at the circle K.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
In the FreeBSD code if you delete the interface
and it has no configuration, the ifp pointer will
be deleted from the system *but* zebra continues
to dereference the just freed pointer.
==58624== Invalid read of size 1
==58624== at 0x48539F3: strlcpy (in /usr/local/libexec/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==58624== by 0x2B0565: ifreq_set_name (ioctl.c:48)
==58624== by 0x2B0565: if_get_flags (ioctl.c:416)
==58624== by 0x2B2D9E: ifan_read (kernel_socket.c:455)
==58624== by 0x2B2D9E: kernel_read (kernel_socket.c:1403)
==58624== by 0x499F46E: thread_call (thread.c:2002)
==58624== by 0x495D2B7: frr_run (libfrr.c:1196)
==58624== by 0x2B40B8: main (main.c:471)
==58624== Address 0x6baa7f0 is 64 bytes inside a block of size 432 free'd
==58624== at 0x484ECDC: free (in /usr/local/libexec/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==58624== by 0x4953A64: if_delete (if.c:283)
==58624== by 0x2A93C1: if_delete_update (interface.c:874)
==58624== by 0x2B2DF3: ifan_read (kernel_socket.c:453)
==58624== by 0x2B2DF3: kernel_read (kernel_socket.c:1403)
==58624== by 0x499F46E: thread_call (thread.c:2002)
==58624== by 0x495D2B7: frr_run (libfrr.c:1196)
==58624== by 0x2B40B8: main (main.c:471)
==58624== Block was alloc'd at
==58624== at 0x4851381: calloc (in /usr/local/libexec/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==58624== by 0x496A022: qcalloc (memory.c:116)
==58624== by 0x49546BC: if_new (if.c:164)
==58624== by 0x49546BC: if_create_name (if.c:218)
==58624== by 0x49546BC: if_get_by_name (if.c:603)
==58624== by 0x2B1295: ifm_read (kernel_socket.c:628)
==58624== by 0x2A7FB6: interface_list (if_sysctl.c:129)
==58624== by 0x2E99C8: zebra_ns_enable (zebra_ns.c:127)
==58624== by 0x2E99C8: zebra_ns_init (zebra_ns.c:214)
==58624== by 0x2B3FF2: main (main.c:401)
==58624==
Zebra needs to pass back whether or not the ifp pointer
was freed when if_delete_update is called and it should
then check in ifan_read as well as ifm_read that the
ifp pointer is still valid for use.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add new debug output to show the string of the message type that
is currently unhandled:
2022-03-24 18:30:15.284 [DEBG] zebra: [V3NSB-BPKBD] Kernel:
2022-03-24 18:30:15.284 [DEBG] zebra: [HDTM1-ENZNM] Kernel: message seq 792
2022-03-24 18:30:15.284 [DEBG] zebra: [MJD4M-0AAAR] Kernel: pid 594488, rtm_addrs {DST,GENMASK}
2022-03-24 18:30:15.285 [DEBG] zebra: [GRDRZ-0N92S] Unprocessed RTM_type: RTM_NEWMADDR(d)
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When running zebra w/ valgrind, it was noticed that there
was a bunch of passing uninitialized data to the kernel:
==38194== Syscall param ioctl(generic) points to uninitialised byte(s)
==38194== at 0x4CDF88A: ioctl (in /lib/libc.so.7)
==38194== by 0x49A4031: vrf_ioctl (vrf.c:860)
==38194== by 0x2AFE29: vrf_if_ioctl (ioctl.c:91)
==38194== by 0x2AFF39: if_get_mtu (ioctl.c:161)
==38194== by 0x2B12C3: ifm_read (kernel_socket.c:653)
==38194== by 0x2A7F76: interface_list (if_sysctl.c:129)
==38194== by 0x2E9958: zebra_ns_enable (zebra_ns.c:127)
==38194== by 0x2E9958: zebra_ns_init (zebra_ns.c:214)
==38194== by 0x2B3F82: main (main.c:401)
==38194== Address 0x7fc000967 is on thread 1's stack
==38194== in frame #3, created by if_get_mtu (ioctl.c:155)
==38194==
==38194== Syscall param ioctl(generic) points to uninitialised byte(s)
==38194== at 0x4CDF88A: ioctl (in /lib/libc.so.7)
==38194== by 0x49A4031: vrf_ioctl (vrf.c:860)
==38194== by 0x2AFE29: vrf_if_ioctl (ioctl.c:91)
==38194== by 0x2AFED9: if_get_metric (ioctl.c:143)
==38194== by 0x2B12CB: ifm_read (kernel_socket.c:655)
==38194== by 0x2A7F76: interface_list (if_sysctl.c:129)
==38194== by 0x2E9958: zebra_ns_enable (zebra_ns.c:127)
==38194== by 0x2E9958: zebra_ns_init (zebra_ns.c:214)
==38194== by 0x2B3F82: main (main.c:401)
==38194== Address 0x7fc000967 is on thread 1's stack
==38194== in frame #3, created by if_get_metric (ioctl.c:137)
==38194==
==38194== Syscall param ioctl(generic) points to uninitialised byte(s)
==38194== at 0x4CDF88A: ioctl (in /lib/libc.so.7)
==38194== by 0x49A4031: vrf_ioctl (vrf.c:860)
==38194== by 0x2AFE29: vrf_if_ioctl (ioctl.c:91)
==38194== by 0x2B052D: if_get_flags (ioctl.c:419)
==38194== by 0x2B1CF1: ifam_read (kernel_socket.c:930)
==38194== by 0x2A7F57: interface_list (if_sysctl.c:132)
==38194== by 0x2E9958: zebra_ns_enable (zebra_ns.c:127)
==38194== by 0x2E9958: zebra_ns_init (zebra_ns.c:214)
==38194== by 0x2B3F82: main (main.c:401)
==38194== Address 0x7fc000707 is on thread 1's stack
==38194== in frame #3, created by if_get_flags (ioctl.c:411)
Valgrind is no longer reporting these issues.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When `zebra_evpn_mac_svi_add()` adds one found mac by
`zebra_evpn_mac_lookup()` and the found mac is without
svi flag, then call `zebra_evpn_mac_svi_add()` to create
one appropriate mac, but it will call `zebra_evpn_mac_lookup()`
the second time. So lookup twice, the procedure is redundant.
Just an optimization for it, make sure only lookup once.
Modify `zebra_evpn_mac_gw_macip_add()` to check the `macp`
parameter passed by caller, so it can distinguish whether
really need lookup or not.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
When issuing a RTM_DELETE operation and the kernel tells
us that the route is already deleted, let's not complain
about the situation:
2022/03/19 02:40:34 ZEBRA: [EC 100663303] kernel_rtm: 2a10:cc42:1d51::/48: rtm_write() unexpectedly returned -4 for command RTM_DELETE
I can recreate this issue on freebsd by doing this:
a) create a route using sharpd
b) shutdown the nexthop's interface
c) remove the route using sharpd
This would also be true of pretty much any routing protocol's behavior.
Let's just not complain about the situation if a RTM_DELETE
operation is issued and FRR is told that the route does not
exist to delete.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Since the `RB_INSERT()` is called after not found in RB tree, it MUST be ok and
and return zero. The check of returning value of `RB_INSERT()` is redundant,
just remove them.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
Commit: 5d41413833 added 3 new dplane ops:
DPLANE_OP_INTF_INSTALL
DPLANE_OP_INTF_UPDATE
DPLANE_OP_INTF_DELETE
The build system does not build lua so zebra_script.c
was not updated. Update of course!
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When encoding a response to the upper level protocol the
prefixlen is not something that needs to be part of the
switch statement for handling of a prefix.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently the nexthop tracking code is only sending to the requestor
what it was requested to match against. When the nexthop tracking
code was simplified to not need an import check and a nexthop check
in b8210849b8 for bgpd. It was not
noticed that a longer prefix could match but it would be seen
as a match because FRR was not sending up both the resolved
route prefix and the route FRR was asked to match against.
This code change causes the nexthop tracking code to pass
back up the matched requested route (so that the calling
protocol can figure out which one it is being told about )
as well as the actual prefix that was matched to.
Fixes: #10766
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Since `zvni_map_to_svi_ns()` is used to find and return one specific interface
based on passed attributes of SVI, so the two parameters `in_param` and `p_ifp`
must not be NULL.
Passing NULL `p_ifp` makes no sense, so the check `if (p_ifp)` is
unnecessary.
So use `assert` to ensure the two parameters, and remove that unnecessary check.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
Upon 'no advertise-all-vni', cleanup l2vni from
its tenant-vrf's l3vni list, instead of passed
zvrf->l3vni which will not be present in case
of default instance.
Reviewed By:
Testing Done:
Before Fix:
----------
TORC12(config-router-af)# advertise-all-vni
TORC12(config-router-af)# end
TORC12# show evpn vni 4001
VNI: 4001
Type: L3
Tenant VRF: vrf1
Vxlan-Intf: vni4001
State: Up
Router MAC: 44:38:39:ff:ff:01
L2 VNIs: 134217728 0 1000 1002 <-----
After Fix:
----------
TORC12# show evpn vni 4001
VNI: 4001
Type: L3
Tenant VRF: vrf1
Vxlan-Intf: vni4001
State: Up
Router MAC: 44:38:39:ff:ff:01
L2 VNIs: 1000 1002
Signed-off-by: Chirag Shah <chirag@nvidia.com>
RMAC keeping list of nexthops to keep track
of its existiance, remove the (old way) host prefix
mapping.
Ticket: #2798406
Reviewed By:
Testing Done:
TORS1# show evpn rmac vni 4001 mac 44:38:39:ff:ff:01
MAC: 44:38:39:ff:ff:01
Remote VTEP: 36.0.0.11
Refcount: 0
Prefixes:
Signed-off-by: Chirag Shah <chirag@nvidia.com>
Keep the list of remote-vteps/nexthops in
rmac db.
Problem:
In CLAG deployment there might be a situation
where CLAG secondary sends individual ip as nexthop
along with anycast mac as RMAC. This combination
is updated in zebra's rmac cache.
Upon recovery at clag secondary sends withdrawal
of the incorrect rmac and nexthop mapping.
The RMAC entry mapping to nh is not cleaned up properly
in the zebra rmac cache.
Fix:
Zebra rmac db needs to maintain a list of nexthops.
When a bgp withdrawal for rmac to nexthop mapping
is received, remove the old nexthop from the rmac's nh
list and if the host reference still remains for
the RMAC,fall back to the nexthop one remaining in
the list.
At most you expect two nexthops mapped to RMAC
(in clag deployment).
Ticket: 2798406
Reviewed By:
Testing Done:
CLAG primary and secondary have advertise-pip enabled
advertise type-5 route (default route) with
individual IP as nh and individual svi mac as rmac.
- disable advertise pip on both clag devices, this
results in advertisement of routes with anycast ip as nh
and anycast mac as rmac.
- disable peerlink on clag primary, this triggers
clag secondary to (transitory) send bgp update with
individual ip as nh and anycast mac as rmac.
- At the remote vtep:
Check the zebra's rmac cache/nh mapping correctly
and points to anycast rmac and anycast ip as nh of the
clag system.
Signed-off-by: Chirag Shah <chirag@nvidia.com>
Cleanup the logs in the netlink code for setting
protodown on/off to be more useful to a user parsing them
after an issue.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Use the SET/UNSET/CHECK/COND macros for flag bifields
where appropriate throught the protodown code base.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Ensure we include the old reason when we are updating the reason
code for a evpn-mh bond member. Now that this is a common API
it could include things external to EVPN in this reason code
bitfield (ex: vrrp).
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Make the netlink protodown static function for checking
if the only bit set for protodown reason is FRR's more
easily readable to someone not familiar with the code.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Simplify the code for printing the reason codes via
show command. Just remove the trailing comma last
before printing.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Cleanup the logs in the api for setting protodown on/off
that zapi and others use. Make them more useful to a user parsing
them after an issue.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Avoid initialization in dplane_ctx_intf_init() so
the compiler can warn us about using unintialized data.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
When we are processing a bond member's protodown we get from
the dataplane, check to make sure we haven't already queued
up a set. If we have, it's likely this is just a notification
we get from the kernel after we set protodown and before we have
processed the result in our dplane pthread.
This change is needed now that we set protodown via the dplane
pthread.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
When setting the protodown reason use the update api
where we can directly update the entire reason bitfield
since we have to set more than one.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Extern the api for setting the protodown reason code
bitfield directly. Some places may want to completely update the
bitfield with more than one reason at a time.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Only clear protodown reason on shutdown/sweep, retain protodown
state.
This is to retain traditional and expected behavior with daemons
like vrrpd setting protodown. They expet it to be set on shutdown
and retained on bring up to prevent traffic from being dropped.
We must cleanup our reason code though to prevent us from blocking
others.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Add functionality to clear any reason code set on shutdown
of zebra when we are freeing the interface, in case a bad
client didn't tell us to clear it when the shutdown.
Also, in case of a crash or failure to do the above, clear reason
on startup if it is set.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Add enums for set/unset of prodown state to handle the mainthread
knowing an update is already queued without actually marking it
as complete.
This is to make the logic confirm a bit more with other parts of the code
where we queue dplane updates and not update our internal structs until
success callback is received.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Add command for use to set protodown via frr.conf in
the case our default conflicts with another application
they are using.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Add support for setting the protodown reason code.
829eb208e8
These patches handle all our netlink code for setting the reason.
For protodown reason we only set `frr` as the reason externally
but internally we have more descriptive reasoning available via
`show interface IFNAME`. The kernel only provides a bitwidth of 32
that all userspace programs have to share so this makes the most sense.
Since this is new functionality, it needs to be added to the dplane
pthread instead. So these patches, also move the protodown setting we
were doing before into the dplane pthread. For this, we abstract it a
bit more to make it a general interface LINK update dplane API. This
API can be expanded to support gernal link creation/updating when/if
someone ever adds that code.
We also move a more common entrypoint for evpn-mh and from zapi clients
like vrrpd. They both call common code now to set our internal flags
for protodown and protodown reason.
Also add debugging code for dumping netlink packets with
protodown/protodown_reason.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Contraints of host routes are too strict in current code:
Host routes with same destination address and nexthop address are forbidden
even when cross VRFs.
Currently host routes with different destination and nexthop address can cross
VRFs, it is ok. But host routes with same addresses are forbidden to cross VRFs,
it is wrong.
Since different VRFs can have the same addresses, leak specific host route with
the same nexthop address ( it means destination address is same to nexthop
address ) to other VRFs is a normal case.
This commit relaxes that contraints. Host routes with same destination address
and nexthop address are forbidden only when not cross VRFs.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
When an interface goes down, it signals any related NHGs to
re-validate themselves. During zebra shutdown, ensure we remove
any NHGs we've installed.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
In `zebra_evpn_neigh_gw_macip_add()`, it sets `mac->flags` to "ZEBRA_MAC_DEF_GW"
for "advertise-default-gw" mode. But this set is redundant because this "mac"
is already set by `zebra_evpn_mac_gw_macip_add()`.
So remove this redundant assignment.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
In the loop, local variable `ip` is always set even if the check condition
is not satisfied.
Avoid the redundant set, move this set exactly after the check condition is
satisfied. Set `ip` only if the check condition is met, otherwise needn't.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
zebra crash is seen during shutdown (frr restart).
During shutdown, remote neigh and remote mac clean up
is triggered first, followed by per vni all neigh
(including local) and macs cleanup is triggered.
The crash occurs when a remote mac is cleaned up first
and its reference is remained in local neigh.
When local neigh attempt removes itself from its associated
mac's neigh_list it triggers inaccessible memory crash.
The fix is during mac deletion if its neigh_list is non-empty
then retain the MAC in AUTO state.
This can arise when MAC and neigh duo are in different state
(remote/local). Otherwise, the order of cleanup operation
is neighs followed by macs.
The auto mac will be cleaned up when per vni all neighs and macs
are cleaned up.
Ticket:CM-29826
Reviewed By:CCR-10369
Testing Done:
Configure evpn symmetric config where
MAC is in remote state and neigh is in local state.
Perform frr restart then crash is not seen.
Signed-off-by: Chirag Shah <chirag@nvidia.com>
With recent changes to interface up mechanics in if_netlink.c
FRR was receiving as many as 4 up events for an interface
on ifdown/ifup events. This was causing timing issues
in FRR based upon some fun timings. Remove this from
happening.
Ticket: CM-31623
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Vxlan interfaces flap (protodown/up) event,
non ptm operative interfaces do not come up
as protodown up event do not trigger "if_up()"
event.
Ticket:CM-30477
Reviewed By:CCR-10681
Testing Done:
validated interfaces flaps, ip link down, ifdown
and protodown followed by UP event. all Vxlan interfaces
come up in bgpd post flap.
Signed-off-by: Chirag Shah <chirag@nvidia.com>
Frr need to handle protocol down event for vxlan
interface.
In MLAG scenario, one of the pair switch can put
vxlan port to protodown state, followed by
tunnel-ip change from anycast IP to individual IP.
In absence of protodown handling, evpn end up
advertising locally learn EVPN (MAC-IP) routes
with individual IP as nexthop.
This leads an issue of overwriting locally learn
entries as remote on MLAG pair.
Ticket:CM-24545
Reviewed By:CCR-10310
Testing Done:
In EVPN deployment, restart one of the MLAG
daemon, which puts vxlan interfaces in protodown state.
FRR treats protodown as oper down for vxlan interfaces.
VNI down cleans up/withdraws locally learn routes.
Followed by vxlan device UP event, re-advertise
locally learn routes.
Signed-off-by: Chirag Shah <chirag@nvidia.com>
When a end operator is doing cross vrf imports in bgp:
router bgp 3239 vrf FOO
address-family ipv4 uni
import vrf BAR
!
and zebra has this configuration:
vrf FOO
ip protocol bgp route-map EVA
!
The current code in zebra_nhg.c was looking up the vrf of the
nexthop and attempting to apply the ip protocol route-map.
For most people the nexthop vrf and the re vrf are one and the
same so they never see a problem.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Upon restart zebra reads in the kernel state. Under linux
there is a mechanism to read the route and convert the protocol
to the correct internal FRR protocol to allow the zebra graceful
restart efforts to work properly.
Under *BSD I do not see a mechanism to convey the original FRR
protocol into the kernel and thus back out of it. Thus when
zebra crashes ( or restarts ) the routes read back in are kernel
routes and are effectively lost to the system and FRR cannot
remove them properly. Why? Because FRR see's kernel routes
as routes that it should not own and in general the admin
distance for those routes will be a better one than the
admin distance from a routing protocol. This is even
worse because when the graceful restart timer pops and rib_sweep
is run, FRR becomes out of sync with the state of the kernel forwarding
on *BSD.
On restart, notice that the route is a self route that there
is no way to know it's originating protocol. In this case
let's set the protocol to ZEBRA_ROUTE_STATIC and set the admin
distance to 255.
This way when an upper level protocol reinstalls it's route
the general zebra graceful restart code still works. The
high admin distance allows the code to just work in a way
that is graceful( HA! )
The drawback here is that the route shows up as a static
route for the time the system is doing it's work. FRR
could introduce *another* route type but this seems like
a bad idea and the STATIC route type is loosely analagous
to the type of route it has become.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
FRR will crash when the re->type is a ZEBRA_ROUTE_ALL and it
is inserted into the meta-queue. Let's just put some basic
code in place to prevent a crash from happening. No routing
protocol should be using ZEBRA_ROUTE_ALL as a value but
bugs do happen. Let's just accept the weird route type
gracefully and move on.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
There exists some interface types that are slow on startup
to fully register their link speed. Especially those that
are working with an asic backend. The speed_update timer
associated with each interface would keep trying if the
system returned a MAX_UINT32 as the speed. This speed
means both unknown or there is none under linux.
Since some interface types are slow on startup let's modify
FRR to try for at most 4 minutes and give up trying on those
interfaces where we never get any useful data.
Why 4 minutes? I wanted to balance the time associated with
slow interfaces coming up with those that will never give us
a value. So I choose 4 minutes as a good ballpark of time
to keep trying
Why not track all those interfaces and just not attempt to
do the speed lookup? I would prefer to not keep track of these
as that I do not know all the interface types, nor do I wish
to keep programming as new ones come in.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Look up linked interface in the correct netns, otherwise, either a wrong
interface or NULL would be used.
For example, enable VRF netns backend, and:
ip netns add ns1
ip link add link eth0 link1 type macvlan
ip link set link1 netns ns1 up
Zebra will crash in zebra_vxlan_macvlan_up because zif->link is NULL.
Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
End operator is reporting that they are receiving buffer overruns
when attempting to read from the kernel receive socket. It is
possible to adjust this size to more modern levels especially
for when the system is under load. Modify the code base
so that *BSD operators can use the zebra `-s XXX` option
to specify a read buffer.
Additionally setup the default receive buffer size on *BSD
to be 128k instead of the 8k so that FRR does not run into
this issue again.
Fixes: #10666
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Use the dataplane to query and read interface NETCONF data;
add netconf-oriented data to the dplane context object, and
add accessors for it. Add handler for incoming update
processing.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
Allow self-produced xxxNETCONF netlink messages through the BPF
filter we use. Just like address-configuration actions, we'll
process NETCONF changes in one path, whether the changes were
generated by zebra or by something else in the host OS.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
a) We'll need to pass the info up via some dataplane control method
(This way bsd and linux can both be zebra agnostic of each other)
b) We'll need to modify `struct interface *` to track this data
and when it changes to notify upper level protocols about it.
c) Work is needed to dump the entire mpls state at the start
so we can gather interface state. This should be done
after interface data gathering from the kernel.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
Description:
===========
Change is intended for fixing the NHT resolution logic.
While recursively resolving nexthop, keep looking for a valid/useable route in the rib,
by not stopping at the first/most-specific route in the rib.
Consider the following set of events taking place on R1:
R1(config)# ip route 2.2.2.0/24 ens192
R1# sharp watch nexthop 2.2.2.32 connected
R1# show ip nht
2.2.2.32(Connected)
resolved via static
is directly connected, ens192
Client list: sharp(fd 33)
-2.2.2.32 NHT is resolved over the above valid static route.
R1# sharp install routes 2.2.2.32 nexthop 2.2.2.32 1
R1# 2.2.2.32(Connected)
resolved via static
is directly connected, ens192
Client list: sharp(fd 33)
-.32/32 comes which is going to resolve through itself, but since this is an invalid route,
it will be marked as inactive and will not affect the NHT.
R1# sharp install routes 2.2.2.31 nexthop 2.2.2.32 1
R1# 2.2.2.32(Connected)
unresolved(Connected)
Client list: sharp(fd 50)
-Now a .31/32 comes which will resolve over .32 route, but as per the current logic,
this will trigger the NHT check, in turn making the NHT unresolved.
-With fix, NHT should stay in resolved state as long as the valid static or connected route stays installed
Fix:
====
-While resolving nexthops, walk up the tree from the most-specific match,
walk up the tree without any ZEBRA_NHT_CONNECTED check.
Co-authored-by: Vishal Dhingra <vdhingra@vmware.com>
Co-authored-by: Kantesh Mundaragi <kmundaragi@vmware.com>
Signed-off-by: Iqra Siddiqui <imujeebsiddi@vmware.com>
Two minor changes:
1) Change `zebra_evpn_mac_gw_macip_add()` 's return type to `void`.
2) Since `zebra_evpn_mac_gw_macip_add()` has already `assert` the returned
`mac`, the check of its return value makes no sense. And keep setting
`mac->flags` inside `zebra_evpn_mac_gw_macip_add()` is more reasonable. So
just move the setting `mac->flags` inside `zebra_evpn_mac_gw_macip_add()`.
Signed-off-by: anlan_cs <vic.lan@pica8.com>
The recently-added hashtable of nlsock objects needs to be
thread-safe: it's accessed from the main and dplane pthreads.
Add a mutex for it, use wrapper apis when accessing it. Add
a per-OS init/terminate api so we can do init that's not
per-vrf or per-namespace.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
NetDEF CI has been whining about multiline string style.
Make the strings single-line and call it a day.
Signed-off-by: Trey Aspelund <taspelund@nvidia.com>
Trying to call multiple ioctl calls on ifreq will result in
overwriting ifreq with garbage data. On if_get_flags call,
try to keep the flags field safe from another possible ioctl
call before applying the flags field.
Modified code as per Code Review, done by Donald Sharp.
Signed-off-by: Bijan <bijanebrahimi@riseup.net>
Currently when the kernel sends netlink messages to FRR
the buffers to receive this data is of fixed length.
The kernel, with certain configurations, will send
netlink messages that are larger than this fixed length.
This leads to situations where, on startup, zebra gets
really confused about the state of the kernel. Effectively
the current algorithm is this:
read up to buffer in size
while (data to parse)
get netlink message header, look at size
parse if you can
The problem is that there is a 32k buffer we read.
We get the first message that is say 1k in size,
subtract that 1k to 31k left to parse. We then
get the next header and notice that the length
of the message is 33k. Which is obviously larger
than what we read in. FRR has no recover mechanism
nor is there a way to know, a priori, what the maximum
size the kernel will send us.
Modify FRR to look at the kernel message and see if the
buffer is large enough, if not, make it large enough to
read in the message.
This code has to be per netlink socket because of the usage
of pthreads. So add to `struct nlsock` the buffer and current
buffer length. Growing it as necessary.
Fixes: #10404
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Store the fd that corresponds to the appropriate `struct nlsock` and pass
that around in the dplane context instead of the pointer to the nlsock.
Modify the kernel_netlink.c code to store in a hash the `struct nlsock`
with the socket fd as the key.
Why do this? The dataplane context is used to pass around the `struct nlsock`
but the zebra code has a bug where the received buffer for kernel netlink
messages from the kernel is not big enough. So we need to dynamically
grow the receive buffer per socket, instead of having a non-dynamic buffer
that we read into. By passing around the fd we can look up the `struct nlsock`
that will soon have the associated buffer and not have to worry about `const`
issues that will arise.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Store and use the sequence number instead of using what is in
the `struct nlsock`. Future commits are going away from storing
the `struct nlsock` and the copy of the nlsock was guaranteeing
unique sequence numbers per message. So let's store the
sequence number to use instead.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Using memcmp is wrong because struct ipaddr may contain unitialized
padding bytes that should not be compared.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
When using wait for install there exists situations where
zebra will issue several route change operations to the kernel
but end up in a state where we shouldn't be at the end
due to extra data being received. Example:
a) zebra receives from bgp a route change, installs sends the
route to the kernel.
b) zebra receives a route deletion from bgp, removes the
struct route entry and then sends to the kernel a deletion.
c) zebra receives an asynchronous notification that (a) succeeded
but we treat this as a new route.
This is the ships in the night problem. In this case if we receive
notification from the kernel about a route that we know nothing
about and we are not in startup and we are doing asic offload
then we can ignore this update.
Ticket: #2563300
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The ctx->zd_is_update is being set in various
spots based upon the same value that we are
passing into dplane_ctx_ns_init. Let's just
consolidate all this into the dplane_ctx_ns_init
so that the zd_is_udpate value is set at the
same time that we increment the sequence numbers
to use.
As a note for future me's reading this. The sequence
number choosen for the seq number passed to the
kernel is that each context gets a copy of the
appropriate nlsock to use. Since it's a copy
at a point in time, we know we have a unique sequence
number value.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When nl_batch_read_resp gets a full on failure -1 or an implicit
ack 0 from the kernel for a batch of code. Let's immediately
mark all of those in the batch pass/fail as needed. Instead
of having them marked else where.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
I'm seeing this crash in various forms:
Program terminated with signal SIGSEGV, Segmentation fault.
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f418efbc7c0 (LWP 3580253))]
(gdb) bt
(gdb) f 4
267 (*func)(hb, arg);
(gdb) p hb
$1 = (struct hash_bucket *) 0x558cdaafb250
(gdb) p *hb
$2 = {len = 0, next = 0x0, key = 0, data = 0x0}
(gdb)
I've also seen a crash where data is 0x03.
My suspicion is that hash_iterate is calling zebra_nhg_sweep_entry which
does delete the particular entry we are looking at as well as possibly other
entries when the ref count for those entries gets set to 0 as well.
Then we have this loop in hash_iterate.c:
for (i = 0; i < hash->size; i++)
for (hb = hash->index[i]; hb; hb = hbnext) {
/* get pointer to next hash bucket here, in case (*func)
* decides to delete hb by calling hash_release
*/
hbnext = hb->next;
(*func)(hb, arg);
}
Suppose in the previous loop hbnext is set to hb->next and we call
zebra_nhg_sweep_entry. This deletes the previous entry and also
happens to cause the hbnext entry to be deleted as well, because of nhg
refcounts. At this point in time the memory pointed to by hbnext is
not owned by the pthread anymore and we can end up on a state where
it's overwritten by another pthread in zebra with data for other incoming events.
What to do? Let's change the sweep function to a hash_walk and have
it stop iterating and to start over if there is a possible double
delete operation.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add to `show zebra` whether or not RA is compiled into FRR
and whether or not BGP is using RFC 5549 at the moment.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The dataplane pthread only processes a limited set of incoming
netlink notifications: only register for that set of events,
reducing duplicate incoming netlink messages.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
Current code treats all metaqueues as lists of route_node structures.
However, some queues contain other structures that need to be cleaned up
differently. Casting the elements of those queues to struct route_node
and dereferencing them leads to a crash. The crash may be seen when
executing bgp_multi_vrf_topo2.
Fix the code by using the proper list element types.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
The name 'opaque' is a little general - call the route_entry
struct 're_opaque' to make it more specific.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
RA packets are pretty chatty and when there is a warning from
a missconfiguration on the network, the log file gets filed
up with warnings. Modify the code in rtadv.c to only spit
out the warning in these cases at most every 6 hours.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
EVPN route add should be queued to preserve the config order.
In particular, against deletion in rib_delete().
Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
When a operator has a FRR based route installed into the
FIB and a better route comes in from the system. There
is code in the data plane to schedule the batching
and continue processing. But in this case we are done
so we can just return
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
vrf_disable is always called first before
vrf_delete. The rnh_table and rnh_table_multicast tables
are already deleted as part of vrf_disable. No need
to do it again.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
VRF name should not be printed in the config since 574445ec. The update
was done for NB config output but I missed it for regular vty output.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
Add a thread_ignore_late_timer(struct thread *thread) function
that allows thread.c to ignore when timers are late to the party.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Pass in the route_node that is under consideration
into route_notify_internal to allow calling functions
to reduce stack size as well as looking up data.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The dest_pfx was pretty much only ever used for
debug output and FRR already knows the rn. So
use that instead.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
the dest_p and src_p values were only ever used for
debugs and %pFX, when we already have the rn.
There is no need to do this lookup
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
FRR must give variable names instead of not defining
them in the .h file. This just cleans up this
problem for redistribute.h
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The function zsend_redistribute_route uses the prefix and
source prefix. Just pass in the route_node instead.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
FRR is passing around a bunch of data that is encapsulated
within the route node. Let's just pass that around instead.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
FRR is passing around a bunch of data that is encapsulated
within the route node. Let's just pass that around instead.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
If you have this setup:
router ospf 3
redistribute sharp
!
and then install:
sharp install route 4.5.6.7 nexthop 192.168.100.1 1
sharp install route 4.5.6.8 nexthop 192.168.100.1 1 instance 3
sharp install route 4.5.6.9 nexthop 192.168.100.1 1 instance 4
The .8 and .9 routes are auto redistributed into ospf instance 3:
eva# show ip ospf data
OSPF Instance: 3
OSPF Router with ID (192.168.122.1)
AS External Link States
Link ID ADV Router Age Seq# CkSum Route
4.5.6.7 192.168.122.1 13 0x80000001 0x477c E2 4.5.6.7/32 [0x0]
4.5.6.8 192.168.122.1 5 0x80000001 0x3d85 E2 4.5.6.8/32 [0x0]
4.5.6.9 192.168.122.1 5 0x80000001 0x338e E2 4.5.6.9/32 [0x0]
This cannot be correct behavior. When redistributing in the absense
of an instance number the default instance of 0 should be used and should
be the only route redistributed. Here is the correct behavior:
eva# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
K>* 0.0.0.0/0 [0/100] via 192.168.119.1, enp39s0, 00:00:28
D>* 4.5.6.7/32 [150/0] via 192.168.100.1, virbr1, weight 1, 00:00:02
D[3]>* 4.5.6.8/32 [150/0] via 192.168.100.1, virbr1, weight 1, 00:00:02
D[4]>* 4.5.6.9/32 [150/0] via 192.168.100.1, virbr1, weight 1, 00:00:02
C>* 192.168.100.0/24 is directly connected, virbr1, 00:00:28
C>* 192.168.110.0/24 is directly connected, virbr2, 00:00:28
C>* 192.168.119.0/24 is directly connected, enp39s0, 00:00:28
C>* 192.168.122.0/24 is directly connected, virbr0, 00:00:28
eva# show ip ospf data
OSPF Instance: 3
OSPF Router with ID (192.168.122.1)
AS External Link States
Link ID ADV Router Age Seq# CkSum Route
4.5.6.7 192.168.122.1 6 0x80000001 0x477c E2 4.5.6.7/32 [0x0]
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
FRR allows redistribution to a client with a specific
instance in mind. The code was not allowing you to figure
out what instance was being looked at. So let's clarify this
in the debugs.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Update ospfd and ospf6d to send opaque route attributes to
zebra. Those attributes are stored in the RIB and can be viewed
using the "show ip[v6] route" commands (other than that, they are
completely ignored by zebra).
Example:
```
debian# show ip route 192.168.1.0/24
Routing entry for 192.168.1.0/24
Known via "ospf", distance 110, metric 20, best
Last update 01:57:08 ago
* 10.0.1.2, via eth-rt2, weight 1
OSPF path type : External-2
OSPF tag : 0
debian#
debian# show ip route 192.168.1.0/24 json
{
"192.168.1.0\/24":[
{
"prefix":"192.168.1.0\/24",
"prefixLen":24,
"protocol":"ospf",
"vrfId":0,
"vrfName":"default",
"selected":true,
[snip]
"ospfPathType":"External-2",
"ospfTag":"0"
}
]
}
```
Signed-off-by: Renato Westphal <renato@opensourcerouting.org>
Topology:
IXIA-----(ens192)FRR(ens224)------iXIA
Configuration:
1. Create 8 sub-interfaces on ens192 under Default VRF and configure 8
EBGP session between FRR and IXIA.
2. Create 1000 sub-interfaces on ens224 under Default VRF and configure
1000 EBGP session between FRR and IXIA.
3. 2M prefixes distributed from Left side Ixia each with 8 ECMP path.
4. So in total, there are 2M prefixes * 8 ECMP = 16M prefixes entries
in RIB and FIB.
Issue:
Shut ens192 and ens224, this is taking 1hr 15 mins to clean up the routes.
Root Cause:
In the case of route deletion, if the particular route node is having
nht count = 0, we are going to the parent and doing nht evaluation,
which is not needed.
Fix:
If the deleted the route node is having nht count > 0, then do a nht
evaluation on the parent node.
Shut ens192 and ens224, it is taking 1 min to clean up the routes
with the fix.
Signed-off-by: Sarita Patra <saritap@vmware.com>
Used for graceful-restart mostly.
Especially for bgp_show_neighbor_graceful_restart_capability_per_afi_safi()
Signed-off-by: Donatas Abraitis <donatas.abraitis@gmail.com>
This function is sure to return correct value by "assert", so the
checking its return value should be removed.
Signed-off-by: anlan_cs <anlan_cs@tom.com>
Currently, it is possible to rename the default VRF either by passing
`-o` option to zebra or by creating a file in `/var/run/netns` and
binding it to `/proc/self/ns/net`.
In both cases, only zebra knows about the rename and other daemons learn
about it only after they connect to zebra. This is a problem, because
daemons may read their config before they connect to zebra. To handle
this rename after the config is read, we have some special code in every
single daemon, which is not very bad but not desirable in my opinion.
But things are getting worse when we need to handle this in northbound
layer as we have to manually rewrite the config nodes. This approach is
already hacky, but still works as every daemon handles its own NB
structures. But it is completely incompatible with the central
management daemon architecture we are aiming for, as mgmtd doesn't even
have a connection with zebra to learn from it. And it shouldn't have it,
because operational state changes should never affect configuration.
To solve the problem and simplify the code, I propose to expand the `-o`
option to all daemons. By using the startup option, we let daemons know
about the rename before they read their configs so we don't need any
special code to deal with it. There's an easy way to pass the option to
all daemons by using `frr_global_options` variable.
Unfortunately, the second way of renaming by creating a file in
`/var/run/netns` is incompatible with the new mgmtd architecture.
Theoretically, we could force daemons to read their configs only after
they connect to zebra, but it means adding even more code to handle a
very specific use-case. And anyway this won't work for mgmtd as it
doesn't have a connection with zebra. So I had to remove this option.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
Add optional NHG ID output to `show ip route` dumps. We have
this in json output already as nexthopGroupID but nice
to have the option in a normal dump as well. Not including in main
output for now to avoid breaking screen scrapers.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
On startup we create a thread timer event to do a rib sweep
of the system. On shutdown we never stopped this timer and
as such we have a situation where a thread event could be run
on shutdown after the data for it has been freed. Here is the
crash I am seeing:
(gdb) bt
(gdb)
Save the thread data in zebra_router and stop the thread so we don't
accidently do work on shutdown we don't mean to. In this case
it happened in our topotests with some severe system load.
Essentially we happened to kill the zebra daemon just as the
graceful_restart timer popped here.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
In some cases, zebra may install a nexthop-group id that is
different from the id of the nhe struct attached to a
route-entry. This happens for a singleton recursive nexthop,
for example, where a route is installed with the resolving
nexthop's id.
The installed value is the most useful value - that corresponds
to information in the kernel on linux/netlink platforms that
support nhgs. Display both values if they differ in ascii
output, and include both values in the json form.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
Since f60a1188 we store a pointer to the VRF in the interface structure.
There's no need anymore to store a separate vrf_id field.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
During zebra shutdown, we clear out the LSP workqueue. The LSPs
will be uninstalled and freed during the shutdown process, so
just ignore any LSPs that happen to be on the workqueue.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
At some scale we eventually run out of room displaying v4/v6 route
totals for `show zebra client summ`:
janelle# show zebra client summ
Name Connect Time Last Read Last Write IPv4 Routes IPv6 Routes
--------------------------------------------------------------------------------
bgp 04w0d18h 00:00:19 00:01:2411729127/4052681 2037786/903094
This total over ran the space in just a little over a week of uptime.
Expand to have a bit more room.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The dplane_ctx_get_pbr_ipset_entry function only
failed when the caller did not pass in a valid
usable pointer. Change the code to assert on
a pointer not being passed in and remove the
bool return
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The only time this function ever failed is when
the developer does not pass in a usable pointer
to place the data in. Change it to an assert
to signify to the end developer that is what
we want and then remove all the if checks
for failure
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The function call dplane_ctx_get_pbr_ipset only
returns false when the calling function fails to
pass in a valid ipset pointer. This should
be an assertion issue since it's a programming
issue as opposed to an actual run time issue.
Change the function call parameter to not return
a bool on success/fail for a compile time decision.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
We should always treat the VRF interface as a loopback. Currently, this
is not the case, because in some old pre-VRF code we use if_is_loopback
instead of if_is_loopback_or_vrf. To avoid any future problems, the
proposal is to rename if_is_loopback_or_vrf to if_is_loopback and use it
everywhere. if_is_loopback is renamed to if_is_loopback_exact in case
it's ever needed, but currently it's not used anywhere.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
During shutdown, when table_manager_disable is called for the default
VRF, its vrf_id is already set to VRF_UNKNOWN, so the expression is true
and the table manager memory is not freed. Change the expression to
compare the VRF name instead of the id. The check in table_manager_enable
is changed for consistency.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
Free the LSP workqueue later during shutdown, so that zebra
has enough time to clean up and uninstall any LSPs.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
42d4b30e introduced per-VRF table manager.
Table manager is allocated when the VRF is created, but it is freed when
the VRF is disabled. When this VRF is re-enabled, zebra ends up with
table manager being NULL pointer and it crashes on any dereference.
Table manager should be freed when the VRF is deleted, not when it's
disabled.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
We don't receive interface down/delete notifications from kernel when a
netns is deleted. Therefore we have to manually replicate the necessary
actions, otherwise interfaces are kept in the system with stale pointers
to the deleted netns.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
Problem:
L2-VNI SVI down followed by L2-VNI's vxlan device
deletion leads to stale entry into L3VNI's
L2-VNI list.
Solution:
When L2-VNI associated SVI is down, default vrf
is the new tenant vrf.
Remove L2-VNI from L3VNI's l2vni list as
L3VNI/VRF is no longer valid in absence of associated
SVI.
When SVI is up re-add L2-VNI into associated VRF's
L3VNI.
The above remove/add from the L3VNI's L2VNI list is
already done when vxlan or L2-VNI is flaped, just need
to handle when SVI is flapped.
Ticket:#2817127
Reviewed By:
Testing Done:
After deleting SVI following by L2-VNI deletion,
L3VNI's L2-VNI list delets the L2-VNI. (no stale entry).
After adding back SVI/L2-VNI, L3VNI list adds back the
L2-VNI and it is associated right tenant VRF.
Signed-off-by: Chirag Shah <chirag@nvidia.com>
In the rtadv_timer(), it always uses the zvrf's socket to send RA
packets. In the vrf-lite mode, it's righ since it uses the default
vrf to send the RA packets. But in the netns mode, it uses socket
in each netns. So the issue only happens in the netns mode because
the zvrf's socket may not be in the same netns as the interface's
netns. In order to compatible with both vrf-lite and netns mode,
the fix uses the if_lookup_by_index() to check whether interfaces
can use the zvrf's socket.
Signed-off-by: LEI BAO <bali.baolei@cn.ibm.com>
Before 42d4b30e, table_manager_enable was called only once and the hook
was also registered once. After the change, the hook is registered per
each VRF that is created in the system. This is wrong.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
Currently the NEXTHOP_TYPE_IPV4 and NEXTHOP_TYPE_IPV6 are
not sending up the resolved ifindex for the route. This
is causing upper level protocols that have something like
this:
route-map FOO permit 10
match interface swp13
!
router ospf
redistribute static
!
ip route 4.5.6.7/32 10.10.10.10
where 10.10.10.10 resolves to interface swp13. The route-map
will never match in this case.
Since FRR has the resolved nexthop interface, FRR might as
well send it up to be selected on by the upper level protocol
as needed.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
It appears that without that change, there were no notifications
sent to bgp daemon, after flowspec operations have been sent to
zebra.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
It is needed for the ipset entry to know for which address family
this ipset entry applies to. Actually, the family is in the original
ipset structure and was not passed as attribute in the dataplane
ipset_info structure. Add it.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When injecting an ipset entry into the zebra dataplane context, the
ipset name is stored in a separate structure. This will permit the
flowspec plugin to be able to know which ipset has to be appended with
relevant ipset entry.
The problem was that the zebra dataplane objects related to ipset entries
is made up of an union between the ipset structure and the ipset info
structure. This was implying that the two structures were on the same
memory zone, and when extracting the data stored, the data were incomplete.
Fix this by replacing the union structure by a defined struct.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When the netns is deleted, we should always clear the vrf->ns_ctxt
pointer. Currently, it is not cleared when there are interfaces in the
netns at the time of deletion.
If the netns is re-created, zebra crashes because it tries to use the
stale pointer.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
if_lookup_by_index_all_vrf doesn't work correctly with netns VRF backend
as the same index may be used in multiple netns simultaneously.
In both case where it's used, we know the VRF in which we need to lookup
for the interface.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
The kernel can return to us nested attributes for BRIDGE RTM_NEWNEIGH
attributes. Just ensure that we can parse and read them.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
With the addition of resillient hashing for nexthops, the
parsing of nexthops requires telling the decoder functions
that there may be nested attributes. This was found by
code inspection of iproute2/ipnexthop.c when trying to
understand resillient hashing as well as statistics
gathering for nexthops that are / will be in upstream
kernels in the near future.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add actual recent nexthop.h file from kernel
and fix up resulting fallout because FRR's
original nexthop.h did not match upstream
linux kernel.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
when gre information could not be retrieved because GRE interface has
been deleted, a GRE_UPDATE message may be sent to NHRP. In that case,
the gre values are reset. There was a missing tunnel destination value,
which has been omitted.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
There is a bit of an impedance mismatch in the sequence of events here.
Depending on the dplane behavior, the `ROUTE_ENTRY_SELECTED` bit will be
inconsistent for rib_process_result().
With an asynchronous dataplane:
0. rib_process() is called
1. rib_install_kernel() is called, dplane action is queued
2. rib_install_kernel() returns
3. rib_process() sets the SELECTED bit appropriately, returns
4. dplane is done, triggers rib_process_result()
5. SELECTED bit is seen in "after" state
(5a. NHT code looks at the SELECTED bit, works correctly.)
With a synchronous dataplane:
0. rib_process() is called
1. rib_install_kernel() is called, dplane action is executed
2. dplane (should) trigger rib_process_result()
3. SELECTED bit is seen in "before" state
(3a. NHT code looks at the SELECTED bit, fails.)
4. rib_install_kernel() returns
5. rib_process() sets the SELECTED bit appropriately, too late.
Essentially, poking the dataplane is a sequencing point where control is
handed over to the dplane. Control may or may not return immediately.
Doing /anything/ after triggering the dataplane is a recipe for odd race
conditions.
(FWIW, I'm not sure rib_process_result() is called correctly in the
synchronous case, but that's a separate problem.)
Unfortunately, this change might have some unforeseen side effects. I
haven't dug through the code to see if anything breaks. There
/shouldn't/ be anything looking at the SELECTED bit here, but who knows.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
Do not return pointer to the newly created thread from various thread_add
functions. This should prevent developers from storing a thread pointer
into some variable without letting the lib know that the pointer is
stored. When the lib doesn't know that the pointer is stored, it doesn't
prevent rescheduling and it can lead to hard to find bugs. If someone
wants to store the pointer, they should pass a double pointer as the last
argument.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
rib_update() was mallocing memory then attempting to schedule
and if the schedule failed( it was already going to be run )
FRR would then free the memory. Fix this memory usage pattern
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
It allows FRR to read the interface config even when the necessary VRFs
are not yet created and interfaces are in "wrong" VRFs. Currently, such
config is rejected.
For VRF-lite backend, we don't care at all about the VRF of the inactive
interface. When the interface is created in the OS and becomes active,
we always use its actual VRF instead of the configured one. So there's
no need to reject the config.
For netns backend, we may have multiple interfaces with the same name in
different VRFs. So we care about the VRF of inactive interfaces. And we
must allow to preconfigure the interface in a VRF even before it is
moved to the corresponding netns. From now on, we allow to create
multiple configs for the same interface name in different VRFs and
the necessary config is applied once the OS interface is moved to the
corresponding netns.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
When something is used only from zebra and part of its description is
"should be called from zebra only" then it belongs to zebra, not lib.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
When an ES is deleted and re-added bgpd can start sending MAC-IP sync updates
before the dataplane and zebra have setup the VLAN membership for the ES. Such
MAC entries are not installed in the dataplane till the ES-EVI is created.
Ticket: #2668488
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
In the window immediately after an ES deletion bgpd can send MAC-IP updates
using that ES. Zebra needs to ignore these updates to prevent creation
of stale entries.
Ticket: #2668488
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
This addresses deletion of ES interfaces that are were not completely
configured.
Ticket: #2668488
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
When PTM sends a "cbl status" message it specifies the interface name
but not the VRF name. It is fine for VRF-lite, but doesn't work for
netns because it's possible to have multiple interfaces with the same
name. Be more restrictive in this case and return an error instead of
randomly using of the interface with the specified name.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
With netns VRF backend, we may have multiple interfaces with the same
name. Currently, the function output is not deterministic in this case,
it returns the first interface that it finds in the list. Be more
explicit and tell the user that we need the VRF name.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
```
exit1-debian-9# show ip route 172.16.16.1/32
Routing entry for 172.16.16.1/32
Known via "bgp", distance 20, metric 0, best
Last update 00:00:28 ago
* 192.168.0.2, via eth1, weight 1
AS-Path : 65003
Communities : first 65001:2 65001:3
Large-Communities: 65001:1:1 65001:1:2 65001:1:3
Selection reason : First path received
```
Signed-off-by: Donatas Abraitis <donatas.abraitis@gmail.com>
Currently, the ll_type is set only in `netlink_interface` which is
executed only during startup. If the interface is created when the FRR
is already running, the type is not stored.
Fixes#1164.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
When a client sends to zebra that GR mode is being turned
on. The client also passes down the time zebra should hold
onto the routes. Display this time with the output
of the `show zebra client` command as well.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When issuing the `show zebra client` command data about
Graceful Restart state is being printed 2 times.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
In startup, zebra would dump interface information from Kernel in 3
steps w/o lock: step1, get interface information; step2, get interface
ipv4 address; step3, get interface ipv6 address.
If any interface gets added after step1, but before step2/3, zebra
would get extra interface addresses in step2/3 that has not been added
into zebra in step1. Returning error in the referenced interface lookup
would cause the startup interface retrieval to be incomplete.
Signed-off-by: Yuan Yuan <yyuanam@amazon.com>
FRR should only ever use the appropriate THREAD_ON/THREAD_OFF
semantics. This is espacially true for the functions we
end up calling the thread for.
Signed-off-by: Donatas Abraitis <donatas.abraitis@gmail.com>
There's a helper function to check whether the interface is loopback or
VRF - if_is_loopback_or_vrf. Let's use it whenever we need to check that.
There's no functional change in this commit.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
Pass down the safi for when we need address
resolution. At this point in time we are
hard coding the safi to SAFI_UNICAST.
Future commits will take advantage of this.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
PIM is going to need to be able to send down the address it is
trying to resolve in the multicast rib. We need a way to signal
this to the end developer. Start the conversion by adding the
ability to have a safi. But only allow SAFI_UNICAST at the moment.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The entirety of the import checking no longer needs to be
in zebra as that no-one is calling it. Remove the code.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
These are no longer really needed. The client just needs
to call nexthop resolution instead.
So let's remove the zapi types.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
There were two identical blocks of code run at init time that
requested info about AF_BRIDGE - don't see any reason to do that
twice, so remove one block.
Signed-off-by: Mark Stapp <mstapp@nvidia.com>
Because vrf backend may be based on namespaces, each vrf can
use in the [16-(2^32-1)] range table identifier for daemons that
request it. Extend the table manager to be hosted by vrf.
That possibility is disabled in the case the vrf backend is vrflite.
In that case, all vrf context use the same table manager instance.
Add a configuration command to be able to configure the wished
range of tables to use. This is a solution that permits to give
chunks to bgp daemon when it works with bgp flowspec entries and
wants to use specific iptables that do not override vrf tables.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When using bgp evpn rt5 setup, after BGP configuration has been
loaded, if the user attempts to detach and reattach the bridged
vxlan interface from the bridge, then BGP loses its BGP EVPN
contexts, and a refresh of BGP configuration is necessary to
maintain consistency between linux configuration and BGP EVPN
contexts (RIB). The following command can lead to inconsistency:
ip netns exec cust1 ip link set dev vxlan1000 nomaster
ip netns exec cust1 ip link set dev vxlan1000 master br1000
consecutive to the, BGP l2vpn evpn RIB is empty, and the way to
solve this until now is to reconfigure EVPN like this:
vrf cust1
no vni 1000
vni 1000
exit-vrf
Actually, the link information is correctly handled. In fact,
at the time of link event, the lower link status of the bridge
interface was not yet up, thus preventing from establishing
BGP EVPN contexts. In fact, when a bridge interface does not
have any slave interface, the link status of the bridge interface
is down. That change of status comes a bit after, and is not
detected by slave interfaces, as this event is not intercepted.
This commit intercepts the bridge link up event, and triggers
a check on slaved vxlan interfaces.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
when running bgp evpn rt5 setup, the Rmac sent in BGP updates
stands for the MAC address of the bridge interface. After
having loaded frr configuration, the Rmac address is not refreshed.
This issue can be easily reproduced by executing some commands:
ip netns exec cust1 ip link set dev br1000 address 2e🆎45:aa:bb:cc
Actually, the BGP EVPN contexts are kept unchanged.
That commit proposes to fix this by intercepting the mac address
change, and refreshing the vxlan interfaces attached to te bridge
interface that changed its MAC address.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
We should not be using `case default` with an enumerated type
This prevents the developer of new cases from knowing where
they need to fix by just compiling.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Move the handler for incoming interface address events
to a neutral source file - it's not netlink-specific and
shouldn't have been in a netlink file.
Signed-off-by: Mark Stapp <mjs.ietf@gmail.com>
Read incoming interface address change notifications in the
dplane pthread; enqueue the events to the main pthread
for processing. This is netlink-only for now - the bsd
kernel socket path remains unchanged.
Signed-off-by: Mark Stapp <mjs.ietf@gmail.com>
Add new apis for dplane interface address handling, based on
the existing api. The existing api is basically split in two:
the first part processes an incoming netlink message in the
dplane pthread, creating a dplane context with info about
the event. The second part runs in the main pthread and uses
the context data to update an interface or connected object.
Signed-off-by: Mark Stapp <mjs.ietf@gmail.com>
Add a new netlink socket for events coming in from the host OS
to the dataplane system for processing. Rename the existing
outbound dplane socket.
Signed-off-by: Mark Stapp <mjs.ietf@gmail.com>
Description: Currently IPv4 routes with IPv6 link local next hops are
not properly installed in FPM.
Reason is the netlink decoding truncates the ipv6 LL address to 4 byte
ipv4 address.
Ex : fe80:: is directly converted to ipv4 and it results in 254.128.0.0
as next hop for below routes
show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
B>* 2.1.0.0/16 [200/0] via fe80::268a:7ff:fed0:d40, Ethernet0, weight 1,
02:22:26
B>* 5.1.0.0/16 [200/0] via fe80::268a:7ff:fed0:d40, Ethernet0, weight 1,
02:22:26
B>* 10.1.0.2/32 [200/0] via fe80::268a:7ff:fed0:d40, Ethernet0, weight
1, 02:22:26
Hence this fix converts the ipv6-LL address to ipv4-LL (169.254.0.1)
address before sending it to FPM. This is inline with how these types of
routes are currently programmed into kernel.
Signed-off-by: Nikhil Kelapure <nikhil.kelapure@broadcom.com>
Current implementation doesn't copy nexthop_srv6. This causes unexpected
behavior when receiving SID information and nexthop isn't onlink.t
Signed-off-by: Ryoga Saito <contact@proelbtn.com>
Problem:
When IP1:M1 (local) moved to IP1:M2 (remote-VTEP) bgpd continues to
advertise IP1:M1.
Fix:
Local path del is sent to bgp if the neigh was {local-active||peer-active}.
So path del needs to be called before the sync flags (including peer-active)
are cleared.
Ticket: #2706744
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
When we hand set the router-id, but we have choosen a router-id
that is already the `winner` there is no point in updating anyone
with this data.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
At startup there exists a time frame where we might not know
a particular vrf's router id. When zebra gets a request for
it let's not just blindly send whatever we have. Let's be
a bit smart and only respond with one if we have one.
The upper level protocol can wait for it to have one.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
vrf_name_to_id() returned VRF_DEFAULT when the vrf name was
unknown, hiding errors. Per community recommendation, vrf_name_to_id()
is now removed and the few callers now use vrf_lookup_by_name()
directly.
Signed-off-by: G. Paul Ziemba <paulz@labn.net>
When running bgp evpn rt5 setup with vrf namespace backend, once the
BGP configuration loaded, some refresh like the config change of a
vxlan interface is not taken into account. As consequence, the BGP
l2vpn evpn entries are empty. This can happen by recreating vxlan
interface like follows:
ip netns exec cust1 ip li del vxlan1000
ip link add vxlan1000 type vxlan id 1000 dev loopback0 local 10.209.36.1 learning
ip link set dev vxlan1000 mtu 9000
ip link set dev vxlan1000 netns cust1
ip netns exec cust1 bash
ip link set dev vxlan1000 up
ip link set dev vxlan1000 master br1000
Actually, changing learning attribute requires recreation, and this
change needs to manually reload the frr configuration.
The update mechanism in zebra about vxlan interface updates is
already put in place, but it does not work well with namespace
based vrf backend. The function zl3vni_from_svi() is then
modified to parse all the interfaces of each namespace.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Description:
Change is intended for fixing the following issues related to vrf route leaking:
Routes with special nexthops i.e. blackhole/sink routes when imported,
are not programmed into the FIB and corresponding nexthop is set as 'inactive',
nexthop interface as 'unknown'.
While importing/leaking routes between VRFs, in case of special nexthop(ipv4/ipv6)
once bgp announces route(s) to zebra, nexthop type is incorrectly set as
NEXTHOP_TYPE_IPV6_IFINDEX/NEXTHOP_TYPE_IFINDEX
i.e. directly connected even though we are not able to resolve through an interface.
This leads to nexthop_active_check marking nexthop !NEXTHOP_FLAG_ACTIVE.
Unable to find the active nexthop(s), route is not programmed into the FIB.
Whenever BGP leaks routes, set the correct nexthop type, so that route gets resolved
and correctly programmed into the FIB, in the imported vrf.
Co-authored-by: Kantesh Mundaragi <kmundaragi@vmware.com>
Signed-off-by: Iqra Siddiqui <imujeebsiddi@vmware.com>
Insist on the fact that zclient neighbor state flags are
mapped over netlink state flags. List all the defines
currently known on kernel, and create a netlink API to
convert netlink values to zclient values. The function is
simplified as it is a 1-1 match.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
As NHRP expects some notification of neighboring entries on GRE
interface, when a new interface notification is encountered, the
exact neighbor state flag is found. Previously, the flag passed
to the upper layer was forced to NDM_STATE which is REACHABLE,
as can be seen on below trace:
2021/08/25 10:58:39 NHRP: [QQ0NK-1H449] Netlink: new-neigh 102.1.1.1 dev gre1 lladdr 10.125.0.2 nud 0x2 cache used 1 type 5
When passing the real value, NHRP received an other value like STALE.
2021/08/25 11:28:44 NHRP: [QQ0NK-1H449] Netlink: new-neigh 102.1.1.1 dev gre1 lladdr 10.125.0.2 nud 0x4 cache used 0 type 5
This flag is important for NHRP, as it permits to monitor the link
layer of NHRP entries.
Fixes: d603c0774e ("nhrp, zebra, lib: enforce usage of zapi_neigh_ip structure")
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
"[no] netns NAME" commands are part of the lib, but they are actually
zebra-only:
- they are using vrf_netns_handler_create and its description clearly
says that it "should be called from zebra only"
- vtysh sends these commands only to zebra
- only zebra outputs the netns related config
- zebra notifies other daemons about netns attachment
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
There is a possibility that the same line can be matched as a command in
some node and its parent node. In this case, when reading the config,
this line is always executed as a command of the child node.
For example, with the following config:
```
router ospf
network 193.168.0.0/16 area 0
!
mpls ldp
discovery hello interval 111
!
```
Line `mpls ldp` is processed as command `mpls ldp-sync` inside the
`router ospf` node. This leads to a complete loss of `mpls ldp` node
configuration.
To eliminate this issue and all possible similar issues, let's print an
explicit "exit" at the end of every node config.
This commit also changes indentation for a couple of existing exit
commands so that all existing commands are on the same level as their
corresponding node-entering commands.
Fixes#9206.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
For some reason commit #ef524230a6baa decided
to remove enums and switch to uint16_t. Which
is not the right thing to do. Put it back
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Maybe with empty nexthop to call zebra_mpls_transit_lsp():
"no mpls lsp (16-1048575)".
So just remove this "gate_str" check. If without "gate" in command, "gtype" is
set to NEXTHOP_TYPE_BLACKHOLE for subsequent processing.
Signed-off-by: anlan_cs <anlan_cs@tom.com>
When NHRP registers to zebra to receive link layer events related to
gre interfaces, then it is interested in receiving also RTM_GETNEIGH
messages.
Fixes ("b3b751046495") nhrpd: link layer registration to notifications
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
In zebra_evpn_proc_remote_nh if we do not pass in a long
enough stream, the stream reads will fail. Ensure that
we have enough data.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Handle TYPE_IFINDEX nexthops more consistently in a few places;
be more specific about a few integer return values that were
being treated as booleans.
Signed-off-by: Mark Stapp <mjs.ietf@gmail.com>
When calling rib_add_multipath_nhe ensure that we have
well aligned return codes that mean something so that
interersted parties can properly handle the situation.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When receiving a route via zapi, if the route is rejected
there exists a code path where we would not free the corresponding
re created.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The command `debug zebra kernel msgdump is netlink specific.
There is no point at all to allow this to be configed on non
netlink platforms.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
There were a bunch of places where we converted the
route node to a prefix string via srcdest_rnode2str when
we should have been using %pRN in zebra_rib.c. Just
convert over the ones we should to use it.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When we are calling rib_process and the route_node
in question has no dest, there is no work to do here
at all. As such we should just return before
attempting to do any other work. This is just a tiny bit
of simplification being done.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
There exists a call path where the nhlfe_alloc can return NULL
for blackhole nexthops. In this case we were still trying
to save the nhlfe pointer causing a crash when we attempted
to add it to a self-contained list.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Do not use the `default` case when switching over an enumerated
type. This allows the code to fail to compile when we add a
new enumeration. Thus allowing us developers to know all
the places in the code we'll need to touch.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
1. This check is absolutely useless. Nothing keeps user from deleting
the address right after this check.
2. This check prevents zebra from correctly reading the user config with
"set src" because of a race with interface startup (see #4249).
3. NO OPERATIONAL DATA USAGE ON VALIDATION STAGE.
Fixes#7319.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
v4 and v6 host/refernce prefixes need to be setup separately for
[RMAC, VTEP] entries as the VTEP is always normalized to a v4 addr.
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
The only difference in daemons' interface node definition is the config
write function. No need to define the node in every daemon, just pass
the callback as an argument to a library function and define the node
there.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
There exists some rare situations where fpm will attempt
to send a route update with no valid nexthops. In that
case an assert would be hit. This is not good for
trying to keep your routing daemons up and running
when we can safely just recover the situation.
Fixes#7588
Signed-off-by: batmancn <batmanustc@gmail.com>
<fixed commit message, and used zlog_err>
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently 'show evpn rmac vni .. mac .. json' includes fields for
localSequence and remoteSequence, which are misleading since they
aren't applicable to a macs in the IP-VRF mac table (RMAC).
This removes the localSequence + remoteSequence fields from the output.
Signed-off-by: Trey Aspelund <taspelund@nvidia.com>
like the other automake variables, setting `xyz_LDFLAGS` causes
`AM_LDFLAGS` to be ignored for `xyz`. For some reason I had in my mind
that automake doesn't do this for LDFLAGS, but... it does. (Which is
consistent with `_CFLAGS` and co.)
So, all the libraries and modules have been ignoring `AM_LDFLAGS` (which
includes `SAN_FLAGS` too). Set up new `LIB_LDFLAGS` and
`MODULE_LDFLAGS` to handle all of this correctly (and move these bits to
a central location.)
Fixes: #9034
Fixes: 0c4285d77e ("build: properly split CFLAGS from AC_CFLAGS")
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
Notice when a ip address on a bsd interface is considered
an alias, let's mark the connected prefix we generate as
a SECONDARY.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When port was removed from last access vlan, the linux kernel
won't send any vlan info in the netlink message, it might affact
the evpn mh not withdraw EAD-EVI routes.
Signed-off-by: Gord Chen <gord_chen@edge-core.com>
Current code was allowing redistribution of kernel routes from
the non-default non vrf tables once FRR was already up and running.
In the case where we add `redistribute kernel` in an upper level
protocol we never consider the non-default vrf or non-vrf tables
so it is never accepted.
In the case where a kernel route is added after `redistribute kernel`
is already in place we were never looking at the fact that the
route was in a non-default non-vrf table. This code fixes
that issue.
Fixes: #9073
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Move remote VTEP updates from immediate, inline processing
in their ZAPI message handlers to the main workqueue.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Enqueue incoming vxlan remote macip updates on the main
workqueue, instead of performing the updates immediately,
in-line.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Add workqueue subqueue for EVPN/VxLAN updates; migrate the
evpn route and remote ES processing from their ZAPI handlers
to the workqueue.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
At some point we broke the ifp pointer for nhe->ifp such
that it was pointing to an interface even in groups/recurisve
instances.
Add checks here to make it again so that we only set the ifp
pointer if it is a fully resolved singleton NHE.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
In the reachability code we auto pass back the fully resolved
nexthops. Modify the ZEBRA_IPV4_NEXTHOP_LOOKUP_MRIB code
to do the exact same thing so that the zclient_lookup_nexthop
code does not need to recursively look for the data that
zebra already has.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Basically, this is handled by JSON-C library. I've compiled with the
latest release of json-c and it works well.
Didn't test with various distribution versions, but this change is kinda
dependend from the json-c lib version the distra has.
Before:
```
"192.168.100.1\/32":[
{
"prefix":"192.168.100.1\/32",
```
After:
```
"192.168.100.1/32":[
{
"prefix":"192.168.100.1/32",
```
Signed-off-by: Donatas Abraitis <donatas.abraitis@gmail.com>
There are a few places in the code where we use PREFIX_COPY(_IPV4/IPV6)
macro to copy a prefix. Let's always use prefix_copy function for this.
This should fix CID 1482142 and 1504610.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
when sending nexthop information. We do not need to reset the
last_write_cmd since that is taken care of in the send routine.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Include the complete set of primary and backup nexthops from
the resolving route for a pseudowire. Add accessors for that
info. Modify the logic that creates the fib set of pw nexthops
so that only installed, labelled nexthops are included.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Modify the pseudowire reachability logic so that it returns
success if there is at least one installed labelled nexthop for
the route resolving the pw destination. We also check for
valid backup nexthops if necessary, in case there's been a
switchover event.
Only OpenBSD requires that _all_ nexthops be labelled, so we
have a more strict version of the logic also.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
When processing bulk messages we need more space to handle more
mroutes. In this case we are doubling the stream size from
16k -> 32k, which should roughly double the number of mroutes
we can handle in one go.
Additionally. If we cannot parse the passed message into
the stream to pass up to pimd then gracefully stop processing
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add a show command so we can easily get info on
what interfaces are turned on per ver and in
which list.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Rework RA handling for vrf-lite scenarios.
Before we were using a single FD descriptor for polling
across multiple zvrf's. This would cause us to hit this
assert() in some bgp unnumbered and vrrp configs:
```
/*
* What happens if we have a thread already
* created for this event?
*/
if (thread_array[fd])
assert(!"Thread already scheduled for file descriptor");
```
We were scheduling a thread_read on the same FD for every zvrf.
With vrf-lite, RAs and ARPs are not vrf-bound, so we can just use one
rtadv instance to manage them for all VRFs. We will choose the default
VRF for this.
This patch removes the rtadv_sock altogether for zrouter and moves the
functionality this represented to the default VRF. All RAs will be
handled in the default VRF under vrf-lite configs with only one poll
thread started for it.
This patch also extends how we track subscribed interfaces (s or msec)
to use an actual sorted list by interface names rather than just a
counter. With multiple daemons turning interfaces/on/off these counters
can get very wrong during ifup/down events. Making them a sorted list
prevents this from happening by preventing duplicates.
With netns-vrf's nothing should change other than the interface list.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
FPM sends VNI to the data plane with the EVPN prefix. For pure type-5 EVPN
route, nexthop interface of EVPN prefix is L3VNI SVI. Thus, we encode L3VNI
corresponding to the nexthop vrf with rtmsg for this prefix.
For EVPN type-5 route with gateway IP overlay index, we supporting
asymmetric IRB. Thus, nexthop interface is L2VNI SVI. So, instead of fetching
vrf VNI, fetch VNI corresponding to the nexthop SVI and encode it in the rtmsg
for EVPN prefix.
Signed-off-by: Ameya Dharkar <adharkar@vmware.com>
SVI ifindex for L2VNI is required in BGP to perform EVPN type-5 to type-2
recusrsive resolution using gateway IP overlay index.
Program this svi_ifindex in struct zebra_vni_t as well as in struct bgpevpn
Changes include:
1. Add svi_if field to struct zebra_evpn_t
2. Add svi_ifindex field to struct bgpevpn
3. When SVI (bridge or VLAN) is bound to a VxLAN interface, store it in the
zebra_evpn_t structure.
4. Add this SVI ifindex to ZEBRA_VNI_ADD
5. Store svi_ifindex in struct bgpevpn
Signed-off-by: Ameya Dharkar <adharkar@vmware.com>
When the VRF node is exited using "exit" or "quit", there's still a VRF
pointer stored in the vty context. If you try to configure some router
related command, it will be applied to the previous VRF instead of the
default VRF. For example:
```
(config)# vrf test
(config-vrf)# ip router-id 1.1.1.1
(config-vrf)# do show run
...
!
vrf test
ip router-id 1.1.1.1
exit-vrf
!
...
(config-vrf)# exit
(config)# ip router-id 2.2.2.2
(config)# do show run
...
!
vrf test
ip router-id 2.2.2.2
exit-vrf
!
...
```
`vrf-exit` works correctly, because it stores a pointer to the default
VRF into the vty context (but weirdly keeping the VRF_NODE instead of
changing it to CONFIG_NODE).
Instead of relying on the behavior of exit function, always use the
default VRF when in CONFIG_NODE.
Another problem is missing `VTY_CHECK_CONTEXT`. If someone deletes the
VRF in which node the user enters the command, then zebra applies the
command to the default VRF instead of throwing an error.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
https://github.com/FRRouting/frr/pull/5865#discussion_r597670225
As this comment says. ZEBRA_FLAG_XXX should not have been used.
To communicate SRv6 Route Information. A simple Nexthop Flag would
have been sufficient for SRv6 information. And I fixed the whole
thing that way.
Signed-off-by: Hiroki Shirokura <slank.dev@gmail.com>
FRRouting operator can install seg6 route via ZAPI,
But linux kernel operator also can install seg6 route
via Netlink directry (i.e. iproute2)
This commit make zebra to parse non-frr seg6 route
configuration via netlink and audit Zebra's RIB.
Signed-off-by: Hiroki Shirokura <slank.dev@gmail.com>
With this patch, zclient can intall seg6 rotues when
they set properties "nh_seg6_segs" on struct nexthop
and set ZEBRA_FLAG_SEG6_ROUTE on zapi_route's flag.
Signed-off-by: Hiroki Shirokura <slank.dev@gmail.com>
This commit is a part of #5853 works that add new clis to
configure SRv6 locator and its show commands.
Following clis are added on this commit.
vtysh -c 'conf te' \
-c 'segment-routing' \
-c ' srv6' \
-c ' locators' \
-c ' locator LOC1' \
-c ' prefix A::/64'
- "show segment-routing srv6 sid [json]"
- "show segment-routing srv6 locator [json]"
- "show segment-routing srv6 locator NAME detail [json]"
- "show runnning-config" (make it to print srv6 configuration)
Signed-off-by: Hiroki Shirokura <slank.dev@gmail.com>
This commit is a part of #5853 works that add new ZAPI to
configure SRv6 locator which manages chunk prefix for
SRv6 SID IPv6 address for each routing protocol daemons.
NEW-ZAPIs:
* ZEBRA_SRV6_LOCATOR_ADD
* ZEBRA_SRV6_LOCATOR_DELETE
* ZEBRA_SRV6_MANAGER_CONNECT
* ZEBRA_SRV6_MANAGER_GET_LOCATOR_CHUNK
* ZEBRA_SRV6_MANAGER_RELEASE_LOCATOR_CHUNK
Zclient can connect to zebra's srv6-manager with
ZEBRA_SRV6_MANAGER_CONNECT api like a label-manager.
Then zclient uses ZEBRA_SRV6_MANAGER_GET_LOCATOR_CHUNK to
allocated dedicated locator chunk for it's routing protocol.
Zebra works for only prefix reservation and distribute
the ownership of the locator chunks for zcliens.
Then, zclient installs SRv6 function with
ZEBRA_ROUTE_ADD api with nh_seg6local_* fields.
This feature is already implemented by another PR(#7680).
Signed-off-by: Hiroki Shirokura <slank.dev@gmail.com>
This commit is a part of #5853 that add new cmd-node for SRv6 configuration.
This commit just add cmd-node and moving node cli only, acutual SRv6 config
command isn't added. (that is added later commit. of this branch)
new cli nodes:
* SRv6
* SRv6-locators
* SRv6-locator
Signed-off-by: Hiroki Shirokura <slank.dev@gmail.com>
FRRouting operator can install seg6local route via ZAPI,
But linux kernel operator also can install seg6local route
via Netlink directry (i.e. iproute2)
This commit make zebra to parse non-frr seg6local
route configuration via netlink and audit Zebra's RIB.
Signed-off-by: Hiroki Shirokura <slank.dev@gmail.com>
With this patch, zclient can intall seg6local rotues whem
they set properties nh_seg6local_{action,ctx} on struct nexthop
and set ZEBRA_FLAG_SEG6LOCAL_ROUTE on zapi_route's flag.
Signed-off-by: Hiroki Shirokura <slank.dev@gmail.com>
This includes community and large-community data.
```
exit1-debian-9# show ip route 172.16.16.1/32
Routing entry for 172.16.16.1/32
Known via "bgp", distance 20, metric 0, best
Last update 00:00:23 ago
* 192.168.0.2, via eth1, weight 1
AS-Path : 65030
Communities : 65001:1 65001:2 65001:3 65001:4 65001:5 65001:6
Large-Communities: 65001:123:1 65001:123:2
```
Signed-off-by: Donatas Abraitis <donatas.abraitis@gmail.com>
Track 'down' state of connected addresses with a new flag. We
may have multiple addresses on an interface that share a prefix;
in those cases, we need to determine when the first address
is valid, to install a connected route, and similarly detect
when the last address goes 'down', to remove the connected
route.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
if_netlink.c created it's on nested parsing #define which
is identical to netlink_parse_rtattr_nested. Consolidate
on one instead of having this duality.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
In order to parse the netlink message into the
`struct rtattr *tb[size]` it is assumed that the buffer is
memset to 0 before the parsing. As such if you attempt
to read a value that was not returned in the message
you will not crash when you test for it.
The code has places were we memset it and places where we don't.
This *will* lead to crashes when the kernel changes. In
our parsing routines let's have them memset instead of having
to remember to do it pre pass in to the parser.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When clagd is stopped on secondary device,
all vxlan interfaces (vnis) are kept in protodown state.
FRR treats protodown vxlan interfaces (vnis) as interface down
and sends vni delete to bgpd.
In the event of clagd down, SVIs are flapping as underlying
bridge is going through churn.
When FRR receives SVI up notification do not trigger event to bgpd
if vnis are operationaly down.
Ticket:#2600210 CM-22929
Reviewed By:CCR-11544
Testing Done:
Performed CLAG stop/start on secondary device, all vxlan devices
remained in protodown along with this validated the vnis are cleaned up
and added back in bgpd.
Signed-off-by: Chirag Shah <chirag@nvidia.com>
Description:
Added a new show command("show ip zebra route dump") to dump all routes
with detailed information including nexthops,flags, status ..etc.
This helps for dubugging and added to support_bundle_command.conf.
Defined this command as a hidden command.
Signed-off-by: Rajesh Girada <rgirada@vmware.com>
When creating a large number of vrf's we are creating a fairly
large number of hash tables per vrf. Reduce memory usage on
startup as well as let us identify the table these things come
from.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
We are creating 2 hash tables per vni in zebra. Once we start to
scale the number of vni's we start to see some serious memory
usage in zebra. Let's reduce the memory usage at startup
for scale of vni's.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Current code has an inconsistent behavior with redistribute routes.
Suppose you have a kernel route that is being read w/ a distance
of 255:
eva# show ip route kernel
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
K>* 0.0.0.0/0 [0/100] via 192.168.161.1, enp39s0, 00:06:39
K>* 4.4.4.4/32 [255/8192] via 192.168.161.1, enp39s0, 00:01:26
eva#
If you have redistribution already turned on for kernel routes
you will be notified of the 4.4.4.4/32 route. If you turn
on kernel route redistribution watching after the 4.4.4.4/32 route
has been read by zebra you will never learn of it.
There is no need to look for infinite distance in the redistribution
code. Either we are selected or not. In other words non kernel routes
with an 255 distance are never installed so the checks were pointless.
So let's just remove the distance checking and tell interested parties
about the 255 kernel route if it exists.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently FRR reads the kernel for interface state and FRR
creates a connected route per address on an interface. If
you are in a situation where you have multiple addresses
on an interface just create 1 connected route for them:
sharpd@eva:/tmp/topotests$ vtysh -c "show int dummy302"
Interface dummy302 is up, line protocol is up
Link ups: 0 last: (never)
Link downs: 0 last: (never)
vrf: default
index 3279 metric 0 mtu 1500 speed 0
flags: <UP,BROADCAST,RUNNING,NOARP>
Type: Ethernet
HWaddr: aa:4a:ed:95:9f:18
inet 10.4.1.1/24
inet 10.4.1.2/24 secondary
inet 10.4.1.3/24 secondary
inet 10.4.1.4/24 secondary
inet 10.4.1.5/24 secondary
inet6 fe80::a84a:edff:fe95:9f18/64
Interface Type Other
Interface Slave Type None
protodown: off
sharpd@eva:/tmp/topotests$ vtysh -c "show ip route connected"
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
C>* 10.4.1.0/24 is directly connected, dummy302, 00:10:03
C>* 192.168.161.0/24 is directly connected, enp39s0, 00:10:03
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Since _rnode_zlog was wrapping zlog(), these messages weren't getting an
unique ID assigned through the xref mechanism. Replace macro with a
small extension that prints (almost) the same thing.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
Initially the reading of the speed of an interface happened
upon interface creation and happened until the speed of a link
settled down to a single value. The speed of an interface
can also change as that a new optic can be inserted that
changes the speed, in which case FRR would see a interface
down (optic removal) and then a interface up (optic insertion).
In this case FRR would not treat this as an event that changed
the speed. Let's expand the checking a bit more.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
- gre keys are collected and stored locally.
- when gre source set is requested, and the link interface
configured is different, the gre information collected is
pushed in the query, namely source ip or gre keys if present.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
preserve mtu upon interface flapping and tunnel source change.
Signed-off-by:Reuben Dowle <reuben.dowle@4rf.com>
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
This action is initiated by nhrp and has been stubbed when
moving to zebra. Now, a netlink request is forged to set
the link interface of a gre interface if that gre interface
does not have already a link interface.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
zebra is able to get information about gre tunnels.
zebra_gre file is created to handle hooks, but is not yet used.
also, debug zebra gre command is done to add gre traces.
A zebra_gre file is used for complementary actions that may be needed.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
when zebra has vrf backend mapped to namespaces, the polling
of interfaces leads to fix all linkages of interfaces. This
was not done on non default namespace. do it for other namespaces.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
There are cases where either link information is not present at
interface creation or link information changed. handle this
situation.
Signed-off-by: Philippe.Guibert <philippe.guibert@6wind.com>
zebra dd link
a) `debug zebra kernel` turns off `debug zebra kernel msgdump....`
this is odd and bad
b) `debug zebra kernel msgdump send` turns off receive and vice versa
this is counter intuitive as well
c) `no zebra kernel msgdump ...` turns off all kernel level debugging
we should only turn off msgdump specific debugs
d) `no debug zebra kernel` turns off all kernel level debugging
we should leave msgdump on.
e) Fix `show run` and show debug output
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
encoding signed int as unsigned is bad practice; since we want to do
it here lets at least be explicit about it
Signed-off-by: Quentin Young <qlyoung@nvidia.com>
Use unsigned value for all RA requests to Zebra
- encoding signed int as unsigned is bad practice
- RA interval is never, and should never be, negative
Signed-off-by: Quentin Young <qlyoung@nvidia.com>
This is always a 16 bit unsigned value.
- signed int is the wrong type to use
- encoding a signed int as a uint32 is bad practice
- decoding a signed int encoded as a uint32 into a uint16 is bad
practice
Signed-off-by: Quentin Young <qlyoung@nvidia.com>
We're firing an event debug log for zebra_redistribute_add, but not one
for zebra_redistribute_delete. Let's make it symmetric.
Signed-off-by: Emanuele Di Pascale <emanuele@voltanet.io>
`config.h` has all the defines from autoconf, which may include things
that switch behavior of other included headers (e.g. _GNU_SOURCE
enabling prototypes for additional functions.)
So, the first include in any `.c` file must be either `config.h` (with
the appropriate guard) or `zebra.h` (which includes `config.h` first
thing.)
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
Properly handle refcounting of Proto-owned NHGs when
zebra is operating under graceful restart and retain
conditions.
We have an extra refcnt of 1 we keep for proto-owned NHGs to
indicate the upper level proto has created and owns it.
When we are reading these in from the kernel, we need to set them
to 1 as appropriate. Without this, we fail in the assert() during
zebra_nhg_proto_add() after the owning daemons resends the NHG
and the refcnts are off by one.
Also add in the same logic we use for routes when sweeping with
respect to uptimes.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Add uptime for use with NHEs to keep track of how
long we have had this NHE in our rib without an update.
This is treated exactly the same as the re->uptime for
routes. When we get an update for a route, we reset the
uptime.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Add a PROTO_OWNED macro for code readability when checking
ID bounds for whether a NHG is proto owned.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Handle SR-TE policy changes in the LSP async notification
handler, as we do in the normal LSP dplane results handler.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
When capturing backup nexthops with recursive resolution,
ensure that inner labels from the recursive nexthop are
included in each backup (as they are with the resolving
primary nexthops).
Signed-off-by: Mark Stapp <mjs@voltanet.io>
`CFLAGS` is a "user variable", not intended to be controlled by
configure itself. Let's put all the "important" stuff in AC_CFLAGS and
only leave debug/optimization controls in CFLAGS.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
... by referencing all autogenerated headers relative to the root
directory. (90% of the changes here is `version.h`.)
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
Use the main zebra workqueue for daemon-owned NHGs, in addition
to processing kernel-owned NHGs. The zapi message processing
creates a temporary object that's enqueued to the workqueue,
then processed/installed as part of the workqueue processing.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
do not add a new route type, and consider 0 as a value meaning
that zebra should be the owner.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
zapi_nbr structure is renamed to zapi_neigh_ip.
Initially used to set a neighbor ip entry for gre interfaces, this
structure is used to get events from the zebra layer to nhrp layer.
The ndm state has been added, as it is needed on both sides.
The zebra dplane layer is slightly modified.
Also, to clarify what ZEBRA_NEIGH_ADD/DEL means, a rename is done:
it is called now ZEBRA_NEIGH_IP_ADD/DEL, and it signified that this
zapi interface permits to set link operations by associating ip
addresses to link addresses.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The first change in this commit is the processing of the VRF termination.
When we terminate the VRF, we should not delete the underlying interfaces,
because there may be pointers to them in the northbound configuration. We
should move them to the default VRF instead.
Because of the first change, the VRF interface itself is also not deleted
when deleting the VRF. It should be handled in netlink_link_change. This
is done by the second change.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
Most of these are many, many years out of date. All of them vary
randomly in quality. They show up by default in packages where they
aren't really useful now that we use integrated config. Remove them.
The useful ones have been moved to the docs.
Signed-off-by: Quentin Young <qlyoung@nvidia.com>
Instead of directly configuring the neighbor table after read from zapi
interface, a zebra dplane context is prepared to host the interface and
the family where the neighbor table is updated. Also, some other fields
are hosted: app_probes, ucast_probes, and mcast_probes. More information
on those fields can be found on ip-ntable configuration.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
EVPN neighbor operations were already done in the zebra dataplane
framework. Now that NHRP is able to use zebra to perform neighbor IP
operations (by programming link IP operations), handle this operation
under dataplane framework:
- assign two new operations NEIGH_IP_INSTALL and NEIGH_IP_DELETE; this
is reserved for GRE like interfaces:
example: ip neigh add A.B.C.D lladdr E.F.G.H
- use 'struct ipaddr' to store and encode the link ip address
- reuse dplane_neigh_info, and create an union with mac address
- reuse the protocol type and use it for neighbor operations; this
permits to store the daemon originating this neighbor operation.
a new route type is created: ZEBRA_ROUTE_NEIGH.
- the netlink level functions will handle a pointer, and a type; the
type indicates the family of the pointer: AF_INET or AF_INET6 if the
link type is an ip address, mac address otherwise.
- to keep backward compatibility with old queries, as no extension was
done, an option NEIGH_NO_EXTENSION has been put in place
- also, 2 new state flags are used: NUD_PERMANENT and NUD_FAILED.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
neighbor table api in zebra is added. a netlink api is created for that.
the handler is called from the api defined in the previous commit.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When netlink_neigh_update() is called, the link registration was
failing, due to bad request length.
Also, the query was failing if NDA_DST was an ipv6 address.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
a zebra api is extended to offer ability to add or remove neighbor
entry from daemon. Also this extension makes possible to add neigh
entry, not only between IPs and macs, but also between IPs and NBMA IPs.
This API supports configuring ipv6/ipv4 entries with ipv4/ipv6 lladdr.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
zebra implements zebra api for configuring link layer information. that
can be an arp entry (for ipv4) or ipv6 neighbor discovery entry. This
can also be an ipv4/ipv6 entry associated to an underlay ipv4 address,
as it is used in gre point to multipoint interfaces.
this api will also be used as monitoring. an hash list is instantiated
into zebra (this is the vrf bitmap). each client interested in those entries
in a specific vrf, will listen for following messages: entries added, removed,
or who-has messages.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Optionally hide route changes that only involve backup nexthop
activation/deactivation. The goal is to avoid route churn during
backup nexthop switchover events, before the resolving routes
re-converge. A UI config enables this 'hiding' behavior.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Description:
After FRR restart, routes are not getting redistributed;
when routes added first and then 'redistribute static' cmd is issued.
During the frr restart, vrf_id will be unknown,
so irrespective of redistribution, we set the redistribute vrf bitmap.
Later, when we add a route and then issue 'redistribute' cmd,
we check the redistribute vrf bitmap and return CMD_WARNING;
zebra_redistribute_add also checks the redistribute vrf bitmap and returns.
Instead of checking the redistribute vrf bitmap, always set it anyways.
Co-authored-by: Santosh P K <sapk@vmware.com>
Co-authored-by: Kantesh Mundaragi <kmundaragi@vmware.com>
Signed-off-by: Abhinay Ramesh <rabhinay@vmware.com>
When certain events occur (connected route changes e.g.)
zebra examines LSPs to see if they might have been affected. For
LSPs with backup nhlfes, skip this immediate processing and
wait for the owning protocol daemon to react.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
This commit introduces the implementation for the north-bound
callbacks for the zebra-specific route-map match and set clauses.
Signed-off-by: NaveenThanikachalam <nthanikachal@vmware.com>
Signed-off-by: Sarita Patra <saritap@vmware.com>
This is to fix the crash reproduced by the following steps:
* ip link add red type vrf table 1
Creates VRF.
* vtysh -c "conf" -c "vrf red"
Creates VRF NB node and marks VRF as configured.
* ip route 1.1.1.0/24 2.2.2.2 vrf red
* no ip route 1.1.1.0/24 2.2.2.2 vrf red
(or similar l3vni set/unset in zebra)
Marks VRF as NOT configured.
* ip link del red
VRF is deleted, because it is marked as not configured, but NB node
stays.
Subsequent attempt to configure something in the VRF leads to a crash
because of the stale pointer in NB layer.
Fixes#8357.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
EVPN nexthops are installed as remote neighs by zebra. This was earlier
done only via VRF IPvX uni routes imported from EVPN routes.
With EVPN-MH these VRF routes now reference a L3NHG which is setup based
on the EAD and doesn't include the RMAC. To workaround that BGP now
consolidates and maintains EVPN nexthops which are then sent to zebra.
zebra sets up these nexthops as L3-VNI nh entries using a dummy type-1
route as reference.
Ticket: CM-31398
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
This one also needed a bit of shuffling around, but MTYPE_RE is the only
one left used across file boundaries now.
Signed-off-by: David Lamparter <equinox@diac24.net>
Back when I put this together in 2015, ISO C11 was still reasonably new
and we couldn't require it just yet. Without ISO C11, there is no
"good" way (only bad hacks) to require a semicolon after a macro that
ends with a function definition. And if you added one anyway, you'd get
"spurious semicolon" warnings on some compilers...
With C11, `_Static_assert()` at the end of a macro will make it so that
the semicolon is properly required, consumed, and not warned about.
Consistently requiring semicolons after "file-level" macros matches
Linux kernel coding style and helps some editors against mis-syntax'ing
these macros.
Signed-off-by: David Lamparter <equinox@diac24.net>
The point of the `-std=gnu99` was to override a `-std=c99` that may be
coming in from net-snmp. However, we want C11, not C99.
Signed-off-by: David Lamparter <equinox@diac24.net>
Add a control and api for the use of backup nexthops in
recursive resolution. With 'no', we won't try to use installed
backup nexthops when resolving a recursive route.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Zebra routing tables are not controlled by the user and can not be
created/deleted manually. Current NB create/destroy callbacks are
incorrectly implemented because instead of creating/deleting the RIB
they are only checking for it's existence. YANG model should reflect
the real situation.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
There are places in the code where function nb_running_get_entry is used
with abort_if_not_found set to true during the config validation stage.
This is incorrect because when used in transactional CLI, the running
entry won't be set until the apply stage, and such usage leads to crash.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
like it has been done for iptable contexts, a zebra dplane context is
created for each ipset/ipset entry event. The zebra_dplane_ctx job is
then enqueued and processed by separate thread. Like it has been done
for zebra_pbr_iptable context, the ipset and ipset entry contexts are
encapsulated into an union of structures in zebra_dplane_ctx.
There is a specificity in that when storing ipset_entry structure, there
was a backpointer pointer to the ipset structure that is necessary
to get some complementary information before calling the hook. The
proposal is to use an ipset_entry_info structure next to the ipset_entry,
in the zebra_dplane context. That information is used for ipset_entry
processing. The ipset name and the ipset type are the only fields
necessary.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The iptable processing was not handled in remote dataplane, and was
directly processed by the thread in charge of zapi calls. Now that call
can be handled in the zebra_dplane separate thread. once a
zebra_dplane_ctx is allocated for iptable handling, the hook call is
performed later. Subsequently, a return code may be triggered to zclient
interface if any problem occurs when calling the hook call.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
This was caused because of uninitialized netlint attrs in the bond-member
netlink parse API.
PS: It was caught by the upstream topotests on ARM8 (passed everywhere
else).
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
This is needed as kernel currently doesn't allow a mac replace if the dst
changes from a L2NHG to a single-VTEP and viceversa.
Ticket: CM-31561
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
When a ES-bond is in bypass state MACs learnt on it are linked to the
access port instead of the ES. When LACP converges on the bond it moves
out of bypass and the MACs previously learnt on it are flushed to force
a re-learn on new traffic.
Ticket: CM-31326
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
When an ES-bond comes out of bypass FRR needs to flush the local MACs learnt
while the bond was in bypass. To do that efficiently local MACs are linked
to the dest-access port. This only happens if the access-port is in
LACP-bypass or if it is non-ES.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Feature overview:
=================
A 802.3ad bond can be setup to allow lacp-bypass. This is done to enable
servers to pxe boot without a LACP license i.e. allows the bond to go oper
up (with a single link) without LACP converging.
If an ES-bond is oper-up in an "LACP-bypass" state MH treats it as a non-ES
bond. This involves the following special handling -
1. If the bond is in a bypass-state the associated ES is placed in a
bypass state.
2. If an ES is in a bypass state -
a. DF election is disabled (i.e. assumed DF)
b. SPH filter is not installed.
3. MACs learnt via the host bond are advertised with a zero ESI.
When the ES moves out of "bypass" the MACs are moved from a zero-ESI to
the correct non-zero id. This is treated as a local station move.
Implementation:
===============
When (a) an ES is detached from a hostbond or (b) an ES-bond goes into
LACP bypass zebra deletes all the local macs (with that ES as destination)
in the kernel and its local db. BGP re-sends any imported MAC-IP routes
that may exist with this ES destination as remote routes i.e. zebra can
end up programming a MAC that was perviously local as remote pointing
to a VTEP-ECMP group.
When an ES is attached to a hostbond or an ES-bond goes
LACP-up (out of bypss) zebra again deletes all the local macs in the
kernel and its local db. At this point BGP resends any imported MAC-IP
routes that may exist with this ES destination as sync routes i.e.
zebra can end up programming a MAC that was perviously remote
as local pointing to an access port.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
VNI configuration is done without NB layer in default VRF. It leads to
the following problems:
```
vtysh -c "conf" -c "vni 1"
vtysh -c "conf" -c "vrf default" -c "no vni"
```
Second command does nothing, because the NB node is not created by the
first command.
```
vtysh -c "conf" -c "vrf default" -c "vni 1"
vtysh -c "conf" -c "no vni 1"
```
Second command doesn't delete the NB node created by the first command.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
This is causing problems with VM move i.e. transition from remote
neigh to local neigh. This transition involves changing the NUD_STATE
NUD_NOARP to NUD_STALE. And the weak override flag prevents changing
the state from connected (REACHABLE, NOARP, PERMANENT) to STALE.
PS: Weak-override was originally used to prevent race conditions where
FRR can end up making a REACHABLE neigh STALE. We may need to revisit
and address that case at a later point.
Ticket: CM-30273
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Start reorg of zebra nexthop-resolution so that we can use the
resolution logic for nexthop-groups as well as routes. Change
the signature of the core nexthop_active() api so that it does
not require a route-entry or route-node. Move some of the logic
around so that nexthop-specific logic is in nexthop_active(),
while route-oriented logic is in nexthop_active_check().
Signed-off-by: Mark Stapp <mjs@voltanet.io>
For MH the SVI MAC is advertised to prevent flooding of ARP replies.
But because of a bug the SVI MAC was being added to the zebra database
but not sent to bgpd for advertising.
Ticket: CM-33329
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
As a part of FRR shutdown interfaces are force flushed (in an arbitary
order). Interfaces are already down at that point i.e. resources like
SVI-MAC have already been released. Attempting to clean it up again
as a part of the force-flush was resulting in access of freed up memory -
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
==26457== Thread 1:
==26457== Invalid read of size 8
==26457== at 0x1AE6B0: zebra_evpn_acc_bd_svi_set (zebra_evpn_mh.c:606)
==26457== by 0x1B1460: zebra_evpn_if_cleanup (zebra_evpn_mh.c:1040)
==26457== by 0x13CA69: if_zebra_delete_hook (interface.c:244)
==26457== by 0x48A0E34: hook_call_if_del (if.c:59)
==26457== by 0x48A0E34: if_delete_retain (if.c:290)
==26457== by 0x48A2F94: if_delete (if.c:313)
==26457== by 0x48A3169: if_terminate (if.c:1217)
==26457== by 0x48E0024: vrf_delete (vrf.c:254)
==26457== by 0x48E0024: vrf_delete (vrf.c:225)
==26457== by 0x48E02FE: vrf_terminate (vrf.c:551)
==26457== by 0x1442E1: sigint (main.c:203)
==26457== by 0x1442E1: sigint (main.c:141)
==26457== by 0x48CF862: quagga_sigevent_process (sigevent.c:103)
==26457== by 0x48DD324: thread_fetch (thread.c:1404)
==26457== by 0x48A926A: frr_run (libfrr.c:1122)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
(gdb) bt
(gdb) fr 5
1037 zebra/zebra_evpn_mh.c: No such file or directory.
(gdb) p zif->ifp->name
$2 = "vlan131", '\000' <repeats 12 times>
(gdb) p zif->link->info
$5 = (void *) 0x1
(gdb) p/x zif->ifp->flags
$7 = 0x1002
(gdb)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Ticket: CM-32435
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
zebra crash is seen while cleaning up evpn interface
during shutdown event.
evpn interface clean up is called from vrf_delete callback
(gdb) frame 4
(is_up=false, br_zif=0x0, vlan_zif=0x557f31fb36f0) at zebra/zebra_evpn_mh.c:614
614 zebra/zebra_evpn_mh.c: No such file or directory.
(gdb) p tmp_br_zif
$1 = (struct zebra_if *) 0x0
(gdb) p vlan_zif->link
$2 = (struct interface *) 0x557f31fb2d40
(gdb) p vlan_zif->link->info
$3 = (void *) 0x0
(gdb) p zebra_if->ifp->name
No symbol "zebra_if" in current context.
(gdb) p vlan_zif->ifp->name
$4 = "peerlink-3.4094\000\000\000\000"
Ticket:CM-32435
Reviewed By:CCR-10957
Testing Done:
Signed-off-by: Chirag Shah <chirag@nvidia.com>
Added support for advertising SVI MAC if EVPN-MH is enabled.
In the case of EVPN MH arp replies from an attached server can be sent to
the ES-peer. To prevent flooding of the reply the SVI MAC needs to be
advertised by default.
Note:
advertise-svi-ip could have been used as an alternate way to advertise
SVI MAC. However that config cannot be turned on if SVI IPs are
re-used (which is done to avoid wasting IP addresses in a subnet).
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
SVI IP is being advertised unconditionally i.e. even if disabled (and
that is the default config). This can be problematic when the SVI address
is re-used across racks.
Added the user config condition in all the relevant places where the
SVI advertisement is triggered.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
When looking up the conversion from kernel protocol to
internal protocol family make sure we use the correct
AF_INET( what the kernel uses ) instead of AFI_IP (which
is what FRR uses ).
Routes from OSPF will show up from the kernel as OSPF6 instead of
OSPF. Which will cause mayhem
Ticket: CM-33306
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Neither tabs nor newlines are acceptable in syslog messages. They also
break line-based parsing of file logs.
Signed-off-by: David Lamparter <equinox@diac24.net>
the old VXLAN function for local MAC deletion was still in
existence and being called from the VXLAN code whilst the new
generic function was not being called at all. Resolve this so
the generic function matches the old function and is called
exclusively.
Signed-off-by: Pat Ruddy <pat@voltanet.io>
Move the pbr hash creation to be after the update release
and dplane install. Now that rules are installed in a separate
dplane pthread, we can have scenarios where we have an interface
flapping and we install/remove rules sufficiently fast enough we
could issue what we think is an update for an identical rule and
end up releasing the rule right after we created it and sent it to
the dplane. This solves the problem of recving duplicate rules
during interface flapping.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Disallow the resolution to nexthops that are marked duplicate.
When we are resolving to an ecmp group, it's possible this
group has duplicates.
I found this when I hit a bug where we can have groups resolving
to each other and cause the resolved->next->next pointer to increase
exponentially. Sufficiently large ecmp and zebra will grind to a hault.
Like so:
```
D> 4.4.4.14/32 [150/0] via 1.1.1.1 (recursive), weight 1, 00:00:02
* via 1.1.1.1, dummy1 onlink, weight 1, 00:00:02
via 4.4.4.1 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.2 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.3 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.4 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.5 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.6 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.7 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.8 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.9 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.10 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.11 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.12 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.13 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.15 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1 onlink, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1 onlink, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 4.4.4.16 (recursive), weight 1, 00:00:02
via 1.1.1.1, dummy1 onlink, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
via 1.1.1.1, dummy1, weight 1, 00:00:02
D> 4.4.4.15/32 [150/0] via 1.1.1.1 (recursive), weight 1, 00:00:09
* via 1.1.1.1, dummy1 onlink, weight 1, 00:00:09
via 4.4.4.1 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.2 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.3 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.4 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.5 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.6 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.7 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.8 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.9 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.10 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.11 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.12 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.13 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.14 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 4.4.4.16 (recursive), weight 1, 00:00:09
via 1.1.1.1, dummy1 onlink, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
via 1.1.1.1, dummy1, weight 1, 00:00:09
D> 4.4.4.16/32 [150/0] via 1.1.1.1 (recursive), weight 1, 00:00:19
* via 1.1.1.1, dummy1 onlink, weight 1, 00:00:19
via 4.4.4.1 (recursive), weight 1, 00:00:19
via 1.1.1.1, dummy1, weight 1, 00:00:19
via 4.4.4.2 (recursive), weight 1, 00:00:19
...............
................
and on...
```
You can repro the above via:
```
kernel routes:
1.1.1.1 dev dummy1 scope link
4.4.4.0/24 via 1.1.1.1 dev dummy1
==============================
config:
nexthop-group doof
nexthop 1.1.1.1
nexthop 4.4.4.1
nexthop 4.4.4.10
nexthop 4.4.4.11
nexthop 4.4.4.12
nexthop 4.4.4.13
nexthop 4.4.4.14
nexthop 4.4.4.15
nexthop 4.4.4.16
nexthop 4.4.4.2
nexthop 4.4.4.3
nexthop 4.4.4.4
nexthop 4.4.4.5
nexthop 4.4.4.6
nexthop 4.4.4.7
nexthop 4.4.4.8
nexthop 4.4.4.9
!
===========================
Then use sharpd to install 4.4.4.16 -> 4.4.4.1 pointing to that nexthop
group in decending order.
```
With these changes it prevents the growing ecmp above by disallowing
duplicates to be in the resolution decision. These nexthops are not
installed anyways so why should we be resolving to them?
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Description: When we get a new vrf add and vrf with same name, but different vrf-id already
exists in the database, we should treat vrf add as update.
This happens mostly when there are lots of vrf and other configuration being replayed.
There may be a stale vrf delete followed by new vrf add. This
can cause timing race condition where vrf delete could be missed and
further same vrf add would get rejected instead of treating last arrived
vrf add as update.
Treat vrf add for existing vrf as update.
Implicitly disable this VRF to cleanup routes and other functions as part of vrf disable.
Update vrf_id for the vrf and update vrf_id tree.
Re-enable VRF so that all routes are freshly installed.
Above 3 steps are mandatory since it can happen that with config reload
stale routes which are installed in vrf-1 table might contain routes from
older vrf-0 table which might have got deleted due to missing vrf-0 in new configuration.
Signed-off-by: sudhanshukumar22 <sudhanshu.kumar@broadcom.com>
valgrind is reporting:
2448137-==2448137== Thread 5 zebra_apic:
2448137-==2448137== Syscall param writev(vector[...]) points to uninitialised byte(s)
2448137:==2448137== at 0x4D6FDDD: __writev (writev.c:26)
2448137-==2448137== by 0x4D6FDDD: writev (writev.c:24)
2448137-==2448137== by 0x48A35F5: buffer_flush_available (buffer.c:431)
2448137-==2448137== by 0x48A3504: buffer_flush_all (buffer.c:237)
2448137-==2448137== by 0x495948: zserv_write (zserv.c:263)
2448137-==2448137== by 0x4904B7E: thread_call (thread.c:1681)
2448137-==2448137== by 0x48BD3E5: fpt_run (frr_pthread.c:308)
2448137-==2448137== by 0x4C61EA6: start_thread (pthread_create.c:477)
2448137-==2448137== by 0x4D78DEE: clone (clone.S:95)
2448137-==2448137== Address 0x720c3ce is 62 bytes inside a block of size 4,120 alloc'd
2448137:==2448137== at 0x483877F: malloc (vg_replace_malloc.c:307)
2448137-==2448137== by 0x48D2977: qmalloc (memory.c:110)
2448137-==2448137== by 0x48A30E3: buffer_add (buffer.c:135)
2448137-==2448137== by 0x48A30E3: buffer_put (buffer.c:161)
2448137-==2448137== by 0x49591B: zserv_write (zserv.c:256)
2448137-==2448137== by 0x4904B7E: thread_call (thread.c:1681)
2448137-==2448137== by 0x48BD3E5: fpt_run (frr_pthread.c:308)
2448137-==2448137== by 0x4C61EA6: start_thread (pthread_create.c:477)
2448137-==2448137== by 0x4D78DEE: clone (clone.S:95)
2448137-==2448137== Uninitialised value was created by a stack allocation
2448137:==2448137== at 0x43E490: zserv_encode_vrf (zapi_msg.c:103)
Effectively we are sending `struct vrf_data` without ensuring
data has been properly initialized.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Send the results of daemons' nhg updates asynchronously,
after the update has actually completed. Capture additional
info about the source daemon in order to locate the correct
zapi session. Simplify the result types considered by the
zebra_nhg module.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
The raw zapi apis to encode and decode NHGs don't need to be
public; also add a little more validity-checking.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Calling fpm_nl_enqueue we should expect a it fit or not
return value on the outgoing stream. This is not necessary
to check here because the while loop where we are checking this
already has ensured that the data being written will fit.
CID -> 1499854
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Setting `zebra route-map delay-timer 0` completely turns of any
route-map processing in zebra. Which is completely wrong. A timer
of 0 means `do it now`.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
If we are running with a delayed timer to handle route-map changes
in zebra, if another route-map change is made to the cli, push
out the timer instead of not modifying the timer. This will
allow a large set of route-maps to be possibly be read in by
the system and we don't have a state where new route-map
changes are being read in and having the timer pop in
the middle of it.
Additionally convert to use THREAD_OFF, preventing a possible
use after free as well as aligning the thread api usage
with what we consider correct.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Current code when a route map changes schedules a rerun of all routes in the
particular table. So if you modify the `ip protocol XX route-map FOO`
route-map `FOO` all routes will be rechecked. This is extremely expensive.
Modify zebra to only update the routes associated with the route-map. So
if we have 800k bgp routes and 50 ospf routes and we are route-map'ing
the ospf routes we'll only look at 50 routes.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When we need to cause a reprocessing of data the code currently
marks all routes as needing to be looked at. Modify the
rib_update_table code to allow us to specify a specific route
type we only want to reprocess. At this point none
of the code is behaving differently this is just setup
for a future code change.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Use nl_pid from the netlink socket used for programming the kernel
(netlink_dplane) in netlink route messages sent by the 'fpm' module.
This makes 'fpm' consistent with 'dplane_fpm_nl' which already
behaves this way, and allows FPM server implementations to determine
route origin via nlmsg_pid.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
Create a function that can dump the mac->flags in human readable
output and convert all debugs to use it.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The re->flags and re->status in debugs were being dumped as hex values.
I can never quickly decode this. Here is an idea. Let's let FRR do
it for me.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
In the case where a routes nexthops cannot be resolved as part
of route processing, immmediately notify the upper level protocol
that their routes failed to install if they are interested in
being informed about this issue.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The zebra route-map delay timer value is a global value
not a per vrf change. As such we should only print it
out one time.
We are seeing this:
zebra route-map delay-timer 33
exit-vrf
zebra route-map delay-timer 33
When we have 2 vrf's configured.
Fix the code to only write it out for the default vrf
Ticket: CM-32888
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
when checking if there is a "hole" behind the current reservation
marker the calculation of whether the hole is big enough to satisfy
the requested chunk is out by 1. This could result in returning a label
which has already been allocated.
Signed-off-by: Pat Ruddy <pat@voltanet.io>
if the requested chunk size was less than 16 then a chunk
within the reserved block would be returned. Make sure that
we never return labels that are below MPLS_LABEL_UNRESERVED_MIN
Signed-off-by: Pat Ruddy <pat@voltanet.io>
When dplane_fpm_nl is used the "Please add this protocol(n) to proper
rt_netlink.c handling" debug message is emitted for any route of type
kernel or connected.
This severely reduces performance of dplane_fpm_nl when large numbers
of these routes are present in the RIB.
The messages are not observed when using the original fpm module since
this uses a custom function, netlink_proto_from_route_type().
zebra2proto() now returns RTPROT_KERNEL for ZEBRA_ROUTE_CONNECT and
ZEBRA_ROUTE_KERNEL. This should only impact dplane_fpm_nl's use of
the common netlink routines since these routes generally ignored via
checking of RSYSTEM_ROUTE().
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
fpm_nl_process() now ensures that the dataplane thread is rescheduled
if it hits the work limit while processing its incoming work queue.
This would probably already occur due to some other event, such as
fpm_process_queue() enqueuing completed work to the output queue,
however it does no harm to add this explicit reschedule.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
If the dataplane thread hits the work limit while processing the
output queue for any given provider, we now explicitly reschedule
the thread.
Otherwise, if the number of items in the output queue is greater than
the work limit, draining of that output queue is dependent on new
dataplane work.
Routes which are not drained from the output queue are stuck with
the 'q' flag, so this is a similar issue to that observed in
164d8e8608.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
zebra maintains pseudo interface for hanging off user config after
the interface is deleted in the kernel. If an user tried to config
an ES against such an interface zebra would crash with the following
call stack -
at zebra/zebra_evpn_mh.c:2095
sysmac=sysmac@entry=0x55cfbadd3160) at zebra/zebra_evpn_mh.c:2258
at zebra/zebra_evpn_mh.c:3222
argv=<optimized out>, es_lid_str=<optimized out>, es_lid=1, no=0x0, vty=0x55cfbaf4c7b0)
at zebra/zebra_evpn_mh.c:3222
argv=<optimized out>) at ./zebra/zebra_evpn_mh_clippy.c:202
vty=vty@entry=0x55cfbaf4c7b0, cmd=cmd@entry=0x0, filter=FILTER_RELAXED)
at lib/command.c:1073
Ticket: CM-31702
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
If a local-MAC or local-neigh is not active locally it is not sent to BGP.
At this point if BGP rxes a remote route it accepts it and installs in
zebra. Zebra was rejecting BGP's update if it had a higher seq local (inactive)
entry. This would result in bgp and zebra falling out of sync.
In some cases zebra would delete the local-inactive entries in sometime (as
a part of the dplane/kernel garbage collection). This would leave zebra
with missing remote entries (which were still present in bgpd).
This change allows lower-seq BGP updates to overwrite zebra's local entry if
that entry happens to be local-inactive.
Note: This logic was already in use for sync-mac-ip updates. Extended the
same logic to remote-mac-ip updates.
Ticket: CM-31626
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
When an VNI was deleted as a part of FRR/zebra shutdown the zevpn entry
was being freed without removing its reference in the access vlan
entry (i.e. without clearing the VLAN->VNI mapping) used by MH.
Ticket: CM-31197
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
If a netlink/dp notification is rxed for a neigh without the peer-sync
flag FRR re-installs the entry with the right flags. This change is
needed to handle cases where the dataplane and FRR may fall out of
sync because of neigh learning on the network ports (i.e. via
the VxLAN).
Ticket: CM-30693
The problem was found during VM mobility "torture" tests where 100s
of extended VM moves were done.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
If a remote MAC update is rxed from BGP with a lower sequence number than
the local one zebra ignores the MAC update. This typically happens if
there is a race condition (where updates are in flight from zebra to BGP).
There was a bug in zebra because of which the dest ES was being updated
before this check. This left the local MAC pointing to a remote ES.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Relevant Dumps:
===============
root@leaf21:mgmt:~# net show evpn mac vni 101101 mac 00:93:00:00:00:01
MAC: 00:93:00:00:00:01
ESI: 03:00:00:00:77:01:03:00:00:0d
Intf: - VLAN: 101
Sync-info: neigh#: 1 peer-proxy
Local Seq: 3 Remote Seq: 0
Neighbors:
21.1.13.1 Active
root@leaf21:mgmt:~# net sho evpn es
Type: L local, R remote, N non-DF
ESI Type ES-IF VTEPs
03:00:00:00:77:01:02:00:00:0c R - 6.0.0.10,6.0.0.11
03:00:00:00:77:01:03:00:00:0d R - 6.0.0.10,6.0.0.11,6.0.0.12
03:00:00:00:77:01:04:00:00:0e R - 6.0.0.10,6.0.0.11,6.0.0.12,6.0.0.13
03:00:00:00:77:02:02:00:00:16 LR bondP2-H2 6.0.0.15
03:00:00:00:77:02:03:00:00:17 LR bondP2-H3 6.0.0.15,6.0.0.16
03:00:00:00:77:02:04:00:00:18 LR bondP2-H4 6.0.0.15,6.0.0.16,6.0.0.17
root@leaf21:mgmt:~#
Relevant logs:
===============
2020/07/29 15:41:27.110846 ZEBRA: Recv MACIP ADD VNI 101101 MAC 00:93:00:00:00:01 IP 21.1.13.1 flags 0x0 seq 2 VTEP 0.0.0.0 ESI 03:00:00:00:77:01:03:00:00:0d from bgp
2020/07/29 15:41:27.110867 ZEBRA: Ignore remote MACIP ADD VNI 101101 MAC 00:93:00:00:00:01 IP 21.1.13.1 as existing MAC has higher seq 3 flags 0x401
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Ticket: CM-30273
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
With EVPN-MH, Type-2 routes are also used for MAC-IP syncing between
ES peers so a change was done to only treat REACHABLE local neigh
entries as local-active and advertise them as Type-2 routes i.e. STALE
neigh entries are no longer advertised as Type-2s.
This however exposed some unexpected problems with MLAG where a
secondary reboot followed by a primary reboot left a lot of neighs
in STALE state (on the primary) resulting in them not being
advertised. And remote routed traffic to those hosts being
blackholed in a sym-IRB setup.
This commit is a workaround to fix the regression (it doesn't fix
the underlying problems with entries not becoming REACHABLE; which
maybe a day-1 problem). The workaround is to continue advertising
STALE neighbors if EVPN-MH is not enabled.
Ticket: CM-30303
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
zebra was crashing when the command was run on a non-existent VNI.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
root@torm-12:mgmt:~# net show evpn es-evi vni 16777215
VNI 16777215 doesn't exist
root@torm-12:mgmt:~# net show evpn es-evi vni 16777215 detail
VNI 16777215 doesn't exist
root@torm-12:mgmt:~# net show evpn es-evi vni 16777215 json
[
]
root@torm-12:mgmt:~# net show evpn es-evi vni 16777215 detail json
[
]
root@torm-12:mgmt:~#
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Ticket: CM-30232
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
in rib_handle_nhg_replace, do not use new as a parameter name to
allow compilation of c++ code including zebra headers.
Signed-off-by: Emanuele Di Pascale <emanuele@voltanet.io>
The way a couple of clauses were placed in a loop meant that
some info might not be collected - re-order things just a bit.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Derive the rule family from src if available, otherwise
dst if available, otherwise assume ipv4. We only support
ipv4/ipv6 currently so it we cant tell from the src/dst
it must be ipv4 and likely a dsfield match.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Maintain the count of contexts which have been processed in a local
variable, and perform a single atomic update after we have consumed
all queued contexts.
Generally this results in at least one less atomic operation per
context.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
Don't use an atomic operation to determine whether fpm_process_queue()
needs to be re-scheduled. Instead we can simply use a local variable
to determine if we stopped processing because we ran out of buffers.
In the case where we would have re-scheduled due to new context objects
in the queue (enqueued after we stopped processing), fpm_nl_process()
will schedule us (or will have done already).
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
Maintain the peak ctxqueue length in a local variable, and perform
a single atomic update after processing all contexts.
Generally this results in at least one less atomic operation per
context.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
Reduce code in the critical sections of fpm_nl_process() and
fpm_process_queue() to the bare minimum - basically only enqueue
and dequeue operations on the shared ctxqueue.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
We don't need to use the 'force' flag when processing the
resolve-via-default clis for ip and ipv6: we can just do normal
nht processing.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
After removal of L3VNI config, the VNI should become an L2VNI if a VxLAN
interface is present for the VNI. This case is not handled in the code.
Changes:
1. After unconfiguring L3VNI, create an L2VNI if VxLAN interface is present
for the VNI.
2. Trigger an update to BGP.
3. Read MAC and ARP entries from kernel.
This PR fixes the issue only for route type-2, 3 and 5. This PR does not address
states regarding route type-1, 4 and multicast group for VxLAN interface.
Signed-off-by: Ameya Dharkar <adharkar@vmware.com>
When a new ES is created it is held in a non-DF state for 3 seconds
as specified by RFC7432. This allows the switch time to import
the Type-4 routes from the peers. And the peers time to rx the new
Type-4 route.
root@torm-11:mgmt:~# vtysh -c "show evpn es 03:44:38:39:ff:ff:01:00:00:01"|grep DF
DF status: non-df
DF delay: 00:00:01
DF preference: 50000
root@torm-11:mgmt:~# vtysh -c "show evpn es 03:44:38:39:ff:ff:01:00:00:01"|grep DF
DF status: df
DF preference: 50000
root@torm-11:mgmt:~#
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
When all the uplinks go down the VTEP is disconnected from the
VxLAN overlay and this was handled by proto-downing the ES bonds. When
the uplinks come up again we need to re-enable the ES bonds but that
needs to be done after a delay to allow the EVPN network to converge.
And that is done by firing off the startup-delay timer on first
uplink-up.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
1. When a bond is associated with an ES we may need to re-sync
the dplane protodown state (which maybe stale/set by some other
app).
2. Also change the uplink state display to avoid confusion with
protodown reason code (both used to show uplink-up).
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
protodown state is a combination of the dplane and zebra states.
protodown reason is maintained exclusively by zebra. Display this
information on two separate lines to make that ownership clearer.
Also display n/a for bonds as the dplane doesn't support protodowning
the bond device.
Sample output -
==============
root@torm-11:mgmt:~# vtysh -c "show interface hostbond1"|grep -i protodown
protodown: off (n/a)
protodown reasons: (uplinks-down)
root@torm-11:mgmt:~# vtysh -c "show interface swp5"|grep -i protodown
protodown: on
protodown reasons: (uplinks-down)
root@torm-11:mgmt:~#
PS: Cosmetic changes only, no functional change.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
The code for this was already there but was not kicking in because of a
zebra local reason-code dup check. Even if the reason-code is the same,
if the dplane and zebra disagree about the protodown state zebra will
need to re-program the dplane.
Fixed a couple of spelling errors in the protodown logs to make greps
easy.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Use the new nested NDA_FDB_EXT_ATTRS attribute to control per-fdb
notifications.
PS: The attributes where updated as a part of the kernel upstreaming
hence the change.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
New work enqueued to the dplane_fpm_nl provider is initially de-queued
and re-enqueued, in fpm_nl_process(), to be processed by the provider's
own thread.
After performing this initial de-queue/enqueue we return to
dplane_thread_loop() and check the dplane_fpm_nl output queue for any
work which has been completed.
Since this work is being processed in another thread it is very likely
that there will be some (or all) work still outstanding at this point.
The dataplane thread finishes up any other tasks and then waits until
it is next scheduled. In the meantime the dplane_fpm_nl thread is
processing its work queue until completion.
The issue arises here as the dataplane thread is not explicitly
re-scheduled once dplane_fpm_nl has drained its work queue and
populated its output queue with completed work.
This completed work can sit in the output queue for an indeterminate
period of time, depending upon when the dataplane thread is next
scheduled for other work. If the RIB has reached a stable state then
this could be a significant period of time. During this period zebra
marks these routes as queued, even though they have actually been
processed by all dataplane providers.
An un-related RIB change which triggers a FIB update will result in
the dataplane thread being scheduled and this completed work then
being processed. At this point the routes will then no longer be
marked as queued by zebra. However this new FIB update might itself
then fall victim to the same scenario!
We can observe the above behaviour in these detailed dplane logs.
11:24:47 zebra[7282]: dplane: incoming new work counter: 2
11:24:47 zebra[7282]: dplane enqueues 2 new work to provider 'Kernel'
11:24:47 zebra[7282]: dplane provider 'Kernel': processing
11:24:47 zebra[7282]: Dplane NEIGH_DISCOVER, ip 192.168.2.2, ifindex 9
11:24:47 zebra[7282]: Dplane NEIGH_DISCOVER, ip 192.168.2.2, ifindex 9
11:24:47 zebra[7282]: dplane dequeues 2 completed work from provider Kernel
11:24:47 zebra[7282]: dplane enqueues 2 new work to provider 'dplane_fpm_nl'
11:24:47 zebra[7282]: dplane dequeues 1 completed work from provider dplane_fpm_nl
11:24:47 zebra[7282]: dplane has 1 completed, 0 errors, for zebra main
2 contexts (all incoming work) have been queued to dplane_fpm_nl - all good.
1 completed context was de-queued, so there is outstanding work.
11:24:58 zebra[7282]: dplane: incoming new work counter: 2
11:24:58 zebra[7282]: dplane enqueues 2 new work to provider 'Kernel'
11:24:58 zebra[7282]: dplane provider 'Kernel': processing
11:24:58 zebra[7282]: ID (193) Dplane nexthop update ctx 0x55c429b6fed0 op NH_INSTALL
11:24:58 zebra[7282]: 0:5.5.5.5/32 Dplane route update ctx 0x55c429b79690 op ROUTE_INSTALL
11:24:58 zebra[7282]: dplane dequeues 2 completed work from provider Kernel
11:24:58 zebra[7282]: dplane enqueues 2 new work to provider 'dplane_fpm_nl'
11:24:58 zebra[7282]: dplane dequeues 2 completed work from provider dplane_fpm_nl
11:24:58 zebra[7282]: dplane has 2 completed, 0 errors, for zebra main
A further 2 contexts (all incoming work) have been queued to dplane_fpm_nl - all good.
2 completed contexts were de-queued, which sounds good as that is what we en-queued.
However, there is an outstanding context from earlier, so there is still outstanding
work.
Indeed the new 5.5.5.5/32 route is marked as queued:
O>q 5.5.5.5/32 [110/10] via 192.168.2.2, dp0p1s3, weight 1, 00:01:19
This remains the case until we trigger a FIB update by installation of the
(eg.) 10.10.10.10/32 route:
11:26:41 zebra[7282]: dplane: incoming new work counter: 2
11:26:41 zebra[7282]: dplane enqueues 2 new work to provider 'Kernel'
11:26:41 zebra[7282]: dplane provider 'Kernel': processing
11:26:41 zebra[7282]: ID (195) Dplane nexthop update ctx 0x55c429b78ce0 op NH_INSTALL
11:26:41 zebra[7282]: 0:10.10.10.10/32 Dplane route update ctx 0x55c429b7a040 op ROUTE_INSTALL
11:26:41 zebra[7282]: dplane dequeues 2 completed work from provider Kernel
11:26:41 zebra[7282]: dplane enqueues 2 new work to provider 'dplane_fpm_nl'
11:26:41 zebra[7282]: dplane dequeues 2 completed work from provider dplane_fpm_nl
11:26:41 zebra[7282]: dplane has 2 completed, 0 errors, for zebra main
11:26:41 zebra[7282]: zebra2proto: Please add this protocol(2) to proper rt_netlink.c handling
11:26:41 zebra[7282]: Nexthop dplane ctx 0x55c429b6fed0, op NH_INSTALL, nexthop ID (193), result SUCCESS
11:26:41 zebra[7282]: default(0:254):5.5.5.5/32 Processing dplane result ctx 0x55c429b79690, op ROUTE_INSTALL result SUCCESS
We observe the same 2 enqueues and 2 dequeues as before, which again suggests
that there is outstanding work.
As expected, the 5.5.5.5/32 route is no longer marked as queued:
O>* 5.5.5.5/32 [110/10] via 192.168.2.2, dp0p1s3, weight 1, 00:02:06
But the 10.10.10.10/32 route is, as we have not yet processed the completed
context:
C>q 10.10.10.10/32 is directly connected, lo, 00:26:05
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
Returns the current number of (completed) contexts in the provider's
output queue (dp_ctx_out_q), allowing access to this data from the
provider itself.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
Following functions which is a piece of label-maanager implementation
isn't called from out side of its file. And all lines of label-manager
are coded on zebra/label_manager.c at this time. So these functions
should be unexposed.
Functions:
- create_label_chunk
- assign_label_chunk
- delete_label_chunk
- release_label_chunk
Signed-off-by: Hiroki Shirokura <slank.dev@gmail.com>
in the case the namespace pointer is already available, feed it at vrf
creation. this prevents from crashing if the netlink parsing already
began, and the vrf-lite is not enabled yet.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Following functions is using writen to dispatch message
into socket, but another function uses zserv_send_message.
This commit does tiny unification for zapi's socket messaging.
Funcs:
- zsend_assign_label_chunk_response()
- zsend_label_manager_connect_response()
Signed-off-by: Hiroki Shirokura <slank.dev@gmail.com>
The `show ip nht` and `show ipv6 nht` commands were broken.
This is because recent code commit: 0154d8ce45
assumed that p must not be NULL and this is not the case.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add a bit of code to allow bgp to send the AS-Path associated with
the route being installed to zebra so it can be displayed and
used as part of the `show ip route A` command in zebra.
eva# show ip route 20.0.0.0/11
Routing entry for 20.0.0.0/11
Known via "bgp", distance 20, metric 0, best
Last update 00:00:00 ago
* 192.168.161.1, via enp39s0, weight 1
AS-Path: 60000 64539 15096 6939 8075
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Just gather the opaque data into the route entry. Later
commits will display this data for end users as well as
to send it down.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add the current queue depths for each plugin to the
'show dplane providers' output. Maintain the out-bound queue
max counter properly, that was being ignored.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Zebra accumulates route-entry objects and then processes them
as a group. If that rib processing is delayed, because the
dataplane/fib programming has built up a queue e.g., zebra can
hold multiple deleted route objects in memory. At scale, this can
be a problem. Delete unneeded route entries promptly, if they
can't contribute to rib processing.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Don't attempt to walk data structures while not connected so we can
save some CPU usage when FPM server is offline.
Signed-off-by: Rafael Zalamena <rzalamena@opensourcerouting.org>
Instead of checking for next group reset, always do it and skip sending
if next hop group support is disabled.
Also remove unused `*_complete` variables.
Signed-off-by: Rafael Zalamena <rzalamena@opensourcerouting.org>
A MAC entry cannot be deleted while a neigh is referencing it. It seems
there is some race condition where this may be happening. The log is
to help identify those cases.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
It is now 4bits of type and 28bits of value -
1. type=0 is for L3 NHG
2. type=1 is for L2 NH
3. type=2 is for L2 NHG
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
This is an optimization to reduce the number of L2 nexthops. A
l2 or fdb nexthop simply provides the dataplane with a nexthop ip-
torm-12:mgmt:~# ip nexthop
id 268435461 via 27.0.0.20 scope link fdb
id 268435463 via 27.0.0.20 scope link fdb
id 268435465 via 27.0.0.20 scope link fdb
So there is no need to allocate a nexthop per-ES/per-VTEP. There
can be 100+ ESs per-VTEP so this change cuts the scale down by a
factor of 100.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
When a local ES flaps there are two modes in which the local
MACs are failed over -
1. Fast failover - A backup NHG (ES-peer group) is programmed in the
dataplane per-access port. When a local ES flaps the MAC entries
are left unaltered i.e. pointing to the down access port. And the
dataplane redirects traffic destined to the oper-down access port
via the backup NHG.
2. Slow failover - This mode needs to be turned on to allow dataplanes
not capable of re-directing traffic. In this mode local MAC entries
on a down local ES are re-programmed to point to the ES-peers'
NHG. And vice-versa i.e. when the ES comes up the MAC entries
are re-programmed with the access port as dest.
Fast failover is on by default. Slow failover can be enabled via the
following config -
evpn mh redirect-off
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
As a part of extended MM handing a MAC can be updated from local
to remote while being referenced by SYNC neighs (this is really a
temporary/small window). During this window if the MAC transitions
back to local again we need to re-inforce the previous SYNC flags
(based on the sync-neigh count) as subsequent SYNC updates to the
MAC will be de-duped and ignored.
Ticket: CM-29636
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
When a local mac is deleted by the dataplane zebra can re-install it
if the MAC is a SYNC MAC (learned from ES peers). The "local_inactive"
bit must be set as a part of the re-install to prevent zebra turning
around and advertising the MAC as locally active.
Also fixed up some debug logs in the slow-fail path to include the VNI.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
NHG and DST (VTEP-IP) are mutually exclusive attributes. If DST is
present the kernel ignores NHG.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
NHG is activated i.e. programmed in the dataplane only if there
are active-VTEPs associated with it. When a NHG is de-activated
all the remote-mac entries associated with it need to be removed
before the NHG is removed.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
The lookup for non default VRFs was always using a tableId; if not
provided, we were defaulting to RT_TABLE_MAIN. This is fine for the
default VRF but not for others. As a result, the command was silently
failing for non-default VRFs unless we also specified the correct tableId.
Fix this by only performing the lookup using the tableId if it is
provided; else use zebra_vrf_table.
Signed-off-by: Emanuele Di Pascale <emanuele@voltanet.io>
A couple NHG messages we were logging as errors are a bit spammy
in usecases where you routinely add/remove interfaces (VM heavy
deployments). Its not really an error a user cares about and
more for a developer to know what went wrong after the fact so
it makes more sense for these to be under a debug rather than
an error since seeing them does not implicitly mean error during
those usecases.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
During times of network trauma and when we are at large network scale
the process_remote_macip_add function can issue a zlog_warn for
a common occurrence. Modify the code to be a debug statement.
This behavior is the same now as the process_remote_macip_del function
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add an api that allows a caller in the zebra main pthread to
process the queue of pending dplane updates. The caller supplies
a function to call to test each pending context. Selected
contexts are dequeued, and freed without being processed.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
There are two fixes in this commit -
1. Prevent implicit deletion of (*,G) entries during (S,G) cleanup.
This is done by creating a dummy reference on all (*,G) entries.
This is needed for a hash-walk based table cleanup.
2. Free up the SG hash table when the VRF is deleted.
Ticket: CM-30151
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Earlier type-3 ESI was the only format supported for evpn-mh. Updated the
CLI to allow a 10-byte type-0 ESI.
Both type-0 and type-3 ESIs are statically configured; just in two different
ways -
1. type-0 is configured as a complete 10-byte string
2. type-3 is configured as a 6-byte es-sys-mac and a 3-byte
local-discriminator.
Sample config -
!
interface hostbond1
evpn mh es-id 00:44:38:39:ff:ff:01:00:00:01
!
This is a CLI-only change and has no functional impact.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Add routines to walk the LSP table and generate FPM updates for all
entries. A walk of the LSP table is triggered when (re-)connecting
to an FPM.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
Export netlink_lsp_msg_encoder() and use it to encode and send netlink
messages concerning LSP updates to connected FPMs.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
Dataplane/kernel prints the NHG and NH ids as decimal. Zebra
was printing it as hex (to display type vs. val). This became a
debugging hassle hence normalizing the format.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
DAD is not supported currently with EVPN-MH so we turn it off internally
when the first ES config is detected.
PS: Note that when all local ESs are deleted DAD will stay off and
will need to be cleared via a daemon restart.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
The function was originally implemented for zebra data plane FPM plugin,
but another code places could use it.
Signed-off-by: Rafael Zalamena <rzalamena@opensourcerouting.org>
The return from sockunion2hostprefix tells us if the conversion
succeeded or not. There are places in the code where we
always assume that it just `works`, since it can fail
notice and try to do the right thing.
Please note that failure of this function for most cases
of sockunion2hostprefix is highly highly unlikely as that
the sockunion was already created and tested elsewhere
it's just that this function can fail.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add a command that allows FRR to know it's being used with
an underlying asic offload, from the linux kernel perspective.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The linux kernel is getting RTM_F_OFFLOAD_FAILED for kernel routes
that have failed to offload. Write the code
to receive these notifications from the linux kernel
and store that data for display about the routes.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>