Use nl_pid from the netlink socket used for programming the kernel
(netlink_dplane) in netlink route messages sent by the 'fpm' module.
This makes 'fpm' consistent with 'dplane_fpm_nl' which already
behaves this way, and allows FPM server implementations to determine
route origin via nlmsg_pid.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
Create a function that can dump the mac->flags in human readable
output and convert all debugs to use it.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The re->flags and re->status in debugs were being dumped as hex values.
I can never quickly decode this. Here is an idea. Let's let FRR do
it for me.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
In the case where a routes nexthops cannot be resolved as part
of route processing, immmediately notify the upper level protocol
that their routes failed to install if they are interested in
being informed about this issue.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The zebra route-map delay timer value is a global value
not a per vrf change. As such we should only print it
out one time.
We are seeing this:
zebra route-map delay-timer 33
exit-vrf
zebra route-map delay-timer 33
When we have 2 vrf's configured.
Fix the code to only write it out for the default vrf
Ticket: CM-32888
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
when checking if there is a "hole" behind the current reservation
marker the calculation of whether the hole is big enough to satisfy
the requested chunk is out by 1. This could result in returning a label
which has already been allocated.
Signed-off-by: Pat Ruddy <pat@voltanet.io>
if the requested chunk size was less than 16 then a chunk
within the reserved block would be returned. Make sure that
we never return labels that are below MPLS_LABEL_UNRESERVED_MIN
Signed-off-by: Pat Ruddy <pat@voltanet.io>
When dplane_fpm_nl is used the "Please add this protocol(n) to proper
rt_netlink.c handling" debug message is emitted for any route of type
kernel or connected.
This severely reduces performance of dplane_fpm_nl when large numbers
of these routes are present in the RIB.
The messages are not observed when using the original fpm module since
this uses a custom function, netlink_proto_from_route_type().
zebra2proto() now returns RTPROT_KERNEL for ZEBRA_ROUTE_CONNECT and
ZEBRA_ROUTE_KERNEL. This should only impact dplane_fpm_nl's use of
the common netlink routines since these routes generally ignored via
checking of RSYSTEM_ROUTE().
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
fpm_nl_process() now ensures that the dataplane thread is rescheduled
if it hits the work limit while processing its incoming work queue.
This would probably already occur due to some other event, such as
fpm_process_queue() enqueuing completed work to the output queue,
however it does no harm to add this explicit reschedule.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
If the dataplane thread hits the work limit while processing the
output queue for any given provider, we now explicitly reschedule
the thread.
Otherwise, if the number of items in the output queue is greater than
the work limit, draining of that output queue is dependent on new
dataplane work.
Routes which are not drained from the output queue are stuck with
the 'q' flag, so this is a similar issue to that observed in
164d8e8608.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
zebra maintains pseudo interface for hanging off user config after
the interface is deleted in the kernel. If an user tried to config
an ES against such an interface zebra would crash with the following
call stack -
at zebra/zebra_evpn_mh.c:2095
sysmac=sysmac@entry=0x55cfbadd3160) at zebra/zebra_evpn_mh.c:2258
at zebra/zebra_evpn_mh.c:3222
argv=<optimized out>, es_lid_str=<optimized out>, es_lid=1, no=0x0, vty=0x55cfbaf4c7b0)
at zebra/zebra_evpn_mh.c:3222
argv=<optimized out>) at ./zebra/zebra_evpn_mh_clippy.c:202
vty=vty@entry=0x55cfbaf4c7b0, cmd=cmd@entry=0x0, filter=FILTER_RELAXED)
at lib/command.c:1073
Ticket: CM-31702
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
If a local-MAC or local-neigh is not active locally it is not sent to BGP.
At this point if BGP rxes a remote route it accepts it and installs in
zebra. Zebra was rejecting BGP's update if it had a higher seq local (inactive)
entry. This would result in bgp and zebra falling out of sync.
In some cases zebra would delete the local-inactive entries in sometime (as
a part of the dplane/kernel garbage collection). This would leave zebra
with missing remote entries (which were still present in bgpd).
This change allows lower-seq BGP updates to overwrite zebra's local entry if
that entry happens to be local-inactive.
Note: This logic was already in use for sync-mac-ip updates. Extended the
same logic to remote-mac-ip updates.
Ticket: CM-31626
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
When an VNI was deleted as a part of FRR/zebra shutdown the zevpn entry
was being freed without removing its reference in the access vlan
entry (i.e. without clearing the VLAN->VNI mapping) used by MH.
Ticket: CM-31197
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
If a netlink/dp notification is rxed for a neigh without the peer-sync
flag FRR re-installs the entry with the right flags. This change is
needed to handle cases where the dataplane and FRR may fall out of
sync because of neigh learning on the network ports (i.e. via
the VxLAN).
Ticket: CM-30693
The problem was found during VM mobility "torture" tests where 100s
of extended VM moves were done.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
If a remote MAC update is rxed from BGP with a lower sequence number than
the local one zebra ignores the MAC update. This typically happens if
there is a race condition (where updates are in flight from zebra to BGP).
There was a bug in zebra because of which the dest ES was being updated
before this check. This left the local MAC pointing to a remote ES.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Relevant Dumps:
===============
root@leaf21:mgmt:~# net show evpn mac vni 101101 mac 00:93:00:00:00:01
MAC: 00:93:00:00:00:01
ESI: 03:00:00:00:77:01:03:00:00:0d
Intf: - VLAN: 101
Sync-info: neigh#: 1 peer-proxy
Local Seq: 3 Remote Seq: 0
Neighbors:
21.1.13.1 Active
root@leaf21:mgmt:~# net sho evpn es
Type: L local, R remote, N non-DF
ESI Type ES-IF VTEPs
03:00:00:00:77:01:02:00:00:0c R - 6.0.0.10,6.0.0.11
03:00:00:00:77:01:03:00:00:0d R - 6.0.0.10,6.0.0.11,6.0.0.12
03:00:00:00:77:01:04:00:00:0e R - 6.0.0.10,6.0.0.11,6.0.0.12,6.0.0.13
03:00:00:00:77:02:02:00:00:16 LR bondP2-H2 6.0.0.15
03:00:00:00:77:02:03:00:00:17 LR bondP2-H3 6.0.0.15,6.0.0.16
03:00:00:00:77:02:04:00:00:18 LR bondP2-H4 6.0.0.15,6.0.0.16,6.0.0.17
root@leaf21:mgmt:~#
Relevant logs:
===============
2020/07/29 15:41:27.110846 ZEBRA: Recv MACIP ADD VNI 101101 MAC 00:93:00:00:00:01 IP 21.1.13.1 flags 0x0 seq 2 VTEP 0.0.0.0 ESI 03:00:00:00:77:01:03:00:00:0d from bgp
2020/07/29 15:41:27.110867 ZEBRA: Ignore remote MACIP ADD VNI 101101 MAC 00:93:00:00:00:01 IP 21.1.13.1 as existing MAC has higher seq 3 flags 0x401
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Ticket: CM-30273
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
With EVPN-MH, Type-2 routes are also used for MAC-IP syncing between
ES peers so a change was done to only treat REACHABLE local neigh
entries as local-active and advertise them as Type-2 routes i.e. STALE
neigh entries are no longer advertised as Type-2s.
This however exposed some unexpected problems with MLAG where a
secondary reboot followed by a primary reboot left a lot of neighs
in STALE state (on the primary) resulting in them not being
advertised. And remote routed traffic to those hosts being
blackholed in a sym-IRB setup.
This commit is a workaround to fix the regression (it doesn't fix
the underlying problems with entries not becoming REACHABLE; which
maybe a day-1 problem). The workaround is to continue advertising
STALE neighbors if EVPN-MH is not enabled.
Ticket: CM-30303
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
zebra was crashing when the command was run on a non-existent VNI.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
root@torm-12:mgmt:~# net show evpn es-evi vni 16777215
VNI 16777215 doesn't exist
root@torm-12:mgmt:~# net show evpn es-evi vni 16777215 detail
VNI 16777215 doesn't exist
root@torm-12:mgmt:~# net show evpn es-evi vni 16777215 json
[
]
root@torm-12:mgmt:~# net show evpn es-evi vni 16777215 detail json
[
]
root@torm-12:mgmt:~#
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Ticket: CM-30232
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
in rib_handle_nhg_replace, do not use new as a parameter name to
allow compilation of c++ code including zebra headers.
Signed-off-by: Emanuele Di Pascale <emanuele@voltanet.io>
The way a couple of clauses were placed in a loop meant that
some info might not be collected - re-order things just a bit.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Derive the rule family from src if available, otherwise
dst if available, otherwise assume ipv4. We only support
ipv4/ipv6 currently so it we cant tell from the src/dst
it must be ipv4 and likely a dsfield match.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Maintain the count of contexts which have been processed in a local
variable, and perform a single atomic update after we have consumed
all queued contexts.
Generally this results in at least one less atomic operation per
context.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
Don't use an atomic operation to determine whether fpm_process_queue()
needs to be re-scheduled. Instead we can simply use a local variable
to determine if we stopped processing because we ran out of buffers.
In the case where we would have re-scheduled due to new context objects
in the queue (enqueued after we stopped processing), fpm_nl_process()
will schedule us (or will have done already).
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
Maintain the peak ctxqueue length in a local variable, and perform
a single atomic update after processing all contexts.
Generally this results in at least one less atomic operation per
context.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
Reduce code in the critical sections of fpm_nl_process() and
fpm_process_queue() to the bare minimum - basically only enqueue
and dequeue operations on the shared ctxqueue.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
We don't need to use the 'force' flag when processing the
resolve-via-default clis for ip and ipv6: we can just do normal
nht processing.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
After removal of L3VNI config, the VNI should become an L2VNI if a VxLAN
interface is present for the VNI. This case is not handled in the code.
Changes:
1. After unconfiguring L3VNI, create an L2VNI if VxLAN interface is present
for the VNI.
2. Trigger an update to BGP.
3. Read MAC and ARP entries from kernel.
This PR fixes the issue only for route type-2, 3 and 5. This PR does not address
states regarding route type-1, 4 and multicast group for VxLAN interface.
Signed-off-by: Ameya Dharkar <adharkar@vmware.com>
When a new ES is created it is held in a non-DF state for 3 seconds
as specified by RFC7432. This allows the switch time to import
the Type-4 routes from the peers. And the peers time to rx the new
Type-4 route.
root@torm-11:mgmt:~# vtysh -c "show evpn es 03:44:38:39:ff:ff:01:00:00:01"|grep DF
DF status: non-df
DF delay: 00:00:01
DF preference: 50000
root@torm-11:mgmt:~# vtysh -c "show evpn es 03:44:38:39:ff:ff:01:00:00:01"|grep DF
DF status: df
DF preference: 50000
root@torm-11:mgmt:~#
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
When all the uplinks go down the VTEP is disconnected from the
VxLAN overlay and this was handled by proto-downing the ES bonds. When
the uplinks come up again we need to re-enable the ES bonds but that
needs to be done after a delay to allow the EVPN network to converge.
And that is done by firing off the startup-delay timer on first
uplink-up.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
1. When a bond is associated with an ES we may need to re-sync
the dplane protodown state (which maybe stale/set by some other
app).
2. Also change the uplink state display to avoid confusion with
protodown reason code (both used to show uplink-up).
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
protodown state is a combination of the dplane and zebra states.
protodown reason is maintained exclusively by zebra. Display this
information on two separate lines to make that ownership clearer.
Also display n/a for bonds as the dplane doesn't support protodowning
the bond device.
Sample output -
==============
root@torm-11:mgmt:~# vtysh -c "show interface hostbond1"|grep -i protodown
protodown: off (n/a)
protodown reasons: (uplinks-down)
root@torm-11:mgmt:~# vtysh -c "show interface swp5"|grep -i protodown
protodown: on
protodown reasons: (uplinks-down)
root@torm-11:mgmt:~#
PS: Cosmetic changes only, no functional change.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
The code for this was already there but was not kicking in because of a
zebra local reason-code dup check. Even if the reason-code is the same,
if the dplane and zebra disagree about the protodown state zebra will
need to re-program the dplane.
Fixed a couple of spelling errors in the protodown logs to make greps
easy.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Use the new nested NDA_FDB_EXT_ATTRS attribute to control per-fdb
notifications.
PS: The attributes where updated as a part of the kernel upstreaming
hence the change.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
New work enqueued to the dplane_fpm_nl provider is initially de-queued
and re-enqueued, in fpm_nl_process(), to be processed by the provider's
own thread.
After performing this initial de-queue/enqueue we return to
dplane_thread_loop() and check the dplane_fpm_nl output queue for any
work which has been completed.
Since this work is being processed in another thread it is very likely
that there will be some (or all) work still outstanding at this point.
The dataplane thread finishes up any other tasks and then waits until
it is next scheduled. In the meantime the dplane_fpm_nl thread is
processing its work queue until completion.
The issue arises here as the dataplane thread is not explicitly
re-scheduled once dplane_fpm_nl has drained its work queue and
populated its output queue with completed work.
This completed work can sit in the output queue for an indeterminate
period of time, depending upon when the dataplane thread is next
scheduled for other work. If the RIB has reached a stable state then
this could be a significant period of time. During this period zebra
marks these routes as queued, even though they have actually been
processed by all dataplane providers.
An un-related RIB change which triggers a FIB update will result in
the dataplane thread being scheduled and this completed work then
being processed. At this point the routes will then no longer be
marked as queued by zebra. However this new FIB update might itself
then fall victim to the same scenario!
We can observe the above behaviour in these detailed dplane logs.
11:24:47 zebra[7282]: dplane: incoming new work counter: 2
11:24:47 zebra[7282]: dplane enqueues 2 new work to provider 'Kernel'
11:24:47 zebra[7282]: dplane provider 'Kernel': processing
11:24:47 zebra[7282]: Dplane NEIGH_DISCOVER, ip 192.168.2.2, ifindex 9
11:24:47 zebra[7282]: Dplane NEIGH_DISCOVER, ip 192.168.2.2, ifindex 9
11:24:47 zebra[7282]: dplane dequeues 2 completed work from provider Kernel
11:24:47 zebra[7282]: dplane enqueues 2 new work to provider 'dplane_fpm_nl'
11:24:47 zebra[7282]: dplane dequeues 1 completed work from provider dplane_fpm_nl
11:24:47 zebra[7282]: dplane has 1 completed, 0 errors, for zebra main
2 contexts (all incoming work) have been queued to dplane_fpm_nl - all good.
1 completed context was de-queued, so there is outstanding work.
11:24:58 zebra[7282]: dplane: incoming new work counter: 2
11:24:58 zebra[7282]: dplane enqueues 2 new work to provider 'Kernel'
11:24:58 zebra[7282]: dplane provider 'Kernel': processing
11:24:58 zebra[7282]: ID (193) Dplane nexthop update ctx 0x55c429b6fed0 op NH_INSTALL
11:24:58 zebra[7282]: 0:5.5.5.5/32 Dplane route update ctx 0x55c429b79690 op ROUTE_INSTALL
11:24:58 zebra[7282]: dplane dequeues 2 completed work from provider Kernel
11:24:58 zebra[7282]: dplane enqueues 2 new work to provider 'dplane_fpm_nl'
11:24:58 zebra[7282]: dplane dequeues 2 completed work from provider dplane_fpm_nl
11:24:58 zebra[7282]: dplane has 2 completed, 0 errors, for zebra main
A further 2 contexts (all incoming work) have been queued to dplane_fpm_nl - all good.
2 completed contexts were de-queued, which sounds good as that is what we en-queued.
However, there is an outstanding context from earlier, so there is still outstanding
work.
Indeed the new 5.5.5.5/32 route is marked as queued:
O>q 5.5.5.5/32 [110/10] via 192.168.2.2, dp0p1s3, weight 1, 00:01:19
This remains the case until we trigger a FIB update by installation of the
(eg.) 10.10.10.10/32 route:
11:26:41 zebra[7282]: dplane: incoming new work counter: 2
11:26:41 zebra[7282]: dplane enqueues 2 new work to provider 'Kernel'
11:26:41 zebra[7282]: dplane provider 'Kernel': processing
11:26:41 zebra[7282]: ID (195) Dplane nexthop update ctx 0x55c429b78ce0 op NH_INSTALL
11:26:41 zebra[7282]: 0:10.10.10.10/32 Dplane route update ctx 0x55c429b7a040 op ROUTE_INSTALL
11:26:41 zebra[7282]: dplane dequeues 2 completed work from provider Kernel
11:26:41 zebra[7282]: dplane enqueues 2 new work to provider 'dplane_fpm_nl'
11:26:41 zebra[7282]: dplane dequeues 2 completed work from provider dplane_fpm_nl
11:26:41 zebra[7282]: dplane has 2 completed, 0 errors, for zebra main
11:26:41 zebra[7282]: zebra2proto: Please add this protocol(2) to proper rt_netlink.c handling
11:26:41 zebra[7282]: Nexthop dplane ctx 0x55c429b6fed0, op NH_INSTALL, nexthop ID (193), result SUCCESS
11:26:41 zebra[7282]: default(0:254):5.5.5.5/32 Processing dplane result ctx 0x55c429b79690, op ROUTE_INSTALL result SUCCESS
We observe the same 2 enqueues and 2 dequeues as before, which again suggests
that there is outstanding work.
As expected, the 5.5.5.5/32 route is no longer marked as queued:
O>* 5.5.5.5/32 [110/10] via 192.168.2.2, dp0p1s3, weight 1, 00:02:06
But the 10.10.10.10/32 route is, as we have not yet processed the completed
context:
C>q 10.10.10.10/32 is directly connected, lo, 00:26:05
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
Returns the current number of (completed) contexts in the provider's
output queue (dp_ctx_out_q), allowing access to this data from the
provider itself.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
Following functions which is a piece of label-maanager implementation
isn't called from out side of its file. And all lines of label-manager
are coded on zebra/label_manager.c at this time. So these functions
should be unexposed.
Functions:
- create_label_chunk
- assign_label_chunk
- delete_label_chunk
- release_label_chunk
Signed-off-by: Hiroki Shirokura <slank.dev@gmail.com>
in the case the namespace pointer is already available, feed it at vrf
creation. this prevents from crashing if the netlink parsing already
began, and the vrf-lite is not enabled yet.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Following functions is using writen to dispatch message
into socket, but another function uses zserv_send_message.
This commit does tiny unification for zapi's socket messaging.
Funcs:
- zsend_assign_label_chunk_response()
- zsend_label_manager_connect_response()
Signed-off-by: Hiroki Shirokura <slank.dev@gmail.com>
The `show ip nht` and `show ipv6 nht` commands were broken.
This is because recent code commit: 0154d8ce45
assumed that p must not be NULL and this is not the case.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add a bit of code to allow bgp to send the AS-Path associated with
the route being installed to zebra so it can be displayed and
used as part of the `show ip route A` command in zebra.
eva# show ip route 20.0.0.0/11
Routing entry for 20.0.0.0/11
Known via "bgp", distance 20, metric 0, best
Last update 00:00:00 ago
* 192.168.161.1, via enp39s0, weight 1
AS-Path: 60000 64539 15096 6939 8075
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Just gather the opaque data into the route entry. Later
commits will display this data for end users as well as
to send it down.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add the current queue depths for each plugin to the
'show dplane providers' output. Maintain the out-bound queue
max counter properly, that was being ignored.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Zebra accumulates route-entry objects and then processes them
as a group. If that rib processing is delayed, because the
dataplane/fib programming has built up a queue e.g., zebra can
hold multiple deleted route objects in memory. At scale, this can
be a problem. Delete unneeded route entries promptly, if they
can't contribute to rib processing.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Don't attempt to walk data structures while not connected so we can
save some CPU usage when FPM server is offline.
Signed-off-by: Rafael Zalamena <rzalamena@opensourcerouting.org>
Instead of checking for next group reset, always do it and skip sending
if next hop group support is disabled.
Also remove unused `*_complete` variables.
Signed-off-by: Rafael Zalamena <rzalamena@opensourcerouting.org>
A MAC entry cannot be deleted while a neigh is referencing it. It seems
there is some race condition where this may be happening. The log is
to help identify those cases.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
It is now 4bits of type and 28bits of value -
1. type=0 is for L3 NHG
2. type=1 is for L2 NH
3. type=2 is for L2 NHG
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
This is an optimization to reduce the number of L2 nexthops. A
l2 or fdb nexthop simply provides the dataplane with a nexthop ip-
torm-12:mgmt:~# ip nexthop
id 268435461 via 27.0.0.20 scope link fdb
id 268435463 via 27.0.0.20 scope link fdb
id 268435465 via 27.0.0.20 scope link fdb
So there is no need to allocate a nexthop per-ES/per-VTEP. There
can be 100+ ESs per-VTEP so this change cuts the scale down by a
factor of 100.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
When a local ES flaps there are two modes in which the local
MACs are failed over -
1. Fast failover - A backup NHG (ES-peer group) is programmed in the
dataplane per-access port. When a local ES flaps the MAC entries
are left unaltered i.e. pointing to the down access port. And the
dataplane redirects traffic destined to the oper-down access port
via the backup NHG.
2. Slow failover - This mode needs to be turned on to allow dataplanes
not capable of re-directing traffic. In this mode local MAC entries
on a down local ES are re-programmed to point to the ES-peers'
NHG. And vice-versa i.e. when the ES comes up the MAC entries
are re-programmed with the access port as dest.
Fast failover is on by default. Slow failover can be enabled via the
following config -
evpn mh redirect-off
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
As a part of extended MM handing a MAC can be updated from local
to remote while being referenced by SYNC neighs (this is really a
temporary/small window). During this window if the MAC transitions
back to local again we need to re-inforce the previous SYNC flags
(based on the sync-neigh count) as subsequent SYNC updates to the
MAC will be de-duped and ignored.
Ticket: CM-29636
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
When a local mac is deleted by the dataplane zebra can re-install it
if the MAC is a SYNC MAC (learned from ES peers). The "local_inactive"
bit must be set as a part of the re-install to prevent zebra turning
around and advertising the MAC as locally active.
Also fixed up some debug logs in the slow-fail path to include the VNI.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
NHG and DST (VTEP-IP) are mutually exclusive attributes. If DST is
present the kernel ignores NHG.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
NHG is activated i.e. programmed in the dataplane only if there
are active-VTEPs associated with it. When a NHG is de-activated
all the remote-mac entries associated with it need to be removed
before the NHG is removed.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
The lookup for non default VRFs was always using a tableId; if not
provided, we were defaulting to RT_TABLE_MAIN. This is fine for the
default VRF but not for others. As a result, the command was silently
failing for non-default VRFs unless we also specified the correct tableId.
Fix this by only performing the lookup using the tableId if it is
provided; else use zebra_vrf_table.
Signed-off-by: Emanuele Di Pascale <emanuele@voltanet.io>
A couple NHG messages we were logging as errors are a bit spammy
in usecases where you routinely add/remove interfaces (VM heavy
deployments). Its not really an error a user cares about and
more for a developer to know what went wrong after the fact so
it makes more sense for these to be under a debug rather than
an error since seeing them does not implicitly mean error during
those usecases.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
During times of network trauma and when we are at large network scale
the process_remote_macip_add function can issue a zlog_warn for
a common occurrence. Modify the code to be a debug statement.
This behavior is the same now as the process_remote_macip_del function
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add an api that allows a caller in the zebra main pthread to
process the queue of pending dplane updates. The caller supplies
a function to call to test each pending context. Selected
contexts are dequeued, and freed without being processed.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
There are two fixes in this commit -
1. Prevent implicit deletion of (*,G) entries during (S,G) cleanup.
This is done by creating a dummy reference on all (*,G) entries.
This is needed for a hash-walk based table cleanup.
2. Free up the SG hash table when the VRF is deleted.
Ticket: CM-30151
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Earlier type-3 ESI was the only format supported for evpn-mh. Updated the
CLI to allow a 10-byte type-0 ESI.
Both type-0 and type-3 ESIs are statically configured; just in two different
ways -
1. type-0 is configured as a complete 10-byte string
2. type-3 is configured as a 6-byte es-sys-mac and a 3-byte
local-discriminator.
Sample config -
!
interface hostbond1
evpn mh es-id 00:44:38:39:ff:ff:01:00:00:01
!
This is a CLI-only change and has no functional impact.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Add routines to walk the LSP table and generate FPM updates for all
entries. A walk of the LSP table is triggered when (re-)connecting
to an FPM.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
Export netlink_lsp_msg_encoder() and use it to encode and send netlink
messages concerning LSP updates to connected FPMs.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
Dataplane/kernel prints the NHG and NH ids as decimal. Zebra
was printing it as hex (to display type vs. val). This became a
debugging hassle hence normalizing the format.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
DAD is not supported currently with EVPN-MH so we turn it off internally
when the first ES config is detected.
PS: Note that when all local ESs are deleted DAD will stay off and
will need to be cleared via a daemon restart.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
The function was originally implemented for zebra data plane FPM plugin,
but another code places could use it.
Signed-off-by: Rafael Zalamena <rzalamena@opensourcerouting.org>
The return from sockunion2hostprefix tells us if the conversion
succeeded or not. There are places in the code where we
always assume that it just `works`, since it can fail
notice and try to do the right thing.
Please note that failure of this function for most cases
of sockunion2hostprefix is highly highly unlikely as that
the sockunion was already created and tested elsewhere
it's just that this function can fail.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add a command that allows FRR to know it's being used with
an underlying asic offload, from the linux kernel perspective.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The linux kernel is getting RTM_F_OFFLOAD_FAILED for kernel routes
that have failed to offload. Write the code
to receive these notifications from the linux kernel
and store that data for display about the routes.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
If we have `ip protocol <proto> route-map FOO` and FOO has
not been defined in any way shape fashion or form, we
should deny the match instead of permitting it.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The route_map_object_t was being used to track what protocol we were
being called against. But each protocol was only ever calling itself.
So we had a variable that was only ever being passed in from route_map_apply
that had to be carried against and everyone was testing if that variable
was for their own stack.
Clean up this route_map_object_t from the entire system. We should
speed some stuff up. Yes I know not a bunch but this will add up.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
If a route-map in zebra has `set src X` and the interface
X is on has not been configured yet, we are rejecting the command
outright. This is a problem on boot up especially( and where I
found this issue ) in that interfaces *can* and *will* be slow
on startup and config can easily be read in *before* the
interface has an ip address.
Let's modify zebra to just warn to the user we may have a problem
and let the chips fall where they may.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
(ndm_state & NUD_NOARP) - prevents the entry from expiring
(ndm_flags & NTF_STICKY) - prevents station moves on the entry
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Issue:
The bgp routes learnt from peers which are not installed in kernel are
advertised to peers. This can cause routers to send traffic to these
destinations only to get dropped. The fix is to provide a configurable
option "bgp suppress-fib-pending". When the option is enabled, bgp will
advertise routes only if it these are successfully installed in kernel.
Fix (Part1) :
* Added message ZEBRA_ROUTE_NOTIFY_REQUEST used by client to request
FIB install status for routes
* Added AFI/SAFI to ZAPI messages
* Modified the functions zapi_route_notify_decode(), zsend_route_notify_owner()
and route_notify_internal() to include AFI, SAFI as parameters
Signed-off-by: kssoman <somanks@gmail.com>
Clan SA was saying:
./zebra/zebra_vty_clippy.c: In function ‘show_route’:
zebra/zebra_vty.c:1775:4: warning: ‘zvrf’ may be used uninitialized in this function [-Wmaybe-uninitialized]
do_show_ip_route_all(vty, zvrf, afi, !!fib, !!json, tag,
^
I do not see a way that zvrf could ever be uninited in the code path
but rearrange the code a tiny bit to make it happier.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add extra data about the interfaces used in route updates'
nexthops - some consumers of route updates may want additional
data, but dataplane plugins running in the dplane pthread
cannot safely access the normal zebra data structures. Capturing
this info is optional - a plugin must request it (via an api).
Signed-off-by: Mark Stapp <mjs@voltanet.io>
When we get a route for installation via any method we should
consolidate on 32 bits as the flag size, since we have
actually more than 8 bits of data to bass around.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Local ethernet segments are held in a protodown or error-disabled state
if access to the VxLAN overlay is not ready -
1. When FRR comes up the local-ESs/access-port are kept protodown
for the startup-delay duration. During this time the underlay and
EVPN routes via it are expected to converge.
2. When all the uplinks/core-links attached to the underlay go down
the access-ports are similarly protodowned.
The ES-bond protodown state is propagated to each ES-bond member
and programmed in the dataplane/kernel (per-bond-member).
Configuring uplinks -
vtysh -c "conf t" vtysh -c "interface swp4" vtysh -c "evpn mh uplink"
Configuring startup delay -
vtysh -c "conf t" vtysh -c "evpn mh startup-delay 100"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
EVPN protodown display -
========================
root@torm-11:mgmt:~# vtysh -c "show evpn"
L2 VNIs: 10
L3 VNIs: 3
Advertise gateway mac-ip: No
Advertise svi mac-ip: No
Duplicate address detection: Disable
Detection max-moves 5, time 180
EVPN MH:
mac-holdtime: 60s, neigh-holdtime: 60s
startup-delay: 180s, start-delay-timer: 00:01:14 <<<<<<<<<<<<
uplink-cfg-cnt: 4, uplink-active-cnt: 4
protodown: startup-delay <<<<<<<<<<<<<<<<<<<<<<<
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
ES-bond protodown display -
===========================
root@torm-11:mgmt:~# vtysh -c "show interface hostbond1"
Interface hostbond1 is up, line protocol is down
Link ups: 0 last: (never)
Link downs: 1 last: 2020/04/26 20:38:03.53
PTM status: disabled
vrf: default
OS Description: Local Node/s torm-11 and Ports swp5 <==> Remote Node/s hostd-11 and Ports swp1
index 58 metric 0 mtu 9152 speed 4294967295
flags: <UP,BROADCAST,MULTICAST>
Type: Ethernet
HWaddr: 00:02:00:00:00:35
Interface Type bond
Master interface: bridge
EVPN-MH: ES id 1 ES sysmac 00:00:00:00:01:11
protodown: off rc: startup-delay <<<<<<<<<<<<<<<<<
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
ES-bond member protodown display -
==================================
root@torm-11:mgmt:~# vtysh -c "show interface swp5"
Interface swp5 is up, line protocol is down
Link ups: 0 last: (never)
Link downs: 3 last: 2020/04/26 20:38:03.52
PTM status: disabled
vrf: default
index 7 metric 0 mtu 9152 speed 10000
flags: <UP,BROADCAST,MULTICAST>
Type: Ethernet
HWaddr: 00:02:00:00:00:35
Interface Type Other
Master interface: hostbond1
protodown: on rc: startup-delay <<<<<<<<<<<<<<<<
root@torm-11:mgmt:~#
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Add a type specifier to the `show nexthop-group` command
so we can easily filter by type when using proto created
nexthop groups.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
A local ES can be added or removed to a bridge after it is created.
When it becomes a bridge port member the dataplane attributes need
to be programmed.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
split horizon filter, non-DF block filter and backup nexthop group
are passed as bridge port attributes to the dataplane.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
This includes -
1. non-DF block filter
2. List of es-peers that need to be blocked per-access port (for
split horizon filtering)
3. Backup nexthop group to failover local-es via the VxLAN overlay
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
1. DF preference is configurable per-ES
!
interface hostbond1
evpn mh es-df-pref 100 >>>>>>>>>>>
evpn mh es-id 1
evpn mh es-sys-mac 00:00:00:00:01:11
!
2. This parameter is sent to BGP and advertised via the ESR.
3. The peer-ESs' DF params are sent to zebra (by BGP) and used
for running the DF election.
4. If the local VTEP becomes non-DF on an ES a block filter is
programmed in the dataplane to drop de-capsulated BUM packets
destined to that ES.
Sample output
=============
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
torm-11# sh evpn es
Type: L local, R remote, N non-DF
ESI Type ES-IF VTEPs
03:00:00:00:00:01:11:00:00:01 LRN hostbond1 27.0.0.16
03:00:00:00:00:01:22:00:00:02 LR hostbond2 27.0.0.16
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
torm-11# sh evpn es 03:00:00:00:00:01:11:00:00:01
ESI: 03:00:00:00:00:01:11:00:00:01
Type: Local,Remote
Interface: hostbond1
State: up
Ready for BGP: yes
VNI Count: 10
MAC Count: 2
DF: status: non-df preference: 100 >>>>>>>>
Nexthop group: 0x2000001
VTEPs:
27.0.0.16 df_alg: preference df_pref: 32767 nh: 0x100000d >>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
There are several places where prefix2str was used to convert
a prefix but they were debug guarded and the buffer was
used for flog_err/warn. This would lead to corrupt data
being output in the failure cases if debugs were not turned
on.
Modify the code in zebra_mpls.c to not use prefix2str
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
We are loading a buffer with the prefix2str results then
using it in the debugs throughout functions. Replace
with just using %pFX and remove the buffer.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Fixes the valgrind error we were seeing on startup due to
initializing the msg header struct:
```
==2534283== Thread 3 zebra_dplane:
==2534283== Syscall param recvmsg(msg) points to uninitialised byte(s)
==2534283== at 0x4D616DD: recvmsg (in /usr/lib64/libpthread-2.31.so)
==2534283== by 0x43107C: netlink_recv_msg (kernel_netlink.c:744)
==2534283== by 0x4330E4: nl_batch_read_resp (kernel_netlink.c:1070)
==2534283== by 0x431D12: nl_batch_send (kernel_netlink.c:1201)
==2534283== by 0x431E8B: kernel_update_multi (kernel_netlink.c:1369)
==2534283== by 0x46019B: kernel_dplane_process_func (zebra_dplane.c:3979)
==2534283== by 0x45EB7F: dplane_thread_loop (zebra_dplane.c:4368)
==2534283== by 0x493F5CC: thread_call (thread.c:1585)
==2534283== by 0x48D3450: fpt_run (frr_pthread.c:303)
==2534283== by 0x48D3D41: frr_pthread_inner (frr_pthread.c:156)
==2534283== by 0x4D56431: start_thread (in /usr/lib64/libpthread-2.31.so)
==2534283== by 0x4E709D2: clone (in /usr/lib64/libc-2.31.so)
==2534283== Address 0x85cd850 is on thread 3's stack
==2534283== in frame #2, created by nl_batch_read_resp (kernel_netlink.c:1051)
==2534283==
==2534283== Syscall param recvmsg(msg.msg_control) points to unaddressable byte(s)
==2534283== at 0x4D616DD: recvmsg (in /usr/lib64/libpthread-2.31.so)
==2534283== by 0x43107C: netlink_recv_msg (kernel_netlink.c:744)
==2534283== by 0x4330E4: nl_batch_read_resp (kernel_netlink.c:1070)
==2534283== by 0x431D12: nl_batch_send (kernel_netlink.c:1201)
==2534283== by 0x431E8B: kernel_update_multi (kernel_netlink.c:1369)
==2534283== by 0x46019B: kernel_dplane_process_func (zebra_dplane.c:3979)
==2534283== by 0x45EB7F: dplane_thread_loop (zebra_dplane.c:4368)
==2534283== by 0x493F5CC: thread_call (thread.c:1585)
==2534283== by 0x48D3450: fpt_run (frr_pthread.c:303)
==2534283== by 0x48D3D41: frr_pthread_inner (frr_pthread.c:156)
==2534283== by 0x4D56431: start_thread (in /usr/lib64/libpthread-2.31.so)
==2534283== by 0x4E709D2: clone (in /usr/lib64/libc-2.31.so)
==2534283== Address 0xa0 is not stack'd, malloc'd or (recently) free'd
==2534283==
```
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Replace all lib/thread cancel macros, use thread_cancel()
everywhere. Only the THREAD_OFF macro and thread_cancel() api are
supported. Also adjust thread_cancel_async() to NULL caller's pointer (if
present).
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Change thread_cancel to take a ** to an event, NULL-check
before dereferencing, and NULL the caller's pointer. Update
many callers to use the new signature.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Because the backup nexthop groups currently are more like pseudo-NHEs
(they don't have IDs and are not inserted into the ID table or
hashed), they can't really have this depends/dependents relationship
yet in both directions. Some work needs to be done there to make
them more like first class citizens like "normal" NHGs to enable
this.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
When `-r` is specified to zebra, on shutdown we should
not remove any routes from the fib. This was a problem
with nhg's on shutdown due to their ref-count behavior.
Introduce a methodology where on shutdown we don't mess
with the nexthop groups in the kernel. That way on
next startup things will be ok.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add an alias so people can still type `show ip ro`.
It became ambigious in a recent release.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Apparantly the dependents backpointer trees for singletons
got broken at some point and we never noticed. There is
not really any code making use of this right now so not
suprising but let's go ahead and fix it for zebra and proto
NHGs.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
This problem was accidentally introduced as a part of another fixup -
[
commit e378f5020d (anuradhak/mh-misc-fixes, mh-misc-fixes)
Author: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Date: Tue Sep 15 16:50:14 2020 -0700
zebra: fix use of freed es during zebra shutdown
]
zif->es_info.es is cleared as a part of zebra_evpn_es_local_info_clear so it
cannot be passed around as a pointer from zebra_evpn_local_es_update/del.
Because of this bug removing ES from an interface resulted in
a zebra crash.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Create appropriate accessor functions for the rn->lock
data. We should be accessing this data through accessor
functions since it is private data to the data structure.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When zebra is running with debugs turned on there
is a use after free reported by the address sanitizer:
2020/10/16 12:58:02 ZEBRA: rib_delnode: (0:254):4.5.6.16/32: rn 0x60b000026f20, re 0x6080000131a0, removing
2020/10/16 12:58:02 ZEBRA: rib_meta_queue_add: (0:254):4.5.6.16/32: queued rn 0x60b000026f20 into sub-queue 3
=================================================================
==3101430==ERROR: AddressSanitizer: heap-use-after-free on address 0x608000011d28 at pc 0x555555705ab6 bp 0x7fffffffdab0 sp 0x7fffffffdaa8
READ of size 8 at 0x608000011d28 thread T0
#0 0x555555705ab5 in re_list_const_first zebra/rib.h:222
#1 0x555555705b54 in re_list_first zebra/rib.h:222
#2 0x555555711a4f in process_subq_route zebra/zebra_rib.c:2248
#3 0x555555711d2e in process_subq zebra/zebra_rib.c:2286
#4 0x555555711ec7 in meta_queue_process zebra/zebra_rib.c:2320
#5 0x7ffff74701f7 in work_queue_run lib/workqueue.c:291
#6 0x7ffff7450e9c in thread_call lib/thread.c:1581
#7 0x7ffff738eaf7 in frr_run lib/libfrr.c:1099
#8 0x55555561a578 in main zebra/main.c:455
#9 0x7ffff7079cc9 in __libc_start_main ../csu/libc-start.c:308
#10 0x5555555e3429 in _start (/usr/lib/frr/zebra+0x8f429)
0x608000011d28 is located 8 bytes inside of 88-byte region [0x608000011d20,0x608000011d78)
freed by thread T0 here:
#0 0x7ffff768bb6f in __interceptor_free (/lib/x86_64-linux-gnu/libasan.so.6+0xa9b6f)
#1 0x7ffff739ccad in qfree lib/memory.c:129
#2 0x555555709ee4 in rib_gc_dest zebra/zebra_rib.c:746
#3 0x55555570ca76 in rib_process zebra/zebra_rib.c:1240
#4 0x555555711a05 in process_subq_route zebra/zebra_rib.c:2245
#5 0x555555711d2e in process_subq zebra/zebra_rib.c:2286
#6 0x555555711ec7 in meta_queue_process zebra/zebra_rib.c:2320
#7 0x7ffff74701f7 in work_queue_run lib/workqueue.c:291
#8 0x7ffff7450e9c in thread_call lib/thread.c:1581
#9 0x7ffff738eaf7 in frr_run lib/libfrr.c:1099
#10 0x55555561a578 in main zebra/main.c:455
#11 0x7ffff7079cc9 in __libc_start_main ../csu/libc-start.c:308
previously allocated by thread T0 here:
#0 0x7ffff768c037 in calloc (/lib/x86_64-linux-gnu/libasan.so.6+0xaa037)
#1 0x7ffff739cb98 in qcalloc lib/memory.c:110
#2 0x555555712ace in zebra_rib_create_dest zebra/zebra_rib.c:2515
#3 0x555555712c6c in rib_link zebra/zebra_rib.c:2576
#4 0x555555712faa in rib_addnode zebra/zebra_rib.c:2607
#5 0x555555715bf0 in rib_add_multipath_nhe zebra/zebra_rib.c:3012
#6 0x555555715f56 in rib_add_multipath zebra/zebra_rib.c:3049
#7 0x55555571788b in rib_add zebra/zebra_rib.c:3327
#8 0x5555555e584a in connected_up zebra/connected.c:254
#9 0x5555555e42ff in connected_announce zebra/connected.c:94
#10 0x5555555e4fd3 in connected_update zebra/connected.c:195
#11 0x5555555e61ad in connected_add_ipv4 zebra/connected.c:340
#12 0x5555555f26f5 in netlink_interface_addr zebra/if_netlink.c:1213
#13 0x55555560f756 in netlink_information_fetch zebra/kernel_netlink.c:350
#14 0x555555612e49 in netlink_parse_info zebra/kernel_netlink.c:941
#15 0x55555560f9f1 in kernel_read zebra/kernel_netlink.c:402
#16 0x7ffff7450e9c in thread_call lib/thread.c:1581
#17 0x7ffff738eaf7 in frr_run lib/libfrr.c:1099
#18 0x55555561a578 in main zebra/main.c:455
#19 0x7ffff7079cc9 in __libc_start_main ../csu/libc-start.c:308
SUMMARY: AddressSanitizer: heap-use-after-free zebra/rib.h:222 in re_list_const_first
This is happening because we are using the dest pointer after a call into
rib_gc_dest. In process_subq_route, we call rib_process() and if the
dest is deleted dest pointer is now garbage. We must reload the
dest pointer in this case.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
We support configuration of multiple addresses in the same
subnet on a single interface: make sure that zebra supports
multiple instances of the corresponding connected route.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Let's just track the NHEs we get from the kernel(dplane) for
ID usage with internal routes. I tried to be smart originally
and allow them to be re-used internal to zebra but its proving
to cause more bugs than it's worth.
This doesn't break any functionality. It just means we won't
use NHEs we get from the kernel with our routes, we will create
new ones.
Decided this based on various bugs seen ith the lastest one
being on startup with this kernel state:
```
[root@alfred frr-2]# ip next ls
id 15 via 192.168.161.1 dev doof scope link proto zebra
id 17 group 15 proto zebra
[root@alfred frr-2]# ip ro show 3.3.3.1
3.3.3.1 nhid 17 via 192.168.161.1 dev doof
```
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
SIOCGIFMEDIA returns the media state.
SIOCGIFDATA returns interface data which includes the link state.
While the status of the former is usually indicitive of the latter,
this is not always the case.
Ifact some recent net80211 changes in at least NetBSD and OpenBSD
have MONITOR media set to active but the link status set to DOWN.
All interfaces will return link state with SIOCGIFDATA, unlike
SIOCGIFMEDIA. However not all BSD's support SIOCGIFDATA - it has
recently been accepted into FreeBSD-13.
However, all BSD's do report the same structure in ifa_data for
AF_LINK addresses from getifaddrs(3) so the information has always
been available.
Signed-off-by: Roy Marples <roy@marples.name>
Add a param to the common NHE creation callstack so we can
know if this is one we have read in from the dataplane. We can
add some logic on how to handle these special ones later.
I considered putting this on a struct as a flag or something
but it would have required it being put on struct nexthop
since we have some `*_find_nexthop()` functions that can
be called when given NHEs from the dataplane.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
In zebra_evpn_es_evi_show_vni the zevpn pointer if passed into
zebra_evpn_es_evi_show_one_evi will crash if it is null and
we have code that checks that it is non null and then immediately
calls the function. Add a return to prevent a crash.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
No need to check for n->mac existence as that all paths
leading to this code have n->mac already derefed.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Use the same lsp and nexthop/nhlfe objects for 'static' and
dynamic LSPs; remove the 'static' objects and their supporting
code.
Signed-off-by: Mark Stapp <mjs@voltanet.io>
Zebra's clear duplicate detect command is rpc converted.
There is condition where cli fails with human readable message.
Using northboun's errmsg buffer to display error message to
user.
Testing:
bharat# clear evpn dup-addr vni 1002 ip 2011:11::11
Error type: generic error
Error description: Requested IP's associated MAC aa:aa:aa:aa:aa:aa is still in duplicate state
bharat# clear evpn dup-addr vni 1002 ip 11.11.11.11
Error type: generic error
Error description: Requested IP's associated MAC aa:aa:aa:aa:aa:aa is still in duplicate state
Signed-off-by: Chirag Shah <chirag@nvidia.com>
Display human readable error message in northbound rpc
transaction failure. In case of vtysh nb client, the error
message will be displayed to user.
Testing:
bharat# clear evpn dup-addr vni 1002 ip 11.11.11.11
Error type: generic error
Error description: Requested IP's associated MAC aa:aa:aa:aa:aa:aa is still
in duplicate state
Signed-off-by: Chirag Shah <chirag@nvidia.com>
NetBSD and DragonFlyBSD support reporting of route(4) overflows
by setting the socket option SO_RERROR.
This is handled the same as on Linux by exiting with a -1 error code.
Signed-off-by: Roy Marples <roy@marples.name>
a) Use appropriate %p modifiers for output
2) Display vrf name in addition to vrf id
c) Remove now unused function
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
During quick ifdown / ifup events from the linux kernel there
exists a situation where a prefix that has both a kernel route
and a static route can queued up on the meta-q. If the static
route happens to point at a connected route for nexthop resolution
and we receive a series of quick up/down events *after* the
static route and kernel route are queued up for rib reprocessing.
Since the static route and kernel route are queued on meta-q 1
and the connected route is also on meta-q 1 there exists a situation
where the connected route will be resolved after the static route
fails to resolve, leaving the static route in a unresolved state.
Add a new queue level and put connected routes on their own level,
since they are the fundamental building blocks of pretty much
all the other routes.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When zebra is processing routes to determine what to send
to the rib, suppose we have two routes (a) a route processed
earlier that none of it's nexthops were active and (b)
a route that has good nexthops but has a worse admin distance.
rib_process, would not relook at (a)'s nexthops because
the ROUTE_ENTRY_CHANGED flag was not true and it would
win when compared to (b) because it's admin distance
was better, leaving us with a state where we would
attempt and fail to install route (a) because it
was not valid.
Modify the code to consider the number of nexthops
we have as a determiner if we can use the route.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
In rib_process_update_fib, the function is sent two route entries
the old ( previously installed ) and new ( the one to install )
When the function detects that the new is unusable because
the number of nexthops that are usable for that route is 0,
then we uninstall the old route. The problem here is that
we should not attempt to uninstall any route that is
not owned by FRR. Modify the code to not attempt
this behavior
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
b0e9567ed1 fixed an issue whereby
zebra would abort while building an update for a blackhole route.
The same issue, `assert(data_len)` failing in
`zfpm_build_route_updates()`, can be observed when building updates
for unreachable and prohibit routes.
To address this `netlink_route_info_fill()` is updated to not
indicate failure, due to lack of nexthops, for any blackhole routes.
Signed-off-by: Duncan Eastoe <duncan.eastoe@att.com>
When debugging why a route was not successfully installed into the
rib, it would be preferable that the end user only have to turn
on `debug zebra rib detail` as that is what we have been telling
people to do for the last couple of years. Consolidate *back*
to this.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
With l2vni flap leading to duplicate entry creation
in l3vni's l2vni-list.
Use list sorted add with no duplicates.
root@TORC11:mgmt:~# show evpn vni 4001
VNI: 4001
Type: L3
Tenant VRF: vrf1
State: Up
...
L2 VNIs: 1000 1000 1000 0 0 1002
root@TORC11:mgmt:~# ip link set down vx-1002
root@TORC11:mgmt:~# ip link set up vx-1002
root@TORC11:mgmt:~# show evpn vni 4001
VNI: 4001
Type: L3
Tenant VRF: vrf1
State: Up
...
L2 VNIs: 1000 1000 1000 0 0 1002 1002
Ticket:CM-31545
Reviewed By:
Testing Done:
With Fix:
Multiple time flaps vni counts remained the same.
root@TORC11:mgmt:~# ip link set down vx-1002
root@TORC11:mgmt:~# ip link set up vx-1002
root@TORC11:mgmt:~# ip link set down vx-1002
root@TORC11:mgmt:~# ip link set up vx-1002
root@TORC11:mgmt:~# net show evpn vni 4001
VNI: 4001
Type: L3
Tenant VRF: vrf1
State: Up
...
L2 VNIs: 1000 1002
Signed-off-by: Chirag Shah <chirag@nvidia.com>
Only set the NHG/backup NHG pointers of the caller if the read
of the nexthops was successfull. Otherwise, we might free when not
neccessary or double free.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Add the zapi code for encoding/decoding of backup nexthops for when
we are ready for it, but disable it for now so that we revert
to the old way with them.
When zebra gets a proto-NHG with a backup in it, we early fail and
tell the upper level proto. In this case sharpd. Sharpd then reverts
to the old way of installation with the route.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Add type to the nhg_proto_del API params for sanity checking
that the types of the route sent by the proto matches the type
found with the ID.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Limit the not re-installation of routes with the same NHG ID
to routes that are using the new NHG PROTO API. This would
only include sharpd and EVPN-MH for now.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
In scoring our NHEs during shutdown there is a chance we could release mutliple
NHEs at the same time during one iteration. This can cause memory corruption
if the two being released are directly next to each other in the hash table.
hash_iterate accounts for releasing one during the iteration but not
two by setting hbnext before release but if hbnext is also freed,
we obviously can have a problem.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Reject proto NHGs of type blackhole/interface for now.
We need to think a bit more about how to resolve these
given the linux kernel needs to know the Address Family
of the routes that will use them and install it with them.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Clean up the function names and remove some TODOs that are no
longer needed/hacks we used for testing.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Get the multipath number checks working with proto-based NHG
message decoding in zapi_msg.c
Modify the function that checks this for routes to work without
being passed a prefix as is the case with NHG creates.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Add a flag to track the released state of a proto-based NHG.
This flag is used to know whether the upper level proto has called
the *_del API. Typically, the NHG would just get removed and uninstalled
at this point but there is a chance we are being sent it while routes
are still being owned or we were sent it multiple times. This flag
and associated code handles that.
Ticket: CM-30369
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
We currently don't support ADD/DEL/REPLACE with proto-based
NHGs that are not already fully resolved and ifindex/onlink
based. If we are handed one that doesn't have ifindex set
i.e. recursive, gracefully fail and with a notification.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Make the message parameters align better with other zapi
notifications and change the ID to correctly be a uint32.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
When the dataplane detects that we have no need to
reinstall the same route, setup the NEXTHOP_FLAG_FIB
appropriately.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
The code was installing the nexthop group again using
the NLM_F_REPLACE function causing extremely large
route installation times. This reduces the time from
installing 1 million routes from sharpd with a nhg
from > 200 seconds ( where I gave up ) to ~15
seconds on my machine for 32 x ecmp. As a side note 1 million
routes using master sharpd takes ~50 seconds to do
the same thing.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Add some logging for when we choose to ignore a NHG install
for one reason or another. Also, cleanup some of the code
using the same accessor functions for the context object.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Return the proto nhe on del even if their are still possible
route references.
We may get a del before the routes are removed. So we still need
to return this to the caller so they can decrement the ref.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Fix the releasing of proto-owned singletons from the attribute
hashed table. Proto-owned singleton nexthops are hashed so they
can still be shared therefore they are present in this table
and need to be released when the time comes.
This check was only matching on zebra proto before. Changed
to match IDs in zebra allocated range.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Increment the nhg proto score iterator we used to count
leftover NHGs after client disconnect and log.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Fix some reference counting issues seen when replacing
a NHG and deleting one.
For replacement, we should end with the same refcnt on the new
one.
For delete, its the caller's job to decrement its ref after
its done with it.
Further, update routes in the rib with the new pointer after replace.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Add code to handle proto-based NHG uninstalling after
the owning client disconnects.
This is handled the same way as rib_score_proto() but for now
we are ignoring instance.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>