When SRv6 is enabled and an SRv6 locator is specified in the BGP
configuration, BGP may attempt to request SRv6 locator information from
zebra before the connection is fully established. If this occurs, the
request fails with the following error:
```
2025/02/06 16:37:32 BGP: [HR66R-TWQYD][EC 100663302] srv6_manager_get_locator: invalid zclient socket
````
As a result, BGP is unable to obtain the locator information,
preventing SRv6 VPN from working.
This commit fixes the issue by ensuring BGP requests SRv6 locator
information once the connection with zebra is successfully established.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
(cherry picked from commit 16640b615d)
Sometimes, NHRP receives L2 information on a cache entry with the
0.0.0.0 IP address. NHRP considers it as valid and updates the binding
with the new IP address.
> Feb 09 20:09:54 aws-sin-vpn01 nhrpd[2695]: [QQ0NK-1H449] Netlink: new-neigh 10.2.114.238 dev dmvpn1 lladdr 162.251.180.10 nud 0x2 cache used 0 type 4
> Feb 09 20:10:35 aws-sin-vpn01 nhrpd[2695]: [QQ0NK-1H449] Netlink: new-neigh 10.2.114.238 dev dmvpn1 lladdr 162.251.180.10 nud 0x4 cache used 1 type 4
> Feb 09 20:10:48 aws-sin-vpn01 nhrpd[2695]: [QQ0NK-1H449] Netlink: del-neigh 10.2.114.238 dev dmvpn1 lladdr 162.251.180.10 nud 0x4 cache used 1 type 4
> Feb 09 20:10:49 aws-sin-vpn01 nhrpd[2695]: [QQ0NK-1H449] Netlink: who-has 10.2.114.238 dev dmvpn1 lladdr (unspec) nud 0x1 cache used 1 type 4
> Feb 09 20:10:49 aws-sin-vpn01 nhrpd[2695]: [QVXNM-NVHEQ] Netlink: update binding for 10.2.114.238 dev dmvpn1 from c 162.251.180.10 peer.vc.nbma 162.251.180.10 to lladdr (unspec)
> Feb 09 20:10:49 aws-sin-vpn01 nhrpd[2695]: [QQ0NK-1H449] Netlink: new-neigh 10.2.114.238 dev dmvpn1 lladdr 0.0.0.0 nud 0x2 cache used 1 type 4
> Feb 09 20:11:30 aws-sin-vpn01 nhrpd[2695]: [QQ0NK-1H449] Netlink: new-neigh 10.2.114.238 dev dmvpn1 lladdr 0.0.0.0 nud 0x4 cache used 1 type 4
Actually, the 0.0.0.0 IP addressed mentiones in the 'who-has' message is
wrong because the nud state value means that value is incomplete and
should not be handled as a valid entry. Instead of considering it, fix
this by by invalidating the current binding. This step is necessary in
order to permit NHRP to trigger resolution requests again.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
(cherry picked from commit 3202323052)
Blocking all signals on non-main threads is not the way to go, at least
the handlers for SIGSEGV, SIGBUS, SIGILL, SIGABRT and SIGFPE need to run
so we get backtraces. Otherwise the process just exits.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
(cherry picked from commit 13a6ac5b4c)
In bgp route leak, when import vrf x is executed,
it creates bgp instance as hidden with asn value as unspecified.
When router bgp x is configured ensure the correct as,
asnotation is applied otherwise running config shows asn value as 0.
This can lead to frr-reload failure when any FRR config change.
Fix:
Move asn and asnotiation, as_pretty value in common done section,
so when bgp_create gets existing instance but before returning
update asn and required fields in common section.
In bgp_create(): when returning for hidden at least update asn
and required when bgp instance created implicitly due to vrf leak.
if (hidden) {
bgp = bgp_old;
goto peer_init; <<<
}
Before fix:
show running:
router bgp 0 vrf purple
bgp router-id 10.10.3.11
!
address-family ipv4 unicast
redistribute static
import vrf blue
exit-address-family
!
address-family ipv6 unicast
import vrf blue
exit-address-family
!
address-family l2vpn evpn
advertise ipv4 unicast
advertise ipv6 unicast
exit-address-family
exit
Testing:
1) following snippet config:
router bgp 63420 vrf blue
import vrf purple
router bgp 63420 vrf purple
import vrf blue
2) restart frr leads to the running config with 0 asn value.
Signed-off-by: Chirag Shah <chirag@nvidia.com>
(cherry picked from commit 2ff08af78e)
Issue:
When there is no traffic for a group, the LHR and RP take the default KAT+Join timer expiry of
a maximum of 480 seconds to clear the S,G . However, in the FHR, we update the state from JOINED
to NOT Joined, downstream state from PPto NOINFO. This restarts the ET timer, causing S,G on FHR to
take more than 10 minutes to age out.
In other words,
Consider a case where (S,G) is in Join state. When the traffic stops and the KAT (210) expires,
the Join expiry timer restarts. At this time, if we receive a prune, the expectation is to set
PPT to 0 (RFC 4601 sec 4.5.2).
When the PPT expires, we move to the noinfo state and restart the expiry timer one more time. We remove the
(S,G) entry only after ~10 minutes when there is no active traffic.
Summary:
KAT Join ET 210 + PP ET 210 + NOINFO ET 210.
Solution:
Delete the ifchannel when in noinfo state, and KAT is not running.
Ticket: #13703
Signed-off-by: Rajesh Varatharaj <rvaratharaj@nvidia.com>
(cherry picked from commit afed39ea2b)
Use a memory allocation specific type for filter names (to help detect memory
leaks) and fix a memory leak when releasing peer memory.
Signed-off-by: Rafael Zalamena <rzalamena@opensourcerouting.org>
(cherry picked from commit d1440dadff)
The dg_update_list access is controlled by the dg_mutex in all
other locations. Let's just add a mutex usage around the initialization
of the dg_update_list even if it's part of the startup, just to keep
things consistent.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
(cherry picked from commit 19af3f3d7a)
Grab the count of streams in ibuf when it is protected
by a mutex. Since this data is written to it in another
pthread.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
(cherry picked from commit f94ad538cf)
msg_new takes a uint16_t, the length passed
down variable is a unsigned int, thus 32 bit.
It's possible, but highly unlikely, that the
msglen could be greater than 16 bit.
Let's just add some checks to ensure that
this could not happen.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
(cherry picked from commit 283cc51178)
Memory is being leaked when processing the eoiu marker.
BGP is creating a dummy dest to contain the data but
it was never freed. As well as the eoiu info was
not being freed either.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
(cherry picked from commit c6b7a993fb)
If we have IPv6-only network and no IPv4 addresses at all, then by default
0.0.0.0 is created which is treated as malformed according to RFC 6286.
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
(cherry picked from commit 739f2b566a)
For auto configured value RD value comes as NULL,
switching back to original change will ensure to cover
for both auto and user configured RD value in JSON.
tor-11# show bgp vrf blue ipv4 unicast route-leak json
{
"vrf":"blue",
"afiSafi":"ipv4Unicast",
"importFromVrfs":[
"purple"
],
"importRts":"10.10.3.11:6",
"exportToVrfs":[
"purple"
],
"routeDistinguisher":"(null)", <<<<<
"exportRts":"10.10.3.11:10"
}
Signed-off-by: Chirag Shah <chirag@nvidia.com>
Upon zebra shutdown hash_clean_and_free is called
where user free function is passed,
The free function should not call hash_release
which lead to double free of hash bucket.
Fix:
The fix is to avoid calling hash_release from
free function if its called from hash_clean_and_free
path.
10 0x00007f0422b7df1f in free () from /lib/x86_64-linux-gnu/libc.so.6
11 0x00007f0422edd779 in qfree (mt=0x7f0423047ca0 <MTYPE_HASH_BUCKET>,
ptr=0x55fc8bc81980) at ../lib/memory.c:130
12 0x00007f0422eb97e2 in hash_clean (hash=0x55fc8b979a60,
free_func=0x55fc8a529478 <svd_nh_del_terminate>) at
../lib/hash.c:290
13 0x00007f0422eb98a1 in hash_clean_and_free (hash=0x55fc8a675920
<svd_nh_table>, free_func=0x55fc8a529478 <svd_nh_del_terminate>) at
../lib/hash.c:305
14 0x000055fc8a5323a5 in zebra_vxlan_terminate () at
../zebra/zebra_vxlan.c:6099
15 0x000055fc8a4c9227 in zebra_router_terminate () at
../zebra/zebra_router.c:276
16 0x000055fc8a4413b3 in zebra_finalize (dummy=0x7fffb881c1d0) at
../zebra/main.c:269
17 0x00007f0422f44387 in event_call (thread=0x7fffb881c1d0) at
../lib/event.c:2011
18 0x00007f0422ecb6fa in frr_run (master=0x55fc8b733cb0) at
../lib/libfrr.c:1243
19 0x000055fc8a441987 in main (argc=14, argv=0x7fffb881c4a8) at
../zebra/main.c:584
Signed-off-by: Chirag Shah <chirag@nvidia.com>
Add a new test case that re-add the deleted SIDs and verifies that all
SIDs are added back to the RIB.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
Add a new test case that deletes a SID and verifies that only this
SID has been removed from the RIB.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
When a user wants to delete a specific SRv6 SID, he executes the
`no sid X:X::X:X/M` command.
However, by mistake, in addition to deleting the SID requested by the
user, this command also removes all other SIDs.
This happens because `no sid X:X::X:X/M` triggers a destroy operation
on the wrong xpath `frr-staticd:staticd/segment-routing/srv6`.
This commit fixes the issue by replacing the wrong xpath
`frr-staticd:staticd/segment-routing/srv6` with the correct xpath
`frr-staticd:staticd/segment-routing/srv6/static-sids/sid[sid='%s']`.
This ensures that the `no sid X:X::X:X/M` command only deletes the SID
that was requested by the user.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
Just to make it simpler for compiling with a different default value.
No change to its default value.
Signed-off-by: Enke Chen <enchen@paloaltonetworks.com>
When staticd receives a `ZAPI_SRV6_SID_RELEASED` notification from SRv6
SID Manager, it tries to unset the validity flag of `sid`. But since
the `sid` variable is NULL, we get a NULL pointer dereference.
```
=================================================================
==13815==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000060 (pc 0xc14b813d9eac bp 0xffffcb135a40 sp 0xffffcb135a40 T0)
==13815==The signal is caused by a READ memory access.
==13815==Hint: address points to the zero page.
#0 0xc14b813d9eac in static_zebra_srv6_sid_notify staticd/static_zebra.c:1172
#1 0xe44e7aa2c194 in zclient_read lib/zclient.c:4746
#2 0xe44e7a9b69d8 in event_call lib/event.c:1984
#3 0xe44e7a85ac28 in frr_run lib/libfrr.c:1246
#4 0xc14b813ccf98 in main staticd/static_main.c:193
#5 0xe44e7a4773f8 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
#6 0xe44e7a4774c8 in __libc_start_main_impl ../csu/libc-start.c:392
#7 0xc14b813cc92c in _start (/usr/lib/frr/staticd+0x1c92c)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV staticd/static_zebra.c:1172 in static_zebra_srv6_sid_notify
==13815==ABORTING
```
This commit fixes the problem by doing a SID lookup first. If the SID
can't be found, we log an error and return. If the SID is found, we go
ahead and unset the validity flag.
Signed-off-by: Carmine Scarpitta <cscarpit@cisco.com>
Just to make it simpler for compiling with a different default value.
No change to its default value.
Signed-off-by: Enke Chen <enchen@paloaltonetworks.com>
When you have suppress-fib-pending turned on it is possible
to end up in a situation where the prefix is not withdrawn
from downstream peers.
Here is the timing that I believe is happening:
a) have 2 paths to a peer.
b) receive a withdrawal from 1 path, set BGP_NODE_FIB_INSTALL_PENDING
and send the route install to zebra.
c) receive a withdrawal from the other path.
d) At this point we have a dest->flags set BGP_NODE_FIB_INSTALL_PENDING
old_select the path_info going away, new_select is NULL
e) A bit further down we call group_announce_route() which calls
the code to see if we should advertise the path. It sees the
BGP_NODE_FIB_INSTALL_PENDING flag and says, nope.
f) the route is sent to zebra to withdraw, which unsets the
BGP_NODE_FIB_INSTALL_PENDING.
g) This function winds up and deletes the path_info. Dest now
has no path infos.
h) BGP receives the route install(from step b) and unsets the
BGP_NODE_FIB_INSTALL_PENDING flag
i) BGP receives the route removed from zebra (from step f) and
unsets the flag again.
We know if there is no new_select, let's go ahead and just
unset the PENDING flag to allow the withdrawal to go out
at the time when the second withdrawal is received.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When called without caps/privs, just return from "change_caps"
instead of exiting - it's possible that a process may not need
privs, but a lib (for example) may use the api.
Signed-off-by: Mark Stapp <mjs@cisco.com>
When looping through the dplane providers, the worklist was
being populated with items from the last provider and then
the event system was checked to see if we should stop processing.
If the event system says `yes` then the dplane code would stop
and send the worklist to the master zebra pthread for collection.
This obviously skipped the next dplane provider on the list
which is double plus not good.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>