This is a first in a series of commits, whose goal is to rename
the thread system in FRR to an event system. There is a continual
problem where people are confusing `struct thread` with a true
pthread. In reality, our entire thread.c is an event system.
In this commit rename the thread.[ch] files to event.[ch].
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The label allocation per nexthop mode requires to use a nexthop
tracking context. For redistributed routes, a nexthop tracking
context is created, and the resolution helps to know the real
nexthop ip address used. The below configuration example has
been used:
> vrf vrf1
> ip route 172.31.0.14/32 192.0.2.14
> ip route 172.31.0.15/32 192.0.2.12
> ip route 172.31.0.30/32 192.0.2.30
> exit
> router bgp 65500 vrf vrf1
> address-family ipv4 unicast
> redistribute static
> label vpn export per-nexthop
> [..]
The static routes are correctly imported in the BGP IPv4 RIB.
Contrary to label allocation per vrf mode, some nexthop tracking
are created/or reused:
> # show bgp vrf vrf1 nexthop
> 192.0.2.12 valid [IGP metric 0], #paths 3, peer 192.0.2.12
> if r1-eth1
> Last update: Fri Jan 13 15:49:42 2023
> 192.0.2.14 valid [IGP metric 0], #paths 1
> if r1-eth1
> Last update: Fri Jan 13 15:49:42 2023
> 192.0.2.30 valid [IGP metric 0], #paths 1
> if r1-eth1
> Last update: Fri Jan 13 15:49:51 2023
> [..]
This results in having a BGP VPN route for each of the static
routes:
> # show bgp ipv4 vpn
> [..]
> Route Distinguisher: 444:1
> *> 172.31.0.14/32 192.0.2.14@9< 0 32768 ?
> *> 172.31.0.15/32 192.0.2.12@9< 0 32768 ?
> *> 172.31.0.30/32 192.0.2.30@9< 0 32768 ?
> [..]
Without that patch, only the redistributed routes that rely on a
pre-existing nexthop tracking context could be exported.
Also, a command in the code about redistributed routes is modified
accordingly, to explain that redistribute routes may be submitted
to nexthop tracking in the case label allocation per next-hop is
used.
note:
VNC routes have been removed from the redistribution,
because of a test failure in the bgp_l3vpn_to_bgp_direct test.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
This commit introduces a new method to associate a label to
prefixes to export to a VPNv4 backbone. All the methods to
associate a label to a BGP update is documented in rfc4364,
chapter 4.3.2. Initially, the "single label for an entire
VRF" method was available. This commit adds "single label
for each attachment circuit" method.
The change impacts the control-plane, because each BGP update
is checked to know if the nexthop has reachability in the VRF
or not. If this is the case, then a unique label for a given
destination IP in the VRF will be picked up. This label will
be reused for an other BGP update that will have the same
nexthop IP address.
The change impacts the data-plane, because the MPLs pop
mechanism applied to incoming labelled packets changes: the
MPLS label is popped, and the packet is directly sent to the
connected nexthop described in the previous outgoing BGP VPN
update.
By default per-vrf mode is done, but the user may choose
the per-nexthop mode, by using the vty command from the
previous commit. In the latter case, a per-vrf label
will however be allocated to handle networks that are not directly
connected. This is the case for local traffic for instance.
The change also include the following:
- ECMP case
In case a route is learnt in a given VRF, and is resolved via an
ECMP nexthop. This implies that when exporting the route as a BGP
update, if label allocation per nexthop is used, then two possible
MPLS values could be picked up, which is not possible with the
current implementation. Actually, the NLRI for VPNv4 stores one
prefix, and one single label value, not two. Today, RFC8277 with
multiple label capability is not yet available.
To avoid this corner case, when a route is resolved via more than one
nexthop, the label allocation per nexthop will not apply, and the
default per-vrf label will be chosen.
Let us imagine BGP redistributes a static route using the `172.31.0.20`
nexthop. The nexthop resolution will find two different nexthops fo a
unique BGP update.
> r1# show running-config
> [..]
> vrf vrf1
> ip route 172.31.0.30/32 172.31.0.20
> r1# show bgp vrf vrf1 nexthop
> [..]
> 172.31.0.20 valid [IGP metric 0], #paths 1
> gate 192.0.2.11
> gate 192.0.2.12
> Last update: Mon Jan 16 09:27:09 2023
> Paths:
> 1/1 172.31.0.30/32 VRF vrf1 flags 0x20018
To avoid this situation, BGP updates that resolve over multiple
nexthops are using the unique per-vrf label.
- recursive route case
Prefixes that need a recursive route to be resolved can
also be eligible for mpls allocation per nexthop. In that
case, the nexthop will be the recursive nexthop calculated.
To achieve this, all nexthop types in bnc contexts are valid,
except for the blackhole nexthops.
- network declared prefixes
Nexthop tracking is used to look for the reachability of the
prefixes. When the the 'no bgp network import-check' command
is used, network declared prefixes are maintained active,
even if there is no active nexthop.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
This causes early return. peer->conf is NULL for IPv6 link-local peering,
and the session never establish.
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
RD may be built based on an AS number. Like for the AS, the RD
may use the AS notation. The two below examples can illustrate:
RD 1.1:20 stands for an AS4B:NN RD with AS4B=65536 in dot format.
RD 0.1:20 stands for an AS2B:NNNN RD with AS2B=0.1 in dot+ format.
This commit adds the asnotation mode to prefix_rd2str() API so as
to pick up the relevant display.
Two new printfrr extensions are available to display the RD with
the two above display methods.
- The pRDD extension stands for dot asnotation format
- The pRDE extension stands for dot+ asnotation format.
- The pRD extension has been renamed to pRDP extension
The code is changed each time '%pRD' printf extension is called.
Possibly, the asnotation may change the output, then a macro defines
the asnotation mode to use. A side effect of forging the mode to
use is that the string could not be concatenated with other strings
in vty_out and snprintfrr. Those functions have been called multiple
times. When zlog_debug needs to display the RD with some other string,
the prefix_rd2str() old API is used instead of the printf extension.
Some code has been kept untouched:
- code related to running-config. Actually, wherever an RD is displayed,
its configured name should be dumped.
- bgp rfapi code
- bgp evpn multihoming code (partially done), since the logic is
missing to get the asnotation of 'struct bgp_evpn_es'.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Previous commits have introduced a new 8 bits nh_flag in the attr
struct that has increased the memory footprint.
Move the mp_nexthop_prefer_global boolean in the attr structure that
takes 8 bits to the new nh_flag in order to go back to the previous
memory utilization.
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
The following configuration creates an infinite routing leaking loop
because 'rt vpn both' parameters are the same in both VRFs.
> router bgp 5227 vrf r1-cust4
> no bgp network import-check
> bgp router-id 192.168.1.1
> address-family ipv4 unicast
> network 28.0.0.0/24
> rd vpn export 10:12
> rt vpn both 52:100
> import vpn
> export vpn
> exit-address-family
> !
> router bgp 5227 vrf r1-cust5
> no bgp network import-check
> bgp router id 192.168.1.1
> address-family ipv4 unicast
> network 29.0.0.0/24
> rd vpn export 10:13
> rt vpn both 52:100
> import vpn
> export vpn
> exit-address-family
The previous commit has added a routing leak update when a nexthop
update is received from zebra. It indirectly calls
bgp_find_or_add_nexthop() in which a static route triggers a nexthop
cache entry registration that triggers a nexthop update from zebra.
Do not register again the nexthop cache entry if the BGP_STATIC_ROUTE is
already set.
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
If 'network import-check' is defined on the source BGP session, prefixes
that are stated in the network command cannot be leaked to the other
VRFs BGP table even if they are present in the origin VRF RIB if the
'rt import' statement is defined after the 'network <prefix>' ones.
When a prefix nexthop is updated, update the prefix route leaking. The
current state of nexthop validation is now stored in the attributes of
the bgp path info. Attributes are compared with the previous ones at
route leaking update so that a nexthop validation change now triggers
the update of destination VRF BGP table.
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
"if not XX else" statements are confusing.
Replace two "if not XX else" statements by "if XX else" to prepare next
commits. The patch is only cosmetic.
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
Since the commit da0c0ef70c ("bgpd: VRF-Lite fix best path selection"),
the best path selection is made from the comparison of the attributes
of the original route i.e. the ultimate path.
The IGP metric is currently set on the child path instead of the
ultimate path (i.e. the parent path). On eBGP, the ultimate path is the
child path. However, for imported routes, the ultimate path is always
set to 0, which results in skipping the IGP metric comparison when
selecting the best path.
Set the IGP metric on the ultimate path when a BGP nexthop is added or
updated.
Fixes: da0c0ef70c ("bgpd: VRF-Lite fix best path selection")
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
In case of BGP unnumbered, BGP fails to free the nexthop
node for peer if the interface is shutdown before
unconfiguring/deleting the BGP neighbor.
This is because, when the interface is shutdown,
peer's LL neighbor address will be cleared. Therefore,
during neighbor deletion, since the peer's neighbor
address is not available, BGP will skip freeing the
nexthop node of this peer. This results in a stale
nexthop node that points to a peer that's already
been freed.
Ticket: 3191547
Signed-off-by: Pooja Jagadeesh Doijode <pdoijode@nvidia.com>
This patch just introduces the callback mechanism for the
resilient nexthop changes so that upper level daemons
can take advantage of the change. This does nothing
at this point but just call some code.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
RFC4364 describes peerings between multiple AS domains, to ease
the continuity of VPN services across multiple SPs. This commit
implements a sub-set of IETF option b) described in chapter 10 b.
The ASBR to ASBR approach is taken, with an EBGP peering between
the two routers. The EBGP peering must be directly connected to
the outgoing interface used. In those conditions, the next hop
is directly connected, and there is no need to have a transport
label to convey the VPN label. A new vty command is added on a
per interface basis:
This command if enabled, will permit to convey BGP VPN labels
without any transport labels (i.e. with implicit-null label).
restriction:
this command is used only for EBGP directly connected peerings.
Other use cases are not covered.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When a route imported from l3vpn is analysed, the nexthop from default
VRF is looked up against a valid MPLS path. Generally, this is done on
backbones with a MPLS signalisation transport layer like LDP. Generally,
the BGP connection is multiple hops away. That scenario is already
working.
There is case where it is possible to run L3VPN over GRE interfaces, and
where there is no LSP path over that GRE interface: GRE is just here to
tunnel MPLS traffic. On that case, the nexthop given in the path does not
have MPLS path, but should be authorized to convey MPLS traffic provided
that the user permits it via a configuration command.
That commit introduces a new command that can be activated in route-map:
> set l3vpn next-hop encapsulation gre
That command authorizes the nexthop tracking engine to accept paths that
o have a GRE interface as output, independently of the presence of an LSP
path or not.
A configuration example is given below. When bgp incoming vpnv4 updates
are received, the nexthop of NLRI is 192.168.0.2. Based on nexthop
tracking service from zebra, BGP knows that the output interface to reach
192.168.0.2 is r1-gre0. Because that interface is not MPLS based, but is
a GRE tunnel, then the update will be using that nexthop to be installed.
interface r1-gre0
ip address 192.168.0.1/24
exit
router bgp 65500
bgp router-id 1.1.1.1
neighbor 192.168.0.2 remote-as 65500
!
address-family ipv4 unicast
no neighbor 192.168.0.2 activate
exit-address-family
!
address-family ipv4 vpn
neighbor 192.168.0.2 activate
neighbor 192.168.0.2 route-map rmap in
exit-address-family
exit
!
router bgp 65500 vrf vrf1
bgp router-id 1.1.1.1
no bgp network import-check
!
address-family ipv4 unicast
network 10.201.0.0/24
redistribute connected
label vpn export 101
rd vpn export 444:1
rt vpn both 52:100
export vpn
import vpn
exit-address-family
exit
!
route-map rmap permit 1
set l3vpn next-hop encapsulation gre
exit
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Let's convert to our actual library call instead
of using yet another abstraction that makes it fun
for people to switch daemons.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
RFC 4760 states we SHOULD ignore the NEXT_HOP attribute for BGP Update
messages carrying only MP_REACH_NLRI attributes. Thus we should use the
Network Address of Next Hop field of the MP_REACH_NLRI as the nexthop.
Instead of always looking for BGP_ATTR_NEXT_HOP, this commit ensures:
1) we set mp_nexthop_len to BGP_ATTR_NHLEN_IPV4 for v4 bgp_static routes
2) we check mp_nexthop_len when choosing the nexthop to use for nht
3) we check mp_nexthop_len when choosing the nexthop to send to zebra
4) we check mp_nexthop_len when picking the nexthop to shown by vtysh
Reported-by: Binon Gorbutt <binon@aervivo.com>
Signed-off-by: Trey Aspelund <taspelund@nvidia.com>
FRR should create a bnc per peer. Not have
one's that write over others. Currently when
FRR has multiple Interface based peering, BGP wa
creating a single BNC. This is insufficient in that
we were accidently overwriting the one LL with other
data. This causes issues when there are multiple and
there is weird starting issues with those interfaces
that you are peering over.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Commit:
9f002fa5dd
Accidently broke the handling of SR color for nexthops
in BGP. Put it back
Fixes: #11237
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Fix: 06e4e90132
Modified BGP to pay more attention the prefix returned from
zebra to ensure that a LPM wasn't accidently causing BGP
import checks to think it had a match when it did not.
This unfortunately removed the check to handle the route
removal.
This sequence of config and events would leave BGP in a bad state:
ip route 100.100.100.0/24 Null0
router bgp 32932
bgp network import-check
address-family ipv4 uni
network 100.100.100.0/24
Then if you removed the static route the import check would
still think the route existed:
donatas-pc(config)# ip route 100.100.100.0/24 Null0
donatas-pc(config)# do sh ip bgp import-check-table
Current BGP import check cache:
100.100.100.0 valid [IGP metric 0], #paths 1
blackhole
Last update: Sat Apr 23 22:51:34 2022
donatas-pc(config)# do sh ip nht
100.100.100.0
resolved via static
is directly connected, Null0
Client list: bgp(fd 17)
donatas-pc(config)# do sh ip bgp neighbors 192.168.10.123 advertised-routes | include 100.100.100.0
*> 100.100.100.0/24 0.0.0.0 0 32768 i
donatas-pc(config)# no ip route 100.100.100.0/24 Null0
donatas-pc(config)# do sh ip nht
100.100.100.0
resolved via kernel
via 192.168.10.1, enp3s0
Client list: bgp(fd 17)
donatas-pc(config)# do sh ip bgp import-check-table
Current BGP import check cache:
100.100.100.0 valid [IGP metric 0], #paths 1
blackhole
Last update: Sat Apr 23 22:51:34 2022
donatas-pc(config)# do sh ip bgp neighbors 192.168.10.123 advertised-routes | include 100.100.100.0
*> 100.100.100.0/24 0.0.0.0 0 32768 i
donatas-pc(config)#
Fix this by moving the code to handle the prefix check to the
evaluation function and mark the bnc as not matching and actually
evaluate the bnc.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
In some stress testing, we are seeing type-5 evpn routes being
left in a rejected state in zebra.
Sequence of events as I am seeing it:
a) Interface comes up that type5 routes nexthop depends on
b) zebra processes creates the connected and lets bgp know via nht
c) bgp installs the route to zebra
d) zebra processes and sends install to kernel
e) before route is installed, the interface the nexthop points at flaps
f) the route install is rejected, notify zebra
g) the interface comes up
h) zebra gets the notification about the route install rejection
i) zebra processes the down/up and turns it into a single up event
j) BGP never reinstalls the type 5 route
This up event does not translate into a nexthop tracking event
when the events happen quickly enough and/or zebra is extremelyh
busy and bgp would never see that the nexthops changed even very quickly.
This is the same thing that was going on with
https://github.com/FRRouting/frr/pull/7724
in PBR.
To fix this let's notice the interface up/down events for v4
in bgp now as well.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently the nexthop tracking code is only sending to the requestor
what it was requested to match against. When the nexthop tracking
code was simplified to not need an import check and a nexthop check
in b8210849b8 for bgpd. It was not
noticed that a longer prefix could match but it would be seen
as a match because FRR was not sending up both the resolved
route prefix and the route FRR was asked to match against.
This code change causes the nexthop tracking code to pass
back up the matched requested route (so that the calling
protocol can figure out which one it is being told about )
as well as the actual prefix that was matched to.
Fixes: #10766
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Since f60a1188 we store a pointer to the VRF in the interface structure.
There's no need anymore to store a separate vrf_id field.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
We had various forms of min/max macros across multiple daemons
all of which duplicated what we have in compiler.h. Convert
everyone to use the `correct` ones
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
if_lookup_by_index_all_vrf doesn't work correctly with netns VRF backend
as the same index may be used in multiple netns simultaneously.
We always know the BGP instance we work with, so use its VRF id for the
interface lookup.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
These are no longer really needed. The client just needs
to call nexthop resolution instead.
So let's remove the zapi types.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Allow bgp to figure out if it cares about address resolution instead
of having zebra care about it. This will allow the removal of the
zapi type for import checking and just use nexthop resolution.
Effectively we just look up the route being returned and
if it is in either table we just handle it instead of
looking for clues from the zapi message type.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Some BGP updates received by BGP invite local router to
install a route through itself. The system will not do it, and
the route should be considered as not valid at the earliest.
This case is detected on the zebra, and this detection prevents
from trying to install this route to the local system. However,
the nexthop tracking mechanism is called, and acts as if the route
was valid, which is not the case.
By detecting in BGP that use case, we avoid installing the invalid
routes.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When bgp peers with ipv6 link local addresses, it may receive a
BGP update with next-hop containing both LL and GA information.
By default, nexthop tracking applies to GA, and ignores presence
of LL, when both addresses are present. This is a problem for
resolving GA as next-hop as the next-hop information can be solved
by using the LL address only.
The solution consists in defaulting the nexthop ipv6 choice to LL
when available, and moving back to GA if a route-map is locally
configured at inbound.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When EVPN prefix route with a gateway IP overlay index is imported into the IP
vrf at the ingress PE, BGP nexthop of this route is set to the gateway IP.
For this vrf route to be valid, following conditions must be met.
- Gateway IP nexthop of this route should be L3 reachable, i.e., this route
should be resolved in RIB.
- A remote MAC/IP route should be present for the gateway IP address in the
EVI(L2VPN table).
To check for the first condition, gateway IP is registered with nht (nexthop
tracking) to receive the reachability notifications for this IP from zebra RIB.
If the gateway IP is reachable, zebra sends the reachability information (i.e.,
nexthop interface) for the gateway IP.
This nexthop interface should be the SVI interface.
Now, to find out type-2 route corresponding to the gateway IP, we need to fetch
the VNI for the above SVI.
To do this VNI lookup effitiently, define a hashtable of struct bgpevpn with
svi_ifindex as key.
struct hash *vni_svi_hash;
An EVI instance is added to vni_svi_hash if its svi_ifindex is nonzero.
Using this hash, we obtain struct bgpevpn corresponding to the gateway IP.
For gateway IP overlay index recursive lookup, once we find the correct EVI, we
have to lookup its route table for a MAC/IP prefix. As we have to iterate the
entire route table for every lookup, this lookup is expensive. We can optimize
this lookup by adding all the remote IP addresses in a hash table.
Following hash table is defined for this purpose in struct bgpevpn
Struct hash *remote_ip_hash;
When a MAC/IP route is installed in the EVI table, it is also added to
remote_ip_hash.
It is possible to have multiple MAC/IP routes with the same IP address because
of host move scenarios. Thus, for every address addr in remote_ip_hash, we
maintain list of all the MAC/IP routes having addr as their IP address.
Following structure defines an address in remote_ip_hash.
struct evpn_remote_ip {
struct ipaddr addr;
struct list *macip_path_list;
};
A Boolean field is added to struct bgp_nexthop_cache to indicate that the
nexthop is EVPN gateway IP overlay index.
bool is_evpn_gwip_nexthop;
A flag BGP_NEXTHOP_EVPN_INCOMPLETE is added to struct bgp_nexthop_cache.
This flag is set when the gateway IP is L3 reachable but not yet resolved by a
MAC/IP route.
Following table explains the combination of L3 and L2 reachability w.r.t.
BGP_NEXTHOP_VALID and BGP_NEXTHOP_EVPN_INCOMPLETE flags
* | MACIP resolved | MACIP unresolved
*----------------|----------------|------------------
* L3 reachable | VALID = 1 | VALID = 0
* | INCOMPLETE = 0 | INCOMPLETE = 1
* ---------------|----------------|--------------------
* L3 unreachable | VALID = 0 | VALID = 0
* | INCOMPLETE = 0 | INCOMPLETE = 0
Procedure that we use to check if the gateway IP is resolvable by a MAC/IP
route:
- Find the EVI/L2VRF that belongs to the nexthop SVI using vni_svi_hash.
- Check if the gateway IP is present in remote_ip_hash in this EVI.
When the gateway IP is L3 reachable and it is also resolved by a MAC/IP route,
unset BGP_NEXTHOP_EVPN_INCOMPLETE flag and set BGP_NEXTHOP_VALID flag.
Signed-off-by: Ameya Dharkar <adharkar@vmware.com>
The IP/IPv6 prefix carried with EVPN RT-5 is imported in the BGP vrf according
to the attached route targets.
If the prefix carries a gateway IP overlay index, this gateway IP should be
installed as the nexthop of the route imported in the BGP vrf.
This route in vrf will be marked as VALID only if the nexthop is resolved in the
SVI network.
To receive runtime reachability information for the nexthop, register it with
the nexthop tracking module.
Send this route to zebra after processing.
Signed-off-by: Ameya Dharkar <adharkar@vmware.com>
the code processing an NHT update was only resetting the BGP_NEXTHOP_VALID
flag, so labeled nexthops were considered valid even if there was no
nexthop. Reset the flag in response to the update, and also make the
isvalid_nexthop functions a little more robust by checking the number
of nexthops.
Signed-off-by: Emanuele Di Pascale <emanuele@voltanet.io>
The new LL code in:
8761cd6ddb
Introduced the idea of the bgp unnumbered peers using interface up/down
events to track the bgp peers nexthop. This code was not properly
working when a connection was received from a peer in some circumstances.
Effectively the connection from a peer was immediately skipping state transitions
and FRR was never properly tracking the peers nexthop. When we receive the
connection attempt, let's track the nexthop now.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
For link-local IPv6 next hops, the next hop tracking is implemented based
on interface status changes. For this purpose, the ifindex is stored in
the NHT. Reset this value if a change in ifindex is noticed, such as for
example after a restart of the networking service.
Also add some additional debug logs.
Signed-off-by: Vivek Venkatraman <vivek@nvidia.com>
Updates: "bgpd: Switch LL nexthop tracking to be interface based"
Ticket: RM 2575386
Testing Done:
1. Manual verification
2. Precommit (#156), evpn-smoke (#155), bgp-smoke (#157), vrl (#158)
-- Precommit is clean, reported failures in evpn-smoke & vrl are resolved
-- some other tests fail in evpn-smoke, bgp-smoke & vrl, appear to be existing
-- or unrelated failures
The v6 LL commit 8761cd6ddb
incorrectly was setting the metric value to 1 for the underlying
connected interface. Modify the code to use a metric value of 0
instead of 1 that now represents the actual metric value that
was originally passed up.
This was noticed when the `show bgp ipv4 uni` command was
inserting a `(metric 1)` into output where before it was not.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Recent changes to allow bgpd to handle v6 LL slightly
differently in the nexthop tracking code has not
interacted well with the blackhole nexthop change
for peers. Modify the code to do the right thing
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
bgp is currently registering v6 LL as nexthops to be tracked
from zebra. This presents several problems.
a) zebra does not properly track multiple prefixes that match
the same route properly at this point in time.
b) BGP was receiving nexthops that were just incorrect because
of (a).
c) When a nexthop changed that really didn't affect the v6 LL
we were responding incorrectly because of this
Modify the code such that bgp nexthop tracking notices that
we are trying to register a v6 LL. When we do so, shortcut
and watch interface up/down events for this v6 LL and do
the work when an interface goes up / down for this type
of tracking.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When bgp registers for a nexthop that is not reachable due
to the nexthop pointing to a blackhole, bgp is never going
to be able to reach it when attempting to open a connection.
Broken behavior:
<show bgp nexthop>
192.168.161.204 valid [IGP metric 0], #paths 0, peer 192.168.161.204
blackhole
Last update: Thu Feb 11 09:46:10 2021
eva# show bgp ipv4 uni summ fail
BGP router identifier 10.10.3.11, local AS number 3235 vrf-id 0
BGP table version 40
RIB entries 78, using 14 KiB of memory
Peers 2, using 54 KiB of memory
Neighbor EstdCnt DropCnt ResetTime Reason
192.168.161.204 0 0 never Waiting for peer OPEN
The log file fills up with this type of message:
2021-02-09T18:53:11.653433+00:00 nq-sjc6c-cor-01 bgpd[6548]: can't connect to 24.51.27.241 fd 26 : Invalid argument
2021-02-09T18:53:21.654005+00:00 nq-sjc6c-cor-01 bgpd[6548]: can't connect to 24.51.27.241 fd 26 : Invalid argument
2021-02-09T18:53:31.654381+00:00 nq-sjc6c-cor-01 bgpd[6548]: can't connect to 24.51.27.241 fd 26 : Invalid argument
2021-02-09T18:53:41.654729+00:00 nq-sjc6c-cor-01 bgpd[6548]: can't connect to 24.51.27.241 fd 26 : Invalid argument
2021-02-09T18:53:51.655147+00:00 nq-sjc6c-cor-01 bgpd[6548]: can't connect to 24.51.27.241 fd 26 : Invalid argument
As that the connect to a blackhole is correctly rejected by the kernel
Fixed behavior:
eva# show bgp ipv4 uni summ
BGP router identifier 10.10.3.11, local AS number 3235 vrf-id 0
BGP table version 40
RIB entries 78, using 14 KiB of memory
Peers 2, using 54 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
annie(192.168.161.2) 4 64539 126264 39 0 0 0 00:01:36 38 40 N/A
192.168.161.178 4 0 0 0 0 0 0 never Active 0 N/A
Total number of neighbors 2
eva# show bgp ipv4 uni summ fail
BGP router identifier 10.10.3.11, local AS number 3235 vrf-id 0
BGP table version 40
RIB entries 78, using 14 KiB of memory
Peers 2, using 54 KiB of memory
Neighbor EstdCnt DropCnt ResetTime Reason
192.168.161.178 0 0 never Waiting for NHT
Total number of neighbors 2
eva# show bgp nexthop
Current BGP nexthop cache:
192.168.161.2 valid [IGP metric 0], #paths 38, peer 192.168.161.2
if enp39s0
Last update: Thu Feb 11 09:52:05 2021
192.168.161.131 valid [IGP metric 0], #paths 0, peer 192.168.161.131
if enp39s0
Last update: Thu Feb 11 09:52:05 2021
192.168.161.178 invalid, #paths 0, peer 192.168.161.178
Must be Connected
Last update: Thu Feb 11 09:53:37 2021
eva#
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
If we are using a nexthop for a MPLS VPN route make sure the
nexthop is over a labeled path. This new check mirrors the one
in validate_paths (where routes are enabled when a nexthop
becomes reachable). The check is introduced to the code path
where routes are added and the nexthop is looked up.
Signed-off-by: Pat Ruddy <pat@voltanet.io>
Two L3 next groups are installed per-VRF per-ES for v4 and v6. These
NHGs are used as an indirect destination for symmetric IRB host routes.
Using L3NHGs allows for efficient failover of an ES (similar to the
use of L2NHGs) i.e. when an ES goes down the number of dataplane
updates are limited to 2xN (where N is the number of tenant VRFs
associated with the ES) instead of updating all host-routes behind the
ES.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
ES-VRF entries are maintained for the purpose of L3-NHG creation -
1. Each ES-EVI entry is associated with a tenant VRF. This associaton
triggers the creation of an ES-VRF entry.
2. Type-2/MAC-IP routes are imported into a tenant VRF and programmed as
a /32 or host route entry in the dataplane. If the destination of
the host route is a remote-ES the route is programmed with the
corresponding (keyed in by {vrf,ES-id}) L3-NHG.
3. The reason for this indirection (route->L3-NHG, L3-NHG->list-of-VTEPs)
is to avoid route updates to the dplane when a remote-ES link flaps i.e.
instead of updating all the dependent routes the NHG's contents are
updated. This reduces the amount of dataplane updates (fewer nhg updates vs.
route updates) allowing for a faster failover.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
The `enum zclient_send_status` enum needs to be extended
throughout the code base to use the new states and
to fix up places where we tested against the return
value being non zero.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
This should never happen; no need to debug guard it and it's not a
warning, if this isn't working then NHT is not working at all.
Signed-off-by: Quentin Young <qlyoung@nvidia.com>
This function is poorly named; it's really used to allow the FSM to
decide the next valid state based on whether a peer has valid /
reachable nexthops as determined by NHT or BFD.
Signed-off-by: Quentin Young <qlyoung@nvidia.com>
since the addition of srte_color to the comparison for bgp nexthops
it is possible to have several nexthops per prefix but since zebra
only sores a per prefix registration we should not unregister for
nh notifications for a prefix unti all the nexthops for that prefix
have been deleted. Otherwise we can get into a deadlock situation
where BGP thinks we have registered but we have unregistered from zebra.
Signed-off-by: Pat Ruddy <pat@voltanet.io>
Extend the NHT code so that only the affected BGP routes are affected
whenever an SR-policy is updated on zebra.
Signed-off-by: Renato Westphal <renato@opensourcerouting.org>
Fist, routing tables aren't the most appropriate data structure
to store nexthops and imported routes since we don't need to do
longest prefix matches with that information.
Second, by converting the NHT code to use rb-trees, we can index
the nexthops using additional information, not only the destination
address. This will be useful later to index bgpd's nexthops by
both destination and SR-TE color.
Co-authored-by: Sebastien Merle <sebastien@netdef.org>
Signed-off-by: Renato Westphal <renato@opensourcerouting.org>
until now, the assumption was done in bgp flowspec code that the
information contained was an ipv4 flowspec prefix. now that it is
possible to handle ipv4 or ipv6 flowspec prefixes, that information is
stored in prefix_flowspec attribute. Also, some unlocking is done in
order to process ipv4 and ipv6 flowspec entries.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Added a macro to validate the v4 mapped v6 address.
Modified bgp receive & send updates for v4 mapped v6 address as
nexthop and installing it as recursive nexthop in RIB.
Minor change in fpm while sending the routes for nexthop as
v4 mapped v6 address.
Signed-off-by: Kaushik <kaushik@niralnetworks.com>
This is the bulk part extracted from "bgpd: Convert from `struct
bgp_node` to `struct bgp_dest`". It should not result in any functional
change.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
When there is a NHT change and the paths dependent on that NHT are being
evaluated, skip those that are marked for removal or as history.
When a route gets withdrawn, its valid flag is cleared and it is flagged
for removal; in the case of an EVPN route, it is also unimported from
VRFs (L2 and/or L3). bgp_process is then scheduled. Under rare timing
conditions, an NHT update for the route's next hop may arrive right after,
and if routes flagged for removal are not skipped, they may not only be
incorrectly marked as valid but also re-imported in the case of EVPN,
which will be a serious error.
Signed-off-by: Vivek Venkatraman <vivek@cumulusnetworks.com>
Ensure that only if there is a change to the path's validity based
on the NHT update, EVPN import or unimport is invoked.
Signed-off-by: Vivek Venkatraman <vivek@cumulusnetworks.com>
It is possible that the if_lookup_by_index() call will return
a NULL value and calling zclient_send_interface_radv_req. Just
test that we have a valid interface pointer.
Found by Coverity
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Problem reported that in many circumstances, RAs created in the
process of bringing up numbered IPv6 peers with extended-nexthop
capability enabled (for ipv4 over ipv6) were not stopped on the
interface when those peers were deleted. Found several circumstances
where this occurred and fix them in this patch.
Ticket: CM-26875
Signed-off-by: Don Slice <dslice@cumulusnetworks.com>
Problem Description:
=====================
+--+ +--+
|R1|-(192.201.202.1)----iBGP----(192.201.202.2)-|R2|
+--+ +--+
Routes on R2:
=============
S>* 202.202.202.202/32 [1/0] via 192.201.78.1, ens256, 00:40:48
Where, the next-hop network, 192.201.78.0/24, is a directly connected network address.
C>* 192.201.78.0/24 is directly connected, ens256, 00:40:48
Configurations on R1:
=====================
!
router bgp 201
bgp router-id 192.168.0.1
neighbor 192.201.202.2 remote-as 201
!
Configurations on R2:
=====================
!
ip route 202.202.202.202/32 192.201.78.1
!
router bgp 201
bgp router-id 192.168.0.2
neighbor 192.201.202.1 remote-as 201
!
address-family ipv4 unicast
redistribute static
exit-address-family
!
Step-1:
=======
R1 receives the route 202.202.202.202/32 from R2.
R1 installs the route in its BGP RIB.
Step-2:
=======
On R1, a connected interface address is added.
The address is the same as the next-hop of the BGP route received from R2 (192.201.78.1).
Point of Failure:
=================
R1 resolves the BGP route even though the route's next-hop is its own connected address.
Even though this appears to be a misconfiguration it would still be better to safeguard the code against it.
Fix:
====
When BGP receives a connected route from Zebra, it processes the
routes for the next-hop update.
While doing so, BGP must ignore routes whose next-hop address matches
the address of the connected route for which Zebra sent the next-hop update
message.
Signed-off-by: NaveenThanikachalam <nthanikachal@vmware.com>
Add new function `bgp_node_get_prefix()` and modify
the bgp code base to use it.
This is prep work for the struct bgp_dest rework.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Current failed reasons for bgp when you have a peer that
is not online yet is `Waiting for NHT`, even if NHT has
succeeded. Add some code to differentiate this.
eva# show bgp ipv4 uni summ failed
BGP router identifier 192.168.201.135, local AS number 3923 vrf-id 0
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 2, using 43 KiB of memory
Neighbor EstdCnt DropCnt ResetTime Reason
192.168.44.1 0 0 never Waiting for NHT
192.168.201.139 0 0 never Waiting for Open to Succeed
Total number of neighbors 2
eva#
eva# show bgp nexthop
Current BGP nexthop cache:
192.168.44.1 invalid, peer 192.168.44.1
Must be Connected
Last update: Mon Feb 10 19:05:19 2020
192.168.201.139 valid [IGP metric 0], #paths 0, peer 192.168.201.139
So 192.168.201.139 is a peer for a connected route that has not been
created on .139, while 44.1 nexthop tracking has not succeeded yet.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
bgp nexthop cache update triggers RA for global ipv6
nexthop update.
In case of blackhole route type the outgoing interface
information is NULL which leads to bgpd crash.
Skip sending RA for blackhole nexthop type.
Ticket:CM-27299
Reviewed By:
Testing Done:
Configure bgp neighbor over global ipv6 address.
Configure static blackhole route with prefix includes
connected ipv6 global address.
Upon link flap, zebra sends nexthop update to bgp.
Bgp nexthop cache skips sending RA for blackhole nexthop type.
router bgp 65002
bgp router-id 91.189.93.190
...
neighbor 2001:67c:1360::b peer-group internal
static route:
ipv6 route 2001:67c:1360::/48 Null0 254
iface rowlink.4010
address 91.189.93.190/32
address 2001:67c:1360::a/128
Trigger ifdown rowlink.4010; ifup rowlink.4010
Signed-off-by: Chirag Shah <chirag@cumulusnetworks.com>
Problem statement:
When IPv4/IPv6 prefixes are received in BGP, bgp_update function registers the
nexthop of the route with nexthop tracking module. The BGP route is marked as
valid only if the nexthop is resolved.
Even for EVPN RT-5, route should be marked as valid only if the the nexthop is
resolvable.
Code changes:
1. Add nexthop of EVPN RT-5 for nexthop tracking. Route will be marked as valid
only if the nexthop is resolved.
2. Only the valid EVPN routes are imported to the vrf.
3. When nht update is received in BGP, make sure that the EVPN routes are
imported/unimported based on the route becomes valid/invalid.
Testcases:
1. At rtr-1, advertise EVPN RT-5 with a nexthop 10.100.0.2.
10.100.0.2 is resolved at rtr-2 in default vrf.
At rtr-2, remote EVPN RT-5 should be marked as valid and should be imported into
vrfs.
2. Make the nexthop 10.100.0.2 unreachable at rtr-2
Remote EVPN RT-5 should be marked as invalid and should be unimported from the
vrfs. As this code change deals with EVPN type-5 routes only, other EVPN routes
should be valid.
3. At rtr-2, add a static route to make nexthop 10.100.0.2 reachable.
EVPN RT-5 should again become valid and should be imported into the vrfs.
Signed-off-by: Ameya Dharkar <adharkar@vmware.com>
Recently had a case where I was attempting to debug a nexthop tracking
issue across multiple bgp vrf's and since the setup vrf's in it with
overlapping address ranges, it became real fun real fast to track
vrf data associated. Add a bit of code to allow us to figure out
what vrf we are in when we print out debug messages.
Look through the rest of the code and find debugs where we are
not using bgp->name_pretty and switch it over.
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
The functions nexthop_same() does not check the resolved
nexthops so I don't think this function is even needed
anymore.
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
if bfd comes back up, and a bgp reconnection is in progress, theorically
it should be necessary to wait for the end of the reconnection process.
however, since that reconnection process may take some time, update the
fsm by cancelling the connect timer. This done, one just have to call
the start timer.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Avoid tracking 0.0.0.0/32 nexthop with RIB.
When routes are aggregated,
the originate of the route becomes self.
Do not track nexthop self (0.0.0.0) with rib.
Ticket: CM-24248
Testing Done:
Before fix-
tor-11# show ip nht vrf all
VRF blue:
0.0.0.0
unresolved
Client list: bgp(fd 16)
VRF default:
VRF green:
VRF magenta:
0.0.0.0
unresolved
Client list: bgp(fd 16)
After fix-
tor-11# show ip nht vrf all
VRF blue:
VRF default:
VRF green:
VRF magenta:
Signed-off-by: Chirag Shah <chirag@cumulusnetworks.com>