Consider this scenario:
Lots of peers with a bunch of route information that is changing
fast. One of the peers happens to be really slow for whatever
reason. The way the output queue is filled is that bgpd puts
64 packets at a time and then reschedules itself to send more
in the future. Now suppose that peer has hit it's input Queue
limit and is slow. As such bgp will continue to add data to
the output Queue, irrelevant if the other side is receiving
this data.
Let's limit the Output Queue to the same limit as the Input
Queue. This should prevent bgp eating up large amounts of
memory as stream data when under severe network trauma.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The idea is to drop unwanted attributes from the BGP UPDATE messages and
continue by just ignoring them. This improves the security, flexiblity, etc.
This is the command that Cisco has also.
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
When actually creating a peer in BGP, tell the creation if
it is a config node or not. There were cases where the
CONFIG_NODE was being set *after* being placed into
the bgp->peerhash, thus causing collisions between the
doppelganger and the peer and eventually use after free's.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
We already have a global knob for graceful-shutdown, but it's handy having
per neighbor knob as well.
Especially when a single neighbor needs to be restarted/shutdown gracefuly.
We can do this route-maps, but this is a faster/cleaner way doing the same
for an operator.
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
RPKI revalidation is an possibly expensive operation. Break up
revalidation on a prefix basis by the `struct bgp` pointer.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
An end operator is showing cases with multiple bgp feeds
and a rpki table that calling the revalidation functions
is extremely expensive and they are seeing lots of thread
WARNS about timers being late and eventually the whole
thing gets unresponsive. Let's break up soft reconfiguration
in to a series of events per peer so that all the work
for this is not done at the same exact time.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Simulated latency with:
```
tc qdisc add dev eth3 root netem delay 100ms
```
```
donatas-laptop# sh ip bgp summary failed
IPv4 Unicast Summary (VRF default):
BGP router identifier 192.0.2.252, local AS number 65000 vrf-id 0
BGP table version 28
RIB entries 0, using 0 bytes of memory
Peers 1, using 724 KiB of memory
Neighbor EstdCnt DropCnt ResetTime Reason
192.168.10.65 2 2 00:00:17 Admin. shutdown (RTT)
Displayed neighbors 1
Total number of neighbors 1
donatas-laptop#
```
Another end received:
```
%NOTIFICATION: received from neighbor 192.168.10.17 6/2 (Cease/Administrative Shutdown) "shutdown due to high round-trip-time (104ms > 5ms, hit 21 times)"
```
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
Add a default limit to the InQ for messages off the bgp peer
socket. Make the limit configurable via cli.
Adding in this limit causes the messages to be retained in the tcp
socket and allow for tcp back pressure and congestion control to kick
in.
Before this change, we allow the InQ to grow indefinitely just taking
messages off the socket and adding them to the fifo queue, never letting
the kernel know we need to slow down. We were seeing under high loads of
messages and large perf-heavy routemaps (regex matching) this queue
would cause a memory spike and BGP would get OOM killed. Modifying this
leaves the messages in the socket and distributes that load where it
should be in the socket buffers on both send/recv while we handle the
mesages.
Also, changes were made to allow the ringbuffer to hold messages and
continue to be filled by the IO pthread while we wait for the Main
pthread to handle the work on the InQ.
Memory spike seen with large numbers of routes flapping and route-maps
with dozens of regex matching:
```
Memory statistics for bgpd:
System allocator statistics:
Total heap allocated: > 2GB
Holding block headers: 516 KiB
Used small blocks: 0 bytes
Used ordinary blocks: 160 MiB
Free small blocks: 3680 bytes
Free ordinary blocks: > 2GB
Ordinary blocks: 121244
Small blocks: 83
Holding blocks: 1
```
With most of it being held by the inQ (seen from the stream datastructure info here):
```
Type : Current# Size Total Max# MaxBytes
...
...
Stream : 115543 variable 26963208 15970740 3571708768
```
With this change that memory is capped and load is left in the sockets:
RECV Side:
```
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
ESTAB 265350 0 [fe80::4080:30ff:feb0:cee3]%veth1:36950 [fe80::4c14:9cff:fe1d:5bfd]:179 users:(("bgpd",pid=1393334,fd=26))
skmem:(r403688,rb425984,t0,tb425984,f1816,w0,o0,bl0,d61)
```
SEND Side:
```
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
ESTAB 0 1275012 [fe80::4c14:9cff:fe1d:5bfd]%veth1:179 [fe80::4080:30ff:feb0:cee3]:36950 users:(("bgpd",pid=1393443,fd=27))
skmem:(r0,rb131072,t0,tb1453568,f1916,w1300612,o0,bl0,d0)
```
Signed-off-by: Stephen Worley <sworley@nvidia.com>
In the current implementation of bgpd, SRv6 SIDs can be configured only
under the address-family. This enables bgpd to leak IPv6 routes using
an SRv6 End.DT6 behavior and IPv4 routes using an SRv6 End.DT4
behavior. It is not possible to leak both IPv6 and IPv4 routes using a
single SRv6 SID.
This commit adds a new CLI command
"sid vpn per-vrf export <sid_idx|auto>" that enables bgpd to leak both
IPv6 and IPv4 routes using a single SRv6 SID (End.DT46 behavior).
Signed-off-by: Carmine Scarpitta <carmine.scarpitta@uniroma2.it>
In order to send correct SRv6 L3VPN advertisement, we need to save
srv6_locator_chunk in vpn_policy. With this information, we can
construct correct SRv6 L3VPN advertisement packets.
Signed-off-by: Ryoga Saito <ryoga.saito@linecorp.com>
As an example, Arista EOS allows this behavior.
Configuration something like:
```
neighbor PG peer-group
neighbor PG remote-as 65001
neighbor PG local-as 65001
neighbor 192.168.10.124 peer-group PG
```
Or without peer-group.
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
See the BGP message sequence:
R1 R2
| updates |
|------------------>|
| |
| refresh request |
x<------------------|
| |
| updates cont. |
|------------------>|
| |
| end-of-rib |
|------------------>|
| |
When R1 and R2 establish BGP session, R1 begins to send initial updates.
If R2 sends a route-refresh request before EoR, it's silently ignored
by R1, and routes received earlier have no chance to be processed again.
RFC7313 says, "for a BGP speaker that supports the BGP Graceful Restart,
it MUST NOT send a BoRR for an <AFI, SAFI> to a neighbor before it sends
the EoR for the <AFI, SAFI> to the neighbor." But it doesn't forbid
route-refresh request to be sent before receiving EoR.
To handle this scenario, postpone response to refresh request until EoR
is sent.
Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
RFC4364 describes peerings between multiple AS domains, to ease
the continuity of VPN services across multiple SPs. This commit
implements a sub-set of IETF option b) described in chapter 10 b.
The ASBR to ASBR approach is taken, with an EBGP peering between
the two routers. The EBGP peering must be directly connected to
the outgoing interface used. In those conditions, the next hop
is directly connected, and there is no need to have a transport
label to convey the VPN label. A new vty command is added on a
per interface basis:
This command if enabled, will permit to convey BGP VPN labels
without any transport labels (i.e. with implicit-null label).
restriction:
this command is used only for EBGP directly connected peerings.
Other use cases are not covered.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
TCP keepalive is enabled once BGP connection is established.
New vty commands:
bgp tcp-keepalive <1-65535> <1-65535> <1-30>
no bgp tcp-keepalive
Signed-off-by: Xiaofeng Liu <xiaofeng.liu@6wind.com>
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Let's convert to our actual library call instead
of using yet another abstraction that makes it fun
for people to switch daemons.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Implement forcing L3 auto derivation via configs even when
manually RTs are set. This will allow both to coexist in
BGP RTs. Without using auto config command, it will remove
auto derived RTs when you manually configure your own. To allow
both, use the auto command ond import/export/both.
Implement '*' wildcard import L3 RTs so we can import a route into any AS.
This is necessary to avoid a user from having to configure an L3 RT for
every AS they care to import evpn route from.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
BGP SoO is a tag that is appended on BGP updates to allow a peer to mark
a particular peer as belonging to a particular site. In certain MPLS L3 VPN
configurations, the BGP AS-Path may not provide the granularity needed
prevent a loop in the control-plane. With this in mind, BGP SoO is designed
to fill this gap and prevent a routing loop that may occur.
If we configure for example, `neighbor soo 65000:1` at PEs, routes won't be
announced between CPEs if soo matches. This is especially needed when using
as-override or allowas-in.
Also, this is the automated way of the same behavior as configuring route-maps
for each peer like:
```
bgp extcommunity-list cpe permit soo 65000:1
!
route-map cpe permit 10
set extcommunity soo 65000:1
...
route-map cpe deny 10
match extcommunity cpe
route-map cpe permit 20
...
```
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
A new command is available under SAFI_MPLS_VPN:
With this command, the BGP vpnvx prefixes received are
not kept, if there are no VRF interested in importing
those vpn entries.
A soft refresh is performed if there is a change of
configuration: retain cmd, vrf import settings, or
route-map change.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
These values were named WITHDRAW and UPDATE. Yeah, you guessed it, those
are already #define's elsewhere (bgp_debug.h). Hilarity ensues.
Signed-off-by: Quentin Young <qlyoung@nvidia.com>
Just adding a support for peer-groups, because now it's not possible to
configure BGP role for peer-groups.
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
The command `debug bgp allow-martian` is not actually
a debug command it's a command that when entered allows
bgp to not reset a peering when a martian nexthop is
passed in the nlri.
Add the `bgp allow-martian-nexthop` command and allow it to be
used.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
RFC9234 is a way to establish correct connection roles (Customer/
Provider, Peer or with RS) between bgp speakers. This patch:
- Add a new configuration/terminal option to set the appropriate local
role;
- Add a mechanism for checking used roles, implemented by exchanging
the corresponding capabilities in OPEN messages;
- Add strict mode to force other party to use this feature;
- Add basic support for a new transitive optional bgp attribute - OTC
(Only to Customer);
- Add logic for default setting OTC attribute and filtering routes with
this attribute by the edge speakers, if the appropriate conditions are
met;
- Add two test stands to check role negotiation and route filtering
during role usage.
Signed-off-by: Eugene Bogomazov <eb@qrator.net>
Reduce the scope, to avoid comparing uint16_t vs. size_t in a loop.
```
vty_out(vty,
" Message received that caused BGP to send a NOTIFICATION:\n ");
for (i = 1; i <= p->last_reset_cause_size;
i++) {
```
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
Related: https://datatracker.ietf.org/doc/html/draft-ietf-idr-bfd-subcode
When BFD Down notification comes and BGP is configured to track on BFD events,
send BGP Cease/BFD Down notification to the peer.
If RFC 8538 is enabled (Notification support for Graceful-Restart), notification
should be encapsulated into Hard Reset message.
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
As described by
https://www.ietf.org/archive/id/draft-spaghetti-idr-bgp-sendholdtimer-04.html
Since this replicates the HoldTime check on the receiver that is already
part of the protocol, I do not believe it necessary to wait for IETF
progress on this draft. It's just replicating an existing element of
the protocol at the other side of the session.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
Allow BGP to control the TOS DSCP value in the tcp header
via a new command at the bgp global level `bgp session-dscp <0-63>`
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Signed-off-by: Pavel Shirhov <pavelsh@microsoft.com>
The maxpaths same_clusterlen value was a uint16_t
with a single bit being used. No other values are
being stored. Let's remove the bitfield and simplify
to a bool.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
We will hit this soon because uint32_t will be not enough.
Two more flags gonna be added for rfc8538.
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
N-bit flag should be exchanged in BGP OPEN messages, not only when the
bgpd is restarted/started.
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
Also, add N-Bit (Notification) flag for Graceful Restart.
This is a preparation for RFC8538.
More information: https://datatracker.ietf.org/doc/html/rfc8538
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
Delay BGP configuration until we receive end-configuration hook to make sure
we don't send partial updates to peer which leads to broken Graceful-Restart.
Signed-off-by: David Lamparter <equinox@opensourcerouting.org>
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
For the later patches, this patch changes the behavior of alloc_new sid
so that bgpd record not only SID for VRF, but also Locator of SID.
Signed-off-by: Ryoga Saito <ryoga.saito@linecorp.com>
Conversion of bgp error codes returned for cli input into
an enum and then properly handling all the error cases
in bgp_vty_return.
Because not all error codes returned were properly handled
in this function there existed configuration examples that
were accepted on the cli without an error message but not
saved.
Fixes: #10589
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
confederations are checking to see that the bgp pointer
is non-null. But it's impossible to have a null pointer
in the cli and in all paths we have already deref'ed the bgp
pointer. Let's remove that error code as that it is impossible
to happen.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add a 15 minute warning to the logging system when
bgp policy is not setup properly. Operators keep asking
about the missing policy( on upgrade typically ). Let's
try to give them a bit more of a hint when something is
going wrong as that they are clearly missing the other
various places FRR tells them about it.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When setting maximum-prefix-out on peer-group, the applied value on
member is 0.
Fix usage of maximum-prefix-out on peer-group.
The peer_maximum_prefix_out_(un)set functions are derived from
peer_maximum_prefix_(un)set.
Fixes: fde246e835 ("bgpd: Add an option to limit outgoing prefixes")
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
Abstract:
- The command "neighbor PEER maximum-prefix-out NUMBER" cannot be applied
without clearing the BGP neighbor.
- Apply the maximum-prefix-out value as soon as it is modified without
clearing the neighbor.
subgroup_update_packet() and subgroup_withdraw_packet() respectively
manages the announcement and withdrawal BGP message to the peer.
subgrp->scount counter counts the number of sent prefixes.
Before the patch, the maximum out prefix limitation was applied in
subgroup_update_packet() in order that subgrp->scount never exceeds the
limit. Setting a limit inferior to the effective number of sent prefix
did not result in sending any withdrawal message to reduce the number of
sent prefixes. Without clearing the BGP neighbor, the limitation only
applied to the announcement of new prefixes when the limitation was
over.
With the patch, the limitation is checked in subgroup_announce_check().
The function is intended to say whether a prefix has to be announced in
regards to the prefix-list, route-map... Now when a maximum-prefix-out
value is changed/removed, the neighbor AFI/SAFI table is re-parsed in
the same way as for the application of route-map, prefix-lists...
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
Extended BGP Administrative Shutdown Communication (rfc9003):
Basically, shutdown message size is increased to 255 from 128.
Signed-off-by: Donatas Abraitis <donatas.abraitis@gmail.com>
We had various forms of min/max macros across multiple daemons
all of which duplicated what we have in compiler.h. Convert
everyone to use the `correct` ones
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The idea is to disable addpath-rx capability to avoid unnecessary additional
routes installed.
Signed-off-by: Donatas Abraitis <donatas.abraitis@gmail.com>
BGP can experience a bunch of errors associated with sockets
being manipulated which would prevent the peer from coming up.
Let's add some additional debug information here so that
our operators can do a bit more for themselves.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Current implementation of SRv6 SID allocation algorithm sets most least
2 bytes. But, according to RFC8986, function bits is located in the next
to locator. New allocation alogirithm respects this format.
Signed-off-by: Ryoga Saito <contact@proelbtn.com>
This is to avoid breaking changes between existing deployments of
extended community for bandwidth encoding. By default FRR uses uint32
to encode bandwidth, which is not as the draft requires (IEEE floating-point).
This switch enables the required encoding per-peer.
Signed-off-by: Donatas Abraitis <donatas.abraitis@gmail.com>
When BGP is notified by RIB that peer address is unreachable then BGP session must be brought
down immediately and not wait for the hold-timer expiry. Today single-hop EBGP already behaves
this way but need to change for iBGP and multi-hop EBGP sessions.
Signed-off-by: Prerana G.B <prerana@vmware.com>, Pushpasis Sarkar <spushpasis@vmware.com>
in bgp_io.c upon packet read of some error we are storing
the peer pointer on a thread to call bgp_packet_process_error.
In this case an event is generated that is not guaranteed to be
run immediately. It could come in *after* the peer data structure
is deleted and as such we now are writing into memory that we
no longer possibly own as a peer data structure.
Modify the code so that the peer can track the thread associated
with the read error and then it can wisely kill that thread
when deleting the peer data structure.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Adds a knob that sets the time between loc-rib scans for conditional
advertisement.
I chose the range (5-240) because 1 second seems dumb and too easy to
hurt yourself at even moderate scale, 5 seconds you can still hurt
yourself but I could see a use case for it, and 4 minutes should be
enough for anyone (tm)
Signed-off-by: Quentin Young <qlyoung@nvidia.com>
Introduces bgp->default_af to selectively enable various default
afi/safis to be inherited by new peers.
Makes default_af flag logic consistent for all address-families, i.e.
instead of a "no default" flag for ipv4 and a "default" flag for ipv6,
just use "default" for both and make it true for ipv4 by default.
Removes old BGP_FLAG_NO_DEFAULT_IPV4 and BGP_FLAG_DEFAULT_IPV6, and
cleans up bgp->flags bit definitions to avoid gaps for unused bits.
Signed-off-by: Trey Aspelund <taspelund@nvidia.com>
Gateway IP overlay index of the remote type-5 route is resolved
recursively using remote type-2 route. For the purpose of this
recursive resolution, for each L2VNI, we build a hash table of the
remote IP addresses received by remote type-2 routes.
For the topologies where overlay index resolution is not needed, we
do not need to build this remote-ip-hash.
Thus, make the recursive resolution of the overlay index conditional on
"enable-resolve-overlay-index" configuration.
router bgp 65001
bgp router-id 192.168.100.1
neighbor 10.0.1.2 remote-as 65002
!
address-family l2vpn evpn
neighbor 10.0.1.2 activate
advertise-all-vni
enable-resolve-overlay-index----------> New configuration
exit-address-family
Gateway IP overlay index will be resolved only if this configuration is present.
Signed-off-by: Ameya Dharkar <adharkar@vmware.com>
When EVPN prefix route with a gateway IP overlay index is imported into the IP
vrf at the ingress PE, BGP nexthop of this route is set to the gateway IP.
For this vrf route to be valid, following conditions must be met.
- Gateway IP nexthop of this route should be L3 reachable, i.e., this route
should be resolved in RIB.
- A remote MAC/IP route should be present for the gateway IP address in the
EVI(L2VPN table).
To check for the first condition, gateway IP is registered with nht (nexthop
tracking) to receive the reachability notifications for this IP from zebra RIB.
If the gateway IP is reachable, zebra sends the reachability information (i.e.,
nexthop interface) for the gateway IP.
This nexthop interface should be the SVI interface.
Now, to find out type-2 route corresponding to the gateway IP, we need to fetch
the VNI for the above SVI.
To do this VNI lookup effitiently, define a hashtable of struct bgpevpn with
svi_ifindex as key.
struct hash *vni_svi_hash;
An EVI instance is added to vni_svi_hash if its svi_ifindex is nonzero.
Using this hash, we obtain struct bgpevpn corresponding to the gateway IP.
For gateway IP overlay index recursive lookup, once we find the correct EVI, we
have to lookup its route table for a MAC/IP prefix. As we have to iterate the
entire route table for every lookup, this lookup is expensive. We can optimize
this lookup by adding all the remote IP addresses in a hash table.
Following hash table is defined for this purpose in struct bgpevpn
Struct hash *remote_ip_hash;
When a MAC/IP route is installed in the EVI table, it is also added to
remote_ip_hash.
It is possible to have multiple MAC/IP routes with the same IP address because
of host move scenarios. Thus, for every address addr in remote_ip_hash, we
maintain list of all the MAC/IP routes having addr as their IP address.
Following structure defines an address in remote_ip_hash.
struct evpn_remote_ip {
struct ipaddr addr;
struct list *macip_path_list;
};
A Boolean field is added to struct bgp_nexthop_cache to indicate that the
nexthop is EVPN gateway IP overlay index.
bool is_evpn_gwip_nexthop;
A flag BGP_NEXTHOP_EVPN_INCOMPLETE is added to struct bgp_nexthop_cache.
This flag is set when the gateway IP is L3 reachable but not yet resolved by a
MAC/IP route.
Following table explains the combination of L3 and L2 reachability w.r.t.
BGP_NEXTHOP_VALID and BGP_NEXTHOP_EVPN_INCOMPLETE flags
* | MACIP resolved | MACIP unresolved
*----------------|----------------|------------------
* L3 reachable | VALID = 1 | VALID = 0
* | INCOMPLETE = 0 | INCOMPLETE = 1
* ---------------|----------------|--------------------
* L3 unreachable | VALID = 0 | VALID = 0
* | INCOMPLETE = 0 | INCOMPLETE = 0
Procedure that we use to check if the gateway IP is resolvable by a MAC/IP
route:
- Find the EVI/L2VRF that belongs to the nexthop SVI using vni_svi_hash.
- Check if the gateway IP is present in remote_ip_hash in this EVI.
When the gateway IP is L3 reachable and it is also resolved by a MAC/IP route,
unset BGP_NEXTHOP_EVPN_INCOMPLETE flag and set BGP_NEXTHOP_VALID flag.
Signed-off-by: Ameya Dharkar <adharkar@vmware.com>
Adds gateway-ip option to advertise ipv4/ipv6 unicast CLI.
dev(config-router-af)# advertise <ipv4|ipv6> unicast
<cr>
gateway-ip Specify EVPN Overlay Index
route-map route-map for filtering specific routes
When gateway-ip is specified, gateway IP field of EVPN RT-5 NLRI is filled with
the BGP nexthop of the vrf prefix being advertised.
No support for ESI overlay index yet.
Test cases:
1) advertise ipv4 unicast
2) advertise ipv4 unicast gateway-ip
3) advertise ipv6 unicast
4) advertise ipv6 unicast gateway-ip
5) Modify from no-overlay-index to gateway-ip
6) Modify from gateway-ip to no-overlay-index
7) CLI with route-map and modify route-map
Author: Sri Mohana Singamsetty <srimohans@gmail.com>
Signed-off-by: Sri Mohana Singamsetty <srimohans@gmail.com>
We are inconsistently using peer_establiahed(peer) with
sometimes using `peer->status == Established`. Just Convert
over to using the function for consistency.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
This commit add base-lines for BGP SRv6 VPN support.
srv6_locator_chunks property of struct bgp is used
to store BGPd's own SRv6 locator chunk getting with
ZEBRA_SRV6_MANAGER_GET_LOCATOR_CHUNK api.
And srv6_functions is used to store BGP's srv6
localsids. It's mainly used when new SID reservation
from locator chunks.
Signed-off-by: Hiroki Shirokura <slank.dev@gmail.com>
base
BGP_MAX_PACKET_SIZE no longer represented the absolute maximum BGP
packet size as it did before, instead it was defined as 4096 bytes,
which is the maximum unless extended message capability is negotiated,
in which case the maximum goes to 65k.
That introduced at least one bug - last_reset_cause was undersized for
extended messages, and when sending an extended message > 4096 bytes
back to a peer as part of NOTIFY data would trigger a bounds check
assert.
This patch redefines the macro to restore its previous meaning,
introduces a new macro - BGP_STANDARD_MESSAGE_MAX_PACKET_SIZE - to
represent the 4096 byte size, and renames the extended size to
BGP_EXTENDED_MESSAGE_MAX_PACKET_SIZE for consistency. Code locations
that definitely should use the small size have been updated, locations
that semantically always need whatever the max is, no matter what that
is, use BGP_MAX_PACKET_SIZE.
BGP_EXTENDED_MESSAGE_MAX_PACKET_SIZE should only be used as a constant
when storing what the negotiated max size is for use at runtime and to
define BGP_MAX_PACKET_SIZE. Unless there is a future standard that
introduces a third valid size it should not be used for any other
purpose.
Signed-off-by: Quentin Young <qlyoung@nvidia.com>
Problem Statement:
=================
In scale setup BGP sessions start flapping.
RCA:
====
In virtualized environment there are multiple places where
MTU need to be set. If there are some places were MTU is not set
properly then there is chances that BGP packets get fragmented,
in scale setup this will lead to BGP session flap.
Fix:
====
A new tcp option is provided as part of this implementation,
which can be configured per neighbor and helps to set the TCP
max segment size. User need to derive the path MTU between the BGP
neighbors and set that value as part of tcp-mss setting.
1. CLI Configuration:
[no] neighbor <A.B.C.D|X:X::X:X|WORD> tcp-mss (1-65535)
2. Running config
frr# show running-config
router bgp 100
neighbor 198.51.100.2 tcp-mss 150 => new entry
neighbor 2001:DB8::2 tcp-mss 400 => new entry
3. Show command
frr# show bgp neighbors 198.51.100.2
BGP neighbor is 198.51.100.2, remote AS 100, local AS 100, internal link
Hostname: frr
Configured tcp-mss is 150, synced tcp-mss is 138 => new display
4. Show command json output
frr# show bgp neighbors 2001:DB8::2 json
{
"2001:DB8::2":{
"remoteAs":100,
"bgpTimerKeepAliveIntervalMsecs":60000,
"bgpTcpMssConfigured":400, => new entry
"bgpTcpMssSynced":388, => new entry
Risk:
=====
Low - This is a config driven feature and it sets the max segment
size for the TCP session between BGP peers.
Tests Executed:
===============
Have done manual testing with three router topology.
1. Executed basic config and un config scenarios
2. Verified if the config is updated in running config
during config and no config operation
3. Verified the show command output in both CLI format and
JSON format.
4. Verified if TCP SYN messages carry the max segment size
in their initial packets.
5. Verified the behaviour during clear bgp session.
6. done packet capture to see if the new segment size
takes effect.
Signed-off-by: Abhinay Ramesh <rabhinay@vmware.com>
As pointed out on code review of BGP extended messages, increasing the
maximum BGP message size has the consequence of growing the dynamically
sized stack buffer up to 650K. While unlikely to exceed modern stack
sizes it is still unreasonably large. Remedy this with a heap buffer.
Signed-off-by: Quentin Young <qlyoung@nvidia.com>
There are multiple problems:
- commit ef7c53e2 introduced a new return value 2 which broke things,
because a lot of code treats non-zero return as an error,
- there is an incorrect error returned when AS number mismatches.
This commit fixes both.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
Description:
FRR doesn't re-install the routes, imported from a tenant VRF,
when bgp instance for source vrf is deleted and re-added again.
When bgp instance is removed and re-added, when import statement is already there,
then route leaking stops between two VRFs.
Every 'router bgp' command should trigger re-export of all the routes
to the importing bgp vrf instances.
When a router bgp is configured, there could be bgp vrf instance(s) importing routes from
this newly configured bgp vrf instance.
We need to export routes from configured bgp vrf to VPN.
This can impact performance, whenever we are testing scale from vrf route-leaking perspective.
We should not trigger re-export for already existing bgp vrf instances.
Co-authored-by: Santosh P K <sapk@vmware.com>
Co-authored-by: Kantesh Mundaragi <kmundaragi@vmware.com>
Signed-off-by: Abhinay Ramesh <rabhinay@vmware.com>
increase the maximum number of neighbors in a bgp group.
Set the maximum value to 50000 instead of 5000.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
In the case of EVPN type-2 routes that use ES as destination, BGP
consolidates the nh (and nh->rmac mapping) and sends it to zebra as
a nexthop add.
This nexthop is the EVPN remote PE and is created by reference of
VRF IPvx unicast paths imported from EVPN Type-2 routes.
zebra uses this nexthop for setting up a remote neigh enty for the PE
and a remote fdb entry for the PE's RMAC.
Ticket: CM-31398
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
This new BGP configuration is akin to "bgp bestpath aspath
multipath-relax". When applied, paths learned from different peer types
will be eligible to be considered for multipath (ECMP). Paths from all
of eBGP, iBGP, and confederation peers may be included in multipaths
if they are otherwise equal cost.
This change preserves the existing bestpath behavior of step 10's result
being returned, not the result from steps 8 and 9, in the case where
both 8+9 and 10 determine a winner.
Signed-off-by: Joanne Mikkelson <jmmikkel@arista.com>
Remove old BFD API usage and replace it with the new one.
Highlights:
- More shared code: the daemon gets notified with callbacks instead of
having to roll its own code to find the notified sessions.
- Less code to integrate with BFD.
- Remove hidden commands to configure single / multi hop. Use
protocol data instead.
BGP can determine if a peer is single/multi hop according to the
following criteria:
a. If the IP address is a link-local address (single hop)
b. The network is shared with peer (single hop)
c. BGP is configured for eBGP multi hop / TTL security (multi hop)
- Respect the configuration hierarchy:
a. Peer configuration take precendence over peer-group
configuration.
b. When peer group configuration is removed, reset peer
BFD configurations to defaults (unless peer had specific
configs).
Example:
neighbor foo peer-group
neighbor foo bfd profile X
neighbor 192.168.0.2 peer-group foo
neighbor 192.168.0.2 bfd
! If peer-group is removed the profile configuration gets
! removed from peer 192.168.0.2, but BFD will still enabled
! because of the neighbor specific bfd configuration.
Signed-off-by: Rafael Zalamena <rzalamena@opensourcerouting.org>
When dumping data about prefixes in bgp. Let's dump the
rpki validation state as well:
Output if rpki is turned on:
janelle# show rpki prefix 2003::/19
Prefix Prefix Length Origin-AS
2003:: 19 - 19 3320
janelle# show bgp ipv6 uni 2003::/19
BGP routing table entry for 2003::/19
Paths: (1 available, best #1, table default)
Not advertised to any peer
15096 6939 3320
::ffff:4113:867a from 65.19.134.122 (193.72.216.231)
(fe80::e063:daff:fe79:1dab) (used)
Origin IGP, valid, external, best (First path received), validation-state: valid
Last update: Sat Mar 6 09:20:51 2021
janelle# show rpki prefix 8.8.8.0/24
Prefix Prefix Length Origin-AS
janelle# show bgp ipv4 uni 8.8.8.0/24
BGP routing table entry for 8.8.8.0/24
Paths: (1 available, best #1, table default)
Advertised to non peer-group peers:
100.99.229.142
15096 6939 15169
65.19.134.122 from 65.19.134.122 (193.72.216.231)
Origin IGP, valid, external, best (First path received), validation-state: not found
Last update: Sat Mar 6 09:21:25 2021
Example output when rpki is not configured:
eva# show bgp ipv4 uni 8.8.8.0/24
BGP routing table entry for 8.8.8.0/24
Paths: (1 available, best #1, table default)
Advertised to non peer-group peers:
janelle(192.168.161.137)
64539 15096 6939 15169
192.168.161.137(janelle) from janelle(192.168.161.137) (192.168.44.1)
Origin IGP, valid, external, bestpath-from-AS 64539, best (First path received)
Last update: Sat Mar 6 09:33:51 2021
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When you use a single BGP session for both IPv4 and IPv6 it's a bit
annoying going into ipv6 address-family and explicitly activating it.
Let's get this automatically if enabled with `bgp default ipv6-unicast`.
Signed-off-by: Donatas Abraitis <donatas.abraitis@gmail.com>
When sending BMP messages for a status change event for a peer whose NHT
has failed, we were sending a Peer Down Reason Code of 1 (Local system
closed, NOTIFICATION follows) with no NOTIFICAION PDU (because there was
none). This is wrong. Also, the reason code of 1 is semantically off, it
should be 2 (Local system closed, FSM event follows).
This patch:
- adds definitions of all BGP FSM event codes per RFC4271
- changes the BMP reason code emitted when a peer changes state due to
NHT failure to 2 and encodes FSM event 18 (TcpConnectionFails)
- changes the catch-all case where we have not yet
implemented the appropriate BMP response to indicate reason code 2
with FSM event 0 (no relevant Event code is defined).
These changes ought to prevent the BMP session from being torn down due
to an improperly formatted message.
Signed-off-by: Quentin Young <qlyoung@qlyoung.net>
Add SNMP support for L3vpn Vrf table as defined in [RFC4382]
Keep track of vrf status for the table and for future traps.
Signed-off-by: Pat Ruddy <pat@voltanet.io>
From RFC4382:
A VRF is
up(1) when there is at least one interface associated
with the VRF whose ifOperStatus is up(1). A VRF is
down(2) when:
a. There does not exist at least one interface whose
ifOperStatus is up(1).
b. There are no interfaces associated with the VRF.
Run through interfaces associated with a vrf and return
true if there is one in the up state.
Signed-off-by: Pat Ruddy <pat@voltanet.io>
Reference: https://www.cmand.org/communityexploration
--y2--
/ | \
c1 ---- x1 ---- y1 | z1
\ | /
--y3--
1. z1 announces 192.168.255.254/32 to y2, y3.
2. y2 and y3 tags this prefix at ingress with appropriate
communities 65004:2 (y2) and 65004:3 (y3).
3. x1 filters all communities at the egress to c1.
4. Shutdown the link between y1 and y2.
5. y1 will generate a BGP UPDATE message regarding the next-hop change.
6. x1 will generate a BGP UPDATE message regarding community change.
To avoid sending duplicate BGP UPDATE messages we should make sure
we send only actual route updates. In this example, x1 will skip
BGP UPDATE to c1 because the actual route is the same
(filtered communities - nothing changes).
Signed-off-by: Donatas Abraitis <donatas.abraitis@gmail.com>
On top of the recent `bgp suppress-fib-pending which
was at a BGP_NODE level, add this command at the CONFIG_NODE
level as well and allow the command to apply to all instances
of bgp running.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Add a bit of code to allow bgp to send the AS-Path associated with
the route being installed to zebra so it can be displayed and
used as part of the `show ip route A` command in zebra.
eva# show ip route 20.0.0.0/11
Routing entry for 20.0.0.0/11
Known via "bgp", distance 20, metric 0, best
Last update 00:00:00 ago
* 192.168.161.1, via enp39s0, weight 1
AS-Path: 60000 64539 15096 6939 8075
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Changes implement dampening profiles for peers and peer groups. This is
achieved by introducing the possibility to have multible existing
dampening configurations with their own sets of parameters and lists of
associated paths.
Signed-off-by: Rafael Zalamena <rzalamena@opensourcerouting.org>
Signed-off-by: David Schweizer <dschweizer@opensourcerouting.org>
ES-VRF entries are maintained for the purpose of L3-NHG creation -
1. Each ES-EVI entry is associated with a tenant VRF. This associaton
triggers the creation of an ES-VRF entry.
2. Type-2/MAC-IP routes are imported into a tenant VRF and programmed as
a /32 or host route entry in the dataplane. If the destination of
the host route is a remote-ES the route is programmed with the
corresponding (keyed in by {vrf,ES-id}) L3-NHG.
3. The reason for this indirection (route->L3-NHG, L3-NHG->list-of-VTEPs)
is to avoid route updates to the dplane when a remote-ES link flaps i.e.
instead of updating all the dependent routes the NHG's contents are
updated. This reduces the amount of dataplane updates (fewer nhg updates vs.
route updates) allowing for a faster failover.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Move the FOREACH_AFI_SAFI macro from bgpd.h to zebra.h( GLOBAL's YOUALL )
Then convert all the places that have the two level for loop to
iterate over all afi/safis
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
* Process FIB update in bgp_zebra_route_notify_owner() and call
group_announce_route() if route is installed
* When bgp update is received for a route which is not installed earlier
(flag BGP_NODE_FIB_INSTALLED is not set) and suppress fib is enabled
set the flag BGP_NODE_FIB_INSTALL_PENDING to indicate fib install is
pending for the route. The route will be advertised when zebra send
ZAPI_ROUTE_INSTALLED status.
* The advertisement delay (BGP_DEFAULT_UPDATE_ADVERTISEMENT_TIME)
is added to allow more routes to be sent in single update message.
This is required since zebra sends route notify message for each route.
The delay will be applied to update group timer which advertises
routes to peers.
Signed-off-by: kssoman <somanks@gmail.com>