This change alters the behavior of existing test code. The
default mode (before any call to luSetWaitType()) is now
"strict".
The historical behavior of luCommand(op="wait) is to ignore
failures to match the specified regexp in the specified time.
In those cases, no result was logged and no error was signaled.
This change introduces a new "strict" mode for luCommand(op="wait):
in "strict" wait mode, each invocation of luCommand(op="wait)
generates an explicit, logged failure result when it fails to match
the specified regexp in the specified time. These failures signal
an error for the test.
Calling luSetWaitType("nostrict") restores the historical behavior.
Calling luSetWaitType("strict") (re)enables the new strict behavior.
Individual calls to luCommand() may also specify op="wait-nostrict"
to override any default and use the historical behavior.
Individual calls to luCommand() may also specify op="wait-strict"
to override any default and use the new behavior.
Signed-off-by: G. Paul Ziemba <paulz@labn.net>
Add an "exist" key to check the existence of a prefix in the BGP RIB.
Useful to check that a prefix has not leaked by error.
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
Earlier daemon parameter was passed to
start_topology(), which is not needed now,
as new code is implemented to start
feature specific daemons.
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
Currently topotests starts all daemons by default,
made changes to f/w so only needed daemons can
be started, daemons which are needed to tests
particular test suite.
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
This is for run_and_expect_type and run_and_expect topotests method.
Some contributions unintentionally get merged with very low values, that leads
to CI failures, let's guard this a bit.
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
When zebra receives routes from upper level protocols it decodes the
zapi message and places the routes on the metaQ for processing. Suppose
we have a route A that is already installed by some routing protocol.
And there is a route B that has a nexthop that will be recursively
resolved through A. Imagine if a route replace operation for A is
going to happen from an upper level protocol at about the same time
the route B is going to be installed into zebra. If these routes
are received, and decoded, at about the same time there exists a
chance that the metaQ will contain both of them at the same time.
If the order of installation is [ B, A ]. B will be resolved
correctly through A and installed, A will be processed and
re-installed into the FIB. If the nexthops have changed for
A then the owner of B should be notified about the change( and B
can do the correct action here and decide to withdraw or re-install ).
Now imagine if the order of routes received for processing on the
metaQ is [ A, B ]. A will be received, processed and sent to the
dataplane for reinstall. B will then be pulled off the metaQ and
fail the install since A is in a `not Installed` state.
Let's loosen the restriction in nexthop resolution for B such
that if the route we are dependent on is a route replace operation
allow the resolution to suceed. This requires zebra to track a new
route state( ROUTE_ENTRY_ROUTE_REPLACING ) that can be looked at
during nexthop resolution. I believe this is ok because A is
a route replace operation, which could result in this:
-route install failed, in which case B should be nht'ing and
will receive the nht failure and the upper level protocol should
remove B.
-route install succeeded, no nexthop changes. In this case
allowing the resolution for B is ok, NHT will not notify the upper
level protocol so no action is needed.
-route install succeeded, nexthops changes. In this case
allowing the resolution for B is ok, NHT will notify the upper
level protocol and it can decide to reinstall B or not based
upon it's own algorithm.
This set of events was found by the bgp_distance_change topotest(s).
Effectively the tests were looking for the bug ( A, B order in the metaQ )
as the `correct` state. When under very heavy load, the A, B ordering
caused A to just be installed and fully resolved in the dataplane before
B is gotten to( which is entirely possible ).
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Enhanced few exsiting PIM APIs to support both
IPv4 and IPv6 configuration. Added few new APIs
for PIMv6. Tested all existing tests with new
API changes.
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
This tests checks that there are no errors when receiving BFD
packets over the various linux vrf interfaces. For example, if
an incoming packet is received by the wrong socket, a VRF
mismatch error would occur, and BFD flapping would be observed.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Issue was reported by Donald, we were hitting
with key not found error and execution was
stopped, which is fixed by this PR.
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
Verifying and making sure PIM neighbors are
up before sending BSM packet using Scapy.
Verifying static routes are installed before
proceeding fruther.
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
When you have a static route with multiple different admin
distances there exists a chance that route will have been
installed multiple times due to system load when inserted
at about the same time. If this is the case then the
verify_rib function can and will select the wrong route
that happens to have a nexthop group that is still installed.
Modify verify_rib to ensure that the route that is going to
be looked at for nexthop correctness is the actual installed
route, not a previous version of it.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
"ip vrf exec" command is not possible in the topotest shell.
> root@r1:~# ip vrf exec r1-cust5 bash
> mkdir failed for /sys/fs/cgroup/unified: No such file or directory
> Failed to setup vrf cgroup2 directory
Remount cgroup after remounting sysfs.
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
As of now we are logging only JSON output of CLIs
in topotests(topojson) executions and same o/p is
getting printed twice, which is of no use.
Enhanced code to show both plain and JSON output
of CLIs and remove duplicate logging.
It will help in reducing execution logs and in
verification, if sometimes there is mis-match
in CLI plain and JSON outputs.
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
Under heavy load I am seeing verify_rib failing after 12 seconds
but succeeding after 17:
2022-05-19 18:52:54,374 DEBUG: topolog: Exiting lib API: verify_rib
2022-05-19 18:52:54,374 DEBUG: topolog: Function returned True
2022-05-19 18:52:54,374 WARNING: topolog: RETRY DIAGNOSTIC: SUCCEED after FAILED with requested timeout of 12.0s; however, succeeded in 14.7s, investigate timeout timing
There is no reason to not have the test wait a bit longer for very very
heavily loaded systems. Change the time to 40 seconds.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Lots of tests call verify_rib that takes a list of routes that
need to be verified in some fashion. This verify_rib functionality
will try up to 12 seconds before failing the check that zebra
has the route and has installed it.
Unfortunately the verify_rib code was not looking to see if
the route was queued for installation and was then allowing
tests to immediately do subsuquent steps that depended on
that route actually being installed sometimes causing tests
to fail.
Write a bit of additional code that looks at the queued
status and allows the test to wait a bit longer for zebra
to finish processing before allowing the test to move on
to the next bit.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
This test is failing intermittently because sometimes igmp
local join is not getting deleted. I did split the joins means
trying to delete igmp local joins one by one. I tried running
tests multiple times and it seems to be working fine with
current changes.
There was an issue found during debugging this test failure,
which was raised already:
Issue# https://github.com/FRRouting/frr/issues/11105
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
1. Modified pim APIs name to generic one, same APIs would be used for PIMv4 and PIMv6
verifications
2. Modified all affacted scripts and ran multiple times locally to avoid CI failures
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
a) Remove the retry mechanism to continue looking for 75%
of the time for pim code.
This alone saves a bunch of time in tests that use lib/pim.py
Effectively all the times given for retry are already long
enough. Additionally some tests are gathering data with
the expectation that they will not find data so the entire
time is being taken up in retry's. Extending the retry
mechanism makes this even worse. This is especially bad
for pim in that keep alive timers are counting down and
state can be removed due to excessive time waiting.
b) Reduce verify verify_multicast_traffic from 40 seconds
to 20 seconds to gather traffic data.
A bunch of tests are doing this:
a) gather pre test start traffic data( taking about 70
seconds to run, because a bunch of time it was looking
for data that does not exist yet)
b) run a change to introduce a different traffic flow
c) gather post test traffic data ( taking about 70
seconds to run )
Why does this matter? Tests were iterating through
all the different routers looking for traffic flow
as well as different mroute state. This is against
the keepalive timer of 210 seconds. It does not take
long before the stream can be removed and the test is
still looking for data that is no longer there due
to state timeout.
The multicast_pim_sm_topo3/test_multicast_pim_sm_topo3.py
test reduced run time from 398 seconds to 297 seconds.
Greatly reducing keepalive timeout problems.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
1. Handle KeyError
2. logger object is defined in main function and its not not accessible
in other functions so defined it in local functions.
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
Do not allow the test system to turn off the logging of commands
Some tests use the reload command that is accidently turning off
the logging. Just force the tests to ignore it.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
This PR adds support for configuring topotest routers using a single file.
instead of:
```
router.load_config(
TopoRouter.RD_ZEBRA, os.path.join(CWD, "{}/zebra.conf".format(rname))
)
router.load_config(
TopoRouter.RD_OSPF, os.path.join(CWD, "{}/ospfd.conf".format(rname))
)
router.load_config(
TopoRouter.RD_BGP, os.path.join(CWD, "{}/bgpd.conf".format(rname))
)
```
you can now do:
```
router.load_frr_config(
os.path.join(CWD, "{}/frr.conf".format(rname)),
[TopoRouter.RD_ZEBRA, TopoRouter.RD_OSPF, TopoRouter.RD_BGP]
)
```
or just:
```
router.load_frr_config(os.path.join(CWD, "{}/frr.conf".format(rname)))
```
In this latter case, the daemons list will be inferred from frr.conf file.
Signed-off-by: Jafar Al-Gharaibeh <jafar@atcorp.com>
When running a topotest with the --shell or --vtysh argument, the
window titles of the routers are generic.
Set the router name as title to identify correctly the window.
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
Opening new tab in screen is not possible when using option --vtysh or
--shell. Error 'No such file or directory'.
Fix the issue.
Fixes: 6a5433ef0b ("tests: NEW micronet replacement for mininet")
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
Have added topotest to verify below combination.
Auth support for md5
Auth support for hmac-sha-256
Auth support with keychain for md5
Auth support with keychain for hmac-sha-256
Have sussessfully run all 4 test cases in my local setup.
Signed-off-by: Abhinay Ramesh <rabhinay@vmware.com>
VRF name should not be printed in the config since 574445ec. The update
was done for NB config output but I missed it for regular vty output.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>
verify_pim_interface_traffic *fetches* the pim
traffic data. Rename the function to what it
actually does
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Lot's of the GR topotests kill daemons in order to test code
that deals with crashing daemons. Under heavy system load
it was noticed that a kill command was sent and if told to
wait we would sleep 2 seconds send another kill command and
call it good. This was causiing issues when subsuquent
json commands would get errors like `lost connection to daemon`
as the daemon finally shut down after some time due to load.
Modify the kill the daemon function to notice that the daemon
was not actually killed and if we need to wait wait some
more time for it too happen
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently under system load tests that use verify_pim_interface_traffic
immediately after a interface down/up event are not giving any time
for pim to receive and process the data from that event. Give
the test some time to gather this data.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Under heavy system load, we are sometimes seeing this
output for addKernelRoute:
2021-11-28 16:17:27,604 INFO: topolog: [DUT: b1]: Running command: [ip route add 224.0.0.13 dev b1-f1-eth0]
2021-11-28 16:17:27,604 DEBUG: topolog.b1: LinuxNamespace(b1): cmd_status("['/bin/bash', '-c', 'ip route add 224.0.0.13 dev b1-f1-eth0']", kwargs: {'encoding': 'utf-8', 'stdout': -1, 'stderr': -2, 'shell': False, 'stdin': None})
2021-11-28 16:17:27,967 DEBUG: topolog.b1: LinuxNamespace(b1): cmd_status("['/bin/bash', '-c', 'ip route']", kwargs: {'encoding': 'utf-8', 'stdout': -1, 'stderr': -2, 'shell': False, 'stdin': None})
2021-11-28 16:17:28,243 DEBUG: topolog: ip route
70.0.0.0/24 dev b1-f1-eth0 proto kernel scope link src 70.0.0.1
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
This tells us that the ip route add succeeded but when looking for it
the system failed to immediately find it. Why is this happening?
Probably we are under heavy system load and the two different
commands, 'ip route add..' and 'ip route show' are being executed
on different cpu's and the data has not been copied to the different
cpu yet in the kernel. This is not necessarily something normally
seen but entirely possible. Giving the system a few extra seconds
for the kernel to execute/work the memory barrier system seems
prudent for long term success of our programming.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Update verify_ospf6_neighbor() so we can verify there are no
neighbors in a given router
input_dict = {
"r0": {
"ospf6": {
"neighbors": []
}
}
}
result = verify_ospf6_neighbor(tgen, topo, dut, input_dict)
Signed-off-by: ckishimo <carles.kishimoto@gmail.com>
The interface area command is deprecated under
router ospf6 and should be on the individual interface.
Let's modify the tests to not actually put the
interface foo area 0.0.0.0 command under the
router node.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When using build_config_from_json there exists a timing
window where neighbors can come up before the router-id
is applied. As a precaution, quickly clear the neighbors
to ensure that we get neighbors with the expected router-id.
This can especially happen under high system load.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When testers use the build_config_from_json function
the create_router_ospf function is double creating
the ospfv3 cli to be passed in. This is because
the create_router_ospf loops over both v2 and v3
and then create_router_ospf6 re-adds v3.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Somewhere along the line core-files stopped being generated
with the running of the topotests. With this change we now
see this:
sharpd@eva /t/topotests> find . -name '*.dmp' -print
./ospfv3_basic_functionality.test_ospfv3_asbr_summary_topo1/r0/ospf6d_core-sig_6-pid_430478.dmp
sharpd@eva /t/topotests> sudo gdb /usr/lib/frr/ospf6d ./ospfv3_basic_functionality.test_ospfv3_asbr_summary_topo1/r0/ospf6d_core-sig_6-pid_430478.dmp
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/frr/ospf6d...
[New LWP 430478]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/lib/frr/ospf6d --log file:ospf6d.log --log-level debug -d'.
Program terminated with signal SIGABRT, Aborted.
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
(gdb)
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Our topotests send SIGBUS 2 seconds after a SIGTERM is
initiated. This is bad because under a heavily loaded
topotest system we may have a case where the system has
not had a chance to properly shut down the daemon.
Extend the time greatly before topotests send SIGBUS.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Several tests used the route_map_create functionality
with `metric-type` but never bothered to add the
backend code to ensure it works correctly.
Add it in so it can be used.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Considering that both the GR helper mode and restarting mode can be
enabled at the same time, the "graceful-restart helper-only" command
can be a bit misleading since it implies that only the helper mode
is enabled. Rename the command to "graceful-restart helper enable"
to clarify what the command does.
Start a deprecation cycle of one year before removing the original
command
Signed-off-by: Renato Westphal <renato@opensourcerouting.org>
A bunch of tests have this pattern:
a) Install a new prefix into bgp
b) Run this loop:
foreach (router in topology) {
verify_bgp_rib(router)
}
This is to ensure that the prefix is actually disseminated.
The problem with this, of course, is that a wait of 2 seconds
for every item in that loop makes no sense. As that the initial
router verification of it's bgp rib will wait 2 seconds and
all the remaining bgp routers in the topology will have gotten
the data. So we end up waiting a bunch of extra time.
Remove the initial_wait time for verify_bgp_rib. Also
increase the failure wait time to 30 seconds. This is
to give a bigger window for bgp to send it's data for
our test systems that could be under heavy load. In the
normal case tests will never hit this.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Modernize the test a bit, generate expected results rather than load from
file, and add a general json_cmp with retry function and use it.
Signed-off-by: Christian Hopps <chopps@labn.net>
When looking for a implied host route it is not necessary
to add the `/32` to an ip route add. As such masks
will not be set in this case. Set the value of masks
to a known good value so that when the route installation
fails the test for it actually being there will tell you
that the route is not there -vs- complaining about mask
being uninited.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
- Fix xterm support to work, previously it mostly didn't, not it should
in all cases (i.e., single or dist mode).
- Catch when the user tries to use various window requiring topotests
features (e.g., --cli-on-error) but isn't running under supported
system (e.g., byobu/tmux/xterm), and fail the run with an explanation.
Signed-off-by: Christian Hopps <chopps@labn.net>
Create a pid file for the router created by topotest.
By executing nsenter directly against this pid, developers
can execute commands directly from outside the unet shell.
This allows the developer to use script, tab completion, etc.,
and improves efficiency.
Signed-off-by: Hiroki Shirokura <slank.dev@gmail.com>
Refactor the bgp_auth test to create common_config code to allow
non-json based tests to reset routers and load configs in parallel.
Signed-off-by: Christian Hopps <chopps@labn.net>
- In order to run tests in parallel the netns-based vrfs need to
have unique names primarily bc they are all tracked/looked-up in
`/run/netns` which is not network namespace nesting friendly
- use ip(8) exclusively rather than a mix of `ip` and `ifconfig`
and `vconfig`, reducing required pkg count by a couple.
Signed-off-by: Christian Hopps <chopps@labn.net>
- bugs in the support library function `verify_gr_address_family`
allowed this test to pass depending on ordering of python dictinoary
keys. Fix the bugs, fix the test.
Signed-off-by: Christian Hopps <chopps@labn.net>
- A more general fix for the bgp listener test which requires interfaces be
configured in the kernel when the bgpd daemons are launched.
Signed-off-by: Christian Hopps <chopps@labn.net>
- Remove incorrect requirement for `service integrated-vtysh-config`
when producing a delta.
- Add `--test-reset` option which suppresses non-parseable lines from the
produced delta
- Use new features in common_config.py
Signed-off-by: Christian Hopps <chopps@labn.net>
TMUX and Screen support when running topotests inside docker. This
allows the gdb, shell and vtysh features to correctly work even when
running the tests inside docker.
Add options:
--asan-abort :: aborts the process on ASAN errors
--strace-daemons :: strace some or all daemons
Signed-off-by: Christian Hopps <chopps@labn.net>
Using "write memory" to save the daemons' configurations before
restarting them can cause log files to stop working correctly. Add
a new "save_config" to the kill_router_daemons() function to prevent
that from happening when saving the configurations isn't necessary.
Signed-off-by: Renato Westphal <renato@opensourcerouting.org>
Speedup (large topo): OLD: ~6 minutes NEW: ~1 second
(when paired with generate_support_bundle.py changes)
- Collect from each node in parallel
Bug fixes:
- sub-directory test name was the same internal pytest function name
for any test, and not the actual test name.
Signed-off-by: Christian Hopps <chopps@labn.net>
This python fixture was way too complex for what is needed.
Eliminate gratuitous options/over-engineering:
- Change from non-deterministic `wait` and `attempts` to a single
`retry_timeout` value. This is both more deterministic, as well as
what the user should actually be thinking about.
- Use a fixed 2 second pause between executing the wrapped function
rather than a bunch of arbitrary choices of 2, 3 and 4 seconds
spread all over the test code.
- Get rid of the multiple variables for determining what "Positive" and
"Negative" results are. Instead just implement what all the user code
already wants, i.e., boolean False or a str (errormsg) means
"Negative" result otherwise it's a "Positive" result.
- As part of the above the inversion logic is much more comprehensible
in the fixture code (and more correct to boot).
Signed-off-by: Christian Hopps <chopps@labn.net>