- Often millisecond precision is not good enough to differentiate things that
occur directly one after another, and things that have some pause in between,
increase to microsecond precision (reporting)
Signed-off-by: Christian Hopps <chopps@labn.net>
After the munet switch we weren't passing the logger on to low-level
LinuxNamespace and thus Commander parent classes, so the lowest-level
`cmd_status` logs were missing from more specific log files in the run
directory.
Signed-off-by: Christian Hopps <chopps@labn.net>
- Remove the .pid and .vty files and then wait for them to show back up.
- Fix broken BGP GR test to not fail now that it's bug is exposed. It
only worked b/c when starting a daemon the pid file still existed and
blocked the bogus second BGP launch from happening.
Signed-off-by: Christian Hopps <chopps@labn.net>
Rather than create a new global dict and copy all the config into it, just
expose the pytest config globally and use it directly.
Signed-off-by: Christian Hopps <chopps@labn.net>
When launching the daemons under gdb it takes a bit for them to come up, the
currently code only looks for pid files to determine if the daemon is running.
This test is no good as these files are left around by previous runs.
For now do a simple sleep when debugging with gdb to get things working.
Signed-off-by: Christian Hopps <chopps@labn.net>
For multicast pimv6 join and traffic, socat is
used, which was not cleaned up post tests executions,
enhanced kill_socat() API to kill socat join and
traffic specific PIDs during teardown module.
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
Enhanced or added new libraries to support
multicast mld local join automation
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
Co-Auther: Vijay Kumar Gupta <vijayg@vmware.com>
Enhanced or added new libraries to support
multicast pimv6 automation
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
Co-Auther: Vijay Kumar Gupta <vijayg@vmware.com>
Add a iproute2 API guard to the SVD test using `bridge fdb get`.
While it SHOULD be present on most systems based on their kernel
version it may not be present due to kernel/iproute2 version mismatch
weirdness.
Signed-off-by: Stephen Worley <sworley@nvidia.com>
Have added topotest to verify below scenarios.
1. Verify OSPF Flood reduction functionality with ospf enabled on process level.
2. Verify OSPF Flood reduction functionality with ospf enabled on area level.
3. Verify OSPF Flood reduction functionality between different area's
Have sussessfully tested these in my local setup
Signed-off-by: nguggarigoud <nguggarigoud@vmware.com>
Testcase: test_pim6_multiple_groups_different_RP_address_p2
was failing because of a bug in framework, Fixed the
bug in this commit.
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
Multicast pim6 static RP tests are failing
when run in parallel using micronet. There
are APIs to clean mcast traffic before
starting new test but these cleanups
are not needed when socat is used.
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
This change alters the behavior of existing test code. The
default mode (before any call to luSetWaitType()) is now
"strict".
The historical behavior of luCommand(op="wait) is to ignore
failures to match the specified regexp in the specified time.
In those cases, no result was logged and no error was signaled.
This change introduces a new "strict" mode for luCommand(op="wait):
in "strict" wait mode, each invocation of luCommand(op="wait)
generates an explicit, logged failure result when it fails to match
the specified regexp in the specified time. These failures signal
an error for the test.
Calling luSetWaitType("nostrict") restores the historical behavior.
Calling luSetWaitType("strict") (re)enables the new strict behavior.
Individual calls to luCommand() may also specify op="wait-nostrict"
to override any default and use the historical behavior.
Individual calls to luCommand() may also specify op="wait-strict"
to override any default and use the new behavior.
Signed-off-by: G. Paul Ziemba <paulz@labn.net>
Add an "exist" key to check the existence of a prefix in the BGP RIB.
Useful to check that a prefix has not leaked by error.
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
Earlier daemon parameter was passed to
start_topology(), which is not needed now,
as new code is implemented to start
feature specific daemons.
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
Currently topotests starts all daemons by default,
made changes to f/w so only needed daemons can
be started, daemons which are needed to tests
particular test suite.
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
This is for run_and_expect_type and run_and_expect topotests method.
Some contributions unintentionally get merged with very low values, that leads
to CI failures, let's guard this a bit.
Signed-off-by: Donatas Abraitis <donatas@opensourcerouting.org>
When zebra receives routes from upper level protocols it decodes the
zapi message and places the routes on the metaQ for processing. Suppose
we have a route A that is already installed by some routing protocol.
And there is a route B that has a nexthop that will be recursively
resolved through A. Imagine if a route replace operation for A is
going to happen from an upper level protocol at about the same time
the route B is going to be installed into zebra. If these routes
are received, and decoded, at about the same time there exists a
chance that the metaQ will contain both of them at the same time.
If the order of installation is [ B, A ]. B will be resolved
correctly through A and installed, A will be processed and
re-installed into the FIB. If the nexthops have changed for
A then the owner of B should be notified about the change( and B
can do the correct action here and decide to withdraw or re-install ).
Now imagine if the order of routes received for processing on the
metaQ is [ A, B ]. A will be received, processed and sent to the
dataplane for reinstall. B will then be pulled off the metaQ and
fail the install since A is in a `not Installed` state.
Let's loosen the restriction in nexthop resolution for B such
that if the route we are dependent on is a route replace operation
allow the resolution to suceed. This requires zebra to track a new
route state( ROUTE_ENTRY_ROUTE_REPLACING ) that can be looked at
during nexthop resolution. I believe this is ok because A is
a route replace operation, which could result in this:
-route install failed, in which case B should be nht'ing and
will receive the nht failure and the upper level protocol should
remove B.
-route install succeeded, no nexthop changes. In this case
allowing the resolution for B is ok, NHT will not notify the upper
level protocol so no action is needed.
-route install succeeded, nexthops changes. In this case
allowing the resolution for B is ok, NHT will notify the upper
level protocol and it can decide to reinstall B or not based
upon it's own algorithm.
This set of events was found by the bgp_distance_change topotest(s).
Effectively the tests were looking for the bug ( A, B order in the metaQ )
as the `correct` state. When under very heavy load, the A, B ordering
caused A to just be installed and fully resolved in the dataplane before
B is gotten to( which is entirely possible ).
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Enhanced few exsiting PIM APIs to support both
IPv4 and IPv6 configuration. Added few new APIs
for PIMv6. Tested all existing tests with new
API changes.
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
This tests checks that there are no errors when receiving BFD
packets over the various linux vrf interfaces. For example, if
an incoming packet is received by the wrong socket, a VRF
mismatch error would occur, and BFD flapping would be observed.
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Issue was reported by Donald, we were hitting
with key not found error and execution was
stopped, which is fixed by this PR.
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
Verifying and making sure PIM neighbors are
up before sending BSM packet using Scapy.
Verifying static routes are installed before
proceeding fruther.
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
When you have a static route with multiple different admin
distances there exists a chance that route will have been
installed multiple times due to system load when inserted
at about the same time. If this is the case then the
verify_rib function can and will select the wrong route
that happens to have a nexthop group that is still installed.
Modify verify_rib to ensure that the route that is going to
be looked at for nexthop correctness is the actual installed
route, not a previous version of it.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
"ip vrf exec" command is not possible in the topotest shell.
> root@r1:~# ip vrf exec r1-cust5 bash
> mkdir failed for /sys/fs/cgroup/unified: No such file or directory
> Failed to setup vrf cgroup2 directory
Remount cgroup after remounting sysfs.
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
As of now we are logging only JSON output of CLIs
in topotests(topojson) executions and same o/p is
getting printed twice, which is of no use.
Enhanced code to show both plain and JSON output
of CLIs and remove duplicate logging.
It will help in reducing execution logs and in
verification, if sometimes there is mis-match
in CLI plain and JSON outputs.
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
Under heavy load I am seeing verify_rib failing after 12 seconds
but succeeding after 17:
2022-05-19 18:52:54,374 DEBUG: topolog: Exiting lib API: verify_rib
2022-05-19 18:52:54,374 DEBUG: topolog: Function returned True
2022-05-19 18:52:54,374 WARNING: topolog: RETRY DIAGNOSTIC: SUCCEED after FAILED with requested timeout of 12.0s; however, succeeded in 14.7s, investigate timeout timing
There is no reason to not have the test wait a bit longer for very very
heavily loaded systems. Change the time to 40 seconds.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Lots of tests call verify_rib that takes a list of routes that
need to be verified in some fashion. This verify_rib functionality
will try up to 12 seconds before failing the check that zebra
has the route and has installed it.
Unfortunately the verify_rib code was not looking to see if
the route was queued for installation and was then allowing
tests to immediately do subsuquent steps that depended on
that route actually being installed sometimes causing tests
to fail.
Write a bit of additional code that looks at the queued
status and allows the test to wait a bit longer for zebra
to finish processing before allowing the test to move on
to the next bit.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
This test is failing intermittently because sometimes igmp
local join is not getting deleted. I did split the joins means
trying to delete igmp local joins one by one. I tried running
tests multiple times and it seems to be working fine with
current changes.
There was an issue found during debugging this test failure,
which was raised already:
Issue# https://github.com/FRRouting/frr/issues/11105
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
1. Modified pim APIs name to generic one, same APIs would be used for PIMv4 and PIMv6
verifications
2. Modified all affacted scripts and ran multiple times locally to avoid CI failures
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>
a) Remove the retry mechanism to continue looking for 75%
of the time for pim code.
This alone saves a bunch of time in tests that use lib/pim.py
Effectively all the times given for retry are already long
enough. Additionally some tests are gathering data with
the expectation that they will not find data so the entire
time is being taken up in retry's. Extending the retry
mechanism makes this even worse. This is especially bad
for pim in that keep alive timers are counting down and
state can be removed due to excessive time waiting.
b) Reduce verify verify_multicast_traffic from 40 seconds
to 20 seconds to gather traffic data.
A bunch of tests are doing this:
a) gather pre test start traffic data( taking about 70
seconds to run, because a bunch of time it was looking
for data that does not exist yet)
b) run a change to introduce a different traffic flow
c) gather post test traffic data ( taking about 70
seconds to run )
Why does this matter? Tests were iterating through
all the different routers looking for traffic flow
as well as different mroute state. This is against
the keepalive timer of 210 seconds. It does not take
long before the stream can be removed and the test is
still looking for data that is no longer there due
to state timeout.
The multicast_pim_sm_topo3/test_multicast_pim_sm_topo3.py
test reduced run time from 398 seconds to 297 seconds.
Greatly reducing keepalive timeout problems.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
1. Handle KeyError
2. logger object is defined in main function and its not not accessible
in other functions so defined it in local functions.
Signed-off-by: Kuldeep Kashyap <kashyapk@vmware.com>