On startup we create a thread timer event to do a rib sweep
of the system. On shutdown we never stopped this timer and
as such we have a situation where a thread event could be run
on shutdown after the data for it has been freed. Here is the
crash I am seeing:
(gdb) bt
(gdb)
Save the thread data in zebra_router and stop the thread so we don't
accidently do work on shutdown we don't mean to. In this case
it happened in our topotests with some severe system load.
Essentially we happened to kill the zebra daemon just as the
graceful_restart timer popped here.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently under system load tests that use verify_pim_interface_traffic
immediately after a interface down/up event are not giving any time
for pim to receive and process the data from that event. Give
the test some time to gather this data.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Under heavy system load, we are sometimes seeing this
output for addKernelRoute:
2021-11-28 16:17:27,604 INFO: topolog: [DUT: b1]: Running command: [ip route add 224.0.0.13 dev b1-f1-eth0]
2021-11-28 16:17:27,604 DEBUG: topolog.b1: LinuxNamespace(b1): cmd_status("['/bin/bash', '-c', 'ip route add 224.0.0.13 dev b1-f1-eth0']", kwargs: {'encoding': 'utf-8', 'stdout': -1, 'stderr': -2, 'shell': False, 'stdin': None})
2021-11-28 16:17:27,967 DEBUG: topolog.b1: LinuxNamespace(b1): cmd_status("['/bin/bash', '-c', 'ip route']", kwargs: {'encoding': 'utf-8', 'stdout': -1, 'stderr': -2, 'shell': False, 'stdin': None})
2021-11-28 16:17:28,243 DEBUG: topolog: ip route
70.0.0.0/24 dev b1-f1-eth0 proto kernel scope link src 70.0.0.1
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
This tells us that the ip route add succeeded but when looking for it
the system failed to immediately find it. Why is this happening?
Probably we are under heavy system load and the two different
commands, 'ip route add..' and 'ip route show' are being executed
on different cpu's and the data has not been copied to the different
cpu yet in the kernel. This is not necessarily something normally
seen but entirely possible. Giving the system a few extra seconds
for the kernel to execute/work the memory barrier system seems
prudent for long term success of our programming.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
During repeated runs I am seeing this test fail to run successfully.
Upon inspecting the output:
{
"prefix":"10.0.10.0/24",
"prefixLen":24,
"protocol":"isis",
"vrfId":6,
"vrfName":"r1-cust1",
"selected":true,
"destSelected":true,
"distance":115,
"metric":10,
"queued":true,
We can see that the route is still queued. Under heavy system
load and not ensuring that isis has time to send the route to
zebra and for zebra to install the route, this test can fail.
Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When the summary-address is deleted, `ospf_aggr_handle_external_info` is
called for each aggregated route for the cleanup. It needs to find the
corresponding OSPF instance and it does it using the `ei->instance`
which is totally wrong, because it's the instance from which the route
is redistributed, not the local OSPF instance. A pointer to the correct
OSPF instance is already stored in the external_info structure.
Signed-off-by: Igor Ryzhov <iryzhov@nfware.com>