Network devices can allocate reasources and private memory using
netdev_ops->ndo_init(). However, the release of these resources
can occur in one of two different places.
Either netdev_ops->ndo_uninit() or netdev->destructor().
The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.
netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.
netdev->destructor(), on the other hand, does not run until the
netdev references all go away.
Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().
This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.
If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit(). But
it is not able to invoke netdev->destructor().
This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.
However, this means that the resources that would normally be released
by netdev->destructor() will not be.
Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.
Many drivers do not try to deal with this, and instead we have leaks.
Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().
netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().
netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().
Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().
And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().
Signed-off-by: David S. Miller <davem@davemloft.net>
Whenever a user changes bonding options, a NETDEV_CHANGEINFODATA
notificatin is generated which results in a rtnelink message to
be sent. While runnig 'ip monitor', we can actually see 2 messages,
one a result of the event, and the other a result of state change
that is generated bo netdev_state_change(). However, this is not
always the case. If bonding changes were done via sysfs or ifenslave
(old ioctl interface), then only 1 message is seen.
This patch removes duplicate messages in the case of using netlink
to configure bonding. It introduceds a separte function that
triggers a netdev event and uses that function in the syfs and ioctl
cases.
This was discovered while auditing all the different envents and
continues the effort of cleaning up duplicated netlink messages.
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the loadbalance arp monitoring scheme, when a slave link change is
detected, the slave->link is immediately updated and slave_state_changed
is set. Later down the function, the rtnl_lock is acquired and the
changes are committed, updating the bond link state.
However, the acquisition of the rtnl_lock can fail. The next time the
monitor runs, since slave->link is already updated, it determines that
link is unchanged. This results in the bond link state permanently out
of sync with the slave link.
This patch modifies bond_loadbalance_arp_mon() to handle link changes
identical to bond_ab_arp_{inspect/commit}(). The new link state is
maintained in slave->new_link until we're ready to commit at which point
it's copied into slave->link.
NOTE: miimon_{inspect/commit}() has a more complex state machine
requiring the use of the bond_{propose,commit}_link_state() functions
which maintains the intermediate state in slave->link_new_state. The arp
monitors don't require that.
Testing: This bug is very easy to reproduce with the following steps.
1. In a loop, toggle a slave link of a bond slave interface.
2. In a separate loop, do ifconfig up/down of an unrelated interface to
create contention for rtnl_lock.
Within a few iterations, the bond link goes out of sync with the slave
link.
Signed-off-by: Nithin Nayak Sujir <nsujir@tintri.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit dc9c4d0fe0, the arp_target array moved from a static global
to a local variable. By the nature of static globals, the array used to
be initialized to all 0. At present, it's full of random data, which
that gets interpreted as arp_target values, when none have actually been
specified. Systems end up booting with spew along these lines:
[ 32.161783] IPv6: ADDRCONF(NETDEV_UP): lacp0: link is not ready
[ 32.168475] IPv6: ADDRCONF(NETDEV_UP): lacp0: link is not ready
[ 32.175089] 8021q: adding VLAN 0 to HW filter on device lacp0
[ 32.193091] IPv6: ADDRCONF(NETDEV_UP): lacp0: link is not ready
[ 32.204892] lacp0: Setting MII monitoring interval to 100
[ 32.211071] lacp0: Removing ARP target 216.124.228.17
[ 32.216824] lacp0: Removing ARP target 218.160.255.255
[ 32.222646] lacp0: Removing ARP target 185.170.136.184
[ 32.228496] lacp0: invalid ARP target 255.255.255.255 specified for removal
[ 32.236294] lacp0: option arp_ip_target: invalid value (-255.255.255.255)
[ 32.243987] lacp0: Removing ARP target 56.125.228.17
[ 32.249625] lacp0: Removing ARP target 218.160.255.255
[ 32.255432] lacp0: Removing ARP target 15.157.233.184
[ 32.261165] lacp0: invalid ARP target 255.255.255.255 specified for removal
[ 32.268939] lacp0: option arp_ip_target: invalid value (-255.255.255.255)
[ 32.276632] lacp0: Removing ARP target 16.0.0.0
[ 32.281755] lacp0: Removing ARP target 218.160.255.255
[ 32.287567] lacp0: Removing ARP target 72.125.228.17
[ 32.293165] lacp0: Removing ARP target 218.160.255.255
[ 32.298970] lacp0: Removing ARP target 8.125.228.17
[ 32.304458] lacp0: Removing ARP target 218.160.255.255
None of these were actually specified as ARP targets, and the driver does
seem to clean up the mess okay, but it's rather noisy and confusing, leaks
values to userspace, and the 255.255.255.255 spew shows up even when debug
prints are disabled.
The fix: just zero out arp_target at init time.
While we're in here, init arp_all_targets_value in the right place.
Fixes: dc9c4d0fe0 ("bonding: reduce scope of some global variables")
CC: Mahesh Bandewar <maheshb@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
CC: stable@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
As of 7bb11dc9f5 and 0622cab034, bond slaves in a 3ad bond are not
removed from the aggregator when they are down, and the active slave count
is NOT equal to number of ports in the aggregator, but rather the number
of ports in the aggregator that are still enabled. The sysfs spew for
bonding_show_ad_num_ports() has a comment that says "Show number of active
802.3ad ports.", but it's currently showing total number of ports, both
active and inactive. Remedy it by using the same logic introduced in
0622cab034 in __bond_3ad_get_active_agg_info(), so sysfs, procfs and
netlink all report the number of active ports. Note that this means that
IFLA_BOND_AD_INFO_NUM_PORTS really means NUM_ACTIVE_PORTS instead of
NUM_PORTS, and thus perhaps should be renamed for clarity.
Lightly tested on a dual i40e lacp bond, simulating link downs with an ip
link set dev <slave2> down, was able to produce the state where I could
see both in the same aggregator, but a number of ports count of 1.
MII Status: up
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2 <---
Slave Interface: ens10
MII Status: up <---
Aggregator ID: 1
Slave Interface: ens11
MII Status: up
Aggregator ID: 1
MII Status: up
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 1 <---
Slave Interface: ens10
MII Status: down <---
Aggregator ID: 1
Slave Interface: ens11
MII Status: up
Aggregator ID: 1
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Millar:
"Here are some highlights from the 2065 networking commits that
happened this development cycle:
1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)
2) Add a generic XDP driver, so that anyone can test XDP even if they
lack a networking device whose driver has explicit XDP support
(me).
3) Sparc64 now has an eBPF JIT too (me)
4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
Starovoitov)
5) Make netfitler network namespace teardown less expensive (Florian
Westphal)
6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)
7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)
8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)
9) Multiqueue support in stmmac driver (Joao Pinto)
10) Remove TCP timewait recycling, it never really could possibly work
well in the real world and timestamp randomization really zaps any
hint of usability this feature had (Soheil Hassas Yeganeh)
11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
Aleksandrov)
12) Add socket busy poll support to epoll (Sridhar Samudrala)
13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
and several others)
14) IPSEC hw offload infrastructure (Steffen Klassert)"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
tipc: refactor function tipc_sk_recv_stream()
tipc: refactor function tipc_sk_recvmsg()
net: thunderx: Optimize page recycling for XDP
net: thunderx: Support for XDP header adjustment
net: thunderx: Add support for XDP_TX
net: thunderx: Add support for XDP_DROP
net: thunderx: Add basic XDP support
net: thunderx: Cleanup receive buffer allocation
net: thunderx: Optimize CQE_TX handling
net: thunderx: Optimize RBDR descriptor handling
net: thunderx: Support for page recycling
ipx: call ipxitf_put() in ioctl error path
net: sched: add helpers to handle extended actions
qed*: Fix issues in the ptp filter config implementation.
qede: Fix concurrency issue in PTP Tx path processing.
stmmac: Add support for SIMATIC IOT2000 platform
net: hns: fix ethtool_get_strings overflow in hns driver
tcp: fix wraparound issue in tcp_lp
bpf, arm64: fix jit branch offset related to ldimm64
bpf, arm64: implement jiting of BPF_XADD
...
On slave list updates, the bonding driver computes its hard_header_len
as the maximum of all enslaved devices's hard_header_len.
If the slave list is empty, e.g. on last enslaved device removal,
ETH_HLEN is used.
Since the bonding header_ops are set only when the first enslaved
device is attached, the above can lead to header_ops->create()
being called with the wrong skb headroom in place.
If bond0 is configured on top of ipoib devices, with the
following commands:
ifup bond0
for slave in $BOND_SLAVES_LIST; do
ip link set dev $slave nomaster
done
ping -c 1 <ip on bond0 subnet>
we will obtain a skb_under_panic() with a similar call trace:
skb_push+0x3d/0x40
push_pseudo_header+0x17/0x30 [ib_ipoib]
ipoib_hard_header+0x4e/0x80 [ib_ipoib]
arp_create+0x12f/0x220
arp_send_dst.part.19+0x28/0x50
arp_solicit+0x115/0x290
neigh_probe+0x4d/0x70
__neigh_event_send+0xa7/0x230
neigh_resolve_output+0x12e/0x1c0
ip_finish_output2+0x14b/0x390
ip_finish_output+0x136/0x1e0
ip_output+0x76/0xe0
ip_local_out+0x35/0x40
ip_send_skb+0x19/0x40
ip_push_pending_frames+0x33/0x40
raw_sendmsg+0x7d3/0xb50
inet_sendmsg+0x31/0xb0
sock_sendmsg+0x38/0x50
SYSC_sendto+0x102/0x190
SyS_sendto+0xe/0x10
do_syscall_64+0x67/0x180
entry_SYSCALL64_slow_path+0x25/0x25
This change addresses the issue avoiding updating the bonding device
hard_header_len when the slaves list become empty, forbidding to
shrink it below the value used by header_ops->create().
The bug is there since commit 54ef313714 ("[PATCH] bonding: Handle large
hard_header_len") but the panic can be triggered only since
commit fc791b6335 ("IB/ipoib: move back IB LL address into the hard
header").
Reported-by: Norbert P <noe@physik.uzh.ch>
Fixes: 54ef313714 ("[PATCH] bonding: Handle large hard_header_len")
Fixes: fc791b6335 ("IB/ipoib: move back IB LL address into the hard header")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Earlier patch 4493b81bea ("bonding: initialize work-queues during
creation of bond") moved the work-queue initialization from bond_open()
to bond_create(). However this caused the link those are created using
netlink 'create bond option' (ip link add bondX type bond); create the
new trunk without initializing work-queues. Prior to the above mentioned
change, ndo_open was in both paths and things worked correctly. The
consequence is visible in the report shared by Joe Stringer -
I've noticed that this patch breaks bonding within namespaces if
you're not careful to perform device cleanup correctly.
Here's my repro script, you can run on any net-next with this patch
and you'll start seeing some weird behaviour:
ip netns add foo
ip li add veth0 type veth peer name veth0+ netns foo
ip li add veth1 type veth peer name veth1+ netns foo
ip netns exec foo ip li add bond0 type bond
ip netns exec foo ip li set dev veth0+ master bond0
ip netns exec foo ip li set dev veth1+ master bond0
ip netns exec foo ip addr add dev bond0 192.168.0.1/24
ip netns exec foo ip li set dev bond0 up
ip li del dev veth0
ip li del dev veth1
The second to last command segfaults, last command hangs. rtnl is now
permanently locked. It's not a problem if you take bond0 down before
deleting veths, or delete bond0 before deleting veths. If you delete
either end of the veth pair as per above, either inside or outside the
namespace, it hits this problem.
Here's some kernel logs:
[ 1221.801610] bond0: Enslaving veth0+ as an active interface with an up link
[ 1224.449581] bond0: Enslaving veth1+ as an active interface with an up link
[ 1281.193863] bond0: Releasing backup interface veth0+
[ 1281.193866] bond0: the permanent HWaddr of veth0+ -
16:bf:fb:e0:b8:43 - is still in use by bond0 - set the HWaddr of
veth0+ to a different address to avoid conflicts
[ 1281.193867] ------------[ cut here ]------------
[ 1281.193873] WARNING: CPU: 0 PID: 2024 at kernel/workqueue.c:1511
__queue_delayed_work+0x13f/0x150
[ 1281.193873] Modules linked in: bonding veth openvswitch nf_nat_ipv6
nf_nat_ipv4 nf_nat autofs4 nfsd auth_rpcgss nfs_acl binfmt_misc nfs
lockd grace sunrpc fscache ppdev vmw_balloon coretemp psmouse
serio_raw vmwgfx ttm drm_kms_helper vmw_vmci netconsole parport_pc
configfs drm i2c_piix4 fb_sys_fops syscopyarea sysfillrect sysimgblt
shpchp mac_hid nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4
nf_defrag_ipv4 nf_conntrack libcrc32c lp parport hid_generic usbhid
hid mptspi mptscsih e1000 mptbase ahci libahci
[ 1281.193905] CPU: 0 PID: 2024 Comm: ip Tainted: G W
4.10.0-bisect-bond-v0.14 #37
[ 1281.193906] Hardware name: VMware, Inc. VMware Virtual
Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
[ 1281.193906] Call Trace:
[ 1281.193912] dump_stack+0x63/0x89
[ 1281.193915] __warn+0xd1/0xf0
[ 1281.193917] warn_slowpath_null+0x1d/0x20
[ 1281.193918] __queue_delayed_work+0x13f/0x150
[ 1281.193920] queue_delayed_work_on+0x27/0x40
[ 1281.193929] bond_change_active_slave+0x25b/0x670 [bonding]
[ 1281.193932] ? synchronize_rcu_expedited+0x27/0x30
[ 1281.193935] __bond_release_one+0x489/0x510 [bonding]
[ 1281.193939] ? addrconf_notify+0x1b7/0xab0
[ 1281.193942] bond_netdev_event+0x2c5/0x2e0 [bonding]
[ 1281.193944] ? netconsole_netdev_event+0x124/0x190 [netconsole]
[ 1281.193947] notifier_call_chain+0x49/0x70
[ 1281.193948] raw_notifier_call_chain+0x16/0x20
[ 1281.193950] call_netdevice_notifiers_info+0x35/0x60
[ 1281.193951] rollback_registered_many+0x23b/0x3e0
[ 1281.193953] unregister_netdevice_many+0x24/0xd0
[ 1281.193955] rtnl_delete_link+0x3c/0x50
[ 1281.193956] rtnl_dellink+0x8d/0x1b0
[ 1281.193960] rtnetlink_rcv_msg+0x95/0x220
[ 1281.193962] ? __kmalloc_node_track_caller+0x35/0x280
[ 1281.193964] ? __netlink_lookup+0xf1/0x110
[ 1281.193966] ? rtnl_newlink+0x830/0x830
[ 1281.193967] netlink_rcv_skb+0xa7/0xc0
[ 1281.193969] rtnetlink_rcv+0x28/0x30
[ 1281.193970] netlink_unicast+0x15b/0x210
[ 1281.193971] netlink_sendmsg+0x319/0x390
[ 1281.193974] sock_sendmsg+0x38/0x50
[ 1281.193975] ___sys_sendmsg+0x25c/0x270
[ 1281.193978] ? mem_cgroup_commit_charge+0x76/0xf0
[ 1281.193981] ? page_add_new_anon_rmap+0x89/0xc0
[ 1281.193984] ? lru_cache_add_active_or_unevictable+0x35/0xb0
[ 1281.193985] ? __handle_mm_fault+0x4e9/0x1170
[ 1281.193987] __sys_sendmsg+0x45/0x80
[ 1281.193989] SyS_sendmsg+0x12/0x20
[ 1281.193991] do_syscall_64+0x6e/0x180
[ 1281.193993] entry_SYSCALL64_slow_path+0x25/0x25
[ 1281.193995] RIP: 0033:0x7f6ec122f5a0
[ 1281.193995] RSP: 002b:00007ffe69e89c48 EFLAGS: 00000246 ORIG_RAX:
000000000000002e
[ 1281.193997] RAX: ffffffffffffffda RBX: 00007ffe69e8dd60 RCX: 00007f6ec122f5a0
[ 1281.193997] RDX: 0000000000000000 RSI: 00007ffe69e89c90 RDI: 0000000000000003
[ 1281.193998] RBP: 00007ffe69e89c90 R08: 0000000000000000 R09: 0000000000000003
[ 1281.193999] R10: 00007ffe69e89a10 R11: 0000000000000246 R12: 0000000058f14b9f
[ 1281.193999] R13: 0000000000000000 R14: 00000000006473a0 R15: 00007ffe69e8e450
[ 1281.194001] ---[ end trace 713a77486cbfbfa3 ]---
Fixes: 4493b81bea ("bonding: initialize work-queues during creation of bond")
Reported-by: Joe Stringer <joe@ovn.org>
Tested-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bonding driver changes the skb->dev to the bonding-master before
passing the packet to stack for further processing. This, however
does not make sense for the link-local packets and it loses "the
link info" once its skb->dev is changed to bonding-master. This
patch changes this behavior for link-local packets by not changing
the skb->dev to the bonding-master and maintaining it as it is,
i.e. the link on which the packet arrived.
Signed-off-by: Chonggang Li <chonggangli@google.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When link transitions from LINK_FAIL to LINK_UP, the commit phase is
not called. This leads to an erroneous state causing slave-link state to
get stuck in "going down" state while its speed and duplex are perfectly
fine. This issue is a side-effect of splitting link-set into propose and
commit phases introduced by de77ecd4ef ("bonding: improve link-status
update in mii-monitoring")
This patch fixes these issues by calling commit phase whenever link
state change is proposed.
Fixes: de77ecd4ef ("bonding: improve link-status update in mii-monitoring")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e77315, so the alb code is where most of the changes are.
One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.
Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0
Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0
Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Earlier patch c4adfc822b ("bonding: make speed, duplex setting
consistent with link state") made an attempt to keep slave state
consistent with speed and duplex settings. Unfortunately link-state
transition is used to change the active link especially when used
in conjunction with mii-mon. The above mentioned patch broke that
logic. Also when speed and duplex settings for a link are updated
during a link-event, the link-status should not be changed to
invoke correct transition logic.
This patch fixes this issue by moving the link-state update outside
of the bond_update_speed_duplex() fn and to the places where this fn
is called and update link-state selectively.
Fixes: c4adfc822b ("bonding: make speed, duplex setting consistent
with link state")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some device drivers reset their stats at down/up events, possibly
fooling bonding stats, since they operate with relative deltas.
It is nearly not possible to fix drivers, since some of them compute the
tx/rx counters based on per rx/tx queue stats, and the queues can be
reconfigured (ethtool -L) between the down/up sequence.
Lets avoid accumulating 'negative' values that render bonding stats
useless.
It is better to lose small deltas, assuming the bonding stats are
fetched at a reasonable frequency.
Fixes: 5f0c5f73e5 ("bonding: make global bonding stats more reliable")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bond_miimon_commit() marks the link UP after attempting to get the speed
and duplex settings for the link. There is a possibility that
bond_update_speed_duplex() could fail. This is another place where it
could result into an inconsistent bonding link state.
With this patch the link will be marked UP only if the speed and duplex
values retrieved have sane values and processed further.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bond_update_speed_duplex() retrieves speed and duplex settings. There
is a possibility of failure in retrieving these values but caller has
to assume it's always successful. This leads to having inconsistent
slave link settings. If these (speed, duplex) values cannot be
retrieved, then keeping the link UP causes problems.
The updated bond_update_speed_duplex() returns 0 on success if it
retrieves sane values for speed and duplex. On failure it returns 1
and marks the link down.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The primary issue is that mii-inspect phase updates link-state and
expects changes to be committed during the mii-commit phase. After
the inspect phase if it fails to acquire rtnl-mutex, the commit
phase (bond_mii_commit) doesn't get to run. This partially updated
state stays and makes the internal-state inconsistent.
e.g. setup bond0 => slaves: eth1, eth2
eth1 goes DOWN -> UP
mii_monitor()
mii-inspect()
bond_set_slave_link_state(eth1, UP, DontNotify)
rtnl_trylock() <- fails!
Next mii-monitor round
eth1: No change
mii_monitor()
mii-inspect()
eth1->link == current-status (ethtool_ops->get_link)
no-change-detected
End result:
eth1:
Link = BOND_LINK_UP
Speed = 0xfffff [SpeedUnknown]
Duplex = 0xff [DuplexUnknown]
This doesn't always happen but for some unlucky machines in a large set
of machines it creates problems.
The fix for this is to avoid making changes during inspect phase and
postpone them until acquiring the rtnl-mutex / invoking commit phase.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cut-n-paste enablement of 802.3ad bonding on 25G NICs, which currently
report 0 as their bandwidth.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many of the bond param variables are declared global while it's not
really necessary for these variables to be global. So moving them to
the location these are used.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
LACP state-machine defines "port-moved" state when the same ActorSystemID
and Port are seen in a LACPDU received on different port. The state is
never set since it's not implemented. However the state-machine attempts
to clear that state occasionally. LACP state machine is already complicated
and since this state is not implemented, removing it's checks makes the
state-machine little simpler.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eliminate hard-coded value and use the default that is set.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initializing work-queues every time ifup operation performed is unnecessary
and can be performed only once when the port is created.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation to move the work-queue initialization to port creation
from current port_open phase. Work-queue initialization does not make
sense every time we do 'ifup/ifdown'. So moving to port creation phase.
Arp monitoring work depends on the bonding mode and that is not tied
to the port creation and can change anytime during the life after port
creation. So this restructuring allows us to move the initialization
at creation without losing the ability to arm the correct work for the
mode user has selected.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix double-free in batman-adv, from Sven Eckelmann.
2) Fix packet stats for fast-RX path, from Joannes Berg.
3) Netfilter's ip_route_me_harder() doesn't handle request sockets
properly, fix from Florian Westphal.
4) Fix sendmsg deadlock in rxrpc, from David Howells.
5) Add missing RCU locking to transport hashtable scan, from Xin Long.
6) Fix potential packet loss in mlxsw driver, from Ido Schimmel.
7) Fix race in NAPI handling between poll handlers and busy polling,
from Eric Dumazet.
8) TX path in vxlan and geneve need proper RCU locking, from Jakub
Kicinski.
9) SYN processing in DCCP and TCP need to disable BH, from Eric
Dumazet.
10) Properly handle net_enable_timestamp() being invoked from IRQ
context, also from Eric Dumazet.
11) Fix crash on device-tree systems in xgene driver, from Alban Bedel.
12) Do not call sk_free() on a locked socket, from Arnaldo Carvalho de
Melo.
13) Fix use-after-free in netvsc driver, from Dexuan Cui.
14) Fix max MTU setting in bonding driver, from WANG Cong.
15) xen-netback hash table can be allocated from softirq context, so use
GFP_ATOMIC. From Anoob Soman.
16) Fix MAC address change bug in bgmac driver, from Hari Vyas.
17) strparser needs to destroy strp_wq on module exit, from WANG Cong.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits)
strparser: destroy workqueue on module exit
sfc: fix IPID endianness in TSOv2
sfc: avoid max() in array size
rds: remove unnecessary returned value check
rxrpc: Fix potential NULL-pointer exception
nfp: correct DMA direction in XDP DMA sync
nfp: don't tell FW about the reserved buffer space
net: ethernet: bgmac: mac address change bug
net: ethernet: bgmac: init sequence bug
xen-netback: don't vfree() queues under spinlock
xen-netback: keep a local pointer for vif in backend_disconnect()
netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() fails
netfilter: nft_set_rbtree: incorrect assumption on lower interval lookups
netfilter: nf_conntrack_sip: fix wrong memory initialisation
can: flexcan: fix typo in comment
can: usb_8dev: Fix memory leak of priv->cmd_msg_buffer
can: gs_usb: fix coding style
can: gs_usb: Don't use stack memory for USB transfers
ixgbe: Limit use of 2K buffers on architectures with 256B or larger cache lines
ixgbe: update the rss key on h/w, when ethtool ask for it
...
This restores the ability of setting bond device's mtu to 9000.
Fixes: 91572088e3 ("net: use core MTU range checking in core net infra")
Reported-by: daznis@gmail.com
Reported-by: Brad Campbell <lists2009@fnarfbargle.com>
Cc: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix up affected files that include this signal functionality via sched.h.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In commit 18bfb924f0 ("net: introduce default neigh_construct/destroy
ndo calls for L2 upper devices") we added these ndos to stacked devices
such as team and bond, so that calls will be propagated to mlxsw.
However, previous commit removed the reliance on these ndos and no new
users of these ndos have appeared since above mentioned commit. We can
therefore safely remove this dead code.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function bond_info_query alwarys returns 0. As such, in the function
bond_do_ioctl, it is not necessary to check the returned value. So the
interface type of the function bond_info_query is changed to void. The
redundant check is removed.
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.
Fix all drivers with ndo_get_stats64 to have a void function.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make struct pernet_operations::id unsigned.
There are 2 reasons to do so:
1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.
2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.
"int" being used as an array index needs to be sign-extended
to 64-bit before being used.
void f(long *p, int i)
{
g(p[i]);
}
roughly translates to
movsx rsi, esi
mov rdi, [rsi+...]
call g
MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.
Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:
static inline void *net_generic(const struct net *net, int id)
{
...
ptr = ng->ptr[id - 1];
...
}
And this function is used a lot, so those sign extensions add up.
Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]
However, overall balance is in negative direction:
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
function old new delta
nfsd4_lock 3886 3959 +73
tipc_link_build_proto_msg 1096 1140 +44
mac80211_hwsim_new_radio 2776 2808 +32
tipc_mon_rcv 1032 1058 +26
svcauth_gss_legacy_init 1413 1429 +16
tipc_bcbase_select_primary 379 392 +13
nfsd4_exchange_id 1247 1260 +13
nfsd4_setclientid_confirm 782 793 +11
...
put_client_renew_locked 494 480 -14
ip_set_sockfn_get 730 716 -14
geneve_sock_add 829 813 -16
nfsd4_sequence_done 721 703 -18
nlmclnt_lookup_host 708 686 -22
nfsd4_lockt 1085 1063 -22
nfs_get_client 1077 1050 -27
tcf_bpf_init 1106 1076 -30
nfsd4_encode_fattr 5997 5930 -67
Total: Before=154856051, After=154854321, chg -0.00%
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert alb_send_learning_packets and bond_has_this_ip to use the new
netdev_walk_all_upper_dev_rcu API. In both cases this is just a code
conversion; no functional change is intended.
v2
- removed typecast of data and simplified bond_upper_dev_walk
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull uaccess.h prepwork from Al Viro:
"Preparations to tree-wide switch to use of linux/uaccess.h (which,
obviously, will allow to start unifying stuff for real). The last step
there, ie
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
`git grep -l "$PATT"|grep -v ^include/linux/uaccess.h`
is not taken here - I would prefer to do it once just before or just
after -rc1. However, everything should be ready for it"
* 'work.uaccess2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
remove a stray reference to asm/uaccess.h in docs
sparc64: separate extable_64.h, switch elf_64.h to it
score: separate extable.h, switch module.h to it
mips: separate extable.h, switch module.h to it
x86: separate extable.h, switch sections.h to it
remove stray include of asm/uaccess.h from cacheflush.h
mn10300: remove a bogus processor.h->uaccess.h include
xtensa: split uaccess.h into C and asm sides
bonding: quit messing with IOCTL
kill __kernel_ds_p off
mn10300: finish verify_area() off
frv: move HAVE_ARCH_UNMAPPED_AREA to pgtable.h
exceptions: detritus removal
The only remaining users are issuing SIOCGMIIPHY and SIOCGMIIREG,
neither of which deals with userland pointers. Simply calling
->ndo_do_ioctl() is fine; no messing with set_fs() is needed.
It used to mess with SIOCETHTOOL, which would've needed set_fs(),
but that has been killed in "[NET] ethtool ops are the only way"
9 years ago...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Conflicts:
drivers/net/ethernet/mediatek/mtk_eth_soc.c
drivers/net/ethernet/qlogic/qed/qed_dcbx.c
drivers/net/phy/Kconfig
All conflicts were cases of overlapping commits.
Signed-off-by: David S. Miller <davem@davemloft.net>
Following few steps will crash kernel -
(a) Create bonding master
> modprobe bonding miimon=50
(b) Create macvlan bridge on eth2
> ip link add link eth2 dev mvl0 address aa:0:0:0:0:01 \
type macvlan
(c) Now try adding eth2 into the bond
> echo +eth2 > /sys/class/net/bond0/bonding/slaves
<crash>
Bonding does lots of things before checking if the device enslaved is
busy or not.
In this case when the notifier call-chain sends notifications, the
bond_netdev_event() assumes that the rx_handler /rx_handler_data is
registered while the bond_enslave() hasn't progressed far enough to
register rx_handler for the new slave.
This patch adds a rx_handler check that can be performed right at the
beginning of the enslave code to avoid getting into this situation.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "wq" queues multiple work items viz
&bond->mcast_work, &nnw->work, &bond->mii_work, &bond->arp_work,
&bond->alb_work, &bond->mii_work, &bond->ad_work, &bond->slave_arr_work
which require strict execution ordering. Hence, an ordered dedicated
workqueue has been used.
Since, it is a network driver, WQ_MEM_RECLAIM has been set to
ensure forward progress under memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using an IPoIB bond currently only active-backup mode is a valid
use case and this commit strengthens it.
Since commit 2ab82852a2 ("net/bonding: Enable bonding to enslave
netdevices not supporting set_mac_address()") was introduced till
4.7-rc1, IPoIB didn't support the set_mac_address ndo, and hence the
fail over mac policy always applied to IPoIB bonds.
With the introduction of commit 492a7e67ff ("IB/IPoIB: Allow setting
the device address"), that doesn't hold and practically IPoIB bonds are
broken as of that. To fix it, lets go to fail over mac if the device
doesn't support the ndo OR this is IPoIB device.
As a by-product, this commit also prevents a stack corruption which
occurred when trying to copy 20 bytes (IPoIB) device address
to a sockaddr struct that has only 16 bytes of storage.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e826eafa65 ("bonding: Call netif_carrier_off after
register_netdevice") moved netif_carrier_off() from bond_init() to
bond_create(), but the latter is called only for initial default
devices and ones created through sysfs:
$ modprobe bonding
$ echo +bond1 > /sys/class/net/bonding_masters
$ ip link add bond2 type bond
$ grep "MII Status" /proc/net/bonding/*
/proc/net/bonding/bond0:MII Status: down
/proc/net/bonding/bond1:MII Status: down
/proc/net/bonding/bond2:MII Status: up
Ensure that carrier is initially off also for devices created through
netlink.
Signed-off-by: Beniamino Galvani <bgalvani@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/mellanox/mlx5/core/en.h
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
drivers/net/usb/r8152.c
All three conflicts were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, link notifications are not sent by
bond_set_slave_link_state() upon enslavement if
the slave is enslaved when up.
This happens because slave->link default init value
is 0, which is the same as BOND_LINK_UP, resulting
in bond_set_slave_link_state() ignoring this transition.
This patch sets the default value of slave->link to
BOND_LINK_NOCHANGE, assuring it will count as a state
transition and thus trigger notification logic.
Signed-off-by: Aviv Heller <avivh@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
L2 upper device needs to propagate neigh_construct/destroy calls down to
lower devices. Do this by defining default ndo functions and use them in
team, bond, bridge and vlan.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ether_addr_equal_64bits() requires some care about its arguments,
namely that 8 bytes might be read, even if last 2 byte values are not
used.
KASan detected a violation with null_mac_addr and lacpdu_mcast_addr
in bond_3ad.c
Same problem with mac_bcast[] and mac_v6_allmcast[] in bond_alb.c :
Although the 8-byte alignment was there, KASan would detect out
of bound accesses.
Fixes: 815117adaf ("bonding: use ether_addr_equal_unaligned for bond addr compare")
Fixes: bb54e58929 ("bonding: Verify RX LACPDU has proper dest mac-addr")
Fixes: 885a136c52 ("bonding: use compare_ether_addr_64bits() in ALB")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several cases of overlapping changes, except the packet scheduler
conflicts which deal with the addition of the free list parameter
to qdisc_enqueue().
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 7bb11dc9f5 ("bonding: unify all places where
actor-oper key needs to be updated."), the logic in bonding to handle
selection between multiple aggregators has not functioned.
This affects only configurations wherein the bonding slaves
connect to two discrete aggregators (e.g., two independent switches, each
with LACP enabled), thus creating two separate aggregation groups within a
single bond.
The cause is a change in 7bb11dc9f5 to no longer set
AD_PORT_BEGIN on a port after a link state change, which would cause the
port to be reselected for attachment to an aggregator as if were newly
added to the bond. We cannot restore the prior behavior, as it
contradicts IEEE 802.1AX 5.4.12, which requires ports that "become
inoperable" (lose carrier, setting port_enabled=false as per 802.1AX
5.4.7) to remain selected (i.e., assigned to the aggregator). As the port
now remains selected, the aggregator selection logic is not invoked.
A side effect of this change is that aggregators in bonding will
now contain ports that are link down. The aggregator selection logic
does not currently handle this situation correctly, causing incorrect
aggregator selection.
This patch makes two changes to repair the aggregator selection
logic in bonding to function as documented and within the confines of the
standard:
First, the aggregator selection and related logic now utilizes the
number of active ports per aggregator, not the number of selected ports
(as some selected ports may be down). The ad_select "bandwidth" and
"count" options only consider ports that are link up.
Second, on any carrier state change of any slave, the aggregator
selection logic is explicitly called to insure the correct aggregator is
active.
Reported-by: Veli-Matti Lintu <veli-matti.lintu@opinsys.fi>
Fixes: 7bb11dc9f5 ("bonding: unify all places where actor-oper key needs to be updated.")
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is time to add netdev_lockdep_set_classes() helper
so that lockdep annotations per device type are easier to manage.
This removes a lot of copies and missing annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using a single bit (__QDISC___STATE_RUNNING)
in sch->__state, use a seqcount.
This adds lockdep support, but more importantly it will allow us
to sample qdisc/class statistics without having to grab qdisc root lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bond_get_stats() can be called from rtnetlink (with RTNL held)
or from /proc/net/dev seq handler (with RCU held)
The logic added in commit 5f0c5f73e5 ("bonding: make global bonding
stats more reliable") kind of assumed only one cpu could run there.
If multiple threads are reading /proc/net/dev, stats can be really
messed up after a while.
A second problem is that some fields are 32bit, so we need to properly
handle the wrap around problem.
Given that RTNL is not always held, we need to use
bond_for_each_slave_rcu().
Fixes: 5f0c5f73e5 ("bonding: make global bonding stats more reliable")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove unnecessary set of flag IFF_MULTICAST, since ether_setup
already does this.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/phy/bcm7xxx.c
drivers/net/phy/marvell.c
drivers/net/vxlan.c
All three conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
There is presently a race condition between the bonding periodic
link monitor and the updating of a slave's speed and duplex. The former
occurs on a periodic basis, and the latter in response to a driver's
calling of netif_carrier_on.
It is possible for the periodic monitor to run between the
driver call of netif_carrier_on and the receipt of the NETDEV_CHANGE
event that causes bonding to update the slave's speed and duplex. This
manifests most notably as a report that a slave is up and "0 Mbps full
duplex" after enslavement, but in principle could report an incorrect
speed and duplex after any link up event if the device comes up with a
different speed or duplex. This affects the 802.3ad aggregator
selection, as the speed and duplex are selection criteria.
This is fixed by updating the speed and duplex in the periodic
monitor, prior to using that information.
This was done historically in bonding, but the call to
bond_update_speed_duplex was removed in commit 876254ae27 ("bonding:
don't call update_speed_duplex() under spinlocks"), as it might sleep
under lock. Later, the locking was changed to only hold RTNL, and so
after commit 876254ae27 ("bonding: don't call update_speed_duplex()
under spinlocks") this call is again safe.
Tested-by: "Tantilov, Emil S" <emil.s.tantilov@intel.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: dingtianhong <dingtianhong@huawei.com>
Fixes: 876254ae27 ("bonding: don't call update_speed_duplex() under spinlocks")
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current logic in bond_arp_rcv will accept an incoming ARP for
validation if (a) the receiving slave is either "active" (which includes
the currently active slave, or the current ARP slave) or, (b) there is a
currently active slave, and it has received an ARP since it became active.
For case (b), the receiving slave isn't the currently active slave, and is
receiving the original broadcast ARP request, not an ARP reply from the
target.
This logic can fail if there is no currently active slave. In
this situation, the ARP probe logic cycles through all slaves, assigning
each in turn as the "current_arp_slave" for one arp_interval, then setting
that one as "active," and sending an ARP probe from that slave. The
current logic expects the ARP reply to arrive on the sending
current_arp_slave, however, due to switch FDB updating delays, the reply
may be directed to another slave.
This can arise if the bonding slaves and switch are working, but
the ARP target is not responding. When the ARP target recovers, a
condition may result wherein the ARP target host replies faster than the
switch can update its forwarding table, causing each ARP reply to be sent
to the previous current_arp_slave. This will never pass the logic in
bond_arp_rcv, as neither of the above conditions (a) or (b) are met.
Some experimentation on a LAN shows ARP reply round trips in the
200 usec range, but my available switches never update their FDB in less
than 4000 usec.
This patch changes the logic in bond_arp_rcv to additionally
accept an ARP reply for validation on any slave if there is a current ARP
slave and it sent an ARP probe during the previous arp_interval.
Fixes: aeea64ac71 ("bonding: don't trust arp requests unless active slave really works")
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace 'goto' with 'return' to remove unnecessary check at label:
err_undo_flags.
The reason is that 'err_undo_flags' do two things for the first slave device:
1.revert bond mac address if it is set by the slave device.
2.revert bond device type if it's not ARPHRD_ETHER.
It's not necessary for the following three places, they changed neither bond
mac address nor type. It's straightforward to return directly.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The return value of kzalloc on failure of allocation of memory should
be -ENOMEM and not -1.
Found using Coccinelle. A simplified version of the semantic patch
used is:
//<smpl>
@@
expression *e;
@@
e = kzalloc(...);
if (e == NULL) {
...
return
- -1
+ -ENOMEM
;
}
//</smpl>
The single call site only checks that the return value is not 0,
hence no change is required at the call site.
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to require the bond down while changing these settings, the change
will be reflected immediately and the 3ad mode will sort itself out.
For faster convergence set port->ntt to true in order to generate new
LACPDUs immediately.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the bonding allows to set ad_actor_system and prio while the
bond device is down, but these are actually applied only if there aren't
any slaves yet (applied to bond device when first slave shows up, and to
slaves at 3ad bind time). After this patch changes are applied immediately
and the new values can be used/seen after the bond's upped so it's not
necessary anymore to release all and enslave again to see the changes.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
remove some redudant brackets, use sizeof(*) instead of sizeof(struct x).
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev_dbg() will add bond device name, it will be helpful if we print
slave device name.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sample output with this set applied for an active-backup bond:
$ cat /sys/devices/virtual/net/bond0/lower_p7p1/statistics/rx_nohandler
16568
$ cat /sys/devices/virtual/net/bond0/lower_p5p2/statistics/rx_nohandler
16583
$ cat /sys/devices/virtual/net/bond0/statistics/rx_nohandler
33151
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/bonding/bond_main.c
drivers/net/ethernet/mellanox/mlxsw/spectrum.h
drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
The bond_main.c and mellanox switch conflicts were cases of
overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 1f718f0f4f ("bonding: populate neighbour's private on enslave")
undoes the fix provided by commit c2edacf80e ("bonding / ipv6: no addrconf
for slaves separately from master") by effectively setting the slave flag
after the slave has been opened. If the slave comes up quickly enough, it
will go through the IPv6 addrconf before the slave flag has been set and
will get a link local IPv6 address.
In order to ensure that addrconf knows to ignore the slave devices on state
change, set IFF_SLAVE before dev_open() during bonding enslavement.
Fixes: 1f718f0f4f ("bonding: populate neighbour's private on enslave")
Signed-off-by: Karl Heiss <kheiss@gmail.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reviewed-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The spew in /proc/net/bonding/bond0 uses netif_carrier_ok() to determine
mii_status, while /sys/class/net/bond0/bonding/mii_status looks at
curr_active_slave, which doesn't actually seem to be set sometimes when
the bond actually is up. A mode 4 bond configured via ifcfg-foo files on a
Red Hat Enterprise Linux system, after boot, comes up clean and
functional, but the sysfs node shows mii_status of down, while proc shows
up. A simple enough fix here seems to be to use the same method for
determining up or down in both places, and I'd opt for the one that seems
to match reality.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The name NETIF_F_ALL_CSUM is a misnomer. This does not correspond to the
set of features for offloading all checksums. This is a mask of the
checksum offload related features bits. It is incorrect to set both
NETIF_F_HW_CSUM and NETIF_F_IP_CSUM or NETIF_F_IPV6 at the same time for
features of a device.
This patch:
- Changes instances of NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK (where
NETIF_F_ALL_CSUM is being used as a mask).
- Changes bonding, sfc/efx, ipvlan, macvlan, vlan, and team drivers to
use NEITF_F_HW_CSUM in features list instead of NETIF_F_ALL_CSUM.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not necessary to use two brackets. As such, the redudant brackets
are removed.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Be correct and symmetric to enslave and set inactive flags during release.
That gives LAG offload drivers - lower state change listeners - possibility
to do proper cleanup.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Let netdev notifier listeners know about link and slave state change.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to state notifications.
We allow caller to indicate if the notification should happen now or later,
depending on if he holds rtnl mutex or not. Introduce bond_slave_link_notify
function (similar to bond_slave_state_notify) which is later on called
with rtnl mutex and goes over slaves and executes delayed notification.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize netdev_lag_upper_info structure by TX type according to
current bonding mode and pass it along via netdev_master_upper_dev_link.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sometimes the drivers and other code would find it handy to know some
internal information about upper device being changed. So allow upper-code
to pass information down to notifier listeners during linking.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eliminate netdev_master_upper_dev_link_private and pass priv directly as
a parameter of netdev_master_upper_dev_link.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to other speeds, add 100G to bonding 802.3ad code.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 7d5cd2ce529b, when bond_enslave fails on devices that
are not ARPHRD_ETHER, if needed, it resets the bonding device back to
ARPHRD_ETHER by calling ether_setup.
Unfortunately, ether_setup clobbers dev->flags, clearing IFF_UP
if the bond device is up, leaving it in a quasi-down state without
having actually gone through dev_close. For bonding, if any periodic
work queue items are active (miimon, arp_interval, etc), those will
remain running, as they are stopped by bond_close. At this point, if
the bonding module is unloaded or the bond is deleted, the system will
panic when the work function is called.
This panic is resolved by calling dev_close on the bond itself
prior to calling ether_setup.
Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Fixes: 7d5cd2ce52 ("bonding: correctly handle bonding type change on enslave failure")
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Old logic of updating state-machine is not required since
ad_update_actor_keys() does it implicitly. The only loss is
the notification differentiation between speed vs. duplex
change. Now only one unified notification is printed.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
actor_admin, and actor_oper key is changed at multiple locations in
the code. This patch brings all those updates into one location in
an attempt to avoid possible inconsistent updates causing LACP state
machine to go in weird state.
The unified place is ad_update_actor_key() with simple state-machine
logic -
(a) If port is "duplex" then only it can participate in LACP
(b) Speed change reinitializes the LACP state-machine.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eliminate 'else' clause by simply initializing variable
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If using a sixtofour device on top of a bonding device,
skb segmentation of TCP traffic is done right before calling
bonding xmit, because bonding only enables TSO for IPv4.
This patch improves single flow performance by about 120 % on my hosts,
because segmentation is deferred right before calling slave xmit.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Create drivers/staging/rdma
- Move amso1100 driver to staging/rdma and schedule for deletion
- Move ipath driver to staging/rdma and schedule for deletion
- Add hfi1 driver to staging/rdma and set TODO for move to regular tree
- Initial support for namespaces to be used on RDMA devices
- Add RoCE GID table handling to the RDMA core caching code
- Infrastructure to support handling of devices with differing
read and write scatter gather capabilities
- Various iSER updates
- Kill off unsafe usage of global mr registrations
- Update SRP driver
- Misc. mlx4 driver updates
- Support for the mr_alloc verb
- Support for a netlink interface between kernel and user space cache
daemon to speed path record queries and route resolution
- Ininitial support for safe hot removal of verbs devices
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJV7v8wAAoJELgmozMOVy/d2dcP/3PXnGFPgFGJODKE6VCZtTvj
nooNXRKXjxv470UT5DiAX7SNcBxzzS7Zl/Lj+831H9iNXUyzuH31KtBOAZ3W03vZ
yXwCB2caOStSldTRSUUvPe2aIFPnyNmSpC4i6XcJLJMCFijKmxin5pAo8qE44BQU
yjhT+wC9P6LL5wZXsn/nFIMLjOFfu0WBFHNp3gs5j59paxlx5VeIAZk16aQZH135
m7YCyicwrS8iyWQl2bEXRMon2vlCHlX2RHmOJ4f/P5I0quNcGF2+d8Yxa+K1VyC5
zcb3OBezz+wZtvh16yhsDfSPqHWirljwID2VzOgRSzTJWvQjju8VkwHtkq6bYoBW
egIxGCHcGWsD0R5iBXLYr/tB+BmjbDObSm0AsR4+JvSShkeVA1IpeoO+19162ixE
n6CQnk2jCee8KXeIN4PoIKsjRSbIECM0JliWPLoIpuTuEhhpajftlSLgL5hf1dzp
HrSy6fXmmoRj7wlTa7DnYIC3X+ffwckB8/t1zMAm2sKnIFUTjtQXF7upNiiyWk4L
/T1QEzJ2bLQckQ9yY4v528SvBQwA4Dy1amIQB7SU8+2S//bYdUvhysWPkdKC4oOT
WlqS5PFDCI31MvNbbM3rUbMAD8eBAR8ACw9ZpGI/Rffm5FEX5W3LoxA8gfEBRuqt
30ZYFuW8evTL+YQcaV65
=EHLg
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull inifiniband/rdma updates from Doug Ledford:
"This is a fairly sizeable set of changes. I've put them through a
decent amount of testing prior to sending the pull request due to
that.
There are still a few fixups that I know are coming, but I wanted to
go ahead and get the big, sizable chunk into your hands sooner rather
than waiting for those last few fixups.
Of note is the fact that this creates what is intended to be a
temporary area in the drivers/staging tree specifically for some
cleanups and additions that are coming for the RDMA stack. We
deprecated two drivers (ipath and amso1100) and are waiting to hear
back if we can deprecate another one (ehca). We also put Intel's new
hfi1 driver into this area because it needs to be refactored and a
transfer library created out of the factored out code, and then it and
the qib driver and the soft-roce driver should all be modified to use
that library.
I expect drivers/staging/rdma to be around for three or four kernel
releases and then to go away as all of the work is completed and final
deletions of deprecated drivers are done.
Summary of changes for 4.3:
- Create drivers/staging/rdma
- Move amso1100 driver to staging/rdma and schedule for deletion
- Move ipath driver to staging/rdma and schedule for deletion
- Add hfi1 driver to staging/rdma and set TODO for move to regular
tree
- Initial support for namespaces to be used on RDMA devices
- Add RoCE GID table handling to the RDMA core caching code
- Infrastructure to support handling of devices with differing read
and write scatter gather capabilities
- Various iSER updates
- Kill off unsafe usage of global mr registrations
- Update SRP driver
- Misc mlx4 driver updates
- Support for the mr_alloc verb
- Support for a netlink interface between kernel and user space cache
daemon to speed path record queries and route resolution
- Ininitial support for safe hot removal of verbs devices"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits)
IB/ipoib: Suppress warning for send only join failures
IB/ipoib: Clean up send-only multicast joins
IB/srp: Fix possible protection fault
IB/core: Move SM class defines from ib_mad.h to ib_smi.h
IB/core: Remove unnecessary defines from ib_mad.h
IB/hfi1: Add PSM2 user space header to header_install
IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY
mlx5: Fix incorrect wc pkey_index assignment for GSI messages
IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow
IB/uverbs: reject invalid or unknown opcodes
IB/cxgb4: Fix if statement in pick_local_ip6adddrs
IB/sa: Fix rdma netlink message flags
IB/ucma: HW Device hot-removal support
IB/mlx4_ib: Disassociate support
IB/uverbs: Enable device removal when there are active user space applications
IB/uverbs: Explicitly pass ib_dev to uverbs commands
IB/uverbs: Fix race between ib_uverbs_open and remove_one
IB/uverbs: Fix reference counting usage of event files
IB/core: Make ib_dealloc_pd return void
IB/srp: Create an insecure all physical rkey only if needed
...
The flags argument will allow control of the dissection process (for
instance whether to parse beyond L3).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some consumers of the netdev events API would like to know who is the
active slave when a NETDEV_CHANGEUPPER or NETDEV_BONDING_FAILOVER
events occur. For example, when managing RoCE GIDs, GIDs based on the
bond's ips should only be set on the port which corresponds to active
slave netdevice.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/cavium/Kconfig
The cavium conflict was overlapping dependency
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
When the first slave is added (such as during bootup) the first
gratuitous ARP gets dropped. We don't see this drop during a failover.
The packet gets dropped in qdisc (noop_enqueue).
The fix is to delay the sending of gratuitous ARPs till the bond dev's
carrier is present.
It can also be worked around by setting num_grat_arp to more than 1.
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tlb_dynamic_lb could be set only via sysfs, this patch allows it to be
set via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
num_grat_arp wasn't converted to the new bonding option API, so do this
now and remove the specific sysfs store option in order to use the
standard one. num_grat_arp is the same as num_unsol_na so add it as an
alias with the same option settings. An important difference is the option
name which is matched in bond_sysfs_store_option().
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/bridge/br_mdb.c
br_mdb.c conflict was a function call being removed to fix a bug in
'net' but whose signature was changed in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
The "follow" fail_over_mac policy is useful for multiport devices that
either become confused or incur a performance penalty when multiple
ports are programmed with the same MAC address, but the same MAC
address still may happened by this steps for this policy:
1) echo +eth0 > /sys/class/net/bond0/bonding/slaves
bond0 has the same mac address with eth0, it is MAC1.
2) echo +eth1 > /sys/class/net/bond0/bonding/slaves
eth1 is backup, eth1 has MAC2.
3) ifconfig eth0 down
eth1 became active slave, bond will swap MAC for eth0 and eth1,
so eth1 has MAC1, and eth0 has MAC2.
4) ifconfig eth1 down
there is no active slave, and eth1 still has MAC1, eth2 has MAC2.
5) ifconfig eth0 up
the eth0 became active slave again, the bond set eth0 to MAC1.
Something wrong here, then if you set eth1 up, the eth0 and eth1 will have the same
MAC address, it will break this policy for ACTIVE_BACKUP mode.
This patch will fix this problem by finding the old active slave and
swap them MAC address before change active slave.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Tested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the bond is enslaving a device with different type it will be setup
by it, but if after being setup the enslave fails the bond doesn't
switch back its type and also keeps pointers to foreign structures that can
be long gone. Thus revert back any type changes if the enslave failed and
the bond had to change its type.
Example:
Before patch:
$ echo lo > bond0/bonding/slaves
-bash: echo: write error: Cannot assign requested address
$ ip l sh bond0
20: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN
mode DEFAULT group default
link/loopback 16:54:78:34:bd:41 brd 00:00:00:00:00:00
$ echo +eth1 > bond0/bonding/slaves
$ ip l sh bond0
20: bond0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
DEFAULT group default qlen 1000
link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
(notice the MASTER flag is gone)
After patch:
$ echo lo > bond0/bonding/slaves
-bash: echo: write error: Cannot assign requested address
$ ip l sh bond0
21: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN
mode DEFAULT group default qlen 1000
link/ether 6e:66:94:f6:07:fc brd ff:ff:ff:ff:ff:ff
$ echo +eth1 > bond0/bonding/slaves
$ ip l sh bond0
21: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN
mode DEFAULT group default qlen 1000
link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: e36b9d16c6 ("bonding: clean muticast addresses when device changes type")
Signed-off-by: David S. Miller <davem@davemloft.net>
When the bonding is being unloaded and the netdevice notifier is
unregistered it executes NETDEV_UNREGISTER for each device which should
remove the bond's proc entry but if the device enslaved is not of
ARPHRD_ETHER type and is in front of the bonding, it may execute
bond_release_and_destroy() first which would release the last slave and
destroy the bond device leaving the proc entry and thus we will get the
following error (with dynamic debug on for bond_netdev_event to see the
events order):
[ 908.963051] eql: event: 9
[ 908.963052] eql: IFF_SLAVE
[ 908.963054] eql: event: 2
[ 908.963056] eql: IFF_SLAVE
[ 908.963058] eql: event: 6
[ 908.963059] eql: IFF_SLAVE
[ 908.963110] bond0: Releasing active interface eql
[ 908.976168] bond0: Destroying bond bond0
[ 908.976266] bond0 (unregistering): Released all slaves
[ 908.984097] ------------[ cut here ]------------
[ 908.984107] WARNING: CPU: 0 PID: 1787 at fs/proc/generic.c:575
remove_proc_entry+0x112/0x160()
[ 908.984110] remove_proc_entry: removing non-empty directory
'net/bonding', leaking at least 'bond0'
[ 908.984111] Modules linked in: bonding(-) eql(O) 9p nfsd auth_rpcgss
oid_registry nfs_acl nfs lockd grace fscache sunrpc crct10dif_pclmul
crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev qxl drm_kms_helper
snd_hda_codec_generic aesni_intel ttm aes_x86_64 glue_helper pcspkr lrw
gf128mul ablk_helper cryptd snd_hda_intel virtio_console snd_hda_codec
psmouse serio_raw snd_hwdep snd_hda_core 9pnet_virtio 9pnet evdev joydev
drm virtio_balloon snd_pcm snd_timer snd soundcore i2c_piix4 i2c_core
pvpanic acpi_cpufreq parport_pc parport processor thermal_sys button
autofs4 ext4 crc16 mbcache jbd2 hid_generic usbhid hid sg sr_mod cdrom
ata_generic virtio_blk virtio_net floppy ata_piix e1000 libata ehci_pci
virtio_pci scsi_mod uhci_hcd ehci_hcd virtio_ring virtio usbcore
usb_common [last unloaded: bonding]
[ 908.984168] CPU: 0 PID: 1787 Comm: rmmod Tainted: G W O
4.2.0-rc2+ #8
[ 908.984170] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 908.984172] 0000000000000000 ffffffff81732d41 ffffffff81525b34
ffff8800358dfda8
[ 908.984175] ffffffff8106c521 ffff88003595af78 ffff88003595af40
ffff88003e3a4280
[ 908.984178] ffffffffa058d040 0000000000000000 ffffffff8106c59a
ffffffff8172ebd0
[ 908.984181] Call Trace:
[ 908.984188] [<ffffffff81525b34>] ? dump_stack+0x40/0x50
[ 908.984193] [<ffffffff8106c521>] ? warn_slowpath_common+0x81/0xb0
[ 908.984196] [<ffffffff8106c59a>] ? warn_slowpath_fmt+0x4a/0x50
[ 908.984199] [<ffffffff81218352>] ? remove_proc_entry+0x112/0x160
[ 908.984205] [<ffffffffa05850e6>] ? bond_destroy_proc_dir+0x26/0x30
[bonding]
[ 908.984208] [<ffffffffa057540e>] ? bond_net_exit+0x8e/0xa0 [bonding]
[ 908.984217] [<ffffffff8142f407>] ? ops_exit_list.isra.4+0x37/0x70
[ 908.984225] [<ffffffff8142f52d>] ?
unregister_pernet_operations+0x8d/0xd0
[ 908.984228] [<ffffffff8142f58d>] ?
unregister_pernet_subsys+0x1d/0x30
[ 908.984232] [<ffffffffa0585269>] ? bonding_exit+0x23/0xdba [bonding]
[ 908.984236] [<ffffffff810e28ba>] ? SyS_delete_module+0x18a/0x250
[ 908.984241] [<ffffffff81086f99>] ? task_work_run+0x89/0xc0
[ 908.984244] [<ffffffff8152b732>] ?
entry_SYSCALL_64_fastpath+0x16/0x75
[ 908.984247] ---[ end trace 7c006ed4abbef24b ]---
Thus remove the proc entry manually if bond_release_and_destroy() is
used. Because of the checks in bond_remove_proc_entry() it's not a
problem for a bond device to change namespaces (the bug fixed by the
Fixes commit) but since commit
f939981492 ("bonding: Don't allow bond devices to change network
namespaces.") that can't happen anyway.
Reported-by: Carol Soto <clsoto@linux.vnet.ibm.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: a64d49c3dd ("bonding: Manage /proc/net/bonding/ entries from
the netdev events")
Tested-by: Carol L Soto <clsoto@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Get rid of these:
drivers/net/bonding//bond_main.c: In function ‘bond_update_slave_arr’:
drivers/net/bonding//bond_main.c:3754:6: warning: variable
‘slaves_in_agg’ set but not used [-Wunused-but-set-variable]
int slaves_in_agg;
^
CC [M] drivers/net/bonding//bond_3ad.o
drivers/net/bonding//bond_3ad.c: In function
‘ad_marker_response_received’:
drivers/net/bonding//bond_3ad.c:1870:61: warning: parameter ‘marker’
set but not used [-Wunused-but-set-parameter]
static void ad_marker_response_received(struct bond_marker *marker,
^
drivers/net/bonding//bond_3ad.c:1871:19: warning: parameter ‘port’ set
but not used [-Wunused-but-set-parameter]
struct port *port)
^
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When "primary_reselect" is set to "failure", primary interface should
not become active until current active slave is down. But if we set first
member of bond device as a "primary" interface and "primary_reselect"
is set to "failure" then whenever primary interface's link get back(up)
it become active slave even if current active slave is still up.
With this patch, "bond_find_best_slave" will not traverse members if
primary interface is not candidate for failover/reselection and current
active slave is still up.
Signed-off-by: Mazhar Rana <mazhar.rana@cyberoam.com>
Signed-off-by: Jay Vosburgh <j.vosburgh@gmail.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Actor and Partner details can be accessed via proc-fs, sys-fs
entries or netlink interface. These interfaces are world readable
at this moment. The earlier patch-series made the LACP communication
secure to avoid nuisance attack from within the same L2 domain but
it did not prevent "someone unprivileged" looking at that information
on host and perform the same act.
This patch essentially avoids spitting those entries if the user
in question does not have enough privileges.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Export the partner_oper_port_state of each port via sysfs and netlink.
In 802.3ad mode it is valuable for the user to be able to check the
partner_oper state, it is already exported via bond's proc entry.
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Export the actor_oper_port_state of each port via sysfs and netlink.
In 802.3ad mode it is valuable for the user to be able to check the
actor_oper state, it is already exported via bond's proc entry.
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds full IPv6 addresses into flow_keys and uses them as
input to the flow hash function. The implementation supports either
IPv4 or IPv6 addresses in a union, and selector is used to determine
how may words to input to jhash2.
We also add flow_get_u32_dst and flow_get_u32_src functions which are
used to get a u32 representation of the source and destination
addresses. For IPv6, ipv6_addr_hash is called. These functions retain
getting the legacy values of src and dst in flow_keys.
With this patch, Ethertype and IP protocol are now included in the
flow hash input.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/cadence/macb.c
drivers/net/phy/phy.c
include/linux/skbuff.h
net/ipv4/tcp.c
net/switchdev/switchdev.c
Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
renaming overlapping with net-next changes of various
sorts.
phy.c was a case of two changes, one adding a local
variable to a function whilst the second was removing
one.
tcp.c overlapped a deadlock fix with the addition of new tcp_info
statistic values.
macb.c involved the addition of two zyncq device entries.
skbuff.h involved adding back ipv4_daddr to nf_bridge_info
whilst net-next changes put two other existing members of
that struct into a union.
Signed-off-by: David S. Miller <davem@davemloft.net>
- introduce port fdb obj and generic switchdev_port_fdb_add/del/dump()
- use switchdev_port_fdb_add/del/dump in rocker/team/bonding ndo ops.
- add support for fdb obj in switchdev_port_obj_add/del/dump()
- switch rocker to implement fdb ops via switchdev_ops
v3: updated to sync with named union changes.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before the patch, the command 'ip link add bond2 type bond mode 802.3ad'
causes the kernel to send a rtnl message for the bond2 interface, with an
ifindex 0.
'ip monitor' shows:
0: bond2: <BROADCAST,MULTICAST,MASTER> mtu 1500 state DOWN group default
link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
9: bond2@NONE: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN group default
link/ether ea:3e:1f:53:92:7b brd ff:ff:ff:ff:ff:ff
[snip]
The patch fixes the spotted bug by checking in bond driver if the interface
is registered before calling the notifier chain.
It also adds a check in rtmsg_ifinfo() to prevent this kind of bug in the
future.
Fixes: d4261e5650 ("bonding: create netlink event when bonding option is changed")
CC: Jiri Pirko <jiri@resnulli.us>
Reported-by: Julien Meunier <julien.meunier@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa said remove the feature flag for this series and she'll work on
bringing it back if needed at a later date.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rocker, bonding and team and switch over to the new
switchdev_port_bridge_dellink to avoid duplicating code in each driver.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rocker, bonding, and team can now use the switchdev bridge setlink to parse
raw netlink; no need to duplicate this code in each driver.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Turned out that "switchdev" sticks. So just unify all related terms to use
this prefix.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds netlink support for the following bonding options:
* BOND_OPT_AD_ACTOR_SYS_PRIO
* BOND_OPT_AD_ACTOR_SYSTEM
* BOND_OPT_AD_USER_PORT_KEY
When setting the actor system mac address we assume the netlink message
contains a binary mac and not a string representation of a mac.
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
[jt: completed the setting side of the netlink attributes]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The port key has three components - user-key, speed-part, and duplex-part.
The LSBit is for the duplex-part, next 5 bits are for the speed while the
remaining 10 bits are the user defined key bits. Get these 10 bits
from the user-space (through the SysFs interface) and use it to form the
admin port-key. Allowed range for the user-key is 0 - 1023 (10 bits). If
it is not provided then use zero for the user-key-bits (default).
It can set using following example code -
# modprobe bonding mode=4
# usr_port_key=$(( RANDOM & 0x3FF ))
# echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key
# echo +eth1 > /sys/class/net/bond0/bonding/slaves
...
# ip link set bond0 up
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
[jt: * fixed up style issues reported by checkpatch
* fixed up context from change in ad_actor_sys_prio patch]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In an AD system, the communication between actor and partner is the
business between these two entities. In the current setup anyone on the
same L2 can "guess" the LACPDU contents and then possibly send the
spoofed LACPDUs and trick the partner causing connectivity issues for
the AD system. This patch allows to use a random mac-address obscuring
it's identity making it harder for someone in the L2 is do the same thing.
This patch allows user-space to choose the mac-address for the AD-system.
This mac-address can not be NULL or a Multicast. If the mac-address is set
from user-space; kernel will honor it and will not overwrite it. In the
absence (value from user space); the logic will default to using the
masters' mac as the mac-address for the AD-system.
It can be set using example code below -
# modprobe bonding mode=4
# sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \
$(( (RANDOM & 0xFE) | 0x02 )) \
$(( RANDOM & 0xFF )) \
$(( RANDOM & 0xFF )) \
$(( RANDOM & 0xFF )) \
$(( RANDOM & 0xFF )) \
$(( RANDOM & 0xFF )))
# echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system
# echo +eth1 > /sys/class/net/bond0/bonding/slaves
...
# ip link set bond0 up
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
[jt: fixed up style issues reported by checkpatch]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows user to randomize the system-priority in an ad-system.
The allowed range is 1 - 0xFFFF while default value is 0xFFFF. If user
does not specify this value, the system defaults to 0xFFFF, which is
what it was before this patch.
Following example code could set the value -
# modprobe bonding mode=4
# sys_prio=$(( 1 + RANDOM + RANDOM ))
# echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio
# echo +eth1 > /sys/class/net/bond0/bonding/slaves
...
# ip link set bond0 up
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
[jt: * fixed up style issues reported by checkpatch
* changed how the default value is set in bond_check_params(), this
makes the default consistent between what gets set for a new bond
and what the default is claimed to be in the bonding options.]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a Kernel Panic in bonding driver debugfs file: rlb_hash_table.
$> modprobe bonding mode=6
$> cat /sys/kernel/debug/bonding/bond0/rlb_hash_table
This will crash the kernel. The struct alb_bond_info is initialized only when
the bonding interface is initialized (ip link set bond0 up) and not at the time
it is allocated. If we try to read the table before that, it'll result in a
kernel panic.
The patch applies against both net and net-next
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bonding modules currently defines four macros with
general names that pollute the global namespace:
DRV_VERSION
DRV_RELDATE
DRV_NAME
DRV_DESCRIPTION
Fixing that by defining a private bonding_priv.h
header files which includes those defines.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bond_3ad_bind_slave() calls ad_initialize_port() and then immediately
assigns correct values making some of that initialization unnecessary.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch breaks the rich assignments into it's own statements
and removes some duplicate code where admin-key, & oper-key are
updated.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/asix_common.c
drivers/net/usb/sr9800.c
drivers/net/usb/usbnet.c
include/linux/usb/usbnet.h
net/ipv4/tcp_ipv4.c
net/ipv6/tcp_ipv6.c
The TCP conflicts were overlapping changes. In 'net' we added a
READ_ONCE() to the socket cached RX route read, whilst in 'net-next'
Eric Dumazet touched the surrounding code dealing with how mini
sockets are handled.
With USB, it's a case of the same bug fix first going into net-next
and then I cherry picked it back into net.
Signed-off-by: David S. Miller <davem@davemloft.net>
When mii-mon discovers that the link is up, it will call
bond_3ad_handle_link_change() but we forget to add the LACP_ENABLED
flag when we discover the speed and duplex for the slave link are
normal.
Change-Id: Ie8b268ecfeea0f99bf9fdcd72706c0653f9d9e49
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
AD_PORT_ACTOR_CHURN and AD_PORT_PARTNER_CHURN are already present and
essentially BOND_MONITOR_CHURNED is a combination of these two definitions.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before commit 3900f29021 ("bonding: slight
optimizztion for bond_slave_override()") the override logic was to send packets
with non-zero queue_id through the slave with corresponding queue_id, under two
conditions only - if the slave can transmit and it's up.
The above mentioned commit changed this logic by introducing an additional
condition - whether the bond is active (indirectly, using the slave_can_tx and
later - bond_is_active_slave), that prevents the user from implementing more
complex policies according to the Documentation/networking/bonding.txt.
Signed-off-by: Anton Nayshtut <anton@swortex.com>
Signed-off-by: Alexey Bogoslavsky <alexey@swortex.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bonding devices don't need to segment multiple tagged packets since their
slaves can segment them.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patches implements the poll_controller support for all
bonding driver. If the slaves have poll_controller net_op defined,
this implementation calls them. This is mode agnostic implementation
and iterates through all slaves (based on mode) and calls respective
handler.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Churn Detection machines detect the situation where a port is operable,
but the Actor and Partner have not attached the link to an Aggregator and
brought the link into operation within a bound time period. Under normal
operation of the LACP, agreement between Actor and Partner should be reached
very rapidly. Continued failure to reach agreement can be symptomatic of
device failure.
Actor-churn-detection state-machine
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
===================================
BEGIN=True + PortEnable=False
|
v
+------------------------+ ActorPort.Sync=True +------------------+
| ACTOR_CHURN_MONITOR | ---------------------> | NO_ACTOR_CHURN |
|========================| |==================|
| ActorChurn=False | ActorPort.Sync=False | ActorChurn=False |
| ActorChurn.Timer=Start | <--------------------- | |
+------------------------+ +------------------+
| ^
| |
ActorChurn.Timer=Expired |
| ActorPort.Sync=True
| |
| +-----------------+ |
| | ACTOR_CHURN | |
| |=================| |
+--------------> | ActorChurn=True | ------------+
| |
+-----------------+
Similar for the Partner-churn-detection.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 802.1AX standard states:
"The DA in LACPDUs is the Slow_Protocols_Multicast address."
This patch enforces that and drops LACPDUs with destination MAC
addresses other than Slow_Protocols_Multicast address
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove duplicate code.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When queuing work to send the NETDEV_BONDING_INFO netdev event, it's
possible that when the work is executed, the pointer to the slave
becomes invalid. This can happen if between queuing the event and the
execution of the work, the net-device was un-ensvaled and re-enslaved.
Fix that by queuing a work with the data of the slave instead of the
slave structure.
Fixes: 69e6113343 ('net/bonding: Notify state change on slaves')
Reported-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use notifier chain to dispatch an event upon a change in slave state.
Event is dispatched with slave specific info.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move slave state changes to a helper function, this is a pre-step for adding
functionality of dispatching an event when this helper is called.
This commit doesn't add new functionality.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want bond to pick up the offload flag if any of its slaves have it.
NETIF_F_HW_SWITCH_OFFLOAD flag is added to the mask, so that
netdev_increment_features does not ignore it.
This also adds ndo_bridge_setlink and ndo_bridge_dellink handlers.
These currently point to the default handlers provided by the
switchdev api.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix sparse warning about non-static function
drivers/net/bonding/bond_main.c:3737:5: warning: symbol
'bond_3ad_xor_xmit' was not declared. Should it be static?
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a slave is added to a bond and it is not in full duplex mode,
AD_PORT_LACP_ENABLED flag is cleared, due to this LACP PDU is not sent
on slave. When the duplex is changed to full, the flag needs to be set
to send LACP PDU.
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch attempts to fix the following problems when an actor or
partner's aggregator is not active:
1. a slave's lacp port state is marked as AD_STATE_SYNCHRONIZATION
even if it is attached to an inactive aggregator. LACP advertises
this state to the partner, making the partner think he can move
into COLLECTING_DISTRIBUTING state even though this link will not
pass traffic on the local side
2. a slave goes into COLLECTING_DISTRIBUTING state without checking
if the aggregator is actually active
3. when in COLLECTING_DISTRIBUTING state, the partner parameters may
change, e.g. the partner_oper_port_state.SYNCHRONIZATION. The
local mux machine is not reacting to the change and continue to
keep the slave and bond up
4. When bond slave leaves an inactive aggregator and joins an active
aggregator, the actor oper port state need to update to SYNC state.
v2:
* fix style issues in bond_3ad.c
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mode 802.3ad, fix incorrect bond slave active state when slave is not in
active aggregator. During bond_open(), the bonding driver always sets
the slave active flag to true if the bond is not in active-backup, alb,
or tlb modes. Bonding should let the aggregator selection logic set the
active flag when in 802.3ad mode.
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 5a7baa7885 ("bonding: Advertize vxlan offload features when
supported"), Or Gerlitz added support conditional vxlan offload.
In this patch I also add support for all kind of tunnels,
but we allow a bonding device to not require segmentation,
as it is always better to make this segmentation at the very last stage,
if a particular slave device requires it.
Tested:
Setup a GRE tunnel,
on a physical NIC not having tx-gre-segmentation.
Results on bnx2x are even better, as we no longer have to segment
in software.
ethtool -K bond0 tx-gre-segmentation off
super_netperf 50 --google-pacing-rate 30000000 -H 10.7.8.152 -l 15
7538.32
ethtool -K bond0 tx-gre-segmentation on
super_netperf 50 --google-pacing-rate 30000000 -H 10.7.8.152 -l 15
10200.5
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the empty array element initializer and size the array with
BOND_OPT_LAST so the compiler will complain if more elements are in
there than should be.
An interesting unwanted side effect of this initializer is that if one
inserts new options into the middle of the array then this initializer
will zero out the option that equals BOND_OPT_TLB_DYNAMIC_LB+1.
Example:
Extend the OPTS enum:
enum {
...
BOND_OPT_TLB_DYNAMIC_LB,
BOND_OPT_LACP_NEW1,
BOND_OPT_LAST
};
Now insert into bond_opts array:
static const struct bond_option bond_opts[] = {
...
[BOND_OPT_LACP_RATE] = { .... unchanged stuff .... },
[BOND_OPT_LACP_NEW1] = { ... new stuff ... },
...
[BOND_OPT_TLB_DYNAMIC_LB] = { .... unchanged stuff ....},
{ } // MARK A
};
Since BOND_OPT_LACP_NEW1 = BOND_OPT_TLB_DYNAMIC_LB+1, the last
initializer (MARK A) will overwrite the contents of BOND_OPT_LACP_NEW1
and can be easily viewed with the crash utility.
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Cc: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In __bond_release_one(), when the interface is not a slave or not a slave of
"this" master, it log error message.
The message actually should be a debug message matching what bond_enslave()
does.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/amd/xgbe/xgbe-desc.c
drivers/net/ethernet/renesas/sh_eth.c
Overlapping changes in both conflict cases.
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 7f28fa10 ("bonding: add arp_ip_target netlink support")
Reported-by: John Fastabend <john.fastabend@gmail.com>
Cc: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ieee802154/fakehard.c
A bug fix went into 'net' for ieee802154/fakehard.c, which is removed
in 'net-next'.
Add build fix into the merge from Stephen Rothwell in openvswitch, the
logging macros take a new initial 'log' argument, a new call was added
in 'net' so when we merge that in here we have to explicitly add the
new 'log' arg to it else the build fails.
Signed-off-by: David S. Miller <davem@davemloft.net>
Name fits better. Plus there's going to be introduced
__vlan_insert_tag later on.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since both tx and rx paths work with skb->vlan_tci, there's no need for
this function anymore. Switch users directly to __vlan_hwaccel_put_tag.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds [2.5|20|40|56] Gbps enum definition, and fixes
aggregated bandwidth calculation based on above slave links.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: Jianhua Xie <jianhua.xie@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Port Key was determined as 16 bits according to the link speed,
duplex and user key (which is yet not supported). In the old
speed field, 5 bits are for speed [1|10|100|1000|10000]Mbps as
below:
--------------------------------------------------------------
Port key :| User key | Speed | Duplex|
--------------------------------------------------------------
16 6 1 0
This patch keeps the old layout, but changes AD_LINK_SPEED_BITMASK
from bit type to an enum type. In this way, the speed field can
expand speed type from 5 to 32.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: Jianhua Xie <jianhua.xie@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 6fde8f037e ("bonding: fix locking in
bond_loadbalance_arp_mon()") we can have a stale bond carrier state and
stale curr_active_slave when using arp monitoring in loadbalance modes. The
reason is that in bond_loadbalance_arp_mon() we can't have
do_failover == true but slave_state_changed == false, whenever do_failover
is true then slave_state_changed is also true. Then the following piece
from bond_loadbalance_arp_mon():
if (slave_state_changed) {
bond_slave_state_change(bond);
if (BOND_MODE(bond) == BOND_MODE_XOR)
bond_update_slave_arr(bond, NULL);
} else if (do_failover) {
block_netpoll_tx();
bond_select_active_slave(bond);
unblock_netpoll_tx();
}
will execute only the first branch, always and regardless of do_failover.
Since these two events aren't related in such way, we need to decouple and
consider them separately.
For example this issue could lead to the following result:
Bonding Mode: load balancing (round-robin)
*MII Status: down*
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 100
ARP IP target/s (n.n.n.n form): 192.168.9.2
Slave Interface: ens12
*MII Status: up*
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: 00:0f:53:01:42:2c
Slave queue ID: 0
Slave Interface: eth1
*MII Status: up*
Speed: Unknown
Duplex: Unknown
Link Failure Count: 70
Permanent HW addr: 52:54:00:2f:0f:8e
Slave queue ID: 0
Since some interfaces are up, then the status of the bond should also be
up, but it will never change unless something invokes bond_set_carrier()
(i.e. enslave, bond_select_active_slave etc). Now, if I force the
calling of bond_select_active_slave via for example changing
primary_reselect (it can change in any mode), then the MII status goes to
"up" because it calls bond_select_active_slave() which should've been done
from bond_loadbalance_arp_mon() itself.
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Ding Tianhong <dingtianhong@huawei.com>
Fixes: 6fde8f037e ("bonding: fix locking in bond_loadbalance_arp_mon()")
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Large receive offloading is known to cause problems if received packets
are passed to other host. Therefore the kernel disables it by calling
dev_disable_lro() whenever a network device is enslaved in a bridge or
forwarding is enabled for it (or globally). For virtual devices we need
to disable LRO on the underlying physical device (which is actually
receiving the packets).
Current dev_disable_lro() code handles this propagation for a vlan
(including 802.1ad nested vlan), macvlan or a vlan on top of a macvlan.
It doesn't handle other stacked devices and their combinations, in
particular propagation from a bond to its slaves which often causes
problems in virtualization setups.
As we now have generic data structures describing the upper-lower device
relationship, dev_disable_lro() can be generalized to disable LRO also
for all lower devices (if any) once it is disabled for the device
itself.
For bonding and teaming devices, it is necessary to disable LRO not only
on current slaves at the moment when dev_disable_lro() is called but
also on any slave (port) added later.
v2: use lower device links for all devices (including vlan and macvlan)
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This ways drivers like cxgb4 don't need to do ugly relative includes.
Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because bonding stats are usually sum of slave stats, it was
not easy to account for tx drops at bonding layer.
We can use dev->tx_dropped for this, as this counter is later
added to the device stats (in dev_get_stats())
This extends the idea we had in commit ee63771474 ("bonding: Simplify
the xmit function for modes that use xmit_hash") for bond_3ad_xor_xmit()
to other bonding modes.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
"Most notable changes in here:
1) By far the biggest accomplishment, thanks to a large range of
contributors, is the addition of multi-send for transmit. This is
the result of discussions back in Chicago, and the hard work of
several individuals.
Now, when the ->ndo_start_xmit() method of a driver sees
skb->xmit_more as true, it can choose to defer the doorbell
telling the driver to start processing the new TX queue entires.
skb->xmit_more means that the generic networking is guaranteed to
call the driver immediately with another SKB to send.
There is logic added to the qdisc layer to dequeue multiple
packets at a time, and the handling mis-predicted offloads in
software is now done with no locks held.
Finally, pktgen is extended to have a "burst" parameter that can
be used to test a multi-send implementation.
Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4,
virtio_net
Adding support is almost trivial, so export more drivers to
support this optimization soon.
I want to thank, in no particular or implied order, Jesper
Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal
Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann,
David Tat, Hannes Frederic Sowa, and Rusty Russell.
2) PTP and timestamping support in bnx2x, from Michal Kalderon.
3) Allow adjusting the rx_copybreak threshold for a driver via
ethtool, and add rx_copybreak support to enic driver. From
Govindarajulu Varadarajan.
4) Significant enhancements to the generic PHY layer and the bcm7xxx
driver in particular (EEE support, auto power down, etc.) from
Florian Fainelli.
5) Allow raw buffers to be used for flow dissection, allowing drivers
to determine the optimal "linear pull" size for devices that DMA
into pools of pages. The objective is to get exactly the
necessary amount of headers into the linear SKB area pre-pulled,
but no more. The new interface drivers use is eth_get_headlen().
From WANG Cong, with driver conversions (several had their own
by-hand duplicated implementations) by Alexander Duyck and Eric
Dumazet.
6) Support checksumming more smoothly and efficiently for
encapsulations, and add "foo over UDP" facility. From Tom
Herbert.
7) Add Broadcom SF2 switch driver to DSA layer, from Florian
Fainelli.
8) eBPF now can load programs via a system call and has an extensive
testsuite. Alexei Starovoitov and Daniel Borkmann.
9) Major overhaul of the packet scheduler to use RCU in several major
areas such as the classifiers and rate estimators. From John
Fastabend.
10) Add driver for Intel FM10000 Ethernet Switch, from Alexander
Duyck.
11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric
Dumazet.
12) Add Datacenter TCP congestion control algorithm support, From
Florian Westphal.
13) Reorganize sk_buff so that __copy_skb_header() is significantly
faster. From Eric Dumazet"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits)
netlabel: directly return netlbl_unlabel_genl_init()
net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers
net: description of dma_cookie cause make xmldocs warning
cxgb4: clean up a type issue
cxgb4: potential shift wrapping bug
i40e: skb->xmit_more support
net: fs_enet: Add NAPI TX
net: fs_enet: Remove non NAPI RX
r8169:add support for RTL8168EP
net_sched: copy exts->type in tcf_exts_change()
wimax: convert printk to pr_foo()
af_unix: remove 0 assignment on static
ipv6: Do not warn for informational ICMP messages, regardless of type.
Update Intel Ethernet Driver maintainers list
bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
tipc: fix bug in multicast congestion handling
net: better IFF_XMIT_DST_RELEASE support
net/mlx4_en: remove NETDEV_TX_BUSY
3c59x: fix bad split of cpu_to_le32(pci_map_single())
net: bcmgenet: fix Tx ring priority programming
...
Testing xmit_more support with netperf and connected UDP sockets,
I found strange dst refcount false sharing.
Current handling of IFF_XMIT_DST_RELEASE is not optimal.
Dropping dst in validate_xmit_skb() is certainly too late in case
packet was queued by cpu X but dequeued by cpu Y
The logical point to take care of drop/force is in __dev_queue_xmit()
before even taking qdisc lock.
As Julian Anastasov pointed out, need for skb_dst() might come from some
packet schedulers or classifiers.
This patch adds new helper to cleanly express needs of various drivers
or qdiscs/classifiers.
Drivers that need skb_dst() in their ndo_start_xmit() should call
following helper in their setup instead of the prior :
dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
->
netif_keep_dst(dev);
Instead of using a single bit, we use two bits, one being
eventually rebuilt in bonding/team drivers.
The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
rebuilt in bonding/team. Eventually, we could add something
smarter later.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Earlier change to use usable slave array for TLB mode had an additional
performance advantage. So extending the same logic to all other modes
that use xmit-hash for slave selection (viz 802.3AD, and XOR modes).
Also consolidating this with the earlier TLB change.
The main idea is to build the usable slaves array in the control path
and use that array for slave selection during xmit operation.
Measured performance in a setup with a bond of 4x1G NICs with 200
instances of netperf for the modes involved (3ad, xor, tlb)
cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5
Mode TPS-Before TPS-After
802.3ad : 468,694 493,101
TLB (lb=0): 392,583 392,965
XOR : 475,696 484,517
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's a trivial fix to display xmit_hash_policy for this new TLB mode
since it uses transmit-hash-poilicy as part of bonding-master info
(/proc/net/bonding/<bonding-interface).
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As the code stands today, bonding stats are based simply on the stats
from the member interfaces. If a member was to be removed from a bond,
the stats would instantly drop. This would be confusing to an admin
would would suddonly see interface stats drop while traffic is still
flowing.
In addition to preventing the stats drops mentioned above, new members
will now be added to the bond and only traffic received after the member
was added to the bond will be counted as part of bonding stats. Bonding
counters will also be updated when any slaves are dropped to make sure
the reported stats are reliable.
v2: Changes suggested by Nik to properly allocate/free stats memory.
v3: Properly destroy workqueue and fix netlink configuration path.
v4: Moved cached stats into bonding and slave structs as there does not
seem to be a complexity/performance benefit to using alloc'd memory vs
in-struct memory.
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
arch/mips/net/bpf_jit.c
drivers/net/can/flexcan.c
Both the flexcan and MIPS bpf_jit conflicts were cases of simple
overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
When the slave is the curr_active_slave, no need to check
whether the slave is active or not, it is always active.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Consolidate the calls to ASSERT_RTNL() before bond_select_active_slave()
inside bond_select_active_slave() itself and remove the ASSERT_RTNL()
from bond_hw_addr_swap() as it's not exported and its only caller -
bond_change_active_slave() already has an ASSERT_RTNL().
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
First adjust a couple of locking comments that were left inaccurate,
then adjust comments to use the netdev styling and remove extra new
lines where necessary and add a couple of new lines between declarations
and code. These are all trivial styling changes, no functional change.
Also removed a couple of outdated or obvious comments.
This patch is by no means a complete fix of all netdev style violations
but it gets the bonding closer.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__rlb_next_rx_slave() is a copy of rlb_next_rx_slave() with the
difference that it uses rcu primitives to walk the slave list. We don't
need the two functions and can make rlb_next_rx_slave() a wrapper for
callers which hold RTNL.
So add a comment and ASSERT_RTNL() to make sure what is intended.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that locks have been removed, remove some unnecessary comments and
adjust others to reflect reality. Also add a comment to "mode_lock" to
describe its current users and give a brief summary why they need it.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have bond->mode_lock, we can remove the state_machine_lock
and use it in its place. There're no fast paths requiring the per-port
spinlocks so it should be okay to consolidate them into mode_lock.
Also move it inside the unbinding function as we don't want to expose
mode_lock outside of the specific modes.
Suggested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ALB/TLB specific spinlocks are no longer necessary as we now have
bond->mode_lock for this purpose, so convert them and remove them from
struct alb_bond_info.
Also remove the unneeded lock/unlock functions and use spin_lock/unlock
directly.
Suggested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
curr_slave_lock is now a misleading name, a much better name is
mode_lock as it'll be used for each mode's purposes and it's no longer
necessary to use a rwlock, a simple spinlock is enough.
Suggested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mostly all users of curr_slave_lock already have RTNL as we've discussed
previously so there's no point in using it, the one case where the lock
must stay is the 3ad code, in fact it's the only one.
It's okay to remove it from bond_do_fail_over_mac() as it's called with
RTNL and drops the curr_slave_lock anyway.
bond_change_active_slave() is one of the main places where
curr_slave_lock was used, it's okay to remove it as all callers use RTNL
these days before calling it, that's why we move the ASSERT_RTNL() in
the beginning to catch any potential offenders to this rule.
The RTNL argument actually applies to all of the places where
curr_slave_lock has been removed from in this patch.
Also remove the unnecessary bond_deref_active_protected() macro and use
rtnl_dereference() instead.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
First in rlb_teach_disabled_mac_on_primary() it's okay to remove
curr_slave_lock as all callers except bond_alb_monitor() already hold
RTNL, and in case bond_alb_monitor() is executing we can at most have a
period with bad throughput (very unlikely though).
In bond_alb_monitor() it's okay to remove the read_lock as the slave
list is walked with RCU and the worst that could happen is another
transmitter at the same time and thus for a period which currently is 10
seconds (bond_alb.h: BOND_ALB_LP_TICKS).
And bond_alb_handle_active_change() is okay because it's always called
with RTNL. Removed the ASSERT_RTNL() because it'll be inserted in the
parent function in a following patch.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the read_lock in bond_3ad_lacpdu_recv() since when the slave is
being released its rx_handler is removed before 3ad unbind, so even if
packets arrive, they won't see the slave in an inconsistent state.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds missing space between "interface" and "by"
in bonding module parameter description.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The usage of bond->lock in bond_main.c was completely unnecessary as it
didn't help to sync with anything, most of the spots already had RTNL.
Since there're no more users of bond->lock, remove it.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We're safe to remove the bond->lock use from the arp targets because
arp_rcv_probe no longer acquires bond->lock, only rcu_read_lock.
Also setting the primary slave is safe because noone uses the bond->lock
as a syncing mechanism for that anymore.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use RCU to protect against slave release, the proc show function will sync
with the bond destruction by the proc locks and the fact that the bond is
released after NETDEV_UNREGISTER which causes the bonding to remove the
proc entry.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is necessary mainly for two bonding call sites: procfs and
sysfs as it was dereferenced without any real protection.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can remove the lock/unlock as it's no longer necessary since
RTNL should be held while calling bond_alb_set_mac_address().
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In 3ad mode the only syncing needed by bond->lock is for the wq
and the recv handler, so change them to use curr_slave_lock.
There're no locking dependencies here as 3ad doesn't use
curr_slave_lock at all.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for slave_changelink to the bonding and uses it
to give the ability to change the queue_id of the enslaved devices via
netlink. It sets slave_maxtype and uses bond_changelink as a prototype for
bond_slave_changelink.
Example/test command after the iproute2 patch:
ip link set eth0 type bond_slave queue_id 10
CC: David S. Miller <davem@davemloft.net>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Suggested-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fix spelling typo in printk within vairous
part of the code.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Userspace needs to be notified if one changes some option.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This "rcu_dereference()" call is used directly in a condition.
Since its return value is never dereferenced it is recommended to use
"rcu_access_pointer()" instead of "rcu_dereference()".
Therefore, this patch makes this replacement.
The following Coccinelle semantic patch was used for solving it:
@@
@@
(
if(
(<+...
- rcu_dereference
+ rcu_access_pointer
(...)
...+>)) {...}
|
while(
(<+...
- rcu_dereference
+ rcu_access_pointer
(...)
...+>)) {...}
)
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/Makefile
net/ipv6/sysctl_net_ipv6.c
Two ipv6_table_template[] additions overlap, so the index
of the ipv6_table[x] assignments needed to be adjusted.
In the drivers/net/Makefile case, we've gotten rid of the
garbage whereby we had to list every single USB networking
driver in the top-level Makefile, there is just one
"USB_NETWORKING" that guards everything.
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise the name of the kobject isn't getting freed and other stuff from
kobject_cleanup() isn't getting called. kobject_put() will call
kobject_del() on its own in kobject_cleanup().
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This test is reversed so the memory is always leaked. It's better style
to remove the test anyway.
Fixes: 3e403a7777 ('bonding: make it possible to have unlimited nested upper vlans')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we're limited by a constant level of vlan nestings, and fail to
find anything beyound that level (currently 2).
To fix this - remove the limit of nestings when going through device tree,
and when the end device is found - allocate the needed amount of vlan tags
and return them, instead of found/not found.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we might arrive to bond_net_exit() with some bonds left (that
were created while the module is unloading). We take care of that by
destroying sysfs (the last possibility to add new bonds) and then
destroying all the remaining bonds.
However, we destroy the /proc/net/bonding directory before destroying those
last bonds, and get a warning that we're trying to destroy a non-empty
proc directory (containing /proc/net/bonding/bondX).
Fix this by moving bond_destroy_proc_dir() after all the bonds are
destroyed, so that we're sure that no bonds exist.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As it's always called with RTNL held, via dev_set_allmulti/promiscuity.
Also, remove the wrong comment.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current L2 hash helper calculates destination eth addr and
source ether addr as L2 hash factors. This patch is adding
packet type ID field into L2 hash factors. While one of
BOND_XMIT_POLICY_LAYER2 or BOND_XMIT_POLICY_{LAYER|ENCAP}23
is applied, for the 2nd level hash, enhanced hash method can
help to distribute different types of packets like IPv4/IPv6
packets to different slave devices.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: David S. Miller <davem@davemloft.net>
CC: Pan Jiafei <Jiafei.Pan@freescale.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jianhua Xie <jianhua.xie@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In TLB mode if tlb_dynamic_lb is NOT set, slaves from the bond
group are selected based on the hash distribution. This does not
exclude dead links which are part of the bond. Also if there is a
temporary link event which brings down the interface, packets
hashed on that interface would be dropped too.
This patch fixes these issues and distributes flows across the
UP links only. Also the array construction of links which are
capable of sending packets happen in the control path leaving
only link-selection during the data-path.
One possible side effect of this is - at a link event; all
flows will be shuffled to get good distribution. But impact of
this should be minimum with the assumption that a member or
members of the bond group are not available is a very temporary
situation.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To maintain the same message structure as netdev_* functions print.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To maintain the same message structure as netdev_* functions print.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One occurance left intact as it's unrelated to net_device.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To maintain the same message structure as netdev_* functions print.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To maintain the same message structure as netdev_* functions print.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several functions left out cause we might not have at that time a valid
bond/slave/port.
Also, converted severa pr_ratelimited into net_ratelimited.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To maintain the same message structure as netdev_* functions print.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Converted only the parts where we've had a valid net_device, skipping the
init/deinit and options verification.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During the conversion to "static" functions this one got left out, only its
prototype was converted, thus resulting in:
drivers/net/bonding//bond_options.c:674:5: warning: symbol
'bond_option_mode_set' was not declared. Should it be static?
Fix it by making it static and also break the line in two as it was too
long.
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we exit if the slave isn't the first slave, doesn't support mac
address setting and fail_over_mac isn't FOM_ACTIVE. It's wrong because we
only require ndo_set_mac_address in case bonding is in active-backup mode
and FOM isn't FOM_ACTIVE.
To fix this - only exit with an error if we're in a/b mode and have
fail_over_mac != FOM_ACTIVE.
Also, maintain current behaviour on the first slave (forcibly change fom to
FOM_ACTIVE) to not break anyone's configuration.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using __rcu annotation actually helps to spot all accesses to
bond->current_arp_slave are correctly protected, with LOCKDEP support.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RCU was added to bonding in linux-3.12 but lacked proper sparse annotations.
Using __rcu annotation actually helps to spot all accesses to bond->curr_active_slave
are correctly protected, with LOCKDEP support.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
curr_active_slave is rcu protected, and bonding_show_mii_status() only
wants to check if pointer is NULL or not.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only keep bond_option_active_slave_get_rcu() helper.
bond_fill_info() uses a new bond_option_active_slave_get_ifindex()
helper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Obvious copy/paste error when I converted the ad_select to the new
option API. "lacp_rate" there should be "ad_select" so we can get the
proper value.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: David S. Miller <davem@davemloft.net>
Fixes: 9e5f5eebe7 ("bonding: convert ad_select to use the new option
API")
Reported-by: Karim Scheik <karim.scheik@prisma-solutions.at>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These warnings are no longer relevant. Even when last slave is
removed, there is a valid address assigned to bond (random).
The correct functionality of vlans is ensured by maintaining unicast
list in vlan_sync_address().
Suggested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This limitation maybe had some reason in the past, but now there is not
one -> removing this.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the underlying device supports TCP offloads for VXLAN/UDP
encapulated traffic, we need to reflect that through the hw_enc_features
field of the bonding net-device. This will cause the xmit path
in the core networking stack to provide bonding with encapsulated
GSO frames to offload into the HW etc.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To make TLB mode work, the patch allows learning packets
to be sent using mac addresses assigned to macvlan devices,
also taking into an account vlans that may be between the
bond and macvlan device.
To make RLB work, all we have to do is accept ARP packets
for addresses added to the bond dev->uc list. Since RLB
mode will take care to update the peers directly with
correct mac addresses, learning packets for these addresses
do not have be send to switch.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bonding devices manage the unicast filters of the underlying
interfaces, but do not turn on IFF_UNICAST_FLT flag. Thus
anytime a unicast address is added to the bond, the bond is
places in promiscuous mode.
Turn on IFF_UNICAST_FLT on the bond device so that the bond does
not go into promiscuous mode needlesly. If an underlying device
does not support unicast filtering, that device will automaticall
enter promiscuous mode already.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/bonding/bond_alb.c
drivers/net/ethernet/altera/altera_msgdma.c
drivers/net/ethernet/altera/altera_sgdma.c
net/ipv6/xfrm6_output.c
Several cases of overlapping changes.
The xfrm6_output.c has a bug fix which overlaps the renaming
of skb->local_df to skb->ignore_df.
In the Altera TSE driver cases, the register access cleanups
in net-next overlapped with bug fixes done in net.
Similarly a bug fix to send ALB packets in the bonding driver using
the right source address overlaps with cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
ALB learning packets are currentlyalways sent using the slave mac
address for all vlans configured on top of bond. This is not always
correct, as vlans may change their mac address.
This patch introduced a concept of strict matching where the
source of learning packets can either strictly match the address
passed in, or it can determine a more correct address to use.
There are 3 casese to consider:
1) Switchover. In this case, we have a new active slave and we need
tell the switch about all addresses available on the slave.
2) Monitor. We'll periodically refresh learning info for all slaves.
In this case, we refresh all addresses for current active, and just
the slave address for other slaves.
3) Teaching of disabled adddress. This happens as part of the
failover and in this case, we alwyas to use just the address
provided.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Every caller relies on the result being the actual bond, so this
verification just masks the real problem.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new bond_free_slave() needs new_slave->bond to verify if additional
structures were allocated, so populate it early so that, in case of failure
in bond_enslave(), we would be able to get it.
Also populate the new_slave->dev field, as it's too one of the most needed
things to assign early.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TLB/ALB learning packets always assume 802.1Q vlan protocol, but
that is no longer the case since we now have support for Q-in-Q
on top of bonding. Pass the vlan protocol to alb_send_lp_vid()
so that the packets are properly tagged.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bond_compute_features() uses netdev_increment_features() to
combine vlan_features of slaves into vlan_features of the bond.
As netdev_increment_features() only adds most features and we
start with BOND_VLAN_FEATURES, we can end up with features none
of the slaves provided.
If there is at least one slave, initialize vlan_features only
with the flags in NETIF_F_ALL_FOR_ALL. Right now there is none
in BOND_VLAN_FEATURES but stating it explicitely will make the
code more future proof.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
ALB/TLB learning packets use all vlans configured on top
of the bond. This ends up being incorrect if we have a stack
of vlans on top of the bond. ALB/TLB should only use
first level/outer most vlans in its announcements.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prior to commit fbd929f2dc
bonding: support QinQ for bond arp interval
the arp monitoring code allowed for proper detection of devices
stacked on top of vlans. Since the above commit, the
code can still detect a device stacked on top of single
vlan, but not a device stacked on top of Q-in-Q configuration.
The search will only set the inner vlan tag if the route
device is the vlan device. However, this is not always the
case, as it is possible to extend the stacked configuration.
With this patch it is possible to provision devices on
top Q-in-Q vlan configuration that should be used as
a source of ARP monitoring information.
For example:
ip link add link bond0 vlan10 type vlan proto 802.1q id 10
ip link add link vlan10 vlan100 type vlan proto 802.1q id 100
ip link add link vlan100 type macvlan
Note: This patch limites the number of stacked VLANs to 2,
just like before. The original, however had another issue
in that if we had more then 2 levels of VLANs, we would end
up generating incorrectly tagged traffic. This is no longer
possible.
Fixes: fbd929f2dc (bonding: support QinQ for bond arp interval)
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@redhat.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Ding Tianhong <dingtianhong@huawei.com>
CC: Patric McHardy <kaber@trash.net>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
They're verifying the same thing (except of IFF_UP, which is implied for
netif_running(), which is also a prerequisite).
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also, remove the IFF_UP verification cause we can't be netif_running() with
being also IFF_UP.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also, use standard IP primitives to check the address.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the name a bit to better reflect its scope, and update some
comments. Two functions added - one which takes bond as a param and the
other which takes the mode.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also, change its name to better reflect its scope, and skip the "no"
part.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also, make it accept bonding as a parameter and change the name a bit to
better reflect its scope.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's used only in an inline function and is useless.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I've missed to add a NULL entry to the bond_intmax_tbl when I introduced
it with the conversion of arp_interval so add it now.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Fixes: 7bdb04ed0d ("bonding: convert arp_interval to use the new option API")
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The struct ad_slave_info is very huge, and only be used for 802.3ad mode,
so alloc the structure dynamically could save 356 Bits for every slave in
non 802.3ad mode.
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/altera/altera_sgdma.c
net/netlink/af_netlink.c
net/sched/cls_api.c
net/sched/sch_api.c
The netlink conflict dealt with moving to netlink_capable() and
netlink_ns_capable() in the 'net' tree vs. supporting 'tc' operations
in non-init namespaces. These were simple transformations from
netlink_capable to netlink_ns_capable.
The Altera driver conflict was simply code removal overlapping some
void pointer cast cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a generic option store function for sysfs and remove the
specific ones. The attribute name is used to match against the option
which is to be set.
Also adjust the "name" of tlb_dynamic_lb option to match the sysfs
entry and fix the comments and comment style in bond_sysfs.c
The comments which showed obvious behaviour (i.e. behaviour that's seen
in the option's entry) are removed, the ones that explained important
points about the setting function have been moved above the respective
set function in bond_options.c
There's only 1 exception: num_unsol_na/num_grat_arp since it has 2 names
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The argument slave is not used for slave_do_arp_validate_only(), so no need
to keep it, make the function more simple.
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the structure bond_net forward, and remove the unnecessary structure declaration.
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix format string mismatch in bonding_show_min_links().
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The aggresive load balancing causes packet re-ordering as active
flows are moved from a slave to another within the group. Sometime
this aggresive lb is not necessary if the preference is for less
re-ordering. This parameter if used with value "0" disables
this dynamic flow shuffling minimizing packet re-ordering. Of course
the side effect is that it has to live with the static load balancing
that the hashing distribution provides. This impact is less severe if
the correct xmit-hashing-policy is used for the tlb setup.
The default value of the parameter is set to "1" mimicing the earlier
behavior.
Ran the netperf test with 200 stream for 1 min between two hosts with
4x1G trunk (xmit-lb mode with xmit-policy L3+4) before and after these
changes. Following was the command used for those 200 instances -
netperf -t TCP_RR -l 60 -s 5 -H <host> -- -r81920,81920
Transactions per second:
Before change: 1,367.11
After change: 1,470.65
Change-Id: Ie3f75c77282cf602e83a6e833c6eb164e72a0990
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Re-organized the xmit function for the lb mode separating tlb xmit
from the alb mode. This will enable use of the hashing policies
like 802.3ad mode. Also extended use of xmit-hash-policy to tlb mode.
Now the tlb-mode defaults to BOND_XMIT_POLICY_LAYER2 if the xmit policy
module parameter is not set (just like 802.3ad, or Xor mode).
Change-Id: I140257403d272df75f477b380207338d0f04963e
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Separating the actual xmit part from the function in a separate
function that can be used in the tlb_xmit in the next patch. Also
there is no reason do_tx_balance to be an int so changing it to
bool type.
Change-Id: I9c48ff30487810f68587e621a191db616f49bd3b
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modified the hash function to return just hash separating from the
modulo operation that can be performed by the caller. This is to
make way for the tlb mode to use the same hashing policies that
are used in the 802.3ad and Xor mode.
Change-Id: I276609e87e0ca213c4d1b17b79c5e0b0f3d0dd6f
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the bonding debug_fs entries when the
module initialization fails. The debug_fs
entries should be removed together with all other
already allocated resources.
Signed-off-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: Jay Vosburgh <j.vosburgh@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bond_open is not setting the inactive flag correctly for some modes (alb and
tlb), resulting in error behavior if the bond has been administratively set
down and then back up. This effect should not occur when slaves are added while
the bond is up; it's something that only happens after a down/up bounce of the
bond.
For example, in bond tlb or alb mode, domu send some ARP request which go out
from dom0 bond's active slave, then the ARP broadcast request packets go back to
inactive slave from switch, because the inactive slave's inactive flag is zero,
kernel will receive the packets and pass them to bridge that cause dom0's bridge
map domu's MAC address to port of bond, bridge should map domu's MAC to port of
vif.
Signed-off-by: Zheng Li <zheng.x.li@oracle.com>
Signed-off-by: Jay Vosburgh <j.vosburgh@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The gfp parameter was added in:
commit 47be03a28c
Author: Amerigo Wang <amwang@redhat.com>
Date: Fri Aug 10 01:24:37 2012 +0000
netpoll: use GFP_ATOMIC in slave_enable_netpoll() and __netpoll_setup()
slave_enable_netpoll() and __netpoll_setup() may be called
with read_lock() held, so should use GFP_ATOMIC to allocate
memory. Eric suggested to pass gfp flags to __netpoll_setup().
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reason for the gfp parameter was removed in:
commit c4cdef9b71
Author: dingtianhong <dingtianhong@huawei.com>
Date: Tue Jul 23 15:25:27 2013 +0800
bonding: don't call slave_xxx_netpoll under spinlocks
The slave_xxx_netpoll will call synchronize_rcu_bh(),
so the function may schedule and sleep, it should't be
called under spinlocks.
bond_netpoll_setup() and bond_netpoll_cleanup() are always
protected by rtnl lock, it is no need to take the read lock,
as the slave list couldn't be changed outside rtnl lock.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nothing else that calls __netpoll_setup or ndo_netpoll_setup
requires a gfp paramter, so remove the gfp parameter from both
of these functions making the code clearer.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)
The rcu_assign_pointer() ensures that the initialization of a structure
is carried out before storing a pointer to that structure.
And in the case of the NULL pointer, there is no structure to initialize.
So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL)
Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the unnecessary log and add net_ratelimit to the others, in order to
avoid spam the log.
Cc: Joe Perches <joe@perches.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond send arp request to indicate that the slave is active, and if the bond dev
is a vlan dev, it will set the vlan tag in skb to notice the vlan group, but the
bond could only send a skb with 802.1q proto, not support for QinQ.
So add outer tag for lower vlan tag and inner tag for upper vlan tag to support QinQ,
The new skb will be consist of two vlan tag just like this:
dst mac | src mac | outer vlan tag | inner vlan tag | data | .....
If We don't need QinQ, the inner vlan tag could be set to 0 and use outer vlan tag
as a normal vlan group.
Using "ip link" to configure the bond for QinQ and add test log:
ip link add link bond0 bond0.20 type vlan proto 802.1ad id 20
ip link add link bond0.20 bond0.20.200 type vlan proto 802.1q id 200
ifconfig bond0.20 11.11.20.36/24
ifconfig bond0.20.200 11.11.200.36/24
echo +11.11.200.37 > /sys/class/net/bond0/bonding/arp_ip_target
90:e2:ba:07:4a:5c (oui Unknown) > Broadcast, ethertype 802.1Q-QinQ (0x88a8),length 50: vlan 20, p 0,ethertype 802.1Q, vlan 200, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 11.11.200.37 tell 11.11.200.36, length 28
90:e2:ba:06:f9:86 (oui Unknown) > 90:e2:ba:07:4a:5c (oui Unknown), ethertype 802.1Q-QinQ (0x88a8), length 50: vlan 20, p 0, ethertype 802.1Q, vlan 200, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Reply 11.11.200.37 is-at 90:e2:ba:06:f9:86 (oui Unknown), length 28
v1->v2: remove the comment "TODO: QinQ?".
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It may spam if the system is out of the memory, add ratelimit for it.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add unlikely() micro to the unlikely conditions in the bond
xmit path for slight optimization.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only ratelimit the ones that might spam, omiting the ones from
enslave/deslave.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/r8152.c
drivers/net/xen-netback/netback.c
Both the r8152 and netback conflicts were simple overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently it's using the wrong ETH_P_LOOP type, which is sometimes treated
as packet length instead of ether type (because it's 0x0060).
Use the new ETH_P_LOOPBACK type.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit d3ab3ffd1d
(bonding: use rlb_client_info->vlan_id instead of ->tag)
remove the rlb_client_info->tag, but occur some issues,
The vlan_get_tag() will return 0 for success and -EINVAL for
error, so the client_info->vlan_id always be set to 0 if the
vlan_get_tag return 0 for success, so the client_info would
never get a correct vlan id.
We should only set the vlan id to 0 when the vlan_get_tag return error.
Fixes: d3ab3ffd1d (bonding: use rlb_client_info->vlan_id instead of ->tag)
CC: Ding Tianhong <dingtianhong@huawei.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace kfree_skb with dev_kfree_skb_any in functions that can
be called in hard irq and other contexts.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The option code is taking IP address and putting it into a generic
container. Force cast to silence sparse warnings.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a fixup patch to resolve issues with const from my earlier patch.
Make all the setter functions use const on input parameter.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We didn't correctly check cases where the value for lp_interval is not
within the legal range due to a missing table terminator.
This would let userspace trigger a kernel panic by specifying a value out
of range:
echo -1 > /sys/devices/virtual/net/bond0/bonding/lp_interval
Introduced by commit 4325b374f8 ("bonding: convert lp_interval to use
the new option API").
Acked-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make local functions static (ie. only used in bond_options.c)
Make bond options parsing tables constant.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
These functions are defined but no longer used.
Compile tested only.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we're using GFP_KERNEL, however there are some path(s) where we
can hold some spinlocks, specifically bond->curr_slave_lock:
[ 4.722916] BUG: sleeping function called from invalid context at mm/slub.c:965
[ 4.724438] in_atomic(): 1, irqs_disabled(): 0, pid: 940, name: ifup-eth
[ 4.726034] 5 locks held by ifup-eth/940:
...snip...
[ 4.734646] #4: (&bond->curr_slave_lock){+...+.}, at: [<ffffffffa00badc6>] bond_enslave+0xda6/0xdd0 [bonding]
...snip...
[ 4.759081] [<ffffffffa00b6f11>] bond_change_active_slave+0x191/0x3b0 [bonding]
[ 4.760917] [<ffffffffa00b7227>] bond_select_active_slave+0xf7/0x1d0 [bonding]
[ 4.762751] [<ffffffffa00badce>] bond_enslave+0xdae/0xdd0 [bonding]
...snip...
As it's out of hot path and is a really rare event - change the gfp_t flags
to GFP_ATOMIC to avoid sleeping under spinlock.
v2: convert new notify calls to GFP_ATOMIC.
CC: Thomas Glanzmann <thomas@glanzmann.de>
CC: Ding Tianhong <dingtianhong@huawei.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/wireless/ath/ath9k/recv.c
drivers/net/wireless/mwifiex/pcie.c
net/ipv6/sit.c
The SIT driver conflict consists of a bug fix being done by hand
in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper
was created (netdev_alloc_pcpu_stats()) which takes care of this.
The two wireless conflicts were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we're only sending arp requests if we have a route to the target
(and, thus, can find out the source ip address).
There are some use cases, however, where we don't want/need to set an ip
address (or set up a specific route) for bonding to use arp monitoring *for
traffic generation*. We can easily send arp probes (arp requests with src
ip == 0) to generate arp broadcast responses from the target ip and use
them for determining if the target is up.
This, obviously, won't work with arp validation - because we don't have the
ip address set and, thus, will filter out the responses. So in that case -
print a warning.
CC: François CACHEREUL <f.cachereul@alphalink.fr>
CC: Zhenjie Chen <zhchen@redhat.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enslaving a bond to itself leads to an endless loop and hangs the kernel.
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Tested-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's a bug in the slave release function which leads the transmit
functions which use the bond->slave_cnt to a div by 0 because we might
just have released our last slave and made slave_cnt == 0 but at the same
time we may have a transmitter after the check for an empty list which will
fetch it and use it in the slave id calculation.
Fix it by moving the slave_cnt after synchronize_rcu so if this was our
last slave any new transmitters will see an empty slave list which is
checked after rcu lock but before calling the mode transmit functions
which rely on bond->slave_cnt.
Fixes: 278b208375 ("bonding: initial RCU conversion")
CC: Veaceslav Falico <vfalico@redhat.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Veaceslav has reported and fix this problem by commit f2ebd477f1
(bonding: restructure locking of bond_ab_arp_probe()). According Jay's
opinion, the current solution is not very well, because the notification
is to indicate that the interface has actually changed state in a meaningful
way, but these calls in the ab ARP monitor are internal settings of the flags
to allow the ARP monitor to search for a slave to become active when there are
no active slaves. The flag setting to active or backup is to permit the ARP
monitor's response logic to do the right thing when deciding if the test
slave (current_arp_slave) is up or not.
So the best way to fix the problem is that we should not send a notification
when the slave is in testing state, and check the state at the end of the
monitor, if the slave's state recover, avoid to send pointless notification
twice. And RTNL is really a big lock, hold it regardless the slave's state
changed or not when the current_active_slave is null will loss performance
(every 100ms), so we should hold it only when the slave's state changed and
need to notify.
I revert the old commit and add new modifications.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The problem was introduced by the commit 1d3ee88ae0
(bonding: add netlink attributes to slave link dev).
The bond_set_active_slave() and bond_set_backup_slave()
will use rtmsg_ifinfo to send slave's states, so these
two functions should be called in RTNL.
In 802.3ad mode, acquiring RTNL for the __enable_port and
__disable_port cases is difficult, as those calls generally
already hold the state machine lock, and cannot unconditionally
call rtnl_lock because either they already hold RTNL (for calls
via bond_3ad_unbind_slave) or due to the potential for deadlock
with bond_3ad_adapter_speed_changed, bond_3ad_adapter_duplex_changed,
bond_3ad_link_change, or bond_3ad_update_lacp_rate. All four of
those are called with RTNL held, and acquire the state machine lock
second. The calling contexts for __enable_port and __disable_port
already hold the state machine lock, and may or may not need RTNL.
According to the Jay's opinion, I don't think it is a problem that
the slave don't send notify message synchronously when the status
changed, normally the state machine is running every 100 ms, send
the notify message at the end of the state machine if the slave's
state changed should be better.
I fix the problem through these steps:
1). add a new function bond_set_slave_state() which could change
the slave's state and call rtmsg_ifinfo() according to the input
parameters called notify.
2). Add a new slave parameter which called should_notify, if the slave's state
changed and don't notify yet, the parameter will be set to 1, and then if
the slave's state changed again, the param will be set to 0, it indicate that
the slave's state has been restored, no need to notify any one.
3). the __enable_port and __disable_port should not call rtmsg_ifinfo
in the state machine lock, any change in the state of slave could
set a flag in the slave, it will indicated that an rtmsg_ifinfo
should be called at the end of the state machine.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond_xxx_info_query() was already in RTNL, so no need to use
bond lock to protect the bond slave list, so remove it.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond_info_show_master already in RCU read-side critical section,
and the we access curr_active_slave without the curr_slave_lock, we
could not sure whether the curr_active_slave will be changed during
the processing, so use RCU to protected the pointer.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The __netpoll_setup() will check the slave's flag and ndo_poll_controller just
like the slave_dev_support_netpoll() does, and slave_dev_support_netpoll() was
not used by any place, so remove it.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bond->curr_active_slave can be changed between its deferences, even to
NULL, and thus we might panic.
We're always holding the rcu (rx_handler->bond_handle_frame()->bond_arp_rcv())
so fix this by rcu_dereferencing() it and using the saved.
Reported-by: Ding Tianhong <dingtianhong@huawei.com>
Fixes: aeea64a ("bonding: don't trust arp requests unless active slave really works")
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the error case return early.
Make the normal return at the bottom of the function.
Reduces indent for readability.
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's unnecessary and less readable after a clause ending in a goto.
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's smaller and faster for some architectures.
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/bonding/bond_3ad.h
drivers/net/bonding/bond_main.c
Two minor conflicts in bonding, both of which were overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
To reflect the new meaning.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
slave->jiffies is updated every time the slave becomes active, which, for
bonding, means that its link is 'up'.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that all the logic is handled via last_arp_rx, we don't need to use
last_rx.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that last_arp_rx correctly show the last time we've received an ARP, we
can use it safely instead of slave->dev->last_rx.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that last_arp_rx really has the last time we've received any (validated or
not) ARP, we can use it in slave_last_rx() instead of slave->dev->last_rx.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the options are in place - arp_validate can be set to receive all
the traffic or only arp packets to verify if the slave is up, when the
slave isn't validated.
CC: Rob Landley <rob@landley.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Nikolay Aleksandrov <nikolay@redhat.com>
CC: Ding Tianhong <dingtianhong@huawei.com>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we can either receive any traffic as a proff of slave being up,
or only *validated* arp traffic (i.e. with src/dst ip checked).
Add an option to be able to specify if we want to receive non-validated arp
traffic only.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we only set bond_arp_rcv() if we're using arp_validate, however
this makes us skip updating last_arp_rx if we're not validating incoming
ARPs - thus, if arp_validate is off, last_arp_rx will never be updated.
Fix this by always setting up recv_probe = bond_arp_rcv, even if we're not
using arp_validate.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we're updating the last_arp_rx only when we've validate the
packet, however afterwards we use it as 'ANY last packet received', but not
only validated ARPs.
Fix this by updating it in case of any packet received. It won't break the
arp_validation=0 because we, anyway, return the correct slave->dev->last_rx in
slave_last_rx().
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently it's disabled because it's sometimes hard, in typical configs, to
make it work - because of the nature how the loadbalance modes work - as
it's hard to deliver valid arp replies to correct slaves by the switch.
However we still can use arp_validation in loadbalance with several other
configs, per example with arp_validate == 2 for backup with one broadcast
domain, without the switch(es) doing any balancing - this way we'd be (a
bit more) sure that the slave is up.
So, enable it to let users decide which one works/suits them best. Also
correct the mode limitation from BOND_OPT_ARP_VALIDATE.
CC: Nikolay Aleksandrov <nikolay@redhat.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We're always called with rcu_read_lock() held (bond_arp_rcv() is only
called from bond_handle_frame(), which is rx_handler and always called
under rcu from __netif_receive_skb_core() ).
The slave active/passive and/or bonding params can change in-flight, however
we don't really care about that - we only modify the last time packet was
received, which is harmless.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
aggregator_identifier is used to assign unique aggregator identifiers
to aggregators of a bond during device enslaving.
aggregator_identifier is currently a global variable that is zeroed in
bond_3ad_initialize().
This sequence will lead to duplicate aggregator identifiers for eth1 and eth3:
create bond0
change bond0 mode to 802.3ad
enslave eth0 to bond0 //eth0 gets agg id 1
enslave eth1 to bond0 //eth1 gets agg id 2
create bond1
change bond1 mode to 802.3ad
enslave eth2 to bond1 //aggregator_identifier is reset to 0
//eth2 gets agg id 1
enslave eth3 to bond0 //eth3 gets agg id 2
Fix this by making aggregator_identifier private to the bond.
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new argument for ndo_select_queue() callback that passes a
fallback handler. This gets invoked through netdev_pick_tx();
fallback handler is currently __netdev_pick_tx() as most drivers
invoke this function within their customized implementation in
case for skbs that don't need any special handling. This fallback
handler can then be replaced on other call-sites with different
queue selection methods (e.g. in packet sockets, pktgen etc).
This also has the nice side-effect that __netdev_pick_tx() is
then only invoked from netdev_pick_tx() and export of that
function to modules can be undone.
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ether_addr_copy is smaller and faster for some architectures.
This relies on a stack frame being at least __aligned(2)
for one use of an Ethernet address on the stack.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing terminating newlines.
Convert uses of pr_info to pr_cont in bond_check_params.
Standardize upper/lower case styles.
Typo fixes, remove unnecessary parentheses and periods.
Alignment neatening.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use more current logging style.
Coalesce formats, realign arguments, drop unnecessary periods.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
drivers/net/bonding/bond_options.c includes rwlock.h directly,
which is a nono, and which also breaks RT kernel build.
Signed-off-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dev_valid_name() will check the buffer length for input name, no need to
check it twice.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dev_set_mac_address() will check the dev->netdev_ops->ndo_set_mac_address,
so no need to check it in bond_set_mac_address().
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bonding driver take write locks and spin locks that are shared
by the tx path in enslave processing and notification processing,
If the netconsole is in use, the bonding can call printk which puts
us in the netpoll tx path, if the netconsole is attached to the bonding
driver, result in deadlock.
So add protection for these place, by checking the netpoll_block_tx
state, we can defer the sending of the netconsole frames until a later
time using the retransmit feature of netpoll_send_skb that is triggered
on the return code NETDEV_TX_BUSY.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond enslave processing don't hold bond->lock anymore,
so release an unlocked rw lock will cause warning message,
remove the unwanted read_unlock(&bond->lock).
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The fail_over_mac could be set to active or follow in any time for all modes,
so if the fail_over_mac is not none and the current mode is not active-backup,
the bond_set_mac_address() could not change the master and slave's MAC address.
In bond_set_mac_address(), the fail_over_mac should only affect AB mode, so modify
to check the mode in addition to fail_over_mac when setting bond's MAC address.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to bonding.txt, the fail_over_ma should only affect active-backup mode,
but I found that the fail_over_mac could be set to active or follow in all
modes, this will cause new slave could not be set to bond's MAC address at
enslave processing and restore its own MAC address at removal processing.
The correct way to fix the problem is that we should not add restrictions when
setting options, just need to modify the bond enslave and removal processing
to check the mode in addition to fail_over_mac when setting a slave's MAC during
enslavement. The change active slave processing already only calls the fail_over_mac
function when in active-backup mode.
Thanks for Jay's suggestion.
The patch also modify the pr_warning() to pr_warn().
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 1d3ee88ae0
(bonding: add netlink attributes to slave link dev)
has add rtmsg_ifinfo() in bond_set_active_slave() and
bond_set_backup_slave(), so the two function need to
called in RTNL lock, but bond_loadbalance_arp_mon()
only calling these functions in RCU, warning message
will occurs.
fix this by add a new function bond_slave_state_change(),
which will reset the slave's state after slave link check,
so remove the bond_set_xxx_slave() from the cycle and only
record the slave_state_changed, this will call the new
function to set all slaves to new state in RTNL later.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we're calling it from under RCU context, however we're using some
functions that require rtnl to be held.
Fix this by restructuring the locking - don't call it under any locks,
aquire rcu_read_lock() if we're sending _only_ (i.e. we have the active
slave present), and use rtnl locking otherwise - if we need to modify
(in)active flags of a slave.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently bond_ab_arp_probe() is always called under rcu_read_lock(),
however to work with curr_active_slave we're still holding the
curr_slave_lock.
To remove that curr_slave_lock - rcu_dereference the bond's
curr_active_slave and use it further - so that we're sure the slave won't
go away, and we don't care if it will change in the meanwhile.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) BPF debugger and asm tool by Daniel Borkmann.
2) Speed up create/bind in AF_PACKET, also from Daniel Borkmann.
3) Correct reciprocal_divide and update users, from Hannes Frederic
Sowa and Daniel Borkmann.
4) Currently we only have a "set" operation for the hw timestamp socket
ioctl, add a "get" operation to match. From Ben Hutchings.
5) Add better trace events for debugging driver datapath problems, also
from Ben Hutchings.
6) Implement auto corking in TCP, from Eric Dumazet. Basically, if we
have a small send and a previous packet is already in the qdisc or
device queue, defer until TX completion or we get more data.
7) Allow userspace to manage ipv6 temporary addresses, from Jiri Pirko.
8) Add a qdisc bypass option for AF_PACKET sockets, from Daniel
Borkmann.
9) Share IP header compression code between Bluetooth and IEEE802154
layers, from Jukka Rissanen.
10) Fix ipv6 router reachability probing, from Jiri Benc.
11) Allow packets to be captured on macvtap devices, from Vlad Yasevich.
12) Support tunneling in GRO layer, from Jerry Chu.
13) Allow bonding to be configured fully using netlink, from Scott
Feldman.
14) Allow AF_PACKET users to obtain the VLAN TPID, just like they can
already get the TCI. From Atzm Watanabe.
15) New "Heavy Hitter" qdisc, from Terry Lam.
16) Significantly improve the IPSEC support in pktgen, from Fan Du.
17) Allow ipv4 tunnels to cache routes, just like sockets. From Tom
Herbert.
18) Add Proportional Integral Enhanced packet scheduler, from Vijay
Subramanian.
19) Allow openvswitch to mmap'd netlink, from Thomas Graf.
20) Key TCP metrics blobs also by source address, not just destination
address. From Christoph Paasch.
21) Support 10G in generic phylib. From Andy Fleming.
22) Try to short-circuit GRO flow compares using device provided RX
hash, if provided. From Tom Herbert.
The wireless and netfilter folks have been busy little bees too.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2064 commits)
net/cxgb4: Fix referencing freed adapter
ipv6: reallocate addrconf router for ipv6 address when lo device up
fib_frontend: fix possible NULL pointer dereference
rtnetlink: remove IFLA_BOND_SLAVE definition
rtnetlink: remove check for fill_slave_info in rtnl_have_link_slave_info
qlcnic: update version to 5.3.55
qlcnic: Enhance logic to calculate msix vectors.
qlcnic: Refactor interrupt coalescing code for all adapters.
qlcnic: Update poll controller code path
qlcnic: Interrupt code cleanup
qlcnic: Enhance Tx timeout debugging.
qlcnic: Use bool for rx_mac_learn.
bonding: fix u64 division
rtnetlink: add missing IFLA_BOND_AD_INFO_UNSPEC
sfc: Use the correct maximum TX DMA ring size for SFC9100
Add Shradha Shah as the sfc driver maintainer.
net/vxlan: Share RX skb de-marking and checksum checks with ovs
tulip: cleanup by using ARRAY_SIZE()
ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called
net/cxgb4: Don't retrieve stats during recovery
...
After the option conversion downdelay and updelay divide a u64
and on a 32 bit this causes the following errors:
ERROR: "__udivdi3" [drivers/net/bonding/bonding.ko] undefined!
ERROR: "__umoddi3" [drivers/net/bonding/bonding.ko] undefined!
Fix it by using a normal int instead because newval->value is capped
at INT_MAX by the way the option is defined.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like bridge, bonding as netdevice doesn't cross netns boundaries.
Bonding ports and bonding itself live in same netns.
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow user to identify easily what the attributes are related to. Change
the name of the group to "bonding_slave" to be similar to master which
is named "bonding".
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the new primay is not matching any slave in the bond, the bond should
record it to params, clean the primary slave and select a new active slave.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so slaves would use
the new bonding option API. Also move the option to its own set function
in bond_options.c and fix some style errors.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so lp_interval would use
the new bonding option API.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so resend_igmp would use
the new bonding option API.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so all_slaves_active would use
the new bonding option API.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so queue_id would use
the new bonding option API. Also move it to its own set function in
bond_options.c and fix some style errors.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so active_slave would use
the new bonding option API. Also some trivial/style fixes.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so use_carrier would use
the new bonding option API.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so primary_reselect would use
the new bonding option API.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so primary would use
the new bonding option API.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so miimon would use
the new bonding option API. The "default" definition has been removed as
it was 0.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so num_peer_notif would use
the new bonding option API.
When the auto-sysfs generation is done an alias should be added for
this option as there're currently 2 entries in sysfs for it.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so ad_select would use
the new bonding option API.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so min_links would use
the new bonding option API.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so lacp_rate would use
the new bonding option API. Also some trivial/style error fixes.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so updelay would use
the new bonding option API. Also some trivial style fixes.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so downdelay would use
the new bonding option API. Also some trivial style fixes.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so arp_ip_target would use
the new bonding option API. This option is an exception because of
the way it's currently implemented that's why its netlink code is
a bit different from the other options to keep the functionality as
before and at the same time to have a single set function.
This patch also fixes a few stylistic errors.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so arp_interval would use
the new bonding option API. The "default" definition has been removed as
it was 0.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so fail_over_mac would use
the new bonding option API. Also fixes a trivial copy/paste error in
bond_check_params where the wrong variable was used for the error msg.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so arp_all_targets would use the
new bonding option API.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so arp_validate would use the
new bonding option API. Also fix some trivial/style errors.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so xmit_hash_policy would use the
new bonding option API. Also fix some trivial/style errors.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary changes so packets_per_slave would use the
new bonding option API.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes the bond's mode setting use the new option API and
adds support for dependency printing which relies on having an entry for
the mode option in the bond_opts[] array.
Also add the ability to print the mode name when mode dependency fails
and fix some trivial/style errors.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary basic infrastructure to support
centralized and unified option manipulation API for the bonding. The new
structure bond_option will be used to describe each option with its
dependencies on modes which will be checked automatically thus removing a
lot of duplicated code. Also automatic range checking is added for
some options. Currently the option setting function requires RTNL to
be acquired prior to calling it, since many options already rely on RTNL
it seemed like the best choice to protect all against common race
conditions.
In order to add an option the following steps need to be done:
1. Add an entry BOND_OPT_<option> to bond_options.h so it gets a unique id
and a bit corresponding to the id
2. Add a bond_option entry to the bond_opts[] array in bond_options.c which
describes the option, its dependencies and its manipulation function
3. Add code to export the option through sysfs and/or as a module parameter
(the sysfs export will be made automatically in the future)
The options can have different flags set, currently the following are
supported:
BOND_OPTFLAG_NOSLAVES - require that the bond device has no slaves prior
to setting the option
BOND_OPTFLAG_IFDOWN - require that the bond device is down prior to
setting the option
BOND_OPTFLAG_RAWVAL - don't parse the value but return it raw for the
option to parse
There's a new value structure to describe different types of values
which can have the following flags:
BOND_VALFLAG_DEFAULT - marks the default option (permanent string alias
to this option is "default")
BOND_VALFLAG_MIN - the minimum value that this option can have
BOND_VALFLAG_MAX - the maximum value that this option can have
An example would be nice here, so if we have an option which can have
the values "off"(2), "special"(4, default) and supports a range, say
16 - 32, it should be defined as follows:
"off", 2,
"special", 4, BOND_VALFLAG_DEFAULT,
"rangemin", 16, BOND_VALFLAG_MIN,
"rangemax", 32, BOND_VALFLAG_MAX
So we have the valid intervals: [2, 2], [4, 4], [16, 32]
Also the valid strings: "off" = 2, "special" and "default" = 4
"rangemin" = 16, "rangemax" = 32
BOND_VALFLAG_(MIN|MAX) can be used to specify a valid range for an
option, if MIN is omitted then 0 is considered as a minimum. If an
exact match is found in the values[] table it will be returned,
otherwise the range is tried (if available).
The option parameter passing is done by using a special structure called
bond_opt_value which can take either a string or a value to parse. One
of the bond_opt_init(val|str) macros should be used depending on which
one does the user want to parse (string or value). Then a call to
__bond_opt_set should be done under RTNL.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Zawadzki noticed that some divisions by reciprocal_divide()
were not correct [1][2], which he could also show with BPF code
after divisions are transformed into reciprocal_value() for runtime
invariance which can be passed to reciprocal_divide() later on;
reverse in BPF dump ended up with a different, off-by-one K in
some situations.
This has been fixed by Eric Dumazet in commit aee636c480
("bpf: do not use reciprocal divide"). This follow-up patch
improves reciprocal_value() and reciprocal_divide() to work in
all cases by using Granlund and Montgomery method, so that also
future use is safe and without any non-obvious side-effects.
Known problems with the old implementation were that division by 1
always returned 0 and some off-by-ones when the dividend and divisor
where very large. This seemed to not be problematic with its
current users, as far as we can tell. Eric Dumazet checked for
the slab usage, we cannot surely say so in the case of flex_array.
Still, in order to fix that, we propose an extension from the
original implementation from commit 6a2d7a955d resp. [3][4],
by using the algorithm proposed in "Division by Invariant Integers
Using Multiplication" [5], Torbjörn Granlund and Peter L.
Montgomery, that is, pseudocode for q = n/d where q, n, d is in
u32 universe:
1) Initialization:
int l = ceil(log_2 d)
uword m' = floor((1<<32)*((1<<l)-d)/d)+1
int sh_1 = min(l,1)
int sh_2 = max(l-1,0)
2) For q = n/d, all uword:
uword t = (n*m')>>32
q = (t+((n-t)>>sh_1))>>sh_2
The assembler implementation from Agner Fog [6] also helped a lot
while implementing. We have tested the implementation on x86_64,
ppc64, i686, s390x; on x86_64/haswell we're still half the latency
compared to normal divide.
Joint work with Daniel Borkmann.
[1] http://www.wireshark.org/~darkjames/reciprocal-buggy.c
[2] http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
[3] https://gmplib.org/~tege/division-paper.pdf
[4] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
[5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556
[6] http://www.agner.org/optimize/asmlib.zip
Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: Jesse Gross <jesse@nicira.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull RCU updates from Ingo Molnar:
- add RCU torture scripts/tooling
- static analysis improvements
- update RCU documentation
- miscellaneous fixes
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
rcu: Remove "extern" from function declarations in kernel/rcu/rcu.h
rcu: Remove "extern" from function declarations in include/linux/*rcu*.h
rcu/torture: Dynamically allocate SRCU output buffer to avoid overflow
rcu: Don't activate RCU core on NO_HZ_FULL CPUs
rcu: Warn on allegedly impossible rcu_read_unlock_special() from irq
rcu: Add an RCU_INITIALIZER for global RCU-protected pointers
rcu: Make rcu_assign_pointer's assignment volatile and type-safe
bonding: Use RCU_INIT_POINTER() for better overhead and for sparse
rcu: Add comment on evaluate-once properties of rcu_assign_pointer().
rcu: Provide better diagnostics for blocking in RCU callback functions
rcu: Improve SRCU's grace-period comments
rcu: Fix CONFIG_RCU_FANOUT_EXACT for odd fanout/leaf values
rcu: Fix coccinelle warnings
rcutorture: Stop tracking FSF's postal address
rcutorture: Move checkarg to functions.sh
rcutorture: Flag errors and warnings with color coding
rcutorture: Record results from repeated runs of the same test scenario
rcutorture: Test summary at end of run with less chattiness
rcutorture: Update comment in kvm.sh listing typical RCU trace events
rcutorture: Add tracing-enabled version of TREE08
...
If link is IFF_SLAVE, extend link dev netlink attributes to include
slave attributes with new IFLA_SLAVE nest. Add netlink notification
(RTM_NEWLINK) when slave status changes from backup to active, or
visa-versa.
Adds new ndo_get_slave op to net_device_ops to fill skb with IFLA_SLAVE
attributes. Currently only used by bonding driver, but could be
used by other aggregating devices with slaves.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add sub-directory under /sys/class/net/<interface>/slave with
read-only attributes for slave. Directory only appears when
<interface> is a slave.
$ tree /sys/class/net/eth2/slave/
/sys/class/net/eth2/slave/
├── ad_aggregator_id
├── link_failure_count
├── mii_status
├── perm_hwaddr
├── queue_id
└── state
$ cat /sys/class/net/eth2/slave/*
2
0
up
40:02:10:ef:06:01
0
active
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, if a slave's name change, we just pass it by. However, if the
slave is a current primary_slave, then we end up with using a slave, whose
name != params.primary, for primary_slave. And vice-versa, if we don't have
a primary_slave but have params.primary set - we will not detected a new
primary_slave.
Fix this by catching the NETDEV_CHANGENAME event and setting primary_slave
accordingly. Also, if the primary_slave was changed, issue a reselection of
the active slave, cause the priorities have changed.
Reported-by: Ding Tianhong <dingtianhong@huawei.com>
CC: Ding Tianhong <dingtianhong@huawei.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following call chain indicates that bond_do_ioctl() is protected
under rtnl_lock. If we use __dev_get_by_name() instead of
dev_get_by_name() to find interface handler in it, this would
help us avoid to change reference counter of interface once.
dev_ioctl()
rtnl_lock()
dev_ifsioc()
bond_do_ioctl()
rtnl_unlock()
Additionally we also change the coding style in bond_do_ioctl(),
letting it more readable for us.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the implementation is meaningless - once again, we take the
slave structure and use it after we've exited RCU critical section.
Fix this by removing the rcu_read_lock() from __get_active_agg(), and
ensuring that all its callers are holding RCU.
Fixes: be79bd048 ("bonding: add RCU for bond_3ad_state_machine_handler()")
CC: dingtianhong@huawei.com
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the RCU read lock usage is just wrong - it gets the slave struct
under RCU and continues to use it when RCU lock is released.
However, it's still safe to do this cause we didn't need the
rcu_read_lock() initially - all of the __get_first_agg() callers are either
holding RCU read lock or the RTNL lock, so that we can't sync while in it.
Fixes: be79bd048 ("bonding: add RCU for bond_3ad_state_machine_handler()")
CC: dingtianhong@huawei.com
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, its usage is just plainly wrong. It first gets a slave under
RCU, and, after releasing the RCU lock, continues to use it - whilst it can
be freed.
Fix this by ensuring that bond_3ad_set_carrier() holds RCU till it uses its
slave (or its agg).
Fixes: be79bd048a ("bonding: add RCU for bond_3ad_state_machine_handler()")
CC: dingtianhong@huawei.com
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
That code has been around for ages without being used.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's a huge mess currently, that is really hard to read. This cleanup
doesn't touch the logic at all, it only breaks easy-to-fix long lines and
updates comment styles.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The
will cause several issues:
- NETIF_F_LLTX were removed for macvlan, so txq lock were done for macvlan
instead of lower device which misses the necessary txq synchronization for
lower device such as txq stopping or frozen required by dev watchdog or
control path.
- dev_hard_start_xmit() was called with NULL txq which bypasses the net device
watchdog.
- dev_hard_start_xmit() does not check txq everywhere which will lead a crash
when tso is disabled for lower device.
Fix this by explicitly introducing a new param for .ndo_select_queue() for just
selecting queues in the case of l2 forwarding offload. netdev_pick_tx() was also
extended to accept this parameter and dev_queue_xmit_accel() was used to do l2
forwarding transmission.
With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need
to check txq against NULL in dev_hard_start_xmit(). Also there's no need to keep
a dedicated ndo_dfwd_start_xmit() and we can just reuse the code of
dev_queue_xmit() to do the transmission.
In the future, it was also required for macvtap l2 forwarding support since it
provides a necessary synchronization method.
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: e1000-devel@lists.sourceforge.net
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
net/ipv6/ip6_tunnel.c
net/ipv6/ip6_vti.c
ipv6 tunnel statistic bug fixes conflicting with consolidation into
generic sw per-cpu net stats.
qlogic conflict between queue counting bug fix and the addition
of multiple MAC address support.
Signed-off-by: David S. Miller <davem@davemloft.net>
It returns 0 in case of success, !0 error otherwise. Fix the improper error
verification.
Fixes: 2c9839c143 ("bonding: add num_grat_arp attribute netlink support")
CC: sfeldma@cumulusnetworks.com
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add bounds checking for params defined with parm tbl.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing space for IFLA_BOND_ARP_IP_TARGET nest header.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add nested IFLA_BOND_AD_INFO for bonding 802.3ad info.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_AD_SELECT to allow get/set of bonding parameter
ad_select via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_AD_LACP_RATE to allow get/set of bonding parameter
lacp_rate via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
More functions in bonding that can be declared static because
they are only used in one file.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The net_device.dev_addr have more than 2 bytes of additional data after
the mac addr, so it is safe to use the ether_addr_equal_64bits().
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I'm sure the operand slave and bond for the function will not
be NULL, so the check for the bond will not make any sense, so
remove the judgement, and the return value was useless here,
remove the unwanted return value.
The comments for the bond 3ad is too old, cleanup some errors
and warming.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The return value for bond_dev_queue_xmit() will not be used anymore,
so remove the return value.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the skb is xmit by the function bond_slave_override(),
it will have duplicate judgement for slave state, and I think it
will consumes a little performance, maybe it is negligible,
so I simplify the function and remove the unwanted judgement.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond_alb_xmit will check the return value for
bond_dev_queue_xmit() every time, but the bond_dev_queue_xmit()
is always return 0, it is no need to check the value every time,
so remove the unneed judgement for the xmit path.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond_dev_queue_xmit() will always return 0, and as a fast path,
it is inappropriate to check the res value when xmit every package,
so remove the res check and avoid once judgement for xmit.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use possibly more efficient ether_addr_equal_64bits
to instead of memcmp.
Modify the MAC_ADDR_COMPARE to MAC_ADDR_EQUAL, this looks more
appropriate.
The comments for the bond 3ad is too old, cleanup some errors
and warming.
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond 3ad and TLB/ALB has the same check path, so combine them.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond driver could set the lp_interval when loading module.
Suggested-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bond_xmit_slave_id is only used in main.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_PACKETS_PER_SLAVE to allow get/set of bonding parameter
packets_per_slave via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_LP_INTERVAL to allow get/set of bonding parameter
lp_interval via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_MIN_LINKS to allow get/set of bonding parameter
min_links via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_ALL_SLAVES_ACTIVE to allow get/set of bonding parameter
all_slaves_active via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_NUM_PEER_NOTIF to allow get/set of bonding parameter
num_grat_arp via netlink. Bonding parameter num_unsol_na is
synonymous with num_grat_arp, so add only one netlink attribute
to represent both bonding parameters.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond_3ad_handle_link_change is called with RTNL only,
and the function will modify the port's information with
no further locking, it will not mutex against bond state
machine and incoming LACPDU which do not hold RTNL, So I
add __get_state_machine_lock to protect the port.
But it is not a critical bug, it exist since day one, and till
now it has never been hit and reported, because changes to
speed is very rare, and will not occur critical problem.
The comments in the function is very old, cleanup it and
add a new pr_debug to debug the port message.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jay Vosburgh said that the bond_3ad_adapter_duplex_changed is
called with RTNL only, and the function will modify the port's
information with no further locking, it will not mutex against
bond state machine and incoming LACPDU which do not hold RTNL,
So I add __get_state_machine_lock to protect the port.
But it is not a critical bug, it exist since day one, and till
now it has never been hit and reported, because changes to
speed is very rare, and will not occur critical problem.
The comments in the function is very old, cleanup it.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jay Vosburgh said that the bond_3ad_adapter_speed_changed is
called with RTNL only, and the function will modify the port's
information with no further locking, it will not mutex against
bond state machine and incoming LACPDU which do not hold RTNL,
So I add __get_state_machine_lock to protect the port.
But it is not a critical bug, it exist since day one, and till
now it has never been hit and reported, because changes to
speed is very rare, and will not occur critical problem.
The comment in the function is very old, cleanup it.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_RESEND_IGMP to allow get/set of bonding parameter
resend_igmp via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_XMIT_HASH_POLICY to allow get/set of bonding parameter
xmit_hash_policy via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_FAIL_OVER_MAC to allow get/set of bonding parameter
fail_over_mac via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_PRIMARY_SELECT to allow get/set of bonding parameter
primary_select via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_PRIMARY to allow get/set of bonding parameter
primary via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond_resend_igmp_join_requests_delayed() and
bond_resend_igmp_join_requests() should be integrated,
because the bond_resend_igmp_join_requests_delayed() did
nothing except bond_resend_igmp_join_requests().
The bond igmp_retrans could only be changed in bond_change_active_slave
and here, bond_change_active_slave will be called in RTNL and curr_slave_lock,
the bond_resend_igmp_join_requests already hold RTNL, so no need
to free RTNL and hold curr_slave_lock again, it may be a small optimization,
so move the igmp_retrans in RTNL and remove the curr_slave_lock.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond_select_active_slave() will not release and acquire
bond lock, so it is no need to read the bond lock for them,
and the bond_store_primaryxxx() is already in RTNL, so remove the
unwanted lock.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond_option_active_slave_set() is always called in RTNL,
the RTNL could protect bond slave list, so remove the unwanted
bond lock.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond_3ad_state_machine_handler() use the bond lock to protect
the bond slave list and slave port together, but it is not enough,
the bond slave list was link and unlink in RTNL, not bond lock,
so I add RCU to protect the slave list from leaving.
The bond lock is still used here, because when the slave has been
removed from the list by the time the state machine runs, it appears
to be possible for both function to manupulate the same aggregator->lag_ports
by finding the aggregator via two different ports that are both members of
that aggregator (i.e., port A of the agg is being unbound, and port B
of the agg is runing its state machine).
If I remove the bond lock, there are nothing to mutex changes
to aggregator->lag_ports between bond_3ad_state_machine_handler and
bond_3ad_unbind_slave, So the bond lock is the simplest way to protect
aggregator->lag_ports.
There was a lot of function need RCU protect, I have two choice
to make the function in RCU-safe, (1) create new similar functions
and make the bond slave list in RCU. (2) modify the existed functions
and make them in read-side critical section, because the RCU
read-side critical sections may be nested.
I choose (2) because it is no need to create more similar functions.
The nots in the function is still too old, clean up the nots.
Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com>
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond_change_active_slave() and bond_select_active_slave()
do't need bond lock anymore, so remove the unwanted bond lock
for these two functions.
The bond_select_active_slave() will release and acquire
curr_slave_lock, so the curr_slave_lock need to protect
the function.
In bond enslave and bond release, the bond slave list is also
protected by RTNL, so bond lock is no need to exist, remove
the lock and clean the functions.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond_activebackup_arp_mon() use the bond lock for read to
protect the slave list, it is no effect, and the RTNL is only
called for bond_ab_arp_commit() and peer notify, for the performance
better, use RCU to replace with the bond lock, to the bond slave
list need to called in RCU, add a new bond_first_slave_rcu()
to get the first slave in RCU protection.
In bond_ab_arp_probe(), the bond->current_arp_slave may changd
if bond release slave, just like:
bond_ab_arp_probe() bond_release()
cpu 0 cpu 1
...
if (bond->current_arp_slave...) ...
... bond->current_arp_slave = NULl
bond->current_arp_slave->dev->name ...
So the current_arp_slave need to dereference in the section.
Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com>
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond_first_slave_rcu() will be used to instead of bond_first_slave()
in rcu_read_lock().
According to the Jay Vosburgh's suggestion, the struct netdev_adjacent
should hide from users who wanted to use it directly. so I package a
new function to get the first slave of the bond.
Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com>
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond_loadbalance_arp_mon() use the bond lock to protect the
bond slave list, it is no effect, so I could use RTNL or RCU to
replace it, considering the performance impact, the RCU is more
better here, so the bond lock replace with the RCU.
The bond_select_active_slave() need RTNL and curr_slave_lock
together, but there is no RTNL lock here, so add a rtnl_rtylock.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond_alb_monitor use bond lock to protect the bond slave list,
it is no effect here, we need to use RTNL or RCU to replace bond lock,
the bond_alb_monitor will called 10 times one second, RTNL may loss
performance here, so I replace bond lock with RCU to protect the
bond slave list, also the RTNL is preserved, the logic of the monitor
did not changed.
Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com>
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond_mii_monitor() still use bond lock to protect bond slave list,
it is no effect, I have 2 way to fix the problem, move the RTNL to the
top of the function, or add RCU to protect the bond slave list,
according to the Jay Vosburgh's opinion, 10 times one second is a
truely big performance loss if use RTNL to protect the whole monitor,
so I would take the advice and use RCU to protect the bond slave list.
The bond_has_slave() will not protect by anything, there will no things
happen if the slave list is be changed, unless the bond was free, but
it will not happened before the monitor, the bond will closed before
be freed.
The peers notify for the bond will calling curr_active_slave, so
derefence the slave to make sure we will accessing the same slave
if the curr_active_slave changed, as the rcu dereference need in
read-side critical sector and bond_change_active_slave() will call
it with no RCU hold, so add peer notify in rcu_read_lock which
will be nested in monitor.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond slave list was no longer protected by bond lock and only
protected by RTNL or RCU, so anywhere that use bond lock to protect
slave list is meaningless.
remove the release and acquire bond lock for bond_select_active_slave().
The curr_active_slave could only be changed in 3 place:
1. enslave slave.
2. release slave.
3. change_active_slave.
all above place were holding bond lock, RTNL and curr_slave_lock
together, it is tedious and meaningless, obviously bond lock is no
need here, but RTNL or curr_slave_lock is needed, so if you want
to access active slave, you have to choose one lock, RTNL or
curr_slave_lock, if RTNL is exist, no need to add curr_slave_lock,
otherwise curr_slave_lock is better, because of the performance.
there are several place calling bond_select_active_slave() and
bond_change_active_slave(), the next step I will clean these place
and remove the no effect lock.
there are some document changed together when update the function.
Suggested-by: Jay Vosburgh <fubar@us.ibm.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_ARP_ALL_TARGETS to allow get/set of bonding parameter
arp_all_targets via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_ARP_VALIDATE to allow get/set of bonding parameter
arp_validate via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_ARP_IP_TARGET to allow get/set of bonding parameter
arp_ip_target via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_ARP_INTERVAL to allow get/set of bonding parameter
arp_interval via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_USE_CARRIER to allow get/set of bonding parameter
use_carrier via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_DOWNDELAY to allow get/set of bonding parameter
downdelay via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_UPDELAY to allow get/set of bonding parameter
updelay via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BOND_MIIMON to allow get/set of bonding parameter
miimon via netlink.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Although rcu_assign_pointer() can be used to assign a constant
NULL pointer, doing so gets you an unnecessary memory barrier and
in some circumstances, sparse warnings. This commit therefore
changes the rcu_assign_pointer() of NULL in __bond_release_one() to
RCU_INIT_POINTER().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Help of this function says: "in_dev: only on this interface, 0=any interface",
but since commit 39a6d06300 ("[NETNS]: Process inet_confirm_addr in the
correct namespace."), the code supposes that it will never be NULL. This
function is never called with in_dev == NULL, but it's exported and may be used
by an external module.
Because this patch restore the ability to call inet_confirm_addr() with in_dev
== NULL, I partially revert the above commit, as suggested by Julian.
CC: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge 'net' into 'net-next' to get the AF_PACKET bug fix that
Daniel's direct transmit changes depend upon.
Signed-off-by: David S. Miller <davem@davemloft.net>
There's an issue when showing the value of packets_per_slave due to
using signed integer. The value may be < 0 and thus not put through
reciprocal_value() before showing. This patch makes it use unsigned
integer when showing it.
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Veaceslav Falico <vfalico@redhat.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several files refer to an old address for the Free Software Foundation
in the file header comment. Resolve by replacing the address with
the URL <http://www.gnu.org/licenses/> so that we do not have to keep
updating the header comments anytime the address changes.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Veaceslav Falico <vfalico@redhat.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Paul Mackerras <paulus@samba.org>
CC: Ian Campbell <ian.campbell@citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I install the bonding with the wrong arp_ip_target,
just like arp_ip_target=500.500.500.500, the arp_ip_target
was transfored to 245.245.245.244 and stored in the ip
target success, it is uncorrect, so I add checks to avoid
adding wrong address.
The in4_pton() will set wrong ip address to 0.0.0.0 and
return 0, also use the micro IS_IP_TARGET_UNUSABLE_ADDRESS
to simplify the code.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because the ARP monitoring is not support for 802.3ad, but I still
could change the mode to 802.3ad from ab mode while ARP monitoring
is running, it is incorrect.
So add a check for 802.3ad in bonding_store_mode to fix the problem,
and make a new macro BOND_NO_USES_ARP() to simplify the code.
v2: according to the Dan Williams's suggestion, bond mode is the most
important bond option, it should override any of the other sub-options.
So when the mode is changed, the conficting values should be cleared
or reset, otherwise the user has to duplicate more operations to modify
the logic. I disable the arp and enable mii monitoring when the bond mode
is changed to AB, TB and 8023AD if the arp interval is true.
v3: according to the Nik's suggestion, the default value of miimon should need
a name, there is several place to use it, and the bond_store_arp_interval()
could use micro BOND_NO_USES_ARP to make the code more simpify.
Suggested-by: Dan Williams <dcbw@redhat.com>
Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I met a Bug when I add ip target with the wrong ip address:
echo +500.500.500.500 > /sys/class/net/bond0/bonding/arp_ip_target
the wrong ip address will transfor to 245.245.245.244 and add
to the ip target success, it is uncorrect, so I add checks to avoid
adding wrong address.
The in4_pton() will set wrong ip address to 0.0.0.0, it will return by
the next check and will not add to ip target.
v2
According Veaceslav's opinion, simplify the code.
v3
According Veaceslav's opinion, add broadcast check and make a micro
definition to package it.
v4
Solve the problem of the format which David point out.
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes two race conditions between bond_store_updelay/downdelay
and bond_store_miimon which could lead to division by zero as miimon can
be set to 0 while either updelay/downdelay are being set and thus miss the
zero check in the beginning, the zero div happens because updelay/downdelay
are stored as new_value / bond->params.miimon. Use rtnl to synchronize with
miimon setting.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the ARP monitoring is not supported with 802.3ad, and it's
prohibited to use it via the module params.
However we still can set it afterwards via sysfs, cause we only check for
*LB modes there.
To fix this - add a check for 802.3ad mode in bonding_store_arp_interval.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) The addition of nftables. No longer will we need protocol aware
firewall filtering modules, it can all live in userspace.
At the core of nftables is a, for lack of a better term, virtual
machine that executes byte codes to inspect packet or metadata
(arriving interface index, etc.) and make verdict decisions.
Besides support for loading packet contents and comparing them, the
interpreter supports lookups in various datastructures as
fundamental operations. For example sets are supports, and
therefore one could create a set of whitelist IP address entries
which have ACCEPT verdicts attached to them, and use the appropriate
byte codes to do such lookups.
Since the interpreted code is composed in userspace, userspace can
do things like optimize things before giving it to the kernel.
Another major improvement is the capability of atomically updating
portions of the ruleset. In the existing netfilter implementation,
one has to update the entire rule set in order to make a change and
this is very expensive.
Userspace tools exist to create nftables rules using existing
netfilter rule sets, but both kernel implementations will need to
co-exist for quite some time as we transition from the old to the
new stuff.
Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have
worked so hard on this.
2) Daniel Borkmann and Hannes Frederic Sowa made several improvements
to our pseudo-random number generator, mostly used for things like
UDP port randomization and netfitler, amongst other things.
In particular the taus88 generater is updated to taus113, and test
cases are added.
3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet
and Yang Yingliang.
4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin
Sujir.
5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet,
Neal Cardwell, and Yuchung Cheng.
6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary
control message data, much like other socket option attributes.
From Francesco Fusco.
7) Allow applications to specify a cap on the rate computed
automatically by the kernel for pacing flows, via a new
SO_MAX_PACING_RATE socket option. From Eric Dumazet.
8) Make the initial autotuned send buffer sizing in TCP more closely
reflect actual needs, from Eric Dumazet.
9) Currently early socket demux only happens for TCP sockets, but we
can do it for connected UDP sockets too. Implementation from Shawn
Bohrer.
10) Refactor inet socket demux with the goal of improving hash demux
performance for listening sockets. With the main goals being able
to use RCU lookups on even request sockets, and eliminating the
listening lock contention. From Eric Dumazet.
11) The bonding layer has many demuxes in it's fast path, and an RCU
conversion was started back in 3.11, several changes here extend the
RCU usage to even more locations. From Ding Tianhong and Wang
Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav
Falico.
12) Allow stackability of segmentation offloads to, in particular, allow
segmentation offloading over tunnels. From Eric Dumazet.
13) Significantly improve the handling of secret keys we input into the
various hash functions in the inet hashtables, TCP fast open, as
well as syncookies. From Hannes Frederic Sowa. The key fundamental
operation is "net_get_random_once()" which uses static keys.
Hannes even extended this to ipv4/ipv6 fragmentation handling and
our generic flow dissector.
14) The generic driver layer takes care now to set the driver data to
NULL on device removal, so it's no longer necessary for drivers to
explicitly set it to NULL any more. Many drivers have been cleaned
up in this way, from Jingoo Han.
15) Add a BPF based packet scheduler classifier, from Daniel Borkmann.
16) Improve CRC32 interfaces and generic SKB checksum iterators so that
SCTP's checksumming can more cleanly be handled. Also from Daniel
Borkmann.
17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces
using the interface MTU value. This helps avoid PMTU attacks,
particularly on DNS servers. From Hannes Frederic Sowa.
18) Use generic XPS for transmit queue steering rather than internal
(re-)implementation in virtio-net. From Jason Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
random32: add test cases for taus113 implementation
random32: upgrade taus88 generator to taus113 from errata paper
random32: move rnd_state to linux/random.h
random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized
random32: add periodic reseeding
random32: fix off-by-one in seeding requirement
PHY: Add RTL8201CP phy_driver to realtek
xtsonic: add missing platform_set_drvdata() in xtsonic_probe()
macmace: add missing platform_set_drvdata() in mace_probe()
ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe()
ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh
vlan: Implement vlan_dev_get_egress_qos_mask as an inline.
ixgbe: add warning when max_vfs is out of range.
igb: Update link modes display in ethtool
netfilter: push reasm skb through instead of original frag skbs
ip6_output: fragment outgoing reassembled skb properly
MAINTAINERS: mv643xx_eth: take over maintainership from Lennart
net_sched: tbf: support of 64bit rates
ixgbe: deleting dfwd stations out of order can cause null ptr deref
ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS
...
This patch aims to extend round-robin mode with a new option called
packets_per_slave which can have the following values and effects:
0 - choose a random slave
1 (default) - standard round-robin, 1 packet per slave
>1 - round-robin when >1 packets have been transmitted per slave
The allowed values are between 0 and 65535.
This patch also fixes the comment style in bond_xmit_roundrobin().
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is an extra semi-colon so bond_get_size() doesn't return the
correct value.
Fixes: ec76aa4985 ('bonding: add Netlink support active_slave option')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 4d961a101e, reversing
changes made to a00f6fcc7d.
Revert bond locking changes, they cause regressions and Veaceslav Falico
doesn't like how the commit messages were done at all.
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond slave list may change when the monitor is running, the slave list is no longer
protected by bond->lock, only protected by rtnl lock(), so we have 3 ways to modify it:
1.add bond_master_upper_dev_link() and bond_upper_dev_unlink() in bond->lock, but it is unsafe
to call call_netdevice_notifiers() in write lock.
2.remove unused bond->lock for monitor function, only use the existing rtnl lock().
3.use rcu_read_lock() to protect it, of course, it will transform bond_for_each_slave to
bond_for_each_slave_rcu() and performance is better, but in slow path, it is ignored.
so I remove the bond->lock and move the rtnl lock to protect the whole monitor function.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond slave list may change when the monitor is running, the slave list is no longer
protected by bond->lock, only protected by rtnl lock(), so we have 3 ways to modify it:
1.add bond_master_upper_dev_link() and bond_upper_dev_unlink() in bond->lock, but it is unsafe
to call call_netdevice_notifiers() in write lock.
2.remove unused bond->lock for monitor function, only use the existing rtnl lock().
3.use rcu_read_lock() to protect it, of course, it will transform bond_for_each_slave to
bond_for_each_slave_rcu() and performance is better, but in slow path, it is ignored.
so I remove the bond->lock and move the rtnl lock to protect the whole monitor function.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond slave list may change when the monitor is running, the slave list is no longer
protected by bond->lock, only protected by rtnl lock(), so we have 3 ways to modify it:
1.add bond_master_upper_dev_link() and bond_upper_dev_unlink() in bond->lock, but it is unsafe
to call call_netdevice_notifiers() in write lock.
2.remove unused bond->lock for monitor function, only use the existing rtnl lock().
3.use rcu_read_lock() to protect it, of course, it will transform bond_for_each_slave to
bond_for_each_slave_rcu() and performance is better, but in slow path, it is ignored.
so I remove the bond->lock and add the rtnl lock to protect the whole monitor function.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond slave list may change when the monitor is running, the slave list is no longer
protected by bond->lock, only protected by rtnl lock(), so we have 3 ways to modify it:
1.add bond_master_upper_dev_link() and bond_upper_dev_unlink() in bond->lock, but it is unsafe
to call call_netdevice_notifiers() in write lock.
2.remove unused bond->lock for monitor function, only use the existing rtnl lock().
3.use rcu_read_lock() to protect it, of course, it will transform bond_for_each_slave to
bond_for_each_slave_rcu() and performance is better, but in slow path, it is ignored.
so I remove the bond->lock and move the rtnl lock to protect the whole monitor function.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond slave list may change when the monitor is running, the slave list is no longer
protected by bond->lock, only protected by rtnl lock(), so we have 3 ways to modify it:
1.add bond_master_upper_dev_link() and bond_upper_dev_unlink() in bond->lock, but it is unsafe
to call call_netdevice_notifiers() in write lock.
2.remove unused bond->lock for monitor function, only use the existing rtnl lock().
3.use rcu_read_lock() to protect it, of course, it will transform bond_for_each_slave to
bond_for_each_slave_rcu() and performance is better, but in slow path, it is ignored.
so I remove the bond->lock and move the rtnl lock to protect the whole monitor function.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Jiri noted, currently we first do all bonding-specific initialization
(specifically - bond_select_active_slave(bond)) before we actually attach
the slave (so that it becomes visible through bond_for_each_slave() and
friends). This might result in bond_select_active_slave() not seeing the
first/new slave and, thus, not actually selecting an active slave.
Fix this by moving all the bond-related init part after we've actually
completely initialized and linked (via bond_master_upper_dev_link()) the
new slave.
Also, remove the bond_(de/a)ttach_slave(), it's useless to have functions
to ++/-- one int.
After this we have all the initialization of the new slave *before*
linking, and all the stuff that needs to be done on bonding *after* it. It
has also a bonus effect - we can remove the locking on the new slave init
completely, and only use it for bond_select_active_slave().
Reported-by: Jiri Pirko <jiri@resnulli.us>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Ding Tianhong@huawei.com
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
no longer needed since bond_option_active_slave_set() can be used
instead.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond_for_each_slave() will not be protected by read_lock(),
only protected by rtnl_lock(), so need to replace read_lock()
with rtnl_lock().
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 278b208375
(bonding: initial RCU conversion) has convert the roundrobin,
active-backup, broadcast and xor xmit path to rcu protection,
the performance will be better for these mode, so this time,
convert xmit path for alb mode.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Cc: Nikolay Aleksandrov <nikolay@redhat.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 278b208375
(bonding: initial RCU conversion) has convert the roundrobin,
active-backup, broadcast and xor xmit path to rcu protection,
the performance will be better for these mode, so this time,
convert xmit path for 3ad mode.
Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: Wang Yufen <wangyufen@huawei.com>
Cc: Nikolay Aleksandrov <nikolay@redhat.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, in TLB mode we change mac addresses only by memcpy-ing the to
net_device->dev_addr, without actually setting them via
dev_set_mac_address(). This permits us to receive all the traffic always on
one mac address.
However, in case the interface flips, some drivers might enforce the
mac filtering for its FW/HW based on current ->dev_addr, and thus we won't
be able to receive traffic on that interface, in case it will be selected
as active in TLB mode.
Fix it by setting the mac address forcefully on every new active slave that
we select in TLB mode.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Yuval Mintz <yuvalmin@broadcom.com>
Reported-by: Yuval Mintz <yuvalmin@broadcom.com>
Tested-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds two new hash policy modes which use skb_flow_dissect:
3 - Encapsulated layer 2+3
4 - Encapsulated layer 3+4
There should be a good improvement for tunnel users in those modes.
It also changes the old hash functions to:
hash ^= (__force u32)flow.dst ^ (__force u32)flow.src;
hash ^= (hash >> 16);
hash ^= (hash >> 8);
Where hash will be initialized either to L2 hash, that is
SRCMAC[5] XOR DSTMAC[5], or to flow->ports which should be extracted
from the upper layer. Flow's dst and src are also extracted based on the
xmit policy either directly from the buffer or by using skb_flow_dissect,
but in both cases if the protocol is IPv6 then dst and src are obtained by
ipv6_addr_hash() on the real addresses. In case of a non-dissectable
packet, the algorithms fall back to L2 hashing.
The bond_set_mode_ops() function is now obsolete and thus deleted
because it was used only to set the proper hash policy. Also we trim a
pointer from struct bonding because we no longer need to keep the hash
function, now there's only a single hash function - bond_xmit_hash that
works based on bond->params.xmit_policy.
The hash function and skb_flow_dissect were suggested by Eric Dumazet.
The layer names were suggested by Andy Gospodarek, because I suck at
semantics.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/emulex/benet/be.h
drivers/net/usb/qmi_wwan.c
drivers/net/wireless/brcm80211/brcmfmac/dhd_bus.h
include/net/netfilter/nf_conntrack_synproxy.h
include/net/secure_seq.h
The conflicts are of two varieties:
1) Conflicts with Joe Perches's 'extern' removal from header file
function declarations. Usually it's an argument signature change
or a function being added/removed. The resolutions are trivial.
2) Some overlapping changes in qmi_wwan.c and be.h, one commit adds
a new value, another changes an existing value. That sort of
thing.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we rely on rtnl locking in bond_set_rx_mode(), however it's not
always the case:
RTNL: assertion failed at drivers/net/bonding/bond_main.c (3391)
...
[<ffffffff81651ca5>] dump_stack+0x54/0x74
[<ffffffffa029e717>] bond_set_rx_mode+0xc7/0xd0 [bonding]
[<ffffffff81553af7>] __dev_set_rx_mode+0x57/0xa0
[<ffffffff81557ff8>] __dev_mc_add+0x58/0x70
[<ffffffff81558020>] dev_mc_add+0x10/0x20
[<ffffffff8161e26e>] igmp6_group_added+0x18e/0x1d0
[<ffffffff81186f76>] ? kmem_cache_alloc_trace+0x236/0x260
[<ffffffff8161f80f>] ipv6_dev_mc_inc+0x29f/0x320
[<ffffffff8161f9e7>] ipv6_sock_mc_join+0x157/0x260
...
Fix this by using RCU primitives.
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recently grabbed this report:
https://bugzilla.redhat.com/show_bug.cgi?id=1005567
Of an issue in which the bonding driver, with an attached vlan encountered the
following errors when bond0 was taken down and back up:
dummy1: promiscuity touches roof, set promiscuity failed. promiscuity feature of
device might be broken.
The error occurs because, during __bond_release_one, if we release our last
slave, we take on a random mac address and issue a NETDEV_CHANGEADDR
notification. With an attached vlan, the vlan may see that the vlan and bond
mac address were in sync, but no longer are. This triggers a call to dev_uc_add
and dev_set_rx_mode, which enables IFF_PROMISC on the bond device. Then, when
we complete __bond_release_one, we use the current state of the bond flags to
determine if we should decrement the promiscuity of the releasing slave. But
since the bond changed promiscuity state during the release operation, we
incorrectly decrement the slave promisc count when it wasn't in promiscuous mode
to begin with, causing the above error
Fix is pretty simple, just cache the bonding flags at the start of the function
and use those when determining the need to set promiscuity.
This is also needed for the ALLMULTI flag
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Mark Wu <wudxw@linux.vnet.ibm.com>
CC: "David S. Miller" <davem@davemloft.net>
Reported-by: Mark Wu <wudxw@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's a forgotten function declaration, which was removed some time ago
already.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are no users left, so it's safe to remove.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't need the circular loop there and it's the only current user of
bond_next_slave() - so just use the standard bond_for_each_slave().
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It has no users, so it's safe to remove it completely.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert all instances of
for (agg = __get_first_agg(); agg; agg = __get_next_port)
to the standard bond_for_each_slave().
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert all instances of
for (agg = __get_first_agg(); agg; agg = __get_next_port)
to the standard bond_for_each_slave(). Also, remove the useless checks
before calling bond_3ad_set_carrier() - if we have something NULL - it
would fire long ago, in __get_first/next_port(), per example.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we're relying on suboptimal construct
for (; aggregator; aggregator = __get_next_agg(aggregator)) {
where aggregator is an argument of __get_active_agg() which is _always_ the
first slave's aggregator - judging by all the callers, comments in the
ad_agg_selection_logic() and by logic.
Convert it to use the standard bond_for_each_slave().
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, ad_port_selection_logic() uses
for (aggregator = __get_first_agg(port); aggregator;
aggregator = __get_next_agg(aggregator)) {
construct, however it's suboptimal, difficult to read and understand.
Change it to a standard bond_for_each_slave(), so that we won't need
__get_first/next_agg() and have it more readable.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we have only one user of it, so it's kind of useless and just
obfusicates things.
Remove it and move the logic to the only user -
bond_3ad_state_machine_handler().
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently this function is only used in constructs like
for (port = __get_first_port(bond); port; port = __get_next_port(port))
which is basicly the same as
bond_for_each_slave(bond, slave, iter) {
port = &(SLAVE_AD_INFO(slave).port);
but a more time consuming.
Remove the function and convert the users to bond_for_each_slave().
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 1f718f0f4f ("bonding: populate
neighbour's private on enslave"), we've moved the unlinking of the slave
to the earliest position possible - so that nobody will see an
half-uninited slave.
However, bond_3ad_unbind_slave() relied that, even while removing the last
slave, it is still accessible - via __get_first_agg() (and, eventually,
bond_first_slave()).
Fix that by verifying if the aggregator return is an actual aggregator, but
not NULL.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 1f718f0f4f ("bonding: populate
neighbour's private on enslave"), we've moved the actual 'linking' in the
end of the function - so that, once linked, the slave is ready to be used,
and is not still in the process of enslaving.
However, 802.3ad verified if it's the first slave by looking at the
if (bond_first_slave(bond) == new_slave)
which, because we've moved the linking to the end, became broken - on the
first slave bond_first_slave(bond) returns NULL.
Fix this by verifying if the prev_slave, that equals bond_last_slave(), is
actually populated - if it is - then it's not the first slave, and vice
versa.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sysfs ns (namespace) implementation became more convoluted than
necessary while trying to hide ns information from visible interface.
The relatively recent attr ns support is a good example.
* attr ns tag is determined by sysfs_ops->namespace() callback while
dir tag is determined by kobj_type->namespace(). The placement is
arbitrary.
* Instead of performing operations with explicit ns tag, the namespace
callback is routed through sysfs_attr_ns(), sysfs_ops->namespace(),
class_attr_namespace(), class_attr->namespace(). It's not simpler
in any sense. The only thing this convolution does is traversing
the whole stack backwards.
The namespace callbacks are unncessary because the operations involved
are inherently synchronous. The information can be provided in in
straight-forward top-down direction and reversing that direction is
unnecessary and against basic design principles.
This backward interface is unnecessarily convoluted and hinders
properly separating out sysfs from driver model / kobject for proper
layering. This patch updates attr ns support such that
* sysfs_ops->namespace() and class_attr->namespace() are dropped.
* sysfs_{create|remove}_file_ns(), which take explicit @ns param, are
added and sysfs_{create|remove}_file() are now simple wrappers
around the ns aware functions.
* ns handling is dropped from sysfs_chmod_file(). Nobody uses it at
this point. sysfs_chmod_file_ns() can be added later if necessary.
* Explicit @ns is propagated through class_{create|remove}_file_ns()
and netdev_class_{create|remove}_file_ns().
* driver/net/bonding which is currently the only user of attr
namespace is updated to use netdev_class_{create|remove}_file_ns()
with @bh->net as the ns tag instead of using the namespace callback.
This patch should be an equivalent conversion without any functional
difference. It makes the code easier to follow, reduces lines of code
a bit and helps proper separation and layering.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Also, remove the same functionality from bonding - it will be already done
for any device that links to its lower/upper neighbour.
The links will be created for dev's kobject, and will look like
lower_eth0 for lower device eth0 and upper_bridge0 for upper device
bridge0.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we can have only one master upper neighbour, so it would be
useful to create a symlink to it in the sysfs device directory, the way
that bonding now does it, for every device. Lower devices from
bridge/team/etc will automagically get it, so we could rely on it.
Also, remove the same functionality from bonding.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
And all the initialization.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the new function __bond_next_slave().
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new function, __bond_next_slave(), which uses neighbours to find the
next slave after the slave provided. It will be further used to gradually
go start using neighbour netdev_adjacent infrastructure instead of
bonding's own lists.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't really need it, and it's really hard to RCUify the list->prev.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For that, use netdev_adjacent_get_private(list_head) on bond's lower
neighbour list members. Also, add a small macro - bond_slave_list(bond),
which returns the bond list via neighbour list.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The same way as it was used for its own slave_list.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we verify if we have slaves by checking if bond->slave_list is
empty. Create a define bond_has_slaves() and use it, a bit more readable
and easier to change in the future.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It has no users, so we can remove it.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently it uses the hard-to-rcuify bond_for_each_slave_from(), and also
it doesn't check every slave for disrepencies between the actual
IS_UP(slave) and the slave->link == BOND_LINK_UP, but only till we find the
next suitable slave.
Fix this by using bond_for_each_slave() and storing the first good slave in
*before till we find the current_arp_slave, after that we store the first good
slave in new_slave. If new_slave is empty - use the slave stored in before,
and if it's also empty - then we didn't find any suitable slave.
Also, in the meanwhile, check for each slave status.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bond_find_best_slave() does not have to be balanced - i.e. return the slave
that is *after* some other slave, but rather return the best slave that
suits, except of bond->primary_slave - in which case we just return it if
it's suitable.
After that we just look through all the slaves and return either first up
slave or the slave whose link came back earliest.
We also don't care about curr_active_slave lock cause we use it in
bond_should_change_active() only and there we take it right away - i.e. it
won't go away.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we're using bond_for_each_slave_from(), which is really hard to
implement under RCU and/or neighbour list.
Remove it and use bond_for_each_slave() instead, taking care of the last
used slave.
Also, rename next_rx_slave to rx_slave and store the current (last)
rx_slave.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, there are two loops - first we find the first slave in an
aggregator after the xmit_hash_policy() returned number, and after that we
loop from that slave, over bonding head, and till that slave to find any
suitable slave to send the packet through.
Replace it by just one bond_for_each_slave() loop, which first loops
through the requested number of slaves, saving the first suitable one, and
after that we've hit the requested number of slaves to skip - search for
any up slave to send the packet through. If we don't find such kind of
slave - then just send the packet through the first suitable slave found.
Logic remains unchainged, and we skip two loops. Also, refactor it a bit
for readability.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We're safe agains removal there, cause we use neighbours primitives.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It needs a list_head *iter, so add it wherever needed. Use both non-rcu and
rcu variants.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Dimitris Michailidis <dm@chelsio.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We only use it in rollback scenarios and can easily use the standart
bond_for_each_dev() instead.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It should be used under rtnl/bonding lock, so use the non-RCU version.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the new provided function when attaching the lower slave to populate
its ->private with struct slave *new_slave. Also, move it to the end to
be able to 'find' it only after it was completely initialized, and
deinitialize in the first place on release.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
running bonding in ALB mode requires that learning packets be sent periodically,
so that the switch knows where to send responding traffic. However, depending
on switch configuration, there may not be any need to send traffic at the
default rate of 3 packets per second, which represents little more than wasted
data. Allow the ALB learning packet interval to be made configurable via sysfs
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Acked-by: Veaceslav Falico <vfalico@redhat.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We make bond_arp_rcv global so it can be used in bond_sysfs if the bond
interface is up and arp_interval is being changed to a positive value
and cleared otherwise as per Jay's suggestion.
This also fixes a problem where bond_arp_rcv was set even though
arp_validate was disabled while the bond was up by unsetting recv_probe
in bond_store_arp_validate and respectively setting it if enabled.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to protect store_arp_validate via rtnl because it can race with
mode changing and we can end up having arp_validate set in a mode
different from active-backup.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bond_compute_features is always called with RTNL held, so we can safely
drop the read bond->lock.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We're protected by RTNL so nothing can happen and we can safely drop the
read bond->lock.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can drop the use of bond->lock for mutual exclusion in
bond_3ad_update_lacp_rate and use RTNL in the sysfs store function
instead. This way we'll prevent races with mode change and interface
up/down as well as simplify update_lacp_rate by removing the check for
port->slave because it'll always be initialized (done while enslaving
with RTNL). This change will also help in the future removal of reader
bond->lock from bond_enslave.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't have to release all slaves when closing the bond dev, so remove
the outdated comment and the braces around the left single statement.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch aims to remove a use of the bond->lock for mutual exclusion
which will later allow easier migration to RCU of the users of this
functionality. We use RTNL as a synchronizing mechanism since it's
always held when send_peer_notif is set, and when it is decremented from
the notifier function. We can also drop some locking, and fix the
leakage of the send_peer_notif counter.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Store VID in ->vlan_id (if any), and remove the useless ->tag.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We're using it currently to verify if we have vlans before getting the tag
from the skb we're about to send. It's useless because the vlan_get_tag()
verifies if the skb has the tag (and returns an error if not), and we can
receive tagged skbs only if we *already* have vlans.
Plus, the current RCUed implementation is kind of useless anyway - the we
can remove the last vlan in the moment we return from the function.
So remove the only usage of it and the whole function.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
They're simply annoying and will spam dmesg constantly if we hit them, so
convert to pr_debug so that we still can access them in case of debugging.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently there are no real users of vlan_list/current_alb_vlan, only the
helpers which maintain them, so remove them.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, if there are vlans on top of bond, alb_send_learning_packets()
will never send LPs from the bond itself (i.e. untagged), which might leave
untagged clients unupdated.
Also, the 'circular vlan' logic (i.e. update only MAX_LP_BURST vlans at a
time, and save the last vlan for the next update) is really suboptimal - in
case of lots of vlans it will take a lot of time to update every vlan. It
is also never called in any hot path and sends only a few small packets -
thus the optimization by itself is useless.
So remove the whole current_alb_vlan/MAX_LP_BURST logic from
alb_send_learning_packets(). Instead, we'll first send a packet untagged
and then traverse the upper dev list, sending a tagged packet for each vlan
found. Also, remove the MAX_LP_BURST define - we already don't need it.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create alb_send_lp_vid(), which will handle the skb/lp creation, vlan
tagging and sending, and use it in alb_send_learning_packets().
This way all the logic remains in alb_send_learning_packets(), which
becomes a lot more cleaner and easier to understand.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We always hold the rtnl_lock() in __bond_release_one(), so use
vlan_uses_dev() instead of bond_vlan_used().
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, bond_has_this_ip() is aware only of vlan upper devices, and thus
will return false if the address is associated with the upper bridge or any
other device, and thus will break the arp logic.
Fix this by using the upper device list. For every upper device we verify
if the address associated with it is our address, and if yes - return true.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>