Commit Graph

1054 Commits

Author SHA1 Message Date
Taehee Yoo
544f287b84 bonding: check error value of register_netdevice() immediately
If register_netdevice() is failed, net_device should not be used
because variables are uninitialized or freed.
So, the routine should be stopped immediately.
But, bond_create() doesn't check return value of register_netdevice()
immediately. That will result in a panic because of using uninitialized
or freed memory.

Test commands:
    modprobe netdev-notifier-error-inject
    echo -22 > /sys/kernel/debug/notifier-error-inject/netdev/\
actions/NETDEV_REGISTER/error
    modprobe bonding max_bonds=3

Splat looks like:
[  375.028492][  T193] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[  375.033207][  T193] CPU: 2 PID: 193 Comm: kworker/2:2 Not tainted 5.8.0-rc4+ #645
[  375.036068][  T193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  375.039673][  T193] Workqueue: events linkwatch_event
[  375.041557][  T193] RIP: 0010:dev_activate+0x4a/0x340
[  375.043381][  T193] Code: 40 a8 04 0f 85 db 00 00 00 8b 83 08 04 00 00 85 c0 0f 84 0d 01 00 00 31 d2 89 d0 48 8d 04 40 48 c1 e0 07 48 03 83 00 04 00 00 <48> 8b 48 10 f6 41 10 01 75 08 f0 80 a1 a0 01 00 00 fd 48 89 48 08
[  375.050267][  T193] RSP: 0018:ffff9f8facfcfdd8 EFLAGS: 00010202
[  375.052410][  T193] RAX: 6b6b6b6b6b6b6b6b RBX: ffff9f8fae6ea000 RCX: 0000000000000006
[  375.055178][  T193] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f8fae6ea000
[  375.057762][  T193] RBP: ffff9f8fae6ea000 R08: 0000000000000000 R09: 0000000000000000
[  375.059810][  T193] R10: 0000000000000001 R11: 0000000000000000 R12: ffff9f8facfcfe08
[  375.061892][  T193] R13: ffffffff883587e0 R14: 0000000000000000 R15: ffff9f8fae6ea580
[  375.063931][  T193] FS:  0000000000000000(0000) GS:ffff9f8fbae00000(0000) knlGS:0000000000000000
[  375.066239][  T193] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  375.067841][  T193] CR2: 00007f2f542167a0 CR3: 000000012cee6002 CR4: 00000000003606e0
[  375.069657][  T193] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  375.071471][  T193] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  375.073269][  T193] Call Trace:
[  375.074005][  T193]  linkwatch_do_dev+0x4d/0x50
[  375.075052][  T193]  __linkwatch_run_queue+0x10b/0x200
[  375.076244][  T193]  linkwatch_event+0x21/0x30
[  375.077274][  T193]  process_one_work+0x252/0x600
[  375.078379][  T193]  ? process_one_work+0x600/0x600
[  375.079518][  T193]  worker_thread+0x3c/0x380
[  375.080534][  T193]  ? process_one_work+0x600/0x600
[  375.081668][  T193]  kthread+0x139/0x150
[  375.082567][  T193]  ? kthread_park+0x90/0x90
[  375.083567][  T193]  ret_from_fork+0x22/0x30

Fixes: e826eafa65 ("bonding: Call netif_carrier_off after register_netdevice")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-19 18:12:21 -07:00
Jarod Wilson
f548a47626 bonding: don't need RTNL for ipsec helpers
The bond_ipsec_* helpers don't need RTNL, and can potentially get called
without it being held, so switch from rtnl_dereference() to
rcu_dereference() to access bond struct data.

Lightly tested with xfrm bonding, no problems found, should address the
syzkaller bug referenced below.

Reported-by: syzbot+582c98032903dcc04816@syzkaller.appspotmail.com
CC: Huy Nguyen <huyn@mellanox.com>
CC: Saeed Mahameed <saeedm@mellanox.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org
CC: intel-wired-lan@lists.osuosl.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-08 16:09:07 -07:00
Jarod Wilson
5cd24cbe7d bonding: deal with xfrm state in all modes and add more error-checking
It's possible that device removal happens when the bond is in non-AB mode,
and addition happens in AB mode, so bond_ipsec_del_sa() never gets called,
which leaves security associations in an odd state if bond_ipsec_add_sa()
then gets called after switching the bond into AB. Just call add and
delete universally for all modes to keep things consistent.

However, it's also possible that this code gets called when the system is
shutting down, and the xfrm subsystem has already been disconnected from
the bond device, so we need to do some error-checking and bail, lest we
hit a null ptr deref.

Fixes: a3b658cfb6 ("bonding: allow xfrm offload setup post-module-load")
CC: Huy Nguyen <huyn@mellanox.com>
CC: Saeed Mahameed <saeedm@mellanox.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org
CC: intel-wired-lan@lists.osuosl.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-08 15:37:14 -07:00
Jarod Wilson
a3b658cfb6 bonding: allow xfrm offload setup post-module-load
At the moment, bonding xfrm crypto offload can only be set up if the bonding
module is loaded with active-backup mode already set. We need to be able to
make this work with bonds set to AB after the bonding driver has already
been loaded.

So what's done here is:

1) move #define BOND_XFRM_FEATURES to net/bonding.h so it can be used
by both bond_main.c and bond_options.c
2) set BOND_XFRM_FEATURES in bond_dev->hw_features universally, rather than
only when loading in AB mode
3) wire up xfrmdev_ops universally too
4) disable BOND_XFRM_FEATURES in bond_dev->features if not AB
5) exit early (non-AB case) from bond_ipsec_offload_ok, to prevent a
performance hit from traversing into the underlying drivers
5) toggle BOND_XFRM_FEATURES in bond_dev->wanted_features and call
netdev_change_features() from bond_option_mode_set()

In my local testing, I can change bonding modes back and forth on the fly,
have hardware offload work when I'm in AB, and see no performance penalty
to non-AB software encryption, despite having xfrm bits all wired up for
all modes now.

Fixes: 18cb261afd ("bonding: support hardware encryption offload to slaves")
Reported-by: Huy Nguyen <huyn@mellanox.com>
CC: Saeed Mahameed <saeedm@mellanox.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org
CC: intel-wired-lan@lists.osuosl.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01 15:53:32 -07:00
Nathan Chancellor
18c955b730 bonding: Remove extraneous parentheses in bond_setup
Clang warns:

drivers/net/bonding/bond_main.c:4657:23: warning: equality comparison
with extraneous parentheses [-Wparentheses-equality]
        if ((BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP))
             ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~

drivers/net/bonding/bond_main.c:4681:23: warning: equality comparison
with extraneous parentheses [-Wparentheses-equality]
        if ((BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP))
             ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~

This warning occurs when a comparision has two sets of parentheses,
which is usually the convention for doing an assignment within an
if statement. Since equality comparisons do not need a second set of
parentheses, remove them to fix the warning.

Fixes: 18cb261afd ("bonding: support hardware encryption offload to slaves")
Link: https://github.com/ClangBuiltLinux/linux/issues/1066
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reported-by: kernelci.org bot <bot@kernelci.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-26 12:18:40 -07:00
Jarod Wilson
bdfd2d1fa7 bonding/xfrm: use real_dev instead of slave_dev
Rather than requiring every hw crypto capable NIC driver to do a check for
slave_dev being set, set real_dev in the xfrm layer and xso init time, and
then override it in the bonding driver as needed. Then NIC drivers can
always use real_dev, and at the same time, we eliminate the use of a
variable name that probably shouldn't have been used in the first place,
particularly given recent current events.

CC: Boris Pismenny <borisp@mellanox.com>
CC: Saeed Mahameed <saeedm@mellanox.com>
CC: Leon Romanovsky <leon@kernel.org>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org
Suggested-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-23 15:19:55 -07:00
Jarod Wilson
18cb261afd bonding: support hardware encryption offload to slaves
Currently, this support is limited to active-backup mode, as I'm not sure
about the feasilibity of mapping an xfrm_state's offload handle to
multiple hardware devices simultaneously, and we rely on being able to
pass some hints to both the xfrm and NIC driver about whether or not
they're operating on a slave device.

I've tested this atop an Intel x520 device (ixgbe) using libreswan in
transport mode, succesfully achieving ~4.3Gbps throughput with netperf
(more or less identical to throughput on a bare NIC in this system),
as well as successful failover and recovery mid-netperf.

v2: just use CONFIG_XFRM_OFFLOAD for wrapping, isolate more code with it

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org
CC: intel-wired-lan@lists.osuosl.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-22 15:38:57 -07:00
Cong Wang
845e0ebb44 net: change addr_list_lock back to static key
The dynamic key update for addr_list_lock still causes troubles,
for example the following race condition still exists:

CPU 0:				CPU 1:
(RCU read lock)			(RTNL lock)
dev_mc_seq_show()		netdev_update_lockdep_key()
				  -> lockdep_unregister_key()
 -> netif_addr_lock_bh()

because lockdep doesn't provide an API to update it atomically.
Therefore, we have to move it back to static keys and use subclass
for nest locking like before.

In commit 1a33e10e4a ("net: partially revert dynamic lockdep key
changes"), I already reverted most parts of commit ab92d68fc2
("net: core: add generic lockdep keys").

This patch reverts the rest and also part of commit f3b0a18bb6
("net: remove unnecessary variables and callback"). After this
patch, addr_list_lock changes back to using static keys and
subclasses to satisfy lockdep. Thanks to dev->lower_level, we do
not have to change back to ->ndo_get_lock_subclass().

And hopefully this reduces some syzbot lockdep noises too.

Reported-by: syzbot+f3a0e80c34b3fc28ac5e@syzkaller.appspotmail.com
Cc: Taehee Yoo <ap420073@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-09 12:59:45 -07:00
Saeed Mahameed
76cd622fe2 Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
This merge includes updates to bonding driver needed for the rdma stack,
to avoid conflicts with the RDMA branch.

Maor Gottlieb Says:

====================
Bonding: Add support to get xmit slave

The following series adds support to get the LAG master xmit slave by
introducing new .ndo - ndo_get_xmit_slave. Every LAG module can
implement it and it first implemented in the bond driver.
This is follow-up to the RFC discussion [1].

The main motivation for doing this is for drivers that offload part
of the LAG functionality. For example, Mellanox Connect-X hardware
implements RoCE LAG which selects the TX affinity when the resources
are created and port is remapped when it goes down.

The first part of this patchset introduces the new .ndo and add the
support to the bonding module.

The second part adds support to get the RoCE LAG xmit slave by building
skb of the RoCE packet based on the AH attributes and call to the new
.ndo.

The third part change the mlx5 driver driver to set the QP's affinity
port according to the slave which found by the .ndo.
====================

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-09 01:05:30 -07:00
Eric Dumazet
ae46f184bc bonding: propagate transmit status
Currently, bonding always returns NETDEV_TX_OK to its caller.

It is worth trying to be more accurate : TCP for instance
can have different recovery strategies if it can have more
precise status, if packet was dropped by slave qdisc.

This is especially important when host is under stress.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 18:11:07 -07:00
Cong Wang
e7511f560f bonding: remove useless stats_lock_key
After commit b3e80d44f5
("bonding: fix lockdep warning in bond_get_stats()") the dynamic
key is no longer necessary, as we compute nest level at run-time.
So, we can just remove it to save some lockdep key entries.

Test commands:
 ip link add bond0 type bond
 ip link add bond1 type bond
 ip link set bond0 master bond1
 ip link set bond0 nomaster
 ip link set bond1 master bond0

Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 12:05:56 -07:00
Cong Wang
1a33e10e4a net: partially revert dynamic lockdep key changes
This patch reverts the folowing commits:

commit 064ff66e2b
"bonding: add missing netdev_update_lockdep_key()"

commit 53d374979e
"net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()"

commit 1f26c0d3d2
"net: fix kernel-doc warning in <linux/netdevice.h>"

commit ab92d68fc2
"net: core: add generic lockdep keys"

but keeps the addr_list_lock_key because we still lock
addr_list_lock nestedly on stack devices, unlikely xmit_lock
this is safe because we don't take addr_list_lock on any fast
path.

Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 12:05:56 -07:00
Maor Gottlieb
33720aaf8c bonding: Implement ndo_get_xmit_slave
Add implementation of ndo_get_xmit_slave. Find the slave by using the
helper function according to the bond mode. If the caller set all_slaves
to true, then it assumes that all slaves are available to transmit.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-01 12:15:38 -07:00
Maor Gottlieb
6b447e76ed bonding: Add array of all slaves
Keep all slaves in array so it could be used to get the xmit slave
assume all the slaves are active.
The logic to add slave to the array is like the usable slaves, except
that we also add slaves that currently can't transmit - not up or active.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-01 12:15:38 -07:00
Maor Gottlieb
5a19f1c1a2 bonding: Add function to get the xmit slave in active-backup mode
Add helper function to get the xmit slave in active-backup mode.
It's only one line function that return the curr_active_slave,
but it will used both in the xmit flow and by the new .ndo to get
the xmit slave.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-01 12:15:38 -07:00
Maor Gottlieb
29d5bbccb3 bonding: Add helper function to get the xmit slave in rr mode
Add helper function to get the xmit slave when bond is in round
robin mode. Change bond_xmit_slave_id to bond_get_slave_by_id, then
the logic for find the next slave for transmit could be used
both by the xmit flow and the .ndo to get the xmit slave.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-01 12:15:37 -07:00
Maor Gottlieb
c071d91d2a bonding: Add helper function to get the xmit slave based on hash
Both xor and 802.3ad modes use bond_xmit_hash to get the xmit slave.
Export the logic to helper function so it could be used in the
following patches by the .ndo to get the xmit slave.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-01 12:15:37 -07:00
Maor Gottlieb
ed7d4f023b bonding: Rename slave_arr to usable_slaves
Rename slave_arr to usable_slaves, since we will have two arrays,
one for the usable slaves and the other to all slaves.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-01 12:15:37 -07:00
Maor Gottlieb
119d48fd42 bonding: Export skip slave logic to function
As a preparation for following change that add array of
all slaves, extract code that skip slave to function.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-01 12:15:37 -07:00
Leon Romanovsky
2b526b56e3 net/bond: Delete driver and module versions
The in-kernel code has already unique version, which is based
on Linus's tag, update the bond driver to be consistent with that
version.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 11:23:36 -08:00
David S. Miller
e65ee2fb54 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflict resolution of ice_virtchnl_pf.c based upon work by
Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-21 13:39:34 -08:00
Julian Wiedmann
2e92a2d0e4 net: use netif_is_bridge_port() to check for IFF_BRIDGE_PORT
Trivial cleanup, so that all bridge port-specific code can be found in
one go.

CC: Johannes Berg <johannes@sipsolutions.net>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:10:32 -08:00
Taehee Yoo
b3e80d44f5 bonding: fix lockdep warning in bond_get_stats()
In the "struct bonding", there is stats_lock.
This lock protects "bond_stats" in the "struct bonding".
bond_stats is updated in the bond_get_stats() and this function would be
executed concurrently. So, the lock is needed.

Bonding interfaces would be nested.
So, either stats_lock should use dynamic lockdep class key or stats_lock
should be used by spin_lock_nested(). In the current code, stats_lock is
using a dynamic lockdep class key.
But there is no updating stats_lock_key routine So, lockdep warning
will occur.

Test commands:
    ip link add bond0 type bond
    ip link add bond1 type bond
    ip link set bond0 master bond1
    ip link set bond0 nomaster
    ip link set bond1 master bond0

Splat looks like:
[   38.420603][  T957] 5.5.0+ #394 Not tainted
[   38.421074][  T957] ------------------------------------------------------
[   38.421837][  T957] ip/957 is trying to acquire lock:
[   38.422399][  T957] ffff888063262cd8 (&bond->stats_lock_key#2){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[   38.423528][  T957]
[   38.423528][  T957] but task is already holding lock:
[   38.424526][  T957] ffff888065fd2cd8 (&bond->stats_lock_key){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[   38.426075][  T957]
[   38.426075][  T957] which lock already depends on the new lock.
[   38.426075][  T957]
[   38.428536][  T957]
[   38.428536][  T957] the existing dependency chain (in reverse order) is:
[   38.429475][  T957]
[   38.429475][  T957] -> #1 (&bond->stats_lock_key){+.+.}:
[   38.430273][  T957]        _raw_spin_lock+0x30/0x70
[   38.430812][  T957]        bond_get_stats+0x90/0x4d0 [bonding]
[   38.431451][  T957]        dev_get_stats+0x1ec/0x270
[   38.432088][  T957]        bond_get_stats+0x1a5/0x4d0 [bonding]
[   38.432767][  T957]        dev_get_stats+0x1ec/0x270
[   38.433322][  T957]        rtnl_fill_stats+0x44/0xbe0
[   38.433866][  T957]        rtnl_fill_ifinfo+0xeb2/0x3720
[   38.434474][  T957]        rtmsg_ifinfo_build_skb+0xca/0x170
[   38.435081][  T957]        rtmsg_ifinfo_event.part.33+0x1b/0xb0
[   38.436848][  T957]        rtnetlink_event+0xcd/0x120
[   38.437455][  T957]        notifier_call_chain+0x90/0x160
[   38.438067][  T957]        netdev_change_features+0x74/0xa0
[   38.438708][  T957]        bond_compute_features.isra.45+0x4e6/0x6f0 [bonding]
[   38.439522][  T957]        bond_enslave+0x3639/0x47b0 [bonding]
[   38.440225][  T957]        do_setlink+0xaab/0x2ef0
[   38.440786][  T957]        __rtnl_newlink+0x9c5/0x1270
[   38.441463][  T957]        rtnl_newlink+0x65/0x90
[   38.442075][  T957]        rtnetlink_rcv_msg+0x4a8/0x890
[   38.442774][  T957]        netlink_rcv_skb+0x121/0x350
[   38.443451][  T957]        netlink_unicast+0x42e/0x610
[   38.444282][  T957]        netlink_sendmsg+0x65a/0xb90
[   38.444992][  T957]        ____sys_sendmsg+0x5ce/0x7a0
[   38.445679][  T957]        ___sys_sendmsg+0x10f/0x1b0
[   38.446365][  T957]        __sys_sendmsg+0xc6/0x150
[   38.447007][  T957]        do_syscall_64+0x99/0x4f0
[   38.447668][  T957]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   38.448538][  T957]
[   38.448538][  T957] -> #0 (&bond->stats_lock_key#2){+.+.}:
[   38.449554][  T957]        __lock_acquire+0x2d8d/0x3de0
[   38.450148][  T957]        lock_acquire+0x164/0x3b0
[   38.450711][  T957]        _raw_spin_lock+0x30/0x70
[   38.451292][  T957]        bond_get_stats+0x90/0x4d0 [bonding]
[   38.451950][  T957]        dev_get_stats+0x1ec/0x270
[   38.452425][  T957]        bond_get_stats+0x1a5/0x4d0 [bonding]
[   38.453362][  T957]        dev_get_stats+0x1ec/0x270
[   38.453825][  T957]        rtnl_fill_stats+0x44/0xbe0
[   38.454390][  T957]        rtnl_fill_ifinfo+0xeb2/0x3720
[   38.456257][  T957]        rtmsg_ifinfo_build_skb+0xca/0x170
[   38.456998][  T957]        rtmsg_ifinfo_event.part.33+0x1b/0xb0
[   38.459351][  T957]        rtnetlink_event+0xcd/0x120
[   38.460086][  T957]        notifier_call_chain+0x90/0x160
[   38.460829][  T957]        netdev_change_features+0x74/0xa0
[   38.461752][  T957]        bond_compute_features.isra.45+0x4e6/0x6f0 [bonding]
[   38.462705][  T957]        bond_enslave+0x3639/0x47b0 [bonding]
[   38.463476][  T957]        do_setlink+0xaab/0x2ef0
[   38.464141][  T957]        __rtnl_newlink+0x9c5/0x1270
[   38.464897][  T957]        rtnl_newlink+0x65/0x90
[   38.465522][  T957]        rtnetlink_rcv_msg+0x4a8/0x890
[   38.466215][  T957]        netlink_rcv_skb+0x121/0x350
[   38.466895][  T957]        netlink_unicast+0x42e/0x610
[   38.467583][  T957]        netlink_sendmsg+0x65a/0xb90
[   38.468285][  T957]        ____sys_sendmsg+0x5ce/0x7a0
[   38.469202][  T957]        ___sys_sendmsg+0x10f/0x1b0
[   38.469884][  T957]        __sys_sendmsg+0xc6/0x150
[   38.470587][  T957]        do_syscall_64+0x99/0x4f0
[   38.471245][  T957]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   38.472093][  T957]
[   38.472093][  T957] other info that might help us debug this:
[   38.472093][  T957]
[   38.473438][  T957]  Possible unsafe locking scenario:
[   38.473438][  T957]
[   38.474898][  T957]        CPU0                    CPU1
[   38.476234][  T957]        ----                    ----
[   38.480171][  T957]   lock(&bond->stats_lock_key);
[   38.480808][  T957]                                lock(&bond->stats_lock_key#2);
[   38.481791][  T957]                                lock(&bond->stats_lock_key);
[   38.482754][  T957]   lock(&bond->stats_lock_key#2);
[   38.483416][  T957]
[   38.483416][  T957]  *** DEADLOCK ***
[   38.483416][  T957]
[   38.484505][  T957] 3 locks held by ip/957:
[   38.485048][  T957]  #0: ffffffffbccf6230 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x457/0x890
[   38.486198][  T957]  #1: ffff888065fd2cd8 (&bond->stats_lock_key){+.+.}, at: bond_get_stats+0x90/0x4d0 [bonding]
[   38.487625][  T957]  #2: ffffffffbc9254c0 (rcu_read_lock){....}, at: bond_get_stats+0x5/0x4d0 [bonding]
[   38.488897][  T957]
[   38.488897][  T957] stack backtrace:
[   38.489646][  T957] CPU: 1 PID: 957 Comm: ip Not tainted 5.5.0+ #394
[   38.490497][  T957] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[   38.492810][  T957] Call Trace:
[   38.493219][  T957]  dump_stack+0x96/0xdb
[   38.493709][  T957]  check_noncircular+0x371/0x450
[   38.494344][  T957]  ? lookup_address+0x60/0x60
[   38.494923][  T957]  ? print_circular_bug.isra.35+0x310/0x310
[   38.495699][  T957]  ? hlock_class+0x130/0x130
[   38.496334][  T957]  ? __lock_acquire+0x2d8d/0x3de0
[   38.496979][  T957]  __lock_acquire+0x2d8d/0x3de0
[   38.497607][  T957]  ? register_lock_class+0x14d0/0x14d0
[   38.498333][  T957]  ? check_chain_key+0x236/0x5d0
[   38.499003][  T957]  lock_acquire+0x164/0x3b0
[   38.499800][  T957]  ? bond_get_stats+0x90/0x4d0 [bonding]
[   38.500706][  T957]  _raw_spin_lock+0x30/0x70
[   38.501435][  T957]  ? bond_get_stats+0x90/0x4d0 [bonding]
[   38.502311][  T957]  bond_get_stats+0x90/0x4d0 [bonding]
[ ... ]

But, there is another problem.
The dynamic lockdep class key is protected by RTNL, but bond_get_stats()
would be called outside of RTNL.
So, it would use an invalid dynamic lockdep class key.

In order to fix this issue, stats_lock uses spin_lock_nested() instead of
a dynamic lockdep key.
The bond_get_stats() calls bond_get_lowest_level_rcu() to get the correct
nest level value, which will be used by spin_lock_nested().
The "dev->lower_level" indicates lower nest level value, but this value
is invalid outside of RTNL.
So, bond_get_lowest_level_rcu() returns valid lower nest level value in
the RCU critical section.
bond_get_lowest_level_rcu() will be work only when LOCKDEP is enabled.

Fixes: 089bca2cae ("bonding: use dynamic lockdep key instead of subclass")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:32:11 -08:00
Taehee Yoo
064ff66e2b bonding: add missing netdev_update_lockdep_key()
After bond_release(), netdev_update_lockdep_key() should be called.
But both ioctl path and attribute path don't call
netdev_update_lockdep_key().
This patch adds missing netdev_update_lockdep_key().

Test commands:
    ip link add bond0 type bond
    ip link add bond1 type bond
    ifenslave bond0 bond1
    ifenslave -d bond0 bond1
    ifenslave bond1 bond0

Splat looks like:
[   29.501182][ T1046] WARNING: possible circular locking dependency detected
[   29.501945][ T1039] hardirqs last disabled at (1962): [<ffffffffac6c807f>] handle_mm_fault+0x13f/0x700
[   29.503442][ T1046] 5.5.0+ #322 Not tainted
[   29.503447][ T1046] ------------------------------------------------------
[   29.504277][ T1039] softirqs last  enabled at (1180): [<ffffffffade00678>] __do_softirq+0x678/0x981
[   29.505443][ T1046] ifenslave/1046 is trying to acquire lock:
[   29.505886][ T1039] softirqs last disabled at (1169): [<ffffffffac19c18a>] irq_exit+0x17a/0x1a0
[   29.509997][ T1046] ffff88805d5da280 (&dev->addr_list_lock_key#3){+...}, at: dev_mc_sync_multiple+0x95/0x120
[   29.511243][ T1046]
[   29.511243][ T1046] but task is already holding lock:
[   29.512192][ T1046] ffff8880460f2280 (&dev->addr_list_lock_key#4){+...}, at: bond_enslave+0x4482/0x47b0 [bonding]
[   29.514124][ T1046]
[   29.514124][ T1046] which lock already depends on the new lock.
[   29.514124][ T1046]
[   29.517297][ T1046]
[   29.517297][ T1046] the existing dependency chain (in reverse order) is:
[   29.518231][ T1046]
[   29.518231][ T1046] -> #1 (&dev->addr_list_lock_key#4){+...}:
[   29.519076][ T1046]        _raw_spin_lock+0x30/0x70
[   29.519588][ T1046]        dev_mc_sync_multiple+0x95/0x120
[   29.520208][ T1046]        bond_enslave+0x448d/0x47b0 [bonding]
[   29.520862][ T1046]        bond_option_slaves_set+0x1a3/0x370 [bonding]
[   29.521640][ T1046]        __bond_opt_set+0x1ff/0xbb0 [bonding]
[   29.522438][ T1046]        __bond_opt_set_notify+0x2b/0xf0 [bonding]
[   29.523251][ T1046]        bond_opt_tryset_rtnl+0x92/0xf0 [bonding]
[   29.524082][ T1046]        bonding_sysfs_store_option+0x8a/0xf0 [bonding]
[   29.524959][ T1046]        kernfs_fop_write+0x276/0x410
[   29.525620][ T1046]        vfs_write+0x197/0x4a0
[   29.526218][ T1046]        ksys_write+0x141/0x1d0
[   29.526818][ T1046]        do_syscall_64+0x99/0x4f0
[   29.527430][ T1046]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   29.528265][ T1046]
[   29.528265][ T1046] -> #0 (&dev->addr_list_lock_key#3){+...}:
[   29.529272][ T1046]        __lock_acquire+0x2d8d/0x3de0
[   29.529935][ T1046]        lock_acquire+0x164/0x3b0
[   29.530638][ T1046]        _raw_spin_lock+0x30/0x70
[   29.531187][ T1046]        dev_mc_sync_multiple+0x95/0x120
[   29.531790][ T1046]        bond_enslave+0x448d/0x47b0 [bonding]
[   29.532451][ T1046]        bond_option_slaves_set+0x1a3/0x370 [bonding]
[   29.533163][ T1046]        __bond_opt_set+0x1ff/0xbb0 [bonding]
[   29.533789][ T1046]        __bond_opt_set_notify+0x2b/0xf0 [bonding]
[   29.534595][ T1046]        bond_opt_tryset_rtnl+0x92/0xf0 [bonding]
[   29.535500][ T1046]        bonding_sysfs_store_option+0x8a/0xf0 [bonding]
[   29.536379][ T1046]        kernfs_fop_write+0x276/0x410
[   29.537057][ T1046]        vfs_write+0x197/0x4a0
[   29.537640][ T1046]        ksys_write+0x141/0x1d0
[   29.538251][ T1046]        do_syscall_64+0x99/0x4f0
[   29.538870][ T1046]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   29.539659][ T1046]
[   29.539659][ T1046] other info that might help us debug this:
[   29.539659][ T1046]
[   29.540953][ T1046]  Possible unsafe locking scenario:
[   29.540953][ T1046]
[   29.541883][ T1046]        CPU0                    CPU1
[   29.542540][ T1046]        ----                    ----
[   29.543209][ T1046]   lock(&dev->addr_list_lock_key#4);
[   29.543880][ T1046]                                lock(&dev->addr_list_lock_key#3);
[   29.544873][ T1046]                                lock(&dev->addr_list_lock_key#4);
[   29.545863][ T1046]   lock(&dev->addr_list_lock_key#3);
[   29.546525][ T1046]
[   29.546525][ T1046]  *** DEADLOCK ***
[   29.546525][ T1046]
[   29.547542][ T1046] 5 locks held by ifenslave/1046:
[   29.548196][ T1046]  #0: ffff88806044c478 (sb_writers#5){.+.+}, at: vfs_write+0x3bb/0x4a0
[   29.549248][ T1046]  #1: ffff88805af00890 (&of->mutex){+.+.}, at: kernfs_fop_write+0x1cf/0x410
[   29.550343][ T1046]  #2: ffff88805b8b54b0 (kn->count#157){.+.+}, at: kernfs_fop_write+0x1f2/0x410
[   29.551575][ T1046]  #3: ffffffffaecf4cf0 (rtnl_mutex){+.+.}, at: bond_opt_tryset_rtnl+0x5f/0xf0 [bonding]
[   29.552819][ T1046]  #4: ffff8880460f2280 (&dev->addr_list_lock_key#4){+...}, at: bond_enslave+0x4482/0x47b0 [bonding]
[   29.554175][ T1046]
[   29.554175][ T1046] stack backtrace:
[   29.554907][ T1046] CPU: 0 PID: 1046 Comm: ifenslave Not tainted 5.5.0+ #322
[   29.555854][ T1046] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[   29.557064][ T1046] Call Trace:
[   29.557504][ T1046]  dump_stack+0x96/0xdb
[   29.558054][ T1046]  check_noncircular+0x371/0x450
[   29.558723][ T1046]  ? print_circular_bug.isra.35+0x310/0x310
[   29.559486][ T1046]  ? hlock_class+0x130/0x130
[   29.560100][ T1046]  ? __lock_acquire+0x2d8d/0x3de0
[   29.560761][ T1046]  __lock_acquire+0x2d8d/0x3de0
[   29.561366][ T1046]  ? register_lock_class+0x14d0/0x14d0
[   29.562045][ T1046]  ? find_held_lock+0x39/0x1d0
[   29.562641][ T1046]  lock_acquire+0x164/0x3b0
[   29.563199][ T1046]  ? dev_mc_sync_multiple+0x95/0x120
[   29.563872][ T1046]  _raw_spin_lock+0x30/0x70
[   29.564464][ T1046]  ? dev_mc_sync_multiple+0x95/0x120
[   29.565146][ T1046]  dev_mc_sync_multiple+0x95/0x120
[   29.565793][ T1046]  bond_enslave+0x448d/0x47b0 [bonding]
[   29.566487][ T1046]  ? bond_update_slave_arr+0x940/0x940 [bonding]
[   29.567279][ T1046]  ? bstr_printf+0xc20/0xc20
[   29.567857][ T1046]  ? stack_trace_consume_entry+0x160/0x160
[   29.568614][ T1046]  ? deactivate_slab.isra.77+0x2c5/0x800
[   29.569320][ T1046]  ? check_chain_key+0x236/0x5d0
[   29.569939][ T1046]  ? sscanf+0x93/0xc0
[   29.570442][ T1046]  ? vsscanf+0x1e20/0x1e20
[   29.571003][ T1046]  bond_option_slaves_set+0x1a3/0x370 [bonding]
[ ... ]

Fixes: ab92d68fc2 ("net: core: add generic lockdep keys")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:32:11 -08:00
Mahesh Bandewar
5d485ed88d bonding: fix active-backup transition after link failure
After the recent fix in commit 1899bb3251 ("bonding: fix state
transition issue in link monitoring"), the active-backup mode with
miimon initially come-up fine but after a link-failure, both members
transition into backup state.

Following steps to reproduce the scenario (eth1 and eth2 are the
slaves of the bond):

    ip link set eth1 up
    ip link set eth2 down
    sleep 1
    ip link set eth2 up
    ip link set eth1 down
    cat /sys/class/net/eth1/bonding_slave/state
    cat /sys/class/net/eth2/bonding_slave/state

Fixes: 1899bb3251 ("bonding: fix state transition issue in link monitoring")
CC: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-12-14 16:22:34 -08:00
Eric Dumazet
9e99bfefdb bonding: fix bond_neigh_init()
1) syzbot reported an uninit-value in bond_neigh_setup() [1]

 bond_neigh_setup() uses a temporary on-stack 'struct neigh_parms parms',
 but only clears parms.neigh_setup field.

 A stacked bonding device would then enter bond_neigh_setup()
 and read garbage from parms->dev.

 If we get really unlucky and garbage is matching @dev, then we
 could recurse and eventually crash.

 Let's make sure the whole structure is cleared to avoid surprises.

2) bond_neigh_setup() can be called while another cpu manipulates
 the master device, removing or adding a slave.
 We need at least rcu protection to prevent use-after-free.

Note: Prior code does not support a stack of bonding devices,
      this patch does not attempt to fix this, and leave a comment instead.

[1]

BUG: KMSAN: uninit-value in bond_neigh_setup+0xa4/0x110 drivers/net/bonding/bond_main.c:3655
CPU: 0 PID: 11256 Comm: syz-executor.0 Not tainted 5.4.0-rc8-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x220 lib/dump_stack.c:118
 kmsan_report+0x128/0x220 mm/kmsan/kmsan_report.c:108
 __msan_warning+0x57/0xa0 mm/kmsan/kmsan_instr.c:245
 bond_neigh_setup+0xa4/0x110 drivers/net/bonding/bond_main.c:3655
 bond_neigh_init+0x216/0x4b0 drivers/net/bonding/bond_main.c:3626
 ___neigh_create+0x169e/0x2c40 net/core/neighbour.c:613
 __neigh_create+0xbd/0xd0 net/core/neighbour.c:674
 ip6_finish_output2+0x149a/0x2670 net/ipv6/ip6_output.c:113
 __ip6_finish_output+0x83d/0x8f0 net/ipv6/ip6_output.c:142
 ip6_finish_output+0x2db/0x420 net/ipv6/ip6_output.c:152
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:175
 dst_output include/net/dst.h:436 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 mld_sendpack+0xebd/0x13d0 net/ipv6/mcast.c:1682
 mld_send_cr net/ipv6/mcast.c:1978 [inline]
 mld_ifc_timer_expire+0x116b/0x1680 net/ipv6/mcast.c:2477
 call_timer_fn+0x232/0x530 kernel/time/timer.c:1404
 expire_timers kernel/time/timer.c:1449 [inline]
 __run_timers+0xd60/0x1270 kernel/time/timer.c:1773
 run_timer_softirq+0x2d/0x50 kernel/time/timer.c:1786
 __do_softirq+0x4a1/0x83a kernel/softirq.c:293
 invoke_softirq kernel/softirq.c:375 [inline]
 irq_exit+0x230/0x280 kernel/softirq.c:416
 exiting_irq+0xe/0x10 arch/x86/include/asm/apic.h:536
 smp_apic_timer_interrupt+0x48/0x70 arch/x86/kernel/apic/apic.c:1138
 apic_timer_interrupt+0x2e/0x40 arch/x86/entry/entry_64.S:835
 </IRQ>
RIP: 0010:kmsan_free_page+0x18d/0x1c0 mm/kmsan/kmsan_shadow.c:439
Code: 4c 89 ff 44 89 f6 e8 82 0d ee ff 65 ff 0d 9f 26 3b 60 65 8b 05 98 26 3b 60 85 c0 75 24 e8 5b f6 35 ff 4c 89 6d d0 ff 75 d0 9d <48> 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 0f 0b 0f 0b 0f
RSP: 0018:ffffb328034af818 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000000 RBX: ffffe2d7471f8360 RCX: 0000000000000000
RDX: ffffffffadea7000 RSI: 0000000000000004 RDI: ffff93496fcda104
RBP: ffffb328034af850 R08: ffff934a47e86d00 R09: ffff93496fc41900
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000246 R14: 0000000000000000 R15: ffffe2d7472225c0
 free_pages_prepare mm/page_alloc.c:1138 [inline]
 free_pcp_prepare mm/page_alloc.c:1230 [inline]
 free_unref_page_prepare+0x1d9/0x770 mm/page_alloc.c:3025
 free_unref_page mm/page_alloc.c:3074 [inline]
 free_the_page mm/page_alloc.c:4832 [inline]
 __free_pages+0x154/0x230 mm/page_alloc.c:4840
 __vunmap+0xdac/0xf20 mm/vmalloc.c:2277
 __vfree mm/vmalloc.c:2325 [inline]
 vfree+0x7c/0x170 mm/vmalloc.c:2355
 copy_entries_to_user net/ipv6/netfilter/ip6_tables.c:883 [inline]
 get_entries net/ipv6/netfilter/ip6_tables.c:1041 [inline]
 do_ip6t_get_ctl+0xfa4/0x1030 net/ipv6/netfilter/ip6_tables.c:1709
 nf_sockopt net/netfilter/nf_sockopt.c:104 [inline]
 nf_getsockopt+0x481/0x4e0 net/netfilter/nf_sockopt.c:122
 ipv6_getsockopt+0x264/0x510 net/ipv6/ipv6_sockglue.c:1400
 tcp_getsockopt+0x1c6/0x1f0 net/ipv4/tcp.c:3688
 sock_common_getsockopt+0x13f/0x180 net/core/sock.c:3110
 __sys_getsockopt+0x533/0x7b0 net/socket.c:2129
 __do_sys_getsockopt net/socket.c:2144 [inline]
 __se_sys_getsockopt+0xe1/0x100 net/socket.c:2141
 __x64_sys_getsockopt+0x62/0x80 net/socket.c:2141
 do_syscall_64+0xb6/0x160 arch/x86/entry/common.c:291
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45d20a
Code: b8 34 01 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 8d 8b fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 37 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 6a 8b fb ff c3 66 0f 1f 84 00 00 00 00 00
RSP: 002b:0000000000a6f618 EFLAGS: 00000212 ORIG_RAX: 0000000000000037
RAX: ffffffffffffffda RBX: 0000000000a6f640 RCX: 000000000045d20a
RDX: 0000000000000041 RSI: 0000000000000029 RDI: 0000000000000003
RBP: 0000000000717cc0 R08: 0000000000a6f63c R09: 0000000000004000
R10: 0000000000a6f740 R11: 0000000000000212 R12: 0000000000000003
R13: 0000000000000000 R14: 0000000000000029 R15: 0000000000715b00

Local variable description: ----parms@bond_neigh_init
Variable was created at:
 bond_neigh_init+0x8c/0x4b0 drivers/net/bonding/bond_main.c:3617
 bond_neigh_init+0x8c/0x4b0 drivers/net/bonding/bond_main.c:3617

Fixes: 9918d5bf32 ("bonding: modify only neigh_parms owned by us")
Fixes: 234bcf8a49 ("net/bonding: correctly proxy slave neigh param setup ndo function")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-09 09:49:07 -08:00
Eric Dumazet
f394722fb0 neighbour: remove neigh_cleanup() method
neigh_cleanup() has not been used for seven years, and was a wrong design.

Messing with shared pointer in bond_neigh_init() without proper
memory barriers would at least trigger syzbot complains eventually.

It is time to remove this stuff.

Fixes: b63b70d877 ("IPoIB: Use a private hash table for path lookup in xmit path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-09 09:48:47 -08:00
Matteo Croce
df98be06c9 bonding: symmetric ICMP transmit
A bonding with layer2+3 or layer3+4 hashing uses the IP addresses and the ports
to balance packets between slaves. With some network errors, we receive an ICMP
error packet by the remote host or a router. If sent by a router, the source IP
can differ from the remote host one. Additionally the ICMP protocol has no port
numbers, so a layer3+4 bonding will get a different hash than the previous one.
These two conditions could let the packet go through a different interface than
the other packets of the same flow:

    # tcpdump -qltnni veth0 |sed 's/^/0: /' &
    # tcpdump -qltnni veth1 |sed 's/^/1: /' &
    # hping3 -2 192.168.0.2 -p 9
    0: IP 192.168.0.1.2251 > 192.168.0.2.9: UDP, length 0
    1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
    1: IP 192.168.0.1.2252 > 192.168.0.2.9: UDP, length 0
    1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
    1: IP 192.168.0.1.2253 > 192.168.0.2.9: UDP, length 0
    1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
    0: IP 192.168.0.1.2254 > 192.168.0.2.9: UDP, length 0
    1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36

An ICMP error packet contains the header of the packet which caused the network
error, so inspect it and match the flow against it, so we can send the ICMP via
the same interface of the previous packet in the flow.
Move the IP and port dissect code into a generic function bond_flow_ip() and if
we are dissecting an ICMP error packet, call it again with the adjusted offset.

    # hping3 -2 192.168.0.2 -p 9
    1: IP 192.168.0.1.1224 > 192.168.0.2.9: UDP, length 0
    1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
    1: IP 192.168.0.1.1225 > 192.168.0.2.9: UDP, length 0
    1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
    0: IP 192.168.0.1.1226 > 192.168.0.2.9: UDP, length 0
    0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
    0: IP 192.168.0.1.1227 > 192.168.0.2.9: UDP, length 0
    0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16 13:02:53 -08:00
David S. Miller
14684b9301 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
One conflict in the BPF samples Makefile, some fixes in 'net' whilst
we were converting over to Makefile.target rules in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-09 11:04:37 -08:00
Jay Vosburgh
1899bb3251 bonding: fix state transition issue in link monitoring
Since de77ecd4ef ("bonding: improve link-status update in
mii-monitoring"), the bonding driver has utilized two separate variables
to indicate the next link state a particular slave should transition to.
Each is used to communicate to a different portion of the link state
change commit logic; one to the bond_miimon_commit function itself, and
another to the state transition logic.

	Unfortunately, the two variables can become unsynchronized,
resulting in incorrect link state transitions within bonding.  This can
cause slaves to become stuck in an incorrect link state until a
subsequent carrier state transition.

	The issue occurs when a special case in bond_slave_netdev_event
sets slave->link directly to BOND_LINK_FAIL.  On the next pass through
bond_miimon_inspect after the slave goes carrier up, the BOND_LINK_FAIL
case will set the proposed next state (link_new_state) to BOND_LINK_UP,
but the new_link to BOND_LINK_DOWN.  The setting of the final link state
from new_link comes after that from link_new_state, and so the slave
will end up incorrectly in _DOWN state.

	Resolve this by combining the two variables into one.

Reported-by: Aleksei Zakharov <zakharov.a.g@yandex.ru>
Reported-by: Sha Zhang <zhangsha.zhang@huawei.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Fixes: de77ecd4ef ("bonding: improve link-status update in mii-monitoring")
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 17:40:16 -08:00
David S. Miller
d31e95585c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The only slightly tricky merge conflict was the netdevsim because the
mutex locking fix overlapped a lot of driver reload reorganization.

The rest were (relatively) trivial in nature.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-02 13:54:56 -07:00
Matteo Croce
58deb77cc5 bonding: balance ICMP echoes in layer3+4 mode
The bonding uses the L4 ports to balance flows between slaves. As the ICMP
protocol has no ports, those packets are sent all to the same device:

    # tcpdump -qltnni veth0 ip |sed 's/^/0: /' &
    # tcpdump -qltnni veth1 ip |sed 's/^/1: /' &
    # ping -qc1 192.168.0.2
    1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, length 64
    # ping -qc1 192.168.0.2
    1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, length 64
    # ping -qc1 192.168.0.2
    1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, length 64

But some ICMP packets have an Identifier field which is
used to match packets within sessions, let's use this value in the hash
function to balance these packets between bond slaves:

    # ping -qc1 192.168.0.2
    0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, length 64
    0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, length 64
    # ping -qc1 192.168.0.2
    1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, length 64

Aso, let's use a flow_dissector_key which defines FLOW_DISSECTOR_KEY_ICMP,
so we can balance pings encapsulated in a tunnel when using mode encap3+4:

    # ping -q 192.168.1.2 -c1
    0: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 585, seq 1, length 64
    0: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 585, seq 1, length 64
    # ping -q 192.168.1.2 -c1
    1: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 586, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 586, seq 1, length 64

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 17:21:35 -07:00
Taehee Yoo
ad9bd8daf2 bonding: fix using uninitialized mode_lock
When a bonding interface is being created, it setups its mode and options.
At that moment, it uses mode_lock so mode_lock should be initialized
before that moment.

rtnl_newlink()
	rtnl_create_link()
		alloc_netdev_mqs()
			->setup() //bond_setup()
	->newlink //bond_newlink
		bond_changelink()
		register_netdevice()
			->ndo_init() //bond_init()

After commit 089bca2cae ("bonding: use dynamic lockdep key instead of
subclass"), mode_lock is initialized in bond_init().
So in the bond_changelink(), un-initialized mode_lock can be used.
mode_lock should be initialized in bond_setup().
This patch partially reverts commit 089bca2cae ("bonding: use dynamic
lockdep key instead of subclass")

Test command:
    ip link add bond0 type bond mode 802.3ad lacp_rate 0

Splat looks like:
[   60.615127] INFO: trying to register non-static key.
[   60.615900] the code is fine but needs lockdep annotation.
[   60.616697] turning off the locking correctness validator.
[   60.617490] CPU: 1 PID: 957 Comm: ip Not tainted 5.4.0-rc3+ #109
[   60.618350] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[   60.619481] Call Trace:
[   60.619918]  dump_stack+0x7c/0xbb
[   60.620453]  register_lock_class+0x1215/0x14d0
[   60.621131]  ? alloc_netdev_mqs+0x7b3/0xcc0
[   60.621771]  ? is_bpf_text_address+0x86/0xf0
[   60.622416]  ? is_dynamic_key+0x230/0x230
[   60.623032]  ? unwind_get_return_address+0x5f/0xa0
[   60.623757]  ? create_prof_cpu_mask+0x20/0x20
[   60.624408]  ? arch_stack_walk+0x83/0xb0
[   60.625023]  __lock_acquire+0xd8/0x3de0
[   60.625616]  ? stack_trace_save+0x82/0xb0
[   60.626225]  ? stack_trace_consume_entry+0x160/0x160
[   60.626957]  ? deactivate_slab.isra.80+0x2c5/0x800
[   60.627668]  ? register_lock_class+0x14d0/0x14d0
[   60.628380]  ? alloc_netdev_mqs+0x7b3/0xcc0
[   60.629020]  ? save_stack+0x69/0x80
[   60.629574]  ? save_stack+0x19/0x80
[   60.630121]  ? __kasan_kmalloc.constprop.4+0xa0/0xd0
[   60.630859]  ? __kmalloc_node+0x16f/0x480
[   60.631472]  ? alloc_netdev_mqs+0x7b3/0xcc0
[   60.632121]  ? rtnl_create_link+0x2ed/0xad0
[   60.634388]  ? __rtnl_newlink+0xad4/0x11b0
[   60.635024]  lock_acquire+0x164/0x3b0
[   60.635608]  ? bond_3ad_update_lacp_rate+0x91/0x200 [bonding]
[   60.636463]  _raw_spin_lock_bh+0x38/0x70
[   60.637084]  ? bond_3ad_update_lacp_rate+0x91/0x200 [bonding]
[   60.637930]  bond_3ad_update_lacp_rate+0x91/0x200 [bonding]
[   60.638753]  ? bond_3ad_lacpdu_recv+0xb30/0xb30 [bonding]
[   60.639552]  ? bond_opt_get_val+0x180/0x180 [bonding]
[   60.640307]  ? ___slab_alloc+0x5aa/0x610
[   60.640925]  bond_option_lacp_rate_set+0x71/0x140 [bonding]
[   60.641751]  __bond_opt_set+0x1ff/0xbb0 [bonding]
[   60.643217]  ? kasan_unpoison_shadow+0x30/0x40
[   60.643924]  bond_changelink+0x9a4/0x1700 [bonding]
[   60.644653]  ? memset+0x1f/0x40
[   60.742941]  ? bond_slave_changelink+0x1a0/0x1a0 [bonding]
[   60.752694]  ? alloc_netdev_mqs+0x8ea/0xcc0
[   60.753330]  ? rtnl_create_link+0x2ed/0xad0
[   60.753964]  bond_newlink+0x1e/0x60 [bonding]
[   60.754612]  __rtnl_newlink+0xb9f/0x11b0
[ ... ]

Reported-by: syzbot+8da67f407bcba2c72e6e@syzkaller.appspotmail.com
Reported-by: syzbot+0d083911ab18b710da71@syzkaller.appspotmail.com
Fixes: 089bca2cae ("bonding: use dynamic lockdep key instead of subclass")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 18:03:36 -07:00
Taehee Yoo
f3b0a18bb6 net: remove unnecessary variables and callback
This patch removes variables and callback these are related to the nested
device structure.
devices that can be nested have their own nest_level variable that
represents the depth of nested devices.
In the previous patch, new {lower/upper}_level variables are added and
they replace old private nest_level variable.
So, this patch removes all 'nest_level' variables.

In order to avoid lockdep warning, ->ndo_get_lock_subclass() was added
to get lockdep subclass value, which is actually lower nested depth value.
But now, they use the dynamic lockdep key to avoid lockdep warning instead
of the subclass.
So, this patch removes ->ndo_get_lock_subclass() callback.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-24 14:53:49 -07:00
Taehee Yoo
089bca2cae bonding: use dynamic lockdep key instead of subclass
All bonding device has same lockdep key and subclass is initialized with
nest_level.
But actual nest_level value can be changed when a lower device is attached.
And at this moment, the subclass should be updated but it seems to be
unsafe.
So this patch makes bonding use dynamic lockdep key instead of the
subclass.

Test commands:
    ip link add bond0 type bond

    for i in {1..5}
    do
	    let A=$i-1
	    ip link add bond$i type bond
	    ip link set bond$i master bond$A
    done
    ip link set bond5 master bond0

Splat looks like:
[  307.992912] WARNING: possible recursive locking detected
[  307.993656] 5.4.0-rc3+ #96 Tainted: G        W
[  307.994367] --------------------------------------------
[  307.995092] ip/761 is trying to acquire lock:
[  307.995710] ffff8880513aac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[  307.997045]
	       but task is already holding lock:
[  307.997923] ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[  307.999215]
	       other info that might help us debug this:
[  308.000251]  Possible unsafe locking scenario:

[  308.001137]        CPU0
[  308.001533]        ----
[  308.001915]   lock(&(&bond->stats_lock)->rlock#2/2);
[  308.002609]   lock(&(&bond->stats_lock)->rlock#2/2);
[  308.003302]
		*** DEADLOCK ***

[  308.004310]  May be due to missing lock nesting notation

[  308.005319] 3 locks held by ip/761:
[  308.005830]  #0: ffffffff9fcc42b0 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
[  308.006894]  #1: ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[  308.008243]  #2: ffffffff9f9219c0 (rcu_read_lock){....}, at: bond_get_stats+0x9f/0x500 [bonding]
[  308.009422]
	       stack backtrace:
[  308.010124] CPU: 0 PID: 761 Comm: ip Tainted: G        W         5.4.0-rc3+ #96
[  308.011097] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  308.012179] Call Trace:
[  308.012601]  dump_stack+0x7c/0xbb
[  308.013089]  __lock_acquire+0x269d/0x3de0
[  308.013669]  ? register_lock_class+0x14d0/0x14d0
[  308.014318]  lock_acquire+0x164/0x3b0
[  308.014858]  ? bond_get_stats+0xb8/0x500 [bonding]
[  308.015520]  _raw_spin_lock_nested+0x2e/0x60
[  308.016129]  ? bond_get_stats+0xb8/0x500 [bonding]
[  308.017215]  bond_get_stats+0xb8/0x500 [bonding]
[  308.018454]  ? bond_arp_rcv+0xf10/0xf10 [bonding]
[  308.019710]  ? rcu_read_lock_held+0x90/0xa0
[  308.020605]  ? rcu_read_lock_sched_held+0xc0/0xc0
[  308.021286]  ? bond_get_stats+0x9f/0x500 [bonding]
[  308.021953]  dev_get_stats+0x1ec/0x270
[  308.022508]  bond_get_stats+0x1d1/0x500 [bonding]

Fixes: d3fff6c443 ("net: add netdev_lockdep_set_classes() helper")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-24 14:53:48 -07:00
Taehee Yoo
65de65d903 bonding: fix unexpected IFF_BONDING bit unset
The IFF_BONDING means bonding master or bonding slave device.
->ndo_add_slave() sets IFF_BONDING flag and ->ndo_del_slave() unsets
IFF_BONDING flag.

bond0<--bond1

Both bond0 and bond1 are bonding device and these should keep having
IFF_BONDING flag until they are removed.
But bond1 would lose IFF_BONDING at ->ndo_del_slave() because that routine
do not check whether the slave device is the bonding type or not.
This patch adds the interface type check routine before removing
IFF_BONDING flag.

Test commands:
    ip link add bond0 type bond
    ip link add bond1 type bond
    ip link set bond1 master bond0
    ip link set bond1 nomaster
    ip link del bond1 type bond
    ip link add bond1 type bond

Splat looks like:
[  226.665555] proc_dir_entry 'bonding/bond1' already registered
[  226.666440] WARNING: CPU: 0 PID: 737 at fs/proc/generic.c:361 proc_register+0x2a9/0x3e0
[  226.667571] Modules linked in: bonding af_packet sch_fq_codel ip_tables x_tables unix
[  226.668662] CPU: 0 PID: 737 Comm: ip Not tainted 5.4.0-rc3+ #96
[  226.669508] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  226.670652] RIP: 0010:proc_register+0x2a9/0x3e0
[  226.671612] Code: 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 39 01 00 00 48 8b 04 24 48 89 ea 48 c7 c7 a0 0b 14 9f 48 8b b0 e
0 00 00 00 e8 07 e7 88 ff <0f> 0b 48 c7 c7 40 2d a5 9f e8 59 d6 23 01 48 8b 4c 24 10 48 b8 00
[  226.675007] RSP: 0018:ffff888050e17078 EFLAGS: 00010282
[  226.675761] RAX: dffffc0000000008 RBX: ffff88805fdd0f10 RCX: ffffffff9dd344e2
[  226.676757] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff88806c9f6b8c
[  226.677751] RBP: ffff8880507160f3 R08: ffffed100d940019 R09: ffffed100d940019
[  226.678761] R10: 0000000000000001 R11: ffffed100d940018 R12: ffff888050716008
[  226.679757] R13: ffff8880507160f2 R14: dffffc0000000000 R15: ffffed100a0e2c1e
[  226.680758] FS:  00007fdc217cc0c0(0000) GS:ffff88806c800000(0000) knlGS:0000000000000000
[  226.681886] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  226.682719] CR2: 00007f49313424d0 CR3: 0000000050e46001 CR4: 00000000000606f0
[  226.683727] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  226.684725] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  226.685681] Call Trace:
[  226.687089]  proc_create_seq_private+0xb3/0xf0
[  226.687778]  bond_create_proc_entry+0x1b3/0x3f0 [bonding]
[  226.691458]  bond_netdev_event+0x433/0x970 [bonding]
[  226.692139]  ? __module_text_address+0x13/0x140
[  226.692779]  notifier_call_chain+0x90/0x160
[  226.693401]  register_netdevice+0x9b3/0xd80
[  226.694010]  ? alloc_netdev_mqs+0x854/0xc10
[  226.694629]  ? netdev_change_features+0xa0/0xa0
[  226.695278]  ? rtnl_create_link+0x2ed/0xad0
[  226.695849]  bond_newlink+0x2a/0x60 [bonding]
[  226.696422]  __rtnl_newlink+0xb9f/0x11b0
[  226.696968]  ? rtnl_link_unregister+0x220/0x220
[ ... ]

Fixes: 0b680e7537 ("[PATCH] bonding: Add priv_flag to avoid event mishandling")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-24 14:53:48 -07:00
Taehee Yoo
ab92d68fc2 net: core: add generic lockdep keys
Some interface types could be nested.
(VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI, VXLAN, etc..)
These interface types should set lockdep class because, without lockdep
class key, lockdep always warn about unexisting circular locking.

In the current code, these interfaces have their own lockdep class keys and
these manage itself. So that there are so many duplicate code around the
/driver/net and /net/.
This patch adds new generic lockdep keys and some helper functions for it.

This patch does below changes.
a) Add lockdep class keys in struct net_device
   - qdisc_running, xmit, addr_list, qdisc_busylock
   - these keys are used as dynamic lockdep key.
b) When net_device is being allocated, lockdep keys are registered.
   - alloc_netdev_mqs()
c) When net_device is being free'd llockdep keys are unregistered.
   - free_netdev()
d) Add generic lockdep key helper function
   - netdev_register_lockdep_key()
   - netdev_unregister_lockdep_key()
   - netdev_update_lockdep_key()
e) Remove unnecessary generic lockdep macro and functions
f) Remove unnecessary lockdep code of each interfaces.

After this patch, each interface modules don't need to maintain
their lockdep keys.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-24 14:53:48 -07:00
Eric Dumazet
a7137534b5 bonding: fix potential NULL deref in bond_update_slave_arr
syzbot got a NULL dereference in bond_update_slave_arr() [1],
happening after a failure to allocate bond->slave_arr

A workqueue (bond_slave_arr_handler) is supposed to retry
the allocation later, but if the slave is removed before
the workqueue had a chance to complete, bond->slave_arr
can still be NULL.

[1]

Failed to build slave-array.
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN PTI
Modules linked in:
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:bond_update_slave_arr.cold+0xc6/0x198 drivers/net/bonding/bond_main.c:4039
RSP: 0018:ffff88018fe33678 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc9000290b000
RDX: 0000000000000000 RSI: ffffffff82b63037 RDI: ffff88019745ea20
RBP: ffff88018fe33760 R08: ffff880170754280 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff88019745ea00 R14: 0000000000000000 R15: ffff88018fe338b0
FS:  00007febd837d700(0000) GS:ffff8801dad00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004540a0 CR3: 00000001c242e005 CR4: 00000000001626f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff82b5b45e>] __bond_release_one+0x43e/0x500 drivers/net/bonding/bond_main.c:1923
 [<ffffffff82b5b966>] bond_release drivers/net/bonding/bond_main.c:2039 [inline]
 [<ffffffff82b5b966>] bond_do_ioctl+0x416/0x870 drivers/net/bonding/bond_main.c:3562
 [<ffffffff83ae25f4>] dev_ifsioc+0x6f4/0x940 net/core/dev_ioctl.c:328
 [<ffffffff83ae2e58>] dev_ioctl+0x1b8/0xc70 net/core/dev_ioctl.c:495
 [<ffffffff83995ffd>] sock_do_ioctl+0x1bd/0x300 net/socket.c:1088
 [<ffffffff83996a80>] sock_ioctl+0x300/0x5d0 net/socket.c:1196
 [<ffffffff81b124db>] vfs_ioctl fs/ioctl.c:47 [inline]
 [<ffffffff81b124db>] file_ioctl fs/ioctl.c:501 [inline]
 [<ffffffff81b124db>] do_vfs_ioctl+0xacb/0x1300 fs/ioctl.c:688
 [<ffffffff81b12dc6>] SYSC_ioctl fs/ioctl.c:705 [inline]
 [<ffffffff81b12dc6>] SyS_ioctl+0xb6/0xe0 fs/ioctl.c:696
 [<ffffffff8101ccc8>] do_syscall_64+0x528/0x770 arch/x86/entry/common.c:305
 [<ffffffff84400091>] entry_SYSCALL_64_after_hwframe+0x42/0xb7

Fixes: ee63771474 ("bonding: Simplify the xmit function for modes that use xmit_hash")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-09 16:07:27 -07:00
YueHaibing
d595b03de2 bonding: Add vlan tx offload to hw_enc_features
As commit 30d8177e8a ("bonding: Always enable vlan tx offload")
said, we should always enable bonding's vlan tx offload, pass the
vlan packets to the slave devices with vlan tci, let them to handle
vlan implementation.

Now if encapsulation protocols like VXLAN is used, skb->encapsulation
may be set, then the packet is passed to vlan device which based on
bonding device. However in netif_skb_features(), the check of
hw_enc_features:

	 if (skb->encapsulation)
                 features &= dev->hw_enc_features;

clears NETIF_F_HW_VLAN_CTAG_TX/NETIF_F_HW_VLAN_STAG_TX. This results
in same issue in commit 30d8177e8a like this:

vlan_dev_hard_start_xmit
  -->dev_queue_xmit
    -->validate_xmit_skb
      -->netif_skb_features //NETIF_F_HW_VLAN_CTAG_TX is cleared
      -->validate_xmit_vlan
        -->__vlan_hwaccel_push_inside //skb->tci is cleared
...
 --> bond_start_xmit
   --> bond_xmit_hash //BOND_XMIT_POLICY_ENCAP34
     --> __skb_flow_dissect // nhoff point to IP header
        -->  case htons(ETH_P_8021Q)
             // skb_vlan_tag_present is false, so
             vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan),
             //vlan point to ip header wrongly

Fixes: b2a103e6d0 ("bonding: convert to ndo_fix_features")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08 22:01:44 -07:00
Thomas Falcon
12185dfe44 bonding: Force slave speed check after link state recovery for 802.3ad
The following scenario was encountered during testing of logical
partition mobility on pseries partitions with bonded ibmvnic
adapters in LACP mode.

1. Driver receives a signal that the device has been
   swapped, and it needs to reset to initialize the new
   device.

2. Driver reports loss of carrier and begins initialization.

3. Bonding driver receives NETDEV_CHANGE notifier and checks
   the slave's current speed and duplex settings. Because these
   are unknown at the time, the bond sets its link state to
   BOND_LINK_FAIL and handles the speed update, clearing
   AD_PORT_LACP_ENABLE.

4. Driver finishes recovery and reports that the carrier is on.

5. Bond receives a new notification and checks the speed again.
   The speeds are valid but miimon has not altered the link
   state yet.  AD_PORT_LACP_ENABLE remains off.

Because the slave's link state is still BOND_LINK_FAIL,
no further port checks are made when it recovers. Though
the slave devices are operational and have valid speed
and duplex settings, the bond will not send LACPDU's. The
simplest fix I can see is to force another speed check
in bond_miimon_commit. This way the bond will update
AD_PORT_LACP_ENABLE if needed when transitioning from
BOND_LINK_FAIL to BOND_LINK_UP.

CC: Jarod Wilson <jarod@redhat.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-22 12:09:32 -07:00
David S. Miller
af144a9834 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two cases of overlapping changes, nothing fancy.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 19:48:57 -07:00
Vincent Bernat
07a4ddec3c bonding: add an option to specify a delay between peer notifications
Currently, gratuitous ARP/ND packets are sent every `miimon'
milliseconds. This commit allows a user to specify a custom delay
through a new option, `peer_notif_delay'.

Like for `updelay' and `downdelay', this delay should be a multiple of
`miimon' to avoid managing an additional work queue. The configuration
logic is copied from `updelay' and `downdelay'. However, the default
value cannot be set using a module parameter: Netlink or sysfs should
be used to configure this feature.

When setting `miimon' to 100 and `peer_notif_delay' to 500, we can
observe the 500 ms delay is respected:

    20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
    20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
    20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
    20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28

In bond_mii_monitor(), I have tried to keep the lock logic readable.
The change is due to the fact we cannot rely on a notification to
lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is
only triggered once every N times, while we need to decrement the
counter each time.

iproute2 also needs to be updated to be able to specify this new
attribute through `ip link'.

Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-04 12:30:48 -07:00
Cong Wang
9d1bc24b52 bonding: validate ip header before check IPPROTO_IGMP
bond_xmit_roundrobin() checks for IGMP packets but it parses
the IP header even before checking skb->protocol.

We should validate the IP header with pskb_may_pull() before
using iph->protocol.

Reported-and-tested-by: syzbot+e5be16aa39ad6e755391@syzkaller.appspotmail.com
Fixes: a2fd940f4c ("bonding: fix broken multicast with round-robin mode")
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 13:26:12 -07:00
Eric Dumazet
b8bd72d317 bonding/main: fix NULL dereference in bond_select_active_slave()
A bonding master can be up while best_slave is NULL.

[12105.636318] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[12105.638204] mlx4_en: eth1: Linkstate event 1 -> 1
[12105.648984] IP: bond_select_active_slave+0x125/0x250
[12105.653977] PGD 0 P4D 0
[12105.656572] Oops: 0000 [#1] SMP PTI
[12105.660487] gsmi: Log Shutdown Reason 0x03
[12105.664620] Modules linked in: kvm_intel loop act_mirred uhaul vfat fat stg_standard_ftl stg_megablocks stg_idt stg_hdi stg elephant_dev_num stg_idt_eeprom w1_therm wire i2c_mux_pca954x i2c_mux mlx4_i2c i2c_usb cdc_acm ehci_pci ehci_hcd i2c_iimc mlx4_en mlx4_ib ib_uverbs ib_core mlx4_core [last unloaded: kvm_intel]
[12105.685686] mlx4_core 0000:03:00.0: dispatching link up event for port 2
[12105.685700] mlx4_en: eth2: Linkstate event 2 -> 1
[12105.685700] mlx4_en: eth2: Link Up (linkstate)
[12105.724452] Workqueue: bond0 bond_mii_monitor
[12105.728854] RIP: 0010:bond_select_active_slave+0x125/0x250
[12105.734355] RSP: 0018:ffffaf146a81fd88 EFLAGS: 00010246
[12105.739637] RAX: 0000000000000003 RBX: ffff8c62b03c6900 RCX: 0000000000000000
[12105.746838] RDX: 0000000000000000 RSI: ffffaf146a81fd08 RDI: ffff8c62b03c6000
[12105.754054] RBP: ffffaf146a81fdb8 R08: 0000000000000001 R09: ffff8c517d387600
[12105.761299] R10: 00000000001075d9 R11: ffffffffaceba92f R12: 0000000000000000
[12105.768553] R13: ffff8c8240ae4800 R14: 0000000000000000 R15: 0000000000000000
[12105.775748] FS:  0000000000000000(0000) GS:ffff8c62bfa40000(0000) knlGS:0000000000000000
[12105.783892] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12105.789716] CR2: 0000000000000000 CR3: 0000000d0520e001 CR4: 00000000001626f0
[12105.796976] Call Trace:
[12105.799446]  [<ffffffffac31d387>] bond_mii_monitor+0x497/0x6f0
[12105.805317]  [<ffffffffabd42643>] process_one_work+0x143/0x370
[12105.811225]  [<ffffffffabd42c7a>] worker_thread+0x4a/0x360
[12105.816761]  [<ffffffffabd48bc5>] kthread+0x105/0x140
[12105.821865]  [<ffffffffabd42c30>] ? rescuer_thread+0x380/0x380
[12105.827757]  [<ffffffffabd48ac0>] ? kthread_associate_blkcg+0xc0/0xc0
[12105.834266]  [<ffffffffac600241>] ret_from_fork+0x51/0x60

Fixes: e2a7420df2 ("bonding/main: convert to using slave printk macros")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: John Sperbeck <jsperbeck@google.com>
Cc: Jarod Wilson <jarod@redhat.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-02 15:14:48 -07:00
David S. Miller
d96ff269a0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The new route handling in ip_mc_finish_output() from 'net' overlapped
with the new support for returning congestion notifications from BPF
programs.

In order to handle this I had to take the dev_loopback_xmit() calls
out of the switch statement.

The aquantia driver conflicts were simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27 21:06:39 -07:00
YueHaibing
30d8177e8a bonding: Always enable vlan tx offload
We build vlan on top of bonding interface, which vlan offload
is off, bond mode is 802.3ad (LACP) and xmit_hash_policy is
BOND_XMIT_POLICY_ENCAP34.

Because vlan tx offload is off, vlan tci is cleared and skb push
the vlan header in validate_xmit_vlan() while sending from vlan
devices. Then in bond_xmit_hash, __skb_flow_dissect() fails to
get information from protocol headers encapsulated within vlan,
because 'nhoff' is points to IP header, so bond hashing is based
on layer 2 info, which fails to distribute packets across slaves.

This patch always enable bonding's vlan tx offload, pass the vlan
packets to the slave devices with vlan tci, let them to handle
vlan implementation.

Fixes: 278339a42a ("bonding: propogate vlan_features to bonding master")
Suggested-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-26 08:56:35 -07:00
Jarod Wilson
e2a7420df2 bonding/main: convert to using slave printk macros
All of these printk instances benefit from having both master and slave
device information included, so convert to using a standardized macro
format and remove redundant information.

Suggested-by: Joe Perches <joe@perches.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 13:36:01 -07:00
Jarod Wilson
f43b653026 bonding: fix error messages in bond_do_fail_over_mac
Passing the bond name again to debug output when referencing slave is wrong.
We're trying to set the bond's MAC to that of the new_active slave, so adjust
the error message slightly and pass in the slave's name, not the bond's.
Then we're trying to set the MAC on the old active slave, but putting the
new active slave's name in the output. While we're at it, clarify the
error messages so you know which one actually triggered.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 13:36:01 -07:00
Jarod Wilson
75466dce4d bonding: improve event debug usability
Seeing bonding debug log data along the lines of "event: 5" is a bit spartan,
and often requires a lookup table if you don't remember what every event is.
Make use of netdev_cmd_to_name for an improved debugging experience, so for
the prior example, you'll see: "bond_netdev_event received NETDEV_REGISTER"
instead (both are prefixed with the device for which the event pertains).

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 13:36:01 -07:00
Ariel Levkovich
2e770b507c net: bonding: Inherit MPLS features from slave devices
When setting the bonding interface net device features,
the kernel code doesn't address the slaves' MPLS features
and doesn't inherit them.

Therefore, HW offloads that enhance performance such as
checksumming and TSO are disabled for MPLS tagged traffic
flowing via the bonding interface.

The patch add the inheritance of the MPLS features from the
slave devices with a similar logic to setting the bonding device's
VLAN and encapsulation features.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:49:38 -07:00
Jarod Wilson
334031219a bonding/802.3ad: fix slave link initialization transition states
Once in a while, with just the right timing, 802.3ad slaves will fail to
properly initialize, winding up in a weird state, with a partner system
mac address of 00:00:00:00:00:00. This started happening after a fix to
properly track link_failure_count tracking, where an 802.3ad slave that
reported itself as link up in the miimon code, but wasn't able to get a
valid speed/duplex, started getting set to BOND_LINK_FAIL instead of
BOND_LINK_DOWN. That was the proper thing to do for the general "my link
went down" case, but has created a link initialization race that can put
the interface in this odd state.

The simple fix is to instead set the slave link to BOND_LINK_DOWN again,
if the link has never been up (last_link_up == 0), so the link state
doesn't bounce from BOND_LINK_DOWN to BOND_LINK_FAIL -- it hasn't failed
in this case, it simply hasn't been up yet, and this prevents the
unnecessary state change from DOWN to FAIL and getting stuck in an init
failure w/o a partner mac.

Fixes: ea53abfab9 ("bonding/802.3ad: fix link_failure_count tracking")
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: netdev@vger.kernel.org
Tested-by: Heesoon Kim <Heesoon.Kim@stratus.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 13:28:23 -07:00
David S. Miller
6b0a7f84ea Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflict resolution of af_smc.c from Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 11:26:25 -07:00
Sabrina Dubroca
92480b3977 bonding: fix event handling for stacked bonds
When a bond is enslaved to another bond, bond_netdev_event() only
handles the event as if the bond is a master, and skips treating the
bond as a slave.

This leads to a refcount leak on the slave, since we don't remove the
adjacency to its master and the master holds a reference on the slave.

Reproducer:
  ip link add bondL type bond
  ip link add bondU type bond
  ip link set bondL master bondU
  ip link del bondL

No "Fixes:" tag, this code is older than git history.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:22:09 -07:00
Paolo Abeni
a350eccee5 net: remove 'fallback' argument from dev->ndo_select_queue()
After the previous patch, all the callers of ndo_select_queue()
provide as a 'fallback' argument netdev_pick_tx.
The only exceptions are nested calls to ndo_select_queue(),
which pass down the 'fallback' available in the current scope
- still netdev_pick_tx.

We can drop such argument and replace fallback() invocation with
netdev_pick_tx(). This avoids an indirect call per xmit packet
in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen)
with device drivers implementing such ndo. It also clean the code
a bit.

Tested with ixgbe and CONFIG_FCOE=m

With pktgen using queue xmit:
threads		vanilla 	patched
		(kpps)		(kpps)
1		2334		2428
2		4166		4278
4		7895		8100

 v1 -> v2:
 - rebased after helper's name change

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-20 11:18:55 -07:00
Florian Fainelli
47f706262f net: Remove switchdev.h inclusion from team/bond/vlan
This is no longer necessary after eca59f6915 ("net: Remove support for bridge bypass ndos from stacked devices")

Suggested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andy Gospodarek <andy@greyhouse.net>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24 17:40:46 -08:00
Michal Soltys
3c963a3306 bonding: fix PACKET_ORIGDEV regression
This patch fixes a subtle PACKET_ORIGDEV regression which was a side
effect of fixes introduced by:

6a9e461f6f bonding: pass link-local packets to bonding master also.

... to:

b89f04c61e bonding: deliver link-local packets with skb->dev set to link that packets arrived on

While 6a9e461f6f restored pre-b89f04c61efe presence of link-local
packets on bonding masters (which is required e.g. by linux bridges
participating in spanning tree or needed for lab-like setups created
with group_fwd_mask) it also caused the originating device
information to be lost due to cloning.

Maciej Żenczykowski proposed another solution that doesn't require
packet cloning and retains original device information - instead of
returning RX_HANDLER_PASS for all link-local packets it's now limited
only to packets from inactive slaves.

At the same time, packets passed to bonding masters retain correct
information about the originating device and PACKET_ORIGDEV can be used
to determine it.

This elegantly solves all issues so far:

- link-local packets that were removed from bonding masters
- LLDP daemons being forced to explicitly bind to slave interfaces
- PACKET_ORIGDEV having no effect on bond interfaces

Fixes: 6a9e461f6f (bonding: pass link-local packets to bonding master also.)
Reported-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-21 13:20:08 -08:00
Willem de Bruijn
001e465f09 bonding: update nest level on unlink
A network device stack with multiple layers of bonding devices can
trigger a false positive lockdep warning. Adding lockdep nest levels
fixes this. Update the level on both enslave and unlink, to avoid the
following series of events ..

    ip netns add test
    ip netns exec test bash
    ip link set dev lo addr 00:11:22:33:44:55
    ip link set dev lo down

    ip link add dev bond1 type bond
    ip link add dev bond2 type bond

    ip link set dev lo master bond1
    ip link set dev bond1 master bond2

    ip link set dev bond1 nomaster
    ip link set dev bond2 master bond1

.. from still generating a splat:

    [  193.652127] ======================================================
    [  193.658231] WARNING: possible circular locking dependency detected
    [  193.664350] 4.20.0 #8 Not tainted
    [  193.668310] ------------------------------------------------------
    [  193.674417] ip/15577 is trying to acquire lock:
    [  193.678897] 00000000a40e3b69 (&(&bond->stats_lock)->rlock#3/3){+.+.}, at: bond_get_stats+0x58/0x290
    [  193.687851]
    	       but task is already holding lock:
    [  193.693625] 00000000807b9d9f (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0x58/0x290

    [..]

    [  193.851092]        lock_acquire+0xa7/0x190
    [  193.855138]        _raw_spin_lock_nested+0x2d/0x40
    [  193.859878]        bond_get_stats+0x58/0x290
    [  193.864093]        dev_get_stats+0x5a/0xc0
    [  193.868140]        bond_get_stats+0x105/0x290
    [  193.872444]        dev_get_stats+0x5a/0xc0
    [  193.876493]        rtnl_fill_stats+0x40/0x130
    [  193.880797]        rtnl_fill_ifinfo+0x6c5/0xdc0
    [  193.885271]        rtmsg_ifinfo_build_skb+0x86/0xe0
    [  193.890091]        rtnetlink_event+0x5b/0xa0
    [  193.894320]        raw_notifier_call_chain+0x43/0x60
    [  193.899225]        netdev_change_features+0x50/0xa0
    [  193.904044]        bond_compute_features.isra.46+0x1ab/0x270
    [  193.909640]        bond_enslave+0x141d/0x15b0
    [  193.913946]        do_set_master+0x89/0xa0
    [  193.918016]        do_setlink+0x37c/0xda0
    [  193.921980]        __rtnl_newlink+0x499/0x890
    [  193.926281]        rtnl_newlink+0x48/0x70
    [  193.930238]        rtnetlink_rcv_msg+0x171/0x4b0
    [  193.934801]        netlink_rcv_skb+0xd1/0x110
    [  193.939103]        rtnetlink_rcv+0x15/0x20
    [  193.943151]        netlink_unicast+0x3b5/0x520
    [  193.947544]        netlink_sendmsg+0x2fd/0x3f0
    [  193.951942]        sock_sendmsg+0x38/0x50
    [  193.955899]        ___sys_sendmsg+0x2ba/0x2d0
    [  193.960205]        __x64_sys_sendmsg+0xad/0x100
    [  193.964687]        do_syscall_64+0x5a/0x460
    [  193.968823]        entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: 7e2556e400 ("bonding: avoid lockdep confusion in bond_get_stats()")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-10 16:49:39 -05:00
Petr Machata
1caf40dec1 net: bonding: Issue NETDEV_PRE_CHANGEADDR
Give interested parties an opportunity to veto an impending HW address
change.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-13 18:41:39 -08:00
Petr Machata
b924591428 net: bonding: Give bond_set_dev_addr() a return value
Before NETDEV_CHANGEADDR, bond driver should emit NETDEV_PRE_CHANGEADDR,
and allow consumers to veto the address change. To propagate further the
return code from NETDEV_PRE_CHANGEADDR, give the function that
implements address change a return value.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-13 18:41:39 -08:00
Petr Machata
3a37a9636c net: dev: Add extack argument to dev_set_mac_address()
A follow-up patch will add a notifier type NETDEV_PRE_CHANGEADDR, which
allows vetoing of MAC address changes. One prominent path to that
notification is through dev_set_mac_address(). Therefore give this
function an extack argument, so that it can be packed together with the
notification. Thus a textual reason for rejection (or a warning) can be
communicated back to the user.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-13 18:41:38 -08:00
Petr Machata
00f54e6892 net: core: dev: Add extack argument to dev_open()
In order to pass extack together with NETDEV_PRE_UP notifications, it's
necessary to route the extack to __dev_open() from diverse (possibly
indirect) callers. One prominent API through which the notification is
invoked is dev_open().

Therefore extend dev_open() with and extra extack argument and update
all users. Most of the calls end up just encoding NULL, but bond and
team drivers have the extack readily available.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-06 13:26:06 -08:00
Jarod Wilson
ea53abfab9 bonding/802.3ad: fix link_failure_count tracking
Commit 4d2c0cda07 set slave->link to
BOND_LINK_DOWN for 802.3ad bonds whenever invalid speed/duplex values
were read, to fix a problem with slaves getting into weird states, but
in the process, broke tracking of link failures, as going straight to
BOND_LINK_DOWN when a link is indeed down (cable pulled, switch rebooted)
means we broke out of bond_miimon_inspect()'s BOND_LINK_DOWN case because
!link_state was already true, we never incremented commit, and never got
a chance to call bond_miimon_commit(), where slave->link_failure_count
would be incremented. I believe the simple fix here is to mark the slave
as BOND_LINK_FAIL, and let bond_miimon_inspect() transition the link from
_FAIL to either _UP or _DOWN, and in the latter case, we now get proper
incrementing of link_failure_count again.

Fixes: 4d2c0cda07 ("bonding: speed/duplex update at NETDEV_UP event")
CC: Mahesh Bandewar <maheshb@google.com>
CC: David S. Miller <davem@davemloft.net>
CC: netdev@vger.kernel.org
CC: stable@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-04 16:44:44 -08:00
Debabrata Banerjee
c9fbd71f73 netpoll: allow cleanup to be synchronous
This fixes a problem introduced by:
commit 2cde6acd49 ("netpoll: Fix __netpoll_rcu_free so that it can hold the rtnl lock")

When using netconsole on a bond, __netpoll_cleanup can asynchronously
recurse multiple times, each __netpoll_free_async call can result in
more __netpoll_free_async's. This means there is now a race between
cleanup_work queues on multiple netpoll_info's on multiple devices and
the configuration of a new netpoll. For example if a netconsole is set
to enable 0, reconfigured, and enable 1 immediately, this netconsole
will likely not work.

Given the reason for __netpoll_free_async is it can be called when rtnl
is not locked, if it is locked, we should be able to execute
synchronously. It appears to be locked everywhere it's called from.

Generalize the design pattern from the teaming driver for current
callers of __netpoll_free_async.

CC: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-19 17:01:43 -07:00
Mahesh Bandewar
0f3b914c9c bonding: fix warning message
RX queue config for bonding master could be different from its slave
device(s). With the commit 6a9e461f6f ("bonding: pass link-local
packets to bonding master also."), the packet is reinjected into stack
with skb->dev as bonding master. This potentially triggers the
message:

   "bondX received packet on queue Y, but number of RX queues is Z"

whenever the queue that packet is received on is higher than the
numrxqueues on bonding master (Y > Z).

Fixes: 6a9e461f6f ("bonding: pass link-local packets to bonding master also.")
Reported-by: John Sperbeck <jsperbeck@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-02 15:55:16 -07:00
Mahesh Bandewar
d4859d749a bonding: avoid possible dead-lock
Syzkaller reported this on a slightly older kernel but it's still
applicable to the current kernel -

======================================================
WARNING: possible circular locking dependency detected
4.18.0-next-20180823+ #46 Not tainted
------------------------------------------------------
syz-executor4/26841 is trying to acquire lock:
00000000dd41ef48 ((wq_completion)bond_dev->name){+.+.}, at: flush_workqueue+0x2db/0x1e10 kernel/workqueue.c:2652

but task is already holding lock:
00000000768ab431 (rtnl_mutex){+.+.}, at: rtnl_lock net/core/rtnetlink.c:77 [inline]
00000000768ab431 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x412/0xc30 net/core/rtnetlink.c:4708

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (rtnl_mutex){+.+.}:
       __mutex_lock_common kernel/locking/mutex.c:925 [inline]
       __mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073
       mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088
       rtnl_lock+0x17/0x20 net/core/rtnetlink.c:77
       bond_netdev_notify drivers/net/bonding/bond_main.c:1310 [inline]
       bond_netdev_notify_work+0x44/0xd0 drivers/net/bonding/bond_main.c:1320
       process_one_work+0xc73/0x1aa0 kernel/workqueue.c:2153
       worker_thread+0x189/0x13c0 kernel/workqueue.c:2296
       kthread+0x35a/0x420 kernel/kthread.c:246
       ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:415

-> #1 ((work_completion)(&(&nnw->work)->work)){+.+.}:
       process_one_work+0xc0b/0x1aa0 kernel/workqueue.c:2129
       worker_thread+0x189/0x13c0 kernel/workqueue.c:2296
       kthread+0x35a/0x420 kernel/kthread.c:246
       ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:415

-> #0 ((wq_completion)bond_dev->name){+.+.}:
       lock_acquire+0x1e4/0x4f0 kernel/locking/lockdep.c:3901
       flush_workqueue+0x30a/0x1e10 kernel/workqueue.c:2655
       drain_workqueue+0x2a9/0x640 kernel/workqueue.c:2820
       destroy_workqueue+0xc6/0x9d0 kernel/workqueue.c:4155
       __alloc_workqueue_key+0xef9/0x1190 kernel/workqueue.c:4138
       bond_init+0x269/0x940 drivers/net/bonding/bond_main.c:4734
       register_netdevice+0x337/0x1100 net/core/dev.c:8410
       bond_newlink+0x49/0xa0 drivers/net/bonding/bond_netlink.c:453
       rtnl_newlink+0xef4/0x1d50 net/core/rtnetlink.c:3099
       rtnetlink_rcv_msg+0x46e/0xc30 net/core/rtnetlink.c:4711
       netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2454
       rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4729
       netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
       netlink_unicast+0x5a0/0x760 net/netlink/af_netlink.c:1343
       netlink_sendmsg+0xa18/0xfc0 net/netlink/af_netlink.c:1908
       sock_sendmsg_nosec net/socket.c:622 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:632
       ___sys_sendmsg+0x7fd/0x930 net/socket.c:2115
       __sys_sendmsg+0x11d/0x290 net/socket.c:2153
       __do_sys_sendmsg net/socket.c:2162 [inline]
       __se_sys_sendmsg net/socket.c:2160 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2160
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

Chain exists of:
  (wq_completion)bond_dev->name --> (work_completion)(&(&nnw->work)->work) --> rtnl_mutex

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(rtnl_mutex);
                               lock((work_completion)(&(&nnw->work)->work));
                               lock(rtnl_mutex);
  lock((wq_completion)bond_dev->name);

 *** DEADLOCK ***

1 lock held by syz-executor4/26841:

stack backtrace:
CPU: 1 PID: 26841 Comm: syz-executor4 Not tainted 4.18.0-next-20180823+ #46
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 print_circular_bug.isra.34.cold.55+0x1bd/0x27d kernel/locking/lockdep.c:1222
 check_prev_add kernel/locking/lockdep.c:1862 [inline]
 check_prevs_add kernel/locking/lockdep.c:1975 [inline]
 validate_chain kernel/locking/lockdep.c:2416 [inline]
 __lock_acquire+0x3449/0x5020 kernel/locking/lockdep.c:3412
 lock_acquire+0x1e4/0x4f0 kernel/locking/lockdep.c:3901
 flush_workqueue+0x30a/0x1e10 kernel/workqueue.c:2655
 drain_workqueue+0x2a9/0x640 kernel/workqueue.c:2820
 destroy_workqueue+0xc6/0x9d0 kernel/workqueue.c:4155
 __alloc_workqueue_key+0xef9/0x1190 kernel/workqueue.c:4138
 bond_init+0x269/0x940 drivers/net/bonding/bond_main.c:4734
 register_netdevice+0x337/0x1100 net/core/dev.c:8410
 bond_newlink+0x49/0xa0 drivers/net/bonding/bond_netlink.c:453
 rtnl_newlink+0xef4/0x1d50 net/core/rtnetlink.c:3099
 rtnetlink_rcv_msg+0x46e/0xc30 net/core/rtnetlink.c:4711
 netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2454
 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4729
 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
 netlink_unicast+0x5a0/0x760 net/netlink/af_netlink.c:1343
 netlink_sendmsg+0xa18/0xfc0 net/netlink/af_netlink.c:1908
 sock_sendmsg_nosec net/socket.c:622 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:632
 ___sys_sendmsg+0x7fd/0x930 net/socket.c:2115
 __sys_sendmsg+0x11d/0x290 net/socket.c:2153
 __do_sys_sendmsg net/socket.c:2162 [inline]
 __se_sys_sendmsg net/socket.c:2160 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2160
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457089
Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f2df20a5c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f2df20a66d4 RCX: 0000000000457089
RDX: 0000000000000000 RSI: 0000000020000180 RDI: 0000000000000003
RBP: 0000000000930140 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000004d40b8 R14: 00000000004c8ad8 R15: 0000000000000001

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-26 20:22:19 -07:00
Mahesh Bandewar
6a9e461f6f bonding: pass link-local packets to bonding master also.
Commit b89f04c61e ("bonding: deliver link-local packets with
skb->dev set to link that packets arrived on") changed the behavior
of how link-local-multicast packets are processed. The change in
the behavior broke some legacy use cases where these packets are
expected to arrive on bonding master device also.

This patch passes the packet to the stack with the link it arrived
on as well as passes to the bonding-master device to preserve the
legacy use case.

Fixes: b89f04c61e ("bonding: deliver link-local packets with skb->dev set to link that packets arrived on")
Reported-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-26 20:21:27 -07:00
Eric Dumazet
93f62ad5e8 bonding: use netpoll_poll_dev() helper
We want to allow NAPI drivers to no longer provide
ndo_poll_controller() method, as it has been proven problematic.

team driver must not look at its presence, but instead call
netpoll_poll_dev() which factorize the needed actions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-23 21:55:24 -07:00
David S. Miller
89b1698c93 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
The BTF conflicts were simple overlapping changes.

The virtio_net conflict was an overlap of a fix of statistics counter,
happening alongisde a move over to a bonafide statistics structure
rather than counting value on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-02 10:55:32 -07:00
Eric Dumazet
7e2556e400 bonding: avoid lockdep confusion in bond_get_stats()
syzbot found that the following sequence produces a LOCKDEP splat [1]

ip link add bond10 type bond
ip link add bond11 type bond
ip link set bond11 master bond10

To fix this, we can use the already provided nest_level.

This patch also provides correct nesting for dev->addr_list_lock

[1]
WARNING: possible recursive locking detected
4.18.0-rc6+ #167 Not tainted
--------------------------------------------
syz-executor751/4439 is trying to acquire lock:
(____ptrval____) (&(&bond->stats_lock)->rlock){+.+.}, at: spin_lock include/linux/spinlock.h:310 [inline]
(____ptrval____) (&(&bond->stats_lock)->rlock){+.+.}, at: bond_get_stats+0xb4/0x560 drivers/net/bonding/bond_main.c:3426

but task is already holding lock:
(____ptrval____) (&(&bond->stats_lock)->rlock){+.+.}, at: spin_lock include/linux/spinlock.h:310 [inline]
(____ptrval____) (&(&bond->stats_lock)->rlock){+.+.}, at: bond_get_stats+0xb4/0x560 drivers/net/bonding/bond_main.c:3426

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&bond->stats_lock)->rlock);
  lock(&(&bond->stats_lock)->rlock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by syz-executor751/4439:
 #0: (____ptrval____) (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20 net/core/rtnetlink.c:77
 #1: (____ptrval____) (&(&bond->stats_lock)->rlock){+.+.}, at: spin_lock include/linux/spinlock.h:310 [inline]
 #1: (____ptrval____) (&(&bond->stats_lock)->rlock){+.+.}, at: bond_get_stats+0xb4/0x560 drivers/net/bonding/bond_main.c:3426
 #2: (____ptrval____) (rcu_read_lock){....}, at: bond_get_stats+0x0/0x560 include/linux/compiler.h:215

stack backtrace:
CPU: 0 PID: 4439 Comm: syz-executor751 Not tainted 4.18.0-rc6+ #167
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 print_deadlock_bug kernel/locking/lockdep.c:1765 [inline]
 check_deadlock kernel/locking/lockdep.c:1809 [inline]
 validate_chain kernel/locking/lockdep.c:2405 [inline]
 __lock_acquire.cold.64+0x1fb/0x486 kernel/locking/lockdep.c:3435
 lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
 _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:144
 spin_lock include/linux/spinlock.h:310 [inline]
 bond_get_stats+0xb4/0x560 drivers/net/bonding/bond_main.c:3426
 dev_get_stats+0x10f/0x470 net/core/dev.c:8316
 bond_get_stats+0x232/0x560 drivers/net/bonding/bond_main.c:3432
 dev_get_stats+0x10f/0x470 net/core/dev.c:8316
 rtnl_fill_stats+0x4d/0xac0 net/core/rtnetlink.c:1169
 rtnl_fill_ifinfo+0x1aa6/0x3fb0 net/core/rtnetlink.c:1611
 rtmsg_ifinfo_build_skb+0xc8/0x190 net/core/rtnetlink.c:3268
 rtmsg_ifinfo_event.part.30+0x45/0xe0 net/core/rtnetlink.c:3300
 rtmsg_ifinfo_event net/core/rtnetlink.c:3297 [inline]
 rtnetlink_event+0x144/0x170 net/core/rtnetlink.c:4716
 notifier_call_chain+0x180/0x390 kernel/notifier.c:93
 __raw_notifier_call_chain kernel/notifier.c:394 [inline]
 raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401
 call_netdevice_notifiers_info+0x3f/0x90 net/core/dev.c:1735
 call_netdevice_notifiers net/core/dev.c:1753 [inline]
 netdev_features_change net/core/dev.c:1321 [inline]
 netdev_change_features+0xb3/0x110 net/core/dev.c:7759
 bond_compute_features.isra.47+0x585/0xa50 drivers/net/bonding/bond_main.c:1120
 bond_enslave+0x1b25/0x5da0 drivers/net/bonding/bond_main.c:1755
 bond_do_ioctl+0x7cb/0xae0 drivers/net/bonding/bond_main.c:3528
 dev_ifsioc+0x43c/0xb30 net/core/dev_ioctl.c:327
 dev_ioctl+0x1b5/0xcc0 net/core/dev_ioctl.c:493
 sock_do_ioctl+0x1d3/0x3e0 net/socket.c:992
 sock_ioctl+0x30d/0x680 net/socket.c:1093
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:500 [inline]
 do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:684
 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
 __do_sys_ioctl fs/ioctl.c:708 [inline]
 __se_sys_ioctl fs/ioctl.c:706 [inline]
 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:706
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x440859
Code: e8 2c af 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b 10 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffc51a92878 EFLAGS: 00000213 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000440859
RDX: 0000000020000040 RSI: 0000000000008990 RDI: 0000000000000003
RBP: 0000000000000000 R08: 00000000004002c8 R09: 00000000004002c8
R10: 00000000022d5880 R11: 0000000000000213 R12: 0000000000007390
R13: 0000000000401db0 R14: 0000000000000000 R15: 0000000000000000

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:34:07 -07:00
Alexander Duyck
4f49dec907 net: allow ndo_select_queue to pass netdev
This patch makes it so that instead of passing a void pointer as the
accel_priv we instead pass a net_device pointer as sb_dev. Making this
change allows us to pass the subordinate device through to the fallback
function eventually so that we can keep the actual code in the
ndo_select_queue call as focused on possible on the exception cases.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-07-09 13:41:34 -07:00
Kees Cook
6396bb2215 treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:

        kzalloc(a * b, gfp)

with:
        kcalloc(a * b, gfp)

as well as handling cases of:

        kzalloc(a * b * c, gfp)

with:

        kzalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kzalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kzalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kzalloc
+ kcalloc
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kzalloc(sizeof(THING) * C2, ...)
|
  kzalloc(sizeof(TYPE) * C2, ...)
|
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
John Hurley
f44aa9ef79 net: include hash policy in LAG changeupper info
LAG upper event notifiers contain the tx type used by the LAG device.
Extend this to also include the hash policy used for tx types that
utilize hashing.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24 23:10:57 -04:00
Willem de Bruijn
8eea1ca82b gso: limit udp gso to egress-only virtual devices
Until the udp receive stack supports large packets (UDP GRO), GSO
packets must not loop from the egress to the ingress path.

Revert the change that added NETIF_F_GSO_UDP_L4 to various virtual
devices through NETIF_F_GSO_ENCAP_ALL as this included devices that
may loop packets, such as veth and macvlan.

Instead add it to specific devices that forward to another device's
egress path, bonding and team.

Fixes: 83aa025f53 ("udp: add gso support to virtual devices")
CC: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-23 14:48:44 -04:00
Tonghao Zhang
7e878b605f bonding: introduce link change helper
Introduce an new common helper to avoid redundancy.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17 15:51:13 -04:00
Debabrata Banerjee
b3c898e20b Revert "bonding: allow carrier and link status to determine link state"
This reverts commit 1386c36b30.

We don't want to encourage drivers to not report carrier status
correctly, therefore remove this commit.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16 14:03:50 -04:00
Debabrata Banerjee
1386c36b30 bonding: allow carrier and link status to determine link state
In a mixed environment it may be difficult to tell if your hardware
support carrier, if it does not it can always report true. With a new
use_carrier option of 2, we can check both carrier and link status
sequentially, instead of one or the other

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16 12:15:11 -04:00
Debabrata Banerjee
e79c105574 bonding: allow use of tx hashing in balance-alb
The rx load balancing provided by balance-alb is not mutually
exclusive with using hashing for tx selection, and should provide a decent
speed increase because this eliminates spinlocks and cache contention.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16 12:15:11 -04:00
David S. Miller
b2d6cee117 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The bpf syscall and selftests conflicts were trivial
overlapping changes.

The r8169 change involved moving the added mdelay from 'net' into a
different function.

A TLS close bug fix overlapped with the splitting of the TLS state
into separate TX and RX parts.  I just expanded the tests in the bug
fix from "ctx->conf == X" into "ctx->tx_conf == X && ctx->rx_conf
== X".

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11 20:53:22 -04:00
Tonghao Zhang
ae35c6f7a5 bonding: use the skb_get/set_queue_mapping
Use the skb_get_queue_mapping, skb_set_queue_mapping
and skb_rx_queue_recorded for skb queue_mapping in bonding
driver, but not use it directly.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11 16:08:44 -04:00
Tonghao Zhang
dbdc8a2175 bonding: replace the return value type
The method ndo_start_xmit is defined as returning a
netdev_tx_t, which is a typedef for an enum type,
but the implementation in this driver returns an int.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11 16:08:44 -04:00
Debabrata Banerjee
21706ee8a4 bonding: send learning packets for vlans on slave
There was a regression at some point from the intended functionality of
commit f60c3704e8 ("bonding: Fix alb mode to only use first level
vlans.")

Given the return value vlan_get_encap_level() we need to store the nest
level of the bond device, and then compare the vlan's encap level to
this. Without this, this check always fails and learning packets are
never sent.

In addition, this same commit caused a regression in the behavior of
balance_alb, which requires learning packets be sent for all interfaces
using the slave's mac in order to load balance properly. For vlan's
that have not set a user mac, we can send after checking one bit.
Otherwise we need send the set mac, albeit defeating rx load balancing
for that vlan.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11 11:50:41 -04:00
Xin Long
ddea788c63 bonding: do not set slave_dev npinfo before slave_enable_netpoll in bond_enslave
After Commit 8a8efa22f5 ("bonding: sync netpoll code with bridge"), it
would set slave_dev npinfo in slave_enable_netpoll when enslaving a dev
if bond->dev->npinfo was set.

However now slave_dev npinfo is set with bond->dev->npinfo before calling
slave_enable_netpoll. With slave_dev npinfo set, __netpoll_setup called
in slave_enable_netpoll will not call slave dev's .ndo_netpoll_setup().
It causes that the lower dev of this slave dev can't set its npinfo.

One way to reproduce it:

  # modprobe bonding
  # brctl addbr br0
  # brctl addif br0 eth1
  # ifconfig bond0 192.168.122.1/24 up
  # ifenslave bond0 eth2
  # systemctl restart netconsole
  # ifenslave bond0 br0
  # ifconfig eth2 down
  # systemctl restart netconsole

The netpoll won't really work.

This patch is to remove that slave_dev npinfo setting in bond_enslave().

Fixes: 8a8efa22f5 ("bonding: sync netpoll code with bridge")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 11:52:35 -04:00
Xin Long
9f5a90c107 bonding: process the err returned by dev_set_allmulti properly in bond_enslave
When dev_set_promiscuity(1) succeeds but dev_set_allmulti(1) fails,
dev_set_promiscuity(-1) should be done before going to the err path.
Otherwise, dev->promiscuity will leak.

Fixes: 7e1a1ac1fb ("bonding: Check return of dev_set_promiscuity/allmulti")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 12:51:06 -04:00
Xin Long
ae42cc62a9 bonding: move dev_mc_sync after master_upper_dev_link in bond_enslave
Beniamino found a crash when adding vlan as slave of bond which is also
the parent link:

  ip link add bond1 type bond
  ip link set bond1 up
  ip link add link bond1 vlan1 type vlan id 80
  ip link set vlan1 master bond1

The call trace is as below:

  [<ffffffffa850842a>] queued_spin_lock_slowpath+0xb/0xf
  [<ffffffffa8515680>] _raw_spin_lock+0x20/0x30
  [<ffffffffa83f6f07>] dev_mc_sync+0x37/0x80
  [<ffffffffc08687dc>] vlan_dev_set_rx_mode+0x1c/0x30 [8021q]
  [<ffffffffa83efd2a>] __dev_set_rx_mode+0x5a/0xa0
  [<ffffffffa83f7138>] dev_mc_sync_multiple+0x78/0x80
  [<ffffffffc084127c>] bond_enslave+0x67c/0x1190 [bonding]
  [<ffffffffa8401909>] do_setlink+0x9c9/0xe50
  [<ffffffffa8403bf2>] rtnl_newlink+0x522/0x880
  [<ffffffffa8403ff7>] rtnetlink_rcv_msg+0xa7/0x260
  [<ffffffffa8424ecb>] netlink_rcv_skb+0xab/0xc0
  [<ffffffffa83fe498>] rtnetlink_rcv+0x28/0x30
  [<ffffffffa8424850>] netlink_unicast+0x170/0x210
  [<ffffffffa8424bf8>] netlink_sendmsg+0x308/0x420
  [<ffffffffa83cc396>] sock_sendmsg+0xb6/0xf0

This is actually a dead lock caused by sync slave hwaddr from master when
the master is the slave's 'slave'. This dead loop check is actually done
by netdev_master_upper_dev_link. However, Commit 1f718f0f4f ("bonding:
populate neighbour's private on enslave") moved it after dev_mc_sync.

This patch is to fix it by moving dev_mc_sync after master_upper_dev_link,
so that this loop check would be earlier than dev_mc_sync. It also moves
if (mode == BOND_MODE_8023AD) into if (!bond_uses_primary) clause as an
improvement.

Note team driver also has this issue, I will fix it in another patch.

Fixes: 1f718f0f4f ("bonding: populate neighbour's private on enslave")
Reported-by: Beniamino Galvani <bgalvani@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 12:51:05 -04:00
Xin Long
5c78f6bfae bonding: fix the err path for dev hwaddr sync in bond_enslave
vlan_vids_add_by_dev is called right after dev hwaddr sync, so on
the err path it should unsync dev hwaddr. Otherwise, the slave
dev's hwaddr will never be unsync when this err happens.

Fixes: 1ff412ad77 ("bonding: change the bond's vlan syncing functions with the standard ones")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 12:51:05 -04:00
Linus Torvalds
5bbcc0f595 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Maintain the TCP retransmit queue using an rbtree, with 1GB
      windows at 100Gb this really has become necessary. From Eric
      Dumazet.

   2) Multi-program support for cgroup+bpf, from Alexei Starovoitov.

   3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew
      Lunn.

   4) Add meter action support to openvswitch, from Andy Zhou.

   5) Add a data meta pointer for BPF accessible packets, from Daniel
      Borkmann.

   6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet.

   7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli.

   8) More work to move the RTNL mutex down, from Florian Westphal.

   9) Add 'bpftool' utility, to help with bpf program introspection.
      From Jakub Kicinski.

  10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper
      Dangaard Brouer.

  11) Support 'blocks' of transformations in the packet scheduler which
      can span multiple network devices, from Jiri Pirko.

  12) TC flower offload support in cxgb4, from Kumar Sanghvi.

  13) Priority based stream scheduler for SCTP, from Marcelo Ricardo
      Leitner.

  14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg.

  15) Add RED qdisc offloadability, and use it in mlxsw driver. From
      Nogah Frankel.

  16) eBPF based device controller for cgroup v2, from Roman Gushchin.

  17) Add some fundamental tracepoints for TCP, from Song Liu.

  18) Remove garbage collection from ipv6 route layer, this is a
      significant accomplishment. From Wei Wang.

  19) Add multicast route offload support to mlxsw, from Yotam Gigi"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits)
  tcp: highest_sack fix
  geneve: fix fill_info when link down
  bpf: fix lockdep splat
  net: cdc_ncm: GetNtbFormat endian fix
  openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start
  netem: remove unnecessary 64 bit modulus
  netem: use 64 bit divide by rate
  tcp: Namespace-ify sysctl_tcp_default_congestion_control
  net: Protect iterations over net::fib_notifier_ops in fib_seq_sum()
  ipv6: set all.accept_dad to 0 by default
  uapi: fix linux/tls.h userspace compilation error
  usbnet: ipheth: prevent TX queue timeouts when device not ready
  vhost_net: conditionally enable tx polling
  uapi: fix linux/rxrpc.h userspace compilation errors
  net: stmmac: fix LPI transitioning for dwmac4
  atm: horizon: Fix irq release error
  net-sysfs: trigger netlink notification on ifalias change via sysfs
  openvswitch: Using kfree_rcu() to simplify the code
  openvswitch: Make local function ovs_nsh_key_attr_size() static
  openvswitch: Fix return value check in ovs_meter_cmd_features()
  ...
2017-11-15 11:56:19 -08:00
Linus Torvalds
8e9a2dba86 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core locking updates from Ingo Molnar:
 "The main changes in this cycle are:

   - Another attempt at enabling cross-release lockdep dependency
     tracking (automatically part of CONFIG_PROVE_LOCKING=y), this time
     with better performance and fewer false positives. (Byungchul Park)

   - Introduce lockdep_assert_irqs_enabled()/disabled() and convert
     open-coded equivalents to lockdep variants. (Frederic Weisbecker)

   - Add down_read_killable() and use it in the VFS's iterate_dir()
     method. (Kirill Tkhai)

   - Convert remaining uses of ACCESS_ONCE() to
     READ_ONCE()/WRITE_ONCE(). Most of the conversion was Coccinelle
     driven. (Mark Rutland, Paul E. McKenney)

   - Get rid of lockless_dereference(), by strengthening Alpha atomics,
     strengthening READ_ONCE() with smp_read_barrier_depends() and thus
     being able to convert users of lockless_dereference() to
     READ_ONCE(). (Will Deacon)

   - Various micro-optimizations:

        - better PV qspinlocks (Waiman Long),
        - better x86 barriers (Michael S. Tsirkin)
        - better x86 refcounts (Kees Cook)

   - ... plus other fixes and enhancements. (Borislav Petkov, Juergen
     Gross, Miguel Bernal Marin)"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
  locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE
  rcu: Use lockdep to assert IRQs are disabled/enabled
  netpoll: Use lockdep to assert IRQs are disabled/enabled
  timers/posix-cpu-timers: Use lockdep to assert IRQs are disabled/enabled
  sched/clock, sched/cputime: Use lockdep to assert IRQs are disabled/enabled
  irq_work: Use lockdep to assert IRQs are disabled/enabled
  irq/timings: Use lockdep to assert IRQs are disabled/enabled
  perf/core: Use lockdep to assert IRQs are disabled/enabled
  x86: Use lockdep to assert IRQs are disabled/enabled
  smp/core: Use lockdep to assert IRQs are disabled/enabled
  timers/hrtimer: Use lockdep to assert IRQs are disabled/enabled
  timers/nohz: Use lockdep to assert IRQs are disabled/enabled
  workqueue: Use lockdep to assert IRQs are disabled/enabled
  irq/softirqs: Use lockdep to assert IRQs are disabled/enabled
  locking/lockdep: Add IRQs disabled/enabled assertion APIs: lockdep_assert_irqs_enabled()/disabled()
  locking/pvqspinlock: Implement hybrid PV queued/unfair locks
  locking/rwlocks: Fix comments
  x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized
  block, locking/lockdep: Assign a lock_class per gendisk used for wait_for_completion()
  workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes
  ...
2017-11-13 12:38:26 -08:00
David S. Miller
4dc6758d78 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Simple cases of overlapping changes in the packet scheduler.

Must easier to resolve this time.

Which probably means that I screwed it up somehow.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10 10:00:18 +09:00
Jay Vosburgh
055db6957e bonding: fix slave stuck in BOND_LINK_FAIL state
The bonding miimon logic has a flaw, in that a failure of the
rtnl_trylock can cause a slave to become permanently stuck in
BOND_LINK_FAIL state.

	The sequence of events to cause this is as follows:

	1) bond_miimon_inspect finds that a slave's link is down, and so
calls bond_propose_link_state, setting slave->new_link_state to
BOND_LINK_FAIL, then sets slave->new_link to BOND_LINK_DOWN and returns
non-zero.

	2) In bond_mii_monitor, the rtnl_trylock fails, and the timer is
rescheduled.  No change is committed.

	3) bond_miimon_inspect is called again, but this time the slave
from step 1 has recovered.  slave->new_link is reset to NOCHANGE, and, as
slave->link was never changed, the switch enters the BOND_LINK_UP case,
and does nothing.  The pending BOND_LINK_FAIL state from step 1 remains
pending, as new_link_state is not reset.

	4) The state from step 3 persists until another slave changes link
state and causes bond_miimon_inspect to return non-zero.  At this point,
the BOND_LINK_FAIL state change on the slave from steps 1-3 is committed,
and the slave will remain stuck in BOND_LINK_FAIL state even though it
is actually link up.

	The remedy for this is to initialize new_link_state on each entry
to bond_miimon_inspect, as is already done with new_link.

Fixes: fb9eb899a6 ("bonding: handle link transition from FAIL to UP correctly")
Reported-by: Alex Sidorenko <alexandre.sidorenko@hpe.com>
Reviewed-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 16:07:10 +09:00
Hangbin Liu
b5f862180d bonding: discard lowest hash bit for 802.3ad layer3+4
After commit 07f4c90062 ("tcp/dccp: try to not exhaust ip_local_port_range
in connect()"), we will try to use even ports for connect(). Then if an
application (seen clearly with iperf) opens multiple streams to the same
destination IP and port, each stream will be given an even source port.

So the bonding driver's simple xmit_hash_policy based on layer3+4 addressing
will always hash all these streams to the same interface. And the total
throughput will limited to a single slave.

Change the tcp code will impact the whole tcp behavior, only for bonding
usage. Paolo Abeni suggested fix this by changing the bonding code only,
which should be more reasonable, and less impact.

Fix this by discarding the lowest hash bit because it contains little entropy.
After the fix we can re-balance between slaves.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 11:28:43 +09:00
Mark Rutland
6aa7de0591 locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.

However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:

----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()

// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-25 11:01:08 +02:00
Xin Long
4597efe312 bonding: remove rtmsg_ifinfo called in bond_master_upper_dev_link
Since commit 42e52bf9e3 ("net: add netnotifier event for upper device
change"), netdev_master_upper_dev_link has generated NETDEV_CHANGEUPPER
event which would send a notification to userspace in rtnetlink_event.

There's no need to call rtmsg_ifinfo to send the notification any more.
So this patch is to remove it from bond_master_upper_dev_link as well
as bond_upper_dev_unlink to avoid the redundant notifications.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-25 10:54:39 +09:00
David Ahern
759088bda2 net: bonding: Add extack messages for some enslave failures
A number of bond_enslave errors are logged using the netdev_err API.
Return those messages to userspace via the extack facility.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-04 21:39:34 -07:00
David Ahern
42ab19ee90 net: Add extack to upper device linking
Add extack arg to netdev_upper_dev_link and netdev_master_upper_dev_link

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-04 21:39:33 -07:00
David Ahern
33eaf2a6eb net: Add extack to ndo_add_slave
Pass extack to do_set_master and down to ndo_add_slave

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-04 21:39:33 -07:00
Mahesh Bandewar
4d2c0cda07 bonding: speed/duplex update at NETDEV_UP event
Some NIC drivers don't have correct speed/duplex settings at the
time they send NETDEV_UP notification and that messes up the
bonding state. Especially 802.3ad mode which is very sensitive
to these settings. In the current implementation we invoke
bond_update_speed_duplex() when we receive NETDEV_UP, however,
ignore the return value. If the values we get are invalid
(UNKNOWN), then slave gets removed from the aggregator with
speed and duplex set to UNKNOWN while link is still marked as UP.

This patch fixes this scenario. Also 802.3ad mode is sensitive to
these conditions while other modes are not, so making sure that it
doesn't change the behavior for other modes.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-03 14:32:25 -07:00
Alexey Dobriyan
6ade97da60 arp: make arp_hdr_len() return unsigned int
Negative ARP header length are not a thing.

Constify arguments while I'm at it.

Space savings:

	add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-3 (-3)
	function                        old     new   delta
	arpt_do_table                  1163    1160      -3

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 10:29:36 -07:00
Nikolay Aleksandrov
f13ad104b4 net: bonding: fix tlb_dynamic_lb default value
Commit 8b426dc54c ("bonding: remove hardcoded value") changed the
default value for tlb_dynamic_lb which lead to either broken ALB mode
(since tlb_dynamic_lb can be changed only in TLB) or setting TLB mode
with tlb_dynamic_lb equal to 0.
The first issue was recently fixed by setting tlb_dynamic_lb to 1 always
when switching to ALB mode, but the default value is still wrong and
we'll enter TLB mode with tlb_dynamic_lb equal to 0 if the mode is
changed via netlink or sysfs. In order to restore the previous behaviour
and default value simply remove the mode check around the default param
initialization for tlb_dynamic_lb which will always set it to 1 as
before.

Fixes: 8b426dc54c ("bonding: remove hardcoded value")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-12 20:58:12 -07:00
Andreas Born
11e9d7829d bonding: ratelimit failed speed/duplex update warning
bond_miimon_commit() handles the UP transition for each slave of a bond
in the case of MII. It is triggered 10 times per second for the default
MII Polling interval of 100ms. For device drivers that do not implement
__ethtool_get_link_ksettings() the call to bond_update_speed_duplex()
fails persistently while the MII status could remain UP. That is, in
this and other cases where the speed/duplex update keeps failing over a
longer period of time while the MII state is UP, a warning is printed
every MII polling interval.

To address these excessive warnings net_ratelimit() should be used.
Printing a warning once would not be sufficient since the call to
bond_update_speed_duplex() could recover to succeed and fail again
later. In that case there would be no new indication what went wrong.

Fixes: b5bf0f5b16 (bonding: correctly update link status during mii-commit phase)
Signed-off-by: Andreas Born <futur.andy@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-13 20:01:38 -07:00
Andreas Born
ad729bc9ac bonding: require speed/duplex only for 802.3ad, alb and tlb
The patch c4adfc822b ("bonding: make speed, duplex setting consistent
with link state") puts the link state to down if
bond_update_speed_duplex() cannot retrieve speed and duplex settings.
Assumably the patch was written with 802.3ad mode in mind which relies
on link speed/duplex settings. For other modes like active-backup these
settings are not required. Thus, only for these other modes, this patch
reintroduces support for slaves that do not support reporting speed or
duplex such as wireless devices. This fixes the regression reported in
bug 196547 (https://bugzilla.kernel.org/show_bug.cgi?id=196547).

Fixes: c4adfc822b ("bonding: make speed, duplex setting consistent
with link state")
Signed-off-by: Andreas Born <futur.andy@googlemail.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 14:21:42 -07:00
WANG Cong
d94708a553 bonding: commit link status change after propose
Commit de77ecd4ef ("bonding: improve link-status update in mii-monitoring")
moves link status commitment into bond_mii_monitor(), but it still relies
on the return value of bond_miimon_inspect() as the hint. We need to return
non-zero as long as we propose a link status change.

Fixes: de77ecd4ef ("bonding: improve link-status update in mii-monitoring")
Reported-by: Benjamin Gilbert <benjamin.gilbert@coreos.com>
Tested-by: Benjamin Gilbert <benjamin.gilbert@coreos.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-26 13:41:21 -07:00
Kosuke Tatsukawa
cbf5ecb305 net: bonding: Fix transmit load balancing in balance-alb mode
balance-alb mode used to have transmit dynamic load balancing feature
enabled by default.  However, transmit dynamic load balancing no longer
works in balance-alb after commit 8b426dc54c ("bonding: remove
hardcoded value").

Both balance-tlb and balance-alb use the function bond_do_alb_xmit() to
send packets.  This function uses the parameter tlb_dynamic_lb.
tlb_dynamic_lb used to have the default value of 1 for balance-alb, but
now the value is set to 0 except in balance-tlb.

Re-enable transmit dyanmic load balancing by initializing tlb_dynamic_lb
for balance-alb similar to balance-tlb.

Fixes: 8b426dc54c ("bonding: remove hardcoded value")
Signed-off-by: Kosuke Tatsukawa <tatsu@ab.jp.nec.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-20 15:34:40 -07:00
WANG Cong
f51048c3e0 bonding: avoid NETDEV_CHANGEMTU event when unregistering slave
As Hongjun/Nicolas summarized in their original patch:

"
When a device changes from one netns to another, it's first unregistered,
then the netns reference is updated and the dev is registered in the new
netns. Thus, when a slave moves to another netns, it is first
unregistered. This triggers a NETDEV_UNREGISTER event which is caught by
the bonding driver. The driver calls bond_release(), which calls
dev_set_mtu() and thus triggers NETDEV_CHANGEMTU (the device is still in
the old netns).
"

This is a very special case, because the device is being unregistered
no one should still care about the NETDEV_CHANGEMTU event triggered
at this point, we can avoid broadcasting this event on this path,
and avoid touching inetdev_event()/addrconf_notify() path.

It requires to export __dev_set_mtu() to bonding driver.

Reported-by: Hongjun Li <hongjun.li@6wind.com>
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08 11:23:29 +01:00
David S. Miller
0ddead90b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The conflicts were two cases of overlapping changes in
batman-adv and the qed driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 11:59:32 -04:00
Arkadi Sharshevsky
eca59f6915 net: Remove support for bridge bypass ndos from stacked devices
Remove support for bridge bypass ndos from stacked devices. At this point
no driver which supports stack device behavior offload supports operation
with SELF flag. The case for upper device is already taken care of in both
of the following cases:

1. FDB add/del - driver should check at the notification cb if the
                 stacked device contains his ports.

2. Port attribute - calls switchdev code directly which checks
                    for case of stack device.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08 14:16:28 -04:00
David S. Miller
cf124db566 net: Fix inconsistent teardown and release of private netdev state.
Network devices can allocate reasources and private memory using
netdev_ops->ndo_init().  However, the release of these resources
can occur in one of two different places.

Either netdev_ops->ndo_uninit() or netdev->destructor().

The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.

netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.

netdev->destructor(), on the other hand, does not run until the
netdev references all go away.

Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().

This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.

If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
it is not able to invoke netdev->destructor().

This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.

However, this means that the resources that would normally be released
by netdev->destructor() will not be.

Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.

Many drivers do not try to deal with this, and instead we have leaks.

Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().

netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().

netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().

Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().

And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07 15:53:24 -04:00
Vlad Yasevich
7a7e96e09d bonding: Prevent duplicate userspace notification
Whenever a user changes bonding options, a NETDEV_CHANGEINFODATA
notificatin is generated which results in a rtnelink message to
be sent.  While runnig 'ip monitor', we can actually see 2 messages,
one a result of the event, and the other a result of state change
that is generated bo netdev_state_change().  However, this is not
always the case. If bonding changes were done via sysfs or ifenslave
(old ioctl interface), then only 1 message is seen.

This patch removes duplicate messages in the case of using netlink
to configure bonding.  It introduceds a separte function that
triggers a netdev event and uses that function in the syfs and ioctl
cases.

This was discovered while auditing all the different envents and
continues the effort of cleaning up duplicated netlink messages.

CC: David Ahern <dsa@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-27 18:51:41 -04:00
Nithin Sujir
797a93647a bonding: Don't update slave->link until ready to commit
In the loadbalance arp monitoring scheme, when a slave link change is
detected, the slave->link is immediately updated and slave_state_changed
is set. Later down the function, the rtnl_lock is acquired and the
changes are committed, updating the bond link state.

However, the acquisition of the rtnl_lock can fail. The next time the
monitor runs, since slave->link is already updated, it determines that
link is unchanged. This results in the bond link state permanently out
of sync with the slave link.

This patch modifies bond_loadbalance_arp_mon() to handle link changes
identical to bond_ab_arp_{inspect/commit}(). The new link state is
maintained in slave->new_link until we're ready to commit at which point
it's copied into slave->link.

NOTE: miimon_{inspect/commit}() has a more complex state machine
requiring the use of the bond_{propose,commit}_link_state() functions
which maintains the intermediate state in slave->link_new_state. The arp
monitors don't require that.

Testing: This bug is very easy to reproduce with the following steps.
1. In a loop, toggle a slave link of a bond slave interface.
2. In a separate loop, do ifconfig up/down of an unrelated interface to
create contention for rtnl_lock.
Within a few iterations, the bond link goes out of sync with the slave
link.

Signed-off-by: Nithin Nayak Sujir <nsujir@tintri.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25 14:47:35 -04:00
Jarod Wilson
72ccc471e1 bonding: fix randomly populated arp target array
In commit dc9c4d0fe0, the arp_target array moved from a static global
to a local variable. By the nature of static globals, the array used to
be initialized to all 0. At present, it's full of random data, which
that gets interpreted as arp_target values, when none have actually been
specified. Systems end up booting with spew along these lines:

[   32.161783] IPv6: ADDRCONF(NETDEV_UP): lacp0: link is not ready
[   32.168475] IPv6: ADDRCONF(NETDEV_UP): lacp0: link is not ready
[   32.175089] 8021q: adding VLAN 0 to HW filter on device lacp0
[   32.193091] IPv6: ADDRCONF(NETDEV_UP): lacp0: link is not ready
[   32.204892] lacp0: Setting MII monitoring interval to 100
[   32.211071] lacp0: Removing ARP target 216.124.228.17
[   32.216824] lacp0: Removing ARP target 218.160.255.255
[   32.222646] lacp0: Removing ARP target 185.170.136.184
[   32.228496] lacp0: invalid ARP target 255.255.255.255 specified for removal
[   32.236294] lacp0: option arp_ip_target: invalid value (-255.255.255.255)
[   32.243987] lacp0: Removing ARP target 56.125.228.17
[   32.249625] lacp0: Removing ARP target 218.160.255.255
[   32.255432] lacp0: Removing ARP target 15.157.233.184
[   32.261165] lacp0: invalid ARP target 255.255.255.255 specified for removal
[   32.268939] lacp0: option arp_ip_target: invalid value (-255.255.255.255)
[   32.276632] lacp0: Removing ARP target 16.0.0.0
[   32.281755] lacp0: Removing ARP target 218.160.255.255
[   32.287567] lacp0: Removing ARP target 72.125.228.17
[   32.293165] lacp0: Removing ARP target 218.160.255.255
[   32.298970] lacp0: Removing ARP target 8.125.228.17
[   32.304458] lacp0: Removing ARP target 218.160.255.255

None of these were actually specified as ARP targets, and the driver does
seem to clean up the mess okay, but it's rather noisy and confusing, leaks
values to userspace, and the 255.255.255.255 spew shows up even when debug
prints are disabled.

The fix: just zero out arp_target at init time.

While we're in here, init arp_all_targets_value in the right place.

Fixes: dc9c4d0fe0 ("bonding: reduce scope of some global variables")
CC: Mahesh Bandewar <maheshb@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
CC: stable@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-22 14:38:22 -04:00
Linus Torvalds
8d65b08deb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Millar:
 "Here are some highlights from the 2065 networking commits that
  happened this development cycle:

   1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)

   2) Add a generic XDP driver, so that anyone can test XDP even if they
      lack a networking device whose driver has explicit XDP support
      (me).

   3) Sparc64 now has an eBPF JIT too (me)

   4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
      Starovoitov)

   5) Make netfitler network namespace teardown less expensive (Florian
      Westphal)

   6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)

   7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)

   8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)

   9) Multiqueue support in stmmac driver (Joao Pinto)

  10) Remove TCP timewait recycling, it never really could possibly work
      well in the real world and timestamp randomization really zaps any
      hint of usability this feature had (Soheil Hassas Yeganeh)

  11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
      Aleksandrov)

  12) Add socket busy poll support to epoll (Sridhar Samudrala)

  13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
      and several others)

  14) IPSEC hw offload infrastructure (Steffen Klassert)"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
  tipc: refactor function tipc_sk_recv_stream()
  tipc: refactor function tipc_sk_recvmsg()
  net: thunderx: Optimize page recycling for XDP
  net: thunderx: Support for XDP header adjustment
  net: thunderx: Add support for XDP_TX
  net: thunderx: Add support for XDP_DROP
  net: thunderx: Add basic XDP support
  net: thunderx: Cleanup receive buffer allocation
  net: thunderx: Optimize CQE_TX handling
  net: thunderx: Optimize RBDR descriptor handling
  net: thunderx: Support for page recycling
  ipx: call ipxitf_put() in ioctl error path
  net: sched: add helpers to handle extended actions
  qed*: Fix issues in the ptp filter config implementation.
  qede: Fix concurrency issue in PTP Tx path processing.
  stmmac: Add support for SIMATIC IOT2000 platform
  net: hns: fix ethtool_get_strings overflow in hns driver
  tcp: fix wraparound issue in tcp_lp
  bpf, arm64: fix jit branch offset related to ldimm64
  bpf, arm64: implement jiting of BPF_XADD
  ...
2017-05-02 16:40:27 -07:00
Paolo Abeni
19cdead3e2 bonding: avoid defaulting hard_header_len to ETH_HLEN on slave removal
On slave list updates, the bonding driver computes its hard_header_len
as the maximum of all enslaved devices's hard_header_len.
If the slave list is empty, e.g. on last enslaved device removal,
ETH_HLEN is used.

Since the bonding header_ops are set only when the first enslaved
device is attached, the above can lead to header_ops->create()
being called with the wrong skb headroom in place.

If bond0 is configured on top of ipoib devices, with the
following commands:

ifup bond0
for slave in $BOND_SLAVES_LIST; do
	ip link set dev $slave nomaster
done
ping -c 1 <ip on bond0 subnet>

we will obtain a skb_under_panic() with a similar call trace:
	skb_push+0x3d/0x40
	push_pseudo_header+0x17/0x30 [ib_ipoib]
	ipoib_hard_header+0x4e/0x80 [ib_ipoib]
	arp_create+0x12f/0x220
	arp_send_dst.part.19+0x28/0x50
	arp_solicit+0x115/0x290
	neigh_probe+0x4d/0x70
	__neigh_event_send+0xa7/0x230
	neigh_resolve_output+0x12e/0x1c0
	ip_finish_output2+0x14b/0x390
	ip_finish_output+0x136/0x1e0
	ip_output+0x76/0xe0
	ip_local_out+0x35/0x40
	ip_send_skb+0x19/0x40
	ip_push_pending_frames+0x33/0x40
	raw_sendmsg+0x7d3/0xb50
	inet_sendmsg+0x31/0xb0
	sock_sendmsg+0x38/0x50
	SYSC_sendto+0x102/0x190
	SyS_sendto+0xe/0x10
	do_syscall_64+0x67/0x180
	entry_SYSCALL64_slow_path+0x25/0x25

This change addresses the issue avoiding updating the bonding device
hard_header_len when the slaves list become empty, forbidding to
shrink it below the value used by header_ops->create().

The bug is there since commit 54ef313714 ("[PATCH] bonding: Handle large
hard_header_len") but the panic can be triggered only since
commit fc791b6335 ("IB/ipoib: move back IB LL address into the hard
header").

Reported-by: Norbert P <noe@physik.uzh.ch>
Fixes: 54ef313714 ("[PATCH] bonding: Handle large hard_header_len")
Fixes: fc791b6335 ("IB/ipoib: move back IB LL address into the hard header")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 16:04:05 -04:00
Mahesh Bandewar
ea8ffc0818 bonding: fix wq initialization for links created via netlink
Earlier patch 4493b81bea ("bonding: initialize work-queues during
creation of bond") moved the work-queue initialization from bond_open()
to bond_create(). However this caused the link those are created using
netlink 'create bond option' (ip link add bondX type bond); create the
new trunk without initializing work-queues. Prior to the above mentioned
change, ndo_open was in both paths and things worked correctly. The
consequence is visible in the report shared by Joe Stringer -

I've noticed that this patch breaks bonding within namespaces if
you're not careful to perform device cleanup correctly.

Here's my repro script, you can run on any net-next with this patch
and you'll start seeing some weird behaviour:

ip netns add foo
ip li add veth0 type veth peer name veth0+ netns foo
ip li add veth1 type veth peer name veth1+ netns foo
ip netns exec foo ip li add bond0 type bond
ip netns exec foo ip li set dev veth0+ master bond0
ip netns exec foo ip li set dev veth1+ master bond0
ip netns exec foo ip addr add dev bond0 192.168.0.1/24
ip netns exec foo ip li set dev bond0 up
ip li del dev veth0
ip li del dev veth1

The second to last command segfaults, last command hangs. rtnl is now
permanently locked. It's not a problem if you take bond0 down before
deleting veths, or delete bond0 before deleting veths. If you delete
either end of the veth pair as per above, either inside or outside the
namespace, it hits this problem.

Here's some kernel logs:
[ 1221.801610] bond0: Enslaving veth0+ as an active interface with an up link
[ 1224.449581] bond0: Enslaving veth1+ as an active interface with an up link
[ 1281.193863] bond0: Releasing backup interface veth0+
[ 1281.193866] bond0: the permanent HWaddr of veth0+ -
16:bf:fb:e0:b8:43 - is still in use by bond0 - set the HWaddr of
veth0+ to a different address to avoid conflicts
[ 1281.193867] ------------[ cut here ]------------
[ 1281.193873] WARNING: CPU: 0 PID: 2024 at kernel/workqueue.c:1511
__queue_delayed_work+0x13f/0x150
[ 1281.193873] Modules linked in: bonding veth openvswitch nf_nat_ipv6
nf_nat_ipv4 nf_nat autofs4 nfsd auth_rpcgss nfs_acl binfmt_misc nfs
lockd grace sunrpc fscache ppdev vmw_balloon coretemp psmouse
serio_raw vmwgfx ttm drm_kms_helper vmw_vmci netconsole parport_pc
configfs drm i2c_piix4 fb_sys_fops syscopyarea sysfillrect sysimgblt
shpchp mac_hid nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4
nf_defrag_ipv4 nf_conntrack libcrc32c lp parport hid_generic usbhid
hid mptspi mptscsih e1000 mptbase ahci libahci
[ 1281.193905] CPU: 0 PID: 2024 Comm: ip Tainted: G        W
4.10.0-bisect-bond-v0.14 #37
[ 1281.193906] Hardware name: VMware, Inc. VMware Virtual
Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
[ 1281.193906] Call Trace:
[ 1281.193912]  dump_stack+0x63/0x89
[ 1281.193915]  __warn+0xd1/0xf0
[ 1281.193917]  warn_slowpath_null+0x1d/0x20
[ 1281.193918]  __queue_delayed_work+0x13f/0x150
[ 1281.193920]  queue_delayed_work_on+0x27/0x40
[ 1281.193929]  bond_change_active_slave+0x25b/0x670 [bonding]
[ 1281.193932]  ? synchronize_rcu_expedited+0x27/0x30
[ 1281.193935]  __bond_release_one+0x489/0x510 [bonding]
[ 1281.193939]  ? addrconf_notify+0x1b7/0xab0
[ 1281.193942]  bond_netdev_event+0x2c5/0x2e0 [bonding]
[ 1281.193944]  ? netconsole_netdev_event+0x124/0x190 [netconsole]
[ 1281.193947]  notifier_call_chain+0x49/0x70
[ 1281.193948]  raw_notifier_call_chain+0x16/0x20
[ 1281.193950]  call_netdevice_notifiers_info+0x35/0x60
[ 1281.193951]  rollback_registered_many+0x23b/0x3e0
[ 1281.193953]  unregister_netdevice_many+0x24/0xd0
[ 1281.193955]  rtnl_delete_link+0x3c/0x50
[ 1281.193956]  rtnl_dellink+0x8d/0x1b0
[ 1281.193960]  rtnetlink_rcv_msg+0x95/0x220
[ 1281.193962]  ? __kmalloc_node_track_caller+0x35/0x280
[ 1281.193964]  ? __netlink_lookup+0xf1/0x110
[ 1281.193966]  ? rtnl_newlink+0x830/0x830
[ 1281.193967]  netlink_rcv_skb+0xa7/0xc0
[ 1281.193969]  rtnetlink_rcv+0x28/0x30
[ 1281.193970]  netlink_unicast+0x15b/0x210
[ 1281.193971]  netlink_sendmsg+0x319/0x390
[ 1281.193974]  sock_sendmsg+0x38/0x50
[ 1281.193975]  ___sys_sendmsg+0x25c/0x270
[ 1281.193978]  ? mem_cgroup_commit_charge+0x76/0xf0
[ 1281.193981]  ? page_add_new_anon_rmap+0x89/0xc0
[ 1281.193984]  ? lru_cache_add_active_or_unevictable+0x35/0xb0
[ 1281.193985]  ? __handle_mm_fault+0x4e9/0x1170
[ 1281.193987]  __sys_sendmsg+0x45/0x80
[ 1281.193989]  SyS_sendmsg+0x12/0x20
[ 1281.193991]  do_syscall_64+0x6e/0x180
[ 1281.193993]  entry_SYSCALL64_slow_path+0x25/0x25
[ 1281.193995] RIP: 0033:0x7f6ec122f5a0
[ 1281.193995] RSP: 002b:00007ffe69e89c48 EFLAGS: 00000246 ORIG_RAX:
000000000000002e
[ 1281.193997] RAX: ffffffffffffffda RBX: 00007ffe69e8dd60 RCX: 00007f6ec122f5a0
[ 1281.193997] RDX: 0000000000000000 RSI: 00007ffe69e89c90 RDI: 0000000000000003
[ 1281.193998] RBP: 00007ffe69e89c90 R08: 0000000000000000 R09: 0000000000000003
[ 1281.193999] R10: 00007ffe69e89a10 R11: 0000000000000246 R12: 0000000058f14b9f
[ 1281.193999] R13: 0000000000000000 R14: 00000000006473a0 R15: 00007ffe69e8e450
[ 1281.194001] ---[ end trace 713a77486cbfbfa3 ]---

Fixes: 4493b81bea ("bonding: initialize work-queues during creation of bond")
Reported-by: Joe Stringer <joe@ovn.org>
Tested-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 15:28:37 -04:00
Chonggang Li
b89f04c61e bonding: deliver link-local packets with skb->dev set to link that packets arrived on
Bonding driver changes the skb->dev to the bonding-master before
passing the packet to stack for further processing. This, however
does not make sense for the link-local packets and it loses "the
link info" once its skb->dev is changed to bonding-master.  This
patch changes this behavior for link-local packets by not changing
the skb->dev to the bonding-master and maintaining it as it is,
i.e. the link on which the packet arrived.

Signed-off-by: Chonggang Li <chonggangli@google.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 15:45:47 -04:00
Mahesh Bandewar
fb9eb899a6 bonding: handle link transition from FAIL to UP correctly
When link transitions from LINK_FAIL to LINK_UP, the commit phase is
not called. This leads to an erroneous state causing slave-link state to
get stuck in "going down" state while its speed and duplex are perfectly
fine. This issue is a side-effect of splitting link-set into propose and
commit phases introduced by de77ecd4ef ("bonding: improve link-status
update in mii-monitoring")

This patch fixes these issues by calling commit phase whenever link
state change is proposed.

Fixes: de77ecd4ef ("bonding: improve link-status update in mii-monitoring")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:48:59 -04:00
Jarod Wilson
faeeb317a5 bonding: attempt to better support longer hw addresses
People are using bonding over Infiniband IPoIB connections, and who knows
what else. Infiniband has a hardware address length of 20 octets
(INFINIBAND_ALEN), and the network core defines a MAX_ADDR_LEN of 32.
Various places in the bonding code are currently hard-wired to 6 octets
(ETH_ALEN), such as the 3ad code, which I've left untouched here. Besides,
only alb is currently possible on Infiniband links right now anyway, due
to commit 1533e77315, so the alb code is where most of the changes are.

One major component of this change is the addition of a bond_hw_addr_copy
function that takes a length argument, instead of using ether_addr_copy
everywhere that hardware addresses need to be copied about. The other
major component of this change is converting the bonding code from using
struct sockaddr for address storage to struct sockaddr_storage, as the
former has an address storage space of only 14, while the latter is 128
minus a few, which is necessary to support bonding over device with up to
MAX_ADDR_LEN octet hardware addresses. Additionally, this probably fixes
up some memory corruption issues with the current code, where it's
possible to write an infiniband hardware address into a sockaddr declared
on the stack.

Lightly tested on a dual mlx4 IPoIB setup, which properly shows a 20-octet
hardware address now:

$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: mlx4_ib0 (primary_reselect always)
Currently Active Slave: mlx4_ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100

Slave Interface: mlx4_ib0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:08:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:1d:67:01
Slave queue ID: 0

Slave Interface: mlx4_ib1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr:
80:00:02:09:fe:80:00:00:00:00:00:01:e4:1d:2d:03:00:1d:67:02
Slave queue ID: 0

Also tested with a standard 1Gbps NIC bonding setup (with a mix of
e1000 and e1000e cards), running LNST's bonding tests.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 18:44:54 -07:00
Mahesh Bandewar
3f3c278c94 bonding: fix active-backup transition
Earlier patch c4adfc822b ("bonding: make speed, duplex setting
consistent with link state") made an attempt to keep slave state
consistent with speed and duplex settings. Unfortunately link-state
transition is used to change the active link especially when used
in conjunction with mii-mon. The above mentioned patch broke that
logic. Also when speed and duplex settings for a link are updated
during a link-event, the link-status should not be changed to
invoke correct transition logic.

This patch fixes this issue by moving the link-state update outside
of the bond_update_speed_duplex() fn and to the places where this fn
is called and update link-state selectively.

Fixes: c4adfc822b ("bonding: make speed, duplex setting consistent
with link state")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 07:15:58 -07:00
Eric Dumazet
142c6594ac bonding: refine bond_fold_stats() wrap detection
Some device drivers reset their stats at down/up events, possibly
fooling bonding stats, since they operate with relative deltas.

It is nearly not possible to fix drivers, since some of them compute the
tx/rx counters based on per rx/tx queue stats, and the queues can be
reconfigured (ethtool -L) between the down/up sequence.

Lets avoid accumulating 'negative' values that render bonding stats
useless.

It is better to lose small deltas, assuming the bonding stats are
fetched at a reasonable frequency.

Fixes: 5f0c5f73e5 ("bonding: make global bonding stats more reliable")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-30 12:40:34 -07:00
Mahesh Bandewar
b5bf0f5b16 bonding: correctly update link status during mii-commit phase
bond_miimon_commit() marks the link UP after attempting to get the speed
and duplex settings for the link. There is a possibility that
bond_update_speed_duplex() could fail. This is another place where it
could result into an inconsistent bonding link state.

With this patch the link will be marked UP only if the speed and duplex
values retrieved have sane values and processed further.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-27 21:11:50 -07:00
Mahesh Bandewar
c4adfc822b bonding: make speed, duplex setting consistent with link state
bond_update_speed_duplex() retrieves speed and duplex settings. There
is a possibility of failure in retrieving these values but caller has
to assume it's always successful. This leads to having inconsistent
slave link settings. If these (speed, duplex) values cannot be
retrieved, then keeping the link UP causes problems.

The updated bond_update_speed_duplex() returns 0 on success if it
retrieves sane values for speed and duplex. On failure it returns 1
and marks the link down.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-27 21:11:50 -07:00
Mahesh Bandewar
de77ecd4ef bonding: improve link-status update in mii-monitoring
The primary issue is that mii-inspect phase updates link-state and
expects changes to be committed during the mii-commit phase. After
the inspect phase if it fails to acquire rtnl-mutex, the commit
phase (bond_mii_commit) doesn't get to run. This partially updated
state stays and makes the internal-state inconsistent.

e.g. setup bond0 => slaves: eth1, eth2
eth1 goes DOWN -> UP
   mii_monitor()
	mii-inspect()
	    bond_set_slave_link_state(eth1, UP, DontNotify)
	rtnl_trylock() <- fails!

Next mii-monitor round
eth1: No change
   mii_monitor()
	mii-inspect()
	    eth1->link == current-status (ethtool_ops->get_link)
	    no-change-detected

End result:
    eth1:
      Link = BOND_LINK_UP
      Speed = 0xfffff  [SpeedUnknown]
      Duplex = 0xff    [DuplexUnknown]

This doesn't always happen but for some unlucky machines in a large set
of machines it creates problems.

The fix for this is to avoid making changes during inspect phase and
postpone them until acquiring the rtnl-mutex / invoking commit phase.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-27 21:11:49 -07:00
Mahesh Bandewar
dc9c4d0fe0 bonding: reduce scope of some global variables
Many of the bond param variables are declared global while it's not
really necessary for these variables to be global. So moving them to
the location these are used.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 17:33:29 -08:00
Mahesh Bandewar
8b426dc54c bonding: remove hardcoded value
Eliminate hard-coded value and use the default that is set.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 17:33:29 -08:00
Mahesh Bandewar
4493b81bea bonding: initialize work-queues during creation of bond
Initializing work-queues every time ifup operation performed is unnecessary
and can be performed only once when the port is created.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 17:33:29 -08:00
Mahesh Bandewar
d5e73f7be8 bonding: restructure arp-monitor
In preparation to move the work-queue initialization to port creation
from current port_open phase. Work-queue initialization does not make
sense every time we do 'ifup/ifdown'. So moving to port creation phase.

Arp monitoring work depends on the bonding mode and that is not tied
to the port creation and can change anytime during the life after port
creation. So this restructuring allows us to move the initialization
at creation without losing the ability to arm the correct work for the
mode user has selected.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 17:33:29 -08:00
WANG Cong
31c05415f5 bonding: use ETH_MAX_MTU as max mtu
This restores the ability of setting bond device's mtu to 9000.

Fixes: 91572088e3 ("net: use core MTU range checking in core net infra")
Reported-by: daznis@gmail.com
Reported-by: Brad Campbell <lists2009@fnarfbargle.com>
Cc: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-02 14:43:20 -08:00
Ido Schimmel
a8eca32615 net: remove ndo_neigh_{construct, destroy} from stacked devices
In commit 18bfb924f0 ("net: introduce default neigh_construct/destroy
ndo calls for L2 upper devices") we added these ndos to stacked devices
such as team and bond, so that calls will be propagated to mlxsw.

However, previous commit removed the reliance on these ndos and no new
users of these ndos have appeared since above mentioned commit. We can
therefore safely remove this dead code.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-06 11:25:57 -05:00
Zhu Yanjun
3d67576da1 bonding: Remove unnecessary returned value check
The function bond_info_query alwarys returns 0. As such, in the function
bond_do_ioctl, it is not necessary to check the returned value. So the
interface type of the function bond_info_query is changed to void. The
redundant check is removed.

Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03 16:25:17 -05:00
stephen hemminger
bc1f44709c net: make ndo_get_stats64 a void function
The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.

Fix all drivers with ndo_get_stats64 to have a void function.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08 17:51:44 -05:00
Alexey Dobriyan
c7d03a00b5 netns: make struct pernet_operations::id unsigned int
Make struct pernet_operations::id unsigned.

There are 2 reasons to do so:

1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.

2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.

"int" being used as an array index needs to be sign-extended
to 64-bit before being used.

	void f(long *p, int i)
	{
		g(p[i]);
	}

  roughly translates to

	movsx	rsi, esi
	mov	rdi, [rsi+...]
	call 	g

MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.

Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:

	static inline void *net_generic(const struct net *net, int id)
	{
		...
		ptr = ng->ptr[id - 1];
		...
	}

And this function is used a lot, so those sign extensions add up.

Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]

However, overall balance is in negative direction:

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
	function                                     old     new   delta
	nfsd4_lock                                  3886    3959     +73
	tipc_link_build_proto_msg                   1096    1140     +44
	mac80211_hwsim_new_radio                    2776    2808     +32
	tipc_mon_rcv                                1032    1058     +26
	svcauth_gss_legacy_init                     1413    1429     +16
	tipc_bcbase_select_primary                   379     392     +13
	nfsd4_exchange_id                           1247    1260     +13
	nfsd4_setclientid_confirm                    782     793     +11
		...
	put_client_renew_locked                      494     480     -14
	ip_set_sockfn_get                            730     716     -14
	geneve_sock_add                              829     813     -16
	nfsd4_sequence_done                          721     703     -18
	nlmclnt_lookup_host                          708     686     -22
	nfsd4_lockt                                 1085    1063     -22
	nfs_get_client                              1077    1050     -27
	tcf_bpf_init                                1106    1076     -30
	nfsd4_encode_fattr                          5997    5930     -67
	Total: Before=154856051, After=154854321, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 10:59:15 -05:00
Philippe Reynes
d46b634945 net: bonding: use new api ethtool_{get|set}_link_ksettings
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30 17:23:39 -04:00
David Ahern
b3208b2024 net: bonding: Flip to the new dev walk API
Convert alb_send_learning_packets and bond_has_this_ip to use the new
netdev_walk_all_upper_dev_rcu API. In both cases this is just a code
conversion; no functional change is intended.

v2
- removed typecast of data and simplified bond_upper_dev_walk

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 11:44:58 -04:00
Linus Torvalds
4cdf8dbe2d Merge branch 'work.uaccess2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess.h prepwork from Al Viro:
 "Preparations to tree-wide switch to use of linux/uaccess.h (which,
  obviously, will allow to start unifying stuff for real). The last step
  there, ie

    PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
    sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
            `git grep -l "$PATT"|grep -v ^include/linux/uaccess.h`

  is not taken here - I would prefer to do it once just before or just
  after -rc1.  However, everything should be ready for it"

* 'work.uaccess2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  remove a stray reference to asm/uaccess.h in docs
  sparc64: separate extable_64.h, switch elf_64.h to it
  score: separate extable.h, switch module.h to it
  mips: separate extable.h, switch module.h to it
  x86: separate extable.h, switch sections.h to it
  remove stray include of asm/uaccess.h from cacheflush.h
  mn10300: remove a bogus processor.h->uaccess.h include
  xtensa: split uaccess.h into C and asm sides
  bonding: quit messing with IOCTL
  kill __kernel_ds_p off
  mn10300: finish verify_area() off
  frv: move HAVE_ARCH_UNMAPPED_AREA to pgtable.h
  exceptions: detritus removal
2016-10-11 23:38:39 -07:00
Al Viro
4ad41c1e26 bonding: quit messing with IOCTL
The only remaining users are issuing SIOCGMIIPHY and SIOCGMIIREG,
neither of which deals with userland pointers.  Simply calling
->ndo_do_ioctl() is fine; no messing with set_fs() is needed.
It used to mess with SIOCETHTOOL, which would've needed set_fs(),
but that has been killed in "[NET] ethtool ops are the only way"
9 years ago...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:15:21 -04:00
David S. Miller
b20b378d49 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mediatek/mtk_eth_soc.c
	drivers/net/ethernet/qlogic/qed/qed_dcbx.c
	drivers/net/phy/Kconfig

All conflicts were cases of overlapping commits.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-12 15:52:44 -07:00
Mahesh Bandewar
24b27fc4cd bonding: Fix bonding crash
Following few steps will crash kernel -

  (a) Create bonding master
      > modprobe bonding miimon=50
  (b) Create macvlan bridge on eth2
      > ip link add link eth2 dev mvl0 address aa:0:0:0:0:01 \
	   type macvlan
  (c) Now try adding eth2 into the bond
      > echo +eth2 > /sys/class/net/bond0/bonding/slaves
      <crash>

Bonding does lots of things before checking if the device enslaved is
busy or not.

In this case when the notifier call-chain sends notifications, the
bond_netdev_event() assumes that the rx_handler /rx_handler_data is
registered while the bond_enslave() hasn't progressed far enough to
register rx_handler for the new slave.

This patch adds a rx_handler check that can be performed right at the
beginning of the enslave code to avoid getting into this situation.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-04 11:41:12 -07:00
Bhaktipriya Shridhar
f9f225ebf5 bonding: Remove deprecated create_singlethread_workqueue
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.

The workqueue "wq" queues multiple work items viz
&bond->mcast_work, &nnw->work, &bond->mii_work, &bond->arp_work,
&bond->alb_work, &bond->mii_work, &bond->ad_work, &bond->slave_arr_work
which require strict execution ordering. Hence, an ordered dedicated
workqueue has been used.

Since, it is a network driver, WQ_MEM_RECLAIM has been set to
ensure forward progress under memory pressure.

Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 16:41:07 -07:00
Zhu Yanjun
0d039f337f bonding: fix the typo
The message "803.ad" should be "802.3ad".

Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-09 14:57:14 -07:00
Mark Bloch
1533e77315 net/bonding: Enforce active-backup policy for IPoIB bonds
When using an IPoIB bond currently only active-backup mode is a valid
use case and this commit strengthens it.

Since commit 2ab82852a2 ("net/bonding: Enable bonding to enslave
netdevices not supporting set_mac_address()") was introduced till
4.7-rc1, IPoIB didn't support the set_mac_address ndo, and hence the
fail over mac policy always applied to IPoIB bonds.

With the introduction of commit 492a7e67ff ("IB/IPoIB: Allow setting
the device address"), that doesn't hold and practically IPoIB bonds are
broken as of that. To fix it, lets go to fail over mac if the device
doesn't support the ndo OR this is IPoIB device.

As a by-product, this commit also prevents a stack corruption which
occurred when trying to copy 20 bytes (IPoIB) device address
to a sockaddr struct that has only 16 bytes of storage.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-25 10:18:58 -07:00
David S. Miller
30d0844bdc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mellanox/mlx5/core/en.h
	drivers/net/ethernet/mellanox/mlx5/core/en_main.c
	drivers/net/usb/r8152.c

All three conflicts were overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-06 10:35:22 -07:00
Aviv Heller
a30b016808 bonding: fix enslavement slave link notifications
Currently, link notifications are not sent by
bond_set_slave_link_state() upon enslavement if
the slave is enslaved when up.

This happens because slave->link default init value
is 0, which is the same as BOND_LINK_UP, resulting
in bond_set_slave_link_state() ignoring this transition.

This patch sets the default value of slave->link to
BOND_LINK_NOCHANGE, assuring it will count as a state
transition and thus trigger notification logic.

Signed-off-by: Aviv Heller <avivh@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-05 11:51:55 -07:00
Jiri Pirko
18bfb924f0 net: introduce default neigh_construct/destroy ndo calls for L2 upper devices
L2 upper device needs to propagate neigh_construct/destroy calls down to
lower devices. Do this by defining default ndo functions and use them in
team, bond, bridge and vlan.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-05 09:06:28 -07:00
Eric Dumazet
d3fff6c443 net: add netdev_lockdep_set_classes() helper
It is time to add netdev_lockdep_set_classes() helper
so that lockdep annotations per device type are easier to manage.

This removes a lot of copies and missing annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09 13:28:37 -07:00
Eric Dumazet
f9eb8aea2a net_sched: transform qdisc running bit into a seqcount
Instead of using a single bit (__QDISC___STATE_RUNNING)
in sch->__state, use a seqcount.

This adds lockdep support, but more importantly it will allow us
to sample qdisc/class statistics without having to grab qdisc root lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07 16:37:13 -07:00
Eric Dumazet
fe30937b65 bonding: fix bond_get_stats()
bond_get_stats() can be called from rtnetlink (with RTNL held)
or from /proc/net/dev seq handler (with RCU held)

The logic added in commit 5f0c5f73e5 ("bonding: make global bonding
stats more reliable") kind of assumed only one cpu could run there.

If multiple threads are reading /proc/net/dev, stats can be really
messed up after a while.

A second problem is that some fields are 32bit, so we need to properly
handle the wrap around problem.

Given that RTNL is not always held, we need to use
bond_for_each_slave_rcu().

Fixes: 5f0c5f73e5 ("bonding: make global bonding stats more reliable")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-18 23:14:15 -04:00
Zhang Shengju
1098cee6e1 bonding: remove duplicate set of flag IFF_MULTICAST
Remove unnecessary set of flag IFF_MULTICAST, since ether_setup
already does this.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-18 22:14:47 -04:00
David Decotigny
9856909c2a net: bonding: use __ethtool_get_ksettings
Signed-off-by: David Decotigny <decot@googlers.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 22:06:46 -05:00
David S. Miller
b633353115 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/phy/bcm7xxx.c
	drivers/net/phy/marvell.c
	drivers/net/vxlan.c

All three conflicts were cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23 00:09:14 -05:00
Jay Vosburgh
266b495f11 bonding: don't use stale speed and duplex information
There is presently a race condition between the bonding periodic
link monitor and the updating of a slave's speed and duplex.  The former
occurs on a periodic basis, and the latter in response to a driver's
calling of netif_carrier_on.

	It is possible for the periodic monitor to run between the
driver call of netif_carrier_on and the receipt of the NETDEV_CHANGE
event that causes bonding to update the slave's speed and duplex.  This
manifests most notably as a report that a slave is up and "0 Mbps full
duplex" after enslavement, but in principle could report an incorrect
speed and duplex after any link up event if the device comes up with a
different speed or duplex.  This affects the 802.3ad aggregator
selection, as the speed and duplex are selection criteria.

	This is fixed by updating the speed and duplex in the periodic
monitor, prior to using that information.

	This was done historically in bonding, but the call to
bond_update_speed_duplex was removed in commit 876254ae27 ("bonding:
don't call update_speed_duplex() under spinlocks"), as it might sleep
under lock.  Later, the locking was changed to only hold RTNL, and so
after commit 876254ae27 ("bonding: don't call update_speed_duplex()
under spinlocks") this call is again safe.

Tested-by: "Tantilov, Emil S" <emil.s.tantilov@intel.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: dingtianhong <dingtianhong@huawei.com>
Fixes: 876254ae27 ("bonding: don't call update_speed_duplex() under spinlocks")
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 15:14:01 -05:00
Jay Vosburgh
21a75f0915 bonding: Fix ARP monitor validation
The current logic in bond_arp_rcv will accept an incoming ARP for
validation if (a) the receiving slave is either "active" (which includes
the currently active slave, or the current ARP slave) or, (b) there is a
currently active slave, and it has received an ARP since it became active.
For case (b), the receiving slave isn't the currently active slave, and is
receiving the original broadcast ARP request, not an ARP reply from the
target.

	This logic can fail if there is no currently active slave.  In
this situation, the ARP probe logic cycles through all slaves, assigning
each in turn as the "current_arp_slave" for one arp_interval, then setting
that one as "active," and sending an ARP probe from that slave.  The
current logic expects the ARP reply to arrive on the sending
current_arp_slave, however, due to switch FDB updating delays, the reply
may be directed to another slave.

	This can arise if the bonding slaves and switch are working, but
the ARP target is not responding.  When the ARP target recovers, a
condition may result wherein the ARP target host replies faster than the
switch can update its forwarding table, causing each ARP reply to be sent
to the previous current_arp_slave.  This will never pass the logic in
bond_arp_rcv, as neither of the above conditions (a) or (b) are met.

	Some experimentation on a LAN shows ARP reply round trips in the
200 usec range, but my available switches never update their FDB in less
than 4000 usec.

	This patch changes the logic in bond_arp_rcv to additionally
accept an ARP reply for validation on any slave if there is a current ARP
slave and it sent an ARP probe during the previous arp_interval.

Fixes: aeea64ac71 ("bonding: don't trust arp requests unless active slave really works")
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-13 05:54:55 -05:00
Zhang Shengju
1e2a8868e0 bonding: use return instead of goto
Replace 'goto' with 'return' to remove unnecessary check at label:
err_undo_flags.

The reason is that 'err_undo_flags' do two things for the first slave device:
1.revert bond mac address if it is set by the slave device.
2.revert bond device type if it's not ARPHRD_ETHER.

It's not necessary for the following three places, they changed neither bond
mac address nor type. It's straightforward to return directly.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 09:57:00 -05:00
Zhang Shengju
d66bd9050b bonding: trivial: style fixes
remove some redudant brackets, use sizeof(*) instead of sizeof(struct x).

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-08 10:45:42 -05:00
Zhang Shengju
c6140a299b bonding: add slave device name for debug
netdev_dbg() will add bond device name, it will be helpful if we print
slave device name.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 03:37:03 -05:00
Jarod Wilson
f344b0d940 bond: track sum of rx_nohandler for all slaves
Sample output with this set applied for an active-backup bond:

$ cat /sys/devices/virtual/net/bond0/lower_p7p1/statistics/rx_nohandler
16568
$ cat /sys/devices/virtual/net/bond0/lower_p5p2/statistics/rx_nohandler
16583
$ cat /sys/devices/virtual/net/bond0/statistics/rx_nohandler
33151

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 02:59:51 -05:00
David S. Miller
9d367eddf3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/bonding/bond_main.c
	drivers/net/ethernet/mellanox/mlxsw/spectrum.h
	drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c

The bond_main.c and mellanox switch conflicts were cases of
overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 23:55:43 -05:00
Karl Heiss
03d84a5f83 bonding: Prevent IPv6 link local address on enslaved devices
Commit 1f718f0f4f ("bonding: populate neighbour's private on enslave")
undoes the fix provided by commit c2edacf80e ("bonding / ipv6: no addrconf
for slaves separately from master") by effectively setting the slave flag
after the slave has been opened.  If the slave comes up quickly enough, it
will go through the IPv6 addrconf before the slave flag has been set and
will get a link local IPv6 address.

In order to ensure that addrconf knows to ignore the slave devices on state
change, set IFF_SLAVE before dev_open() during bonding enslavement.

Fixes: 1f718f0f4f ("bonding: populate neighbour's private on enslave")
Signed-off-by: Karl Heiss <kheiss@gmail.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reviewed-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 17:55:41 -05:00
Tom Herbert
a188222b6e net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK
The name NETIF_F_ALL_CSUM is a misnomer. This does not correspond to the
set of features for offloading all checksums. This is a mask of the
checksum offload related features bits. It is incorrect to set both
NETIF_F_HW_CSUM and NETIF_F_IP_CSUM or NETIF_F_IPV6 at the same time for
features of a device.

This patch:
  - Changes instances of NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK (where
    NETIF_F_ALL_CSUM is being used as a mask).
  - Changes bonding, sfc/efx, ipvlan, macvlan, vlan, and team drivers to
    use NEITF_F_HW_CSUM in features list instead of NETIF_F_ALL_CSUM.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 16:50:08 -05:00
yzhu1
ce3ea1c705 net: bonding: remove redudant brackets
It is not necessary to use two brackets. As such, the redudant brackets
are removed.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-04 00:07:14 -05:00
Jiri Pirko
57beaca8ec bonding: set inactive flags on release
Be correct and symmetric to enslave and set inactive flags during release.
That gives LAG offload drivers - lower state change listeners - possibility
to do proper cleanup.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 11:49:27 -05:00
Jiri Pirko
f7c7eb7f7a bonding: implement lower state change propagation
Let netdev notifier listeners know about link and slave state change.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 11:49:27 -05:00
Jiri Pirko
5d397061ca bonding: allow notifications for bond_set_slave_link_state
Similar to state notifications.

We allow caller to indicate if the notification should happen now or later,
depending on if he holds rtnl mutex or not. Introduce bond_slave_link_notify
function (similar to bond_slave_state_notify) which is later on called
with rtnl mutex and goes over slaves and executes delayed notification.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 11:49:27 -05:00
Jiri Pirko
41f0b04964 bonding: fill-up LAG changeupper info struct and pass it along
Initialize netdev_lag_upper_info structure by TX type according to
current bonding mode and pass it along via netdev_master_upper_dev_link.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 11:49:26 -05:00
Jiri Pirko
29bf24afb2 net: add possibility to pass information about upper device via notifier
Sometimes the drivers and other code would find it handy to know some
internal information about upper device being changed. So allow upper-code
to pass information down to notifier listeners during linking.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 11:49:25 -05:00
Jiri Pirko
6dffb0447c net: propagate upper priv via netdev_master_upper_dev_link
Eliminate netdev_master_upper_dev_link_private and pass priv directly as
a parameter of netdev_master_upper_dev_link.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 11:49:25 -05:00
Jay Vosburgh
40baec2257 bonding: fix panic on non-ARPHRD_ETHER enslave failure
Since commit 7d5cd2ce529b, when bond_enslave fails on devices that
are not ARPHRD_ETHER, if needed, it resets the bonding device back to
ARPHRD_ETHER by calling ether_setup.

	Unfortunately, ether_setup clobbers dev->flags, clearing IFF_UP
if the bond device is up, leaving it in a quasi-down state without
having actually gone through dev_close.  For bonding, if any periodic
work queue items are active (miimon, arp_interval, etc), those will
remain running, as they are stopped by bond_close.  At this point, if
the bonding module is unloaded or the bond is deleted, the system will
panic when the work function is called.

	This panic is resolved by calling dev_close on the bond itself
prior to calling ether_setup.

Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Fixes: 7d5cd2ce52 ("bonding: correctly handle bonding type change on enslave failure")
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-07 13:17:32 -05:00
Mahesh Bandewar
52bc671681 bonding: simplify / unify event handling code for 3ad mode.
Old logic of updating state-machine is not required since
ad_update_actor_keys() does it implicitly. The only loss is
the notification differentiation between speed vs. duplex
change. Now only one unified notification is printed.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:52:24 -05:00
Eric Dumazet
e87eb4051e bonding: support encapsulated ipv6 TSO
If using a sixtofour device on top of a bonding device,
skb segmentation of TCP traffic is done right before calling
bonding xmit, because bonding only enables TSO for IPv4.

This patch improves single flow performance by about 120 % on my hosts,
because segmentation is deferred right before calling slave xmit.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:29:28 -07:00
Eric Dumazet
4b1b865e4e bonding: use l4 hash if available
If skb carries a l4 hash, no need to perform a flow dissection.

Performance is slightly better :

lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
2.39012e+06
lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
2.39393e+06
lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
2.39988e+06

After patch :

lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
2.43579e+06
lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
2.44304e+06
lpaa5:~# ./super_netperf 200 -H lpaa6 -t TCP_RR -l 100
2.44312e+06

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 21:01:05 -07:00
Tom Herbert
cd79a2382a flow_dissector: Add flags argument to skb_flow_dissector functions
The flags argument will allow control of the dissection process (for
instance whether to parse beyond L3).

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-01 15:06:22 -07:00
David S. Miller
06fb4e701b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-08-30 21:45:01 -07:00
Nikolay Aleksandrov
b0d4943eec bonding: fix bond_poll_controller bh_enable warning
The problem is rcu_read_unlock_bh() which triggers a warning when irqs are
disabled. ndo_poll_controller should run with irqs disabled always so we
can drop the rcu_read_lock_bh.

[   98.502922] bond0: making interface eth1 the new active one
[   98.503039] ------------[ cut here ]------------
[   98.503039] WARNING: CPU: 0 PID: 1744 at kernel/softirq.c:150 __local_bh_enable_ip+0x96/0xc0()
[   98.503039] Modules linked in: bonding(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netconsole ppdev joydev parport_pc serio_raw parport i2c_piix4 video acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc virtio_net e1000 ata_generic pcnet32 mii virtio_pci virtio_ring virtio pata_acpi
[   98.503039] CPU: 0 PID: 1744 Comm: ifenslave Tainted: G           OE   4.2.0-rc7+ #56
[   98.503039] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[   98.503039]  0000000000000000 00000000e96ba230 ffff880020c236b8 ffffffff8183f105
[   98.503039]  0000000000000000 0000000000000000 ffff880020c236f8 ffffffff810a9496
[   98.503039]  ffff88002ea99e08 0000000000000200 ffffffffa02a8e06 ffff88002ea99e08
[   98.503039] Call Trace:
[   98.503039]  [<ffffffff8183f105>] dump_stack+0x4c/0x65
[   98.503039]  [<ffffffff810a9496>] warn_slowpath_common+0x86/0xc0
[   98.503039]  [<ffffffffa02a8e06>] ? bond_poll_controller+0x146/0x250 [bonding]
[   98.503039]  [<ffffffff810a95ca>] warn_slowpath_null+0x1a/0x20
[   98.503039]  [<ffffffff810ae376>] __local_bh_enable_ip+0x96/0xc0
[   98.503039]  [<ffffffffa02a8e2f>] bond_poll_controller+0x16f/0x250 [bonding]
[   98.503039]  [<ffffffffa02a8cf3>] ? bond_poll_controller+0x33/0x250 [bonding]
[   98.503039]  [<ffffffff810feaed>] ? trace_hardirqs_off+0xd/0x10
[   98.503039]  [<ffffffff81848afb>] ? _raw_spin_unlock_irqrestore+0x5b/0x60
[   98.503039]  [<ffffffff816ec48e>] netpoll_poll_dev+0x6e/0x350
[   98.503039]  [<ffffffff816eb977>] ? netpoll_start_xmit+0x137/0x1d0
[   98.503039]  [<ffffffff816b2e8b>] ? __alloc_skb+0x5b/0x210
[   98.503039]  [<ffffffff816ec89d>] netpoll_send_skb_on_dev+0x12d/0x2a0
[   98.503039]  [<ffffffff816eccde>] netpoll_send_udp+0x2ce/0x430
[   98.503039]  [<ffffffffa0190850>] write_msg+0xb0/0xf0 [netconsole]
[   98.503039]  [<ffffffff81116b63>] call_console_drivers.constprop.25+0x133/0x260
[   98.503039]  [<ffffffff81117934>] console_unlock+0x2f4/0x580
[   98.503039]  [<ffffffff81117ea5>] ? vprintk_emit+0x2e5/0x630
[   98.503039]  [<ffffffff81117ee5>] vprintk_emit+0x325/0x630
[   98.503039]  [<ffffffff81118379>] vprintk_default+0x29/0x40
[   98.503039]  [<ffffffff8183de4f>] printk+0x55/0x6b
[   98.503039]  [<ffffffff816c754c>] __netdev_printk+0x16c/0x260
[   98.503039]  [<ffffffff816c7a12>] netdev_info+0x62/0x80
[   98.503039]  [<ffffffffa02ab464>] bond_change_active_slave+0x134/0x6a0 [bonding]
[   98.503039]  [<ffffffffa02aba95>] bond_select_active_slave+0xc5/0x310 [bonding]
[   98.503039]  [<ffffffffa02aeb78>] bond_enslave+0x1088/0x10c0 [bonding]
[   98.503039]  [<ffffffffa02af46b>] bond_do_ioctl+0x37b/0x400 [bonding]
[   98.503039]  [<ffffffff81101d8d>] ? trace_hardirqs_on+0xd/0x10
[   98.503039]  [<ffffffff816dc437>] ? rtnl_lock+0x17/0x20
[   98.503039]  [<ffffffff816e5fd1>] dev_ifsioc+0x331/0x3e0
[   98.503039]  [<ffffffff816e62dc>] dev_ioctl+0xec/0x6c0
[   98.503039]  [<ffffffff816a6c6a>] sock_do_ioctl+0x4a/0x60
[   98.503039]  [<ffffffff816a7300>] sock_ioctl+0x1c0/0x250
[   98.503039]  [<ffffffff81271bfe>] do_vfs_ioctl+0x2ee/0x540
[   98.503039]  [<ffffffff810fd943>] ? up_read+0x23/0x40
[   98.503039]  [<ffffffff81070993>] ? __do_page_fault+0x1d3/0x420
[   98.503039]  [<ffffffff8127e246>] ? __fget_light+0x66/0x90
[   98.503039]  [<ffffffff81271ec9>] SyS_ioctl+0x79/0x90
[   98.503039]  [<ffffffff8184936e>] entry_SYSCALL_64_fastpath+0x12/0x76
[   98.503039] ---[ end trace 00cfa804b0670051 ]---

Fixes: 616f45416c ("bonding: implement bond_poll_controller()")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 22:25:43 -07:00
Phil Sutter
1e6f20ca6c net: bonding: convert to using IFF_NO_QUEUE
Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 11:55:06 -07:00
David S. Miller
182ad468e7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/cavium/Kconfig

The cavium conflict was overlapping dependency
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 16:23:11 -07:00
Venkat Venkatsubra
b02e3e948d bonding: Gratuitous ARP gets dropped when first slave added
When the first slave is added (such as during bootup) the first
gratuitous ARP gets dropped. We don't see this drop during a failover.
The packet gets dropped in qdisc (noop_enqueue).

The fix is to delay the sending of gratuitous ARPs till the bond dev's
carrier is present.

It can also be worked around by setting num_grat_arp to more than 1.

Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-12 14:37:33 -07:00
David S. Miller
c5e40ee287 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/bridge/br_mdb.c

br_mdb.c conflict was a function call being removed to fix a bug in
'net' but whose signature was changed in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-23 00:41:16 -07:00
dingtianhong
a951bc1e6b bonding: correct the MAC address for "follow" fail_over_mac policy
The "follow" fail_over_mac policy is useful for multiport devices that
either become confused or incur a performance penalty when multiple
ports are programmed with the same MAC address, but the same MAC
address still may happened by this steps for this policy:

1) echo +eth0 > /sys/class/net/bond0/bonding/slaves
   bond0 has the same mac address with eth0, it is MAC1.

2) echo +eth1 > /sys/class/net/bond0/bonding/slaves
   eth1 is backup, eth1 has MAC2.

3) ifconfig eth0 down
   eth1 became active slave, bond will swap MAC for eth0 and eth1,
   so eth1 has MAC1, and eth0 has MAC2.

4) ifconfig eth1 down
   there is no active slave, and eth1 still has MAC1, eth2 has MAC2.

5) ifconfig eth0 up
   the eth0 became active slave again, the bond set eth0 to MAC1.

Something wrong here, then if you set eth1 up, the eth0 and eth1 will have the same
MAC address, it will break this policy for ACTIVE_BACKUP mode.

This patch will fix this problem by finding the old active slave and
swap them MAC address before change active slave.

Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Tested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:29:40 -07:00
Nikolay Aleksandrov
7d5cd2ce52 bonding: correctly handle bonding type change on enslave failure
If the bond is enslaving a device with different type it will be setup
by it, but if after being setup the enslave fails the bond doesn't
switch back its type and also keeps pointers to foreign structures that can
be long gone. Thus revert back any type changes if the enslave failed and
the bond had to change its type.
Example:
 Before patch:
$ echo lo > bond0/bonding/slaves
-bash: echo: write error: Cannot assign requested address
$ ip l sh bond0
20: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN
mode DEFAULT group default
    link/loopback 16:54:78:34:bd:41 brd 00:00:00:00:00:00
$ echo +eth1 > bond0/bonding/slaves
$ ip l sh bond0
20: bond0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
DEFAULT group default qlen 1000
    link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
(notice the MASTER flag is gone)

 After patch:
$ echo lo > bond0/bonding/slaves
-bash: echo: write error: Cannot assign requested address
$ ip l sh bond0
21: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN
mode DEFAULT group default qlen 1000
    link/ether 6e:66:94:f6:07:fc brd ff:ff:ff:ff:ff:ff
$ echo +eth1 > bond0/bonding/slaves
$ ip l sh bond0
21: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN
mode DEFAULT group default qlen 1000
    link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: e36b9d16c6 ("bonding: clean muticast addresses when device changes type")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 16:23:06 -07:00
Nikolay Aleksandrov
06f6d1094a bonding: fix destruction of bond with devices different from arphrd_ether
When the bonding is being unloaded and the netdevice notifier is
unregistered it executes NETDEV_UNREGISTER for each device which should
remove the bond's proc entry but if the device enslaved is not of
ARPHRD_ETHER type and is in front of the bonding, it may execute
bond_release_and_destroy() first which would release the last slave and
destroy the bond device leaving the proc entry and thus we will get the
following error (with dynamic debug on for bond_netdev_event to see the
events order):
[  908.963051] eql: event: 9
[  908.963052] eql: IFF_SLAVE
[  908.963054] eql: event: 2
[  908.963056] eql: IFF_SLAVE
[  908.963058] eql: event: 6
[  908.963059] eql: IFF_SLAVE
[  908.963110] bond0: Releasing active interface eql
[  908.976168] bond0: Destroying bond bond0
[  908.976266] bond0 (unregistering): Released all slaves
[  908.984097] ------------[ cut here ]------------
[  908.984107] WARNING: CPU: 0 PID: 1787 at fs/proc/generic.c:575
remove_proc_entry+0x112/0x160()
[  908.984110] remove_proc_entry: removing non-empty directory
'net/bonding', leaking at least 'bond0'
[  908.984111] Modules linked in: bonding(-) eql(O) 9p nfsd auth_rpcgss
oid_registry nfs_acl nfs lockd grace fscache sunrpc crct10dif_pclmul
crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev qxl drm_kms_helper
snd_hda_codec_generic aesni_intel ttm aes_x86_64 glue_helper pcspkr lrw
gf128mul ablk_helper cryptd snd_hda_intel virtio_console snd_hda_codec
psmouse serio_raw snd_hwdep snd_hda_core 9pnet_virtio 9pnet evdev joydev
drm virtio_balloon snd_pcm snd_timer snd soundcore i2c_piix4 i2c_core
pvpanic acpi_cpufreq parport_pc parport processor thermal_sys button
autofs4 ext4 crc16 mbcache jbd2 hid_generic usbhid hid sg sr_mod cdrom
ata_generic virtio_blk virtio_net floppy ata_piix e1000 libata ehci_pci
virtio_pci scsi_mod uhci_hcd ehci_hcd virtio_ring virtio usbcore
usb_common [last unloaded: bonding]

[  908.984168] CPU: 0 PID: 1787 Comm: rmmod Tainted: G        W  O
4.2.0-rc2+ #8
[  908.984170] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  908.984172]  0000000000000000 ffffffff81732d41 ffffffff81525b34
ffff8800358dfda8
[  908.984175]  ffffffff8106c521 ffff88003595af78 ffff88003595af40
ffff88003e3a4280
[  908.984178]  ffffffffa058d040 0000000000000000 ffffffff8106c59a
ffffffff8172ebd0
[  908.984181] Call Trace:
[  908.984188]  [<ffffffff81525b34>] ? dump_stack+0x40/0x50
[  908.984193]  [<ffffffff8106c521>] ? warn_slowpath_common+0x81/0xb0
[  908.984196]  [<ffffffff8106c59a>] ? warn_slowpath_fmt+0x4a/0x50
[  908.984199]  [<ffffffff81218352>] ? remove_proc_entry+0x112/0x160
[  908.984205]  [<ffffffffa05850e6>] ? bond_destroy_proc_dir+0x26/0x30
[bonding]
[  908.984208]  [<ffffffffa057540e>] ? bond_net_exit+0x8e/0xa0 [bonding]
[  908.984217]  [<ffffffff8142f407>] ? ops_exit_list.isra.4+0x37/0x70
[  908.984225]  [<ffffffff8142f52d>] ?
unregister_pernet_operations+0x8d/0xd0
[  908.984228]  [<ffffffff8142f58d>] ?
unregister_pernet_subsys+0x1d/0x30
[  908.984232]  [<ffffffffa0585269>] ? bonding_exit+0x23/0xdba [bonding]
[  908.984236]  [<ffffffff810e28ba>] ? SyS_delete_module+0x18a/0x250
[  908.984241]  [<ffffffff81086f99>] ? task_work_run+0x89/0xc0
[  908.984244]  [<ffffffff8152b732>] ?
entry_SYSCALL_64_fastpath+0x16/0x75
[  908.984247] ---[ end trace 7c006ed4abbef24b ]---

Thus remove the proc entry manually if bond_release_and_destroy() is
used. Because of the checks in bond_remove_proc_entry() it's not a
problem for a bond device to change namespaces (the bug fixed by the
Fixes commit) but since commit
f939981492 ("bonding: Don't allow bond devices to change network
namespaces.") that can't happen anyway.

Reported-by: Carol Soto <clsoto@linux.vnet.ibm.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: a64d49c3dd ("bonding: Manage /proc/net/bonding/ entries from
                      the netdev events")
Tested-by: Carol L Soto <clsoto@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 12:56:11 -07:00
Nikolay Aleksandrov
22f94e6256 bonding: trivial: remove unused variables
Get rid of these:
drivers/net/bonding//bond_main.c: In function ‘bond_update_slave_arr’:
drivers/net/bonding//bond_main.c:3754:6: warning: variable
‘slaves_in_agg’ set but not used [-Wunused-but-set-variable]
  int slaves_in_agg;
      ^
  CC [M]  drivers/net/bonding//bond_3ad.o
drivers/net/bonding//bond_3ad.c: In function
‘ad_marker_response_received’:
drivers/net/bonding//bond_3ad.c:1870:61: warning: parameter ‘marker’
set but not used [-Wunused-but-set-parameter]
 static void ad_marker_response_received(struct bond_marker *marker,
                                                             ^
drivers/net/bonding//bond_3ad.c:1871:19: warning: parameter ‘port’ set
but not used [-Wunused-but-set-parameter]
      struct port *port)
                   ^

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 12:50:32 -07:00
Mazhar Rana
b5a983f314 bonding: "primary_reselect" with "failure" is not working properly
When "primary_reselect" is set to "failure", primary interface should
not become active until current active slave is down. But if we set first
member of bond device as a "primary" interface and "primary_reselect"
is set to "failure" then whenever primary interface's link get back(up)
it become active slave even if current active slave is still up.

With this patch, "bond_find_best_slave" will not traverse members if
primary interface is not candidate for failover/reselection and current
active slave is still up.

Signed-off-by: Mazhar Rana <mazhar.rana@cyberoam.com>
Signed-off-by: Jay Vosburgh <j.vosburgh@gmail.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-08 16:06:08 -07:00
Tom Herbert
c3f8324188 net: Add full IPv6 addresses to flow_keys
This patch adds full IPv6 addresses into flow_keys and uses them as
input to the flow hash function. The implementation supports either
IPv4 or IPv6 addresses in a union, and selector is used to determine
how may words to input to jhash2.

We also add flow_get_u32_dst and flow_get_u32_src functions which are
used to get a u32 representation of the source and destination
addresses. For IPv6, ipv6_addr_hash is called. These functions retain
getting the legacy values of src and dst in flow_keys.

With this patch, Ethertype and IP protocol are now included in the
flow hash input.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-04 15:44:30 -07:00
Samudrala, Sridhar
45d4122ca7 switchdev: add support for fdb add/del/dump via switchdev_port_obj ops.
- introduce port fdb obj and generic switchdev_port_fdb_add/del/dump()
- use switchdev_port_fdb_add/del/dump in rocker/team/bonding ndo ops.
- add support for fdb obj in switchdev_port_obj_add/del/dump()
- switch rocker to implement fdb ops via switchdev_ops

v3: updated to sync with named union changes.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:49:09 -04:00
Jiri Pirko
06635a35d1 flow_dissect: use programable dissector in skb_flow_dissect and friends
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:47 -04:00
Jiri Pirko
1bd758eb1c net: change name of flow_dissector header to match the .c file name
add couple of empty lines on the way.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:45 -04:00
Scott Feldman
7889cbee83 switchdev: remove NETIF_F_HW_SWITCH_OFFLOAD feature flag
Roopa said remove the feature flag for this series and she'll work on
bringing it back if needed at a later date.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:55 -04:00
Scott Feldman
85fdb95672 switchdev: cut over to new switchdev_port_bridge_getlink
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:55 -04:00
Scott Feldman
54ba5a0bbc switchdev: cut over to new switchdev_port_bridge_dellink
Rocker, bonding and team and switch over to the new
switchdev_port_bridge_dellink to avoid duplicating code in each driver.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:55 -04:00
Scott Feldman
fc8f40d864 switchdev: cut over to new switchdev_port_bridge_setlink
Rocker, bonding, and team can now use the switchdev bridge setlink to parse
raw netlink; no need to duplicate this code in each driver.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:54 -04:00
Jiri Pirko
ebb9a03a59 switchdev: s/netdev_switch_/switchdev_/ and s/NETDEV_SWITCH_/SWITCHDEV_/
Turned out that "switchdev" sticks. So just unify all related terms to use
this prefix.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:52 -04:00
Mahesh Bandewar
d22a5fc0c3 bonding: Implement user key part of port_key in an AD system.
The port key has three components - user-key, speed-part, and duplex-part.
The LSBit is for the duplex-part, next 5 bits are for the speed while the
remaining 10 bits are the user defined key bits. Get these 10 bits
from the user-space (through the SysFs interface) and use it to form the
admin port-key. Allowed range for the user-key is 0 - 1023 (10 bits). If
it is not provided then use zero for the user-key-bits (default).

It can set using following example code -

   # modprobe bonding mode=4
   # usr_port_key=$(( RANDOM & 0x3FF ))
   # echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key
   # echo +eth1 > /sys/class/net/bond0/bonding/slaves
   ...
   # ip link set bond0 up

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
[jt: * fixed up style issues reported by checkpatch
     * fixed up context from change in ad_actor_sys_prio patch]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11 10:59:32 -04:00
Mahesh Bandewar
7451495755 bonding: Allow userspace to set actors' macaddr in an AD-system.
In an AD system, the communication between actor and partner is the
business between these two entities. In the current setup anyone on the
same L2 can "guess" the LACPDU contents and then possibly send the
spoofed LACPDUs and trick the partner causing connectivity issues for
the AD system. This patch allows to use a random mac-address obscuring
it's identity making it harder for someone in the L2 is do the same thing.

This patch allows user-space to choose the mac-address for the AD-system.
This mac-address can not be NULL or a Multicast. If the mac-address is set
from user-space; kernel will honor it and will not overwrite it. In the
absence (value from user space); the logic will default to using the
masters' mac as the mac-address for the AD-system.

It can be set using example code below -

   # modprobe bonding mode=4
   # sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \
                    $(( (RANDOM & 0xFE) | 0x02 )) \
                    $(( RANDOM & 0xFF )) \
                    $(( RANDOM & 0xFF )) \
                    $(( RANDOM & 0xFF )) \
                    $(( RANDOM & 0xFF )) \
                    $(( RANDOM & 0xFF )))
   # echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system
   # echo +eth1 > /sys/class/net/bond0/bonding/slaves
   ...
   # ip link set bond0 up

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
[jt: fixed up style issues reported by checkpatch]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11 10:59:32 -04:00
Mahesh Bandewar
6791e4661c bonding: Allow userspace to set actors' system_priority in AD system
This patch allows user to randomize the system-priority in an ad-system.
The allowed range is 1 - 0xFFFF while default value is 0xFFFF. If user
does not specify this value, the system defaults to 0xFFFF, which is
what it was before this patch.

Following example code could set the value -
    # modprobe bonding mode=4
    # sys_prio=$(( 1 + RANDOM + RANDOM ))
    # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio
    # echo +eth1 > /sys/class/net/bond0/bonding/slaves
    ...
    # ip link set bond0 up

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
[jt: * fixed up style issues reported by checkpatch
     * changed how the default value is set in bond_check_params(), this
       makes the default consistent between what gets set for a new bond
       and what the default is claimed to be in the bonding options.]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11 10:59:31 -04:00
Pai
e913fb279c net: Fix Kernel Panic in bonding driver debugfs file: rlb_hash_table
This patch fixes a Kernel Panic in bonding driver debugfs file: rlb_hash_table.

$> modprobe bonding mode=6
$> cat /sys/kernel/debug/bonding/bond0/rlb_hash_table

This will crash the kernel. The struct alb_bond_info is initialized only when
the bonding interface is initialized (ip link set bond0 up) and not at the time
it is allocated. If we try to read the table before that, it'll result in a
kernel panic.

The patch applies against both net and net-next

Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29 15:37:19 -04:00
Matan Barak
73b5a6f2a7 net/bonding: Make DRV macros private
The bonding modules currently defines four macros with
general names that pollute the global namespace:
DRV_VERSION
DRV_RELDATE
DRV_NAME
DRV_DESCRIPTION

Fixing that by defining a private bonding_priv.h
header files which includes those defines.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-26 22:59:53 -04:00
David S. Miller
9f0d34bc34 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/asix_common.c
	drivers/net/usb/sr9800.c
	drivers/net/usb/usbnet.c
	include/linux/usb/usbnet.h
	net/ipv4/tcp_ipv4.c
	net/ipv6/tcp_ipv6.c

The TCP conflicts were overlapping changes.  In 'net' we added a
READ_ONCE() to the socket cached RX route read, whilst in 'net-next'
Eric Dumazet touched the surrounding code dealing with how mini
sockets are handled.

With USB, it's a case of the same bug fix first going into net-next
and then I cherry picked it back into net.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 16:16:53 -04:00
Anton Nayshtut
f5e2dc5d7f bonding: Bonding Overriding Configuration logic restored.
Before commit 3900f29021 ("bonding: slight
optimizztion for bond_slave_override()") the override logic was to send packets
with non-zero queue_id through the slave with corresponding queue_id, under two
conditions only - if the slave can transmit and it's up.

The above mentioned commit changed this logic by introducing an additional
condition - whether the bond is active (indirectly, using the slave_can_tx and
later - bond_is_active_slave), that prevents the user from implementing more
complex policies according to the Documentation/networking/bonding.txt.

Signed-off-by: Anton Nayshtut <anton@swortex.com>
Signed-off-by: Alexey Bogoslavsky <alexey@swortex.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:49:45 -04:00
Toshiaki Makita
4847f04982 bonding: Don't segment multiple tagged packets on bonding device
Bonding devices don't need to segment multiple tagged packets since their
slaves can segment them.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-29 13:33:24 -07:00
Mahesh Bandewar
616f45416c bonding: implement bond_poll_controller()
This patches implements the poll_controller support for all
bonding driver. If the slaves have poll_controller net_op defined,
this implementation calls them. This is mode agnostic implementation
and iterates through all slaves (based on mode) and calls respective
handler.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 14:40:42 -05:00
Mahesh Bandewar
950ddcb1c1 bonding: simple code refactor
Remove duplicate code.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-20 17:43:29 -05:00
Moni Shoua
92e584fe44 net/bonding: Fix potential bad memory access during bonding events
When queuing work to send the NETDEV_BONDING_INFO netdev event, it's
possible that when the work is executed, the pointer to the slave
becomes invalid. This can happen if between queuing the event and the
execution of the work, the net-device was un-ensvaled and re-enslaved.

Fix that by queuing a work with the data of the slave instead of the
slave structure.

Fixes: 69e6113343 ('net/bonding: Notify state change on slaves')
Reported-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09 14:03:53 -08:00
Moni Shoua
69e6113343 net/bonding: Notify state change on slaves
Use notifier chain to dispatch an event upon a change in slave state.
Event is dispatched with slave specific info.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04 16:14:24 -08:00
Moni Shoua
69a2338e05 net/bonding: Move slave state changes to a helper function
Move slave state changes to a helper function, this is a pre-step for adding
functionality of dispatching an event when this helper is called.

This commit doesn't add new functionality.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04 16:14:24 -08:00
Roopa Prabhu
c158cba38c bonding: handle NETIF_F_HW_SWITCH_OFFLOAD flag and add ndo_bridge_setlink/dellink handlers
We want bond to pick up the offload flag if any of its slaves have it.

NETIF_F_HW_SWITCH_OFFLOAD flag is added to the mask, so that
netdev_increment_features does not ignore it.

This also adds ndo_bridge_setlink and ndo_bridge_dellink handlers.
These currently point to the default handlers provided by the
switchdev api.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-01 23:16:34 -08:00
Jonathan Toppins
303691042d bonding: cleanup and remove dead code
fix sparse warning about non-static function

drivers/net/bonding/bond_main.c:3737:5: warning: symbol
'bond_3ad_xor_xmit' was not declared. Should it be static?

Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-27 17:09:04 -08:00
Wilson Kok
8bbe71a595 bonding: fix bond_open() don't always set slave active flag
Mode 802.3ad, fix incorrect bond slave active state when slave is not in
active aggregator. During bond_open(), the bonding driver always sets
the slave active flag to true if the bond is not in active-backup, alb,
or tlb modes. Bonding should let the aggregator selection logic set the
active flag when in 802.3ad mode.

Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-27 17:09:03 -08:00
Jonathan Toppins
2477bc9a3d bonding: update bond carrier state when min_links option changes
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-27 17:09:03 -08:00
Eric Dumazet
24f87d4ce1 bonding: handle more gso types
In commit 5a7baa7885 ("bonding: Advertize vxlan offload features when
supported"), Or Gerlitz added support conditional vxlan offload.

In this patch I also add support for all kind of tunnels,
but we allow a bonding device to not require segmentation,
as it is always better to make this segmentation at the very last stage,
if a particular slave device requires it.

Tested:

 Setup a GRE tunnel,
 on a physical NIC not having tx-gre-segmentation.
 Results on bnx2x are even better, as we no longer have to segment
 in software.

ethtool -K bond0 tx-gre-segmentation off

super_netperf 50 --google-pacing-rate 30000000 -H 10.7.8.152 -l 15
7538.32

ethtool -K bond0 tx-gre-segmentation on

super_netperf 50 --google-pacing-rate 30000000 -H 10.7.8.152 -l 15
10200.5

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-25 23:34:23 -08:00
Wengang Wang
a22a9e4141 bonding: change error message to debug message in __bond_release_one()
In __bond_release_one(), when the interface is not a slave or not a slave of
"this" master, it log error message.

The message actually should be a debug message matching what bond_enslave()
does.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-27 02:20:55 -05:00
David S. Miller
1459143386 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ieee802154/fakehard.c

A bug fix went into 'net' for ieee802154/fakehard.c, which is removed
in 'net-next'.

Add build fix into the merge from Stephen Rothwell in openvswitch, the
logging macros take a new initial 'log' argument, a new call was added
in 'net' so when we merge that in here we have to explicitly add the
new 'log' arg to it else the build fails.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 22:28:24 -05:00
Jiri Pirko
62749e2cb3 vlan: rename __vlan_put_tag to vlan_insert_tag_set_proto
Name fits better. Plus there's going to be introduced
__vlan_insert_tag later on.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:20:17 -05:00
Jiri Pirko
b4bef1b575 vlan: kill vlan_put_tag helper
Since both tx and rx paths work with skb->vlan_tci, there's no need for
this function anymore. Switch users directly to __vlan_hwaccel_put_tag.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:20:17 -05:00
Nikolay Aleksandrov
b8e4500f42 bonding: fix curr_active_slave/carrier with loadbalance arp monitoring
Since commit 6fde8f037e ("bonding: fix locking in
bond_loadbalance_arp_mon()") we can have a stale bond carrier state and
stale curr_active_slave when using arp monitoring in loadbalance modes. The
reason is that in bond_loadbalance_arp_mon() we can't have
do_failover == true but slave_state_changed == false, whenever do_failover
is true then slave_state_changed is also true. Then the following piece
from bond_loadbalance_arp_mon():
                if (slave_state_changed) {
                        bond_slave_state_change(bond);
                        if (BOND_MODE(bond) == BOND_MODE_XOR)
                                bond_update_slave_arr(bond, NULL);
                } else if (do_failover) {
                        block_netpoll_tx();
                        bond_select_active_slave(bond);
                        unblock_netpoll_tx();
                }

will execute only the first branch, always and regardless of do_failover.
Since these two events aren't related in such way, we need to decouple and
consider them separately.

For example this issue could lead to the following result:
Bonding Mode: load balancing (round-robin)
*MII Status: down*
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 100
ARP IP target/s (n.n.n.n form): 192.168.9.2

Slave Interface: ens12
*MII Status: up*
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: 00:0f:53:01:42:2c
Slave queue ID: 0

Slave Interface: eth1
*MII Status: up*
Speed: Unknown
Duplex: Unknown
Link Failure Count: 70
Permanent HW addr: 52:54:00:2f:0f:8e
Slave queue ID: 0

Since some interfaces are up, then the status of the bond should also be
up, but it will never change unless something invokes bond_set_carrier()
(i.e. enslave, bond_select_active_slave etc). Now, if I force the
calling of bond_select_active_slave via for example changing
primary_reselect (it can change in any mode), then the MII status goes to
"up" because it calls bond_select_active_slave() which should've been done
from bond_loadbalance_arp_mon() itself.

CC: Veaceslav Falico <vfalico@gmail.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Ding Tianhong <dingtianhong@huawei.com>

Fixes: 6fde8f037e ("bonding: fix locking in bond_loadbalance_arp_mon()")
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-19 15:09:04 -05:00
Michal Kubeček
fbe168ba91 net: generic dev_disable_lro() stacked device handling
Large receive offloading is known to cause problems if received packets
are passed to other host. Therefore the kernel disables it by calling
dev_disable_lro() whenever a network device is enslaved in a bridge or
forwarding is enabled for it (or globally). For virtual devices we need
to disable LRO on the underlying physical device (which is actually
receiving the packets).

Current dev_disable_lro() code handles this  propagation for a vlan
(including 802.1ad nested vlan), macvlan or a vlan on top of a macvlan.
It doesn't handle other stacked devices and their combinations, in
particular propagation from a bond to its slaves which often causes
problems in virtualization setups.

As we now have generic data structures describing the upper-lower device
relationship, dev_disable_lro() can be generalized to disable LRO also
for all lower devices (if any) once it is disabled for the device
itself.

For bonding and teaming devices, it is necessary to disable LRO not only
on current slaves at the moment when dev_disable_lro() is called but
also on any slave (port) added later.

v2: use lower device links for all devices (including vlan and macvlan)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-13 14:48:56 -05:00
David S. Miller
1ef8019be8 net: Move bonding headers under include/net
This ways drivers like cxgb4 don't need to do ugly relative includes.

Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-10 13:27:49 -05:00
Eric Dumazet
31aa860e0a bonding: add bond_tx_drop() helper
Because bonding stats are usually sum of slave stats, it was
not easy to account for tx drops at bonding layer.

We can use dev->tx_dropped for this, as this counter is later
added to the device stats (in dev_get_stats())

This extends the idea we had in commit ee63771474 ("bonding: Simplify
the xmit function for modes that use xmit_hash") for bond_3ad_xor_xmit()
to other bonding modes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-31 16:09:03 -04:00
Eric Dumazet
0287587884 net: better IFF_XMIT_DST_RELEASE support
Testing xmit_more support with netperf and connected UDP sockets,
I found strange dst refcount false sharing.

Current handling of IFF_XMIT_DST_RELEASE is not optimal.

Dropping dst in validate_xmit_skb() is certainly too late in case
packet was queued by cpu X but dequeued by cpu Y

The logical point to take care of drop/force is in __dev_queue_xmit()
before even taking qdisc lock.

As Julian Anastasov pointed out, need for skb_dst() might come from some
packet schedulers or classifiers.

This patch adds new helper to cleanly express needs of various drivers
or qdiscs/classifiers.

Drivers that need skb_dst() in their ndo_start_xmit() should call
following helper in their setup instead of the prior :

	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
->
	netif_keep_dst(dev);

Instead of using a single bit, we use two bits, one being
eventually rebuilt in bonding/team drivers.

The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
rebuilt in bonding/team. Eventually, we could add something
smarter later.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-07 13:22:11 -04:00
Mahesh Bandewar
ee63771474 bonding: Simplify the xmit function for modes that use xmit_hash
Earlier change to use usable slave array for TLB mode had an additional
performance advantage. So extending the same logic to all other modes
that use xmit-hash for slave selection (viz 802.3AD, and XOR modes).
Also consolidating this with the earlier TLB change.

The main idea is to build the usable slaves array in the control path
and use that array for slave selection during xmit operation.

Measured performance in a setup with a bond of 4x1G NICs with 200
instances of netperf for the modes involved (3ad, xor, tlb)
cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5

Mode        TPS-Before   TPS-After

802.3ad   : 468,694      493,101
TLB (lb=0): 392,583      392,965
XOR       : 475,696      484,517

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-06 17:13:07 -04:00
Andy Gospodarek
5f0c5f73e5 bonding: make global bonding stats more reliable
As the code stands today, bonding stats are based simply on the stats
from the member interfaces.  If a member was to be removed from a bond,
the stats would instantly drop.  This would be confusing to an admin
would would suddonly see interface stats drop while traffic is still
flowing.

In addition to preventing the stats drops mentioned above, new members
will now be added to the bond and only traffic received after the member
was added to the bond will be counted as part of bonding stats.  Bonding
counters will also be updated when any slaves are dropped to make sure
the reported stats are reliable.

v2: Changes suggested by Nik to properly allocate/free stats memory.
v3: Properly destroy workqueue and fix netlink configuration path.
v4: Moved cached stats into bonding and slave structs as there does not
seem to be a complexity/performance benefit to using alloc'd memory vs
in-struct memory.

Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30 01:20:07 -04:00
David S. Miller
1f6d80358d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	arch/mips/net/bpf_jit.c
	drivers/net/can/flexcan.c

Both the flexcan and MIPS bpf_jit conflicts were cases of simple
overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-23 12:09:27 -04:00
dingtianhong
37ab7ddf3f bonding: remove the unnecessary notes for bond_xmit_broadcast()
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-22 15:21:01 -04:00
dingtianhong
a64d044e39 bonding: slight optimization for bond_xmit_roundrobin()
When the slave is the curr_active_slave, no need to check
whether the slave is active or not, it is always active.

Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-22 15:21:00 -04:00
Nikolay Aleksandrov
e0974585e7 bonding: consolidate ASSERT_RTNL()s and remove the unnecessary
Consolidate the calls to ASSERT_RTNL() before bond_select_active_slave()
inside bond_select_active_slave() itself and remove the ASSERT_RTNL()
from bond_hw_addr_swap() as it's not exported and its only caller -
bond_change_active_slave() already has an ASSERT_RTNL().

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-15 17:19:50 -04:00
Nikolay Aleksandrov
547942cace bonding: trivial: style and comment fixes
First adjust a couple of locking comments that were left inaccurate,
then adjust comments to use the netdev styling and remove extra new
lines where necessary and add a couple of new lines between declarations
and code. These are all trivial styling changes, no functional change.
Also removed a couple of outdated or obvious comments.
This patch is by no means a complete fix of all netdev style violations
but it gets the bonding closer.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-15 17:19:49 -04:00
Nikolay Aleksandrov
9a72c2da69 bonding: fix div by zero while enslaving and transmitting
The problem is that the slave is first linked and slave_cnt is
incremented afterwards leading to a div by zero in the modes that use it
as a modulus. What happens is that in bond_start_xmit()
bond_has_slaves() is used to evaluate further transmission and it becomes
true after the slave is linked in, but when slave_cnt is used in the xmit
path it is still 0, so fetch it once and transmit based on that. Since
it is used only in round-robin and XOR modes, the fix is only for them.
Thanks to Eric Dumazet for pointing out the fault in my first try to fix
this.

Call trace (took it out of net-next kernel, but it's the same with net):
[46934.330038] divide error: 0000 [#1] SMP
[46934.330041] Modules linked in: bonding(O) 9p fscache
snd_hda_codec_generic crct10dif_pclmul
[46934.330041] bond0: Enslaving eth1 as an active interface with an up
link
[46934.330051]  ppdev joydev crc32_pclmul crc32c_intel 9pnet_virtio
ghash_clmulni_intel snd_hda_intel 9pnet snd_hda_controller parport_pc
serio_raw pcspkr snd_hda_codec parport virtio_balloon virtio_console
snd_hwdep snd_pcm pvpanic i2c_piix4 snd_timer i2ccore snd soundcore
virtio_blk virtio_net virtio_pci virtio_ring virtio ata_generic
pata_acpi floppy [last unloaded: bonding]
[46934.330053] CPU: 1 PID: 3382 Comm: ping Tainted: G           O
3.17.0-rc4+ #27
[46934.330053] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[46934.330054] task: ffff88005aebf2c0 ti: ffff88005b728000 task.ti:
ffff88005b728000
[46934.330059] RIP: 0010:[<ffffffffa0198c33>]  [<ffffffffa0198c33>]
bond_start_xmit+0x1c3/0x450 [bonding]
[46934.330060] RSP: 0018:ffff88005b72b7f8  EFLAGS: 00010246
[46934.330060] RAX: 0000000000000679 RBX: ffff88004b077000 RCX:
000000000000002a
[46934.330061] RDX: 0000000000000000 RSI: ffff88004b3f0500 RDI:
ffff88004b077940
[46934.330061] RBP: ffff88005b72b830 R08: 00000000000000c0 R09:
ffff88004a83e000
[46934.330062] R10: 000000000000ffff R11: ffff88004b1f12c0 R12:
ffff88004b3f0500
[46934.330062] R13: ffff88004b3f0500 R14: 000000000000002a R15:
ffff88004b077940
[46934.330063] FS:  00007fbd91a4c740(0000) GS:ffff88005f080000(0000)
knlGS:0000000000000000
[46934.330064] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[46934.330064] CR2: 00007f803a8bb000 CR3: 000000004b2c9000 CR4:
00000000000406e0
[46934.330069] Stack:
[46934.330071]  ffffffff811e6169 00000000e772fa05 ffff88004b077000
ffff88004b3f0500
[46934.330072]  ffffffff81d17d18 000000000000002a 0000000000000000
ffff88005b72b8a0
[46934.330073]  ffffffff81620108 ffffffff8161fe0e ffff88005b72b8c4
ffff88005b302000
[46934.330073] Call Trace:
[46934.330077]  [<ffffffff811e6169>] ?
__kmalloc_node_track_caller+0x119/0x300
[46934.330084]  [<ffffffff81620108>] dev_hard_start_xmit+0x188/0x410
[46934.330086]  [<ffffffff8161fe0e>] ? harmonize_features+0x2e/0x90
[46934.330088]  [<ffffffff81620b06>] __dev_queue_xmit+0x456/0x590
[46934.330089]  [<ffffffff81620c50>] dev_queue_xmit+0x10/0x20
[46934.330090]  [<ffffffff8168f022>] arp_xmit+0x22/0x60
[46934.330091]  [<ffffffff8168f090>] arp_send.part.16+0x30/0x40
[46934.330092]  [<ffffffff8168f1e5>] arp_solicit+0x115/0x2b0
[46934.330094]  [<ffffffff8160b5d7>] ? copy_skb_header+0x17/0xa0
[46934.330096]  [<ffffffff8162875a>] neigh_probe+0x4a/0x70
[46934.330097]  [<ffffffff8162979c>] __neigh_event_send+0xac/0x230
[46934.330098]  [<ffffffff8162a00b>] neigh_resolve_output+0x13b/0x220
[46934.330100]  [<ffffffff8165f120>] ? ip_forward_options+0x1c0/0x1c0
[46934.330101]  [<ffffffff81660478>] ip_finish_output+0x1f8/0x860
[46934.330102]  [<ffffffff81661f08>] ip_output+0x58/0x90
[46934.330103]  [<ffffffff81661602>] ? __ip_local_out+0xa2/0xb0
[46934.330104]  [<ffffffff81661640>] ip_local_out_sk+0x30/0x40
[46934.330105]  [<ffffffff81662a66>] ip_send_skb+0x16/0x50
[46934.330106]  [<ffffffff81662ad3>] ip_push_pending_frames+0x33/0x40
[46934.330107]  [<ffffffff8168854c>] raw_sendmsg+0x88c/0xa30
[46934.330110]  [<ffffffff81612b31>] ? skb_recv_datagram+0x41/0x60
[46934.330111]  [<ffffffff816875a9>] ? raw_recvmsg+0xa9/0x1f0
[46934.330113]  [<ffffffff816978d4>] inet_sendmsg+0x74/0xc0
[46934.330114]  [<ffffffff81697a9b>] ? inet_recvmsg+0x8b/0xb0
[46934.330115] bond0: Adding slave eth2
[46934.330116]  [<ffffffff8160357c>] sock_sendmsg+0x9c/0xe0
[46934.330118]  [<ffffffff81603248>] ?
move_addr_to_kernel.part.20+0x28/0x80
[46934.330121]  [<ffffffff811b4477>] ? might_fault+0x47/0x50
[46934.330122]  [<ffffffff816039b9>] ___sys_sendmsg+0x3a9/0x3c0
[46934.330125]  [<ffffffff8144a14a>] ? n_tty_write+0x3aa/0x530
[46934.330127]  [<ffffffff810d1ae4>] ? __wake_up+0x44/0x50
[46934.330129]  [<ffffffff81242b38>] ? fsnotify+0x238/0x310
[46934.330130]  [<ffffffff816048a1>] __sys_sendmsg+0x51/0x90
[46934.330131]  [<ffffffff816048f2>] SyS_sendmsg+0x12/0x20
[46934.330134]  [<ffffffff81738b29>] system_call_fastpath+0x16/0x1b
[46934.330144] Code: 48 8b 10 4c 89 ee 4c 89 ff e8 aa bc ff ff 31 c0 e9
1a ff ff ff 0f 1f 00 4c 89 ee 4c 89 ff e8 65 fb ff ff 31 d2 4c 89 ee 4c
89 ff <f7> b3 64 09 00 00 e8 02 bd ff ff 31 c0 e9 f2 fe ff ff 0f 1f 00
[46934.330146] RIP  [<ffffffffa0198c33>] bond_start_xmit+0x1c3/0x450
[bonding]
[46934.330146]  RSP <ffff88005b72b7f8>

CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
Fixes: 278b208375 ("bonding: initial RCU conversion")
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-13 17:16:56 -04:00
Nikolay Aleksandrov
8c0bc55028 bonding: adjust locking comments
Now that locks have been removed, remove some unnecessary comments and
adjust others to reflect reality. Also add a comment to "mode_lock" to
describe its current users and give a brief summary why they need it.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-13 16:29:07 -04:00
Nikolay Aleksandrov
e470259fa1 bonding: 3ad: convert to bond->mode_lock
Now that we have bond->mode_lock, we can remove the state_machine_lock
and use it in its place. There're no fast paths requiring the per-port
spinlocks so it should be okay to consolidate them into mode_lock.
Also move it inside the unbinding function as we don't want to expose
mode_lock outside of the specific modes.

Suggested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-13 16:29:07 -04:00
Nikolay Aleksandrov
4bab16d7c9 bonding: alb: convert to bond->mode_lock
The ALB/TLB specific spinlocks are no longer necessary as we now have
bond->mode_lock for this purpose, so convert them and remove them from
struct alb_bond_info.
Also remove the unneeded lock/unlock functions and use spin_lock/unlock
directly.

Suggested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-13 16:29:07 -04:00
Nikolay Aleksandrov
b743562819 bonding: convert curr_slave_lock to a spinlock and rename it
curr_slave_lock is now a misleading name, a much better name is
mode_lock as it'll be used for each mode's purposes and it's no longer
necessary to use a rwlock, a simple spinlock is enough.

Suggested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-13 16:29:07 -04:00
Nikolay Aleksandrov
1c72cfdc96 bonding: clean curr_slave_lock use
Mostly all users of curr_slave_lock already have RTNL as we've discussed
previously so there's no point in using it, the one case where the lock
must stay is the 3ad code, in fact it's the only one.
It's okay to remove it from bond_do_fail_over_mac() as it's called with
RTNL and drops the curr_slave_lock anyway.
bond_change_active_slave() is one of the main places where
curr_slave_lock was used, it's okay to remove it as all callers use RTNL
these days before calling it, that's why we move the ASSERT_RTNL() in
the beginning to catch any potential offenders to this rule.
The RTNL argument actually applies to all of the places where
curr_slave_lock has been removed from in this patch.
Also remove the unnecessary bond_deref_active_protected() macro and use
rtnl_dereference() instead.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-13 16:29:06 -04:00
Masanari Iida
37b7021d9d net:bonding: Add missing space in bonding driver parameter description
This patch adds missing space between "interface" and "by"
in bonding module parameter description.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09 20:38:03 -07:00
Nikolay Aleksandrov
87163ef9cd bonding: remove last users of bond->lock and bond->lock itself
The usage of bond->lock in bond_main.c was completely unnecessary as it
didn't help to sync with anything, most of the spots already had RTNL.
Since there're no more users of bond->lock, remove it.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09 17:31:36 -07:00
Nikolay Aleksandrov
059b47e8aa bonding: convert primary_slave to use RCU
This is necessary mainly for two bonding call sites: procfs and
sysfs as it was dereferenced without any real protection.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09 17:31:35 -07:00
Nikolay Aleksandrov
bdbc5f1303 bonding: 3ad: use curr_slave_lock instead of bond->lock
In 3ad mode the only syncing needed by bond->lock is for the wq
and the recv handler, so change them to use curr_slave_lock.
There're no locking dependencies here as 3ad doesn't use
curr_slave_lock at all.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09 17:31:35 -07:00
Dan Carpenter
a67eed571a bonding: fix a memory leak in bond_arp_send_all()
This test is reversed so the memory is always leaked.  It's better style
to remove the test anyway.

Fixes: 3e403a7777 ('bonding: make it possible to have unlimited nested upper vlans')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-28 17:27:47 -07:00
Veaceslav Falico
3e403a7777 bonding: make it possible to have unlimited nested upper vlans
Currently we're limited by a constant level of vlan nestings, and fail to
find anything beyound that level (currently 2).

To fix this - remove the limit of nestings when going through device tree,
and when the end device is found - allocate the needed amount of vlan tags
and return them, instead of found/not found.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-20 20:35:00 -07:00
Veaceslav Falico
23fa5c2caa bonding: destroy proc directory only after all bonds are gone
Currently we might arrive to bond_net_exit() with some bonds left (that
were created while the module is unloading). We take care of that by
destroying sysfs (the last possibility to add new bonds) and then
destroying all the remaining bonds.

However, we destroy the /proc/net/bonding directory before destroying those
last bonds, and get a warning that we're trying to destroy a non-empty
proc directory (containing /proc/net/bonding/bondX).

Fix this by moving bond_destroy_proc_dir() after all the bonds are
destroyed, so that we're sure that no bonds exist.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-17 16:33:27 -07:00
Veaceslav Falico
14056e7930 bonding: use rtnl_deref in bond_change_rx_flags()
As it's always called with RTNL held, via dev_set_allmulti/promiscuity.
Also, remove the wrong comment.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-17 16:06:52 -07:00
Jianhua Xie
ce04d63502 bonding: enhance L2 hash helper with packet type
Current L2 hash helper calculates destination eth addr and
source ether addr as L2 hash factors.  This patch is adding
packet type ID field into L2 hash factors.  While one of
BOND_XMIT_POLICY_LAYER2 or BOND_XMIT_POLICY_{LAYER|ENCAP}23
is applied, for the 2nd level hash, enhanced hash method can
help to distribute different types of packets like IPv4/IPv6
packets to different slave devices.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: David S. Miller <davem@davemloft.net>
CC: Pan Jiafei <Jiafei.Pan@freescale.com>

Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jianhua Xie <jianhua.xie@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-17 16:03:27 -07:00
David S. Miller
1a98c69af1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-16 14:09:34 -07:00
Veaceslav Falico
f338532327 bonding: remove pr_fmt from bond_main.c
To maintain the same message structure as netdev_* functions print.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15 23:15:58 -07:00
Veaceslav Falico
76444f5052 bonding: convert bond_main.c to use netdev_printk instead of pr_
Converted only the parts where we've had a valid net_device, skipping the
init/deinit and options verification.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15 23:15:57 -07:00
Veaceslav Falico
f54424412b bonding: permit enslaving interfaces without set_mac support
Currently we exit if the slave isn't the first slave, doesn't support mac
address setting and fail_over_mac isn't FOM_ACTIVE. It's wrong because we
only require ndo_set_mac_address in case bonding is in active-backup mode
and FOM isn't FOM_ACTIVE.

To fix this - only exit with an error if we're in a/b mode and have
fail_over_mac != FOM_ACTIVE.

Also, maintain current behaviour on the first slave (forcibly change fom to
FOM_ACTIVE) to not break anyone's configuration.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15 22:54:49 -07:00
Eric Dumazet
8574171833 bonding: add proper __rcu annotation for current_arp_slave
Using __rcu annotation actually helps to spot all accesses to
bond->current_arp_slave are correctly protected, with LOCKDEP support.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15 17:49:42 -07:00
Eric Dumazet
4740d63827 bonding: add proper __rcu annotation for curr_active_slave
RCU was added to bonding in linux-3.12 but lacked proper sparse annotations.

Using __rcu annotation actually helps to spot all accesses to bond->curr_active_slave
are correctly protected, with LOCKDEP support.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15 17:49:42 -07:00
Tom Gundersen
c835a67733 net: set name_assign_type in alloc_netdev()
Extend alloc_netdev{,_mq{,s}}() to take name_assign_type as argument, and convert
all users to pass NET_NAME_UNKNOWN.

Coccinelle patch:

@@
expression sizeof_priv, name, setup, txqs, rxqs, count;
@@

(
-alloc_netdev_mqs(sizeof_priv, name, setup, txqs, rxqs)
+alloc_netdev_mqs(sizeof_priv, name, NET_NAME_UNKNOWN, setup, txqs, rxqs)
|
-alloc_netdev_mq(sizeof_priv, name, setup, count)
+alloc_netdev_mq(sizeof_priv, name, NET_NAME_UNKNOWN, setup, count)
|
-alloc_netdev(sizeof_priv, name, setup)
+alloc_netdev(sizeof_priv, name, NET_NAME_UNKNOWN, setup)
)

v9: move comments here from the wrong commit

Signed-off-by: Tom Gundersen <teg@jklm.no>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15 16:12:48 -07:00
Nikolay Aleksandrov
548d28bd0e bonding: fix ad_select module param check
Obvious copy/paste error when I converted the ad_select to the new
option API. "lacp_rate" there should be "ad_select" so we can get the
proper value.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: David S. Miller <davem@davemloft.net>

Fixes: 9e5f5eebe7 ("bonding: convert ad_select to use the new option
API")
Reported-by: Karim Scheik <karim.scheik@prisma-solutions.at>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-14 14:36:58 -07:00
Jiri Pirko
e721f87d80 bonding: remove no longer relevant vlan warnings
These warnings are no longer relevant. Even when last slave is
removed, there is a valid address assigned to bond (random).
The correct functionality of vlans is ensured by maintaining unicast
list in vlan_sync_address().

Suggested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-07 21:31:54 -07:00
Jiri Pirko
763e0ecd72 bonding: allow to add vlans on top of empty bond
This limitation maybe had some reason in the past, but now there is not
one -> removing this.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-01 18:57:43 -07:00
Or Gerlitz
5a7baa7885 bonding: Advertize vxlan offload features when supported
When the underlying device supports TCP offloads for VXLAN/UDP
encapulated traffic, we need to reflect that through the hw_enc_features
field of the bonding net-device. This will cause the xmit path
in the core networking stack to provide bonding with encapsulated
GSO frames to offload into the HW etc.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-18 16:49:51 -07:00
Vlad Yasevich
14af9963ba bonding: Support macvlans on top of tlb/rlb mode bonds
To make TLB mode work, the patch allows learning packets
to be sent using mac addresses assigned to macvlan devices,
also taking into an account vlans that may be between the
bond and macvlan device.

To make RLB work, all we have to do is accept ARP packets
for addresses added to the bond dev->uc list.  Since RLB
mode will take care to update the peers directly with
correct mac addresses, learning packets for these addresses
do not have be send to switch.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-04 15:13:54 -07:00
Vlad Yasevich
c565b488c6 bonding: Turn on IFF_UNICAST_FLT on bond devices
Bonding devices manage the unicast filters of the underlying
interfaces, but do not turn on IFF_UNICAST_FLT flag.  Thus
anytime a unicast address is added to the bond, the bond is
places in promiscuous mode.

Turn on IFF_UNICAST_FLT on the bond device so that the bond does
not go into promiscuous mode needlesly.  If an underlying device
does not support unicast filtering, that device will automaticall
enter promiscuous mode already.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-04 15:13:54 -07:00