This pointer is going to be removed soon, so use the existing helpers in
more places to avoid noise when the removal happens.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
function br_multicast_toggle now always return 0,
so the variable 'err' is unneeded.
Also cleanup dead branch in br_changelink.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements ndo_fdb_get for the bridge
fdb.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a port device seeks approval of a potential new MAC address, make
sure that should the bridge device end up using this address, all
interested parties would agree with it.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a port is attached to a bridge, the address of the bridge in
question may change as well. Even if it would not change at this
point (because the current bridge address is lower), it might end up
changing later as a result of detach of another port, which can't be
vetoed.
Therefore issue NETDEV_PRE_CHANGEADDR regardless of whether the address
will be used at this point or not, and make sure all involved parties
would agree with the change.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the previous patch, bridge driver has extack argument available to
pass to switchdev. Therefore extend switchdev_port_obj_add() with this
argument, updating all callers, and passing the argument through to
switchdev_port_obj_notify().
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ndo_bridge_setlink has been updated in the previous patch to have extack
available, and changelink RTNL op has had this argument since the time
extack was added. Propagate both through the bridge driver to eventually
reach br_switchdev_port_vlan_add(), where it will be used by subsequent
patches.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers may not be able to implement a VLAN addition or reconfiguration.
In those cases it's desirable to explain to the user that it was
rejected (and why).
To that end, add extack argument to ndo_bridge_setlink. Adapt all users
to that change.
Following patches will use the new argument in the bridge driver.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a driver unoffloads all FDB entries en bloc, it's inefficient to
send the switchdev notification one by one. Add a helper that unsets the
offload flag on FDB entries on a given bridge port and VLAN.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bridge's default hash_max was 512 which is rather conservative, now that
we're using the generic rhashtable API which autoshrinks let's increase
it to 4096 and move it to a define in br_private.h.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the bridge multicast uses the generic rhashtable interface we
can drop the hash_elasticity option as that is already done for us and
it's hardcoded to a maximum of RHT_ELASTICITY (16 currently). Add a
warning about the obsolete option when the hash_elasticity is set.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge multicast code has been using a mix of RCU and RCU-bh flavors
sometimes in questionable way. Since we've moved to rhashtable just use
non-bh RCU everywhere. In addition this simplifies freeing of objects
and allows us to remove some unnecessary callback functions.
v3: new patch
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge multicast code currently uses a custom resizable hashtable
which predates the generic rhashtable interface. It has many
shortcomings compared and duplicates functionality that is presently
available via the generic rhashtable, so this patch removes the custom
rhashtable implementation in favor of the kernel's generic rhashtable.
The hash maximum is kept and the rhashtable's size is used to do a loose
check if it's reached in which case we revert to the old behaviour and
disable further bridge multicast processing. Also now we can support any
hash maximum, doesn't need to be a power of 2.
v3: add non-rcu br_mdb_get variant and use it where multicast_lock is
held to avoid RCU splat, drop hash_max function and just set it
directly
v2: handle when IGMP snooping is undefined, add br_mdb_init/uninit
placeholders
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the function only works for the bridge device itself, but
subsequent patches will need to be able to query the PVID of a given
bridge port, so extend the function.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have at least one bool option, we can export all of the
supported bool options via optmask when dumping them.
v2: new patch
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the new boolopt API to add an option which disables learning from
link-local packets. The default is kept as before and learning is
enabled. This is a simple map from a boolopt bit to a bridge private
flag that is tested before learning.
v2: pass NULL for extack via sysfs
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have been adding many new bridge options, a big number of which are
boolean but still take up netlink attribute ids and waste space in the skb.
Recently we discussed learning from link-local packets[1] and decided
yet another new boolean option will be needed, thus introducing this API
to save some bridge nl space.
The API supports changing the value of multiple boolean options at once
via the br_boolopt_multi struct which has an optmask (which options to
set, bit per opt) and optval (options' new values). Future boolean
options will only be added to the br_boolopt_id enum and then will have
to be handled in br_boolopt_toggle/get. The API will automatically
add the ability to change and export them via netlink, sysfs can use the
single boolopt function versions to do the same. The behaviour with
failing/succeeding is the same as with normal netlink option changing.
If an option requires mapping to internal kernel flag or needs special
configuration to be enabled then it should be handled in
br_boolopt_toggle. It should also be able to retrieve an option's current
state via br_boolopt_get.
v2: WARN_ON() on unsupported option as that shouldn't be possible and
also will help catch people who add new options without handling
them for both set and get. Pass down extack so if an option desires
it could set it on error and be more user-friendly.
[1] https://www.spinics.net/lists/netdev/msg532698.html
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
A recent change added a null check on p->dev after p->dev was being
dereferenced by the ns_capable check on p->dev. It turns out that
neither the p->dev and p->br null checks are necessary, and can be
removed, which cleans up a static analyis warning.
As Nikolay Aleksandrov noted, these checks can be removed because:
"My reasoning of why it shouldn't be possible:
- On port add new_nbp() sets both p->dev and p->br before creating
kobj/sysfs
- On port del (trickier) del_nbp() calls kobject_del() before call_rcu()
to destroy the port which in turn calls sysfs_remove_dir() which uses
kernfs_remove() which deactivates (shouldn't be able to open new
files) and calls kernfs_drain() to drain current open/mmaped files in
the respective dir before continuing, thus making it impossible to
open a bridge port sysfs file with p->dev and p->br equal to NULL.
So I think it's safe to remove those checks altogether. It'd be nice to
get a second look over my reasoning as I might be missing something in
sysfs/kernfs call path."
Thanks to Nikolay Aleksandrov's suggestion to remove the check and
David Miller for sanity checking this.
Detected by CoverityScan, CID#751490 ("Dereference before null check")
Fixes: a5f3ea54f3 ("net: bridge: add support for raw sysfs port options")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow querying bridge port flags so that drivers capable of performing
VxLAN learning will update the bridge driver only if learning is enabled
on its bridge port corresponding to the VxLAN device.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Syzbot reported a use-after-free of the global vlan context on port vlan
destruction. When I added per-port vlan stats I missed the fact that the
global vlan context can be freed before the per-port vlan rcu callback.
There're a few different ways to deal with this, I've chosen to add a
new private flag that is set only when per-port stats are allocated so
we can directly check it on destruction without dereferencing the global
context at all. The new field in net_bridge_vlan uses a hole.
v2: cosmetic change, move the check to br_process_vlan_info where the
other checks are done
v3: add change log in the patch, add private (in-kernel only) flags in a
hole in net_bridge_vlan struct and use that instead of mixing
user-space flags with private flags
Fixes: 9163a0fc1f ("net: bridge: add support for per-port vlan stats")
Reported-by: syzbot+04681da557a0e49a52e5@syzkaller.appspotmail.com
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes assumption than vlan_tci != 0 when tag is present.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recently a check was added which prevents marking of routers with zero
source address, but for IPv6 that cannot happen as the relevant RFCs
actually forbid such packets:
RFC 2710 (MLDv1):
"To be valid, the Query message MUST
come from a link-local IPv6 Source Address, be at least 24 octets
long, and have a correct MLD checksum."
Same goes for RFC 3810.
And also it can be seen as a requirement in ipv6_mc_check_mld_query()
which is used by the bridge to validate the message before processing
it. Thus any queries with :: source address won't be processed anyway.
So just remove the check for zero IPv6 source address from the query
processing function.
Fixes: 5a2de63fd1 ("bridge: do not add port to router list when receives query with source 0.0.0.0")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on RFC 4541, 2.1.1. IGMP Forwarding Rules
The switch supporting IGMP snooping must maintain a list of
multicast routers and the ports on which they are attached. This
list can be constructed in any combination of the following ways:
a) This list should be built by the snooping switch sending
Multicast Router Solicitation messages as described in IGMP
Multicast Router Discovery [MRDISC]. It may also snoop
Multicast Router Advertisement messages sent by and to other
nodes.
b) The arrival port for IGMP Queries (sent by multicast routers)
where the source address is not 0.0.0.0.
We should not add the port to router list when receives query with source
0.0.0.0.
Reported-by: Ying Xu <yinxu@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
readability improvements for the formatted output, some LICENSES updates
including the addition of the ISC license, the removal of the unloved and
unmaintained 00-INDEX files, the deprecated APIs document from Kees, more
MM docs from Mike Rapoport, and the usual pile of typo fixes and
corrections.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbztcuAAoJEI3ONVYwIuV6nTAP/0Be+5dNPGJmSnb/RbkwBuBV
zAFVUj2sx4lZlRmWRZ0r7AOef2eSw3IvwBix/vnmllYCVahjp+BdRbhXQAijjyeb
FWWjOH50/J+BaxSthAINiLRLvuoe0D/M08OpmXQfRl5q0S8RufeV3BDtEABx9j2n
IICPGTl8LpPUgSMA4cw8zPhHdauhZpbmL2mGE9LXZ27SJT4S8lcHMwyPU1n5S+Jd
ChEz5g9dYr3GNxFp712pkI5GcVL3tP2nfoVwK7EuGf1tvSnEnn2kzac8QgMqorIh
xB2+Sh4XIUCbHYpGHpxIniD+WI4voNr/E7STQioJK5o2G4HTuxLjktvTezNF8paa
hgNHWjPQBq0OOCdM/rsffONFF2J/v/r7E3B+kaRg8pE0uZWTFaDMs6MVaL2fL4Ls
DrFhi90NJI/Fs7uB4sriiviShAhwboiSIRXJi4VlY/5oFJKHFgqes+R7miU+zTX3
2qv0k4mWZXWDV9w1piPxSCZSdRzaoYSoxEihX+tnYpCyEcYd9ovW/X1Uhl/wCWPl
Ft+Op6rkHXRXVfZzTLuF6PspZ4Udpw2PUcnA5zj5FRDDBsjSMFR31c19IFbCeiNY
kbTIcqejJG1WbVrAK4LCcFyVSGxbrr281eth4rE06cYmmsz3kJy1DB6Lhyg/2vI0
I8K9ZJ99n1RhPJIcburB
=C0wt
-----END PGP SIGNATURE-----
Merge tag 'docs-4.20' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"This is a fairly typical cycle for documentation. There's some welcome
readability improvements for the formatted output, some LICENSES
updates including the addition of the ISC license, the removal of the
unloved and unmaintained 00-INDEX files, the deprecated APIs document
from Kees, more MM docs from Mike Rapoport, and the usual pile of typo
fixes and corrections"
* tag 'docs-4.20' of git://git.lwn.net/linux: (41 commits)
docs: Fix typos in histogram.rst
docs: Introduce deprecated APIs list
kernel-doc: fix declaration type determination
doc: fix a typo in adding-syscalls.rst
docs/admin-guide: memory-hotplug: remove table of contents
doc: printk-formats: Remove bogus kobject references for device nodes
Documentation: preempt-locking: Use better example
dm flakey: Document "error_writes" feature
docs/completion.txt: Fix a couple of punctuation nits
LICENSES: Add ISC license text
LICENSES: Add note to CDDL-1.0 license that it should not be used
docs/core-api: memory-hotplug: add some details about locking internals
docs/core-api: rename memory-hotplug-notifier to memory-hotplug
docs: improve readability for people with poorer eyesight
yama: clarify ptrace_scope=2 in Yama documentation
docs/vm: split memory hotplug notifier description to Documentation/core-api
docs: move memory hotplug description into admin-guide/mm
doc: Fix acronym "FEKEK" in ecryptfs
docs: fix some broken documentation references
iommu: Fix passthrough option documentation
...
This fixes a problem introduced by:
commit 2cde6acd49 ("netpoll: Fix __netpoll_rcu_free so that it can hold the rtnl lock")
When using netconsole on a bond, __netpoll_cleanup can asynchronously
recurse multiple times, each __netpoll_free_async call can result in
more __netpoll_free_async's. This means there is now a race between
cleanup_work queues on multiple netpoll_info's on multiple devices and
the configuration of a new netpoll. For example if a netconsole is set
to enable 0, reconfigured, and enable 1 immediately, this netconsole
will likely not work.
Given the reason for __netpoll_free_async is it can be called when rtnl
is not locked, if it is locked, we should be able to execute
synchronously. It appears to be locked everywhere it's called from.
Generalize the design pattern from the teaming driver for current
callers of __netpoll_free_async.
CC: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, an FDB entry only ceases being offloaded when it is deleted.
This changes with VxLAN encapsulation.
Devices capable of performing VxLAN encapsulation usually have only one
FDB table, unlike the software data path which has two - one in the
bridge driver and another in the VxLAN driver.
Therefore, bridge FDB entries pointing to a VxLAN device are only
offloaded if there is a corresponding entry in the VxLAN FDB.
Allow clearing the offload indication in case the corresponding entry
was deleted from the VxLAN FDB.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After per-port vlan stats, vlan stats should be released
when fail to add vlan
Fixes: 9163a0fc1f ("net: bridge: add support for per-port vlan stats")
CC: bridge@lists.linux-foundation.org
cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds an option to have per-port vlan stats instead of the
default global stats. The option can be set only when there are no port
vlans in the bridge since we need to allocate the stats if it is set
when vlans are being added to ports (and respectively free them
when being deleted). Also bump RTNL_MAX_TYPE as the bridge is the
largest user of options. The current stats design allows us to add
these without any changes to the fast-path, it all comes down to
the per-vlan stats pointer which, if this option is enabled, will
be allocated for each port vlan instead of using the global bridge-wide
one.
CC: bridge@lists.linux-foundation.org
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update br_mdb_dump for strict data checking. If the flag is set,
the dump request is expected to have a br_port_msg struct as the
header. All elements of the struct are expected to be 0 and no
attributes can be appended.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor conflict in net/core/rtnetlink.c, David Ahern's bug fix in 'net'
overlapped the renaming of a netlink attribute in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 13cefad2f2 ("net: bridge: convert and rename mcast disabled")
converted the 'multicast_disabled' field to an option bit named
'BROPT_MULTICAST_ENABLED'.
While the old field was implicitly initialized to 0, the new field is
not initialized, resulting in the bridge defaulting to multicast
disabled state and breaking existing applications.
Fix this by explicitly initializing the option.
Fixes: 13cefad2f2 ("net: bridge: convert and rename mcast disabled")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to explicitly zero is_sticky when creating a new fdb, otherwise
we might get a stale value for a new entry.
Fixes: 435f2e7cc0 ("net: bridge: add support for sticky fdb entries")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Further reduce the size of net_bridge with 8 bytes and reduce the number of
holes in it:
Before: holes: 5, sum holes: 15
After: holes: 3, sum holes: 7
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert the last remaining bool option to a bit thus reducing the overall
net_bridge size further by 8 bytes.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert the neigh_suppress_enabled option to a bit.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts the rest of the mcast options to bits. It also packs
the mcast options a little better by moving multicast_mld_version to an
existing hole, reducing the net_bridge size by 8 bytes.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert mcast disabled to an option bit and while doing so convert the
logic to check if multicast is enabled instead. That is make the logic
follow the option value - if it's set then mcast is enabled and vice versa.
This avoids a few confusing places where we inverted the value that's being
set to follow the mcast_disabled logic.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert group_addr_set internal bridge opt to a bit.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
No functional change, convert of nf_call_[ip|ip6|arp]tables to bits.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bridge options have usually been added as separate fields all over the
net_bridge struct taking up space and ending up in different cache lines.
Let's move them to a single bitfield to save up space and speedup lookups.
This patch adds a simple API for option modifying and retrieving using
bitops and converts the first user of the API - the bridge vlan options
(vlan_enabled and vlan_stats_enabled).
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we have a mix of opening brackets on new lines and on the same
line, let's move them all on the same line.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: ed842faeb2 ("bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some documentation files received recent changes and are
pointing to wrong places.
Those references can easily fixed with the help of a
script:
$ ./scripts/documentation-file-ref-check --fix
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
For starters, the bridge netfilter code registers operations that
are invoked any time nh_hook is called. Specifically, ip_sabotage_in
watches for nested calls for NF_INET_PRE_ROUTING when a bridge is in
the stack.
Packet wise, the bridge netfilter hook runs first. br_nf_pre_routing
allocates nf_bridge, sets in_prerouting to 1 and calls NF_HOOK for
NF_INET_PRE_ROUTING. It's finish function, br_nf_pre_routing_finish,
then resets in_prerouting flag to 0 and the packet continues up the
stack. The packet eventually makes it to the VRF driver and it invokes
nf_hook for NF_INET_PRE_ROUTING in case any rules have been added against
the vrf device.
Because of the registered operations the call to nf_hook causes
ip_sabotage_in to be invoked. That function sees the nf_bridge on the
skb and that in_prerouting is not set. Thinking it is an invalid nested
call it steals (drops) the packet.
Update ip_sabotage_in to recognize that the bridge or one of its upper
devices (e.g., vlan) can be enslaved to a VRF (L3 master device) and
allow the packet to go through the nf_hook a second time.
Fixes: 73e20b761a ("net: vrf: Add support for PREROUTING rules on vrf device")
Reported-by: D'Souza, Nelson <ndsouza@ciena.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add support for entries which are "sticky", i.e. will not change their port
if they show up from a different one. A new ndm flag is introduced for that
purpose - NTF_STICKY. We allow to set it only to non-local entries.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch to bitmap_zalloc() to show clearly what we are allocating.
Besides that it returns pointer of bitmap type instead of opaque void *.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The err is not modified after initalization, So remove it and make
it to be void function.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Netfilter exposes standard hook priorities in case of ipv4, ipv6 and
arp but not in case of bridge.
This patch exposes the hook priority values of the bridge family (which are
different from the formerly mentioned) via uapi so that they can be used by
user-space applications just like the others.
Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds a new port attribute - IFLA_BRPORT_BACKUP_PORT, which
allows to set a backup port to be used for known unicast traffic if the
port has gone carrier down. The backup pointer is rcu protected and set
only under RTNL, a counter is maintained so when deleting a port we know
how many other ports reference it as a backup and we remove it from all.
Also the pointer is in the first cache line which is hot at the time of
the check and thus in the common case we only add one more test.
The backup port will be used only for the non-flooding case since
it's a part of the bridge and the flooded packets will be forwarded to it
anyway. To remove the forwarding just send a 0/non-existing backup port.
This is used to avoid numerous scalability problems when using MLAG most
notably if we have thousands of fdbs one would need to change all of them
on port carrier going down which takes too long and causes a storm of fdb
notifications (and again when the port comes back up). In a Multi-chassis
Link Aggregation setup usually hosts are connected to two different
switches which act as a single logical switch. Those switches usually have
a control and backup link between them called peerlink which might be used
for communication in case a host loses connectivity to one of them.
We need a fast way to failover in case a host port goes down and currently
none of the solutions (like bond) cannot fulfill the requirements because
the participating ports are actually the "master" devices and must have the
same peerlink as their backup interface and at the same time all of them
must participate in the bridge device. As Roopa noted it's normal practice
in routing called fast re-route where a precalculated backup path is used
when the main one is down.
Another use case of this is with EVPN, having a single vxlan device which
is backup of every port. Due to the nature of master devices it's not
currently possible to use one device as a backup for many and still have
all of them participate in the bridge (which is master itself).
More detailed information about MLAG is available at the link below.
https://docs.cumulusnetworks.com/display/DOCS/Multi-Chassis+Link+Aggregation+-+MLAG
Further explanation and a diagram by Roopa:
Two switches acting in a MLAG pair are connected by the peerlink
interface which is a bridge port.
the config on one of the switches looks like the below. The other
switch also has a similar config.
eth0 is connected to one port on the server. And the server is
connected to both switches.
br0 -- team0---eth0
|
-- switch-peerlink
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new alternative store callback for port sysfs options
which takes a raw value (buf) and can use it directly. It is needed for the
backup port sysfs support since we have to pass the device by its name.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When creating various bridge objects in /sys/class/net/... make sure
that they belong to the container's owner instead of global root (if
they belong to a container/namespace).
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the nft_reject_br_send_v4_tcp_reset(), a ttl is set by the
nf_reject_iphdr_put(). so, below code is unnecessary.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull networking fixes from David Miller:
1) Various netfilter fixlets from Pablo and the netfilter team.
2) Fix regression in IPVS caused by lack of PMTU exceptions on local
routes in ipv6, from Julian Anastasov.
3) Check pskb_trim_rcsum for failure in DSA, from Zhouyang Jia.
4) Don't crash on poll in TLS, from Daniel Borkmann.
5) Revert SO_REUSE{ADDR,PORT} change, it regresses various things
including Avahi mDNS. From Bart Van Assche.
6) Missing of_node_put in qcom/emac driver, from Yue Haibing.
7) We lack checking of the TCP checking in one special case during SYN
receive, from Frank van der Linden.
8) Fix module init error paths of mac80211 hwsim, from Johannes Berg.
9) Handle 802.1ad properly in stmmac driver, from Elad Nachman.
10) Must grab HW caps before doing quirk checks in stmmac driver, from
Jose Abreu.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (81 commits)
net: stmmac: Run HWIF Quirks after getting HW caps
neighbour: skip NTF_EXT_LEARNED entries during forced gc
net: cxgb3: add error handling for sysfs_create_group
tls: fix waitall behavior in tls_sw_recvmsg
tls: fix use-after-free in tls_push_record
l2tp: filter out non-PPP sessions in pppol2tp_tunnel_ioctl()
l2tp: reject creation of non-PPP sessions on L2TPv2 tunnels
mlxsw: spectrum_switchdev: Fix port_vlan refcounting
mlxsw: spectrum_router: Align with new route replace logic
mlxsw: spectrum_router: Allow appending to dev-only routes
ipv6: Only emit append events for appended routes
stmmac: added support for 802.1ad vlan stripping
cfg80211: fix rcu in cfg80211_unregister_wdev
mac80211: Move up init of TXQs
mac80211_hwsim: fix module init error paths
cfg80211: initialize sinfo in cfg80211_get_station
nl80211: fix some kernel doc tag mistakes
hv_netvsc: Fix the variable sizes in ipsecv2 and rsc offload
rds: avoid unenecessary cong_update in loop transport
l2tp: clean up stale tunnel or session in pppol2tp_connect's error path
...
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree:
1) Reject non-null terminated helper names from xt_CT, from Gao Feng.
2) Fix KASAN splat due to out-of-bound access from commit phase, from
Alexey Kodanev.
3) Missing conntrack hook registration on IPVS FTP helper, from Julian
Anastasov.
4) Incorrect skbuff allocation size in bridge nft_reject, from Taehee Yoo.
5) Fix inverted check on packet xmit to non-local addresses, also from
Julian.
6) Fix ebtables alignment compat problems, from Alin Nastac.
7) Hook mask checks are not correct in xt_set, from Serhey Popovych.
8) Fix timeout listing of element in ipsets, from Jozsef.
9) Cap maximum timeout value in ipset, also from Jozsef.
10) Don't allow family option for hash:mac sets, from Florent Fourcot.
11) Restrict ebtables to work with NFPROTO_BRIDGE targets only, this
Florian.
12) Another bug reported by KASAN in the rbtree set backend, from
Taehee Yoo.
13) Missing __IPS_MAX_BIT update doesn't include IPS_OFFLOAD_BIT.
From Gao Feng.
14) Missing initialization of match/target in ebtables, from Florian
Westphal.
15) Remove useless nft_dup.h file in include path, from C. Labbe.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Callers of br_fdb_find() need to hold the hash lock, which
br_fdb_find_port() doesn't do. However, since br_fdb_find_port() is not
doing any actual FDB manipulation, the hash lock is not really needed at
all. So convert to br_fdb_find_rcu(), surrounded by rcu_read_lock() /
_unlock() pair.
The device pointer copied from inside the FDB entry is then kept alive
by the RTNL lock, which br_fdb_find_port() asserts.
Fixes: 4d4fd36126 ("net: bridge: Publish bridge accessor functions")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot reports following splat:
BUG: KMSAN: uninit-value in ebt_stp_mt_check+0x24b/0x450
net/bridge/netfilter/ebt_stp.c:162
ebt_stp_mt_check+0x24b/0x450 net/bridge/netfilter/ebt_stp.c:162
xt_check_match+0x1438/0x1650 net/netfilter/x_tables.c:506
ebt_check_match net/bridge/netfilter/ebtables.c:372 [inline]
ebt_check_entry net/bridge/netfilter/ebtables.c:702 [inline]
The uninitialised access is
xt_mtchk_param->nft_compat
... which should be set to 0.
Fix it by zeroing the struct beforehand, same for tgchk.
ip(6)tables targetinfo uses c99-style initialiser, so no change
needed there.
Reported-by: syzbot+da4494182233c23a5fcf@syzkaller.appspotmail.com
Fixes: 55917a21d0 ("netfilter: x_tables: add context to know if extension runs from nft_compat")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull networking updates from David Miller:
1) Add Maglev hashing scheduler to IPVS, from Inju Song.
2) Lots of new TC subsystem tests from Roman Mashak.
3) Add TCP zero copy receive and fix delayed acks and autotuning with
SO_RCVLOWAT, from Eric Dumazet.
4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
Brouer.
5) Add ttl inherit support to vxlan, from Hangbin Liu.
6) Properly separate ipv6 routes into their logically independant
components. fib6_info for the routing table, and fib6_nh for sets of
nexthops, which thus can be shared. From David Ahern.
7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
messages from XDP programs. From Nikita V. Shirokov.
8) Lots of long overdue cleanups to the r8169 driver, from Heiner
Kallweit.
9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.
10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.
11) Plumb extack down into fib_rules, from Roopa Prabhu.
12) Add Flower classifier offload support to igb, from Vinicius Costa
Gomes.
13) Add UDP GSO support, from Willem de Bruijn.
14) Add documentation for eBPF helpers, from Quentin Monnet.
15) Add TLS tx offload to mlx5, from Ilya Lesokhin.
16) Allow applications to be given the number of bytes available to read
on a socket via a control message returned from recvmsg(), from
Soheil Hassas Yeganeh.
17) Add x86_32 eBPF JIT compiler, from Wang YanQing.
18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
From Björn Töpel.
19) Remove indirect load support from all of the BPF JITs and handle
these operations in the verifier by translating them into native BPF
instead. From Daniel Borkmann.
20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.
21) Allow XDP programs to do lookups in the main kernel routing tables
for forwarding. From David Ahern.
22) Allow drivers to store hardware state into an ELF section of kernel
dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.
23) Various RACK and loss detection improvements in TCP, from Yuchung
Cheng.
24) Add TCP SACK compression, from Eric Dumazet.
25) Add User Mode Helper support and basic bpfilter infrastructure, from
Alexei Starovoitov.
26) Support ports and protocol values in RTM_GETROUTE, from Roopa
Prabhu.
27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
Brouer.
28) Add lots of forwarding selftests, from Petr Machata.
29) Add generic network device failover driver, from Sridhar Samudrala.
* ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
strparser: Add __strp_unpause and use it in ktls.
rxrpc: Fix terminal retransmission connection ID to include the channel
net: hns3: Optimize PF CMDQ interrupt switching process
net: hns3: Fix for VF mailbox receiving unknown message
net: hns3: Fix for VF mailbox cannot receiving PF response
bnx2x: use the right constant
Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
enic: fix UDP rss bits
netdev-FAQ: clarify DaveM's position for stable backports
rtnetlink: validate attributes in do_setlink()
mlxsw: Add extack messages for port_{un, }split failures
netdevsim: Add extack error message for devlink reload
devlink: Add extack to reload and port_{un, }split operations
net: metrics: add proper netlink validation
ipmr: fix error path when ipmr_new_table fails
ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
net: hns3: remove unused hclgevf_cfg_func_mta_filter
netfilter: provide udp*_lib_lookup for nf_tproxy
qed*: Utilize FW 8.37.2.0
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEcQCq365ubpQNLgrWVeRaWujKfIoFAlsXFUEUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQVeRaWujKfIoomg//eRNpc6x9kxTijN670AC2uD0CBTlZ
2z6mHuJaOhG8bTxjZxQfUBoo6/eZJ2YC1yq6ornGFNzw4sfKsR/j86ujJim2HAmo
opUhziq3SILGEvjsxfPkREe/wb49jy0AA/WjZqciitB1ig8Hz7xzqi0lpNaEspFh
QJFB6XXkojWGFGrRzruAVJnPS+pDWoTQR0qafs3JWKnpeinpOdZnl1hPsysAEHt5
Ag8o4qS/P9xJM0khi7T+jWECmTyT/mtWqEtFcZ0o+JLOgt/EMvNX6DO4ETDiYRD2
mVChga9x5r78bRgNy2U8IlEWWa76WpcQAEODvhzbijX4RxMAmjsmLE+e+udZSnMZ
eCITl2f7ExxrL5SwNFC/5h7pAv0RJ+SOC19vcyeV4JDlQNNVjUy/aNKv5baV0aeg
EmkeobneMWxqHx52aERz8RF1in5pT8gLOYoYnWfNpcDEmjLrwhuZLX2asIzUEqrS
SoPJ8hxIDCxceHOWIIrz5Dqef7x28Dyi46w3QINC8bSy2RnR/H3q40DRegvXOGiS
9WcbbwbhnM4Kau413qKicGCvdqTVYdeyZqo7fVelSciD139Vk7pZotyom4MuU25p
fIyGfXa8/8gkl7fZ+HNkZbba0XWNfAZt//zT095qsp3CkhVnoybwe6OwG1xRqErq
W7OOQbS7vvN/KGo=
=10u6
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20180605' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"Another reasonable chunk of audit changes for v4.18, thirteen patches
in total.
The thirteen patches can mostly be broken down into one of four
categories: general bug fixes, accessor functions for audit state
stored in the task_struct, negative filter matches on executable
names, and extending the (relatively) new seccomp logging knobs to the
audit subsystem.
The main driver for the accessor functions from Richard are the
changes we're working on to associate audit events with containers,
but I think they have some standalone value too so I figured it would
be good to get them in now.
The seccomp/audit patches from Tyler apply the seccomp logging
improvements from a few releases ago to audit's seccomp logging;
starting with this patchset the changes in
/proc/sys/kernel/seccomp/actions_logged should apply to both the
standard kernel logging and audit.
As usual, everything passes the audit-testsuite and it happens to
merge cleanly with your tree"
[ Heh, except it had trivial merge conflicts with the SELinux tree that
also came in from Paul - Linus ]
* tag 'audit-pr-20180605' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: Fix wrong task in comparison of session ID
audit: use existing session info function
audit: normalize loginuid read access
audit: use new audit_context access funciton for seccomp_actions_logged
audit: use inline function to set audit context
audit: use inline function to get audit context
audit: convert sessionid unset to a macro
seccomp: Don't special case audited processes when logging
seccomp: Audit attempts to modify the actions_logged sysctl
seccomp: Configurable separator for the actions_logged string
seccomp: Separate read and write code for actions_logged sysctl
audit: allow not equal op for audit by executable
audit: add syscall information to FEATURE_CHANGE records
the ebtables evaluation loop expects targets to return
positive values (jumps), or negative values (absolute verdicts).
This is completely different from what xtables does.
In xtables, targets are expected to return the standard netfilter
verdicts, i.e. NF_DROP, NF_ACCEPT, etc.
ebtables will consider these as jumps.
Therefore reject any target found due to unspec fallback.
v2: also reject watchers. ebtables ignores their return value, so
a target that assumes skb ownership (and returns NF_STOLEN) causes
use-after-free.
The only watchers in the 'ebtables' front-end are log and nflog;
both have AF_BRIDGE specific wrappers on kernel side.
Reported-by: syzbot+2b43f681169a2a0d306a@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
On arm64, ebt_entry_{match,watcher,target} structs are 40 bytes long
while on 32-bit arm these structs have a size of 36 bytes.
COMPAT_XT_ALIGN() macro cannot be used here to determine the necessary
padding for the CONFIG_COMPAT because it imposes an 8-byte boundary
alignment, condition that is not found in 32-bit ebtables application.
Signed-off-by: Alin Nastac <alin.nastac@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In order to allocate icmpv6 skb, sizeof(struct ipv6hdr) should be used.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Filling in the padding slot in the bpf structure as a bug fix in 'ne'
overlapped with actually using that padding area for something in
'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
A driver might need to react to changes in settings of brentry VLANs.
Therefore send switchdev port notifications for these as well. Reuse
SWITCHDEV_OBJ_ID_PORT_VLAN for this purpose. Listeners should use
netif_is_bridge_master() on orig_dev to determine whether the
notification is about a bridge port or a bridge.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extract the code that deals with adding a preexisting VLAN to bridge CPU
port to a separate function. A follow-up patch introduces a need to roll
back operations in this block due to an error, and this split will make
the error-handling code clearer.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A call to switchdev_port_obj_add() or switchdev_port_obj_del() involves
initializing a struct switchdev_obj_port_vlan, a piece of code that
repeats on each call site almost verbatim. While in the current codebase
there is just one duplicated add call, the follow-up patches add more of
both add and del calls.
Thus to remove the duplication, extract the repetition into named
functions and reuse.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for a new port flag - BR_ISOLATED. If it is set
then isolated ports cannot communicate between each other, but they can
still communicate with non-isolated ports. The same can be achieved via
ACLs but they can't scale with large number of ports and also the
complexity of the rules grows. This feature can be used to achieve
isolated vlan functionality (similar to pvlan) as well, though currently
it will be port-wide (for all vlans on the port). The new test in
should_deliver uses data that is already cache hot and the new boolean
is used to avoid an additional source port test in should_deliver.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
S390 bpf_jit.S is removed in net-next and had changes in 'net',
since that code isn't used any more take the removal.
TLS data structures split the TX and RX components in 'net-next',
put the new struct members from the bug fix in 'net' into the RX
part.
The 'net-next' tree had some reworking of how the ERSPAN code works in
the GRE tunneling code, overlapping with a one-line headroom
calculation fix in 'net'.
Overlapping changes in __sock_map_ctx_update_elem(), keep the bits
that read the prog members via READ_ONCE() into local variables
before using them.
Signed-off-by: David S. Miller <davem@davemloft.net>
Recognizing that the audit context is an internal audit value, use an
access function to retrieve the audit context pointer for the task
rather than reaching directly into the task struct to get it.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: merge fuzz in auditsc.c and selinuxfs.c, checkpatch.pl fixes]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:
1) Fix handling of simultaneous open TCP connection in conntrack,
from Jozsef Kadlecsik.
2) Insufficient sanitify check of xtables extension names, from
Florian Westphal.
3) Skip unnecessary synchronize_rcu() call when transaction log
is already empty, from Florian Westphal.
4) Incorrect destination mac validation in ebt_stp, from Stephen
Hemminger.
5) xtables module reference counter leak in nft_compat, from
Florian Westphal.
6) Incorrect connection reference counting logic in IPVS
one-packet scheduler, from Julian Anastasov.
7) Wrong stats for 32-bits CPU in IPVS, also from Julian.
8) Calm down sparse error in netfilter core, also from Florian.
9) Use nla_strlcpy to fix compilation warning in nfnetlink_acct
and nfnetlink_cthelper, again from Florian.
10) Missing module alias in icmp and icmp6 xtables extensions,
from Florian Westphal.
11) Base chain statistics in nf_tables may be unset/null, from Florian.
12) Fix handling of large matchinfo size in nft_compat, this includes
one preparation for before this fix. From Florian.
13) Fix bogus EBUSY error when deleting chains due to incorrect reference
counting from the preparation phase of the two-phase commit protocol.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mirroring offload in mlxsw needs to check that a given VLAN is allowed
to ingress the bridge device. br_vlan_get_info() is the function that is
used for this, however currently it only supports bridge port devices.
Extend it to support bridge masters as well.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The destination mac (destmac) is only valid if EBT_DESTMAC flag
is set. Fix by changing the order of the comparison to look for
the flag first.
Reported-by: syzbot+5c06e318fc558cc27823@syzkaller.appspotmail.com
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next
tree, more relevant updates in this batch are:
1) Add Maglev support to IPVS. Moreover, store lastest server weight in
IPVS since this is needed by maglev, patches from from Inju Song.
2) Preparation works to add iptables flowtable support, patches
from Felix Fietkau.
3) Hand over flows back to conntrack slow path in case of TCP RST/FIN
packet is seen via new teardown state, also from Felix.
4) Add support for extended netlink error reporting for nf_tables.
5) Support for larger timeouts that 23 days in nf_tables, patch from
Florian Westphal.
6) Always set an upper limit to dynamic sets, also from Florian.
7) Allow number generator to make map lookups, from Laura Garcia.
8) Use hash_32() instead of opencode hashing in IPVS, from Vicent Bernat.
9) Extend ip6tables SRH match to support previous, next and last SID,
from Ahmed Abdelsalam.
10) Move Passive OS fingerprint nf_osf.c, from Fernando Fernandez.
11) Expose nf_conntrack_max through ctnetlink, from Florent Fourcot.
12) Several housekeeping patches for xt_NFLOG, x_tables and ebtables,
from Taehee Yoo.
13) Unify meta bridge with core nft_meta, then make nft_meta built-in.
Make rt and exthdr built-in too, again from Florian.
14) Missing initialization of tbl->entries in IPVS, from Cong Wang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not automatically bail out on sending notifications about activity on
non-user-added FDB entries. Instead, notify about this activity except
for cases where the activity itself originates in a notification, to
avoid sending duplicate notifications.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following patch enables sending notifications also for events on FDB
entries that weren't added by the user. Give the drivers the information
necessary to distinguish between the two origins of FDB entries.
To maintain the current behavior, have switchdev-implementing drivers
bail out on notifications about non-user-added FDB entries. In case of
mlxsw driver, allow a call to mlxsw_sp_span_respin() so that SPAN over
bridge catches up with the changed FDB.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While handling netdevice events, br_device_event() sometimes uses
br_stp_(disable|enable)_port which unconditionally send a notification,
but then a second notification for the same event is sent at the end of
the br_device_event() function. To avoid sending duplicate notifications
in such cases, check if one has already been sent (i.e.
br_stp_enable/disable_port have been called).
The patch is based on a change by Satish Ashok.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a couple new functions to allow querying FDB and vlan settings of a
bridge.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we set a bond slave's master to bridge via ioctl, we only check
the IFF_BRIDGE_PORT flag. Although we will find the slave's real master
at netdev_master_upper_dev_link() later, it already does some settings
and allocates some resources. It would be better to return as early
as possible.
v1 -> v2:
use netdev_master_upper_dev_get() instead of netdev_has_any_upper_dev()
to check if we have a master, because not all upper devs are masters,
e.g. vlan device.
Reported-by: syzbot+de73361ee4971b6e6f75@syzkaller.appspotmail.com
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use hlist_entry_safe() instead of open-coding it.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It overcomplicates things for no reason.
nft_meta_bridge only offers retrieval of bridge port interface name.
Because of this being its own module, we had to export all nft_meta
functions, which we can then make static again (which even reduces
the size of nft_meta -- including bridge port retrieval...):
before:
text data bss dec hex filename
1838 832 0 2670 a6e net/bridge/netfilter/nft_meta_bridge.ko
6147 936 1 7084 1bac net/netfilter/nft_meta.ko
after:
5826 936 1 6763 1a6b net/netfilter/nft_meta.ko
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ebt_get_target similar to {ip/ip6/arp}t_get_target.
and ebt_get_target_c similar to {ip/ip6/arp}t_get_target_c.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
EBT_MATCH and EBT_NOMATCH are used to change return value.
match functions(ebt_xxx.c) return false when received frame is not matched
and returns true when received frame is matched.
but, EBT_MATCH_ITERATE understands oppositely.
so, to change return value, EBT_MATCH and EBT_NOMATCH are used.
but, we can use operation '!' simply.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
A ebt_free_table_info frees all of chainstacks.
It similar to xt_free_table_info. this inline function
reduces code line.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Dmitry reports 32bit ebtables on 64bit kernel got broken by
a recent change that returns -EINVAL when ruleset has no entries.
ebtables however only counts user-defined chains, so for the
initial table nentries will be 0.
Don't try to allocate the compat array in this case, as no user
defined rules exist no rule will need 64bit translation.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 7d7d7e0211 ("netfilter: compat: reject huge allocation requests")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As Roopa noted today the biggest source of problems when configuring
bridge and ports is that the bridge MTU keeps changing automatically on
port events (add/del/changemtu). That leads to inconsistent behaviour
and network config software needs to chase the MTU and fix it on each
such event. Let's improve on that situation and allow for the user to
set any MTU within ETH_MIN/MAX limits, but once manually configured it
is the user's responsibility to keep it correct afterwards.
In case the MTU isn't manually set - the behaviour reverts to the
previous and the bridge follows the minimum MTU.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recently the bridge was changed to automatically set maximum MTU on port
events (add/del/changemtu) when vlan filtering is enabled, but that
actually changes behaviour in a way which breaks some setups and can lead
to packet drops. In order to still allow that maximum to be set while being
compatible, we add the ability for the user to tune the bridge MTU up to
the maximum when vlan filtering is enabled, but that has to be done
explicitly and all port events (add/del/changemtu) lead to resetting that
MTU to the minimum as before.
Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next
tree. This batch comes with more input sanitization for xtables to
address bug reports from fuzzers, preparation works to the flowtable
infrastructure and assorted updates. In no particular order, they are:
1) Make sure userspace provides a valid standard target verdict, from
Florian Westphal.
2) Sanitize error target size, also from Florian.
3) Validate that last rule in basechain matches underflow/policy since
userspace assumes this when decoding the ruleset blob that comes
from the kernel, from Florian.
4) Consolidate hook entry checks through xt_check_table_hooks(),
patch from Florian.
5) Cap ruleset allocations at 512 mbytes, 134217728 rules and reject
very large compat offset arrays, so we have a reasonable upper limit
and fuzzers don't exercise the oom-killer. Patches from Florian.
6) Several WARN_ON checks on xtables mutex helper, from Florian.
7) xt_rateest now has a hashtable per net, from Cong Wang.
8) Consolidate counter allocation in xt_counters_alloc(), from Florian.
9) Earlier xt_table_unlock() call in {ip,ip6,arp,eb}tables, patch
from Xin Long.
10) Set FLOW_OFFLOAD_DIR_* to IP_CT_DIR_* definitions, patch from
Felix Fietkau.
11) Consolidate code through flow_offload_fill_dir(), also from Felix.
12) Inline ip6_dst_mtu_forward() just like ip_dst_mtu_maybe_forward()
to remove a dependency with flowtable and ipv6.ko, from Felix.
13) Cache mtu size in flow_offload_tuple object, this is safe for
forwarding as f87c10a8aa describes, from Felix.
14) Rename nf_flow_table.c to nf_flow_table_core.o, to simplify too
modular infrastructure, from Felix.
15) Add rt0, rt2 and rt4 IPv6 routing extension support, patch from
Ahmed Abdelsalam.
16) Remove unused parameter in nf_conncount_count(), from Yi-Hung Wei.
17) Support for counting only to nf_conncount infrastructure, patch
from Yi-Hung Wei.
18) Add strict NFT_CT_{SRC_IP,DST_IP,SRC_IP6,DST_IP6} key datatypes
to nft_ct.
19) Use boolean as return value from ipt_ah and from IPVS too, patch
from Gustavo A. R. Silva.
20) Remove useless parameters in nfnl_acct_overquota() and
nf_conntrack_broadcast_help(), from Taehee Yoo.
21) Use ipv6_addr_is_multicast() from xt_cluster, also from Taehee Yoo.
22) Statify nf_tables_obj_lookup_byhandle, patch from Fengguang Wu.
23) Fix typo in xt_limit, from Geert Uytterhoeven.
24) Do no use VLAs in Netfilter code, again from Gustavo.
25) Use ADD_COUNTER from ebtables, from Taehee Yoo.
26) Bitshift support for CONNMARK and MARK targets, from Jack Ma.
27) Use pr_*() and add pr_fmt(), from Arushi Singhal.
28) Add synproxy support to ctnetlink.
29) ICMP type and IGMP matching support for ebtables, patches from
Matthias Schiffer.
30) Support for the revision infrastructure to ebtables, from
Bernie Harris.
31) String match support for ebtables, also from Bernie.
32) Documentation for the new flowtable infrastructure.
33) Use generic comparison functions in ebt_stp, from Joe Perches.
34) Demodularize filter chains in nftables.
35) Register conntrack hooks in case nftables NAT chain is added.
36) Merge assignments with return in a couple of spots in the
Netfilter codebase, also from Arushi.
37) Document that xtables percpu counters are stored in the same
memory area, from Ben Hutchings.
38) Revert mark_source_chains() sanity checks that break existing
rulesets, from Florian Westphal.
39) Use is_zero_ether_addr() in the ipset codebase, from Joe Perches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
One module per supported filter chain family type takes too much memory
for very little code - too much modularization - place all chain filter
definitions in one single file.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use WARN_ON() instead since it should not happen that neither family
goes over NFPROTO_NUMPROTO nor there is already a chain of this type
already registered.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use nft_ prefix. By when I added chain types, I forgot to use the
nftables prefix. Rename enum nft_chain_type to enum nft_chain_types too,
otherwise there is an overlap.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Instead of unnecessary const declarations, use the generic functions to
save a little object space.
$ size net/bridge/netfilter/ebt_stp.o*
text data bss dec hex filename
1250 144 0 1394 572 net/bridge/netfilter/ebt_stp.o.new
1344 144 0 1488 5d0 net/bridge/netfilter/ebt_stp.o.old
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently ebtables assumes that the revision number of all match
modules is 0, which is an issue when trying to use existing
xtables matches with ebtables. The solution is to modify ebtables
to allow extensions to specify a revision number, similar to
iptables. This gets passed down to the kernel, which is then able
to find the match module correctly.
To main binary backwards compatibility, the size of the ebt_entry
structures is not changed, only the size of the name field is
decreased by 1 byte to make room for the revision field.
Signed-off-by: Bernie Harris <bernie.harris@alliedtelesis.co.nz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prefer the direct use of octal for permissions.
Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace
and some typing.
Miscellanea:
o Whitespace neatening around these conversions.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to use br_vlan_enabled() helper otherwise we'll break builds
without bridge vlans:
net/bridge//br_if.c: In function ‘br_mtu’:
net/bridge//br_if.c:458:8: error: ‘const struct net_bridge’ has no
member named ‘vlan_enabled’
if (br->vlan_enabled)
^
net/bridge//br_if.c:462:1: warning: control reaches end of non-void
function [-Wreturn-type]
}
^
scripts/Makefile.build:324: recipe for target 'net/bridge//br_if.o'
failed
Fixes: 419d14af9e ("bridge: Allow max MTU when multiple VLANs present")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the bridge is allowing multiple VLANs, some VLANs may have
different MTUs. Instead of choosing the minimum MTU for the
bridge interface, choose the maximum MTU of the bridge members.
With this the user only needs to set a larger MTU on the member
ports that are participating in the large MTU VLANS.
Signed-off-by: Chas Williams <3chas3@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fun set of conflict resolutions here...
For the mac80211 stuff, these were fortunately just parallel
adds. Trivially resolved.
In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
function phy_disable_interrupts() earlier in the file, whilst in
'net-next' the phy_error() call from this function was removed.
In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
'rt_table_id' member of rtable collided with a bug fix in 'net' that
added a new struct member "rt_mtu_locked" which needs to be copied
over here.
The mlxsw driver conflict consisted of net-next separating
the span code and definitions into separate files, whilst
a 'net' bug fix made some changes to that moved code.
The mlx5 infiniband conflict resolution was quite non-trivial,
the RDMA tree's merge commit was used as a guide here, and
here are their notes:
====================
Due to bug fixes found by the syzkaller bot and taken into the for-rc
branch after development for the 4.17 merge window had already started
being taken into the for-next branch, there were fairly non-trivial
merge issues that would need to be resolved between the for-rc branch
and the for-next branch. This merge resolves those conflicts and
provides a unified base upon which ongoing development for 4.17 can
be based.
Conflicts:
drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
(IB/mlx5: Fix cleanup order on unload) added to for-rc and
commit b5ca15ad7e (IB/mlx5: Add proper representors support)
add as part of the devel cycle both needed to modify the
init/de-init functions used by mlx5. To support the new
representors, the new functions added by the cleanup patch
needed to be made non-static, and the init/de-init list
added by the representors patch needed to be modified to
match the init/de-init list changes made by the cleanup
patch.
Updates:
drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
prototypes added by representors patch to reflect new function
names as changed by cleanup patch
drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
stage list to match new order from cleanup patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We already have ICMPv6 type/code matches (which can be used to distinguish
different types of MLD packets). Add support for IPv4 IGMP matches in the
same way.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We already have ICMPv6 type/code matches. This adds support for IPv4 ICMP
matches in the same way.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
xtables uses ADD_COUNTER macro to increase
packet and byte count. ebtables also can use this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ebt_among is special, it has a dynamic match size and is exempt
from the central size checks.
commit c4585a2823 ("bridge: ebt_among: add missing match size checks")
added validation for pool size, but missed fact that the macros
ebt_among_wh_src/dst can already return out-of-bound result because
they do not check value of wh_src/dst_ofs (an offset) vs. the size
of the match that userspace gave to us.
v2:
check that offset has correct alignment.
Paolo Abeni points out that we should also check that src/dst
wormhash arrays do not overlap, and src + length lines up with
start of dst (or vice versa).
v3: compact wormhash_sizes_valid() part
NB: Fixes tag is intentionally wrong, this bug exists from day
one when match was added for 2.6 kernel. Tag is there so stable
maintainers will notice this one too.
Tested with same rules from the earlier patch.
Fixes: c4585a2823 ("bridge: ebt_among: add missing match size checks")
Reported-by: <syzbot+bdabab6f1983a03fc009@syzkaller.appspotmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The last rule in the blob has next_entry offset that is same as total size.
This made "ebtables32 -A OUTPUT -d de:ad:be:ef:01:02" fail on 64 bit kernel.
Fixes: b718121685 ("netfilter: ebtables: CONFIG_COMPAT: don't trust userland offsets")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
All of the conflicts were cases of overlapping changes.
In net/core/devlink.c, we have to make care that the
resouce size_params have become a struct member rather
than a pointer to such an object.
Signed-off-by: David S. Miller <davem@davemloft.net>
should have no impact, function still always returns 0.
This patch is only to ease review.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
These pernet_operations use nf_log_set() and nf_log_unset()
in their methods:
nf_log_bridge_net_ops
nf_log_arp_net_ops
nf_log_ipv4_net_ops
nf_log_ipv6_net_ops
nf_log_netdev_net_ops
Nobody can send such a packet to a net before it's became
registered, nobody can send a packet after all netdevices
are unregistered. So, these pernet_operations are able
to be marked as async.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These pernet_operations use ebt_register_table() and
ebt_unregister_table() to act on the tables, which
are used as argument in ebt_do_table(), called from
ebtables hooks.
Since there are no net-related bridge packets in-flight,
when the init and exit methods are called, these
pernet_operations are safe to be executed in parallel
with any other.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
1) Put back reference on CLUSTERIP configuration structure from the
error path, patch from Florian Westphal.
2) Put reference on CLUSTERIP configuration instead of freeing it,
another cpu may still be walking over it, also from Florian.
3) Refetch pointer to IPv6 header from nf_nat_ipv6_manip_pkt() given
packet manipulation may reallocation the skbuff header, from Florian.
4) Missing match size sanity checks in ebt_among, from Florian.
5) Convert BUG_ON to WARN_ON in ebtables, from Florian.
6) Sanity check userspace offsets from ebtables kernel, from Florian.
7) Missing checksum replace call in flowtable IPv4 DNAT, from Felix
Fietkau.
8) Bump the right stats on checksum error from bridge netfilter,
from Taehee Yoo.
9) Unset interface flag in IPv6 fib lookups otherwise we get
misleading routing lookup results, from Florian.
10) Missing sk_to_full_sk() in ip6_route_me_harder() from Eric Dumazet.
11) Don't allow devices to be part of multiple flowtables at the same
time, this may break setups.
12) Missing netlink attribute validation in flowtable deletion.
13) Wrong array index in nf_unregister_net_hook() call from error path
in flowtable addition path.
14) Fix FTP IPVS helper when NAT mangling is in place, patch from
Julian Anastasov.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
These pernet_operations only unregister nf hooks.
So, they are able to be marked as async.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These pernet_operations are similar to bond_net_ops. Exit method
unregisters all net bridge devices, and it looks like another
pernet_operations are not interested in foreign net bridge list.
So, it's possible to mark them async.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a VLAN is added on a port, a reference is taken on the
corresponding master VLAN entry. If it does not already exist, then it
is created and a reference taken.
However, in the second case a reference is not really taken when
CONFIG_REFCOUNT_FULL is enabled as refcount_inc() is replaced by
refcount_inc_not_zero().
Fix this by using refcount_set() on a newly created master VLAN entry.
Fixes: 2512775985 ("net, bridge: convert net_bridge_vlan.refcnt from atomic_t to refcount_t")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the ip_rcv, IPSTATS_MIB_CSUMERRORS is increased when
checksum error is occurred.
bridge netfilter routine should increase IPSTATS_MIB_CSUMERRORS.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We need to make sure the offsets are not out of range of the
total size.
Also check that they are in ascending order.
The WARN_ON triggered by syzkaller (it sets panic_on_warn) is
changed to also bail out, no point in continuing parsing.
Briefly tested with simple ruleset of
-A INPUT --limit 1/s' --log
plus jump to custom chains using 32bit ebtables binary.
Reported-by: <syzbot+845a53d13171abf8bf29@syzkaller.appspotmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
All of these conditions are not fatal and should have
been WARN_ONs from the get-go.
Convert them to WARN_ONs and bail out.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ebt_among is special, it has a dynamic match size and is exempt
from the central size checks.
Therefore it must check that the size of the match structure
provided from userspace is sane by making sure em->match_size
is at least the minimum size of the expected structure.
The module has such a check, but its only done after accessing
a structure that might be out of bounds.
tested with: ebtables -A INPUT ... \
--among-dst fe:fe:fe:fe:fe:fe
--among-dst fe:fe:fe:fe:fe:fe --among-src fe:fe:fe:fe:ff:f,fe:fe:fe:fe:fe:fb,fe:fe:fe:fe:fc:fd,fe:fe:fe:fe:fe:fd,fe:fe:fe:fe:fe:fe
--among-src fe:fe:fe:fe:ff:f,fe:fe:fe:fe:fe:fa,fe:fe:fe:fe:fe:fd,fe:fe:fe:fe:fe:fe,fe:fe:fe:fe:fe:fe
Reported-by: <syzbot+fe0b19af568972814355@syzkaller.appspotmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains large batch with Netfilter fixes for
your net tree, mostly due to syzbot report fixups and pr_err()
ratelimiting, more specifically, they are:
1) Get rid of superfluous unnecessary check in x_tables before vmalloc(),
we don't hit BUG there anymore, patch from Michal Hock, suggested by
Andrew Morton.
2) Race condition in proc file creation in ipt_CLUSTERIP, from Cong Wang.
3) Drop socket lock that results in circular locking dependency, patch
from Paolo Abeni.
4) Drop packet if case of malformed blob that makes backpointer jump
in x_tables, from Florian Westphal.
5) Fix refcount leak due to race in ipt_CLUSTERIP in
clusterip_config_find_get(), from Cong Wang.
6) Several patches to ratelimit pr_err() for x_tables since this can be
a problem where CAP_NET_ADMIN semantics can protect us in untrusted
namespace, from Florian Westphal.
7) Missing .gitignore update for new autogenerated asn1 state machine
for the SNMP NAT helper, from Zhu Lingshan.
8) Missing timer initialization in xt_LED, from Paolo Abeni.
9) Do not allow negative port range in NAT, also from Paolo.
10) Lock imbalance in the xt_hashlimit rate match mode, patch from
Eric Dumazet.
11) Initialize workqueue before timer in the idletimer match,
from Eric Dumazet.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ebt_among still uses pr_err -- these errors indicate ebtables tool bug,
not a usage error.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Now br_sysfs_if file flush doesn't have attr show. To read it will
cause kernel panic after users chmod u+r this file.
Xiong found this issue when running the commands:
ip link add br0 type bridge
ip link add type veth
ip link set veth0 master br0
chmod u+r /sys/devices/virtual/net/veth0/brport/flush
timeout 3 cat /sys/devices/virtual/net/veth0/brport/flush
kernel crashed with NULL a pointer dereference call trace.
This patch is to fix it by return -EINVAL when brport_attr->show
is null, just the same as the check for brport_attr->store in
brport_store().
Fixes: 9cf637473c ("bridge: add sysfs hook to flush forwarding table")
Reported-by: Xiong Zhou <xzhou@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With gcc-4.1.2.:
net/bridge/br_fdb.c: In function ‘br_fdb_sync_static’:
net/bridge/br_fdb.c:996: warning: ‘err’ may be used uninitialized in this function
Indeed, if the list is empty, err will be uninitialized, and will be
propagated up as the function return value.
Fix this by preinitializing err to zero.
Fixes: eb7935830d ("net: bridge: use rhashtable for fdbs")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return statements in functions returning bool should use
true/false instead of 1/0.
This issue was detected with the help of Coccinelle.
Fixes: 85b3526932 ("bridge: Fix build error when IGMP_SNOOPING is not enabled")
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several reasons for this:
* Several modules maintain internal version numbers, that they print at
boot/module load time, that are not exposed to userspace, as a
primitive mechanism to make revision number control from the earlier
days of Netfilter.
* IPset shows the protocol version at boot/module load time, instead
display this via module description, as Jozsef suggested.
* Remove copyright notice at boot/module load time in two spots, the
Netfilter codebase is a collective development effort, if we would
have to display copyrights for each contributor at boot/module load
time for each extensions we have, we would probably fill up logs with
lots of useless information - from a technical standpoint.
So let's be consistent and remove them all.
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Remove the infrastructure to register/unregister nft_af_info structure,
this structure stores no useful information anymore.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Now that we have a single table list for each netns, we can get rid of
one pointer per family and the global afinfo list, thus, shrinking
struct netns for nftables that now becomes 64 bytes smaller.
And call __nft_release_afinfo() from __net_exit path accordingly to
release netnamespace objects on removal.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We already validate the hook through bitmask, so this check is
superfluous. When removing this, this patch is also fixing a bug in the
new flowtable codebase, since ctx->afi points to the table family
instead of the netdev family which is where the flowtable is really
hooked in.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This abstraction has no clients anymore, remove it.
This is what remains from previous authors, so correct copyright
statement after recent modifications and code removal.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We cannot make a direct call to nf_ip6_reroute() because that would result
in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define reroute indirection in nf_ipv6_ops where this really
belongs to.
For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We cannot make a direct call to nf_ip6_route() because that would result
in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define route indirection in nf_ipv6_ops where this really
belongs to.
For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is only used by nf_queue.c and this function comes with no symbol
dependencies with IPv6, it just refers to structure layouts. Therefore,
we can replace it by a direct function call from where it belongs.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We cannot make a direct call to nf_ip6_checksum_partial() because that
would result in autoloading the 'ipv6' module because of symbol
dependencies. Therefore, define checksum_partial indirection in
nf_ipv6_ops where this really belongs to.
For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We cannot make a direct call to nf_ip6_checksum() because that would
result in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define checksum indirection in nf_ipv6_ops where this really
belongs to.
For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
They don't belong to the family definition, move them to the filter
chain type definition instead.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since NFPROTO_INET is handled from the core, we don't need to maintain
extra infrastructure in nf_tables to handle the double hook
registration, one for IPv4 and another for IPv6.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Instead of calling this function from the family specific variant, this
reduces the code size in the fast path for the netdev, bridge and inet
families. After this change, we must call nft_set_pktinfo() upfront from
the chain hook indirection.
Before:
text data bss dec hex filename
2145 208 0 2353 931 net/netfilter/nf_tables_netdev.o
After:
text data bss dec hex filename
2125 208 0 2333 91d net/netfilter/nf_tables_netdev.o
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
no need to define hook points if the family isn't supported.
Because we need these hooks for either nftables, arp/ebtables
or the 'call-iptables' hack we have in the bridge layer add two
new dependencies, NETFILTER_FAMILY_{ARP,BRIDGE}, and have the
users select them.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
struct net contains:
struct nf_hook_entries __rcu *hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
which store the hook entry point locations for the various protocol
families and the hooks.
Using array results in compact c code when doing accesses, i.e.
x = rcu_dereference(net->nf.hooks[pf][hook]);
but its also wasting a lot of memory, as most families are
not used.
So split the array into those families that are used, which
are only 5 (instead of 13). In most cases, the 'pf' argument is
constant, i.e. gcc removes switch statement.
struct net before:
/* size: 5184, cachelines: 81, members: 46 */
after:
/* size: 4672, cachelines: 73, members: 46 */
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Lots of overlapping changes. Also on the net-next side
the XDP state management is handled more in the generic
layers so undo the 'net' nfp fix which isn't applicable
in net-next.
Include a necessary change by Jakub Kicinski, with log message:
====================
cls_bpf no longer takes care of offload tracking. Make sure
netdevsim performs necessary checks. This fixes a warning
caused by TC trying to remove a filter it has not added.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Use
%pM to print MAC
mac_pton() to convert it from ASCII to binary format, and
ether_addr_copy() to copy.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The early call to br_stp_change_bridge_id in bridge's newlink can cause
a memory leak if an error occurs during the newlink because the fdb
entries are not cleaned up if a different lladdr was specified, also
another minor issue is that it generates fdb notifications with
ifindex = 0. Another unrelated memory leak is the bridge sysfs entries
which get added on NETDEV_REGISTER event, but are not cleaned up in the
newlink error path. To remove this special case the call to
br_stp_change_bridge_id is done after netdev register and we cleanup the
bridge on changelink error via br_dev_delete to plug all leaks.
This patch makes netlink bridge destruction on newlink error the same as
dellink and ioctl del which is necessary since at that point we have a
fully initialized bridge device.
To reproduce the issue:
$ ip l add br0 address 00:11:22:33:44:55 type bridge group_fwd_mask 1
RTNETLINK answers: Invalid argument
$ rmmod bridge
[ 1822.142525] =============================================================================
[ 1822.143640] BUG bridge_fdb_cache (Tainted: G O ): Objects remaining in bridge_fdb_cache on __kmem_cache_shutdown()
[ 1822.144821] -----------------------------------------------------------------------------
[ 1822.145990] Disabling lock debugging due to kernel taint
[ 1822.146732] INFO: Slab 0x0000000092a844b2 objects=32 used=2 fp=0x00000000fef011b0 flags=0x1ffff8000000100
[ 1822.147700] CPU: 2 PID: 13584 Comm: rmmod Tainted: G B O 4.15.0-rc2+ #87
[ 1822.148578] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 1822.150008] Call Trace:
[ 1822.150510] dump_stack+0x78/0xa9
[ 1822.151156] slab_err+0xb1/0xd3
[ 1822.151834] ? __kmalloc+0x1bb/0x1ce
[ 1822.152546] __kmem_cache_shutdown+0x151/0x28b
[ 1822.153395] shutdown_cache+0x13/0x144
[ 1822.154126] kmem_cache_destroy+0x1c0/0x1fb
[ 1822.154669] SyS_delete_module+0x194/0x244
[ 1822.155199] ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 1822.155773] entry_SYSCALL_64_fastpath+0x23/0x9a
[ 1822.156343] RIP: 0033:0x7f929bd38b17
[ 1822.156859] RSP: 002b:00007ffd160e9a98 EFLAGS: 00000202 ORIG_RAX: 00000000000000b0
[ 1822.157728] RAX: ffffffffffffffda RBX: 00005578316ba090 RCX: 00007f929bd38b17
[ 1822.158422] RDX: 00007f929bd9ec60 RSI: 0000000000000800 RDI: 00005578316ba0f0
[ 1822.159114] RBP: 0000000000000003 R08: 00007f929bff5f20 R09: 00007ffd160e8a11
[ 1822.159808] R10: 00007ffd160e9860 R11: 0000000000000202 R12: 00007ffd160e8a80
[ 1822.160513] R13: 0000000000000000 R14: 0000000000000000 R15: 00005578316ba090
[ 1822.161278] INFO: Object 0x000000007645de29 @offset=0
[ 1822.161666] INFO: Object 0x00000000d5df2ab5 @offset=128
Fixes: 30313a3d57 ("bridge: Handle IFLA_ADDRESS correctly when creating bridge device")
Fixes: 5b8d5429da ("bridge: netlink: register netdevice before executing changelink")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch the bridge used a fixed 256 element hash table which
was fine for small use cases (in my tests it starts to degrade
above 1000 entries), but it wasn't enough for medium or large
scale deployments. Modern setups have thousands of participants in a
single bridge, even only enabling vlans and adding a few thousand vlan
entries will cause a few thousand fdbs to be automatically inserted per
participating port. So we need to scale the fdb table considerably to
cope with modern workloads, and this patch converts it to use a
rhashtable for its operations thus improving the bridge scalability.
Tests show the following results (10 runs each), at up to 1000 entries
rhashtable is ~3% slower, at 2000 rhashtable is 30% faster, at 3000 it
is 2 times faster and at 30000 it is 50 times faster.
Obviously this happens because of the properties of the two constructs
and is expected, rhashtable keeps pretty much a constant time even with
10000000 entries (tested), while the fixed hash table struggles
considerably even above 10000.
As a side effect this also reduces the net_bridge struct size from 3248
bytes to 1344 bytes. Also note that the key struct is 8 bytes.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
all of these can be compiled as a module, so use new
_module version to make sure module can no longer be removed
while callback/dump is in use.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The first member of an IPSEC route bundle chain sets it's dst->path to
the underlying ipv4/ipv6 route that carries the bundle.
Stated another way, if one were to follow the xfrm_dst->child chain of
the bundle, the final non-NULL pointer would be the path and point to
either an ipv4 or an ipv6 route.
This is largely used to make sure that PMTU events propagate down to
the correct ipv4 or ipv6 route.
When we don't have the top of an IPSEC bundle 'dst->path == dst'.
Move it down into xfrm_dst and key off of dst->xfrm.
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Found another missing port flag policy entry for IFLA_BRPORT_VLAN_TUNNEL
so add it now.
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Fixes: efa5356b0d ("bridge: per vlan dst_metadata netlink support")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the host joins or leaves a multicast group, use switchdev to add
an object to the hardware to forward traffic for the group to the
host.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The host can join or leave a multicast group on the brX interface, as
indicated by IGMP snooping. This is tracked within the bridge
multicast code. Send a notification when this happens, in the same way
a notification is sent when a port of the bridge joins/leaves a group
because of IGMP snooping.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The boolean mglist indicates the host has joined a particular
multicast group on the bridge interface. It is badly named, obscuring
what is means. Rename it.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next
tree, they are:
1) Speed up table replacement on busy systems with large tables
(and many cores) in x_tables. Now xt_replace_table() synchronizes by
itself by waiting until all cpus had an even seqcount and we use no
use seqlock when fetching old counters, from Florian Westphal.
2) Add nf_l4proto_log_invalid() and nf_ct_l4proto_log_invalid() to speed
up packet processing in the fast path when logging is not enabled, from
Florian Westphal.
3) Precompute masked address from configuration plane in xt_connlimit,
from Florian.
4) Don't use explicit size for set selection if performance set policy
is selected.
5) Allow to get elements from an existing set in nf_tables.
6) Fix incorrect check in nft_hash_deactivate(), from Florian.
7) Cache netlink attribute size result in l4proto->nla_size, from
Florian.
8) Handle NFPROTO_INET in nf_ct_netns_get() from conntrack core.
9) Use power efficient workqueue in conntrack garbage collector, from
Vincent Guittot.
10) Remove unnecessary parameter, in conntrack l4proto functions, also
from Florian.
11) Constify struct nf_conntrack_l3proto definitions, from Florian.
12) Remove all typedefs in nf_conntrack_h323 via coccinelle semantic
patch, from Harsha Sharma.
13) Don't store address in the rbtree nodes in xt_connlimit, they are
never used, from Florian.
14) Fix out of bound access in the conntrack h323 helper, patch from
Eric Sesterhenn.
15) Print symbols for the address returned with %pS in IPVS, from
Helge Deller.
16) Proc output should only display its own netns in IPVS, from
KUWAZAWA Takuya.
17) Small clean up in size_entry_mwt(), from Colin Ian King.
18) Use test_and_clear_bit from nf_nat_proto_clean() instead of separated
non-atomic test and then clear bit, from Florian Westphal.
19) Consolidate prefix length maps in ipset, from Aaron Conole.
20) Fix sparse warnings in ipset, from Jozsef Kadlecsik.
21) Simplify list_set_memsize(), from simran singhal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
buf is initialized to buf_start and then set on the next statement
to buf_start + offsets[i]. Clean this up to just initialize buf
to buf_start + offsets[i] to clean up the clang build warning:
"Value stored to 'buf' during its initialization is never read"
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Files removed in 'net-next' had their license header updated
in 'net'. We take the remove from 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
switch to using the new timer_setup() and from_timer() api's.
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
switch to using the new timer_setup() and from_timer() api's.
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The lan9303 driver defines eth_stp_addr as a synonym to
eth_reserved_addr_base to get the STP ethernet address 01:80:c2:00:00:00.
eth_reserved_addr_base is also used to define the start of Bridge Reserved
ethernet address range, which happen to be the STP address.
br_dev_setup refer to eth_reserved_addr_base as a definition of STP
address.
Clean up by:
- Move the eth_stp_addr definition to linux/etherdevice.h
- Use eth_stp_addr instead of eth_reserved_addr_base in br_dev_setup.
Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently the bridge device doesn't generate any notifications upon vlan
modifications on itself because it doesn't use the generic bridge
notifications.
With the recent changes we know if anything was modified in the vlan config
thus we can generate a notification when necessary for the bridge device
so add support to br_ifinfo_notify() similar to how other combined
functions are done - if port is present it takes precedence, otherwise
notify about the bridge. I've explicitly marked the locations where the
notification should be always for the port by setting bridge to NULL.
I've also taken the liberty to rearrange each modified function's local
variables in reverse xmas tree as well.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an entry for IFLA_BRPORT_NEIGH_SUPPRESS to bridge port policies.
Fixes: 821f1b21ca ("bridge: add new BR_NEIGH_SUPPRESS port flag to suppress arp and nd flood")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch there was no way to tell if the vlan add operation
actually changed anything, thus we would always generate a notification
on adds. Let's make the notifications more precise and generate them
only if anything changed, so use the new bool parameter to signal that the
vlan was updated. We cannot return an error because there are valid use
cases that will be broken (e.g. overlapping range add) and also we can't
risk masking errors due to calls into drivers for vlan add which can
potentially return anything.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch we had cases that either sent notifications when there
were in fact no changes (e.g. non-existent vlan delete) or didn't send
notifications when there were changes (e.g. vlan add range with an error in
the middle, port flags change + vlan update error). This patch sends down
a boolean to the functions setlink/dellink use and if there is even a
single configuration change (port flag, vlan add/del, port state) then
we always send a notification. This is all done to keep backwards
compatibility with the opportunistic vlan delete, where one could
specify a vlan range that has missing vlans inside and still everything
in that range will be cleared, this is mostly used to clear the whole
vlan config with a single call, i.e. range 1-4094.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit dc709f3757 ("rtnetlink: bring NETDEV_CHANGEUPPER event
process back in rtnetlink_event"), rtnetlink_event would process the
NETDEV_CHANGEUPPER event and send a notification to userspace.
In add_del_if, it would generate NETDEV_CHANGEUPPER event by whether
netdev_master_upper_dev_link or netdev_upper_dev_unlink. There's
no need to call rtmsg_ifinfo to send the notification any more.
So this patch is to remove it from add_del_if also to avoid redundant
notifications.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There were quite a few overlapping sets of changes here.
Daniel's bug fix for off-by-ones in the new BPF branch instructions,
along with the added allowances for "data_end > ptr + x" forms
collided with the metadata additions.
Along with those three changes came veritifer test cases, which in
their final form I tried to group together properly. If I had just
trimmed GIT's conflict tags as-is, this would have split up the
meta tests unnecessarily.
In the socketmap code, a set of preemption disabling changes
overlapped with the rename of bpf_compute_data_end() to
bpf_compute_data_pointers().
Changes were made to the mv88e6060.c driver set addr method
which got removed in net-next.
The hyperv transport socket layer had a locking change in 'net'
which overlapped with a change of socket state macro usage
in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
When vlan tunnels were introduced, vlan range errors got silently
dropped and instead 0 was returned always. Restore the previous
behaviour and return errors to user-space.
Fixes: efa5356b0d ("bridge: per vlan dst_metadata netlink support")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an access function that, given a bridge netdevice, returns whether the
bridge device is currently an mrouter or not. The function uses the already
existing br_multicast_is_router function to check that.
This function is needed in order to allow ports that join an already
existing bridge to know the current mrouter state of the bridge device.
Together with the bridge device mrouter ports switchdev notifications, it
is possible to have full offloading of the semantics of the bridge device
mcast router state.
Due to the fact that the bridge multicast router status can change in
packet RX path, take the multicast_router bridge spinlock to protect the
read.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Nogah Frankel <nogahf@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the SWITCHDEV_ATTR_ID_BRIDGE_MROUTER switchdev notification type, used
to indicate whether the bridge is or isn't mrouter. Notify when the bridge
changes its state, similarly to the already existing bridged port mrouter
notifications.
The notification uses the switchdev_attr.u.mrouter boolean flag to indicate
the current bridge mrouter status. Thus, it only indicates whether the
bridge is currently used as an mrouter or not, and does not indicate the
exact mrouter state of the bridge (learning, permanent, etc.).
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch avoids flooding and proxies ndisc packets
for BR_NEIGH_SUPPRESS ports.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch avoids flooding and proxies arp packets
for BR_NEIGH_SUPPRESS ports.
Moves existing br_do_proxy_arp to br_do_proxy_suppress_arp
to support both proxy arp and neigh suppress.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new bridge port flag BR_NEIGH_SUPPRESS to
suppress arp and nd flood on bridge ports. It implements
rfc7432, section 10.
https://tools.ietf.org/html/rfc7432#section-10
for ethernet VPN deployments. It is similar to the existing
BR_PROXYARP* flags but has a few semantic differences to conform
to EVPN standard. Unlike the existing flags, this new flag suppresses
flood of all neigh discovery packets (arp and nd) to tunnel ports.
Supports both vlan filtering and non-vlan filtering bridges.
In case of EVPN, it is mainly used to avoid flooding
of arp and nd packets to tunnel ports like vxlan.
This patch adds netlink and sysfs support to set this bridge port
flag.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass extack arg to br_add_if. Add messages for a couple of failures
and pass arg to netdev_master_upper_dev_link.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack arg to netdev_upper_dev_link and netdev_master_upper_dev_link
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass extack to do_set_master and down to ndo_add_slave
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is possible for ebt_in_hook to be triggered before ebt_table is assigned
resulting in a NULL-pointer dereference. Make sure hooks are
registered as the last step.
Fixes: aee12a0a37 ("ebtables: remove nf_hook_register usage")
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We need to be able to transparently forward most link-local frames via
tunnels (e.g. vxlan, qinq). Currently the bridge's group_fwd_mask has a
mask which restricts the forwarding of STP and LACP, but we need to be able
to forward these over tunnels and control that forwarding on a per-port
basis thus add a new per-port group_fwd_mask option which only disallows
mac pause frames to be forwarded (they're always dropped anyway).
The patch does not change the current default situation - all of the others
are still restricted unless configured for forwarding.
We have successfully tested this patch with LACP and STP forwarding over
VxLAN and qinq tunnels.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, there is a difference in netlink events received when an
interface is modified through bridge ioctl() or through netlink. This
patch generates additional events when an interface is added to or
removed from a bridge via ioctl().
When adding then removing an interface from a bridge with netlink, we
get:
5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master bridge0 state UNKNOWN group default
link/ether 9e:da:60:ee:cf:c8 brd ff:ff:ff:ff:ff:ff
5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN
link/ether 9e:da:60:ee:cf:c8
5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN
link/ether 9e:da:60:ee:cf:c8
5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN
link/ether 9e:da:60:ee:cf:c8
5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN
link/ether 9e:da:60:ee:cf:c8
5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master bridge0 state UNKNOWN group default
link/ether 9e:da:60:ee:cf:c8 brd ff:ff:ff:ff:ff:ff
5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master bridge0 state UNKNOWN group default
link/ether 9e:da:60:ee:cf:c8 brd ff:ff:ff:ff:ff:ff
5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN
link/ether 9e:da:60:ee:cf:c8
Deleted 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN
link/ether 9e:da:60:ee:cf:c8
5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
link/ether 9e:da:60:ee:cf:c8 brd ff:ff:ff:ff:ff:ff
When using ioctl():
5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master bridge0 state UNKNOWN group default
link/ether 9e:da:60:ee:cf:c8 brd ff:ff:ff:ff:ff:ff
5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN
link/ether 9e:da:60:ee:cf:c8
5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN
link/ether 9e:da:60:ee:cf:c8
5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN
link/ether 9e:da:60:ee:cf:c8
5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master bridge0 state UNKNOWN group default
link/ether 9e:da:60:ee:cf:c8 brd ff:ff:ff:ff:ff:ff
5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master bridge0 state UNKNOWN group default
link/ether 9e:da:60:ee:cf:c8 brd ff:ff:ff:ff:ff:ff
5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN
link/ether 9e:da:60:ee:cf:c8
Deleted 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN
link/ether 9e:da:60:ee:cf:c8
5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
link/ether 9e:da:60:ee:cf:c8 brd ff:ff:ff:ff:ff:ff
Without this patch, the last netlink notification is not sent.
Signed-off-by: Vincent Bernat <vincent@bernat.im>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 00ba4cb36d.
Discussion with David Ahern determined that this change is
actually not needed.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when an interface is released from a bridge via
ioctl(), we get a RTM_DELLINK event through netlink:
Deleted 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN
link/ether 6e:23:c2:54:3a:b3
Userspace has to interpret that as a removal from the bridge, not as a
complete removal of the interface. When an bridged interface is
completely removed, we get two events:
Deleted 2: dummy0: <BROADCAST,NOARP> mtu 1500 master bridge0 state DOWN
link/ether 6e:23:c2:54:3a:b3
Deleted 2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether 6e:23:c2:54:3a:b3 brd ff:ff:ff:ff:ff:ff
In constrast, when an interface is released from a bond, we get a
RTM_NEWLINK with only the new characteristics (no master):
3: dummy1: <BROADCAST,NOARP,SLAVE,UP,LOWER_UP> mtu 1500 qdisc noqueue master bond0 state UNKNOWN group default
link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff
3: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff
3: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noqueue state DOWN group default
link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff
3: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noqueue state DOWN group default
link/ether ca:c8:7b:66:f8:25 brd ff:ff:ff:ff:ff:ff
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff
Userland may be confused by the fact we say a link is deleted while
its characteristics are only modified. A first solution would have
been to turn the RTM_DELLINK event in del_nbp() into a RTM_NEWLINK
event. However, maybe some piece of userland is relying on this
RTM_DELLINK to detect when a bridged interface is released. Instead,
we also emit a RTM_NEWLINK event once the interface is
released (without master info).
Deleted 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN
link/ether 8a:bb:e7:94:b1:f8
2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
link/ether 8a:bb:e7:94:b1:f8 brd ff:ff:ff:ff:ff:ff
This is done only when using ioctl(). When using Netlink, such an
event is already automatically emitted in do_setlink().
Signed-off-by: Vincent Bernat <vincent@bernat.im>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using ifdef in the C file.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Suggested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Tested-by: Yotam Gigi <yotamg@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes CONFIG_NETFILTER_DEBUG and _ASSERT() macros as they
are no longer required. Replace _ASSERT() macros with WARN_ON().
Signed-off-by: Varsha Rao <rvarsha016@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. Basically, updates to the conntrack core, enhancements for
nf_tables, conversion of netfilter hooks from linked list to array to
improve memory locality and asorted improvements for the Netfilter
codebase. More specifically, they are:
1) Add expection to hashes after timer initialization to prevent
access from another CPU that walks on the hashes and calls
del_timer(), from Florian Westphal.
2) Don't update nf_tables chain counters from hot path, this is only
used by the x_tables compatibility layer.
3) Get rid of nested rcu_read_lock() calls from netfilter hook path.
Hooks are always guaranteed to run from rcu read side, so remove
nested rcu_read_lock() where possible. Patch from Taehee Yoo.
4) nf_tables new ruleset generation notifications include PID and name
of the process that has updated the ruleset, from Phil Sutter.
5) Use skb_header_pointer() from nft_fib, so we can reuse this code from
the nf_family netdev family. Patch from Pablo M. Bermudo.
6) Add support for nft_fib in nf_tables netdev family, also from Pablo.
7) Use deferrable workqueue for conntrack garbage collection, to reduce
power consumption, from Patch from Subash Abhinov Kasiviswanathan.
8) Add nf_ct_expect_iterate_net() helper and use it. From Florian
Westphal.
9) Call nf_ct_unconfirmed_destroy only from cttimeout, from Florian.
10) Drop references on conntrack removal path when skbuffs has escaped via
nfqueue, from Florian.
11) Don't queue packets to nfqueue with dying conntrack, from Florian.
12) Constify nf_hook_ops structure, from Florian.
13) Remove neededlessly branch in nf_tables trace code, from Phil Sutter.
14) Add nla_strdup(), from Phil Sutter.
15) Rise nf_tables objects name size up to 255 chars, people want to use
DNS names, so increase this according to what RFC 1035 specifies.
Patch series from Phil Sutter.
16) Kill nf_conntrack_default_on, it's broken. Default on conntrack hook
registration on demand, suggested by Eric Dumazet, patch from Florian.
17) Remove unused variables in compat_copy_entry_from_user both in
ip_tables and arp_tables code. Patch from Taehee Yoo.
18) Constify struct nf_conntrack_l4proto, from Julia Lawall.
19) Constify nf_loginfo structure, also from Julia.
20) Use a single rb root in connlimit, from Taehee Yoo.
21) Remove unused netfilter_queue_init() prototype, from Taehee Yoo.
22) Use audit_log() instead of open-coding it, from Geliang Tang.
23) Allow to mangle tcp options via nft_exthdr, from Florian.
24) Allow to fetch TCP MSS from nft_rt, from Florian. This includes
a fix for a miscalculation of the minimal length.
25) Simplify branch logic in h323 helper, from Nick Desaulniers.
26) Calculate netlink attribute size for conntrack tuple at compile
time, from Florian.
27) Remove protocol name field from nf_conntrack_{l3,l4}proto structure.
From Florian.
28) Remove holes in nf_conntrack_l4proto structure, so it becomes
smaller. From Florian.
29) Get rid of print_tuple() indirection for /proc conntrack listing.
Place all the code in net/netfilter/nf_conntrack_standalone.c.
Patch from Florian.
30) Do not built in print_conntrack() if CONFIG_NF_CONNTRACK_PROCFS is
off. From Florian.
31) Constify most nf_conntrack_{l3,l4}proto helper functions, from
Florian.
32) Fix broken indentation in ebtables extensions, from Colin Ian King.
33) Fix several harmless sparse warning, from Florian.
34) Convert netfilter hook infrastructure to use array for better memory
locality, joint work done by Florian and Aaron Conole. Moreover, add
some instrumentation to debug this.
35) Batch nf_unregister_net_hooks() calls, to call synchronize_net once
per batch, from Florian.
36) Get rid of noisy logging in ICMPv6 conntrack helper, from Florian.
37) Get rid of obsolete NFDEBUG() instrumentation, from Varsha Rao.
38) Remove unused code in the generic protocol tracker, from Davide
Caratti.
I think I will have material for a second Netfilter batch in my queue if
time allow to make it fit in this merge window.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 6bc506b4fb ("bridge: switchdev: Add forward mark support for
stacked devices") added the 'offload_fwd_mark' bit to the skb in order
to allow drivers to indicate to the bridge driver that they already
forwarded the packet in L2.
In case the bit is set, before transmitting the packet from each port,
the port's mark is compared with the mark stored in the skb's control
block. If both marks are equal, we know the packet arrived from a switch
device that already forwarded the packet and it's not re-transmitted.
However, if the packet is transmitted from the bridge device itself
(e.g., br0), we should clear the 'offload_fwd_mark' bit as the mark
stored in the skb's control block isn't valid.
This scenario can happen in rare cases where a packet was trapped during
L3 forwarding and forwarded by the kernel to a bridge device.
Fixes: 6bc506b4fb ("bridge: switchdev: Add forward mark support for stacked devices")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Yotam Gigi <yotamg@mellanox.com>
Tested-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This extends bridge fdb table tracepoints to also cover
learned fdb entries in the br_fdb_update path. Note that
unlike other tracepoints I have moved this to when the fdb
is modified because this is in the datapath and can generate
a lot of noise in the trace output. br_fdb_update is also called
from added_by_user context in the NTF_USE case which is already
traced ..hence the !added_by_user check.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A few useful tracepoints to trace bridge forwarding
database updates.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make this const as it is only passed to a const argument of the function
ebt_register_table.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This converts the storage and layout of netfilter hook entries from a
linked list to an array. After this commit, hook entries will be
stored adjacent in memory. The next pointer is no longer required.
The ops pointers are stored at the end of the array as they are only
used in the register/unregister path and in the legacy br_netfilter code.
nf_unregister_net_hooks() is slower than needed as it just calls
nf_unregister_net_hook in a loop (i.e. at least n synchronize_net()
calls), this will be addressed in followup patch.
Test setup:
- ixgbe 10gbit
- netperf UDP_STREAM, 64 byte packets
- 5 hooks: (raw + mangle prerouting, mangle+filter input, inet filter):
empty mangle and raw prerouting, mangle and filter input hooks:
353.9
this patch:
364.2
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The returns on some if statements are not indented correctly,
add in the missing tab.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This change allows us to later indicate to rtnetlink core that certain
doit functions should be called without acquiring rtnl_mutex.
This change should have no effect, we simply replace the last (now
unused) calcit argument with the new flag.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
At this point no driver supports FDB add/del through switchdev object
but rather via notification chain, thus, it is removed.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We no longer place these on a list so they can be const.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With 802.1ad support the vlan_ingress code started checking for vlan
protocol mismatch which causes the current tag to be inserted and the
bridge vlan protocol & pvid to be set. The vlan tag insertion changes
the skb mac_header and thus the lookup mac dest pointer which was loaded
prior to calling br_allowed_ingress in br_handle_frame_finish is VLAN_HLEN
bytes off now, pointing to the last two bytes of the destination mac and
the first four of the source mac causing lookups to always fail and
broadcasting all such packets to all ports. Same thing happens for locally
originated packets when passing via br_dev_xmit. So load the dest pointer
after the vlan checks and possible skb change.
Fixes: 8580e2117c ("bridge: Prepare for 802.1ad vlan filtering support")
Reported-by: Anitha Narasimha Murthy <anitha@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains two Netfilter fixes for your net tree,
they are:
1) Fix memleak from netns release path of conntrack protocol trackers,
patch from Liping Zhang.
2) Uninitialized flags field in ebt_log, that results in unpredictable
logging format in ebtables, also from Liping.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
current code silently ignores change of port in the request
message. This patch makes sure the port is modified and
notification is sent to userspace.
Fixes: cf6b8e1eed ("bridge: add API to notify bridge driver of learned FBD on offloaded device")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. This batch contains connection tracking updates for the cleanup
iteration path, patches from Florian Westphal:
X) Skip unconfirmed conntracks in nf_ct_iterate_cleanup_net(), just set
dying bit to let the CPU release them.
X) Add nf_ct_iterate_destroy() to be used on module removal, to kill
conntrack from all namespace.
X) Restart iteration on hashtable resizing, since both may occur at
the same time.
X) Use the new nf_ct_iterate_destroy() to remove conntrack with NAT
mapping on module removal.
X) Use nf_ct_iterate_destroy() to remove conntrack entries helper
module removal, from Liping Zhang.
X) Use nf_ct_iterate_cleanup_net() to remove the timeout extension
if user requests this, also from Liping.
X) Add net_ns_barrier() and use it from FTP helper, so make sure
no concurrent namespace removal happens at the same time while
the helper module is being removed.
X) Use NFPROTO_MAX in layer 3 conntrack protocol array, to reduce
module size. Same thing in nf_tables.
Updates for the nf_tables infrastructure:
X) Prepare usage of the extended ACK reporting infrastructure for
nf_tables.
X) Remove unnecessary forward declaration in nf_tables hash set.
X) Skip set size estimation if number of element is not specified.
X) Changes to accomodate a (faster) unresizable hash set implementation,
for anonymous sets and dynamic size fixed sets with no timeouts.
X) Faster lookup function for unresizable hash table for 2 and 4
bytes key.
And, finally, a bunch of asorted small updates and cleanups:
X) Do not hold reference to netdev from ipt_CLUSTER, instead subscribe
to device events and look up for index from the packet path, this
is fixing an issue that is present since the very beginning, patch
from Xin Long.
X) Use nf_register_net_hook() in ipt_CLUSTER, from Florian Westphal.
X) Use ebt_invalid_target() whenever possible in the ebtables tree,
from Gao Feng.
X) Calm down compilation warning in nf_dup infrastructure, patch from
stephen hemminger.
X) Statify functions in nftables rt expression, also from stephen.
X) Update Makefile to use canonical method to specify nf_tables-objs.
From Jike Song.
X) Use nf_conntrack_helpers_register() in amanda and H323.
X) Space cleanup for ctnetlink, from linzhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by <linux/sysfs.h> work with const
attribute_group. So mark the non-const structs as const.
File size before:
text data bss dec hex filename
2645 896 0 3541 dd5 net/bridge/br_sysfs_br.o
File size After adding 'const':
text data bss dec hex filename
2701 832 0 3533 dcd net/bridge/br_sysfs_br.o
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"struct nf_loginfo li;" is a local variable, so we should set the flags
to 0 explicitly, else, packets maybe truncated unexpectedly when copied
to the userspace.
Fixes: 7643507fe8 ("netfilter: xt_NFLOG: nflog-range does not truncate packets")
Cc: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add support for extended error reporting.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for extended error reporting.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for extended error reporting.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for extended error reporting.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the new helper function ebt_invalid_target instead of the old
macro INVALID_TARGET and other duplicated codes to enhance the readability.
Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.
Make these functions return void * and remove all the casts across
the tree, adding a (u8 *) cast only where the unsigned char pointer
was used directly, all done with the following spatch:
@@
expression SKB, LEN;
typedef u8;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
@@
- *(fn(SKB, LEN))
+ *(u8 *)fn(SKB, LEN)
@@
expression E, SKB, LEN;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
type T;
@@
- E = ((T *)(fn(SKB, LEN)))
+ E = fn(SKB, LEN)
@@
expression SKB, LEN;
identifier fn = { skb_push, __skb_push, skb_push_rcsum };
@@
- fn(SKB, LEN)[0]
+ *(u8 *)fn(SKB, LEN)
Note that the last part there converts from push(...)[0] to the
more idiomatic *(u8 *)push(...).
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A common pattern with skb_put() is to just want to memcpy()
some data into the new space, introduce skb_put_data() for
this.
An spatch similar to the one for skb_put_zero() converts many
of the places using it:
@@
identifier p, p2;
expression len, skb, data;
type t, t2;
@@
(
-p = skb_put(skb, len);
+p = skb_put_data(skb, data, len);
|
-p = (t)skb_put(skb, len);
+p = skb_put_data(skb, data, len);
)
(
p2 = (t2)p;
-memcpy(p2, data, len);
|
-memcpy(p, data, len);
)
@@
type t, t2;
identifier p, p2;
expression skb, data;
@@
t *p;
...
(
-p = skb_put(skb, sizeof(t));
+p = skb_put_data(skb, data, sizeof(t));
|
-p = (t *)skb_put(skb, sizeof(t));
+p = skb_put_data(skb, data, sizeof(t));
)
(
p2 = (t2)p;
-memcpy(p2, data, sizeof(*p));
|
-memcpy(p, data, sizeof(*p));
)
@@
expression skb, len, data;
@@
-memcpy(skb_put(skb, len), data, len);
+skb_put_data(skb, data, len);
(again, manually post-processed to retain some comments)
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There were many places that my previous spatch didn't find,
as pointed out by yuan linyu in various patches.
The following spatch found many more and also removes the
now unnecessary casts:
@@
identifier p, p2;
expression len;
expression skb;
type t, t2;
@@
(
-p = skb_put(skb, len);
+p = skb_put_zero(skb, len);
|
-p = (t)skb_put(skb, len);
+p = skb_put_zero(skb, len);
)
... when != p
(
p2 = (t2)p;
-memset(p2, 0, len);
|
-memset(p, 0, len);
)
@@
type t, t2;
identifier p, p2;
expression skb;
@@
t *p;
...
(
-p = skb_put(skb, sizeof(t));
+p = skb_put_zero(skb, sizeof(t));
|
-p = (t *)skb_put(skb, sizeof(t));
+p = skb_put_zero(skb, sizeof(t));
)
... when != p
(
p2 = (t2)p;
-memset(p2, 0, sizeof(*p));
|
-memset(p, 0, sizeof(*p));
)
@@
expression skb, len;
@@
-memset(skb_put(skb, len), 0, len);
+skb_put_zero(skb, len);
Apply it to the tree (with one manual fixup to keep the
comment in vxlan.c, which spatch removed.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a new static FDB is added to the bridge a notification is sent to
the driver for offload. In case of successful offload the driver should
notify the bridge back, which in turn should mark the FDB as offloaded.
Currently, externally learned is equivalent for being offloaded which is
not correct due to the fact that FDBs which are added from user-space are
also marked as externally learned. In order to specify if an FDB was
successfully offloaded a new flag is introduced.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the bridge doesn't notify the underlying devices about new
FDBs learned. The FDB sync is placed on the switchdev notifier chain
because devices may potentially learn FDB that are not directly related
to their ports, for example:
1. Mixed SW/HW bridge - FDBs that point to the ASICs external devices
should be offloaded as CPU traps in order to
perform forwarding in slow path.
2. EVPN - Externally learned FDBs for the vtep device.
Notification is sent only about static FDB add/del. This is done due
to fact that currently this is the only scenario supported by switch
drivers.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is done as a preparation to moving the switchdev notifier chain
to be atomic. The FDB external learning should be called under rtnl
or rcu.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the flood, learning and learning_sync port attributes are
offloaded by setting the SELF flag. Add support for offloading the
flood and learning attribute through the bridge code. In case of
setting an unsupported flag on a offloded port the operation will
fail.
The learning_sync attribute doesn't have any software representation
and cannot be offloaded through the bridge code.
Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Network devices can allocate reasources and private memory using
netdev_ops->ndo_init(). However, the release of these resources
can occur in one of two different places.
Either netdev_ops->ndo_uninit() or netdev->destructor().
The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.
netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.
netdev->destructor(), on the other hand, does not run until the
netdev references all go away.
Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().
This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.
If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit(). But
it is not able to invoke netdev->destructor().
This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.
However, this means that the resources that would normally be released
by netdev->destructor() will not be.
Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.
Many drivers do not try to deal with this, and instead we have leaks.
Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().
netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().
netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().
Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().
And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().
Signed-off-by: David S. Miller <davem@davemloft.net>
We might call br_afspec() with p == NULL which is a valid use case if
the action is on the bridge device itself, but the bridge tunnel code
dereferences the p pointer without checking, so check if p is null
first.
Reported-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Fixes: efa5356b0d ("bridge: per vlan dst_metadata netlink support")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the transition of NO_STP -> KERNEL_STP was fixed by always calling
mod_timer in br_stp_start, it introduced a new regression which causes
the timer to be armed even when the bridge is down, and since we stop
the timers in its ndo_stop() function, they never get disabled if the
device is destroyed before it's upped.
To reproduce:
$ while :; do ip l add br0 type bridge hello_time 100; brctl stp br0 on;
ip l del br0; done;
CC: Xin Long <lucien.xin@gmail.com>
CC: Ivan Vecera <cera@cera.cz>
CC: Sebastian Ott <sebott@linux.vnet.ibm.com>
Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Fixes: 6d18c732b9 ("bridge: start hello_timer when enabling KERNEL_STP in br_stp_start")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During enslavement to a bridge, after the CHANGEUPPER is sent, the
multicast enabled state of the bridge isn't propagated down to the
offloading driver unless it's changed.
This patch allows such drivers to query the multicast enabled state from
the bridge, so that they'll be able to correctly configure their flood
tables during port enslavement.
In case multicast is disabled, unregistered multicast packets can be
treated as broadcast and be flooded through all the bridge ports.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's useful for drivers supporting bridge offload to be able to query
the bridge's VLAN filtering state.
Currently, upon enslavement to a bridge master, the offloading driver
will only learn about the bridge's VLAN filtering state after the bridge
device was already linked with its slave.
Being able to query the bridge's VLAN filtering state allows such
drivers to forbid enslavement in case resource couldn't be allocated for
a VLAN-aware bridge and also choose the correct initialization routine
for the enslaved port, which is dependent on the bridge type.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current bridge code incorrectly handles starting/stopping of hello and
hold timers during STP enable/disable.
1. Timers are stopped in br_stp_start() during NO_STP->USER_STP
transition. The timers are already stopped in NO_STP state so
this is confusing no-op.
2. During USER_STP->NO_STP transition the timers are started. This
does not make sense and is confusion because the timer should not be
active in NO_STP state.
Cc: davem@davemloft.net
Cc: sashok@cumulusnetworks.com
Cc: stephen@networkplumber.org
Cc: bridge@lists.linux-foundation.org
Cc: lucien.xin@gmail.com
Cc: nikolay@cumulusnetworks.com
Signed-off-by: Ivan Vecera <cera@cera.cz>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 76b91c32dd ("bridge: stp: when using userspace stp stop
kernel hello and hold timers"), bridge would not start hello_timer if
stp_enabled is not KERNEL_STP when br_dev_open.
The problem is even if users set stp_enabled with KERNEL_STP later,
the timer will still not be started. It causes that KERNEL_STP can
not really work. Users have to re-ifup the bridge to avoid this.
This patch is to fix it by starting br->hello_timer when enabling
KERNEL_STP in br_stp_start.
As an improvement, it's also to start hello_timer again only when
br->stp_enabled is KERNEL_STP in br_hello_timer_expired, there is
no reason to start the timer again when it's NO_STP.
Fixes: 76b91c32dd ("bridge: stp: when using userspace stp stop kernel hello and hold timers")
Reported-by: Haidong Li <haili@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ivan Vecera <cera@cera.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:
1) When using IPVS in direct-routing mode, normal traffic from the LVS
host to a back-end server is sometimes incorrectly NATed on the way
back into the LVS host. Patch to fix this from Julian Anastasov.
2) Calm down clang compilation warning in ctnetlink due to type
mismatch, from Matthias Kaehlcke.
3) Do not re-setup NAT for conntracks that are already confirmed, this
is fixing a problem that was introduced in the previous nf-next batch.
Patch from Liping Zhang.
4) Do not allow conntrack helper removal from userspace cthelper
infrastructure if already in used. This comes with an initial patch
to introduce nf_conntrack_helper_put() that is required by this fix.
From Liping Zhang.
5) Zero the pad when copying data to userspace, otherwise iptables fails
to remove rules. This is a follow up on the patchset that sorts out
the internal match/target structure pointer leak to userspace. Patch
from the same author, Willem de Bruijn. This also comes with a build
failure when CONFIG_COMPAT is not on, coming in the last patch of
this series.
6) SYNPROXY crashes with conntrack entries that are created via
ctnetlink, more specifically via conntrackd state sync. Patch from
Eric Leblond.
7) RCU safe iteration on set element dumping in nf_tables, from
Liping Zhang.
8) Missing sanitization of immediate date for the bitwise and cmp
expressions in nf_tables.
9) Refcounting logic for chain and objects from set elements does not
integrate into the nf_tables 2-phase commit protocol.
10) Missing sanitization of target verdict in ebtables arpreply target,
from Gao Feng.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This bit was introduced with commit 5a21232983 ("net: Support for
csum_bad in skbuff") to reduce the stack workload when processing RX
packets carrying a wrong Internet Checksum. Up to now, only one driver and
GRO core are setting it.
Suggested-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently it is allowed to set the default pvid of a bridge to a value
above VLAN_VID_MASK (0xfff). This patch adds a check to br_validate and
returns -EINVAL in case the pvid is out of bounds.
Reproduce by calling:
[root@test ~]# ip l a type bridge
[root@test ~]# ip l a type dummy
[root@test ~]# ip l s bridge0 type bridge vlan_filtering 1
[root@test ~]# ip l s bridge0 type bridge vlan_default_pvid 9999
[root@test ~]# ip l s dummy0 master bridge0
[root@test ~]# bridge vlan
port vlan ids
bridge0 9999 PVID Egress Untagged
dummy0 9999 PVID Egress Untagged
Fixes: 0f963b7592 ("bridge: netlink: add support for default_pvid")
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Tobias Jungel <tobias.jungel@bisdn.de>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The info->target comes from userspace and it would be used directly.
So we need to add the sanity check to make sure it is a valid standard
target, although the ebtables tool has already checked it. Kernel needs
to validate anything coming from userspace.
If the target is set as an evil value, it would break the ebtables
and cause a panic. Because the non-standard target is treated as one
offset.
Now add one helper function ebt_invalid_target, and we would replace
the macro INVALID_TARGET later.
Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When looking up an iptables rule, the iptables binary compares the
aligned match and target data (XT_ALIGN). In some cases this can
exceed the actual data size to include padding bytes.
Before commit f77bc5b23f ("iptables: use match, target and data
copy_to_user helpers") the malloc()ed bytes were overwritten by the
kernel with kzalloced contents, zeroing the padding and making the
comparison succeed. After this patch, the kernel copies and clears
only data, leaving the padding bytes undefined.
Extend the clear operation from data size to aligned data size to
include the padding bytes, if any.
Padding bytes can be observed in both match and target, and the bug
triggered, by issuing a rule with match icmp and target ACCEPT:
iptables -t mangle -A INPUT -i lo -p icmp --icmp-type 1 -j ACCEPT
iptables -t mangle -D INPUT -i lo -p icmp --icmp-type 1 -j ACCEPT
Fixes: f77bc5b23f ("iptables: use match, target and data copy_to_user helpers")
Reported-by: Paul Moore <pmoore@redhat.com>
Reported-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The attribute sizes for IFLA_BRPORT_MCAST_FLOOD and
IFLA_BRPORT_BCAST_FLOOD weren't accounted for in br_port_info_size()
when they were added. Do so now and also add the corresponding policy
entries:
Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Cc: Mike Manning <mmanning@brocade.com>
Fixes: b6cb5ac833 ("net: bridge: add per-port multicast flood flag")
Fixes: 99f906e9ad ("bridge: add per-port broadcast flood flag")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS/OVS fixes for net
The following patchset contains a rather large batch of Netfilter, IPVS
and OVS fixes for your net tree. This includes fixes for ctnetlink, the
userspace conntrack helper infrastructure, conntrack OVS support,
ebtables DNAT target, several leaks in error path among other. More
specifically, they are:
1) Fix reference count leak in the CT target error path, from Gao Feng.
2) Remove conntrack entry clashing with a matching expectation, patch
from Jarno Rajahalme.
3) Fix bogus EEXIST when registering two different userspace helpers,
from Liping Zhang.
4) Don't leak dummy elements in the new bitmap set type in nf_tables,
from Liping Zhang.
5) Get rid of module autoload from conntrack update path in ctnetlink,
we don't need autoload at this late stage and it is happening with
rcu read lock held which is not good. From Liping Zhang.
6) Fix deadlock due to double-acquire of the expect_lock from conntrack
update path, this fixes a bug that was introduced when the central
spinlock got removed. Again from Liping Zhang.
7) Safe ct->status update from ctnetlink path, from Liping. The expect_lock
protection that was selected when the central spinlock was removed was
not really protecting anything at all.
8) Protect sequence adjustment under ct->lock.
9) Missing socket match with IPv6, from Peter Tirsek.
10) Adjust skb->pkt_type of DNAT'ed frames from ebtables, from
Linus Luessing.
11) Don't give up on evaluating the expression on new entries added via
dynset expression in nf_tables, from Liping Zhang.
12) Use skb_checksum() when mangling icmpv6 in IPv6 NAT as this deals
with non-linear skbuffs.
13) Don't allow IPv6 service in IPVS if no IPv6 support is available,
from Paolo Abeni.
14) Missing mutex release in error path of xt_find_table_lock(), from
Dan Carpenter.
15) Update maintainers files, Netfilter section. Add Florian to the
file, refer to nftables.org and change project status from Supported
to Maintained.
16) Bail out on mismatching extensions in element updates in nf_tables.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. A large bunch of code cleanups, simplify the conntrack extension
codebase, get rid of the fake conntrack object, speed up netns by
selective synchronize_net() calls. More specifically, they are:
1) Check for ct->status bit instead of using nfct_nat() from IPVS and
Netfilter codebase, patch from Florian Westphal.
2) Use kcalloc() wherever possible in the IPVS code, from Varsha Rao.
3) Simplify FTP IPVS helper module registration path, from Arushi Singhal.
4) Introduce nft_is_base_chain() helper function.
5) Enforce expectation limit from userspace conntrack helper,
from Gao Feng.
6) Add nf_ct_remove_expect() helper function, from Gao Feng.
7) NAT mangle helper function return boolean, from Gao Feng.
8) ctnetlink_alloc_expect() should only work for conntrack with
helpers, from Gao Feng.
9) Add nfnl_msg_type() helper function to nfnetlink to build the
netlink message type.
10) Get rid of unnecessary cast on void, from simran singhal.
11) Use seq_puts()/seq_putc() instead of seq_printf() where possible,
also from simran singhal.
12) Use list_prev_entry() from nf_tables, from simran signhal.
13) Remove unnecessary & on pointer function in the Netfilter and IPVS
code.
14) Remove obsolete comment on set of rules per CPU in ip6_tables,
no longer true. From Arushi Singhal.
15) Remove duplicated nf_conntrack_l4proto_udplite4, from Gao Feng.
16) Remove unnecessary nested rcu_read_lock() in
__nf_nat_decode_session(). Code running from hooks are already
guaranteed to run under RCU read side.
17) Remove deadcode in nf_tables_getobj(), from Aaron Conole.
18) Remove double assignment in nf_ct_l4proto_pernet_unregister_one(),
also from Aaron.
19) Get rid of unsed __ip_set_get_netlink(), from Aaron Conole.
20) Don't propagate NF_DROP error to userspace via ctnetlink in
__nf_nat_alloc_null_binding() function, from Gao Feng.
21) Revisit nf_ct_deliver_cached_events() to remove unnecessary checks,
from Gao Feng.
22) Kill the fake untracked conntrack objects, use ctinfo instead to
annotate a conntrack object is untracked, from Florian Westphal.
23) Remove nf_ct_is_untracked(), now obsolete since we have no
conntrack template anymore, from Florian.
24) Add event mask support to nft_ct, also from Florian.
25) Move nf_conn_help structure to
include/net/netfilter/nf_conntrack_helper.h.
26) Add a fixed 32 bytes scratchpad area for conntrack helpers.
Thus, we don't deal with variable conntrack extensions anymore.
Make sure userspace conntrack helper doesn't go over that size.
Remove variable size ct extension infrastructure now this code
got no more clients. From Florian Westphal.
27) Restore offset and length of nf_ct_ext structure to 8 bytes now
that wraparound is not possible any longer, also from Florian.
28) Allow to get rid of unassured flows under stress in conntrack,
this applies to DCCP, SCTP and TCP protocols, from Florian.
29) Shrink size of nf_conntrack_ecache structure, from Florian.
30) Use TCP_MAX_WSCALE instead of hardcoded 14 in TCP tracker,
from Gao Feng.
31) Register SYNPROXY hooks on demand, from Florian Westphal.
32) Use pernet hook whenever possible, instead of global hook
registration, from Florian Westphal.
33) Pass hook structure to ebt_register_table() to consolidate some
infrastructure code, from Florian Westphal.
34) Use consume_skb() and return NF_STOLEN, instead of NF_DROP in the
SYNPROXY code, to make sure device stats are not fooled, patch
from Gao Feng.
35) Remove NF_CT_EXT_F_PREALLOC this kills quite some code that we
don't need anymore if we just select a fixed size instead of
expensive runtime time calculation of this. From Florian.
36) Constify nf_ct_extend_register() and nf_ct_extend_unregister(),
from Florian.
37) Simplify nf_ct_ext_add(), this kills nf_ct_ext_create(), from
Florian.
38) Attach NAT extension on-demand from masquerade and pptp helper
path, from Florian.
39) Get rid of useless ip_vs_set_state_timeout(), from Aaron Conole.
40) Speed up netns by selective calls of synchronize_net(), from
Florian Westphal.
41) Silence stack size warning gcc in 32-bit arch in snmp helper,
from Florian.
42) Inconditionally call nf_ct_ext_destroy(), even if we have no
extensions, to deal with the NF_NAT_MANIP_SRC case. Patch from
Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7e26bf45e4 ("net: bridge: allow SW learn to take over HW fdb
entries") added the ability to "take over an entry which was previously
learned via HW when it shows up from a SW port".
However, if an entry was learned via HW and then a control packet
(e.g., ARP request) was trapped to the CPU, the bridge driver will
update the entry and remove the externally learned flag, although the
entry is still present in HW. Instead, only clear the externally learned
flag in case of roaming.
Fixes: 7e26bf45e4 ("net: bridge: allow SW learn to take over HW fdb entries")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Arkadi Sharashevsky <arkadis@mellanox.com>
Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support for l2 multicast flood control was added in commit b6cb5ac833
("net: bridge: add per-port multicast flood flag"). It allows broadcast
as it was introduced specifically for unknown multicast flood control.
But as broadcast is a special case of multicast, this may also need to
be disabled. For this purpose, introduce a flag to disable the flooding
of received l2 broadcasts. This approach is backwards compatible and
provides flexibility in filtering for the desired packet types.
Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Mike Manning <mmanning@brocade.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to ip_register_table, pass nf_hook_ops to ebt_register_table().
This allows to handle hook registration also via pernet_ops and allows
us to avoid use of legacy register_hook api.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
During removing a bridge device, if the bridge is still up, a new mdb entry
still can be added in br_multicast_add_group() after all mdb entries are
removed in br_multicast_dev_del(). Like the path:
mld_ifc_timer_expire ->
mld_sendpack -> ...
br_multicast_rcv ->
br_multicast_add_group
The new mp's timer will be set up. If the timer expires after the bridge
is freed, it may cause use-after-free panic in br_multicast_group_expired.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
IP: [<ffffffffa07ed2c8>] br_multicast_group_expired+0x28/0xb0 [bridge]
Call Trace:
<IRQ>
[<ffffffff81094536>] call_timer_fn+0x36/0x110
[<ffffffffa07ed2a0>] ? br_mdb_free+0x30/0x30 [bridge]
[<ffffffff81096967>] run_timer_softirq+0x237/0x340
[<ffffffff8108dcbf>] __do_softirq+0xef/0x280
[<ffffffff8169889c>] call_softirq+0x1c/0x30
[<ffffffff8102c275>] do_softirq+0x65/0xa0
[<ffffffff8108e055>] irq_exit+0x115/0x120
[<ffffffff81699515>] smp_apic_timer_interrupt+0x45/0x60
[<ffffffff81697a5d>] apic_timer_interrupt+0x6d/0x80
Nikolay also found it would cause a memory leak - the mdb hash is
reallocated and not freed due to the mdb rehash.
unreferenced object 0xffff8800540ba800 (size 2048):
backtrace:
[<ffffffff816e2287>] kmemleak_alloc+0x67/0xc0
[<ffffffff81260bea>] __kmalloc+0x1ba/0x3e0
[<ffffffffa05c60ee>] br_mdb_rehash+0x5e/0x340 [bridge]
[<ffffffffa05c74af>] br_multicast_new_group+0x43f/0x6e0 [bridge]
[<ffffffffa05c7aa3>] br_multicast_add_group+0x203/0x260 [bridge]
[<ffffffffa05ca4b5>] br_multicast_rcv+0x945/0x11d0 [bridge]
[<ffffffffa05b6b10>] br_dev_xmit+0x180/0x470 [bridge]
[<ffffffff815c781b>] dev_hard_start_xmit+0xbb/0x3d0
[<ffffffff815c8743>] __dev_queue_xmit+0xb13/0xc10
[<ffffffff815c8850>] dev_queue_xmit+0x10/0x20
[<ffffffffa02f8d7a>] ip6_finish_output2+0x5ca/0xac0 [ipv6]
[<ffffffffa02fbfc6>] ip6_finish_output+0x126/0x2c0 [ipv6]
[<ffffffffa02fc245>] ip6_output+0xe5/0x390 [ipv6]
[<ffffffffa032b92c>] NF_HOOK.constprop.44+0x6c/0x240 [ipv6]
[<ffffffffa032bd16>] mld_sendpack+0x216/0x3e0 [ipv6]
[<ffffffffa032d5eb>] mld_ifc_timer_expire+0x18b/0x2b0 [ipv6]
This could happen when ip link remove a bridge or destroy a netns with a
bridge device inside.
With Nikolay's suggestion, this patch is to clean up bridge multicast in
ndo_uninit after bridge dev is shutdown, instead of br_dev_delete, so
that netif_running check in br_multicast_add_group can avoid this issue.
v1->v2:
- fix this issue by moving br_multicast_dev_del to ndo_uninit, instead
of calling dev_close in br_dev_delete.
(NOTE: Depends upon b6fe0440c6 ("bridge: implement missing ndo_uninit()"))
Fixes: e10177abf8 ("bridge: multicast: fix handling of temp and perm entries")
Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When trying to redirect bridged frames to the bridge device itself or
a bridge port (brouting) via the dnat target then this currently fails:
The ethernet destination of the frame is dnat'ed to the MAC address of
the bridge device or port just fine. However, the IP code drops it in
the beginning of ip_input.c/ip_rcv() as the dnat target left
the skb->pkt_type as PACKET_OTHERHOST.
Fixing this by resetting skb->pkt_type to an appropriate type after
dnat'ing.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse
for doit functions that call it directly.
This is the first step to using extended error reporting in rtnetlink.
>From here individual subsystems can be updated to set netlink_ext_ack as
needed.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recently we added support for SW fdbs to take over HW ones, but that
results in changing a user-visible fdb flag thus we need to send a
notification, also it's consistent with how HW takes over SW entries.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts were simply overlapping changes. In the net/ipv4/route.c
case the code had simply moved around a little bit and the same fix
was made in both 'net' and 'net-next'.
In the net/sched/sch_generic.c case a fix in 'net' happened at
the same time that a new argument was added to qdisc_hash_add().
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Peter reported a kernel oops when executing the following command:
$ ip link add name test type bridge vlan_default_pvid 1
[13634.939408] BUG: unable to handle kernel NULL pointer dereference at
0000000000000190
[13634.939436] IP: __vlan_add+0x73/0x5f0
[...]
[13634.939783] Call Trace:
[13634.939791] ? pcpu_next_unpop+0x3b/0x50
[13634.939801] ? pcpu_alloc+0x3d2/0x680
[13634.939810] ? br_vlan_add+0x135/0x1b0
[13634.939820] ? __br_vlan_set_default_pvid.part.28+0x204/0x2b0
[13634.939834] ? br_changelink+0x120/0x4e0
[13634.939844] ? br_dev_newlink+0x50/0x70
[13634.939854] ? rtnl_newlink+0x5f5/0x8a0
[13634.939864] ? rtnl_newlink+0x176/0x8a0
[13634.939874] ? mem_cgroup_commit_charge+0x7c/0x4e0
[13634.939886] ? rtnetlink_rcv_msg+0xe1/0x220
[13634.939896] ? lookup_fast+0x52/0x370
[13634.939905] ? rtnl_newlink+0x8a0/0x8a0
[13634.939915] ? netlink_rcv_skb+0xa1/0xc0
[13634.939925] ? rtnetlink_rcv+0x24/0x30
[13634.939934] ? netlink_unicast+0x177/0x220
[13634.939944] ? netlink_sendmsg+0x2fe/0x3b0
[13634.939954] ? _copy_from_user+0x39/0x40
[13634.939964] ? sock_sendmsg+0x30/0x40
[13634.940159] ? ___sys_sendmsg+0x29d/0x2b0
[13634.940326] ? __alloc_pages_nodemask+0xdf/0x230
[13634.940478] ? mem_cgroup_commit_charge+0x7c/0x4e0
[13634.940592] ? mem_cgroup_try_charge+0x76/0x1a0
[13634.940701] ? __handle_mm_fault+0xdb9/0x10b0
[13634.940809] ? __sys_sendmsg+0x51/0x90
[13634.940917] ? entry_SYSCALL_64_fastpath+0x1e/0xad
The problem is that the bridge's VLAN group is created after setting the
default PVID, when registering the netdevice and executing its
ndo_init().
Fix this by changing the order of both operations, so that
br_changelink() is only processed after the netdevice is registered,
when the VLAN group is already initialized.
Fixes: b6677449df ("bridge: netlink: call br_changelink() during br_dev_newlink()")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Peter V. Saveliev <peter@svinota.eu>
Tested-by: Peter V. Saveliev <peter@svinota.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
While the bridge driver implements an ndo_init(), it was missing a
symmetric ndo_uninit(), causing the different de-initialization
operations to be scattered around its dellink() and destructor().
Implement a symmetric ndo_uninit() and remove the overlapping operations
from its dellink() and destructor().
This is a prerequisite for the next patch, as it allows us to have a
proper cleanup upon changelink() failure during the bridge's newlink().
Fixes: b6677449df ("bridge: netlink: call br_changelink() during br_dev_newlink()")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove & from function pointers to conform to the style found elsewhere
in the file. Done using the following semantic patch
// <smpl>
@r@
identifier f;
@@
f(...) { ... }
@@
identifier r.f;
@@
- &f
+ f
// </smpl>
Signed-off-by: Arushi Singhal <arushisinghal19971997@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The following Coccinelle script was used to detect this:
@r@
expression x;
void* e;
type T;
identifier f;
@@
(
*((T *)e)
|
((T *)x)[...]
|
((T*)x)->f
|
- (T*)
e
)
Unnecessary parantheses are also remove.
Signed-off-by: simran singhal <singhalsimran0@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There is an include loop between netdevice.h, dsa.h, devlink.h because
of NETDEV_ALIGN, making it impossible to use devlink structures in
dsa.h.
Break this loop by taking dsa.h out of netdevice.h, add a forward
declaration of dsa_switch_tree and netdev_set_default_ethtool_ops()
function, which is what netdevice.h requires.
No longer having dsa.h in netdevice.h means the includes in dsa.h no
longer get included. This breaks a few other files which depend on
these includes. Add these directly in the affected file.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The NTF_EXT_LEARNED flag was added for switchdev and externally learned
entries, but it can also be used for entries learned via a software
in user-space which requires dynamic entries that do not expire.
One such case that we have is with quagga and evpn which need dynamic
entries but also require to age them themselves.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow to take over an entry which was previously learned via HW when it
shows up from a SW port. This is analogous to how HW takes over SW learned
entries already.
Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/genet/bcmmii.c
drivers/net/hyperv/netvsc.c
kernel/bpf/hashtab.c
Almost entirely overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your
net-next tree. A couple of new features for nf_tables, and unsorted
cleanups and incremental updates for the Netfilter tree. More
specifically, they are:
1) Allow to check for TCP option presence via nft_exthdr, patch
from Phil Sutter.
2) Add symmetric hash support to nft_hash, from Laura Garcia Liebana.
3) Use pr_cont() in ebt_log, from Joe Perches.
4) Remove some dead code in arp_tables reported via static analysis
tool, from Colin Ian King.
5) Consolidate nf_tables expression validation, from Liping Zhang.
6) Consolidate set lookup via nft_set_lookup().
7) Remove unnecessary rcu read lock side in bridge netfilter, from
Florian Westphal.
8) Remove unused variable in nf_reject_ipv4, from Tahee Yoo.
9) Pass nft_ctx struct to object initialization indirections, from
Florian Westphal.
10) Add code to integrate conntrack helper into nf_tables, also from
Florian.
11) Allow to check if interface index or name exists via
NFTA_FIB_F_PRESENT, from Phil Sutter.
12) Simplify resolve_normal_ct(), from Florian.
13) Use per-limit spinlock in nft_limit and xt_limit, from Liping Zhang.
14) Use rwlock in nft_set_rbtree set, also from Liping Zhang.
15) One patch to remove a useless printk at netns init path in ipvs,
and several patches to document IPVS knobs.
16) Use refcount_t for reference counter in the Netfilter/IPVS code,
from Elena Reshetova.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrei reported a false alarm of lockdep at net/bridge/br_fdb.c:109,
this is because in Andrei's case, a spin_bug() was already triggered
before this, therefore the debug_locks is turned off, lockdep_is_held()
is no longer accurate after that. We should use lockdep_assert_held_once()
instead of lockdep_is_held() to respect debug_locks.
Fixes: 410b3d48f5 ("bridge: fdb: add proper lock checks in searching functions")
Reported-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree, a
rather large batch of fixes targeted to nf_tables, conntrack and bridge
netfilter. More specifically, they are:
1) Don't track fragmented packets if the socket option IP_NODEFRAG is set.
From Florian Westphal.
2) SCTP protocol tracker assumes that ICMP error messages contain the
checksum field, what results in packet drops. From Ying Xue.
3) Fix inconsistent handling of AH traffic from nf_tables.
4) Fix new bitmap set representation with big endian. Fix mismatches in
nf_tables due to incorrect big endian handling too. Both patches
from Liping Zhang.
5) Bridge netfilter doesn't honor maximum fragment size field, cap to
largest fragment seen. From Florian Westphal.
6) Fake conntrack entry needs to be aligned to 8 bytes since the 3 LSB
bits are now used to store the ctinfo. From Steven Rostedt.
7) Fix element comments with the bitmap set type. Revert the flush
field in the nft_set_iter structure, not required anymore after
fixing up element comments.
8) Missing error on invalid conntrack direction from nft_ct, also from
Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Andreas reports kernel oops during rmmod of the br_netfilter module.
Hannes debugged the oops down to a NULL rt6info->rt6i_indev.
Problem is that br_netfilter has the nasty concept of adding a fake
rtable to skb->dst; this happens in a br_netfilter prerouting hook.
A second hook (in bridge LOCAL_IN) is supposed to remove these again
before the skb is handed up the stack.
However, on module unload hooks get unregistered which means an
skb could traverse the prerouting hook that attaches the fake_rtable,
while the 'fake rtable remove' hook gets removed from the hooklist
immediately after.
Fixes: 34666d467c ("netfilter: bridge: move br_netfilter out of the core")
Reported-by: Andreas Karis <akaris@redhat.com>
Debugged-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
consider a bridge with mtu 9000, but end host sending smaller
packets to another host with mtu < 9000.
In this case, after reassembly, bridge+defrag would refragment,
and then attempt to send the reassembled packet as long as it
was below 9k.
Instead we have to cap by the largest fragment size seen.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
as comment says, the function is always called with rcu read lock held.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When we want to validate the expr's dependency or hooks, we must do two
things to accomplish it. First, write a X_validate callback function
and point ->validate to it. Second, call X_validate in init routine.
This is very common, such as fib, nat, reject expr and so on ...
It is a little ugly, since we will call X_validate in the expr's init
routine, it's better to do it in nf_tables_newexpr. So we can avoid to
do this again and again. After doing this, the second step listed above
is not useful anymore, remove them now.
Patch was tested by nftables/tests/py/nft-test.py and
nftables/tests/shell/run-tests.sh.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Logging output was changed when simple printks without KERN_CONT
are now emitted on a new line and KERN_CONT is required to continue
lines so use pr_cont.
Miscellanea:
o realign arguments
o use print_hex_dump instead of a local variant
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull networking fixes from David Miller:
1) Fix double-free in batman-adv, from Sven Eckelmann.
2) Fix packet stats for fast-RX path, from Joannes Berg.
3) Netfilter's ip_route_me_harder() doesn't handle request sockets
properly, fix from Florian Westphal.
4) Fix sendmsg deadlock in rxrpc, from David Howells.
5) Add missing RCU locking to transport hashtable scan, from Xin Long.
6) Fix potential packet loss in mlxsw driver, from Ido Schimmel.
7) Fix race in NAPI handling between poll handlers and busy polling,
from Eric Dumazet.
8) TX path in vxlan and geneve need proper RCU locking, from Jakub
Kicinski.
9) SYN processing in DCCP and TCP need to disable BH, from Eric
Dumazet.
10) Properly handle net_enable_timestamp() being invoked from IRQ
context, also from Eric Dumazet.
11) Fix crash on device-tree systems in xgene driver, from Alban Bedel.
12) Do not call sk_free() on a locked socket, from Arnaldo Carvalho de
Melo.
13) Fix use-after-free in netvsc driver, from Dexuan Cui.
14) Fix max MTU setting in bonding driver, from WANG Cong.
15) xen-netback hash table can be allocated from softirq context, so use
GFP_ATOMIC. From Anoob Soman.
16) Fix MAC address change bug in bgmac driver, from Hari Vyas.
17) strparser needs to destroy strp_wq on module exit, from WANG Cong.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits)
strparser: destroy workqueue on module exit
sfc: fix IPID endianness in TSOv2
sfc: avoid max() in array size
rds: remove unnecessary returned value check
rxrpc: Fix potential NULL-pointer exception
nfp: correct DMA direction in XDP DMA sync
nfp: don't tell FW about the reserved buffer space
net: ethernet: bgmac: mac address change bug
net: ethernet: bgmac: init sequence bug
xen-netback: don't vfree() queues under spinlock
xen-netback: keep a local pointer for vif in backend_disconnect()
netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() fails
netfilter: nft_set_rbtree: incorrect assumption on lower interval lookups
netfilter: nf_conntrack_sip: fix wrong memory initialisation
can: flexcan: fix typo in comment
can: usb_8dev: Fix memory leak of priv->cmd_msg_buffer
can: gs_usb: fix coding style
can: gs_usb: Don't use stack memory for USB transfers
ixgbe: Limit use of 2K buffers on architectures with 256B or larger cache lines
ixgbe: update the rss key on h/w, when ethtool ask for it
...
Fix up affected files that include this signal functionality via sched.h.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Even with multicast flooding turned off, IPv6 ND should still work so
that IPv6 connectivity is provided. Allow this by continuing to flood
multicast traffic originated by us.
Fixes: b6cb5ac833 ("net: bridge: add per-port multicast flood flag")
Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Mike Manning <mmanning@brocade.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix error path order in nbp_vlan_init, so if switchdev_port_attr_set
call failes, the vlan_hash wouldn't be destroyed before inited.
Fixes: efa5356b0d ("bridge: per vlan dst_metadata netlink support")
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that %z is standartised in C99 there is no reason to support %Z.
Unlike %L it doesn't even make format strings smaller.
Use BUILD_BUG_ON in a couple ATM drivers.
In case anyone didn't notice lib/vsprintf.o is about half of SLUB which
is in my opinion is quite an achievement. Hopefully this patch inspires
someone else to trim vsprintf.c more.
Link: http://lkml.kernel.org/r/20170103230126.GA30170@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
added_by_external_learn fdb entries are added and expired by
external entities like switchdev driver or external controllers.
ageing is already disabled for such entries. Hence, don't
indicate expiry for such fdb entries.
CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: efa5356b0d ("bridge: per vlan dst_metadata netlink support")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can simplify the logic of entries pointing to the bridge by
converging the fdb_delete_by functions, this would allow us to use the
same function for both cases since the fdb's dst is set to NULL if it is
pointing to the bridge thus we can always check for a port match.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to avoid new errors add checks to br_fdb_find and fdb_find_rcu
functions. The first requires hash_lock, the second obviously RCU.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch we had 3 different fdb searching functions which was
confusing. This patch reduces all of them to one - fdb_find_rcu(), and
two flavors: br_fdb_find() which requires hash_lock and br_fdb_find_rcu
which requires RCU. This makes it clear what needs to be used, we also
remove two abusers of __br_fdb_get which called it under hash_lock and
replace them with br_fdb_find().
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Offload the mc router ports list, whenever it is being changed.
It is done because in some cases mc packets needs to be flooded to all
the ports in this list.
Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are three places where a port gets deleted from the mc router port
list. This patch join the actual deletion to one function.
It will be helpful for later patch that will offload changes in the mc
router ports list.
Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Offload multicast disabled flag, for more accurate mc flood behavior:
When it is on, the mdb should be ignored.
When it is off, unregistered mc packets should be flooded to mc router
ports.
Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a bug and cleans up tunnelid range size
calculation code by using consistent variable names
and checks in size calculation and fill functions.
tested for a few cases of vlan-vni range mappings:
(output from patched iproute2):
$bridge vlan showtunnel
port vid tunid
vxlan0 100-105 1000-1005
200 2000
210 2100
211-213 2100-2102
214 2104
216-217 2108-2109
219 2119
Fixes: efa5356b0d ("bridge: per vlan dst_metadata netlink support")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiffies is volatile so read it once.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
vtbegin should not be NULL in this function, Its already checked by the
caller.
this should silence the below smatch complaint:
net/bridge/br_netlink_tunnel.c:144 br_fill_vlan_tinfo_range()
error: we previously assumed 'vtbegin' could be null (see line 130)
net/bridge/br_netlink_tunnel.c
129
130 if (vtbegin && vtend && (vtend->vid - vtbegin->vid) > 0) {
^^^^^^^
Check for NULL.
Fixes: efa5356b0d ("bridge: per vlan dst_metadata netlink support")
Reported-By: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These checks should go after the attributes have been parsed otherwise
we're using tb uninitialized.
Fixes: efa5356b0d ("bridge: per vlan dst_metadata netlink support")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The error check on err is redundant as it is being checked
previously each time it has been updated. Remove this redundant
check.
Detected with CoverityScan, CID#140030("Logically dead code")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Writing once per jiffy is enough to limit the bridge's false sharing.
After this change the bridge doesn't show up in the local load HitM stats.
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fdb's used and updated fields are written to on every packet forward and
packet receive respectively. Thus if we are receiving packets from a
particular fdb, they'll cause false-sharing with everyone who has looked
it up (even if it didn't match, since mac/vid share cache line!). The
"used" field is even worse since it is updated on every packet forward
to that fdb, thus the standard config where X ports use a single gateway
results in 100% fdb false-sharing. Note that this patch does not prevent
the last scenario, but it makes it better for other bridge participants
which are not using that fdb (and are only doing lookups over it).
The point is with this move we make sure that only communicating parties
get the false-sharing, in a later patch we'll show how to avoid that too.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the fdb garbage collector to a workqueue which fires at least 10
milliseconds apart and cleans chain by chain allowing for other tasks
to run in the meantime. When having thousands of fdbs the system is much
more responsive. Most importantly remove the need to check if the
matched entry has expired in __br_fdb_get that causes false-sharing and
is completely unnecessary if we cleanup entries, at worst we'll get 10ms
of traffic for that entry before it gets deleted.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move around net_bridge so the vlan fields are in the beginning since
they're checked on every packet even if vlan filtering is disabled.
For the port move flags & vlan group to the beginning, so they're in the
same cache line with the port's state (both flags and state are checked
on each packet).
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 18bfb924f0 ("net: introduce default neigh_construct/destroy
ndo calls for L2 upper devices") we added these ndos to stacked devices
such as team and bond, so that calls will be propagated to mlxsw.
However, previous commit removed the reliance on these ndos and no new
users of these ndos have appeared since above mentioned commit. We can
therefore safely remove this dead code.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree, they are:
1) Stash ctinfo 3-bit field into pointer to nf_conntrack object from
sk_buff so we only access one single cacheline in the conntrack
hotpath. Patchset from Florian Westphal.
2) Don't leak pointer to internal structures when exporting x_tables
ruleset back to userspace, from Willem DeBruijn. This includes new
helper functions to copy data to userspace such as xt_data_to_user()
as well as conversions of our ip_tables, ip6_tables and arp_tables
clients to use it. Not surprinsingly, ebtables requires an ad-hoc
update. There is also a new field in x_tables extensions to indicate
the amount of bytes that we copy to userspace.
3) Add nf_log_all_netns sysctl: This new knob allows you to enable
logging via nf_log infrastructure for all existing netnamespaces.
Given the effort to provide pernet syslog has been discontinued,
let's provide a way to restore logging using netfilter kernel logging
facilities in trusted environments. Patch from Michal Kubecek.
4) Validate SCTP checksum from conntrack helper, from Davide Caratti.
5) Merge UDPlite conntrack and NAT helpers into UDP, this was mostly
a copy&paste from the original helper, from Florian Westphal.
6) Reset netfilter state when duplicating packets, also from Florian.
7) Remove unnecessary check for broadcast in IPv6 in pkttype match and
nft_meta, from Liping Zhang.
8) Add missing code to deal with loopback packets from nft_meta when
used by the netdev family, also from Liping.
9) Several cleanups on nf_tables, one to remove unnecessary check from
the netlink control plane path to add table, set and stateful objects
and code consolidation when unregister chain hooks, from Gao Feng.
10) Fix harmless reference counter underflow in IPVS that, however,
results in problems with the introduction of the new refcount_t
type, from David Windsor.
11) Enable LIBCRC32C from nf_ct_sctp instead of nf_nat_sctp,
from Davide Caratti.
12) Missing documentation on nf_tables uapi header, from Liping Zhang.
13) Use rb_entry() helper in xt_connlimit, from Geliang Tang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
- ingress hook:
- if port is a tunnel port, use tunnel info in
attached dst_metadata to map it to a local vlan
- egress hook:
- if port is a tunnel port, use tunnel info attached to
vlan to set dst_metadata on the skb
CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support to attach per vlan tunnel info dst
metadata. This enables bridge driver to map vlan to tunnel_info
at ingress and egress. It uses the kernel dst_metadata infrastructure.
The initial use case is vlan to vni bridging, but the api is generic
to extend to any tunnel_info in the future:
- Uapi to configure/unconfigure/dump per vlan tunnel data
- netlink functions to configure vlan and tunnel_info mapping
- Introduces bridge port flag BR_LWT_VLAN to enable attach/detach
dst_metadata to bridged packets on ports. off by default.
- changes to existing code is mainly refactor some existing vlan
handling netlink code + hooks for new vlan tunnel code
- I have kept the vlan tunnel code isolated in separate files.
- most of the netlink vlan tunnel code is handling of vlan-tunid
ranges (follows the vlan range handling code). To conserve space
vlan-tunid by default are always dumped in ranges if applicable.
Use case:
example use for this is a vxlan bridging gateway or vtep
which maps vlans to vn-segments (or vnis).
iproute2 example (patched and pruned iproute2 output to just show
relevant fdb entries):
example shows same host mac learnt on two vni's and
vlan 100 maps to vni 1000, vlan 101 maps to vni 1001
before (netdev per vni):
$bridge fdb show | grep "00:02:00:00:00:03"
00:02:00:00:00:03 dev vxlan1001 vlan 101 master bridge
00:02:00:00:00:03 dev vxlan1001 dst 12.0.0.8 self
00:02:00:00:00:03 dev vxlan1000 vlan 100 master bridge
00:02:00:00:00:03 dev vxlan1000 dst 12.0.0.8 self
after this patch with collect metdata in bridged mode (single netdev):
$bridge fdb show | grep "00:02:00:00:00:03"
00:02:00:00:00:03 dev vxlan0 vlan 101 master bridge
00:02:00:00:00:03 dev vxlan0 src_vni 1001 dst 12.0.0.8 self
00:02:00:00:00:03 dev vxlan0 vlan 100 master bridge
00:02:00:00:00:03 dev vxlan0 src_vni 1000 dst 12.0.0.8 self
CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 69b34fb996 ("netfilter: xt_LOG: add net namespace support for
xt_LOG") disabled logging packets using the LOG target from non-init
namespaces. The motivation was to prevent containers from flooding
kernel log of the host. The plan was to keep it that way until syslog
namespace implementation allows containers to log in a safe way.
However, the work on syslog namespace seems to have hit a dead end
somewhere in 2013 and there are users who want to use xt_LOG in all
network namespaces. This patch allows to do so by setting
/proc/sys/net/netfilter/nf_log_all_netns
to a nonzero value. This sysctl is only accessible from init_net so that
one cannot switch the behaviour from inside a container.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The only caller of this new function is inside of an #ifdef checking
for CONFIG_BRIDGE_IGMP_SNOOPING, so we should move the implementation
there too, in order to avoid this harmless warning:
net/bridge/br_forward.c:177:13: error: 'maybe_deliver_addr' defined but not used [-Werror=unused-function]
Fixes: 6db6f0eae6 ("bridge: multicast to unicast")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implements an optional, per bridge port flag and feature to deliver
multicast packets to any host on the according port via unicast
individually. This is done by copying the packet per host and
changing the multicast destination MAC to a unicast one accordingly.
multicast-to-unicast works on top of the multicast snooping feature of
the bridge. Which means unicast copies are only delivered to hosts which
are interested in it and signalized this via IGMP/MLD reports
previously.
This feature is intended for interface types which have a more reliable
and/or efficient way to deliver unicast packets than broadcast ones
(e.g. wifi).
However, it should only be enabled on interfaces where no IGMPv2/MLDv1
report suppression takes place. This feature is disabled by default.
The initial patch and idea is from Felix Fietkau.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
[linus.luessing@c0d3.blue: various bug + style fixes, commit message]
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Any bridge options specified during link creation (e.g. ip link add)
are ignored as br_dev_newlink() does not process them.
Use br_changelink() to do it.
Fixes: 1332351617 ("bridge: implement rtnl_link_ops->changelink")
Signed-off-by: Ivan Vecera <cera@cera.cz>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changed type of csum field in struct igmpv3_query from __be16 to
__sum16 to eliminate type warning, made same change in struct
igmpv3_report for consistency.
Fixed up an ntohs() where htons() should have been used instead.
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In matches and targets that define a kernel-only tail to their
xt_match and xt_target data structs, add a field .usersize that
specifies up to where data is to be shared with userspace.
Performed a search for comment "Used internally by the kernel" to find
relevant matches and targets. Manually inspected the structs to derive
a valid offsetof.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Convert ebtables to copying entries, matches and targets one by one.
The solution is analogous to that of generic xt_(match|target)_to_user
helpers, but is applied to different structs.
Convert existing helpers ebt_make_XXXname helpers that overwrite
fields of an already copy_to_user'd struct with ebt_XXX_to_user
helpers that copy all relevant fields of the struct from scratch.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.
Fix all drivers with ndo_get_stats64 to have a void function.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains accumulated Netfilter fixes for your
net tree:
1) Ensure quota dump and reset happens iff we can deliver numbers to
userspace.
2) Silence splat on incorrect use of smp_processor_id() from nft_queue.
3) Fix an out-of-bound access reported by KASAN in
nf_tables_rule_destroy(), patch from Florian Westphal.
4) Fix layer 4 checksum mangling in the nf_tables payload expression
with IPv6.
5) Fix a race in the CLUSTERIP target from control plane path when two
threads run to add a new configuration object. Serialize invocations
of clusterip_config_init() using spin_lock. From Xin Long.
6) Call br_nf_pre_routing_finish_bridge_finish() once we are done with
the br_nf_pre_routing_finish() hook. From Artur Molchanov.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Problem:
br_nf_pre_routing_finish() calls itself instead of
br_nf_pre_routing_finish_bridge(). Due to this bug reverse path filter drops
packets that go through bridge interface.
User impact:
Local docker containers with bridge network can not communicate with each
other.
Fixes: c5136b15ea ("netfilter: bridge: add and use br_nf_hook_thresh")
Signed-off-by: Artur Molchanov <artur.molchanov@synesis.ru>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
802.1D [1] specifies that the bridges must use a short value to age out
dynamic entries in the Filtering Database for a period, once a topology
change has been communicated by the root bridge.
Add a bridge_ageing_time member in the net_bridge structure to store the
bridge ageing time value configured by the user (ioctl/netlink/sysfs).
If we are using in-kernel STP, shorten the ageing time value to twice
the forward delay used by the topology when the topology change flag is
set. When the flag is cleared, restore the configured ageing time.
[1] "8.3.5 Notifying topology changes ",
http://profesores.elo.utfsm.cl/~agv/elo309/doc/802.1D-1998.pdf
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a __br_set_topology_change helper to set the topology change value.
This can be later extended to add actions when the topology change flag
is set or cleared.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME switchdev attr is actually set
when initializing a bridge port, and when configuring the bridge ageing
time from ioctl/netlink/sysfs.
Add a __set_ageing_time helper to offload the ageing time to physical
switches, and add the SWITCHDEV_F_DEFER flag since it can be called
under bridge lock.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains a large Netfilter update for net-next,
to summarise:
1) Add support for stateful objects. This series provides a nf_tables
native alternative to the extended accounting infrastructure for
nf_tables. Two initial stateful objects are supported: counters and
quotas. Objects are identified by a user-defined name, you can fetch
and reset them anytime. You can also use a maps to allow fast lookups
using any arbitrary key combination. More info at:
http://marc.info/?l=netfilter-devel&m=148029128323837&w=2
2) On-demand registration of nf_conntrack and defrag hooks per netns.
Register nf_conntrack hooks if we have a stateful ruleset, ie.
state-based filtering or NAT. The new nf_conntrack_default_on sysctl
enables this from newly created netnamespaces. Default behaviour is not
modified. Patches from Florian Westphal.
3) Allocate 4k chunks and then use these for x_tables counter allocation
requests, this improves ruleset load time and also datapath ruleset
evaluation, patches from Florian Westphal.
4) Add support for ebpf to the existing x_tables bpf extension.
From Willem de Bruijn.
5) Update layer 4 checksum if any of the pseudoheader fields is updated.
This provides a limited form of 1:1 stateless NAT that make sense in
specific scenario, eg. load balancing.
6) Add support to flush sets in nf_tables. This series comes with a new
set->ops->deactivate_one() indirection given that we have to walk
over the list of set elements, then deactivate them one by one.
The existing set->ops->deactivate() performs an element lookup that
we don't need.
7) Two patches to avoid cloning packets, thus speed up packet forwarding
via nft_fwd from ingress. From Florian Westphal.
8) Two IPVS patches via Simon Horman: Decrement ttl in all modes to
prevent infinite loops, patch from Dwip Banerjee. And one minor
refactoring from Gao feng.
9) Revisit recent log support for nf_tables netdev families: One patch
to ensure that we correctly handle non-ethernet packets. Another
patch to add missing logger definition for netdev. Patches from
Liping Zhang.
10) Three patches for nft_fib, one to address insufficient register
initialization and another to solve incorrect (although harmless)
byteswap operation. Moreover update xt_rpfilter and nft_fib to match
lbcast packets with zeronet as source, eg. DHCP Discover packets
(0.0.0.0 -> 255.255.255.255). Also from Liping Zhang.
11) Built-in DCCP, SCTP and UDPlite conntrack and NAT support, from
Davide Caratti. While DCCP is rather hopeless lately, and UDPlite has
been broken in many-cast mode for some little time, let's give them a
chance by placing them at the same level as other existing protocols.
Thus, users don't explicitly have to modprobe support for this and
NAT rules work for them. Some people point to the lack of support in
SOHO Linux-based routers that make deployment of new protocols harder.
I guess other middleboxes outthere on the Internet are also to blame.
Anyway, let's see if this has any impact in the midrun.
12) Skip software SCTP software checksum calculation if the NIC comes
with SCTP checksum offload support. From Davide Caratti.
13) Initial core factoring to prepare conversion to hook array. Three
patches from Aaron Conole.
14) Gao Feng made a wrong conversion to switch in the xt_multiport
extension in a patch coming in the previous batch. Fix it in this
batch.
15) Get vmalloc call in sync with kmalloc flags to avoid a warning
and likely OOM killer intervention from x_tables. From Marcelo
Ricardo Leitner.
16) Update Arturo Borrero's email address in all source code headers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This is to facilitate converting from a singly-linked list to an array
of elements.
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Function br_sysfs_addbr() does not set error code when the call
kobject_create_and_add() returns a NULL pointer. It may be better to
return "-ENOMEM" when kobject_create_and_add() fails.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188781
Signed-off-by: Pan Bian <bianpan2016@163.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In netdev family, we will handle non ethernet packets, so using
eth_hdr(skb)->h_proto is incorrect.
Meanwhile, we can use socket(AF_PACKET...) to sending packets, so
skb->protocol is not always set in bridge family.
Add an extra parameter into nf_log_l2packet to solve this issue.
Fixes: 1fddf4bad0 ("netfilter: nf_log: add packet logging for netdev family")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds basic support for MLDv2 queries, the default is MLDv1
as before. A new multicast option - multicast_mld_version, adds the
ability to change it between 1 and 2 via netlink and sysfs.
The MLD option is disabled if CONFIG_IPV6 is disabled.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds basic support for IGMPv3 queries, the default is IGMPv2
as before. A new multicast option - multicast_igmp_version, adds the
ability to change it between 2 and 3 via netlink and sysfs. The option
struct member is in a 4 byte hole in net_bridge.
There also a few minor style adjustments in br_multicast_new_group and
br_multicast_add_group.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make struct pernet_operations::id unsigned.
There are 2 reasons to do so:
1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.
2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.
"int" being used as an array index needs to be sign-extended
to 64-bit before being used.
void f(long *p, int i)
{
g(p[i]);
}
roughly translates to
movsx rsi, esi
mov rdi, [rsi+...]
call g
MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.
Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:
static inline void *net_generic(const struct net *net, int id)
{
...
ptr = ng->ptr[id - 1];
...
}
And this function is used a lot, so those sign extensions add up.
Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]
However, overall balance is in negative direction:
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
function old new delta
nfsd4_lock 3886 3959 +73
tipc_link_build_proto_msg 1096 1140 +44
mac80211_hwsim_new_radio 2776 2808 +32
tipc_mon_rcv 1032 1058 +26
svcauth_gss_legacy_init 1413 1429 +16
tipc_bcbase_select_primary 379 392 +13
nfsd4_exchange_id 1247 1260 +13
nfsd4_setclientid_confirm 782 793 +11
...
put_client_renew_locked 494 480 -14
ip_set_sockfn_get 730 716 -14
geneve_sock_add 829 813 -16
nfsd4_sequence_done 721 703 -18
nlmclnt_lookup_host 708 686 -22
nfsd4_lockt 1085 1063 -22
nfs_get_client 1077 1050 -27
tcf_bpf_init 1106 1076 -30
nfsd4_encode_fattr 5997 5930 -67
Total: Before=154856051, After=154854321, chg -0.00%
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This field is only useful for nf_queue, so store it in the
nf_queue_entry structure instead, away from the core path. Pass
hook_head to nf_hook_slow().
Since we always have a valid entry on the first iteration in
nf_iterate(), we can use 'do { ... } while (entry)' loop instead.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Don't copy relevant fields from hook state structure, instead use the
one that is already available in struct xt_action_param.
This patch also adds a set of new wrapper functions to fetch relevant
hook state structure fields.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Place pointer to hook state in xt_action_param structure instead of
copying the fields that we need. After this change xt_action_param fits
into one cacheline.
This patch also adds a set of new wrapper functions to fetch relevant
hook state structure fields.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
NF_STOP is only used by br_netfilter these days, and it can be emulated
with a combination of NF_STOLEN plus explicit call to the ->okfn()
function as Florian suggests.
To retain binary compatibility with userspace nf_queue application, we
have to keep NF_STOP around, so libnetfilter_queue userspace userspace
applications still work if they use NF_STOP for some exotic reason.
Out of tree modules using NF_STOP would break, but we don't care about
those.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Patch c5136b15ea ("netfilter: bridge: add and use br_nf_hook_thresh")
introduced br_nf_hook_thresh().
Replace NF_HOOK_THRESH() by br_nf_hook_thresh from
br_nf_forward_finish(), so we have no more callers for this macro.
As a result, state->thresh and explicit thresh parameter in the hook
state structure is not required anymore. And we can get rid of
skip-hook-under-thresh loop in nf_iterate() in the core path that is
only used by br_netfilter to search for the filter hook.
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. This includes better integration with the routing subsystem for
nf_tables, explicit notrack support and smaller updates. More
specifically, they are:
1) Add fib lookup expression for nf_tables, from Florian Westphal. This
new expression provides a native replacement for iptables addrtype
and rp_filter matches. This is more flexible though, since we can
populate the kernel flowi representation to inquire fib to
accomodate new usecases, such as RTBH through skb mark.
2) Introduce rt expression for nf_tables, from Anders K. Pedersen. This
new expression allow you to access skbuff route metadata, more
specifically nexthop and classid fields.
3) Add notrack support for nf_tables, to skip conntracking, requested by
many users already.
4) Add boilerplate code to allow to use nf_log infrastructure from
nf_tables ingress.
5) Allow to mangle pkttype from nf_tables prerouting chain, to emulate
the xtables cluster match, from Liping Zhang.
6) Move socket lookup code into generic nf_socket_* infrastructure so
we can provide a native replacement for the xtables socket match.
7) Make sure nfnetlink_queue data that is updated on every packets is
placed in a different cache from read-only data, from Florian Westphal.
8) Handle NF_STOLEN from nf_tables core, also from Florian Westphal.
9) Start round robin number generation in nft_numgen from zero,
instead of n-1, for consistency with xtables statistics match,
patch from Liping Zhang.
10) Set GFP_NOWARN flag in skbuff netlink allocations in nfnetlink_log,
given we retry with a smaller allocation on failure, from Calvin Owens.
11) Cleanup xt_multiport to use switch(), from Gao feng.
12) Remove superfluous check in nft_immediate and nft_cmp, from
Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Move layer 2 packet logging into nf_log_l2packet() that resides in
nf_log_common.c, so this can be shared by both bridge and netdev
families.
This patch adds the boiler plate code to register the netdev logging
family.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When we receive a PIM Hello message on a port we can consider that it
has a multicast router attached, thus it is correct to add it to the
router list. The only catch is it shouldn't be considered for a querier.
Using Daniel's description:
leaf-11 leaf-12 leaf-13
\ | /
bridge-1
/ \
host-11 host-12
- all ports in bridge-1 are in a single vlan aware bridge
- leaf-11 is the IGMP querier
- leaf-13 is the PIM DR
- host-11 TXes packets to 226.10.10.10
- bridge-1 only forwards the 226.10.10.10 traffic out the port to
leaf-11, it should also forward this traffic out the port to leaf-13
Suggested-by: Daniel Walton <dwalton@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mostly simple overlapping changes.
For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.
Signed-off-by: David S. Miller <davem@davemloft.net>
geneve:
- Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
- This one isn't quite as straight-forward as others, could use some
closer inspection and testing
macvlan:
- set min/max_mtu
tun:
- set min/max_mtu, remove tun_net_change_mtu
vxlan:
- Merge __vxlan_change_mtu back into vxlan_change_mtu
- Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in
change_mtu function
- This one is also not as straight-forward and could use closer inspection
and testing from vxlan folks
bridge:
- set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in
change_mtu function
openvswitch:
- set min/max_mtu, remove internal_dev_change_mtu
- note: max_mtu wasn't checked previously, it's been set to 65535, which
is the largest possible size supported
sch_teql:
- set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)
macsec:
- min_mtu = 0, max_mtu = 65535
macvlan:
- min_mtu = 0, max_mtu = 65535
ntb_netdev:
- min_mtu = 0, max_mtu = 65535
veth:
- min_mtu = 68, max_mtu = 65535
8021q:
- min_mtu = 0, max_mtu = 65535
CC: netdev@vger.kernel.org
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
CC: Tom Herbert <tom@herbertland.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Paolo Abeni <pabeni@redhat.com>
CC: Jiri Benc <jbenc@redhat.com>
CC: WANG Cong <xiyou.wangcong@gmail.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Pravin B Shelar <pshelar@ovn.org>
CC: Sabrina Dubroca <sd@queasysnail.net>
CC: Patrick McHardy <kaber@trash.net>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: Pravin Shelar <pshelar@nicira.com>
CC: Maxim Krasnyansky <maxk@qti.qualcomm.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Satish reported a problem with the perm multicast router ports not getting
reenabled after some series of events, in particular if it happens that the
multicast snooping has been disabled and the port goes to disabled state
then it will be deleted from the router port list, but if it moves into
non-disabled state it will not be re-added because the mcast snooping is
still disabled, and enabling snooping later does nothing.
Here are the steps to reproduce, setup br0 with snooping enabled and eth1
added as a perm router (multicast_router = 2):
1. $ echo 0 > /sys/class/net/br0/bridge/multicast_snooping
2. $ ip l set eth1 down
^ This step deletes the interface from the router list
3. $ ip l set eth1 up
^ This step does not add it again because mcast snooping is disabled
4. $ echo 1 > /sys/class/net/br0/bridge/multicast_snooping
5. $ bridge -d -s mdb show
<empty>
At this point we have mcast enabled and eth1 as a perm router (value = 2)
but it is not in the router list which is incorrect.
After this change:
1. $ echo 0 > /sys/class/net/br0/bridge/multicast_snooping
2. $ ip l set eth1 down
^ This step deletes the interface from the router list
3. $ ip l set eth1 up
^ This step does not add it again because mcast snooping is disabled
4. $ echo 1 > /sys/class/net/br0/bridge/multicast_snooping
5. $ bridge -d -s mdb show
router ports on br0: eth1
Note: we can directly do br_multicast_enable_port for all because the
querier timer already has checks for the port state and will simply
expire if it's in blocking/disabled. See the comment added by
commit 9aa6638216 ("bridge: multicast: add a comment to
br_port_state_selection about blocking state")
Fixes: 561f1103a2 ("bridge: Add multicast_snooping sysfs toggle")
Reported-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I added the multicast flood control flag, I also added an attribute
for it for sysfs similar to other flags, but I forgot to add it to
brport_attrs.
Fixes: b6cb5ac833 ("net: bridge: add per-port multicast flood flag")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds vlan and address to warning messages printed
in the bridge fdb code for debuggability.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using bridge without bridge netfilter enabled the message
displayed is rather confusing and leads to belive that a deprecated
feature is in use. Use IS_MODULE to be explicit that the message only
affects users which use bridge netfilter as module and reword the
message.
Signed-off-by: Stefan Agner <stefan@agner.ch>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/netfilter/core.c
net/netfilter/nf_tables_netdev.c
Resolve two conflicts before pull request for David's net-next tree:
1) Between c73c248490 ("netfilter: nf_tables_netdev: remove redundant
ip_hdr assignment") from the net tree and commit ddc8b6027a
("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()").
2) Between e8bffe0cf9 ("net: Add _nf_(un)register_hooks symbols") and
Aaron Conole's patches to replace list_head with single linked list.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
NFTA_LOG_FLAGS attribute is already supported, but the related
NF_LOG_XXX flags are not exposed to the userspace. So we cannot
explicitly enable log flags to log uid, tcp sequence, ip options
and so on, i.e. such rule "nft add rule filter output log uid"
is not supported yet.
So move NF_LOG_XXX macro definitions to the uapi/../nf_log.h. In
order to keep consistent with other modules, change NF_LOG_MASK to
refer to all supported log flags. On the other hand, add a new
NF_LOG_DEFAULT_MASK to refer to the original default log flags.
Finally, if user specify the unsupported log flags or NFTA_LOG_GROUP
and NFTA_LOG_FLAGS are set at the same time, report EINVAL to the
userspace.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The netfilter hook list never uses the prev pointer, and so can be trimmed to
be a simple singly-linked list.
In addition to having a more light weight structure for hook traversal,
struct net becomes 5568 bytes (down from 6400) and struct net_device becomes
2176 bytes (down from 2240).
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
All of the callers of nf_hook_slow already hold the rcu_read_lock, so this
cleanup removes the recursive call. This is just a cleanup, as the locking
code gracefully handles this situation.
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This replaces the last uses of NF_HOOK_THRESH().
Followup patch will remove it and rename nf_hook_thresh.
The reason is that inet (non-bridge) netfilter no longer invokes the
hooks from hooks, so we do no longer need the thresh value to skip hooks
with a lower priority.
The bridge netfilter however may need to do this. br_nf_hook_thresh is a
wrapper that is supposed to do this, i.e. only call hooks with a
priority that exceeds NF_BR_PRI_BRNF.
It's used only in the recursion cases of br_netfilter. It invokes
nf_hook_slow while holding an rcu read-side critical section to make a
future cleanup simpler.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If /sbin/bridge-stp is available on the system, bridge tries to execute
it instead of the kernel implementation when starting/stopping STP.
If anything goes wrong with /sbin/bridge-stp, bridge silently falls back
to kernel STP, making hard to debug userspace STP.
This patch adds a br_stp_call_user helper to start/stop userspace STP
and debug errors from the program: abnormal exit status is stored in the
lower byte and normal exit status is stored in higher byte.
Below is a simple example on a kernel with dynamic debug enabled:
# ln -s /bin/false /sbin/bridge-stp
# brctl stp br0 on
br0: failed to start userspace STP (256)
# dmesg
br0: /sbin/bridge-stp exited with code 1
br0: failed to start userspace STP (256)
br0: using kernel STP
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/mediatek/mtk_eth_soc.c
drivers/net/ethernet/qlogic/qed/qed_dcbx.c
drivers/net/phy/Kconfig
All conflicts were cases of overlapping commits.
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some codes of netfilter module which did not check the return
value of nft_register_chain_type. Add the checks now.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Consolidate pktinfo setup and validation by using the new generic
functions so we converge to the netdev family codebase.
We only need a linear IPv4 and IPv6 header from the reject expression,
so move nft_bridge_iphdr_validate() and nft_bridge_ip6hdr_validate()
to net/bridge/netfilter/nft_reject_bridge.c.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces nft_set_pktinfo_unspec() that ensures proper
initialization all of pktinfo fields for non-IP traffic. This is used
by the bridge, netdev and arp families.
This new function relies on nft_set_pktinfo_proto_unspec() to set a new
tprot_set field that indicates if transport protocol information is
available. Remain fields are zeroed.
The meta expression has been also updated to check to tprot_set in first
place given that zero is a valid tprot value. Even a handcrafted packet
may come with the IPPROTO_RAW (255) protocol number so we can't rely on
this value as tprot unset.
Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. Most relevant updates are the removal of per-conntrack timers to
use a workqueue/garbage collection approach instead from Florian
Westphal, the hash and numgen expression for nf_tables from Laura
Garcia, updates on nf_tables hash set to honor the NLM_F_EXCL flag,
removal of ip_conntrack sysctl and many other incremental updates on our
Netfilter codebase.
More specifically, they are:
1) Retrieve only 4 bytes to fetch ports in case of non-linear skb
transport area in dccp, sctp, tcp, udp and udplite protocol
conntrackers, from Gao Feng.
2) Missing whitespace on error message in physdev match, from Hangbin Liu.
3) Skip redundant IPv4 checksum calculation in nf_dup_ipv4, from Liping Zhang.
4) Add nf_ct_expires() helper function and use it, from Florian Westphal.
5) Replace opencoded nf_ct_kill() call in IPVS conntrack support, also
from Florian.
6) Rename nf_tables set implementation to nft_set_{name}.c
7) Introduce the hash expression to allow arbitrary hashing of selector
concatenations, from Laura Garcia Liebana.
8) Remove ip_conntrack sysctl backward compatibility code, this code has
been around for long time already, and we have two interfaces to do
this already: nf_conntrack sysctl and ctnetlink.
9) Use nf_conntrack_get_ht() helper function whenever possible, instead
of opencoding fetch of hashtable pointer and size, patch from Liping Zhang.
10) Add quota expression for nf_tables.
11) Add number generator expression for nf_tables, this supports
incremental and random generators that can be combined with maps,
very useful for load balancing purpose, again from Laura Garcia Liebana.
12) Fix a typo in a debug message in FTP conntrack helper, from Colin Ian King.
13) Introduce a nft_chain_parse_hook() helper function to parse chain hook
configuration, this is used by a follow up patch to perform better chain
update validation.
14) Add rhashtable_lookup_get_insert_key() to rhashtable and use it from the
nft_set_hash implementation to honor the NLM_F_EXCL flag.
15) Missing nulls check in nf_conntrack from nf_conntrack_tuple_taken(),
patch from Florian Westphal.
16) Don't use the DYING bit to know if the conntrack event has been already
delivered, instead a state variable to track event re-delivery
states, also from Florian.
17) Remove the per-conntrack timer, use the workqueue approach that was
discussed during the NFWS, from Florian Westphal.
18) Use the netlink conntrack table dump path to kill stale entries,
again from Florian.
19) Add a garbage collector to get rid of stale conntracks, from
Florian.
20) Reschedule garbage collector if eviction rate is high.
21) Get rid of the __nf_ct_kill_acct() helper.
22) Use ARPHRD_ETHER instead of hardcoded 1 from ARP logger.
23) Make nf_log_set() interface assertive on unsupported families.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a per-port flag to control the unknown multicast flood, similar to the
unknown unicast flood flag and break a few long lines in the netlink flag
exports.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the unicast flag and introduce an exact pkt_type. That would help us
for the upcoming per-port multicast flood flag and also slightly reduce the
tests in the input fast path.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fdb dumps spanning multiple skb's currently restart from the first
interface again for every skb. This results in unnecessary
iterations on the already visited interfaces and their fdb
entries. In large scale setups, we have seen this to slow
down fdb dumps considerably. On a system with 30k macs we
see fdb dumps spanning across more than 300 skbs.
To fix the problem, this patch replaces the existing single fdb
marker with three markers: netdev hash entries, netdevs and fdb
index to continue where we left off instead of restarting from the
first netdev. This is consistent with link dumps.
In the process of fixing the performance issue, this patch also
re-implements fix done by
commit 472681d57a ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump")
(with an internal fix from Wilson Kok) in the following ways:
- change ndo_fdb_dump handlers to return error code instead
of the last fdb index
- use cb->args strictly for dump frag markers and not error codes.
This is consistent with other dump functions.
Below results were taken on a system with 1000 netdevs
and 35085 fdb entries:
before patch:
$time bridge fdb show | wc -l
15065
real 1m11.791s
user 0m0.070s
sys 1m8.395s
(existing code does not return all macs)
after patch:
$time bridge fdb show | wc -l
35085
real 0m2.017s
user 0m0.113s
sys 0m1.942s
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_may_pull may fail due to various reasons (e.g. alloc failure), but the
skb isn't changed/dropped and processing continues so we shouldn't
increment tx_dropped.
CC: Kyeyoon Park <kyeyoonp@codeaurora.org>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: bridge@lists.linux-foundation.org
Fixes: 958501163d ("bridge: Add support for IEEE 802.11 Proxy ARP")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit bc8c20acae ("bridge: multicast: treat igmpv3 report with
INCLUDE and no sources as a leave") seems to have accidentally reverted
commit 47cc84ce0c ("bridge: fix parsing of MLDv2 reports"). This
commit brings back a change to br_ip6_multicast_mld2_report() where
parsing of MLDv2 reports stops when the first group is successfully
added to the MDB cache.
Fixes: bc8c20acae ("bridge: multicast: treat igmpv3 report with INCLUDE and no sources as a leave")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nf_log_set is an interface function, so it should do the strict sanity
check of parameters. Convert the return value of nf_log_set as int instead
of void. When the pf is invalid, return -EOPNOTSUPP.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of
port netdevs so that packets being flooded by the device won't be
flooded twice.
It works by assigning a unique identifier (the ifindex of the first
bridge port) to bridge ports sharing the same parent ID. This prevents
packets from being flooded twice by the same switch, but will flood
packets through bridge ports belonging to a different switch.
This method is problematic when stacked devices are taken into account,
such as VLANs. In such cases, a physical port netdev can have upper
devices being members in two different bridges, thus requiring two
different 'offload_fwd_mark's to be configured on the port netdev, which
is impossible.
The main problem is that packet and netdev marking is performed at the
physical netdev level, whereas flooding occurs between bridge ports,
which are not necessarily port netdevs.
Instead, packet and netdev marking should really be done in the bridge
driver with the switch driver only telling it which packets it already
forwarded. The bridge driver will mark such packets using the mark
assigned to the ingress bridge port and will prevent the packet from
being forwarded through any bridge port sharing the same mark (i.e.
having the same parent ID).
Remove the current switchdev 'offload_fwd_mark' implementation and
instead implement the proposed method. In addition, make rocker - the
sole user of the mark - use the proposed method.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I added support to export the vlan entry flags via xstats I forgot to
add support for the pvid since it is manually matched, so check if the
entry matches the vlan_group's pvid and set the flag appropriately.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit bcf4934288 ("netfilter: ebtables: Fix extension lookup with
identical name") added a second lookup in case the extension that was
found during the first lookup matched another extension with the same
name, but didn't release the reference on the incorrect module.
Fixes: bcf4934288 ("netfilter: ebtables: Fix extension lookup with identical name")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
"meta pkttype set" is only supported on prerouting chain with bridge
family and ingress chain with netdev family.
But the validate check is incomplete, and the user can add the nft
rules on input chain with bridge family, for example:
# nft add table bridge filter
# nft add chain bridge filter input {type filter hook input \
priority 0 \;}
# nft add chain bridge filter test
# nft add rule bridge filter test meta pkttype set unicast
# nft add rule bridge filter input jump test
This patch fixes the problem.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use one of the vlan xstats padding fields to export the vlan flags. This is
needed in order to be able to distinguish between master (bridge) and port
vlan entries in user-space when dumping the bridge vlan stats.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the bridge driver we usually have the same function working for both
port and bridge. In order to follow that logic and also avoid code
duplication, consolidate the bridge_ and brport_ linkxstats calls into
one since they share most of their code. As a side effect this allows us
to dump the vlan stats also via the slave call which is in preparation for
the upcoming per-port vlan stats and vlan flag dumping.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding fdb entries pointing to the bridge device uses fdb_insert(),
which lacks various checks and does not respect added_by_user flag.
As a result, some inconsistent behavior can happen:
* Adding temporary entries succeeds but results in permanent entries.
* Same goes for "dynamic" and "use".
* Changing mac address of the bridge device causes deletion of
user-added entries.
* Replacing existing entries looks successful from userspace but actually
not, regardless of NLM_F_EXCL flag.
Use the same logic as other entries and fix them.
Fixes: 3741873b4f ("bridge: allow adding of fdb entries pointing to the bridge device")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 8626c56c82 ("bridge: fix potential use-after-free when hook
returns QUEUE or STOLEN verdict") caused LLDP packets arriving through a
bridge port to be re-injected to the Rx path with skb->dev set to the
bridge device, but this breaks the lldpad daemon.
The lldpad daemon opens a packet socket with protocol set to ETH_P_LLDP
for any valid device on the system, which doesn't not include soft
devices such as bridge and VLAN.
Since packet sockets (ptype_base) are processed in the Rx path after the
Rx handler, LLDP packets with skb->dev set to the bridge device never
reach the lldpad daemon.
Fix this by making the bridge's Rx handler re-inject LLDP packets with
RX_HANDLER_PASS, which effectively restores the behaviour prior to the
mentioned commit.
This means netfilter will never receive LLDP packets coming through a
bridge port, as I don't see a way in which we can have okfn() consume
the packet without breaking existing behaviour. I've already carried out
a similar fix for STP packets in commit 56fae404fb ("bridge: Fix
incorrect re-injection of STP packets").
Fixes: 8626c56c82 ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the ageing_time type in br_set_ageing_time() from u32 to what it
is expected to be, i.e. a clock_t.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_stp_enable_bridge() does take the br->lock spinlock. Fix its wrongly
pasted comment and use the same as br_stp_disable_bridge().
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch we had two flavors of most forwarding functions -
_forward and _deliver, the difference being that the latter are used
when the packets are locally originated. Instead of all this function
pointer passing and code duplication, we can just pass a boolean noting
that the packet was locally originated and use that to perform the
necessary checks in __br_forward. This gives a minor performance
improvement but more importantly consolidates the forwarding paths.
Also add a kernel doc comment to explain the exported br_forward()'s
arguments.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently if the packet is going to be received locally we set skb0 or
sometimes called skb2 variables to the original skb. This can get
confusing and also we can avoid one conditional on the fast path by
simply using a boolean and passing it around. Thanks to Roopa for the
name suggestion.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes one conditional from the unicast path by using the fact
that skb is NULL only when the packet is multicast or is local.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial style changes in br_handle_frame_finish.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As was suggested this patch adds support for the different versions of MLD
and IGMP query types. Since the user visible structure is still in net-next
we can augment it instead of adding netlink attributes.
The distinction between the different IGMP/MLD query types is done as
suggested in Section 7.1, RFC 3376 [1] and Section 8.1, RFC 3810 [2] based
on query payload size and code for IGMP. Since all IGMP packets go through
multicast_rcv() and it uses ip_mc_check_igmp/ipv6_mc_check_mld we can be
sure that at least the ip/ipv6 header can be directly used.
[1] https://tools.ietf.org/html/rfc3376#section-7
[2] https://tools.ietf.org/html/rfc3810#section-8.1
Suggested-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/mellanox/mlx5/core/en.h
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
drivers/net/usb/r8152.c
All three conflicts were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next,
they are:
1) Don't use userspace datatypes in bridge netfilter code, from
Tobin Harding.
2) Iterate only once over the expectation table when removing the
helper module, instead of once per-netns, from Florian Westphal.
3) Extra sanitization in xt_hook_ops_alloc() to return error in case
we ever pass zero hooks, xt_hook_ops_alloc():
4) Handle NFPROTO_INET from the logging core infrastructure, from
Liping Zhang.
5) Autoload loggers when TRACE target is used from rules, this doesn't
change the behaviour in case the user already selected nfnetlink_log
as preferred way to print tracing logs, also from Liping Zhang.
6) Conntrack slabs with SLAB_HWCACHE_ALIGN to allow rearranging fields
by cache lines, increases the size of entries in 11% per entry.
From Florian Westphal.
7) Skip zone comparison if CONFIG_NF_CONNTRACK_ZONES=n, from Florian.
8) Remove useless defensive check in nf_logger_find_get() from Shivani
Bhardwaj.
9) Remove zone extension as place it in the conntrack object, this is
always include in the hashing and we expect more intensive use of
zones since containers are in place. Also from Florian Westphal.
10) Owner match now works from any namespace, from Eric Bierdeman.
11) Make sure we only reply with TCP reset to TCP traffic from
nf_reject_ipv4, patch from Liping Zhang.
12) Introduce --nflog-size to indicate amount of network packet bytes
that are copied to userspace via log message, from Vishwanath Pai.
This obsoletes --nflog-range that has never worked, it was designed
to achieve this but it has never worked.
13) Introduce generic macros for nf_tables object generation masks.
14) Use generation mask in table, chain and set objects in nf_tables.
This allows fixes interferences with ongoing preparation phase of
the commit protocol and object listings going on at the same time.
This update is introduced in three patches, one per object.
15) Check if the object is active in the next generation for element
deactivation in the rbtree implementation, given that deactivation
happens from the commit phase path we have to observe the future
status of the object.
16) Support for deletion of just added elements in the hash set type.
17) Allow to resize hashtable from /proc entry, not only from the
obscure /sys entry that maps to the module parameter, from Florian
Westphal.
18) Get rid of NFT_BASECHAIN_DISABLED, this code is not exercised
anymore since we tear down the ruleset whenever the netdevice
goes away.
19) Support for matching inverted set lookups, from Arturo Borrero.
20) Simplify the iptables_mangle_hook() by removing a superfluous
extra branch.
21) Introduce ether_addr_equal_masked() and use it from the netfilter
codebase, from Joe Perches.
22) Remove references to "Use netfilter MARK value as routing key"
from the Netfilter Kconfig description given that this toggle
doesn't exists already for 10 years, from Moritz Sichert.
23) Introduce generic NF_INVF() and use it from the xtables codebase,
from Joe Perches.
24) Setting logger to NONE via /proc was not working unless explicit
nul-termination was included in the string. This fixes seems to
leave the former behaviour there, so we don't break backward.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
L2 upper device needs to propagate neigh_construct/destroy calls down to
lower devices. Do this by defining default ndo functions and use them in
team, bond, bridge and vlan.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netfilter uses multiple FWINV #defines with identical form that hide a
specific structure variable and dereference it with a invflags member.
$ git grep "#define FWINV"
include/linux/netfilter_bridge/ebtables.h:#define FWINV(bool,invflg) ((bool) ^ !!(info->invflags & invflg))
net/bridge/netfilter/ebtables.c:#define FWINV2(bool, invflg) ((bool) ^ !!(e->invflags & invflg))
net/ipv4/netfilter/arp_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(arpinfo->invflags & (invflg)))
net/ipv4/netfilter/ip_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg)))
net/ipv6/netfilter/ip6_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ip6info->invflags & (invflg)))
net/netfilter/xt_tcpudp.c:#define FWINVTCP(bool, invflg) ((bool) ^ !!(tcpinfo->invflags & (invflg)))
Consolidate these macros into a single NF_INVF macro.
Miscellanea:
o Neaten the alignment around these uses
o A few lines are > 80 columns for intelligibility
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There are code duplications of a masked ethernet address comparison here
so make it a separate function instead.
Miscellanea:
o Neaten alignment of FWINV macro uses to make it clearer for the reader
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ip_skb_dst_mtu uses skb->sk, assuming it is an AF_INET socket (e.g. it
calls ip_sk_use_pmtu which casts sk as an inet_sk).
However, in the case of UDP tunneling, the skb->sk is not necessarily an
inet socket (could be AF_PACKET socket, or AF_UNSPEC if arriving from
tun/tap).
OTOH, the sk passed as an argument throughout IP stack's output path is
the one which is of PMTU interest:
- In case of local sockets, sk is same as skb->sk;
- In case of a udp tunnel, sk is the tunneling socket.
Fix, by passing ip_finish_output's sk to ip_skb_dst_mtu.
This augments 7026b1ddb6 'netfilter: Pass socket pointer down through okfn().'
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds stats support for the currently used IGMP/MLD types by the
bridge. The stats are per-port (plus one stat per-bridge) and per-direction
(RX/TX). The stats are exported via netlink via the new linkxstats API
(RTM_GETSTATS). In order to minimize the performance impact, a new option
is used to enable/disable the stats - multicast_stats_enabled, similar to
the recent vlan stats. Also in order to avoid multiple IGMP/MLD type
lookups and checks, we make use of the current "igmp" member of the bridge
private skb->cb region to record the type on Rx (both host-generated and
external packets pass by multicast_rcv()). We can do that since the igmp
member was used as a boolean and all the valid IGMP/MLD types are positive
values. The normal bridge fast-path is not affected at all, the only
affected paths are the flooding ones and since we make use of the IGMP/MLD
type, we can quickly determine if the packet should be counted using
cache-hot data (cb's igmp member). We add counters for:
* IGMP Queries
* IGMP Leaves
* IGMP v1/v2/v3 reports
* MLD Queries
* MLD Leaves
* MLD v1/v2 reports
These are invaluable when monitoring or debugging complex multicast setups
with bridges.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for the IFLA_STATS_LINK_XSTATS_SLAVE attribute
which allows to export per-slave statistics if the master device supports
the linkxstats callback. The attribute is passed down to the linkxstats
callback and it is up to the callback user to use it (an example has been
added to the only current user - the bridge). This allows us to query only
specific slaves of master devices like bridge ports and export only what
we're interested in instead of having to dump all ports and searching only
for a single one. This will be used to export per-port IGMP/MLD stats and
also per-port vlan stats in the future, possibly other statistics as well.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I made a dumb off-by-one mistake when I added the vlan stats counter
dumping code. The increment should happen before the check, not after
otherwise we miss one entry when we continue dumping.
Fixes: a60c090361 ("bridge: netlink: export per-vlan stats")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge is falsly dropping ipv6 mulitcast packets if there is:
1. No ipv6 address assigned on the brigde.
2. No external mld querier present.
3. The internal querier enabled.
When the bridge fails to build mld queries, because it has no
ipv6 address, it slilently returns, but keeps the local querier enabled.
This specific case causes confusing packet loss.
Ipv6 multicast snooping can only work if:
a) An external querier is present
OR
b) The bridge has an ipv6 address an is capable of sending own queries
Otherwise it has to forward/flood the ipv6 multicast traffic,
because snooping cannot work.
This patch fixes the issue by adding a flag to the bridge struct that
indicates that there is currently no ipv6 address assinged to the bridge
and returns a false state for the local querier in
__br_multicast_querier_exists().
Special thanks to Linus Lüssing.
Fixes: d1d81d4c3d ("bridge: check return value of ipv6_dev_get_saddr()")
Signed-off-by: Daniel Danzberger <daniel@dd-wrt.com>
Acked-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 8626c56c82 ("bridge: fix potential use-after-free when hook
returns QUEUE or STOLEN verdict") fixed incorrect usage of NF_HOOK's
return value by consuming packets in okfn via br_pass_frame_up().
However, this function re-injects packets to the Rx path with skb->dev
set to the bridge device, which breaks kernel's STP, as all STP packets
appear to originate from the bridge device itself.
Instead, if STP is enabled and bridge isn't a 802.1ad bridge, then learn
packet's SMAC and inject it back to the Rx path for further processing
by the packet handlers.
The patch also makes netfilter's behavior consistent with regards to
packets destined to the Bridge Group Address, as no hook registered at
LOCAL_IN will ever be called, regardless if STP is enabled or not.
Cc: Florian Westphal <fw@strlen.de>
Cc: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Fixes: 8626c56c82 ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The missing br_vlan_should_use() test caused creation of an unneeded
local fdb entry on changing mac address of a bridge device when there is
a vlan which is configured on a bridge port but not on the bridge
device.
Fixes: 2594e9064a ("bridge: vlan: add per-vlan struct and move to rhashtables")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
checkpatch produces data type 'checks'.
This patch amends them by changing, for example:
uint8_t -> u8
Signed-off-by: Tobin C Harding <me@tobin.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In netdevice.h we removed the structure in net-next that is being
changes in 'net'. In macsec.c and rtnetlink.c we have overlaps
between fixes in 'net' and the u64 attribute changes in 'net-next'.
The mlx5 conflicts have to do with vxlan support dependencies.
Signed-off-by: David S. Miller <davem@davemloft.net>
With the newly introduced helper functions the skb pulling is hidden
in the checksumming function - and undone before returning to the
caller.
The IGMP and MLD query parsing functions in the bridge still
assumed that the skb is pointing to the beginning of the IGMP/MLD
message while it is now kept at the beginning of the IPv4/6 header.
If there is a querier somewhere else, then this either causes
the multicast snooping to stay disabled even though it could be
enabled. Or, if we have the querier enabled too, then this can
create unnecessary IGMP / MLD query messages on the link.
Fixing this by taking the offset between IP and IGMP/MLD header into
account, too.
Fixes: 9afd85c9e4 ("net: Export IGMP/MLD message validation code")
Reported-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
get_bridge_ifindices() is used from the old "deviceless" bridge ioctl
calls which aren't called with rtnl held. The comment above says that it is
called with rtnl but that is not really the case.
Here's a sample output from a test ASSERT_RTNL() which I put in
get_bridge_ifindices and executed "brctl show":
[ 957.422726] RTNL: assertion failed at net/bridge//br_ioctl.c (30)
[ 957.422925] CPU: 0 PID: 1862 Comm: brctl Tainted: G W O
4.6.0-rc4+ #157
[ 957.423009] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.8.1-20150318_183358- 04/01/2014
[ 957.423009] 0000000000000000 ffff880058adfdf0 ffffffff8138dec5
0000000000000400
[ 957.423009] ffffffff81ce8380 ffff880058adfe58 ffffffffa05ead32
0000000000000001
[ 957.423009] 00007ffec1a444b0 0000000000000400 ffff880053c19130
0000000000008940
[ 957.423009] Call Trace:
[ 957.423009] [<ffffffff8138dec5>] dump_stack+0x85/0xc0
[ 957.423009] [<ffffffffa05ead32>]
br_ioctl_deviceless_stub+0x212/0x2e0 [bridge]
[ 957.423009] [<ffffffff81515beb>] sock_ioctl+0x22b/0x290
[ 957.423009] [<ffffffff8126ba75>] do_vfs_ioctl+0x95/0x700
[ 957.423009] [<ffffffff8126c159>] SyS_ioctl+0x79/0x90
[ 957.423009] [<ffffffff8163a4c0>] entry_SYSCALL_64_fastpath+0x23/0xc1
Since it only reads bridge ifindices, we can use rcu to safely walk the net
device list. Also remove the wrong rtnl comment above.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new LINK_XSTATS_TYPE_BRIDGE attribute and implement the
RTM_GETSTATS callbacks for IFLA_STATS_LINK_XSTATS (fill_linkxstats and
get_linkxstats_size) in order to export the per-vlan stats.
The paddings were added because soon these fields will be needed for
per-port per-vlan stats (or something else if someone beats me to it) so
avoiding at least a few more netlink attributes.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for per-VLAN Tx/Rx statistics. Every global vlan context gets
allocated a per-cpu stats which is then set in each per-port vlan context
for quick access. The br_allowed_ingress() common function is used to
account for Rx packets and the br_handle_vlan() common function is used
to account for Tx packets. Stats accounting is performed only if the
bridge-wide vlan_stats_enabled option is set either via sysfs or netlink.
A struct hole between vlan_enabled and vlan_proto is used for the new
option so it is in the same cache line. Currently it is binary (on/off)
but it is intentionally restricted to exactly 0 and 1 since other values
will be used in the future for different purposes (e.g. per-port stats).
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP6_INC_STATS_BH() to __IP6_INC_STATS()
and IP6_ADD_STATS_BH() to __IP6_ADD_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP_INC_STATS_BH() to __IP_INC_STATS(), to
better express this is used in non preemptible context.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor overlapping changes in the conflicts.
In the macsec case, the change of the default ID macro
name overlapped with the 64-bit netlink attribute alignment
fixes in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a race-condition when updating the mdb offload flag without using
the mulicast_lock. This reverts commit 9e8430f8d6 ("bridge: mdb:
Passing the port-group pointer to br_mdb module").
This patch marks offloaded MDB entry as "offload" by changing the port-
group flags and marks it as MDB_PG_FLAGS_OFFLOAD.
When switchdev PORT_MDB succeeded and adds a multicast group, a completion
callback is been invoked "br_mdb_complete". The completion function
locks the multicast_lock and finds the right net_bridge_port_group and
marks it as offloaded.
Fixes: 9e8430f8d6 ("bridge: mdb: Passing the port-group pointer to br_mdb module")
Reported-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is duplicate code that translates br_mdb_entry to br_ip let's wrap it
in a common function.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts were two cases of simple overlapping changes,
nothing serious.
In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when we change the attributes of bridge or br_port by netlink,
a relevant netlink notification will be sent, but if we change them
by ioctl or sysfs, no notification will be sent.
We should ensure that whenever those attributes change internally or from
sysfs/ioctl, that a netlink notification is sent out to listeners.
Also, NetworkManager will use this in the future to listen for out-of-band
bridge master attribute updates and incorporate them into the runtime
configuration.
This patch is used for ioctl.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when we change the attributes of bridge or br_port by netlink,
a relevant netlink notification will be sent, but if we change them
by ioctl or sysfs, no notification will be sent.
We should ensure that whenever those attributes change internally or from
sysfs/ioctl, that a netlink notification is sent out to listeners.
Also, NetworkManager will use this in the future to listen for out-of-band
bridge master attribute updates and incorporate them into the runtime
configuration.
This patch is used for br_sysfs_if, and we also move br_ifinfo_notify out
of store_flag.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when we change the attributes of bridge or br_port by netlink,
a relevant netlink notification will be sent, but if we change them
by ioctl or sysfs, no notification will be sent.
We should ensure that whenever those attributes change internally or from
sysfs/ioctl, that a netlink notification is sent out to listeners.
Also, NetworkManager will use this in the future to listen for out-of-band
bridge master attribute updates and incorporate them into the runtime
configuration.
This patch is used for br_sysfs_br. and we also need to remove some
rtnl_trylock in old functions so that we can call it in a common one.
For group_addr_store, we cannot make it use store_bridge_parm, because
it's not a string-to-long convert, we will add notification on it
individually.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some repetitive codes in stp_state_store, we can remove
them by calling store_bridge_parm.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some repetitive codes in forward_delay_store, we can remove
them by calling store_bridge_parm.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some repetitive codes in flush_store, we can remove
them by calling store_bridge_parm, also, it would send rtnl notification
after we add it in store_bridge_parm in the following patches.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains the first batch of Netfilter updates for
your net-next tree.
1) Define pr_fmt() in nf_conntrack, from Weongyo Jeong.
2) Define and register netfilter's afinfo for the bridge family,
this comes in preparation for native nfqueue's bridge for nft,
from Stephane Bryant.
3) Add new attributes to store layer 2 and VLAN headers to nfqueue,
also from Stephane Bryant.
4) Parse new NFQA_VLAN and NFQA_L2HDR nfqueue netlink attributes
coming from userspace, from Stephane Bryant.
5) Use net->ipv6.devconf_all->hop_limit instead of hardcoded hop_limit
in IPv6 SYNPROXY, from Liping Zhang.
6) Remove unnecessary check for dst == NULL in nf_reject_ipv6,
from Haishuang Yan.
7) Deinline ctnetlink event report functions, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
If a requested extension exists as module and is not loaded,
ebt_check_match() might accidentally use an NFPROTO_UNSPEC one with same
name and fail.
Reproduced with limit match: Given xt_limit and ebt_limit both built as
module, the following would fail:
modprobe xt_limit
ebtables -I INPUT --limit 1/s -j ACCEPT
The fix is to make ebt_check_match() distrust a found NFPROTO_UNSPEC
extension and retry after requesting an appropriate module.
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This fixes the incorrect variable assignment on error path in
br_sysfs_addbr for when the call to kobject_create_and_add
fails to assign the value of -EINVAL to the returned variable of
err rather then incorrectly return zero making callers think this
function has succeededed due to the previous assignment being
assigned zero when assigning it the successful return value of
the call to sysfs_create_group which is zero.
Signed-off-by: Bastien Philbert <bastienphilbert@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When NET_SWITCHDEV=n, switchdev_port_attr_set will return -EOPNOTSUPP,
we should ignore this error code and continue to set the ageing time.
Fixes: c62987bbd8 ("bridge: push bridge setting ageing_time down to switchdev")
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This just adds and registers a nf_afinfo for the ethernet
bridge, which enables queuing to userspace for the AF_BRIDGE
family. No checksum computation is done.
Signed-off-by: Stephane Bryant <stephane.ml.bryant@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit fa50d974d1 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
use sock_net(skb->sk) to get the net namespace, but we can't assume
that sk_buff->sk is always exist, so when it is NULL, oops will happen.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Reviewed-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Make sure the table names via getsockopt GET_ENTRIES is nul-terminated
in ebtables and all the x_tables variants and their respective compat
code. Uncovered by KASAN.
Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It can be useful to lower max_gso_segs on NIC with very low
number of TX descriptors like bcmgenet.
However, this is defeated by bridge since it does not propagate
the lower value of max_gso_segs and max_gso_size.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Petri Gynther <pgynther@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zefir Kurtisi reported kernel panic with an openwrt specific patch.
However, it turns out that mainline has a similar bug waiting to happen.
Once NF_HOOK() returns the skb is in undefined state and must not be
used. Moreover, the okfn must consume the skb to support async
processing (NF_QUEUE).
Current okfn in this spot doesn't consume it and caller assumes that
NF_HOOK return value tells us if skb was freed or not, but thats wrong.
It "works" because no in-tree user registers a NFPROTO_BRIDGE hook at
LOCAL_IN that returns STOLEN or NF_QUEUE verdicts.
Once we add NF_QUEUE support for nftables bridge this will break --
NF_QUEUE holds the skb for async processing, caller will erronoulsy
return RX_HANDLER_PASS and on reinject netfilter will access free'd skb.
Fix this by pushing skb up the stack in the okfn instead.
NB: It also seems dubious to use LOCAL_IN while bypassing PRE_ROUTING
completely in this case but this is how its been forever so it seems
preferable to not change this.
Cc: Felix Fietkau <nbd@openwrt.org>
Cc: Zefir Kurtisi <zefir.kurtisi@neratec.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Zefir Kurtisi <zefir.kurtisi@neratec.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a regression in the bridge ageing time caused by:
commit c62987bbd8 ("bridge: push bridge setting ageing_time down to switchdev")
There are users of Linux bridge which use the feature that if ageing time
is set to 0 it causes entries to never expire. See:
https://www.linuxfoundation.org/collaborate/workgroups/networking/bridge
For a pure software bridge, it is unnecessary for the code to have
arbitrary restrictions on what values are allowable.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter updates for your net-next tree,
they are:
1) Remove useless debug message when deleting IPVS service, from
Yannick Brosseau.
2) Get rid of compilation warning when CONFIG_PROC_FS is unset in
several spots of the IPVS code, from Arnd Bergmann.
3) Add prandom_u32 support to nft_meta, from Florian Westphal.
4) Remove unused variable in xt_osf, from Sudip Mukherjee.
5) Don't calculate IP checksum twice from netfilter ipv4 defrag hook
since fixing af_packet defragmentation issues, from Joe Stringer.
6) On-demand hook registration for iptables from netns. Instead of
registering the hooks for every available netns whenever we need
one of the support tables, we register this on the specific netns
that needs it, patchset from Florian Westphal.
7) Add missing port range selection to nf_tables masquerading support.
BTW, just for the record, there is a typo in the description of
5f6c253ebe ("netfilter: bridge: register hooks only when bridge
interface is added") that refers to the cluster match as deprecated, but
it is actually the CLUSTERIP target (which registers hooks
inconditionally) the one that is scheduled for removal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Several cases of overlapping changes, as well as one instance
(vxlan) of a bug fix in 'net' overlapping with code movement
in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
This moves bridge hooks to a register-when-needed scheme.
We use a device notifier to register the 'call-iptables' netfilter hooks
only once a bridge gets added.
This means that if the initial namespace uses a bridge, newly created
network namespaces no longer get the PRE_ROUTING ipt_sabotage hook.
It will registered in that network namespace once a bridge is created
within that namespace.
A few modules still use global hooks:
- conntrack
- bridge PF_BRIDGE hooks
- IPVS
- CLUSTER match (deprecated)
- SYNPROXY
As long as these modules are not loaded/used, a new network namespace has
empty hook list and NF_HOOK() will boil down to single list_empty test even
if initial namespace does stateless packet filtering.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After commit 52bd2d62ce ("net: better skb->sender_cpu and skb->napi_id cohabitation")
skb_sender_cpu_clear() becomes empty and can be removed.
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow for more multicast router port information to be dumped such as
timer and type attributes. For that that purpose we need to extend the
MDBA_ROUTER_PORT attribute similar to how it was done for the mdb entries
recently. The new format is thus:
[MDBA_ROUTER_PORT] = { <- nested attribute
u32 ifindex <- router port ifindex for user-space compatibility
[MDBA_ROUTER_PATTR attributes]
}
This way it remains compatible with older users (they'll simply retrieve
the u32 in the beginning) and new users can parse the remaining
attributes. It would also allow to add future extensions to the router
port without breaking compatibility.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for a temporary router port which doesn't depend only on the
incoming query. It can be refreshed if set to the same value, which is
a no-op for the rest.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is needed for the upcoming temporary port router. There's no point
to go through the logic if the value is the same.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using raw values makes it difficult to extend and also understand the
code, give them names and do explicit per-option manipulation in
br_multicast_set_port_router.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On bridge needed_headroom changes, the enslaved devices are
notified via the ndo_set_rx_headroom method
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the send skbuff reaches the end, nlmsg_put and friends returns
-EMSGSIZE but it is silently thrown away in ndo_fdb_dump. It is called
within a for_each_netdev loop and the first fdb entry of a following
netdev could fit in the remaining skbuff. This breaks the mechanism
of cb->args[0] and idx to keep track of the entries that are already
dumped, which results missing entries in bridge fdb show command.
Signed-off-by: Minoura Makoto <minoura@valinux.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/phy/bcm7xxx.c
drivers/net/phy/marvell.c
drivers/net/vxlan.c
All three conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently mdb entries are exported directly as a structure inside
MDBA_MDB_ENTRY_INFO attribute, we can't really extend it without
breaking user-space. In order to export new mdb fields, I've converted
the MDBA_MDB_ENTRY_INFO into a nested attribute which starts like before
with struct br_mdb_entry (without header, as it's casted directly in
iproute2) and continues with MDBA_MDB_EATTR_ attributes. This way we
keep compatibility with older users and can export new data.
I've tested this with iproute2, both with and without support for the
added attribute and it works fine.
So basically we again have MDBA_MDB_ENTRY_INFO with struct br_mdb_entry
inside but it may contain also some additional MDBA_MDB_EATTR_ attributes
such as MDBA_MDB_EATTR_TIMER which can be parsed by user-space.
So the new structure is:
[MDBA_MDB] = {
[MDBA_MDB_ENTRY] = {
[MDBA_MDB_ENTRY_INFO]
[MDBA_MDB_ENTRY_INFO] { <- Nested attribute
struct br_mdb_entry <- nla_put_nohdr()
[MDBA_MDB_ENTRY attributes] <- normal netlink attributes
}
}
}
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch the port check and skip if it's null, this allows us to reduce one
indentation level.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the shared br_log_state function and print the info directly in
br_set_state, where the net_bridge_port state is actually changed.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When VLANs are created / destroyed on a VLAN filtering bridge (MASTER
flag set), the configuration is passed down to the hardware. However,
when only the flags (e.g. PVID) are toggled, the configuration is done
in the software bridge alone.
While it is possible to pass these flags to hardware when invoked with
the SELF flag set, this creates inconsistency with regards to the way
the VLANs are initially configured.
Pass the flags down to the hardware even when the VLAN already exists
and only the flags are toggled.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A recent change to the mdb code confused the compiler to the point
where it did not realize that the port-group returned from
br_mdb_add_group() is always valid when the function returns a nonzero
return value, so we get a spurious warning:
net/bridge/br_mdb.c: In function 'br_mdb_add':
net/bridge/br_mdb.c:542:4: error: 'pg' may be used uninitialized in this function [-Werror=maybe-uninitialized]
__br_mdb_notify(dev, entry, RTM_NEWMDB, pg);
Slightly rearranging the code in br_mdb_add_group() makes the problem
go away, as gcc is clever enough to see that both functions check
for 'ret != 0'.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 9e8430f8d6 ("bridge: mdb: Passing the port-group pointer to br_mdb module")
Signed-off-by: David S. Miller <davem@davemloft.net>
Passing the port-group to br_mdb in order to allow direct access to the
structure. br_mdb will later use the structure to reflect HW reflection
status via "state" variable.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change net_bridge_port_group 'state' member to 'flags' and define new set
of flags internal to the kernel.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When switchdev drivers process FDB notifications from the underlying
device they resolve the netdev to which the entry points to and notify
the bridge using the switchdev notifier.
However, since the RTNL mutex is not held there is nothing preventing
the netdev from disappearing in the middle, which will cause
br_switchdev_event() to dereference a non-existing netdev.
Make switchdev drivers hold the lock at the beginning of the
notification processing session and release it once it ends, after
notifying the bridge.
Also, remove switchdev_mutex and fdb_lock, as they are no longer needed
when RTNL mutex is held.
Fixes: 03bf0c2812 ("switchdev: introduce switchdev notifier")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Offload MDB changes per port to hardware
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next, they are:
1) Release nf_tables objects on netns destructions via
nft_release_afinfo().
2) Destroy basechain and rules on netdevice removal in the new netdev
family.
3) Get rid of defensive check against removal of inactive objects in
nf_tables.
4) Pass down netns pointer to our existing nfnetlink callbacks, as well
as commit() and abort() nfnetlink callbacks.
5) Allow to invert limit expression in nf_tables, so we can throttle
overlimit traffic.
6) Add packet duplication for the netdev family.
7) Add forward expression for the netdev family.
8) Define pr_fmt() in conntrack helpers.
9) Don't leave nfqueue configuration on inconsistent state in case of
errors, from Ken-ichirou MATSUZAWA, follow up patches are also from
him.
10) Skip queue option handling after unbind.
11) Return error on unknown both in nfqueue and nflog command.
12) Autoload ctnetlink when NFQA_CFG_F_CONNTRACK is set.
13) Add new NFTA_SET_USERDATA attribute to store user data in sets,
from Carlos Falgueras.
14) Add support for 64 bit byteordering changes nf_tables, from Florian
Westphal.
15) Add conntrack byte/packet counter matching support to nf_tables,
also from Florian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Notifying hardware about newly bridged port vlan-aware changes.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Notifying hardware about bridge vlan-aware changes.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Disallow adding interfaces to a bridge when vlan filtering operation
failed. Send the failure code to the user.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[I stole this patch from Eric Biederman. He wrote:]
> There is no defined mechanism to pass network namespace information
> into /sbin/bridge-stp therefore don't even try to invoke it except
> for bridge devices in the initial network namespace.
>
> It is possible for unprivileged users to cause /sbin/bridge-stp to be
> invoked for any network device name which if /sbin/bridge-stp does not
> guard against unreasonable arguments or being invoked twice on the
> same network device could cause problems.
[Hannes: changed patch using netns_eq]
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have to release the existing objects on netns removal otherwise we
leak them. Chains are unregistered in first place to make sure no
packets are walking on our rules and sets anymore.
The object release happens by when we unregister the family via
nft_release_afinfo() which is called from nft_unregister_afinfo() from
the corresponding __net_exit path in every family.
Reported-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
kobj_to_dev has been defined in linux/device.h, so I replace to_dev
with it.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge's ageing time is offloaded to hardware when:
1) A port joins a bridge
2) The ageing time of the bridge is changed
In the first case the ageing time is offloaded as jiffies, but in the
second case it's offloaded as clock_t, which is what existing switchdev
drivers expect to receive.
Fixes: 6ac311ae8b ("Adding switchdev ageing notification on port bridged")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains the first batch of Netfilter updates for
the upcoming 4.5 kernel. This batch contains userspace netfilter header
compilation fixes, support for packet mangling in nf_tables, the new
tracing infrastructure for nf_tables and cgroup2 support for iptables.
More specifically, they are:
1) Two patches to include dependencies in our netfilter userspace
headers to resolve compilation problems, from Mikko Rapeli.
2) Four comestic cleanup patches for the ebtables codebase, from Ian Morris.
3) Remove duplicate include in the netfilter reject infrastructure,
from Stephen Hemminger.
4) Two patches to simplify the netfilter defragmentation code for IPv6,
patch from Florian Westphal.
5) Fix root ownership of /proc/net netfilter for unpriviledged net
namespaces, from Philip Whineray.
6) Get rid of unused fields in struct nft_pktinfo, from Florian Westphal.
7) Add mangling support to our nf_tables payload expression, from
Patrick McHardy.
8) Introduce a new netlink-based tracing infrastructure for nf_tables,
from Florian Westphal.
9) Change setter functions in nfnetlink_log to be void, from
Rami Rosen.
10) Add netns support to the cttimeout infrastructure.
11) Add cgroup2 support to iptables, from Tejun Heo.
12) Introduce nfnl_dereference_protected() in nfnetlink, from Florian.
13) Add support for mangling pkttype in the nf_tables meta expression,
also from Florian.
BTW, I need that you pull net into net-next, I have another batch that
requires changes that I don't yet see in net.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
switchdev drivers need to know the netdev on which the switchdev op was
invoked. For example, the STP state of a VLAN interface configured on top
of a port can change while being member in a bridge. In this case, the
underlying driver should only change the STP state of that particular
VLAN and not of all the VLANs configured on the port.
However, current switchdev infrastructure only passes the port netdev down
to the driver. Solve that by passing the original device down to the
driver as part of the required switchdev object / attribute.
This doesn't entail any change in current switchdev drivers. It simply
enables those supporting stacked devices to know the originating device
and act accordingly.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Resolve conflict between commit 264640fc2c ("ipv6: distinguish frag
queues by device for multicast and link-local packets") from the net
tree and commit 029f7f3b87 ("netfilter: ipv6: nf_defrag: avoid/free
clone operations") from the nf-next tree.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
net/ipv6/netfilter/nf_conntrack_reasm.c
Only needed when meta nftrace rule(s) were added.
The assumption is that no such rules are active, so the call to
nft_trace_init is "never" needed.
When nftrace rules are active, we always call the nft_trace_* functions,
but will only send netlink messages when all of the following are true:
- traceinfo structure was initialised
- skb->nf_trace == 1
- at least one subscriber to trace group.
Adding an extra conditional
(static_branch ... && skb->nf_trace)
nft_trace_init( ..)
Is possible but results in a larger nft_do_chain footprint.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Sometimes the drivers and other code would find it handy to know some
internal information about upper device being changed. So allow upper-code
to pass information down to notifier listeners during linking.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eliminate netdev_master_upper_dev_link_private and pass priv directly as
a parameter of netdev_master_upper_dev_link.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eliminate some checkpatch issues by improved layout of if statements.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Change brace placement to eliminate checkpatch error.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Changes comments to use netdev style.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes a bunch of issues detected by checkpatch with regards to code
indentation.
No changes detected by objdiff.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When NET_SWITCHDEV=n, switchdev_port_attr_set simply returns EOPNOTSUPP.
In this case we should not emit errors and warnings to the kernel log.
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Fixes: 0bc05d585d ("switchdev: allow caller to explicitly request
attr_set as deferred")
Fixes: 6ac311ae8b ("Adding switchdev ageing notification on port
bridged")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 34c2d9fb04.
There are 2 reasons for this revert:
1) The commit in question doesn't do what it says it does. The
description reads: "Allow bridge forward delay to be configured
when Spanning Tree is enabled." This was already the case before
the commit was made. What the commit actually do was disallow
invalid values or 'forward_delay' when STP was turned off.
2) The above change was actually a change in the user observed
behavior and broke things like libvirt and other network configs
that set 'forward_delay' to 0 without enabling STP. The value
of 0 is actually used when STP is turned off to immediately mark
the bridge as forwarding.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_should_learn() is protected by RCU and not by RTNL, so use correct
flavor of nbp_vlan_group().
Fixes: 907b1e6e83 ("bridge: vlan: use proper rcu for the vlgrp
member")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The flag used to indicate if a VLAN should be used for filtering - as
opposed to context only - on the bridge itself (e.g. br0) is called
'brentry' and not 'brvlan'.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When adding a port to a bridge we initialize VLAN filtering on it. We do
not bail out in case an error occurred in nbp_vlan_init, as it can be
used as a non VLAN filtering bridge.
However, if VLAN filtering is required and an error occurred in
nbp_vlan_init, we should set vlgrp to NULL, so that VLAN filtering
functions (e.g. br_vlan_find, br_get_pvid) will know the struct is
invalid and will not try to access it.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Problem Description:
We can add fdbs pointing to the bridge with NULL ->dst but that has a
few race conditions because br_fdb_insert() is used which first creates
the fdb and then, after the fdb has been published/linked, sets
"is_local" to 1 and in that time frame if a packet arrives for that fdb
it may see it as non-local and either do a NULL ptr dereference in
br_forward() or attach the fdb to the port where it arrived, and later
br_fdb_insert() will make it local thus getting a wrong fdb entry.
Call chain br_handle_frame_finish() -> br_forward():
But in br_handle_frame_finish() in order to call br_forward() the dst
should not be local i.e. skb != NULL, whenever the dst is
found to be local skb is set to NULL so we can't forward it,
and here comes the problem since it's running only
with RCU when forwarding packets it can see the entry before "is_local"
is set to 1 and actually try to dereference NULL.
The main issue is that if someone sends a packet to the switch while
it's adding the entry which points to the bridge device, it may
dereference NULL ptr. This is needed now after we can add fdbs
pointing to the bridge. This poses a problem for
br_fdb_update() as well, while someone's adding a bridge fdb, but
before it has is_local == 1, it might get moved to a port if it comes
as a source mac and then it may get its "is_local" set to 1
This patch changes fdb_create to take is_local and is_static as
arguments to set these values in the fdb entry before it is added to the
hash. Also adds null check for port in br_forward.
Fixes: 3741873b4f ("bridge: allow adding of fdb entries pointing to the bridge device")
Reported-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
if_nlmsg_size() overestimates the minimum allocation size of netlink
dump request (when called from rtnl_calcit()) or the size of the
message (when called from rtnl_getlink()). This is because
ext_filter_mask is not supported by rtnl_link_get_af_size() and
rtnl_link_get_size().
The over-estimation is significant when at least one netdev has many
VLANs configured (8 bytes for each configured VLAN).
This patch-set "rightsizes" the protocol specific attribute size
calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
and adding this a argument to get_link_af_size op in rtnl_af_ops.
Bridge module already used filtering aware sizing for notifications.
br_get_link_af_size_filtered() is consistent with the modified
get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
br_get_link_af_size() becomes unused and thus removed.
Signed-off-by: Ronen Arad <ronen.arad@intel.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Configure ageing time to the HW for newly bridged device
CC: Scott Feldman <sfeldma@gmail.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Elad Raz <eladr@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This merge resolves conflicts with 75aec9df3a ("bridge: Remove
br_nf_push_frag_xmit_sk") as part of Eric Biederman's effort to improve
netns support in the network stack that reached upstream via David's
net-next tree.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
net/bridge/br_netfilter_hooks.c
since commit 8405a8fff3 ("netfilter: nf_qeueue: Drop queue entries on
nf_unregister_hook") all pending queued entries are discarded.
So we can simply remove all of the owner handling -- when module is
removed it also needs to unregister all its hooks.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since spinlock is held here, defer the switchdev operation. Also, ensure
that defered switchdev ops are processed before port master device
is unlinked.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When object is used in deferred work, we cannot use pointers in
switchdev object structures because the memory they point at may be already
used by someone else. So rather do local copy of the value.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Caller should know if he can call attr_set directly (when holding RTNL)
or if he has to defer the att_set processing for later.
This also allows drivers to sleep inside attr_set and report operation
status back to switchdev core. Switchdev core then warns if status is
not ok, instead of silent errors happening in drivers.
Benefit from newly introduced switchdev deferred ops infrastructure.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel reported a problem with switchdev devices because of the
order change of del_nbp operations, more specifically the move of
nbp_vlan_flush() which deletes all vlans and frees vlgrp after the
rx_handler has been unregistered. So in order to fix this move
vlan_flush back where it was and make it destroy the rhtable after
NULLing vlgrp and waiting a grace period to make sure noone can see it.
Reported-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Ido Schimmel pointed out the vlan_vid_del() code in nbp_vlan_flush is
unnecessary (and is actually a remnant of the old vlan code) so we can
remove it.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fill_ifinfo is called by br_ifinfo_notify which can be called from
many contexts with different locks held, sometimes it relies upon
bridge's spinlock only which is a problem for the vlan code, so use
explicitly rcu for that to avoid problems.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge and port's vlgrp member is already used in RCU way, currently
we rely on the fact that it cannot disappear while the port exists but
that is error-prone and we might miss places with improper locking
(either RCU or RTNL must be held to walk the vlan_list). So make it
official and use RCU for vlgrp to catch offenders. Introduce proper vlgrp
accessors and use them consistently throughout the code.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently it's possible for someone to send a vlan range to the kernel
with the pvid flag set which will result in the pvid bouncing from a
vlan to vlan and isn't correct, it also introduces problems for hardware
where it doesn't make sense having more than 1 pvid. iproute2 already
enforces this, so let's enforce it on kernel-side as well.
Reported-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers need to implement both switchdev vlan ops and
vid_add/kill ndos. For that to work in bridge code, we need to try
switchdev op first when adding/deleting vlan id.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the ARM mini2440_defconfig, the bridge netfilter code gets
built with both CONFIG_NF_DEFRAG_IPV4 and CONFIG_NF_DEFRAG_IPV6
disabled, which leads to a harmless gcc warning:
net/bridge/br_netfilter_hooks.c: In function 'br_nf_dev_queue_xmit':
net/bridge/br_netfilter_hooks.c:792:2: warning: label 'drop' defined but not used [-Wunused-label]
This gets rid of the warning by cleaning up the code to avoid
the respective #ifdefs causing this problem, and replacing them
with if(IS_ENABLED()) checks. I have verified that the resulting
object code is unchanged, and an additional advantage is that
we now get compile coverage of the unused functions in more
configurations.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: dd302b59bd ("netfilter: bridge: don't leak skb in error paths")
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use SWITCHDEV_F_SKIP_EOPNOTSUPP to skip over ports in bridge that don't
support setting ageing_time (or setting bridge attrs in general).
If push fails, don't update ageing_time in bridge and return err to user.
If push succeeds, update ageing_time in bridge and run gc_timer now to
recalabrate when to run gc_timer next, based on new ageing_time.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch enables adding of fdb entries pointing to the bridge device.
This can be used to propagate mac address of vlan interfaces
configured on top of the vlan filtering bridge.
Before:
$bridge fdb add 44:38:39:00:27:9f dev bridge
RTNETLINK answers: Invalid argument
After:
$bridge fdb add 44:38:39:00:27:9f dev bridge
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BRPORT_MULTICAST_ROUTER to allow setting/getting port's
multicast_router via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BRPORT_FLUSH to allow flushing port's fdb similar to sysfs's
flush.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the following attributes in order to export port's timer values:
IFLA_BRPORT_MESSAGE_AGE_TIMER, IFLA_BRPORT_FORWARD_DELAY_TIMER and
IFLA_BRPORT_HOLD_TIMER.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BRPORT_TOPOLOGY_CHANGE_ACK and IFLA_BRPORT_CONFIG_PENDING to
allow getting port's topology_change_ack and config_pending respectively
via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BRPORT_(ID|NO) to allow getting port's port_id and port_no
respectively via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BRPORT_DESIGNATED_(COST|PORT) to allow getting the port's
designated cost and port respectively via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BRPORT_BRIDGE_ID to allow getting the designated bridge id via
netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BRPORT_ROOT_ID to allow getting the designated root id via
netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When KASAN is enabled the frame size grows > 2048 bytes and we get a
warning, so make it smaller.
net/bridge/br_netlink.c: In function 'br_fill_info':
>> net/bridge/br_netlink.c:1110:1: warning: the frame size of 2160 bytes
>> is larger than 2048 bytes [-Wframe-larger-than=]
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman says:
====================
net: Pass net through ip fragmention
This is the next installment of my work to pass struct net through the
output path so the code does not need to guess how to figure out which
network namespace it is in, and ultimately routes can have output
devices in another network namespace.
This round focuses on passing net through ip fragmentation which we seem
to call from about everywhere. That is the main ip output paths, the
bridge netfilter code, and openvswitch. This has to happend at once
accross the tree as function pointers are involved.
First some prep work is done, then ipv4 and ipv6 are converted and then
temporary helper functions are removed.
====================
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_VLAN_DEFAULT_PVID to allow setting/getting bridge's
default_pvid via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to allow getting/setting netfilter tables settings.
Currently these are IFLA_BR_NF_CALL_IPTABLES, IFLA_BR_NF_CALL_IP6TABLES
and IFLA_BR_NF_CALL_ARPTABLES.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to set/get all of the igmp's configurable intervals via
netlink. These currently are:
IFLA_BR_MCAST_LAST_MEMBER_INTVL
IFLA_BR_MCAST_MEMBERSHIP_INTVL
IFLA_BR_MCAST_QUERIER_INTVL
IFLA_BR_MCAST_QUERY_INTVL
IFLA_BR_MCAST_QUERY_RESPONSE_INTVL
IFLA_BR_MCAST_STARTUP_QUERY_INTVL
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_MCAST_STARTUP_QUERY_CNT to allow setting/getting
br->multicast_startup_query_count via netlink. Also align the ifla
comments.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_MCAST_LAST_MEMBER_CNT to allow setting/getting
br->multicast_last_member_count via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_MCAST_HASH_MAX to allow setting/getting br->hash_max via
netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_MCAST_HASH_ELASTICITY to allow setting/getting
br->hash_elasticity via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_MCAST_QUERIER to allow setting/getting br->multicast_querier
via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_MCAST_QUERY_USE_IFADDR to allow setting/getting
br->multicast_query_use_ifaddr via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_MCAST_SNOOPING to allow enabling/disabling multicast
snooping via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_MCAST_ROUTER to allow setting and retrieving
br->multicast_router when igmp snooping is enabled.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simple attribute that flushes the bridge's fdb.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_GROUP_ADDR attribute to allow setting and retrieving the
group_addr via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Export the following bridge timers (also exported via sysfs):
IFLA_BR_HELLO_TIMER, IFLA_BR_TCN_TIMER, IFLA_BR_TOPOLOGY_CHANGE_TIMER,
IFLA_BR_GC_TIMER via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_TOPOLOGY_CHANGE and IFLA_BR_TOPOLOGY_CHANGE_DETECTED and
export them via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_ROOT_PATH_COST and export it via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_ROOT_PORT and export it via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_BRIDGE_ID and export br->bridge_id via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_ROOT_ID and export br->designated_root via netlink. For this
purpose add struct ifla_bridge_id that would represent struct bridge_id.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IFLA_BR_GROUP_FWD_MASK attribute to allow setting and retrieving the
group_fwd_mask via netlink.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The checks that lead to num_vlans change are always what
br_vlan_should_use checks for, namely if the vlan is only a context or
not and depending on that it's either not counted or counted
as a real/used vlan respectively.
Also give better explanation in br_vlan_should_use's comment.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's only one user now and we can include the flag directly.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce br_vlan_(get|put)_master which take a reference (or create the
master vlan first if it didn't exist) and drop a reference respectively.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I did the conversion to rhashtable I missed the required locking of
one important user of the vlan list - br_get_link_af_size_filtered()
which is called:
br_ifinfo_notify() -> br_nlmsg_size() -> br_get_link_af_size_filtered()
and the notifications can be sent without holding rtnl. Before this
conversion the function relied on using rcu and since we already use rcu to
destroy the vlans, we can simply migrate the list to use the rcu helpers.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace "void *obj" with a generic structure. Introduce couple of
helpers along that.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the struct name in sync with object id name.
Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the struct name in sync with object id name.
Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To be aligned with obj.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should not pass the original flags when creating a context vlan only
because they may contain some flags that change behaviour in the bridge.
The new global context should be with minimal set of flags, so pass 0
and let br_vlan_add() set the master flag only.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a new port is being added we need to make vlgrp available after
rhashtable has been initialized and when removing a port we need to
flush the vlans and free the resources after we're sure noone can use
the port, i.e. after it's removed from the port list and synchronize_rcu
is executed.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One obvious way to converge more code (which was also used by the
previous vlan code) is to move pvid inside net_bridge_vlan_group. This
allows us to simplify some and remove other port-specific functions.
Also gives us the ability to simply pass the vlan group and use all of the
contained information.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While a new port is being initialized the rx_handler gets set, but the
vlans get initialized later in br_add_if() and in that window if we
receive a frame with a link-local address we can try to dereference
p->vlgrp in:
br_handle_frame() -> br_handle_local_finish() -> br_should_learn()
Fix this by checking vlgrp before using it.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Stephen pointed out the default initial size is more than we need, so
let's start small (4 elements, thus nelem_hint = 3). Also limit the hash
locks to the number of CPUs as we don't need any write-side scaling and
this looks like the minimum.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following pull request contains Netfilter/IPVS updates for net-next
containing 90 patches from Eric Biederman.
The main goal of this batch is to avoid recurrent lookups for the netns
pointer, that happens over and over again in our Netfilter/IPVS code. The idea
consists of passing netns pointer from the hook state to the relevant functions
and objects where this may be needed.
You can find more information on the IPVS updates from Simon Horman's commit
merge message:
c3456026ad ("Merge tag 'ipvs2-for-v4.4' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next").
Exceptionally, this time, I'm not posting the patches again on netdev, Eric
already Cc'ed this mailing list in the original submission. If you need me to
make, just let me know.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to the notifier_call callback of a notifier_block, change the
function signature of switchdev add and del operations to:
int switchdev_port_obj_add/del(struct net_device *dev,
enum switchdev_obj_id id, void *obj);
This allows the caller to pass a specific switchdev_obj_* structure
instead of the generic switchdev_obj one.
Drivers implementation of these operations and switchdev have been
changed accordingly.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the bridge vlan implementation to use rhashtables
instead of bitmaps. The main motivation behind this change is that we
need extensible per-vlan structures (both per-port and global) so more
advanced features can be introduced and the vlan support can be
extended. I've tried to break this up but the moment net_port_vlans is
changed and the whole API goes away, thus this is a larger patch.
A few short goals of this patch are:
- Extensible per-vlan structs stored in rhashtables and a sorted list
- Keep user-visible behaviour (compressed vlans etc)
- Keep fastpath ingress/egress logic the same (optimizations to come
later)
Here's a brief list of some of the new features we'd like to introduce:
- per-vlan counters
- vlan ingress/egress mapping
- per-vlan igmp configuration
- vlan priorities
- avoid fdb entries replication (e.g. local fdb scaling issues)
The structure is kept single for both global and per-port entries so to
avoid code duplication where possible and also because we'll soon introduce
"port0 / aka bridge as port" which should simplify things further
(thanks to Vlad for the suggestion!).
Now we have per-vlan global rhashtable (bridge-wide) and per-vlan port
rhashtable, if an entry is added to a port it'll get a pointer to its
global context so it can be quickly accessed later. There's also a
sorted vlan list which is used for stable walks and some user-visible
behaviour such as the vlan ranges, also for error paths.
VLANs are stored in a "vlan group" which currently contains the
rhashtable, sorted vlan list and the number of "real" vlan entries.
A good side-effect of this change is that it resembles how hw keeps
per-vlan data.
One important note after this change is that if a VLAN is being looked up
in the bridge's rhashtable for filtering purposes (or to check if it's an
existing usable entry, not just a global context) then the new helper
br_vlan_should_use() needs to be used if the vlan is found. In case the
lookup is done only with a port's vlan group, then this check can be
skipped.
Things tested so far:
- basic vlan ingress/egress
- pvids
- untagged vlans
- undef CONFIG_BRIDGE_VLAN_FILTERING
- adding/deleting vlans in different scenarios (with/without global ctx,
while transmitting traffic, in ranges etc)
- loading/removing the module while having/adding/deleting vlans
- extracting bridge vlan information (user ABI), compressed requests
- adding/deleting fdbs on vlans
- bridge mac change, promisc mode
- default pvid change
- kmemleak ON during the whole time
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The network namespace is easiliy available in state->net so use it.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Allow bridge forward delay to be configured when Spanning Tree is enabled.
Signed-off-by: Ian Wilson <iwilson@brocade.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/ipv4/arp.c
The net/ipv4/arp.c conflict was one commit adding a new
local variable while another commit was deleting one.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Siva Mannem <siva.mannem.lnx@gmail.com>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Premkumar Jonnala <pjonnala@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only pass the void *priv parameter out of the nf_hook_ops. That is
all any of the functions are interested now, and by limiting what is
passed it becomes simpler to change implementation details.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As xt_action_param lives on the stack this does not bloat any
persistent data structures.
This is a first step in making netfilter code that needs to know
which network namespace it is executing in simpler.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
- Add nft_pktinfo.pf to replace ops->pf
- Add nft_pktinfo.hook to replace ops->hooknum
This simplifies the code, makes it more readable, and likely reduces
cache line misses. Maintainability is enhanced as the details of
nft_hook_ops are of no concern to the recpients of nft_pktinfo.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Nearly everything thing of interest to ebt_do_table is already present
in nf_hook_state. Simplify ebt_do_table by just passing in the skb,
nf_hook_state, and the table. This make the code easier to read and
maintenance easier.
To support this create an nf_hook_state on the stack in ebt_broute
(the only caller without a nf_hook_state already available). This new
nf_hook_state adds no new computations to ebt_broute, but does use a
few more bytes of stack.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is immediately motivated by the bridge code that chains functions that
call into netfilter. Without passing net into the okfns the bridge code would
need to guess about the best expression for the network namespace to process
packets in.
As net is frequently one of the first things computed in continuation functions
after netfilter has done it's job passing in the desired network namespace is in
many cases a code simplification.
To support this change the function dst_output_okfn is introduced to
simplify passing dst_output as an okfn. For the moment dst_output_okfn
just silently drops the struct net.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of saying "net = dev_net(state->in?state->in:state->out)"
just say "state->net". As that information is now availabe,
much less confusing and much less error prone.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a network namespace parameter into the netfilter hooks. At the
call site of the netfilter hooks the path a packet is taking through
the network stack is well known which allows the network namespace to
be easily and reliabily.
This allows the replacement of magic code like
"dev_net(state->in?:state->out)" that appears at the start of most
netfilter hooks with "state->net".
In almost all cases the network namespace passed in is derived
from the first network device passed in, guaranteeing those
paths will not see any changes in practice.
The exceptions are:
xfrm/xfrm_output.c:xfrm_output_resume() xs_net(skb_dst(skb)->xfrm)
ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont() ip_vs_conn_net(cp)
ipvs/ip_vs_xmit.c:ip_vs_send_or_cont() ip_vs_conn_net(cp)
ipv4/raw.c:raw_send_hdrinc() sock_net(sk)
ipv6/ip6_output.c:ip6_xmit() sock_net(sk)
ipv6/ndisc.c:ndisc_send_skb() dev_net(skb->dev) not dev_net(dst->dev)
ipv6/raw.c:raw6_send_hdrinc() sock_net(sk)
br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev
In all cases these exceptions seem to be a better expression for the
network namespace the packet is being processed in then the historic
"dev_net(in?in:out)". I am documenting them in case something odd
pops up and someone starts trying to track down what happened.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_receive_skb_sk is only called once in the bridge code, replace
it with a bridge specific function that calls netif_receive_skb.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is prep work for passing net to the netfilter hooks.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When struct net starts being passed through the ipv4 and ipv6 fragment
routines br_nf_push_frag_xmit will need to take a net parameter.
Prepare br_nf_push_frag_xmit before that is needed and introduce
br_nf_push_frag_xmit_sk for the call sites that still need the old
calling conventions.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a prep work for passing struct net through ip_do_fragment and
later the netfilter okfn. Doing this independently makes the later
code changes clearer.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function dev_queue_xmit_skb_sk is unncessary and very confusing.
Introduce br_send_bpdu_finish to remove the need for dev_queue_xmit_skb_sk,
and have br_send_bpdu_finish call dev_queue_xmit.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the newly introduced helper functions the skb pulling is hidden in
the checksumming function - and undone before returning to the caller.
The IGMPv3 and MLDv2 report parsing functions in the bridge still
assumed that the skb is pointing to the beginning of the IGMP/MLD
message while it is now kept at the beginning of the IPv4/6 header,
breaking the message parsing and creating packet loss.
Fixing this by taking the offset between IP and IGMP/MLD header into
account, too.
Fixes: 9afd85c9e4 ("net: Export IGMP/MLD message validation code")
Reported-by: Tobias Powalowski <tobias.powalowski@googlemail.com>
Tested-by: Tobias Powalowski <tobias.powalowski@googlemail.com>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the unnecessary switchdev.h include from br_netlink.c.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since __vlan_del can return an error code, change its inner function
__vlan_vid_del to return an eventual error from switchdev_port_obj_del.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next tree.
In sum, patches to address fallout from the previous round plus updates from
the IPVS folks via Simon Horman, they are:
1) Add a new scheduler to IPVS: The weighted overflow scheduling algorithm
directs network connections to the server with the highest weight that is
currently available and overflows to the next when active connections exceed
the node's weight. From Raducu Deaconu.
2) Fix locking ordering in IPVS, always take rtnl_lock in first place. Patch
from Julian Anastasov.
3) Allow to indicate the MTU to the IPVS in-kernel state sync daemon. From
Julian Anastasov.
4) Enhance multicast configuration for the IPVS state sync daemon. Also from
Julian.
5) Resolve sparse warnings in the nf_dup modules.
6) Fix a linking problem when CONFIG_NF_DUP_IPV6 is not set.
7) Add ICMP codes 5 and 6 to IPv6 REJECT target, they are more informative
subsets of code 1. From Andreas Herz.
8) Revert the jumpstack size calculation from mark_source_chains due to chain
depth miscalculations, from Florian Westphal.
9) Calm down more sparse warning around the Netfilter tree, again from Florian
Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
bridge/netfilter/ebtables.c:290:26: warning: incorrect type in assignment (different modifiers)
-> remove __pure annotation.
ipv6/netfilter/ip6t_SYNPROXY.c:240:27: warning: cast from restricted __be16
-> switch ntohs to htons and vice versa.
netfilter/core.c:391:30: warning: symbol 'nfq_ct_nat_hook' was not declared. Should it be static?
-> delete it, got removed
net/netfilter/nf_synproxy_core.c:221:48: warning: cast to restricted __be32
-> Use __be32 instead of u32.
Tested with objdiff that these changes do not affect generated code.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
While looking into fixing the local entries scalability issue I noticed
that the structure is badly arranged because vlan_id would fall in a
second cache line while keeping rcu which is used only when deleting
in the first, so re-arrange the structure and push rcu to the end so we
can get 16 bytes which can be used for other fields (by pushing rcu
fully in the second 64 byte chunk). With this change all the core
necessary information when doing fdb lookups will be available in a
single cache line.
pahole before (note vlan_id):
struct net_bridge_fdb_entry {
struct hlist_node hlist; /* 0 16 */
struct net_bridge_port * dst; /* 16 8 */
struct callback_head rcu; /* 24 16 */
long unsigned int updated; /* 40 8 */
long unsigned int used; /* 48 8 */
mac_addr addr; /* 56 6 */
unsigned char is_local:1; /* 62: 7 1 */
unsigned char is_static:1; /* 62: 6 1 */
unsigned char added_by_user:1; /* 62: 5 1 */
unsigned char added_by_external_learn:1; /* 62: 4 1 */
/* XXX 4 bits hole, try to pack */
/* XXX 1 byte hole, try to pack */
/* --- cacheline 1 boundary (64 bytes) --- */
__u16 vlan_id; /* 64 2 */
/* size: 72, cachelines: 2, members: 11 */
/* sum members: 65, holes: 1, sum holes: 1 */
/* bit holes: 1, sum bit holes: 4 bits */
/* padding: 6 */
/* last cacheline: 8 bytes */
}
pahole after (note vlan_id):
struct net_bridge_fdb_entry {
struct hlist_node hlist; /* 0 16 */
struct net_bridge_port * dst; /* 16 8 */
long unsigned int updated; /* 24 8 */
long unsigned int used; /* 32 8 */
mac_addr addr; /* 40 6 */
__u16 vlan_id; /* 46 2 */
unsigned char is_local:1; /* 48: 7 1 */
unsigned char is_static:1; /* 48: 6 1 */
unsigned char added_by_user:1; /* 48: 5 1 */
unsigned char added_by_external_learn:1; /* 48: 4 1 */
/* XXX 4 bits hole, try to pack */
/* XXX 7 bytes hole, try to pack */
struct callback_head rcu; /* 56 16 */
/* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
/* size: 72, cachelines: 2, members: 11 */
/* sum members: 65, holes: 1, sum holes: 7 */
/* bit holes: 1, sum bit holes: 4 bits */
/* last cacheline: 8 bytes */
}
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This enables bridge vlan_protocol to be configured through netlink.
When CONFIG_BRIDGE_VLAN_FILTERING is disabled, kernel behaves the
same way as this feature is not implemented.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
.maxtype should match .policy. Probably just been getting lucky here
because IFLA_BRPORT_MAX > IFLA_BR_MAX.
Fixes: 13323516 ("bridge: implement rtnl_link_ops->changelink")
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recent refactoring of the IGMP and MLD parsing code into
ipv6_mc_check_mld() / ip_mc_check_igmp() introduced a potential crash /
BUG() invocation for bridges:
I wrongly assumed that skb_get() could be used as a simple reference
counter for an skb which is not the case. skb_get() bears additional
semantics, a user count. This leads to a BUG() invocation in
pskb_expand_head() / kernel panic if pskb_may_pull() is called on an skb
with a user count greater than one - unfortunately the refactoring did
just that.
Fixing this by removing the skb_get() call and changing the API: The
caller of ipv6_mc_check_mld() / ip_mc_check_igmp() now needs to
additionally check whether the returned skb_trimmed is a clone.
Fixes: 9afd85c9e4 ("net: Export IGMP/MLD message validation code")
Reported-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/cavium/Kconfig
The cavium conflict was overlapping dependency
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the ability to toggle the vlan filtering support via
netlink. Since we're already running with rtnl in .changelink() we don't
need to take any additional locks.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The address in the switchdev_obj_fdb structure is currently represented
as a pointer. Replacing it for a 6-byte array allows switchdev to carry
addresses directly read from hardware registers, not stored by the
switch chip driver (as in Rocker).
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The attribute size wasn't accounted for in the get_slave_size() callback
(br_port_get_slave_size) when it was introduced, so fix it now. Also add
a policy entry for it in br_port_policy.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 842a9ae08a ("bridge: Extend Proxy ARP design to allow optional rules for Wi-Fi")
Signed-off-by: David S. Miller <davem@davemloft.net>
The attribute size wasn't accounted for in the get_slave_size() callback
(br_port_get_slave_size) when it was introduced, so fix it now. Also add
a policy entry for it in br_port_policy.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 958501163d ("bridge: Add support for IEEE 802.11 Proxy ARP")
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next, they are:
1) A couple of cleanups for the netfilter core hook from Eric Biederman.
2) Net namespace hook registration, also from Eric. This adds a dependency with
the rtnl_lock. This should be fine by now but we have to keep an eye on this
because if we ever get the per-subsys nfnl_lock before rtnl we have may
problems in the future. But we have room to remove this in the future by
propagating the complexity to the clients, by registering hooks for the init
netns functions.
3) Update nf_tables to use the new net namespace hook infrastructure, also from
Eric.
4) Three patches to refine and to address problems from the new net namespace
hook infrastructure.
5) Switch to alternate jumpstack in xtables iff the packet is reentering. This
only applies to a very special case, the TEE target, but Eric Dumazet
reports that this is slowing down things for everyone else. So let's only
switch to the alternate jumpstack if the tee target is in used through a
static key. This batch also comes with offline precalculation of the
jumpstack based on the callchain depth. From Florian Westphal.
6) Minimal SCTP multihoming support for our conntrack helper, from Michal
Kubecek.
7) Reduce nf_bridge_info per skbuff scratchpad area to 32 bytes, from Florian
Westphal.
8) Fix several checkpatch errors in bridge netfilter, from Bernhard Thaler.
9) Get rid of useless debug message in ip6t_REJECT, from Subash Abhinov.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of trying to access br->vlan_enabled directly use the provided
helper br_vlan_enabled().
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch when a vid was not specified, the entry was added with
vid 0 which is useless when vlan_filtering is enabled. This patch makes
the entry to be added on all configured vlans when vlan filtering is
enabled and respectively deleted from all, if the entry vid is 0.
This is also closer to the way fdb works with regard to vid 0 and vlan
filtering.
Example:
Setup:
$ bridge vlan add vid 256 dev eth4
$ bridge vlan add vid 1024 dev eth4
$ bridge vlan add vid 64 dev eth3
$ bridge vlan add vid 128 dev eth3
$ bridge vlan
port vlan ids
eth3 1 PVID Egress Untagged
64
128
eth4 1 PVID Egress Untagged
256
1024
$ echo 1 > /sys/class/net/br0/bridge/vlan_filtering
Before:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1
$ bridge mdb
dev br0 port eth3 grp 239.0.0.1 temp
After:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1
$ bridge mdb
dev br0 port eth3 grp 239.0.0.1 temp vid 1
dev br0 port eth3 grp 239.0.0.1 temp vid 128
dev br0 port eth3 grp 239.0.0.1 temp vid 64
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bridge devices don't need to segment multiple tagged packets since thier
ports can segment them.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
arch/s390/net/bpf_jit_comp.c
drivers/net/ethernet/ti/netcp_ethss.c
net/bridge/br_multicast.c
net/ipv4/ip_fragment.c
All four conflicts were cases of simple overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix checkpatch.pl "ERROR: do not initialise statics to 0 or NULL" for
all statics explicitly initialized to 0.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We can use union for most of the temporary cruft (original ipv4/ipv6
address, source mac, physoutdev) since they're used during different
stages of br netfilter traversal.
Also get rid of the last two ->mask users.
Shrinks struct from 48 to 32 on 64bit arch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since mdb states were introduced when deleting an entry the state was
left as it was set in the delete request from the user which leads to
the following output when doing a monitor (for example):
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 temp
^^^
Note the "temp" state in the delete notification which is wrong since
the entry was permanent, the state in a delete is always reported as
"temp" regardless of the real state of the entry.
After this patch:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
There's one important note to make here that the state is actually not
matched when doing a delete, so one can delete a permanent entry by
stating "temp" in the end of the command, I've chosen this fix in order
not to break user-space tools which rely on this (incorrect) behaviour.
So to give an example after this patch and using the wrong state:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 temp
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
Note the state of the entry that got deleted is correct in the
notification.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: ccb1c31a7a ("bridge: add flags to distinguish permanent mdb entires")
Signed-off-by: David S. Miller <davem@davemloft.net>
When fast leave is configured on a bridge port and an IGMP leave is
received for a group, the group is not deleted immediately if there is
a router detected or if multicast querier is configured.
Ideally the group should be deleted immediately when fast leave is
configured.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are several devices that can receive vlan tagged packets with
CHECKSUM_PARTIAL like tap, possibly veth and xennet.
When (multiple) vlan tagged packets with CHECKSUM_PARTIAL are forwarded
by bridge to a device with the IP_CSUM feature, they end up with checksum
error because before entering bridge, the network header is set to
ETH_HLEN (not including vlan header length) in __netif_receive_skb_core(),
get_rps_cpu(), or drivers' rx functions, and nobody fixes the pointer later.
Since the network header is exepected to be ETH_HLEN in flow-dissection
and hash-calculation in RPS in rx path, and since the header pointer fix
is needed only in tx path, set the appropriate network header on forwarding
packets.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
These should be handled only by the respective STP which is in control.
They become problematic for devices with limited resources with many
ports because the hold_timer is per port and fires each second and the
hello timer fires each 2 seconds even though it's global. While in
user-space STP mode these timers are completely unnecessary so it's better
to keep them off.
Also ensure that when the bridge is up these timers are started only when
running with kernel STP.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Send notifications on router port add and del/expire, re-use the already
existing MDBA_ROUTER and send NEWMDB/DELMDB netlink notifications
respectively.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since slave_changelink support was added there have been a few race
conditions when using br_setport() since some of the port functions it
uses require the bridge lock. It is very easy to trigger a lockup due to
some internal spin_lock() usage without bh disabled, also it's possible to
get the bridge into an inconsistent state.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 3ac636b859 ("bridge: implement rtnl_link_ops->slave_changelink")
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes setting of vinfo.flags in the br_fill_ifvlaninfo_range() method. The
assignment of vinfo.flags &= ~BRIDGE_VLAN_INFO_RANGE_BEGIN has no effect and is
unneeded, as vinfo.flags value is overriden by the immediately following
vinfo.flags = flags | BRIDGE_VLAN_INFO_RANGE_END assignement.
Signed-off-by: Rami Rosen <rami.rosen@intel.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/bridge/br_mdb.c
br_mdb.c conflict was a function call being removed to fix a bug in
'net' but whose signature was changed in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix:
net/bridge/br_if.c: In function 'br_dev_delete':
>> net/bridge/br_if.c:284:2: error: implicit declaration of function
>> 'br_multicast_dev_del' [-Werror=implicit-function-declaration]
br_multicast_dev_del(br);
^
cc1: some warnings being treated as errors
when igmp snooping is not defined.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the bridge (or port) is brought down/up flush only temp entries and
leave the perm ones. Flush perm entries only when deleting the bridge
device or the associated port.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Group notifications were not sent when a group expired or was deleted
due to bridge/port device being deleted. So add br_mdb_notify() to
br_multicast_del_pg().
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the mdb add/del code was introduced there have been 2 br_mdb_notify
calls when doing br_mdb_add() resulting in 2 notifications on each add.
Example:
Command: bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
Before patch:
root@debian:~# bridge monitor all
[MDB]dev br0 port eth1 grp 239.0.0.1 permanent
[MDB]dev br0 port eth1 grp 239.0.0.1 permanent
After patch:
root@debian:~# bridge monitor all
[MDB]dev br0 port eth1 grp 239.0.0.1 permanent
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: cfd5675435 ("bridge: add support of adding and deleting mdb entries")
Signed-off-by: David S. Miller <davem@davemloft.net>
A report with INCLUDE/Change_to_include and empty source list should be
treated as a leave, specified by RFC 3376, section 3.1:
"If the requested filter mode is INCLUDE *and* the requested source
list is empty, then the entry corresponding to the requested
interface and multicast address is deleted if present. If no such
entry is present, the request is ignored."
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/bridge/br_mdb.c
Minor conflict in br_mdb.c, in 'net' we added a memset of the
on-stack 'ip' variable whereas in 'net-next' we assign a new
member 'vid'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now all user mdb entries were added in vlan 0, this patch adds
support to allow the user to specify the vlan for the entry.
About the uapi change a hole in struct br_mdb_entry is used so the size
and offsets are kept the same (verified with pahole and tested with older
iproute2).
Example:
$ bridge mdb
dev br0 port eth1 grp 239.0.0.1 permanent vlan 2000
dev br0 port eth1 grp 239.0.0.1 permanent vlan 200
dev br0 port eth1 grp 239.0.0.1 permanent
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now when a querier was present static entries couldn't be deleted.
Fix this and allow the user to manipulate the mdb with or without a
querier.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit c29390c6df ("xps: must clear sender_cpu before forwarding")
fixed an issue in normal forward path, caused by sender_cpu & napi_id
skb fields being an union.
Bridge is another point where skb can be forwarded, so we need
the same cure.
Bug triggers if packet was received on a NIC using skb_mark_napi_id()
Fixes: 2bd82484bb ("xps: fix xps for stacked devices")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Bob Liu <bob.liu@oracle.com>
Tested-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fill also the port group state when sending notifications.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree. This batch
mostly comes with patches to address fallout from the previous merge window
cycle, they are:
1) Use entry->state.hook_list from nf_queue() instead of the global nf_hooks
which is not valid when used from NFPROTO_NETDEV, this should cause no
problems though since we have no userspace queueing for that family, but
let's fix this now for the sake of correctness. Patch from Eric W. Biederman.
2) Fix compilation breakage in bridge netfilter if CONFIG_NF_DEFRAG_IPV4 is not
set, from Bernhard Thaler.
3) Use percpu jumpstack in arptables too, now that there's a single copy of the
rule blob we can't store the return address there anymore. Patch from
Florian Westphal.
4) Fix a skb leak in the xmit path of bridge netfilter, problem there since
2.6.37 although it should be not possible to hit invalid traffic there, also
from Florian.
5) Eric Leblond reports that when loading a large ruleset with many missing
modules after a fresh boot, nf_tables can take long time commit it. Fix this
by processing the full batch until the end, even on missing modules, then
abort only once and restart processing.
6) Add bridge netfilter files to the MAINTAINER files.
7) Fix a net_device refcount leak in the new IPV6 bridge netfilter code, from
Julien Grall.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit b0e9a30dd6 ("bridge: Add vlan id to multicast groups")
there's a check in br_ip_equal() for a matching vlan id, but the mdb
functions were not modified to use (or at least zero it) so when an
entry was added it would have a garbage vlan id (from the local br_ip
variable in __br_mdb_add/del) and this would prevent it from being
matched and also deleted. So zero out the whole local ip var to protect
ourselves from future changes and also to fix the current bug, since
there's no vlan id support in the mdb uapi - use always vlan id 0.
Example before patch:
root@debian:~# bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb
dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb del dev br0 port eth1 grp 239.0.0.1 permanent
RTNETLINK answers: Invalid argument
After patch:
root@debian:~# bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb
dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb del dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Fixes: b0e9a30dd6 ("bridge: Add vlan id to multicast groups")
Signed-off-by: David S. Miller <davem@davemloft.net>
Start the delete timer when adding temp static entries so they can expire.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: ccb1c31a7a ("bridge: add flags to distinguish permanent mdb entires")
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit efb6de9b4b "netfilter: bridge:
forward IPv6 fragmented packets" introduced a new function
br_validate_ipv6 which take a reference on the inet6 device. Although,
the reference is not released at the end.
This will result to the impossibility to destroy any netdevice using
ipv6 and bridge.
It's possible to directly retrieve the inet6 device without taking a
reference as all netfilter hooks are protected by rcu_read_lock via
nf_hook_slow.
Spotted while trying to destroy a Xen guest on the upstream Linux:
"unregister_netdevice: waiting for vif1.0 to become free. Usage count = 1"
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Cc: Bernhard Thaler <bernhard.thaler@wvnet.at>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: fw@strlen.de
Cc: ian.campbell@citrix.com
Cc: wei.liu2@citrix.com
Cc: Bob Liu <bob.liu@oracle.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Vlan ids 0 and 4095 were disallowed by commit:
8adff41c3d ("bridge: Don't use VID 0 and 4095 in vlan filtering")
but then the check was removed when vlan ranges were introduced by:
bdced7ef78 ("bridge: support for multiple vlans and vlan ranges in setlink and dellink requests")
So reintroduce the vlan range check.
Before patch:
[root@testvm ~]# bridge vlan add vid 0 dev eth0 master
(succeeds)
After Patch:
[root@testvm ~]# bridge vlan add vid 0 dev eth0 master
RTNETLINK answers: Invalid argument
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: bdced7ef78 ("bridge: support for multiple vlans and vlan ranges in setlink and dellink requests")
Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_nf_dev_queue_xmit must free skb in its error path.
NF_DROP is misleading -- its an okfn, not a netfilter hook.
Fixes: 462fb2af97 ("bridge : Sanitize skb before it enters the IP stack")
Fixes: efb6de9b4b ("netfilter: bridge: forward IPv6 fragmented packets")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
br_nf_ip_fragment() is not needed when neither CONFIG_NF_DEFRAG_IPV4 nor
CONFIG_NF_DEFRAG_IPV6 is set.
struct brnf_frag_data must be available if either CONFIG_NF_DEFRAG_IPV4
or CONFIG_NF_DEFRAG_IPV6 is set.
Fixes: efb6de9b4b ("netfilter: bridge: forward IPv6 fragmented packets")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull networking updates from David Miller:
1) Add TX fast path in mac80211, from Johannes Berg.
2) Add TSO/GRO support to ibmveth, from Thomas Falcon
3) Move away from cached routes in ipv6, just like ipv4, from Martin
KaFai Lau.
4) Lots of new rhashtable tests, from Thomas Graf.
5) Run ingress qdisc lockless, from Alexei Starovoitov.
6) Allow servers to fetch TCP packet headers for SYN packets of new
connections, for fingerprinting. From Eric Dumazet.
7) Add mode parameter to pktgen, for testing receive. From Alexei
Starovoitov.
8) Cache access optimizations via simplifications of build_skb(), from
Alexander Duyck.
9) Move page frag allocator under mm/, also from Alexander.
10) Add xmit_more support to hv_netvsc, from KY Srinivasan.
11) Add a counter guard in case we try to perform endless reclassify
loops in the packet scheduler.
12) Extern flow dissector to be programmable and use it in new "Flower"
classifier. From Jiri Pirko.
13) AF_PACKET fanout rollover fixes, performance improvements, and new
statistics. From Willem de Bruijn.
14) Add netdev driver for GENEVE tunnels, from John W Linville.
15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso.
16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet.
17) Add an ECN retry fallback for the initial TCP handshake, from Daniel
Borkmann.
18) Add tail call support to BPF, from Alexei Starovoitov.
19) Add several pktgen helper scripts, from Jesper Dangaard Brouer.
20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa.
21) Favor even port numbers for allocation to connect() requests, and
odd port numbers for bind(0), in an effort to help avoid
ip_local_port_range exhaustion. From Eric Dumazet.
22) Add Cavium ThunderX driver, from Sunil Goutham.
23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata,
from Alexei Starovoitov.
24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai.
25) Double TCP Small Queues default to 256K to accomodate situations
like the XEN driver and wireless aggregation. From Wei Liu.
26) Add more entropy inputs to flow dissector, from Tom Herbert.
27) Add CDG congestion control algorithm to TCP, from Kenneth Klette
Jonassen.
28) Convert ipset over to RCU locking, from Jozsef Kadlecsik.
29) Track and act upon link status of ipv4 route nexthops, from Andy
Gospodarek.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits)
bridge: vlan: flush the dynamically learned entries on port vlan delete
bridge: multicast: add a comment to br_port_state_selection about blocking state
net: inet_diag: export IPV6_V6ONLY sockopt
stmmac: troubleshoot unexpected bits in des0 & des1
net: ipv4 sysctl option to ignore routes when nexthop link is down
net: track link-status of ipv4 nexthops
net: switchdev: ignore unsupported bridge flags
net: Cavium: Fix MAC address setting in shutdown state
drivers: net: xgene: fix for ACPI support without ACPI
ip: report the original address of ICMP messages
net/mlx5e: Prefetch skb data on RX
net/mlx5e: Pop cq outside mlx5e_get_cqe
net/mlx5e: Remove mlx5e_cq.sqrq back-pointer
net/mlx5e: Remove extra spaces
net/mlx5e: Avoid TX CQE generation if more xmit packets expected
net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion
net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq()
net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them
net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues
net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device
...
Add a new argument to br_fdb_delete_by_port which allows to specify a
vid to match when flushing entries and use it in nbp_vlan_delete() to
flush the dynamically learned entries of the vlan/port pair when removing
a vlan from a port. Before this patch only the local mac was being
removed and the dynamically learned ones were left to expire.
Note that the do_all argument is still respected and if specified, the
vid will be ignored.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a comment to explain why we're not disabling port's multicast when it
goes in blocking state. Since there's a check in the timer's function which
bypasses the timer if the port's in blocking/disabled state, the timer will
simply expire and stop without sending more queries.
Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/mellanox/mlx4/main.c
net/packet/af_packet.c
Both conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use vid_begin/end to be consistent with BRIDGE_VLAN_INFO_RANGE_BEGIN/END.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a port goes through a link down/up the multicast router configuration
is not restored.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 0909e11758 ("bridge: Add multicast_router sysfs entries")
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When STP is running in user-space and querier is configured, the
querier timer is not started when a port goes to a non-blocking state.
This patch unifies the user- and kernel-space stp multicast port enable
path and enables it in all states different from blocking. Note that when a
port goes in BR_STATE_DISABLED it's not enabled because that is handled
in the beginning of the port list loop.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Resolve compilation breakage when CONFIG_IPV6 is not set by moving the IPv6
code into a separated br_netfilter_ipv6.c file.
Fixes: efb6de9b4b ("netfilter: bridge: forward IPv6 fragmented packets")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After the ->set() spinlocks were removed br_stp_set_bridge_priority
was left running without any protection when used via sysfs. It can
race with port add/del and could result in use-after-free cases and
corrupted lists. Tested by running port add/del in a loop with stp
enabled while setting priority in a loop, crashes are easily
reproducible.
The spinlocks around sysfs ->set() were removed in commit:
14f98f258f ("bridge: range check STP parameters")
There's also a race condition in the netlink priority support that is
fixed by this change, but it was introduced recently and the fixes tag
covers it, just in case it's needed the commit is:
af615762e9 ("bridge: add ageing_time, stp_state, priority over netlink")
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Fixes: 14f98f258f ("bridge: range check STP parameters")
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to delete from offload the device externally learnded fdbs when any
one of these events happen:
1) Bridge ages out fdb. (When bridge is doing ageing vs. device doing
ageing. If device is doing ageing, it would send SWITCHDEV_FDB_DEL
directly).
2) STP state change flushes fdbs on port.
3) User uses sysfs interface to flush fdbs from bridge or bridge port:
echo 1 >/sys/class/net/BR_DEV/bridge/flush
echo 1 >/sys/class/net/BR_PORT/brport/flush
4) Offload driver send event SWITCHDEV_FDB_DEL to delete fdb entry.
For rocker, we can now get called to delete fdb entry in wait and nowait
contexts, so set NOWAIT flag when deleting fdb entry.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
v2:
Move struct switchdev_obj automatics to inner scope where there used.
v1:
To maintain backward compatibility with the existing iproute2 "bridge vlan"
command, let bridge's setlink/dellink handler call into either the port
driver's 8021q ndo ops or the port driver's bridge_setlink/dellink ops.
This allows port driver to choose 8021q ops or the newer
bridge_setlink/dellink ops when implementing VLAN add/del filtering on the
device. The iproute "bridge vlan" command does not need to be modified.
To summarize using the "bridge vlan" command examples, we have:
1) bridge vlan add|del vid VID dev DEV
Here iproute2 sets MASTER flag. Bridge's bridge_setlink/dellink is called.
Vlan is set on bridge for port. If port driver implements ndo 8021q ops,
call those to port driver can install vlan filter on device. Otherwise, if
port driver implements bridge_setlink/dellink ops, call those to install
vlan filter to device. This option only works if port is bridged.
2) bridge vlan add|del vid VID dev DEV master
Same as 1)
3) bridge vlan add|del vid VID dev DEV self
Bridge's bridge_setlink/dellink isn't called. Port driver's
bridge_setlink/dellink is called, if implemented. This option works if
port is bridged or not. If port is not bridged, a VLAN can still be
added/deleted to device filter using this variant.
4) bridge vlan add|del vid VID dev DEV master self
This is a combination of 1) and 3), but will only work if port is bridged.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
This a bit large (and late) patchset that contains Netfilter updates for
net-next. Most relevantly br_netfilter fixes, ipset RCU support, removal of
x_tables percpu ruleset copy and rework of the nf_tables netdev support. More
specifically, they are:
1) Warn the user when there is a better protocol conntracker available, from
Marcelo Ricardo Leitner.
2) Fix forwarding of IPv6 fragmented traffic in br_netfilter, from Bernhard
Thaler. This comes with several patches to prepare the change in first place.
3) Get rid of special mtu handling of PPPoE/VLAN frames for br_netfilter. This
is not needed anymore since now we use the largest fragment size to
refragment, from Florian Westphal.
4) Restore vlan tag when refragmenting in br_netfilter, also from Florian.
5) Get rid of the percpu ruleset copy in x_tables, from Florian. Plus another
follow up patch to refine it from Eric Dumazet.
6) Several ipset cleanups, fixes and finally RCU support, from Jozsef Kadlecsik.
7) Get rid of parens in Netfilter Kconfig files.
8) Attach the net_device to the basechain as opposed to the initial per table
approach in the nf_tables netdev family.
9) Subscribe to netdev events to detect the removal and registration of a
device that is referenced by a basechain.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
If bridge netfilter is used with both
bridge-nf-call-iptables and bridge-nf-filter-vlan-tagged enabled
then ip fragments in VLAN frames are sent without the vlan header.
This has never worked reliably. Turns out this relied on pre-3.5
behaviour where skb frag_list was used to store ip fragments;
ip_fragment() then re-used these skbs.
But since commit 3cc4949269
("ipv4: use skb coalescing in defragmentation") this is no longer
the case. ip_do_fragment now needs to allocate new skbs, but these
don't contain the vlan tag information anymore.
Fix it by storing vlan information of the ressembled skb in the
br netfilter percpu frag area, and restore them for each of the
fragments.
Fixes: 3cc4949269 ("ipv4: use skb coalescing in defragmentation")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
since commit d6b915e29f
("ip_fragment: don't forward defragmented DF packet") the largest
fragment size is available in the IPCB.
Therefore we no longer need to care about 'encapsulation'
overhead of stripped PPPOE/VLAN headers since ip_do_fragment
doesn't use device mtu in such cases.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
IPv6 fragmented packets are not forwarded on an ethernet bridge
with netfilter ip6_tables loaded. e.g. steps to reproduce
1) create a simple bridge like this
modprobe br_netfilter
brctl addbr br0
brctl addif br0 eth0
brctl addif br0 eth2
ifconfig eth0 up
ifconfig eth2 up
ifconfig br0 up
2) place a host with an IPv6 address on each side of the bridge
set IPv6 address on host A:
ip -6 addr add fd01:2345:6789:1::1/64 dev eth0
set IPv6 address on host B:
ip -6 addr add fd01:2345:6789:1::2/64 dev eth0
3) run a simple ping command on host A with packets > MTU
ping6 -s 4000 fd01:2345:6789:1::2
4) wait some time and run e.g. "ip6tables -t nat -nvL" on the bridge
IPv6 fragmented packets traverse the bridge cleanly until somebody runs.
"ip6tables -t nat -nvL". As soon as it is run (and netfilter modules are
loaded) IPv6 fragmented packets do not traverse the bridge any more (you
see no more responses in ping's output).
After applying this patch IPv6 fragmented packets traverse the bridge
cleanly in above scenario.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
[pablo@netfilter.org: small changes to br_nf_dev_queue_xmit]
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Prepare check_hbh_len() to be called from newly introduced
br_validate_ipv6() in next commit.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
br_parse_ip_options() does not parse any IP options, it validates IP
packets as a whole and the function name is misleading.
Rename br_parse_ip_options() to br_validate_ipv4() and remove unneeded
commments.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently frag_max_size is member of br_input_skb_cb and copied back and
forth using IPCB(skb) and BR_INPUT_SKB_CB(skb) each time it is changed or
used.
Attach frag_max_size to nf_bridge_info and set value in pre_routing and
forward functions. Use its value in forward and xmit functions.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
IPv4 iptables allows to REDIRECT/DNAT/SNAT any traffic over a bridge.
e.g. REDIRECT
$ sysctl -w net.bridge.bridge-nf-call-iptables=1
$ iptables -t nat -A PREROUTING -p tcp -m tcp --dport 8080 \
-j REDIRECT --to-ports 81
This does not work with ip6tables on a bridge in NAT66 scenario
because the REDIRECT/DNAT/SNAT is not correctly detected.
The bridge pre-routing (finish) netfilter hook has to check for a possible
redirect and then fix the destination mac address. This allows to use the
ip6tables rules for local REDIRECT/DNAT/SNAT REDIRECT similar to the IPv4
iptables version.
e.g. REDIRECT
$ sysctl -w net.bridge.bridge-nf-call-ip6tables=1
$ ip6tables -t nat -A PREROUTING -p tcp -m tcp --dport 8080 \
-j REDIRECT --to-ports 81
This patch makes it possible to use IPv6 NAT66 on a bridge. It was tested
on a bridge with two interfaces using SNAT/DNAT NAT66 rules.
Reported-by: Artie Hamilton <artiemhamilton@yahoo.com>
Signed-off-by: Sven Eckelmann <sven@open-mesh.com>
[bernhard.thaler@wvnet.at: rebased, add indirect call to ip6_route_input()]
[bernhard.thaler@wvnet.at: rebased, split into separate patches]
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Put br_nf_pre_routing_finish_ipv6() after daddr_was_changed() and
br_nf_pre_routing_finish_bridge() to prepare calling these functions
from there.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
use binary AND on complement of BRNF_NF_BRIDGE_PREROUTING to unset
bit in nf_bridge->mask.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since the addition of sysfs multicast router support if one set
multicast_router to "2" more than once, then the port would be added to
the hlist every time and could end up linking to itself and thus causing an
endless loop for rlist walkers.
So to reproduce just do:
echo 2 > multicast_router; echo 2 > multicast_router;
in a bridge port and let some igmp traffic flow, for me it hangs up
in br_multicast_flood().
Fix this by adding a check in br_multicast_add_router() if the port is
already linked.
The reason this didn't happen before the addition of multicast_router
sysfs entries is because there's a !hlist_unhashed check that prevents
it.
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Fixes: 0909e11758 ("bridge: Add multicast_router sysfs entries")
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch the user-specified bridge port was ignored when
deleting an fdb entry and thus one could delete an entry that belonged
to any port.
Example (eth0 and eth1 are br0 ports):
bridge fdb add 00:11:22:33:44:55 dev eth0 master
bridge fdb del 00:11:22:33:44:55 dev eth1 master
(succeeds)
after the patch:
bridge fdb add 00:11:22:33:44:55 dev eth0 master
bridge fdb del 00:11:22:33:44:55 dev eth1 master
RTNETLINK answers: No such file or directory
Based on a patch by Wilson Kok.
Reported-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_update() can be called in process context in the following way:
br_fdb_add() -> __br_fdb_add() -> br_fdb_update() (if NTF_USE flag is set)
so we need to disable softirqs because there are softirq users of the
hash_lock. One easy way to reproduce this is to modify the bridge utility
to set NTF_USE, enable stp and then set maxageing to a low value so
br_fdb_cleanup() is called frequently and then just add new entries in
a loop. This happens because br_fdb_cleanup() is called from timer/softirq
context. The spin locks in br_fdb_update were _bh before commit f8ae737dee
("[BRIDGE]: forwarding remove unneeded preempt and bh diasables")
and at the time that commit was correct because br_fdb_update() couldn't be
called from process context, but that changed after commit:
292d139898 ("bridge: add NTF_USE support")
Using local_bh_disable/enable around br_fdb_update() allows us to keep
using the spin_lock/unlock in br_fdb_update for the fast-path.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 292d139898 ("bridge: add NTF_USE support")
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_update() can be called in process context in the following way:
br_fdb_add() -> __br_fdb_add() -> br_fdb_update() (if NTF_USE flag is set)
so we need to use spin_lock_bh because there are softirq users of the
hash_lock. One easy way to reproduce this is to modify the bridge utility
to set NTF_USE, enable stp and then set maxageing to a low value so
br_fdb_cleanup() is called frequently and then just add new entries in
a loop. This happens because br_fdb_cleanup() is called from timer/softirq
context. These locks were _bh before commit f8ae737dee
("[BRIDGE]: forwarding remove unneeded preempt and bh diasables")
and at the time that commit was correct because br_fdb_update() couldn't be
called from process context, but that changed after commit:
292d139898 ("bridge: add NTF_USE support")
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 292d139898 ("bridge: add NTF_USE support")
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/phy/amd-xgbe-phy.c
drivers/net/wireless/iwlwifi/Kconfig
include/net/mac80211.h
iwlwifi/Kconfig and mac80211.h were both trivial overlapping
changes.
The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and
the bug fix that happened on the 'net' side is already integrated
into the rest of the amd-xgbe driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fix for net
The following patch reverts the ebtables chunk that enforces counters that was
introduced in the recently applied d26e2c9ffa ('Revert "netfilter: ensure
number of counters is >0 in do_replace()"') since this breaks ebtables.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This partially reverts commit 1086bbe97a ("netfilter: ensure number of
counters is >0 in do_replace()") in net/bridge/netfilter/ebtables.c.
Setting rules with ebtables does not work any more with 1086bbe97a place.
There is an error message and no rules set in the end.
e.g.
~# ebtables -t nat -A POSTROUTING --src 12:34:56:78:9a:bc -j DROP
Unable to update the kernel. Two possible causes:
1. Multiple ebtables programs were executing simultaneously. The ebtables
userspace tool doesn't by default support multiple ebtables programs
running
Reverting the ebtables part of 1086bbe97a makes this work again.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
br_multicast_query_expired() querier argument is a pointer to
a struct bridge_mcast_querier :
struct bridge_mcast_querier {
struct br_ip addr;
struct net_bridge_port __rcu *port;
};
Intent of the code was to clear port field, not the pointer to querier.
Fixes: 2cd4143192 ("bridge: memorize and export selected IGMP/MLD querier port")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Acked-by: Linus Lüssing <linus.luessing@c0d3.blue>
Cc: Linus Lüssing <linus.luessing@web.de>
Cc: Steinar H. Gunderson <sesse@samfundet.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check in fdb_add_entry() if the source port should learn, similar
check is used in br_fdb_update.
Note that new fdb entries which are added manually or
as local ones are still permitted.
This patch has been tested by running traffic via a bridge port and
switching the port's state, also by manually adding/removing entries
from the bridge's fdb.
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Network managers like netifd (used in OpenWRT for instance) try to
configure interface options after creation but before setting the
interface up.
Unfortunately the sysfs / bridge currently only allows to configure the
hash_max and multicast_router options when the bridge interface is up.
But since br_multicast_init() doesn't start any timers and only sets
default values and initializes timers it should be save to reconfigure
the default values after that, before things actually get active after
the bridge is set up.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/cadence/macb.c
drivers/net/phy/phy.c
include/linux/skbuff.h
net/ipv4/tcp.c
net/switchdev/switchdev.c
Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
renaming overlapping with net-next changes of various
sorts.
phy.c was a case of two changes, one adding a local
variable to a function whilst the second was removing
one.
tcp.c overlapped a deadlock fix with the addition of new tcp_info
statistic values.
macb.c involved the addition of two zyncq device entries.
skbuff.h involved adding back ipv4_daddr to nf_bridge_info
whilst net-next changes put two other existing members of
that struct into a union.
Signed-off-by: David S. Miller <davem@davemloft.net>
When more than a multicast address is present in a MLDv2 report, all but
the first address is ignored, because the code breaks out of the loop if
there has not been an error adding that address.
This has caused failures when two guests connected through the bridge
tried to communicate using IPv6. Neighbor discoveries would not be
transmitted to the other guest when both used a link-local address and a
static address.
This only happens when there is a MLDv2 querier in the network.
The fix will only break out of the loop when there is a failure adding a
multicast address.
The mdb before the patch:
dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
dev ovirtmgmt port bond0.86 grp ff02::2 temp
After the patch:
dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
dev ovirtmgmt port bond0.86 grp ff02::fb temp
dev ovirtmgmt port bond0.86 grp ff02::2 temp
dev ovirtmgmt port bond0.86 grp ff02::d temp
dev ovirtmgmt port vnet0 grp ff02::1:ff00:76 temp
dev ovirtmgmt port bond0.86 grp ff02::16 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff00:77 temp
dev ovirtmgmt port bond0.86 grp ff02::1:ff00:def temp
dev ovirtmgmt port bond0.86 grp ff02::1:ffa1:40bf temp
Fixes: 08b202b672 ("bridge br_multicast: IPv6 MLD support.")
Reported-by: Rik Theys <Rik.Theys@esat.kuleuven.be>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Tested-by: Rik Theys <Rik.Theys@esat.kuleuven.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit c055d5b03b.
There are two issues:
'dnat_took_place' made me think that this is related to
-j DNAT/MASQUERADE.
But thats only one part of the story. This is also relevant for SNAT
when we undo snat translation in reverse/reply direction.
Furthermore, I originally wanted to do this mainly to avoid
storing ipv6 addresses once we make DNAT/REDIRECT work
for ipv6 on bridges.
However, I forgot about SNPT/DNPT which is stateless.
So we can't escape storing address for ipv6 anyway. Might as
well do it for ipv4 too.
Reported-and-tested-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After improving setsockopt() coverage in trinity, I started triggering
vmalloc failures pretty reliably from this code path:
warn_alloc_failed+0xe9/0x140
__vmalloc_node_range+0x1be/0x270
vzalloc+0x4b/0x50
__do_replace+0x52/0x260 [ip_tables]
do_ipt_set_ctl+0x15d/0x1d0 [ip_tables]
nf_setsockopt+0x65/0x90
ip_setsockopt+0x61/0xa0
raw_setsockopt+0x16/0x60
sock_common_setsockopt+0x14/0x20
SyS_setsockopt+0x71/0xd0
It turns out we don't validate that the num_counters field in the
struct we pass in from userspace is initialized.
The same problem also exists in ebtables, arptables, ipv6, and the
compat variants.
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When bridge netfilter re-fragments an IP packet for output, all
packets that can not be re-fragmented to their original input size
should be silently discarded.
However, current bridge netfilter output path generates an ICMP packet
with 'size exceeded MTU' message for such packets, this is a bug.
This patch refactors the ip_fragment() API to allow two separate
use cases. The bridge netfilter user case will not
send ICMP, the routing output will, as before.
Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next. Briefly
speaking, cleanups and minor fixes for ipset from Jozsef Kadlecsik and
Serget Popovich, more incremental updates to make br_netfilter a better
place from Florian Westphal, ARP support to the x_tables mark match /
target from and context Zhang Chunyu and the addition of context to know
that the x_tables runs through nft_compat. More specifically, they are:
1) Fix sparse warning in ipset/ip_set_hash_ipmark.c when fetching the
IPSET_ATTR_MARK netlink attribute, from Jozsef Kadlecsik.
2) Rename STREQ macro to STRNCMP in ipset, also from Jozsef.
3) Use skb->network_header to calculate the transport offset in
ip_set_get_ip{4,6}_port(). From Alexander Drozdov.
4) Reduce memory consumption per element due to size miscalculation,
this patch and follow up patches from Sergey Popovich.
5) Expand nomatch field from 1 bit to 8 bits to allow to simplify
mtype_data_reset_flags(), also from Sergey.
6) Small clean for ipset macro trickery.
7) Fix error reporting when both ip_set_get_hostipaddr4() and
ip_set_get_extensions() from per-set uadt functions.
8) Simplify IPSET_ATTR_PORT netlink attribute validation.
9) Introduce HOST_MASK instead of hardcoded 32 in ipset.
10) Return true/false instead of 0/1 in functions that return boolean
in the ipset code.
11) Validate maximum length of the IPSET_ATTR_COMMENT netlink attribute.
12) Allow to dereference from ext_*() ipset macros.
13) Get rid of incorrect definitions of HKEY_DATALEN.
14) Include linux/netfilter/ipset/ip_set.h in the x_tables set match.
15) Reduce nf_bridge_info size in br_netfilter, from Florian Westphal.
16) Release nf_bridge_info after POSTROUTING since this is only needed
from the physdev match, also from Florian.
17) Reduce size of ipset code by deinlining ip_set_put_extensions(),
from Denys Vlasenko.
18) Oneliner to add ARP support to the x_tables mark match/target, from
Zhang Chunyu.
19) Add context to know if the x_tables extension runs from nft_compat,
to address minor problems with three existing extensions.
20) Correct return value in several seqfile *_show() functions in the
netfilter tree, from Joe Perches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we have four xtables extensions that cannot be used from the
xt over nft compat layer. The problem is that they need real access to
the full blown xt_entry to validate that the rule comes with the right
dependencies. This check was introduced to overcome the lack of
sufficient userspace dependency validation in iptables.
To resolve this problem, this patch introduces a new field to the
xt_tgchk_param structure that tell us if the extension is run from
nft_compat context.
The three affected extensions are:
1) CLUSTERIP, this target has been superseded by xt_cluster. So just
bail out by returning -EINVAL.
2) TCPMSS. Relax the checking when used from nft_compat. If used with
the wrong configuration, it will corrupt !syn packets by adding TCP
MSS option.
3) ebt_stp. Relax the check to make sure it uses the reserved
destination MAC address for STP.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
nf_bridge information is only needed for -m physdev, so we can always free
it after POST_ROUTING. This has the advantage that allocation and free will
typically happen on the same cpu.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The neigh_header is only needed when we detect DNAT after prerouting
and neigh cache didn't have a mac address for us.
The output port has not been chosen yet so we can re-use the storage
area, bringing struct size down to 32 bytes on x86_64.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Older gcc versions (e.g. gcc version 4.4.6) don't like anonymous unions
which was causing build issues on the newly added switchdev attr/obj
structs. Fix this by using named union on structs.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Reported-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is revert of:
commit 68e331c785 ("bridge: offload bridge port attributes to switch asic
if feature flag set")
Restore br_dellink back to original and don't call into SELF port driver.
rtnetlink.c:bridge_dellink() already does a call into port driver for SELF.
bridge vlan add/del cmd defaults to MASTER. From man page for bridge vlan
add/del cmd:
self the vlan is configured on the specified physical device.
Required if the device is the bridge device.
master the vlan is configured on the software bridge (default).
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is revert of:
commit 68e331c785 ("bridge: offload bridge port attributes to switch asic
if feature flag set")
Restore br_setlink back to original and don't call into SELF port driver.
rtnetlink.c:bridge_setlink() already does a call into port driver for SELF.
bridge set link cmd defaults to MASTER. From man page for bridge link set
cmd:
self link setting is configured on specified physical device
master link setting is configured on the software bridge (default)
The link setting has two values: the device-side value and the software
bridge-side value. These are independent and settable using the bridge
link set cmd by specifying some combination of [master] | [self].
Furthermore, the device-side and bridge-side settings have their own
initial value, viewable from bridge -d link show cmd.
Restoring br_setlink back to original makes rocker (the only in-kernel user
of SELF link settings) work as first implement: two-sided values.
It's true that when both MASTER and SELF are specified from the command,
two netlink notifications are generated, one for each side of the settings.
The user-space app can distiquish between the two notifications by
observing the MASTER or SELF flag.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
STP update is just a settable port attribute, so convert
switchdev_port_stp_update to an attr set.
For DSA, the prepare phase is skipped and STP updates are only done in the
commit phase. This is because currently the DSA drivers don't need to
allocate any memory for STP updates and the STP update will not fail to HW
(unless something horrible goes wrong on the MDIO bus, in which case the
prepare phase wouldn't have been able to predict anyway).
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Turned out that "switchdev" sticks. So just unify all related terms to use
this prefix.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace "ntohs(proto) >= ETH_P_802_3_MIN" w/ eth_proto_is_802_3(proto).
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
BR_GROUPFWD_RESTRICTED bitmask restricts users from setting values to
/sys/class/net/brX/bridge/group_fwd_mask that allow forwarding of
some IEEE 802.1D Table 7-10 Reserved addresses:
(MAC Control) 802.3 01-80-C2-00-00-01
(Link Aggregation) 802.3 01-80-C2-00-00-02
802.1AB LLDP 01-80-C2-00-00-0E
Change BR_GROUPFWD_RESTRICTED to allow to forward LLDP frames and document
group_fwd_mask.
e.g.
echo 16384 > /sys/class/net/brX/bridge/group_fwd_mask
allows to forward LLDP frames.
This may be needed for bridge setups used for network troubleshooting or
any other scenario where forwarding of LLDP frames is desired (e.g. bridge
connecting a virtual machine to real switch transmitting LLDP frames that
virtual machine needs to receive).
Tested on a simple bridge setup with two interfaces and host transmitting
LLDP frames on one side of this bridge (used lldpd). Setting group_fwd_mask
as described above lets LLDP frames traverse bridge.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch, the IGMP and MLD message validation functions are moved
from the bridge code to IPv4/IPv6 multicast files. Some small
refactoring was done to enhance readibility and to iron out some
differences in behaviour between the IGMP and MLD parsing code (e.g. the
skb-cloning of MLD messages is now only done if necessary, just like the
IGMP part always did).
Finally, these IGMP and MLD message validation functions are exported so
that not only the bridge can use it but batman-adv later, too.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
Let's use these new, neat helpers.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
it is sent only at the end of a dump.
Libraries like libnl will wait forever for NLMSG_DONE.
Fixes: e5a55a8987 ("net: create generic bridge ops")
Fixes: 815cccbf10 ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf")
CC: John Fastabend <john.r.fastabend@intel.com>
CC: Sathya Perla <sathya.perla@emulex.com>
CC: Subbu Seetharaman <subbu.seetharaman@emulex.com>
CC: Ajit Khaparde <ajit.khaparde@emulex.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: intel-wired-lan@lists.osuosl.org
CC: Jiri Pirko <jiri@resnulli.us>
CC: Scott Feldman <sfeldma@gmail.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: bridge@lists.linux-foundation.org
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
it is sent only at the end of a dump.
Libraries like libnl will wait forever for NLMSG_DONE.
Fixes: 37a393bc49 ("bridge: notify mdb changes via netlink")
CC: Cong Wang <amwang@redhat.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: bridge@lists.linux-foundation.org
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch the nf_tables registers from 128 bit addressing to 32 bit
addressing to support so called concatenations, where multiple values
can be concatenated over multiple registers for O(1) exact matches of
multiple dimensions using sets.
The old register values are mapped to areas of 128 bits for compatibility.
When dumping register numbers, values are expressed using the old values
if they refer to the beginning of a 128 bit area for compatibility.
To support concatenations, register loads of less than a full 32 bit
value need to be padded. This mainly affects the payload and exthdr
expressions, which both unconditionally zero the last word before
copying the data.
Userspace fully passes the testsuite using both old and new register
addressing.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add helper functions to parse and dump register values in netlink attributes.
These helpers will later be changed to take care of translation between the
old 128 bit and the new 32 bit register numbers.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Simple conversion to use u32 pointers to the beginning of the registers
to keep follow up patches smaller.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Replace the array of registers passed to expressions by a struct nft_regs,
containing the verdict as a seperate member, which aliases to the
NFT_REG_VERDICT register.
This is needed to seperate the verdict from the data registers completely,
so their size can be changed.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
All users of nft_validate_register_store() first invoke
nft_validate_output_register(). There is in fact no use for using it
on its own, so simplify the code by folding the functionality into
nft_validate_register_store() and kill it.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The existing name is ambiguous, data is loaded as well when we read from
a register. Rename to nft_validate_register_store() for clarity and
consistency with the upcoming patch to introduce its counterpart,
nft_validate_register_load().
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
For values spanning multiple registers, we need to validate that enough
space is available from the destination register onwards. Add a len
argument to nft_validate_data_load() and consolidate the existing length
validations in preparation of that.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next tree.
They are:
* nf_tables set timeout infrastructure from Patrick Mchardy.
1) Add support for set timeout support.
2) Add support for set element timeouts using the new set extension
infrastructure.
4) Add garbage collection helper functions to get rid of stale elements.
Elements are accumulated in a batch that are asynchronously released
via RCU when the batch is full.
5) Add garbage collection synchronization helpers. This introduces a new
element busy bit to address concurrent access from the netlink API and the
garbage collector.
5) Add timeout support for the nft_hash set implementation. The garbage
collector peridically checks for stale elements from the workqueue.
* iptables/nftables cgroup fixes:
6) Ignore non full-socket objects from the input path, otherwise cgroup
match may crash, from Daniel Borkmann.
7) Fix cgroup in nf_tables.
8) Save some cycles from xt_socket by skipping packet header parsing when
skb->sk is already set because of early demux. Also from Daniel.
* br_netfilter updates from Florian Westphal.
9) Save frag_max_size and restore it from the forward path too.
10) Use a per-cpu area to restore the original source MAC address when traffic
is DNAT'ed.
11) Add helper functions to access physical devices.
12) Use these new physdev helper function from xt_physdev.
13) Add another nf_bridge_info_get() helper function to fetch the br_netfilter
state information.
14) Annotate original layer 2 protocol number in nf_bridge info, instead of
using kludgy flags.
15) Also annotate the pkttype mangling when the packet travels back and forth
from the IP to the bridge layer, instead of using a flag.
* More nf_tables set enhancement from Patrick:
16) Fix possible usage of set variant that doesn't support timeouts.
17) Avoid spurious "set is full" errors from Netlink API when there are pending
stale elements scheduled to be released.
18) Restrict loop checks to set maps.
19) Add support for dynamic set updates from the packet path.
20) Add support to store optional user data (eg. comments) per set element.
BTW, I have also pulled net-next into nf-next to anticipate the conflict
resolution between your okfn() signature changes and Florian's br_netfilter
updates.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
More recent GCC warns about two kinds of switch statement uses:
1) Switching on an enumeration, but not having an explicit case
statement for all members of the enumeration. To show the
compiler this is intentional, we simply add a default case
with nothing more than a break statement.
2) Switching on a boolean value. I think this warning is dumb
but nevertheless you get it wholesale with -Wswitch.
This patch cures all such warnings in netfilter.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_bridge_info->mask is used for several things, for example to
remember if skb->pkt_type was set to OTHER_HOST.
For a bridge, OTHER_HOST is expected case. For ip forward its a non-starter
though -- routing expects PACKET_HOST.
Bridge netfilter thus changes OTHER_HOST to PACKET_HOST before hook
invocation and then un-does it after hook traversal.
This information is irrelevant outside of br_netfilter.
After this change, ->mask now only contains flags that need to be
known outside of br_netfilter in fast-path.
Future patch changes mask into a 2bit state field in sk_buff, so that
we can remove skb->nf_bridge pointer for good and consider all remaining
places that access nf_bridge info content a not-so fastpath.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
->mask is a bit info field that mixes various use cases.
In particular, we have flags that are mutually exlusive, and flags that
are only used within br_netfilter while others need to be exposed to
other parts of the kernel.
Remove BRNF_8021Q/PPPoE flags. They're mutually exclusive and only
needed within br_netfilter context.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Don't access skb->nf_bridge directly, this pointer will be removed soon.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
br_netfilter maintains an extra state, nf_bridge_info, which is attached
to skb via skb->nf_bridge pointer.
Amongst other things we use skb->nf_bridge->data to store the original
mac header for every processed skb.
This is required for ip refragmentation when using conntrack
on top of bridge, because ip_fragment doesn't copy it from original skb.
However there is no need anymore to do this unconditionally.
Move this to the one place where its needed -- when br_netfilter calls
ip_fragment().
Also switch to percpu storage for this so we can handle fragmenting
without accessing nf_bridge meta data.
Only user left is neigh resolution when DNAT is detected, to hold
the original source mac address (neigh resolution builds new mac header
using bridge mac), so rename ->data and reduce its size to whats needed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
On the output paths in particular, we have to sometimes deal with two
socket contexts. First, and usually skb->sk, is the local socket that
generated the frame.
And second, is potentially the socket used to control a tunneling
socket, such as one the encapsulates using UDP.
We do not want to disassociate skb->sk when encapsulating in order
to fix this, because that would break socket memory accounting.
The most extreme case where this can cause huge problems is an
AF_PACKET socket transmitting over a vxlan device. We hit code
paths doing checks that assume they are dealing with an ipv4
socket, but are actually operating upon the AF_PACKET one.
Signed-off-by: David S. Miller <davem@davemloft.net>
The goal of this patch is to prepare the removal of the iflink field. It
introduces a new ndo function, which will be implemented by virtual interfaces.
There is no functional change into this patch. All readers of iflink field
now call dev_get_iflink().
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We also need to save/store in forward, else br_parse_ip_options call
will zero frag_max_size as well.
Fixes: 93fdd47e5 ('bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING')
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next.
Basically, more incremental updates for br_netfilter from Florian
Westphal, small nf_tables updates (including one fix for rb-tree
locking) and small two-liner to add extra validation for the REJECT6
target.
More specifically, they are:
1) Use the conntrack status flags from br_netfilter to know that DNAT is
happening. Patch for Florian Westphal.
2) nf_bridge->physoutdev == NULL already indicates that the traffic is
bridged, so let's get rid of the BRNF_BRIDGED flag. Also from Florian.
3) Another patch to prepare voidization of seq_printf/seq_puts/seq_putc,
from Joe Perches.
4) Consolidation of nf_tables_newtable() error path.
5) Kill nf_bridge_pad used by br_netfilter from ip_fragment(),
from Florian Westphal.
6) Access rb-tree root node inside the lock and remove unnecessary
locking from the get path (we already hold nfnl_lock there), from
Patrick McHardy.
7) You cannot use a NFT_SET_ELEM_INTERVAL_END when the set doesn't
support interval, also from Patrick.
8) Enforce IP6T_F_PROTO from ip6t_REJECT to make sure the core is
actually restricting matches to TCP.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The br_netfilter frag output function calls skb_cow_head() so in
case it needs a larger headroom to e.g. re-add a previously stripped PPPOE
or VLAN header things will still work (at cost of reallocation).
We can then move nf_bridge_encap_header_len to br_netfilter.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
drivers/net/ethernet/emulex/benet/be_main.c
net/core/sysctl_net_core.c
net/ipv4/inet_diag.c
The be_main.c conflict resolution was really tricky. The conflict
hunks generated by GIT were very unhelpful, to say the least. It
split functions in half and moved them around, when the real actual
conflict only existed solely inside of one function, that being
be_map_pci_bars().
So instead, to resolve this, I checked out be_main.c from the top
of net-next, then I applied the be_main.c changes from 'net' since
the last time I merged. And this worked beautifully.
The inet_diag.c and sysctl_net_core.c conflicts were simple
overlapping changes, and were easily to resolve.
Signed-off-by: David S. Miller <davem@davemloft.net>
Its not needed anymore since 2bf540b73e
([NETFILTER]: bridge-netfilter: remove deferred hooks).
Before this it was possible to have physoutdev set for locally generated
packets -- this isn't the case anymore:
BRNF_STATE_BRIDGED flag is set when we assign nf_bridge->physoutdev,
so physoutdev != NULL means BRNF_STATE_BRIDGED is set.
If physoutdev is NULL, then we are looking at locally-delivered and
routed packet.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ask conntrack instead of storing ipv4 address in nf_bridge_info->data.
Ths avoids the need to use ->data during NF_PRE_ROUTING.
Only two functions that need ->data remain.
These will be addressed in followup patches.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
On adding an interface br_add_if() sets the MTU to the min of
all the interfaces. Do the same thing on removing an interface too
in br_del_if.
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e5de75b ("netfilter: bridge: move DNAT helper to br_netfilter") results
in the following link problem:
net/bridge/br_device.c:29: undefined reference to `br_nf_prerouting_finish_bridge`
Moreover it creates a hard dependency between br_netfilter and the
bridge core, which is what we've been trying to avoid so far.
Resolve this problem by using a hook structure so we reduce #ifdef
pollution and keep bridge netfilter specific code under br_netfilter.c
which was the original intention.
Reported-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
make C=1 CF=-D__CHECK_ENDIAN__ shows following:
net/bridge/netfilter/nft_reject_bridge.c:65:50: warning: incorrect type in argument 3 (different base types)
net/bridge/netfilter/nft_reject_bridge.c:65:50: expected restricted __be16 [usertype] protocol [..]
net/bridge/netfilter/nft_reject_bridge.c:102:37: warning: cast from restricted __be16
net/bridge/netfilter/nft_reject_bridge.c:102:37: warning: incorrect type in argument 1 (different base types) [..]
net/bridge/netfilter/nft_reject_bridge.c:121:50: warning: incorrect type in argument 3 (different base types) [..]
net/bridge/netfilter/nft_reject_bridge.c:168:52: warning: incorrect type in argument 3 (different base types) [..]
net/bridge/netfilter/nft_reject_bridge.c:233:52: warning: incorrect type in argument 3 (different base types) [..]
Caused by two (harmless) errors:
1. htons() instead of ntohs()
2. __be16 for protocol in nf_reject_ipXhdr_put API, use u8 instead.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After my change to neigh_hh_init to obtain the protocol from the
neigh_table there are no more users of protocol in struct dst_ops.
Remove the protocol field from dst_ops and all of it's initializers.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. Basically, improvements for the packet rejection infrastructure,
deprecation of CLUSTERIP, cleanups for nf_tables and some untangling for
br_netfilter. More specifically they are:
1) Send packet to reset flow if checksum is valid, from Florian Westphal.
2) Fix nf_tables reject bridge from the input chain, also from Florian.
3) Deprecate the CLUSTERIP target, the cluster match supersedes it in
functionality and it's known to have problems.
4) A couple of cleanups for nf_tables rule tracing infrastructure, from
Patrick McHardy.
5) Another cleanup to place transaction declarations at the bottom of
nf_tables.h, also from Patrick.
6) Consolidate Kconfig dependencies wrt. NF_TABLES.
7) Limit table names to 32 bytes in nf_tables.
8) mac header copying in bridge netfilter is already required when
calling ip_fragment(), from Florian Westphal.
9) move nf_bridge_update_protocol() to br_netfilter.c, also from
Florian.
10) Small refactor in br_netfilter in the transmission path, again from
Florian.
11) Move br_nf_pre_routing_finish_bridge_slow() to br_netfilter.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Only one caller, there is no need to keep this in a header.
Move it to br_netfilter.c where this belongs to.
Based on patch from Florian Westphal.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The mac header only has to be copied back into the skb for
fragments generated by ip_fragment(), which only happens
for bridge forwarded packets with nf-call-iptables=1 && active nf_defrag.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This extends the design in commit 958501163d ("bridge: Add support for
IEEE 802.11 Proxy ARP") with optional set of rules that are needed to
meet the IEEE 802.11 and Hotspot 2.0 requirements for ProxyARP. The
previously added BR_PROXYARP behavior is left as-is and a new
BR_PROXYARP_WIFI alternative is added so that this behavior can be
configured from user space when required.
In addition, this enables proxyarp functionality for unicast ARP
requests for both BR_PROXYARP and BR_PROXYARP_WIFI since it is possible
to use unicast as well as broadcast for these frames.
The key differences in functionality:
BR_PROXYARP:
- uses the flag on the bridge port on which the request frame was
received to determine whether to reply
- block bridge port flooding completely on ports that enable proxy ARP
BR_PROXYARP_WIFI:
- uses the flag on the bridge port to which the target device of the
request belongs
- block bridge port flooding selectively based on whether the proxyarp
functionality replied
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the STP timer fires, it can call br_ifinfo_notify(),
which in turn ends up in the new br_get_link_af_size().
This function is annotated to be using RTNL locking, which
clearly isn't the case here, and thus lockdep warns:
===============================
[ INFO: suspicious RCU usage. ]
3.19.0+ #569 Not tainted
-------------------------------
net/bridge/br_private.h:204 suspicious rcu_dereference_protected() usage!
Fix this by doing RCU locking here.
Fixes: b7853d73e3 ("bridge: add vlan info to bridge setlink and dellink notification messages")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/rocker/rocker.c
The rocker commit was two overlapping changes, one to rename
the ->vport member to ->pport, and another making the bitmask
expression use '1ULL' instead of plain '1'.
Signed-off-by: David S. Miller <davem@davemloft.net>
make build fail if structure no longer fits into ->cb storage.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
bridge reject handling is not straightforward, there are many subtle
differences depending on configuration.
skb->dev is either the bridge port (PRE_ROUTING) or the bridge
itself (INPUT), so we need to use indev instead.
Also, checksum validation will only work reliably if we trim skb
according to the l3 header size.
While at it, add csum validation for ipv6 and skip existing tests
if skb was already checked e.g. by GRO.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
my previous patch skipped vlan range optimizations during skb size
calculations for simplicity.
This incremental patch considers vlan ranges during
skb size calculations. This leads to a bit of code duplication
in the fill and size calculation functions. But, I could not find a
prettier way to do this. will take any suggestions.
Previously, I had reused the existing br_get_link_af_size size calculation
function to calculate skb size for notifications. Reusing it this time
around creates some change in behaviour issues for the usual
.get_link_af_size callback.
This patch adds a new br_get_link_af_size_filtered() function to
base the size calculation on the incoming filter flag and include
vlan ranges.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
vlan add/deletes are not notified to userspace today. This patch adds
vlan info to bridge newlink/dellink notifications generated from the
bridge driver. Notifications use the RTEXT_FILTER_BRVLAN_COMPRESSED
flag to compress vlans into ranges whereever applicable.
The size calculations does not take ranges into account for
simplicity. This has the potential for allocating a larger skb than
required.
There is an existing inconsistency with bridge NEWLINK and DELLINK
change notifications. Both generate NEWLINK notifications. Since its
always a NEWLINK notification, this patch includes all vlans the port
belongs to in the notification. The NEWLINK and DELLINK request
messages however only include the vlans to be added and deleted.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If CONFIG_SYSCTL=n:
net/bridge/br_netfilter.c: In function ‘br_netfilter_init’:
net/bridge/br_netfilter.c:996: warning: label ‘err1’ defined but not used
Move the label and the code after it inside the existing #ifdef to get
rid of the warning.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bridge's default_pvid adds a vid by default, by which we cannot add a
non-vlan fdb entry by default, because br_fdb_add() adds fdb entries for
all vlans instead of a non-vlan one when any vlan is configured.
# ip link add br0 type bridge
# ip link set eth0 master br0
# bridge fdb add 12:34:56:78:90:ab dev eth0 master temp
# bridge fdb show brport eth0 | grep 12:34:56:78:90:ab
12:34:56:78:90:ab dev eth0 vlan 1 static
We expect a non-vlan fdb entry as well as vlan 1:
12:34:56:78:90:ab dev eth0 static
To fix this, we need to insert a non-vlan fdb entry if vlan is not
specified, even when any vlan is configured.
Fixes: 5be5a2df40 ("bridge: Add filtering support for default_pvid")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a missing bridge port check caught by smatch.
setlink/dellink of attributes like vlans can come for a bridge device
and there is no need to offload those today. So, this patch adds a bridge
port check. (In these cases however, the BRIDGE_SELF flags will always be set
and we may not hit a problem with the current code).
smatch complaint:
The patch 68e331c785: "bridge: offload bridge port attributes to
switch asic if feature flag set" from Jan 29, 2015, leads to the
following Smatch complaint:
net/bridge/br_netlink.c:552 br_setlink()
error: we previously assumed 'p' could be null (see line 518)
net/bridge/br_netlink.c
517
518 if (p && protinfo) {
^
Check for NULL.
Reported-By: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/vxlan.c
drivers/vhost/net.c
include/linux/if_vlan.h
net/core/dev.c
The net/core/dev.c conflict was the overlap of one commit marking an
existing function static whilst another was adding a new function.
In the include/linux/if_vlan.h case, the type used for a local
variable was changed in 'net', whereas the function got rewritten
to fix a stacked vlan bug in 'net-next'.
In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
overlapped with an endainness fix for VHOST 1.0 in 'net'.
In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
in 'net-next' whereas in 'net' there was a bug fix to pass in the
correct network namespace pointer in calls to this function.
Signed-off-by: David S. Miller <davem@davemloft.net>
When 'learned_sync' flag is turned on, the offloaded switch
port syncs learned MAC addresses to bridge's FDB via switchdev notifier
(NETDEV_SWITCH_FDB_ADD). Currently, FDB entries learnt via this mechanism are
wrongly being deleted by bridge aging logic. This patch ensures that FDB
entries synced from offloaded switch ports are not deleted by bridging logic.
Such entries can only be deleted via switchdev notifier
(NETDEV_SWITCH_FDB_DEL).
Signed-off-by: Siva Mannem <siva.mannem.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support to set/del bridge port attributes in hardware from
the bridge driver.
With this, when the user sends a bridge setlink message with no flags or
master flags set,
- the bridge driver ndo_bridge_setlink handler sets settings in the kernel
- calls the swicthdev api to propagate the attrs to the switchdev
hardware
You can still use the self flag to go to the switch hw or switch port
driver directly.
With this, it also makes sure a notification goes out only after the
attributes are set both in the kernel and hw.
The patch calls switchdev api only if BRIDGE_FLAGS_SELF is not set.
This is because the offload cases with BRIDGE_FLAGS_SELF are handled in
the caller (in rtnetlink.c).
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bridge flags are needed inside ndo_bridge_setlink/dellink handlers to
avoid another call to parse IFLA_AF_SPEC inside these handlers
This is used later in this series
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
They are all either written once or extremly rarely (e.g. from init
code), so we can move them to the .data..read_mostly section.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Static checkers complain that we should maybe set "ret" before we do the
"goto out;". They interpret the NULL return from br_port_get_rtnl() as
a failure and forgetting to set the error code is a common bug in this
situation.
The code is confusing but it's actually correct. We are returning zero
deliberately. Let's re-write it a bit to be more clear.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
DSA-enabled master network devices with a switch tagging protocol should
strip the protocol specific format before handing the frame over to
higher layer.
When adding such a DSA master network device as a bridge member, we go
through the following code path when receiving a frame:
__netif_receive_skb_core
-> first ptype check against ptype_all is not returning any
handler for this skb
-> check and invoke rx_handler:
-> deliver frame to the bridge layer: br_handle_frame
DSA registers a ptype handler with the fake ETH_XDSA ethertype, which is
called *after* the bridge-layer rx_handler has run. br_handle_frame()
tries to parse the frame it received from the DSA master network device,
and will not be able to match any of its conditions and jumps straight
at the end of the end of br_handle_frame() and returns
RX_HANDLER_CONSUMED there.
Since we returned RX_HANDLER_CONSUMED, __netif_receive_skb_core() stops
RX processing for this frame and returns NET_RX_SUCCESS, so we never get
a chance to call our switch tag packet processing logic and deliver
frames to the DSA slave network devices, and so we do not get any
functional bridge members at all.
Instead of cluttering the bridge receive path with DSA-specific checks,
and rely on assumptions about how __netif_receive_skb_core() is
processing frames, we simply deny adding the DSA master network device
(conduit interface) as a bridge member, leaving only the slave DSA
network devices to be bridge members, since those will work correctly in
all circumstances.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The user can crash the kernel if it uses any of the existing NAT
expressions from the wrong hook, so add some code to validate this
when loading the rule.
This patch introduces nft_chain_validate_hooks() which is based on
an existing function in the bridge version of the reject expression.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.
This makes the very common pattern of
if (genlmsg_end(...) < 0) { ... }
be a whole bunch of dead code. Many places also simply do
return nlmsg_end(...);
and the caller is expected to deal with it.
This also commonly (at least for me) causes errors, because it is very
common to write
if (my_function(...))
/* error condition */
and if my_function() does "return nlmsg_end()" this is of course wrong.
Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.
Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did
- return nlmsg_end(...);
+ nlmsg_end(...);
+ return 0;
I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with <= 0 in dump functionality, but that could just
be changed to < 0 with no change in behaviour, so I opted for the more
efficient version.
One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for <0 or <=0 and thus broke out of the loop every single time.
I've preserved this since it will (I think) have caused the messages to
userspace to be formatted differently with just a single message for
every SKB returned to userspace. It's possible that this isn't needed
for the tools that actually use this, but I don't even know what they
are so couldn't test that changing this behaviour would be acceptable.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch benefits from newly introduced switchdev notifier and uses it
to propagate fdb learn events from rocker driver to bridge. That avoids
direct function calls and possible use by other listeners (ovs).
Suggested-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
problems with bridge getlink/setlink notifications today:
- bridge setlink generates two notifications to userspace
- one from the bridge driver
- one from rtnetlink.c (rtnl_bridge_notify)
- dellink generates one notification from rtnetlink.c. Which
means bridge setlink and dellink notifications are not
consistent
- Looking at the code it appears,
If both BRIDGE_FLAGS_MASTER and BRIDGE_FLAGS_SELF were set,
the size calculation in rtnl_bridge_notify can be wrong.
Example: if you set both BRIDGE_FLAGS_MASTER and BRIDGE_FLAGS_SELF
in a setlink request to rocker dev, rtnl_bridge_notify will
allocate skb for one set of bridge attributes, but,
both the bridge driver and rocker dev will try to add
attributes resulting in twice the number of attributes
being added to the skb. (rocker dev calls ndo_dflt_bridge_getlink)
There are multiple options:
1) Generate one notification including all attributes from master and self:
But, I don't think it will work, because both master and self may use
the same attributes/policy. Cannot pack the same set of attributes in a
single notification from both master and slave (duplicate attributes).
2) Generate one notification from master and the other notification from
self (This seems to be ideal):
For master: the master driver will send notification (bridge in this
example)
For self: the self driver will send notification (rocker in the above
example. It can use helpers from rtnetlink.c to do so. Like the
ndo_dflt_bridge_getlink api).
This patch implements 2) (leaving the 'rtnl_bridge_notify' around to be used
with 'self').
v1->v2 :
- rtnl_bridge_notify is now called only for self,
so, remove 'BRIDGE_FLAGS_SELF' check and cleanup a few things
- rtnl_bridge_dellink used to always send a RTM_NEWLINK msg
earlier. So, I have changed the notification from br_dellink to
go as RTM_NEWLINK
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is just a cleanup, because in the current code MDBA_SET_ENTRY_MAX ==
MDBA_SET_ENTRY.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/xen-netfront.c
Minor overlapping changes in xen-netfront.c, mostly to do
with some buffer management changes alongside the split
of stats into TX and RX.
Signed-off-by: David S. Miller <davem@davemloft.net>
When IPV4 support is disabled, we cannot call arp_send from
the bridge code, which would result in a kernel link error:
net/built-in.o: In function `br_handle_frame_finish':
:(.text+0x59914): undefined reference to `arp_send'
:(.text+0x59a50): undefined reference to `arp_tbl'
This makes the newly added proxy ARP support in the bridge
code depend on the CONFIG_INET symbol and lets the compiler
optimize the code out to avoid the link error.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 958501163d ("bridge: Add support for IEEE 802.11 Proxy ARP")
Cc: Kyeyoon Park <kyeyoonp@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The same macros are used for rx as well. So rename it.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/bridge/br_netlink.c: In function ‘br_fill_ifinfo’:
net/bridge/br_netlink.c:146:32: warning: ‘vid_range_flags’ may be used uninitialized in this function [-Wmaybe-uninitialized]
err = br_fill_ifvlaninfo_range(skb, vid_range_start,
^
net/bridge/br_netlink.c:108:6: note: ‘vid_range_flags’ was declared here
u16 vid_range_flags;
Reported-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds new function to pack vlans into ranges
whereever applicable using the flags BRIDGE_VLAN_INFO_RANGE_BEGIN
and BRIDGE VLAN_INFO_RANGE_END
Old vlan packing code is moved to a new function and continues to be
called when filter_mask is RTEXT_FILTER_BRVLAN.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes bridge IFLA_AF_SPEC netlink attribute parser to
look for more than one IFLA_BRIDGE_VLAN_INFO attribute. This allows
userspace to pack more than one vlan in the setlink msg.
The dumps were already sending more than one vlan info in the getlink msg.
This patch also adds bridge_vlan_info flags BRIDGE_VLAN_INFO_RANGE_BEGIN and
BRIDGE_VLAN_INFO_RANGE_END to indicate start and end of vlan range
This patch also deletes unused ifla_br_policy.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently a bridge device turns off TSO feature if no bridge ports
support it. We can always enable it, since packets can be segmented on
ports by software as well as on the bridge device.
This will reduce the number of packets processed in the bridge.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add checking whether the call to ndo_dflt_fdb_dump is needed.
It is not expected to call ndo_dflt_fdb_dump unconditionally
by some drivers (i.e. qlcnic or macvlan) that defines
own ndo_fdb_dump. Other drivers define own ndo_fdb_dump
and don't want ndo_dflt_fdb_dump to be called at all.
At the same time it is desirable to call the default dump
function on a bridge device.
Fix attributes that are passed to dev->netdev_ops->ndo_fdb_dump.
Add extra checking in br_fdb_dump to avoid duplicate entries
as now filter_dev can be NULL.
Following tests for filtering have been performed before
the change and after the patch was applied to make sure
they are the same and it doesn't break the filtering algorithm.
[root@localhost ~]# cd /root/iproute2-3.18.0/bridge
[root@localhost bridge]# modprobe dummy
[root@localhost bridge]# ./bridge fdb add f1:f2:f3:f4:f5:f6 dev dummy0
[root@localhost bridge]# brctl addbr br0
[root@localhost bridge]# brctl addif br0 dummy0
[root@localhost bridge]# ip link set dev br0 address 02:00:00:12:01:04
[root@localhost bridge]# # show all
[root@localhost bridge]# ./bridge fdb show
33:33:00:00:00:01 dev p2p1 self permanent
01:00:5e:00:00:01 dev p2p1 self permanent
33:33:ff:ac:ce:32 dev p2p1 self permanent
33:33:00:00:02:02 dev p2p1 self permanent
01:00:5e:00:00:fb dev p2p1 self permanent
33:33:00:00:00:01 dev p7p1 self permanent
01:00:5e:00:00:01 dev p7p1 self permanent
33:33:ff:79:50:53 dev p7p1 self permanent
33:33:00:00:02:02 dev p7p1 self permanent
01:00:5e:00:00:fb dev p7p1 self permanent
f2:46:50:85:6d:d9 dev dummy0 master br0 permanent
f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent
33:33:00:00:00:01 dev dummy0 self permanent
f1:f2:f3:f4:f5:f6 dev dummy0 self permanent
33:33:00:00:00:01 dev br0 self permanent
02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent
02:00:00:12:01:04 dev br0 master br0 permanent
[root@localhost bridge]# # filter by bridge
[root@localhost bridge]# ./bridge fdb show br br0
f2:46:50:85:6d:d9 dev dummy0 master br0 permanent
f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent
33:33:00:00:00:01 dev dummy0 self permanent
f1:f2:f3:f4:f5:f6 dev dummy0 self permanent
33:33:00:00:00:01 dev br0 self permanent
02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent
02:00:00:12:01:04 dev br0 master br0 permanent
[root@localhost bridge]# # filter by port
[root@localhost bridge]# ./bridge fdb show brport dummy0
f2:46:50:85:6d:d9 master br0 permanent
f2:46:50:85:6d:d9 vlan 1 master br0 permanent
33:33:00:00:00:01 self permanent
f1:f2:f3:f4:f5:f6 self permanent
[root@localhost bridge]# # filter by port + bridge
[root@localhost bridge]# ./bridge fdb show br br0 brport dummy0
f2:46:50:85:6d:d9 master br0 permanent
f2:46:50:85:6d:d9 vlan 1 master br0 permanent
33:33:00:00:00:01 self permanent
f1:f2:f3:f4:f5:f6 self permanent
[root@localhost bridge]#
Signed-off-by: Hubert Sokolowski <hubert.sokolowski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains netfilter updates for net-next. Basically,
enhancements for xt_recent, skip zeroing of timer in conntrack, fix
linking problem with recent redirect support for nf_tables, ipset
updates and a couple of cleanups. More specifically, they are:
1) Rise maximum number per IP address to be remembered in xt_recent
while retaining backward compatibility, from Florian Westphal.
2) Skip zeroing timer area in nf_conn objects, also from Florian.
3) Inspect IPv4 and IPv6 traffic from the bridge to allow filtering using
using meta l4proto and transport layer header, from Alvaro Neira.
4) Fix linking problems in the new redirect support when CONFIG_IPV6=n
and IP6_NF_IPTABLES=n.
And ipset updates from Jozsef Kadlecsik:
5) Support updating element extensions when the set is full (fixes
netfilter bugzilla id 880).
6) Fix set match with 32-bits userspace / 64-bits kernel.
7) Indicate explicitly when /0 networks are supported in ipset.
8) Simplify cidr handling for hash:*net* types.
9) Allocate the proper size of memory when /0 networks are supported.
10) Explicitly add padding elements to hash:net,net and hash:net,port,
because the elements must be u32 sized for the used hash function.
Jozsef is also cooking ipset RCU conversion which should land soon if
they reach the merge window in time.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
When the swdev device learns a new mac/vlan on a port, it sends some async
notification to the driver and the driver installs an FDB in the device.
To give a holistic system view, the learned mac/vlan should be reflected
in the bridge's FBD table, so the user, using normal iproute2 cmds, can view
what is currently learned by the device. This API on the bridge driver gives
a way for the swdev driver to install an FBD entry in the bridge FBD table.
(And remove one).
This is equivalent to the device running these cmds:
bridge fdb [add|del] <mac> dev <dev> vid <vlan id> master
This patch needs some extra eyeballs for review, in paricular around the
locking and contexts.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To notify switch driver of change in STP state of bridge port, add new
.ndo op and provide switchdev wrapper func to call ndo op. Use it in bridge
code then.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do the work of parsing NDA_VLAN directly in rtnetlink code, pass simple
u16 vid to drivers from there.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current name might seem that this actually offloads the fdb entry to
hw. So rename it to clearly present that this for hardware address
addition/removal.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the missing bits to allow to match per meta l4proto from
the bridge. Example:
nft add rule bridge filter input ether type {ip, ip6} meta l4proto udp counter
Signed-off-by: Alvaro Neira Ayuso <alvaroneay@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch exports the functions nft_reject_iphdr_validate and
nft_reject_ip6hdr_validate to use it in follow up patches.
These functions check if the IPv4/IPv6 header is correct.
Signed-off-by: Alvaro Neira Ayuso <alvaroneay@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes: c2d3babf ("bridge: implement multicast fast leave")
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
netfilter/ipvs updates for net-next
The following patchset contains Netfilter updates for your net-next
tree, this includes the NAT redirection support for nf_tables, the
cgroup support for nft meta and conntrack zone support for the connlimit
match. Coming after those, a bunch of sparse warning fixes, missing
netns bits and cleanups. More specifically, they are:
1) Prepare IPv4 and IPv6 NAT redirect code to use it from nf_tables,
patches from Arturo Borrero.
2) Introduce the nf_tables redir expression, from Arturo Borrero.
3) Remove an unnecessary assignment in ip_vs_xmit/__ip_vs_get_out_rt().
Patch from Alex Gartrell.
4) Add nft_log_dereference() macro to the nf_log infrastructure, patch
from Marcelo Leitner.
5) Add some extra validation when registering logger families, also
from Marcelo.
6) Some spelling cleanups from stephen hemminger.
7) Fix sparse warning in nf_logger_find_get().
8) Add cgroup support to nf_tables meta, patch from Ana Rey.
9) A Kconfig fix for the new redir expression and fix sparse warnings in
the new redir expression.
10) Fix several sparse warnings in the netfilter tree, from
Florian Westphal.
11) Reduce verbosity when OOM in nfnetlink_log. User can basically do
nothing when this situation occurs.
12) Add conntrack zone support to xt_connlimit, again from Florian.
13) Add netnamespace support to the h323 conntrack helper, contributed
by Vasily Averin.
14) Remove unnecessary nul-pointer checks before free_percpu() and
module_put(), from Markus Elfring.
15) Use pr_fmt in nfnetlink_log, again patch from Marcelo Leitner.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ieee802154/fakehard.c
A bug fix went into 'net' for ieee802154/fakehard.c, which is removed
in 'net-next'.
Add build fix into the merge from Stephen Rothwell in openvswitch, the
logging macros take a new initial 'log' argument, a new call was added
in 'net' so when we merge that in here we have to explicitly add the
new 'log' arg to it else the build fails.
Signed-off-by: David S. Miller <davem@davemloft.net>
Name fits better. Plus there's going to be introduced
__vlan_insert_tag later on.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ebtables on the OUTPUT chain (NF_BR_LOCAL_OUT) would not work as expected
for both locally generated IGMP and MLD queries. The IP header specific
filter options are off by 14 Bytes for netfilter (actual output on
interfaces is fine).
NF_HOOK() expects the skb->data to point to the IP header, not the
ethernet one (while dev_queue_xmit() does not). Luckily there is an
br_dev_queue_push_xmit() helper function already - let's just use that.
Introduced by eb1d164143
("bridge: Add core IGMP snooping support")
Ebtables example:
$ ebtables -I OUTPUT -p IPv6 -o eth1 --logical-out br0 \
--log --log-level 6 --log-ip6 --log-prefix="~EBT: " -j DROP
before (broken):
~EBT: IN= OUT=eth1 MAC source = 02:04:64:a4:39:c2 \
MAC dest = 33:33:00:00:00:01 proto = 0x86dd IPv6 \
SRC=64a4:39c2:86dd:6000:0000:0020:0001:fe80 IPv6 \
DST=0000:0000:0000:0004:64ff:fea4:39c2:ff02, \
IPv6 priority=0x3, Next Header=2
after (working):
~EBT: IN= OUT=eth1 MAC source = 02:04:64:a4:39:c2 \
MAC dest = 33:33:00:00:00:01 proto = 0x86dd IPv6 \
SRC=fe80:0000:0000:0000:0004:64ff:fea4:39c2 IPv6 \
DST=ff02:0000:0000:0000:0000:0000:0000:0001, \
IPv6 priority=0x0, Next Header=0
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
net/bridge/br_netfilter.c:870:6: symbol 'br_netfilter_enable' was not declared. Should it be static?
no; add include
net/ipv4/netfilter/nft_reject_ipv4.c:22:6: symbol 'nft_reject_ipv4_eval' was not declared. Should it be static?
yes
net/ipv6/netfilter/nf_reject_ipv6.c:16:6: symbol 'nf_send_reset6' was not declared. Should it be static?
no; add include
net/ipv6/netfilter/nft_reject_ipv6.c:22:6: symbol 'nft_reject_ipv6_eval' was not declared. Should it be static?
yes
net/netfilter/core.c:33:32: symbol 'nf_ipv6_ops' was not declared. Should it be static?
no; add include
net/netfilter/xt_DSCP.c:40:57: cast truncates bits from constant value (ffffff03 becomes 3)
net/netfilter/xt_DSCP.c:57:59: cast truncates bits from constant value (ffffff03 becomes 3)
add __force, 3 is what we want.
net/ipv4/netfilter/nf_log_arp.c:77:6: symbol 'nf_log_arp_packet' was not declared. Should it be static?
yes
net/ipv4/netfilter/nf_reject_ipv4.c:17:6: symbol 'nf_send_reset' was not declared. Should it be static?
no; add include
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fix:
net/bridge/netfilter/nft_reject_bridge.c:
In function 'nft_reject_br_send_v6_unreach':
net/bridge/netfilter/nft_reject_bridge.c:240:3:
error: implicit declaration of function 'csum_ipv6_magic'
csum_ipv6_magic(&nip6h->saddr, &nip6h->daddr,
^
make[3]: *** [net/bridge/netfilter/nft_reject_bridge.o] Error 1
Seen with powerpc:allmodconfig.
Fixes: 523b929d54 ("netfilter: nft_reject_bridge: don't use IP stack to reject traffic")
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Restrict the reject expression to the prerouting and input bridge
hooks. If we allow this to be used from forward or any other later
bridge hook, if the frame is flooded to several ports, we'll end up
sending several reject packets, one per cloned packet.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If the packet is received via the bridge stack, this cannot reject
packets from the IP stack.
This adds functions to build the reject packet and send it from the
bridge stack. Comments and assumptions on this patch:
1) Validate the IPv4 and IPv6 headers before further processing,
given that the packet comes from the bridge stack, we cannot assume
they are clean. Truncated packets are dropped, we follow similar
approach in the existing iptables match/target extensions that need
to inspect layer 4 headers that is not available. This also includes
packets that are directed to multicast and broadcast ethernet
addresses.
2) br_deliver() is exported to inject the reject packet via
bridge localout -> postrouting. So the approach is similar to what
we already do in the iptables reject target. The reject packet is
sent to the bridge port from which we have received the original
packet.
3) The reject packet is forged based on the original packet. The TTL
is set based on sysctl_ip_default_ttl for IPv4 and per-net
ipv6.devconf_all hoplimit for IPv6.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This feature is defined in IEEE Std 802.11-2012, 10.23.13. It allows
the AP devices to keep track of the hardware-address-to-IP-address
mapping of the mobile devices within the WLAN network.
The AP will learn this mapping via observing DHCP, ARP, and NS/NA
frames. When a request for such information is made (i.e. ARP request,
Neighbor Solicitation), the AP will respond on behalf of the
associated mobile device. In the process of doing so, the AP will drop
the multicast request frame that was intended to go out to the wireless
medium.
It was recommended at the LKS workshop to do this implementation in
the bridge layer. vxlan.c is already doing something very similar.
The DHCP snooping code will be added to the userspace application
(hostapd) per the recommendation.
This RFC commit is only for IPv4. A similar approach in the bridge
layer will be taken for IPv6 as well.
Signed-off-by: Kyeyoon Park <kyeyoonp@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 462fb2af97
bridge : Sanitize skb before it enters the IP stack
broke when IP options are actually used because it mangles the
skb as if it entered the IP stack which is wrong because the
bridge is supposed to operate below the IP stack.
Since nobody has actually requested for parsing of IP options
this patch fixes it by simply reverting to the previous approach
of ignoring all IP options, i.e., zeroing the IPCB.
If and when somebody who uses IP options and actually needs them
to be parsed by the bridge complains then we can revisit this.
Reported-by: David Newall <davidn@davidnewall.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As we may defragment the packet in IPv4 PRE_ROUTING and refragment
it after POST_ROUTING we should save the value of frag_max_size.
This is still very wrong as the bridge is supposed to leave the
packets intact, meaning that the right thing to do is to use the
original frag_list for fragmentation.
Unfortunately we don't currently guarantee that the frag_list is
left untouched throughout netfilter so until this changes this is
the best we can do.
There is also a spot in FORWARD where it appears that we can
forward a packet without going through fragmentation, mark it
so that we can fix it later.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains another batch with Netfilter/IPVS updates
for net-next, they are:
1) Add abstracted ICMP codes to the nf_tables reject expression. We
introduce four reasons to reject using ICMP that overlap in IPv4
and IPv6 from the semantic point of view. This should simplify the
maintainance of dual stack rule-sets through the inet table.
2) Move nf_send_reset() functions from header files to per-family
nf_reject modules, suggested by Patrick McHardy.
3) We have to use IS_ENABLED(CONFIG_BRIDGE_NETFILTER) everywhere in the
code now that br_netfilter can be modularized. Convert remaining spots
in the network stack code.
4) Use rcu_barrier() in the nf_tables module removal path to ensure that
we don't leave object that are still pending to be released via
call_rcu (that may likely result in a crash).
5) Remove incomplete arch 32/64 compat from nft_compat. The original (bad)
idea was to probe the word size based on the xtables match/target info
size, but this assumption is wrong when you have to dump the information
back to userspace.
6) Allow to filter from prerouting and postrouting in the nf_tables bridge.
In order to emulate the ebtables NAT chains (which are actually simple
filter chains with no special semantics), we have support filtering from
this hooks too.
7) Add explicit module dependency between xt_physdev and br_netfilter.
This provides a way to detect if the user needs br_netfilter from
the configuration path. This should reduce the breakage of the
br_netfilter modularization.
8) Cleanup coding style in ip_vs.h, from Simon Horman.
9) Fix crash in the recently added nf_tables masq expression. We have
to register/unregister the notifiers to clean up the conntrack table
entries from the module init/exit path, not from the rule addition /
deletion path. From Arturo Borrero.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently when vlan filtering is turned on on the bridge, the bridge
will drop all traffic untill the user configures the filter. This
isn't very nice for ports that don't care about vlans and just
want untagged traffic.
A concept of a default_pvid was recently introduced. This patch
adds filtering support for default_pvid. Now, ports that don't
care about vlans and don't define there own filter will belong
to the VLAN of the default_pvid and continue to receive untagged
traffic.
This filtering can be disabled by setting default_pvid to 0.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, if the pvid is not set, we return an illegal vlan value
even though the pvid value is set to 0. Since pvid of 0 is currently
invalid, just return 0 instead. This makes the current and future
checks simpler.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows the user to set and retrieve default_pvid
value. A new value can only be stored when vlan filtering
is disabled.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
You can use physdev to match the physical interface enslaved to the
bridge device. This information is stored in skb->nf_bridge and it is
set up by br_netfilter. So, this is only available when iptables is
used from the bridge netfilter path.
Since 34666d4 ("netfilter: bridge: move br_netfilter out of the core"),
the br_netfilter code is modular. To reduce the impact of this change,
we can autoload the br_netfilter if the physdev match is used since
we assume that the users need br_netfilter in place.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This allows us to emulate the NAT table in ebtables, which is actually
a plain filter chain that hooks at prerouting, output and postrouting.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces the NFT_REJECT_ICMPX_UNREACH type which provides
an abstraction to the ICMP and ICMPv6 codes that you can use from the
inet and bridge tables, they are:
* NFT_REJECT_ICMPX_NO_ROUTE: no route to host - network unreachable
* NFT_REJECT_ICMPX_PORT_UNREACH: port unreachable
* NFT_REJECT_ICMPX_HOST_UNREACH: host unreachable
* NFT_REJECT_ICMPX_ADMIN_PROHIBITED: administratevely prohibited
You can still use the specific codes when restricting the rule to match
the corresponding layer 3 protocol.
I decided to not overload the existing NFT_REJECT_ICMP_UNREACH to have
different semantics depending on the table family and to allow the user
to specify ICMP family specific codes if they restrict it to the
corresponding family.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In preparation for being able to propagate port states to e.g: notifiers
or other kernel parts, do not manipulate the port state directly, but
instead use a helper function which will allow us to do a bit more than
just setting the state.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric reports build failure with
CONFIG_BRIDGE_NETFILTER=n
We insist to build br_nf_core.o unconditionally, but we must only do so
if br_netfilter was enabled, else it fails to build due to
functions being defined to empty stubs (and some structure members
being defined out).
Also, BRIDGE_NETFILTER=y|m makes no sense when BRIDGE=n.
Fixes: 34666d467 (netfilter: bridge: move br_netfilter out of the core)
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
pull request: netfilter/ipvs updates for net-next
The following patchset contains Netfilter/IPVS updates for net-next,
most relevantly they are:
1) Four patches to make the new nf_tables masquerading support
independent of the x_tables infrastructure. This also resolves a
compilation breakage if the masquerade target is disabled but the
nf_tables masq expression is enabled.
2) ipset updates via Jozsef Kadlecsik. This includes the addition of the
skbinfo extension that allows you to store packet metainformation in the
elements. This can be used to fetch and restore this to the packets through
the iptables SET target, patches from Anton Danilov.
3) Add the hash:mac set type to ipset, from Jozsef Kadlecsick.
4) Add simple weighted fail-over scheduler via Simon Horman. This provides
a fail-over IPVS scheduler (unlike existing load balancing schedulers).
Connections are directed to the appropriate server based solely on
highest weight value and server availability, patch from Kenny Mathis.
5) Support IPv6 real servers in IPv4 virtual-services and vice versa.
Simon Horman informs that the motivation for this is to allow more
flexibility in the choice of IP version offered by both virtual-servers
and real-servers as they no longer need to match: An IPv4 connection
from an end-user may be forwarded to a real-server using IPv6 and
vice versa. No ip_vs_sync support yet though. Patches from Alex Gartrell
and Julian Anastasov.
6) Add global generation ID to the nf_tables ruleset. When dumping from
several different object lists, we need a way to identify that an update
has ocurred so userspace knows that it needs to refresh its lists. This
also includes a new command to obtain the 32-bits generation ID. The
less significant 16-bits of this ID is also exposed through res_id field
in the nfnetlink header to quickly detect the interference and retry when
there is no risk of ID wraparound.
7) Move br_netfilter out of the bridge core. The br_netfilter code is
built in the bridge core by default. This causes problems of different
kind to people that don't want this: Jesper reported performance drop due
to the inconditional hook registration and I remember to have read complains
on netdev from people regarding the unexpected behaviour of our bridging
stack when br_netfilter is enabled (fragmentation handling, layer 3 and
upper inspection). People that still need this should easily undo the
damage by modprobing the new br_netfilter module.
8) Dump the set policy nf_tables that allows set parameterization. So
userspace can keep user-defined preferences when saving the ruleset.
From Arturo Borrero.
9) Use __seq_open_private() helper function to reduce boiler plate code
in x_tables, From Rob Jones.
10) Safer default behaviour in case that you forget to load the protocol
tracker. Daniel Borkmann and Florian Westphal detected that if your
ruleset is stateful, you allow traffic to at least one single SCTP port
and the SCTP protocol tracker is not loaded, then any SCTP traffic may
be pass through unfiltered. After this patch, the connection tracking
classifies SCTP/DCCP/UDPlite/GRE packets as invalid if your kernel has
been compiled with support for these modules.
====================
Trivially resolved conflict in include/linux/skbuff.h, Eric moved some
netfilter skbuff members around, and the netfilter tree adjusted the
ifdef guards for the bridging info pointer.
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper reported that br_netfilter always registers the hooks since
this is part of the bridge core. This harms performance for people that
don't need this.
This patch modularizes br_netfilter so it can be rmmod'ed, thus,
the hooks can be unregistered. I think the bridge netfilter should have
been a separated module since the beginning, Patrick agreed on that.
Note that this is breaking compatibility for users that expect that
bridge netfilter is going to be available after explicitly 'modprobe
bridge' or via automatic load through brctl.
However, the damage can be easily undone by modprobing br_netfilter.
The bridge core also spots a message to provide a clue to people that
didn't notice that this has been deprecated.
On top of that, the plan is that nftables will not rely on this software
layer, but integrate the connection tracking into the bridge layer to
enable stateful filtering and NAT, which is was bridge netfilter users
seem to require.
This patch still keeps the fake_dst_ops in the bridge core, since this
is required by when the bridge port is initialized. So we can safely
modprobe/rmmod br_netfilter anytime.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Florian Westphal <fw@strlen.de>
Move nf_bridge_copy_header() as static inline in netfilter_bridge.h
header file. This patch prepares the modularization of the br_netfilter
code.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
arch/mips/net/bpf_jit.c
drivers/net/can/flexcan.c
Both the flexcan and MIPS bpf_jit conflicts were cases of simple
overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
As Toshiaki Makita pointed out, the BRIDGE_INPUT_SKB_CB will
not be initialized in br_should_learn() as that function
is called only from br_handle_local_finish(). That is
an input handler for link-local ethernet traffic so it perfectly
correct to check br->vlan_enabled here.
Reported-by: Toshiaki Makita<toshiaki.makita1@gmail.com>
Fixes: 20adfa1 bridge: Check if vlan filtering is enabled only once.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, it is possible to modify the vlan filter
configuration to add pvid or untagged support.
For example:
bridge vlan add vid 10 dev eth0
bridge vlan add vid 10 dev eth0 untagged pvid
The second statement will modify vlan 10 to
include untagged and pvid configuration.
However, it is currently impossible to go backwards
bridge vlan add vid 10 dev eth0 untagged pvid
bridge vlan add vid 10 dev eth0
Here nothing happens. This patch correct this so
that any modifiers not supplied are removed from
the configuration.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge code checks if vlan filtering is enabled on both
ingress and egress. When the state flip happens, it
is possible for the bridge to currently be forwarding packets
and forwarding behavior becomes non-deterministic. Bridge
may drop packets on some interfaces, but not others.
This patch solves this by caching the filtered state of the
packet into skb_cb on ingress. The skb_cb is guaranteed to
not be over-written between the time packet entres bridge
forwarding path and the time it leaves it. On egress, we
can then check the cached state to see if we need to
apply filtering information.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
nf-next pull request
The following patchset contains Netfilter/IPVS updates for your
net-next tree. Regarding nf_tables, most updates focus on consolidating
the NAT infrastructure and adding support for masquerading. More
specifically, they are:
1) use __u8 instead of u_int8_t in arptables header, from
Mike Frysinger.
2) Add support to match by skb->pkttype to the meta expression, from
Ana Rey.
3) Add support to match by cpu to the meta expression, also from
Ana Rey.
4) A smatch warning about IPSET_ATTR_MARKMASK validation, patch from
Vytas Dauksa.
5) Fix netnet and netportnet hash types the range support for IPv4,
from Sergey Popovich.
6) Fix missing-field-initializer warnings resolved, from Mark Rustad.
7) Dan Carperter reported possible integer overflows in ipset, from
Jozsef Kadlecsick.
8) Filter out accounting objects in nfacct by type, so you can
selectively reset quotas, from Alexey Perevalov.
9) Move specific NAT IPv4 functions to the core so x_tables and
nf_tables can share the same NAT IPv4 engine.
10) Use the new NAT IPv4 functions from nft_chain_nat_ipv4.
11) Move specific NAT IPv6 functions to the core so x_tables and
nf_tables can share the same NAT IPv4 engine.
12) Use the new NAT IPv6 functions from nft_chain_nat_ipv6.
13) Refactor code to add nft_delrule(), which can be reused in the
enhancement of the NFT_MSG_DELTABLE to remove a table and its
content, from Arturo Borrero.
14) Add a helper function to unregister chain hooks, from
Arturo Borrero.
15) A cleanup to rename to nft_delrule_by_chain for consistency with
the new nft_*() functions, also from Arturo.
16) Add support to match devgroup to the meta expression, from Ana Rey.
17) Reduce stack usage for IPVS socket option, from Julian Anastasov.
18) Remove unnecessary textsearch state initialization in xt_string,
from Bojan Prtvar.
19) Add several helper functions to nf_tables, more work to prepare
the enhancement of NFT_MSG_DELTABLE, again from Arturo Borrero.
20) Enhance NFT_MSG_DELTABLE to delete a table and its content, from
Arturo Borrero.
21) Support NAT flags in the nat expression to indicate the flavour,
eg. random fully, from Arturo.
22) Add missing audit code to ebtables when replacing tables, from
Nicolas Dichtel.
23) Generalize the IPv4 masquerading code to allow its re-use from
nf_tables, from Arturo.
24) Generalize the IPv6 masquerading code, also from Arturo.
25) Add the new masq expression to support IPv4/IPv6 masquerading
from nf_tables, also from Arturo.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes an unncessary check in the br_afspec() method of
br_netlink.c.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow rtnetlink users to set bridge master info via IFLA_INFO_DATA attr
This initial part implements forward_delay, hello_time, max_age options.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow rtnetlink users to get bridge master info in IFLA_INFO_DATA attr
This initial part implements forward_delay, hello_time, max_age options.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow rtnetlink users to set port info via IFLA_INFO_SLAVE_DATA attr
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow rtnetlink users to get port info in IFLA_INFO_SLAVE_DATA attr
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
The thing is that netdev_master_upper_dev_link calls
call_netdevice_notifiers(NETDEV_CHANGEUPPER, dev). That generates rtnl
link message and during that, rtnl_link_ops->fill_slave_info is called.
But with current ordering, rx_handler and IFF_BRIDGE_PORT are not set
yet so there would have to be check for that in fill_slave_info callback.
Resolve this by reordering to similar what bonding and team does to
avoid the check.
Also add removal of IFF_BRIDGE_PORT flag into error path.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is already done for x_tables (family AF_INET and AF_INET6), let's
do it for AF_BRIDGE also.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The use of "rcu_assign_pointer()" is NULLing out the pointer.
According to RCU_INIT_POINTER()'s block comment:
"1. This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.
The following Coccinelle semantic patch was used:
@@
@@
- rcu_assign_pointer
+ RCU_INIT_POINTER
(..., NULL)
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the functionality to untag traffic on input resides
as part of the vlan module and is build only when VLAN support
is enabled in the kernel. When VLAN is disabled, the function
vlan_untag() turns into a stub and doesn't really untag the
packets. This seems to create an interesting interaction
between VMs supporting checksum offloading and some network drivers.
There are some drivers that do not allow the user to change
tx-vlan-offload feature of the driver. These drivers also seem
to assume that any VLAN-tagged traffic they transmit will
have the vlan information in the vlan_tci and not in the vlan
header already in the skb. When transmitting skbs that already
have tagged data with partial checksum set, the checksum doesn't
appear to be updated correctly by the card thus resulting in a
failure to establish TCP connections.
The following is a packet trace taken on the receiver where a
sender is a VM with a VLAN configued. The host VM is running on
doest not have VLAN support and the outging interface on the
host is tg3:
10:12:43.503055 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
(0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27243,
offset 0, flags [DF], proto TCP (6), length 60)
10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
-> 0x48d9), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
4294837885 ecr 0,nop,wscale 7], length 0
10:12:44.505556 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
(0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27244,
offset 0, flags [DF], proto TCP (6), length 60)
10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
-> 0x44ee), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
4294838888 ecr 0,nop,wscale 7], length 0
This connection finally times out.
I've only access to the TG3 hardware in this configuration thus have
only tested this with TG3 driver. There are a lot of other drivers
that do not permit user changes to vlan acceleration features, and
I don't know if they all suffere from a similar issue.
The patch attempt to fix this another way. It moves the vlan header
stipping code out of the vlan module and always builds it into the
kernel network core. This way, even if vlan is not supported on
a virtualizatoin host, the virtual machines running on top of such
host will still work with VLANs enabled.
CC: Patrick McHardy <kaber@trash.net>
CC: Nithin Nayak Sujir <nsujir@broadcom.com>
CC: Michael Chan <mchan@broadcom.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet reports that getsockopt() or setsockopt() sometimes
returns -EINTR instead of -ENOPROTOOPT, causing headaches to
application developers.
This patch replaces all the mutex_lock_interruptible() by mutex_lock()
in the netfilter tree, as there is no reason we should sleep for a
long time there.
Reported-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
All other add functions for lists have the new item as first argument
and the position where it is added as second argument. This was changed
for no good reason in this function and makes using it unnecessary
confusing.
The name was changed to hlist_add_behind() to cause unconverted code to
generate a compile error instead of using the wrong parameter order.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Ken Helias <kenhelias@firemail.de>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [intel driver bits]
Cc: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts:
drivers/net/Makefile
net/ipv6/sysctl_net_ipv6.c
Two ipv6_table_template[] additions overlap, so the index
of the ipv6_table[x] assignments needed to be adjusted.
In the drivers/net/Makefile case, we've gotten rid of the
garbage whereby we had to list every single USB networking
driver in the top-level Makefile, there is just one
"USB_NETWORKING" that guards everything.
Signed-off-by: David S. Miller <davem@davemloft.net>
Now bridge ports can be non-promiscuous, vlan_vid_add() is no longer an
unnecessary operation.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 6cbdceeb1c
bridge: Dump vlan information from a bridge port
introduced a comment in an attempt to explain the
code logic. The comment is unfinished so it confuses more
than it explains, remove it.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
An FDB entry with vlan_id 0 doesn't mean it is used in vlan 0, but used when
vlan_filtering is disabled.
There is inconsistency around NDA_VLAN whose payload is 0 - even if we add
an entry by RTM_NEWNEIGH without any NDA_VLAN, and even though adding an
entry with NDA_VLAN 0 is prohibited, we get an entry with NDA_VLAN 0 by
RTM_GETNEIGH.
Dumping an FDB entry with vlan_id 0 shouldn't include NDA_VLAN.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ulog targets were recently killed. A few references to the Kconfig
macros CONFIG_IP_NF_TARGET_ULOG and CONFIG_BRIDGE_EBT_ULOG were left
untouched. Kill these too.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
So you can reject IPv4 and IPv6 packets from bridge tables. If the ether
proto is now known, default on dropping the packet instead.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains updates for your net-next tree,
they are:
1) Use kvfree() helper function from x_tables, from Eric Dumazet.
2) Remove extra timer from the conntrack ecache extension, use a
workqueue instead to redeliver lost events to userspace instead,
from Florian Westphal.
3) Removal of the ulog targets for ebtables and iptables. The nflog
infrastructure superseded this almost 9 years ago, time to get rid
of this code.
4) Replace the list of loggers by an array now that we can only have
two possible non-overlapping logger flavours, ie. kernel ring buffer
and netlink logging.
5) Move Eric Dumazet's log buffer code to nf_log to reuse it from
all of the supported per-family loggers.
6) Consolidate nf_log_packet() as an unified interface for packet logging.
After this patch, if the struct nf_loginfo is available, it explicitly
selects the logger that is used.
7) Move ip and ip6 logging code from xt_LOG to the corresponding
per-family loggers. Thus, x_tables and nf_tables share the same code
for packet logging.
8) Add generic ARP packet logger, which is used by nf_tables. The
format aims to be consistent with the output of xt_LOG.
9) Add generic bridge packet logger. Again, this is used by nf_tables
and it routes the packets to the real family loggers. As a result,
we get consistent logging format for the bridge family. The ebt_log
logging code has been intentionally left in place not to break
backward compatibility since the logging output differs from xt_LOG.
10) Update nft_log to explicitly request the required family logger when
needed.
11) Finish nft_log so it supports arp, ip, ip6, bridge and inet families.
Allowing selection between netlink and kernel buffer ring logging.
12) Several fixes coming after the netfilter core logging changes spotted
by robots.
13) Use IS_ENABLED() macros whenever possible in the netfilter tree,
from Duan Jiong.
14) Removal of a couple of unnecessary branch before kfree, from Fabian
Frederick.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Actually better than brctl showmacs because we can filter by bridge
port in the kernel.
The current bridge netlink interface doesnt scale when you have many
bridges each with large fdbs or even bridges with many bridge ports
And now for the science non-fiction novel you have all been
waiting for..
//lets see what bridge ports we have
root@moja-1:/configs/may30-iprt/bridge# ./bridge link show
8: eth1 state DOWN : <BROADCAST,MULTICAST> mtu 1500 master br0 state
disabled priority 32 cost 19
17: sw1-p1 state DOWN : <BROADCAST,NOARP> mtu 1500 master br0 state
disabled priority 32 cost 100
// show all..
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show
33:33:00:00:00:01 dev bond0 self permanent
33:33:00:00:00:01 dev dummy0 self permanent
33:33:00:00:00:01 dev ifb0 self permanent
33:33:00:00:00:01 dev ifb1 self permanent
33:33:00:00:00:01 dev eth0 self permanent
01:00:5e:00:00:01 dev eth0 self permanent
33:33:ff:22:01:01 dev eth0 self permanent
02:00:00:12:01:02 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:05 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:07 dev eth1 self permanent
33:33:00:00:00:01 dev eth1 self permanent
33:33:00:00:00:01 dev gretap0 self permanent
da:ac:46:27:d9:53 dev sw1-p1 vlan 0 master br0 permanent
33:33:00:00:00:01 dev sw1-p1 self permanent
//filter by bridge
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show br br0
02:00:00:12:01:02 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:05 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:07 dev eth1 self permanent
33:33:00:00:00:01 dev eth1 self permanent
da:ac:46:27:d9:53 dev sw1-p1 vlan 0 master br0 permanent
33:33:00:00:00:01 dev sw1-p1 self permanent
// bridge sw1 has no ports attached..
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show br sw1
//filter by port
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show brport eth1
02:00:00:12:01:02 vlan 0 master br0 permanent
00:17:42:8a:b4:05 vlan 0 master br0 permanent
00:17:42:8a:b4:07 self permanent
33:33:00:00:00:01 self permanent
// filter by port + bridge
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show br br0 brport
sw1-p1
da:ac:46:27:d9:53 vlan 0 master br0 permanent
33:33:00:00:00:01 self permanent
// for shits and giggles (as they say in New Brunswick), lets
// change the mac that br0 uses
// Note: a magical fdb entry with no brport is added ...
root@moja-1:/configs/may30-iprt/bridge# ip link set dev br0 address
02:00:00:12:01:04
// lets see if we can see the unicorn ..
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show
33:33:00:00:00:01 dev bond0 self permanent
33:33:00:00:00:01 dev dummy0 self permanent
33:33:00:00:00:01 dev ifb0 self permanent
33:33:00:00:00:01 dev ifb1 self permanent
33:33:00:00:00:01 dev eth0 self permanent
01:00:5e:00:00:01 dev eth0 self permanent
33:33:ff:22:01:01 dev eth0 self permanent
02:00:00:12:01:02 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:05 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:07 dev eth1 self permanent
33:33:00:00:00:01 dev eth1 self permanent
33:33:00:00:00:01 dev gretap0 self permanent
02:00:00:12:01:04 dev br0 vlan 0 master br0 permanent <=== there it is
da:ac:46:27:d9:53 dev sw1-p1 vlan 0 master br0 permanent
33:33:00:00:00:01 dev sw1-p1 self permanent
//can we see it if we filter by bridge?
root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show br br0
02:00:00:12:01:02 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:05 dev eth1 vlan 0 master br0 permanent
00:17:42:8a:b4:07 dev eth1 self permanent
33:33:00:00:00:01 dev eth1 self permanent
02:00:00:12:01:04 dev br0 vlan 0 master br0 permanent <=== there it is
da:ac:46:27:d9:53 dev sw1-p1 vlan 0 master br0 permanent
33:33:00:00:00:01 dev sw1-p1 self permanent
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dumping a bridge fdb dumps every fdb entry
held. With this change we are going to filter
on selected bridge port.
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch other modules are able to ask the bridge whether an
IGMP or MLD querier exists on the according, bridged link layer.
Multicast snooping can only be performed if a valid, selected querier
exists on a link.
Just like the bridge only enables its multicast snooping if a querier
exists, e.g. batman-adv too can only activate its multicast
snooping in bridged scenarios if a querier is present.
For instance this export avoids having to reimplement IGMP/MLD
querier message snooping and parsing in e.g. batman-adv, when
multicast optimizations for bridged scenarios are added in the
future.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds the generic plain text packet loggger for bridged packets.
It routes the logging message to the real protocol packet logger.
I decided not to refactor the ebt_log code for two reasons:
1) The ebt_log output is not consistent with the IPv4 and IPv6
Netfilter packet loggers. The output is different for no good
reason and it adds redundant code to handle packet logging.
2) To avoid breaking backward compatibility for applications
outthere that are parsing the specific ebt_log output, the ebt_log
output has been left as is. So only nftables will use the new
consistent logging format for logged bridged packets.
More decisions coming in this patch:
1) This also removes ebt_log as default logger for bridged packets.
Thus, nf_log_packet() routes packet to this new packet logger
instead. This doesn't break backward compatibility since
nf_log_packet() is not used to log packets in plain text format
from anywhere in the ebtables/netfilter bridge code.
2) The new bridge packet logger also performs a lazy request to
register the real IPv4, ARP and IPv6 netfilter packet loggers.
If the real protocol logger is no available (not compiled or the
module is not available in the system, not packet logging happens.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Now that legacy ulog targets are not available anymore in the tree, we
can have up to two possible loggers:
1) The plain text logging via kernel logging ring.
2) The nfnetlink_log infrastructure which delivers log messages
to userspace.
This patch replaces the list of loggers by an array of two pointers
per family for each possible logger and it also introduces a new field
to the nf_logger structure which indicates the position in the logger
array (based on the logger type).
This prepares a follow up patch that consolidates the nf_log_packet()
interface by allowing to specify the logger as parameter.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This has been marked as deprecated for quite some time and the NFLOG
target replacement has been also available since 2006.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
use list_for_each_entry_continue_reverse to rollback in fdb_add_hw
when add address failed
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some fields in "struct net_bridge" aren't available when compiling the
kernel without IPv6 support. Therefore adding a check/macro to skip the
complaining code sections in that case.
Introduced by 2cd4143192
("bridge: memorize and export selected IGMP/MLD querier port")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
"New smatch warnings:
net/bridge/br_multicast.c:1368 br_ip6_multicast_query() error:
we previously assumed 'group' could be null (see line 1349)"
In the rare (sort of broken) case of a query having a Maximum
Response Delay of zero, we could create a potential null pointer
dereference.
Fixing this by skipping the multicast specific MLD Query parsing again
if no multicast group address is available.
Introduced by dc4eb53a99
("bridge: adhere to querier election mechanism specified by RFCs")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This enables us to change the vlan protocol for vlan filtering.
We come to be able to filter frames on the basis of 802.1ad vlan tags
through a bridge.
This also changes br->group_addr if it has not been set by user.
This is needed for an 802.1ad bridge.
(See IEEE 802.1Q-2011 8.13.5.)
Furthermore, this sets br->group_fwd_mask_required so that an 802.1ad
bridge can forward the Nearest Customer Bridge group addresses except
for br->group_addr, which should be passed to higher layer.
To change the vlan protocol, write a protocol in sysfs:
# echo 0x88a8 > /sys/class/net/br0/bridge/vlan_protocol
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a bridge is an 802.1ad bridge, it must forward another bridge group
addresses (the Nearest Customer Bridge group addresses).
(For details, see IEEE 802.1Q-2011 8.6.3.)
As user might not want group_fwd_mask to be modified by enabling 802.1ad,
introduce a new mask, group_fwd_mask_required, which indicates addresses
the bridge wants to forward. This will be set by enabling 802.1ad.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
This enables a bridge to have vlan protocol informantion and allows vlan
tag manipulation (retrieve, insert and remove tags) according to the vlan
protocol.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bridge device doesn't need to embed S-tag into skb->data.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding bridge support to the batman-adv multicast optimization requires
batman-adv knowing about the existence of bridged-in IGMP/MLD queriers
to be able to reliably serve any multicast listener behind this same
bridge.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this new, exported function br_multicast_list_adjacent(net_dev) a
list of IPv4/6 addresses is returned. This list contains all multicast
addresses sensed by the bridge multicast snooping feature on all bridge
ports of the bridge interface of net_dev, excluding addresses from the
specified net_device itself.
Adding bridge support to the batman-adv multicast optimization requires
batman-adv knowing about the existence of bridged-in multicast
listeners to be able to reliably serve them with multicast packets.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
MLDv1 (RFC2710 section 6), MLDv2 (RFC3810 section 7.6.2), IGMPv2
(RFC2236 section 3) and IGMPv3 (RFC3376 section 6.6.2) specify that the
querier with lowest source address shall become the selected
querier.
So far the bridge stopped its querier as soon as it heard another
querier regardless of its source address. This results in the "wrong"
querier potentially becoming the active querier or a potential,
unnecessary querying delay.
With this patch the bridge memorizes the source address of the currently
selected querier and ignores queries from queriers with a higher source
address than the currently selected one. This slight optimization is
supposed to make it more RFC compliant (but is rather uncritical and
therefore probably not necessary to be queued for stable kernels).
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current naming of these two structs is very random, in that
reversing their naming would not make any semantical difference.
This patch tries to make the naming less confusing by giving them a more
specific, distinguishable naming.
This is also useful for the upcoming patches reintroducing the
"struct bridge_mcast_querier" but for storing information about the
selected querier (no matter if our own or a foreign querier).
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_manage_promisc() incorrectly expects br_auto_port() to return only 0
or 1, while it actually returns flags, i.e., a subset of BR_AUTO_MASK.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
include/net/inetpeer.h
net/ipv6/output_core.c
Changes in net were fixing bugs in code removed in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
(This patch was previously posted as RFC at
http://patchwork.ozlabs.org/patch/352677/)
This patch adds NDA_MASTER attribute to neighbour attributes enum for
bridge/master ifindex. And adds NDA_MASTER to bridge fdb notify msgs.
Today bridge fdb notifications dont contain bridge information.
Userspace can derive it from the port information in the fdb
notification. However this is tricky in some scenarious.
Example, bridge port delete notification comes before bridge fdb
delete notifications. And we have seen problems in userspace
when using libnl where, the bridge fdb delete notification handling code
does not understand which bridge this fdb entry is part of because
the bridge and port association has already been deleted.
And these notifications (port membership and fdb) are generated on
separate rtnl groups.
Fixing the order of notifications could possibly solve the problem
for some cases (I can submit a separate patch for that).
This patch chooses to add NDA_MASTER to bridge fdb notify msgs
because it not only solves the problem described above, but also helps
userspace avoid another lookup into link msgs to derive the master index.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_handle_local_finish() is allowing us to insert an FDB entry with
disallowed vlan. For example, when port 1 and 2 are communicating in
vlan 10, and even if vlan 10 is disallowed on port 3, port 3 can
interfere with their communication by spoofed src mac address with
vlan id 10.
Note: Even if it is judged that a frame should not be learned, it should
not be dropped because it is destined for not forwarding layer but higher
layer. See IEEE 802.1Q-2011 8.13.10.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There has been a number incidents recently where customers running KVM have
reported that VM hosts on different Hypervisors are unreachable. Based on
pcap traces we found that the bridge was broadcasting the ARP request out
onto the network. However some NICs have an inbuilt switch which on occasions
were broadcasting the VMs ARP request back through the physical NIC on the
Hypervisor. This resulted in the bridge changing ports and incorrectly learning
that the VMs mac address was external. As a result the ARP reply was directed
back onto the external network and VM never updated it's ARP cache. This patch
will notify the bridge command, after a fdb has been updated to identify such
port toggling.
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As commit 2796d0c648 ("bridge: Automatically manage port
promiscuous mode."), make the add_if use dev_set_allmulti
instead of dev_set_promiscuous, so when add_if failed, we
should do dev_set_allmulti(dev, -1).
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Reviewed-by: Amos Kong <akong@redhat.com>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before f5efc69 ("netfilter: nf_tables: Add meta expression key for
bridge interface name"), the entire net/bridge/netfilter/ directory
depended on BRIDGE_NF_EBTABLES, ie. on ebtables. However, that
directory already contained the nf_tables bridge extension that
we should allow to compile separately. In f5efc69, we tried to
generalize this by using CONFIG_BRIDGE_NETFILTER which was not a good
idea since this option already existed and it is dedicated to enable
the Netfilter bridge IP/ARP filtering.
Let's try to fix this mess by:
1) making net/bridge/netfilter/ dependent on the toplevel
CONFIG_NETFILTER option, just like we do with the net/netfilter and
net/ipv{4,6}/netfilter/ directories.
2) Changing 'selects' to 'depends on' NETFILTER_XTABLES for
BRIDGE_NF_EBTABLES. I believe this problem was already before
f5efc69:
warning: (BRIDGE_NF_EBTABLES) selects NETFILTER_XTABLES which has
unmet direct dependencies (NET && INET && NETFILTER)
3) Fix ebtables/nf_tables bridge dependencies by making NF_TABLES_BRIDGE
and BRIDGE_NF_EBTABLES dependent on BRIDGE and NETFILTER:
warning: (NF_TABLES_BRIDGE && BRIDGE_NF_EBTABLES) selects
BRIDGE_NETFILTER which has unmet direct dependencies (NET && BRIDGE &&
NETFILTER && INET && NETFILTER_ADVANCED)
net/built-in.o: In function `br_parse_ip_options':
br_netfilter.c:(.text+0x4a5ba): undefined reference to `ip_options_compile'
br_netfilter.c:(.text+0x4a5ed): undefined reference to `ip_options_rcv_srr'
net/built-in.o: In function `br_nf_pre_routing_finish':
br_netfilter.c:(.text+0x4a8a4): undefined reference to `ip_route_input_noref'
br_netfilter.c:(.text+0x4a987): undefined reference to `ip_route_output_flow'
make: *** [vmlinux] Error 1
Reported-by: Jim Davis <jim.epost@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge net/bridge/br_notify.c into net/bridge/br.c,
since it has only br_device_event() and br.c is small.
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/nftables updates for net-next
The following patchset contains Netfilter/nftables updates for net-next,
most relevantly they are:
1) Add set element update notification via netlink, from Arturo Borrero.
2) Put all object updates in one single message batch that is sent to
kernel-space. Before this patch only rules where included in the batch.
This series also introduces the generic transaction infrastructure so
updates to all objects (tables, chains, rules and sets) are applied in
an all-or-nothing fashion, these series from me.
3) Defer release of objects via call_rcu to reduce the time required to
commit changes. The assumption is that all objects are destroyed in
reverse order to ensure that dependencies betweem them are fulfilled
(ie. rules and sets are destroyed first, then chains, and finally
tables).
4) Allow to match by bridge port name, from Tomasz Bursztyka. This series
include two patches to prepare this new feature.
5) Implement the proper set selection based on the characteristics of the
data. The new infrastructure also allows you to specify your preferences
in terms of memory and computational complexity so the underlying set
type is also selected according to your needs, from Patrick McHardy.
6) Several cleanup patches for nft expressions, including one minor possible
compilation breakage due to missing mark support, also from Patrick.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
fix build when BRIDGE_VLAN_FILTERING is not set
Fixes: 2796d0c648 ("bridge: Automatically manage port promiscuous mode")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There exist configurations where the administrator or another management
entity has the foreknowledge of all the mac addresses of end systems
that are being bridged together.
In these environments, the administrator can statically configure known
addresses in the bridge FDB and disable flooding and learning on ports.
This makes it possible to turn off promiscuous mode on the interfaces
connected to the bridge.
Here is why disabling flooding and learning allows us to control
promiscuity:
Consider port X. All traffic coming into this port from outside the
bridge (ingress) will be either forwarded through other ports of the
bridge (egress) or dropped. Forwarding (egress) is defined by FDB
entries and by flooding in the event that no FDB entry exists.
In the event that flooding is disabled, only FDB entries define
the egress. Once learning is disabled, only static FDB entries
provided by a management entity define the egress. If we provide
information from these static FDBs to the ingress port X, then we'll
be able to accept all traffic that can be successfully forwarded and
drop all the other traffic sooner without spending CPU cycles to
process it.
Another way to define the above is as following equations:
ingress = egress + drop
expanding egress
ingress = static FDB + learned FDB + flooding + drop
disabling flooding and learning we a left with
ingress = static FDB + drop
By adding addresses from the static FDB entries to the MAC address
filter of an ingress port X, we fully define what the bridge can
process without dropping and can thus turn off promiscuous mode,
thus dropping packets sooner.
There have been suggestions that we may want to allow learning
and update the filters with learned addresses as well. This
would require mac-level authentication similar to 802.1x to
prevent attacks against the hw filters as they are limited
resource.
Additionally, if the user places the bridge device in promiscuous mode,
all ports are placed in promiscuous mode regardless of the changes
to flooding and learning.
Since the above functionality depends on full static configuration,
we have also require that vlan filtering be enabled to take
advantage of this. The reason is that the bridge has to be
able to receive and process VLAN-tagged frames and the there
are only 2 ways to accomplish this right now: promiscuous mode
or vlan filtering.
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a static fdb entry is created, add the mac address
from this fdb entry to any ports that are currently running
in non-promiscuous mode. These ports need this data so that
they can receive traffic destined to these addresses.
By default ports start in promiscuous mode, so this feature
is disabled.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a BR_PROMISC per-port flag that will help us track if the
current port is supposed to be in promiscuous mode or not. For now,
always start in promiscuous mode.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add code that allows static fdb entires to be synced to the
hw list for a specified port. This will be used later to
program ports that can function in non-promiscuous mode.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By default, ports on the bridge are capable of automatic
discovery of nodes located behind the port. This is accomplished
via flooding of unknown traffic (BR_FLOOD) and learning the
mac addresses from these packets (BR_LEARNING).
If the above functionality is disabled by turning off these
flags, the port requires static configuration in the form
of static FDB entries to function properly.
This patch adds functionality to keep track of all ports
capable of automatic discovery. This will later be used
to control promiscuity settings.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Turn the flag change macro into a function to allow
easier updates and to reduce space.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net: get rid of SET_ETHTOOL_OPS
Dave Miller mentioned he'd like to see SET_ETHTOOL_OPS gone.
This does that.
Mostly done via coccinelle script:
@@
struct ethtool_ops *ops;
struct net_device *dev;
@@
- SET_ETHTOOL_OPS(dev, ops);
+ dev->ethtool_ops = ops;
Compile tested only, but I'd seriously wonder if this broke anything.
Suggested-by: Dave Miller <davem@davemloft.net>
Signed-off-by: Wilfried Klaebe <w-lkml@lebenslange-mailadresse.de>
Acked-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The __vlan_find_dev_deep should always called in RCU, according
David's suggestion, rename to __vlan_find_dev_deep_rcu looks more
reasonable.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently bridge can silently drop ipv4 fragments.
If node have loaded nf_defrag_ipv4 module but have no nf_conntrack_ipv4,
br_nf_pre_routing defragments incoming ipv4 fragments
but nfct check in br_nf_dev_queue_xmit does not allow re-fragment combined
packet back, and therefore it is dropped in br_dev_queue_push_xmit without
incrementing of any failcounters
It seems the only way to hit the ip_fragment code in the bridge xmit
path is to have a fragment list whose reassembled fragments go over
the mtu. This only happens if nf_defrag is enabled. Thanks to
Florian Westphal for providing feedback to clarify this.
Defragmentation ipv4 is required not only in conntracks but at least in
TPROXY target and socket match, therefore #ifdef is changed from
NF_CONNTRACK_IPV4 to NF_DEFRAG_IPV4
Signed-off-by: Vasily Averin <vvs@openvz.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When bridge device is created with IFLA_ADDRESS, we are not calling
br_stp_change_bridge_id(), which leads to incorrect local fdb
management and bridge id calculation, and prevents us from receiving
frames on the bridge device.
Reported-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
NFT_META_BRI_IIFNAME to get packet input bridge interface name
NFT_META_BRI_OIFNAME to get packet output bridge interface name
Such meta key are accessible only through NFPROTO_BRIDGE family, on a
dedicated nft meta module: nft_meta_bridge.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Tomasz Bursztyka <tomasz.bursztyka@linux.intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
br_allowed_ingress() has two problems.
1. If br_allowed_ingress() is called by br_handle_frame_finish() and
vlan_untag() in br_allowed_ingress() fails, skb will be freed by both
vlan_untag() and br_handle_frame_finish().
2. If br_allowed_ingress() is called by br_dev_xmit() and
br_allowed_ingress() fails, the skb will not be freed.
Fix these two problems by freeing the skb in br_allowed_ingress()
if it fails.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
All xtables variants suffer from the defect that the copy_to_user()
to copy the counters to user memory may fail after the table has
already been exchanged and thus exposed. Return an error at this
point will result in freeing the already exposed table. Any
subsequent packet processing will result in a kernel panic.
We can't copy the counters before exposing the new tables as we
want provide the counter state after the old table has been
unhooked. Therefore convert this into a silent error.
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use existing function instead of trying to use our own.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/marvell/mvneta.c
The mvneta.c conflict is a case of overlapping changes,
a conversion to devm_ioremap_resource() vs. a conversion
to netdev_alloc_pcpu_stats.
Signed-off-by: David S. Miller <davem@davemloft.net>
The gfp parameter was added in:
commit 47be03a28c
Author: Amerigo Wang <amwang@redhat.com>
Date: Fri Aug 10 01:24:37 2012 +0000
netpoll: use GFP_ATOMIC in slave_enable_netpoll() and __netpoll_setup()
slave_enable_netpoll() and __netpoll_setup() may be called
with read_lock() held, so should use GFP_ATOMIC to allocate
memory. Eric suggested to pass gfp flags to __netpoll_setup().
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reason for the gfp parameter was removed in:
commit c4cdef9b71
Author: dingtianhong <dingtianhong@huawei.com>
Date: Tue Jul 23 15:25:27 2013 +0800
bonding: don't call slave_xxx_netpoll under spinlocks
The slave_xxx_netpoll will call synchronize_rcu_bh(),
so the function may schedule and sleep, it should't be
called under spinlocks.
bond_netpoll_setup() and bond_netpoll_cleanup() are always
protected by rtnl lock, it is no need to take the read lock,
as the slave list couldn't be changed outside rtnl lock.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nothing else that calls __netpoll_setup or ndo_netpoll_setup
requires a gfp paramter, so remove the gfp parameter from both
of these functions making the code clearer.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the vlan filtering is enabled on the bridge, but
the filter is not configured on the bridge device itself,
running tcpdump on the bridge device will result in a
an Oops with NULL pointer dereference. The reason
is that br_pass_frame_up() will bypass the vlan
check because promisc flag is set. It will then try
to get the table pointer and process the packet based
on the table. Since the table pointer is NULL, we oops.
Catch this special condition in br_handle_vlan().
Reported-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
CC: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a bridge with vlan_filtering enabled receives frames with stacked
vlan tags, i.e., they have two vlan tags, br_vlan_untag() strips not
only the outer tag but also the inner tag.
br_vlan_untag() is called only from br_handle_vlan(), and in this case,
it is enough to set skb->vlan_tci to 0 here, because vlan_tci has already
been set before calling br_handle_vlan().
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bridge vlan code (br_vlan_get_tag()) assumes that all frames have vlan_tci
if they are tagged, but if vlan tx offload is manually disabled on bridge
device and frames are sent from vlan device on the bridge device, the tags
are embedded in skb->data and they break this assumption.
Extract embedded vlan tags and move them to vlan_tci at ingress.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)
The rcu_assign_pointer() ensures that the initialization of a structure
is carried out before storing a pointer to that structure.
And in the case of the NULL pointer, there is no structure to initialize.
So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL)
Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace the bh safe variant with the hard irq safe variant.
We need a hard irq safe variant to deal with netpoll transmitting
packets from hard irq context, and we need it in most if not all of
the places using the bh safe variant.
Except on 32bit uni-processor the code is exactly the same so don't
bother with a bh variant, just have a hard irq safe variant that
everyone can use.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/r8152.c
drivers/net/xen-netback/netback.c
Both the r8152 and netback conflicts were simple overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Without this check someone could easily create a denial of service
by injecting multicast-specific queries to enable the bridge
snooping part if no real querier issuing periodic general queries
is present on the link which would result in the bridge wrongly
shutting down ports for multicast traffic as the bridge did not learn
about these listeners.
With this patch the snooping code is enabled upon receiving valid,
general queries only.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
General IGMP and MLD queries are supposed to have the multicast
link-local all-nodes address as their destination according to RFC2236
section 9, RFC3376 section 4.1.12/9.1, RFC2710 section 8 and RFC3810
section 5.1.15.
Without this check, such malformed IGMP/MLD queries can result in a
denial of service: The queries are ignored by most IGMP/MLD listeners
therefore they will not respond with an IGMP/MLD report. However,
without this patch these malformed MLD queries would enable the
snooping part in the bridge code, potentially shutting down the
according ports towards these hosts for multicast traffic as the
bridge did not learn about these listeners.
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e688a60480 ("net: introduce DST_NOPEER dst flag") introduced
DST_NOPEER because because of crashes in ipv6_select_ident called from
udp6_ufo_fragment.
Since commit 916e4cf46d ("ipv6: reuse ip6_frag_id from
ip6_ufo_append_data") we don't call ipv6_select_ident any more from
ip6_ufo_append_data, thus this flag lost its purpose and can be removed.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
MLD queries are supposed to have an IPv6 link-local source address
according to RFC2710, section 4 and RFC3810, section 5.1.14. This patch
adds a sanity check to ignore such broken MLD queries.
Without this check, such malformed MLD queries can result in a
denial of service: The queries are ignored by any MLD listener
therefore they will not respond with an MLD report. However,
without this patch these malformed MLD queries would enable the
snooping part in the bridge code, potentially shutting down the
according ports towards these hosts for multicast traffic as the
bridge did not learn about these listeners.
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert the uses of memcpy to ether_addr_copy because
for some architectures it is smaller and faster.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert the more obvious uses of memcpy to ether_addr_copy.
There are still uses of memcpy that could be converted but
these addresses are __aligned(2).
Convert a couple uses of 6 in gr_private.h to ETH_ALEN.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are many drivers calling alloc_percpu() to allocate pcpu stats
and then initializing ->syncp. So just introduce a helper function for them.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_change_mac_address() calls fdb_insert()/fdb_delete() without
br->hash_lock.
These hash list updates are racy with br_fdb_update()/br_fdb_cleanup().
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlan codes unconditionally delete local fdb entries.
We should consider the possibility that other ports have the same
address and vlan.
Example of problematic case:
ip link set eth0 address 12:34:56:78:90:ab
ip link set eth1 address aa:bb:cc:dd:ee:ff
brctl addif br0 eth0
brctl addif br0 eth1 # br0 will have mac address 12:34:56:78:90:ab
bridge vlan add dev eth0 vid 10
bridge vlan add dev eth1 vid 10
bridge vlan add dev br0 vid 10 self
We will have fdb entry such that f->dst == eth0, f->vlan_id == 10 and
f->addr == 12:34:56:78:90:ab at this time.
Next, delete eth0 vlan 10.
bridge vlan del dev eth0 vid 10
In this case, we still need the entry for br0, but it will be deleted.
Note that br0 needs the entry even though its mac address is not set
manually. To delete the entry with proper condition checking,
fdb_delete_local() is suitable to use.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_delete_by_port() doesn't care about vlan and mac address of the
bridge device.
As the check is almost the same as mac address changing, slightly modify
fdb_delete_local() and use it.
Note that we can always set added_by_user to 0 in fdb_delete_local() because
- br_fdb_delete_by_port() calls fdb_delete_local() for local entries
regardless of its added_by_user. In this case, we have to check if another
port has the same address and vlan, and if found, we have to create the
entry (by changing dst). This is kernel-added entry, not user-added.
- br_fdb_changeaddr() doesn't call fdb_delete_local() for user-added entry.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_change_mac_address() doesn't check if the local entry has the
same address as any of bridge ports.
Although I'm not sure when it is beneficial, current implementation allow
the bridge device to receive any mac address of its ports.
To preserve this behavior, we have to check if the mac address of the
entry being deleted is identical to that of any port.
As this check is almost the same as that in br_fdb_changeaddr(), create
a common function fdb_delete_local() and call it from
br_fdb_changeadddr() and br_fdb_change_mac_address().
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should take into account the followings when deleting a local fdb
entry.
- nbp_vlan_find() can be used only when vid != 0 to check if an entry is
deletable, because a fdb entry with vid 0 can exist at any time while
nbp_vlan_find() always return false with vid 0.
Example of problematic case:
ip link set eth0 address 12:34:56:78:90:ab
ip link set eth1 address 12:34:56:78:90:ab
brctl addif br0 eth0
brctl addif br0 eth1
ip link set eth0 address aa:bb:cc:dd:ee:ff
Then, the fdb entry 12:34:56:78:90:ab will be deleted even though the
bridge port eth1 still has that address.
- The port to which the bridge device is attached might needs a local entry
if its mac address is set manually.
Example of problematic case:
ip link set eth0 address 12:34:56:78:90:ab
brctl addif br0 eth0
ip link set br0 address 12:34:56:78:90:ab
ip link set eth0 address aa:bb:cc:dd:ee:ff
Then, the fdb still must have the entry 12:34:56:78:90:ab, but it will be
deleted.
We can use br->dev->addr_assign_type to check if the address is manually
set or not, but I propose another approach.
Since we delete and insert local entries whenever changing mac address
of the bridge device, we can change dst of the entry to NULL regardless of
addr_assign_type when deleting an entry associated with a certain port,
and if it is found to be unnecessary later, then delete it.
That is, if changing mac address of a port, the entry might be changed
to its dst being NULL first, but is eventually deleted when recalculating
and changing bridge id.
This approach is especially useful when we want to share the code with
deleting vlan in which the bridge device might want such an entry regardless
of addr_assign_type, and makes things easy because we don't have to consider
if mac address of the bridge device will be changed or not at the time we
delete a local entry of a port, which means fdb code will not be bothered
even if the bridge id calculating logic is changed in the future.
Also, this change reduces inconsistent state, where frames whose dst is the
mac address of the bridge, can't reach the bridge because of premature fdb
entry deletion. This change reduces the possibility that the bridge device
replies unreachable mac address to arp requests, which could occur during
the short window between calling del_nbp() and br_stp_recalculate_bridge_id()
in br_del_if(). This will effective after br_fdb_delete_by_port() starts to
use the same code by following patch.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlan code may need fdb change when changing mac address of bridge device
even if it is caused by the mac address changing of a bridge port.
Example configuration:
ip link set eth0 address 12:34:56:78:90:ab
ip link set eth1 address aa:bb:cc:dd:ee:ff
brctl addif br0 eth0
brctl addif br0 eth1 # br0 will have mac address 12:34:56:78:90:ab
bridge vlan add dev br0 vid 10 self
bridge vlan add dev eth0 vid 10
We will have fdb entry such that f->dst == NULL, f->vlan_id == 10 and
f->addr == 12:34:56:78:90:ab at this time.
Next, change the mac address of eth0 to greater value.
ip link set eth0 address ee:ff:12:34:56:78
Then, mac address of br0 will be recalculated and set to aa:bb:cc:dd:ee:ff.
However, an entry aa:bb:cc:dd:ee:ff will not be created and we will be not
able to communicate using br0 on vlan 10.
Address this issue by deleting and adding local entries whenever
changing the mac address of the bridge device.
If there already exists an entry that has the same address, for example,
in case that br_fdb_changeaddr() has already inserted it,
br_fdb_change_mac_address() will simply fail to insert it and no
duplicated entry will be made, as it was.
This approach also needs br_add_if() to call br_fdb_insert() before
br_stp_recalculate_bridge_id() so that we don't create an entry whose
dst == NULL in this function to preserve previous behavior.
Note that this is a slight change in behavior where the bridge device can
receive the traffic to the new address before calling
br_stp_recalculate_bridge_id() in br_add_if().
However, it is not a problem because we have already the address on the
new port and such a way to insert new one before recalculating bridge id
is taken in br_device_event() as well.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have been always failed to delete the old entry at
br_fdb_change_mac_address() because br_set_mac_address() updates
dev->dev_addr before calling br_fdb_change_mac_address() and
br_fdb_change_mac_address() uses dev->dev_addr to find the old entry.
That update of dev_addr is completely unnecessary because the same work
is done in br_stp_change_bridge_id() which is called right away after
calling br_fdb_change_mac_address().
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit bc9a25d21e ("bridge: Add vlan support for local fdb entries"),
br_fdb_changeaddr() has inserted a new local fdb entry only if it can
find old one. But if we have two ports where they have the same address
or user has deleted a local entry, there will be no entry for one of the
ports.
Example of problematic case:
ip link set eth0 address aa:bb:cc:dd:ee:ff
ip link set eth1 address aa:bb:cc:dd:ee:ff
brctl addif br0 eth0
brctl addif br0 eth1 # eth1 will not have a local entry due to dup.
ip link set eth1 address 12:34:56:78:90:ab
Then, the new entry for the address 12:34:56:78:90:ab will not be
created, and the bridge device will not be able to communicate.
Insert new entries regardless of whether we can find old entries or not.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_changeaddr() assumes that there is at most one local entry per port
per vlan. It used to be true, but since commit 36fd2b63e3 ("bridge: allow
creating/deleting fdb entries via netlink"), it has not been so.
Therefore, the function might fail to search a correct previous address
to be deleted and delete an arbitrary local entry if user has added local
entries manually.
Example of problematic case:
ip link set eth0 address ee:ff:12:34:56:78
brctl addif br0 eth0
bridge fdb add 12:34:56:78:90:ab dev eth0 master
ip link set eth0 address aa:bb:cc:dd:ee:ff
Then, the address 12:34:56:78:90:ab might be deleted instead of
ee:ff:12:34:56:78, the original mac address of eth0.
Address this issue by introducing a new flag, added_by_user, to struct
net_bridge_fdb_entry.
Note that br_fdb_delete_by_port() has to set added_by_user to 0 in cases
like:
ip link set eth0 address 12:34:56:78:90:ab
ip link set eth1 address aa:bb:cc:dd:ee:ff
brctl addif br0 eth0
bridge fdb add aa:bb:cc:dd:ee:ff dev eth0 master
brctl addif br0 eth1
brctl delif br0 eth0
In this case, kernel should delete the user-added entry aa:bb:cc:dd:ee:ff,
but it also should have been added by "brctl addif br0 eth1" originally,
so we don't delete it and treat it a new kernel-created entry.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 93d8bf9fb8 ("bridge: cleanup netpoll code") introduced
a check in br_netpoll_enable(), but this check is incorrect for
br_netpoll_setup(). This patch moves the code after the check
into __br_netpoll_enable() and calls it in br_netpoll_setup().
For br_add_if(), the check is still needed.
Fixes: 93d8bf9fb8 ("bridge: cleanup netpoll code")
Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Tested-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_handle_vlan() pushes HW accelerated vlan tag into skbuff when outgoing
port is the bridge device.
This is unnecessary because __netif_receive_skb_core() can handle skbs
with HW accelerated vlan tag. In current implementation,
__netif_receive_skb_core() needs to extract the vlan tag embedded in skb
data. This could cause low network performance especially when receiving
frames at a high frame rate on the bridge device.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
And it can become static.
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't encode argument types into function names and since besides
nft_do_chain() there are only AF-specific versions, there is no risk
of confusion.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Minor nf_chain_type cleanups:
- reorder struct to plug a hoe
- rename struct module member to "owner" for consistency
- rename nf_hookfn array to "hooks" for consistency
- reorder initializers for better readability
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In some cases we neither take a reference to the AF info nor to the
chain type, allowing the module to be unloaded while in use.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add support to register chains to multiple hooks for different address
families for mixed IPv4/IPv6 tables.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Currently the AF-specific hook functions override the chain-type specific
hook functions. That doesn't make too much sense since the chain types
are a special case of the AF-specific hooks.
Make the AF-specific hook functions the default and make the optional
chain type hooks override them.
As a side effect, the necessary code restructuring reduces the code size,
f.i. in case of nf_tables_ipv4.o:
nf_tables_ipv4_init_net | -24
nft_do_chain_ipv4 | -113
2 functions changed, 137 bytes removed, diff: -137
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
net/ipv6/ip6_tunnel.c
net/ipv6/ip6_vti.c
ipv6 tunnel statistic bug fixes conflicting with consolidation into
generic sw per-cpu net stats.
qlogic conflict between queue counting bug fix and the addition
of multiple MAC address support.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use DEVICE_ATTR_RO/RW macros to simplify bridge sysfs attribute definitions.
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_multicast_set_hash_max() is called from process context in
net/bridge/br_sysfs_br.c by the sysfs store_hash_max() function.
br_multicast_set_hash_max() calls spin_lock(&br->multicast_lock),
which can deadlock the CPU if a softirq that also tries to take the
same lock interrupts br_multicast_set_hash_max() while the lock is
held . This can happen quite easily when any of the bridge multicast
timers expire, which try to take the same lock.
The fix here is to use spin_lock_bh(), preventing other softirqs from
executing on this CPU.
Steps to reproduce:
1. Create a bridge with several interfaces (I used 4).
2. Set the "multicast query interval" to a low number, like 2.
3. Enable the bridge as a multicast querier.
4. Repeatedly set the bridge hash_max parameter via sysfs.
# brctl addbr br0
# brctl addif br0 eth1 eth2 eth3 eth4
# brctl setmcqi br0 2
# brctl setmcquerier br0 1
# while true ; do echo 4096 > /sys/class/net/br0/bridge/hash_max; done
Signed-off-by: Curt Brune <curt@cumulusnetworks.com>
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
They are same, so unify them as one, pcpu_sw_netstats.
Define pcpu_sw_netstat in netdevice.h, remove pcpu_tstats
from if_tunnel and remove br_cpu_netstats from br_private.h
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function __rtnl_af_register is never called outside this
code, and the return value is always 0.
Signed-off-by: David S. Miller <davem@davemloft.net>
That open brace { should be on the previous line.
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spaces required before the open parenthesis '(', before the open
brace '{', after that ',' and around that '?/:'.
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return is not a function, parentheses are not required.
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because err is always negative, remove unnecessary condition
judgment.
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix spelling errors in bridge driver.
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Various spelling fixes in networking stack
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge 'net' into 'net-next' to get the AF_PACKET bug fix that
Daniel's direct transmit changes depend upon.
Signed-off-by: David S. Miller <davem@davemloft.net>
br_stp_rcv() is reached by non-rx_handler path. That means there is no
guarantee that dev is bridge port and therefore simple NULL check of
->rx_handler_data is not enough. There is need to check if dev is really
bridge port and since only rcu read lock is held here, do it by checking
->rx_handler pointer.
Note that synchronize_net() in netdev_rx_handler_unregister() ensures
this approach as valid.
Introduced originally by:
commit f350a0a873
"bridge: use rx_handler_data pointer to store net_bridge_port pointer"
Fixed but not in the best way by:
commit b5ed54e94d
"bridge: fix RCU races with bridge port"
Reintroduced by:
commit 716ec052d2
"bridge: fix NULL pointer deref of br_port_get_rcu"
Please apply to stable trees as well. Thanks.
RH bugzilla reference: https://bugzilla.redhat.com/show_bug.cgi?id=1025770
Reported-by: Laine Stump <laine@redhat.com>
Debugged-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several files refer to an old address for the Free Software Foundation
in the file header comment. Resolve by replacing the address with
the URL <http://www.gnu.org/licenses/> so that we do not have to keep
updating the header comments anytime the address changes.
CC: netfilter@vger.kernel.org
CC: Pablo Neira Ayuso <pablo@netfilter.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
netfilter fixes for net
The following patchset contains fixes for your net tree, they are:
* Remove extra quote from connlimit configuration in Kconfig, from
Randy Dunlap.
* Fix missing mss option in syn packets sent to the backend in our
new synproxy target, from Martin Topholm.
* Use window scale announced by client when sending the forged
syn to the backend, from Martin Topholm.
* Fix IPv6 address comparison in ebtables, from Luís Fernando
Cornachioni Estrozi.
* Fix wrong endianess in sequence adjustment which breaks helpers
in NAT configurations, from Phil Oester.
* Fix the error path handling of nft_compat, from me.
* Make sure the global conntrack counter is decremented after the
object has been released, also from me.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
bridge dev
When the following commands are executed:
brctl addbr br0
ifconfig br0 hw ether <addr>
rmmod bridge
The calltrace will occur:
[ 563.312114] device eth1 left promiscuous mode
[ 563.312188] br0: port 1(eth1) entered disabled state
[ 563.468190] kmem_cache_destroy bridge_fdb_cache: Slab cache still has objects
[ 563.468197] CPU: 6 PID: 6982 Comm: rmmod Tainted: G O 3.12.0-0.7-default+ #9
[ 563.468199] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 563.468200] 0000000000000880 ffff88010f111e98 ffffffff814d1c92 ffff88010f111eb8
[ 563.468204] ffffffff81148efd ffff88010f111eb8 0000000000000000 ffff88010f111ec8
[ 563.468206] ffffffffa062a270 ffff88010f111ed8 ffffffffa063ac76 ffff88010f111f78
[ 563.468209] Call Trace:
[ 563.468218] [<ffffffff814d1c92>] dump_stack+0x6a/0x78
[ 563.468234] [<ffffffff81148efd>] kmem_cache_destroy+0xfd/0x100
[ 563.468242] [<ffffffffa062a270>] br_fdb_fini+0x10/0x20 [bridge]
[ 563.468247] [<ffffffffa063ac76>] br_deinit+0x4e/0x50 [bridge]
[ 563.468254] [<ffffffff810c7dc9>] SyS_delete_module+0x199/0x2b0
[ 563.468259] [<ffffffff814e0922>] system_call_fastpath+0x16/0x1b
[ 570.377958] Bridge firewalling registered
--------------------------- cut here -------------------------------
The reason is that when the bridge dev's address is changed, the
br_fdb_change_mac_address() will add new address in fdb, but when
the bridge was removed, the address entry in the fdb did not free,
the bridge_fdb_cache still has objects when destroy the cache, Fix
this by flushing the bridge address entry when removing the bridge.
v2: according to the Toshiaki Makita and Vlad's suggestion, I only
delete the vlan0 entry, it still have a leak here if the vlan id
is other number, so I need to call fdb_delete_by_port(br, NULL, 1)
to flush all entries whose dst is NULL for the bridge.
Suggested-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Suggested-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"Mostly these are fixes for fallout due to merge window changes, as
well as cures for problems that have been with us for a much longer
period of time"
1) Johannes Berg noticed two major deficiencies in our genetlink
registration. Some genetlink protocols we passing in constant
counts for their ops array rather than something like
ARRAY_SIZE(ops) or similar. Also, some genetlink protocols were
using fixed IDs for their multicast groups.
We have to retain these fixed IDs to keep existing userland tools
working, but reserve them so that other multicast groups used by
other protocols can not possibly conflict.
In dealing with these two problems, we actually now use less state
management for genetlink operations and multicast groups.
2) When configuring interface hardware timestamping, fix several
drivers that simply do not validate that the hwtstamp_config value
is one the driver actually supports. From Ben Hutchings.
3) Invalid memory references in mwifiex driver, from Amitkumar Karwar.
4) In dev_forward_skb(), set the skb->protocol in the right order
relative to skb_scrub_packet(). From Alexei Starovoitov.
5) Bridge erroneously fails to use the proper wrapper functions to make
calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid. Fix from Toshiaki
Makita.
6) When detaching a bridge port, make sure to flush all VLAN IDs to
prevent them from leaking, also from Toshiaki Makita.
7) Put in a compromise for TCP Small Queues so that deep queued devices
that delay TX reclaim non-trivially don't have such a performance
decrease. One particularly problematic area is 802.11 AMPDU in
wireless. From Eric Dumazet.
8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts
here. Fix from Eric Dumzaet, reported by Dave Jones.
9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn.
10) When computing mergeable buffer sizes, virtio-net fails to take the
virtio-net header into account. From Michael Dalton.
11) Fix seqlock deadlock in ip4_datagram_connect() wrt. statistic
bumping, this one has been with us for a while. From Eric Dumazet.
12) Fix NULL deref in the new TIPC fragmentation handling, from Erik
Hugne.
13) 6lowpan bit used for traffic classification was wrong, from Jukka
Rissanen.
14) macvlan has the same issue as normal vlans did wrt. propagating LRO
disabling down to the real device, fix it the same way. From Michal
Kubecek.
15) CPSW driver needs to soft reset all slaves during suspend, from
Daniel Mack.
16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet.
17) The xen-netfront RX buffer refill timer isn't properly scheduled on
partial RX allocation success, from Ma JieYue.
18) When ipv6 ping protocol support was added, the AF_INET6 protocol
initialization cleanup path on failure was borked a little. Fix
from Vlad Yasevich.
19) If a socket disconnects during a read/recvmsg/recvfrom/etc that
blocks we can do the wrong thing with the msg_name we write back to
userspace. From Hannes Frederic Sowa. There is another fix in the
works from Hannes which will prevent future problems of this nature.
20) Fix route leak in VTI tunnel transmit, from Fan Du.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
genetlink: make multicast groups const, prevent abuse
genetlink: pass family to functions using groups
genetlink: add and use genl_set_err()
genetlink: remove family pointer from genl_multicast_group
genetlink: remove genl_unregister_mc_group()
hsr: don't call genl_unregister_mc_group()
quota/genetlink: use proper genetlink multicast APIs
drop_monitor/genetlink: use proper genetlink multicast APIs
genetlink: only pass array to genl_register_family_with_ops()
tcp: don't update snd_nxt, when a socket is switched from repair mode
atm: idt77252: fix dev refcnt leak
xfrm: Release dst if this dst is improper for vti tunnel
netlink: fix documentation typo in netlink_set_err()
be2net: Delete secondary unicast MAC addresses during be_close
be2net: Fix unconditional enabling of Rx interface options
net, virtio_net: replace the magic value
ping: prevent NULL pointer dereference on write to msg_name
bnx2x: Prevent "timeout waiting for state X"
bnx2x: prevent CFC attention
bnx2x: Prevent panic during DMAE timeout
...
This bug was introduced on commit 0898f99a2. This just recovers two
checks that existed before as suggested by Bart De Schuymer.
Signed-off-by: Luís Fernando Cornachioni Estrozi <lestrozi@uolinc.com>
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We should call vlan_vid_del for all vids at nbp_vlan_flush to prevent
vid_info->refcount from being leaked when detaching a bridge port.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should use wrapper functions vlan_vid_[add/del] instead of
ndo_vlan_rx_[add/kill]_vid. Otherwise, we might be not able to communicate
using vlan interface in a certain situation.
Example of problematic case:
vconfig add eth0 10
brctl addif br0 eth0
bridge vlan add dev eth0 vid 10
bridge vlan del dev eth0 vid 10
brctl delif br0 eth0
In this case, we cannot communicate via eth0.10 because vlan 10 is
filtered by NIC that has the vlan filtering feature.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull core locking changes from Ingo Molnar:
"The biggest changes:
- add lockdep support for seqcount/seqlocks structures, this
unearthed both bugs and required extra annotation.
- move the various kernel locking primitives to the new
kernel/locking/ directory"
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
block: Use u64_stats_init() to initialize seqcounts
locking/lockdep: Mark __lockdep_count_forward_deps() as static
lockdep/proc: Fix lock-time avg computation
locking/doc: Update references to kernel/mutex.c
ipv6: Fix possible ipv6 seqlock deadlock
cpuset: Fix potential deadlock w/ set_mems_allowed
seqcount: Add lockdep functionality to seqcount/seqlock structures
net: Explicitly initialize u64_stats_sync structures for lockdep
locking: Move the percpu-rwsem code to kernel/locking/
locking: Move the lglocks code to kernel/locking/
locking: Move the rwsem code to kernel/locking/
locking: Move the rtmutex code to kernel/locking/
locking: Move the semaphore core to kernel/locking/
locking: Move the spinlock code to kernel/locking/
locking: Move the lockdep code to kernel/locking/
locking: Move the mutex code to kernel/locking/
hung_task debugging: Add tracepoint to report the hang
x86/locking/kconfig: Update paravirt spinlock Kconfig description
lockstat: Report avg wait and hold times
lockdep, x86/alternatives: Drop ancient lockdep fixup message
...
In order to enable lockdep on seqcount/seqlock structures, we
must explicitly initialize any locks.
The u64_stats_sync structure, uses a seqcount, and thus we need
to introduce a u64_stats_init() function and use it to initialize
the structure.
This unfortunately adds a lot of fairly trivial initialization code
to a number of drivers. But the benefit of ensuring correctness makes
this worth while.
Because these changes are required for lockdep to be enabled, and the
changes are quite trivial, I've not yet split this patch out into 30-some
separate patches, as I figured it would be better to get the various
maintainers thoughts on how to best merge this change along with
the seqcount lockdep enablement.
Feedback would be appreciated!
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mirko Lindner <mlindner@marvell.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Roger Luethi <rl@hellgate.ch>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Simon Horman <horms@verge.net.au>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Wensong Zhang <wensong@linux-vs.org>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pablo Neira Ayuso says:
====================
This batch contains fives nf_tables patches for your net-next tree,
they are:
* Fix possible use after free in the module removal path of the
x_tables compatibility layer, from Dan Carpenter.
* Add filter chain type for the bridge family, from myself.
* Fix Kconfig dependencies of the nf_tables bridge family with
the core, from myself.
* Fix sparse warnings in nft_nat, from Tomasz Bursztyka.
* Remove duplicated include in the IPv4 family support for nf_tables,
from Wei Yongjun.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
This is another batch containing Netfilter/IPVS updates for your net-next
tree, they are:
* Six patches to make the ipt_CLUSTERIP target support netnamespace,
from Gao feng.
* Two cleanups for the nf_conntrack_acct infrastructure, introducing
a new structure to encapsulate conntrack counters, from Holger
Eitzenberger.
* Fix missing verdict in SCTP support for IPVS, from Daniel Borkmann.
* Skip checksum recalculation in SCTP support for IPVS, also from
Daniel Borkmann.
* Fix behavioural change in xt_socket after IP early demux, from
Florian Westphal.
* Fix bogus large memory allocation in the bitmap port set type in ipset,
from Jozsef Kadlecsik.
* Fix possible compilation issues in the hash netnet set type in ipset,
also from Jozsef Kadlecsik.
* Define constants to identify netlink callback data in ipset dumps,
again from Jozsef Kadlecsik.
* Use sock_gen_put() in xt_socket to replace xt_socket_put_sk,
from Eric Dumazet.
* Improvements for the SH scheduler in IPVS, from Alexander Frolkin.
* Remove extra delay due to unneeded rcu barrier in IPVS net namespace
cleanup path, from Julian Anastasov.
* Save some cycles in ip6t_REJECT by skipping checksum validation in
packets leaving from our stack, from Stanislav Fomichev.
* Fix IPVS_CMD_ATTR_MAX definition in IPVS, larger that required, from
Julian Anastasov.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/emulex/benet/be.h
drivers/net/netconsole.c
net/bridge/br_private.h
Three mostly trivial conflicts.
The net/bridge/br_private.h conflict was a function signature (argument
addition) change overlapping with the extern removals from Joe Perches.
In drivers/net/netconsole.c we had one change adjusting a printk message
whilst another changed "printk(KERN_INFO" into "pr_info(".
Lastly, the emulex change was a new inline function addition overlapping
with Joe Perches's extern removals.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently multicast code attempts to extrace the vlan id from
the skb even when vlan filtering is disabled. This can lead
to mdb entries being created with the wrong vlan id.
Pass the already extracted vlan id to the multicast
filtering code to make the correct id is used in
creation as well as lookup.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the filter chain type which is required to
create filter chains in the bridge family from userspace.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
when CONFIG_NF_TABLES[_MODULE] is not enabled,
but CONFIG_NF_TABLES_BRIDGE is enabled:
net/bridge/netfilter/nf_tables_bridge.c: In function 'nf_tables_bridge_init_net':
net/bridge/netfilter/nf_tables_bridge.c:24:5: error: 'struct net' has no member named 'nft'
net/bridge/netfilter/nf_tables_bridge.c:25:9: error: 'struct net' has no member named 'nft'
net/bridge/netfilter/nf_tables_bridge.c:28:2: error: 'struct net' has no member named 'nft'
net/bridge/netfilter/nf_tables_bridge.c:30:34: error: 'struct net' has no member named 'nft'
net/bridge/netfilter/nf_tables_bridge.c:35:11: error: 'struct net' has no member named 'nft'
net/bridge/netfilter/nf_tables_bridge.c: In function 'nf_tables_bridge_exit_net':
net/bridge/netfilter/nf_tables_bridge.c:41:27: error: 'struct net' has no member named 'nft'
net/bridge/netfilter/nf_tables_bridge.c:42:11: error: 'struct net' has no member named 'nft'
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pekka Pietikäinen reports xt_socket behavioural change after commit
00028aa37098o (netfilter: xt_socket: use IP early demux).
Reason is xt_socket now no longer does an unconditional sk lookup -
it re-uses existing skb->sk if possible, assuming ->sk was set by
ip early demux.
However, when netfilter is invoked via bridge, this can cause 'bogus'
sockets to be examined by the match, e.g. a 'tun' device socket.
bridge netfilter should orphan the skb just like the routing path
before invoking ipv4/ipv6 netfilter hooks to avoid this.
Reported-and-tested-by: Pekka Pietikäinen <pp@ee.oulu.fi>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
The following patchset contains three netfilter fixes for your net
tree, they are:
* A couple of fixes to resolve info leak to userspace due to uninitialized
memory area in ulogd, from Mathias Krause.
* Fix instruction ordering issues that may lead to the access of
uninitialized data in x_tables. The problem involves the table update
(producer) and the main packet matching (consumer) routines. Detected in
SMP ARMv7, from Will Deacon.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/qmi_wwan.c
include/net/dst.h
Trivial merge conflicts, both were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
While this commit was a good attempt to fix issues occuring when no
multicast querier is present, this commit still has two more issues:
1) There are cases where mdb entries do not expire even if there is a
querier present. The bridge will unnecessarily continue flooding
multicast packets on the according ports.
2) Never removing an mdb entry could be exploited for a Denial of
Service by an attacker on the local link, slowly, but steadily eating up
all memory.
Actually, this commit became obsolete with
"bridge: disable snooping if there is no querier" (b00589af3b)
which included fixes for a few more cases.
Therefore reverting the following commits (the commit stated in the
commit message plus three of its follow up fixes):
====================
Revert "bridge: update mdb expiration timer upon reports."
This reverts commit f144febd93.
Revert "bridge: do not call setup_timer() multiple times"
This reverts commit 1faabf2aab.
Revert "bridge: fix some kernel warning in multicast timer"
This reverts commit c7e8e8a8f7.
Revert "bridge: only expire the mdb entry when query is received"
This reverts commit 9f00b2e7cf.
====================
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Reviewed-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are a mix of function prototypes with and without extern
in the kernel sources. Standardize on not using extern for
function prototypes.
Function prototypes don't need to be written with extern.
extern is assumed by the compiler. Its use is as unnecessary as
using auto to declare automatic/local variables in a block.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently set the value that variable vid is pointing, which will be
used in FDB later, to 0 at br_allowed_ingress() when we receive untagged
or priority-tagged frames, even though the PVID is valid.
This leads to FDB updates in such a wrong way that they are learned with
VID 0.
Update the value to that of PVID if the PVID is applied.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Reviewed-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are using the VLAN_TAG_PRESENT bit to detect whether the PVID is
set or not at br_get_pvid(), while we don't care about the bit in
adding/deleting the PVID, which makes it impossible to forward any
incomming untagged frame with vlan_filtering enabled.
Since vid 0 cannot be used for the PVID, we can use vid 0 to indicate
that the PVID is not set, which is slightly more efficient than using
the VLAN_TAG_PRESENT.
Fix the problem by getting rid of using the VLAN_TAG_PRESENT.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Reviewed-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IEEE 802.1Q says that when we receive priority-tagged (VID 0) frames
use the PVID for the port as its VID.
(See IEEE 802.1Q-2011 6.9.1 and Table 9-2)
Apply the PVID to not only untagged frames but also priority-tagged frames.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Reviewed-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IEEE 802.1Q says that:
- VID 0 shall not be configured as a PVID, or configured in any Filtering
Database entry.
- VID 4095 shall not be configured as a PVID, or transmitted in a tag
header. This VID value may be used to indicate a wildcard match for the VID
in management operations or Filtering Database entries.
(See IEEE 802.1Q-2011 6.9.1 and Table 9-2)
Don't accept adding these VIDs in the vlan_filtering implementation.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Reviewed-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit be4f154d5e
bridge: Clamp forward_delay when enabling STP
had a typo when attempting to clamp maximum forward delay.
It is possible to set bridge_forward_delay to be higher then
permitted maximum when STP is off. When turning STP on, the
higher then allowed delay has to be clamed down to max value.
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Reviewed-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Register family per netnamespace to ensure that sets are
only visible in its approapriate namespace.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds nftables which is the intended successor of iptables.
This packet filtering framework reuses the existing netfilter hooks,
the connection tracking system, the NAT subsystem, the transparent
proxying engine, the logging infrastructure and the userspace packet
queueing facilities.
In a nutshell, nftables provides a pseudo-state machine with 4 general
purpose registers of 128 bits and 1 specific purpose register to store
verdicts. This pseudo-machine comes with an extensible instruction set,
a.k.a. "expressions" in the nftables jargon. The expressions included
in this patch provide the basic functionality, they are:
* bitwise: to perform bitwise operations.
* byteorder: to change from host/network endianess.
* cmp: to compare data with the content of the registers.
* counter: to enable counters on rules.
* ct: to store conntrack keys into register.
* exthdr: to match IPv6 extension headers.
* immediate: to load data into registers.
* limit: to limit matching based on packet rate.
* log: to log packets.
* meta: to match metainformation that usually comes with the skbuff.
* nat: to perform Network Address Translation.
* payload: to fetch data from the packet payload and store it into
registers.
* reject (IPv4 only): to explicitly close connection, eg. TCP RST.
Using this instruction-set, the userspace utility 'nft' can transform
the rules expressed in human-readable text representation (using a
new syntax, inspired by tcpdump) to nftables bytecode.
nftables also inherits the table, chain and rule objects from
iptables, but in a more configurable way, and it also includes the
original datatype-agnostic set infrastructure with mapping support.
This set infrastructure is enhanced in the follow up patch (netfilter:
nf_tables: add netlink set API).
This patch includes the following components:
* the netlink API: net/netfilter/nf_tables_api.c and
include/uapi/netfilter/nf_tables.h
* the packet filter core: net/netfilter/nf_tables_core.c
* the expressions (described above): net/netfilter/nft_*.c
* the filter tables: arp, IPv4, IPv6 and bridge:
net/ipv4/netfilter/nf_tables_ipv4.c
net/ipv6/netfilter/nf_tables_ipv6.c
net/ipv4/netfilter/nf_tables_arp.c
net/bridge/netfilter/nf_tables_bridge.c
* the NAT table (IPv4 only):
net/ipv4/netfilter/nf_table_nat_ipv4.c
* the route table (similar to mangle):
net/ipv4/netfilter/nf_table_route_ipv4.c
net/ipv6/netfilter/nf_table_route_ipv6.c
* internal definitions under:
include/net/netfilter/nf_tables.h
include/net/netfilter/nf_tables_core.h
* It also includes an skeleton expression:
net/netfilter/nft_expr_template.c
and the preliminary implementation of the meta target
net/netfilter/nft_meta_target.c
It also includes a change in struct nf_hook_ops to add a new
pointer to store private data to the hook, that is used to store
the rule list per chain.
This patch is based on the patch from Patrick McHardy, plus merged
accumulated cleanups, fixes and small enhancements to the nftables
code that has been done since 2009, which are:
From Patrick McHardy:
* nf_tables: adjust netlink handler function signatures
* nf_tables: only retry table lookup after successful table module load
* nf_tables: fix event notification echo and avoid unnecessary messages
* nft_ct: add l3proto support
* nf_tables: pass expression context to nft_validate_data_load()
* nf_tables: remove redundant definition
* nft_ct: fix maxattr initialization
* nf_tables: fix invalid event type in nf_tables_getrule()
* nf_tables: simplify nft_data_init() usage
* nf_tables: build in more core modules
* nf_tables: fix double lookup expression unregistation
* nf_tables: move expression initialization to nf_tables_core.c
* nf_tables: build in payload module
* nf_tables: use NFPROTO constants
* nf_tables: rename pid variables to portid
* nf_tables: save 48 bits per rule
* nf_tables: introduce chain rename
* nf_tables: check for duplicate names on chain rename
* nf_tables: remove ability to specify handles for new rules
* nf_tables: return error for rule change request
* nf_tables: return error for NLM_F_REPLACE without rule handle
* nf_tables: include NLM_F_APPEND/NLM_F_REPLACE flags in rule notification
* nf_tables: fix NLM_F_MULTI usage in netlink notifications
* nf_tables: include NLM_F_APPEND in rule dumps
From Pablo Neira Ayuso:
* nf_tables: fix stack overflow in nf_tables_newrule
* nf_tables: nft_ct: fix compilation warning
* nf_tables: nft_ct: fix crash with invalid packets
* nft_log: group and qthreshold are 2^16
* nf_tables: nft_meta: fix socket uid,gid handling
* nft_counter: allow to restore counters
* nf_tables: fix module autoload
* nf_tables: allow to remove all rules placed in one chain
* nf_tables: use 64-bits rule handle instead of 16-bits
* nf_tables: fix chain after rule deletion
* nf_tables: improve deletion performance
* nf_tables: add missing code in route chain type
* nf_tables: rise maximum number of expressions from 12 to 128
* nf_tables: don't delete table if in use
* nf_tables: fix basechain release
From Tomasz Bursztyka:
* nf_tables: Add support for changing users chain's name
* nf_tables: Change chain's name to be fixed sized
* nf_tables: Add support for replacing a rule by another one
* nf_tables: Update uapi nftables netlink header documentation
From Florian Westphal:
* nft_log: group is u16, snaplen u32
From Phil Oester:
* nf_tables: operational limit match
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pass the hook ops to the hookfn to allow for generic hook
functions. This change is required by nf_tables.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
commit 9f00b2e7cf
bridge: only expire the mdb entry when query is received
changed the mdb expiration timer to be armed only when QUERY is
received. Howerver, this causes issues in an environment where
the multicast server socket comes and goes very fast while a client
is trying to send traffic to it.
The root cause is a race where a sequence of LEAVE followed by REPORT
messages can race against QUERY messages generated in response to LEAVE.
The QUERY ends up starting the expiration timer, and that timer can
potentially expire after the new REPORT message has been received signaling
the new join operation. This leads to a significant drop in multicast
traffic and possible complete stall.
The solution is to have REPORT messages update the expiration timer
on entries that already exist.
CC: Cong Wang <xiyou.wangcong@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert the memset/memcpy uses of 6 to ETH_ALEN
where appropriate.
Also convert some struct definitions and u8 array
declarations of [6] to ETH_ALEN.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Arend van Spriel <arend@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ulog messages leak heap bytes by the means of padding bytes and
incompletely filled string arrays. Fix those by memset(0)'ing the
whole struct before filling it.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The NULL deref happens when br_handle_frame is called between these
2 lines of del_nbp:
dev->priv_flags &= ~IFF_BRIDGE_PORT;
/* --> br_handle_frame is called at this time */
netdev_rx_handler_unregister(dev);
In br_handle_frame the return of br_port_get_rcu(dev) is dereferenced
without check but br_port_get_rcu(dev) returns NULL if:
!(dev->priv_flags & IFF_BRIDGE_PORT)
Eric Dumazet pointed out the testing of IFF_BRIDGE_PORT is not necessary
here since we're in rcu_read_lock and we have synchronize_net() in
netdev_rx_handler_unregister. So remove the testing of IFF_BRIDGE_PORT
and by the previous patch, make sure br_port_get_rcu is called in
bridging code.
Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
current br_port_get_rcu is problematic in bridging path
(NULL deref). Change these calls in netlink path first.
Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At some point limits were added to forward_delay. However, the
limits are only enforced when STP is enabled. This created a
scenario where you could have a value outside the allowed range
while STP is disabled, which then stuck around even after STP
is enabled.
This patch fixes this by clamping the value when we enable STP.
I had to move the locking around a bit to ensure that there is
no window where someone could insert a value outside the range
while we're in the middle of enabling STP.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cheers,
Signed-off-by: David S. Miller <davem@davemloft.net>
This changes the message_age_timer calculation to use the BPDU's max age as
opposed to the local bridge's max age. This is in accordance with section
8.6.2.3.2 Step 2 of the 802.1D-1998 sprecification.
With the current implementation, when running with very large bridge
diameters, convergance will not always occur even if a root bridge is
configured to have a longer max age.
Tested successfully on bridge diameters of ~200.
Signed-off-by: Chris Healy <cphealy@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking changes from David Miller:
"Noteworthy changes this time around:
1) Multicast rejoin support for team driver, from Jiri Pirko.
2) Centralize and simplify TCP RTT measurement handling in order to
reduce the impact of bad RTO seeding from SYN/ACKs. Also, when
both timestamps and local RTT measurements are available prefer
the later because there are broken middleware devices which
scramble the timestamp.
From Yuchung Cheng.
3) Add TCP_NOTSENT_LOWAT socket option to limit the amount of kernel
memory consumed to queue up unsend user data. From Eric Dumazet.
4) Add a "physical port ID" abstraction for network devices, from
Jiri Pirko.
5) Add a "suppress" operation to influence fib_rules lookups, from
Stefan Tomanek.
6) Add a networking development FAQ, from Paul Gortmaker.
7) Extend the information provided by tcp_probe and add ipv6 support,
from Daniel Borkmann.
8) Use RCU locking more extensively in openvswitch data paths, from
Pravin B Shelar.
9) Add SCTP support to openvswitch, from Joe Stringer.
10) Add EF10 chip support to SFC driver, from Ben Hutchings.
11) Add new SYNPROXY netfilter target, from Patrick McHardy.
12) Compute a rate approximation for sending in TCP sockets, and use
this to more intelligently coalesce TSO frames. Furthermore, add
a new packet scheduler which takes advantage of this estimate when
available. From Eric Dumazet.
13) Allow AF_PACKET fanouts with random selection, from Daniel
Borkmann.
14) Add ipv6 support to vxlan driver, from Cong Wang"
Resolved conflicts as per discussion.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1218 commits)
openvswitch: Fix alignment of struct sw_flow_key.
netfilter: Fix build errors with xt_socket.c
tcp: Add missing braces to do_tcp_setsockopt
caif: Add missing braces to multiline if in cfctrl_linkup_request
bnx2x: Add missing braces in bnx2x:bnx2x_link_initialize
vxlan: Fix kernel panic on device delete.
net: mvneta: implement ->ndo_do_ioctl() to support PHY ioctls
net: mvneta: properly disable HW PHY polling and ensure adjust_link() works
icplus: Use netif_running to determine device state
ethernet/arc/arc_emac: Fix huge delays in large file copies
tuntap: orphan frags before trying to set tx timestamp
tuntap: purge socket error queue on detach
qlcnic: use standard NAPI weights
ipv6:introduce function to find route for redirect
bnx2x: VF RSS support - VF side
bnx2x: VF RSS support - PF side
vxlan: Notify drivers for listening UDP port changes
net: usbnet: update addr_assign_type if appropriate
driver/net: enic: update enic maintainers and driver
driver/net: enic: Exposing symbols for Cisco's low latency driver
...
Conflicts:
drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
net/bridge/br_multicast.c
net/ipv6/sit.c
The conflicts were minor:
1) sit.c changes overlap with change to ip_tunnel_xmit() signature.
2) br_multicast.c had an overlap between computing max_delay using
msecs_to_jiffies and turning MLDV2_MRC() into an inline function
with a name using lowercase instead of uppercase letters.
3) stmmac had two overlapping changes, one which conditionally allocated
and hooked up a dma_cfg based upon the presence of the pbl OF property,
and another one handling store-and-forward DMA made. The latter of
which should not go into the new of_find_property() basic block.
Signed-off-by: David S. Miller <davem@davemloft.net>
The multicast snooping code should have matured enough to be safely
applicable to IPv6 link-local multicast addresses (excluding the
link-local all nodes address, ff02::1), too.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently if there is no listener for a certain group then IPv6 packets
for that group are flooded on all ports, even though there might be no
host and router interested in it on a port.
With this commit they are only forwarded to ports with a multicast
router.
Just like commit bd4265fe36 ("bridge: Only flood unregistered groups
to routers") did for IPv4, let's do the same for IPv6 with the same
reasoning.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.
This has been sitting in linux-next for a whole cycle.
Thanks,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJSJo+1AAoJENkgDmzRrbjxIC4QALJK95o8AUXuwUkl+2fmFkUt
hh2/PJ1vDYgk4Xt0J6hyoK7XMa0H1RkbBrROuDdsBnorMFpEsGcgdkUZte9ufoAS
97Bg+7N0KPbTB/S8vOwtW1vbERTJIVPN2uf6h1Wqm9Xc2puCh3HbMMr1AWMGu0WQ
NqY5+Zz8zecy1UOrMhEP6H1CjeQcL1w1DO6YM5ydeqlKNzAz+JMfDXriLPDwiE7+
XFPDF/O3Vtd2ckA7L70Lio7hfHwxV5U4WwFVfiwls98XB4jcZqDKIoh1r8z4SRgR
+0Rae2DN3BaOabGMr//5XdrzQVpwJTh5m2w8BAOHJvCJ9HR7Sq29UIN4u+TowZBy
L2xYo4dvFxkympwu5zEd3c7vHYWKIaqmSq5PIjr4gF/uIo2OeOTrpPIK782ZEYb7
e+qUgOEM05V9AmQZCrSZeP9u474Sj8ow3sCtWxfdRtwNfoEIcUXsNNJd/zDHlVtW
cEtXqc2xXIpcuUJQWlSaGp8fmRQjVZPzrLKYLM2m39ZcOOJbf5rzQAYS7hHPosIa
SK+YVux/+Zzi+Xo/vXq1OlM/SruCr5S7JOgCxLowoQ88vupgXME6uPyC8EO+QQ50
GsrHes5ZNLbk0uVsfcexIyojkUnyvDmmnDpv+1zdC6RgZLJQn8OXp5yNhHhnhrFT
BiHX6YFWtDDqRlVv8Q0F
=LeaW
-----END PGP SIGNATURE-----
Merge tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull PTR_RET() removal patches from Rusty Russell:
"PTR_RET() is a weird name, and led to some confusing usage. We ended
up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.
This has been sitting in linux-next for a whole cycle"
[ There are still some PTR_RET users scattered about, with some of them
possibly being new, but most of them existing in Rusty's tree too. We
have that
#define PTR_RET(p) PTR_ERR_OR_ZERO(p)
thing in <linux/err.h>, so they continue to work for now - Linus ]
* tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
GFS2: Replace PTR_RET with PTR_ERR_OR_ZERO
Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO
drm/cma: Replace PTR_RET with PTR_ERR_OR_ZERO
sh_veu: Replace PTR_RET with PTR_ERR_OR_ZERO
dma-buf: Replace PTR_RET with PTR_ERR_OR_ZERO
drivers/rtc: Replace PTR_RET with PTR_ERR_OR_ZERO
mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR().
staging/zcache: don't use PTR_RET().
remoteproc: don't use PTR_RET().
pinctrl: don't use PTR_RET().
acpi: Replace weird use of PTR_RET.
s390: Replace weird use of PTR_RET.
PTR_RET is now PTR_ERR_OR_ZERO(): Replace most.
PTR_RET is now PTR_ERR_OR_ZERO
Get rid of MLDV2_MRC and use our new macros for mantisse and
exponent to calculate Maximum Response Delay out of the Maximum
Response Code.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>