mirror of
https://git.proxmox.com/git/mirror_ubuntu-kernels.git
synced 2025-11-18 13:21:25 +00:00
d845cd901b
77420 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
f7a8b10bfd |
wifi: cfg80211: fix 6 GHz scan request building
The 6 GHz scan request struct allocated by cfg80211_scan_6ghz() is
meant to be formed this way:
[base struct][channels][ssids][6ghz_params]
It is allocated with [channels] as the maximum number of channels
supported by the driver in the 6 GHz band, since allocation is
before knowing how many there will be.
However, the inner pointers are set incorrectly: initially, the
6 GHz scan parameters pointer is set:
[base struct][channels]
^ scan_6ghz_params
and later the SSID pointer is set to the end of the actually
_used_ channels.
[base struct][channels]
^ ssids
If many APs were to be discovered, and many channels used, and
there were many SSIDs, then the SSIDs could overlap the 6 GHz
parameters.
Additionally, the request->ssids for most of the function points
to the original request still (given the struct copy) but is used
normally, which is confusing.
Clear this up, by actually using the allocated space for 6 GHz
parameters _after_ the SSIDs, and set up the SSIDs initially so
they are used more clearly. Just like in nl80211.c, set them
only if there actually are SSIDs though.
Finally, also copy the elements (ie/ie_len) so they're part of
the same request, not pointing to the old request.
Co-developed-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Reviewed-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://msgid.link/20240510113738.4190692ef4ee.I0cb19188be17a8abd029805e3373c0a7777c214c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
||
|
|
177c6ae972 |
wifi: mac80211: handle tasklet frames before stopping
The code itself doesn't want to handle frames from the driver if it's already stopped, but if the tasklet was queued before and runs after the stop, then all bets are off. Flush queues before actually stopping, RX should be off at this point since all the interfaces are removed already, etc. Reported-by: syzbot+8830db5d3593b5546d2e@syzkaller.appspotmail.com Link: https://msgid.link/20240515135318.b05f11385c9a.I41c1b33a2e1814c3a7ef352cd7f2951b91785617@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> |
||
|
|
02c665f048 |
wifi: mac80211: apply mcast rate only if interface is up
If the interface isn't enabled, don't apply multicast rate changes immediately. Reported-by: syzbot+de87c09cc7b964ea2e23@syzkaller.appspotmail.com Link: https://msgid.link/20240515133410.d6cffe5756cc.I47b624a317e62bdb4609ff7fa79403c0c444d32d@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> |
||
|
|
ab904521f4 |
wifi: cfg80211: pmsr: use correct nla_get_uX functions
The commit |
||
|
|
642f89daa3 |
wifi: cfg80211: Lock wiphy in cfg80211_get_station
Wiphy should be locked before calling rdev_get_station() (see lockdep
assert in ieee80211_get_station()).
This fixes the following kernel NULL dereference:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000050
Mem abort info:
ESR = 0x0000000096000006
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x06: level 2 translation fault
Data abort info:
ISV = 0, ISS = 0x00000006
CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=0000000003001000
[0000000000000050] pgd=0800000002dca003, p4d=0800000002dca003, pud=08000000028e9003, pmd=0000000000000000
Internal error: Oops: 0000000096000006 [#1] SMP
Modules linked in: netconsole dwc3_meson_g12a dwc3_of_simple dwc3 ip_gre gre ath10k_pci ath10k_core ath9k ath9k_common ath9k_hw ath
CPU: 0 PID: 1091 Comm: kworker/u8:0 Not tainted 6.4.0-02144-g565f9a3a7911-dirty #705
Hardware name: RPT (r1) (DT)
Workqueue: bat_events batadv_v_elp_throughput_metric_update
pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : ath10k_sta_statistics+0x10/0x2dc [ath10k_core]
lr : sta_set_sinfo+0xcc/0xbd4
sp : ffff000007b43ad0
x29: ffff000007b43ad0 x28: ffff0000071fa900 x27: ffff00000294ca98
x26: ffff000006830880 x25: ffff000006830880 x24: ffff00000294c000
x23: 0000000000000001 x22: ffff000007b43c90 x21: ffff800008898acc
x20: ffff00000294c6e8 x19: ffff000007b43c90 x18: 0000000000000000
x17: 445946354d552d78 x16: 62661f7200000000 x15: 57464f445946354d
x14: 0000000000000000 x13: 00000000000000e3 x12: d5f0acbcebea978e
x11: 00000000000000e3 x10: 000000010048fe41 x9 : 0000000000000000
x8 : ffff000007b43d90 x7 : 000000007a1e2125 x6 : 0000000000000000
x5 : ffff0000024e0900 x4 : ffff800000a0250c x3 : ffff000007b43c90
x2 : ffff00000294ca98 x1 : ffff000006831920 x0 : 0000000000000000
Call trace:
ath10k_sta_statistics+0x10/0x2dc [ath10k_core]
sta_set_sinfo+0xcc/0xbd4
ieee80211_get_station+0x2c/0x44
cfg80211_get_station+0x80/0x154
batadv_v_elp_get_throughput+0x138/0x1fc
batadv_v_elp_throughput_metric_update+0x1c/0xa4
process_one_work+0x1ec/0x414
worker_thread+0x70/0x46c
kthread+0xdc/0xe0
ret_from_fork+0x10/0x20
Code: a9bb7bfd 910003fd a90153f3 f9411c40 (f9402814)
This happens because STA has time to disconnect and reconnect before
batadv_v_elp_throughput_metric_update() delayed work gets scheduled. In
this situation, ath10k_sta_state() can be in the middle of resetting
arsta data when the work queue get chance to be scheduled and ends up
accessing it. Locking wiphy prevents that.
Fixes:
|
||
|
|
e296c95eac |
wifi: cfg80211: fully move wiphy work to unbound workqueue
Previously I had moved the wiphy work to the unbound
system workqueue, but missed that when it restarts and
during resume it was still using the normal system
workqueue. Fix that.
Fixes:
|
||
|
|
4dc3a3893d |
wifi: cfg80211: validate HE operation element parsing
Validate that the HE operation element has the correct
length before parsing it.
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
44c06bbde6 |
wifi: mac80211: Fix deadlock in ieee80211_sta_ps_deliver_wakeup()
The ieee80211_sta_ps_deliver_wakeup() function takes sta->ps_lock to
synchronizes with ieee80211_tx_h_unicast_ps_buf() which is called from
softirq context. However using only spin_lock() to get sta->ps_lock in
ieee80211_sta_ps_deliver_wakeup() does not prevent softirq to execute
on this same CPU, to run ieee80211_tx_h_unicast_ps_buf() and try to
take this same lock ending in deadlock. Below is an example of rcu stall
that arises in such situation.
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 2-....: (42413413 ticks this GP) idle=b154/1/0x4000000000000000 softirq=1763/1765 fqs=21206996
rcu: (t=42586894 jiffies g=2057 q=362405 ncpus=4)
CPU: 2 PID: 719 Comm: wpa_supplicant Tainted: G W 6.4.0-02158-g1b062f552873 #742
Hardware name: RPT (r1) (DT)
pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : queued_spin_lock_slowpath+0x58/0x2d0
lr : invoke_tx_handlers_early+0x5b4/0x5c0
sp : ffff00001ef64660
x29: ffff00001ef64660 x28: ffff000009bc1070 x27: ffff000009bc0ad8
x26: ffff000009bc0900 x25: ffff00001ef647a8 x24: 0000000000000000
x23: ffff000009bc0900 x22: ffff000009bc0900 x21: ffff00000ac0e000
x20: ffff00000a279e00 x19: ffff00001ef646e8 x18: 0000000000000000
x17: ffff800016468000 x16: ffff00001ef608c0 x15: 0010533c93f64f80
x14: 0010395c9faa3946 x13: 0000000000000000 x12: 00000000fa83b2da
x11: 000000012edeceea x10: ffff0000010fbe00 x9 : 0000000000895440
x8 : 000000000010533c x7 : ffff00000ad8b740 x6 : ffff00000c350880
x5 : 0000000000000007 x4 : 0000000000000001 x3 : 0000000000000000
x2 : 0000000000000000 x1 : 0000000000000001 x0 : ffff00000ac0e0e8
Call trace:
queued_spin_lock_slowpath+0x58/0x2d0
ieee80211_tx+0x80/0x12c
ieee80211_tx_pending+0x110/0x278
tasklet_action_common.constprop.0+0x10c/0x144
tasklet_action+0x20/0x28
_stext+0x11c/0x284
____do_softirq+0xc/0x14
call_on_irq_stack+0x24/0x34
do_softirq_own_stack+0x18/0x20
do_softirq+0x74/0x7c
__local_bh_enable_ip+0xa0/0xa4
_ieee80211_wake_txqs+0x3b0/0x4b8
__ieee80211_wake_queue+0x12c/0x168
ieee80211_add_pending_skbs+0xec/0x138
ieee80211_sta_ps_deliver_wakeup+0x2a4/0x480
ieee80211_mps_sta_status_update.part.0+0xd8/0x11c
ieee80211_mps_sta_status_update+0x18/0x24
sta_apply_parameters+0x3bc/0x4c0
ieee80211_change_station+0x1b8/0x2dc
nl80211_set_station+0x444/0x49c
genl_family_rcv_msg_doit.isra.0+0xa4/0xfc
genl_rcv_msg+0x1b0/0x244
netlink_rcv_skb+0x38/0x10c
genl_rcv+0x34/0x48
netlink_unicast+0x254/0x2bc
netlink_sendmsg+0x190/0x3b4
____sys_sendmsg+0x1e8/0x218
___sys_sendmsg+0x68/0x8c
__sys_sendmsg+0x44/0x84
__arm64_sys_sendmsg+0x20/0x28
do_el0_svc+0x6c/0xe8
el0_svc+0x14/0x48
el0t_64_sync_handler+0xb0/0xb4
el0t_64_sync+0x14c/0x150
Using spin_lock_bh()/spin_unlock_bh() instead prevents softirq to raise
on the same CPU that is holding the lock.
Fixes:
|
||
|
|
6f6291f09a |
wifi: mac80211: mesh: init nonpeer_pm to active by default in mesh sdata
With a ath9k device I can see that: iw phy phy0 interface add mesh0 type mp ip link set mesh0 up iw dev mesh0 scan Will start a scan with the Power Management bit set in the Frame Control Field. This is because we set this bit depending on the nonpeer_pm variable of the mesh iface sdata and when there are no active links on the interface it remains to NL80211_MESH_POWER_UNKNOWN. As soon as links starts to be established, it wil switch to NL80211_MESH_POWER_ACTIVE as it is the value set by befault on the per sta nonpeer_pm field. As we want no power save by default, (as expressed with the per sta ini values), lets init it to the expected default value of NL80211_MESH_POWER_ACTIVE. Also please note that we cannot change the default value from userspace prior to establishing a link as using NL80211_CMD_SET_MESH_CONFIG will not work before NL80211_CMD_JOIN_MESH has been issued. So too late for our initial scan. Signed-off-by: Nicolas Escande <nico.escande@gmail.com> Link: https://msgid.link/20240527141759.299411-1-nico.escande@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> |
||
|
|
b7d7f11a29 |
wifi: mac80211: mesh: Fix leak of mesh_preq_queue objects
The hwmp code use objects of type mesh_preq_queue, added to a list in
ieee80211_if_mesh, to keep track of mpath we need to resolve. If the mpath
gets deleted, ex mesh interface is removed, the entries in that list will
never get cleaned. Fix this by flushing all corresponding items of the
preq_queue in mesh_path_flush_pending().
This should take care of KASAN reports like this:
unreferenced object 0xffff00000668d800 (size 128):
comm "kworker/u8:4", pid 67, jiffies 4295419552 (age 1836.444s)
hex dump (first 32 bytes):
00 1f 05 09 00 00 ff ff 00 d5 68 06 00 00 ff ff ..........h.....
8e 97 ea eb 3e b8 01 00 00 00 00 00 00 00 00 00 ....>...........
backtrace:
[<000000007302a0b6>] __kmem_cache_alloc_node+0x1e0/0x35c
[<00000000049bd418>] kmalloc_trace+0x34/0x80
[<0000000000d792bb>] mesh_queue_preq+0x44/0x2a8
[<00000000c99c3696>] mesh_nexthop_resolve+0x198/0x19c
[<00000000926bf598>] ieee80211_xmit+0x1d0/0x1f4
[<00000000fc8c2284>] __ieee80211_subif_start_xmit+0x30c/0x764
[<000000005926ee38>] ieee80211_subif_start_xmit+0x9c/0x7a4
[<000000004c86e916>] dev_hard_start_xmit+0x174/0x440
[<0000000023495647>] __dev_queue_xmit+0xe24/0x111c
[<00000000cfe9ca78>] batadv_send_skb_packet+0x180/0x1e4
[<000000007bacc5d5>] batadv_v_elp_periodic_work+0x2f4/0x508
[<00000000adc3cd94>] process_one_work+0x4b8/0xa1c
[<00000000b36425d1>] worker_thread+0x9c/0x634
[<0000000005852dd5>] kthread+0x1bc/0x1c4
[<000000005fccd770>] ret_from_fork+0x10/0x20
unreferenced object 0xffff000009051f00 (size 128):
comm "kworker/u8:4", pid 67, jiffies 4295419553 (age 1836.440s)
hex dump (first 32 bytes):
90 d6 92 0d 00 00 ff ff 00 d8 68 06 00 00 ff ff ..........h.....
36 27 92 e4 02 e0 01 00 00 58 79 06 00 00 ff ff 6'.......Xy.....
backtrace:
[<000000007302a0b6>] __kmem_cache_alloc_node+0x1e0/0x35c
[<00000000049bd418>] kmalloc_trace+0x34/0x80
[<0000000000d792bb>] mesh_queue_preq+0x44/0x2a8
[<00000000c99c3696>] mesh_nexthop_resolve+0x198/0x19c
[<00000000926bf598>] ieee80211_xmit+0x1d0/0x1f4
[<00000000fc8c2284>] __ieee80211_subif_start_xmit+0x30c/0x764
[<000000005926ee38>] ieee80211_subif_start_xmit+0x9c/0x7a4
[<000000004c86e916>] dev_hard_start_xmit+0x174/0x440
[<0000000023495647>] __dev_queue_xmit+0xe24/0x111c
[<00000000cfe9ca78>] batadv_send_skb_packet+0x180/0x1e4
[<000000007bacc5d5>] batadv_v_elp_periodic_work+0x2f4/0x508
[<00000000adc3cd94>] process_one_work+0x4b8/0xa1c
[<00000000b36425d1>] worker_thread+0x9c/0x634
[<0000000005852dd5>] kthread+0x1bc/0x1c4
[<000000005fccd770>] ret_from_fork+0x10/0x20
Fixes:
|
||
|
|
fb66df20a7 |
net/sched: taprio: extend minimum interval restriction to entire cycle too
It is possible for syzbot to side-step the restriction imposed by the
blamed commit in the Fixes: tag, because the taprio UAPI permits a
cycle-time different from (and potentially shorter than) the sum of
entry intervals.
We need one more restriction, which is that the cycle time itself must
be larger than N * ETH_ZLEN bit times, where N is the number of schedule
entries. This restriction needs to apply regardless of whether the cycle
time came from the user or was the implicit, auto-calculated value, so
we move the existing "cycle == 0" check outside the "if "(!new->cycle_time)"
branch. This way covers both conditions and scenarios.
Add a selftest which illustrates the issue triggered by syzbot.
Fixes:
|
||
|
|
e634134180 |
net/sched: taprio: make q->picos_per_byte available to fill_sched_entry()
In commit |
||
|
|
e8ded22ef0 |
netfilter: nft_fib: allow from forward/input without iif selector
This removes the restriction of needing iif selector in the
forward/input hooks for fib lookups when requested result is
oif/oifname.
Removing this restriction allows "loose" lookups from the forward hooks.
Fixes:
|
||
|
|
21a673bddc |
netfilter: tproxy: bail out if IP has been disabled on the device
syzbot reports:
general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
[..]
RIP: 0010:nf_tproxy_laddr4+0xb7/0x340 net/ipv4/netfilter/nf_tproxy_ipv4.c:62
Call Trace:
nft_tproxy_eval_v4 net/netfilter/nft_tproxy.c:56 [inline]
nft_tproxy_eval+0xa9a/0x1a00 net/netfilter/nft_tproxy.c:168
__in_dev_get_rcu() can return NULL, so check for this.
Reported-and-tested-by: syzbot+b94a6818504ea90d7661@syzkaller.appspotmail.com
Fixes:
|
||
|
|
33c563ebf8 |
netfilter: nft_payload: skbuff vlan metadata mangle support
Userspace assumes vlan header is present at a given offset, but vlan offload allows to store this in metadata fields of the skbuff. Hence mangling vlan results in a garbled packet. Handle this transparently by adding a parser to the kernel. If vlan metadata is present and payload offset is over 12 bytes (source and destination mac address fields), then subtract vlan header present in vlan metadata, otherwise mangle vlan metadata based on offset and length, extracting data from the source register. This is similar to: |
||
|
|
4b4647add7 |
sock_map: avoid race between sock_map_close and sk_psock_put
sk_psock_get will return NULL if the refcount of psock has gone to 0, which will happen when the last call of sk_psock_put is done. However, sk_psock_drop may not have finished yet, so the close callback will still point to sock_map_close despite psock being NULL. This can be reproduced with a thread deleting an element from the sock map, while the second one creates a socket, adds it to the map and closes it. That will trigger the WARN_ON_ONCE: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 7220 at net/core/sock_map.c:1701 sock_map_close+0x2a2/0x2d0 net/core/sock_map.c:1701 Modules linked in: CPU: 1 PID: 7220 Comm: syz-executor380 Not tainted 6.9.0-syzkaller-07726-g3c999d1ae3c7 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024 RIP: 0010:sock_map_close+0x2a2/0x2d0 net/core/sock_map.c:1701 Code: df e8 92 29 88 f8 48 8b 1b 48 89 d8 48 c1 e8 03 42 80 3c 20 00 74 08 48 89 df e8 79 29 88 f8 4c 8b 23 eb 89 e8 4f 15 23 f8 90 <0f> 0b 90 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d e9 13 26 3d 02 RSP: 0018:ffffc9000441fda8 EFLAGS: 00010293 RAX: ffffffff89731ae1 RBX: ffffffff94b87540 RCX: ffff888029470000 RDX: 0000000000000000 RSI: ffffffff8bcab5c0 RDI: ffffffff8c1faba0 RBP: 0000000000000000 R08: ffffffff92f9b61f R09: 1ffffffff25f36c3 R10: dffffc0000000000 R11: fffffbfff25f36c4 R12: ffffffff89731840 R13: ffff88804b587000 R14: ffff88804b587000 R15: ffffffff89731870 FS: 000055555e080380(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000000207d4000 CR4: 0000000000350ef0 Call Trace: <TASK> unix_release+0x87/0xc0 net/unix/af_unix.c:1048 __sock_release net/socket.c:659 [inline] sock_close+0xbe/0x240 net/socket.c:1421 __fput+0x42b/0x8a0 fs/file_table.c:422 __do_sys_close fs/open.c:1556 [inline] __se_sys_close fs/open.c:1541 [inline] __x64_sys_close+0x7f/0x110 fs/open.c:1541 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fb37d618070 Code: 00 00 48 c7 c2 b8 ff ff ff f7 d8 64 89 02 b8 ff ff ff ff eb d4 e8 10 2c 00 00 80 3d 31 f0 07 00 00 74 17 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c RSP: 002b:00007ffcd4a525d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000003 RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fb37d618070 RDX: 0000000000000010 RSI: 00000000200001c0 RDI: 0000000000000004 RBP: 0000000000000000 R08: 0000000100000000 R09: 0000000100000000 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Use sk_psock, which will only check that the pointer is not been set to NULL yet, which should only happen after the callbacks are restored. If, then, a reference can still be gotten, we may call sk_psock_stop and cancel psock->work. As suggested by Paolo Abeni, reorder the condition so the control flow is less convoluted. After that change, the reproducer does not trigger the WARN_ON_ONCE anymore. Suggested-by: Paolo Abeni <pabeni@redhat.com> Reported-by: syzbot+07a2e4a1a57118ef7355@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=07a2e4a1a57118ef7355 Fixes: |
||
|
|
f4dca95fc0 |
tcp: reduce accepted window in NEW_SYN_RECV state
Jason commit made checks against ACK sequence less strict
and can be exploited by attackers to establish spoofed flows
with less probes.
Innocent users might use tcp_rmem[1] == 1,000,000,000,
or something more reasonable.
An attacker can use a regular TCP connection to learn the server
initial tp->rcv_wnd, and use it to optimize the attack.
If we make sure that only the announced window (smaller than 65535)
is used for ACK validation, we force an attacker to use
65537 packets to complete the 3WHS (assuming server ISN is unknown)
Fixes:
|
||
|
|
be008726d0 |
net: gro: initialize network_offset in network layer
Syzkaller was able to trigger
kernel BUG at net/core/gro.c:424 !
RIP: 0010:gro_pull_from_frag0 net/core/gro.c:424 [inline]
RIP: 0010:gro_try_pull_from_frag0 net/core/gro.c:446 [inline]
RIP: 0010:dev_gro_receive+0x242f/0x24b0 net/core/gro.c:571
Due to using an incorrect NAPI_GRO_CB(skb)->network_offset.
The referenced commit sets this offset to 0 in skb_gro_reset_offset.
That matches the expected case in dev_gro_receive:
pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive,
ipv6_gro_receive, inet_gro_receive,
&gro_list->list, skb);
But syzkaller injected an skb with protocol ETH_P_TEB into an ip6gre
device (by writing the IP6GRE encapsulated version to a TAP device).
The result was a first call to eth_gro_receive, and thus an extra
ETH_HLEN in network_offset that should not be there. First issue hit
is when computing offset from network header in ipv6_gro_pull_exthdrs.
Initialize both offsets in the network layer gro_receive.
This pairs with all reads in gro_receive, which use
skb_gro_receive_network_offset().
Fixes:
|
||
|
|
7b05ab85e2 |
ipv4: Fix address dump when IPv4 is disabled on an interface
Cited commit started returning an error when user space requests to dump
the interface's IPv4 addresses and IPv4 is disabled on the interface.
Restore the previous behavior and do not return an error.
Before cited commit:
# ip address show dev dummy1
10: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether e2:40:68:98:d0:18 brd ff:ff:ff:ff:ff:ff
inet6 fe80::e040:68ff:fe98:d018/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
# ip link set dev dummy1 mtu 67
# ip address show dev dummy1
10: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 67 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether e2:40:68:98:d0:18 brd ff:ff:ff:ff:ff:ff
After cited commit:
# ip address show dev dummy1
10: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 32:2d:69:f2:9c:99 brd ff:ff:ff:ff:ff:ff
inet6 fe80::302d:69ff:fef2:9c99/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
# ip link set dev dummy1 mtu 67
# ip address show dev dummy1
RTNETLINK answers: No such device
Dump terminated
With this patch:
# ip address show dev dummy1
10: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether de:17:56:bb:57:c0 brd ff:ff:ff:ff:ff:ff
inet6 fe80::dc17:56ff:febb:57c0/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
# ip link set dev dummy1 mtu 67
# ip address show dev dummy1
10: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 67 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether de:17:56:bb:57:c0 brd ff:ff:ff:ff:ff:ff
I fixed the exact same issue for IPv6 in commit
|
||
|
|
2786ae339e |
bpf-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZlTGFAAKCRDbK58LschI g5NXAP0QRn8nBSxJHIswFSOwRiCyhOhR7YL2P0c+RGcRMA+ZSAD9E1cwsYXsPu3L ummQ52AMaMfouHg6aW+rFIoupkGSnwc= =QctA -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-05-27 We've added 15 non-merge commits during the last 7 day(s) which contain a total of 18 files changed, 583 insertions(+), 55 deletions(-). The main changes are: 1) Fix broken BPF multi-uprobe PID filtering logic which filtered by thread while the promise was to filter by process, from Andrii Nakryiko. 2) Fix the recent influx of syzkaller reports to sockmap which triggered a locking rule violation by performing a map_delete, from Jakub Sitnicki. 3) Fixes to netkit driver in particular on skb->pkt_type override upon pass verdict, from Daniel Borkmann. 4) Fix an integer overflow in resolve_btfids which can wrongly trigger build failures, from Friedrich Vock. 5) Follow-up fixes for ARC JIT reported by static analyzers, from Shahab Vahedi. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Cover verifier checks for mutating sockmap/sockhash Revert "bpf, sockmap: Prevent lock inversion deadlock in map delete elem" bpf: Allow delete from sockmap/sockhash only if update is allowed selftests/bpf: Add netkit test for pkt_type selftests/bpf: Add netkit tests for mac address netkit: Fix pkt_type override upon netkit pass verdict netkit: Fix setting mac address in l2 mode ARC, bpf: Fix issues reported by the static analyzers selftests/bpf: extend multi-uprobe tests with USDTs selftests/bpf: extend multi-uprobe tests with child thread case libbpf: detect broken PID filtering logic for multi-uprobe bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logic bpf: fix multi-uprobe PID filtering logic bpf: Fix potential integer overflow in resolve_btfids MAINTAINERS: Add myself as reviewer of ARM64 BPF JIT ==================== Link: https://lore.kernel.org/r/20240527203551.29712-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
3b9ce0491a |
Revert "bpf, sockmap: Prevent lock inversion deadlock in map delete elem"
This reverts commit
|
||
|
|
51d1b25a72 |
af_unix: Read sk->sk_hash under bindlock during bind().
syzkaller reported data-race of sk->sk_hash in unix_autobind() [0],
and the same ones exist in unix_bind_bsd() and unix_bind_abstract().
The three bind() functions prefetch sk->sk_hash locklessly and
use it later after validating that unix_sk(sk)->addr is NULL under
unix_sk(sk)->bindlock.
The prefetched sk->sk_hash is the hash value of unbound socket set
in unix_create1() and does not change until bind() completes.
There could be a chance that sk->sk_hash changes after the lockless
read. However, in such a case, non-NULL unix_sk(sk)->addr is visible
under unix_sk(sk)->bindlock, and bind() returns -EINVAL without using
the prefetched value.
The KCSAN splat is false-positive, but let's silence it by reading
sk->sk_hash under unix_sk(sk)->bindlock.
[0]:
BUG: KCSAN: data-race in unix_autobind / unix_autobind
write to 0xffff888034a9fb88 of 4 bytes by task 4468 on cpu 0:
__unix_set_addr_hash net/unix/af_unix.c:331 [inline]
unix_autobind+0x47a/0x7d0 net/unix/af_unix.c:1185
unix_dgram_connect+0x7e3/0x890 net/unix/af_unix.c:1373
__sys_connect_file+0xd7/0xe0 net/socket.c:2048
__sys_connect+0x114/0x140 net/socket.c:2065
__do_sys_connect net/socket.c:2075 [inline]
__se_sys_connect net/socket.c:2072 [inline]
__x64_sys_connect+0x40/0x50 net/socket.c:2072
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x4f/0x110 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x46/0x4e
read to 0xffff888034a9fb88 of 4 bytes by task 4465 on cpu 1:
unix_autobind+0x28/0x7d0 net/unix/af_unix.c:1134
unix_dgram_connect+0x7e3/0x890 net/unix/af_unix.c:1373
__sys_connect_file+0xd7/0xe0 net/socket.c:2048
__sys_connect+0x114/0x140 net/socket.c:2065
__do_sys_connect net/socket.c:2075 [inline]
__se_sys_connect net/socket.c:2072 [inline]
__x64_sys_connect+0x40/0x50 net/socket.c:2072
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x4f/0x110 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x46/0x4e
value changed: 0x000000e4 -> 0x000001e3
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 4465 Comm: syz-executor.0 Not tainted 6.8.0-12822-gcd51db110a7e #12
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Fixes:
|
||
|
|
97e1db06c7 |
af_unix: Annotate data-race around unix_sk(sk)->addr.
Once unix_sk(sk)->addr is assigned under net->unx.table.locks and unix_sk(sk)->bindlock, *(unix_sk(sk)->addr) and unix_sk(sk)->path are fully set up, and unix_sk(sk)->addr is never changed. unix_getname() and unix_copy_addr() access the two fields locklessly, and commit |
||
|
|
3998d18426 |
netkit: Fix pkt_type override upon netkit pass verdict
When running Cilium connectivity test suite with netkit in L2 mode, we
found that compared to tcx a few tests were failing which pushed traffic
into an L7 proxy sitting in host namespace. The problem in particular is
around the invocation of eth_type_trans() in netkit.
In case of tcx, this is run before the tcx ingress is triggered inside
host namespace and thus if the BPF program uses the bpf_skb_change_type()
helper the newly set type is retained. However, in case of netkit, the
late eth_type_trans() invocation overrides the earlier decision from the
BPF program which eventually leads to the test failure.
Instead of eth_type_trans(), split out the relevant parts, meaning, reset
of mac header and call to eth_skb_pkt_type() before the BPF program is run
in order to have the same behavior as with tcx, and refactor a small helper
called eth_skb_pull_mac() which is run in case it's passed up the stack
where the mac header must be pulled. With this all connectivity tests pass.
Fixes:
|
||
|
|
6d69b6c12f |
NFS client updates for Linux 6.10
Highlights include: Stable fixes: - nfs: fix undefined behavior in nfs_block_bits() - NFSv4.2: Fix READ_PLUS when server doesn't support OP_READ_PLUS Bugfixes: - Fix mixing of the lock/nolock and local_lock mount options - NFSv4: Fixup smatch warning for ambiguous return - NFSv3: Fix remount when using the legacy binary mount api - SUNRPC: Fix the handling of expired RPCSEC_GSS contexts - SUNRPC: fix the NFSACL RPC retries when soft mounts are enabled - rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL Features and cleanups: - NFSv3: Use the atomic_open API to fix open(O_CREAT|O_TRUNC) - pNFS/filelayout: S layout segment range in LAYOUTGET - pNFS: rework pnfs_generic_pg_check_layout to check IO range - NFSv2: Turn off enabling of NFS v2 by default -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmZPpYMACgkQZwvnipYK APITOw//acjE9YTZcST9kgkf2bfwuHFcdxvMZAr4MV0YsfqMesU2MYmaK/5YMLyo iNCHjLmlfE2iLAUqvFtakc1F3guACJqqFfMdnMHa1MwPznrL3yNNClGnBamovbPd XK2MBgpQBXb+xLxqH0A2TtOK2ofk0CFzb3x9eaziox8omBM2j3v6ZARsDHYehuhM Hig8IxW/kZ7kx5jxqSVktrgW3gDKqIuLssF6fJVINzh45jHC5QO98cuSwetx6Mi1 Aw04HbOE6B66ORrzC1wyGN3PwOkTW2kgAiyB6UNNt+Hnvr0RD5TEqf3s3mzmhP9N 7LJ3H1Okxdcpn0G/bR4LBUg26r5BWxhfPiTYG/l9vAQk65yt2LO1kFzXbECBEfaG ctGG7/7mMLVPs05kIFYm5S0cIYW2dYNuE20JY50LMaCIopjThdfruQj3yR4xibSt bHrAbG9wW4qg/cgx860t5h7nbZnD5OOYIqKOCDRNrUfP7P0mK/tD49HggLjDo47M vIMlYS3bTNSF7uEPTrv6bFr8XOD1I3BVXDQwGaJMZ8zyhkUIQtKO70+i4xM1E/Wl Jw5Z6NpM8saDD449ZqX4IRUPDAhvz4v00QqD3Tqr4MHEc5sWi898S7XcJgL3bEai QMJmBkAK8aDAP/suPw8VQc9wqplFNlB+QEh87p2WO+yRoEucn+A= =HMSC -----END PGP SIGNATURE----- Merge tag 'nfs-for-6.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client updates from Trond Myklebust: "Stable fixes: - nfs: fix undefined behavior in nfs_block_bits() - NFSv4.2: Fix READ_PLUS when server doesn't support OP_READ_PLUS Bugfixes: - Fix mixing of the lock/nolock and local_lock mount options - NFSv4: Fixup smatch warning for ambiguous return - NFSv3: Fix remount when using the legacy binary mount api - SUNRPC: Fix the handling of expired RPCSEC_GSS contexts - SUNRPC: fix the NFSACL RPC retries when soft mounts are enabled - rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL Features and cleanups: - NFSv3: Use the atomic_open API to fix open(O_CREAT|O_TRUNC) - pNFS/filelayout: S layout segment range in LAYOUTGET - pNFS: rework pnfs_generic_pg_check_layout to check IO range - NFSv2: Turn off enabling of NFS v2 by default" * tag 'nfs-for-6.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: fix undefined behavior in nfs_block_bits() pNFS: rework pnfs_generic_pg_check_layout to check IO range pNFS/filelayout: check layout segment range pNFS/filelayout: fixup pNfs allocation modes rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL NFS: Don't enable NFS v2 by default NFS: Fix READ_PLUS when server doesn't support OP_READ_PLUS sunrpc: fix NFSACL RPC retry on soft mount SUNRPC: fix handling expired GSS context nfs: keep server info for remounts NFSv4: Fixup smatch warning for ambiguous return NFS: make sure lock/nolock overriding local_lock mount option NFS: add atomic_open for NFSv3 to handle O_TRUNC correctly. pNFS/filelayout: Specify the layout segment range in LAYOUTGET pNFS/filelayout: Remove the whole file layout requirement |
||
|
|
66ad4829dd |
Quite smaller than usual. Notably it includes the fix for the unix
regression you have been notified of in the past weeks. The TCP window fix will require some follow-up, already queued. Current release - regressions: - af_unix: fix garbage collection of embryos Previous releases - regressions: - af_unix: fix race between GC and receive path - ipv6: sr: fix missing sk_buff release in seg6_input_core - tcp: remove 64 KByte limit for initial tp->rcv_wnd value - eth: r8169: fix rx hangup - eth: lan966x: remove ptp traps in case the ptp is not enabled. - eth: ixgbe: fix link breakage vs cisco switches. - eth: ice: prevent ethtool from corrupting the channels. Previous releases - always broken: - openvswitch: set the skbuff pkt_type for proper pmtud support. - tcp: Fix shift-out-of-bounds in dctcp_update_alpha(). Misc: - a bunch of selftests stabilization patches. Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmZPXmUSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOk/o4QAJTA/LcQmHkObgQWyJ7vSykhRFmxSsfR Qc/DstWuNkM+xDbasdjlxaM+BPgf0RduyB/bsPOr8UvGw0S0NUwQBC9V9bgQ0p67 D9qrZH6gEDRbzG+mkbF49SXksJMSdNSygWc4YnYaCW+eufpCaZwN15q+4pAgAWfW UmSra9wCkgl9nRc7N4+UEJbhhi0Lso/yaRlHUUUooHOP0ENDe3JSKidUyS3UuhYc Ah75gKIMm9BygUhg/+mrsRyeb1kfXMfJ54ku/uEIimErG4rTntCJCAc+dBoRXtob pImg4xfgr1OBL1wQKTHM+nvhE+DThLAJOSguX2RYvTvklx/l00tL1PQkA/kn6XNM HdQGnDoN1JpUs3xw90hxWp0gzOwJ1XCjbXT/Dx2kp+ltFj0A1EZViTNNTgh6y2E0 B5oo8NFD0y02ilMdaGW/KOpceglO82p2P4DEc0kBAYvCICQ8MKMdtThuubQeB0FK EO7Xs7lKbDXLJUDtmN4EiE1sofvLVD+1htGt5FG2jtizyQ5Ho/b2aTk2uq0kRN3F mZgaXcNR3sOJGBdaTvzquALZ2Dt69w0D3EHGv/30tD5zwQO8j71W5OoWTnjknWUp Nh7ytL/YlqvwJI47UuuTeDBh95jb/KpTWFv8EYsQLI0JOTfa1VXsoDxidg6rnHuX mvLdIOtzTZqU =zd2T -----END PGP SIGNATURE----- Merge tag 'net-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Quite smaller than usual. Notably it includes the fix for the unix regression from the past weeks. The TCP window fix will require some follow-up, already queued. Current release - regressions: - af_unix: fix garbage collection of embryos Previous releases - regressions: - af_unix: fix race between GC and receive path - ipv6: sr: fix missing sk_buff release in seg6_input_core - tcp: remove 64 KByte limit for initial tp->rcv_wnd value - eth: r8169: fix rx hangup - eth: lan966x: remove ptp traps in case the ptp is not enabled - eth: ixgbe: fix link breakage vs cisco switches - eth: ice: prevent ethtool from corrupting the channels Previous releases - always broken: - openvswitch: set the skbuff pkt_type for proper pmtud support - tcp: Fix shift-out-of-bounds in dctcp_update_alpha() Misc: - a bunch of selftests stabilization patches" * tag 'net-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (25 commits) r8169: Fix possible ring buffer corruption on fragmented Tx packets. idpf: Interpret .set_channels() input differently ice: Interpret .set_channels() input differently nfc: nci: Fix handling of zero-length payload packets in nci_rx_work() net: relax socket state check at accept time. tcp: remove 64 KByte limit for initial tp->rcv_wnd value net: ti: icssg_prueth: Fix NULL pointer dereference in prueth_probe() tls: fix missing memory barrier in tls_init net: fec: avoid lock evasion when reading pps_enable Revert "ixgbe: Manual AN-37 for troublesome link partners for X550 SFI" testing: net-drv: use stats64 for testing net: mana: Fix the extra HZ in mana_hwc_send_request net: lan966x: Remove ptp traps in case the ptp is not enabled. openvswitch: Set the skbuff pkt_type for proper pmtud support. selftest: af_unix: Make SCM_RIGHTS into OOB data. af_unix: Fix garbage collection of embryos carrying OOB with SCM_RIGHTS tcp: Fix shift-out-of-bounds in dctcp_update_alpha(). selftests/net: use tc rule to filter the na packet ipv6: sr: fix memleak in seg6_hmac_init_algo af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock. ... |
||
|
|
d6a326d694 |
tracing: Remove second argument of __assign_str()
The __assign_str() macro logic of the TRACE_EVENT() macro was optimized so that it no longer needs the second argument. The __assign_str() is always matched with __string() field that takes a field name and the source for that field: __string(field, source) The TRACE_EVENT() macro logic will save off the source value and then use that value to copy into the ring buffer via the __assign_str(). Before commit |
||
|
|
2ef32ad224 |
virtio: features, fixes, cleanups
Several new features here: - virtio-net is finally supported in vduse. - Virtio (balloon and mem) interaction with suspend is improved - vhost-scsi now handles signals better/faster. Fixes, cleanups all over the place. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmZN570PHG1zdEByZWRo YXQuY29tAAoJECgfDbjSjVRp2JUH/1K3fZOHymop6Y5Z3USFS7YdlF+dniedY/vg TKyWERkXOlxq1d9DVxC0mN7tk72DweuWI0YJjLXofrEW1VuW29ecSbyFXxpeWJls b7ErffxDAFRas5jkMCngD8TuFnbEegU0mGP5kbiHpEndBydQ2hH99Gg0x7swW+cE xsvU5zonCCLwLGIP2DrVrn9qGOHtV6o8eZfVKDVXfvicn3lFBkUSxlwEYsO9RMup aKxV4FT2Pb1yBicwBK4TH1oeEXqEGy1YLEn+kAHRbgoC/5L0/LaiqrkzwzwwOIPj uPGkacf8CIbX0qZo5EzD8kvfcYL1xhU3eT9WBmpp2ZwD+4bINd4= =nax1 -----END PGP SIGNATURE----- Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio updates from Michael Tsirkin: "Several new features here: - virtio-net is finally supported in vduse - virtio (balloon and mem) interaction with suspend is improved - vhost-scsi now handles signals better/faster And fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits) virtio-pci: Check if is_avq is NULL virtio: delete vq in vp_find_vqs_msix() when request_irq() fails MAINTAINERS: add Eugenio Pérez as reviewer vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API vp_vdpa: don't allocate unused msix vectors sound: virtio: drop owner assignment fuse: virtio: drop owner assignment scsi: virtio: drop owner assignment rpmsg: virtio: drop owner assignment nvdimm: virtio_pmem: drop owner assignment wifi: mac80211_hwsim: drop owner assignment vsock/virtio: drop owner assignment net: 9p: virtio: drop owner assignment net: virtio: drop owner assignment net: caif: virtio: drop owner assignment misc: nsm: drop owner assignment iommu: virtio: drop owner assignment drm/virtio: drop owner assignment gpio: virtio: drop owner assignment firmware: arm_scmi: virtio: drop owner assignment ... |
||
|
|
6671e35249 |
nfc: nci: Fix handling of zero-length payload packets in nci_rx_work()
When nci_rx_work() receives a zero-length payload packet, it should not
discard the packet and exit the loop. Instead, it should continue
processing subsequent packets.
Fixes:
|
||
|
|
26afda78cd |
net: relax socket state check at accept time.
Christoph reported the following splat: WARNING: CPU: 1 PID: 772 at net/ipv4/af_inet.c:761 __inet_accept+0x1f4/0x4a0 Modules linked in: CPU: 1 PID: 772 Comm: syz-executor510 Not tainted 6.9.0-rc7-g7da7119fe22b #56 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:__inet_accept+0x1f4/0x4a0 net/ipv4/af_inet.c:759 Code: 04 38 84 c0 0f 85 87 00 00 00 41 c7 04 24 03 00 00 00 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc e8 ec b7 da fd <0f> 0b e9 7f fe ff ff e8 e0 b7 da fd 0f 0b e9 fe fe ff ff 89 d9 80 RSP: 0018:ffffc90000c2fc58 EFLAGS: 00010293 RAX: ffffffff836bdd14 RBX: 0000000000000000 RCX: ffff888104668000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: dffffc0000000000 R08: ffffffff836bdb89 R09: fffff52000185f64 R10: dffffc0000000000 R11: fffff52000185f64 R12: dffffc0000000000 R13: 1ffff92000185f98 R14: ffff88810754d880 R15: ffff8881007b7800 FS: 000000001c772880(0000) GS:ffff88811b280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb9fcf2e178 CR3: 00000001045d2002 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> inet_accept+0x138/0x1d0 net/ipv4/af_inet.c:786 do_accept+0x435/0x620 net/socket.c:1929 __sys_accept4_file net/socket.c:1969 [inline] __sys_accept4+0x9b/0x110 net/socket.c:1999 __do_sys_accept net/socket.c:2016 [inline] __se_sys_accept net/socket.c:2013 [inline] __x64_sys_accept+0x7d/0x90 net/socket.c:2013 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x58/0x100 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x4315f9 Code: fd ff 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 ab b4 fd ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffdb26d9c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002b RAX: ffffffffffffffda RBX: 0000000000400300 RCX: 00000000004315f9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004 RBP: 00000000006e1018 R08: 0000000000400300 R09: 0000000000400300 R10: 0000000000400300 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000040cdf0 R14: 000000000040ce80 R15: 0000000000000055 </TASK> The reproducer invokes shutdown() before entering the listener status. After commit |
||
|
|
378979e94e |
tcp: remove 64 KByte limit for initial tp->rcv_wnd value
Recently, we had some servers upgraded to the latest kernel and noticed the indicator from the user side showed worse results than before. It is caused by the limitation of tp->rcv_wnd. In 2018 commit |
||
|
|
91e61dd7a0 |
tls: fix missing memory barrier in tls_init
In tls_init(), a write memory barrier is missing, and store-store
reordering may cause NULL dereference in tls_{setsockopt,getsockopt}.
CPU0 CPU1
----- -----
// In tls_init()
// In tls_ctx_create()
ctx = kzalloc()
ctx->sk_proto = READ_ONCE(sk->sk_prot) -(1)
// In update_sk_prot()
WRITE_ONCE(sk->sk_prot, tls_prots) -(2)
// In sock_common_setsockopt()
READ_ONCE(sk->sk_prot)->setsockopt()
// In tls_{setsockopt,getsockopt}()
ctx->sk_proto->setsockopt() -(3)
In the above scenario, when (1) and (2) are reordered, (3) can observe
the NULL value of ctx->sk_proto, causing NULL dereference.
To fix it, we rely on rcu_assign_pointer() which implies the release
barrier semantic. By moving rcu_assign_pointer() after ctx->sk_proto is
initialized, we can ensure that ctx->sk_proto are visible when
changing sk->sk_prot.
Fixes:
|
||
|
|
2c92ca849f |
tracing/treewide: Remove second parameter of __assign_str()
With the rework of how the __string() handles dynamic strings where it
saves off the source string in field in the helper structure[1], the
assignment of that value to the trace event field is stored in the helper
value and does not need to be passed in again.
This means that with:
__string(field, mystring)
Which use to be assigned with __assign_str(field, mystring), no longer
needs the second parameter and it is unused. With this, __assign_str()
will now only get a single parameter.
There's over 700 users of __assign_str() and because coccinelle does not
handle the TRACE_EVENT() macro I ended up using the following sed script:
git grep -l __assign_str | while read a ; do
sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
mv /tmp/test-file $a;
done
I then searched for __assign_str() that did not end with ';' as those
were multi line assignments that the sed script above would fail to catch.
Note, the same updates will need to be done for:
__assign_str_len()
__assign_rel_str()
__assign_rel_str_len()
I tested this with both an allmodconfig and an allyesconfig (build only for both).
[1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/
Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts.
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for
Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org> # xfs
Tested-by: Guenter Roeck <linux@roeck-us.net>
|
||
|
|
b1c16d4a33 |
vsock/virtio: drop owner assignment
virtio core already sets the .owner, so driver does not need to. Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Message-Id: <20240331-module-owner-virtio-v2-19-98f04bfaf46a@linaro.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> |
||
|
|
d26dd255ce |
net: 9p: virtio: drop owner assignment
virtio core already sets the .owner, so driver does not need to. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Message-Id: <20240331-module-owner-virtio-v2-18-98f04bfaf46a@linaro.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> |
||
|
|
2a8120d7b4 |
more s390 updates for 6.10 merge window
- Switch read and write software bits for PUDs - Add missing hardware bits for PUDs and PMDs - Generate unwind information for C modules to fix GDB unwind error for vDSO functions - Create .build-id links for unstripped vDSO files to enable vDSO debugging with symbols - Use standard stack frame layout for vDSO generated stack frames to manually walk stack frames without DWARF information - Rework perf_callchain_user() and arch_stack_walk_user() functions to reduce code duplication - Skip first stack frame when walking user stack - Add basic checks to identify invalid instruction pointers when walking stack frames - Introduce and use struct stack_frame_vdso_wrapper within vDSO user wrapper code to automatically generate an asm-offset define. Also use STACK_FRAME_USER_OVERHEAD instead of STACK_FRAME_OVERHEAD to document that the code works with user space stack - Clear the backchain of the extra stack frame added by the vDSO user wrapper code. This allows the user stack walker to detect and skip the non-standard stack frame. Without this an incorrect instruction pointer would be added to stack traces. - Rewrite psw_idle() function in C to ease maintenance and further enhancements - Remove get_vtimer() function and use get_cpu_timer() instead - Mark psw variable in __load_psw_mask() as __unitialized to avoid superfluous clearing of PSW - Remove obsolete and superfluous comment about removed TIF_FPU flag - Replace memzero_explicit() and kfree() with kfree_sensitive() to fix warnings reported by Coccinelle - Wipe sensitive data and all copies of protected- or secure-keys from stack when an IOCTL fails - Both do_airq_interrupt() and do_io_interrupt() functions set CIF_NOHZ_DELAY flag. Move it in do_io_irq() to simplify the code - Provide iucv_alloc_device() and iucv_release_device() helpers, which can be used to deduplicate more or less identical IUCV device allocation and release code in four different drivers - Make use of iucv_alloc_device() and iucv_release_device() helpers to get rid of quite some code and also remove a cast to an incompatible function (clang W=1) - There is no user of iucv_root outside of the core IUCV code left. Therefore remove the EXPORT_SYMBOL - __apply_alternatives() contains a runtime check which verifies that the size of the to be patched code area is even. Convert this to a compile time check - Increase size of buffers for sending z/VM CP DIAGNOSE X'008' commands from 128 to 240 - Do not accept z/VM CP DIAGNOSE X'008' commands longer than maximally allowed - Use correct defines IPL_BP_NVME_LEN and IPL_BP0_NVME_LEN instead of IPL_BP_FCP_LEN and IPL_BP0_FCP_LEN ones to initialize NVMe reIPL block on 'scp_data' sysfs attribute update - Initialize the correct fields of the NVMe dump block, which were confused with FCP fields - Refactor macros for 'scp_data' (re-)IPL sysfs attribute to reduce code duplication - Introduce 'scp_data' sysfs attribute for dump IPL to allow tools such as dumpconf passing additional kernel command line parameters to a stand-alone dumper - Rework the CPACF query functions to use the correct RRE or RRF instruction formats and set instruction register fields correctly - Instead of calling BUG() at runtime force a link error during compile when a unsupported opcode is used with __cpacf_query() or __cpacf_check_opcode() functions - Fix a crash in ap_parse_bitmap_str() function on /sys/bus/ap/apmask or /sys/bus/ap/aqmask sysfs file update with a relative mask value - Fix "bindings complete" udev event which should be sent once all AP devices have been bound to device drivers and again when unbind/bind actions take place and all AP devices are bound again - Facility list alt_stfle_fac_list is nowhere used in the decompressor, therefore remove it there - Remove custom kprobes insn slot allocator in favour of the standard module_alloc() one, since kernel image and module areas are located within 4GB - Use kvcalloc() instead of kvmalloc_array() in zcrypt driver to avoid calling memset() with a large byte count and get rid of the sparse warning as result -----BEGIN PGP SIGNATURE----- iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZkyx2BccYWdvcmRlZXZA bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8PYZAP9KxEfTyUmIh61Gx8+m3BW5dy7p E2Q8yotlUpGj49ul+AD8CEAyTiWR95AlMOVZZLV/0J7XIjhALvpKAGfiJWkvXgc= =pife -----END PGP SIGNATURE----- Merge tag 's390-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull more s390 updates from Alexander Gordeev: - Switch read and write software bits for PUDs - Add missing hardware bits for PUDs and PMDs - Generate unwind information for C modules to fix GDB unwind error for vDSO functions - Create .build-id links for unstripped vDSO files to enable vDSO debugging with symbols - Use standard stack frame layout for vDSO generated stack frames to manually walk stack frames without DWARF information - Rework perf_callchain_user() and arch_stack_walk_user() functions to reduce code duplication - Skip first stack frame when walking user stack - Add basic checks to identify invalid instruction pointers when walking stack frames - Introduce and use struct stack_frame_vdso_wrapper within vDSO user wrapper code to automatically generate an asm-offset define. Also use STACK_FRAME_USER_OVERHEAD instead of STACK_FRAME_OVERHEAD to document that the code works with user space stack - Clear the backchain of the extra stack frame added by the vDSO user wrapper code. This allows the user stack walker to detect and skip the non-standard stack frame. Without this an incorrect instruction pointer would be added to stack traces. - Rewrite psw_idle() function in C to ease maintenance and further enhancements - Remove get_vtimer() function and use get_cpu_timer() instead - Mark psw variable in __load_psw_mask() as __unitialized to avoid superfluous clearing of PSW - Remove obsolete and superfluous comment about removed TIF_FPU flag - Replace memzero_explicit() and kfree() with kfree_sensitive() to fix warnings reported by Coccinelle - Wipe sensitive data and all copies of protected- or secure-keys from stack when an IOCTL fails - Both do_airq_interrupt() and do_io_interrupt() functions set CIF_NOHZ_DELAY flag. Move it in do_io_irq() to simplify the code - Provide iucv_alloc_device() and iucv_release_device() helpers, which can be used to deduplicate more or less identical IUCV device allocation and release code in four different drivers - Make use of iucv_alloc_device() and iucv_release_device() helpers to get rid of quite some code and also remove a cast to an incompatible function (clang W=1) - There is no user of iucv_root outside of the core IUCV code left. Therefore remove the EXPORT_SYMBOL - __apply_alternatives() contains a runtime check which verifies that the size of the to be patched code area is even. Convert this to a compile time check - Increase size of buffers for sending z/VM CP DIAGNOSE X'008' commands from 128 to 240 - Do not accept z/VM CP DIAGNOSE X'008' commands longer than maximally allowed - Use correct defines IPL_BP_NVME_LEN and IPL_BP0_NVME_LEN instead of IPL_BP_FCP_LEN and IPL_BP0_FCP_LEN ones to initialize NVMe reIPL block on 'scp_data' sysfs attribute update - Initialize the correct fields of the NVMe dump block, which were confused with FCP fields - Refactor macros for 'scp_data' (re-)IPL sysfs attribute to reduce code duplication - Introduce 'scp_data' sysfs attribute for dump IPL to allow tools such as dumpconf passing additional kernel command line parameters to a stand-alone dumper - Rework the CPACF query functions to use the correct RRE or RRF instruction formats and set instruction register fields correctly - Instead of calling BUG() at runtime force a link error during compile when a unsupported opcode is used with __cpacf_query() or __cpacf_check_opcode() functions - Fix a crash in ap_parse_bitmap_str() function on /sys/bus/ap/apmask or /sys/bus/ap/aqmask sysfs file update with a relative mask value - Fix "bindings complete" udev event which should be sent once all AP devices have been bound to device drivers and again when unbind/bind actions take place and all AP devices are bound again - Facility list alt_stfle_fac_list is nowhere used in the decompressor, therefore remove it there - Remove custom kprobes insn slot allocator in favour of the standard module_alloc() one, since kernel image and module areas are located within 4GB - Use kvcalloc() instead of kvmalloc_array() in zcrypt driver to avoid calling memset() with a large byte count and get rid of the sparse warning as result * tag 's390-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (39 commits) s390/zcrypt: Use kvcalloc() instead of kvmalloc_array() s390/kprobes: Remove custom insn slot allocator s390/boot: Remove alt_stfle_fac_list from decompressor s390/ap: Fix bind complete udev event sent after each AP bus scan s390/ap: Fix crash in AP internal function modify_bitmap() s390/cpacf: Make use of invalid opcode produce a link error s390/cpacf: Split and rework cpacf query functions s390/ipl: Introduce sysfs attribute 'scp_data' for dump ipl s390/ipl: Introduce macros for (re)ipl sysfs attribute 'scp_data' s390/ipl: Fix incorrect initialization of nvme dump block s390/ipl: Fix incorrect initialization of len fields in nvme reipl block s390/ipl: Do not accept z/VM CP diag X'008' cmds longer than max length s390/ipl: Fix size of vmcmd buffers for sending z/VM CP diag X'008' cmds s390/alternatives: Convert runtime sanity check into compile time check s390/iucv: Unexport iucv_root tty: hvc-iucv: Make use of iucv_alloc_device() s390/smsgiucv_app: Make use of iucv_alloc_device() s390/netiucv: Make use of iucv_alloc_device() s390/vmlogrdr: Make use of iucv_alloc_device() s390/iucv: Provide iucv_alloc_device() / iucv_release_device() ... |
||
|
|
30a92c9e3d |
openvswitch: Set the skbuff pkt_type for proper pmtud support.
Open vSwitch is originally intended to switch at layer 2, only dealing with
Ethernet frames. With the introduction of l3 tunnels support, it crossed
into the realm of needing to care a bit about some routing details when
making forwarding decisions. If an oversized packet would need to be
fragmented during this forwarding decision, there is a chance for pmtu
to get involved and generate a routing exception. This is gated by the
skbuff->pkt_type field.
When a flow is already loaded into the openvswitch module this field is
set up and transitioned properly as a packet moves from one port to
another. In the case that a packet execute is invoked after a flow is
newly installed this field is not properly initialized. This causes the
pmtud mechanism to omit sending the required exception messages across
the tunnel boundary and a second attempt needs to be made to make sure
that the routing exception is properly setup. To fix this, we set the
outgoing packet's pkt_type to PACKET_OUTGOING, since it can only get
to the openvswitch module via a port device or packet command.
Even for bridge ports as users, the pkt_type needs to be reset when
doing the transmit as the packet is truly outgoing and routing needs
to get involved post packet transformations, in the case of
VXLAN/GENEVE/udp-tunnel packets. In general, the pkt_type on output
gets ignored, since we go straight to the driver, but in the case of
tunnel ports they go through IP routing layer.
This issue is periodically encountered in complex setups, such as large
openshift deployments, where multiple sets of tunnel traversal occurs.
A way to recreate this is with the ovn-heater project that can setup
a networking environment which mimics such large deployments. We need
larger environments for this because we need to ensure that flow
misses occur. In these environment, without this patch, we can see:
./ovn_cluster.sh start
podman exec ovn-chassis-1 ip r a 170.168.0.5/32 dev eth1 mtu 1200
podman exec ovn-chassis-1 ip netns exec sw01p1 ip r flush cache
podman exec ovn-chassis-1 ip netns exec sw01p1 \
ping 21.0.0.3 -M do -s 1300 -c2
PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
From 21.0.0.3 icmp_seq=2 Frag needed and DF set (mtu = 1142)
--- 21.0.0.3 ping statistics ---
...
Using tcpdump, we can also see the expected ICMP FRAG_NEEDED message is not
sent into the server.
With this patch, setting the pkt_type, we see the following:
podman exec ovn-chassis-1 ip netns exec sw01p1 \
ping 21.0.0.3 -M do -s 1300 -c2
PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
From 21.0.0.3 icmp_seq=1 Frag needed and DF set (mtu = 1222)
ping: local error: message too long, mtu=1222
--- 21.0.0.3 ping statistics ---
...
In this case, the first ping request receives the FRAG_NEEDED message and
a local routing exception is created.
Tested-by: Jaime Caamano <jcaamano@redhat.com>
Reported-at: https://issues.redhat.com/browse/FDP-164
Fixes:
|
||
|
|
25460d6f39 |
net/9p: fix uninit-value in p9_client_rpc()
Syzbot with the help of KMSAN reported the following error:
BUG: KMSAN: uninit-value in trace_9p_client_res include/trace/events/9p.h:146 [inline]
BUG: KMSAN: uninit-value in p9_client_rpc+0x1314/0x1340 net/9p/client.c:754
trace_9p_client_res include/trace/events/9p.h:146 [inline]
p9_client_rpc+0x1314/0x1340 net/9p/client.c:754
p9_client_create+0x1551/0x1ff0 net/9p/client.c:1031
v9fs_session_init+0x1b9/0x28e0 fs/9p/v9fs.c:410
v9fs_mount+0xe2/0x12b0 fs/9p/vfs_super.c:122
legacy_get_tree+0x114/0x290 fs/fs_context.c:662
vfs_get_tree+0xa7/0x570 fs/super.c:1797
do_new_mount+0x71f/0x15e0 fs/namespace.c:3352
path_mount+0x742/0x1f20 fs/namespace.c:3679
do_mount fs/namespace.c:3692 [inline]
__do_sys_mount fs/namespace.c:3898 [inline]
__se_sys_mount+0x725/0x810 fs/namespace.c:3875
__x64_sys_mount+0xe4/0x150 fs/namespace.c:3875
do_syscall_64+0xd5/0x1f0
entry_SYSCALL_64_after_hwframe+0x6d/0x75
Uninit was created at:
__alloc_pages+0x9d6/0xe70 mm/page_alloc.c:4598
__alloc_pages_node include/linux/gfp.h:238 [inline]
alloc_pages_node include/linux/gfp.h:261 [inline]
alloc_slab_page mm/slub.c:2175 [inline]
allocate_slab mm/slub.c:2338 [inline]
new_slab+0x2de/0x1400 mm/slub.c:2391
___slab_alloc+0x1184/0x33d0 mm/slub.c:3525
__slab_alloc mm/slub.c:3610 [inline]
__slab_alloc_node mm/slub.c:3663 [inline]
slab_alloc_node mm/slub.c:3835 [inline]
kmem_cache_alloc+0x6d3/0xbe0 mm/slub.c:3852
p9_tag_alloc net/9p/client.c:278 [inline]
p9_client_prepare_req+0x20a/0x1770 net/9p/client.c:641
p9_client_rpc+0x27e/0x1340 net/9p/client.c:688
p9_client_create+0x1551/0x1ff0 net/9p/client.c:1031
v9fs_session_init+0x1b9/0x28e0 fs/9p/v9fs.c:410
v9fs_mount+0xe2/0x12b0 fs/9p/vfs_super.c:122
legacy_get_tree+0x114/0x290 fs/fs_context.c:662
vfs_get_tree+0xa7/0x570 fs/super.c:1797
do_new_mount+0x71f/0x15e0 fs/namespace.c:3352
path_mount+0x742/0x1f20 fs/namespace.c:3679
do_mount fs/namespace.c:3692 [inline]
__do_sys_mount fs/namespace.c:3898 [inline]
__se_sys_mount+0x725/0x810 fs/namespace.c:3875
__x64_sys_mount+0xe4/0x150 fs/namespace.c:3875
do_syscall_64+0xd5/0x1f0
entry_SYSCALL_64_after_hwframe+0x6d/0x75
If p9_check_errors() fails early in p9_client_rpc(), req->rc.tag
will not be properly initialized. However, trace_9p_client_res()
ends up trying to print it out anyway before p9_client_rpc()
finishes.
Fix this issue by assigning default values to p9_fcall fields
such as 'tag' and (just in case KMSAN unearths something new) 'id'
during the tag allocation stage.
Reported-and-tested-by: syzbot+ff14db38f56329ef68df@syzkaller.appspotmail.com
Fixes:
|
||
|
|
041933a1ec |
af_unix: Fix garbage collection of embryos carrying OOB with SCM_RIGHTS
GC attempts to explicitly drop oob_skb's reference before purging the hit
list.
The problem is with embryos: kfree_skb(u->oob_skb) is never called on an
embryo socket.
The python script below [0] sends a listener's fd to its embryo as OOB
data. While GC does collect the embryo's queue, it fails to drop the OOB
skb's refcount. The skb which was in embryo's receive queue stays as
unix_sk(sk)->oob_skb and keeps the listener's refcount [1].
Tell GC to dispose embryo's oob_skb.
[0]:
from array import array
from socket import *
addr = '\x00unix-oob'
lis = socket(AF_UNIX, SOCK_STREAM)
lis.bind(addr)
lis.listen(1)
s = socket(AF_UNIX, SOCK_STREAM)
s.connect(addr)
scm = (SOL_SOCKET, SCM_RIGHTS, array('i', [lis.fileno()]))
s.sendmsg([b'x'], [scm], MSG_OOB)
lis.close()
[1]
$ grep unix-oob /proc/net/unix
$ ./unix-oob.py
$ grep unix-oob /proc/net/unix
0000000000000000: 00000002 00000000 00000000 0001 02 0 @unix-oob
0000000000000000: 00000002 00000000 00010000 0001 01 6072 @unix-oob
Fixes:
|
||
|
|
3ebc46ca86 |
tcp: Fix shift-out-of-bounds in dctcp_update_alpha().
In dctcp_update_alpha(), we use a module parameter dctcp_shift_g
as follows:
alpha -= min_not_zero(alpha, alpha >> dctcp_shift_g);
...
delivered_ce <<= (10 - dctcp_shift_g);
It seems syzkaller started fuzzing module parameters and triggered
shift-out-of-bounds [0] by setting 100 to dctcp_shift_g:
memcpy((void*)0x20000080,
"/sys/module/tcp_dctcp/parameters/dctcp_shift_g\000", 47);
res = syscall(__NR_openat, /*fd=*/0xffffffffffffff9cul, /*file=*/0x20000080ul,
/*flags=*/2ul, /*mode=*/0ul);
memcpy((void*)0x20000000, "100\000", 4);
syscall(__NR_write, /*fd=*/r[0], /*val=*/0x20000000ul, /*len=*/4ul);
Let's limit the max value of dctcp_shift_g by param_set_uint_minmax().
With this patch:
# echo 10 > /sys/module/tcp_dctcp/parameters/dctcp_shift_g
# cat /sys/module/tcp_dctcp/parameters/dctcp_shift_g
10
# echo 11 > /sys/module/tcp_dctcp/parameters/dctcp_shift_g
-bash: echo: write error: Invalid argument
[0]:
UBSAN: shift-out-of-bounds in net/ipv4/tcp_dctcp.c:143:12
shift exponent 100 is too large for 32-bit type 'u32' (aka 'unsigned int')
CPU: 0 PID: 8083 Comm: syz-executor345 Not tainted 6.9.0-05151-g1b294a1f3561 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x201/0x300 lib/dump_stack.c:114
ubsan_epilogue lib/ubsan.c:231 [inline]
__ubsan_handle_shift_out_of_bounds+0x346/0x3a0 lib/ubsan.c:468
dctcp_update_alpha+0x540/0x570 net/ipv4/tcp_dctcp.c:143
tcp_in_ack_event net/ipv4/tcp_input.c:3802 [inline]
tcp_ack+0x17b1/0x3bc0 net/ipv4/tcp_input.c:3948
tcp_rcv_state_process+0x57a/0x2290 net/ipv4/tcp_input.c:6711
tcp_v4_do_rcv+0x764/0xc40 net/ipv4/tcp_ipv4.c:1937
sk_backlog_rcv include/net/sock.h:1106 [inline]
__release_sock+0x20f/0x350 net/core/sock.c:2983
release_sock+0x61/0x1f0 net/core/sock.c:3549
mptcp_subflow_shutdown+0x3d0/0x620 net/mptcp/protocol.c:2907
mptcp_check_send_data_fin+0x225/0x410 net/mptcp/protocol.c:2976
__mptcp_close+0x238/0xad0 net/mptcp/protocol.c:3072
mptcp_close+0x2a/0x1a0 net/mptcp/protocol.c:3127
inet_release+0x190/0x1f0 net/ipv4/af_inet.c:437
__sock_release net/socket.c:659 [inline]
sock_close+0xc0/0x240 net/socket.c:1421
__fput+0x41b/0x890 fs/file_table.c:422
task_work_run+0x23b/0x300 kernel/task_work.c:180
exit_task_work include/linux/task_work.h:38 [inline]
do_exit+0x9c8/0x2540 kernel/exit.c:878
do_group_exit+0x201/0x2b0 kernel/exit.c:1027
__do_sys_exit_group kernel/exit.c:1038 [inline]
__se_sys_exit_group kernel/exit.c:1036 [inline]
__x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1036
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xe4/0x240 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x67/0x6f
RIP: 0033:0x7f6c2b5005b6
Code: Unable to access opcode bytes at 0x7f6c2b50058c.
RSP: 002b:00007ffe883eb948 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007f6c2b5862f0 RCX: 00007f6c2b5005b6
RDX: 0000000000000001 RSI: 000000000000003c RDI: 0000000000000001
RBP: 0000000000000001 R08: 00000000000000e7 R09: ffffffffffffffc0
R10: 0000000000000006 R11: 0000000000000246 R12: 00007f6c2b5862f0
R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
</TASK>
Reported-by: syzkaller <syzkaller@googlegroups.com>
Reported-by: Yue Sun <samsun1006219@gmail.com>
Reported-by: xingwei lee <xrivendell7@gmail.com>
Closes: https://lore.kernel.org/netdev/CAEkJfYNJM=cw-8x7_Vmj1J6uYVCWMbbvD=EFmDPVBGpTsqOxEA@mail.gmail.com/
Fixes:
|
||
|
|
efb9f4f19f |
ipv6: sr: fix memleak in seg6_hmac_init_algo
seg6_hmac_init_algo returns without cleaning up the previous allocations
if one fails, so it's going to leak all that memory and the crypto tfms.
Update seg6_hmac_exit to only free the memory when allocated, so we can
reuse the code directly.
Fixes:
|
||
|
|
9841991a44 |
af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.
Billy Jheng Bing-Jhong reported a race between __unix_gc() and
queue_oob().
__unix_gc() tries to garbage-collect close()d inflight sockets,
and then if the socket has MSG_OOB in unix_sk(sk)->oob_skb, GC
will drop the reference and set NULL to it locklessly.
However, the peer socket still can send MSG_OOB message and
queue_oob() can update unix_sk(sk)->oob_skb concurrently, leading
NULL pointer dereference. [0]
To fix the issue, let's update unix_sk(sk)->oob_skb under the
sk_receive_queue's lock and take it everywhere we touch oob_skb.
Note that we defer kfree_skb() in manage_oob() to silence lockdep
false-positive (See [1]).
[0]:
BUG: kernel NULL pointer dereference, address: 0000000000000008
PF: supervisor write access in kernel mode
PF: error_code(0x0002) - not-present page
PGD 8000000009f5e067 P4D 8000000009f5e067 PUD 9f5d067 PMD 0
Oops: 0002 [#1] PREEMPT SMP PTI
CPU: 3 PID: 50 Comm: kworker/3:1 Not tainted 6.9.0-rc5-00191-gd091e579b864 #110
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: events delayed_fput
RIP: 0010:skb_dequeue (./include/linux/skbuff.h:2386 ./include/linux/skbuff.h:2402 net/core/skbuff.c:3847)
Code: 39 e3 74 3e 8b 43 10 48 89 ef 83 e8 01 89 43 10 49 8b 44 24 08 49 c7 44 24 08 00 00 00 00 49 8b 14 24 49 c7 04 24 00 00 00 00 <48> 89 42 08 48 89 10 e8 e7 c5 42 00 4c 89 e0 5b 5d 41 5c c3 cc cc
RSP: 0018:ffffc900001bfd48 EFLAGS: 00000002
RAX: 0000000000000000 RBX: ffff8880088f5ae8 RCX: 00000000361289f9
RDX: 0000000000000000 RSI: 0000000000000206 RDI: ffff8880088f5b00
RBP: ffff8880088f5b00 R08: 0000000000080000 R09: 0000000000000001
R10: 0000000000000003 R11: 0000000000000001 R12: ffff8880056b6a00
R13: ffff8880088f5280 R14: 0000000000000001 R15: ffff8880088f5a80
FS: 0000000000000000(0000) GS:ffff88807dd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 0000000006314000 CR4: 00000000007506f0
PKRU: 55555554
Call Trace:
<TASK>
unix_release_sock (net/unix/af_unix.c:654)
unix_release (net/unix/af_unix.c:1050)
__sock_release (net/socket.c:660)
sock_close (net/socket.c:1423)
__fput (fs/file_table.c:423)
delayed_fput (fs/file_table.c:444 (discriminator 3))
process_one_work (kernel/workqueue.c:3259)
worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416)
kthread (kernel/kthread.c:388)
ret_from_fork (arch/x86/kernel/process.c:153)
ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
</TASK>
Modules linked in:
CR2: 0000000000000008
Link: https://lore.kernel.org/netdev/a00d3993-c461-43f2-be6d-07259c98509a@rbox.co/ [1]
Fixes:
|
||
|
|
aff5c01fa1 |
netfilter: nft_payload: restore vlan q-in-q match support
Revert |
||
|
|
c1193d9bbb |
netfilter: ipset: Add list flush to cancel_gc
Flushing list in cancel_gc drops references to other lists right away,
without waiting for RCU to destroy list. Fixes race when referenced
ipsets can't be destroyed while referring list is scheduled for destroy.
Fixes:
|
||
|
|
daa121128a |
dma-mapping updates for Linux 6.10
- optimize DMA sync calls when they are no-ops (Alexander Lobakin) - fix swiotlb padding for untrusted devices (Michael Kelley) - add documentation for swiotb (Michael Kelley) -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmZLV+gLHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYPO7hAAlKuXigzwcrVEUnfRGRdaZ28xbmffyC1dPfw8HRZe xJqvD51aJ/VOoOCcUyt3hNLEQHwtjEk4eM0xGcAASMdwceU58doJCcDJBpbbgbDK CPKJgBLQBC1JfAJUpRiJkV4RsudRhAyndIzUPVgkz0WObpEgDpfO0ClHRF/0Pavy 1sBFVFMbB1ewb/D8ffpp+DWfwrwu0oMC3A2LkYu2F5SQFWuVOpbNemrnZ6K2ckPt 2mcLpJ308+sti8Ka/LrI2akU8JCLYMYDQnue/44v3X3Gm63cMcEx/fj5M5x6m71n P+cxAkjsGDHybnfjbUvR842to8msRsH4CI4Zbb69+5HDlWSadM8JhQd74oeii6o6 RiGPrrFEk7vCxFOkUsqGFYMykEX+71wXfQ1Mpp/b4QgdqBLkxW4ozQ3Ya7ASUs2z TLLmQvIXtYKGnyU+RdOkvS6piHjd4wVHOhuGVdXqVT7WrbaPeovY4TNSTV2ZA1gE 9Y5RCdrX9xeGGNjsYXKwsWGvXVsm6UTQmQVUsatQb3ic+K3S6tQR9pwzk0HmhMuM BscWHSAEL7T8ZZ5Ydph45Cw/6xdH7LggD+nRtLcdAuzCika12eabZHsO0DrF533n qXYOjZOgsMEZWICynxq6+EGQKGWY+F+GyKDMU2w2Es5OgMa9Bqb40aSF+Q887s96 xwI= =Pa8W -----END PGP SIGNATURE----- Merge tag 'dma-mapping-6.10-2024-05-20' of git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping updates from Christoph Hellwig: - optimize DMA sync calls when they are no-ops (Alexander Lobakin) - fix swiotlb padding for untrusted devices (Michael Kelley) - add documentation for swiotb (Michael Kelley) * tag 'dma-mapping-6.10-2024-05-20' of git://git.infradead.org/users/hch/dma-mapping: dma: fix DMA sync for drivers not calling dma_set_mask*() xsk: use generic DMA sync shortcut instead of a custom one page_pool: check for DMA sync shortcut earlier page_pool: don't use driver-set flags field directly page_pool: make sure frag API fields don't span between cachelines iommu/dma: avoid expensive indirect calls for sync operations dma: avoid redundant calls for sync operations dma: compile-out DMA sync op calls when not used iommu/dma: fix zeroing of bounce buffer padding used by untrusted devices swiotlb: remove alloc_size argument to swiotlb_tbl_map_single() Documentation/core-api: add swiotlb documentation |
||
|
|
4836da2197 |
rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL
Under the scenario of IB device bonding, when bringing down one of the
ports, or all ports, we saw xprtrdma entering a non-recoverable state
where it is not even possible to complete the disconnect and shut it
down the mount, requiring a reboot. Following debug, we saw that
transport connect never ended after receiving the
RDMA_CM_EVENT_DEVICE_REMOVAL callback.
The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is
connected, and ESTABLISHED may not have happened. So need to work with
each of these states accordingly.
Fixes:
|
||
|
|
0dc9f43002 |
sunrpc: fix NFSACL RPC retry on soft mount
It used to be quite awhile ago since |
||
|
|
9b62ef6d23 |
SUNRPC: fix handling expired GSS context
In the case where we have received a successful reply to an RPC request, but while processing the reply the client in rpc_decode_header() finds an expired context, the code ends up propagating the error to the caller instead of getting a new context and retrying the request. To give more details, in rpc_decode_header() we call rpcauth_checkverf() will call into the gss and internally will at some point call gss_validate() which has a check if the current’s context lifetime expired, and it would fail. The reason for the failure gets ‘scrubbed’ and translated to EACCES so when we get back to rpc_decode_header() we just go to “out_verifier” which for that error would get converted to “out_garbage” (ie it’s treated as garballed reply) and the next action is call_encode. Which (1) doesn’t reencode or re-send (not to mention no upcall happens because context expires as that reason just not known) and it again fails in the same decoding process. After re-trying it 3 times the error is propagated back to the caller (ie nfs4_write_done_cb() in the case a failing write). To fix this, instead we need to look to the case where the server decides that context has expired and replies with an RPC auth error. In that case, the rpc_decode_header() goes to "out_msg_denied" in that we return EKEYREJECTED which in call_decode() is sent to “call_reserve” which triggers an upcalls and a re-try of the operation. The proposed fix is in case of a failed rpc_decode_header() to check if credentials were set to be invalid and use that as a proxy for deciding that context has expired and then treat is same way as receiving an auth error. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> |
||
|
|
e4a87abf58 |
nfc: nci: Fix uninit-value in nci_rx_work
syzbot reported the following uninit-value access issue [1]
nci_rx_work() parses received packet from ndev->rx_q. It should be
validated header size, payload size and total packet size before
processing the packet. If an invalid packet is detected, it should be
silently discarded.
Fixes:
|
||
|
|
5447f9708d |
ipv6: sr: fix missing sk_buff release in seg6_input_core
The seg6_input() function is responsible for adding the SRH into a packet, delegating the operation to the seg6_input_core(). This function uses the skb_cow_head() to ensure that there is sufficient headroom in the sk_buff for accommodating the link-layer header. In the event that the skb_cow_header() function fails, the seg6_input_core() catches the error but it does not release the sk_buff, which will result in a memory leak. This issue was introduced in commit |
||
|
|
dc21c6cc3d |
netfilter: nfnetlink_queue: acquire rcu_read_lock() in instance_destroy_rcu()
syzbot reported that nf_reinject() could be called without rcu_read_lock() :
WARNING: suspicious RCU usage
6.9.0-rc7-syzkaller-02060-g5c1672705a1a #0 Not tainted
net/netfilter/nfnetlink_queue.c:263 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
2 locks held by syz-executor.4/13427:
#0: ffffffff8e334f60 (rcu_callback){....}-{0:0}, at: rcu_lock_acquire include/linux/rcupdate.h:329 [inline]
#0: ffffffff8e334f60 (rcu_callback){....}-{0:0}, at: rcu_do_batch kernel/rcu/tree.c:2190 [inline]
#0: ffffffff8e334f60 (rcu_callback){....}-{0:0}, at: rcu_core+0xa86/0x1830 kernel/rcu/tree.c:2471
#1: ffff88801ca92958 (&inst->lock){+.-.}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
#1: ffff88801ca92958 (&inst->lock){+.-.}-{2:2}, at: nfqnl_flush net/netfilter/nfnetlink_queue.c:405 [inline]
#1: ffff88801ca92958 (&inst->lock){+.-.}-{2:2}, at: instance_destroy_rcu+0x30/0x220 net/netfilter/nfnetlink_queue.c:172
stack backtrace:
CPU: 0 PID: 13427 Comm: syz-executor.4 Not tainted 6.9.0-rc7-syzkaller-02060-g5c1672705a1a #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
lockdep_rcu_suspicious+0x221/0x340 kernel/locking/lockdep.c:6712
nf_reinject net/netfilter/nfnetlink_queue.c:323 [inline]
nfqnl_reinject+0x6ec/0x1120 net/netfilter/nfnetlink_queue.c:397
nfqnl_flush net/netfilter/nfnetlink_queue.c:410 [inline]
instance_destroy_rcu+0x1ae/0x220 net/netfilter/nfnetlink_queue.c:172
rcu_do_batch kernel/rcu/tree.c:2196 [inline]
rcu_core+0xafd/0x1830 kernel/rcu/tree.c:2471
handle_softirqs+0x2d6/0x990 kernel/softirq.c:554
__do_softirq kernel/softirq.c:588 [inline]
invoke_softirq kernel/softirq.c:428 [inline]
__irq_exit_rcu+0xf4/0x1c0 kernel/softirq.c:637
irq_exit_rcu+0x9/0x30 kernel/softirq.c:649
instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1043 [inline]
sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1043
</IRQ>
<TASK>
Fixes:
|
||
|
|
61307b7be4 |
The usual shower of singleton fixes and minor series all over MM,
documented (hopefully adequately) in the respective changelogs. Notable
series include:
- Lucas Stach has provided some page-mapping
cleanup/consolidation/maintainability work in the series "mm/treewide:
Remove pXd_huge() API".
- In the series "Allow migrate on protnone reference with
MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one
test.
- In their series "Memory allocation profiling" Kent Overstreet and
Suren Baghdasaryan have contributed a means of determining (via
/proc/allocinfo) whereabouts in the kernel memory is being allocated:
number of calls and amount of memory.
- Matthew Wilcox has provided the series "Various significant MM
patches" which does a number of rather unrelated things, but in largely
similar code sites.
- In his series "mm: page_alloc: freelist migratetype hygiene" Johannes
Weiner has fixed the page allocator's handling of migratetype requests,
with resulting improvements in compaction efficiency.
- In the series "make the hugetlb migration strategy consistent" Baolin
Wang has fixed a hugetlb migration issue, which should improve hugetlb
allocation reliability.
- Liu Shixin has hit an I/O meltdown caused by readahead in a
memory-tight memcg. Addressed in the series "Fix I/O high when memory
almost met memcg limit".
- In the series "mm/filemap: optimize folio adding and splitting" Kairui
Song has optimized pagecache insertion, yielding ~10% performance
improvement in one test.
- Baoquan He has cleaned up and consolidated the early zone
initialization code in the series "mm/mm_init.c: refactor
free_area_init_core()".
- Baoquan has also redone some MM initializatio code in the series
"mm/init: minor clean up and improvement".
- MM helper cleanups from Christoph Hellwig in his series "remove
follow_pfn".
- More cleanups from Matthew Wilcox in the series "Various page->flags
cleanups".
- Vlastimil Babka has contributed maintainability improvements in the
series "memcg_kmem hooks refactoring".
- More folio conversions and cleanups in Matthew Wilcox's series
"Convert huge_zero_page to huge_zero_folio"
"khugepaged folio conversions"
"Remove page_idle and page_young wrappers"
"Use folio APIs in procfs"
"Clean up __folio_put()"
"Some cleanups for memory-failure"
"Remove page_mapping()"
"More folio compat code removal"
- David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb
functions to work on folis".
- Code consolidation and cleanup work related to GUP's handling of
hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".
- Rick Edgecombe has developed some fixes to stack guard gaps in the
series "Cover a guard gap corner case".
- Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series
"mm/ksm: fix ksm exec support for prctl".
- Baolin Wang has implemented NUMA balancing for multi-size THPs. This
is a simple first-cut implementation for now. The series is "support
multi-size THP numa balancing".
- Cleanups to vma handling helper functions from Matthew Wilcox in the
series "Unify vma_address and vma_pgoff_address".
- Some selftests maintenance work from Dev Jain in the series
"selftests/mm: mremap_test: Optimizations and style fixes".
- Improvements to the swapping of multi-size THPs from Ryan Roberts in
the series "Swap-out mTHP without splitting".
- Kefeng Wang has significantly optimized the handling of arm64's
permission page faults in the series
"arch/mm/fault: accelerate pagefault when badaccess"
"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"
- GUP cleanups from David Hildenbrand in "mm/gup: consistently call it
GUP-fast".
- hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to
use struct vm_fault".
- selftests build fixes from John Hubbard in the series "Fix
selftests/mm build without requiring "make headers"".
- Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes
the initialization code so that migration between different memory types
works as intended.
- David Hildenbrand has improved follow_pte() and fixed an errant driver
in the series "mm: follow_pte() improvements and acrn follow_pte()
fixes".
- David also did some cleanup work on large folio mapcounts in his
series "mm: mapcount for large folios + page_mapcount() cleanups".
- Folio conversions in KSM in Alex Shi's series "transfer page to folio
in KSM".
- Barry Song has added some sysfs stats for monitoring multi-size THP's
in the series "mm: add per-order mTHP alloc and swpout counters".
- Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled
and limit checking cleanups".
- Matthew Wilcox has been looking at buffer_head code and found the
documentation to be lacking. The series is "Improve buffer head
documentation".
- Multi-size THPs get more work, this time from Lance Yang. His series
"mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes
the freeing of these things.
- Kemeng Shi has added more userspace-visible writeback instrumentation
in the series "Improve visibility of writeback".
- Kemeng Shi then sent some maintenance work on top in the series "Fix
and cleanups to page-writeback".
- Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the
series "Improve anon_vma scalability for anon VMAs". Intel's test bot
reported an improbable 3x improvement in one test.
- SeongJae Park adds some DAMON feature work in the series
"mm/damon: add a DAMOS filter type for page granularity access recheck"
"selftests/damon: add DAMOS quota goal test"
- Also some maintenance work in the series
"mm/damon/paddr: simplify page level access re-check for pageout"
"mm/damon: misc fixes and improvements"
- David Hildenbrand has disabled some known-to-fail selftests ni the
series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL".
- memcg metadata storage optimizations from Shakeel Butt in "memcg:
reduce memory consumption by memcg stats".
- DAX fixes and maintenance work from Vishal Verma in the series
"dax/bus.c: Fixups for dax-bus locking".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZkgQYwAKCRDdBJ7gKXxA
jrdKAP9WVJdpEcXxpoub/vVE0UWGtffr8foifi9bCwrQrGh5mgEAx7Yf0+d/oBZB
nvA4E0DcPrUAFy144FNM0NTCb7u9vAw=
=V3R/
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
"The usual shower of singleton fixes and minor series all over MM,
documented (hopefully adequately) in the respective changelogs.
Notable series include:
- Lucas Stach has provided some page-mapping cleanup/consolidation/
maintainability work in the series "mm/treewide: Remove pXd_huge()
API".
- In the series "Allow migrate on protnone reference with
MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
MPOL_PREFERRED_MANY mode, yielding almost doubled performance in
one test.
- In their series "Memory allocation profiling" Kent Overstreet and
Suren Baghdasaryan have contributed a means of determining (via
/proc/allocinfo) whereabouts in the kernel memory is being
allocated: number of calls and amount of memory.
- Matthew Wilcox has provided the series "Various significant MM
patches" which does a number of rather unrelated things, but in
largely similar code sites.
- In his series "mm: page_alloc: freelist migratetype hygiene"
Johannes Weiner has fixed the page allocator's handling of
migratetype requests, with resulting improvements in compaction
efficiency.
- In the series "make the hugetlb migration strategy consistent"
Baolin Wang has fixed a hugetlb migration issue, which should
improve hugetlb allocation reliability.
- Liu Shixin has hit an I/O meltdown caused by readahead in a
memory-tight memcg. Addressed in the series "Fix I/O high when
memory almost met memcg limit".
- In the series "mm/filemap: optimize folio adding and splitting"
Kairui Song has optimized pagecache insertion, yielding ~10%
performance improvement in one test.
- Baoquan He has cleaned up and consolidated the early zone
initialization code in the series "mm/mm_init.c: refactor
free_area_init_core()".
- Baoquan has also redone some MM initializatio code in the series
"mm/init: minor clean up and improvement".
- MM helper cleanups from Christoph Hellwig in his series "remove
follow_pfn".
- More cleanups from Matthew Wilcox in the series "Various
page->flags cleanups".
- Vlastimil Babka has contributed maintainability improvements in the
series "memcg_kmem hooks refactoring".
- More folio conversions and cleanups in Matthew Wilcox's series:
"Convert huge_zero_page to huge_zero_folio"
"khugepaged folio conversions"
"Remove page_idle and page_young wrappers"
"Use folio APIs in procfs"
"Clean up __folio_put()"
"Some cleanups for memory-failure"
"Remove page_mapping()"
"More folio compat code removal"
- David Hildenbrand chipped in with "fs/proc/task_mmu: convert
hugetlb functions to work on folis".
- Code consolidation and cleanup work related to GUP's handling of
hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".
- Rick Edgecombe has developed some fixes to stack guard gaps in the
series "Cover a guard gap corner case".
- Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the
series "mm/ksm: fix ksm exec support for prctl".
- Baolin Wang has implemented NUMA balancing for multi-size THPs.
This is a simple first-cut implementation for now. The series is
"support multi-size THP numa balancing".
- Cleanups to vma handling helper functions from Matthew Wilcox in
the series "Unify vma_address and vma_pgoff_address".
- Some selftests maintenance work from Dev Jain in the series
"selftests/mm: mremap_test: Optimizations and style fixes".
- Improvements to the swapping of multi-size THPs from Ryan Roberts
in the series "Swap-out mTHP without splitting".
- Kefeng Wang has significantly optimized the handling of arm64's
permission page faults in the series
"arch/mm/fault: accelerate pagefault when badaccess"
"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"
- GUP cleanups from David Hildenbrand in "mm/gup: consistently call
it GUP-fast".
- hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault
path to use struct vm_fault".
- selftests build fixes from John Hubbard in the series "Fix
selftests/mm build without requiring "make headers"".
- Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
series "Improved Memory Tier Creation for CPUless NUMA Nodes".
Fixes the initialization code so that migration between different
memory types works as intended.
- David Hildenbrand has improved follow_pte() and fixed an errant
driver in the series "mm: follow_pte() improvements and acrn
follow_pte() fixes".
- David also did some cleanup work on large folio mapcounts in his
series "mm: mapcount for large folios + page_mapcount() cleanups".
- Folio conversions in KSM in Alex Shi's series "transfer page to
folio in KSM".
- Barry Song has added some sysfs stats for monitoring multi-size
THP's in the series "mm: add per-order mTHP alloc and swpout
counters".
- Some zswap cleanups from Yosry Ahmed in the series "zswap
same-filled and limit checking cleanups".
- Matthew Wilcox has been looking at buffer_head code and found the
documentation to be lacking. The series is "Improve buffer head
documentation".
- Multi-size THPs get more work, this time from Lance Yang. His
series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free"
optimizes the freeing of these things.
- Kemeng Shi has added more userspace-visible writeback
instrumentation in the series "Improve visibility of writeback".
- Kemeng Shi then sent some maintenance work on top in the series
"Fix and cleanups to page-writeback".
- Matthew Wilcox reduces mmap_lock traffic in the anon vma code in
the series "Improve anon_vma scalability for anon VMAs". Intel's
test bot reported an improbable 3x improvement in one test.
- SeongJae Park adds some DAMON feature work in the series
"mm/damon: add a DAMOS filter type for page granularity access recheck"
"selftests/damon: add DAMOS quota goal test"
- Also some maintenance work in the series
"mm/damon/paddr: simplify page level access re-check for pageout"
"mm/damon: misc fixes and improvements"
- David Hildenbrand has disabled some known-to-fail selftests ni the
series "selftests: mm: cow: flag vmsplice() hugetlb tests as
XFAIL".
- memcg metadata storage optimizations from Shakeel Butt in "memcg:
reduce memory consumption by memcg stats".
- DAX fixes and maintenance work from Vishal Verma in the series
"dax/bus.c: Fixups for dax-bus locking""
* tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits)
memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order
selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime
mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp
mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault
selftests: cgroup: add tests to verify the zswap writeback path
mm: memcg: make alloc_mem_cgroup_per_node_info() return bool
mm/damon/core: fix return value from damos_wmark_metric_value
mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED
selftests: cgroup: remove redundant enabling of memory controller
Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree
Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT
Docs/mm/damon/design: use a list for supported filters
Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command
Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file
selftests/damon: classify tests for functionalities and regressions
selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'
selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts
selftests/damon/_damon_sysfs: check errors from nr_schemes file reads
mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()
selftests/damon: add a test for DAMOS quota goal
...
|
||
|
|
61ea647ed1 |
NFSD 6.10 Release Notes
This is a light release containing mostly optimizations, code clean-
ups, and minor bug fixes. This development cycle has focused on non-
upstream kernel work:
1. Continuing to build upstream CI for NFSD, based on kdevops
2. Backporting NFSD filecache-related fixes to selected LTS kernels
One notable new feature in v6.10 NFSD is the addition of a new
netlink protocol dedicated to configuring NFSD. A new user space
tool, nfsdctl, is to be added to nfs-utils. Lots more to come here.
As always I am very grateful to NFSD contributors, reviewers,
testers, and bug reporters who participated during this cycle.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmZHdB8ACgkQM2qzM29m
f5cSMhAApukZZQCSR9lcppVrv48vsTKHFup4qFG5upOtdHR8yuSI4HOfSb5F9gsO
fHJABFFtvlsTWVFotwUY7ljjtin00PK0bfn9kZnekvcZ1A5Yoly8SJmxK+jjnmAw
jaXT7XzGYWShYiRkIXb3NYE9uiC1VZOYURYTVbMwklg3jbsyp2M7ylnRKIqUO2Qt
bin6tdqnDx2H4Hou9k4csMX4sZJlXQZjQxzxhWuL1XrjEMlXREnklfppLzIlnJJt
eHFxTRhwPdcJ9CbGVsae7GNQeGUdgq7P/AIFuHWIruvxaknY7ZOp2Z/xnxaifeU+
O2Psh/9G7zmqFkeH01QwItita8rUdBwgTv0r7QPw8/lCd0xMieqFynNGtTGwWv0Q
1DC8RssM3axeHHfpTgXtkqfwFvKIyE6xKrvTCBZ8Pd8hsrWzbYI4d/oTe8rwXLZ6
sMD5wgsfagl6fd6G+4/9adFniOgpUi2xHmqJ5yyALyzUDeHiiqsOmxM2Rb0FN5YR
ixlNj7s9lmYbbMwQshNRhV/fOPQRvKvicHAyKO7Yko/seDf8NxwQfPX6M2j2esUG
Ld8lW1hGpBDWpF1YnA6AsC+Jr12+A4c2Lg95155R9Svumk6Fv/4MIftiWpO8qf/g
d66Q35eGr3BSSypP9KFEa7aegZdcJAlUpLhsd0Wj2rbei7gh0kU=
=tpVD
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"This is a light release containing mostly optimizations, code clean-
ups, and minor bug fixes. This development cycle has focused on non-
upstream kernel work:
1. Continuing to build upstream CI for NFSD, based on kdevops
2. Backporting NFSD filecache-related fixes to selected LTS kernels
One notable new feature in v6.10 NFSD is the addition of a new netlink
protocol dedicated to configuring NFSD. A new user space tool,
nfsdctl, is to be added to nfs-utils. Lots more to come here.
As always I am very grateful to NFSD contributors, reviewers, testers,
and bug reporters who participated during this cycle"
* tag 'nfsd-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (29 commits)
NFSD: Force all NFSv4.2 COPY requests to be synchronous
SUNRPC: Fix gss_free_in_token_pages()
NFS/knfsd: Remove the invalid NFS error 'NFSERR_OPNOTSUPP'
knfsd: LOOKUP can return an illegal error value
nfsd: set security label during create operations
NFSD: Add COPY status code to OFFLOAD_STATUS response
NFSD: Record status of async copy operation in struct nfsd4_copy
SUNRPC: Remove comment for sp_lock
NFSD: add listener-{set,get} netlink command
SUNRPC: add a new svc_find_listener helper
SUNRPC: introduce svc_xprt_create_from_sa utility routine
NFSD: add write_version to netlink command
NFSD: convert write_threads to netlink command
NFSD: allow callers to pass in scope string to nfsd_svc
NFSD: move nfsd_mutex handling into nfsd_svc callers
lockd: host: Remove unnecessary statements'host = NULL;'
nfsd: don't create nfsv4recoverydir in nfsdfs when not used.
nfsd: optimise recalculate_deny_mode() for a common case
nfsd: add tracepoint in mark_client_expired_locked
nfsd: new tracepoint for check_slot_seqid
...
|
||
|
|
ff9a79307f |
Kbuild updates for v6.10
- Avoid 'constexpr', which is a keyword in C23
- Allow 'dtbs_check' and 'dt_compatible_check' run independently of
'dt_binding_check'
- Fix weak references to avoid GOT entries in position-independent
code generation
- Convert the last use of 'optional' property in arch/sh/Kconfig
- Remove support for the 'optional' property in Kconfig
- Remove support for Clang's ThinLTO caching, which does not work with
the .incbin directive
- Change the semantics of $(src) so it always points to the source
directory, which fixes Makefile inconsistencies between upstream and
downstream
- Fix 'make tar-pkg' for RISC-V to produce a consistent package
- Provide reasonable default coverage for objtool, sanitizers, and
profilers
- Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc.
- Remove the last use of tristate choice in drivers/rapidio/Kconfig
- Various cleanups and fixes in Kconfig
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmZFlGcVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsG8voQALC8NtFpduWVfLRj2Qg6Ll/xf1vX
2igcTJEOFHkeqXLGoT8dTDKLEipUBUvKyguPq66CGwVTe2g6zy/nUSXeVtFrUsIa
msLTi8FqhqUo5lodNvGMRf8qqmuqcvnXoiQwIocF92jtsFy14bhiFY+n4HfcFNjj
GOKwqBZYQUwY/VVb090efc7RfS9c7uwABJSBelSoxg3AGZriwjGy7Pw5aSKGgVYi
inqL1eR6qwPP6z7CgQWM99soP+zwybFZmnQrsD9SniRBI4rtAat8Ih5jQFaSUFUQ
lk2w0NQBRFN88/uR2IJ2GWuIlQ74WeJ+QnCqVuQ59tV5zw90wqSmLzngfPD057Dv
JjNuhk0UyXVtpIg3lRtd4810ppNSTe33b9OM4O2H846W/crju5oDRNDHcflUXcwm
Rmn5ho1rb5QVzDVejJbgwidnUInSgJ9PZcvXQ/RJVZPhpgsBzAY9pQexG1G3hviw
y9UDrt6KP6bF9tHjmolmtdIes9Pj0c4dN6/Rdj4HS4hIQ/GDar0tnwvOvtfUctNL
orJlBsA6GeMmDVXKkR0ytOCWRYqWWbyt8g70RVKQJfuHX7/hGyAQPaQ2/u4mQhC2
aevYfbNJMj0VDfGz81HDBKFtkc5n+Ite8l157dHEl2LEabkOkRdNVcn7SNbOvZmd
ZCSnZ31h7woGfNho
=D5B/
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Avoid 'constexpr', which is a keyword in C23
- Allow 'dtbs_check' and 'dt_compatible_check' run independently of
'dt_binding_check'
- Fix weak references to avoid GOT entries in position-independent code
generation
- Convert the last use of 'optional' property in arch/sh/Kconfig
- Remove support for the 'optional' property in Kconfig
- Remove support for Clang's ThinLTO caching, which does not work with
the .incbin directive
- Change the semantics of $(src) so it always points to the source
directory, which fixes Makefile inconsistencies between upstream and
downstream
- Fix 'make tar-pkg' for RISC-V to produce a consistent package
- Provide reasonable default coverage for objtool, sanitizers, and
profilers
- Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc.
- Remove the last use of tristate choice in drivers/rapidio/Kconfig
- Various cleanups and fixes in Kconfig
* tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (46 commits)
kconfig: use sym_get_choice_menu() in sym_check_prop()
rapidio: remove choice for enumeration
kconfig: lxdialog: remove initialization with A_NORMAL
kconfig: m/nconf: merge two item_add_str() calls
kconfig: m/nconf: remove dead code to display value of bool choice
kconfig: m/nconf: remove dead code to display children of choice members
kconfig: gconf: show checkbox for choice correctly
kbuild: use GCOV_PROFILE and KCSAN_SANITIZE in scripts/Makefile.modfinal
Makefile: remove redundant tool coverage variables
kbuild: provide reasonable defaults for tool coverage
modules: Drop the .export_symbol section from the final modules
kconfig: use menu_list_for_each_sym() in sym_check_choice_deps()
kconfig: use sym_get_choice_menu() in conf_write_defconfig()
kconfig: add sym_get_choice_menu() helper
kconfig: turn defaults and additional prompt for choice members into error
kconfig: turn missing prompt for choice members into error
kconfig: turn conf_choice() into void function
kconfig: use linked list in sym_set_changed()
kconfig: gconf: use MENU_CHANGED instead of SYMBOL_CHANGED
kconfig: gconf: remove debug code
...
|
||
|
|
89721e3038 |
net-accept-more-20240515
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmZFcFwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpuP1EADKJOJRtcvO/av2cUR+HZFDC+s/jBwHIJK+ 4UY633zQlxjqc7dt4rX7zk/uk4mkhZnsGY+wS6xH08kB3VO9YksrwREVt6Ur9lP8 UXNVPpPcZ7fFcIp41rYkZX9pTDp2N8z2qsVg7V8wcXJ7EeTXd6L4ZLjhfCiHvs2s i6yIEwLrW+voYuqSFV7vWBIM3mSXSRTIiO2DqRAOtT2lsj374DOthvP2lOSSb5wq 6TF4s4z3HMGs+HF3rjP5kJ6ic6RdC6i31lzEivUMhwCiKN1AZXdp96KXaC+NVPRV t5//EdS+pSenQgkg6XH7d5kzFoCUFJfVZt05w0GCqMA081Q9ySjUymN1zedJbGd9 8CDlW01N8XLqG6+F9yakJLSFY+mUFGduPuueTNiUJWP8kTkQCtYIRzZDeyjxQrE5 c17NW5S1uWkf26Ucyi1r+gxw9N4kGkuB3+NitC6DOc7BW5CocEIoqLWi/UH7cEZe 0v6loTakqBAdgh03RCDMUj9Rt/37pQs2KFT9/CazVpbkvkKsue4xK4K2CUFsxqOj qcoc/LD62at4S3AUWwhUIs3YaQ7v/6AY5hIktqAwsFHmDffUbPdRrXWY1keKIprJ 4qS/sY0M+kvKGnp+80fPVHab9l6/fMLfabIyFuh0M3W/M4eHGt2YfKWreoGEy/1x xLq2iq+ehw== =S6Xt -----END PGP SIGNATURE----- Merge tag 'net-accept-more-20240515' of git://git.kernel.dk/linux Pull more io_uring updates from Jens Axboe: "This adds support for IORING_CQE_F_SOCK_NONEMPTY for io_uring accept requests. This is very similar to previous work that enabled the same hint for doing receives on sockets. By far the majority of the work here is refactoring to enable the networking side to pass back whether or not the socket had more pending requests after accepting the current one, the last patch just wires it up for io_uring. Not only does this enable applications to know whether there are more connections to accept right now, it also enables smarter logic for io_uring multishot accept on whether to retry immediately or wait for a poll trigger" * tag 'net-accept-more-20240515' of git://git.kernel.dk/linux: io_uring/net: wire up IORING_CQE_F_SOCK_NONEMPTY for accept net: pass back whether socket was empty post accept net: have do_accept() take a struct proto_accept_arg argument net: change proto and proto_ops accept type |
||
|
|
f08a1e912d |
Including fix from Andrii for the issue mentioned in our net-next PR,
the rest is unremarkable.
Current release - regressions:
- virtio_net: fix missed error path rtnl_unlock after control queue
locking rework
Current release - new code bugs:
- bpf: fix KASAN slab-out-of-bounds in percpu_array_map_gen_lookup,
caused by missing nested map handling
- drv: dsa: correct initialization order for KSZ88x3 ports
Previous releases - regressions:
- af_packet: do not call packet_read_pending() from tpacket_destruct_skb()
fix performance regression
- ipv6: fix route deleting failure when metric equals 0, don't assume
0 means not set / default in this case
Previous releases - always broken:
- bridge: couple of syzbot-driven fixes
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmZHtJQACgkQMUZtbf5S
Irsfyw//ZhCFzvXKLENNalHHMXwq7DsuXe6KWlljEOzLHH0/9QqqNC9ggYKRI5OE
unB//YC3sqtAUczQnED+UOh553Pu4Kvq9334LTX5m4HJQTYLLq1aGM/UZplsBTHx
3MsXUApYFth8pqCZvIcKOZcOddeViBfzEQ9jEAsgIyaqFy3XaiH4Zf6pJAAMyUbE
19CRiK/1TYNrX01XPOeV/9vYGj9rzepo6S5zpHKsWsFZArCcRPBsea/KWYYfLjW7
ExA2Cb+eUnPkNL4bTeH6dwgQGVL8Jo/OsKmsa/tdQffnj1pshdePXtP3TBEynMJF
jSSwwUMq56yE+uok4karE3wIhciUEYvTwfgt5FErYVqfqDiX1+7AZGtdZVDX/mgH
F0etKHDhX9F1zxHVMFwOMA4rLN6cvfpe7Pg+dt4B9E0o18SyNekOM1Ngdu/1ALtd
QV41JFHweHInDRrMLdj4aWW4EYPR5SUuvg66Pec4T7x5hAAapzIJySS+RIydC+ND
guPztYxO5cn5Q7kug1FyUBSXUXZxuCNRACb6zD4/4wbVRZhz7l3OTcd13QADCiwv
Tr61r2bS1Bp/HZ3iIHBY85JnKMvpdwNXN2SPsYQQwVrv9FLj9iskH9kjwqVDG4ja
W3ivZZM+CcZbnB81JynK7Ge54PT+SiPy3Nw4RIVxFl1QlzXC21E=
=7eys
-----END PGP SIGNATURE-----
Merge tag 'net-6.10-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Current release - regressions:
- virtio_net: fix missed error path rtnl_unlock after control queue
locking rework
Current release - new code bugs:
- bpf: fix KASAN slab-out-of-bounds in percpu_array_map_gen_lookup,
caused by missing nested map handling
- drv: dsa: correct initialization order for KSZ88x3 ports
Previous releases - regressions:
- af_packet: do not call packet_read_pending() from
tpacket_destruct_skb() fix performance regression
- ipv6: fix route deleting failure when metric equals 0, don't assume
0 means not set / default in this case
Previous releases - always broken:
- bridge: couple of syzbot-driven fixes"
* tag 'net-6.10-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (30 commits)
selftests: net: local_termination: annotate the expected failures
net: dsa: microchip: Correct initialization order for KSZ88x3 ports
MAINTAINERS: net: Update reviewers for TI's Ethernet drivers
dt-bindings: net: ti: Update maintainers list
l2tp: fix ICMP error handling for UDP-encap sockets
net: txgbe: fix to control VLAN strip
net: wangxun: match VLAN CTAG and STAG features
net: wangxun: fix to change Rx features
af_packet: do not call packet_read_pending() from tpacket_destruct_skb()
virtio_net: Fix missed rtnl_unlock
netrom: fix possible dead-lock in nr_rt_ioctl()
idpf: don't skip over ethtool tcp-data-split setting
dt-bindings: net: qcom: ethernet: Allow dma-coherent
bonding: fix oops during rmmod
net/ipv6: Fix route deleting failure when metric equals 0
selftests/net: reduce xfrm_policy test time
selftests/bpf: Adjust btf_dump test to reflect recent change in file_operations
selftests/bpf: Adjust test_access_variable_array after a kernel function name change
selftests/net/lib: no need to record ns name if it already exist
net: qrtr: ns: Fix module refcnt
...
|
||
|
|
91b6163be4 |
sysctl changes for v6.10-rc1
Summary
* Removed sentinel elements from ctl_table structs in kernel/*
Removing sentinels in ctl_table arrays reduces the build time size and
runtime memory consumed by ~64 bytes per array. Removals for net/, io_uring/,
mm/, ipc/ and security/ are set to go into mainline through their respective
subsystems making the next release the most likely place where the final
series that removes the check for proc_name == NULL will land. This PR adds
to removals already in arch/, drivers/ and fs/.
* Adjusted ctl_table definitions and references to allow constification
Adjustments:
- Removing unused ctl_table function arguments
- Moving non-const elements from ctl_table to ctl_table_header
- Making ctl_table pointers const in ctl_table_root structure
Making the static ctl_table structs const will increase safety by keeping the
pointers to proc_handler functions in .rodata. Though no ctl_tables where
made const in this PR, the ground work for making that possible has started
with these changes sent by Thomas Weißschuh.
Testing
* These changes went into linux-next after v6.9-rc4; giving it a good month of
testing.
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmZFvBMACgkQupfNUreW
QU/eGAv9EWeiXKxr3EVSMAsb9MWbJq7C99I/pd5hMf+qH4PgJpKDH7w/sb2e8h8+
unGiW83ikgrtph7OS4/xM3Y9r3Nvzd6C/OztqgMnNKeRFdMgP7wu9HaSNs05ordb
CqJdhvL93quc5HxrGTS9sdLK/wLJWOHwuWMXhX4qS44JNxTdPV2q10Rb7DZyHZ6O
C9qp61L2Q2CrnOBKIx8MoeCh20ynJQAo3b0pTN63ZYF4D0vqCcnYNNTPkge4ID8/
ULJoP5hAQY0vJ4g4fC4Gmooa5GECpm8MfZUf3SdgPyauqM/sm3dVdsLXAWD4Phcp
TsG2a/5KMYwnLHrUGwDW7bFfEemRU88h0Iam56+SKMl1kMlEpWaLL9ApQXoHFayG
e10izS+i/nlQiqYIHtuczCoTimT4/LGnonCLcdA//C3XzBT5MnOd7xsjuaQSpFWl
/CV9SZa4ABwzX7u2jty8ik90iihLCFQyKj1d9m1mDVbgb6r3iUOxVuHBgMtY7MF7
eyaEmV7l
=/rQW
-----END PGP SIGNATURE-----
Merge tag 'sysctl-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl updates from Joel Granados:
- Remove sentinel elements from ctl_table structs in kernel/*
Removing sentinels in ctl_table arrays reduces the build time size
and runtime memory consumed by ~64 bytes per array. Removals for
net/, io_uring/, mm/, ipc/ and security/ are set to go into mainline
through their respective subsystems making the next release the most
likely place where the final series that removes the check for
proc_name == NULL will land.
This adds to removals already in arch/, drivers/ and fs/.
- Adjust ctl_table definitions and references to allow constification
- Remove unused ctl_table function arguments
- Move non-const elements from ctl_table to ctl_table_header
- Make ctl_table pointers const in ctl_table_root structure
Making the static ctl_table structs const will increase safety by
keeping the pointers to proc_handler functions in .rodata. Though no
ctl_tables where made const in this PR, the ground work for making
that possible has started with these changes sent by Thomas
Weißschuh.
* tag 'sysctl-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: drop now unnecessary out-of-bounds check
sysctl: move sysctl type to ctl_table_header
sysctl: drop sysctl_is_perm_empty_ctl_table
sysctl: treewide: constify argument ctl_table_root::permissions(table)
sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table)
bpf: Remove the now superfluous sentinel elements from ctl_table array
delayacct: Remove the now superfluous sentinel elements from ctl_table array
kprobes: Remove the now superfluous sentinel elements from ctl_table array
printk: Remove the now superfluous sentinel elements from ctl_table array
scheduler: Remove the now superfluous sentinel elements from ctl_table array
seccomp: Remove the now superfluous sentinel elements from ctl_table array
timekeeping: Remove the now superfluous sentinel elements from ctl_table array
ftrace: Remove the now superfluous sentinel elements from ctl_table array
umh: Remove the now superfluous sentinel elements from ctl_table array
kernel misc: Remove the now superfluous sentinel elements from ctl_table array
|
||
|
|
6e828dc60e |
l2tp: fix ICMP error handling for UDP-encap sockets
Since commit |
||
|
|
581073f626 |
af_packet: do not call packet_read_pending() from tpacket_destruct_skb()
trafgen performance considerably sank on hosts with many cores after the blamed commit. packet_read_pending() is very expensive, and calling it in af_packet fast path defeats Daniel intent in commit |
||
|
|
e03e7f20eb |
netrom: fix possible dead-lock in nr_rt_ioctl()
syzbot loves netrom, and found a possible deadlock in nr_rt_ioctl [1]
Make sure we always acquire nr_node_list_lock before nr_node_lock(nr_node)
[1]
WARNING: possible circular locking dependency detected
6.9.0-rc7-syzkaller-02147-g654de42f3fc6 #0 Not tainted
------------------------------------------------------
syz-executor350/5129 is trying to acquire lock:
ffff8880186e2070 (&nr_node->node_lock){+...}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
ffff8880186e2070 (&nr_node->node_lock){+...}-{2:2}, at: nr_node_lock include/net/netrom.h:152 [inline]
ffff8880186e2070 (&nr_node->node_lock){+...}-{2:2}, at: nr_dec_obs net/netrom/nr_route.c:464 [inline]
ffff8880186e2070 (&nr_node->node_lock){+...}-{2:2}, at: nr_rt_ioctl+0x1bb/0x1090 net/netrom/nr_route.c:697
but task is already holding lock:
ffffffff8f7053b8 (nr_node_list_lock){+...}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
ffffffff8f7053b8 (nr_node_list_lock){+...}-{2:2}, at: nr_dec_obs net/netrom/nr_route.c:462 [inline]
ffffffff8f7053b8 (nr_node_list_lock){+...}-{2:2}, at: nr_rt_ioctl+0x10a/0x1090 net/netrom/nr_route.c:697
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (nr_node_list_lock){+...}-{2:2}:
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
__raw_spin_lock_bh include/linux/spinlock_api_smp.h:126 [inline]
_raw_spin_lock_bh+0x35/0x50 kernel/locking/spinlock.c:178
spin_lock_bh include/linux/spinlock.h:356 [inline]
nr_remove_node net/netrom/nr_route.c:299 [inline]
nr_del_node+0x4b4/0x820 net/netrom/nr_route.c:355
nr_rt_ioctl+0xa95/0x1090 net/netrom/nr_route.c:683
sock_do_ioctl+0x158/0x460 net/socket.c:1222
sock_ioctl+0x629/0x8e0 net/socket.c:1341
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:904 [inline]
__se_sys_ioctl+0xfc/0x170 fs/ioctl.c:890
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #0 (&nr_node->node_lock){+...}-{2:2}:
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain+0x18cb/0x58e0 kernel/locking/lockdep.c:3869
__lock_acquire+0x1346/0x1fd0 kernel/locking/lockdep.c:5137
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
__raw_spin_lock_bh include/linux/spinlock_api_smp.h:126 [inline]
_raw_spin_lock_bh+0x35/0x50 kernel/locking/spinlock.c:178
spin_lock_bh include/linux/spinlock.h:356 [inline]
nr_node_lock include/net/netrom.h:152 [inline]
nr_dec_obs net/netrom/nr_route.c:464 [inline]
nr_rt_ioctl+0x1bb/0x1090 net/netrom/nr_route.c:697
sock_do_ioctl+0x158/0x460 net/socket.c:1222
sock_ioctl+0x629/0x8e0 net/socket.c:1341
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:904 [inline]
__se_sys_ioctl+0xfc/0x170 fs/ioctl.c:890
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(nr_node_list_lock);
lock(&nr_node->node_lock);
lock(nr_node_list_lock);
lock(&nr_node->node_lock);
*** DEADLOCK ***
1 lock held by syz-executor350/5129:
#0: ffffffff8f7053b8 (nr_node_list_lock){+...}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
#0: ffffffff8f7053b8 (nr_node_list_lock){+...}-{2:2}, at: nr_dec_obs net/netrom/nr_route.c:462 [inline]
#0: ffffffff8f7053b8 (nr_node_list_lock){+...}-{2:2}, at: nr_rt_ioctl+0x10a/0x1090 net/netrom/nr_route.c:697
stack backtrace:
CPU: 0 PID: 5129 Comm: syz-executor350 Not tainted 6.9.0-rc7-syzkaller-02147-g654de42f3fc6 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2187
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain+0x18cb/0x58e0 kernel/locking/lockdep.c:3869
__lock_acquire+0x1346/0x1fd0 kernel/locking/lockdep.c:5137
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
__raw_spin_lock_bh include/linux/spinlock_api_smp.h:126 [inline]
_raw_spin_lock_bh+0x35/0x50 kernel/locking/spinlock.c:178
spin_lock_bh include/linux/spinlock.h:356 [inline]
nr_node_lock include/net/netrom.h:152 [inline]
nr_dec_obs net/netrom/nr_route.c:464 [inline]
nr_rt_ioctl+0x1bb/0x1090 net/netrom/nr_route.c:697
sock_do_ioctl+0x158/0x460 net/socket.c:1222
sock_ioctl+0x629/0x8e0 net/socket.c:1341
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:904 [inline]
__se_sys_ioctl+0xfc/0x170 fs/ioctl.c:890
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes:
|
||
|
|
bb48727238 |
net/ipv6: Fix route deleting failure when metric equals 0
Problem ========= After commit |
||
|
|
fd76e5ccc4 |
net: qrtr: ns: Fix module refcnt
The qrtr protocol core logic and the qrtr nameservice are combined into
a single module. Neither the core logic or nameservice provide much
functionality by themselves; combining the two into a single module also
prevents any possible issues that may stem from client modules loading
inbetween qrtr and the ns.
Creating a socket takes two references to the module that owns the
socket protocol. Since the ns needs to create the control socket, this
creates a scenario where there are always two references to the qrtr
module. This prevents the execution of 'rmmod' for qrtr.
To resolve this, forcefully put the module refcount for the socket
opened by the nameservice.
Fixes:
|
||
|
|
621cde16e4 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Cross merge. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
3a7c1661ae |
net: bridge: mst: fix vlan use-after-free
syzbot reported a suspicious rcu usage[1] in bridge's mst code. While
fixing it I noticed that nothing prevents a vlan to be freed while
walking the list from the same path (br forward delay timer). Fix the rcu
usage and also make sure we are not accessing freed memory by making
br_mst_vlan_set_state use rcu read lock.
[1]
WARNING: suspicious RCU usage
6.9.0-rc6-syzkaller #0 Not tainted
-----------------------------
net/bridge/br_private.h:1599 suspicious rcu_dereference_protected() usage!
...
stack backtrace:
CPU: 1 PID: 8017 Comm: syz-executor.1 Not tainted 6.9.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
lockdep_rcu_suspicious+0x221/0x340 kernel/locking/lockdep.c:6712
nbp_vlan_group net/bridge/br_private.h:1599 [inline]
br_mst_set_state+0x1ea/0x650 net/bridge/br_mst.c:105
br_set_state+0x28a/0x7b0 net/bridge/br_stp.c:47
br_forward_delay_timer_expired+0x176/0x440 net/bridge/br_stp_timer.c:88
call_timer_fn+0x18e/0x650 kernel/time/timer.c:1793
expire_timers kernel/time/timer.c:1844 [inline]
__run_timers kernel/time/timer.c:2418 [inline]
__run_timer_base+0x66a/0x8e0 kernel/time/timer.c:2429
run_timer_base kernel/time/timer.c:2438 [inline]
run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2448
__do_softirq+0x2c6/0x980 kernel/softirq.c:554
invoke_softirq kernel/softirq.c:428 [inline]
__irq_exit_rcu+0xf2/0x1c0 kernel/softirq.c:633
irq_exit_rcu+0x9/0x30 kernel/softirq.c:645
instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1043 [inline]
sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1043
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702
RIP: 0010:lock_acquire+0x264/0x550 kernel/locking/lockdep.c:5758
Code: 2b 00 74 08 4c 89 f7 e8 ba d1 84 00 f6 44 24 61 02 0f 85 85 01 00 00 41 f7 c7 00 02 00 00 74 01 fb 48 c7 44 24 40 0e 36 e0 45 <4b> c7 44 25 00 00 00 00 00 43 c7 44 25 09 00 00 00 00 43 c7 44 25
RSP: 0018:ffffc90013657100 EFLAGS: 00000206
RAX: 0000000000000001 RBX: 1ffff920026cae2c RCX: 0000000000000001
RDX: dffffc0000000000 RSI: ffffffff8bcaca00 RDI: ffffffff8c1eaa60
RBP: ffffc90013657260 R08: ffffffff92efe507 R09: 1ffffffff25dfca0
R10: dffffc0000000000 R11: fffffbfff25dfca1 R12: 1ffff920026cae28
R13: dffffc0000000000 R14: ffffc90013657160 R15: 0000000000000246
Fixes:
|
||
|
|
8bd67ebb50 |
net: bridge: xmit: make sure we have at least eth header len bytes
syzbot triggered an uninit value[1] error in bridge device's xmit path
by sending a short (less than ETH_HLEN bytes) skb. To fix it check if
we can actually pull that amount instead of assuming.
Tested with dropwatch:
drop at: br_dev_xmit+0xb93/0x12d0 [bridge] (0xffffffffc06739b3)
origin: software
timestamp: Mon May 13 11:31:53 2024 778214037 nsec
protocol: 0x88a8
length: 2
original length: 2
drop reason: PKT_TOO_SMALL
[1]
BUG: KMSAN: uninit-value in br_dev_xmit+0x61d/0x1cb0 net/bridge/br_device.c:65
br_dev_xmit+0x61d/0x1cb0 net/bridge/br_device.c:65
__netdev_start_xmit include/linux/netdevice.h:4903 [inline]
netdev_start_xmit include/linux/netdevice.h:4917 [inline]
xmit_one net/core/dev.c:3531 [inline]
dev_hard_start_xmit+0x247/0xa20 net/core/dev.c:3547
__dev_queue_xmit+0x34db/0x5350 net/core/dev.c:4341
dev_queue_xmit include/linux/netdevice.h:3091 [inline]
__bpf_tx_skb net/core/filter.c:2136 [inline]
__bpf_redirect_common net/core/filter.c:2180 [inline]
__bpf_redirect+0x14a6/0x1620 net/core/filter.c:2187
____bpf_clone_redirect net/core/filter.c:2460 [inline]
bpf_clone_redirect+0x328/0x470 net/core/filter.c:2432
___bpf_prog_run+0x13fe/0xe0f0 kernel/bpf/core.c:1997
__bpf_prog_run512+0xb5/0xe0 kernel/bpf/core.c:2238
bpf_dispatcher_nop_func include/linux/bpf.h:1234 [inline]
__bpf_prog_run include/linux/filter.h:657 [inline]
bpf_prog_run include/linux/filter.h:664 [inline]
bpf_test_run+0x499/0xc30 net/bpf/test_run.c:425
bpf_prog_test_run_skb+0x14ea/0x1f20 net/bpf/test_run.c:1058
bpf_prog_test_run+0x6b7/0xad0 kernel/bpf/syscall.c:4269
__sys_bpf+0x6aa/0xd90 kernel/bpf/syscall.c:5678
__do_sys_bpf kernel/bpf/syscall.c:5767 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5765 [inline]
__x64_sys_bpf+0xa0/0xe0 kernel/bpf/syscall.c:5765
x64_sys_call+0x96b/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:322
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes:
|
||
|
|
1b294a1f35 |
Networking changes for 6.10.
Core & protocols
----------------
- Complete rework of garbage collection of AF_UNIX sockets.
AF_UNIX is prone to forming reference count cycles due to fd passing
functionality. New method based on Tarjan's Strongly Connected Components
algorithm should be both faster and remove a lot of workarounds
we accumulated over the years.
- Add TCP fraglist GRO support, allowing chaining multiple TCP packets
and forwarding them together. Useful for small switches / routers which
lack basic checksum offload in some scenarios (e.g. PPPoE).
- Support using SMP threads for handling packet backlog i.e. packet
processing from software interfaces and old drivers which don't
use NAPI. This helps move the processing out of the softirq jumble.
- Continue work of converting from rtnl lock to RCU protection.
Don't require rtnl lock when reading: IPv6 routing FIB, IPv6 address
labels, netdev threaded NAPI sysfs files, bonding driver's sysfs files,
MPLS devconf, IPv4 FIB rules, netns IDs, tcp metrics, TC Qdiscs,
neighbor entries, ARP entries via ioctl(SIOCGARP), a lot of the link
information available via rtnetlink.
- Small optimizations from Eric to UDP wake up handling, memory accounting,
RPS/RFS implementation, TCP packet sizing etc.
- Allow direct page recycling in the bulk API used by XDP, for +2% PPS.
- Support peek with an offset on TCP sockets.
- Add MPTCP APIs for querying last time packets were received/sent/acked,
and whether MPTCP "upgrade" succeeded on a TCP socket.
- Add intra-node communication shortcut to improve SMC performance.
- Add IPv6 (and IPv{4,6}-over-IPv{4,6}) support to the GTP protocol driver.
- Add HSR-SAN (RedBOX) mode of operation to the HSR protocol driver.
- Add reset reasons for tracing what caused a TCP reset to be sent.
- Introduce direction attribute for xfrm (IPSec) states.
State can be used either for input or output packet processing.
Things we sprinkled into general kernel code
--------------------------------------------
- Add bitmap_{read,write}(), bitmap_size(), expose BYTES_TO_BITS().
This required touch-ups and renaming of a few existing users.
- Add Endian-dependent __counted_by_{le,be} annotations.
- Make building selftests "quieter" by printing summaries like
"CC object.o" rather than full commands with all the arguments.
Netfilter
---------
- Use GFP_KERNEL to clone elements, to deal better with OOM situations
and avoid failures in the .commit step.
BPF
---
- Add eBPF JIT for ARCv2 CPUs.
- Support attaching kprobe BPF programs through kprobe_multi link in
a session mode, meaning, a BPF program is attached to both function entry
and return, the entry program can decide if the return program gets
executed and the entry program can share u64 cookie value with return
program. "Session mode" is a common use-case for tetragon and bpftrace.
- Add the ability to specify and retrieve BPF cookie for raw tracepoint
programs in order to ease migration from classic to raw tracepoints.
- Add an internal-only BPF per-CPU instruction for resolving per-CPU
memory addresses and implement support in x86, ARM64 and RISC-V JITs.
This allows inlining functions which need to access per-CPU state.
- Optimize x86 BPF JIT's emit_mov_imm64, and add support for various
atomics in bpf_arena which can be JITed as a single x86 instruction.
Support BPF arena on ARM64.
- Add a new bpf_wq API for deferring events and refactor process-context
bpf_timer code to keep common code where possible.
- Harden the BPF verifier's and/or/xor value tracking.
- Introduce crypto kfuncs to let BPF programs call kernel crypto APIs.
- Support bpf_tail_call_static() helper for BPF programs with GCC 13.
- Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF
program to have code sections where preemption is disabled.
Driver API
----------
- Skip software TC processing completely if all installed rules are
marked as HW-only, instead of checking the HW-only flag rule by rule.
- Add support for configuring PoE (Power over Ethernet), similar to
the already existing support for PoDL (Power over Data Line) config.
- Initial bits of a queue control API, for now allowing a single queue
to be reset without disturbing packet flow to other queues.
- Common (ethtool) statistics for hardware timestamping.
Tests and tooling
-----------------
- Remove the need to create a config file to run the net forwarding tests
so that a naive "make run_tests" can exercise them.
- Define a method of writing tests which require an external endpoint
to communicate with (to send/receive data towards the test machine).
Add a few such tests.
- Create a shared code library for writing Python tests. Expose the YAML
Netlink library from tools/ to the tests for easy Netlink access.
- Move netfilter tests under net/, extend them, separate performance tests
from correctness tests, and iron out issues found by running them
"on every commit".
- Refactor BPF selftests to use common network helpers.
- Further work filling in YAML definitions of Netlink messages for:
nftables, team driver, bonding interfaces, vlan interfaces, VF info,
TC u32 mark, TC police action.
- Teach Python YAML Netlink to decode attribute policies.
- Extend the definition of the "indexed array" construct in the specs
to cover arrays of scalars rather than just nests.
- Add hyperlinks between definitions in generated Netlink docs.
Drivers
-------
- Make sure unsupported flower control flags are rejected by drivers,
and make more drivers report errors directly to the application rather
than dmesg (large number of driver changes from Asbjørn Sloth Tønnesen).
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- support multiple RSS contexts and steering traffic to them
- support XDP metadata
- make page pool allocations more NUMA aware
- Intel (100G, ice, idpf):
- extract datapath code common among Intel drivers into a library
- use fewer resources in switchdev by sharing queues with the PF
- add PFCP filter support
- add Ethernet filter support
- use a spinlock instead of HW lock in PTP clock ops
- support 5 layer Tx scheduler topology
- nVidia/Mellanox:
- 800G link modes and 100G SerDes speeds
- per-queue IRQ coalescing configuration
- Marvell Octeon:
- support offloading TC packet mark action
- Ethernet NICs consumer, embedded and virtual:
- stop lying about skb->truesize in USB Ethernet drivers, it messes up
TCP memory calculations
- Google cloud vNIC:
- support changing ring size via ethtool
- support ring reset using the queue control API
- VirtIO net:
- expose flow hash from RSS to XDP
- per-queue statistics
- add selftests
- Synopsys (stmmac):
- support controllers which require an RX clock signal from the MII
bus to perform their hardware initialization
- TI:
- icssg_prueth: support ICSSG-based Ethernet on AM65x SR1.0 devices
- icssg_prueth: add SW TX / RX Coalescing based on hrtimers
- cpsw: minimal XDP support
- Renesas (ravb):
- support describing the MDIO bus
- Realtek (r8169):
- add support for RTL8168M
- Microchip Sparx5:
- matchall and flower actions mirred and redirect
- Ethernet switches:
- nVidia/Mellanox:
- improve events processing performance
- Marvell:
- add support for MV88E6250 family internal PHYs
- Microchip:
- add DCB and DSCP mapping support for KSZ switches
- vsc73xx: convert to PHYLINK
- Realtek:
- rtl8226b/rtl8221b: add C45 instances and SerDes switching
- Many driver changes related to PHYLIB and PHYLINK deprecated API cleanup.
- Ethernet PHYs:
- Add a new driver for Airoha EN8811H 2.5 Gigabit PHY.
- micrel: lan8814: add support for PPS out and external timestamp trigger
- WiFi:
- Disable Wireless Extensions (WEXT) in all Wi-Fi 7 devices drivers.
Modern devices can only be configured using nl80211.
- mac80211/cfg80211
- handle color change per link for WiFi 7 Multi-Link Operation
- Intel (iwlwifi):
- don't support puncturing in 5 GHz
- support monitor mode on passive channels
- BZ-W device support
- P2P with HE/EHT support
- re-add support for firmware API 90
- provide channel survey information for Automatic Channel Selection
- MediaTek (mt76):
- mt7921 LED control
- mt7925 EHT radiotap support
- mt7920e PCI support
- Qualcomm (ath11k):
- P2P support for QCA6390, WCN6855 and QCA2066
- support hibernation
- ieee80211-freq-limit Device Tree property support
- Qualcomm (ath12k):
- refactoring in preparation of multi-link support
- suspend and hibernation support
- ACPI support
- debugfs support, including dfs_simulate_radar support
- RealTek:
- rtw88: RTL8723CS SDIO device support
- rtw89: RTL8922AE Wi-Fi 7 PCI device support
- rtw89: complete features of new WiFi 7 chip 8922AE including
BT-coexistence and Wake-on-WLAN
- rtw89: use BIOS ACPI settings to set TX power and channels
- rtl8xxxu: enable Management Frame Protection (MFP) support
- Bluetooth:
- support for Intel BlazarI and Filmore Peak2 (BE201)
- support for MediaTek MT7921S SDIO
- initial support for Intel PCIe BT driver
- remove HCI_AMP support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmZD6sQACgkQMUZtbf5S
IrtLYw/+I73ePGIye37o2jpbodcLAUZVfF3r6uYUzK8hokEcKD0QVJa9w7PizLZ3
UO45ClOXFLJCkfP4reFenLfxGCel2AJI+F7VFl2xaO2XgrcH/lnVrHqKZEAEXjls
KoYMnShIolv7h2MKP6hHtyTi2j1wvQUKsZC71o9/fuW+4fUT8gECx1YtYcL73wrw
gEMdlUgBYC3jiiCUHJIFX6iPJ2t/TC+q1eIIF2K/Osrk2kIqQhzoozcL4vpuAZQT
99ljx/qRelXa8oppDb7nM5eulg7WY8ZqxEfFZphTMC5nLEGzClxuOTTl2kDYI/D/
UZmTWZDY+F5F0xvNk2gH84qVJXBOVDoobpT7hVA/tDuybobc/kvGDzRayEVqVzKj
Q0tPlJs+xBZpkK5TVnxaFLJVOM+p1Xosxy3kNVXmuYNBvT/R89UbJiCrUKqKZF+L
z/1mOYUv8UklHqYAeuJSptHvqJjTGa/fsEYP7dAUBbc1N2eVB8mzZ4mgU5rYXbtC
E6UXXiWnoSRm8bmco9QmcWWoXt5UGEizHSJLz6t1R5Df/YmXhWlytll5aCwY1ksf
FNoL7S4u7AZThL1Nwi7yUs4CAjhk/N4aOsk+41S0sALCx30BJuI6UdesAxJ0lu+Z
fwCQYbs27y4p7mBLbkYwcQNxAxGm7PSK4yeyRIy2njiyV4qnLf8=
=EsC2
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Complete rework of garbage collection of AF_UNIX sockets.
AF_UNIX is prone to forming reference count cycles due to fd
passing functionality. New method based on Tarjan's Strongly
Connected Components algorithm should be both faster and remove a
lot of workarounds we accumulated over the years.
- Add TCP fraglist GRO support, allowing chaining multiple TCP
packets and forwarding them together. Useful for small switches /
routers which lack basic checksum offload in some scenarios (e.g.
PPPoE).
- Support using SMP threads for handling packet backlog i.e. packet
processing from software interfaces and old drivers which don't use
NAPI. This helps move the processing out of the softirq jumble.
- Continue work of converting from rtnl lock to RCU protection.
Don't require rtnl lock when reading: IPv6 routing FIB, IPv6
address labels, netdev threaded NAPI sysfs files, bonding driver's
sysfs files, MPLS devconf, IPv4 FIB rules, netns IDs, tcp metrics,
TC Qdiscs, neighbor entries, ARP entries via ioctl(SIOCGARP), a lot
of the link information available via rtnetlink.
- Small optimizations from Eric to UDP wake up handling, memory
accounting, RPS/RFS implementation, TCP packet sizing etc.
- Allow direct page recycling in the bulk API used by XDP, for +2%
PPS.
- Support peek with an offset on TCP sockets.
- Add MPTCP APIs for querying last time packets were received/sent/acked
and whether MPTCP "upgrade" succeeded on a TCP socket.
- Add intra-node communication shortcut to improve SMC performance.
- Add IPv6 (and IPv{4,6}-over-IPv{4,6}) support to the GTP protocol
driver.
- Add HSR-SAN (RedBOX) mode of operation to the HSR protocol driver.
- Add reset reasons for tracing what caused a TCP reset to be sent.
- Introduce direction attribute for xfrm (IPSec) states. State can be
used either for input or output packet processing.
Things we sprinkled into general kernel code:
- Add bitmap_{read,write}(), bitmap_size(), expose BYTES_TO_BITS().
This required touch-ups and renaming of a few existing users.
- Add Endian-dependent __counted_by_{le,be} annotations.
- Make building selftests "quieter" by printing summaries like
"CC object.o" rather than full commands with all the arguments.
Netfilter:
- Use GFP_KERNEL to clone elements, to deal better with OOM
situations and avoid failures in the .commit step.
BPF:
- Add eBPF JIT for ARCv2 CPUs.
- Support attaching kprobe BPF programs through kprobe_multi link in
a session mode, meaning, a BPF program is attached to both function
entry and return, the entry program can decide if the return
program gets executed and the entry program can share u64 cookie
value with return program. "Session mode" is a common use-case for
tetragon and bpftrace.
- Add the ability to specify and retrieve BPF cookie for raw
tracepoint programs in order to ease migration from classic to raw
tracepoints.
- Add an internal-only BPF per-CPU instruction for resolving per-CPU
memory addresses and implement support in x86, ARM64 and RISC-V
JITs. This allows inlining functions which need to access per-CPU
state.
- Optimize x86 BPF JIT's emit_mov_imm64, and add support for various
atomics in bpf_arena which can be JITed as a single x86
instruction. Support BPF arena on ARM64.
- Add a new bpf_wq API for deferring events and refactor
process-context bpf_timer code to keep common code where possible.
- Harden the BPF verifier's and/or/xor value tracking.
- Introduce crypto kfuncs to let BPF programs call kernel crypto
APIs.
- Support bpf_tail_call_static() helper for BPF programs with GCC 13.
- Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF
program to have code sections where preemption is disabled.
Driver API:
- Skip software TC processing completely if all installed rules are
marked as HW-only, instead of checking the HW-only flag rule by
rule.
- Add support for configuring PoE (Power over Ethernet), similar to
the already existing support for PoDL (Power over Data Line)
config.
- Initial bits of a queue control API, for now allowing a single
queue to be reset without disturbing packet flow to other queues.
- Common (ethtool) statistics for hardware timestamping.
Tests and tooling:
- Remove the need to create a config file to run the net forwarding
tests so that a naive "make run_tests" can exercise them.
- Define a method of writing tests which require an external endpoint
to communicate with (to send/receive data towards the test
machine). Add a few such tests.
- Create a shared code library for writing Python tests. Expose the
YAML Netlink library from tools/ to the tests for easy Netlink
access.
- Move netfilter tests under net/, extend them, separate performance
tests from correctness tests, and iron out issues found by running
them "on every commit".
- Refactor BPF selftests to use common network helpers.
- Further work filling in YAML definitions of Netlink messages for:
nftables, team driver, bonding interfaces, vlan interfaces, VF
info, TC u32 mark, TC police action.
- Teach Python YAML Netlink to decode attribute policies.
- Extend the definition of the "indexed array" construct in the specs
to cover arrays of scalars rather than just nests.
- Add hyperlinks between definitions in generated Netlink docs.
Drivers:
- Make sure unsupported flower control flags are rejected by drivers,
and make more drivers report errors directly to the application
rather than dmesg (large number of driver changes from Asbjørn
Sloth Tønnesen).
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- support multiple RSS contexts and steering traffic to them
- support XDP metadata
- make page pool allocations more NUMA aware
- Intel (100G, ice, idpf):
- extract datapath code common among Intel drivers into a library
- use fewer resources in switchdev by sharing queues with the PF
- add PFCP filter support
- add Ethernet filter support
- use a spinlock instead of HW lock in PTP clock ops
- support 5 layer Tx scheduler topology
- nVidia/Mellanox:
- 800G link modes and 100G SerDes speeds
- per-queue IRQ coalescing configuration
- Marvell Octeon:
- support offloading TC packet mark action
- Ethernet NICs consumer, embedded and virtual:
- stop lying about skb->truesize in USB Ethernet drivers, it
messes up TCP memory calculations
- Google cloud vNIC:
- support changing ring size via ethtool
- support ring reset using the queue control API
- VirtIO net:
- expose flow hash from RSS to XDP
- per-queue statistics
- add selftests
- Synopsys (stmmac):
- support controllers which require an RX clock signal from the
MII bus to perform their hardware initialization
- TI:
- icssg_prueth: support ICSSG-based Ethernet on AM65x SR1.0 devices
- icssg_prueth: add SW TX / RX Coalescing based on hrtimers
- cpsw: minimal XDP support
- Renesas (ravb):
- support describing the MDIO bus
- Realtek (r8169):
- add support for RTL8168M
- Microchip Sparx5:
- matchall and flower actions mirred and redirect
- Ethernet switches:
- nVidia/Mellanox:
- improve events processing performance
- Marvell:
- add support for MV88E6250 family internal PHYs
- Microchip:
- add DCB and DSCP mapping support for KSZ switches
- vsc73xx: convert to PHYLINK
- Realtek:
- rtl8226b/rtl8221b: add C45 instances and SerDes switching
- Many driver changes related to PHYLIB and PHYLINK deprecated API
cleanup
- Ethernet PHYs:
- Add a new driver for Airoha EN8811H 2.5 Gigabit PHY.
- micrel: lan8814: add support for PPS out and external timestamp trigger
- WiFi:
- Disable Wireless Extensions (WEXT) in all Wi-Fi 7 devices
drivers. Modern devices can only be configured using nl80211.
- mac80211/cfg80211
- handle color change per link for WiFi 7 Multi-Link Operation
- Intel (iwlwifi):
- don't support puncturing in 5 GHz
- support monitor mode on passive channels
- BZ-W device support
- P2P with HE/EHT support
- re-add support for firmware API 90
- provide channel survey information for Automatic Channel Selection
- MediaTek (mt76):
- mt7921 LED control
- mt7925 EHT radiotap support
- mt7920e PCI support
- Qualcomm (ath11k):
- P2P support for QCA6390, WCN6855 and QCA2066
- support hibernation
- ieee80211-freq-limit Device Tree property support
- Qualcomm (ath12k):
- refactoring in preparation of multi-link support
- suspend and hibernation support
- ACPI support
- debugfs support, including dfs_simulate_radar support
- RealTek:
- rtw88: RTL8723CS SDIO device support
- rtw89: RTL8922AE Wi-Fi 7 PCI device support
- rtw89: complete features of new WiFi 7 chip 8922AE including
BT-coexistence and Wake-on-WLAN
- rtw89: use BIOS ACPI settings to set TX power and channels
- rtl8xxxu: enable Management Frame Protection (MFP) support
- Bluetooth:
- support for Intel BlazarI and Filmore Peak2 (BE201)
- support for MediaTek MT7921S SDIO
- initial support for Intel PCIe BT driver
- remove HCI_AMP support"
* tag 'net-next-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1827 commits)
selftests: netfilter: fix packetdrill conntrack testcase
net: gro: fix napi_gro_cb zeroed alignment
Bluetooth: btintel_pcie: Refactor and code cleanup
Bluetooth: btintel_pcie: Fix warning reported by sparse
Bluetooth: hci_core: Fix not handling hdev->le_num_of_adv_sets=1
Bluetooth: btintel: Fix compiler warning for multi_v7_defconfig config
Bluetooth: btintel_pcie: Fix compiler warnings
Bluetooth: btintel_pcie: Add *setup* function to download firmware
Bluetooth: btintel_pcie: Add support for PCIe transport
Bluetooth: btintel: Export few static functions
Bluetooth: HCI: Remove HCI_AMP support
Bluetooth: L2CAP: Fix div-by-zero in l2cap_le_flowctl_init()
Bluetooth: qca: Fix error code in qca_read_fw_build_info()
Bluetooth: hci_conn: Use __counted_by() and avoid -Wfamnae warning
Bluetooth: btintel: Add support for Filmore Peak2 (BE201)
Bluetooth: btintel: Add support for BlazarI
LE Create Connection command timeout increased to 20 secs
dt-bindings: net: bluetooth: Add MediaTek MT7921S SDIO Bluetooth
Bluetooth: compute LE flow credits based on recvbuf space
Bluetooth: hci_sync: Use cmd->num_cis instead of magic number
...
|
||
|
|
effb835726 |
s390/iucv: Unexport iucv_root
There is no user of iucv_root outside of the core IUCV code left. Therefore remove the EXPORT_SYMBOL. Acked-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://lore.kernel.org/r/20240506194454.1160315-7-hca@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
|
|
4452e8ef8c |
s390/iucv: Provide iucv_alloc_device() / iucv_release_device()
Provide iucv_alloc_device() and iucv_release_device() helper functions, which can be used to deduplicate more or less identical IUCV device allocation and release code in four different drivers. Suggested-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://lore.kernel.org/r/20240506194454.1160315-2-hca@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
|
|
654de42f3f |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in late fixes to prepare for the 6.10 net-next PR. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
e77f43d531 |
Bluetooth: hci_core: Fix not handling hdev->le_num_of_adv_sets=1
If hdev->le_num_of_adv_sets is set to 1 it means that only handle 0x00
can be used, but since the MGMT interface instances start from 1
(instance 0 means all instances in case of MGMT_OP_REMOVE_ADVERTISING)
the code needs to map the instance to handle otherwise users will not be
able to advertise as instance 1 would attempt to use handle 0x01.
Fixes:
|
||
|
|
84a4bb6548 |
Bluetooth: HCI: Remove HCI_AMP support
Since BT_HS has been remove HCI_AMP controllers no longer has any use so
remove it along with the capability of creating AMP controllers.
Since we no longer need to differentiate between AMP and Primary
controllers, as only HCI_PRIMARY is left, this also remove
hdev->dev_type altogether.
Fixes:
|
||
|
|
a5b862c6a2 |
Bluetooth: L2CAP: Fix div-by-zero in l2cap_le_flowctl_init()
l2cap_le_flowctl_init() can cause both div-by-zero and an integer
overflow since hdev->le_mtu may not fall in the valid range.
Move MTU from hci_dev to hci_conn to validate MTU and stop the connection
process earlier if MTU is invalid.
Also, add a missing validation in read_buffer_size() and make it return
an error value if the validation fails.
Now hci_conn_add() returns ERR_PTR() as it can fail due to the both a
kzalloc failure and invalid MTU value.
divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 0 PID: 67 Comm: kworker/u5:0 Tainted: G W 6.9.0-rc5+ #20
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Workqueue: hci0 hci_rx_work
RIP: 0010:l2cap_le_flowctl_init+0x19e/0x3f0 net/bluetooth/l2cap_core.c:547
Code: e8 17 17 0c 00 66 41 89 9f 84 00 00 00 bf 01 00 00 00 41 b8 02 00 00 00 4c
89 fe 4c 89 e2 89 d9 e8 27 17 0c 00 44 89 f0 31 d2 <66> f7 f3 89 c3 ff c3 4d 8d
b7 88 00 00 00 4c 89 f0 48 c1 e8 03 42
RSP: 0018:ffff88810bc0f858 EFLAGS: 00010246
RAX: 00000000000002a0 RBX: 0000000000000000 RCX: dffffc0000000000
RDX: 0000000000000000 RSI: ffff88810bc0f7c0 RDI: ffffc90002dcb66f
RBP: ffff88810bc0f880 R08: aa69db2dda70ff01 R09: 0000ffaaaaaaaaaa
R10: 0084000000ffaaaa R11: 0000000000000000 R12: ffff88810d65a084
R13: dffffc0000000000 R14: 00000000000002a0 R15: ffff88810d65a000
FS: 0000000000000000(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000100 CR3: 0000000103268003 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<TASK>
l2cap_le_connect_req net/bluetooth/l2cap_core.c:4902 [inline]
l2cap_le_sig_cmd net/bluetooth/l2cap_core.c:5420 [inline]
l2cap_le_sig_channel net/bluetooth/l2cap_core.c:5486 [inline]
l2cap_recv_frame+0xe59d/0x11710 net/bluetooth/l2cap_core.c:6809
l2cap_recv_acldata+0x544/0x10a0 net/bluetooth/l2cap_core.c:7506
hci_acldata_packet net/bluetooth/hci_core.c:3939 [inline]
hci_rx_work+0x5e5/0xb20 net/bluetooth/hci_core.c:4176
process_one_work kernel/workqueue.c:3254 [inline]
process_scheduled_works+0x90f/0x1530 kernel/workqueue.c:3335
worker_thread+0x926/0xe70 kernel/workqueue.c:3416
kthread+0x2e3/0x380 kernel/kthread.c:388
ret_from_fork+0x5c/0x90 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
Fixes:
|
||
|
|
ea9e148c80 |
Bluetooth: hci_conn: Use __counted_by() and avoid -Wfamnae warning
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). Also, -Wflex-array-member-not-at-end is coming in GCC-14, and we are getting ready to enable it globally. So, use the `DEFINE_FLEX()` helper for an on-stack definition of a flexible structure where the size of the flexible-array member is known at compile-time, and refactor the rest of the code, accordingly. With these changes, fix the following warning: net/bluetooth/hci_conn.c:669:41: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Link: https://github.com/KSPP/linux/issues/202 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> |
||
|
|
21d74b6b4e |
LE Create Connection command timeout increased to 20 secs
On our DUT, we can see that the host issues create connection cancel command after 4-sec if there is no connection complete event for LE create connection cmd. As per core spec v5.3 section 7.8.5, advertisement interval range is- Advertising_Interval_Min Default : 0x0800(1.28s) Time Range: 20ms to 10.24s Advertising_Interval_Max Default : 0x0800(1.28s) Time Range: 20ms to 10.24s If the remote device is using adv interval of > 4 sec, it is difficult to make a connection with the current timeout value. Also, with the default interval of 1.28 sec, we will get only 3 chances to capture the adv packets with the 4 sec window. Hence we want to increase this timeout to 20sec. Signed-off-by: Mahesh Talewad <mahesh.talewad@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> |
||
|
|
ce60b9231b |
Bluetooth: compute LE flow credits based on recvbuf space
Previously LE flow credits were returned to the
sender even if the socket's receive buffer was
full. This meant that no back-pressure
was applied to the sender, thus it continued to
send data, resulting in data loss without any
error being reported. Furthermore, the amount
of credits was essentially fixed to a small
amount, leading to reduced performance.
This is fixed by computing the number of returned
LE flow credits based on the estimated available
space in the receive buffer of an L2CAP socket.
Consequently, if the receive buffer is full, no
credits are returned until the buffer is read and
thus cleared by user-space.
Since the computation of available receive buffer
space can only be performed approximately (due to
sk_buff overhead) and the receive buffer size may
be changed by user-space after flow credits have
been sent, superfluous received data is temporary
stored within l2cap_pinfo. This is necessary
because Bluetooth LE provides no retransmission
mechanism once the data has been acked by the
physical layer.
If receive buffer space estimation is not possible
at the moment, we fall back to providing credits
for one full packet as before. This is currently
the case during connection setup, when MPS is not
yet available.
Fixes:
|
||
|
|
73b2652cbb |
Bluetooth: hci_sync: Use cmd->num_cis instead of magic number
At the moment of the check, `cmd->num_cis` holds the value of 0x1f, which is the max number of elements in the `cmd->cis[]` array at declaration, which is 0x1f. So, avoid using 0x1f directly, and instead use `cmd->num_cis`. Similarly to this other patch[1]. Link: https://lore.kernel.org/linux-hardening/ZivaHUQyDDK9fXEk@neat/ [1] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> |
||
|
|
d6bb8782b4 |
Bluetooth: hci_conn: Use struct_size() in hci_le_big_create_sync()
Use struct_size() instead of the open-coded version. Similarly to this other patch[1]. Link: https://lore.kernel.org/linux-hardening/ZiwwPmCvU25YzWek@neat/ [1] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> |
||
|
|
c90748b898 |
Bluetooth: hci_conn: Use __counted_by() to avoid -Wfamnae warning
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). Also, -Wflex-array-member-not-at-end is coming in GCC-14, and we are getting ready to enable it globally. So, use the `DEFINE_FLEX()` helper for an on-stack definition of a flexible structure where the size of the flexible-array member is known at compile-time, and refactor the rest of the code, accordingly. With these changes, fix the following warning: net/bluetooth/hci_conn.c:2116:50: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Link: https://github.com/KSPP/linux/issues/202 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> |
||
|
|
c4585edf70 |
Bluetooth: hci_conn, hci_sync: Use __counted_by() to avoid -Wfamnae warnings
Prepare for the coming implementation by GCC and Clang of the
__counted_by attribute. Flexible array members annotated with
__counted_by can have their accesses bounds-checked at run-time
via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE
(for strcpy/memcpy-family functions).
Also, -Wflex-array-member-not-at-end is coming in GCC-14, and we are
getting ready to enable it globally.
So, use the `DEFINE_FLEX()` helper for multiple on-stack definitions
of a flexible structure where the size of the flexible-array member
is known at compile-time, and refactor the rest of the code,
accordingly.
Notice that, due to the use of `__counted_by()` in `struct
hci_cp_le_create_cis`, the for loop in function `hci_cs_le_create_cis()`
had to be modified. Once the index `i`, through which `cp->cis[i]` is
accessed, falls in the interval [0, cp->num_cis), `cp->num_cis` cannot
be decremented all the way down to zero while accessing `cp->cis[]`:
net/bluetooth/hci_event.c:4310:
4310 for (i = 0; cp->num_cis; cp->num_cis--, i++) {
...
4314 handle = __le16_to_cpu(cp->cis[i].cis_handle);
otherwise, only half (one iteration before `cp->num_cis == i`) or half
plus one (one iteration before `cp->num_cis < i`) of the items in the
array will be accessed before running into an out-of-bounds issue. So,
in order to avoid this, set `cp->num_cis` to zero just after the for
loop.
Also, make use of `aux_num_cis` variable to update `cmd->num_cis` after
a `list_for_each_entry_rcu()` loop.
With these changes, fix the following warnings:
net/bluetooth/hci_sync.c:1239:56: warning: structure containing a flexible
array member is not at the end of another structure
[-Wflex-array-member-not-at-end]
net/bluetooth/hci_sync.c:1415:51: warning: structure containing a flexible
array member is not at the end of another structure
[-Wflex-array-member-not-at-end]
net/bluetooth/hci_sync.c:1731:51: warning: structure containing a flexible
array member is not at the end of another structure
[-Wflex-array-member-not-at-end]
net/bluetooth/hci_sync.c:6497:45: warning: structure containing a flexible
array member is not at the end of another structure
[-Wflex-array-member-not-at-end]
Link: https://github.com/KSPP/linux/issues/202
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
||
|
|
94c603c28e |
Bluetooth: Remove 3 repeated macro definitions
Macros HCI_REQ_DONE, HCI_REQ_PEND and HCI_REQ_CANCELED are repeatedly defined twice with hci_request.h, so remove a copy of definition. Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> |
||
|
|
d68d8a7a2c |
Bluetooth: hci_conn: Remove a redundant check for HFP offload
Remove a redundant check !hdev->get_codec_config_data. Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> |
||
|
|
1c08108f30 |
Bluetooth: L2CAP: Avoid -Wflex-array-member-not-at-end warnings
-Wflex-array-member-not-at-end is coming in GCC-14, and we are getting
ready to enable it globally.
There are currently a couple of objects (`req` and `rsp`), in a couple
of structures, that contain flexible structures (`struct l2cap_ecred_conn_req`
and `struct l2cap_ecred_conn_rsp`), for example:
struct l2cap_ecred_rsp_data {
struct {
struct l2cap_ecred_conn_rsp rsp;
__le16 scid[L2CAP_ECRED_MAX_CID];
} __packed pdu;
int count;
};
in the struct above, `struct l2cap_ecred_conn_rsp` is a flexible
structure:
struct l2cap_ecred_conn_rsp {
__le16 mtu;
__le16 mps;
__le16 credits;
__le16 result;
__le16 dcid[];
};
So, in order to avoid ending up with a flexible-array member in the
middle of another structure, we use the `struct_group_tagged()` (and
`__struct_group()` when the flexible structure is `__packed`) helper
to separate the flexible array from the rest of the members in the
flexible structure:
struct l2cap_ecred_conn_rsp {
struct_group_tagged(l2cap_ecred_conn_rsp_hdr, hdr,
... the rest of members
);
__le16 dcid[];
};
With the change described above, we now declare objects of the type of
the tagged struct, in this example `struct l2cap_ecred_conn_rsp_hdr`,
without embedding flexible arrays in the middle of other structures:
struct l2cap_ecred_rsp_data {
struct {
struct l2cap_ecred_conn_rsp_hdr rsp;
__le16 scid[L2CAP_ECRED_MAX_CID];
} __packed pdu;
int count;
};
Also, when the flexible-array member needs to be accessed, we use
`container_of()` to retrieve a pointer to the flexible structure.
We also use the `DEFINE_RAW_FLEX()` helper for a couple of on-stack
definitions of a flexible structure where the size of the flexible-array
member is known at compile-time.
So, with these changes, fix the following warnings:
net/bluetooth/l2cap_core.c:1260:45: warning: structure containing a
flexible array member is not at the end of another structure
[-Wflex-array-member-not-at-end]
net/bluetooth/l2cap_core.c:3740:45: warning: structure containing a
flexible array member is not at the end of another structure
[-Wflex-array-member-not-at-end]
net/bluetooth/l2cap_core.c:4999:45: warning: structure containing a
flexible array member is not at the end of another structure
[-Wflex-array-member-not-at-end]
net/bluetooth/l2cap_core.c:7116:47: warning: structure containing a
flexible array member is not at the end of another structure
[-Wflex-array-member-not-at-end]
Link: https://github.com/KSPP/linux/issues/202
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
||
|
|
d356c924e7 |
Bluetooth: ISO: Handle PA sync when no BIGInfo reports are generated
In case of a Broadcast Source that has PA enabled but no active BIG, a Broadcast Sink needs to establish PA sync and parse BASE from PA reports. This commit moves the allocation of a PA sync hcon from the BIGInfo advertising report event to the PA sync established event. After the first complete PA report, the hcon is notified to the ISO layer. A child socket is allocated and enqueued in the parent's accept queue. BIGInfo reports also need to be processed, to extract the encryption field and inform userspace. After the first BIGInfo report is received, the PA sync hcon is notified again to the ISO layer. Since a socket will be found this time, the socket state will transition to BT_CONNECTED and the userspace will be woken up using sk_state_change. Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> |
||
|
|
311527e9da |
Bluetooth: ISO: Make iso_get_sock_listen generic
This makes iso_get_sock_listen more generic, to return matching socket in the state provided as argument. Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> |
||
|
|
2e2515c1ba |
Bluetooth: hci_event: Set DISCOVERY_FINDING on SCAN_ENABLED
This makes sure that discovery state is properly synchronized otherwise reports may not generate MGMT DeviceFound events as it would be assumed that it was not initiated by a discovery session. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> |
||
|
|
7c2cc5b1db |
Bluetooth: Add proper definitions for scan interval and window
This adds proper definitions for scan interval and window and then make use of them instead their values. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> |
||
|
|
5c1672705a |
net: revert partially applied PHY topology series
The series is causing issues with PHY drivers built as modules. Since it was only partially applied and the merge window has opened let's revert and try again for v6.11. Revert |
||
|
|
7fad5b3756 |
mptcp: include inet_common in mib.h
So this file is now self-contained: it can be compiled alone with analytic tools. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20240514011335.176158-9-martineau@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
76a86686e3 |
mptcp: move mptcp_pm_gen.h's include
Nothing from protocol.h depends on mptcp_pm_gen.h, only code from pm_netlink.c and pm_userspace.c depends on it. So this include can be moved where it is needed to avoid a "unused includes" warning. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20240514011335.176158-8-martineau@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
00797af95f |
mptcp: remove unnecessary else statements
The 'else' statements are not needed here, because their previous 'if' block ends with a 'return'. This fixes CheckPatch warnings: WARNING: else is not generally useful after a break or return Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20240514011335.176158-7-martineau@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
5eae7a8202 |
mptcp: prefer strscpy over strcpy
strcpy() performs no bounds checking on the destination buffer. This could result in linear overflows beyond the end of the buffer, leading to all kinds of misbehaviors. The safe replacement is strscpy() [1]. This is in preparation of a possible future step where all strcpy() uses will be removed in favour of strscpy() [2]. This fixes CheckPatch warnings: WARNING: Prefer strscpy over strcpy Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy [1] Link: https://github.com/KSPP/linux/issues/88 [2] Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20240514011335.176158-6-martineau@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
73c900aa36 |
mptcp: add net.mptcp.available_schedulers
The sysctl lists the available schedulers that can be set using net.mptcp.scheduler similarly to net.ipv4.tcp_available_congestion_control. Signed-off-by: Gregory Detal <gregory.detal@gmail.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20240514011335.176158-5-martineau@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
ce5f6f71b0 |
mptcp: sockopt: info: stop early if no buffer
Up to recently, it has been recommended to use getsockopt(MPTCP_INFO) to check if a fallback to TCP happened, or if the client requested to use MPTCP. In this case, the userspace app is only interested by the returned value of the getsocktop() call, and can then give 0 for the option length, and NULL for the buffer address. An easy optimisation is then to stop early, and avoid filling a local buffer -- which now requires two different locks -- if it is not needed. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20240514011335.176158-4-martineau@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
bd11dc4fb9 |
mptcp: fix full TCP keep-alive support
SO_KEEPALIVE support has been added a while ago, as part of a series
"adding SOL_SOCKET" support. To have a full control of this keep-alive
feature, it is important to also support TCP_KEEP* socket options at the
SOL_TCP level.
Supporting them on the setsockopt() part is easy, it is just a matter of
remembering each value in the MPTCP sock structure, and calling
tcp_sock_set_keep*() helpers on each subflow. If the value is not
modified (0), calling these helpers will not do anything. For the
getsockopt() part, the corresponding value from the MPTCP sock structure
or the default one is simply returned. All of this is very similar to
other TCP_* socket options supported by MPTCP.
It looks important for kernels supporting SO_KEEPALIVE, to also support
TCP_KEEP* options as well: some apps seem to (wrongly) consider that if
the former is supported, the latter ones will be supported as well. But
also, not having this simple and isolated change is preventing MPTCP
support in some apps, and libraries like GoLang [1]. This is why this
patch is seen as a fix.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/383
Fixes:
|
||
|
|
a65198136e |
mptcp: SO_KEEPALIVE: fix getsockopt support
SO_KEEPALIVE support has to be set on each subflow: on each TCP socket,
where sk_prot->keepalive is defined. Technically, nothing has to be done
on the MPTCP socket. That's why mptcp_sol_socket_sync_intval() was
called instead of mptcp_sol_socket_intval().
Except that when nothing is done on the MPTCP socket, the
getsockopt(SO_KEEPALIVE), handled in net/core/sock.c:sk_getsockopt(),
will not know if SO_KEEPALIVE has been set on the different subflows or
not.
The fix is simple: simply call mptcp_sol_socket_intval() which will end
up calling net/core/sock.c:sk_setsockopt() where the SOCK_KEEPOPEN flag
will be set, the one used in sk_getsockopt().
So now, getsockopt(SO_KEEPALIVE) on an MPTCP socket will return the same
value as the one previously set with setsockopt(SO_KEEPALIVE).
Fixes:
|
||
|
|
11f46ea981 |
tcp: rstreason: fully support in tcp_check_req()
We're going to send an RST due to invalid syn packet which is already checked whether 1) it is in sequence, 2) it is a retransmitted skb. As RFC 793 says, if the state of socket is not CLOSED/LISTEN/SYN-SENT, then we should send an RST when receiving bad syn packet: "fourth, check the SYN bit,...If the SYN is in the window it is an error, send a reset" Signed-off-by: Jason Xing <kernelxing@tencent.com> Link: https://lore.kernel.org/r/20240510122502.27850-6-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
22a3255775 |
tcp: rstreason: handle timewait cases in the receive path
There are two possible cases where TCP layer can send an RST. Since they happen in the same place, I think using one independent reason is enough to identify this special situation. Signed-off-by: Jason Xing <kernelxing@tencent.com> Link: https://lore.kernel.org/r/20240510122502.27850-5-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
7951e36ac6 |
net: pass back whether socket was empty post accept
This adds an 'is_empty' argument to struct proto_accept_arg, which can be used to pass back information on whether or not the given socket has more connections to accept post the one just accepted. To utilize this information, the caller should initialize the 'is_empty' field to, eg, -1 and then check for 0/1 after the accept. If the field has been set, the caller knows whether there are more pending connections or not. If the field remains -1 after the accept call, the protocol doesn't support passing back this information. This patch wires it up for ipv4/6 TCP. Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
0645fbe760 |
net: have do_accept() take a struct proto_accept_arg argument
In preparation for passing in more information via this API, change do_accept() to take a proto_accept_arg struct pointer rather than just the file flags separately. No functional changes in this patch. Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
92ef0fd55a |
net: change proto and proto_ops accept type
Rather than pass in flags, error pointer, and whether this is a kernel invocation or not, add a struct proto_accept_arg struct as the argument. This then holds all of these arguments, and prepares accept for being able to pass back more information. No functional changes in this patch. Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
6e62702feb |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZkGcZAAKCRDbK58LschI g6o6APwLsqhrM2w71VUN5ciCxu4H5VDtZp6wkdqtVbxxU4qNxQEApKgYgKt8ZLF3 Kily5c7m+S4ZXhMX21rb8JhSAz0dfQk= =5Dk7 -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-05-13 We've added 119 non-merge commits during the last 14 day(s) which contain a total of 134 files changed, 9462 insertions(+), 4742 deletions(-). The main changes are: 1) Add BPF JIT support for 32-bit ARCv2 processors, from Shahab Vahedi. 2) Add BPF range computation improvements to the verifier in particular around XOR and OR operators, refactoring of checks for range computation and relaxing MUL range computation so that src_reg can also be an unknown scalar, from Cupertino Miranda. 3) Add support to attach kprobe BPF programs through kprobe_multi link in a session mode, meaning, a BPF program is attached to both function entry and return, the entry program can decide if the return program gets executed and the entry program can share u64 cookie value with return program. Session mode is a common use-case for tetragon and bpftrace, from Jiri Olsa. 4) Fix a potential overflow in libbpf's ring__consume_n() and improve libbpf as well as BPF selftest's struct_ops handling, from Andrii Nakryiko. 5) Improvements to BPF selftests in context of BPF gcc backend, from Jose E. Marchesi & David Faust. 6) Migrate remaining BPF selftest tests from test_sock_addr.c to prog_test- -style in order to retire the old test, run it in BPF CI and additionally expand test coverage, from Jordan Rife. 7) Big batch for BPF selftest refactoring in order to remove duplicate code around common network helpers, from Geliang Tang. 8) Another batch of improvements to BPF selftests to retire obsolete bpf_tcp_helpers.h as everything is available vmlinux.h, from Martin KaFai Lau. 9) Fix BPF map tear-down to not walk the map twice on free when both timer and wq is used, from Benjamin Tissoires. 10) Fix BPF verifier assumptions about socket->sk that it can be non-NULL, from Alexei Starovoitov. 11) Change BTF build scripts to using --btf_features for pahole v1.26+, from Alan Maguire. 12) Small improvements to BPF reusing struct_size() and krealloc_array(), from Andy Shevchenko. 13) Fix s390 JIT to emit a barrier for BPF_FETCH instructions, from Ilya Leoshkevich. 14) Extend TCP ->cong_control() callback in order to feed in ack and flag parameters and allow write-access to tp->snd_cwnd_stamp from BPF program, from Miao Xu. 15) Add support for internal-only per-CPU instructions to inline bpf_get_smp_processor_id() helper call for arm64 and riscv64 BPF JITs, from Puranjay Mohan. 16) Follow-up to remove the redundant ethtool.h from tooling infrastructure, from Tushar Vyavahare. 17) Extend libbpf to support "module:<function>" syntax for tracing programs, from Viktor Malik. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits) bpf: make list_for_each_entry portable bpf: ignore expected GCC warning in test_global_func10.c bpf: disable strict aliasing in test_global_func9.c selftests/bpf: Free strdup memory in xdp_hw_metadata selftests/bpf: Fix a few tests for GCC related warnings. bpf: avoid gcc overflow warning in test_xdp_vlan.c tools: remove redundant ethtool.h from tooling infra selftests/bpf: Expand ATTACH_REJECT tests selftests/bpf: Expand getsockname and getpeername tests sefltests/bpf: Expand sockaddr hook deny tests selftests/bpf: Expand sockaddr program return value tests selftests/bpf: Retire test_sock_addr.(c|sh) selftests/bpf: Remove redundant sendmsg test cases selftests/bpf: Migrate ATTACH_REJECT test cases selftests/bpf: Migrate expected_attach_type tests selftests/bpf: Migrate wildcard destination rewrite test selftests/bpf: Migrate sendmsg6 v4 mapped address tests selftests/bpf: Migrate sendmsg deny test cases selftests/bpf: Migrate WILDCARD_IP test selftests/bpf: Handle SYSCALL_EPERM and SYSCALL_ENOTSUPP test cases ... ==================== Link: https://lore.kernel.org/r/20240513134114.17575-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
36e56b1b00 |
ax25: Fix reference count leak issue of net_device
There is a reference count leak issue of the object "net_device" in
ax25_dev_device_down(). When the ax25 device is shutting down, the
ax25_dev_device_down() drops the reference count of net_device one
or zero times depending on if we goto unlock_put or not, which will
cause memory leak.
In order to solve the above issue, decrease the reference count of
net_device after dev->ax25_ptr is set to null.
Fixes:
|
||
|
|
b505e03198 |
ax25: Fix reference count leak issues of ax25_dev
The ax25_addr_ax25dev() and ax25_dev_device_down() exist a reference
count leak issue of the object "ax25_dev".
Memory leak issue in ax25_addr_ax25dev():
The reference count of the object "ax25_dev" can be increased multiple
times in ax25_addr_ax25dev(). This will cause a memory leak.
Memory leak issues in ax25_dev_device_down():
The reference count of ax25_dev is set to 1 in ax25_dev_device_up() and
then increase the reference count when ax25_dev is added to ax25_dev_list.
As a result, the reference count of ax25_dev is 2. But when the device is
shutting down. The ax25_dev_device_down() drops the reference count once
or twice depending on if we goto unlock_put or not, which will cause
memory leak.
As for the issue of ax25_addr_ax25dev(), it is impossible for one pointer
to be on a list twice. So add a break in ax25_addr_ax25dev(). As for the
issue of ax25_dev_device_down(), increase the reference count of ax25_dev
once in ax25_dev_device_up() and decrease the reference count of ax25_dev
after it is removed from the ax25_dev_list.
Fixes:
|
||
|
|
a7d6e36b9a |
ax25: Use kernel universal linked list to implement ax25_dev_list
The origin ax25_dev_list implements its own single linked list, which is complicated and error-prone. For example, when deleting the node of ax25_dev_list in ax25_dev_device_down(), we have to operate on the head node and other nodes separately. This patch uses kernel universal linked list to replace original ax25_dev_list, which make the operation of ax25_dev_list easier. We should do "dev->ax25_ptr = ax25_dev;" and "dev->ax25_ptr = NULL;" while holding the spinlock, otherwise the ax25_dev_device_up() and ax25_dev_device_down() could race. Suggested-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/85bba3af651ca0e1a519da8d0d715b949891171c.1715247018.git.duoming@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
364798056f |
l2tp: Support different protocol versions with same IP/port quadruple
|
||
|
|
b56035101e |
netdev: Add queue stats for TX stop and wake
TX queue stop and wake are counted by some drivers. Support reporting these via netdev-genl queue stats. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/20240510201927.1821109-2-danielj@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
8ec9897ec2 |
netlabel: fix RCU annotation for IPv4 options on socket creation
Xiumei reports the following splat when netlabel and TCP socket are used:
=============================
WARNING: suspicious RCU usage
6.9.0-rc2+ #637 Not tainted
-----------------------------
net/ipv4/cipso_ipv4.c:1880 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by ncat/23333:
#0: ffffffff906030c0 (rcu_read_lock){....}-{1:2}, at: netlbl_sock_setattr+0x25/0x1b0
stack backtrace:
CPU: 11 PID: 23333 Comm: ncat Kdump: loaded Not tainted 6.9.0-rc2+ #637
Hardware name: Supermicro SYS-6027R-72RF/X9DRH-7TF/7F/iTF/iF, BIOS 3.0 07/26/2013
Call Trace:
<TASK>
dump_stack_lvl+0xa9/0xc0
lockdep_rcu_suspicious+0x117/0x190
cipso_v4_sock_setattr+0x1ab/0x1b0
netlbl_sock_setattr+0x13e/0x1b0
selinux_netlbl_socket_post_create+0x3f/0x80
selinux_socket_post_create+0x1a0/0x460
security_socket_post_create+0x42/0x60
__sock_create+0x342/0x3a0
__sys_socket_create.part.22+0x42/0x70
__sys_socket+0x37/0xb0
__x64_sys_socket+0x16/0x20
do_syscall_64+0x96/0x180
? do_user_addr_fault+0x68d/0xa30
? exc_page_fault+0x171/0x280
? asm_exc_page_fault+0x22/0x30
entry_SYSCALL_64_after_hwframe+0x71/0x79
RIP: 0033:0x7fbc0ca3fc1b
Code: 73 01 c3 48 8b 0d 05 f2 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 29 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d5 f1 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007fff18635208 EFLAGS: 00000246 ORIG_RAX: 0000000000000029
RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fbc0ca3fc1b
RDX: 0000000000000006 RSI: 0000000000000001 RDI: 0000000000000002
RBP: 000055d24f80f8a0 R08: 0000000000000003 R09: 0000000000000001
R10: 0000000000020000 R11: 0000000000000246 R12: 000055d24f80f8a0
R13: 0000000000000000 R14: 000055d24f80fb88 R15: 0000000000000000
</TASK>
The current implementation of cipso_v4_sock_setattr() replaces IP options
under the assumption that the caller holds the socket lock; however, such
assumption is not true, nor needed, in selinux_socket_post_create() hook.
Let all callers of cipso_v4_sock_setattr() specify the "socket lock held"
condition, except selinux_socket_post_create() _ where such condition can
safely be set as true even without holding the socket lock.
Fixes:
|
||
|
|
c084ebd77a |
tcp: socket option to check for MPTCP fallback to TCP
A way for an application to know if an MPTCP connection fell back to TCP is to use getsockopt(MPTCP_INFO) and look for errors. The issue with this technique is that the same errors -- EOPNOTSUPP (IPv4) and ENOPROTOOPT (IPv6) -- are returned if there was a fallback, *or* if the kernel doesn't support this socket option. The userspace then has to look at the kernel version to understand what the errors mean. It is not clean, and it doesn't take into account older kernels where the socket option has been backported. A cleaner way would be to expose this info to the TCP socket level. In case of MPTCP socket where no fallback happened, the socket options for the TCP level will be handled in MPTCP code, in mptcp_getsockopt_sol_tcp(). If not, that will be in TCP code, in do_tcp_getsockopt(). So MPTCP simply has to set the value 1, while TCP has to set 0. If the socket option is not supported, one of these two errors will be reported: - EOPNOTSUPP (95 - Operation not supported) for MPTCP sockets - ENOPROTOOPT (92 - Protocol not available) for TCP sockets, e.g. on the socket received after an 'accept()', when the client didn't request to use MPTCP: this socket will be a TCP one, even if the listen socket was an MPTCP one. With this new option, the kernel can return a clear answer to both "Is this kernel new enough to tell me the fallback status?" and "If it is new enough, is it currently a TCP or MPTCP socket?" questions, while not breaking the previous method. Acked-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240509-upstream-net-next-20240509-mptcp-tcp_is_mptcp-v1-1-f846df999202@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
4b0ebbca3e |
net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
iph->id, ...) against all packets in a loop. These flush checks are used in
all merging UDP and TCP flows.
These checks need to be done only once and only against the found p skb,
since they only affect flush and not same_flow.
This patch leverages correct network header offsets from the cb for both
outer and inner network headers - allowing these checks to be done only
once, in tcp_gro_receive and udp_gro_receive_segment. As a result,
NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are
more declarative and contained in inet_gro_flush, thus removing the need
for flush_id in napi_gro_cb.
This results in less parsing code for non-loop flush tests for TCP and UDP
flows.
To make sure results are not within noise range - I've made netfilter drop
all TCP packets, and measured CPU performance in GRO (in this case GRO is
responsible for about 50% of the CPU utilization).
perf top while replaying 64 parallel IP/TCP streams merging in GRO:
(gro_receive_network_flush is compiled inline to tcp_gro_receive)
net-next:
6.94% [kernel] [k] inet_gro_receive
3.02% [kernel] [k] tcp_gro_receive
patch applied:
4.27% [kernel] [k] tcp_gro_receive
4.22% [kernel] [k] inet_gro_receive
perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same
results for any encapsulation, in this case inet_gro_receive is top
offender in net-next)
net-next:
10.09% [kernel] [k] inet_gro_receive
2.08% [kernel] [k] tcp_gro_receive
patch applied:
6.97% [kernel] [k] inet_gro_receive
3.68% [kernel] [k] tcp_gro_receive
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240509190819.2985-3-richardbgobert@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
||
|
|
186b1ea73a |
net: gro: use cb instead of skb->network_header
This patch converts references of skb->network_header to napi_gro_cb's network_offset and inner_network_offset. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240509190819.2985-2-richardbgobert@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
1af7f88af2 |
inet: fix inet_fill_ifaddr() flags truncation
I missed that (struct ifaddrmsg)->ifa_flags was only 8bits,
while (struct in_ifaddr)->ifa_flags is 32bits.
Use a temporary 32bit variable as I did in set_ifa_lifetime()
and check_lifetime().
Fixes:
|
||
|
|
c85e41bfe7 |
netfilter pull request 24-05-12
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmZA578ACgkQ1V2XiooU IOS18g//Zyuv+23GcUM7+FEXrlMN658xJyWiYvKjOaZZx5ZiV0QdZc4cbfPFD44p qZBMmVC/WVoC89SLdwH1W47KoJU0xsK3/OdqGHHeNJ69111wIQMpOLfAetS2K0mb F+Ue2vyWg1GQDICGsCdenHX7ihtVvnJJkomxc+3ObxtLCNsb2Dsr6JM5hMVP5Bil 4UZnPsrgfWy3A8O92burlPVE1sTWDFfFUGIf8geJc4QadwkgufkzxMhXNO7xHlpG EZ99s8FPyD3R6tRPjf4gwdjr7JjinrdrYjZDuS4d3Uv8pKlUqcx8PgXG51/unr/y qlynLXtEc1QU6SO2jENosHAG2/LQG2zsYEiiLFCP+a1JOtOxevZQKx8MyAFW8xDX +RQhcBpTBocIyJ/tCDoM9lp69iYTR196Ct48v6pSGMNhZcddT4K4BkUL47GEs8T3 IA5x8h5gV2Q9ECMgqSaycdUsfLNgE/6fWx0ROs/wo3tMsgWrXCSJi8RFtN1sNbIO rfuNnQiETIFBkQxBi7um8jadxdfIHm65cjgZBCyVbNNml3JwjYvXLxCXt2G7LxC4 Sg4nZvIbqWIifoMc1aQKypvFZjzzsWtFmYCuEUVLrnpj2SFTyh5CNzNo3MlHf7LG sRb/XubdY6e0spLzd5VDjwH5qOT3poWAccatRr5BVUarxXCD5Vs= =RD9p -----END PGP SIGNATURE----- Merge tag 'nf-next-24-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: Patch #1 skips transaction if object type provides no .update interface. Patch #2 skips NETDEV_CHANGENAME which is unused. Patch #3 enables conntrack to handle Multicast Router Advertisements and Multicast Router Solicitations from the Multicast Router Discovery protocol (RFC4286) as untracked opposed to invalid packets. From Linus Luessing. Patch #4 updates DCCP conntracker to mark invalid as invalid, instead of dropping them, from Jason Xing. Patch #5 uses NF_DROP instead of -NF_DROP since NF_DROP is 0, also from Jason. Patch #6 removes reference in netfilter's sysctl documentation on pickup entries which were already removed by Florian Westphal. Patch #7 removes check for IPS_OFFLOAD flag to disable early drop which allows to evict entries from the conntrack table, also from Florian. Patches #8 to #16 updates nf_tables pipapo set backend to allocate the datastructure copy on-demand from preparation phase, to better deal with OOM situations where .commit step is too late to fail. Series from Florian Westphal. Patch #17 adds a selftest with packetdrill to cover conntrack TCP state transitions, also from Florian. Patch #18 use GFP_KERNEL to clone elements from control plane to avoid quick atomic reserves exhaustion with large sets, reporter refers to million entries magnitude. * tag 'nf-next-24-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: allow clone callbacks to sleep selftests: netfilter: add packetdrill based conntrack tests netfilter: nft_set_pipapo: remove dirty flag netfilter: nft_set_pipapo: move cloning of match info to insert/removal path netfilter: nft_set_pipapo: prepare pipapo_get helper for on-demand clone netfilter: nft_set_pipapo: merge deactivate helper into caller netfilter: nft_set_pipapo: prepare walk function for on-demand clone netfilter: nft_set_pipapo: prepare destroy function for on-demand clone netfilter: nft_set_pipapo: make pipapo_clone helper return NULL netfilter: nft_set_pipapo: move prove_locking helper around netfilter: conntrack: remove flowtable early-drop test netfilter: conntrack: documentation: remove reference to non-existent sysctl netfilter: use NF_DROP instead of -NF_DROP netfilter: conntrack: dccp: try not to drop skb in conntrack netfilter: conntrack: fix ct-state for ICMPv6 Multicast Router Discovery netfilter: nf_tables: remove NETDEV_CHANGENAME from netdev chain event handler netfilter: nf_tables: skip transaction if update object is not implemented ==================== Link: https://lore.kernel.org/r/20240512161436.168973-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
9961a78594 |
for-6.10/io_uring-20240511
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmY/YdYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnmVEADBq8QT9Oa3HTIONHwxjmGMOalr7PSrBP89 S6Inv/l+3xDlyolyLh1HIXUC84iS9Ihi2pNC3dZct4fNcpA99H0CFaHDGwZ5rVri MrFaubZAps1qSzeypqEq3zWGKVUoaYWaOKhuOjye5Ei2tKymbguhDKl1WiKibD21 E9qOYbhSUFdub/xtx9Rv4BS05QW5bHZ2Y/tTFqB8MY4JUsdb9g/deVZkyGUQYRSd 40mDallRldjQQTQ8iU4H6/ORdGIN/90aLPbmzMdFtQcymnmRyid3rOEwhwWYe4NO ljnI8m1SJQilZz1d5oHBXBB5QubVptY1JWxbk8GQCSmOU5wrCq+ARCJXUtBXwniJ K4VFsGm9MkZcc5vsIwIzvsrk8DODla6EVo/jyDy8iFceZcNWfVxdwa5NS67V/6QT macbF785XDsmA5E4UjslbZqU047w+A5N1yazcZWzMk0coJDeB8AtsA1/C2WZOm8p HVoiAzsqt81hvPItnjCyZluL/YW+BKeOTnq04QbpQKcJpZBzszO4ZLtuD+IXkE69 8ZZPGFPnPS4ZMQojKkwsBr+Yo65S18oBDkib36mr2lsdnoWTpGq47C7ScUDBbqGm iI7U8tYMnVVkQQHVVmGI4KOr5/4lxxp8398kqCaxfW3D5BQhbtUOF/OBjBHj1ZSV 9aZx87CyhA== =DwAV -----END PGP SIGNATURE----- Merge tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux Pull io_uring updates from Jens Axboe: - Greatly improve send zerocopy performance, by enabling coalescing of sent buffers. MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the io_uring side did not. In local testing, the crossover point for send zerocopy being faster is now around 3000 byte packets, and it performs better than the sync syscall variants as well. This feature relies on a shared branch with net-next, which was pulled into both branches. - Unification of how async preparation is done across opcodes. Previously, opcodes that required extra memory for async retry would allocate that as needed, using on-stack state until that was the case. If async retry was needed, the on-stack state was adjusted appropriately for a retry and then copied to the allocated memory. This led to some fragile and ugly code, particularly for read/write handling, and made storage retries more difficult than they needed to be. Allocate the memory upfront, as it's cheap from our pools, and use that state consistently both initially and also from the retry side. - Move away from using remap_pfn_range() for mapping the rings. This is really not the right interface to use and can cause lifetime issues or leaks. Additionally, it means the ring sq/cq arrays need to be physically contigious, which can cause problems in production with larger rings when services are restarted, as memory can be very fragmented at that point. Move to using vm_insert_page(s) for the ring sq/cq arrays, and apply the same treatment to mapped ring provided buffers. This also helps unify the code we have dealing with allocating and mapping memory. Hard to see in the diffstat as we're adding a few features as well, but this kills about ~400 lines of code from the codebase as well. - Add support for bundles for send/recv. When used with provided buffers, bundles support sending or receiving more than one buffer at the time, improving the efficiency by only needing to call into the networking stack once for multiple sends or receives. - Tweaks for our accept operations, supporting both a DONTWAIT flag for skipping poll arm and retry if we can, and a POLLFIRST flag that the application can use to skip the initial accept attempt and rely purely on poll for triggering the operation. Both of these have identical flags on the receive side already. - Make the task_work ctx locking unconditional. We had various code paths here that would do a mix of lock/trylock and set the task_work state to whether or not it was locked. All of that goes away, we lock it unconditionally and get rid of the state flag indicating whether it's locked or not. The state struct still exists as an empty type, can go away in the future. - Add support for specifying NOP completion values, allowing it to be used for error handling testing. - Use set/test bit for io-wq worker flags. Not strictly needed, but also doesn't hurt and helps silence a KCSAN warning. - Cleanups for io-wq locking and work assignments, closing a tiny race where cancelations would not be able to find the work item reliably. - Misc fixes, cleanups, and improvements * tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux: (97 commits) io_uring: support to inject result for NOP io_uring: fail NOP if non-zero op flags is passed in io_uring/net: add IORING_ACCEPT_POLL_FIRST flag io_uring/net: add IORING_ACCEPT_DONTWAIT flag io_uring/filetable: don't unnecessarily clear/reset bitmap io_uring/io-wq: Use set_bit() and test_bit() at worker->flags io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring io_uring: Require zeroed sqe->len on provided-buffers send io_uring/notif: disable LAZY_WAKE for linked notifs io_uring/net: fix sendzc lazy wake polling io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it io_uring/rw: reinstate thread check for retries io_uring/notif: implement notification stacking io_uring/notif: simplify io_notif_flush() net: add callback for setting a ubuf_info to skb net: extend ubuf_info callback to ops structure io_uring/net: support bundles for recv io_uring/net: support bundles for send io_uring/kbuf: add helpers for getting/peeking multiple buffers io_uring/net: add provided buffer support for IORING_OP_SEND ... |
||
|
|
ef31ea6c27 |
vfs-6.10.netfs
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZj3PiAAKCRCRxhvAZXjc
ojXMAP4vIKnxNOf0qXNDHkMvIXw9gYxtHXQfOWCEokcRdBPxlQEArhZNz/TBWhH2
lEbE/mM1PUYhpqGh+K19IX503l87NQA=
=gyKJ
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.10.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull netfs updates from Christian Brauner:
"This reworks the netfslib writeback implementation so that pages read
from the cache are written to the cache through ->writepages(),
thereby allowing the fscache page flag to be retired.
The reworking also:
- builds on top of the new writeback_iter() infrastructure
- makes it possible to use vectored write RPCs as discontiguous
streams of pages can be accommodated
- makes it easier to do simultaneous content crypto and stream
division
- provides support for retrying writes and re-dividing a stream
- replaces the ->launder_folio() op, so that ->writepages() is used
instead
- uses mempools to allocate the netfs_io_request and
netfs_io_subrequest structs to avoid allocation failure in the
writeback path
Some code that uses the fscache page flag is retained for
compatibility purposes with nfs and ceph. The code is switched to
using the synonymous private_2 label instead and marked with
deprecation comments.
The merge commit contains additional details on the new algorithm that
I've left out of here as it would probably be excessively detailed.
On top of the netfslib infrastructure this contains the work to
convert cifs over to netfslib"
* tag 'vfs-6.10.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
cifs: Enable large folio support
cifs: Remove some code that's no longer used, part 3
cifs: Remove some code that's no longer used, part 2
cifs: Remove some code that's no longer used, part 1
cifs: Cut over to using netfslib
cifs: Implement netfslib hooks
cifs: Make add_credits_and_wake_if() clear deducted credits
cifs: Add mempools for cifs_io_request and cifs_io_subrequest structs
cifs: Set zero_point in the copy_file_range() and remap_file_range()
cifs: Move cifs_loose_read_iter() and cifs_file_write_iter() to file.c
cifs: Replace the writedata replay bool with a netfs sreq flag
cifs: Make wait_mtu_credits take size_t args
cifs: Use more fields from netfs_io_subrequest
cifs: Replace cifs_writedata with a wrapper around netfs_io_subrequest
cifs: Replace cifs_readdata with a wrapper around netfs_io_subrequest
cifs: Use alternative invalidation to using launder_folio
netfs, afs: Use writeback retry to deal with alternate keys
netfs: Miscellaneous tidy ups
netfs: Remove the old writeback code
netfs: Cut over to using new writeback code
...
|
||
|
|
fe6532b44a |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next into net-accept-more
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1557 commits) net: qede: use extack in qede_parse_actions() net: qede: propagate extack through qede_flow_spec_validate() net: qede: use faked extack in qede_flow_spec_to_rule() net: qede: use extack in qede_parse_flow_attr() net: qede: add extack in qede_add_tc_flower_fltr() net: qede: use extack in qede_flow_parse_udp_v4() net: qede: use extack in qede_flow_parse_udp_v6() net: qede: use extack in qede_flow_parse_tcp_v4() net: qede: use extack in qede_flow_parse_tcp_v6() net: qede: use extack in qede_flow_parse_v4_common() net: qede: use extack in qede_flow_parse_v6_common() net: qede: use extack in qede_set_v4_tuple_to_profile() net: qede: use extack in qede_set_v6_tuple_to_profile() net: qede: use extack in qede_flow_parse_ports() net: usb: smsc95xx: stop lying about skb->truesize net: dsa: microchip: Fix spellig mistake "configur" -> "configure" af_unix: Add dead flag to struct scm_fp_list. net: ethernet: adi: adin1110: Replace linux/gpio.h by proper one octeontx2-pf: Reuse Transmit queue/Send queue index of HTB class gve: Use ethtool_sprintf/puts() to fill stats strings ... |
||
|
|
ad1978dbbd |
Merge branch 'for-6.10/io_uring' into net-accept-more
* for-6.10/io_uring: (97 commits) io_uring: support to inject result for NOP io_uring: fail NOP if non-zero op flags is passed in io_uring/net: add IORING_ACCEPT_POLL_FIRST flag io_uring/net: add IORING_ACCEPT_DONTWAIT flag io_uring/filetable: don't unnecessarily clear/reset bitmap io_uring/io-wq: Use set_bit() and test_bit() at worker->flags io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring io_uring: Require zeroed sqe->len on provided-buffers send io_uring/notif: disable LAZY_WAKE for linked notifs io_uring/net: fix sendzc lazy wake polling io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it io_uring/rw: reinstate thread check for retries io_uring/notif: implement notification stacking io_uring/notif: simplify io_notif_flush() net: add callback for setting a ubuf_info to skb net: extend ubuf_info callback to ops structure io_uring/net: support bundles for recv io_uring/net: support bundles for send io_uring/kbuf: add helpers for getting/peeking multiple buffers io_uring/net: add provided buffer support for IORING_OP_SEND ... |
||
|
|
160e9d2752 |
ipv6: sr: fix invalid unregister error path
The error path of seg6_init() is wrong in case CONFIG_IPV6_SEG6_LWTUNNEL is not defined. In that case if seg6_hmac_init() fails, the genl_unregister_family() isn't called. This issue exist since commit |
||
|
|
6e370a771d |
ipv6: sr: fix incorrect unregister order
Commit |
||
|
|
3321687e32 |
ipv6: sr: add missing seg6_local_exit
Currently, we only call seg6_local_exit() in seg6_init() if
seg6_local_init() failed. But forgot to call it in seg6_exit().
Fixes:
|
||
|
|
7c988176b6 |
net: openvswitch: fix overwriting ct original tuple for ICMPv6
OVS_PACKET_CMD_EXECUTE has 3 main attributes:
- OVS_PACKET_ATTR_KEY - Packet metadata in a netlink format.
- OVS_PACKET_ATTR_PACKET - Binary packet content.
- OVS_PACKET_ATTR_ACTIONS - Actions to execute on the packet.
OVS_PACKET_ATTR_KEY is parsed first to populate sw_flow_key structure
with the metadata like conntrack state, input port, recirculation id,
etc. Then the packet itself gets parsed to populate the rest of the
keys from the packet headers.
Whenever the packet parsing code starts parsing the ICMPv6 header, it
first zeroes out fields in the key corresponding to Neighbor Discovery
information even if it is not an ND packet.
It is an 'ipv6.nd' field. However, the 'ipv6' is a union that shares
the space between 'nd' and 'ct_orig' that holds the original tuple
conntrack metadata parsed from the OVS_PACKET_ATTR_KEY.
ND packets should not normally have conntrack state, so it's fine to
share the space, but normal ICMPv6 Echo packets or maybe other types of
ICMPv6 can have the state attached and it should not be overwritten.
The issue results in all but the last 4 bytes of the destination
address being wiped from the original conntrack tuple leading to
incorrect packet matching and potentially executing wrong actions
in case this packet recirculates within the datapath or goes back
to userspace.
ND fields should not be accessed in non-ND packets, so not clearing
them should be fine. Executing memset() only for actual ND packets to
avoid the issue.
Initializing the whole thing before parsing is needed because ND packet
may not contain all the options.
The issue only affects the OVS_PACKET_CMD_EXECUTE path and doesn't
affect packets entering OVS datapath from network interfaces, because
in this case CT metadata is populated from skb after the packet is
already parsed.
Fixes:
|
||
|
|
540bf24fba |
af_unix: Fix data races in unix_release_sock/unix_stream_sendmsg
A data-race condition has been identified in af_unix. In one data path, the write function unix_release_sock() atomically writes to sk->sk_shutdown using WRITE_ONCE. However, on the reader side, unix_stream_sendmsg() does not read it atomically. Consequently, this issue is causing the following KCSAN splat to occur: BUG: KCSAN: data-race in unix_release_sock / unix_stream_sendmsg write (marked) to 0xffff88867256ddbb of 1 bytes by task 7270 on cpu 28: unix_release_sock (net/unix/af_unix.c:640) unix_release (net/unix/af_unix.c:1050) sock_close (net/socket.c:659 net/socket.c:1421) __fput (fs/file_table.c:422) __fput_sync (fs/file_table.c:508) __se_sys_close (fs/open.c:1559 fs/open.c:1541) __x64_sys_close (fs/open.c:1541) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) read to 0xffff88867256ddbb of 1 bytes by task 989 on cpu 14: unix_stream_sendmsg (net/unix/af_unix.c:2273) __sock_sendmsg (net/socket.c:730 net/socket.c:745) ____sys_sendmsg (net/socket.c:2584) __sys_sendmmsg (net/socket.c:2638 net/socket.c:2724) __x64_sys_sendmmsg (net/socket.c:2753 net/socket.c:2750 net/socket.c:2750) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) value changed: 0x01 -> 0x03 The line numbers are related to commit |
||
|
|
7172dc93d6 |
af_unix: Add dead flag to struct scm_fp_list.
Commit |
||
|
|
fa23e0d4b7 |
netfilter: nf_tables: allow clone callbacks to sleep
Sven Auhagen reports transaction failures with following error: ./main.nft:13:1-26: Error: Could not process rule: Cannot allocate memory percpu: allocation failed, size=16 align=8 atomic=1, atomic alloc failed, no space left This points to failing pcpu allocation with GFP_ATOMIC flag. However, transactions happen from user context and are allowed to sleep. One case where we can call into percpu allocator with GFP_ATOMIC is nft_counter expression. Normally this happens from control plane, so this could use GFP_KERNEL instead. But one use case, element insertion from packet path, needs to use GFP_ATOMIC allocations (nft_dynset expression). At this time, .clone callbacks always use GFP_ATOMIC for this reason. Add gfp_t argument to the .clone function and pass GFP_KERNEL or GFP_ATOMIC flag depending on context, this allows all clone memory allocations to sleep for the normal (transaction) case. Cc: Sven Auhagen <sven.auhagen@voleatech.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> |
||
|
|
532aec7e87 |
netfilter: nft_set_pipapo: remove dirty flag
After previous change: ->clone exists: ->dirty is always true ->clone == NULL ->dirty is always false So remove this flag. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> |
||
|
|
3f1d886cc7 |
netfilter: nft_set_pipapo: move cloning of match info to insert/removal path
This set type keeps two copies of the sets' content, priv->match (live version, used to match from packet path) priv->clone (work-in-progress version of the 'future' priv->match). All additions and removals are done on priv->clone. When transaction completes, priv->clone becomes priv->match and a new clone is allocated for use by next transaction. Problem is that the cloning requires GFP_KERNEL allocations but we cannot fail at either commit or abort time. This patch defers the clone until we get an insertion or removal request. This allows us to handle OOM situations correctly. This also allows to remove ->dirty in a followup change: If ->clone exists, ->dirty is always true If ->clone is NULL, ->dirty is always false, no elements were added or removed (except catchall elements which are external to the specific set backend). Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> |
||
|
|
a238106703 |
netfilter: nft_set_pipapo: prepare pipapo_get helper for on-demand clone
The helper uses priv->clone unconditionally which will fail once we do the clone conditionally on first insert or removal. 'nft get element' from userspace needs to use priv->match since this runs from rcu read side lock section. Prepare for this by passing the match backend data as argument. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> |
||
|
|
1cd354fe1e |
net: ipv6: fix wrong start position when receive hop-by-hop fragment
In IPv6, ipv6_rcv_core will parse the hop-by-hop type extension header and increase skb->transport_header by one extension header length.
But if there are more other extension headers like fragment header at this time, the skb->transport_header points to the second extension header,
not the transport layer header or the first extension header.
This will result in the start and nexthdrp variable not pointing to the same position in ipv6frag_thdr_trunced,
and ipv6_skip_exthdr returning incorrect offset and frag_off.Sometimes,the length of the last sharded packet is smaller than the calculated incorrect offset, resulting in packet loss.
We can use network header to offset and calculate the correct position to solve this problem.
Fixes:
|
||
|
|
383eed2de5 |
tcp: get rid of twsk_unique()
DCCP is going away soon, and had no twsk_unique() method. We can directly call tcp_twsk_unique() for TCP sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240507164140.940547-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
33fb988b67 |
net/sched: adjust device watchdog timer to detect stopped queue at right time
Applications are sensitive to long network latency, particularly heartbeat monitoring ones. Longer the tx timeout recovery higher the risk with such applications on a production machines. This patch remedies, yet honoring device set tx timeout. Modify watchdog next timeout to be shorter than the device specified. Compute the next timeout be equal to device watchdog timeout less the how long ago queue stop had been done. At next watchdog timeout tx timeout handler is called into if still in stopped state. Either called or not called, restore the watchdog timeout back to device specified. Signed-off-by: Praveen Kumar Kannoju <praveen.kannoju@oracle.com> Link: https://lore.kernel.org/r/20240508133617.4424-1-praveen.kannoju@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
b1992c3772 |
kbuild: use $(src) instead of $(srctree)/$(src) for source directory
Kbuild conventionally uses $(obj)/ for generated files, and $(src)/ for
checked-in source files. It is merely a convention without any functional
difference. In fact, $(obj) and $(src) are exactly the same, as defined
in scripts/Makefile.build:
src := $(obj)
When the kernel is built in a separate output directory, $(src) does
not accurately reflect the source directory location. While Kbuild
resolves this discrepancy by specifying VPATH=$(srctree) to search for
source files, it does not cover all cases. For example, when adding a
header search path for local headers, -I$(srctree)/$(src) is typically
passed to the compiler.
This introduces inconsistency between upstream and downstream Makefiles
because $(src) is used instead of $(srctree)/$(src) for the latter.
To address this inconsistency, this commit changes the semantics of
$(src) so that it always points to the directory in the source tree.
Going forward, the variables used in Makefiles will have the following
meanings:
$(obj) - directory in the object tree
$(src) - directory in the source tree (changed by this commit)
$(objtree) - the top of the kernel object tree
$(srctree) - the top of the kernel source tree
Consequently, $(srctree)/$(src) in upstream Makefiles need to be replaced
with $(src).
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
|
||
|
|
e7073830cc |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c |
||
|
|
628bc3e5a1 |
l2tp: Support several sockets with same IP/port quadruple
Some l2tp providers will use 1701 as origin port and open several tunnels for the same origin and target. On the Linux side, this may mean opening several sockets, but then trafic will go to only one of them, losing the trafic for the tunnel of the other socket (or leaving it up to userland, consuming a lot of cpu%). This can also happen when the l2tp provider uses a cluster, and load-balancing happens to migrate from one origin IP to another one, for which a socket was already established. Managing reassigning tunnels from one socket to another would be very hairy for userland. Lastly, as documented in l2tpconfig(1), as client it may be necessary to use 1701 as origin port for odd firewalls reasons, which could prevent from establishing several tunnels to a l2tp server, for the same reason: trafic would get only on one of the two sockets. With the V2 protocol it is however easy to route trafic to the proper tunnel, by looking up the tunnel number in the network namespace. This fixes the three cases altogether. Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Link: https://lore.kernel.org/r/20240506215336.1470009-1-samuel.thibault@ens-lyon.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
|
|
bafa6b4d95 |
SUNRPC: Fix gss_free_in_token_pages()
Dan Carpenter says: > Commit |
||
|
|
2ddc0dd7fe |
net/smc: fix neighbour and rtable leak in smc_ib_find_route()
In smc_ib_find_route(), the neighbour found by neigh_lookup() and rtable
resolved by ip_route_output_flow() are not released or put before return.
It may cause the refcount leak, so fix it.
Link: https://lore.kernel.org/r/20240506015439.108739-1-guwen@linux.alibaba.com
Fixes:
|
||
|
|
83127ecada |
wireless-next patches for v6.10
The third, and most likely the last, "new features" pull request for v6.10 with changes both in stack and in drivers. In ath12k and rtw89 we disabled Wireless Extensions just like with iwlwifi earlier. Wi-Fi 7 devices will not support Wireless Extensions (WEXT) anymore so if someone is still using the legacy WEXT interface it's time to switch to nl80211 now! We merged wireless into wireless-next as we decided not to send a wireless pull request to v6.9 this late in the cycle. Also an immutable branch with MHI subsystem was merged to get ath11k and ath12k hibernation working. Major changes: mac80211/cfg80211 * handle color change per link mt76 * mt7921 LED control * mt7925 EHT radiotap support * mt7920e PCI support ath12k * debugfs support * dfs_simulate_radar debugfs file * disable Wireless Extensions * suspend and hibernation support * ACPI support * refactoring in preparation of multi-link support ath11k * support hibernation (required changes in qrtr and MHI subsystems) * ieee80211-freq-limit Device Tree property support ath10k * firmware-name Device Tree property support rtw89 * complete features of new WiFi 7 chip 8922AE including BT-coexistence and WoWLAN * use BIOS ACPI settings to set TX power and channels * disable Wireless Extensios on Wi-Fi 7 devices iwlwifi * block_esr debugfs file * support again firmware API 90 (was reverted earlier) * provide channel survey information for Automatic Channel Selection (ACS) -----BEGIN PGP SIGNATURE----- iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmY7aisRHGt2YWxvQGtl cm5lbC5vcmcACgkQbhckVSbrbZuILQgAmcfU9EJca08qqI2dvcu/X3C/WBVaOQUC 3sdvdGfaO9bTXDjbt0KfmhYNc0BHIM1Jzbegmhe80k6/4cnkWh0JgGZbgMMZGL+V KI8ejbjY1BpfReApiL37DzSUWh176u4jNhUhifRXrLYxBkoZ5goUX8u6meMZHuj4 8EBfW7xGhtmMn2Po+bUYNXsczWR+mcfxfNF3vfyYIIsuJz2wbONcRuc0Hvq1mHmC Ebtq7myY/SQvLtILGZxwAhdgHH8AgCN9R5WpzcCHze8t8xmmt4CjdwS6BGwmaBNM NoGnzHvDG0u+dUYcY/x1tiky9PKRjMtMW+GvYPUvjy5j3nDDXeq51w== =tf9q -----END PGP SIGNATURE----- Merge tag 'wireless-next-2024-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.10 The third, and most likely the last, "new features" pull request for v6.10 with changes both in stack and in drivers. In ath12k and rtw89 we disabled Wireless Extensions just like with iwlwifi earlier. Wi-Fi 7 devices will not support Wireless Extensions (WEXT) anymore so if someone is still using the legacy WEXT interface it's time to switch to nl80211 now! We merged wireless into wireless-next as we decided not to send a wireless pull request to v6.9 this late in the cycle. Also an immutable branch with MHI subsystem was merged to get ath11k and ath12k hibernation working. Major changes: mac80211/cfg80211 * handle color change per link mt76 * mt7921 LED control * mt7925 EHT radiotap support * mt7920e PCI support ath12k * debugfs support * dfs_simulate_radar debugfs file * disable Wireless Extensions * suspend and hibernation support * ACPI support * refactoring in preparation of multi-link support ath11k * support hibernation (required changes in qrtr and MHI subsystems) * ieee80211-freq-limit Device Tree property support ath10k * firmware-name Device Tree property support rtw89 * complete features of new WiFi 7 chip 8922AE including BT-coexistence and WoWLAN * use BIOS ACPI settings to set TX power and channels * disable Wireless Extensios on Wi-Fi 7 devices iwlwifi * block_esr debugfs file * support again firmware API 90 (was reverted earlier) * provide channel survey information for Automatic Channel Selection (ACS) * tag 'wireless-next-2024-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (214 commits) wifi: mwl8k: initialize cmd->addr[] properly wifi: iwlwifi: Ensure prph_mac dump includes all addresses wifi: iwlwifi: mvm: don't request statistics in restart wifi: iwlwifi: mvm: exit EMLSR if secondary link is not used wifi: iwlwifi: mvm: add beacon template version 14 wifi: iwlwifi: mvm: align UATS naming with firmware wifi: iwlwifi: Force SCU_ACTIVE for specific platforms wifi: iwlwifi: mvm: record and return channel survey information wifi: iwlwifi: mvm: add the firmware API for channel survey wifi: iwlwifi: mvm: Fix race in scan completion wifi: iwlwifi: mvm: Add a print for invalid link pair due to bandwidth wifi: iwlwifi: mvm: add a debugfs for reading EMLSR blocking reasons wifi: iwlwifi: mvm: Add active EMLSR blocking reasons prints wifi: iwlwifi: bump FW API to 90 for BZ/SC devices wifi: iwlwifi: mvm: fix primary link setting wifi: iwlwifi: mvm: use already determined cmd_id wifi: iwlwifi: mvm: don't reset link selection during restart wifi: iwlwifi: Print EMLSR states name wifi: iwlwifi: mvm: Block EMLSR when a p2p/softAP vif is active wifi: iwlwifi: mvm: fix typo in debug print ... ==================== Link: https://lore.kernel.org/r/20240508120726.85A10C113CC@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
4db783d68b |
ipv6: prevent NULL dereference in ip6_output()
According to syzbot, there is a chance that ip6_dst_idev()
returns NULL in ip6_output(). Most places in IPv6 stack
deal with a NULL idev just fine, but not here.
syzbot reported:
general protection fault, probably for non-canonical address 0xdffffc00000000bc: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x00000000000005e0-0x00000000000005e7]
CPU: 0 PID: 9775 Comm: syz-executor.4 Not tainted 6.9.0-rc5-syzkaller-00157-g6a30653b604a #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
RIP: 0010:ip6_output+0x231/0x3f0 net/ipv6/ip6_output.c:237
Code: 3c 1e 00 49 89 df 74 08 4c 89 ef e8 19 58 db f7 48 8b 44 24 20 49 89 45 00 49 89 c5 48 8d 9d e0 05 00 00 48 89 d8 48 c1 e8 03 <42> 0f b6 04 38 84 c0 4c 8b 74 24 28 0f 85 61 01 00 00 8b 1b 31 ff
RSP: 0018:ffffc9000927f0d8 EFLAGS: 00010202
RAX: 00000000000000bc RBX: 00000000000005e0 RCX: 0000000000040000
RDX: ffffc900131f9000 RSI: 0000000000004f47 RDI: 0000000000004f48
RBP: 0000000000000000 R08: ffffffff8a1f0b9a R09: 1ffffffff1f51fad
R10: dffffc0000000000 R11: fffffbfff1f51fae R12: ffff8880293ec8c0
R13: ffff88805d7fc000 R14: 1ffff1100527d91a R15: dffffc0000000000
FS: 00007f135c6856c0(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000080 CR3: 0000000064096000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
NF_HOOK include/linux/netfilter.h:314 [inline]
ip6_xmit+0xefe/0x17f0 net/ipv6/ip6_output.c:358
sctp_v6_xmit+0x9f2/0x13f0 net/sctp/ipv6.c:248
sctp_packet_transmit+0x26ad/0x2ca0 net/sctp/output.c:653
sctp_packet_singleton+0x22c/0x320 net/sctp/outqueue.c:783
sctp_outq_flush_ctrl net/sctp/outqueue.c:914 [inline]
sctp_outq_flush+0x6d5/0x3e20 net/sctp/outqueue.c:1212
sctp_side_effects net/sctp/sm_sideeffect.c:1198 [inline]
sctp_do_sm+0x59cc/0x60c0 net/sctp/sm_sideeffect.c:1169
sctp_primitive_ASSOCIATE+0x95/0xc0 net/sctp/primitive.c:73
__sctp_connect+0x9cd/0xe30 net/sctp/socket.c:1234
sctp_connect net/sctp/socket.c:4819 [inline]
sctp_inet_connect+0x149/0x1f0 net/sctp/socket.c:4834
__sys_connect_file net/socket.c:2048 [inline]
__sys_connect+0x2df/0x310 net/socket.c:2065
__do_sys_connect net/socket.c:2075 [inline]
__se_sys_connect net/socket.c:2072 [inline]
__x64_sys_connect+0x7a/0x90 net/socket.c:2072
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes:
|
||
|
|
4893b8b3ef |
hsr: Simplify code for announcing HSR nodes timer setup
Up till now the code to start HSR announce timer, which triggers sending
supervisory frames, was assuming that hsr_netdev_notify() would be called
at least twice for hsrX interface. This was required to have different
values for old and current values of network device's operstate.
This is problematic for a case where hsrX interface is already in the
operational state when hsr_netdev_notify() is called, so timer is not
configured to trigger and as a result the hsrX is not sending supervisory
frames to HSR ring.
This error has been discovered when hsr_ping.sh script was run. To be
more specific - for the hsr1 and hsr2 the hsr_netdev_notify() was
called at least twice with different IF_OPER_{LOWERDOWN|DOWN|UP} states
assigned in hsr_check_carrier_and_operstate(hsr). As a result there was
no issue with sending supervisory frames.
However, with hsr3, the notify function was called only once with
operstate set to IF_OPER_UP and timer responsible for triggering
supervisory frames was not fired.
The solution is to use netif_oper_up() and netif_running() helper
functions to assess if network hsrX device is up.
Only then, when the timer is not already pending, it is started.
Otherwise it is deactivated.
Fixes:
|
||
|
|
58a4ff5d77 |
phonet: no longer hold RTNL in route_dumpit()
route_dumpit() already relies on RCU, RTNL is not needed. Also change return value at the end of a dump. This allows NLMSG_DONE to be appended to the current skb at the end of a dump, saving a couple of recvmsg() system calls. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Remi Denis-Courmont <courmisch@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240507121748.416287-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
d101291b26 |
ipv6: fib6_rules: avoid possible NULL dereference in fib6_rule_action()
syzbot is able to trigger the following crash [1],
caused by unsafe ip6_dst_idev() use.
Indeed ip6_dst_idev() can return NULL, and must always be checked.
[1]
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 0 PID: 31648 Comm: syz-executor.0 Not tainted 6.9.0-rc4-next-20240417-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
RIP: 0010:__fib6_rule_action net/ipv6/fib6_rules.c:237 [inline]
RIP: 0010:fib6_rule_action+0x241/0x7b0 net/ipv6/fib6_rules.c:267
Code: 02 00 00 49 8d 9f d8 00 00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 74 08 48 89 df e8 f9 32 bf f7 48 8b 1b 48 89 d8 48 c1 e8 03 <42> 80 3c 20 00 74 08 48 89 df e8 e0 32 bf f7 4c 8b 03 48 89 ef 4c
RSP: 0018:ffffc9000fc1f2f0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 1a772f98c8186700
RDX: 0000000000000003 RSI: ffffffff8bcac4e0 RDI: ffffffff8c1f9760
RBP: ffff8880673fb980 R08: ffffffff8fac15ef R09: 1ffffffff1f582bd
R10: dffffc0000000000 R11: fffffbfff1f582be R12: dffffc0000000000
R13: 0000000000000080 R14: ffff888076509000 R15: ffff88807a029a00
FS: 00007f55e82ca6c0(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b31d23000 CR3: 0000000022b66000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
fib_rules_lookup+0x62c/0xdb0 net/core/fib_rules.c:317
fib6_rule_lookup+0x1fd/0x790 net/ipv6/fib6_rules.c:108
ip6_route_output_flags_noref net/ipv6/route.c:2637 [inline]
ip6_route_output_flags+0x38e/0x610 net/ipv6/route.c:2649
ip6_route_output include/net/ip6_route.h:93 [inline]
ip6_dst_lookup_tail+0x189/0x11a0 net/ipv6/ip6_output.c:1120
ip6_dst_lookup_flow+0xb9/0x180 net/ipv6/ip6_output.c:1250
sctp_v6_get_dst+0x792/0x1e20 net/sctp/ipv6.c:326
sctp_transport_route+0x12c/0x2e0 net/sctp/transport.c:455
sctp_assoc_add_peer+0x614/0x15c0 net/sctp/associola.c:662
sctp_connect_new_asoc+0x31d/0x6c0 net/sctp/socket.c:1099
__sctp_connect+0x66d/0xe30 net/sctp/socket.c:1197
sctp_connect net/sctp/socket.c:4819 [inline]
sctp_inet_connect+0x149/0x1f0 net/sctp/socket.c:4834
__sys_connect_file net/socket.c:2048 [inline]
__sys_connect+0x2df/0x310 net/socket.c:2065
__do_sys_connect net/socket.c:2075 [inline]
__se_sys_connect net/socket.c:2072 [inline]
__x64_sys_connect+0x7a/0x90 net/socket.c:2072
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes:
|
||
|
|
e2d09e5a1e |
net: dst_cache: minor optimization in dst_cache_set_ip6()
There is no need to use this_cpu_ptr(dst_cache->cache) twice. Compiler is unable to optimize the second call, because of per-cpu constraints. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240507132717.627518-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
3b09b2bd0d |
net: dst_cache: annotate data-races around dst_cache->reset_ts
dst_cache->reset_ts is read or written locklessly, add READ_ONCE() and WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240507132000.614591-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
012b720691 |
rxrpc: Only transmit one ACK per jumbo packet received
Only generate one ACK packet for all the subpackets in a jumbo packet. If
we would like to generate more than one ACK, we prioritise them base on
their reason code, in the order, highest first:
OutOfSeq > NoSpace > ExceedsWin > Duplicate > Requested > Delay > Idle
For the first four, we reference the lowest offending subpacket; for the
last three, the highest.
This reduces the number of ACKs we end up transmitting to one per UDP
packet transmitted to reduce network loading and packet parsing.
Fixes:
|
||
|
|
ba4e103848 |
rxrpc: Fix congestion control algorithm
Make the following fixes to the congestion control algorithm:
(1) Don't vary the cwnd starting value by the size of RXRPC_TX_SMSS since
that's currently held constant - set to the size of a jumbo subpacket
payload so that we can create jumbo packets on the fly. The current
code invariably picks 3 as the starting value.
Further, the starting cwnd needs to be an even number because we ack
every other packet, so set it to 4.
(2) Don't cut ssthresh when we see an ACK come from the peer with a
receive window (rwind) less than ssthresh. ssthresh keeps track of
characteristics of the connection whereas rwind may be reduced by the
peer for any reason - and may be reduced to 0.
Fixes:
|
||
|
|
1d3985ed0d |
ax25: Remove superfuous "return" from ax25_ds_set_timer
Remove the explicit call to "return" in the void ax25_ds_set_timer
function that was introduced in
|
||
|
|
2b696a2a10 |
ipvs: allow some sysctls in non-init user namespaces
Let's make all IPVS sysctls writtable even when network namespace is owned by non-initial user namespace. Let's make a few sysctls to be read-only for non-privileged users: - sync_qlen_max - sync_sock_size - run_estimation - est_cpulist - est_nice I'm trying to be conservative with this to prevent introducing any security issues in there. Maybe, we can allow more sysctls to be writable, but let's do this on-demand and when we see real use-case. This patch is motivated by user request in the LXC project [1]. Having this can help with running some Kubernetes [2] or Docker Swarm [3] workloads inside the system containers. Link: https://github.com/lxc/lxc/issues/4278 [1] Link: |
||
|
|
643bb5dbae |
ipvs: add READ_ONCE barrier for ipvs->sysctl_amemthresh
Cc: Julian Anastasov <ja@ssi.bg> Cc: Simon Horman <horms@verge.net.au> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Florian Westphal <fw@strlen.de> Suggested-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
|
4e13d3a9c2 |
ipv6: Fix potential uninit-value access in __ip6_make_skb()
As it was done in commit |
||
|
|
b7ffab29a8 |
net: bridge: switchdev: Improve error message for port_obj_add/del functions
Enhance the error reporting mechanism in the switchdev framework to
provide more informative and user-friendly error messages.
Following feedback from users struggling to understand the implications
of error messages like "failed (err=-28) to add object (id=2)", this
update aims to clarify what operation failed and how this might impact
the system or network.
With this change, error messages now include a description of the failed
operation, the specific object involved, and a brief explanation of the
potential impact on the system. This approach helps administrators and
developers better understand the context and severity of errors,
facilitating quicker and more effective troubleshooting.
Example of the improved logging:
[ 70.516446] ksz-switch spi0.0 uplink: Failed to add Port Multicast
Database entry (object id=2) with error: -ENOSPC (-28).
[ 70.516446] Failure in updating the port's Multicast Database could
lead to multicast forwarding issues.
[ 70.516446] Current HW/SW setup lacks sufficient resources.
This comprehensive update includes handling for a range of switchdev
object IDs, ensuring that most operations within the switchdev framework
benefit from clearer error reporting.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
||
|
|
2e82a58d6c |
appletalk: Improve handling of broadcast packets
When a broadcast AppleTalk packet is received, prefer queuing it on the socket whose address matches the address of the interface that received the packet (and is listening on the correct port). Userspace applications that handle such packets will usually send a response on the same socket that received the packet; this fix allows the response to be sent on the correct interface. If a socket matching the interface's address is not found, an arbitrary socket listening on the correct port will be used, if any. This matches the implementation's previous behavior. Fixes atalkd's responses to network information requests when multiple network interfaces are configured to use AppleTalk. Link: https://lore.kernel.org/netdev/20200722113752.1218-2-vincent.ldev@duvert.net/ Link: https://gist.github.com/VinDuv/4db433b6dce39d51a5b7847ee749b2a4 Signed-off-by: Vincent Duvert <vincent.ldev@duvert.net> Signed-off-by: Doug Brown <doug@schmorgal.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
|
db3efdcf70 |
net/ipv4: add tracepoint for icmp_send
Introduce a tracepoint for icmp_send, which can help users to get more
detail information conveniently when icmp abnormal events happen.
1. Giving an usecase example:
=============================
When an application experiences packet loss due to an unreachable UDP
destination port, the kernel will send an exception message through the
icmp_send function. By adding a trace point for icmp_send, developers or
system administrators can obtain detailed information about the UDP
packet loss, including the type, code, source address, destination address,
source port, and destination port. This facilitates the trouble-shooting
of UDP packet loss issues especially for those network-service
applications.
2. Operation Instructions:
==========================
Switch to the tracing directory.
cd /sys/kernel/tracing
Filter for destination port unreachable.
echo "type==3 && code==3" > events/icmp/icmp_send/filter
Enable trace event.
echo 1 > events/icmp/icmp_send/enable
3. Result View:
================
udp_client_erro-11370 [002] ...s.12 124.728002:
icmp_send: icmp_send: type=3, code=3.
From 127.0.0.1:41895 to 127.0.0.1:6666 ulen=23
skbaddr=00000000589b167a
Signed-off-by: Peilin He <he.peilin@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Yunkai Zhang <zhang.yunkai@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: Liu Chun <liu.chun2@zte.com.cn>
Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|