Commit Graph

2338 Commits

Author SHA1 Message Date
Karol Kolacinski
a33a302b50 ice: change SMA pins to SDP in PTP API
This change aligns E810 PTP pin control to all other products.

Currently, SMA/U.FL port expanders are controlled together with SDP pins
connected to 1588 clock. To align this, separate this control by
exposing only SDP20..23 pins in PTP API on adapters with DPLL.

Clear error for all E810 on absent NVM pin section or other errors to
allow proper initialization on SMA E810 with NVM section.

Use ARRAY_SIZE for pin array instead of internal definition.

Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-09 09:56:18 -07:00
Arkadiusz Kubalewski
2dd5d03c77 ice: redesign dpll sma/u.fl pins control
DPLL-enabled E810 NIC driver provides user with list of input and output
pins. Hardware internal design impacts user control over SMA and U.FL
pins. Currently end-user view on those dpll pins doesn't provide any layer
of abstraction. On the hardware level SMA and U.FL pins are tied together
due to existence of direction control logic for each pair:
- SMA1 (bi-directional) and U.FL1 (only output)
- SMA2 (bi-directional) and U.FL2 (only input)
The user activity on each pin of the pair may impact the state of the
other.

Previously all the pins were provided to the user as is, without the
control over SMA pins direction.

Introduce a software controlled layer of abstraction over external board
pins, instead of providing the user with access to raw pins connected to
the dpll:
- new software controlled SMA and U.FL pins,
- callback operations directing user requests to corresponding hardware
  pins according to the runtime configuration,
- ability to control SMA pins direction.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-09 09:56:18 -07:00
Martyna Szapar-Mudlaw
e7aee24a89 ice: add link_down_events statistic
Introduce a link_down_events counter to the ice driver, incremented
each time the link transitions from up to down.
This counter can help diagnose issues related to link stability,
such as port flapping or unexpected link drops.

The value is exposed via ethtool's get_link_ext_stats() interface.

Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-09 09:56:18 -07:00
Jacob Keller
141d0c9037 net: intel: move RSS packet classifier types to libie
The Intel i40e, iavf, and ice drivers all include a definition of the
packet classifier filter types used to program RSS hash enable bits. For
i40e, these bits are used for both the PF and VF to configure the PFQF_HENA
and VFQF_HENA registers.

For ice and iAVF, these bits are used to communicate the desired hash
enable filter over virtchnl via its struct virtchnl_rss_hashena. The
virtchnl.h header makes no mention of where the bit definitions reside.

Maintaining a separate copy of these bits across three drivers is
cumbersome. Move the definition to libie as a new pctype.h header file.
Each driver can include this, and drop its own definition.

The ice implementation also defined a ICE_AVF_FLOW_FIELD_INVALID, intending
to use this to indicate when there were no hash enable bits set. This is
confusing, since the enumeration is using bit positions. A value of 0
*should* indicate the first bit. Instead, rewrite the code that uses
ICE_AVF_FLOW_FIELD_INVALID to just check if the avf_hash is zero. From
context this should be clear that we're checking if none of the bits are
set.

The values are kept as bit positions instead of encoding the BIT_ULL
directly into their value. While most users will simply use BIT_ULL
immediately, i40e uses the macros both with BIT_ULL and test_bit/set_bit
calls.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-09 09:56:18 -07:00
Jacob Keller
78b2d9908b net: intel: rename 'hena' to 'hashcfg' for clarity
i40e, ice, and iAVF all use 'hena' as a shorthand for the "hash enable"
configuration. This comes originally from the X710 datasheet 'xxQF_HENA'
registers. In the context of the registers the meaning is fairly clear.

However, on its own, hena is a weird name that can be more difficult to
understand. This is especially true in ice. The E810 hardware doesn't even
have registers with HENA in the name.

Replace the shorthand 'hena' with 'hashcfg'. This makes it clear the
variables deal with the Hash configuration, not just a single boolean
on/off for all hashing.

Do not update the register names. These come directly from the datasheet
for X710 and X722, and it is more important that the names can be searched.

Suggested-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-06-09 09:56:18 -07:00
Ingo Molnar
41cb08555c treewide, timers: Rename from_timer() to timer_container_of()
Move this API to the canonical timer_*() namespace.

[ tglx: Redone against pre rc1 ]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-08 09:07:37 +02:00
Michal Kubiak
73145e6d81 ice: fix rebuilding the Tx scheduler tree for large queue counts
The current implementation of the Tx scheduler allows the tree to be
rebuilt as the user adds more Tx queues to the VSI. In such a case,
additional child nodes are added to the tree to support the new number
of queues.
Unfortunately, this algorithm does not take into account that the limit
of the VSI support node may be exceeded, so an additional node in the
VSI layer may be required to handle all the requested queues.

Such a scenario occurs when adding XDP Tx queues on machines with many
CPUs. Although the driver still respects the queue limit returned by
the FW, the Tx scheduler was unable to add those queues to its tree
and returned one of the errors below.

Such a scenario occurs when adding XDP Tx queues on machines with many
CPUs (e.g. at least 321 CPUs, if there is already 128 Tx/Rx queue pairs).
Although the driver still respects the queue limit returned by the FW,
the Tx scheduler was unable to add those queues to its tree and returned
the following errors:

     Failed VSI LAN queue config for XDP, error: -5
or:
     Failed to set LAN Tx queue context, error: -22

Fix this problem by extending the tree rebuild algorithm to check if the
current VSI node can support the requested number of queues. If it
cannot, create as many additional VSI support nodes as necessary to
handle all the required Tx queues. Symmetrically, adjust the VSI node
removal algorithm to remove all nodes associated with the given VSI.
Also, make the search for the next free VSI node more restrictive. That is,
add queue group nodes only to the VSI support nodes that have a matching
VSI handle.
Finally, fix the comment describing the tree update algorithm to better
reflect the current scenario.

Fixes: b0153fdd7e ("ice: update VSI config dynamically")
Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Tested-by: Saritha Sanigani <sarithax.sanigani@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-30 13:54:43 -07:00
Michal Kubiak
6fa2942578 ice: create new Tx scheduler nodes for new queues only
The current implementation of the Tx scheduler tree attempts
to create nodes for all Tx queues, ignoring the fact that some
queues may already exist in the tree. For example, if the VSI
already has 128 Tx queues and the user requests for 16 new queues,
the Tx scheduler will compute the tree for 272 queues (128 existing
queues + 144 new queues), instead of 144 queues (128 existing queues
and 16 new queues).
Fix that by modifying the node count calculation algorithm to skip
the queues that already exist in the tree.

Fixes: 5513b920a4 ("ice: Update Tx scheduler tree for VSI multi-Tx queue support")
Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Tested-by: Saritha Sanigani <sarithax.sanigani@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-30 13:54:35 -07:00
Michal Kubiak
0153f36041 ice: fix Tx scheduler error handling in XDP callback
When the XDP program is loaded, the XDP callback adds new Tx queues.
This means that the callback must update the Tx scheduler with the new
queue number. In the event of a Tx scheduler failure, the XDP callback
should also fail and roll back any changes previously made for XDP
preparation.

The previous implementation had a bug that not all changes made by the
XDP callback were rolled back. This caused the crash with the following
call trace:

[  +9.549584] ice 0000:ca:00.0: Failed VSI LAN queue config for XDP, error: -5
[  +0.382335] Oops: general protection fault, probably for non-canonical address 0x50a2250a90495525: 0000 [#1] SMP NOPTI
[  +0.010710] CPU: 103 UID: 0 PID: 0 Comm: swapper/103 Not tainted 6.14.0-net-next-mar-31+ #14 PREEMPT(voluntary)
[  +0.010175] Hardware name: Intel Corporation M50CYP2SBSTD/M50CYP2SBSTD, BIOS SE5C620.86B.01.01.0005.2202160810 02/16/2022
[  +0.010946] RIP: 0010:__ice_update_sample+0x39/0xe0 [ice]

[...]

[  +0.002715] Call Trace:
[  +0.002452]  <IRQ>
[  +0.002021]  ? __die_body.cold+0x19/0x29
[  +0.003922]  ? die_addr+0x3c/0x60
[  +0.003319]  ? exc_general_protection+0x17c/0x400
[  +0.004707]  ? asm_exc_general_protection+0x26/0x30
[  +0.004879]  ? __ice_update_sample+0x39/0xe0 [ice]
[  +0.004835]  ice_napi_poll+0x665/0x680 [ice]
[  +0.004320]  __napi_poll+0x28/0x190
[  +0.003500]  net_rx_action+0x198/0x360
[  +0.003752]  ? update_rq_clock+0x39/0x220
[  +0.004013]  handle_softirqs+0xf1/0x340
[  +0.003840]  ? sched_clock_cpu+0xf/0x1f0
[  +0.003925]  __irq_exit_rcu+0xc2/0xe0
[  +0.003665]  common_interrupt+0x85/0xa0
[  +0.003839]  </IRQ>
[  +0.002098]  <TASK>
[  +0.002106]  asm_common_interrupt+0x26/0x40
[  +0.004184] RIP: 0010:cpuidle_enter_state+0xd3/0x690

Fix this by performing the missing unmapping of XDP queues from
q_vectors and setting the XDP rings pointer back to NULL after all those
queues are released.
Also, add an immediate exit from the XDP callback in case of ring
preparation failure.

Fixes: efc2214b60 ("ice: Add support for XDP")
Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Tested-by: Saritha Sanigani <sarithax.sanigani@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-30 13:54:17 -07:00
Jakub Kicinski
33e1b1b399 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc8).

Conflicts:
  80f2ab46c2 ("irdma: free iwdev->rf after removing MSI-X")
  4bcc063939 ("ice, irdma: fix an off by one in error handling code")
  c24a65b6a2 ("iidc/ice/irdma: Update IDC to support multiple consumers")
https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au

No extra adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22 09:42:41 -07:00
Dave Ertman
6c778f1b83 ice: Fix LACP bonds without SRIOV environment
If an aggregate has the following conditions:
- The SRIOV LAG DDP package has been enabled
- The bond is in 802.3ad LACP mode
- The bond is disqualified from supporting SRIOV VF LAG
- Both interfaces were added simultaneously to the bond (same command)

Then there is a chance that the two interfaces will be assigned different
LACP Aggregator ID's.  This will cause a failure of the LACP control over
the bond.

To fix this, we can detect if the primary interface for the bond (as
defined by the driver) is not in switchdev mode, and exit the setup flow
if so.

Reproduction steps:

%> ip link add bond0 type bond mode 802.3ad miimon 100
%> ip link set bond0 up
%> ifenslave bond0 eth0 eth1
%> cat /proc/net/bonding/bond0 | grep Agg

Check for Aggregator IDs that differ.

Fixes: ec5a6c5f79 ("ice: process events created by lag netdev event handler")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-19 08:38:22 -07:00
Jacob Keller
bbd95160a0 ice: fix vf->num_mac count with port representors
The ice_vc_repr_add_mac() function indicates that it does not store the MAC
address filters in the firmware. However, it still increments vf->num_mac.
This is incorrect, as vf->num_mac should represent the number of MAC
filters currently programmed to firmware.

Indeed, we only perform this increment if the requested filter is a unicast
address that doesn't match the existing vf->hw_lan_addr. In addition,
ice_vc_repr_del_mac() does not decrement the vf->num_mac counter. This
results in the counter becoming out of sync with the actual count.

As it turns out, vf->num_mac is currently only used in legacy made without
port representors. The single place where the value is checked is for
enforcing a filter limit on untrusted VFs.

Upcoming patches to support VF Live Migration will use this value when
determining the size of the TLV for MAC address filters. Fix the
representor mode function to stop incrementing the counter incorrectly.

Fixes: ac19e03ef7 ("ice: allow process VF opcodes in different ways")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-19 08:38:22 -07:00
Jakub Kicinski
cc42263172 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux
Tony Nguyen says:

====================
Prepare for Intel IPU E2000 (GEN3)

This is the first part in introducing RDMA support for idpf.

----------------------------------------------------------------
Tatyana Nikolova says:

To align with review comments, the patch series introducing RDMA
RoCEv2 support for the Intel Infrastructure Processing Unit (IPU)
E2000 line of products is going to be submitted in three parts:

1. Modify ice to use specific and common IIDC definitions and
   pass a core device info to irdma.

2. Add RDMA support to idpf and modify idpf to use specific and
   common IIDC definitions and pass a core device info to irdma.

3. Add RDMA RoCEv2 support for the E2000 products, referred to as
   GEN3 to irdma.

This first part is a 5 patch series based on the original
"iidc/ice/irdma: Update IDC to support multiple consumers" patch
to allow for multiple CORE PCI drivers, using the auxbus.

Patches:
1) Move header file to new name for clarity and replace ice
   specific DSCP define with a kernel equivalent one in irdma
2) Unify naming convention
3) Separate header file into common and driver specific info
4) Replace ice specific DSCP define with a kernel equivalent
   one in ice
5) Implement core device info struct and update drivers to use it
----------------------------------------------------------------

v1: https://lore.kernel.org/20250505212037.2092288-1-anthony.l.nguyen@intel.com

IWL reviews:
[v5] https://lore.kernel.org/20250416021549.606-1-tatyana.e.nikolova@intel.com
[v4] https://lore.kernel.org/20250225050428.2166-1-tatyana.e.nikolova@intel.com
[v3] https://lore.kernel.org/20250207194931.1569-1-tatyana.e.nikolova@intel.com
[v2] https://lore.kernel.org/20240824031924.421-1-tatyana.e.nikolova@intel.com
[v1] https://lore.kernel.org/20240724233917.704-1-tatyana.e.nikolova@intel.com

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux:
  iidc/ice/irdma: Update IDC to support multiple consumers
  ice: Replace ice specific DSCP mapping num with a kernel define
  iidc/ice/irdma: Break iidc.h into two headers
  iidc/ice/irdma: Rename to iidc_* convention
  iidc/ice/irdma: Rename IDC header file

====================

Link: https://patch.msgid.link/20250509200712.2911060-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-12 18:48:27 -07:00
Dave Ertman
c24a65b6a2 iidc/ice/irdma: Update IDC to support multiple consumers
In preparation of supporting more than a single core PCI driver
for RDMA, move ice specific structs like qset_params, qos_info
and qos_params from iidc_rdma.h to iidc_rdma_ice.h.

Previously, the ice driver was just exporting its entire PF struct
to the auxiliary driver, but since each core driver will have its own
different PF struct, implement a universal struct that all core drivers
can provide to the auxiliary driver through the probe call.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Co-developed-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Co-developed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Co-developed-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-09 11:35:43 -07:00
Jakub Kicinski
6b02fd7799 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc6).

No conflicts.

Adjacent changes:

net/core/dev.c:
  08e9f2d584 ("net: Lock netdevices during dev_shutdown")
  a82dc19db1 ("net: avoid potential race between netdev_get_by_index_lock() and netns switch")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-08 08:59:02 -07:00
Przemek Kitszel
0093cb194a ice: use DSN instead of PCI BDF for ice_adapter index
Use Device Serial Number instead of PCI bus/device/function for
the index of struct ice_adapter.

Functions on the same physical device should point to the very same
ice_adapter instance, but with two PFs, when at least one of them is
PCI-e passed-through to a VM, it is no longer the case - PFs will get
seemingly random PCI BDF values, and thus indices, what finally leds to
each of them being on their own instance of ice_adapter. That causes them
to don't attempt any synchronization of the PTP HW clock usage, or any
other future resources.

DSN works nicely in place of the index, as it is "immutable" in terms of
virtualization.

Fixes: 0e2bddf9e5 ("ice: add ice_adapter for shared data across PFs on the same NIC")
Suggested-by: Jacob Keller <jacob.e.keller@intel.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Suggested-by: Jiri Pirko <jiri@resnulli.us>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250505161939.2083581-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-06 18:27:14 -07:00
Jakub Kicinski
337079d31f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc5).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-01 15:11:38 -07:00
Tatyana Nikolova
8239b771b9 ice: Replace ice specific DSCP mapping num with a kernel define
Replace ice driver specific DSCP mapping number defines
ICE_DSCP_NUM_VAL and IIDC_MAX_DSCP_MAPPING with
an equivalent kernel define DSCP_MAX.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-30 13:09:08 -07:00
Dave Ertman
d9251a560b iidc/ice/irdma: Break iidc.h into two headers
In preparation of supporting more than a single core PCI driver
for RDMA, break the iidc_rdma.h header file into two more focused
headers.

Only the elements universal to all Intel drivers will remain in
the generic iidc_rdma.h header. Move the ice specific information
to an ice specific header file named iidc_rdma_ice.h.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-30 13:09:08 -07:00
Dave Ertman
97b5631aae iidc/ice/irdma: Rename to iidc_* convention
In preparation of supporting more than a single core PCI driver
for RDMA, homogenize naming to iidc_rdma_* and IIDC_RDMA_*
form.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-30 13:09:06 -07:00
Dave Ertman
468d8b462a iidc/ice/irdma: Rename IDC header file
To prepare for the IDC upgrade to support different CORE
PCI drivers, rename header file from iidc.h to iidc_rdma.h
since this files functionality is specifically for RDMA support.

Use net/dscp.h include in irdma osdep.h and DSCP_MAX type.h,
instead of iidc header and define.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-30 08:31:49 -07:00
Xuanqiang Luo
425c5f266b ice: Check VF VSI Pointer Value in ice_vc_add_fdir_fltr()
As mentioned in the commit baeb705fd6 ("ice: always check VF VSI
pointer values"), we need to perform a null pointer check on the return
value of ice_get_vf_vsi() before using it.

Fixes: 6ebbe97a48 ("ice: Add a per-VF limit on number of FDIR filters")
Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250425222636.3188441-3-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-28 15:59:13 -07:00
Paul Greenwalt
3ffcd7b657 ice: fix Get Tx Topology AQ command error on E830
The Get Tx Topology AQ command (opcode 0x0418) has different read flag
requirements depending on the hardware/firmware. For E810, E822, and E823
firmware the read flag must be set, and for newer hardware (E825 and E830)
it must not be set.

This results in failure to configure Tx topology and the following warning
message during probe:

  DDP package does not support Tx scheduling layers switching feature -
  please update to the latest DDP package and try again

The current implementation only handles E825-C but not E830. It is
confusing as we first check ice_is_e825c() and then set the flag in the set
case. Finally, we check ice_is_e825c() again and set the flag for all other
hardware in both the set and get case.

Instead, notice that we always need the read flag for set, but only need
the read flag for get on E810, E822, and E823 firmware. Fix the logic to
check the MAC type and set the read flag in get only on the older devices
which require it.

Fixes: ba1124f58a ("ice: Add E830 device IDs, MAC type and registers")
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250425222636.3188441-2-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-28 15:59:13 -07:00
Jacob Keller
d9f3e9ecc4 net: ptp: introduce .supported_perout_flags to ptp_clock_info
The PTP_PEROUT_REQUEST2 ioctl has gained support for flags specifying
specific output behavior including PTP_PEROUT_ONE_SHOT,
PTP_PEROUT_DUTY_CYCLE, PTP_PEROUT_PHASE.

Driver authors are notorious for not checking the flags of the request.
This results in misinterpreting the request, generating an output signal
that does not match the requested value. It is anticipated that even more
flags will be added in the future, resulting in even more broken requests.

Expecting these issues to be caught during review or playing whack-a-mole
after the fact is not a great solution.

Instead, introduce the supported_perout_flags field in the ptp_clock_info
structure. Update the core character device logic to explicitly reject any
request which has a flag not on this list.

This ensures that drivers must 'opt in' to the flags they support. Drivers
which don't set the .supported_perout_flags field will not need to check
that unsupported flags aren't passed, as the core takes care of this.

Update the drivers which do support flags to set this new field.

Note the following driver files set n_per_out to a non-zero value but did
not check the flags at all:

 • drivers/ptp/ptp_clockmatrix.c
 • drivers/ptp/ptp_idt82p33.c
 • drivers/ptp/ptp_fc3.c
 • drivers/net/ethernet/ti/am65-cpts.c
 • drivers/net/ethernet/aquantia/atlantic/aq_ptp.c
 • drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.c
 • drivers/net/dsa/sja1105/sja1105_ptp.c
 • drivers/net/ethernet/freescale/dpaa2/dpaa2-ptp.c
 • drivers/net/ethernet/mscc/ocelot_vsc7514.c
 • drivers/net/ethernet/intel/i40e/i40e_ptp.c

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250414-jk-supported-perout-flags-v2-2-f6b17d15475c@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 20:20:58 -07:00
Jacob Keller
7c571ac57d net: ptp: introduce .supported_extts_flags to ptp_clock_info
The PTP_EXTTS_REQUEST(2) ioctl has a flags field which specifies how the
external timestamp request should behave. This includes which edge of the
signal to timestamp, as well as a specialized "offset" mode. It is expected
that more flags will be added in the future.

Driver authors routinely do not check the flags, often accepting requests
with flags which they do not support. Even drivers which do check flags may
not be future-proofed to reject flags not yet defined. Thus, any future
flag additions often require manually updating drivers to reject these
flags.

This approach of hoping we catch flag checks during review, or playing
whack-a-mole after the fact is the wrong approach.

Introduce the "supported_extts_flags" field to the ptp_clock_info
structure. This field defines the set of flags the device actually
supports.

Update the core character device logic to check this field and reject
unsupported requests. Getting this right is somewhat tricky. First, to
avoid unnecessary repetition and make basic functionality work when
.supported_extts_flags is 0, the core always accepts the PTP_ENABLE_FEATURE
flag. This flag is used to set the 'on' parameter to the .enable function
and is thus always 'supported' by all drivers.

For backwards compatibility, the PTP_RISING_EDGE and PTP_FALLING_EDGE flags
are merely "hints" when using the old PTP_EXTTS_REQUEST ioctl, and are not
expected to be enforced. If the user issues PTP_EXTTS_REQUEST2, the
PTP_STRICT_FLAGS flag is added which is supposed to inform the driver to
strictly validate the flags and reject unsupported requests. To handle
this, first check if the driver reports PTP_STRICT_FLAGS support. If it
does not, then always allow the PTP_RISING_EDGE and PTP_FALLING_EDGE flags.
This keeps backwards compatibility with the original PTP_EXTTS_REQUEST
ioctl where these flags are not guaranteed to be honored.

This way, drivers which do not set the supported_extts_flags will continue
to accept requests for the original PTP_EXTTS_REQUEST ioctl. The core will
automatically reject requests with new flags, and correctly reject requests
with PTP_STRICT_FLAGS, where the driver is supposed to strictly validate
the flags.

Update the various drivers, refactoring their validation logic into the
.supported_extts_flags field. For consistency and readability,
PTP_ENABLE_FEATURE is not set in the supported flags list, and
PTP_EXTTS_EDGES is expanded to PTP_RISING_EDGE | PTP_FALLING_EDGE in all
cases.

Note the following driver files set n_ext_ts to a non-zero value but did
not check flags at all:

 • drivers/net/ethernet/freescale/dpaa2/dpaa2-ptp.c
 • drivers/net/ethernet/freescale/enetc/enetc_ptp.c
 • drivers/net/ethernet/intel/i40e/i40e_ptp.c
 • drivers/net/ethernet/marvell/octeontx2/nic/otx2_ptp.c
 • drivers/net/ethernet/renesas/ravb_ptp.c
 • drivers/net/ethernet/renesas/rtsn.c
 • drivers/net/ethernet/renesas/rtsn.h
 • drivers/net/ethernet/ti/am65-cpts.c
 • drivers/net/ethernet/ti/cpts.h
 • drivers/net/ethernet/ti/icssg/icss_iep.c
 • drivers/net/ethernet/xscale/ptp_ixp46x.c
 • drivers/net/phy/bcm-phy-ptp.c
 • drivers/ptp/ptp_ocp.c
 • drivers/ptp/ptp_pch.c
 • drivers/ptp/ptp_qoriq.c

These drivers behavior does change slightly: they will now reject the
PTP_EXTTS_REQUEST2 ioctl, because they do not strictly validate their
flags. This also makes them no longer incorrectly accept PTP_EXT_OFFSET.

Also note that the renesas ravb driver does not support PTP_STRICT_FLAGS.
We could leave the .supported_extts_flags as 0, but I added the
PTP_RISING_EDGE | PTP_FALLING_EDGE since the driver previously manually
validated these flags. This is equivalent to 0 because the core will allow
these flags regardless unless PTP_STRICT_FLAGS is also set.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250414-jk-supported-perout-flags-v2-1-f6b17d15475c@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 20:20:57 -07:00
Colin Ian King
fee4a79a12 ice: make const read-only array dflt_rules static
Don't populate the const read-only array dflt_rules on the stack at run
time, instead make it static.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 11:58:57 -07:00
Martyna Szapar-Mudlaw
6cb10c063d ice: improve error message for insufficient filter space
When adding a rule to switch through tc, if the operation fails
due to not enough free recipes (-ENOSPC), provide a clearer
error message: "Unable to add filter: insufficient space available."

This improves user feedback by distinguishing space limitations from
other generic failures.

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 11:58:57 -07:00
Karol Kolacinski
e2193f9f9e ice: enable timesync operation on 2xNAC E825 devices
According to the E825C specification, SBQ address for ports on a single
complex is device 2 for PHY 0 and device 13 for PHY1.
For accessing ports on a dual complex E825C (so called 2xNAC mode),
the driver should use destination device 2 (referred as phy_0) for
the current complex PHY and device 13 (referred as phy_0_peer) for
peer complex PHY.

Differentiate SBQ destination device by checking if current PF port
number is on the same PHY as target port number.

Adjust 'ice_get_lane_number' function to provide unique port number for
ports from PHY1 in 'dual' mode config (by adding fixed offset for PHY1
ports). Cache this value in ice_hw struct.

Introduce ice_get_primary_hw wrapper to get access to timesync register
not available from second NAC.

Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Co-developed-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 10:46:37 -07:00
Karol Kolacinski
1fd9c91f7e ice: refactor ice_sbq_msg_dev enum
Rename ice_sbq_msg_dev to ice_sbq_dev_id to reflect the meaning of this
type more precisely. This enum type describes RDA (Remote Device Access)
client ids, accessible over SB (Side Band) interface.
Rename enum elements to make a driver namespace more cleaner and
consistent with other definitions within SB
Remove unused 'rmn_x' entries, specific to unsupported E824 device.
Adjust clients '2' and '13' names (phy_0 and phy_0_peer respectively) to
be compliant with EAS doc. According to the specification, regardless of
the complex entity (single or dual), when accessing its own ports,
they're accessed always as 'phy_0' client. And referred as 'phy_0_peer'
when handling ports connected to the other complex.

Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Co-developed-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 10:46:37 -07:00
Karol Kolacinski
1e05c5a05d ice: remove SW side band access workaround for E825
Due to the bug in FW/NVM autoload mechanism (wrong default
SB_REM_DEV_CTL register settings), the access to peer PHY and CGU
clients was disabled by default.

As the workaround solution, the register value was overwritten by the
driver at the probe or reset handling.
Remove workaround as it's not needed anymore. The fix in autoload
procedure has been provided with NVM 3.80 version.

NOTE: at the time the fix was provided in NVM, the E825C product was
not officially available on the market, so it's not expected this change
will cause regression when running with older driver/kernel versions.

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 10:46:37 -07:00
Larysa Zaremba
517f7a08ca ice: enable LLDP TX for VFs through tc
Only a single VSI can be in charge of sending LLDP frames, sometimes it is
beneficial to assign this function to a VF, that is possible to do with tc
capabilities in the switchdev mode. It requires first blocking the PF from
sending the LLDP frames with a following command:

tc filter add dev <ifname> egress protocol lldp flower skip_sw action drop

Then it becomes possible to configure a forward rule from a VF port
representor to uplink instead.

tc filter add dev <vf_ifname> ingress protocol lldp flower skip_sw
action mirred egress redirect dev <ifname>

How LLDP exclusivity was done previously is LLDP traffic was blocked for a
whole port by a single rule and PF was bypassing that. Now at least in the
switchdev mode, every separate VSI has to have its own drop rule. Another
complication is the fact that tc does not respect when the driver refuses
to delete a rule, so returning an error results in a HW rule still present
with no way to reference it through tc. This is addressed by allowing the
PF rule to be deleted at any time, but making the VF forward rule "dormant"
in such case, this means it is deleted from HW but stays in tc and driver's
bookkeeping to be restored when drop rule is added back to the PF.

Implement tc configuration handling which enables the user to transmit LLDP
packets from VF instead of PF.

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 10:45:52 -07:00
Larysa Zaremba
40f42dc1cb ice: support egress drop rules on PF
tc clsact qdisc allows us to add offloaded egress rules with commands such
as the following one:

tc filter add dev <ifname> egress protocol lldp flower skip_sw action drop

Support the egress rule drop action when added to PF, with a few caveats:
* in switchdev mode, all PF traffic has to go uplink with an exception for
  LLDP that can be delegated to a single VSI at a time
* in legacy mode, we cannot delegate LLDP functionality to another VSI, so
  such packets from PF should not be blocked.

Also, simplify the rule direction logic, it was previously derived from
actions, but actually can be inherited from the tc block (and flipped in
case of port representors).

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 09:47:43 -07:00
Larysa Zaremba
5787179c51 ice: remove headers argument from ice_tc_count_lkups
Remove the headers argument from the ice_tc_count_lkups() function, because
it is not used anywhere.

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 09:46:37 -07:00
Mateusz Pacuszka
2296345416 ice: receive LLDP on trusted VFs
When a trusted VF tries to configure an LLDP multicast address, configure a
rule that would mirror the traffic to this VF, untrusted VFs are not
allowed to receive LLDP at all, so the request to add LLDP MAC address will
always fail for them.

Add a forwarding LLDP filter to a trusted VF when it tries to add an LLDP
multicast MAC address. The MAC address has to be added after enabling
trust (through restarting the LLDP service).

Signed-off-by: Mateusz Pacuszka <mateuszx.pacuszka@intel.com>
Co-developed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 09:45:30 -07:00
Larysa Zaremba
4d5a1c4e6d ice: do not add LLDP-specific filter if not necessary
Commit 34295a3696 ("ice: implement new LLDP filter command")
introduced the ability to use LLDP-specific filter that directs all
LLDP traffic to a single VSI. However, current goal is for all trusted VFs
to be able to see LLDP neighbors, which is impossible to do with the
special filter.

Make using the generic filter the default choice and fall back to special
one only if a generic filter cannot be added. That way setups with "NVMs
where an already existent LLDP filter is blocking the creation of a filter
to allow LLDP packets" will still be able to configure software Rx LLDP on
PF only, while all other setups would be able to forward them to VFs too.

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 09:44:26 -07:00
Mateusz Pacuszka
a808691df3 ice: fix check for existing switch rule
In case the rule already exists and another VSI wants to subscribe to it
new VSI list is being created and both VSIs are moved to it.
Currently, the check for already existing VSI with the same rule is done
based on fdw_id.hw_vsi_id, which applies only to LOOKUP_RX flag.
Change it to vsi_handle. This is software VSI ID, but it can be applied
here, because vsi_map itself is also based on it.

Additionally change return status in case the VSI already exists in the
VSI map to "Already exists". Such case should be handled by the caller.

Signed-off-by: Mateusz Pacuszka <mateuszx.pacuszka@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 09:32:31 -07:00
Thomas Gleixner
8fa7292fee treewide: Switch/rename to timer_delete[_sync]()
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.

Conversion was done with coccinelle plus manual fixups where necessary.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-05 10:30:12 +02:00
Jakub Kicinski
023b1e9d26 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in late fixes to prepare for the 6.15 net-next PR.

No conflicts, adjacent changes:

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  919f9f497d ("eth: bnxt: fix out-of-range access of vnic_info array")
  fe96d717d3 ("bnxt_en: Extend queue stop/start for TX rings")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-26 09:32:10 -07:00
Mateusz Polchlopek
1388dd5641 ice: fix using untrusted value of pkt_len in ice_vc_fdir_parse_raw()
Fix using the untrusted value of proto->raw.pkt_len in function
ice_vc_fdir_parse_raw() by verifying if it does not exceed the
VIRTCHNL_MAX_SIZE_RAW_PACKET value.

Fixes: 99f419df8a ("ice: enable FDIR filters from raw binary patterns for VFs")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18 09:38:16 -07:00
Lukasz Czapnik
c5be6562de ice: fix input validation for virtchnl BW
Add missing validation of tc and queue id values sent by a VF in
ice_vc_cfg_q_bw().
Additionally fixed logged value in the warning message,
where max_tx_rate was incorrectly referenced instead of min_tx_rate.
Also correct error handling in this function by properly exiting
when invalid configuration is detected.

Fixes: 015307754a ("ice: Support VF queue rate limit and quanta size configuration")
Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Lukasz Czapnik <lukasz.czapnik@intel.com>
Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18 09:38:16 -07:00
Jan Glaza
e2f7d3f733 ice: validate queue quanta parameters to prevent OOB access
Add queue wraparound prevention in quanta configuration.
Ensure end_qid does not overflow by validating start_qid and num_queues.

Fixes: 015307754a ("ice: Support VF queue rate limit and quanta size configuration")
Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jan Glaza <jan.glaza@intel.com>
Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18 09:38:16 -07:00
Jan Glaza
f91d0efcc3 ice: stop truncating queue ids when checking
Queue IDs can be up to 4096, fix invalid check to stop
truncating IDs to 8 bits.

Fixes: bf93bf791c ("ice: introduce ice_virtchnl.c and ice_virtchnl.h")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jan Glaza <jan.glaza@intel.com>
Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18 09:38:16 -07:00
Jesse Brandeburg
7fd71f3172 ice: fix reservation of resources for RDMA when disabled
If the CONFIG_INFINIBAND_IRDMA symbol is not enabled as a module or a
built-in, then don't let the driver reserve resources for RDMA. The result
of this change is a large savings in resources for older kernels, and a
cleaner driver configuration for the IRDMA=n case for old and new kernels.

Implement this by avoiding enabling the RDMA capability when scanning
hardware capabilities.

Note: Loading the out-of-tree irdma driver in connection to the in-kernel
ice driver, is not supported, and should not be attempted, especially when
disabling IRDMA in the kernel config.

Fixes: d25a0fc41c ("ice: Initialize RDMA support")
Signed-off-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Acked-by: Dave Ertman <david.m.ertman@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18 09:38:15 -07:00
Karol Kolacinski
53ce7166cb ice: ensure periodic output start time is in the future
On E800 series hardware, if the start time for a periodic output signal is
programmed into GLTSYN_TGT_H and GLTSYN_TGT_L registers, the hardware logic
locks up and the periodic output signal never starts. Any future attempt to
reprogram the clock function is futile as the hardware will not reset until
a power on.

The ice_ptp_cfg_perout function has logic to prevent this, as it checks if
the requested start time is in the past. If so, a new start time is
calculated by rounding up.

Since commit d755a7e129 ("ice: Cache perout/extts requests and check
flags"), the rounding is done to the nearest multiple of the clock period,
rather than to a full second. This is more accurate, since it ensures the
signal matches the user request precisely.

Unfortunately, there is a race condition with this rounding logic. If the
current time is close to the multiple of the period, we could calculate a
target time that is extremely soon. It takes time for the software to
program the registers, during which time this requested start time could
become a start time in the past. If that happens, the periodic output
signal will lock up.

For large enough periods, or for the logic prior to the mentioned commit,
this is unlikely. However, with the new logic rounding to the period and
with a small enough period, this becomes inevitable.

For example, attempting to enable a 10MHz signal requires a period of 100
nanoseconds. This means in the *best* case, we have 99 nanoseconds to
program the clock output. This is essentially impossible, and thus such a
small period practically guarantees that the clock output function will
lock up.

To fix this, add some slop to the clock time used to check if the start
time is in the past. Because it is not critical that output signals start
immediately, but it *is* critical that we do not brick the function, 0.5
seconds is selected. This does mean that any requested output will be
delayed by at least 0.5 seconds.

This slop is applied before rounding, so that we always round up to the
nearest multiple of the period that is at least 0.5 seconds in the future,
ensuring a minimum of 0.5 seconds to program the clock output registers.

Finally, to ensure that the hardware registers programming the clock output
complete in a timely manner, add a write flush to the end of
ice_ptp_write_perout. This ensures we don't risk any issue with PCIe
transaction batching.

Strictly speaking, this fixes a race condition all the way back at the
initial implementation of periodic output programming, as it is
theoretically possible to trigger this bug even on the old logic when
always rounding to a full second. However, the window is narrow, and the
code has been refactored heavily since then, making a direct backport not
apply cleanly.

Fixes: d755a7e129 ("ice: Cache perout/extts requests and check flags")
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18 09:38:15 -07:00
Przemek Kitszel
fa8eda1901 ice: health.c: fix compilation on gcc 7.5
GCC 7 is not as good as GCC 8+ in telling what is a compile-time
const, and thus could be used for static storage.
Fortunately keeping strings as const arrays is enough to make old
gcc happy.

Excerpt from the report:
My GCC is: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0.

  CC [M]  drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.o
drivers/net/ethernet/intel/ice/devlink/health.c:35:3: error: initializer element is not constant
   ice_common_port_solutions, {ice_port_number_label}},
   ^~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/intel/ice/devlink/health.c:35:3: note: (near initialization for 'ice_health_status_lookup[0].solution')
drivers/net/ethernet/intel/ice/devlink/health.c:35:31: error: initializer element is not constant
   ice_common_port_solutions, {ice_port_number_label}},
                               ^~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/intel/ice/devlink/health.c:35:31: note: (near initialization for 'ice_health_status_lookup[0].data_label[0]')
drivers/net/ethernet/intel/ice/devlink/health.c:37:46: error: initializer element is not constant
   "Change or replace the module or cable.", {ice_port_number_label}},
                                              ^~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/intel/ice/devlink/health.c:37:46: note: (near initialization for 'ice_health_status_lookup[1].data_label[0]')
drivers/net/ethernet/intel/ice/devlink/health.c:39:3: error: initializer element is not constant
   ice_common_port_solutions, {ice_port_number_label}},
   ^~~~~~~~~~~~~~~~~~~~~~~~~

Fixes: 85d6164ec5 ("ice: add fw and port health reporters")
Reported-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Closes: https://lore.kernel.org/netdev/CY8PR11MB7134BF7A46D71E50D25FA7A989F72@CY8PR11MB7134.namprd11.prod.outlook.com
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Suggested-by: Simon Horman <horms@kernel.org>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18 09:38:15 -07:00
Karol Kolacinski
50f4ffac91 ice: E825C PHY register cleanup
Minor PTP register refactor, including logical grouping E825C 1-step
timestamping registers. Remove unused register definitions
(PHY_REG_GPCS_BITSLIP, PHY_REG_REVISION).
Also, apply preferred GENMASK macro (instead of ICE_M) for register
fields definition affected by this patch.

Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250310174502.3708121-5-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18 10:15:49 +01:00
Karol Kolacinski
66a1b7e09f ice: Refactor E825C PHY registers info struct
Simplify ice_phy_reg_info_eth56g struct definition to include base
address for the very first quad. Use base address info and 'step'
value to determine address for specific PHY quad.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250310174502.3708121-4-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18 10:15:49 +01:00
Karol Kolacinski
178edd2633 ice: rename ice_ptp_init_phc_eth56g function
Refactor the code by changing ice_ptp_init_phc_eth56g function
name to ice_ptp_init_phc_e825, to be consistent with the naming pattern
for other devices.

Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250310174502.3708121-3-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18 10:15:49 +01:00
Paul Greenwalt
905d1a220e ice: Add E830 checksum offload support
E830 supports raw receive and generic transmit checksum offloads.

Raw receive checksum support is provided by hardware calculating the
checksum over the whole packet, regardless of type. The calculated
checksum is provided to driver in the Rx flex descriptor. Then the driver
assigns the checksum to skb->csum and sets skb->ip_summed to
CHECKSUM_COMPLETE.

Generic transmit checksum support is provided by hardware calculating the
checksum given two offsets: the start offset to begin checksum calculation,
and the offset to insert the calculated checksum in the packet. Support is
advertised to the stack using NETIF_F_HW_CSUM feature.

E830 has the following limitations when both generic transmit checksum
offload and TCP Segmentation Offload (TSO) are enabled:

1. Inner packet header modification is not supported. This restriction
   includes the inability to alter TCP flags, such as the push flag. As a
   result, this limitation can impact the receiver's ability to coalesce
   packets, potentially degrading network throughput.
2. The Maximum Segment Size (MSS) is limited to 1023 bytes, which prevents
   support of Maximum Transmission Unit (MTU) greater than 1063 bytes.

Therefore NETIF_F_HW_CSUM and NETIF_F_ALL_TSO features are mutually
exclusive. NETIF_F_HW_CSUM hardware feature support is indicated but is not
enabled by default. Instead, IP checksums and NETIF_F_ALL_TSO are the
defaults. Enforcement of mutual exclusivity of NETIF_F_HW_CSUM and
NETIF_F_ALL_TSO is done in ice_set_features(). Mutual exclusivity
of IP checksums and NETIF_F_HW_CSUM is handled by netdev_fix_features().

When NETIF_F_HW_CSUM is requested the provided skb->csum_start and
skb->csum_offset are passed to hardware in the Tx context descriptor
generic checksum (GCS) parameters. Hardware calculates the 1's complement
from skb->csum_start to the end of the packet, and inserts the result in
the packet at skb->csum_offset.

Co-developed-by: Alice Michael <alice.michael@intel.com>
Signed-off-by: Alice Michael <alice.michael@intel.com>
Co-developed-by: Eric Joyner <eric.joyner@intel.com>
Signed-off-by: Eric Joyner <eric.joyner@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250310174502.3708121-2-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18 10:15:49 +01:00
Paolo Abeni
941defcea7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc6).

Conflicts:

tools/testing/selftests/drivers/net/ping.py
  75cc19c8ff ("selftests: drv-net: add xdp cases for ping.py")
  de94e86974 ("selftests: drv-net: store addresses in dict indexed by ipver")
https://lore.kernel.org/netdev/20250311115758.17a1d414@canb.auug.org.au/

net/core/devmem.c
  a70f891e0f ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()")
  1d22d3060b ("net: drop rtnl_lock for queue_mgmt operations")
https://lore.kernel.org/netdev/20250313114929.43744df1@canb.auug.org.au/

Adjacent changes:

tools/testing/selftests/net/Makefile
  6f50175cca ("selftests: Add IPv6 link-local address generation tests for GRE devices.")
  2e5584e0f9 ("selftests/net: expand cmsg_ipv6.sh with ipv4")

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  661958552e ("eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic")
  fe96d717d3 ("bnxt_en: Extend queue stop/start for TX rings")

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-13 23:08:11 +01:00
Przemek Kitszel
2a3e89a148 ice: register devlink prior to creating health reporters
ice_health_init() was introduced in the commit 2a82874a3b ("ice: add
Tx hang devlink health reporter"). The call to it should have been put
after ice_devlink_register(). It went unnoticed until next reporter by
Konrad, which receives events from FW. FW is reporting all events, also
from prior driver load, and thus it is not unlikely to have something
at the very beginning. And that results in a splat:
[   24.455950]  ? devlink_recover_notify.constprop.0+0x198/0x1b0
[   24.455973]  devlink_health_report+0x5d/0x2a0
[   24.455976]  ? __pfx_ice_health_status_lookup_compare+0x10/0x10 [ice]
[   24.456044]  ice_process_health_status_event+0x1b7/0x200 [ice]

Do the analogous thing for deinit patch.

Fixes: 85d6164ec5 ("ice: add fw and port health reporters")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Konrad Knitter <konrad.knitter@intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-05 08:45:45 -08:00
Marcin Szycik
dce97cb0a3 ice: Fix switchdev slow-path in LAG
Ever since removing switchdev control VSI and using PF for port
representor Tx/Rx, switchdev slow-path has been working improperly after
failover in SR-IOV LAG. LAG assumes that the first uplink to be added to
the aggregate will own VFs and have switchdev configured. After
failing-over to the other uplink, representors are still configured to
Tx through the uplink they are set up on, which fails because that
uplink is now down.

On failover, update all PRs on primary uplink to use the currently
active uplink for Tx. Call netif_keep_dst(), as the secondary uplink
might not be in switchdev mode. Also make sure to call
ice_eswitch_set_target_vsi() if uplink is in LAG.

On the Rx path, representors are already working properly, because
default Tx from VFs is set to PF owning the eswitch. After failover the
same PF is receiving traffic from VFs, even though link is down.

Fixes: defd52455a ("ice: do Tx through PF netdev in slow-path")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-05 08:43:05 -08:00
Grzegorz Nitka
23d97f1890 ice: fix memory leak in aRFS after reset
Fix aRFS (accelerated Receive Flow Steering) structures memory leak by
adding a checker to verify if aRFS memory is already allocated while
configuring VSI. aRFS objects are allocated in two cases:
- as part of VSI initialization (at probe), and
- as part of reset handling

However, VSI reconfiguration executed during reset involves memory
allocation one more time, without prior releasing already allocated
resources. This led to the memory leak with the following signature:

[root@os-delivery ~]# cat /sys/kernel/debug/kmemleak
unreferenced object 0xff3c1ca7252e6000 (size 8192):
  comm "kworker/0:0", pid 8, jiffies 4296833052
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace (crc 0):
    [<ffffffff991ec485>] __kmalloc_cache_noprof+0x275/0x340
    [<ffffffffc0a6e06a>] ice_init_arfs+0x3a/0xe0 [ice]
    [<ffffffffc09f1027>] ice_vsi_cfg_def+0x607/0x850 [ice]
    [<ffffffffc09f244b>] ice_vsi_setup+0x5b/0x130 [ice]
    [<ffffffffc09c2131>] ice_init+0x1c1/0x460 [ice]
    [<ffffffffc09c64af>] ice_probe+0x2af/0x520 [ice]
    [<ffffffff994fbcd3>] local_pci_probe+0x43/0xa0
    [<ffffffff98f07103>] work_for_cpu_fn+0x13/0x20
    [<ffffffff98f0b6d9>] process_one_work+0x179/0x390
    [<ffffffff98f0c1e9>] worker_thread+0x239/0x340
    [<ffffffff98f14abc>] kthread+0xcc/0x100
    [<ffffffff98e45a6d>] ret_from_fork+0x2d/0x50
    [<ffffffff98e083ba>] ret_from_fork_asm+0x1a/0x30
    ...

Fixes: 28bf26724f ("ice: Implement aRFS")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-05 08:40:41 -08:00
Larysa Zaremba
3be83ee9de ice: do not configure destination override for switchdev
After switchdev is enabled and disabled later, LLDP packets sending stops,
despite working perfectly fine before and during switchdev state.
To reproduce (creating/destroying VF is what triggers the reconfiguration):

devlink dev eswitch set pci/<address> mode switchdev
echo '2' > /sys/class/net/<ifname>/device/sriov_numvfs
echo '0' > /sys/class/net/<ifname>/device/sriov_numvfs

This happens because LLDP relies on the destination override functionality.
It needs to 1) set a flag in the descriptor, 2) set the VSI permission to
make it valid. The permissions are set when the PF VSI is first configured,
but switchdev then enables it for the uplink VSI (which is always the PF)
once more when configured and disables when deconfigured, which leads to
software-generated LLDP packets being blocked.

Do not modify the destination override permissions when configuring
switchdev, as the enabled state is the default configuration that is never
modified.

Fixes: 1a1c40df2e ("ice: set and release switchdev environment")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-05 08:30:26 -08:00
Gal Pressman
e0c032d26d ice: dpll: Remove newline at the end of a netlink error message
Netlink error messages should not have a newline at the end of the
string.

Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Acked-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250226093904.6632-6-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-27 18:11:38 -08:00
Jakub Kicinski
357660d759 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc5).

Conflicts:

drivers/net/ethernet/cadence/macb_main.c
  fa52f15c74 ("net: cadence: macb: Synchronize stats calculations")
  75696dd0fd ("net: cadence: macb: Convert to get_stats64")
https://lore.kernel.org/20250224125848.68ee63e5@canb.auug.org.au

Adjacent changes:

drivers/net/ethernet/intel/ice/ice_sriov.c
  79990cf5e7 ("ice: Fix deinitializing VF in error path")
  a203163274 ("ice: simplify VF MSI-X managing")

net/ipv4/tcp.c
  18912c5206 ("tcp: devmem: don't write truncated dmabuf CMSGs to userspace")
  297d389e9e ("net: prefix devmem specific helpers")

net/mptcp/subflow.c
  8668860b0a ("mptcp: reset when MPTCP opts are dropped after join")
  c3349a22c2 ("mptcp: consolidate subflow cleanup")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-27 10:20:58 -08:00
Ahmed Zaki
4063af2967 ice: use napi's irq affinity and rmap IRQ notifiers
Delete the driver CPU affinity and aRFS rmap info, use the core's
API instead.

Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://patch.msgid.link/20250224232228.990783-5-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26 19:51:37 -08:00
Ahmed Zaki
30b78ba3d4 ice: clear NAPI's IRQ numbers in ice_vsi_clear_napi_queues()
We set the NAPI's IRQ number in ice_vsi_set_napi_queues(). Clear the
NAPI's IRQ in ice_vsi_clear_napi_queues().

Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://patch.msgid.link/20250224232228.990783-4-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26 19:51:37 -08:00
Marcin Szycik
5c07be96d8 ice: Avoid setting default Rx VSI twice in switchdev setup
As part of switchdev environment setup, uplink VSI is configured as
default for both Tx and Rx. Default Rx VSI is also used by promiscuous
mode. If promisc mode is enabled and an attempt to enter switchdev mode
is made, the setup will fail because Rx VSI is already configured as
default (rule exists).

Reproducer:
  devlink dev eswitch set $PF1_PCI mode switchdev
  ip l s $PF1 up
  ip l s $PF1 promisc on
  echo 1 > /sys/class/net/$PF1/device/sriov_numvfs

In switchdev setup, use ice_set_dflt_vsi() instead of plain
ice_cfg_dflt_vsi(), which avoids repeating setting default VSI for Rx if
it's already configured.

Fixes: 50d62022f4 ("ice: default Tx rule instead of to queue")
Reported-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Closes: https://lore.kernel.org/intel-wired-lan/PH0PR11MB50138B635F2E5CEB7075325D961F2@PH0PR11MB5013.namprd11.prod.outlook.com
Reviewed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250224190647.3601930-3-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25 19:09:36 -08:00
Marcin Szycik
79990cf5e7 ice: Fix deinitializing VF in error path
If ice_ena_vfs() fails after calling ice_create_vf_entries(), it frees
all VFs without removing them from snapshot PF-VF mailbox list, leading
to list corruption.

Reproducer:
  devlink dev eswitch set $PF1_PCI mode switchdev
  ip l s $PF1 up
  ip l s $PF1 promisc on
  sleep 1
  echo 1 > /sys/class/net/$PF1/device/sriov_numvfs
  sleep 1
  echo 1 > /sys/class/net/$PF1/device/sriov_numvfs

Trace (minimized):
  list_add corruption. next->prev should be prev (ffff8882e241c6f0), but was 0000000000000000. (next=ffff888455da1330).
  kernel BUG at lib/list_debug.c:29!
  RIP: 0010:__list_add_valid_or_report+0xa6/0x100
   ice_mbx_init_vf_info+0xa7/0x180 [ice]
   ice_initialize_vf_entry+0x1fa/0x250 [ice]
   ice_sriov_configure+0x8d7/0x1520 [ice]
   ? __percpu_ref_switch_mode+0x1b1/0x5d0
   ? __pfx_ice_sriov_configure+0x10/0x10 [ice]

Sometimes a KASAN report can be seen instead with a similar stack trace:
  BUG: KASAN: use-after-free in __list_add_valid_or_report+0xf1/0x100

VFs are added to this list in ice_mbx_init_vf_info(), but only removed
in ice_free_vfs(). Move the removing to ice_free_vf_entries(), which is
also being called in other places where VFs are being removed (including
ice_free_vfs() itself).

Fixes: 8cd8a6b17d ("ice: move VF overflow message count into struct ice_mbx_vf_info")
Reported-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Closes: https://lore.kernel.org/intel-wired-lan/PH0PR11MB50138B635F2E5CEB7075325D961F2@PH0PR11MB5013.namprd11.prod.outlook.com
Reviewed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250224190647.3601930-2-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25 19:09:36 -08:00
Gal Pressman
ecdff89338 ethtool: Symmetric OR-XOR RSS hash
Add an additional type of symmetric RSS hash type: OR-XOR.
The "Symmetric-OR-XOR" algorithm transforms the input as follows:

(SRC_IP | DST_IP, SRC_IP ^ DST_IP, SRC_PORT | DST_PORT, SRC_PORT ^ DST_PORT)

Change 'cap_rss_sym_xor_supported' to 'supported_input_xfrm', a bitmap
of supported RXH_XFRM_* types.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/20250224174416.499070-2-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25 18:31:04 -08:00
Jakub Kicinski
0f375d90c4 Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
ice, iavf: Add support for Rx timestamping

Mateusz Polchlopek says:

Initially, during VF creation it registers the PTP clock in
the system and negotiates with PF it's capabilities. In the
meantime the PF enables the Flexible Descriptor for VF.
Only this type of descriptor allows to receive Rx timestamps.

Enabling virtual clock would be possible, though it would probably
perform poorly due to the lack of direct time access.

Enable timestamping should be done using userspace tools, e.g.
hwstamp_ctl -i $VF -r 14

In order to report the timestamps to userspace, the VF extends
timestamp to 40b.

To support this feature the flexible descriptors and PTP part
in iavf driver have been introduced.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  iavf: add support for Rx timestamps to hotpath
  iavf: handle set and get timestamps ops
  iavf: Implement checking DD desc field
  iavf: refactor iavf_clean_rx_irq to support legacy and flex descriptors
  iavf: define Rx descriptors as qwords
  libeth: move idpf_rx_csum_decoded and idpf_rx_extracted
  iavf: periodically cache PHC time
  iavf: add support for indirect access to PHC time
  iavf: add initial framework for registering PTP clock
  iavf: negotiate PTP capabilities
  iavf: add support for negotiating flexible RXDID format
  virtchnl: add enumeration for the rxdid format
  ice: support Rx timestamp on flex descriptor
  virtchnl: add support for enabling PTP on iAVF
====================

Link: https://patch.msgid.link/20250214192739.1175740-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-17 17:09:44 -08:00
Dan Carpenter
c2ddb619fa ice: Fix signedness bug in ice_init_interrupt_scheme()
If pci_alloc_irq_vectors() can't allocate the minimum number of vectors
then it returns -ENOSPC so there is no need to check for that in the
caller.  In fact, because pf->msix.min is an unsigned int, it means that
any negative error codes are type promoted to high positive values and
treated as success.  So here, the "return -ENOMEM;" is unreachable code.
Check for negatives instead.

Now that we're only dealing with error codes, it's easier to propagate
the error code from pci_alloc_irq_vectors() instead of hardcoding
-ENOMEM.

Fixes: 79d97b8cf9 ("ice: remove splitting MSI-X between features")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/b16e4f01-4c85-46e2-b602-fce529293559@stanley.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-14 17:18:00 -08:00
Jacob Keller
2a86e210f1 iavf: add support for negotiating flexible RXDID format
Enable support for VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC, to enable the VF
driver the ability to determine what Rx descriptor formats are
available. This requires sending an additional message during
initialization and reset, the VIRTCHNL_OP_GET_SUPPORTED_RXDIDS. This
operation requests the supported Rx descriptor IDs available from the
PF.

This is treated the same way that VLAN V2 capabilities are handled. Add
a new set of extended capability flags, used to process send and receipt
of the VIRTCHNL_OP_GET_SUPPORTED_RXDIDS message.

This ensures we finish negotiating for the supported descriptor formats
prior to beginning configuration of receive queues.

This change stores the supported format bitmap into the iavf_adapter
structure. Additionally, if VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC is enabled
by the PF, we need to make sure that the Rx queue configuration
specifies the format.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14 10:58:07 -08:00
Simei Su
7c1178a9df ice: support Rx timestamp on flex descriptor
To support Rx timestamp offload, VIRTCHNL_OP_1588_PTP_CAPS is sent by
the VF to request PTP capability and responded by the PF what capability
is enabled for that VF.

Hardware captures timestamps which contain only 32 bits of nominal
nanoseconds, as opposed to the 64bit timestamps that the stack expects.
To convert 32b to 64b, we need a current PHC time.
VIRTCHNL_OP_1588_PTP_GET_TIME is sent by the VF and responded by the
PF with the current PHC time.

Signed-off-by: Simei Su <simei.su@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14 10:57:45 -08:00
Jakub Kicinski
4e41231249 Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2025-02-10 (ice, igc, e1000e)

For ice:

Karol, Jake, and Michal add PTP support for E830 devices. Karol
refactors and cleans up PTP code. Jake allows for a common
cross-timestamp implementation to be shared for all devices and
Michal adds E830 support.

Mateusz cleans up initial Flow Director rule creation to loop rather
than duplicate repeated similar calls.

For igc:

Siang adjust calls to remove need for close and open calls on loading
XDP program.

For e1000e:

Gerhard Engleder batches register writes for writing multicast table
on real-time kernels.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  e1000e: Fix real-time violations on link up
  igc: Avoid unnecessary link down event in XDP_SETUP_PROG process
  ice: refactor ice_fdir_create_dflt_rules() function
  ice: Implement PTP support for E830 devices
  ice: Refactor ice_ptp_init_tx_*
  ice: Add unified ice_capture_crosststamp
  ice: Process TSYN IRQ in a separate function
  ice: Use FIELD_PREP for timestamp values
  ice: Remove unnecessary ice_is_e8xx() functions
  ice: Don't check device type when checking GNSS presence
====================

Link: https://patch.msgid.link/20250210192352.3799673-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11 19:51:16 -08:00
Alexander Lobakin
2fc6b26ac8 ice: use generic unrolled_count() macro
ice, same as i40e, has a custom loop unrolling macros for unrolling
Tx descriptors filling on XSk xmit.
Replace ice defs with generic unrolled_count(), which is also more
convenient as it allows passing defines as its argument, not hardcoded
values, while the loop declaration will still be usual for-loop.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250206182630.3914318-4-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10 17:54:43 -08:00
Mateusz Polchlopek
5a7b0b6ff4 ice: refactor ice_fdir_create_dflt_rules() function
The Flow Director function ice_fdir_create_dflt_rules() calls few
times function ice_create_init_fdir_rule() each time with different
enum ice_fltr_ptype parameter. Next step is to return error code if
error occurred.

Change the code to store all necessary default rules in constant array
and call ice_create_init_fdir_rule() in the loop. It makes it easy to
extend the list of default rules in the future, without the need of
duplicate code more and more.

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 10:43:48 -08:00
Michal Michalik
f003075227 ice: Implement PTP support for E830 devices
Add specific functions and definitions for E830 devices to enable
PTP support.

E830 devices support direct write to GLTSYN_ registers without shadow
registers and 64 bit read of PHC time.

Enable PTM for E830 device, which is required for cross timestamp and
and dependency on PCIE_PTM for ICE_HWTS.

Check X86_FEATURE_ART for E830 as it may not be present in the CPU.

Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Co-developed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Co-developed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Co-developed-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Michal Michalik <michal.michalik@intel.com>
Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 10:43:48 -08:00
Karol Kolacinski
381d577962 ice: Refactor ice_ptp_init_tx_*
Unify ice_ptp_init_tx_* functions for most of the MAC types except E82X.
This simplifies the code for the future use with new MAC types.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 10:43:47 -08:00
Jacob Keller
92456e795a ice: Add unified ice_capture_crosststamp
Devices supported by ice driver use essentially the same logic for
performing a crosstimestamp. The only difference is that E830 hardware
has different offsets. Instead of having multiple implementations,
combine them into a single ice_capture_crosststamp() function.

To support both hardware types, the ice_capture_crosststamp function
must be able to determine the appropriate registers to access. To handle
this, pass a custom context structure instead of the PF pointer. This
structure, ice_crosststamp_ctx, contains a pointer to the PF, and
a pointer to the device configuration structure. This new structure also
will make it easier to implement historic snapshot support in a future
commit.

The device configuration structure is a static const data which defines
the offsets and flags for the various registers. This includes the lock
register, the cross timestamp control register, the upper and lower ART
system time capture registers, and the upper and lower device time
capture registers for each timer index.

Use the configuration structure to access all of the registers in
ice_capture_crosststamp(). Ensure that we don't over-run the device time
array by checking that the timer index is 0 or 1. Previously this was
simply assumed, and it would cause the device to read an incorrect and
likely garbage register.

It does feel like there should be a kernel interface for managing
register offsets like this, but the closest thing I saw was
<linux/regmap.h> which is interesting but not quite what we're looking
for...

Use rd32_poll_timeout() to read lock_reg and ctl_reg.

Add snapshot system time for historic interpolation.

Remove X86_FEATURE_ART and X86_FEATURE_TSC_KNOWN_FREQ from all E82X
devices because those are SoCs, which will always have those features.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 10:43:47 -08:00
Karol Kolacinski
f9472aaabd ice: Process TSYN IRQ in a separate function
Simplify TSYN IRQ processing by moving it to a separate function and
having appropriate behavior per PHY model, instead of multiple
conditions not related to HW, but to specific timestamping modes.

When PTP is not enabled in the kernel, don't process timestamps and
return IRQ_HANDLED.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 10:43:47 -08:00
Karol Kolacinski
ea7029fe10 ice: Use FIELD_PREP for timestamp values
Instead of using shifts and casts, use FIELD_PREP after reading 40b
timestamp values.

Rename a couple defines for better clarity and consistency.

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 10:43:47 -08:00
Karol Kolacinski
9973ac9f23 ice: Remove unnecessary ice_is_e8xx() functions
Remove unnecessary ice_is_e8xx() functions and PHY model. Instead, use
MAC type where applicable.

Don't check device type in ice_ptp_maybe_trigger_tx_interrupt(), because
in reality it depends on the ready bitmap, which only E810 does not
have.

Call ice_ptp_cfg_phy_interrupt() unconditionally, because all further
function calls check the MAC type anyway and this allows simpler code
in the future with addition of the new MAC types.

Reorder ICE_MAC_* cases in switches in ice_ptp* as in enum ice_mac_type.

Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 10:43:36 -08:00
Karol Kolacinski
e2c6737e6e ice: Don't check device type when checking GNSS presence
Don't check if the device type is E810T as non-E810T devices can support
GNSS too and PCA9575 check is enough to determine if GNSS is present or
not.

Rename ice_gnss_is_gps_present() to ice_gnss_is_module_present()
because GNSS module supports multiple GNSS providers, not only GPS.

Move functions related to PCA9575 from ice_ptp_hw.c to ice_common.c
to be able to access them when PTP is disabled in the kernel, but GNSS
is enabled.

Remove logical AND with ICE_AQC_LINK_TOPO_NODE_TYPE_M in
ice_get_pca9575_handle(), which has no effect, and reorder device type
checks to check the device_id first, then set other variables.

Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 08:52:04 -08:00
Jakub Kicinski
26db4dbb74 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
ice: managing MSI-X in driver

Michal Swiatkowski says:

It is another try to allow user to manage amount of MSI-X used for each
feature in ice. First was via devlink resources API, it wasn't accepted
in upstream. Also static MSI-X allocation using devlink resources isn't
really user friendly.

This try is using more dynamic way. "Dynamic" across whole kernel when
platform supports it and "dynamic" across the driver when not.

To achieve that reuse global devlink parameter pf_msix_max and
pf_msix_min. It fits how ice hardware counts MSI-X. In case of ice amount
of MSI-X reported on PCI is a whole MSI-X for the card (with MSI-X for
VFs also). Having pf_msix_max allow user to statically set how many
MSI-X he wants on PF and how many should be reserved for VFs.

pf_msix_min is used to set minimum number of MSI-X with which ice driver
should probe correctly.

Meaning of this field in case of dynamic vs static allocation:
- on system with dynamic MSI-X allocation support
 * alloc pf_msix_min as static, rest will be allocated dynamically
- on system without dynamic MSI-X allocation support
 * try alloc pf_msix_max as static, minimum acceptable result is
 pf_msix_min

As Jesse and Piotr suggested pf_msix_max and pf_msix_min can (an
probably should) be stored in NVM. This patchset isn't implementing
that.

Dynamic (kernel or driver) way means that splitting MSI-X across the
RDMA and eth in case there is a MSI-X shortage isn't correct. Can work
when dynamic is only on driver site, but can't when dynamic is on kernel
site.

Let's remove this code and move to MSI-X allocation feature by feature.
If there is no more MSI-X for a feature, a feature is working with less
MSI-X or it is turned off.

There is a regression here. With MSI-X splitting user can run RDMA and
eth even on system with not enough MSI-X. Now only eth will work. RDMA
can be turned on by changing number of PF queues (lowering) and reprobe
RDMA driver.

Example:
72 CPU number, eth, RDMA and flow director (1 MSI-X), 1 MSI-X for OICR
on PF, and 1 more for RDMA. Card is using 1 + 72 + 1 + 72 + 1 = 147.

We set pf_msix_min = 2, pf_msix_max = 128

OICR: 1
eth: 72
flow director: 1
RDMA: 128 - 74 = 54

We can change number of queues on pf to 36 and do devlink reinit

OICR: 1
eth: 36
RDMA: 73
flow director: 1

We can also (implemented in "ice: enable_rdma devlink param") turned
RDMA off.

OICR: 1
eth: 72
RDMA: 0 (turned off)
flow director: 1

After this changes we have a static base vector for SRIOV (SIOV probably
in the feature). Last patch from this series is simplifying managing VF
MSI-X code based on static vector.

Now changing queues using ethtool is also changing MSI-X. If there is
enough MSI-X it is always one to one. When there is not enough there
will be more queues than MSI-X.

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: init flow director before RDMA
  ice: simplify VF MSI-X managing
  ice: enable_rdma devlink param
  ice: treat dyn_allowed only as suggestion
  ice, irdma: move interrupts code to irdma
  ice: get rid of num_lan_msix field
  ice: remove splitting MSI-X between features
  ice: devlink PF MSI-X max and min parameter
  ice: count combined queues using Rx/Tx count
====================

Link: https://patch.msgid.link/20250205185512.895887-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06 17:34:36 -08:00
Michal Swiatkowski
d67627e7b5 ice: init flow director before RDMA
Flow director needs only one MSI-X. Load it before RDMA to save MSI-X
for it.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05 09:04:58 -08:00
Michal Swiatkowski
a203163274 ice: simplify VF MSI-X managing
After implementing pf->msix.max field, base vector for other use cases
(like VFs) can be fixed. This simplify code when changing MSI-X amount
on particular VF, because there is no need to move a base vector.

A fixed base vector allows to reserve vectors from the beginning
instead of from the end, which is also simpler in code.

Store total and rest value in the same struct as max and min for PF.
Move tracking vectors from ice_sriov.c to ice_irq.c as it can be also
use for other none PF use cases (SIOV).

Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05 09:04:57 -08:00
Michal Swiatkowski
87181cd698 ice: enable_rdma devlink param
Implement enable_rdma devlink parameter to allow user to turn RDMA
feature on and off.

It is useful when there is no enough interrupts and user doesn't need
RDMA feature.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jan Sokolowski <jan.sokolowski@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05 09:04:57 -08:00
Michal Swiatkowski
a8c2d3932c ice: treat dyn_allowed only as suggestion
It can be needed to have some MSI-X allocated as static and rest as
dynamic. For example on PF VSI. We want to always have minimum one MSI-X
on it, because of that it is allocated as a static one, rest can be
dynamic if it is supported.

Change the ice_get_irq_res() to allow using static entries if they are
free even if caller wants dynamic one.

Adjust limit values to the new approach. Min and max in limit means the
values that are valid, so decrease max and num_static by one.

Set vsi::irq_dyn_alloc if dynamic allocation is supported.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05 09:04:57 -08:00
Michal Swiatkowski
3e0d3cb3fb ice, irdma: move interrupts code to irdma
Move responsibility of MSI-X requesting for RDMA feature from ice driver
to irdma driver. It is done to allow simple fallback when there is not
enough MSI-X available.

Change amount of MSI-X used for control from 4 to 1, as it isn't needed
to have more than one MSI-X for this purpose.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05 09:04:57 -08:00
Michal Swiatkowski
ad61cd9c67 ice: get rid of num_lan_msix field
Remove the field to allow having more queues than MSI-X on VSI. As
default the number will be the same, but if there won't be more MSI-X
available VSI can run with at least one MSI-X.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05 09:04:57 -08:00
Michal Swiatkowski
79d97b8cf9 ice: remove splitting MSI-X between features
With dynamic approach to alloc MSI-X there is no sense to statically
split MSI-X between PF features.

Splitting was also calculating needed MSI-X. Move this part to separate
function and use as max value.

Remove ICE_ESWITCH_MSIX, as there is no need for additional MSI-X for
switchdev.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05 09:04:57 -08:00
Michal Swiatkowski
b2657259fc ice: devlink PF MSI-X max and min parameter
Use generic devlink PF MSI-X parameter to allow user to change MSI-X
range.

Add notes about this parameters into ice devlink documentation.

Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05 09:02:44 -08:00
Michal Swiatkowski
c3a392bdd3 ice: count combined queues using Rx/Tx count
Previous implementation assumes that there is 1:1 matching between
vectors and queues. It isn't always true.

Get minimum value from Rx/Tx queues to determine combined queues number.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-04 15:08:57 -08:00
Jakub Kicinski
88be092224 Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:

====================
ice: fix Rx data path for heavy 9k MTU traffic

Maciej Fijalkowski says:

This patchset fixes a pretty nasty issue that was reported by RedHat
folks which occurred after ~30 minutes (this value varied, just trying
here to state that it was not observed immediately but rather after a
considerable longer amount of time) when ice driver was tortured with
jumbo frames via mix of iperf traffic executed simultaneously with
wrk/nginx on client/server sides (HTTP and TCP workloads basically).

The reported splats were spanning across all the bad things that can
happen to the state of page - refcount underflow, use-after-free, etc.
One of these looked as follows:

[ 2084.019891] BUG: Bad page state in process swapper/34  pfn:97fcd0
[ 2084.025990] page:00000000a60ee772 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x97fcd0
[ 2084.035462] flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
[ 2084.041990] raw: 0017ffffc0000000 dead000000000100 dead000000000122 0000000000000000
[ 2084.049730] raw: 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000
[ 2084.057468] page dumped because: nonzero _refcount
[ 2084.062260] Modules linked in: bonding tls sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm mgag200 irqd
[ 2084.137829] CPU: 34 PID: 0 Comm: swapper/34 Kdump: loaded Not tainted 5.14.0-427.37.1.el9_4.x86_64 #1
[ 2084.147039] Hardware name: Dell Inc. PowerEdge R750/0216NK, BIOS 1.13.2 12/19/2023
[ 2084.154604] Call Trace:
[ 2084.157058]  <IRQ>
[ 2084.159080]  dump_stack_lvl+0x34/0x48
[ 2084.162752]  bad_page.cold+0x63/0x94
[ 2084.166333]  check_new_pages+0xb3/0xe0
[ 2084.170083]  rmqueue_bulk+0x2d2/0x9e0
[ 2084.173749]  ? ktime_get+0x35/0xa0
[ 2084.177159]  rmqueue_pcplist+0x13b/0x210
[ 2084.181081]  rmqueue+0x7d3/0xd40
[ 2084.184316]  ? xas_load+0x9/0xa0
[ 2084.187547]  ? xas_find+0x183/0x1d0
[ 2084.191041]  ? xa_find_after+0xd0/0x130
[ 2084.194879]  ? intel_iommu_iotlb_sync_map+0x89/0xe0
[ 2084.199759]  get_page_from_freelist+0x11f/0x530
[ 2084.204291]  __alloc_pages+0xf2/0x250
[ 2084.207958]  ice_alloc_rx_bufs+0xcc/0x1c0 [ice]
[ 2084.212543]  ice_clean_rx_irq+0x631/0xa20 [ice]
[ 2084.217111]  ice_napi_poll+0xdf/0x2a0 [ice]
[ 2084.221330]  __napi_poll+0x27/0x170
[ 2084.224824]  net_rx_action+0x233/0x2f0
[ 2084.228575]  __do_softirq+0xc7/0x2ac
[ 2084.232155]  __irq_exit_rcu+0xa1/0xc0
[ 2084.235821]  common_interrupt+0x80/0xa0
[ 2084.239662]  </IRQ>
[ 2084.241768]  <TASK>

The fix is mostly about reverting what was done in commit 1dc1a7e7f4
("ice: Centrallize Rx buffer recycling") followed by proper timing on
page_count() storage and then removing the ice_rx_buf::act related logic
(which was mostly introduced for purposes from cited commit).

Special thanks to Xu Du for providing reproducer and Jacob Keller for
initial extensive analysis.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  ice: stop storing XDP verdict within ice_rx_buf
  ice: gather page_count()'s of each frag right before XDP prog call
  ice: put Rx buffers after being done with current frame
====================

Link: https://patch.msgid.link/20250131185415.3741532-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-03 13:18:44 -08:00
Jiasheng Jiang
a8aa6a6ddc ice: Add check for devm_kzalloc()
Add check for the return value of devm_kzalloc() to guarantee the success
of allocation.

Fixes: 42c2eb6b1f ("ice: Implement devlink-rate API")
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250131013832.24805-1-jiashengjiangcool@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-01 16:57:50 -08:00
Maciej Fijalkowski
468a1952df ice: stop storing XDP verdict within ice_rx_buf
Idea behind having ice_rx_buf::act was to simplify and speed up the Rx
data path by walking through buffers that were representing cleaned HW
Rx descriptors. Since it caused us a major headache recently and we
rolled back to old approach that 'puts' Rx buffers right after running
XDP prog/creating skb, this is useless now and should be removed.

Get rid of ice_rx_buf::act and related logic. We still need to take care
of a corner case where XDP program releases a particular fragment.

Make ice_run_xdp() to return its result and use it within
ice_put_rx_mbuf().

Fixes: 2fba7dc515 ("ice: Add support for XDP multi-buffer on Rx side")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-31 10:07:46 -08:00
Maciej Fijalkowski
11c4aa074d ice: gather page_count()'s of each frag right before XDP prog call
If we store the pgcnt on few fragments while being in the middle of
gathering the whole frame and we stumbled upon DD bit not being set, we
terminate the NAPI Rx processing loop and come back later on. Then on
next NAPI execution we work on previously stored pgcnt.

Imagine that second half of page was used actively by networking stack
and by the time we came back, stack is not busy with this page anymore
and decremented the refcnt. The page reuse algorithm in this case should
be good to reuse the page but given the old refcnt it will not do so and
attempt to release the page via page_frag_cache_drain() with
pagecnt_bias used as an arg. This in turn will result in negative refcnt
on struct page, which was initially observed by Xu Du.

Therefore, move the page count storage from ice_get_rx_buf() to a place
where we are sure that whole frame has been collected, but before
calling XDP program as it internally can also change the page count of
fragments belonging to xdp_buff.

Fixes: ac07533911 ("ice: Store page count inside ice_rx_buf")
Reported-and-tested-by: Xu Du <xudu@redhat.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Co-developed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-31 10:06:26 -08:00
Maciej Fijalkowski
743bbd93cf ice: put Rx buffers after being done with current frame
Introduce a new helper ice_put_rx_mbuf() that will go through gathered
frags from current frame and will call ice_put_rx_buf() on them. Current
logic that was supposed to simplify and optimize the driver where we go
through a batch of all buffers processed in current NAPI instance turned
out to be broken for jumbo frames and very heavy load that was coming
from both multi-thread iperf and nginx/wrk pair between server and
client. The delay introduced by approach that we are dropping is simply
too big and we need to take the decision regarding page
recycling/releasing as quick as we can.

While at it, address an error path of ice_add_xdp_frag() - we were
missing buffer putting from day 1 there.

As a nice side effect we get rid of annoying and repetitive three-liner:

	xdp->data = NULL;
	rx_ring->first_desc = ntc;
	rx_ring->nr_frags = 0;

by embedding it within introduced routine.

Fixes: 1dc1a7e7f4 ("ice: Centrallize Rx buffer recycling")
Reported-and-tested-by: Xu Du <xudu@redhat.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Co-developed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-31 09:51:47 -08:00
Mateusz Polchlopek
c5cc2a27e0 ice: remove invalid parameter of equalizer
It occurred that in the commit 70838938e8 ("ice: Implement driver
functionality to dump serdes equalizer values") the invalid DRATE parameter
for reading has been added. The output of the command:

  $ ethtool -d <ethX>

returns the garbage value in the place where DRATE value should be
stored.

Remove mentioned parameter to prevent return of corrupted data to
userspace.

Fixes: 70838938e8 ("ice: Implement driver functionality to dump serdes equalizer values")
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-24 10:49:42 -08:00
Przemek Kitszel
18625e26fe ice: fix ice_parser_rt::bst_key array size
Fix &ice_parser_rt::bst_key size. It was wrongly set to 10 instead of 20
in the initial impl commit (see Fixes tag). All usage code assumed it was
of size 20. That was also the initial size present up to v2 of the intro
series [2], but halved by v3 [3] refactor described as "Replace magic
hardcoded values with macros." The introducing series was so big that
some ugliness was unnoticed, same for bugs :/

ICE_BST_KEY_TCAM_SIZE and ICE_BST_TCAM_KEY_SIZE were differing by one.
There was tmp variable @j in the scope of edited function, but was not
used in all places. This ugliness is now gone.
I'm moving ice_parser_rt::pg_prio a few positions up, to fill up one of
the holes in order to compensate for the added 10 bytes to the ::bst_key,
resulting in the same size of the whole as prior to the fix, and minimal
changes in the offsets of the fields.

Extend also the debug dump print of the key to cover all bytes. To not
have string with 20 "%02x" and 20 params, switch to
ice_debug_array_w_prefix().

This fix obsoletes Ahmed's attempt at [1].

[1] https://lore.kernel.org/intel-wired-lan/20240823230847.172295-1-ahmed.zaki@intel.com
[2] https://lore.kernel.org/intel-wired-lan/20230605054641.2865142-13-junfeng.guo@intel.com
[3] https://lore.kernel.org/intel-wired-lan/20230817093442.2576997-13-junfeng.guo@intel.com

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/intel-wired-lan/b1fb6ff9-b69e-4026-9988-3c783d86c2e0@stanley.mountain
Fixes: 9a4c07aaa0 ("ice: add parser execution main loop")
CC: Ahmed Zaki <ahmed.zaki@intel.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-24 10:49:30 -08:00
Linus Torvalds
0ad9617c78 Networking changes for 6.14.
Core
 ----
 
  - More core refactoring to reduce the RTNL lock contention,
    including preparatory work for the per-network namespace RTNL lock,
    replacing RTNL lock with a per device-one to protect NAPI-related
    net device data and moving synchronize_net() calls outside such
    lock.
 
  - Extend drop reasons usage, adding net scheduler, AF_UNIX, bridge and
    more specific TCP coverage.
 
  - Reduce network namespace tear-down time by removing per-subsystems
    synchronize_net() in tipc and sched.
 
  - Add flow label selector support for fib rules, allowing traffic
    redirection based on such header field.
 
 Netfilter
 ---------
 
  - Do not remove netdev basechain when last device is gone, allowing
    netdev basechains without devices.
 
  - Revisit the flowtable teardown strategy, dealing better with fin,
    reset and re-open events.
 
  - Scale-up IP-vs connection dumping by avoiding linear search on
    each restart.
 
 Protocols
 ---------
 
  - A significant XDP socket refactor, consolidating and optimizing
    several helpers into the core
 
  - Better scaling of ICMP rate-limiting, by removing false-sharing in
    inet peers handling.
 
  - Introduces netlink notifications for multicast IPv4 and IPv6
    address changes.
 
  - Add ipsec support for IP-TFS/AggFrag encapsulation, allowing
    aggregation and fragmentation of the inner IP.
 
  - Add sysctl to configure TIME-WAIT reuse delay for TCP sockets,
    to avoid local port exhaustion issues when the average connection
    lifetime is very short.
 
  - Support updating keys (re-keying) for connections using kernel
    TLS (for TLS 1.3 only).
 
  - Support ipv4-mapped ipv6 address clients in smc-r v2.
 
  - Add support for jumbo data packet transmission in RxRPC sockets,
    gluing multiple data packets in a single UDP packet.
 
  - Support RxRPC RACK-TLP to manage packet loss and retransmission in
    conjunction with the congestion control algorithm.
 
 Driver API
 ----------
 
  - Introduce a unified and structured interface for reporting PHY
    statistics, exposing consistent data across different H/W via
    ethtool.
 
  - Make timestamping selectable, allow the user to select the desired
    hwtstamp provider (PHY or MAC) administratively.
 
  - Add support for configuring a header-data-split threshold (HDS)
    value via ethtool, to deal with partial or buggy H/W implementation.
 
  - Consolidate DSA drivers Energy Efficiency Ethernet support.
 
  - Add EEE management to phylink, making use of the phylib
    implementation.
 
  - Add phylib support for in-band capabilities negotiation.
 
  - Simplify how phylib-enabled mac drivers expose the supported
    interfaces.
 
 Tests and tooling
 -----------------
 
  - Make the YNL tool package-friendly to make it easier to deploy it
    separately from the kernel.
 
  - Increase TCP selftest coverage importing several packetdrill
    test-cases.
 
  - Regenerate the ethtool uapi header from the YNL spec,
    to ease maintenance and future development.
 
  - Add YNL support for decoding the link types used in net
    self-tests, allowing a single build to run both net and
    drivers/net.
 
 Drivers
 -------
 
  - Ethernet high-speed NICs:
    - nVidia/Mellanox (mlx5):
      - add cross E-Switch QoS support
      - add SW Steering support for ConnectX-8
      - implement support for HW-Managed Flow Steering, improving the
        rule deletion/insertion rate
      - support for multi-host LAG
    - Intel (ixgbe, ice, igb):
      - ice: add support for devlink health events
      - ixgbe: add initial support for E610 chipset variant
      - igb: add support for AF_XDP zero-copy
    - Meta:
      - add support for basic RSS config
      - allow changing the number of channels
      - add hardware monitoring support
    - Broadcom (bnxt):
      - implement TCP data split and HDS threshold ethtool support,
        enabling Device Memory TCP.
    - Marvell Octeon:
      - implement egress ipsec offload support for the cn10k family
    - Hisilicon (HIBMC):
      - implement unicast MAC filtering
 
  - Ethernet NICs embedded and virtual:
    - Convert UDP tunnel drivers to NETDEV_PCPU_STAT_DSTATS, avoiding
      contented atomic operations for drop counters
    - Freescale:
      - quicc: phylink conversion
      - enetc: support Tx and Rx checksum offload and improve TSO
        performances
    - MediaTek:
      - airoha: introduce support for ETS and HTB Qdisc offload
    - Microchip:
      - lan78XX USB: preparation work for phylink conversion
    - Synopsys (stmmac):
      - support DWMAC IP on NXP Automotive SoCs S32G2xx/S32G3xx/S32R45
      - refactor EEE support to leverage the new driver API
      - optimize DMA and cache access to increase raw RX performances
        by 40%
    - TI:
      - icssg-prueth: add multicast filtering support for VLAN
        interface
    - netkit:
      - add ability to configure head/tailroom
    - VXLAN:
      - accepts packets with user-defined reserved bit
 
  - Ethernet switches:
    - Microchip:
      - lan969x: add RGMII support
      - lan969x: improve TX and RX performance using the FDMA engine
    - nVidia/Mellanox:
      - move Tx header handling to PCI driver, to ease XDP support
 
  - Ethernet PHYs:
    - Texas Instruments DP83822:
      - add support for GPIO2 clock output
    - Realtek:
      - 8169: add support for RTL8125D rev.b
      - rtl822x: add hwmon support for the temperature sensor
    - Microchip:
      - add support for RDS PTP hardware
      - consolidate periodic output signal generation
 
  - CAN:
    - several DT-bindings to DT schema conversions
    - tcan4x5x:
      - add HW standby support
      - support nWKRQ voltage selection
    - kvaser:
      - allowing Bus Error Reporting runtime configuration
 
  - WiFi:
    - the on-going Multi-Link Operation (MLO) effort continues, affecting
      both the stack and in drivers
    - mac80211/cfg80211:
      - Emergency Preparedness Communication Services (EPCS) station mode
        support
      - support for adding and removing station links for MLO
      - add support for WiFi 7/EHT mesh over 320 MHz channels
      - report Tx power info for each link
    - RealTek (rtw88):
      - enable USB Rx aggregation and USB 3 to improve performance
      - LED support
    - RealTek (rtw89):
      - refactor power save to support Multi-Link Operations
      - add support for RTL8922AE-VS variant
    - MediaTek (mt76):
      - single wiphy multiband support (preparation for MLO)
      - p2p device support
      - add TP-Link TXE50UH USB adapter support
    - Qualcomm (ath10k):
      - support for the QCA6698AQ IP core
    - Qualcomm (ath12k):
      - enable MLO for QCN9274
 
  - Bluetooth:
    - Allow sysfs to trigger hdev reset, to allow recovering devices
      not responsive from user-space
    - MediaTek: add support for MT7922, MT7925, MT7921e devices
    - Realtek: add support for RTL8851BE devices
    - Qualcomm: add support for WCN785x devices
    - ISO: allow BIG re-sync
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmePf5YSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkUcMQALblhkGTxurnfT+yK+Bsuhn2LoHl2RPN
 4u2Kjkzm+2FYgcw6lS17cFXsnfAPlRIpmhnmKk1EBgsBdkuL29c+jtqnljA2bboD
 tIMhMgWiaLS3xgEMrLeKnseIo0G9mviQRphGeZPFTaLb4Ww/bd5LAp4ZGc5oij76
 tURatC3b6MuO4Lt5U+jWKnRwviXku8udHkVHXlvPdirawHCVinmx3tvce/BI/MaD
 eUOp6ZeJCPCOLtk7b8WEyxxvdY0f6D9ed82qfPDHjb94SJv+Vxb38RZtNuApIjn9
 S0KdlNih/4flDy17LDxGYSyFps78lUFRbpqmsUlnZkyLXpsph7/WTvAmMAFcrX0K
 UgQ/F/q5GAvcP5WZcCj5+tZaRmfKQraQirXMtYU/Uj50qCnSU7ssyACASt23GLZ8
 OF8tCLlm9lLOU1B6Ofkul1Dbo5f0Xpaghga4dFb0kzSfbm78fTUnqBNsJ7jIkWfi
 fD6dO+fg+p2ZMD0CACGo3CNxQuJmaQWg6BIDeno6God8kZ6qBMxY/sFr4qozrvFH
 x/FgQq8dgc8WLmaPejKiNIPkdQepXrIiv3T9jgMVyEjJnWB/LBfyWKSQOdTfnLs+
 rgr4YMV6XW4bx0fYqTI8B9jZ+FCWbG6sn4UtRTHITKcd3FSvd8Y+PHa5YyCUWvJM
 l8pePMGF0XVF
 =hrsp
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "This is slightly smaller than usual, with the most interesting work
  being still around RTNL scope reduction.

  Core:

   - More core refactoring to reduce the RTNL lock contention, including
     preparatory work for the per-network namespace RTNL lock, replacing
     RTNL lock with a per device-one to protect NAPI-related net device
     data and moving synchronize_net() calls outside such lock.

   - Extend drop reasons usage, adding net scheduler, AF_UNIX, bridge
     and more specific TCP coverage.

   - Reduce network namespace tear-down time by removing per-subsystems
     synchronize_net() in tipc and sched.

   - Add flow label selector support for fib rules, allowing traffic
     redirection based on such header field.

  Netfilter:

   - Do not remove netdev basechain when last device is gone, allowing
     netdev basechains without devices.

   - Revisit the flowtable teardown strategy, dealing better with fin,
     reset and re-open events.

   - Scale-up IP-vs connection dumping by avoiding linear search on each
     restart.

  Protocols:

   - A significant XDP socket refactor, consolidating and optimizing
     several helpers into the core

   - Better scaling of ICMP rate-limiting, by removing false-sharing in
     inet peers handling.

   - Introduces netlink notifications for multicast IPv4 and IPv6
     address changes.

   - Add ipsec support for IP-TFS/AggFrag encapsulation, allowing
     aggregation and fragmentation of the inner IP.

   - Add sysctl to configure TIME-WAIT reuse delay for TCP sockets, to
     avoid local port exhaustion issues when the average connection
     lifetime is very short.

   - Support updating keys (re-keying) for connections using kernel TLS
     (for TLS 1.3 only).

   - Support ipv4-mapped ipv6 address clients in smc-r v2.

   - Add support for jumbo data packet transmission in RxRPC sockets,
     gluing multiple data packets in a single UDP packet.

   - Support RxRPC RACK-TLP to manage packet loss and retransmission in
     conjunction with the congestion control algorithm.

  Driver API:

   - Introduce a unified and structured interface for reporting PHY
     statistics, exposing consistent data across different H/W via
     ethtool.

   - Make timestamping selectable, allow the user to select the desired
     hwtstamp provider (PHY or MAC) administratively.

   - Add support for configuring a header-data-split threshold (HDS)
     value via ethtool, to deal with partial or buggy H/W
     implementation.

   - Consolidate DSA drivers Energy Efficiency Ethernet support.

   - Add EEE management to phylink, making use of the phylib
     implementation.

   - Add phylib support for in-band capabilities negotiation.

   - Simplify how phylib-enabled mac drivers expose the supported
     interfaces.

  Tests and tooling:

   - Make the YNL tool package-friendly to make it easier to deploy it
     separately from the kernel.

   - Increase TCP selftest coverage importing several packetdrill
     test-cases.

   - Regenerate the ethtool uapi header from the YNL spec, to ease
     maintenance and future development.

   - Add YNL support for decoding the link types used in net self-tests,
     allowing a single build to run both net and drivers/net.

  Drivers:

   - Ethernet high-speed NICs:
      - nVidia/Mellanox (mlx5):
         - add cross E-Switch QoS support
         - add SW Steering support for ConnectX-8
         - implement support for HW-Managed Flow Steering, improving the
           rule deletion/insertion rate
         - support for multi-host LAG
      - Intel (ixgbe, ice, igb):
         - ice: add support for devlink health events
         - ixgbe: add initial support for E610 chipset variant
         - igb: add support for AF_XDP zero-copy
      - Meta:
         - add support for basic RSS config
         - allow changing the number of channels
         - add hardware monitoring support
      - Broadcom (bnxt):
         - implement TCP data split and HDS threshold ethtool support,
           enabling Device Memory TCP.
      - Marvell Octeon:
         - implement egress ipsec offload support for the cn10k family
      - Hisilicon (HIBMC):
         - implement unicast MAC filtering

   - Ethernet NICs embedded and virtual:
      - Convert UDP tunnel drivers to NETDEV_PCPU_STAT_DSTATS, avoiding
        contented atomic operations for drop counters
      - Freescale:
         - quicc: phylink conversion
         - enetc: support Tx and Rx checksum offload and improve TSO
           performances
      - MediaTek:
         - airoha: introduce support for ETS and HTB Qdisc offload
      - Microchip:
         - lan78XX USB: preparation work for phylink conversion
      - Synopsys (stmmac):
         - support DWMAC IP on NXP Automotive SoCs S32G2xx/S32G3xx/S32R45
         - refactor EEE support to leverage the new driver API
         - optimize DMA and cache access to increase raw RX performances
           by 40%
      - TI:
         - icssg-prueth: add multicast filtering support for VLAN
           interface
      - netkit:
         - add ability to configure head/tailroom
      - VXLAN:
         - accepts packets with user-defined reserved bit

   - Ethernet switches:
      - Microchip:
         - lan969x: add RGMII support
         - lan969x: improve TX and RX performance using the FDMA engine
      - nVidia/Mellanox:
         - move Tx header handling to PCI driver, to ease XDP support

   - Ethernet PHYs:
      - Texas Instruments DP83822:
         - add support for GPIO2 clock output
      - Realtek:
         - 8169: add support for RTL8125D rev.b
         - rtl822x: add hwmon support for the temperature sensor
      - Microchip:
         - add support for RDS PTP hardware
         - consolidate periodic output signal generation

   - CAN:
      - several DT-bindings to DT schema conversions
      - tcan4x5x:
         - add HW standby support
         - support nWKRQ voltage selection
      - kvaser:
         - allowing Bus Error Reporting runtime configuration

   - WiFi:
      - the on-going Multi-Link Operation (MLO) effort continues,
        affecting both the stack and in drivers
      - mac80211/cfg80211:
         - Emergency Preparedness Communication Services (EPCS) station
           mode support
         - support for adding and removing station links for MLO
         - add support for WiFi 7/EHT mesh over 320 MHz channels
         - report Tx power info for each link
      - RealTek (rtw88):
         - enable USB Rx aggregation and USB 3 to improve performance
         - LED support
      - RealTek (rtw89):
         - refactor power save to support Multi-Link Operations
         - add support for RTL8922AE-VS variant
      - MediaTek (mt76):
         - single wiphy multiband support (preparation for MLO)
         - p2p device support
         - add TP-Link TXE50UH USB adapter support
      - Qualcomm (ath10k):
         - support for the QCA6698AQ IP core
      - Qualcomm (ath12k):
         - enable MLO for QCN9274

   - Bluetooth:
      - Allow sysfs to trigger hdev reset, to allow recovering devices
        not responsive from user-space
      - MediaTek: add support for MT7922, MT7925, MT7921e devices
      - Realtek: add support for RTL8851BE devices
      - Qualcomm: add support for WCN785x devices
      - ISO: allow BIG re-sync"

* tag 'net-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1386 commits)
  net/rose: prevent integer overflows in rose_setsockopt()
  net: phylink: fix regression when binding a PHY
  net: ethernet: ti: am65-cpsw: streamline TX queue creation and cleanup
  net: ethernet: ti: am65-cpsw: streamline RX queue creation and cleanup
  net: ethernet: ti: am65-cpsw: ensure proper channel cleanup in error path
  ipv6: Convert inet6_rtm_deladdr() to per-netns RTNL.
  ipv6: Convert inet6_rtm_newaddr() to per-netns RTNL.
  ipv6: Move lifetime validation to inet6_rtm_newaddr().
  ipv6: Set cfg.ifa_flags before device lookup in inet6_rtm_newaddr().
  ipv6: Pass dev to inet6_addr_add().
  ipv6: Convert inet6_ioctl() to per-netns RTNL.
  ipv6: Hold rtnl_net_lock() in addrconf_init() and addrconf_cleanup().
  ipv6: Hold rtnl_net_lock() in addrconf_dad_work().
  ipv6: Hold rtnl_net_lock() in addrconf_verify_work().
  ipv6: Convert net.ipv6.conf.${DEV}.XXX sysctl to per-netns RTNL.
  ipv6: Add __in6_dev_get_rtnl_net().
  net: stmmac: Drop redundant skb_mark_for_recycle() for SKB frags
  net: mii: Fix the Speed display when the network cable is not connected
  sysctl net: Remove macro checks for CONFIG_SYSCTL
  eth: bnxt: update header sizing defaults
  ...
2025-01-22 08:28:57 -08:00
Linus Torvalds
1d6d399223 Kthreads affinity follow either of 4 existing different patterns:
1) Per-CPU kthreads must stay affine to a single CPU and never execute
    relevant code on any other CPU. This is currently handled by smpboot
    code which takes care of CPU-hotplug operations. Affinity here is
    a correctness constraint.
 
 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't
    run anywhere else. The affinity is set through kthread_bind_mask()
    and the subsystem takes care by itself to handle CPU-hotplug
    operations. Affinity here is assumed to be a correctness constraint.
 
 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This
    is not a correctness constraint but merely a preference in terms of
    memory locality. kswapd and kcompactd both fall into this category.
    The affinity is set manually like for any other task and CPU-hotplug
    is supposed to be handled by the relevant subsystem so that the task
    is properly reaffined whenever a given CPU from the node comes up.
    Also care should be taken so that the node affinity doesn't cross
    isolated (nohz_full) cpumask boundaries.
 
 4) Similar to the previous point except kthreads have a _preferred_
    affinity different than a node. Both RCU boost kthreads and RCU
    exp kworkers fall into this category as they refer to "RCU nodes"
    from a distinctly distributed tree.
 
 Currently the preferred affinity patterns (3 and 4) have at least 4
 identified users, with more or less success when it comes to handle
 CPU-hotplug operations and CPU isolation. Each of which do it in its own
 ad-hoc way.
 
 This is an infrastructure proposal to handle this with the following API
 changes:
 
 _ kthread_create_on_node() automatically affines the created kthread to
   its target node unless it has been set as per-cpu or bound with
   kthread_bind[_mask]() before the first wake-up.
 
 - kthread_affine_preferred() is a new function that can be called right
   after kthread_create_on_node() to specify a preferred affinity
   different than the specified node.
 
 When the preferred affinity can't be applied because the possible
 targets are offline or isolated (nohz_full), the kthread is affine
 to the housekeeping CPUs (which means to all online CPUs most of the
 time or only the non-nohz_full CPUs when nohz_full= is set).
 
 kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
 converted, along with a few old drivers.
 
 Summary of the changes:
 
 * Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu()
 
 * Introduce task_cpu_fallback_mask() that defines the default last
   resort affinity of a task to become nohz_full aware
 
 * Add some correctness check to ensure kthread_bind() is always called
   before the first kthread wake up.
 
 * Default affine kthread to its preferred node.
 
 * Convert kswapd / kcompactd and remove their halfway working ad-hoc
   affinity implementation
 
 * Implement kthreads preferred affinity
 
 * Unify kthread worker and kthread API's style
 
 * Convert RCU kthreads to the new API and remove the ad-hoc affinity
   implementation.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmeNf8gACgkQhSRUR1CO
 jHedQQ/+IxTjjqQiItzrq41TES2S0desHDq8lNJFb7rsR/DtKFyLx3s67cOYV+cM
 Yx54QHg2m/Fz4nXMQ7Po5ygOtJGCKBc5C5QQy7y0lVKeTQK+daDfEtBSa3oG7j3C
 u+E3tTY6qxkbCzymUyaKkHN4/ay2vLvjFS50luV7KMyI3x47Aji+t7VdCX4LCPP2
 eAwOALWD0+7qLJ/VF6gsmQLKA4Qx7PQAzBa3KSBmUN9UcN8Gk1bQHCTIQKDHP9LQ
 v8BXrNZtYX1o2+snNYpX2z6/ECjxkdwriOgqqZY5306hd9RAQ1u46Dx3byrIqjGn
 ULG/XQ2istPyhTqb/h+RbrobdOcwEUIeqk8hRRbBXE8bPpqUz9EMuaCMxWDbQjgH
 NTuKG4ifKJ/IqstkkuDkdOiByE/ysMmwqrTXgSnu2ITNL9yY3BEgFbvA95hgo42s
 f7QCxEfZb1MHcNEMENSMwM3xw5lLMGMpxVZcMQ3gLwyotMBRrhFZm1qZJG7TITYW
 IDIeCbH4JOMdQwLs3CcWTXio0N5/85NhRNFV+IDn96OrgxObgnMtV8QwNgjXBAJ5
 wGeJWt8s34W1Zo3qS9gEuVzEhW4XaxISQQMkHe8faKkK6iHmIB/VjSQikDwwUNQ/
 AspYj82RyWBCDZsqhiYh71kpxjvS6Xp0bj39Ce1sNsOnuksxKkQ=
 =g8In
 -----END PGP SIGNATURE-----

Merge tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks

Pull kthread updates from Frederic Weisbecker:
 "Kthreads affinity follow either of 4 existing different patterns:

   1) Per-CPU kthreads must stay affine to a single CPU and never
      execute relevant code on any other CPU. This is currently handled
      by smpboot code which takes care of CPU-hotplug operations.
      Affinity here is a correctness constraint.

   2) Some kthreads _have_ to be affine to a specific set of CPUs and
      can't run anywhere else. The affinity is set through
      kthread_bind_mask() and the subsystem takes care by itself to
      handle CPU-hotplug operations. Affinity here is assumed to be a
      correctness constraint.

   3) Per-node kthreads _prefer_ to be affine to a specific NUMA node.
      This is not a correctness constraint but merely a preference in
      terms of memory locality. kswapd and kcompactd both fall into this
      category. The affinity is set manually like for any other task and
      CPU-hotplug is supposed to be handled by the relevant subsystem so
      that the task is properly reaffined whenever a given CPU from the
      node comes up. Also care should be taken so that the node affinity
      doesn't cross isolated (nohz_full) cpumask boundaries.

   4) Similar to the previous point except kthreads have a _preferred_
      affinity different than a node. Both RCU boost kthreads and RCU
      exp kworkers fall into this category as they refer to "RCU nodes"
      from a distinctly distributed tree.

  Currently the preferred affinity patterns (3 and 4) have at least 4
  identified users, with more or less success when it comes to handle
  CPU-hotplug operations and CPU isolation. Each of which do it in its
  own ad-hoc way.

  This is an infrastructure proposal to handle this with the following
  API changes:

   - kthread_create_on_node() automatically affines the created kthread
     to its target node unless it has been set as per-cpu or bound with
     kthread_bind[_mask]() before the first wake-up.

   - kthread_affine_preferred() is a new function that can be called
     right after kthread_create_on_node() to specify a preferred
     affinity different than the specified node.

  When the preferred affinity can't be applied because the possible
  targets are offline or isolated (nohz_full), the kthread is affine to
  the housekeeping CPUs (which means to all online CPUs most of the time
  or only the non-nohz_full CPUs when nohz_full= is set).

  kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
  converted, along with a few old drivers.

  Summary of the changes:

   - Consolidate a bunch of ad-hoc implementations of
     kthread_run_on_cpu()

   - Introduce task_cpu_fallback_mask() that defines the default last
     resort affinity of a task to become nohz_full aware

   - Add some correctness check to ensure kthread_bind() is always
     called before the first kthread wake up.

   - Default affine kthread to its preferred node.

   - Convert kswapd / kcompactd and remove their halfway working ad-hoc
     affinity implementation

   - Implement kthreads preferred affinity

   - Unify kthread worker and kthread API's style

   - Convert RCU kthreads to the new API and remove the ad-hoc affinity
     implementation"

* tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks:
  kthread: modify kernel-doc function name to match code
  rcu: Use kthread preferred affinity for RCU exp kworkers
  treewide: Introduce kthread_run_worker[_on_cpu]()
  kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format
  rcu: Use kthread preferred affinity for RCU boost
  kthread: Implement preferred affinity
  mm: Create/affine kswapd to its preferred node
  mm: Create/affine kcompactd to its preferred node
  kthread: Default affine kthread to its preferred NUMA node
  kthread: Make sure kthread hasn't started while binding it
  sched,arm64: Handle CPU isolation on last resort fallback rq selection
  arm64: Exclude nohz_full CPUs from 32bits el0 support
  lib: test_objpool: Use kthread_run_on_cpu()
  kallsyms: Use kthread_run_on_cpu()
  soc/qman: test: Use kthread_run_on_cpu()
  arm/bL_switcher: Use kthread_run_on_cpu()
2025-01-21 17:10:05 -08:00
Jakub Kicinski
ba0209bd18 Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
ice: support FW Recovery Mode

Konrad Knitter says:

Enable update of card in FW Recovery Mode

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: support FW Recovery Mode
  devlink: add devl guard
  pldmfw: enable selected component update
====================

Link: https://patch.msgid.link/20250116212059.1254349-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-17 19:45:35 -08:00
Konrad Knitter
1de25c6b98 ice: support FW Recovery Mode
Recovery Mode is intended to recover from a fatal failure scenario in
which the device is not accessible to the host, meaning the firmware is
non-responsive.

The purpose of the Firmware Recovery Mode is to enable software tools to
update firmware and/or device configuration so the fatal error can be
resolved.

Recovery Mode Firmware supports a limited set of admin commands required
for NVM update.
Recovery Firmware does not support hardware interrupts so a polling mode
is used.

The driver will expose only the minimum set of devlink commands required
for the recovery of the adapter.

Using an appropriate NVM image, the user can recover the adapter using
the devlink flash API.

Prior to 4.20 E810 Adapter Recovery Firmware supports only the update
and erase of the "fw.mgmt" component.

E810 Adapter Recovery Firmware doesn't support selected preservation of
cards settings or identifiers.

The following command can be used to recover the adapter:

$ devlink dev flash <pci-address> <update-image.bin> component fw.mgmt
  overwrite settings overwrite identifier

Newer FW versions (4.20 or newer) supports update of "fw.undi" and
"fw.netlist" components.

$ devlink dev flash <pci-address> <update-image.bin>

Tested on Intel Corporation Ethernet Controller E810-C for SFP
FW revision 3.20 and 4.30.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Konrad Knitter <konrad.knitter@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-16 13:05:06 -08:00
Jakub Kicinski
2ee738e90e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc8).

Conflicts:

drivers/net/ethernet/realtek/r8169_main.c
  1f691a1fc4 ("r8169: remove redundant hwmon support")
  152d00a913 ("r8169: simplify setting hwmon attribute visibility")
https://lore.kernel.org/20250115122152.760b4e8d@canb.auug.org.au

Adjacent changes:

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  152f4da05a ("bnxt_en: add support for rx-copybreak ethtool command")
  f0aa6a37a3 ("eth: bnxt: always recalculate features after XDP clearing, fix null-deref")

drivers/net/ethernet/intel/ice/ice_type.h
  50327223a8 ("ice: add lock to protect low latency interface")
  dc26548d72 ("ice: Fix quad registers read on E825")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16 10:34:59 -08:00
Karol Kolacinski
914639464b ice: Add in/out PTP pin delays
HW can have different input/output delays for each of the pins.

Currently, only E82X adapters have delay compensation based on TSPLL
config and E810 adapters have constant 1 ms compensation, both cases
only for output delays and the same one for all pins.

E825 adapters have different delays for SDP and other pins. Those
delays are also based on direction and input delays are different than
output ones. This is the main reason for moving delays to pin
description structure.

Add a field in ice_ptp_pin_desc structure to reflect that. Delay values
are based on approximate calculations of HW delays based on HW spec.

Implement external timestamp (input) delay compensation.

Remove existing definitions and wrappers for periodic output propagation
delays.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-14 14:37:34 -08:00
Jacob Keller
ef9a64c072 ice: implement low latency PHY timer updates
Programming the PHY registers in preparation for an increment value change
or a timer adjustment on E810 requires issuing Admin Queue commands for
each PHY register. It has been found that the firmware Admin Queue
processing occasionally has delays of tens or rarely up to hundreds of
milliseconds. This delay cascades to failures in the PTP applications which
depend on these updates being low latency.

Consider a standard PTP profile with a sync rate of 16 times per second.
This means there is ~62 milliseconds between sync messages. A complete
cycle of the PTP algorithm

1) Sync message (with Tx timestamp) from source
2) Follow-up message from source
3) Delay request (with Tx timestamp) from sink
4) Delay response (with Rx timestamp of request) from source
5) measure instantaneous clock offset
6) request time adjustment via CLOCK_ADJTIME systemcall

The Tx timestamps have a default maximum timeout of 10 milliseconds. If we
assume that the maximum possible time is used, this leaves us with ~42
milliseconds of processing time for a complete cycle.

The CLOCK_ADJTIME system call is synchronous and will block until the
driver completes its timer adjustment or frequency change.

If the writes to prepare the PHY timers get hit by a latency spike of 50
milliseconds, then the PTP application will be delayed past the point where
the next cycle should start. Packets from the next cycle may have already
arrived and are waiting on the socket.

In particular, LinuxPTP ptp4l may start complaining about missing an
announce message from the source, triggering a fault. In addition, the
clockcheck logic it uses may trigger. This clockcheck failure occurs
because the timestamp captured by hardware is compared against a reading of
CLOCK_MONOTONIC. It is assumed that the time when the Rx timestamp is
captured and the read from CLOCK_MONOTONIC are relatively close together.
This is not the case if there is a significant delay to processing the Rx
packet.

Newer firmware supports programming the PHY registers over a low latency
interface which bypasses the Admin Queue. Instead, software writes to the
REG_LL_PROXY_L and REG_LL_PROXY_H registers. Firmware reads these registers
and then programs the PHY timers.

Implement functions to use this interface when available to program the PHY
timers instead of using the Admin Queue. This avoids the Admin Queue
latency and ensures that adjustments happen within acceptable latency
bounds.

Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Anton Nadezhdin <anton.nadezhdin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-14 14:37:34 -08:00
Jacob Keller
a5c69d45df ice: check low latency PHY timer update firmware capability
Newer versions of firmware support programming the PHY timer via the low
latency interface exposed over REG_LL_PROXY_L and REG_LL_PROXY_H. Add
support for checking the device capabilities for this feature.

Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Anton Nadezhdin <anton.nadezhdin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-14 14:37:34 -08:00