This scnprintf() uses the wrong limit. It should be
"LPFC_FPIN_WWPN_LINE_SZ - len" instead of LPFC_FPIN_WWPN_LINE_SZ.
Link: https://lore.kernel.org/r/20210916132251.GD25094@kili
Fixes: 428569e66f ("scsi: lpfc: Expand FPIN and RDF receive logging")
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Currently, we hold off unregistering with NVMe transport layer until GID_FT
or ADISC completes upon receipt of RSCN. In the ADISC discovery routine,
for nodes not found in the GID_FT response, the nodes are unregistered from
the SCSI transport but not UNREG_RPI'd. Meaning outstanding WQEs continue
to be outstanding and were not failed back to the OS. If an NVMe device,
this mean there wasn't initial termination of the I/Os so they could be
issued on a different NVMe path.
Fix by unregistering the RPI so that I/O is cancelled.
Link: https://lore.kernel.org/r/20210910233159.115896-8-jsmart2021@gmail.com
Fixes: 0614568361 ("scsi: lpfc: Delay unregistering from transport until GIDFT or ADISC completes")
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In pt-2-pt mode, the initiator does not log into the target after a PRLI
error. In pt-2-pt mode, the target responded to the PRLI by sending a
LOGO. The LOGO causes all ELS and I/Os to be aborted. This caused the PRLI
to fail. The PRLI completion path caused the discovery node to be dropped
to avoid being stick in an UNUSED (not logged in) state. As the node was
dropped there is no retry of the login and as it is pt-2-pt, there is no
RSCN to retrigger discovery. Thus the other end is not seen by the OS.
Fix by ensuring the discovery node is not dropped if connecting pt-2-pt.
This will cause PLOGI to be retried.
Link: https://lore.kernel.org/r/20210910233159.115896-7-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
A test scenario encountered an unload hang while an FLOGI ELS was in flight
when a link down condition occurred. The driver fails unload as it never
releases the fport node.
For most nodes, when the link drops, devloss tmo is started and the timeout
will cause the final node release. For the Fport, as it has not yet
registered with the SCSI transport, there is no devloss timer to be
started, so there is no final release. Additionally, the link down
sequence causes ABORTS to be issued for pending ELS's. The completions from
the ABORTS perform the release of node references. However, as the adapter
is being reset to be unloaded, those completions will never occur.
Fix by the following:
- In the ELS cleanup, recognize when unloading and place the ELS's on a
different list that immediately cleans up/completes the ELS's. It's
recognized that this condition primarily affects only the fport, with
other ports having normal clean up logic that handles things.
- Resolve the devloss issue by, when cleaning up nodes on after link down,
recognizing when the fabric node does not have a completed state (its
state is UNUSED) and removing a reference so the node can delete after
the ELS reference is released.
Link: https://lore.kernel.org/r/20210910233159.115896-5-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
A test scenario has a target issuing a TPLS after accepting the driver's
PRLI. TPLS is not supported by the driver so it rejects the ELS. However,
the reject was only happening on the primary N_Port. If the TPLS was to a
NPIV vport, not only would it reject the ELS, but it would act on the TPLS,
starting devloss, then unregister from the SCSI transport and release the
node. When devloss expired, it would access the node again and cause a page
faul.
Fix by altering the NPIV code to recognize that a correctly registered node
can reject unsolicited ELS I/O and to not unregister with the SCSI
transport and tear the node down. Add a check of the fc4_xpt_flags so that
only a zero value allows the unreg and teardown.
Link: https://lore.kernel.org/r/20210910233159.115896-4-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In a rarely executed path, FLOGI failure, there is a refcounting error. If
FLOGI completed with an error, typically a timeout, the initial completion
handler would remove the job reference. However, the job completion isn't
the actual end of the job/exchange as the timeout usually initiates an
ABTS, and upon that ABTS completion, a final completion is sent. The driver
removes the reference again in the final completion. Thus the imbalance.
In the buggy cases, if there was a link bounce while the delayed response
is outstanding, the fport node may be referenced again but there was no
additional reference as it is already present. The delayed completion then
occurs and removes the last reference freeing the node and causing issues
in the link up processed that is using the node.
Fix this scenario by removing the snippet that removed the reference in the
initial FLOGI completion. The bad snippet was poorly trying to identify the
FLOGI as OK to do so by realizing the node was not registered with either
SCSI or NVMe transport.
Link: https://lore.kernel.org/r/20210910233159.115896-3-jsmart2021@gmail.com
Fixes: 618e2ee146 ("scsi: lpfc: Fix FLOGI failure due to accessing a freed node")
Cc: <stable@vger.kernel.org> # v5.13+
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The kernel test robot reported the following sparse warning:
".../lpfc_els.c:3984:25: sparse: sparse: cast from restricted __be16"
For the error being flagged, using be32_to_cpu() on a be16 data type, it
was simple enough. But a review of other elements and warnings were also
evaluated.
This patch corrected several items in the original patch:
- Using be32_to_cpu() on a be16 data type
- cpu_to_le32() used on a std uint32_t (CPU) data type.
Note: This is a byte array, but stored in LE layout by hardware at
32-bit boundaries. So it possibly needed conversion.
- Using cpu_to_le32() on a std uint16_t and assigned to a char typeA
- Using le32_to_cpu() on a le16 type
- Missing cpu_to_le16() on an assignment
Link: https://lore.kernel.org/r/20210830231243.6227-1-jsmart2021@gmail.com
Fixes: 9064aeb2df ("scsi: lpfc: Add EDC ELS support")
Reported-by: kernel test robot <lkp@intel.com>
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Allow abbreviated cm framework status information to be obtained via sysfs.
Link: https://lore.kernel.org/r/20210816162901.121235-14-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Add the logic to move the congestion management and event information into
the cmd statistics buffer maintained for the adapter. The update includes
rolling up values for the last minute, hour, and day information.
Link: https://lore.kernel.org/r/20210816162901.121235-12-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Complete the enablement of the cm framework feature in the adapter. Perform
the following:
- Detect the presence of the congestion management framework feature.
When the cm framework is present:
- Issue the SET_FEATURE command to enable the feature.
- Register the cm statistics buffer with the adapter.
- Read the cm enablement buffer to determine the cm framework state for cm
management.
When cm management is enabled:
- Monitor all FPIN and congestion signalling events, incrementing
counters.
- Regularly sync with the adapter to communicate congestion events and to
receive an rx request limit.
- Monitor requests for rx data and ensure that no more than the
adapter prescribed limit is issued on the link. If the limit is
exceeded, SCSI and/or NVMe traffic is temporarily suspended.
- Maintain the minute, hourly, daily statistics buffer.
- Monitor for congestion enablement change events, causing a reread of the
enablement buffer and acting on any change in enablement.
And:
- Add teardown logic, including buffer deregistration, on adapter
detachment or reset.
Link: https://lore.kernel.org/r/20210816162901.121235-10-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When congestion management is enabled, issue EDC ELS to register congestion
signaling capabilities with the fabric. The response handling will process
the fabric parameters and set the reporting parameters.
Similarly, add support for receiving an EDC request from the fabric
generating a corresponding response.
Implement handlers for congestion signals from the fabric and maintain
statistics for them.
Link: https://lore.kernel.org/r/20210816162901.121235-6-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Expand FPIN logging:
- Display Attached Port Names for Link Integrity and Peer Congestion
events
- Log Delivery, Peer Congestion, and Congestion events
- Sanity check FPIN descriptor lengths when processing FPIN descriptors.
Log RDF events when congestion logging is enabled.
Link: https://lore.kernel.org/r/20210816162901.121235-5-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Update routines to support 256 Gb link speed for LPe37000/LPe38000
adapters. 256 Gb speeds can be seen on trunk links.
Link: https://lore.kernel.org/r/20210722221721.74388-5-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During RSCN storms, some instances of LIP on SLI-3 adapters lead to a
situation where FLOGIs keep failing with firmware indicating an illegal
command error code. This situation was preceded by an ADISC completion
that was processed while the link was down. This path on SLI-3 performs a
CLEAR_LA and attempts to activate a VPI with REG_VPI. Later, as the FLOGI
completes, it's no longer in sync with the VPI state. In SLI-3 it is
illegal to have an active VPI during FLOGI.
Resolve by circumventing the SLI-3 path that performs the CLEAR_LA and
REG_VPI. The path will be taken after the FLOGI after the next Link Up.
Link: https://lore.kernel.org/r/20210707184351.67872-18-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In the PLOGI and ADISC completion handling, the device removal event could
be skipped during some link errors. This could leave a stale node in UNUSED
state. Driver unload would hang for a long time waiting for this node to
be freed.
Resolve by taking the following steps:
- Always post ADISC completion events to discovery state machine upon
ADISC completion.
- In case of a completion error for PLOGI/ADISC, ensure that init refcount
is dropped if not registered with transport.
Link: https://lore.kernel.org/r/20210707184351.67872-17-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
On an RSCN event, the nodes specified in RSCN payload and in MAPPED state
are moved to NPR state in order to revalidate the login. This triggers an
immediate unregister from SCSI/NVMe backend. The assumption is that the
node may be missing. The re-registration with the backend happens after
either relogin (PLOGI/PRLI; if ADISC is disabled or login truly lost) or
when ADISC completes successfully (rediscover with ADISC enabled).
However, the NVMe-FC standard provides for an RSCN to be triggered when
the remote port supports a discovery controller and there was a change
of discovery log content. As the remote port typically also supports
storage subsystems, this unregister causes all storage controller
connections to fail and require reconnect.
Correct by reworking the code to ensure that the unregistration only occurs
when a login state is truly terminated, thereby leaving the NVMe storage
controllers in place.
The changes made are:
- Retain node state in ADISC_ISSUE when scheduling ADISC ELS retry.
- Do not clear wwpn/wwnn values upon ADISC failure.
- Move MAPPED nodes to NPR during RSCN processing, but do not unregister
with transport. On GIDFT completion, identify missing nodes (not marked
NLP_NPR_2B_DISC) and unregister them.
- Perform unregistration for nodes that will go through ADISC processing
if ADISC completion fails.
- Successful ADISC completion will move node back to MAPPED state.
Link: https://lore.kernel.org/r/20210707184351.67872-16-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Since the REG_LOGIN to the fabric controller happens in parallel with SCR,
it may or may not be completed by the time RDF is sent. RDF and SCR are
sent to the fabric in parallel, so checking for a completed REG_LOGIN in
the RDF submit path is not needed.
Remove the REG_LOGI check from the RDF submission path.
Link: https://lore.kernel.org/r/20210707184351.67872-11-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The ELS job request structure, that is allocated while issuing ELS RDF/SCR
request path, is not being released in an error path causing a memory leak
message on driver unload.
Free the ELS job structure in the error paths.
Link: https://lore.kernel.org/r/20210707184351.67872-10-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
RDF ELS handling for NPIV ports may result in an incorrect NDLP reference
count. In the event of a persistent link down, this may lead to premature
release of an NDLP structure and subsequent NULL ptr dereference panic.
Remove extraneous lpfc_nlp_put() call in NPIV port RDF processing.
Link: https://lore.kernel.org/r/20210707184351.67872-9-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
If a LOGO is received for a node that is in the NPR state, post a DEVICE_RM
event to allow clean up of the logged out node.
Clearing the NLP_NPR_2B_DISC flag upon receipt of a LOGO ACC may cause
skipping of processing outstanding PLOGIs triggered by parallel RSCN
events. If an outstanding PLOGI is being retried and receipt of a LOGO ACC
occurs, then allow the discovery state machine's PLOGI completion to clean
up the node.
Link: https://lore.kernel.org/r/20210707184351.67872-6-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Implement ELS commands QFPA and UVEM for the priority tagging appid
support.
Link: https://lore.kernel.org/r/20210608043556.274139-8-muneendra.kumar@broadcom.com
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Gaurav Srivastava <gaurav.srivastava@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Muneendra Kumar <muneendra.kumar@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
FC-LS-5 specifies that a received RDF implies a possible change to fabric
supported diagnostic functions. Endpoints are to re-perform the RDF
exchange with the fabric to enable possible new features or adapt to
changes in values.
This patch adds the logic to RDF receive to re-perform the RDF exchange
with the switch.
Link: https://lore.kernel.org/r/20210514195559.119853-11-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During link bounce testing, RPI counts were seen to differ from the number
of nodes. For fabric and domain controllers, a temporary RPI is assigned,
but the code isn't registering it. If the nodes do go away, such as on link
down, the temporary RPI isn't being released.
Change the way these two fabric services are managed, make them behave like
any other remote port. Register the RPI and register with the transport.
Never leave the nodes in a NPR or UNUSED state where their RPI is in limbo.
This allows them to follow normal dev_loss_tmo handling, RPI refcounting,
and normal removal rules. It also allows fabric I/Os to use the RPI for
traffic requests.
Note: There is some logic that still has a couple of exceptions when the
Domain controller (0xfffcXX). There are cases where the fabric won't have a
valid login but will send RDP. Other times, it will it send a LOGO then an
RDP. It makes for ad-hoc behavior to manage the node. Exceptions are
documented in the code.
Link: https://lore.kernel.org/r/20210514195559.119853-7-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When lpfc is handling a solicited and unsolicited PLOGI with another
initiator, the remote initiator is never recovered. The node for the
initiator is erroneouosly removed and all resources released.
In lpfc_cmpl_els_plogi(), when lpfc_els_retry() returns a failure code, the
driver is calling the state machine with a device remove event because the
remote port is not currently registered with the SCSI or NVMe
transports. The issue is that on a PLOGI "collision" the driver correctly
aborts the solicited PLOGI and allows the unsolicited PLOGI to complete the
process, but this process is interrupted with a device_rm event.
Introduce logic in the PLOGI completion to capture the PLOGI collision
event and jump out of the routine. This will avoid removal of the node.
If there is no collision, the normal node removal will occur.
Fixes: 52edb2caf6 ("scsi: lpfc: Remove ndlp when a PLOGI/ADISC/PRLI/REG_RPI ultimately fails")
Cc: <stable@vger.kernel.org> # v5.11+
Link: https://lore.kernel.org/r/20210514195559.119853-6-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
An 'unexpected timeout' message may be seen in a point-2-point topology.
The message occurs when a PLOGI is received before the driver is notified
of FLOGI completion. The FLOGI completion failure causes discovery to be
triggered for a second time. The discovery timer is restarted but no new
discovery activity is initiated, thus the timeout message eventually
appears.
In point-2-point, when discovery has progressed before the FLOGI completion
is processed, it is not a failure. Add code to FLOGI completion to detect
that discovery has progressed and exit the FLOGI handling (noop'ing it).
Link: https://lore.kernel.org/r/20210514195559.119853-4-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
While testing NPIV and watching logins and used RPI levels, it was seen the
used RPI count was much higher than the number of remote ports discovered.
Code inspection showed that remote port removals on any NPIV instance are
releasing the RPI, but not performing an UNREG_RPI with the adapter thus
the reference counting never fully drops and the RPI is never fully
released. This was happening on NPIV nodes due to a log of fabric ELS's to
fabric addresses. This lack of UNREG_RPI was introduced by a prior node
rework patch that performed the UNREG_RPI as part of node cleanup.
To resolve the issue, do the following:
- Restore the RPI release code, but move the location to so that it is in
line with the new node cleanup design.
- NPIV ports now release the RPI and drop the node when the caller sets
the NLP_RELEASE_RPI flag.
- Set the NLP_RELEASE_RPI flag in node cleanup which will trigger a
release of RPI to free pool.
- Ensure there's an UNREG_RPI at LOGO completion so that RPI release is
completed.
- Stop offline_prep from skipping nodes that are UNUSED. The RPI may
not have been released.
- Stop the default RPI handling in lpfc_cmpl_els_rsp() for SLI4.
- Fixed up debugfs RPI displays for better debugging.
Fixes: a70e63eee1 ("scsi: lpfc: Fix NPIV Fabric Node reference counting")
Link: https://lore.kernel.org/r/20210514195559.119853-2-jsmart2021@gmail.com
Cc: <stable@vger.kernel.org> # v5.11+
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Clean up minor issues spotted by tools and code review:
- Spelling Errors
- Spurious characters and errors in function headers
- nvme_info wqerr and err fields source data reversed
- Extraneous new line in log message 0466
- Spacing error in log message 0109
- Messages 0140 and 0141 have portname and nodename reversed
- Incorrect function labelling in comment
Link: https://lore.kernel.org/r/20210412013127.2387-13-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During target port swap, the swap logic ignores the DROPPED flag in the
nodes. As a node then moves into the UNUSED state, the reference count will
be dropped. If a node is later reused and moved out of the UNUSED state, an
access can result in a use-after-free assert.
Fix by having the port swap logic propagate the DROPPED flag when switching
nodes. This will avoid reference from being dropped.
Link: https://lore.kernel.org/r/20210412013127.2387-8-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During target port-swap testing with link flips, the initiator could
encounter PRLI errors. If the target node disappears permanently, the ndlp
is found stuck in UNUSED state with ref count of 1. The rmmod of the driver
will hang waiting for this node to be freed.
While handling a link error in PRLI completion path, the code intends to
skip triggering the discovery state machine. However this is causing the
final reference release path to be skipped. This causes the node to be
stuck with ref count of 1
Fix by ensuring the code path triggers the device removal event on the node
state machine.
Link: https://lore.kernel.org/r/20210412013127.2387-6-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Remove hbalock dependency for lpfc_abts_els_sgl_list and
lpfc_abts_nvmet_ctx_list. The lists are adaquately synchronized with the
sgl_list_lock and abts_nvmet_buf_list_lock.
Link: https://lore.kernel.org/r/20210412013127.2387-5-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Call traces are being seen that result from a nodelist structure ref
counting error. They are typically seen after transmission of an LS_RJT ELS
response.
Aged code in lpfc_cmpl_els_rsp() calls lpfc_nlp_not_used() which, if the
ndlp reference count is exactly 1, will decrement the reference count.
Previously lpfc_nlp_put() was within lpfc_els_free_iocb(), and the 'put'
within the free would only be invoked if cmdiocb->context1 was not NULL.
Since the nodelist structure reference count is decremented when exiting
lpfc_cmpl_els_rsp() the lpfc_nlp_not_used() calls are no longer required.
Calling them is causing the reference count issue.
Fix by removing the lpfc_nlp_not_used() calls.
Link: https://lore.kernel.org/r/20210412013127.2387-4-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Code inspection revealed stale comments in function headers for functions
that call lpfc_prep_els_iocb(). Changes in ndlp reference counting were not
reflected in function headers.
Update the stale comments in function headers to more accurately indicate
ndlp reference counting.
Link: https://lore.kernel.org/r/20210301171821.3427-21-jsmart2021@gmail.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
While testing target port swap test with ADISC enabled, several nodes
remain in UNUSED state. These nodes are never freed and rmmod hangs for
long time before finising with "0233 Nodelist not empty" error.
During PLOGI completion lpfc_plogi_confirm_nport() looks for existing nodes
with same WWPN. If found, the existing node is used to continue discovery.
The node on which plogi was performed is freed. When ADISC is enabled, an
ADISC els request is triggered in response to an RSCN. It's possible that
the ADISC may be rejected by the remote port causing the ADISC completion
handler to clear the port and node name in the node. If this occurs, if a
PLOGI is received it causes a node lookup based on wwpn to now fail,
causing the port swap logic to kick in which allocates a new node and swaps
to it. This effectively orphans the original node structure.
Fix the situation by detecting when the lookup fails and forgo the node
swap and node allocation by using the node on which the PLOGI was issued.
Link: https://lore.kernel.org/r/20210301171821.3427-15-jsmart2021@gmail.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When connected in pt2pt mode, there is a scenario where the remote port
significantly delays sending a response to our FLOGI, but acts on the FLOGI
it sent us and proceeds to PLOGI/PRLI. The FLOGI ends up timing out and
kicks off recovery logic. End result is a lot of unnecessary state changes
and lots of discovery messages being logged.
Fix by terminating the FLOGI and noop'ing its completion if we have already
accepted the remote ports FLOGI and are now processing PLOGI.
Link: https://lore.kernel.org/r/20210301171821.3427-13-jsmart2021@gmail.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
An unlikely error exit path from lpfc_els_retry() returns incorrect status
to a caller, erroneously indicating that a retry has been successfully
issued or scheduled.
Change error exit path to indicate no retry.
Link: https://lore.kernel.org/r/20210301171821.3427-12-jsmart2021@gmail.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
There are several code paths where the following sequence occurs:
- An ndlp pointer is assigned to an iocb via a nlp_get()
- An attempt is made to issue the iocb, but it fails
- The failure case does a put on the ndlp then calls lpfc_els_free_iocb()
The put may free the ndlp structure, but the els_free_iocb may reference
the now-stale ndlp pointer and cause a crash.
Fix by ensuring that the lpfc_els_free_iocb() occurs before the
lpfc_nlp_put().
While fixing, refactor the code to better ensure this calling sequence.
Link: https://lore.kernel.org/r/20210301171821.3427-11-jsmart2021@gmail.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
It is possible to call lpfc_issue_els_plogi() passing a did for which no
matching ndlp is found. A call is then made to lpfc_prep_els_iocb() with a
null pointer to a lpfc_nodelist structure resulting in a null pointer
dereference.
Fix by returning an error status if no valid ndlp is found. Fix up comments
regarding ndlp reference counting.
Link: https://lore.kernel.org/r/20210301171821.3427-10-jsmart2021@gmail.com
Fixes: 4430f7fd09 ("scsi: lpfc: Rework locations of ndlp reference taking")
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Driver crashed in lpfc_debugfs_disc_trc() due to null ndlp pointer. In
some calling cases, the ndlp is null and the did is looked up.
Fix by using the local did variable that is set appropriately based on ndlp
value.
Link: https://lore.kernel.org/r/20210301171821.3427-7-jsmart2021@gmail.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
After an initial successful FLOGI into the switch, if a subsequent FLOGI
fails the driver crashed accessing a node struct. On FLOGI error, the flogi
completion logic triggers the final dereference on the node structure
without checking if it is registered with a backend. The devloss logic is
triggered after node is freed leading to the access of freed node.
Fix by adjusting the error path to not take the final dereferece if there
is an outstanding transport registration. Let the transport devloss call
remove the final reference.
Link: https://lore.kernel.org/r/20210301171821.3427-6-jsmart2021@gmail.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Whenever an RRQ needs to be triggered, the DID from the node structure and
node pointer are stored in the RRQ data structure and the RRQ is scheduled
for later transmission. However, at the point in time that the timer
triggers, there's no validation on the node pointer. Reference counters may
have freed the structure. Additionally the DID in the node may no longer be
valid.
Fix by not tracking the node pointer in the RRQ, only the DID. At the time
of the timer expiration, look up the node with the did and if present, send
the RRQ. If no node exists, no need to send the RRQ.
Link: https://lore.kernel.org/r/20210301171821.3427-5-jsmart2021@gmail.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Several errors have occurred where the adapter stops or fails but does not
raise the register values for the driver to detect failure. Thus driver is
unaware of the failure. The failure typically results in I/O timeouts, the
I/O timeout handler failing (after several seconds), and the error handler
escalating recovery policy and resulting in more errors. Eventually, the
driver is in a position where things have spiraled and it can't do recovery
because other recovery ops are still outstanding and it becomes unusable.
Resolve the situation by having the I/O timeout handler (actually a els,
SCSI I/O, NVMe ls, or NVMe I/O timeout), in addition to aborting the I/O,
perform a mailbox command and look for a response from the hardware. If
the mailbox command fails, it will mark the adapter offline and then invoke
the adapter reset handler to clean up.
The new I/O timeout test will be limited to a test every 5s. If there are
multiple I/O timeouts concurrently, only the 1st I/O timeout will generate
the mailbox command. Further testing will only occur once a timeout occurs
after a 5s delay from the last mailbox command has expired.
Link: https://lore.kernel.org/r/20210104180240.46824-14-jsmart2021@gmail.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Target reset is failed by the target as an invalid command.
The Target Reset TMF has been obsoleted in T10 for a while, but continues
to be used. On (newer) devices, the TMF is rejected causing the reset
handler to escalate to adapter resets.
Fix by having Target Reset TMF rejections be translated into a LOGO and
re-PLOGI with the target device. This provides the same semantic action
(although, if the device also supports nvme traffic, it will terminate nvme
traffic as well - but it's still recoverable).
Link: https://lore.kernel.org/r/20210104180240.46824-10-jsmart2021@gmail.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Under some pt2pt situations, the other end of the link may issue a LOGO
after successfully completing PLOGI and assigning addresses to the port.
Thus the driver may attempt a new PLOGI to re-create the login, but the
LOGO handling cleared the address back to 0. Once this happens, the other
end, which may be address 0, gets all confused and this cannot be resolved
without an administrative action to bounce the link.
Fix by assuming that address assignment only occurs on the 1st PLOGI after
link up, and regardless of login state, the address assignment sticks. The
FC standards aren't particularly clear in this situation (it only describes
initial PLOGI), but there is nothing that contradicts this and behaviors on
the devices tested appears to conform to the understanding.
Thus, don't reset the port address to 0 as part of LOGO handling. Port
addresses will only reset on link down.
Link: https://lore.kernel.org/r/20210104180240.46824-2-jsmart2021@gmail.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
smatch correctly called out a logic error with accessing a pointer after
checking it for null:
drivers/scsi/lpfc/lpfc_els.c:2043 lpfc_cmpl_els_plogi()
error: we previously assumed 'ndlp' could be null (see line 1942)
Adjust the exit point to avoid the trace printf ndlp reference. A trace
entry was already generated when the ndlp was checked for null.
Link: https://lore.kernel.org/r/20201130181226.16675-1-james.smart@broadcom.com
Fixes: 4430f7fd09 ("scsi: lpfc: Rework locations of ndlp reference taking")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Remove local variables that are set but not used.
Link: https://lore.kernel.org/r/20201119203340.121819-1-james.smart@broadcom.com
Fixes: c6adba1501 ("scsi: lpfc: Rework remote port lock handling")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Currently there is an error return path that neglects to free the
allocation for lcb_context. Fix this by adding a new error free exit path
that kfree's lcb_context before returning. Use this new kfree exit path in
another exit error path that also kfree's the same object, allowing a line
of code to be removed.
Link: https://lore.kernel.org/r/20201118141314.462471-1-colin.king@canonical.com
Fixes: 4430f7fd09 ("scsi: lpfc: Rework locations of ndlp reference taking")
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Addresses-Coverity: ("Resource leak")
This patch reworks the abort interfaces such that SLI-3 retains the
iocb-based formatting and completions and SLI-4 now uses native WQEs and
completion routines.
The following changes are made:
- The code is refactored from a confusing 2 routine sequence of
xx_abort_iotag_issue(), which creates/formats and abort cmd, and
xx_issue_abort_tag(), which then issues and handles the completion of
the abort cmd - into a single interface of xx_issue_abort_iotag(). The
new interface will determine whether SLI-3 or SLI-4 and then call the
appropriate handler. A completion handler can now be specified to
address the differences in completion handling. Note: original code is
all iocb based, with SLI-4 converting to SLI-3 for the SCSI/ELS path,
and NVMe natively using wqes.
- The SLI-3 side is refactored:
The older iocb-base lpfc_sli_issue_abort_iotag() routine is combined
with the logic of lpfc_sli_abort_iotag_issue() as well as the
iocb-specific code in lpfc_abort_handler() and lpfc_sli_abort_iocb() to
create the new single SLI-3 abort routine that formats and issues the
iocb.
- The SLI-4 side is refactored and added to:
The native WQE abort code in NVMe is moved to the new SLI-4
issue_abort_iotag() routine. Items in SCSI that set fields not set by
NVMe is migrated into the new routine. Thus the routine supports NVMe
and SCSI initiators. The nvmet block (target) formats the abort slightly
different (like the old NVMe initiator) thus it has its own prep routine
stolen from NVMe initiator and it retains the current code it has for
issuing the WQE (does not use the commonized routine the initiators
do). SLI-4 completion handlers were also added.
- lpfc_abort_handler now becomes a wrapper that determines whether
SLI-3 or SLI-4 and calls the proper abort handler.
Link: https://lore.kernel.org/r/20201115192646.12977-16-james.smart@broadcom.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
While testing initiator-side cable swaps with NPIV, oops occur. The
reference counts for the Fabric nodes on the NPIV vports isn't balanced,
resulting in premature node removal.
The following fixes were made:
- Removed the FC_LBIT check in lpfc_linkup_port. This removed the special
case for vports that didn't have them clean up just like the physical
port.
- Removed the unreg_rpi call in lpfc_cleanup_node. In this section, the
node is being removed in the context of a reference count release and a
mailbox command can't be issued at this point.
- Remove special case handling in the default mailbox completion handler
that allowed the skipping of a node reference. Now, reference counting
always requires the removal of the reference.
- Move the location of the DEVICE_RM event is done during LOGO handling as
the driver has additional work to do on the ndlp before puts/releases
can be performed.
Link: https://lore.kernel.org/r/20201115192646.12977-10-james.smart@broadcom.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
While testing NPIV and link bounces, the vport would not show a fabric node
for the F_Port, would not transition into NPR state during a link fault, or
leave the FDMI node untouched during error injection. Cause for this was
determined to be an inconsistent manner in which F_Port, Nameserver, and
FDMI controller nodes were created and linked. In some cases, the nodes
would never be unregistered from the transport, leaving references
active. In other cases, the fabric nodes may register with the transport
multiple times while still registered.
The following changes were made:
- Fix the FDISC issue routine, which starts vport (re)creation, to mark
the F_Port as a fabric node (NLP_FABRIC) and allow the F_Port node to
fully be created and show up in the node list.
- When remote ports are cleaned up on vport termination, cleanup the
nameserver and FDMI controller nodes on the vport so they unregister
from the transport.
- On link bounces, don't exclude the NPIV Fabric remote ports from
transitioning to the NPR state, allowing them to avoid re-registration
if already registered.
Link: https://lore.kernel.org/r/20201115192646.12977-9-james.smart@broadcom.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When a target swap happens, under certain conditions the node sends a
LOGO. The unsolicited ELS logic responds with a reject. The logic may
allocate a new node to handle this. Afterward, the new nodes are dropped
incorrectly leaving them in a mis-matched state and refcounting causes a
use-after-free situation leading to a crash.
It is also possible that the unsolicited els handling finds a node which is
in an UNUSED state. The handling moves these nodes to NPR state with a
refcount of 1. Although the end of the discovery logic assumes a final put
will free such a node, there are codes paths which could increment the
reference count, thus the node is in NPR state and not released.
Eventually this mismatch in state and refcount leads to premature release
of the node causing a crash.
Fix by always using the discovery engine DEVICE RM event to decrement and
release the nodes (rather than explicit code that tried to do it before).
This will take care of moving the node to the UNUSED state and then removes
the final ref count. If there is a trigger to reuse this node, the
transition from the UNUSED state clearly indicates that the initial
reference is then incremented and use can continue.
Link: https://lore.kernel.org/r/20201115192646.12977-8-james.smart@broadcom.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When a PLOGI/ADISC/PRLI/REG_RPI fails, the node remains in the nodelist in
that state. Although the driver now frees a node when the ref count goes
to zero, in this case the ref cnt doesn't reach zero because there isn't a
mechanism to release the final reference. Discovery just stops.
Fix by calling the node discovery state machine DEVICE_RM event whenever
one of these commands fail. This will remove the final reference count and
trigger node release.
Link: https://lore.kernel.org/r/20201115192646.12977-7-james.smart@broadcom.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Currently the discovery layers within the driver use the SCSI midlayer
host_lock to access node-specific structures. This can contend with the I/O
path and is too coarse of a lock.
Rework the driver so that it uses a lock specific to the remote port node
structure when accessing the structure contents. A few of the changes
brought out spots were some slightly reorganized routines worked better.
Link: https://lore.kernel.org/r/20201115192646.12977-6-james.smart@broadcom.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Due to bug history and code review, the node reference counting approach in
the driver isn't implemented consistently with how the scsi and nvme
transport perform registrations and unregistrations and their callbacks.
This resulted in many bad/stale node pointers.
Reword the driver so that reference handling is performed as follows:
- The initial node reference is taken on structure allocation
- Take a reference on any add/register call to the transport
- Remove a reference on any delete/unregister call to the transport
- After the node has fully removed from both the SCSI and NVMEe transports
(dev_loss_callbacks have called back) call the discovery engine
DEVICE_RM event which will remove the final reference and release the
node structure.
- Alter dev_loss handling when a vport or base port is unloading.
- Remove the put_node handling - no longer needed.
- Rewrite the vport_delete handling on reference counts. Part of this
effort was driven from the FDISC not registering with the transport and
disrupting the model for node reference counting.
- Deleted lpfc_nlp_remove. Pushed it's remaining ops into
lpfc_nlp_release.
- Several other small code cleanups.
Link: https://lore.kernel.org/r/20201115192646.12977-5-james.smart@broadcom.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The lpfc driver is calling get_device and put_device on scsi_fc_transport
device structure. When this code was removed, the driver triggered an oops
in "scsi_is_host_dev" when the first SCSI target was unregistered from the
transport.
The reason the calls were necessary is that the driver is calling
scsi_remove_host too early, before the target rports are unregistered and
the scsi devices disconnected from the scsi_host. The fc_host was torn
down during fc_remove_host.
Fix by moving the lpfc_pci_remove_one_s3/s4 calls to scsi_remove_host to
after the nodes are cleaned up. Remove the get_device and put_device calls
and the supporting code.
Link: https://lore.kernel.org/r/20201115192646.12977-4-james.smart@broadcom.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Now that the driver has gone to a normal ref interface (with no odd logic)
the discovery logic needs to be updated to reworked so that it properly
takes references when it should and give them up when it should.
Rework the driver for the following get/put model:
- Move gets to just before an I/O is issued. Add gets for places where an
I/O was issued without one.
- Ensure that failures from lpfc_nlp_get() are handled by the driver.
- Check and fix the placement of lpfc_nlp_puts relative to io completions.
Note: some of these paths may not release the reference on the exact io
completion as the reference is held as the code takes another step in
the discovery thread and which may cause another io to be issued.
- Rearrange some code for error processing and calling lpfc_nlp_put.
- Fix some places of incorrect reference freeing that was causing the
premature releasing of the structure.
- Nvmet plogi handling performs unreg_rpi's. The reference counts were
unbalanced resulting in premature node removal. In some cases this
caused loss of node discovery. Corrected the reftaking around nvmet
plogis.
Nodes that experience devloss now get released from the node list now that
there is a proper reference taking.
Link: https://lore.kernel.org/r/20201115192646.12977-3-james.smart@broadcom.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When a remote port is disconnected and disappears, its node structure
(ndlp) stays allocated and on a vport node list. While on the list it can
be matched, thus requires validation checks on state to be added in
numerous code paths. If the node comes back, its possible for there to be
multiple node structures for the same device on the vport node list. There
is no reason to keep the node structure around after it is no longer in
existence, and the current implementation creates problems for itself
(multiple nodes) and lots of unnecessary code for state validation.
Additionally, the reference taking on the node structure didn't follow the
normal model used by the kernel kref api. It included lots of odd logic to
match state with reference count. The combination of this odd logic plus
the way it was implicitly used in the discovery engine made its reference
taking implementation suspect and extremely hard to follow.
Change the driver such that the reference taking routines are now normal
ref increments/decrements and callout on refcount=0.
With this in place, the rework can be done such that the node structure is
fully removed and deallocated when the remote port no longer exists and all
references are removed. This removal logic, and the basic ref counting are
intrically tied, thus in a single patch.
Link: https://lore.kernel.org/r/20201115192646.12977-2-james.smart@broadcom.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Created new attribute lpfc_enable_mi, which by default is enabled.
Add command definition bits for SLI-4 parameters that recognize whether the
adapter has MIB information support and what revision of MIB data. Using
the adapter information, register vendor-specific MIB support with FDMI.
The registration will be done every link up.
During FDMI registration, encountered a couple of errors when reverting to
FDMI rev1. Code needed to exist once reverting. Fixed these.
Link: https://lore.kernel.org/r/20201020202719.54726-8-james.smart@broadcom.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Eleven fixes, mostly in drivers or minor fixes in driver related
infrastructure libraries (target, libfc and libsas). Most of the bugs
fixed only show up under rare circumstances, the exception being the
endianness problem in qla2xxx which is used as a device on some sparc
systems.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCX1egVyYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishShgAP9McHKn
9T/m2mAjBr8IMZRlD4Y6nC+ToxDdRDsYT6iAOQD/ZO1FajxEcGC4xEtWznvXtqGk
H1Lfr/ta19GSEPqaZ44=
=rDo9
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Eleven fixes, mostly in drivers or minor fixes in driver related
infrastructure libraries (target, libfc and libsas).
Most of the bugs fixed only show up under rare circumstances, the
exception being the endianness problem in qla2xxx which is used as a
device on some sparc systems"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: mpt3sas: Don't call disable_irq from IRQ poll handler
scsi: megaraid_sas: Don't call disable_irq from process IRQ poll
scsi: target: iscsi: Fix hang in iscsit_access_np() when getting tpg->np_login_sem
scsi: libsas: Set data_dir as DMA_NONE if libata marks qc as NODATA
scsi: target: iscsi: Fix data digest calculation
scsi: lpfc: Update lpfc version to 12.8.0.4
scsi: lpfc: Extend the RDF FPIN Registration descriptor for additional events
scsi: lpfc: Fix FLOGI/PLOGI receive race condition in pt2pt discovery
scsi: lpfc: Fix setting IRQ affinity with an empty CPU mask
scsi: qla2xxx: Fix regression on sparc64
scsi: libfc: Fix for double free()
scsi: pm8001: Fix memleak in pm8001_exec_internal_task_abort
Currently the driver registers for Link Integrity events only.
This patch adds registration for the following FPIN types:
- Delivery Notifications
- Congestion Notification
- Peer Congestion Notification
Link: https://lore.kernel.org/r/20200828175332.130300-4-james.smart@broadcom.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The driver is unable to successfully login with remote device. During pt2pt
login, the driver completes its FLOGI request with the remote device having
WWN precedence. The remote device issues its own (delayed) FLOGI after
accepting the driver's and, upon transmitting the FLOGI, immediately
recognizes it has already processed the driver's FLOGI thus it transitions
to sending a PLOGI before waiting for an ACC to its FLOGI.
In the driver, the FLOGI is received and an ACC sent, followed by the PLOGI
being received and an ACC sent. The issue is that the PLOGI reception
occurs before the response from the adapter from the FLOGI ACC is
received. Processing of the PLOGI sets state flags to perform the REG_RPI
mailbox command and proceed with the rest of discovery on the port. The
same completion routine used by both FLOGI and PLOGI is generic in
nature. One of the things it does is clear flags, and those flags happen to
drive the rest of discovery. So what happened was the PLOGI processing set
the flags, the FLOGI ACC completion cleared them, thus when the PLOGI ACC
completes it doesn't see the flags and stops.
Fix by modifying the generic completion routine to not clear the rest of
discovery flag (NLP_ACC_REGLOGIN) unless the completion is also associated
with performing a mailbox command as part of its handling. For things such
as FLOGI ACC, there isn't a subsequent action to perform with the adapter,
thus there is no mailbox cmd ptr. PLOGI ACC though will perform REG_RPI
upon completion, thus there is a mailbox cmd ptr.
Link: https://lore.kernel.org/r/20200828175332.130300-3-james.smart@broadcom.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
With port bounce/address swaps and timing between initiator GID queries vs
remote port FC4 support registrations, the driver may be in a situation
where it sends PRLIs for both FCP and NVME even though the target may not
support one of the protocols. In this case, the remote port will reject the
PRLI and usually indicate it does not support the request. However, the
driver currently ignores the status of the failure and immediately retries
the PRLI, which is pointless. In the case of this one remote port, the
reception of the PRLI retry caused it to decide to send a LOGO. The LOGO
restarted the process and the same results happened. It made the remote
port undiscoverable to either protocol.
Add logic to detect the non-support status and not attempt the retry
of the PRLI.
Link: https://lore.kernel.org/r/20200803210229.23063-6-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Fixes the following W=1 kernel build warning(s):
drivers/scsi/lpfc/lpfc_els.c:3619: warning: Function parameter or member 't' not described in 'lpfc_els_retry_delay'
drivers/scsi/lpfc/lpfc_els.c:3619: warning: Excess function parameter 'ptr' description in 'lpfc_els_retry_delay'
drivers/scsi/lpfc/lpfc_els.c:4877: warning: Function parameter or member 'rejectError' not described in 'lpfc_els_rsp_reject'
drivers/scsi/lpfc/lpfc_els.c:7900: warning: Function parameter or member 't' not described in 'lpfc_els_timeout'
drivers/scsi/lpfc/lpfc_els.c:7900: warning: Excess function parameter 'ptr' description in 'lpfc_els_timeout'
drivers/scsi/lpfc/lpfc_els.c:8272: warning: Function parameter or member 'payload' not described in 'lpfc_send_els_event'
drivers/scsi/lpfc/lpfc_els.c:8272: warning: Excess function parameter 'cmd' description in 'lpfc_send_els_event'
drivers/scsi/lpfc/lpfc_els.c:8355: warning: Function parameter or member 'tlv' not described in 'lpfc_els_rcv_fpin_li'
drivers/scsi/lpfc/lpfc_els.c:8355: warning: Excess function parameter 'lnk_not' description in 'lpfc_els_rcv_fpin_li'
drivers/scsi/lpfc/lpfc_els.c:9688: warning: Function parameter or member 't' not described in 'lpfc_fabric_block_timeout'
drivers/scsi/lpfc/lpfc_els.c:9688: warning: Excess function parameter 'ptr' description in 'lpfc_fabric_block_timeout'
Link: https://lore.kernel.org/r/20200723122446.1329773-2-lee.jones@linaro.org
Cc: James Smart <james.smart@broadcom.com>
Cc: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The current logging methods typically end up requesting a reproduction with
a different logging level set to figure out what happened. This was mainly
by design to not clutter the kernel log messages with things that were
typically not interesting and the messages themselves could cause other
issues.
When looking to make a better system, it was seen that in many cases when
more data was wanted was when another message, usually at KERN_ERR level,
was logged. And in most cases, what the additional logging that was then
enabled was typically. Most of these areas fell into the discovery machine.
Based on this summary, the following design has been put in place: The
driver will maintain an internal log (256 elements of 256 bytes). The
"additional logging" messages that are usually enabled in a reproduction
will be changed to now log all the time to the internal log. A new logging
level is defined - LOG_TRACE_EVENT. When this level is set (it is not by
default) and a message marked as KERN_ERR is logged, all the messages in
the internal log will be dumped to the kernel log before the KERN_ERR
message is logged.
There is a timestamp on each message added to the internal log. However,
this timestamp is not converted to wall time when logged. The value of the
timestamp is solely to give a crude time reference for the messages.
Link: https://lore.kernel.org/r/20200630215001.70793-14-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In order to create or activate a new node, lpfc_els_unsol_buffer() invokes
lpfc_nlp_init() or lpfc_enable_node() or lpfc_nlp_get(), all of them will
return a reference of the specified lpfc_nodelist object to "ndlp" with
increased refcnt.
When lpfc_els_unsol_buffer() returns, local variable "ndlp" becomes
invalid, so the refcount should be decreased to keep refcount balanced.
The reference counting issue happens in one exception handling path of
lpfc_els_unsol_buffer(). When "ndlp" in DEV_LOSS, the function forgets to
decrease the refcnt increased by lpfc_nlp_init() or lpfc_enable_node() or
lpfc_nlp_get(), causing a refcnt leak.
Fix this issue by calling lpfc_nlp_put() when "ndlp" in DEV_LOSS.
Link: https://lore.kernel.org/r/1590416184-52592-1-git-send-email-xiyuyang19@fudan.edu.cn
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During code reviews several instances of duplicate module unloading checks
were found.
Remove the duplicate checks.
Link: https://lore.kernel.org/r/20200421203354.49420-1-jsmart2021@gmail.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
There is a spelling mistake in a lpfc_printf_vlog info message. Fix it.
[mkp: fix spelling mistake in commit description]
Link: https://lore.kernel.org/linux-scsi/20200221154841.77791-1-colin.king@canonical.com
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
This patch modifies lpfc to register for Link Integrity events via the use
of an RDF ELS and to perform Link Integrity FPIN logging.
Specifically, the driver was modified to:
- Format and issue the RDF ELS immediately following SCR registration.
This registers the ability of the driver to receive FPIN ELS.
- Adds decoding of the FPIN els into the received descriptors, with
logging of the Link Integrity event information. After decoding, the ELS
is delivered to the scsi fc transport to be delivered to any user-space
applications.
- To aid in logging, simple helpers were added to create enum to name
string lookup functions that utilize the initialization helpers from the
fc_els.h header.
- Note: base header definitions for the ELS's don't populate the
descriptor payloads. As such, lpfc creates it's own version of the
structures, using the base definitions (mostly headers) and additionally
declaring the descriptors that will complete the population of the ELS.
Link: https://lore.kernel.org/r/20200210173155.547-3-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Update copyrights to 2020 for files modified in the 12.6.0.4 patch set.
Link: https://lore.kernel.org/r/20200128002312.16346-13-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
There was report of an odd "Fix me..." log message, which was tracked down
to the lpfc_els_rcv_rps() routine. This was in handling of a very old and
obsolete ELS - Read Port Status. The RPS ELS was defined in FC-LS-1, but
deprecated in FC-LS-2, and removed from all later FC-LS revisions. It was
replaced by the Read Diagnostic Parameters (RDP) ELS and the Link Error
Status Block descriptor.
There should be no support for the RSP ELS. Remove support from driver.
Link: https://lore.kernel.org/r/20200128002312.16346-9-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Coverity reported the following:
*** CID 101747: Null pointer dereferences (FORWARD_NULL)
/drivers/scsi/lpfc/lpfc_els.c: 4439 in lpfc_cmpl_els_rsp()
4433 kfree(mp);
4434 }
4435 mempool_free(mbox, phba->mbox_mem_pool);
4436 }
4437 out:
4438 if (ndlp && NLP_CHK_NODE_ACT(ndlp)) {
vvv CID 101747: Null pointer dereferences (FORWARD_NULL)
vvv Dereferencing null pointer "shost".
4439 spin_lock_irq(shost->host_lock);
4440 ndlp->nlp_flag &= ~(NLP_ACC_REGLOGIN | NLP_RM_DFLT_RPI);
4441 spin_unlock_irq(shost->host_lock);
4442
4443 /* If the node is not being used by another discovery thread,
4444 * and we are sending a reject, we are done with it.
Fix by adding a check for non-null shost in line 4438.
The scenario when shost is set to null is when ndlp is null.
As such, the ndlp check present was sufficient. But better safe
than sorry so add the shost check.
Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 101747 ("Null pointer dereferences")
Fixes: 2e0fef85e0 ("[SCSI] lpfc: NPIV: split ports")
CC: James Bottomley <James.Bottomley@SteelEye.com>
CC: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
CC: linux-next@vger.kernel.org
Link: https://lore.kernel.org/r/20191111230401.12958-3-jsmart2021@gmail.com
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During heavy RCN activity and log_verbose = 0 we see these messages:
2754 PRLI failure DID:521245 Status:x9/xb2c00, data: x0
0231 RSCN timeout Data: x0 x3
0230 Unexpected timeout, hba link state x5
This is due to delayed RSCN activity.
Correct by avoiding the timeout thus the messages by restarting the
discovery timeout whenever an rscn is received.
Filter PRLI responses such that severity depends on whether expected for
the configuration or not. For example, PRLI errors on a fabric will be
informational (they are expected), but Point-to-Point errors are not
necessarily expected so they are raised to an error level.
Link: https://lore.kernel.org/r/20191105005708.7399-5-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When debugging a recent discovery customer problem it was very hard to tell
what was happening with the existing discovery log messages. To fully debug
the issue additional log messages were necessary.
Add or extend log messages so that sufficient information is present for
debugging.
Link: https://lore.kernel.org/r/20191018211832.7917-16-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Log message conditional upon vport being NULL dereferences vport to
determine log verbose setting.
Changed to use lpfc_print_log which uses phba to determine the active log
verbose setting.
Fixes: 43bfea1bff ("scsi: lpfc: Fix coverity errors on NULL pointer checks")
Link: https://lore.kernel.org/r/20191018211832.7917-8-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
While reviewing the CT behavior, issues with spinlock_irq were seen. The
driver should be using spinlock_irqsave/irqrestore in the els flush
routine.
Changed to spinlock_irqsave/irqrestore.
Link: https://lore.kernel.org/r/20190922035906.10977-15-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
After study, it was determined there was a double free of a CT iocb during
execution of lpfc_offline_prep and lpfc_offline. The prep routine issued
an abort for some CT iocbs, but the aborts did not complete fast enough for
a subsequent routine that waits for completion. Thus the driver proceeded
to lpfc_offline, which releases any pending iocbs. Unfortunately, the
completions for the aborts were then received which re-released the ct
iocbs.
Turns out the issue for why the aborts didn't complete fast enough was not
their time on the wire/in the adapter. It was the lpfc_work_done routine,
which requires the adapter state to be UP before it calls
lpfc_sli_handle_slow_ring_event() to process the completions. The issue is
the prep routine takes the link down as part of it's processing.
To fix, the following was performed:
- Prevent the offline routine from releasing iocbs that have had aborts
issued on them. Defer to the abort completions. Also means the driver
fully waits for the completions. Given this change, the recognition of
"driver-generated" status which then releases the iocb is no longer
valid. As such, the change made in the commit 296012285c is reverted.
As recognition of "driver-generated" status is no longer valid, this
patch reverts the changes made in
commit 296012285c ("scsi: lpfc: Fix leak of ELS completions on adapter reset")
- Modify lpfc_work_done to allow slow path completions so that the abort
completions aren't ignored.
- Updated the fdmi path to recognize a CT request that fails due to the
port being unusable. This stops FDMI retries. FDMI will be restarted on
next link up.
Link: https://lore.kernel.org/r/20190922035906.10977-14-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Coverity flagged several scenarios where checking of null pointer values
wasn't consistent.
Fix the code to that be consistent on checking.
Link: https://lore.kernel.org/r/20190922035906.10977-12-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
FC-NVMe-2 added support for sequence level error recovery in the FC-NVME
protocol. This allows for the detection of errors and lost frames and
immediate retransmission of data to avoid exchange termination, which
escalates into NVMeoFC connection and association failures. A significant
RAS improvement.
The driver is modified to indicate support for SLER in the NVMe PRLI is
issues and to check for support in the PRLI response. When both sides
support it, the driver will set a bit in the WQE to enable the recovery
behavior on the exchange. The adapter will take care of all detection and
retransmission.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Added code to support driver loopback with MDS Diagnostics. This style of
diagnostics passes frames from the fabric to the driver who then echo them
back out the link. SEND_FRAME WQEs are used to transmit the frames. Added
the SOF and EOF field location definitions for use by SEND_FRAME.
Also ensure that enable_mds_diags is a RW parameter.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In order to see real addresses, convert %p with %px for kernel addresses
and replace %p with %pf for functions.
While converting, standardize on "x%px" throughout (not %px or 0x%px).
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Running on Coverity produced the following errors:
- coding style (indentation)
- memset size mismatch errors
note: comment cases where it is purposely a mismatch
Fix the errors.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In situations where zoning is not being used, thus NVMe initiators see
other NVMe initiators as well as NVMe targets, a link bounce on an
initiator will cause the NVMe initiators to spew "6169" State Error
messages.
The driver is not qualifying whether the remote port is a NVMe targer or
not before calling the lpfc_nvme_rescan_port(), which validates the role
and prints the message if its only an NVMe initiator.
Fix by the following:
- Before calling lpfc_nvme_rescan_port() ensure that the node is a NVMe
storage target or a NVMe discovery controller.
- Clean up implementation of lpfc_nvme_rescan_port. remoteport pointer
will always be NULL if a NVMe initiator only. But, grabbing of
remoteport pointer should be done under lock to coincide with the
registering of the remote port with the fc transport.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
It's possible for the driver to initiate an FLOGI and before it completes,
another link down/up transition occurs requiring a new FLOGI. Currently,
nothing is done to abort/noop the older FLOGI request to the adapter, so if
this transition occurs and the FLOGI completion is received after the link
down/up transition, the driver may erroneously act on the older FLOGI. In
most cases, the adapter properly terminates/fails the FLOGI, but there is a
timing condition where the FLOGI may complete on the wire prior to the
transition, but the response may not be seen/processed by the driver before
the driver sees the link transition.
Fix by having the link down handler in the driver run through any
outstanding ELS's and change the completion handler of the ELS so that it
will be no-op'd and released.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
This is mostly update of the usual drivers: qla2xxx, hpsa, lpfc, ufs,
mpt3sas, ibmvscsi, megaraid_sas, bnx2fc and hisi_sas as well as the
removal of the osst driver (I heard from Willem privately that he
would like the driver removed because all his test hardware has
failed). Plus number of minor changes, spelling fixes and other
trivia.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXSTl4yYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishdcxAQDCJVbd
fPUX76/V1ldupunF97+3DTharxxbst+VnkOnCwD8D4c0KFFFOI9+F36cnMGCPegE
fjy17dQLvsJ4GsidHy8=
=aS5B
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This is mostly update of the usual drivers: qla2xxx, hpsa, lpfc, ufs,
mpt3sas, ibmvscsi, megaraid_sas, bnx2fc and hisi_sas as well as the
removal of the osst driver (I heard from Willem privately that he
would like the driver removed because all his test hardware has
failed). Plus number of minor changes, spelling fixes and other
trivia.
The big merge conflict this time around is the SPDX licence tags.
Following discussion on linux-next, we believe our version to be more
accurate than the one in the tree, so the resolution is to take our
version for all the SPDX conflicts"
Note on the SPDX license tag conversion conflicts: the SCSI tree had
done its own SPDX conversion, which in some cases conflicted with the
treewide ones done by Thomas & co.
In almost all cases, the conflicts were purely syntactic: the SCSI tree
used the old-style SPDX tags ("GPL-2.0" and "GPL-2.0+") while the
treewide conversion had used the new-style ones ("GPL-2.0-only" and
"GPL-2.0-or-later").
In these cases I picked the new-style one.
In a few cases, the SPDX conversion was actually different, though. As
explained by James above, and in more detail in a pre-pull-request
thread:
"The other problem is actually substantive: In the libsas code Luben
Tuikov originally specified gpl 2.0 only by dint of stating:
* This file is licensed under GPLv2.
In all the libsas files, but then muddied the water by quoting GPLv2
verbatim (which includes the or later than language). So for these
files Christoph did the conversion to v2 only SPDX tags and Thomas
converted to v2 or later tags"
So in those cases, where the spdx tag substantially mattered, I took the
SCSI tree conversion of it, but then also took the opportunity to turn
the old-style "GPL-2.0" into a new-style "GPL-2.0-only" tag.
Similarly, when there were whitespace differences or other differences
to the comments around the copyright notices, I took the version from
the SCSI tree as being the more specific conversion.
Finally, in the spdx conversions that had no conflicts (because the
treewide ones hadn't been done for those files), I just took the SCSI
tree version as-is, even if it was old-style. The old-style conversions
are perfectly valid, even if the "-only" and "-or-later" versions are
perhaps more descriptive.
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (185 commits)
scsi: qla2xxx: move IO flush to the front of NVME rport unregistration
scsi: qla2xxx: Fix NVME cmd and LS cmd timeout race condition
scsi: qla2xxx: on session delete, return nvme cmd
scsi: qla2xxx: Fix kernel crash after disconnecting NVMe devices
scsi: megaraid_sas: Update driver version to 07.710.06.00-rc1
scsi: megaraid_sas: Introduce various Aero performance modes
scsi: megaraid_sas: Use high IOPS queues based on IO workload
scsi: megaraid_sas: Set affinity for high IOPS reply queues
scsi: megaraid_sas: Enable coalescing for high IOPS queues
scsi: megaraid_sas: Add support for High IOPS queues
scsi: megaraid_sas: Add support for MPI toolbox commands
scsi: megaraid_sas: Offload Aero RAID5/6 division calculations to driver
scsi: megaraid_sas: RAID1 PCI bandwidth limit algorithm is applicable for only Ventura
scsi: megaraid_sas: megaraid_sas: Add check for count returned by HOST_DEVICE_LIST DCMD
scsi: megaraid_sas: Handle sequence JBOD map failure at driver level
scsi: megaraid_sas: Don't send FPIO to RL Bypass queue
scsi: megaraid_sas: In probe context, retry IOC INIT once if firmware is in fault
scsi: megaraid_sas: Release Mutex lock before OCR in case of DCMD timeout
scsi: megaraid_sas: Call disable_irq from process IRQ poll
scsi: megaraid_sas: Remove few debug counters from IO path
...
This patch updates RSCN receive processing to check for the remote
port being an NVME port, and if so, invoke the nvme_fc callback to
rescan the remote port. The rescan will generate a discovery udev
event.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Arun Easi <aeasi@marvell.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This patch adds general RSCN support:
- The ability to transmit an RSCN to the port on the other end of
the link (regular port if pt2pt, or fabric controller if fabric).
- And general recognition of an RSCN ELS when an ELS is received.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Arun Easi <aeasi@marvell.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Under heavy load the target stops responding, the drivers aborts
timeout and we start recovery by logging out of the target, but
the target is never logged into again.
In a point-to-point scenario, there were battling PLOGI's. When we
received a PLOGI request after having sent one, the driver cancels
the processing of the original plogi. However, the completion path
of the remaining plogi was coded to skip the reg_rpi that should
be happening on the 2nd plogi.
Correct by adding a simple pt2pt check such that the 2nd plogi isn't
skipped and the reg_login occurs.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
There was a missing qualification of a valid ndlp structure when calling to
send an RRQ for an abort. Add the check.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
This patch adds support to recognize FPIN ELS's that are received. When
one is received, the fc transport will be called to handle the the FPIN.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
This patch does not change any functionality but avoids that the compiler
complains about set-but-not-used variables when building with W=1.
Cc: James Smart <james.smart@broadcom.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
This patch avoids that the compiler warns about missing fall-through
annotation when building with W=1.
Cc: James Smart <james.smart@broadcom.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
For files modified as part of 12.2.0.0 patches, update copyright to 2019
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The conversion to enable SCSI and NVME fc4 support ran into an issue with
NPIV support. With NVME, NPIV is not currently supported, but with SCSI it
was. The driver reverted to its lowest setting meaning NPIV with SCSI was
not allowed.
Convert the NPIV checks and implementation so that SCSI can continue to
allow NPIV support.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Addition of support for if_type=6 missed several checks for interface type,
resulting in the failure of several key management features such as
firmware dump and loopback testing.
Correct the checks on the if_type so that both SLI4 IF_TYPE's 2 and 6 are
supported.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
This reverts commit 287aba2592.
We killed the bad firmware and this mod is no longer necessary.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The current discovery state machine the driver treated FLOGI oddly. When
point to point, an FLOGI is to be exchanged by the two ports, with the port
with the most significant WWN then proceeding with PLOGI. The
implementation in the driver was keyed to closely with "what have I sent",
not with what has happened between the two endpoints. Thus, it blatantly
would ACC an FLOGI, but reject PLOGI's until it had its FLOGI ACC'd. The
problem is - the sending of FLOGI may be delayed for some reason, or the
response to FLOGI held off by the other side. In the failing situation the
other side sent an FLOGI, which was ACC'd, then sent PLOGIs which were then
rjt'd until the retry count for the PLOGIs were exceeded and the port gave
up. The FLOGI may have been very late in transmit, or the response held off
until the PLOGIs failed. Given the other port had the higher WWN, no PLOGIs
would occur and communication stopped.
Correct the situation by changing the FLOGI handling. Defer any response to
an FLOGI until the driver has sent its FLOGI as well. Then, upon either
completion of the sent FLOGI, or upon sending an ACC to a received FLOGI
(which may be received before or just after FLOGI was sent). the driver
will act on who has the higher WWN. if the other port does, the driver will
noop any handling of an FLOGI response (if outstanding) and wait for PLOGI.
If the local port does, the driver will transition to sending PLOGI and
will noop any action on responding to an FLOGI (if not yet received).
Fortunately, to implement this, it only took another state flag and
deferring any FLOGI response if the FLOGI has yet to be transmit. All
subsequent actions were already in place.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In some link initialization sequences, the fw generates an erroneous FLOGI
payload to the driver without an intervening link bounce. The driver, when
it sees a 2nd FLOGI without an intervening link bounce, automatically
performs a link bounce. In this, the link bounce causes the situate to
repeat and in a nasty loop of link bounces.
Resolve the issue by validating the FLOGI payload. The erroneous FLOGI will
contain VVL signatures that are not normal. When the driver sees these, it
will simply reject the flogi rather than bouncing the link. The reject is
consumed within the firmware.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Two initiator ports were cable swapped and after swap both went down. The
driver internally swaps the nlp nodes based on matching node wwn's but not
the same nport id as before. After detecting a change in the nodes RPI, the
driver sends an UNREG_RPI command and clears the NLP_RPI_REGISTERED flag,
then swaps the node information with the other node. But the other node's
NLP_RPI_REGISTERED flag is also cleared, but it is done so without an
UNREG_RPI being sent, which causes the later REG_RPI for that other node to
fail as the hardware believes its still registered.
Additionally, if the node swap occurred while the two nodes had PLOGI's in
flight, the fc4_types weren't properly getting swapped such that when the
PLOGIs commpleted and PRLI's were then sent, the PRLI's acted on bad
protocol types so the PRLI was for the wrong protocol. NVME devices saw
SCSI FCP PRLIs and vice versa.
Clean up the node swap so that the NLP_RPI_REGISTERED flag is handled
properly.
Fix the handling of the fc4_types when the nodes are swapped as well
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Driver is hitting null pring pointers in lpfc_do_work().
Pointer assignment occurs based on SLI-revision. If recovering after an
error, its possible the sli revision for the port was cleared, making the
lpfc_phba_elsring() not return a ring pointer, thus the null pointer.
Add SLI revision checking to lpfc_phba_elsring() and status checking to all
callers.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The driver is getting hit with 100s of RSCNs during remote port address
changes. Each of those RSCN's ends up generating UNREG_RPI and REG_PRI
mailbox commands. The discovery engine within the driver doesn't wait for
the mailbox command completions. Instead it sets state flags and moves
forward. At some point, there's a massive backlog of mailbox commands which
take time for the adapter to process. Additionally, it appears there were
duplicate events from the switch so the driver generated duplicate mailbox
commands for the same remote port. During this window, failures on PLOGI
and PRLI ELS's are see as the adapter is rejecting them as they are for
remote ports that still have pending mailbox commands.
Streamline the discovery engine so that PLOGI log checks for outstanding
UNREG_RPIs and defer the processing until the commands complete. This
better synchronizes the ELS transmission vs the RPI registrations.
Filter out multiple UNREG_RPIs being queued up for the same remote port.
Beef up log messages in this area.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The driver data structure for managing a mailbox command contained two
context fields. Unfortunately, the context were considered "generic" to be
used at the whim of the command code. Of course, one section of code used
fields this way, while another did it that way, and eventually there were
mixups.
Refactored the structure so that the generic contexts become a node context
and a buffer context and all code standardizes on their use.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Add trunking support to the driver. Trunking is found on more recent
asics. In general, trunking appears as a single "port" to the driver
and overall behavior doesn't differ. Link speed is reported as an
aggregate value, while link speed control is done on a per-physical
link basis with all links in the trunk symmetrical. Some commands
returning port information are updated to additionally provide
trunking information. And new ACQEs are generated to report physical
link events relative to the trunk.
This patch contains the following modifications:
- Added link speed settings of 128GB and 256GB.
- Added handling of trunk-related ACQEs, mainly logging and trapping
of physical link statuses.
- Added additional bsg interface to query trunk state by applications.
- Augment link_state sysfs attribtute to display trunk link status
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The switches seem to respond faster to GID_PT vs GID_FT NameServer
queries. Add support for GID_PT to be used over GID_FT to enable
faster storage failover detection. Includes addition of new module
parameter to select between GID_PT and GID_FT (GID_FT is default).
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
An address change for a remote port cause PRLI for the wrong protocol
to be sent. The node copy done in the discovery code skipped copying
the fc4 protocols supported as well.
Fix the copy logic for the address change. Beefed up log messages in
this area as well.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Testing a point-to-point topology and a case of re-FLOGI without
intervening link bouncing, showed an odd interaction with firmware and
a resulting scenario where the driver no longer probed after accepting
the new FLOGI.
Work around the firmware issue by issuing a link bounce if a FLOGI is
received after the link is already up and FLOGI's accepted.
While debugging the issue, realized that some debug traces should be
clarified to help in the future.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When LCB's are rejected, if beaconing was already in progress, the
Reason Code Explanation was not being set. Should have been set to
command in progress.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
On FCoE adapters, when running link bounce test in a loop, initiator
failed to login with switch switch and required driver reload to
recover. Switch reached a point where all subsequent FLOGIs would be
LS_RJT'd. Further testing showed the condition to be related to not
performing FCF discovery between FLOGI's.
Fix by monitoring FLOGI failures and once a repeated error is seen
repeat FCF discovery.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Currently, PLOGI failures are infinitely delayed/retried. There have
been some fabric situations where the PLOGI's were to the nameserver
and it stopped responding. The retries would never clear up. A better
resolution in this situation is to retry a couple of times, then drop
the link and reinit. This brings back connectivity to the nameserver.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
After a LOGO in response to an ABTS timeout, a PLOGI wasn't issued to
re-establish the login. An nlp_type check in the LOGO completion
handler failed to restart discovery for NVME targets. Revised the
nlp_type check for NVME as well as SCSI.
While reviewing the LOGO handling a few other issues were seen and
were addressed:
- Better lock synchronization around ndlp data types
- When the ABTS times out, unregister the RPI before sending the LOGO
so that all local exchange contexts are cleared and nothing received
while awaiting LOGO/PLOGI handling will be accepted.
- LOGO handling optimized to:
Wait only R_A_TOV for a response.
It doesn't need to be retried on timeout. If there wasn't a
response, a PLOGI will be sent, thus an implicit logout
applies as well when the other port sees it.
If there is a response, any kind of response is considered "good"
and the XRI quarantined for a exchange qualifier window.
- PLOGI is issued as soon a LOGO state is resolved.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When taking the board offline while performing i/o, unsafe locking errors
occurred and irq level isn't properly managed.
In lpfc_sli_hba_down, spin_lock_irqsave(&phba->hbalock, flags) does not
disable softirqs raised from timer expiry. It is possible that a softirq is
raised from the lpfc_els_retry_delay routine and recursively requests the same
phba->hbalock spinlock causing deadlock.
Address the deadlocks by creating a new port_list lock. The softirq behavior
can then be managed a level deeper into the calling sequences.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
After memory allocation for the LCB response frame, the memory wasn't zero
initialized, and not all fields are set. Thus garbage shows up in the
payload.
Fix by zeroing the memory at allocation. Also properly set the Capability
field based on duration support.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Change references from "Broadcom Limited" to "Broadcom Inc." in the
copyright message. Update copyright duration if not yet updated for 2018.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Current implementation missed setting the duration field. Correct the code
to set the field.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During target-side port faults, the driver would not recover all target
port logins. This resulted in a loss of nvme device discovery.
The driver is coded to wait for all GID_FT requests to complete before
restarting discovery. A fault is seen where the outstanding GIT_FT
counts are not properly decremented, thus discovery would never
start. Another fault was found in the clearing of the gidft_inp counter
that would be skipped in this condition. And a third fault found with
lpfc_nvme_register_port that would remove a reverence on the ndlp which
then allows a node swap on a port address change to prematurely remove
the reference and release the ndlp.
The following changes are made:
- Correct the decrementing of the outstanding GID_FT counters.
- In RSCN handling, no longer zero the counter before calling to issue
another GID_FT.
- No longer remove the reference on the dlp when the ndlp->nrport value
is not yet null.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When a port is configured for NVME and SCSI Initiator support and it probes
a target supporting both SCSI and NVME, NVME devices are discovered, but
SCSI devices are not.
The nlp_fc4_type for all NPorts should be cleared on Link Up or just before
GID_FTs get issued, as opposed to just during GID_FT cmpl. RSCN activity as
well as Link Up can trigger GID_FT. One GID_FT may complete before the next
one is issued.
Fix by clearng nlp_fc4_type on link up and just before both GID_FTs are
issued. During port swapping, copy nlp_fc4_type to the new ndlp
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The G7 adapter supports 64G link speeds. Add support to the driver.
In addition, a small cleanup to replace the odd bitmap logic with
a switch case.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Updated Copyright in files updated 11.4.0.7
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Revise the NVME PRLI to indicate CONF support.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Several statements are indented too far, fix these
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When enabled for both SCSI and NVME support, and connected pt2pt to a
SCSI only target, the driver nodelist entry for the remote port is left
in PRLI_ISSUE state and no SCSI LUNs are discovered. Works fine if only
configured for SCSI support.
Error was due to some of the prli points still reflecting the need to
send only 1 PRLI. On a lot of fabric configs, targets were NVME only,
which meant the fabric-reported protocol attributes were only telling
the driver one protocol or the other. Thus things worked fine. With
pt2pt, the driver must send a PRLI for both protocols as there are no
hints on what the target supports. Thus pt2pt targets were hitting the
multiple PRLI issues.
Complete the dual PRLI support. Track explicitly whether scsi (fcp) or
nvme prli's have been sent. Accurately track protocol support detected
on each node as reported by the fabric or probed by PRLI traffic.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Handling a rcv'ed PRLI incorrectly can cause the ndlp to end up in the
wrong state or the driver to ACC and PRLI when it should send LS_RJT.
The cause was due to the driver not properly looking at the PRLI type
and taking the multiple protocol support into consideration.
Resolved by adding checks in the various PRLI receive points to validate
PRLI type and reject if not valid for the enabled protocols and mode
(host vs target).
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During RSCN storms, the driver does not rediscover some targets. The
driver marks some RSCN as to be handled after the ones it's working
on. The driver missed processing some deferred RSCN.
Move where the driver checks for deferred RSCNs and initiate deferred
RSCN handling if the flag was set. Also revise nport state within the
RSCN confirm routine. Add some state data to a possible debug print to
aid future debugging.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
pt2pt ndlp ref count prematurely goes to 0. There was reference removed
that should only be removed if connected to a switch, not if in
point-to-point mode.
Add a mode check before the reference remove.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The driver does not respond to PLOGI from the direct attach target. The
driver uses incorrect S_ID in CONFIG_LINK, after FLOGI completion
Correct by issuing CONFIG_LINK with the correct S_ID after receiving the
PLOGI from the target
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When the HBA is connected to a private loop, the driver reports FLOGI
loop-open failure as functional error. This is an expected condition.
Mark loop-open failure as a warning instead of error.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
This is mostly updates of the usual suspects: lpfc, qla2xxx, hisi_sas,
megaraid_sas, pm80xx, mpt3sas, be2iscsi, hpsa. and a host of minor
updates.
There's no major behaviour change or additions to the core in all of
this, so the potential for regressions should be small (biggest
potential being in the scsi error handler changes).
Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJaCxtCAAoJEAVr7HOZEZN4d9EQAI+OHP6ss6zjKKC21c9jNPcH
NhLrNv37gHg/LA2VXeUEL9RGUjCGLIUrI4HsrxzkFAMLKP4TkshMs8/2RvczY+Sa
VpayPqVybEKLIS6ipQyM1SLIQff2nvtDVcN/T+8z1lkk45TrbA6ZGuwUwd2aJyEA
2V2wtg51ObnL0Nr9QPPll0JrtL1AnCZyRlu9XrwTZuuSBZwk93opIuuvbZm/3dVg
Ir4GSS4Y+PuHIfu4cxqdsPMdzRdY9I2me1YiE4jeFSn1/VTAjL4HBz7fO9eITT42
VhXSpDz1XvFsa9dJ0ubkqoALpJzCfOcBw+EuGvSydLEvOBoEVwMccdfaD9lT1zc5
L9e1Z5qqJoq7hTA6xTXCYfWG73I9HYvljtmc8yudKHhADOdnSTUXhaO6uBF0RNqD
OxPSA1RZwRx3c6lDOcK6BTtvLAkTEuYKdrWSKJi0w+QXJAyQ6etqbmsKpmPdRim7
Z4ZSpJFro2gyo9gcdJO0ykTG+z3U7Z/ay1sNgnuprsv+eU/QjUdlAPl18o79EkRf
H54zZggZ4wC6q/cFVVt4Vx+V+oqIeu38s7NDXS9UltLoTZPm2EzDW6pXd/38Z4Tf
a1oBAUET8kYLC90P8sVZxUIHZjITlpgDbyE2Lq00PMYXhk8S4IxF0aMN5RvVqzUv
+7N2HrHkSSgG1nhw1t+E
=3O85
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This is mostly updates of the usual suspects: lpfc, qla2xxx, hisi_sas,
megaraid_sas, pm80xx, mpt3sas, be2iscsi, hpsa. and a host of minor
updates.
There's no major behaviour change or additions to the core in all of
this, so the potential for regressions should be small (biggest
potential being in the scsi error handler changes)"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (203 commits)
scsi: lpfc: Fix hard lock up NMI in els timeout handling.
scsi: mpt3sas: remove a stray KERN_INFO
scsi: mpt3sas: cleanup _scsih_pcie_enumeration_event()
scsi: aacraid: use timespec64 instead of timeval
scsi: scsi_transport_fc: add 64GBIT and 128GBIT port speed definitions
scsi: qla2xxx: Suppress a kernel complaint in qla_init_base_qpair()
scsi: mpt3sas: fix dma_addr_t casts
scsi: be2iscsi: Use kasprintf
scsi: storvsc: Avoid excessive host scan on controller change
scsi: lpfc: fix kzalloc-simple.cocci warnings
scsi: mpt3sas: Update mpt3sas driver version.
scsi: mpt3sas: Fix sparse warnings
scsi: mpt3sas: Fix nvme drives checking for tlr.
scsi: mpt3sas: NVMe drive support for BTDHMAPPING ioctl command and log info
scsi: mpt3sas: Add-Task-management-debug-info-for-NVMe-drives.
scsi: mpt3sas: scan and add nvme device after controller reset
scsi: mpt3sas: Set NVMe device queue depth as 128
scsi: mpt3sas: Handle NVMe PCIe device related events generated from firmware.
scsi: mpt3sas: API's to remove nvme drive from sml
scsi: mpt3sas: API 's to support NVMe drive addition to SML
...
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.
Cc: James Smart <james.smart@broadcom.com>
Cc: Dick Kennedy <dick.kennedy@broadcom.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Support RDP and Multiple Frames
If the remote Nport is not logged in, the driver would not populate all
the descriptors in the RDP response payload. Doing so would create a
payload length that requires multiple frames due to exceeding the
default rx buffer size without an explicit login. Currently FC-LS
explicitly states the RDP response must be a single frame sequence.
Thus we did not violate the standard.
Recently, a modification to FC-LS was accepted which allows multi-frame
sequences and all vendors have indicated they are interoperable with the
change. As such, extend RDP support with the additional fields and send
a multi-frame sequence.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The driver crashes when attempting to use a freed ndpl pointer.
The pci_remove_one handler runs on a separate kernel thread. The order
of the removal is starting by freeing all of the ndlps and then
disabling interrupts. In between these two events the driver can still
receive an ELS and process it. When it tries to use the ndlp pointer
will be NULL
Change the order of the pci_remove_one vs disable interrupts so that
interrupts are disabled before the ndlp's are freed.
Cc: <stable@vger.kernel.org> # 4.12+
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Add Buffer to buffer credit recovery support to the driver. This is a
negotiated feature with the peer that allows for both sides to detect
dropped RRDY's and FC Frames and recover credit.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When using fabric-assigned WWNs, the switch doesn't like copy of the
FLOGI payload, which includes valid VVL bits, to be used as the FDISC
payload.
Rather than wait for corrected switch firmware, ensure the VVL bits are
marked invalid on FDISCs.
[mkp: typo]
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
A race condition was found whereby the initiator would receive the RSCN
for a new NVME device before it had a chance to register its FC4 support
with the fabric. Thus, when queried by the initiator, it would see that
the target supported FC-NVME.
Corrected by making the assumption that the target always supports
FC-NVME thus a PRLI is sent. It's ok for the target to reject it.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
After lip, the driver sometimes would have two rports for the same
device, allowing the namespaces to be duplicated by nvme.
In lpfc_plogi_confirm_nport() the driver was not swapping the nrport
maintained by the ndlp's undergoing address swapping. This allowed the
2nd rport to sneak in as it was considered a separate device.
This patch adds the fixes to Swap the nrport in each ndlp and take care
of the reference counts on the ndlps similar to FCP rports.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
After link bounce in a NVME Pt2Pt config, the driver managed to map the
same nport twice, resulting in multiple device nodes for the same
namespace.
In Pt2Pt, the driver must send PRLI's for both (scsi) FCP and NVME
rather than using fabric aids. The driver was inconsistent on handling
various PRLI completions, especially rejects, which had reject codes
cross the different protocol PRLI completions.
Fixed to perform the following: if nvmet mode (fc port can only be a
nvme target) - rejects all unsolicitly FCP PRLI's. Never issues a FCP
PRLI.
The multiple protocol PRLI's are sent simultaneously. However, driver
will now only state transition after both PRLI's are complete. New flags
were added to aid tracking the responses from the different PRLI's.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Message "0271 Illegal State Transition: node" seen in logs, all luns are
unuseable for that target.
A window exists in the rcv_plogi path where if the state is plogi issue
but the driver has not issued a plogi, then two reglogins will be sent
for the same RPI. The first one to complete will advance the state to
prli issue the second one will be detected as an illegal state, and
leave the node in an unusable state.
Correct the completion routine for the PLOGI ACC that detects the state
change when the driver starts discovery on the node again and drop the
REGLOGIN mailbox command.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Beacon OFF from switch is rejected by driver.
Driver fails Beacon OFF if frequency is set to 0. As per fc-ls spec,
status, capability, frequency and duration fields are only applicable
for Beacon ON.
Remove frequency and type checks. Reject Beacon ON if duration is non
zero.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In a server with an 8G adapter and a 32G adapter, running NVME and FCP,
the server would crash with the following stack.
RIP: 0010: ... lpfc_nvme_register_port+0x38/0x420 [lpfc]
lpfc_nlp_state_cleanup+0x154/0x4f0 [lpfc]
lpfc_nlp_set_state+0x9d/0x1a0 [lpfc]
lpfc_cmpl_prli_prli_issue+0x35f/0x440 [lpfc]
lpfc_disc_state_machine+0x78/0x1c0 [lpfc]
lpfc_cmpl_els_prli+0x17c/0x1f0 [lpfc]
lpfc_sli_sp_handle_rspiocb+0x39b/0x6b0 [lpfc]
lpfc_sli_handle_slow_ring_event_s3+0x134/0x2d0 [lpfc]
lpfc_work_done+0x8ac/0x13b0 [lpfc]
lpfc_do_work+0xf1/0x1b0 [lpfc]
Crash, on the 8G adapter, is due to a vport which does not have a nvme
local port structure. It's not supposed to have one. NVME is not
supported on the 8G adapter, so the NVME PRLI, which started this flow
shouldn't have been sent in the first place.
Correct discovery engine to recognize when on an SLI3 rport, which
doesn't support SLI3, if the rport supports only NVME, don't send a NVME
PRLI. Instead, as no FC4 will be used, a LOGO is sent. If rport is FCP
and NVME, only execute the SCSI PRLI.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The nvmet driver was rejecting the initiator's PRLI because its reg_rpi
for the PLOGI was still outstanding. The initiator would resend the
PRLI without delay and get the same answer. The PRLI retries would
exhaust causing the nvme initiator to set the nvmet ndlp to UNMAPPED.
The driver's lpfc_els_retry handler did not have a policy for an LS_RJT
with explanation CMD_IN_PROGRESS for PRLI or NVME_PRLI. This caused the
delay to remain at 0 but retry set 1.
Fix: When the ELS response is LS_RJT, TPC and the command was PRLI or
NVME_PRLI, just set the delay to 1000 mS to get a 1 second delay on the
PRLI retry. This was enough to allow the REG_RPI to complete at the
target.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Recent commit on patchset "lpfc updates for 11.2.0.14" fixed an issue
about dereferencing a NULL pointer on port reset. The specific commit,
named "lpfc: Fix system crash when port is reset.", is missing a check
against NULL pointer on lpfc_els_flush_cmd() though.
Since we destroy the queues on adapter resets, like in PCI error
recovery path, we need the validation present on this patch in order to
avoid a NULL pointer dereference when trying to flush commands of ELS
wq, after it has been destroyed (which would lead to a kernel oops).
Tested-by: Raphael Silva <raphasil@linux.vnet.ibm.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Acked-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Added code to support Cisco MDS loopback diagnostic. The diagnostics run
various loopbacks including one which loops-back frame through the
driver.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During driver boot, a latency in the NVMET driver side causes the
incoming NVMEI PRLI to get rejected by the NVMET driver. When this
happens, the NVMEI driver runs out of PRLI retries. Bouncing the link
does not fix the situation.
If the NVMEI driver decides, on PRLI completion failures, to retry the
PRLI, always decrement the fc4_prli_sent counter. This allows the PRLI
completion to resolve to UNMAPPED when NVMET rejects the PRLI.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
With 255 vports created a link trasition can casue a crash.
When going through discovery after a link bounce the driver is using
rpis before the cmd FCOE_POST_HDR_TEMPLATES completes. By doing that the
next rpi bumps the rpi range out of the boundary.
The fix it to increment the next_rpi only when the
FCOE_POST_HDR_TEMPLATE succeeds.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Christoph writes:
"A couple more updates for 4.12. The biggest pile is fc and lpfc
updates from James, but there are various small fixes and cleanups as
well."
Fixes up a few merge issues, and also a warning in
lpfc_nvmet_rcv_unsol_abort() if CONFIG_NVME_TARGET_FC isn't enabled.
Signed-off-by: Jens Axboe <axboe@fb.com>
NVMET didn't have any RSCN handling at all and
would not execute implicit LOGO when receiving a PLOGI
from an rport that NVMET had in state UNMAPPED.
Clean up the logic in lpfc_nlp_state_cleanup for
initiators (FCP and NVME). NVMET should not respond to
RSCN including allocating new ndlps so this code was
conditionalized when nvmet_support is true. The check
for NLP_RCV_PLOGI in lpfc_setup_disc_node was moved
below the check for nvmet_support to allow the NVMET
to recover initiator nodes correctly. The implicit
logo was introduced with lpfc_rcv_plogi when NVMET gets
a PLOGI on an ndlp in UNMAPPED state. The RSCN handling
was modified to not respond to an RSCN in NVMET. Instead
NVMET sends a GID_FT and determines if an NVMEP_INITIATOR
it has is UNMAPPED but no longer in the zone membership.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Adding support for Fabric assigned WWPN and WWNN.
Firmware sends first FLOGI to fabric with vendor version changes.
On link up driver gets updated service parameter with FAWWN assigned port
name. Driver sends 2nd FLOGI with updated fawwpn and modifies the
vport->fc_portname in driver.
Note:
Soft wwpn will not be allowed when fawwpn is enabled.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
When RPI is not available, driver sends WQE with invalid RPI value and
rejected by HBA.
lpfc 0000:82:00.3: 1:3154 BLS ABORT RSP failed, data: x3/xa0320008
and
lpfc :2753 PLOGI failure DID:FFFFFA Status:x3/xa0240008
In this case, driver accesses rpi_ids array out of bounds.
Fix:
Check return value of lpfc_sli4_alloc_rpi(). Do not allocate
lpfc_nodelist entry if RPI is not available.
When RPI is not available, we will get discovery timeouts and
command drops for some of the vports as seen below.
lpfc :0273 Unexpected discovery timeout, vport State x0
lpfc :0230 Unexpected timeout, hba link state x5
lpfc :0111 Dropping received ELS cmd Data: x0 xc90c55 x0
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
lpfc cannot establish connection with targets that send PRLI in P2P
configurations.
If lpfc rejects a PRLI that is sent from a target the target will not
resend and will reject the PRLI send from the initiator.
[mkp: applied by hand]
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>