alistair23-linux

redonkable

Author	SHA1	Message	Date
James Smart	7ec7630548	scsi: lpfc: Fix invalid sleeping context in lpfc_sli4_nvmet_alloc() commit `62e3a931db` upstream. The following calltrace was seen: BUG: sleeping function called from invalid context at mm/slab.h:494 ... Call Trace: dump_stack+0x9a/0xf0 ___might_sleep.cold.63+0x13d/0x178 slab_pre_alloc_hook+0x6a/0x90 kmem_cache_alloc_trace+0x3a/0x2d0 lpfc_sli4_nvmet_alloc+0x4c/0x280 [lpfc] lpfc_post_rq_buffer+0x2e7/0xa60 [lpfc] lpfc_sli4_hba_setup+0x6b4c/0xa4b0 [lpfc] lpfc_pci_probe_one_s4.isra.15+0x14f8/0x2280 [lpfc] lpfc_pci_probe_one+0x260/0x2880 [lpfc] local_pci_probe+0xd4/0x180 work_for_cpu_fn+0x51/0xa0 process_one_work+0x8f0/0x17b0 worker_thread+0x536/0xb50 kthread+0x30c/0x3d0 ret_from_fork+0x3a/0x50 A prior patch introduced a spin_lock_irqsave(hbalock) in the lpfc_post_rq_buffer() routine. Call trace is seen as the hbalock is held with interrupts disabled during a GFP_KERNEL allocation in lpfc_sli4_nvmet_alloc(). Fix by reordering locking so that hbalock not held when calling sli4_nvmet_alloc() (aka rqb_buf_list()). Link: https://lore.kernel.org/r/20201020202719.54726-2-james.smart@broadcom.com Fixes: `411de511c6` ("scsi: lpfc: Fix RQ empty firmware trap") Cc: <stable@vger.kernel.org> # v4.17+ Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-12-30 11:51:44 +01:00
James Smart	4935732e88	scsi: lpfc: Fix RQ buffer leakage when no IOCBs available [ Upstream commit `39c4f1a965` ] The driver is occasionally seeing the following SLI Port error, requiring reset and reinit: Port Status Event: ... error 1=0x52004a01, error 2=0x218 The failure means an RQ timeout. That is, the adapter had received asynchronous receive frames, ran out of buffer slots to place the frames, and the driver did not replenish the buffer slots before a timeout occurred. The driver should not be so slow in replenishing buffers that a timeout can occur. When the driver received all the frames of a sequence, it allocates an IOCB to put the frames in. In a situation where there was no IOCB available for the frame of a sequence, the RQ buffer corresponding to the first frame of the sequence was not returned to the FW. Eventually, with enough traffic encountering the situation, the timeout occurred. Fix by releasing the buffer back to firmware whenever there is no IOCB for the first frame. [mkp: typo] Link: https://lore.kernel.org/r/20200128002312.16346-2-jsmart2021@gmail.com Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-10-01 13:17:32 +02:00
James Smart	0c5733a962	scsi: lpfc: Fix crash after handling a pci error [ Upstream commit `4cd7089130` ] Injecting EEH on a 32GB card is causing kernel oops The pci error handler is doing an IO flush and the offline code is also doing an IO flush. When the 1st flush is complete the hdwq is destroyed (freed), yet the second flush accesses the hdwq and crashes. Added a check in lpfc_sli4_fush_io_rings to check both the HBA_IOQ_FLUSH flag and the hdwq pointer to see if it is already set and not already freed. Link: https://lore.kernel.org/r/20200322181304.37655-6-jsmart2021@gmail.com Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-04-29 16:32:57 +02:00
James Smart	9d1062c4dd	scsi: lpfc: Fix kasan slab-out-of-bounds error in lpfc_unreg_login [ Upstream commit `38503943c8` ] The following kasan bug was called out: BUG: KASAN: slab-out-of-bounds in lpfc_unreg_login+0x7c/0xc0 [lpfc] Read of size 2 at addr ffff889fc7c50a22 by task lpfc_worker_3/6676 ... Call Trace: dump_stack+0x96/0xe0 ? lpfc_unreg_login+0x7c/0xc0 [lpfc] print_address_description.constprop.6+0x1b/0x220 ? lpfc_unreg_login+0x7c/0xc0 [lpfc] ? lpfc_unreg_login+0x7c/0xc0 [lpfc] __kasan_report.cold.9+0x37/0x7c ? lpfc_unreg_login+0x7c/0xc0 [lpfc] kasan_report+0xe/0x20 lpfc_unreg_login+0x7c/0xc0 [lpfc] lpfc_sli_def_mbox_cmpl+0x334/0x430 [lpfc] ... When processing the completion of a "Reg Rpi" login mailbox command in lpfc_sli_def_mbox_cmpl, a call may be made to lpfc_unreg_login. The vpi is extracted from the completing mailbox context and passed as an input for the next. However, the vpi stored in the mailbox command context is an absolute vpi, which for SLI4 represents both base + offset. When used with a non-zero base component, (function id > 0) this results in an out-of-range access beyond the allocated phba->vpi_ids array. Fix by subtracting the function's base value to get an accurate vpi number. Link: https://lore.kernel.org/r/20200322181304.37655-2-jsmart2021@gmail.com Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-04-29 16:32:57 +02:00
James Smart	484cc15ad0	scsi: lpfc: fix inlining of lpfc_sli4_cleanup_poll_list() [ Upstream commit `d480e57809` ] Compilation can fail due to having an inline function reference where the function body is not present. Fix by removing the inline tag. Fixes: `93a4d6f401` ("scsi: lpfc: Add registration for CPU Offline/Online events") Link: https://lore.kernel.org/r/20191111230401.12958-4-jsmart2021@gmail.com Reviewed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-04-17 10:50:26 +02:00
James Smart	f48e759352	scsi: lpfc: Add registration for CPU Offline/Online events [ Upstream commit `93a4d6f401` ] The recent affinitization didn't address cpu offlining/onlining. If an interrupt vector is shared and the low order cpu owning the vector is offlined, as interrupts are managed, the vector is taken offline. This causes the other CPUs sharing the vector will hang as they can't get io completions. Correct by registering callbacks with the system for Offline/Online events. When a cpu is taken offline, its eq, which is tied to an interrupt vector is found. If the cpu is the "owner" of the vector and if the eq/vector is shared by other CPUs, the eq is placed into a polled mode. Additionally, code paths that perform io submission on the "sharing CPUs" will check the eq state and poll for completion after submission of new io to a wq that uses the eq. Similarly, when a cpu comes back online and owns an offlined vector, the eq is taken out of polled mode and rearmed to start driving interrupts for eq. Link: https://lore.kernel.org/r/20191105005708.7399-9-jsmart2021@gmail.com Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-04-17 10:50:24 +02:00
James Smart	b1b105a633	scsi: lpfc: use hdwq assigned cpu for allocation [ Upstream commit `4583a4f66b` ] Looking at the recent conversion from smp_processor_id() to raw_smp_processor_id(), realized that the allocation should be based on the cpu the hdwq is bound to, not the executing cpu. Revise to pull cpu number from the hdwq Fixes: `765ab6cdac` ("scsi: lpfc: Fix a kernel warning triggered by lpfc_get_sgl_per_hdwq()") Link: https://lore.kernel.org/r/20191116003847.6141-1-jsmart2021@gmail.com Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-01-23 08:23:01 +01:00
Bart Van Assche	0ec3e3ba47	scsi: lpfc: Fix a kernel warning triggered by lpfc_get_sgl_per_hdwq() commit `765ab6cdac` upstream. Fix the following kernel bug report: BUG: using smp_processor_id() in preemptible [00000000] code: systemd-udevd/954 Fixes: `d79c9e9d4b` ("scsi: lpfc: Support dynamic unbounded SGL lists on G7 hardware.") Link: https://lore.kernel.org/r/20191107052158.25788-2-bvanassche@acm.org Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-01-23 08:22:59 +01:00
James Smart	51a2104cc4	scsi: lpfc: Fix hdwq sgl locks and irq handling commit `a4c21acca2` upstream. Many of the sgl-per-hdwq paths are locking with spin_lock_irq() and spin_unlock_irq() and may unwittingly raising irq when it shouldn't. Hard deadlocks were seen around lpfc_scsi_prep_cmnd(). Fix by converting the locks to irqsave/irqrestore. Fixes: `d79c9e9d4b` ("scsi: lpfc: Support dynamic unbounded SGL lists on G7 hardware.") Link: https://lore.kernel.org/r/20190922035906.10977-16-jsmart2021@gmail.com Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-01-23 08:22:59 +01:00
James Smart	287a465e12	scsi: lpfc: Fix list corruption detected in lpfc_put_sgl_per_hdwq commit `35a635af54` upstream. In lpfc_release_io_buf, an lpfc_io_buf is returned to the 'available' pool before any associated sgl or cmd and rsp buffers are returned via their respective 'put' routines. If xri rebalancing occurs and an lpfc_io_buf structure is reused quickly, there may be a race condition between release of old and association of new resources. Re-ordered lpfc_release_io_buf to release sgl and cmd/rsp buffer lists before releasing the lpfc_io_buf structure for re-use. Fixes: `d79c9e9d4b` ("scsi: lpfc: Support dynamic unbounded SGL lists on G7 hardware.") Link: https://lore.kernel.org/r/20190922035906.10977-17-jsmart2021@gmail.com Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-01-23 08:22:59 +01:00
James Smart	858f090696	scsi: lpfc: Fix rpi release when deleting vport commit `97acd0019d` upstream. A prior use-after-free mailbox fix solved it's problem by null'ing a ndlp pointer. However, further testing has shown that this change causes a later state change to occasionally be skipped, which results in a reference count never being decremented thus the rpi is never released, which causes a vport delete to never succeed. Revise the fix in the prior patch to no longer null the ndlp. Instead the RELEASE_RPI flag is set which will drive the release of the rpi. Given the new code was added at a deep indentation level, refactor the code block using a new routine that avoids the indentation issues. Fixes: `9b16406864` ("scsi: lpfc: Fix use-after-free mailbox cmd completion") Link: https://lore.kernel.org/r/20190922035906.10977-6-jsmart2021@gmail.com Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-01-09 10:20:01 +01:00
James Smart	11ff350c9b	scsi: lpfc: Fix duplicate unreg_rpi error in port offline flow [ Upstream commit `7cfd5639d9` ] If the driver receives a login that is later then LOGO'd by the remote port (aka ndlp), the driver, upon the completion of the LOGO ACC transmission, will logout the node and unregister the rpi that is being used for the node. As part of the unreg, the node's rpi value is replaced by the LPFC_RPI_ALLOC_ERROR value. If the port is subsequently offlined, the offline walks the nodes and ensures they are logged out, which possibly entails unreg'ing their rpi values. This path does not validate the node's rpi value, thus doesn't detect that it has been unreg'd already. The replaced rpi value is then used when accessing the rpi bitmask array which tracks active rpi values. As the LPFC_RPI_ALLOC_ERROR value is not a valid index for the bitmask, it may fault the system. Revise the rpi release code to detect when the rpi value is the replaced RPI_ALLOC_ERROR value and ignore further release steps. Link: https://lore.kernel.org/r/20191105005708.7399-2-jsmart2021@gmail.com Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-01-04 19:17:17 +01:00
James Smart	03d0de2da8	scsi: lpfc: Fix list corruption in lpfc_sli_get_iocbq [ Upstream commit `15498dc1a5` ] After study, it was determined there was a double free of a CT iocb during execution of lpfc_offline_prep and lpfc_offline. The prep routine issued an abort for some CT iocbs, but the aborts did not complete fast enough for a subsequent routine that waits for completion. Thus the driver proceeded to lpfc_offline, which releases any pending iocbs. Unfortunately, the completions for the aborts were then received which re-released the ct iocbs. Turns out the issue for why the aborts didn't complete fast enough was not their time on the wire/in the adapter. It was the lpfc_work_done routine, which requires the adapter state to be UP before it calls lpfc_sli_handle_slow_ring_event() to process the completions. The issue is the prep routine takes the link down as part of it's processing. To fix, the following was performed: - Prevent the offline routine from releasing iocbs that have had aborts issued on them. Defer to the abort completions. Also means the driver fully waits for the completions. Given this change, the recognition of "driver-generated" status which then releases the iocb is no longer valid. As such, the change made in the commit `296012285c` is reverted. As recognition of "driver-generated" status is no longer valid, this patch reverts the changes made in commit `296012285c` ("scsi: lpfc: Fix leak of ELS completions on adapter reset") - Modify lpfc_work_done to allow slow path completions so that the abort completions aren't ignored. - Updated the fdmi path to recognize a CT request that fails due to the port being unusable. This stops FDMI retries. FDMI will be restarted on next link up. Link: https://lore.kernel.org/r/20190922035906.10977-14-jsmart2021@gmail.com Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-01-04 19:16:28 +01:00
James Smart	a51f92387f	scsi: lpfc: Fix locking on mailbox command completion [ Upstream commit `07b8582430` ] Symptoms were seen of the driver not having valid data for mailbox commands. After debugging, the following sequence was found: The driver maintains a port-wide pointer of the mailbox command that is currently in execution. Once finished, the port-wide pointer is cleared (done in lpfc_sli4_mq_release()). The next mailbox command issued will set the next pointer and so on. The mailbox response data is only copied if there is a valid port-wide pointer. In the failing case, it was seen that a new mailbox command was being attempted in parallel with the completion. The parallel path was seeing the mailbox no long in use (flag check under lock) and thus set the port pointer. The completion path had cleared the active flag under lock, but had not touched the port pointer. The port pointer is cleared after the lock is released. In this case, the completion path cleared the just-set value by the parallel path. Fix by making the calls that clear mbox state/port pointer while under lock. Also slightly cleaned up the error path. Link: https://lore.kernel.org/r/20190922035906.10977-8-jsmart2021@gmail.com Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-01-04 19:16:24 +01:00
James Smart	64c8e5afcb	scsi: lpfc: Fix bad ndlp ptr in xri aborted handling commit `324e1c4020` upstream. In cases where I/O may be aborted, such as driver unload or link bounces, the system will crash based on a bad ndlp pointer. Example: RIP: 0010:lpfc_sli4_abts_err_handler+0x15/0x140 [lpfc] ... lpfc_sli4_io_xri_aborted+0x20d/0x270 [lpfc] lpfc_sli4_sp_handle_abort_xri_wcqe.isra.54+0x84/0x170 [lpfc] lpfc_sli4_fp_handle_cqe+0xc2/0x480 [lpfc] __lpfc_sli4_process_cq+0xc6/0x230 [lpfc] __lpfc_sli4_hba_process_cq+0x29/0xc0 [lpfc] process_one_work+0x14c/0x390 Crash was caused by a bad ndlp address passed to I/O indicated by the XRI aborted CQE. The address was not NULL so the routine deferenced the ndlp ptr. The bad ndlp also caused the lpfc_sli4_io_xri_aborted to call an erroneous io handler. Root cause for the bad ndlp was an lpfc_ncmd that was aborted, put on the abort_io list, completed, taken off the abort_io list, sent to lpfc_release_nvme_buf where it was put back on the abort_io list because the lpfc_ncmd->flags setting LPFC_SBUF_XBUSY was not cleared on the final completion. Rework the exchange busy handling to ensure the flags are properly set for both scsi and nvme. Fixes: `c490850a09` ("scsi: lpfc: Adapt partitioned XRI lists to efficient sharing") Cc: <stable@vger.kernel.org> # v5.1+ Link: https://lore.kernel.org/r/20191018211832.7917-6-jsmart2021@gmail.com Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-12-17 19:55:26 +01:00
Daniel Wagner	535fb49e73	scsi: lpfc: Check queue pointer before use The queue pointer might not be valid. The rest of the code checks the pointer before accessing it. lpfc_sli4_process_missed_mbox_completions is the only place where the check is missing. Fixes: `657add4e5e` ("scsi: lpfc: Fix poor use of hardware queues if fewer irq vectors") Cc: James Smart <jsmart2021@gmail.com> Link: https://lore.kernel.org/r/20191018162111.8798-1-dwagner@suse.de Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-10-18 17:59:27 -04:00
James Smart	4fb86a6bc5	scsi: lpfc: Fix reset recovery paths that are not recovering A recent patch unconditionally marks the hba as in error as part of resetting the adapter. The driver flow that called the adapter reset was a recovery path, which expects the adapter to not be in an error state in order to finish the recovery. Given the new error state being set, the recovery fails and the adapter is left in limbo. Revise the adapter reset routine so that it will only mark the adapter in error if it was unable to reset the adapter. Fixes: `8c24a4f643` ("scsi: lpfc: Fix crash due to port reset racing vs adapter error handling") Link: https://lore.kernel.org/r/20190903215441.10490-1-jsmart2021@gmail.com Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-09-07 16:28:15 -04:00
Sakari Ailus	2d44d165e9	scsi: lpfc: Convert existing %pf users to %ps Convert the remaining %pf users to %ps to prepare for the removal of the old %pf conversion specifier support. Fixes: `3235066449` ("scsi: lpfc: Migrate to %px and %pf in kernel print calls") Link: https://lore.kernel.org/r/20190904160423.3865-1-sakari.ailus@linux.intel.com Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-09-07 16:26:40 -04:00
James Smart	01f2ef6d18	scsi: lpfc: fix 12.4.0.0 GPF at boot The 12.4.0.0 patch that merged WQ/CQ pairs into single per-cpu pair contained a bug: a local variable was set to the queue pair by index. This should have allowed the local variable to be natively used. Instead, the code reused the index relative to the local variable, obtaining a random pointer value that when used eventually faulted the system Convert offending code to use local variable. Fixes: `c00f62e6c5` ("scsi: lpfc: Merge per-protocol WQ/CQ pairs into single per-cpu pair") Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-08-29 18:15:07 -04:00
James Smart	c00f62e6c5	scsi: lpfc: Merge per-protocol WQ/CQ pairs into single per-cpu pair Currently, each hardware queue, typically allocated per-cpu, consists of a WQ/CQ pair per protocol. Meaning if both SCSI and NVMe are supported 2 WQ/CQ pairs will exist for the hardware queue. Separate queues are unnecessary. The current implementation wastes memory backing the 2nd set of queues, and the use of double the SLI-4 WQ/CQ's means less hardware queues can be supported which means there may not always be enough to have a pair per cpu. If there is only 1 pair per cpu, more cpu's may get their own WQ/CQ. Rework the implementation to use a single WQ/CQ pair by both protocols. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-08-19 22:41:12 -04:00
James Smart	d79c9e9d4b	scsi: lpfc: Support dynamic unbounded SGL lists on G7 hardware. Typical SLI-4 hardware supports up to 2 4KB pages to be registered per XRI to contain the exchanges Scatter/Gather List. This caps the number of SGL elements that can be in the SGL. There are not extensions to extend the list out of the 2 pages. The G7 hardware adds a SGE type that allows the SGL to be vectored to a different scatter/gather list segment. And that segment can contain a SGE to go to another segment and so on. The initial segment must still be pre-registered for the XRI, but it can be a much smaller amount (256Bytes) as it can now be dynamically grown. This much smaller allocation can handle the SG list for most normal I/O, and the dynamic aspect allows it to support many MB's if needed. The implementation creates a pool which contains "segments" and which is initially sized to hold the initial small segment per xri. If an I/O requires additional segments, they are allocated from the pool. If the pool has no more segments, the pool is grown based on what is now needed. After the I/O completes, the additional segments are returned to the pool for use by other I/Os. Once allocated, the additional segments are not released under the assumption of "if needed once, it will be needed again". Pools are kept on a per-hardware queue basis, which is typically 1:1 per cpu, but may be shared by multiple cpus. The switch to the smaller initial allocation significantly reduces the memory footprint of the driver (which only grows if large ios are issued). Based on the several K of XRIs for the adapter, the 8KB->256B reduction can conserve 32MBs or more. It has been observed with per-cpu resource pools that allocating a resource on CPU A, may be put back on CPU B. While the get routines are distributed evenly, only a limited subset of CPUs may be handling the put routines. This can put a strain on the lpfc_put_cmd_rsp_buf_per_cpu routine because all the resources are being put on a limited subset of CPUs. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-08-19 22:41:12 -04:00
James Smart	e62245d923	scsi: lpfc: Add MDS driver loopback diagnostics support Added code to support driver loopback with MDS Diagnostics. This style of diagnostics passes frames from the fabric to the driver who then echo them back out the link. SEND_FRAME WQEs are used to transmit the frames. Added the SOF and EOF field location definitions for use by SEND_FRAME. Also ensure that enable_mds_diags is a RW parameter. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-08-19 22:41:12 -04:00
James Smart	3235066449	scsi: lpfc: Migrate to %px and %pf in kernel print calls In order to see real addresses, convert %p with %px for kernel addresses and replace %p with %pf for functions. While converting, standardize on "x%px" throughout (not %px or 0x%px). Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-08-19 22:41:11 -04:00
James Smart	d9f492a1a1	scsi: lpfc: Fix coverity warnings Running on Coverity produced the following errors: - coding style (indentation) - memset size mismatch errors note: comment cases where it is purposely a mismatch Fix the errors. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-08-19 22:41:11 -04:00
James Smart	84f2ddf8cf	scsi: lpfc: Fix hang when downloading fw on port enabled for nvme As part of firmware download, the adapter is reset. On the adapter the reset causes the function to stop and all outstanding io is terminated (without responses). The reset path then starts teardown of the adapter, starting with deregistration of the remote ports with the nvme-fc transport. The local port is then deregistered and the driver waits for local port deregistration. This never finishes. The remote port deregistrations terminated the nvme controllers, causing them to send aborts for all the outstanding io. The aborts were serviced in the driver, but stalled due to its state. The nvme layer then stops to reclaim it's outstanding io before continuing. The io must be returned before the reset on the controller is deemed complete and the controller delete performed. The remote port deregistration won't complete until all the controllers are terminated. And the local port deregistration won't complete until all controllers and remote ports are terminated. Thus things hang. The issue is the reset which stopped the adapter also stopped all the responses that would drive i/o completions, and the aborts were also stopped that stopped i/o completions. The driver, when resetting the adapter like this, needs to be generating the completions as part of the adapter reset so that I/O complete (in error), and any aborts are not queued. Fix by adding flush routines whenever the adapter port has been reset or discovered in error. The flush routines will generate the completions for the scsi and nvme outstanding io. The abort ios, if waiting, will be caught and flushed as well. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-08-19 22:41:10 -04:00
James Smart	8c24a4f643	scsi: lpfc: Fix crash due to port reset racing vs adapter error handling If the adapter encounters a condition which causes the adapter to fail (driver must detect the failure) simultaneously to a request to the driver to reset the adapter (such as a host_reset), the reset path will be racing with the asynchronously-detect adapter failure path. In the failing situation, one path has started to tear down the adapter data structures (io_wq's) while the other path has initiated a repeat of the teardown and is in the lpfc_sli_flush_xxx_rings path and attempting to access the just-freed data structures. Fix by the following: - In cases where an adapter failure is detected, rather than explicitly calling offline_eratt() to start the teardown, change the adapter state and let the later calls of posted work to the slowpath thread invoke the adapter recovery. In essence, this means all requests to reset are serialized on the slowpath thread. - Clean up the routine that restarts the adapter. If there is a failure from brdreset, don't immediately error and leave things in a partial state. Instead, ensure the adapter state is set and finish the teardown of structures before returning. - If in the scsi host reset handler and the board fails to reset and restart (which can be due to parallel reset/recovery paths), instead of hard failing and explicitly calling offline_eratt() (which gets into the redundant path), just fail out and let the asynchronous path resolve the adapter state. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-08-19 22:41:10 -04:00
James Smart	b95b21193c	scsi: lpfc: Fix loss of remote port after devloss due to lack of RPIs In tests with remote ports contantly logging out/logging coupled with occassional local link bounce, if a remote port is disocnnected for longer than devloss_tmo and then subsequently reconnected, eventually the test will fail to login with the remote port and remote port connectivity is lost. When devloss_tmo expires, the driver does not free the node struct until the port or npiv instances is being deleted. The node is left allocated but the state set to UNUSED. If the node was in the process of logging in when the local link drop occurred, meaning the RPI was allocated for the node in order to send the ELS, but not yet registered which comes after successful login, the node is moved to the NPR state, and if devloss expires, to UNUSED state. If the remote port comes back, the node associated with it is restarted and this path happens to allocate a new RPI and overwrites the prior RPI value. In the cases where the port was logged in and loggs out, the path did release the RPI but did not set the node rpi value. In the cases where the remote port never finished logging in, the path never did the call to release the rpi. In this latter case, when the node is subsequently restore, the new rpi allocation overwrites the rpi that was not released, and the rpi is now leaked. Eventually the port will run out of RPI resources to log into new remote ports. Fix by following changes: - When an rpi is released, do so under locks and ensure the node rpi value is set to a non-allocated value (LPFC_RPI_ALLOC_ERROR). Note: refactored to a small service routine to avoid indentation issues. - When re-enabling a node, check the rpi value to determine if a new allocation is necessary. If already set, use the prior rpi. Enhanced logging to help in the future. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-08-19 22:41:09 -04:00
James Smart	4b0a42be26	scsi: lpfc: Fix irq raising in lpfc_sli_hba_down The adapter reset path (lpfc_sli_hba_down) is taking/releasing a lock with irq. But, the path is already under the hbalock which raised irq so it's unnecessary. Convert to simple lock/unlock. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-08-19 22:41:09 -04:00
James Smart	296012285c	scsi: lpfc: Fix leak of ELS completions on adapter reset If the adapter is reset while there are outstanding ELS's, subsequent reinitialization of the adapter will fail as it has not recovered all of the io contexts relative to the ELS's. If an ELS timed out or otherwise failed and an the ELS was attempted to be aborted (which changes the ELS completion context), in causes where the driver generates completions for the outstanding IO as the adapter would not due to being reset, the driver released only the ELS context and failed to release the abort context. When the adapter went to reinit, as it had not received all of the contexts, it failed to reinit. Fix by having the ELS completion handler identify the driver-generated completion status and release the abort context. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-08-19 22:41:08 -04:00
James Smart	4f1a2fef2a	scsi: lpfc: Fix PLOGI failure with high remoteport count When connected to a high number of remote ports, the driver is encountering PLOGI errors. The errors are due to adapter detected failures indicating illegal field values. Turns out the driver was prematurely clearing an RPI bitmask before waiting for an UNREG_RPI mailbox completion. This allowed the RPI to be reused before it was actually available. Fix by clearing RPI bitmask only after UNREG_RPI mailbox completion. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-08-19 22:41:08 -04:00
Fuqian Huang	ee9a256cd8	scsi: lpfc: remove redundant code Remove the redundant initialization code. Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-08-19 22:07:50 -04:00
Linus Torvalds	ba6d10ab80	SCSI misc on 20190709 This is mostly update of the usual drivers: qla2xxx, hpsa, lpfc, ufs, mpt3sas, ibmvscsi, megaraid_sas, bnx2fc and hisi_sas as well as the removal of the osst driver (I heard from Willem privately that he would like the driver removed because all his test hardware has failed). Plus number of minor changes, spelling fixes and other trivia. Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXSTl4yYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishdcxAQDCJVbd fPUX76/V1ldupunF97+3DTharxxbst+VnkOnCwD8D4c0KFFFOI9+F36cnMGCPegE fjy17dQLvsJ4GsidHy8= =aS5B -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "This is mostly update of the usual drivers: qla2xxx, hpsa, lpfc, ufs, mpt3sas, ibmvscsi, megaraid_sas, bnx2fc and hisi_sas as well as the removal of the osst driver (I heard from Willem privately that he would like the driver removed because all his test hardware has failed). Plus number of minor changes, spelling fixes and other trivia. The big merge conflict this time around is the SPDX licence tags. Following discussion on linux-next, we believe our version to be more accurate than the one in the tree, so the resolution is to take our version for all the SPDX conflicts" Note on the SPDX license tag conversion conflicts: the SCSI tree had done its own SPDX conversion, which in some cases conflicted with the treewide ones done by Thomas & co. In almost all cases, the conflicts were purely syntactic: the SCSI tree used the old-style SPDX tags ("GPL-2.0" and "GPL-2.0+") while the treewide conversion had used the new-style ones ("GPL-2.0-only" and "GPL-2.0-or-later"). In these cases I picked the new-style one. In a few cases, the SPDX conversion was actually different, though. As explained by James above, and in more detail in a pre-pull-request thread: "The other problem is actually substantive: In the libsas code Luben Tuikov originally specified gpl 2.0 only by dint of stating: * This file is licensed under GPLv2. In all the libsas files, but then muddied the water by quoting GPLv2 verbatim (which includes the or later than language). So for these files Christoph did the conversion to v2 only SPDX tags and Thomas converted to v2 or later tags" So in those cases, where the spdx tag substantially mattered, I took the SCSI tree conversion of it, but then also took the opportunity to turn the old-style "GPL-2.0" into a new-style "GPL-2.0-only" tag. Similarly, when there were whitespace differences or other differences to the comments around the copyright notices, I took the version from the SCSI tree as being the more specific conversion. Finally, in the spdx conversions that had no conflicts (because the treewide ones hadn't been done for those files), I just took the SCSI tree version as-is, even if it was old-style. The old-style conversions are perfectly valid, even if the "-only" and "-or-later" versions are perhaps more descriptive. * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (185 commits) scsi: qla2xxx: move IO flush to the front of NVME rport unregistration scsi: qla2xxx: Fix NVME cmd and LS cmd timeout race condition scsi: qla2xxx: on session delete, return nvme cmd scsi: qla2xxx: Fix kernel crash after disconnecting NVMe devices scsi: megaraid_sas: Update driver version to 07.710.06.00-rc1 scsi: megaraid_sas: Introduce various Aero performance modes scsi: megaraid_sas: Use high IOPS queues based on IO workload scsi: megaraid_sas: Set affinity for high IOPS reply queues scsi: megaraid_sas: Enable coalescing for high IOPS queues scsi: megaraid_sas: Add support for High IOPS queues scsi: megaraid_sas: Add support for MPI toolbox commands scsi: megaraid_sas: Offload Aero RAID5/6 division calculations to driver scsi: megaraid_sas: RAID1 PCI bandwidth limit algorithm is applicable for only Ventura scsi: megaraid_sas: megaraid_sas: Add check for count returned by HOST_DEVICE_LIST DCMD scsi: megaraid_sas: Handle sequence JBOD map failure at driver level scsi: megaraid_sas: Don't send FPIO to RL Bypass queue scsi: megaraid_sas: In probe context, retry IOC INIT once if firmware is in fault scsi: megaraid_sas: Release Mutex lock before OCR in case of DCMD timeout scsi: megaraid_sas: Call disable_irq from process IRQ poll scsi: megaraid_sas: Remove few debug counters from IO path ...	2019-07-11 15:14:01 -07:00
James Smart	f60cb93bbf	lpfc: add support to generate RSCN events for nport This patch adds general RSCN support: - The ability to transmit an RSCN to the port on the other end of the link (regular port if pt2pt, or fabric controller if fabric). - And general recognition of an RSCN ELS when an ELS is received. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Arun Easi <aeasi@marvell.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-06-21 11:08:37 +02:00
YueHaibing	d7b761b069	scsi: lpfc: Make some symbols static Fix sparse warnings: drivers/scsi/lpfc/lpfc_sli.c:115:1: warning: symbol 'lpfc_sli4_pcimem_bcopy' was not declared. Should it be static? drivers/scsi/lpfc/lpfc_sli.c:7854:1: warning: symbol 'lpfc_sli4_process_missed_mbox_completions' was not declared. Should it be static? drivers/scsi/lpfc/lpfc_nvmet.c:223:27: warning: symbol 'lpfc_nvmet_get_ctx_for_xri' was not declared. Should it be static? drivers/scsi/lpfc/lpfc_nvmet.c:245:27: warning: symbol 'lpfc_nvmet_get_ctx_for_oxid' was not declared. Should it be static? drivers/scsi/lpfc/lpfc_init.c:75:10: warning: symbol 'lpfc_present_cpu' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-06-18 19:46:24 -04:00
James Smart	657add4e5e	scsi: lpfc: Fix poor use of hardware queues if fewer irq vectors While fixing the resources per socket, realized the driver was not using hardware queues (up to 1 per cpu) if there were fewer interrupt vectors. The driver was only using the hardware queue assigned to the cpu with the vector. Rework the affinity map check to use the additional hardware queue elements that had been allocated. If the cpu count exceeds the hardware queue count - share, but choose what is shared with by: hyperthread peer, core peer, socket peer, or finally similar cpu in a different socket. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-06-18 19:46:22 -04:00
James Smart	04d210c98e	scsi: lpfc: Fix memory leak in abnormal exit path from lpfc_eq_create eq create is leaking mailbox memory if it encounters an error. rework error path to free the memory. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-06-18 19:46:21 -04:00
James Smart	d74a89aab9	scsi: lpfc: Separate CQ processing for nvmet_fc upcalls Currently the driver is notified of new command frame receipt by CQEs. As part of the CQE processing, the driver upcalls the nvmet_fc transport to deliver the command. nvmet_fc, as part of receiving the command builds out a context for it, where one of the first steps is to allocate memory for the io. When running with tests that do large ios (1MB), it was found on some systems, the total number of outstanding I/O's, at 1MB per, completely consumed the system's memory. Thus additional ios were getting blocked in the memory allocator. Given that this blocked the lpfc thread processing CQEs, there were lots of other commands that were received and which are then held up, and given CQEs are serially processed, the aggregate delays for an IO waiting behind the others became cummulative - enough so that the initiator hit timeouts for the ios. The basic fix is to avoid the direct upcall and instead schedule a work item for each io as it is received. This allows the cq processing to complete very quickly, and each io can then run or block on it's own. However, this general solution hurts latency when there are few ios. As such, implemented the fix such that the driver watches how many CQEs it has processed sequentially in one run. As long as the count is below a threshold, the direct nvmet_fc upcall will be made. Only when the count is exceeded will it revert to work scheduling. Given that debug of this showed a surprisingly long delay in cq processing, the io timer stats were updated to better reflect the processing of the different points. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-06-18 19:46:21 -04:00
James Smart	e2a8be5696	scsi: lpfc: resolve lockdep warnings There were a number of erroneous comments and incorrect older lockdep checks that were causing a number of warnings. Resolve the following: - Inconsistent lock state warnings in lpfc_nvme_info_show(). - Fixed comments and code on sequences where ring lock is now held instead of hbalock. - Reworked calling sequences around lpfc_sli_iocbq_lookup(). Rather than locking prior to the routine and have routine guess on what lock, take the lock within the routine. The lockdep check becomes unnecessary. - Fixed comments and removed erroneous hbalock checks. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> CC: Bart Van Assche <bvanassche@acm.org> Tested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-05-13 20:32:49 -04:00
Bart Van Assche	d6d189ceab	scsi: lpfc: Change smp_processor_id() into raw_smp_processor_id() This patch avoids that a kernel warning appears when smp_processor_id() is called with preempt debugging enabled. Cc: James Smart <james.smart@broadcom.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-04-03 23:11:36 -04:00
Bart Van Assche	d8c2040bf9	scsi: lpfc: Remove unused functions Remove those functions that are not called from outside the removed functions. Cc: James Smart <james.smart@broadcom.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-04-03 23:11:36 -04:00
Bart Van Assche	ffd43814d9	scsi: lpfc: Fix indentation and balance braces This patch avoid that smatch complains about misleading indentation. Cc: James Smart <james.smart@broadcom.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-04-03 23:11:35 -04:00
Bart Van Assche	3999df75bc	scsi: lpfc: Declare local functions static This patch avoids that the compiler complains about missing declarations when building with W=1. Cc: James Smart <james.smart@broadcom.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-04-03 23:11:35 -04:00
James Smart	92f3b32718	scsi: lpfc: Fixup eq_clr_intr references Declaring interrupt clear routines as inline is bogus as they are used as an indirect pointer. Remove the inline references. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-03-20 20:03:47 -04:00
James Bottomley	c88725dd14	scsi: lpfc: Fix build error You can't declare a function inline in a header if it doesn't have a body available to the compiler. So realistically you either don't declare it inline or you make it a static inline in the header. I think the latter applies in this case, so this should be the fix Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Acked-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-03-20 20:03:47 -04:00
James Smart	c1a21ebc0f	scsi: lpfc: Specify node affinity for queue memory allocation Change the SLI4 queue creation code to use NUMA node based memory allocation based on the cpu the queues will be related to. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-03-19 13:15:09 -04:00
James Smart	9afbee3d62	scsi: lpfc: Reduce memory footprint for lpfc_queue Currently the driver maintains a sideband structure which has a pointer for each queue element. However, at 8 bytes per pointer, and up to 4k elements per queue, and 100s of queues, this can take up a lot of memory. Convert the driver to using an access routine that calculates the element address based on its index rather than using the pointer table. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-03-19 13:15:09 -04:00
James Smart	b3b4f3e1d5	scsi: lpfc: Correct boot bios information to FDMI registration The driver is currently reporting the firmware revision not the actual boot bios version in FDMI data. Modify the driver to obtain the boot bios version from the adapter and use that data in the FMDI data sent to the switch. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-03-19 13:15:09 -04:00
James Smart	e8869f5b0a	scsi: lpfc: Fix mailbox hang on adapter init The adapter initialization sequence enables interrupts, initializes the adapter link_state to LINK_DOWN, then issues commands to initialize the adapter. The interrupt handler on the adapter validates the link_state (has to be at least LINK_DOWN) and if invalid, will discard the interrupting event. In most cases, there is not a command completion, thus an interrupt until the initialization commands have been sent which is post the setting of state to LINK_DOWN. However, in cases of firmware reset, the reset will modify the link_state to an invalid value (indicating a reset of the adapter) and there occasionally are cases where the adapter will generate an asynchronous event which shares the eq/cq used for mailbox commands. In the failure case, an interrupt is generated immediately after enabling them due to the async event. As link_state is invalid, the eq is list and the CQ not serviced. At this point link_state is initialized and the mailbox command sent. As the CQ has not been serviced, it is not armed, so no interrupt event is generated when the mailbox command completes. Modify the initialization sequence so that interrupts are enabled after link_state is properly initialized, which avoids the race condition with the async event. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-03-19 13:15:08 -04:00
James Smart	e2ffe4d5dc	scsi: lpfc: Convert bootstrap mbx polling from msleep to udelay Current code is using msleep when polling for hw ready. Unfortunately the msleep routine isn't very accurate on rescheduling. In fact, on a busy systems which reset the adapter, it became 10s of seconds before it was rescheduled. Fix by busy waiting using udelay. As we're now busy waiting, significantly reduce the wait time so that we can exit the pool loop as soon as possible. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-03-19 12:57:02 -04:00
James Smart	4645f7b56a	scsi: lpfc: Coordinate adapter error handling with offline handling The driver periodically checks for adapter error in a background thread. If the thread detects an error, the adapter will be reset including the deletion and reallocation of workqueues on the adapter. Simultaneously, there may be a user-space request to offline the adapter which may try to do many of the same steps, in parallel, on a different thread. As memory was deallocated while unexpected, the parallel offline request hit a bad pointer. Add coordination between the two threads. The error recovery thread has precedence. So, when an error is detected, a flag is set on the adapter to indicate the error thread is terminating the adapter. But, before doing that work, it will look for a flag that is set by the offline flow, and if set, will wait for it to complete before then processing the error handling path. Similarly, in the offline thread, it first checks for whether the error thread is resetting the adapter, and if so, will then wait for the error thread to finish. Only after it has finished, will it set its flag and offline the adapter. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-03-19 12:57:02 -04:00

1 2 3 4 5 ...

518 Commits (redonkable)