svn commit: r351747 - head/sys/dev/nvme
Warner Losh
imp at bsdimp.com
Wed Sep 4 01:20:35 UTC 2019
On Tue, Sep 3, 2019 at 7:19 PM Warner Losh <imp at bsdimp.com> wrote:
>
>
> On Tue, Sep 3, 2019 at 2:38 PM Ravi Pokala <rpokala at freebsd.org> wrote:
>
>> Hi Warner,
>>
>> + /*
>> + * Per Section 7.6.2 of NVMe spec 1.4, to properly suspend, we
>> need to
>> + * delete the hardware I/O queues, and then shutdown. This
>> properly
>> + * flushes any metadata the drive may have stored so it can
>> survive
>> + * having its power removed and prevents the unsafe shutdown
>> count from
>> + * incriminating. Once we delete the qpairs, we have to disable
>> them
>> + * before shutting down. The delay is out of paranoia in
>> + * nvme_ctrlr_hw_reset, and is repeated here (though we should
>> have no
>> + * pending I/O that the delay copes with).
>> + */
>> + nvme_ctrlr_delete_qpairs(ctrlr);
>> + nvme_ctrlr_disable_qpairs(ctrlr);
>> + DELAY(100*1000);
>> + nvme_ctrlr_shutdown(ctrlr);
>>
>> This seems backwards to me. From that section:
>>
>> > The host should perform the following actions in sequence for a normal
>> shutdown:
>> > 1. Stop submitting any new I/O commands to the controller and allow
>> any outstanding commands to complete;
>> > 2. If the controller implements I/O queues, then the host should
>> delete all I/O Submission Queues, using the Delete I/O Submission Queue
>> command. A result of the successful completion of the Delete I/O Submission
>> Queue command is that any remaining commands outstanding are aborted;
>> > 3. If the controller implements I/O queues, then the host should
>> delete all I/O Completion Queues, using the Delete I/O Completion Queue
>> command; and
>> > 4. The host should set the Shutdown Notification (CC.SHN) field to
>> 01b to indicate a normal shutdown operation. The controller indicates when
>> shutdown processing is completed by updating the Shutdown Status
>> (CSTS.SHST) field to 10b.
>>
>> You are calling nvme_ctrlr_delete_qpairs() -- which implements steps (2)
>> and (3) -- before you are calling nvme_ctrlr_disable_qpairs() -- which
>> implements step (1).
>>
>> Or am I missing something?
>>
>
> The system stops all I/O and makes sure all I/O is complete before
> starting to suspend. Done in the reverse order, I don't have the queues
>
Doh!
disable_qpairs() makes it impossible to send the delete_qpairs() as well,
so I have to do that first. Not entirely sure why, but since the OS
enforces the I/O ban, not the driver, I'm good. Otherwise, I'd have to
drain all my queues, etc before starting this and come up with some
interlock to prevent new I/O from being submitted...
Warner
> Warner
>
>
>
>> When we suspend, we need to properly shutdown the NVME controller. The
>> controller may go into D3 state (or may have the power removed), and
>> to properly flush the metadata to non-volatile RAM, we must
>> complete a
>> normal shutdown. This consists of deleting the I/O queues and
>> setting
>> the shutodown bit. We have to do some extra stuff to make sure we
>> reset the software state of the queues as well.
>>
>> On resume, we have to reset the card twice, for reasons described in
>> the attach funcion. Once we've done that, we can restart the card.
>> If
>> any of this fails, we'll fail the NVMe card, just like we do when a
>> reset fails.
>>
>> Set is_resetting for the duration of the suspend / resume. This
>> keeps
>> the reset taskqueue from running a concurrent reset, and also is
>> needed to prevent any hw completions from queueing more I/O to the
>> card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to
>> get
>> that from the global state of the ctrlr. Wait for any pending reset
>> to
>> finish. All queued I/O will get sent to the hardware as part of
>> nvme_ctrlr_start(), though the upper layers shouldn't send any
>> down. Disabling the qpairs is the other failsafe to ensure all I/O
>> is
>> queued.
>>
>> Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to
>> avoid
>> confusion with all the other destroy functions. It just removes the
>> queues in hardware, while the other _destroy_ functions tear down
>> driver data structures.
>>
>> Split parts of the hardware reset function up so that I can
>> do part of the reset in suspsend. Split out the software disabling
>> of the qpairs into nvme_ctrlr_disable_qpairs.
>>
>> Finally, fix a couple of spelling errors in comments related to
>> this.
>>
>> Relnotes: Yes
>> MFC After: 1 week
>> Reviewed by: scottl@ (prior version)
>> Differential Revision: https://reviews.freebsd.org/D21493
>>
>> Modified:
>> head/sys/dev/nvme/nvme.c
>> head/sys/dev/nvme/nvme_ctrlr.c
>> head/sys/dev/nvme/nvme_pci.c
>> head/sys/dev/nvme/nvme_private.h
>>
>> Modified: head/sys/dev/nvme/nvme.c
>>
>> ==============================================================================
>> --- head/sys/dev/nvme/nvme.c Tue Sep 3 14:55:19 2019
>> (r351746)
>> +++ head/sys/dev/nvme/nvme.c Tue Sep 3 15:26:11 2019
>> (r351747)
>> @@ -137,9 +137,10 @@ nvme_attach(device_t dev)
>> }
>>
>> /*
>> - * Reset controller twice to ensure we do a transition from
>> cc.en==1
>> - * to cc.en==0. This is because we don't really know what status
>> - * the controller was left in when boot handed off to OS.
>> + * Reset controller twice to ensure we do a transition from
>> cc.en==1 to
>> + * cc.en==0. This is because we don't really know what status the
>> + * controller was left in when boot handed off to OS. Linux
>> doesn't do
>> + * this, however. If we adopt that policy, see also
>> nvme_ctrlr_resume().
>> */
>> status = nvme_ctrlr_hw_reset(ctrlr);
>> if (status != 0) {
>>
>> Modified: head/sys/dev/nvme/nvme_ctrlr.c
>>
>> ==============================================================================
>> --- head/sys/dev/nvme/nvme_ctrlr.c Tue Sep 3 14:55:19 2019
>> (r351746)
>> +++ head/sys/dev/nvme/nvme_ctrlr.c Tue Sep 3 15:26:11 2019
>> (r351747)
>> @@ -118,8 +118,8 @@ nvme_ctrlr_construct_io_qpairs(struct
>> nvme_controller
>>
>> /*
>> * Our best estimate for the maximum number of I/Os that we should
>> - * noramlly have in flight at one time. This should be viewed as
>> a hint,
>> - * not a hard limit and will need to be revisitted when the upper
>> layers
>> + * normally have in flight at one time. This should be viewed as
>> a hint,
>> + * not a hard limit and will need to be revisited when the upper
>> layers
>> * of the storage system grows multi-queue support.
>> */
>> ctrlr->max_hw_pend_io = num_trackers * ctrlr->num_io_queues * 3 /
>> 4;
>> @@ -344,10 +344,10 @@ nvme_ctrlr_enable(struct nvme_controller *ctrlr)
>> return (nvme_ctrlr_wait_for_ready(ctrlr, 1));
>> }
>>
>> -int
>> -nvme_ctrlr_hw_reset(struct nvme_controller *ctrlr)
>> +static void
>> +nvme_ctrlr_disable_qpairs(struct nvme_controller *ctrlr)
>> {
>> - int i, err;
>> + int i;
>>
>> nvme_admin_qpair_disable(&ctrlr->adminq);
>> /*
>> @@ -359,7 +359,15 @@ nvme_ctrlr_hw_reset(struct nvme_controller
>> *ctrlr)
>> for (i = 0; i < ctrlr->num_io_queues; i++)
>> nvme_io_qpair_disable(&ctrlr->ioq[i]);
>> }
>> +}
>>
>> +int
>> +nvme_ctrlr_hw_reset(struct nvme_controller *ctrlr)
>> +{
>> + int err;
>> +
>> + nvme_ctrlr_disable_qpairs(ctrlr);
>> +
>> DELAY(100*1000);
>>
>> err = nvme_ctrlr_disable(ctrlr);
>> @@ -481,7 +489,7 @@ nvme_ctrlr_create_qpairs(struct nvme_controller
>> *ctrlr
>> }
>>
>> static int
>> -nvme_ctrlr_destroy_qpairs(struct nvme_controller *ctrlr)
>> +nvme_ctrlr_delete_qpairs(struct nvme_controller *ctrlr)
>> {
>> struct nvme_completion_poll_status status;
>> struct nvme_qpair *qpair;
>> @@ -820,7 +828,7 @@ nvme_ctrlr_configure_int_coalescing(struct
>> nvme_contro
>> }
>>
>> static void
>> -nvme_ctrlr_start(void *ctrlr_arg)
>> +nvme_ctrlr_start(void *ctrlr_arg, bool resetting)
>> {
>> struct nvme_controller *ctrlr = ctrlr_arg;
>> uint32_t old_num_io_queues;
>> @@ -833,7 +841,7 @@ nvme_ctrlr_start(void *ctrlr_arg)
>> * the number of I/O queues supported, so cannot reset
>> * the adminq again here.
>> */
>> - if (ctrlr->is_resetting)
>> + if (resetting)
>> nvme_qpair_reset(&ctrlr->adminq);
>>
>> for (i = 0; i < ctrlr->num_io_queues; i++)
>> @@ -854,7 +862,7 @@ nvme_ctrlr_start(void *ctrlr_arg)
>> * explicit specify how many queues it will use. This value
>> should
>> * never change between resets, so panic if somehow that does
>> happen.
>> */
>> - if (ctrlr->is_resetting) {
>> + if (resetting) {
>> old_num_io_queues = ctrlr->num_io_queues;
>> if (nvme_ctrlr_set_num_qpairs(ctrlr) != 0) {
>> nvme_ctrlr_fail(ctrlr);
>> @@ -894,7 +902,7 @@ nvme_ctrlr_start_config_hook(void *arg)
>>
>> if (nvme_ctrlr_set_num_qpairs(ctrlr) == 0 &&
>> nvme_ctrlr_construct_io_qpairs(ctrlr) == 0)
>> - nvme_ctrlr_start(ctrlr);
>> + nvme_ctrlr_start(ctrlr, false);
>> else
>> nvme_ctrlr_fail(ctrlr);
>>
>> @@ -923,7 +931,7 @@ nvme_ctrlr_reset_task(void *arg, int pending)
>> */
>> pause("nvmereset", hz / 10);
>> if (status == 0)
>> - nvme_ctrlr_start(ctrlr);
>> + nvme_ctrlr_start(ctrlr, true);
>> else
>> nvme_ctrlr_fail(ctrlr);
>>
>> @@ -946,7 +954,7 @@ nvme_ctrlr_poll(struct nvme_controller *ctrlr)
>> }
>>
>> /*
>> - * Poll the single-vector intertrupt case: num_io_queues will be 1
>> and
>> + * Poll the single-vector interrupt case: num_io_queues will be 1 and
>> * there's only a single vector. While we're polling, we mask further
>> * interrupts in the controller.
>> */
>> @@ -1012,7 +1020,7 @@ nvme_ctrlr_passthrough_cmd(struct
>> nvme_controller *ctr
>> if (is_user_buffer) {
>> /*
>> * Ensure the user buffer is wired for the
>> duration of
>> - * this passthrough command.
>> + * this pass-through command.
>> */
>> PHOLD(curproc);
>> buf = uma_zalloc(pbuf_zone, M_WAITOK);
>> @@ -1031,7 +1039,7 @@ nvme_ctrlr_passthrough_cmd(struct
>> nvme_controller *ctr
>> } else
>> req = nvme_allocate_request_null(nvme_pt_done, pt);
>>
>> - /* Assume userspace already converted to little-endian */
>> + /* Assume user space already converted to little-endian */
>> req->cmd.opc = pt->cmd.opc;
>> req->cmd.fuse = pt->cmd.fuse;
>> req->cmd.rsvd2 = pt->cmd.rsvd2;
>> @@ -1206,7 +1214,7 @@ nvme_ctrlr_destruct(struct nvme_controller
>> *ctrlr, dev
>>
>> if (ctrlr->is_initialized) {
>> if (!gone)
>> - nvme_ctrlr_destroy_qpairs(ctrlr);
>> + nvme_ctrlr_delete_qpairs(ctrlr);
>> for (i = 0; i < ctrlr->num_io_queues; i++)
>> nvme_io_qpair_destroy(&ctrlr->ioq[i]);
>> free(ctrlr->ioq, M_NVME);
>> @@ -1305,4 +1313,88 @@ nvme_ctrlr_get_data(struct nvme_controller
>> *ctrlr)
>> {
>>
>> return (&ctrlr->cdata);
>> +}
>> +
>> +int
>> +nvme_ctrlr_suspend(struct nvme_controller *ctrlr)
>> +{
>> + int to = hz;
>> +
>> + /*
>> + * Can't touch failed controllers, so it's already suspended.
>> + */
>> + if (ctrlr->is_failed)
>> + return (0);
>> +
>> + /*
>> + * We don't want the reset taskqueue running, since it does
>> similar
>> + * things, so prevent it from running after we start. Wait for
>> any reset
>> + * that may have been started to complete. The reset process we
>> follow
>> + * will ensure that any new I/O will queue and be given to the
>> hardware
>> + * after we resume (though there should be none).
>> + */
>> + while (atomic_cmpset_32(&ctrlr->is_resetting, 0, 1) == 0 && to--
>> > 0)
>> + pause("nvmesusp", 1);
>> + if (to <= 0) {
>> + nvme_printf(ctrlr,
>> + "Competing reset task didn't finish. Try again
>> later.\n");
>> + return (EWOULDBLOCK);
>> + }
>> +
>> + /*
>> + * Per Section 7.6.2 of NVMe spec 1.4, to properly suspend, we
>> need to
>> + * delete the hardware I/O queues, and then shutdown. This
>> properly
>> + * flushes any metadata the drive may have stored so it can
>> survive
>> + * having its power removed and prevents the unsafe shutdown
>> count from
>> + * incriminating. Once we delete the qpairs, we have to disable
>> them
>> + * before shutting down. The delay is out of paranoia in
>> + * nvme_ctrlr_hw_reset, and is repeated here (though we should
>> have no
>> + * pending I/O that the delay copes with).
>> + */
>> + nvme_ctrlr_delete_qpairs(ctrlr);
>> + nvme_ctrlr_disable_qpairs(ctrlr);
>> + DELAY(100*1000);
>> + nvme_ctrlr_shutdown(ctrlr);
>> +
>> + return (0);
>> +}
>> +
>> +int
>> +nvme_ctrlr_resume(struct nvme_controller *ctrlr)
>> +{
>> +
>> + /*
>> + * Can't touch failed controllers, so nothing to do to resume.
>> + */
>> + if (ctrlr->is_failed)
>> + return (0);
>> +
>> + /*
>> + * Have to reset the hardware twice, just like we do on attach.
>> See
>> + * nmve_attach() for why.
>> + */
>> + if (nvme_ctrlr_hw_reset(ctrlr) != 0)
>> + goto fail;
>> + if (nvme_ctrlr_hw_reset(ctrlr) != 0)
>> + goto fail;
>> +
>> + /*
>> + * Now that we're reset the hardware, we can restart the
>> controller. Any
>> + * I/O that was pending is requeued. Any admin commands are
>> aborted with
>> + * an error. Once we've restarted, take the controller out of
>> reset.
>> + */
>> + nvme_ctrlr_start(ctrlr, true);
>> + atomic_cmpset_32(&ctrlr->is_resetting, 1, 0);
>> +
>> + return (0);
>> +fail:
>> + /*
>> + * Since we can't bring the controller out of reset, announce and
>> fail
>> + * the controller. However, we have to return success for the
>> resume
>> + * itself, due to questionable APIs.
>> + */
>> + nvme_printf(ctrlr, "Failed to reset on resume, failing.\n");
>> + nvme_ctrlr_fail(ctrlr);
>> + atomic_cmpset_32(&ctrlr->is_resetting, 1, 0);
>> + return (0);
>> }
>>
>> Modified: head/sys/dev/nvme/nvme_pci.c
>>
>> ==============================================================================
>> --- head/sys/dev/nvme/nvme_pci.c Tue Sep 3 14:55:19 2019
>> (r351746)
>> +++ head/sys/dev/nvme/nvme_pci.c Tue Sep 3 15:26:11 2019
>> (r351747)
>> @@ -43,6 +43,8 @@ __FBSDID("$FreeBSD$");
>> static int nvme_pci_probe(device_t);
>> static int nvme_pci_attach(device_t);
>> static int nvme_pci_detach(device_t);
>> +static int nvme_pci_suspend(device_t);
>> +static int nvme_pci_resume(device_t);
>>
>> static void nvme_ctrlr_setup_interrupts(struct nvme_controller
>> *ctrlr);
>>
>> @@ -51,6 +53,8 @@ static device_method_t nvme_pci_methods[] = {
>> DEVMETHOD(device_probe, nvme_pci_probe),
>> DEVMETHOD(device_attach, nvme_pci_attach),
>> DEVMETHOD(device_detach, nvme_pci_detach),
>> + DEVMETHOD(device_suspend, nvme_pci_suspend),
>> + DEVMETHOD(device_resume, nvme_pci_resume),
>> DEVMETHOD(device_shutdown, nvme_shutdown),
>> { 0, 0 }
>> };
>> @@ -331,4 +335,22 @@ nvme_ctrlr_setup_interrupts(struct
>> nvme_controller *ct
>> }
>>
>> ctrlr->msix_enabled = 1;
>> +}
>> +
>> +static int
>> +nvme_pci_suspend(device_t dev)
>> +{
>> + struct nvme_controller *ctrlr;
>> +
>> + ctrlr = DEVICE2SOFTC(dev);
>> + return (nvme_ctrlr_suspend(ctrlr));
>> +}
>> +
>> +static int
>> +nvme_pci_resume(device_t dev)
>> +{
>> + struct nvme_controller *ctrlr;
>> +
>> + ctrlr = DEVICE2SOFTC(dev);
>> + return (nvme_ctrlr_resume(ctrlr));
>> }
>>
>> Modified: head/sys/dev/nvme/nvme_private.h
>>
>> ==============================================================================
>> --- head/sys/dev/nvme/nvme_private.h Tue Sep 3 14:55:19 2019
>> (r351746)
>> +++ head/sys/dev/nvme/nvme_private.h Tue Sep 3 15:26:11 2019
>> (r351747)
>> @@ -556,4 +556,7 @@ void nvme_notify_ns(struct nvme_controller
>> *ctrlr, int
>> void nvme_ctrlr_intx_handler(void *arg);
>> void nvme_ctrlr_poll(struct nvme_controller *ctrlr);
>>
>> +int nvme_ctrlr_suspend(struct nvme_controller *ctrlr);
>> +int nvme_ctrlr_resume(struct nvme_controller *ctrlr);
>> +
>> #endif /* __NVME_PRIVATE_H__ */
>>
>>
>>
>>
>>
More information about the svn-src-head
mailing list