svn commit: r351747 - head/sys/dev/nvme

Wed Sep 4 01:20:35 UTC 2019

On Tue, Sep 3, 2019 at 7:19 PM Warner Losh <imp at bsdimp.com> wrote:

>
>
> On Tue, Sep 3, 2019 at 2:38 PM Ravi Pokala <rpokala at freebsd.org> wrote:
>
>> Hi Warner,
>>
>>     +   /*
>>     +    * Per Section 7.6.2 of NVMe spec 1.4, to properly suspend, we
>> need to
>>     +    * delete the hardware I/O queues, and then shutdown. This
>> properly
>>     +    * flushes any metadata the drive may have stored so it can
>> survive
>>     +    * having its power removed and prevents the unsafe shutdown
>> count from
>>     +    * incriminating. Once we delete the qpairs, we have to disable
>> them
>>     +    * before shutting down. The delay is out of paranoia in
>>     +    * nvme_ctrlr_hw_reset, and is repeated here (though we should
>> have no
>>     +    * pending I/O that the delay copes with).
>>     +    */
>>     +   nvme_ctrlr_delete_qpairs(ctrlr);
>>     +   nvme_ctrlr_disable_qpairs(ctrlr);
>>     +   DELAY(100*1000);
>>     +   nvme_ctrlr_shutdown(ctrlr);
>>
>> This seems backwards to me. From that section:
>>
>> > The host should perform the following actions in sequence for a normal
>> shutdown:
>> >     1. Stop submitting any new I/O commands to the controller and allow
>> any outstanding commands to complete;
>> >     2. If the controller implements I/O queues, then the host should
>> delete all I/O Submission Queues, using the Delete I/O Submission Queue
>> command. A result of the successful completion of the Delete I/O Submission
>> Queue command is that any remaining commands outstanding are aborted;
>> >     3. If the controller implements I/O queues, then the host should
>> delete all I/O Completion Queues, using the Delete I/O Completion Queue
>> command; and
>> >     4. The host should set the Shutdown Notification (CC.SHN) field to
>> 01b to indicate a normal shutdown operation. The controller indicates when
>> shutdown processing is completed by updating the Shutdown Status
>> (CSTS.SHST) field to 10b.
>>
>> You are calling nvme_ctrlr_delete_qpairs() -- which implements steps (2)
>> and (3) -- before you are calling nvme_ctrlr_disable_qpairs() -- which
>> implements step (1).
>>
>> Or am I missing something?
>>
>
> The system stops all I/O and makes sure all I/O is complete before
> starting to suspend. Done in the reverse order, I don't have the queues
>

Doh!

disable_qpairs() makes it impossible to send the delete_qpairs() as well,
so I have to do that first. Not entirely sure why, but since the OS
enforces the I/O ban, not the driver, I'm good. Otherwise, I'd have to
drain all my queues, etc before starting this and come up with some
interlock to prevent new I/O from being submitted...

Warner

> Warner
>
>
>
>>  When we suspend, we need to properly shutdown the NVME controller. The
>>       controller may go into D3 state (or may have the power removed), and
>>       to properly flush the metadata to non-volatile RAM, we must
>> complete a
>>       normal shutdown. This consists of deleting the I/O queues and
>> setting
>>       the shutodown bit. We have to do some extra stuff to make sure we
>>       reset the software state of the queues as well.
>>
>>       On resume, we have to reset the card twice, for reasons described in
>>       the attach funcion. Once we've done that, we can restart the card.
>> If
>>       any of this fails, we'll fail the NVMe card, just like we do when a
>>       reset fails.
>>
>>       Set is_resetting for the duration of the suspend / resume. This
>> keeps
>>       the reset taskqueue from running a concurrent reset, and also is
>>       needed to prevent any hw completions from queueing more I/O to the
>>       card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to
>> get
>>       that from the global state of the ctrlr. Wait for any pending reset
>> to
>>       finish. All queued I/O will get sent to the hardware as part of
>>       nvme_ctrlr_start(), though the upper layers shouldn't send any
>>       down. Disabling the qpairs is the other failsafe to ensure all I/O
>> is
>>       queued.
>>
>>       Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to
>> avoid
>>       confusion with all the other destroy functions.  It just removes the
>>       queues in hardware, while the other _destroy_ functions tear down
>>       driver data structures.
>>
>>       Split parts of the hardware reset function up so that I can
>>       do part of the reset in suspsend. Split out the software disabling
>>       of the qpairs into nvme_ctrlr_disable_qpairs.
>>
>>       Finally, fix a couple of spelling errors in comments related to
>>       this.
>>
>>       Relnotes: Yes
>>       MFC After: 1 week
>>       Reviewed by: scottl@ (prior version)
>>       Differential Revision: https://reviews.freebsd.org/D21493
>>
>>     Modified:
>>       head/sys/dev/nvme/nvme.c
>>       head/sys/dev/nvme/nvme_ctrlr.c
>>       head/sys/dev/nvme/nvme_pci.c
>>       head/sys/dev/nvme/nvme_private.h
>>
>>     Modified: head/sys/dev/nvme/nvme.c
>>
>> ==============================================================================
>>     --- head/sys/dev/nvme/nvme.c        Tue Sep  3 14:55:19 2019
>> (r351746)
>>     +++ head/sys/dev/nvme/nvme.c        Tue Sep  3 15:26:11 2019
>> (r351747)
>>     @@ -137,9 +137,10 @@ nvme_attach(device_t dev)
>>         }
>>
>>         /*
>>     -    * Reset controller twice to ensure we do a transition from
>> cc.en==1
>>     -    *  to cc.en==0.  This is because we don't really know what status
>>     -    *  the controller was left in when boot handed off to OS.
>>     +    * Reset controller twice to ensure we do a transition from
>> cc.en==1 to
>>     +    * cc.en==0.  This is because we don't really know what status the
>>     +    * controller was left in when boot handed off to OS.  Linux
>> doesn't do
>>     +    * this, however. If we adopt that policy, see also
>> nvme_ctrlr_resume().
>>          */
>>         status = nvme_ctrlr_hw_reset(ctrlr);
>>         if (status != 0) {
>>
>>     Modified: head/sys/dev/nvme/nvme_ctrlr.c
>>
>> ==============================================================================
>>     --- head/sys/dev/nvme/nvme_ctrlr.c  Tue Sep  3 14:55:19 2019
>> (r351746)
>>     +++ head/sys/dev/nvme/nvme_ctrlr.c  Tue Sep  3 15:26:11 2019
>> (r351747)
>>     @@ -118,8 +118,8 @@ nvme_ctrlr_construct_io_qpairs(struct
>> nvme_controller
>>
>>         /*
>>          * Our best estimate for the maximum number of I/Os that we should
>>     -    * noramlly have in flight at one time. This should be viewed as
>> a hint,
>>     -    * not a hard limit and will need to be revisitted when the upper
>> layers
>>     +    * normally have in flight at one time. This should be viewed as
>> a hint,
>>     +    * not a hard limit and will need to be revisited when the upper
>> layers
>>          * of the storage system grows multi-queue support.
>>          */
>>         ctrlr->max_hw_pend_io = num_trackers * ctrlr->num_io_queues * 3 /
>> 4;
>>     @@ -344,10 +344,10 @@ nvme_ctrlr_enable(struct nvme_controller *ctrlr)
>>         return (nvme_ctrlr_wait_for_ready(ctrlr, 1));
>>      }
>>
>>     -int
>>     -nvme_ctrlr_hw_reset(struct nvme_controller *ctrlr)
>>     +static void
>>     +nvme_ctrlr_disable_qpairs(struct nvme_controller *ctrlr)
>>      {
>>     -   int i, err;
>>     +   int i;
>>
>>         nvme_admin_qpair_disable(&ctrlr->adminq);
>>         /*
>>     @@ -359,7 +359,15 @@ nvme_ctrlr_hw_reset(struct nvme_controller
>> *ctrlr)
>>                 for (i = 0; i < ctrlr->num_io_queues; i++)
>>                         nvme_io_qpair_disable(&ctrlr->ioq[i]);
>>         }
>>     +}
>>
>>     +int
>>     +nvme_ctrlr_hw_reset(struct nvme_controller *ctrlr)
>>     +{
>>     +   int err;
>>     +
>>     +   nvme_ctrlr_disable_qpairs(ctrlr);
>>     +
>>         DELAY(100*1000);
>>
>>         err = nvme_ctrlr_disable(ctrlr);
>>     @@ -481,7 +489,7 @@ nvme_ctrlr_create_qpairs(struct nvme_controller
>> *ctrlr
>>      }
>>
>>      static int
>>     -nvme_ctrlr_destroy_qpairs(struct nvme_controller *ctrlr)
>>     +nvme_ctrlr_delete_qpairs(struct nvme_controller *ctrlr)
>>      {
>>         struct nvme_completion_poll_status      status;
>>         struct nvme_qpair                       *qpair;
>>     @@ -820,7 +828,7 @@ nvme_ctrlr_configure_int_coalescing(struct
>> nvme_contro
>>      }
>>
>>      static void
>>     -nvme_ctrlr_start(void *ctrlr_arg)
>>     +nvme_ctrlr_start(void *ctrlr_arg, bool resetting)
>>      {
>>         struct nvme_controller *ctrlr = ctrlr_arg;
>>         uint32_t old_num_io_queues;
>>     @@ -833,7 +841,7 @@ nvme_ctrlr_start(void *ctrlr_arg)
>>          *  the number of I/O queues supported, so cannot reset
>>          *  the adminq again here.
>>          */
>>     -   if (ctrlr->is_resetting)
>>     +   if (resetting)
>>                 nvme_qpair_reset(&ctrlr->adminq);
>>
>>         for (i = 0; i < ctrlr->num_io_queues; i++)
>>     @@ -854,7 +862,7 @@ nvme_ctrlr_start(void *ctrlr_arg)
>>          *  explicit specify how many queues it will use.  This value
>> should
>>          *  never change between resets, so panic if somehow that does
>> happen.
>>          */
>>     -   if (ctrlr->is_resetting) {
>>     +   if (resetting) {
>>                 old_num_io_queues = ctrlr->num_io_queues;
>>                 if (nvme_ctrlr_set_num_qpairs(ctrlr) != 0) {
>>                         nvme_ctrlr_fail(ctrlr);
>>     @@ -894,7 +902,7 @@ nvme_ctrlr_start_config_hook(void *arg)
>>
>>         if (nvme_ctrlr_set_num_qpairs(ctrlr) == 0 &&
>>             nvme_ctrlr_construct_io_qpairs(ctrlr) == 0)
>>     -           nvme_ctrlr_start(ctrlr);
>>     +           nvme_ctrlr_start(ctrlr, false);
>>         else
>>                 nvme_ctrlr_fail(ctrlr);
>>
>>     @@ -923,7 +931,7 @@ nvme_ctrlr_reset_task(void *arg, int pending)
>>          */
>>         pause("nvmereset", hz / 10);
>>         if (status == 0)
>>     -           nvme_ctrlr_start(ctrlr);
>>     +           nvme_ctrlr_start(ctrlr, true);
>>         else
>>                 nvme_ctrlr_fail(ctrlr);
>>
>>     @@ -946,7 +954,7 @@ nvme_ctrlr_poll(struct nvme_controller *ctrlr)
>>      }
>>
>>      /*
>>     - * Poll the single-vector intertrupt case: num_io_queues will be 1
>> and
>>     + * Poll the single-vector interrupt case: num_io_queues will be 1 and
>>       * there's only a single vector. While we're polling, we mask further
>>       * interrupts in the controller.
>>       */
>>     @@ -1012,7 +1020,7 @@ nvme_ctrlr_passthrough_cmd(struct
>> nvme_controller *ctr
>>                 if (is_user_buffer) {
>>                         /*
>>                          * Ensure the user buffer is wired for the
>> duration of
>>     -                    *  this passthrough command.
>>     +                    *  this pass-through command.
>>                          */
>>                         PHOLD(curproc);
>>                         buf = uma_zalloc(pbuf_zone, M_WAITOK);
>>     @@ -1031,7 +1039,7 @@ nvme_ctrlr_passthrough_cmd(struct
>> nvme_controller *ctr
>>         } else
>>                 req = nvme_allocate_request_null(nvme_pt_done, pt);
>>
>>     -   /* Assume userspace already converted to little-endian */
>>     +   /* Assume user space already converted to little-endian */
>>         req->cmd.opc = pt->cmd.opc;
>>         req->cmd.fuse = pt->cmd.fuse;
>>         req->cmd.rsvd2 = pt->cmd.rsvd2;
>>     @@ -1206,7 +1214,7 @@ nvme_ctrlr_destruct(struct nvme_controller
>> *ctrlr, dev
>>
>>         if (ctrlr->is_initialized) {
>>                 if (!gone)
>>     -                   nvme_ctrlr_destroy_qpairs(ctrlr);
>>     +                   nvme_ctrlr_delete_qpairs(ctrlr);
>>                 for (i = 0; i < ctrlr->num_io_queues; i++)
>>                         nvme_io_qpair_destroy(&ctrlr->ioq[i]);
>>                 free(ctrlr->ioq, M_NVME);
>>     @@ -1305,4 +1313,88 @@ nvme_ctrlr_get_data(struct nvme_controller
>> *ctrlr)
>>      {
>>
>>         return (&ctrlr->cdata);
>>     +}
>>     +
>>     +int
>>     +nvme_ctrlr_suspend(struct nvme_controller *ctrlr)
>>     +{
>>     +   int to = hz;
>>     +
>>     +   /*
>>     +    * Can't touch failed controllers, so it's already suspended.
>>     +    */
>>     +   if (ctrlr->is_failed)
>>     +           return (0);
>>     +
>>     +   /*
>>     +    * We don't want the reset taskqueue running, since it does
>> similar
>>     +    * things, so prevent it from running after we start. Wait for
>> any reset
>>     +    * that may have been started to complete. The reset process we
>> follow
>>     +    * will ensure that any new I/O will queue and be given to the
>> hardware
>>     +    * after we resume (though there should be none).
>>     +    */
>>     +   while (atomic_cmpset_32(&ctrlr->is_resetting, 0, 1) == 0 && to--
>> > 0)
>>     +           pause("nvmesusp", 1);
>>     +   if (to <= 0) {
>>     +           nvme_printf(ctrlr,
>>     +               "Competing reset task didn't finish. Try again
>> later.\n");
>>     +           return (EWOULDBLOCK);
>>     +   }
>>     +
>>     +   /*
>>     +    * Per Section 7.6.2 of NVMe spec 1.4, to properly suspend, we
>> need to
>>     +    * delete the hardware I/O queues, and then shutdown. This
>> properly
>>     +    * flushes any metadata the drive may have stored so it can
>> survive
>>     +    * having its power removed and prevents the unsafe shutdown
>> count from
>>     +    * incriminating. Once we delete the qpairs, we have to disable
>> them
>>     +    * before shutting down. The delay is out of paranoia in
>>     +    * nvme_ctrlr_hw_reset, and is repeated here (though we should
>> have no
>>     +    * pending I/O that the delay copes with).
>>     +    */
>>     +   nvme_ctrlr_delete_qpairs(ctrlr);
>>     +   nvme_ctrlr_disable_qpairs(ctrlr);
>>     +   DELAY(100*1000);
>>     +   nvme_ctrlr_shutdown(ctrlr);
>>     +
>>     +   return (0);
>>     +}
>>     +
>>     +int
>>     +nvme_ctrlr_resume(struct nvme_controller *ctrlr)
>>     +{
>>     +
>>     +   /*
>>     +    * Can't touch failed controllers, so nothing to do to resume.
>>     +    */
>>     +   if (ctrlr->is_failed)
>>     +           return (0);
>>     +
>>     +   /*
>>     +    * Have to reset the hardware twice, just like we do on attach.
>> See
>>     +    * nmve_attach() for why.
>>     +    */
>>     +   if (nvme_ctrlr_hw_reset(ctrlr) != 0)
>>     +           goto fail;
>>     +   if (nvme_ctrlr_hw_reset(ctrlr) != 0)
>>     +           goto fail;
>>     +
>>     +   /*
>>     +    * Now that we're reset the hardware, we can restart the
>> controller. Any
>>     +    * I/O that was pending is requeued. Any admin commands are
>> aborted with
>>     +    * an error. Once we've restarted, take the controller out of
>> reset.
>>     +    */
>>     +   nvme_ctrlr_start(ctrlr, true);
>>     +   atomic_cmpset_32(&ctrlr->is_resetting, 1, 0);
>>     +
>>     +   return (0);
>>     +fail:
>>     +   /*
>>     +    * Since we can't bring the controller out of reset, announce and
>> fail
>>     +    * the controller. However, we have to return success for the
>> resume
>>     +    * itself, due to questionable APIs.
>>     +    */
>>     +   nvme_printf(ctrlr, "Failed to reset on resume, failing.\n");
>>     +   nvme_ctrlr_fail(ctrlr);
>>     +   atomic_cmpset_32(&ctrlr->is_resetting, 1, 0);
>>     +   return (0);
>>      }
>>
>>     Modified: head/sys/dev/nvme/nvme_pci.c
>>
>> ==============================================================================
>>     --- head/sys/dev/nvme/nvme_pci.c    Tue Sep  3 14:55:19 2019
>> (r351746)
>>     +++ head/sys/dev/nvme/nvme_pci.c    Tue Sep  3 15:26:11 2019
>> (r351747)
>>     @@ -43,6 +43,8 @@ __FBSDID("$FreeBSD$");
>>      static int    nvme_pci_probe(device_t);
>>      static int    nvme_pci_attach(device_t);
>>      static int    nvme_pci_detach(device_t);
>>     +static int    nvme_pci_suspend(device_t);
>>     +static int    nvme_pci_resume(device_t);
>>
>>      static void nvme_ctrlr_setup_interrupts(struct nvme_controller
>> *ctrlr);
>>
>>     @@ -51,6 +53,8 @@ static device_method_t nvme_pci_methods[] = {
>>         DEVMETHOD(device_probe,     nvme_pci_probe),
>>         DEVMETHOD(device_attach,    nvme_pci_attach),
>>         DEVMETHOD(device_detach,    nvme_pci_detach),
>>     +   DEVMETHOD(device_suspend,   nvme_pci_suspend),
>>     +   DEVMETHOD(device_resume,    nvme_pci_resume),
>>         DEVMETHOD(device_shutdown,  nvme_shutdown),
>>         { 0, 0 }
>>      };
>>     @@ -331,4 +335,22 @@ nvme_ctrlr_setup_interrupts(struct
>> nvme_controller *ct
>>         }
>>
>>         ctrlr->msix_enabled = 1;
>>     +}
>>     +
>>     +static int
>>     +nvme_pci_suspend(device_t dev)
>>     +{
>>     +   struct nvme_controller  *ctrlr;
>>     +
>>     +   ctrlr = DEVICE2SOFTC(dev);
>>     +   return (nvme_ctrlr_suspend(ctrlr));
>>     +}
>>     +
>>     +static int
>>     +nvme_pci_resume(device_t dev)
>>     +{
>>     +   struct nvme_controller  *ctrlr;
>>     +
>>     +   ctrlr = DEVICE2SOFTC(dev);
>>     +   return (nvme_ctrlr_resume(ctrlr));
>>      }
>>
>>     Modified: head/sys/dev/nvme/nvme_private.h
>>
>> ==============================================================================
>>     --- head/sys/dev/nvme/nvme_private.h        Tue Sep  3 14:55:19 2019
>>       (r351746)
>>     +++ head/sys/dev/nvme/nvme_private.h        Tue Sep  3 15:26:11 2019
>>       (r351747)
>>     @@ -556,4 +556,7 @@ void    nvme_notify_ns(struct nvme_controller
>> *ctrlr, int
>>      void       nvme_ctrlr_intx_handler(void *arg);
>>      void       nvme_ctrlr_poll(struct nvme_controller *ctrlr);
>>
>>     +int        nvme_ctrlr_suspend(struct nvme_controller *ctrlr);
>>     +int        nvme_ctrlr_resume(struct nvme_controller *ctrlr);
>>     +
>>      #endif /* __NVME_PRIVATE_H__ */
>>
>>
>>
>>
>>