svn commit: r351747 - head/sys/dev/nvme

Wed Sep 4 01:19:15 UTC 2019

On Tue, Sep 3, 2019 at 2:38 PM Ravi Pokala <rpokala at freebsd.org> wrote:

> Hi Warner,
>
>     +   /*
>     +    * Per Section 7.6.2 of NVMe spec 1.4, to properly suspend, we
> need to
>     +    * delete the hardware I/O queues, and then shutdown. This properly
>     +    * flushes any metadata the drive may have stored so it can survive
>     +    * having its power removed and prevents the unsafe shutdown count
> from
>     +    * incriminating. Once we delete the qpairs, we have to disable
> them
>     +    * before shutting down. The delay is out of paranoia in
>     +    * nvme_ctrlr_hw_reset, and is repeated here (though we should
> have no
>     +    * pending I/O that the delay copes with).
>     +    */
>     +   nvme_ctrlr_delete_qpairs(ctrlr);
>     +   nvme_ctrlr_disable_qpairs(ctrlr);
>     +   DELAY(100*1000);
>     +   nvme_ctrlr_shutdown(ctrlr);
>
> This seems backwards to me. From that section:
>
> > The host should perform the following actions in sequence for a normal
> shutdown:
> >     1. Stop submitting any new I/O commands to the controller and allow
> any outstanding commands to complete;
> >     2. If the controller implements I/O queues, then the host should
> delete all I/O Submission Queues, using the Delete I/O Submission Queue
> command. A result of the successful completion of the Delete I/O Submission
> Queue command is that any remaining commands outstanding are aborted;
> >     3. If the controller implements I/O queues, then the host should
> delete all I/O Completion Queues, using the Delete I/O Completion Queue
> command; and
> >     4. The host should set the Shutdown Notification (CC.SHN) field to
> 01b to indicate a normal shutdown operation. The controller indicates when
> shutdown processing is completed by updating the Shutdown Status
> (CSTS.SHST) field to 10b.
>
> You are calling nvme_ctrlr_delete_qpairs() -- which implements steps (2)
> and (3) -- before you are calling nvme_ctrlr_disable_qpairs() -- which
> implements step (1).
>
> Or am I missing something?
>

The system stops all I/O and makes sure all I/O is complete before starting
to suspend. Done in the reverse order, I don't have the queues

Warner

>  When we suspend, we need to properly shutdown the NVME controller. The
>       controller may go into D3 state (or may have the power removed), and
>       to properly flush the metadata to non-volatile RAM, we must complete
> a
>       normal shutdown. This consists of deleting the I/O queues and setting
>       the shutodown bit. We have to do some extra stuff to make sure we
>       reset the software state of the queues as well.
>
>       On resume, we have to reset the card twice, for reasons described in
>       the attach funcion. Once we've done that, we can restart the card. If
>       any of this fails, we'll fail the NVMe card, just like we do when a
>       reset fails.
>
>       Set is_resetting for the duration of the suspend / resume. This keeps
>       the reset taskqueue from running a concurrent reset, and also is
>       needed to prevent any hw completions from queueing more I/O to the
>       card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to get
>       that from the global state of the ctrlr. Wait for any pending reset
> to
>       finish. All queued I/O will get sent to the hardware as part of
>       nvme_ctrlr_start(), though the upper layers shouldn't send any
>       down. Disabling the qpairs is the other failsafe to ensure all I/O is
>       queued.
>
>       Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to avoid
>       confusion with all the other destroy functions.  It just removes the
>       queues in hardware, while the other _destroy_ functions tear down
>       driver data structures.
>
>       Split parts of the hardware reset function up so that I can
>       do part of the reset in suspsend. Split out the software disabling
>       of the qpairs into nvme_ctrlr_disable_qpairs.
>
>       Finally, fix a couple of spelling errors in comments related to
>       this.
>
>       Relnotes: Yes
>       MFC After: 1 week
>       Reviewed by: scottl@ (prior version)
>       Differential Revision: https://reviews.freebsd.org/D21493
>
>     Modified:
>       head/sys/dev/nvme/nvme.c
>       head/sys/dev/nvme/nvme_ctrlr.c
>       head/sys/dev/nvme/nvme_pci.c
>       head/sys/dev/nvme/nvme_private.h
>
>     Modified: head/sys/dev/nvme/nvme.c
>
> ==============================================================================
>     --- head/sys/dev/nvme/nvme.c        Tue Sep  3 14:55:19 2019
> (r351746)
>     +++ head/sys/dev/nvme/nvme.c        Tue Sep  3 15:26:11 2019
> (r351747)
>     @@ -137,9 +137,10 @@ nvme_attach(device_t dev)
>         }
>
>         /*
>     -    * Reset controller twice to ensure we do a transition from
> cc.en==1
>     -    *  to cc.en==0.  This is because we don't really know what status
>     -    *  the controller was left in when boot handed off to OS.
>     +    * Reset controller twice to ensure we do a transition from
> cc.en==1 to
>     +    * cc.en==0.  This is because we don't really know what status the
>     +    * controller was left in when boot handed off to OS.  Linux
> doesn't do
>     +    * this, however. If we adopt that policy, see also
> nvme_ctrlr_resume().
>          */
>         status = nvme_ctrlr_hw_reset(ctrlr);
>         if (status != 0) {
>
>     Modified: head/sys/dev/nvme/nvme_ctrlr.c
>
> ==============================================================================
>     --- head/sys/dev/nvme/nvme_ctrlr.c  Tue Sep  3 14:55:19 2019
> (r351746)
>     +++ head/sys/dev/nvme/nvme_ctrlr.c  Tue Sep  3 15:26:11 2019
> (r351747)
>     @@ -118,8 +118,8 @@ nvme_ctrlr_construct_io_qpairs(struct
> nvme_controller
>
>         /*
>          * Our best estimate for the maximum number of I/Os that we should
>     -    * noramlly have in flight at one time. This should be viewed as a
> hint,
>     -    * not a hard limit and will need to be revisitted when the upper
> layers
>     +    * normally have in flight at one time. This should be viewed as a
> hint,
>     +    * not a hard limit and will need to be revisited when the upper
> layers
>          * of the storage system grows multi-queue support.
>          */
>         ctrlr->max_hw_pend_io = num_trackers * ctrlr->num_io_queues * 3 /
> 4;
>     @@ -344,10 +344,10 @@ nvme_ctrlr_enable(struct nvme_controller *ctrlr)
>         return (nvme_ctrlr_wait_for_ready(ctrlr, 1));
>      }
>
>     -int
>     -nvme_ctrlr_hw_reset(struct nvme_controller *ctrlr)
>     +static void
>     +nvme_ctrlr_disable_qpairs(struct nvme_controller *ctrlr)
>      {
>     -   int i, err;
>     +   int i;
>
>         nvme_admin_qpair_disable(&ctrlr->adminq);
>         /*
>     @@ -359,7 +359,15 @@ nvme_ctrlr_hw_reset(struct nvme_controller *ctrlr)
>                 for (i = 0; i < ctrlr->num_io_queues; i++)
>                         nvme_io_qpair_disable(&ctrlr->ioq[i]);
>         }
>     +}
>
>     +int
>     +nvme_ctrlr_hw_reset(struct nvme_controller *ctrlr)
>     +{
>     +   int err;
>     +
>     +   nvme_ctrlr_disable_qpairs(ctrlr);
>     +
>         DELAY(100*1000);
>
>         err = nvme_ctrlr_disable(ctrlr);
>     @@ -481,7 +489,7 @@ nvme_ctrlr_create_qpairs(struct nvme_controller
> *ctrlr
>      }
>
>      static int
>     -nvme_ctrlr_destroy_qpairs(struct nvme_controller *ctrlr)
>     +nvme_ctrlr_delete_qpairs(struct nvme_controller *ctrlr)
>      {
>         struct nvme_completion_poll_status      status;
>         struct nvme_qpair                       *qpair;
>     @@ -820,7 +828,7 @@ nvme_ctrlr_configure_int_coalescing(struct
> nvme_contro
>      }
>
>      static void
>     -nvme_ctrlr_start(void *ctrlr_arg)
>     +nvme_ctrlr_start(void *ctrlr_arg, bool resetting)
>      {
>         struct nvme_controller *ctrlr = ctrlr_arg;
>         uint32_t old_num_io_queues;
>     @@ -833,7 +841,7 @@ nvme_ctrlr_start(void *ctrlr_arg)
>          *  the number of I/O queues supported, so cannot reset
>          *  the adminq again here.
>          */
>     -   if (ctrlr->is_resetting)
>     +   if (resetting)
>                 nvme_qpair_reset(&ctrlr->adminq);
>
>         for (i = 0; i < ctrlr->num_io_queues; i++)
>     @@ -854,7 +862,7 @@ nvme_ctrlr_start(void *ctrlr_arg)
>          *  explicit specify how many queues it will use.  This value
> should
>          *  never change between resets, so panic if somehow that does
> happen.
>          */
>     -   if (ctrlr->is_resetting) {
>     +   if (resetting) {
>                 old_num_io_queues = ctrlr->num_io_queues;
>                 if (nvme_ctrlr_set_num_qpairs(ctrlr) != 0) {
>                         nvme_ctrlr_fail(ctrlr);
>     @@ -894,7 +902,7 @@ nvme_ctrlr_start_config_hook(void *arg)
>
>         if (nvme_ctrlr_set_num_qpairs(ctrlr) == 0 &&
>             nvme_ctrlr_construct_io_qpairs(ctrlr) == 0)
>     -           nvme_ctrlr_start(ctrlr);
>     +           nvme_ctrlr_start(ctrlr, false);
>         else
>                 nvme_ctrlr_fail(ctrlr);
>
>     @@ -923,7 +931,7 @@ nvme_ctrlr_reset_task(void *arg, int pending)
>          */
>         pause("nvmereset", hz / 10);
>         if (status == 0)
>     -           nvme_ctrlr_start(ctrlr);
>     +           nvme_ctrlr_start(ctrlr, true);
>         else
>                 nvme_ctrlr_fail(ctrlr);
>
>     @@ -946,7 +954,7 @@ nvme_ctrlr_poll(struct nvme_controller *ctrlr)
>      }
>
>      /*
>     - * Poll the single-vector intertrupt case: num_io_queues will be 1 and
>     + * Poll the single-vector interrupt case: num_io_queues will be 1 and
>       * there's only a single vector. While we're polling, we mask further
>       * interrupts in the controller.
>       */
>     @@ -1012,7 +1020,7 @@ nvme_ctrlr_passthrough_cmd(struct
> nvme_controller *ctr
>                 if (is_user_buffer) {
>                         /*
>                          * Ensure the user buffer is wired for the
> duration of
>     -                    *  this passthrough command.
>     +                    *  this pass-through command.
>                          */
>                         PHOLD(curproc);
>                         buf = uma_zalloc(pbuf_zone, M_WAITOK);
>     @@ -1031,7 +1039,7 @@ nvme_ctrlr_passthrough_cmd(struct
> nvme_controller *ctr
>         } else
>                 req = nvme_allocate_request_null(nvme_pt_done, pt);
>
>     -   /* Assume userspace already converted to little-endian */
>     +   /* Assume user space already converted to little-endian */
>         req->cmd.opc = pt->cmd.opc;
>         req->cmd.fuse = pt->cmd.fuse;
>         req->cmd.rsvd2 = pt->cmd.rsvd2;
>     @@ -1206,7 +1214,7 @@ nvme_ctrlr_destruct(struct nvme_controller
> *ctrlr, dev
>
>         if (ctrlr->is_initialized) {
>                 if (!gone)
>     -                   nvme_ctrlr_destroy_qpairs(ctrlr);
>     +                   nvme_ctrlr_delete_qpairs(ctrlr);
>                 for (i = 0; i < ctrlr->num_io_queues; i++)
>                         nvme_io_qpair_destroy(&ctrlr->ioq[i]);
>                 free(ctrlr->ioq, M_NVME);
>     @@ -1305,4 +1313,88 @@ nvme_ctrlr_get_data(struct nvme_controller
> *ctrlr)
>      {
>
>         return (&ctrlr->cdata);
>     +}
>     +
>     +int
>     +nvme_ctrlr_suspend(struct nvme_controller *ctrlr)
>     +{
>     +   int to = hz;
>     +
>     +   /*
>     +    * Can't touch failed controllers, so it's already suspended.
>     +    */
>     +   if (ctrlr->is_failed)
>     +           return (0);
>     +
>     +   /*
>     +    * We don't want the reset taskqueue running, since it does similar
>     +    * things, so prevent it from running after we start. Wait for any
> reset
>     +    * that may have been started to complete. The reset process we
> follow
>     +    * will ensure that any new I/O will queue and be given to the
> hardware
>     +    * after we resume (though there should be none).
>     +    */
>     +   while (atomic_cmpset_32(&ctrlr->is_resetting, 0, 1) == 0 && to-- >
> 0)
>     +           pause("nvmesusp", 1);
>     +   if (to <= 0) {
>     +           nvme_printf(ctrlr,
>     +               "Competing reset task didn't finish. Try again
> later.\n");
>     +           return (EWOULDBLOCK);
>     +   }
>     +
>     +   /*
>     +    * Per Section 7.6.2 of NVMe spec 1.4, to properly suspend, we
> need to
>     +    * delete the hardware I/O queues, and then shutdown. This properly
>     +    * flushes any metadata the drive may have stored so it can survive
>     +    * having its power removed and prevents the unsafe shutdown count
> from
>     +    * incriminating. Once we delete the qpairs, we have to disable
> them
>     +    * before shutting down. The delay is out of paranoia in
>     +    * nvme_ctrlr_hw_reset, and is repeated here (though we should
> have no
>     +    * pending I/O that the delay copes with).
>     +    */
>     +   nvme_ctrlr_delete_qpairs(ctrlr);
>     +   nvme_ctrlr_disable_qpairs(ctrlr);
>     +   DELAY(100*1000);
>     +   nvme_ctrlr_shutdown(ctrlr);
>     +
>     +   return (0);
>     +}
>     +
>     +int
>     +nvme_ctrlr_resume(struct nvme_controller *ctrlr)
>     +{
>     +
>     +   /*
>     +    * Can't touch failed controllers, so nothing to do to resume.
>     +    */
>     +   if (ctrlr->is_failed)
>     +           return (0);
>     +
>     +   /*
>     +    * Have to reset the hardware twice, just like we do on attach. See
>     +    * nmve_attach() for why.
>     +    */
>     +   if (nvme_ctrlr_hw_reset(ctrlr) != 0)
>     +           goto fail;
>     +   if (nvme_ctrlr_hw_reset(ctrlr) != 0)
>     +           goto fail;
>     +
>     +   /*
>     +    * Now that we're reset the hardware, we can restart the
> controller. Any
>     +    * I/O that was pending is requeued. Any admin commands are
> aborted with
>     +    * an error. Once we've restarted, take the controller out of
> reset.
>     +    */
>     +   nvme_ctrlr_start(ctrlr, true);
>     +   atomic_cmpset_32(&ctrlr->is_resetting, 1, 0);
>     +
>     +   return (0);
>     +fail:
>     +   /*
>     +    * Since we can't bring the controller out of reset, announce and
> fail
>     +    * the controller. However, we have to return success for the
> resume
>     +    * itself, due to questionable APIs.
>     +    */
>     +   nvme_printf(ctrlr, "Failed to reset on resume, failing.\n");
>     +   nvme_ctrlr_fail(ctrlr);
>     +   atomic_cmpset_32(&ctrlr->is_resetting, 1, 0);
>     +   return (0);
>      }
>
>     Modified: head/sys/dev/nvme/nvme_pci.c
>
> ==============================================================================
>     --- head/sys/dev/nvme/nvme_pci.c    Tue Sep  3 14:55:19 2019
> (r351746)
>     +++ head/sys/dev/nvme/nvme_pci.c    Tue Sep  3 15:26:11 2019
> (r351747)
>     @@ -43,6 +43,8 @@ __FBSDID("$FreeBSD$");
>      static int    nvme_pci_probe(device_t);
>      static int    nvme_pci_attach(device_t);
>      static int    nvme_pci_detach(device_t);
>     +static int    nvme_pci_suspend(device_t);
>     +static int    nvme_pci_resume(device_t);
>
>      static void nvme_ctrlr_setup_interrupts(struct nvme_controller
> *ctrlr);
>
>     @@ -51,6 +53,8 @@ static device_method_t nvme_pci_methods[] = {
>         DEVMETHOD(device_probe,     nvme_pci_probe),
>         DEVMETHOD(device_attach,    nvme_pci_attach),
>         DEVMETHOD(device_detach,    nvme_pci_detach),
>     +   DEVMETHOD(device_suspend,   nvme_pci_suspend),
>     +   DEVMETHOD(device_resume,    nvme_pci_resume),
>         DEVMETHOD(device_shutdown,  nvme_shutdown),
>         { 0, 0 }
>      };
>     @@ -331,4 +335,22 @@ nvme_ctrlr_setup_interrupts(struct
> nvme_controller *ct
>         }
>
>         ctrlr->msix_enabled = 1;
>     +}
>     +
>     +static int
>     +nvme_pci_suspend(device_t dev)
>     +{
>     +   struct nvme_controller  *ctrlr;
>     +
>     +   ctrlr = DEVICE2SOFTC(dev);
>     +   return (nvme_ctrlr_suspend(ctrlr));
>     +}
>     +
>     +static int
>     +nvme_pci_resume(device_t dev)
>     +{
>     +   struct nvme_controller  *ctrlr;
>     +
>     +   ctrlr = DEVICE2SOFTC(dev);
>     +   return (nvme_ctrlr_resume(ctrlr));
>      }
>
>     Modified: head/sys/dev/nvme/nvme_private.h
>
> ==============================================================================
>     --- head/sys/dev/nvme/nvme_private.h        Tue Sep  3 14:55:19 2019
>       (r351746)
>     +++ head/sys/dev/nvme/nvme_private.h        Tue Sep  3 15:26:11 2019
>       (r351747)
>     @@ -556,4 +556,7 @@ void    nvme_notify_ns(struct nvme_controller
> *ctrlr, int
>      void       nvme_ctrlr_intx_handler(void *arg);
>      void       nvme_ctrlr_poll(struct nvme_controller *ctrlr);
>
>     +int        nvme_ctrlr_suspend(struct nvme_controller *ctrlr);
>     +int        nvme_ctrlr_resume(struct nvme_controller *ctrlr);
>     +
>      #endif /* __NVME_PRIVATE_H__ */
>
>
>
>
>