Re: Trying to understand CAM (and the cciss driver)

From: Warner Losh <imp_at_bsdimp.com>
Date: Wed, 22 May 2024 21:03:55 UTC
Hey Peter,

My apologies for my tardy response. The CISS driver is a bit older, and
most of my experience is with newer drivers...

On Thu, May 9, 2024 at 11:59 AM Peter Eriksson <pen@lysator.liu.se> wrote:

> Hi,
>
> I’m trying to fix a bug in the cciss driver that has been there “forever”
> when using it with an HP H241 SAS HBA card.
> The driver works fine when all (SAS, spinning rust) drives are behaving
> well, but when some of them are starting to go bad it often goes into spin
> and either hangs the kernel or panics. I’ve been trying to add
> instrumentation to it in order to pin-point the problem and have been
> attempting some workarounds (like clearing cr_complete since without that
> hack sometime the driver get many many non-busy repeated requests with the
> same “tag” and then It panics with:
>
> > ciss1: WARNING: completing non-busy request
> > ciss1: WARNING: completing non-busy request
> > ciss1: WARNING: completing non-busy request
>

This means that we're not tracking the state of the requests correctly.


> > panic: cam_periph_done_panic: already done with ccb 0xfffff88abefc4000
>

And I think this is the panic when we detect that.


> > cpuid = 0
> > time = 1714852260
> > KDB: stack backtrace:
> > #0 0xffffffff80c541b5 at kdb_backtrace+0x65
> > #1 0xffffffff80c06b31 at vpanic+0x151
> > #2 0xffffffff80c069d3 at panic+0x43
> > #3 0xffffffff8039017c at cam_periph_done_panic+0x1c
> > #4 0xffffffff80397193 at xpt_done_process+0x313
> > #5 0xffffffff803990a5 at xpt_done_td+0xf5
> > #6 0xffffffff80bc333e at fork_exit+0x7e
> > #7 0xffffffff8108a44e at fork_trampoline+0xe
>
> but it’s still there…
>
> I think I could use some input/suggestions from someone more knowledgeable
> with CAM and device drivers in general…
>
> An example (this is with extra instrumentation added so not quite a stock
> kernel driver):
>
> ciss1: *** Drive failure imminent, Port=1E Box=1 Bay=17 reason=0x37
> [ts=4460, class=4, subclass=0, detail=1, tag=0x2]
> (da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 7f 17 51 98 00 00
> 00 10 00 00
> (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
> (da122:ciss1:32:78:0): Retrying command, 3 more tries remain
> (da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 7f 17 52 18 00 00
> 00 10 00 00
> (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
> (da122:ciss1:32:78:0): Retrying command, 3 more tries remain
> (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 65 c8 00 00
> 02 00 00 00
> (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
> (da122:ciss1:32:78:0): Retrying command, 3 more tries remain
> (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 63 c8 00 00
> 02 00 00 00
> (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
> (da122:ciss1:32:78:0): Retrying command, 3 more tries remain
> (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 61 c8 00 00
> 02 00 00 00
> (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
> (da122:ciss1:32:78:0): Retrying command, 3 more tries remain
> (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5f c8 00 00
> 02 00 00 00
> (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
> (da122:ciss1:32:78:0): Retrying command, 3 more tries remain
> (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5f a8 00 00
> 00 20 00 00
> (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
> (da122:ciss1:32:78:0): Retrying command, 3 more tries remain
> (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5f 38 00 00
> 00 70 00 00
> (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
> (da122:ciss1:32:78:0): Retrying command, 3 more tries remain
> (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5d 38 00 00
> 02 00 00 00
> (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
> (da122:ciss1:32:78:0): Retrying command, 3 more tries remain
> (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5b 38 00 00
> 02 00 00 00
> (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
> (da122:ciss1:32:78:0): Retrying command, 3 more tries remain
> (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 59 38 00 00
> 02 00 00 00
> (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
> (da122:ciss1:32:78:0): Retrying command, 3 more tries remain
> (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 58 48 00 00
> 00 f0 00 00
> (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
> (da122:ciss1:32:78:0): Retrying command, 3 more tries remain
> (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 56 48 00 00
> 02 00 00 00
> (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
> (da122:ciss1:32:78:0): Retrying command, 3 more tries remain
> (da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 88 d8 48 d0 00 00
> 00 10 00 00
> (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
> (da122:ciss1:32:78:0): Retrying command, 3 more tries remain
> (da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 88 d8 46 e0 00 00
> 01 f0 00 00
> (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
> (da122:ciss1:32:78:0): Retrying command, 3 more tries remain
> (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 88 d8 8f d0 00 00
> 02 00 00 00
> (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
> (da122:ciss1:32:78:0): Retrying command, 3 more tries remain
> (da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 88 d8 8d d0 00 00
> 02 00 00 00
> (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
> (da122:ciss1:32:78:0): Retrying command, 3 more tries remain
> ciss1: WARNING: completing non-busy request #1 [onq=0, length=262144,
> ccphys=1791048448, tag=72, flags=0x50 (----I-C), sg_tag=0x3,
> cr_complete=0xffffffff806059a0]
> ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy
> ciss1: WARNING: completing non-busy request #1 [onq=0, length=262144,
> ccphys=1791318368, tag=313, flags=0x50 (----I-C), sg_tag=0x3,
> cr_complete=0xffffffff806059a0]
> ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy
> ciss1: WARNING: completing non-busy request #1 [onq=0, length=8192,
> ccphys=1791226528, tag=231, flags=0x50 (----I-C), sg_tag=0x3,
> cr_complete=0xffffffff806059a0]
> ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy
> ciss1: WARNING: completing non-busy request #1 [onq=0, length=262144,
> ccphys=1791313888, tag=309, flags=0x50 (----I-C), sg_tag=0x5,
> cr_complete=0xffffffff806059a0]
> ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy
> ciss1: WARNING: completing non-busy request #1 [onq=0, length=8192,
> ccphys=1791206368, tag=213, flags=0x50 (----I-C), sg_tag=0x3,
> cr_complete=0xffffffff806059a0]
> ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy
> panic: camq_remove: Attempt to remove out-of-bounds index -1 from queue
> 0xfffff80107fa9838 of size 1
>

So these seem to line up with the commands that encountered an error. The
errors handling code of most CAM drivers is the hardest to write and
validate because misbehaving disks, while not rare, are generally not used
in validation.  All my good ones that were dying have since stopped working
completely, for example.

So in this case, you'll need to look at the state tracking code for the
ciss requests, and make sure that in the cases of errors the right
transitions happen. I've fixed bugs in mpr/mps that amounted to marking a
request as busy again when it errored out in some of the weirder error
scenarios (I don't recall the exact cases though). You might look at
mpr/mps for the evolution of this code there. IIRC, the same author did
both ciss and mps and many of the techniques were retained between the two.
You might get good hints for bugs to fix by looking at how things evolved
there.

Basically, you're going to have to find out if these complaints are due to
the state tracking being messed up, but the request really should be doing
the thing that it's doing when the complaint is lodged (in which case you
need to fix the point where the state becomes incorrect). Or if it's caught
a real bug and the state of the request is proper, but we're doing
something like completing something twice, or failing to allocate a new
request or one of the many other silly bugs this tracking code usually
prevents.

Warner


> cpuid = 7
> time = 1715266713
> KDB: stack backtrace:
> #0 0xffffffff80c542d5 at kdb_backtrace+0x65
> #1 0xffffffff80c06c51 at vpanic+0x151
> #2 0xffffffff80c06af3 at panic+0x43
> #3 0xffffffff80390676 at camq_remove+0xf6
> #4 0xffffffff803936c8 at xpt_run_devq+0x148
> #5 0xffffffff80392b0c at xpt_action_default+0x35c
> #6 0xffffffff803c3544 at dastart+0x314
> #7 0xffffffff80394f73 at xpt_run_allocq+0x83
> #8 0xffffffff803c5152 at dastrategy+0x82
> #9 0xffffffff80b438d4 at g_disk_start+0x314
> #10 0xffffffff80b46b55 at g_io_request+0x1d5
> #11 0xffffffff80b46b55 at g_io_request+0x1d5
> #12 0xffffffff821a34dc at vdev_geom_io_start+0x23c
> #13 0xffffffff82347fb4 at zio_vdev_io_start+0x204
> #14 0xffffffff8234204f at zio_nowait+0x12f
> #15 0xffffffff822a0ce7 at vdev_mirror_io_start+0x177
> #16 0xffffffff82347fb4 at zio_vdev_io_start+0x204
> #17 0xffffffff8234204f at zio_nowait+0x12f
> Uptime: 10h16m26s
>
>
> The drive it complains about in the first line is one that has SMART
> errors:
>
> >        connector 1E box  1 bay 17                 HP      MB010000JWAYK
>                                   7PH8LU3G     HPD5 S.M.A.R.T. predictive
> failure.
>
> It is _not_ da122 though… The drive with a SMART error (da85) is not in
> use so basically is just sitting there.
>
>
>
> The code where I’m trying to figure out what’s going on is this part in
> sys/dev/ciss/ciss.c:
>
> static void
> ciss_complete(struct ciss_softc *sc, cr_qhead_t *qh)
> {
>     struct ciss_request *cr;
>
>     debug_called(2);
>
>     /*
>
>
>      * Loop taking requests off the completed queue and performing
>
>
>      * completion processing on them.
>
>
>      */
>     for (;;) {
>       int got_nonbusy = 0;
>
>         if ((cr = ciss_dequeue_complete(sc, qh)) == NULL)
>             break;
>         ciss_unmap_request(cr);
>
>         if ((cr->cr_flags & CISS_REQ_BUSY) == 0) {
>           cr->nonbusy_request_counter++;
>           ciss_printf(sc,
>                       "WARNING: completing non-busy request #%d [onq=%d,
> length=%u, ccphys=%u, tag=%d, flags=0x%x (%s), sg_tag=0x%x,
> cr_complete=%p]\n",
>                       cr->nonbusy_request_counter,
>                       cr->cr_onq, cr->cr_length, cr->cr_ccphys, cr->cr_tag,
>                       cr->cr_flags, cr_flags2str(cr->cr_flags),
>                       cr->cr_sg_tag, cr->cr_complete
>                       );
>           got_nonbusy = 1;
>         }
>
>         cr->cr_flags &= ~CISS_REQ_BUSY;
>
>         /*
>
>
>          * If the request has a callback, invoke it.
>
>
>          */
>         if (cr->cr_complete != NULL) {
>           cr->cr_complete(cr);
>
>         /* pen: Attempt to make sure ciss_cam_complete() doesn’t get
> called multiple times */
>
>           if (cr->cr_complete == ciss_cam_complete && got_nonbusy) {
>             ciss_printf(sc, "WARNING: clearing cr_complete due to
> ciss_cam_complete & nonbusy\n");
>             cr->cr_complete = NULL;
>           }
>           continue;
>         }
>
>         /*
>
>
>          * If someone is sleeping on this request, wake them up.
>
>
>          */
>         if (cr->cr_flags & CISS_REQ_SLEEP) {
>           cr->cr_flags &= ~CISS_REQ_SLEEP;
>           wakeup(cr);
>           continue;
>         }
>
>         /*
>
>
>          * If someone is polling this request for completion, signal.
>
>
>          */
>         if (cr->cr_flags & CISS_REQ_POLL) {
>           cr->cr_flags &= ~CISS_REQ_POLL;
>           continue;
>         }
>
>         /*
>
>
>          * Give up and throw the request back on the free queue.  This
>
>
>          * should never happen; resources will probably be lost.
>
>
>          */
>         ciss_printf(sc, "WARNING: completed command with no submitter\n");
>         ciss_enqueue_free(cr);
>     }
> }
>
>
> - Any suggestions? (Probably not but I thought I’d ask anyway :-)
>
> - Peter
>
>
>
>