Low-level trace-buffers in CAM

Tue Oct 27 02:37:58 UTC 2015

Hi folks,

----------------------------------------------------------------
This is an updated re-send of a message I originally sent about a year ago, during MeetBSD 2014. A few people expressed interest in person, but no one ever followed up on the lists. I'm bringing this up again, in the hopes that I can get some feedback before / during next week's Dev/Vendor Summit. I'm CCing some folks who expressed interest at that time, and some folks that I was told would be interested.
----------------------------------------------------------------

At Panasas, we found it useful to have a trace buffer for each ATA drive in the system, for gathering information about command sequences and latency. I implemented that functionality for our old 7-ish ATA driver, and it works quite well, with fairly low overhead. (I ran our usual suite of performance tests, and any differences were lost in the noise.) These traces allowed us to capture and replay access patterns (though obviously not with the same data), analyze and compare latency, etc.

As Panasas moves toward a more modern base, we want to re-implement this functionality for ATA, SCSI, NVMe, etc. And push it upstream, of course! :-)

First, some example output; I wrote code for three different formats:
  - The binary trace buffer dump
    Useful for manipulating in C; the other two formats are post-processed out of this, and I also wrote a replay utility and a basic stats analyzer which operated on the binary dump.
  - Tabular hex
    Useful for parsing with scripts; I never actually used this format, but it was trivial so I just went ahead and wrote it.
  - Human-readable
    Useful for eyeballing; on more than one occasion, I was able to see that I had mis-assembled a command in userspace, etc.

These examples are actually really long lines; I've broken them up and indented for the sake of email:

    # Tabular hex format, for use w/ scripts
    # Request time       ATA request information   Response time        
        ATA response information  sts/err  flags
    1445839793003737     c8 0000 0004 000000002fa8 1445839793003806     
        c8 0000 0004 000000002fa8 50 00 0000007d
    1445839793003874     c8 0000 0020 000000113ac8 1445839793011541     
        c8 0000 0020 000000113ac8 50 00 0000007d
    1445839793011795     c8 0000 0008 000000118ca8 1445839793017702     
        c8 0000 0008 000000118ca8 50 00 0000007d
    1445839793017808     c8 0000 0080 000000117628 1445839793018975     
        c8 0000 0080 000000117628 50 00 0000007d
    1445839793019079     c8 0000 0020 0000001177a8 1445839793019686     
        c8 0000 0020 0000001177a8 50 00 0000007d

    # Human-readable
    FLAGS       REQUEST TIME         RESPONSE TIME        ELAPSED TIME         
        REQUEST COMMAND                          STATUS   ERROR    
        RESPONSE COMMAND                        
    ----------- -------------------- -------------------- -------------------- 
        ---------------------------------------- -------- -------- 
        ----------------------------------------
    ____C...._V 1445839793003737     1445839793003806     69                   
        READ DMA (LBA 12200 + 4 sectors)         _R_S____ ________ 
        ----                                    
    ____C...._V 1445839793003874     1445839793011541     7667                 
        READ DMA (LBA 1129160 + 32 sectors)      _R_S____ ________ 
        ----                                    
    ____C...._V 1445839793011795     1445839793017702     5907                 
        READ DMA (LBA 1150120 + 8 sectors)       _R_S____ ________ 
        ----                                    
    ____C...._V 1445839793017808     1445839793018975     1167                 
        READ DMA (LBA 1144360 + 128 sectors)     _R_S____ ________ 
        ----                                    
    ____C...._V 1445839793019079     1445839793019686     607                  
        READ DMA (LBA 1144744 + 32 sectors)      _R_S____ ________ 
        ----                                    

The obvious place to put this sort of tracing would be in GEOM, but that actually isn't what I want, for a few reasons. First of all, I'm primarily interested in the exact register / taskfile / CDB values sent to the drive - opcode, feature codes, LBAs, etc. Second, I'm interested in hardware-level latency - how long does a request stay in the controller and device. Both of those pretty much require working below GEOM.

Therefore, I think I want to do it in the CAM stack; ideally, I'd put the hooks in the SIM drivers, to get as close to the metal as possible, to get the most accurate latency. However, even a light touch to every SIM driver is going to be painful, and since I don't have access to every controller in the world, testing would be difficult. Adding the hooks to da(4) and ada(4) would still be pretty close to the metal, and would require much smaller changes. Or maybe they belong in the XPT layer? (I don't know nearly as much about CAM as I should. :-/)

--------
NOTE: When I mentioned this idea at a (San Francisco) Bay Area FreeBSD User Group (BAFUG) meeting, there were suggestions about integrating this with DTrace or KTrace instead. I'm happy to consider either of those, but I know even less about those than I do about CAM. If that's really the best way to do this sort of thing, I'm happy to learn, if someone is willing to teach.
--------

I took the Panasas / 7.x implementation, and pseudo-coded what it might look like in CAM. (I could probably make diffs against the old ATA stack available for comparison, but I figured no one would care anymore.) A huge caveat is that I've never really poked in CAM before, so I fully expect I'm doing several things wrong. Please educate me. :-)

================================

Create a circular buffer of trace records for each CAM PERIPH. Each record will contain the full CDB for the request as sent to the device, the timestamp (in usec) of the request, the CDB as returned by the device, the timestamp of the response, and the status and error code. Additionally, flags are provided to track the state of the trace record, and to indicate which type of periph the record is associated with.

--------

typedef struct {
	u_int32_t	flags;
#define CAM_TRACE_F_VALID	0x00000001	/* valid record */
#define CAM_TRACE_F_INPROGRESS	0x00000002	/* request in progress */
#define CAM_TRACE_F_REQ_VALID	0x00000004	/* request data valid */
#define CAM_TRACE_F_RESP_VALID	0x00000008	/* response data valid */
#define CAM_TRACE_F_COMPLETE	0x00000010	/* response gathered */
#define CAM_TRACE_F_TIMEDOUT	0x00000020	/* request timed out */
#define CAM_TRACE_F_ABANDONED	0x00000040	/* cancelled, etc. */
#define CAM_TRACE_F_RETRIED	0x00000080	/* retry in later record */
#define CAM_TRACE_F_IS_RETRY	0x00000100	/* retry of earlier req. */
#define CAM_TRACE_F_PERIPH_MASK	0xf0000000	/* up to 16 PERIPH types */
#define CAM_TRACE_F_PERIPH_DA	0x0
#define CAM_TRACE_F_PERIPH_ADA	0x1
#define CAM_TRACE_F_PERIPH_NVD	0x2
	u_int64_t	req_usec;
	cdb_t		req_cdb;
	u_int64_t	resp_usec;
	cdb_t		resp_cdb;
	u_int8_t	resp_status;
	u_int8_t	resp_error;
	u_int8_t	resp_asc;	/* Needed for SCSI? */
	u_int8_t	resp_asq;	/* Needed for SCSI? */
	u_int8_t	padding[8];	/* Pad to 64 bytes */
/* There's going to be a large-ish array of these in the kernel; make sure
 * they're packed to minimize space usage, and keep things aligned. It may
 * make sense to re-order the fields for alignment as well.
 */
} cam_trace_record_t __attribute__((packed, aligned(8)));

typedef struct {
	u_int32_t		trace_buf_record_count;
	cam_trace_record_t	*trace_buf;
} cam_trace_buffer_t;

/* This is settable via a tunable; for Panasas' purposes, 100K entries was
 * enough to provide a good amount of history even on a busy system, while
 * not taking up much space (< 5MB/drive).
 */
#define CAM_TRACE_BUFFER_DEFAULT_COUNT 100000

----------------

ioctl interfaces are used to get the trace buffer size, to retrieve the buffer contents, or to clear the buffer. The buffer can then be analyzed from userspace, or written to a file for post-processing.

--------

#define CAM_GET_TRACE_RECORD_COUNT	_IOR(???, ???, u_int32_t)
#define CAM_GET_TRACE_BUFFER		_IOWR(???, ???, cam_trace_buffer_t)
#define CAM_CLEAR_TRACE_BUFFER		_IOW(???, ???, u_int32_t)

----------------

To get the buffer:

--------

	cam_trace_buffer_t trace = NULL;
	fd = open(periph_name, O_RDONLY);
	error = ioctl(fd, CAM_GET_TRACE_RECORD_COUNT,
	    &trace.trace_buf_record_count);
	trace.trace_buf = malloc(trace.trace_buf_record_count *
	    sizeof(cam_trace_record_t));
	error = ioctl(fd, CAM_GET_TRACE_BUFFER, &trace);

----------------

By including the full CDB, the entire command can be decoded by analysis tools. Tools can be written to parse the buffer and translate command data into human-readable descriptions, calculate elapsed times for requests, perform statistical analysis based on opcode, command type, transfer size, etc. Command decoders would need to be written for each PERIPH type (i.e. one for ATA, one for SCSI, one for NVMe, etc.).

----------------

Each PERIPH's softc needs to include some information about the per-device
buffer.

--------

	u_int32_t		trace_buf_record_count;
	u_int64_t		trace_buf_request_count;
	cam_trace_record_t	*trace_buf;

----------------

In each SIM, there should already be some sort of request structure, which associates the softc and the CDB of the request. To that structure, add a pointer to the trace record associated with the request.

--------

request structure:
	cam_trace_record_t	*trace_rec;

----------------

When initializing the per-device softc, allocate a trace buffer. The number of records can be hard-coded, or settable via a tunable.

The example below has a global size for everything, but it should be possible to do it on a per-device or per-PERIPH basis.

--------

	/** Initialize per-drive command tracing buffer. **/
	/* Get the buffer size from the tunable. Sanity check it. */
	sc->trace_buf_record_count = CAM_TRACE_BUFFER_DEFAULT_COUNT;
	TUNABLE_ULONG_FETCH("cam.trace_buf_record_count",
	    &trace_buf_record_count_tunable);
	if (trace_buf_record_count_tunable) {
		/* Less than 1K records is unusably few; more than 1M
		 * records is unusably many.
		 */
		if ((trace_buf_record_count_tunable < (1 << 10)) ||
		    (trace_buf_record_count_tunable > (1 << 20))) {
			device_printf(dev, "cam.trace_buf_record_count=%lu"
			    " out of range; should be between %u and %u\n",
			    trace_buf_record_count_tunable, (1 << 10),
			    (1 << 20));
		} else {
			sc->trace_buf_record_count = (u_int32_t)
			    trace_buf_record_count_tunable;
		}
	}
	/* Allocate the buffer. */
	trace_buf_bytes = sc->trace_buf_record_count *
	    sizeof(cam_trace_record_t);
	sc->trace_buf = malloc(trace_buf_bytes, M_AD, M_NOWAIT | M_ZERO);
	if (sc->trace_buf == NULL) {
		device_printf(dev, "unable to allocate %zd for trace_buf\n",
		    trace_buf_bytes);
		sc->trace_buf_record_count = 0;
	} else {
		device_printf(dev, "allocated %zd @ %p for trace_buf\n",
		    trace_buf_bytes, sc->trace_buf);
	}
	/* Request counter, used for indexing */
	sc->trace_buf_request_count = 0;

----------------

/* Initial setup of the trace record for a new request.
 *
 *  Locate the circular per-device trace buffer in the drive's softc.
 *  Populate the pointer to the next record in the trace buffer; this is
 *  done atomically to prevent concurrent requests from getting the same
 *  index. That could get a bit weird if/when 'trace_buf_request_count'
 *  rolls over, if the maximum counter value is not a multiple of
 *  'trace_buf_record_count' (i.e. the request counter recycles in the
 *  middle of the buffer). By using a 64-bit request counter, we're pretty
 *  safe - even at one transaction per nanosecond, that will still last
 *  about 585 years.
 *
 *  Initialize the record by zeroing it and then setting the VALID flag.
 *
 *  This could be done at any point before sending the request to the
 *  hardware; neither the CDB data nor the timestamps are involved yet.
 */
void
trace_rec_init(cam_request_t req) {
	ada_softc *sc = (request->...->softc)
	u_int64_t tmp = 0;
	if (sc && sc->trace_buf) {
		tmp = atomic_fetchadd_log(&sc->trace_buf_request_count, 1);
		req->trace_rec = &sc->trace_buf[tmp %
		    sc->trace_buf_record_count];
		bzero(req->trace_rec, sizeof(*req->trace_rec));
		req->trace_rec->flags = (CAM_TRACE_F_PERIPH_XXX << 28);
		req->trace_rec->flags |= CAM_TRACE_F_VALID;
	} else {
		req->trace_rec = NULL;
	}
}

----------------

/* Stuff the request data into the trace record.
 *
 *  If the trace record is present and valid, copy the CDB into the
 *  record. Set CAM_TRACE_F_REQ_VALID to indicate that the request data is
 *  valid, and INPROGRESS to show that the request is in-progress.
 *
 *  This should be done just before sending the request to the hardware.
 *  This should *not* be fused into trace_rec_init(), because
 *  trace_rec_init() is also used in the retry case, where it is not
 * correct to immediately start the timer.
 */
void
trace_rec_req(cam_request_t req) {
	struct timeval tv;
	if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
		bcopy(req->cdb, req->trace_rec->req_cdb, sizeof(req->cdb));
		microuptime(&tv);
		req->trace_rec->req_usec = ((uint64_t) tv->tv_sec * 1000000)
		    + tv->tv_usec;
		req->trace_rec->flags |=
		    (CAM_TRACE_F_REQ_VALID | CAM_TRACE_F_INPROGRESS);
	}
}

----------------

/* Stuff the response data into the trace record.
 *
 *  If the trace record is present and valid, copy the CDB **as returned by
 *  the drive** into the record. At this point, we also have the result from
 *  the command (status, error, (asc? ascq?)), so save them too. Set
 *  CAM_TRACE_F_RESP_VALID to indicate that the response data is valid.
 *  Clear INPROGRESS and set COMPLETE instead.
 *
 *  This should be done just after getting the response information from the
 *  hardware.
 */
void
trace_rec_resp(cam_request_t req) {
	struct timeval tv;
	if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
		microuptime(&tv);
		bcopy(req->cdb, req->trace_rec->resp_cdb, sizeof(req->cdb));
		req->trace_rec->resp_usec = ((uint64_t) tv->tv_sec *
		    1000000) + tv->tv_usec;
		req->trace_rec->flags |=
		    (CAM_TRACE_F_RESP_VALID | CAM_TRACE_F_COMPLETE);
		req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS;
	}
}

----------------

/* Update the trace record if there was a timeout.
 *
 *  If the trace record is present and valid, set TIMEDOUT to indicate
 *  that the request associated with the record exceeded its timeout.
 *  Since this attempt is complete, clear INPROGRESS and set COMPLETE.
 *
 *  This should be done along with other timeout handling.
 */
void
trace_rec_timeout(cam_request_t req) {
	if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
		req->trace_rec->flags |=
		    (CAM_TRACE_F_TIMEDOUT | CAM_TRACE_F_COMPLETE);
		req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS;
	}
}

----------------

/* Update the trace record if the request is being retried.
 *  
 *  If the trace record is present and valid, set RETRIED to indicate
 *  that the request associated with the record is going to be retried.
 *  Since this attempt is complete, clear INPROGRESS and set COMPLETE.
 *  Then get a new trace record for the new attempt, copy the request
 *  command data to the new record, and set IS_RETRY to show that this
 *  is a retry of an earlier request.
 *
 *  This should be done at the time the retry is queued.
 */
void
trace_rec_retry(cam_request_t req) {
	if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
		/* First, mark the original request as being requeued. */
		req->trace_rec->flags |=
		    (CAM_TRACE_F_RETRIED | CAM_TRACE_F_COMPLETE);
		req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS;
		/* Get a new record for the retry. */
		trace_rec_init(req);
		/* Mark the new record as being a retry. */
		req->trace_rec->flags |= CAM_TRACE_F_IS_RETRY;
	}
}

================================

I'd appreciate feedback, both on the implementation details and on the idea as a whole. Some open questions are:

- My assumption is that each SIM has a structure that associates a PERIPH's softc with the CDB that's being processed. It might require dereferencing a chain of pointers, but it should be there in some fashion.

- The current implementation adds a few ioctls to the old ad(4) disk-device driver. I would expect to add the new ioctls to da(4) and ada(4); is that the right thing to do?

Again, any feedback would be greatly appreciated. I'll be at the Dev/Vendor Summit next week, if anyone wants to discuss this with me in person.

Thanks,

Ravi