John Baldwin jhb at
Wed May 20 19:36:49 UTC 2009

On Wednesday 20 May 2009 2:49:30 pm Jeff Roberson wrote:
> On Tue, 19 May 2009, John Baldwin wrote:
> > So one of the things I worked on while hacking away at unmapped disk I/O
> > requests was a little API to manage scatter/gather lists of phyiscal
> > addresses.  The basic premise is that a sglist describes a logical object
> > that is backed by one or more physical address ranges.  To minimize locking,
> > the sglist objects themselves are immutable once they are shared.  The
> > unmapped disk I/O project is still very much a WIP (and I'm not even working
> > on any of the really hard bits myself).  However, I actually found this
> > object to be useful for something else I have been working on: the mmap()
> > extensions for the Nvidia amd64 driver.  For the Nvidia patches I have
> > created a new type of VM object that is very similar to OBJT_DEVICE objects
> > except that it uses a sglist to determine the physical pages backing the
> > object instead of calling the d_mmap() method for each page.  Anyway, adding
> > this little API is just the first in a series of patches needed for the
> > Nvidia driver work.  I plan to MFC them to 7.x relatively soon in the hopes
> > that we can soon have a supported Nvidia driver on amd64 on 7.x.
> >
> > The current patches for all the Nvidia stuff is at
> >
> >
> > This particular patch to just add the sglist(9) API is at
> > and is slightly more
> > polished in that it includes a manpage. :)
> I have a couple of minor comments:
> 1) SGLIST_APPEND() contains a return() within a macro.  Shouldn't this be 
> an inline that returns an error code that is always checked?  These kinds 
> of macros get people into trouble.  It also could be written in such a way 
> that you don't have to handle nseg == 0 at each callsite and then it's big 
> enough that it probably shouldn't be a macro or an inline.

Mostly I was trying to avoid having to duplicate a lot of code.  The reason I
didn't handle nseg == 0 directly is a possibly dubious attempt to optimize
the _sglist_append() inline so that it doesn't have to do the extra branch
inside the main loop for virtual address regions that span multiple pages.

> 2) I worry that if all users do sglist_count() followed by a dynamic 
> allocation and then an _append() they will be very expensive. 
> pmap_kextract() is much more expensive than it may first seem to be.  Do 
> you have a user of count already?

The only one that does now is sglist_build() and nothing currently uses that.
VOP_GET/PUTPAGES would not need to do this since they could simply append
the physical addresses extracted directly from vm_page_t's for example.  I'm
not sure this will be used very much now as originally I thought I would be
changing all storage drivers to do all DMA operations using sglists and this
sort of thing would have been used for non-bio requests like firmware
commands; however, as expounded on below, it actually appears better to
still treat bio's separate from non-bio requests for bus_dma so that the
non-bio requests can continue to use bus_dmamap_load_buffer() as they do

> 3) Rather than having sg_segs be an actual pointer, did you consider 
> making it an unsized array?  This removes the overhead of one pointer from 
> the structure while enforcing that it's always contiguously allocated.

It's actually a feature to be able to have the header in separate storage from
segs array.  I use this in the jhb_bio branch in the bus_dma implementations
where a pre-allocated segs array is stored in the bus dma tag and the header
is allocated on the stack.

> 4) SGLIST_INIT might be better off as an inline, and may not even belong 
> in the header file.

That may be true.  I currently only use it in the jhb_bio branch for the
bus_dma implementations.

> In general I think this is a good idea.  It'd be nice to work on replacing 
> the buf layer's implementation with something like this that could be used 
> directly by drivers.  Have you considered a busdma operation to load from 
> a sglist?

So in regards to the bus_dma stuff, I did work on this a while ago in my
jhb_bio branch.  I do have a bus_dmamap_load_sglist() and I had planned on
using that in storage drivers directly.  However, I ended up circling back
to preferring a bus_dmamap_load_bio() and adding a new 'bio_start' field
to 'struct bio' that is an offset into an attached sglist.  This let me
carve up I/O requests in geom_dev to satisfy a disk device's max request
size while still sharing the same read-only sglist across the various
BIO's (by simply adjusting bio_length and bio_start to be a subrange of
the sglist) as opposed to doing memory allocations to allocate specific
ranges of an sglist (using something like sglist_slice()) for each I/O
request.  I then have bus_dmamap_load_bio() use the subrange of the
sglist internally or fall back to using the KVA pointer if the sglist
isn't present.

However, I'm not really trying to get the bio stuff into the tree, this
is mostly for the Nvidia case and for that use case the driver is simply
creating simple single-entry lists and using sglist_append_phys().

An example of doing something like this is from my sample patdev test
module where it creates a VM object that maps the local APIC uncacheable
like so:

	/* Create a scatter/gather list that maps the local APIC. */
	sc->sg = sglist_alloc(1, M_WAITOK);
	sglist_append_phys(sc->sg, lapic_paddr, LAPIC_LEN);

	/* Create a VM object that is backed by the scatter/gather list. */
	sc->sgobj = vm_pager_allocate(OBJT_SG, sc->sg, LAPIC_LEN, VM_PROT_READ,
	vm_object_set_cache_mode(sc->sgobj, VM_CACHE_UNCACHEABLE);

The same approach can be used to map PCI BARs, etc. into userland as well.

John Baldwin

More information about the freebsd-arch mailing list