Unmapped I/O

Wed Dec 19 21:16:31 UTC 2012

On 12/19/2012 13:28, Jeff Roberson wrote:
> On Wed, 19 Dec 2012, Alan Cox wrote:
>
>> On Wed, Dec 19, 2012 at 7:54 AM, Konstantin Belousov
>> <kostikbel at gmail.com>wrote:
>>
>>> One of the known FreeBSD I/O path performance bootleneck is the
>>> neccessity to map each I/O buffer pages into KVA.  The problem is that
>>> on the multi-core machines, the mapping must flush TLB on all cores,
>>> due to the global mapping of the buffer pages into the kernel.  This
>>> means that buffer creation and destruction disrupts execution of all
>>> other cores to perform TLB shootdown through IPI, and the thread
>>> initiating the shootdown must wait for all other cores to execute and
>>> report.
>>>
>>> The patch at
>>> http://people.freebsd.org/~kib/misc/unmapped.4.patch
>>> implements the 'unmapped buffers'.  It means an ability to create the
>>> VMIO struct buf, which does not point to the KVA mapping the buffer
>>> pages to the kernel addresses.  Since there is no mapping, kernel does
>>> not need to clear TLB. The unmapped buffers are marked with the new
>>> B_NOTMAPPED flag, and should be requested explicitely using the
>>> GB_NOTMAPPED flag to the buffer allocation routines.  If the mapped
>>> buffer is requested but unmapped buffer already exists, the buffer
>>> subsystem automatically maps the pages.
>>>
>>> The clustering code is also made aware of the not-mapped buffers, but
>>> this required the KPI change that accounts for the diff in the non-UFS
>>> filesystems.
>>>
>>> UFS is adopted to request not mapped buffers when kernel does not need
>>> to access the content, i.e. mostly for the file data.  New helper
>>> function vn_io_fault_pgmove() operates on the unmapped array of pages.
>>> It calls new pmap method pmap_copy_pages() to do the data move to and
>>> from usermode.
>>>
>>> Besides not mapped buffers, not mapped BIOs are introduced, marked
>>> with the flag BIO_NOTMAPPED.  Unmapped buffers are directly translated
>>> to unmapped BIOs.  Geom providers may indicate an acceptance of the
>>> unmapped BIOs.  If provider does not handle unmapped i/o requests,
>>> geom now automatically establishes transient mapping for the i/o
>>> pages.
>>>
>>> Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The
>>> gpart providers indicate the unmapped BIOs support if the underlying
>>> provider can do unmapped i/o.  I also hacked ahci(4) to handle
>>> unmapped i/o, but this should be changed after the Jeff' physbio patch
>>> is committed, to use proper busdma interface.
>>>
>>> Besides, the swap pager does unmapped swapping if the swap partition
>>> indicated that it can do unmapped i/o.  By Jeff request, a buffer
>>> allocation code may reserve the KVA for unmapped buffer in advance.
>>> The unmapped page-in for the vnode pager is also implemented if
>>> filesystem supports it, but the page out is not. The page-out, as well
>>> as the vnode-backed md(4), currently require mappings, mostly due to
>>> the use of VOP_WRITE().
>>>
>>> As such, the patch worked in my test environment, where I used
>>> ahci-attached SATA disks with gpt partitions, md(4) and UFS.  I see no
>>> statistically significant difference in the buildworld -j 10 times on
>>> the 4-core machine with HT.  On the other hand, when doing sha1 over
>>> the 5GB file, the system time was reduced by 30%.
>>>
>>> Unfinished items:
>>> - Integration with the physbio, will be done after physbio is
>>>   committed to HEAD.
>>> - The key per-architecture function needed for the unmapped i/o is the
>>>   pmap_copy_pages(). I implemented it for amd64 and i386 right now, it
>>>   shall be done for all other architectures.
>>> - The sizing of the submap used for transient mapping of the BIOs is
>>>   naive.  Should be adjusted, esp. for KVA-lean architectures.
>>> - Conversion of the other filesystems. Low priority.
>>>
>>> I am interested in reviews, tests and suggestions.  Note that this
>>> only works now for md(4) and ahci(4), for other drivers the patched
>>> kernel should fall back to the mapped i/o.
>>>
>>>
>> Here are a couple things for you to think about:
>>
>> 1. A while back, I developed the patch at
>> http://www.cs.rice.edu/~alc/buf_maps5.patch as an experiment in
>> trying to
>> reduce the number of TLB shootdowns by the buffer map.  The idea is
>> simple:
>> Replace the calls to pmap_q{enter,remove}() with calls to a new
>> machine-dependent function that opportunistically sets the buffer's
>> kernel
>> virtual address to the direct map for physically contiguous pages.
>> However, if the pages are not physically contiguous, it calls
>> pmap_qenter()
>> with the kernel virtual address from the buffer map.
>>
>> This eliminated about half of the TLB shootdowns for a buildworld,
>> because
>> there is a decent amount of physical contiguity that occurs by
>> "accident".
>> Using a buddy allocator for physical page allocation tends to promote
>> this
>> contiguity.  However, in a few places, it occurs by explicit action,
>> e.g.,
>> mapped files, including large executables, using superpage reservations.
>>
>> So, how does this fit with what you've done?  You might think of
>> using what
>> I describe above as a kind of "fast path".  As you can see from the
>> patch,
>> it's very simple and non-intrusive.  If the pages aren't physically
>> contiguous, then instead of using pmap_qenter(), you fall back to
>> whatever
>> approach for creating ephemeral mappings is appropriate to a given
>> architecture.
>
> I think these are complimentary.  Kib's patch gives us the fastest
> possible path for user data.  Alan's patch will improve the metadata
> performance for things that really require the buffer cache.  I see no
> reason not to clean up and commit both.
>
>>
>> 2. As for managing the ephemeral mappings on machines that don't
>> support a
>> direct map.  I would suggest an approach that is loosely inspired by
>> copying garbage collection (or the segment cleaners in log-structured
>> file
>> systems).  Roughly, you manage the buffer map as a few spaces (or
>> segments).  When you create a new mapping in one of these spaces (or
>> segments), you simply install the PTEs.  When you decide to "garbage
>> collect" a space (or spaces), then you perform a global TLB flush.
>> Specifically, you do something like toggling the bit in the cr4 register
>> that enables/disables support for the PG_G bit.  If the spaces are
>> sufficiently large, then the number of such global TLB flushes should be
>> quite low.  Every space would have an epoch number (or flush
>> number).  In
>> the buffer, you would record the epoch number alongside the kernel
>> virtual
>> address.  On access to the buffer, if the epoch number was too old, then
>> you have to recreate the buffer's mapping in a new space.
>
> Are the machines that don't have a direct map performance critical? 
> My expectation is that they are legacy or embedded.  This seems like a
> great project to do when the rest of the pieces are stable and fast. 
> Until then they could just use something like pbufs?
>

I think the answer to your first question depends entirely on who you
are.  :-)  Also, at the low-end of the server space, there are many
people trying to promote arm-based systems.  While FreeBSD may never run
on your arm-based phone, I think that ceding the arm-based server market
to others will be a strategic mistake.

Alan

P.S. I think we're moving the discussion to far away from kib's
original, so I suggest changing the subject line on any follow ups.