svn commit: r221853 - in head/sys: dev/md dev/null sys vm

Mon May 30 17:52:55 UTC 2011

On Mon, 30 May 2011 mdf at FreeBSD.org wrote:

> On Mon, May 30, 2011 at 8:25 AM, Bruce Evans <brde at optusnet.com.au> wrote:
>> On Sat, 28 May 2011 mdf at FreeBSD.org wrote:
>>> ...
>>> Meanwhile you could try setting ZERO_REGION_SIZE to PAGE_SIZE and I
>>> think that will restore things to the original performance.
>>
>> Using /dev/zero always thrashes caches by the amount <source buffer
>> size> + <target buffer size> (unless the arch uses nontemporal memory
>> accesses for uiomove, which none do AFAIK).  So a large source buffer
>> is always just a pessimization.  A large target buffer size is also a
>> pessimization, but for the target buffer a fairly large size is needed
>> to amortize the large syscall costs.  In this PR, the target buffer
>> size is 64K.  ZERO_REGION_SIZE is 64K on i386 and 2M on amd64.  64K+64K
>> on i386 is good for thrashing the L1 cache.
>
> That depends -- is the cache virtually or physically addressed?  The
> zero_region only has 4k (PAGE_SIZE) of unique physical addresses.  So
> most of the cache thrashing is due to the user-space buffer, if the
> cache is physically addressed.

Oops.  I now remember thinking that the much larger source buffer would be
OK since it only uses 1 physical page.  But it is apparently virtually
addressed.

>  It will only have a
>> noticeable impact on a current L2 cache in competition with other
>> threads.  It is hard to fit everything in the L1 cache even with
>> non-bloated buffer sizes and 1 thread (16 for the source (I)cache, 0
>> for the source (D)cache and 4K for the target cache might work).  On
>> amd64, 2M+2M is good for thrashing most L2 caches.  In this PR, the
>> thrashing is limited by the target buffer size to about 64K+64K, up
>> from 4K+64K, and it is marginal whether the extra thrashing from the
>> larger source buffer makes much difference.
>>
>> The old zbuf source buffer size of PAGE_SIZE was already too large.
>
> Wouldn't this depend on how far down from the use of the buffer the
> actual copy happens?  Another advantage to a large virtual buffer is
> that it reduces the number of times the copy loop in uiomove has to
> return up to the device layer that initiated the copy.  This is all
> pretty fast, but again assuming a physical cache fewer trips is
> better.

Yes, I had forgotten that I have to keep going back to the uiomove()
level for each iteration.  That's a lot of overhead although not nearly
as much as going back to the user level.  If this is actually important
to optimize, then I might add a repeat count to uiomove() and copyout()
(actually a different function for the latter).

linux-2.6.10 uses a mmapped /dev/zero and has had this since Y2K
according to its comment.  Sigh.  You will never beat that by copying,
but I think mmapping /dev/zero is only much more optimal for silly
benchmarks.

linux-2.6.10 also has a seekable /dev/zero.  Seeks don't really work,
but some of them "succeed" and keep the offset at 0 .  ISTR remember
a FreeBSD PR about the file offset for /dev/zero not "working" because
it is garbage instead of 0.  It is clearly a Linuxism to depend on it
being nonzero.  IIRC, the file offset for device files is at best
implementation-defined in POSIX.

Bruce