svn commit: r356159 - head/sys/vm

Sun Dec 29 22:18:44 UTC 2019

Thanks for the detailed answer Mark!

On Sunday, December 29, 2019, Mark Johnston <markj at freebsd.org> wrote:

> On Sun, Dec 29, 2019 at 03:39:55AM +0100, Oliver Pinter wrote:
> > Is there any performance measurement from before and after. It would be
> > nice to see them.
>
> I did not do extensive benchmarking.  The aim of the patch set was
> simply to remove the use of the hashed page lock, since it shows up
> prominently in lock profiles of some workloads.  The problem is that we
> acquire these locks any time a page's LRU state is updated, and the use
> of the hash lock means that we get false sharing.  The solution is to
> implement these state updates using atomic operations on the page
> structure itself, making data contention much less likely.  Another
> option was to embed a mutex into the vm_page structure, but this would
> bloat a structure which is already too large.
>
> A secondary goal was to reduce the number of locks held during page
> queue scans.  Such scans frequently call pmap_ts_referenced() to collect
> info about recent references to the page.  This operation can be
> expensive since it may require a TLB shootdown, and it can block for a
> long time on the pmap lock, for example if the lock holder is copying
> the page tables as part of a fork().  Now, the active queue scan body is
> executed without any locks held, so a page daemon thread blocked on a
> pmap lock no longer has the potential to block other threads by holding
> on to a shared page lock.  Before, the page daemon could block faulting
> threads for a long time, hurting latency.  I don't have any benchmarks
> that capture this, but it's something that I've observed in production
> workloads.
>
> I used some microbenchmarks to verify that the change did not penalize
> the single-threaded case.  Here are some results on a 64-core arm64
> system I have been playing with:
> https://people.freebsd.org/~markj/arm64_page_lock/
>
> The benchmark from will-it-scale simply maps 128MB of anonymous memory,
> faults on each page, and unmaps it, in a loop.  In the fault handler we
> allocate a page and insert it into the active queue, and the unmap
> operation removes all of those pages from the queue.  I collected the
> throughput for 1, 2, 4, 8, 16 and 32 concurrent processes.
>
> With my patches we see some modest gains at low concurrency.  At higher
> levels of concurrency we actually get lower throughput than before as
> contention moves from the page locks and the page queue lock to just the
> page queue lock.  I don't believe this is a real regression: first, the
> benchmark is quite extreme relative to any useful workload, and second,
> arm64 suffers from using a much smaller batch size than amd64 for
> batched page queue operations.  Changing that pushes the results out
> somewhat.  Some earlier testing on a 2-socket Xeon system showed a
> similar pattern with smaller differences.
>