svn commit: r312724 - in head/sys: sys vm

Wed Jan 25 05:10:59 UTC 2017

On Wed, Jan 25, 2017 at 01:29:21PM +1100, Bruce Evans wrote:
> On Tue, 24 Jan 2017, Mateusz Guzik wrote:
> 
> >Log:
> > hwpmc: partially depessimize munmap handling if the module is not loaded
> >
> > HWPMC_HOOKS is enabled in GENERIC and triggers some work avoidable in the
> > common (module not loaded) case.
> >...
> >Modified: head/sys/sys/pmckern.h
> >==============================================================================
> >--- head/sys/sys/pmckern.h	Tue Jan 24 21:48:57 2017	(r312723)
> >+++ head/sys/sys/pmckern.h	Tue Jan 24 22:00:16 2017	(r312724)
> >@@ -174,6 +174,9 @@ extern const int pmc_kernel_version;
> >/* PMC soft per cpu trapframe */
> >extern struct trapframe pmc_tf[MAXCPU];
> >
> >+/* Quick check if preparatory work is necessary */
> >+#define	PMC_HOOK_INSTALLED(cmd)	__predict_false(pmc_hook != NULL)
> 
> I'm still waiting for other __predict_ugly() macro invocations to be
> removed.
> 

Actually pmc was already using the annotation, so this fits the local
style.

> The 2 new ones here even less effect than most.  I couldn't measure the
> effect on makeworld of removing the PMC_HOOKS completely.  Removing KTRACE
> long ago also seemed to have no effect.  Unfortunately, it is impossible
> to remove procctl and other bloat that has grown in the syscall path, and
> is easy to measure slowdowns from this.
> 

The kernel has a lot of systematic single- and multi-threaded slowness
and it is unlikely that some branch removals/predictions in isolation
will make a measurable difference in a macrobenchmark.

Most of the slow down is caused by avoidable atomic ops, which arguably
trump spurious branches even if the target cacheline is not ping-ponged
around. However, when a spurious branch/mostly false branch shows up, I
don't see why not plug it.

There is work done on proper hotpatching support which will hopefully
make plenty of branches go away.

syscall handling is slowed down by a lot of branches which are typically
false. However, here the solution is to group them under one condition
which if true takes the kernel to the slow (current) path. That is
ktrace, ptrace, capsicum and whatever else would set a
->td_specialhandling (or whatever) flag/bitfield along with whatever it
sets now. There is an additional slowness of packing/unpacking all
arguments, but I don't know if this is worth changing.

For non-static syscalls, their handling can be postponed. Instead of
putting such syscalls directly into the table, they can be using a proxy
method which would handle the reference counting. That is, the knowledge
of such syscalls would be removed from the common path completely. There
is only a matter of finding a nice way to get back the syscall number.

I am *NOT* working on this though.

For the vfs layer, slowness includes:
- vget grabs both hold and usecount, which gives 2 atomic ops. then they
  have to be released and that's another 2. I had patch which made it so
  that a usecount implies holdcnt, but it rotted and is only a part of
  the real fix
- dropping references to vnodes too easily causes them to be reshuffled
  on the free list, which adds single-threaded slowness due to more work
  and a lock contention point. I don't have a good solution, but a total
  hack I came up with so far boils down to grabbing and extra reference
  in vget and letting the syncer drop it some time later
- similarly, vputx relocks the vnode in exclusive mode to call the
  inactive routine. Said routine very often does nothing and/or gets
  away with a mere interlock. I have a hack which extends the interface
  so that the filesystem can be asked whether it wants to do inactive.
- VOP_* routines have arguments packed which the target fs has to unpack
  before use. Wrappers have several avoidable and mostly false branches.
  I would argue a new interface is needed.
- the main lockmgr routine takes 8 arguments, 2 more than what's
  passable in registers on amd64 with sysv abi. Then it proceeds to
  perform several branches. I have a patch introducing a fast path which
  avoids it all and falls back to the original lockmgr if there are
  issues getting the lock uncontested. this gave me +4% more ops in a
  single-threaded stat benchmark. there is a LOCK_PROFILING bug
  somewhere in there I have to fix before committing
- lockmgr locks are not adaptive. the facility itself does support
  the feature, but the code is disabled and no fs annotates its vnodes
  to use it if present
- closing a vnode performs vn_start_write + vnode lock + VOP_CLOSE.
  the latter almost always has nothing to do so, thus making the above
  work unnecessary. In a spirit similar to inactive handling, we can ask
  the fs what it wants to do. Or better yet, revamp the interface so
  that the fs calls relevant helpers to do the locking it needs.

and so on.

tl;dr the main point of this patch was to not lock pmc_sx if it can be
helped. I presume you don't have objections here.

The annotations and shifting the work around was done to combat part of
the systematic slowness. Unclear if you claim that such actions have no
effect in general, or just happen to not be measurable in isolation.

They are not easily visible as there is plenty of other slowness in
significantly more used places.

> The above one is an even better obfuscation than most.  It is only invoked
> once, and it is context-dependent whether to false branch is the unusual
> case.
> 

The HWPMC_HOOKS option is present in the default kernel. I highly doubt
majority of users load the hwpmc module to take advantage of it. Things
like this tend to be hotpatched which will not be any faster than the
current branch if it has to be taken.

> >+		pkm.pm_address = (uintptr_t) NULL;
> >+		if (vm_map_lookup_entry(map, addr, &entry)) {
> >+			for (;
> >+			    entry != &map->header && entry->start < addr + size;
> >+			    entry = entry->next) {
> >+				if (vm_map_check_protection(map, entry->start,
> >+					entry->end, VM_PROT_EXECUTE) == TRUE) {
> >+					pkm.pm_address = (uintptr_t) addr;
> >+					pkm.pm_size = (size_t) size;
> >+					break;
> >+				}
> >			}
> >		}
> >	}
> 
> Predictions could also be implemented in a more hard-coded way using
> 'goto slowcase' and moving the slow case out of the way (assuming that
> that the hardware predicts forward branches as not taken and the
> compiler doesn't reorder the code anyway).  This would be uglier, but
> not much more invasive than re-indenting the code after adding a test
> for the slow case.
> 

I tried that. clang folds the slow case back in if the hint is not
provided. That said, it still can be done to reduce indentation level.

-- 
Mateusz Guzik <mjguzik gmail.com>