Fast sigblock (AKA rtld speedup)

Alfred Perlstein bright at mu.org
Tue Jan 8 15:02:30 UTC 2013


On 1/8/13 9:56 AM, Konstantin Belousov wrote:
> On Tue, Jan 08, 2013 at 01:09:51PM +0800, David Xu wrote:
>> On 2013/01/08 02:22, Konstantin Belousov wrote:
>>> Below is the forward of the patch for which I failed to obtain a private
>>> review. Might be, the list generates more responses.
>>>
>>> Our rtld has a performance bootleneck, typically exposed by the images
>>> with the lot of the run-time relocation processing, and by the C++
>>> exception handling. We block the signals delivery during the rtld
>>> performing the relocations, as well as for the dl_iterate_phdr(3) (the
>>> later is used for finding the dwarf unwinding tables).
>>>
>>> The signal blocking is needed to allow the rtld to resolve the symbols
>>> for the signal handlers in the safe way, but also causes 2 syscalls
>>> overhead per each rtld entry.
>>>
>>> The proposed approach allows to shave off those two syscalls, doubling
>>> the FreeBSD performance for the (silly) throw/catch C++ microbenchmark.
>>>
>>> Date: Mon, 13 Aug 2012 15:26:00 +0300
>>> From: Konstantin Belousov <kostikbel at gmail.com>
>>>
>>> ...
>>>
>>> The basic idea is to implement sigprocmask() as single write into usermode
>>> address. If kernel needs to calculate the signal mask for current thread,
>>> it takes into the consideration non-zero value of the word at the agreed
>>> address. Also, usermode is informed about signals which were put on hold
>>> due to fast sigblock active.
>>>
>>> As I said, on my measurements in microbenchmark that did throw/catch in
>>> a loop, I see equal user and system time spent for unpatched system, and
>>> same user time with zero system time on patched system.
>>>
>>> Patch can be improved further, e.g. it would be nice to allow rtld to fall
>>> back to sigprocmask(2) if kernel does not support fast sigblock, to prevent
>>> flag day. Also, the mask enforced by fast sigblock can be made configurable.
>>>
>>> Note that libthr already blocks signals by catching them, and not using rtld
>>> service in the first line handler. I tried to make the change in the spirit
>>> of libthr interceptors, but handoff to libthr appears too complicated to
>>> work. In fact, libthr can be changed to start using fast sigblock instead
>>> of wrapping sigaction, but this is out of scope of the proposal right now.
>>>
>>> Please comment.
>>>
>>> diff --git a/lib/libc/sys/Symbol.map b/lib/libc/sys/Symbol.map
>>> index 6888ea0..3b75539 100644
>>> --- a/lib/libc/sys/Symbol.map
>>> +++ b/lib/libc/sys/Symbol.map
>>> @@ -546,6 +546,7 @@ FBSDprivate_1.0 {
>>>    	__sys_extattr_set_link;
>>>    	_extattrctl;
>>>    	__sys_extattrctl;
>>> +	__sys_fast_sigblock;
>>>    	_fchdir;
>>>    	__sys_fchdir;
>>>    	_fchflags;
>>> diff --git a/libexec/rtld-elf/rtld_lock.c b/libexec/rtld-elf/rtld_lock.c
>>> index d1563e5..50c52c6 100644
>>> --- a/libexec/rtld-elf/rtld_lock.c
>>> +++ b/libexec/rtld-elf/rtld_lock.c
>>> @@ -43,6 +43,7 @@
>>>     */
>>>
>>>    #include <sys/param.h>
>>> +#include <sys/signalvar.h>
>>>    #include <signal.h>
>>>    #include <stdlib.h>
>>>    #include <time.h>
>>> @@ -59,8 +60,8 @@ typedef struct Struct_Lock {
>>>    	void *base;
>>>    } Lock;
>>>
>>> -static sigset_t fullsigmask, oldsigmask;
>>>    static int thread_flag;
>>> +static uint32_t fsigblock;
>>>
>>>    static void *
>>>    def_lock_create()
>>> @@ -111,18 +112,26 @@ def_rlock_acquire(void *lock)
>>>    }
>>>
>>>    static void
>>> +sig_fastunblock(void)
>>> +{
>>> +	uint32_t oldval;
>>> +
>>> +	oldval = atomic_fetchadd_32(&fsigblock, -FAST_SIGBLOCK_INC);
>>> +	if (oldval == (FAST_SIGBLOCK_PEND | FAST_SIGBLOCK_INC))
>>> +		__sys_fast_sigblock(FAST_SIGBLOCK_UNBLOCK, NULL);
>>> +}
>>> +
>>> +static void
>>>    def_wlock_acquire(void *lock)
>>>    {
>>>        Lock *l = (Lock *)lock;
>>> -    sigset_t tmp_oldsigmask;
>>>
>>>        for ( ; ; ) {
>>> -	sigprocmask(SIG_BLOCK, &fullsigmask, &tmp_oldsigmask);
>>> +	atomic_add_rel_32(&fsigblock, FAST_SIGBLOCK_INC);
>>>    	if (atomic_cmpset_acq_int(&l->lock, 0, WAFLAG))
>>>    	    break;
>>> -	sigprocmask(SIG_SETMASK, &tmp_oldsigmask, NULL);
>>> +	sig_fastunblock();
>>>        }
>>> -    oldsigmask = tmp_oldsigmask;
>>>    }
>>>
>>>    static void
>>> @@ -134,7 +143,7 @@ def_lock_release(void *lock)
>>>        	atomic_add_rel_int(&l->lock, -RC_INCR);
>>>        else {
>>>        	atomic_add_rel_int(&l->lock, -WAFLAG);
>>> -    	sigprocmask(SIG_SETMASK, &oldsigmask, NULL);
>>> +	sig_fastunblock();
>>>        }
>>>    }
>>>
>>> @@ -286,19 +295,7 @@ lockdflt_init()
>>>
>>>        memcpy(&lockinfo, &deflockinfo, sizeof(lockinfo));
>>>        _rtld_thread_init(NULL);
>>> -    /*
>>> -     * Construct a mask to block all signals except traps which might
>>> -     * conceivably be generated within the dynamic linker itself.
>>> -     */
>>> -    sigfillset(&fullsigmask);
>>> -    sigdelset(&fullsigmask, SIGILL);
>>> -    sigdelset(&fullsigmask, SIGTRAP);
>>> -    sigdelset(&fullsigmask, SIGABRT);
>>> -    sigdelset(&fullsigmask, SIGEMT);
>>> -    sigdelset(&fullsigmask, SIGFPE);
>>> -    sigdelset(&fullsigmask, SIGBUS);
>>> -    sigdelset(&fullsigmask, SIGSEGV);
>>> -    sigdelset(&fullsigmask, SIGSYS);
>>> +    __sys_fast_sigblock(FAST_SIGBLOCK_SETPTR, &fsigblock);
>>>    }
>>>
>>>    /*
>>> @@ -319,7 +316,10 @@ _rtld_thread_init(struct RtldLockInfo *pli)
>>>
>>>    	if (pli == NULL)
>>>    		pli = &deflockinfo;
>>> -
>>> +	else {
>>> +		fsigblock = 0;
>>> +		__sys_fast_sigblock(FAST_SIGBLOCK_UNSETPTR, NULL);
>>> +	}
>>>
>>>    	for (i = 0; i < RTLD_LOCK_CNT; i++)
>>>    		if ((locks[i] = pli->lock_create()) == NULL)
>>> diff --git a/sys/compat/freebsd32/syscalls.master b/sys/compat/freebsd32/syscalls.master
>>> index 0891e41..f9e8b9e 100644
>>> --- a/sys/compat/freebsd32/syscalls.master
>>> +++ b/sys/compat/freebsd32/syscalls.master
>>> @@ -997,3 +997,4 @@
>>>    				    uint32_t offset1, uint32_t offset2,\
>>>    				    uint32_t len1, uint32_t len2, \
>>>    				    int advice); }
>>> +532	AUE_NULL	NOPROTO	{ int fast_sigblock(int cmd, uint32_t *ptr); }
>>> diff --git a/sys/kern/kern_exec.c b/sys/kern/kern_exec.c
>>> index 90f7311..8a3cd15 100644
>>> --- a/sys/kern/kern_exec.c
>>> +++ b/sys/kern/kern_exec.c
>>> @@ -1031,6 +1031,7 @@ exec_new_vmspace(imgp, sv)
>>>    	int error;
>>>    	struct proc *p = imgp->proc;
>>>    	struct vmspace *vmspace = p->p_vmspace;
>>> +	struct thread *td = curthread;
>>>    	vm_object_t obj;
>>>    	vm_offset_t sv_minuser, stack_addr;
>>>    	vm_map_t map;
>>> @@ -1039,6 +1040,10 @@ exec_new_vmspace(imgp, sv)
>>>    	imgp->vmspace_destroyed = 1;
>>>    	imgp->sysent = sv;
>>>
>>> +	td->td_pflags &= ~TDP_FAST_SIGBLOCK;
>>> +	td->td_sigblock_ptr = NULL;
>>> +	td->td_sigblock_val = 0;
>>> +
>>>    	/* May be called with Giant held */
>>>    	EVENTHANDLER_INVOKE(process_exec, p, imgp);
>>>
>>> diff --git a/sys/kern/kern_sig.c b/sys/kern/kern_sig.c
>>> index 2685a8b..3c8a2f7 100644
>>> --- a/sys/kern/kern_sig.c
>>> +++ b/sys/kern/kern_sig.c
>>> @@ -230,6 +230,7 @@ static int sigproptbl[NSIG] = {
>>>    };
>>>
>>>    static void reschedule_signals(struct proc *p, sigset_t block, int flags);
>>> +static sigset_t fastblock_mask;
>>>
>>>    static void
>>>    sigqueue_start(void)
>>> @@ -240,6 +241,16 @@ sigqueue_start(void)
>>>    	p31b_setcfg(CTL_P1003_1B_REALTIME_SIGNALS, _POSIX_REALTIME_SIGNALS);
>>>    	p31b_setcfg(CTL_P1003_1B_RTSIG_MAX, SIGRTMAX - SIGRTMIN + 1);
>>>    	p31b_setcfg(CTL_P1003_1B_SIGQUEUE_MAX, max_pending_per_proc);
>>> +	SIGFILLSET(fastblock_mask);
>>> +	SIG_CANTMASK(fastblock_mask);
>>> +	SIGDELSET(fastblock_mask, SIGILL);
>>> +	SIGDELSET(fastblock_mask, SIGTRAP);
>>> +	SIGDELSET(fastblock_mask, SIGABRT);
>>> +	SIGDELSET(fastblock_mask, SIGEMT);
>>> +	SIGDELSET(fastblock_mask, SIGFPE);
>>> +	SIGDELSET(fastblock_mask, SIGBUS);
>>> +	SIGDELSET(fastblock_mask, SIGSEGV);
>>> +	SIGDELSET(fastblock_mask, SIGSYS);
>>>    }
>>>
>>>    ksiginfo_t *
>>> @@ -2525,6 +2536,7 @@ issignal(struct thread *td, int stop_allowed)
>>>    	struct sigqueue *queue;
>>>    	sigset_t sigpending;
>>>    	int sig, prop, newsig;
>>> +	uint32_t oldval;
>>>
>>>    	p = td->td_proc;
>>>    	ps = p->p_sigacts;
>>> @@ -2541,6 +2553,32 @@ issignal(struct thread *td, int stop_allowed)
>>>    			SIG_STOPSIGMASK(sigpending);
>>>    		if (SIGISEMPTY(sigpending))	/* no signal to send */
>>>    			return (0);
>>> +
>>> +		/*
>>> +		 * Do fast sigblock if requested by usermode.  Since
>>> +		 * we do know that there was a signal pending at this
>>> +		 * point, set the FAST_SIGBLOCK_PEND as indicator for
>>> +		 * usermode to perform a dummy call to
>>> +		 * FAST_SIGBLOCK_UNBLOCK, which causes immediate
>>> +		 * delivery of postponed pending signal.
>>> +		 */
>>> +		if ((td->td_pflags & TDP_FAST_SIGBLOCK) != 0) {
>>> +			if (td->td_sigblock_val != 0)
>>> +				SIGSETNAND(sigpending, fastblock_mask);
>>> +			if (SIGISEMPTY(sigpending)) {
>>> +				oldval = fuword32(td->td_sigblock_ptr);
>>> +				if (oldval == -1) {
>>> +					fetch_fast_sigblock_failed(td, 0);
>>> +					return (0);
>>> +				}
>>> +				oldval |= FAST_SIGBLOCK_PEND;
>>> +				if (suword32(td->td_sigblock_ptr, oldval) == -1)
>>> +					fetch_fast_sigblock_failed(td, 1);
>>> +				td->td_sigblock_val = oldval;
>>> +				return (0);
>>> +			}
>>> +		}
>>> +
>>>    		sig = sig_ffs(&sigpending);
>>>
>>>    		if (p->p_stops & S_SIG) {
>>> @@ -3456,3 +3494,92 @@ sigacts_shared(struct sigacts *ps)
>>>    	mtx_unlock(&ps->ps_mtx);
>>>    	return (shared);
>>>    }
>>> +
>>> +int
>>> +sys_fast_sigblock(struct thread *td, struct fast_sigblock_args *uap)
>>> +{
>>> +	struct proc *p;
>>> +	int error;
>>> +	uint32_t oldval;
>>> +
>>> +	error = 0;
>>> +	switch (uap->cmd) {
>>> +	case FAST_SIGBLOCK_SETPTR:
>>> +		if ((td->td_pflags & TDP_FAST_SIGBLOCK) != 0)
>>> +			error = EBUSY;
>>> +		else if (((uintptr_t)(uap->ptr) & (sizeof(uint32_t) - 1)) != 0)
>>> +			error = EINVAL;
>>> +		else {
>>> +			td->td_pflags |= TDP_FAST_SIGBLOCK;
>>> +			td->td_sigblock_ptr = uap->ptr;
>>> +		}
>>> +		break;
>>> +
>>> +	case FAST_SIGBLOCK_UNBLOCK:
>>> +		if ((td->td_pflags & TDP_FAST_SIGBLOCK) != 0) {
>>> +			oldval = casuword32(td->td_sigblock_ptr,
>>> +			    FAST_SIGBLOCK_PEND, 0);
>>> +			if (oldval == (uint32_t)-1)
>>> +				error = EFAULT;
>>> +			else if (oldval != FAST_SIGBLOCK_PEND)
>>> +				error = EBUSY;
>>> +			else
>>> +				td->td_sigblock_val = 0;
>>> +		} else
>>> +			error = EDOOFUS;
>>> +
>>> +		/*
>>> +		 * Rely on normal ast mechanism to deliver pending
>>> +		 * signals to current thread.  But notify others about
>>> +		 * fake unblock.
>>> +		 */
>>> +		p = td->td_proc;
>>> +		if (error == 0 && p->p_numthreads != 1) {
>>> +			PROC_LOCK(p);
>>> +			reschedule_signals(p, td->td_sigmask, 0);
>>> +			PROC_UNLOCK(p);
>>> +		}
>>> +		break;
>>> +
>>> +	case FAST_SIGBLOCK_UNSETPTR:
>>> +		if ((td->td_pflags & TDP_FAST_SIGBLOCK) != 0) {
>>> +			error = copyin(td->td_sigblock_ptr, &oldval,
>>> +			    sizeof(uint32_t));
>>> +			if (error != 0)
>>> +				break;
>>> +			if (oldval != 0 && oldval != FAST_SIGBLOCK_PEND)
>>> +				error = EBUSY;
>>> +			else {
>>> +				td->td_pflags &= ~TDP_FAST_SIGBLOCK;
>>> +				td->td_sigblock_val = 0;
>>> +			}
>>> +		} else
>>> +			error = EDOOFUS;
>>> +		break;
>>> +	}
>>> +	return (error);
>>> +}
>>> +
>>> +void
>>> +fetch_fast_sigblock(struct thread *td)
>>> +{
>>> +
>>> +	if ((td->td_pflags & TDP_FAST_SIGBLOCK) == 0)
>>> +		return;
>>> +	td->td_sigblock_val = fuword32(td->td_sigblock_ptr);
>>> +	if (td->td_sigblock_val == -1)
>>> +		fetch_fast_sigblock_failed(td, 0);
>>> +}
>>> +
>>> +void
>>> +fetch_fast_sigblock_failed(struct thread *td, int write)
>>> +{
>>> +	ksiginfo_t ksi;
>>> +
>>> +	td->td_sigblock_val = 0;
>>> +	ksiginfo_init_trap(&ksi);
>>> +	ksi.ksi_signo = SIGSEGV;
>>> +	ksi.ksi_code = write ? SEGV_ACCERR :  SEGV_MAPERR;
>>> +	ksi.ksi_addr = td->td_sigblock_ptr;
>>> +	trapsignal(td, &ksi);
>>> +}
>>> diff --git a/sys/kern/subr_syscall.c b/sys/kern/subr_syscall.c
>>> index 5aee684..77b250d 100644
>>> --- a/sys/kern/subr_syscall.c
>>> +++ b/sys/kern/subr_syscall.c
>>> @@ -131,6 +131,12 @@ syscallenter(struct thread *td, struct syscall_args *sa)
>>>    			    sa->callp, sa->args, 0);
>>>    #endif
>>>
>>> +		/*
>>> +		 * Fetch fast sigblock value at the time of syscall
>>> +		 * entry because sleepqueue primitives might call
>>> +		 * cursig().
>>> +		 */
>>> +		fetch_fast_sigblock(td);
>>>    		AUDIT_SYSCALL_ENTER(sa->code, td);
>>>    		error = (sa->callp->sy_call)(td, sa->args);
>>>    		AUDIT_SYSCALL_EXIT(error, td);
>>> diff --git a/sys/kern/subr_trap.c b/sys/kern/subr_trap.c
>>> index 095bbdf..66e485b 100644
>>> --- a/sys/kern/subr_trap.c
>>> +++ b/sys/kern/subr_trap.c
>>> @@ -237,6 +237,7 @@ ast(struct trapframe *framep)
>>>    	 */
>>>    	if (flags & TDF_NEEDSIGCHK || p->p_pendingcnt > 0 ||
>>>    	    !SIGISEMPTY(p->p_siglist)) {
>>> +		fetch_fast_sigblock(td);
>>>    		PROC_LOCK(p);
>>>    		mtx_lock(&p->p_sigacts->ps_mtx);
>>>    		while ((sig = cursig(td, SIG_STOP_ALLOWED)) != 0)
>>> diff --git a/sys/kern/syscalls.master b/sys/kern/syscalls.master
>>> index f62dad7..28a9393 100644
>>> --- a/sys/kern/syscalls.master
>>> +++ b/sys/kern/syscalls.master
>>> @@ -951,5 +951,6 @@
>>>    				    off_t offset, off_t len); }
>>>    531	AUE_NULL	STD	{ int posix_fadvise(int fd, off_t offset, \
>>>    				    off_t len, int advice); }
>>> +532	AUE_NULL	STD	{ int fast_sigblock(int cmd, uint32_t *ptr); }
>>>    ; Please copy any additions and changes to the following compatability tables:
>>>    ; sys/compat/freebsd32/syscalls.master
>>> diff --git a/sys/sys/proc.h b/sys/sys/proc.h
>>> index 06df632..4899ca2 100644
>>> --- a/sys/sys/proc.h
>>> +++ b/sys/sys/proc.h
>>> @@ -272,6 +272,9 @@ struct thread {
>>>    	struct osd	td_osd;		/* (k) Object specific data. */
>>>    	struct vm_map_entry *td_map_def_user; /* (k) Deferred entries. */
>>>    	pid_t		td_dbg_forked;	/* (c) Child pid for debugger. */
>>> +	void		*td_sigblock_ptr; /* (k) uptr for fast sigblock. */
>>> +	uint32_t	td_sigblock_val;  /* (k) fast sigblock value at
>>> +					     kernel entry. */
>>>    #define	td_endzero td_sigmask
>>>
>>>    /* Copied during fork1() or create_thread(). */
>>> @@ -424,6 +427,7 @@ do {									\
>>>    #define	TDP_RESETSPUR	0x04000000 /* Reset spurious page fault history. */
>>>    #define	TDP_NERRNO	0x08000000 /* Last errno is already in td_errno */
>>>    #define	TDP_UIOHELD	0x10000000 /* Current uio has pages held in td_ma */
>>> +#define	TDP_FAST_SIGBLOCK 0x20000000 /* Fast sigblock active */
>>>
>>>    /*
>>>     * Reasons that the current thread can not be run yet.
>>> diff --git a/sys/sys/signalvar.h b/sys/sys/signalvar.h
>>> index 71685e7..68cca58 100644
>>> --- a/sys/sys/signalvar.h
>>> +++ b/sys/sys/signalvar.h
>>> @@ -250,6 +250,20 @@ typedef struct sigqueue {
>>>    /* Flags for ksi_flags */
>>>    #define	SQ_INIT	0x01
>>>
>>> +/*
>>> + * Fast_sigblock
>>> + */
>>> +#define	FAST_SIGBLOCK_SETPTR	1
>>> +#define	FAST_SIGBLOCK_UNBLOCK	2
>>> +#define	FAST_SIGBLOCK_UNSETPTR	3
>>> +
>>> +#define	FAST_SIGBLOCK_PEND	0x1
>>> +#define	FAST_SIGBLOCK_INC	0x10
>>> +
>>> +#ifndef _KERNEL
>>> +int __sys_fast_sigblock(int cmd, void *ptr);
>>> +#endif
>>> +
>>>    #ifdef _KERNEL
>>>
>>>    /* Return nonzero if process p has an unmasked pending signal. */
>>> @@ -329,6 +343,8 @@ extern struct mtx	sigio_lock;
>>>
>>>    int	cursig(struct thread *td, int stop_allowed);
>>>    void	execsigs(struct proc *p);
>>> +void	fetch_fast_sigblock(struct thread *td);
>>> +void	fetch_fast_sigblock_failed(struct thread *td, int write);
>>>    void	gsignal(int pgid, int sig, ksiginfo_t *ksi);
>>>    void	killproc(struct proc *p, char *why);
>>>    ksiginfo_t * ksiginfo_alloc(int wait);
>>>
>>>
>>>
>>> ----- End forwarded message -----
>>>
>> So you want to make kernel share data with userland. The only thing I
>> don't like is it adds overhead to syscall, rumor said that our syscall
>> is slow than others, this might not be a problem ?
>>
>> Long time ago, I proposed a schedctl() syscall to make kernel share
>> data with userland, some old patches are still there:
>> http://people.freebsd.org/~davidxu/schedctl/
>>
>> Mine does not have such an overhead, but it has another one:
>> memory page is allocated in kernel which can not be swapped out
>> and can not be freed until process is exited, the page is doubly
>> mapped into in kernel and userland, accessing the shared data
>> in kernel has zero overhead though.
> Initial version of the changes indeed used the remap of the user page into
> the KVA. Besides the issues you described regarding the page being wired
> for the whole life of the thread, as well as the one page frame in the KVA,
> there were other non-trivial problems. It was mostly related to the fork(2)
> copying the address spaces, but not limited too. The use of the direct
> user address access appeared to be much less intrusive.
>
> I completely agree with your note about the additional memory access in
> the syscall entry path. I do believe that it does not incur a overhead
> in the real-world loads, but might make some skews in the microbenchmarks.
> I did not noted any in the testing I did.
>
> Anyway, I am still not sure is it worth optimizing for the throw/catch
> microbenchmark at ll. If general public consequence is yes, the patch
> needs some work before being committable anyway, mostly to allow patched
> rtld to work on the pre-patch kernels.
>
> Thank you for the reply.
I think it would be great if it works, I just haven't had time to fully 
understand your more recent patch.  If you feel it's safe and maybe do a 
few test programs to see what happens, that would be best.

-Alfred


More information about the freebsd-toolchain mailing list