svn commit: r274242 - in user/dchagin/lemul/sys: kern sys

Fri Nov 7 23:19:10 UTC 2014

On Fri, 7 Nov 2014, Konstantin Belousov wrote:

> On Fri, Nov 07, 2014 at 04:25:08PM +0000, Dmitry Chagin wrote:
>> ...
>> @@ -1337,13 +1321,37 @@ sys_poll(td, uap)
>>  		}
>>  	} else
>>  		asbt = -1;
>> +
>> +	return (kern_poll(td, uap->fds, uap->nfds, asbt, precision));
>> +}
>> +
>> +static int
>> +kern_poll(struct thread *td, struct pollfd *fds, uint32_t nfds,
>> +    sbintime_t sbt, sbintime_t prec)
> uint32_t type fo nfds is weird and probably even wrong.
> Native syscalls use unsigned int for it, and this helper definitely
> should follow them.
>
>> +{
>> +	struct pollfd *bits;
>> +	struct pollfd smallbits[32];
>> +	int error;
>> +	size_t ni;
>> +
>> +	if (nfds > maxfilesperproc && nfds > FD_SETSIZE)
>> +		return (EINVAL);
> This line was moved from the original code intact, but it still
> looks strange.  What is the purpose of FD_SETSIZE test ?

This is a hack by peter IIRC.  Old versions have a comment explaining it.
I never liked it, but only changed the comment and fixed style bugs in
the code.

% Index: sys_generic.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/kern/sys_generic.c,v
% retrieving revision 1.131
% diff -u -2 -r1.131 sys_generic.c
% --- sys_generic.c	5 Apr 2004 21:03:35 -0000	1.131
% +++ sys_generic.c	13 Aug 2009 11:21:29 -0000
% @@ -960,19 +945,27 @@
%  	 */
%  	mtx_lock(&Giant);
% +
%  	/*
% -	 * This is kinda bogus.  We have fd limits, but that is not
% -	 * really related to the size of the pollfd array.  Make sure
% -	 * we let the process use at least FD_SETSIZE entries and at
% -	 * least enough for the current limits.  We want to be reasonably
% -	 * safe, but not overly restrictive.
% +	 * Unlike for select(), the number of currently open files is
% +	 * not really related to the size of the of the array (the array
% +	 * may be packed or sparse).  We have limits on the maximum
% +	 * number of open files, but these are even less related to the
% +	 * size of the array and may be too large anyway.  Let the process
% +	 * use up to p->p_fd->fd_nfiles or FD_SETSIZE entries, whichever
% +	 * is greater.  The first limit allows packed arrays containing
% +	 * descriptors for all open files, and the second limit allows
% +	 * sparse arrays with room for filling in the gaps with the
% +	 * descriptors for about FD_SETSIZE files that will be opened
% +	 * later.  Standards require arrays with up to {OPEN_MAX} entries
% +	 * to work, but we can't support that because {OPEN_MAX} might be
% +	 * LONG_MAX.  We support {OPEN_MAX} entries because the default for
% +	 * OPEN_MAX is less than the default for FD_SETSIZE.  Someday we
% +	 * will have a reasonable limit on {OPEN_MAX} and then we can use
% +	 * it here.
%  	 */
% -	PROC_LOCK(td->td_proc);
% -	if ((nfds > lim_cur(td->td_proc, RLIMIT_NOFILE)) &&
% -	    (nfds > FD_SETSIZE)) {
% -		PROC_UNLOCK(td->td_proc);
% +	if (nfds > td->td_proc->p_fd->fd_nfiles && nfds > FD_SETSIZE) {
%  		error = EINVAL;
%  		goto done2;
%  	}
% -	PROC_UNLOCK(td->td_proc);
%  	ni = nfds * sizeof(struct pollfd);
%  	if (ni > sizeof(smallbits))

-current already has the style fixes.

Opps, I changed the code significantly by replacing the RLIMIT_NOFILE
limit by an fd_nfiles limit.  -current has the worse change of replacing
it by a maxfilesperproc limit.  My change has a better chance of not
breaking poll() on descriptors that are larger than the current limit
but are valid because they were opened when the limit was larger.  It
is a bug for both to limit the _number_ of fds using any limit on the
maximum fd.  My test allows an array containing all opened fds with
no duplicates, but not arrays containing all open fds with duplicates.

The comment has a bug for {OPEN_MAX}.  We only support the default OPEN_MAX
number of entries.  {OPEN_MAX} is the runtime value.  We don't support that
properly.  OPEN_MAX is broken by its existence, since the actual limit is
variable.  POSIX now specifies in more cases what happens to existing open
fds when the limits are reduced below the fd numbers.  That is for the
rlimits.  FreeBSD's administrative maxfiles and maxfilesperproc break
the rlimits in ways not conforming to POSIX or documented in FreeBSD if
these limits are actually changed (using sysctl), and are still applied
in buggy ways:

% /* ARGSUSED */

Bogus.  The args are used.

% int
% sys_getdtablesize(struct thread *td, struct getdtablesize_args *uap)
% {
% 	struct proc *p = td->td_proc;

Initialization in declaration.

% 	uint64_t lim;
% 
% 	PROC_LOCK(p);
% 	td->td_retval[0] =
% 	    min((int)lim_cur(p, RLIMIT_NOFILE), maxfilesperproc);

Type errors:
- min() is for u_int types, but is applied to int types.  This gives a
   bug only (?) when maxfilesproc is set to a silly negative value.
- lim_cur() returns type rlim_t (64 bits).  The possibility of the
   compiler warning about rruncation from this is broken by bogusly
   casting to int.  But this is normally harmless because it is
   difficult to set the rlimit to a value that won't fit in an int.

Old programs and libraries that try to be careful about fd limits mostly
use this syscall.  This is unportable, and newer programs should use
{OPEN_MAX} via sysconf().  But these interfaces are incompatible --
{OPEN_MAX} is POSIXly correct, but more broken in practice since it
doesn't apply the maxvilesperproc clamp.

% 	lim = racct_get_limit(td->td_proc, RACCT_NOFILE);
% 	PROC_UNLOCK(p);
% 	if (lim < td->td_retval[0])
% 		td->td_retval[0] = lim;

I don't understand racct.  It is almost undocumented.  RACCT_NOFILE is
undocumented.  But this clearly breaks the POSIX limits some more by
enforcing another undocumented clamp that sysconf() doesn't know about.

% 	return (0);
% }
% 
% static int
% getmaxfd(struct proc *p)
% {
% 	int maxfd;
% 
% 	PROC_LOCK(p);
% 	maxfd = min((int)lim_cur(p, RLIMIT_NOFILE), maxfilesperproc);
% 	PROC_UNLOCK(p);
%

Style bug.

% 	return (maxfd);
% }

This is used internally.  It is bug for bug compatible with
getdtablesize() except possibly for handling of racct since it depends
on the caller to do racct stuff and callers do it in a different order.
This replaces several copies of similar code with the type errors and
1 copy of similar code without the type errors (2 copies without the
type errors in my version).

IIRC, POSIX now clarifies that operations with existing open fds are not
affected by the changes to the limits.  FreeBSD seems to conform to this
for RLIMIT_NOFILE except for the clamp in poll().  It is broken for
setting and reporting the limit that applies to new open fds.

Bruce