svn commit: r249859 - head/lib/libc/sys

Bruce Evans brde at optusnet.com.au
Fri Apr 26 18:37:14 UTC 2013


On Fri, 26 Apr 2013, Jilles Tjoelker wrote:

> On Thu, Apr 25, 2013 at 09:56:01PM +1000, Bruce Evans wrote:
>> On Wed, 24 Apr 2013, Jilles Tjoelker wrote:
>
>>> Log:
>>>  getdtablesize(2): Describe what this function actually does.
>>> ...
>>> .Sh NAME
>>> .Nm getdtablesize
>>> -.Nd get descriptor table size
>>> +.Nd get file descriptor limit
>
>> Now its name doesn't match its description, and the reason for this is
>> not documented.
>
> This seems to be the case on most systems that have this function.

No reason not to document it.

>> This function is almost obsolete.  In POSIX, it is spelled {OPEN_MAX}
>> or sysconf(__SC_OPEN_MAX).  It is sometimes misspelled OPEN_MAX.
>
> There is a difference between sysconf(_SC_OPEN_MAX) and getdtablesize():
> the latter also takes maxfilesperproc and rctl(8) rules into account.

Yes, only sysconf(_SC_OPEN_MAX) (and getrlimit(..., RLIMIT_NOFILE)) are
broken.

I'm not familiar with rctl.  It seems to break this some more: in
kern_descrip.c, all the places that use the rlimit clamp it to
maxfileperproc, so the effective rlimit is always the clamped value
and this is what sysconf() and getrlimit() should return too.  Now
for rctl, only 1 of these places (namely getdtablesize()) clamps it
further to the RACCT_NOFILE limit.  Other places include do_dup().
do_dup() checks the RACCT_NOFILE much later.  I think this gives the
same limit, but at least the errno when the limit RACCT_NOFILE limit
is exceeded but the others aren't is wrong in all cases.  (The early
errno for (new >= maxfd) is (flags & DUP_FCNTL ? EINVAL : EBADF).
The later errno for racct_set(... RACCT_NOFILE, new + 1) is always
EMFILE.  EMFILE is not even a possible errno for dup2().

>> I prepared to remove the broken definition of OPEN_MAX, but never committed
>> the final step.  /usr/src has very few misuses of OPEN_MAX now, so removing
>> the definition wouldn't be too hard.  Most uses are in compatibility
>> cruft.  E.g., the following from
>> crypto/openssh/openbsd-compat/bsd-closefrom.c
>> which is confused about related things:
>
>> [snip]
> If that code is compiled at all, it is a bug. We have closefrom() and
> OpenSSH should use it.

It almost certainly uses it.  However, a quick grep for getdtablesize()
shows many applications using it, and most of them use it to give the
top of a close() loop.  These should be converted to use something
like closefrom().  getdtablesize() is used much more than sysconf()
or getrlimit() for this.  This is done mostly in old BSD applications.
However, the worst uses that I noticed are in the relatively new ppp
application.  There are about 40 calls to getdtablesize() in /usr/src.
7 of these are in ppp.  1 of the 7 is for the close() loop.  4 of the
7 are for fcntl(... F_SETFD ...) loops.  The other 2 are for a home
made FD_SETSIZE and a home made FD_ZERO().  These essentially initialize
FD_SETSIZE to the variable getdtablesize().  A table of bits of that
size is allocated.  A table of bits of that size is cleared.  Since
getdtablesize() is variable, buffer overruns and underruns occur if
someone changes getdtablesize() while the process is running (it can
be changed by the maxfileperproc sysctl and now by rctl).  But the
huge table is not always passed to select() (its size doesn't seem to
be recorded anywhere, so it can also be overrun or underrun by
FD_SET() on it, where the fd for FD_SET is acquired before or after
getdtablesize() changes).

>> ... in 4.4BSD and FreeBSD, both sysconf(_SC_OPEN_MAX) are just wrappers for
>> the resource limit (sysconf() is a libc wrapper and getdtablesize() is
>> a syscall wrapper).  Actually, in FreeBSD, getdtablesize() is not even the
>> rlmint -- it is the min() of the rlimit and the global sysctl integer
>> maxfilesperproc.  Here the bug is in the rlimit.  For the rlimit,
>> maxfilesperproc is only used when the rlimit is set and when it is used
>> in the kernel.  But when the rlimit is returned to userland, via
>> getrlimit(), maxfilesperproc is not used, so the rlimit may be wrong if
>> maxfileperproc was lowered after setting the rlimit.
>
> I don't like the idea of rlimits changing of "their own will". Changing
> maxfileperproc at run time is going to be a bit nasty in any case but
> seems not as bad as changing rlimits of running processes.

I don't like this either.  maxfileperproc wouldn't exist if my axe was
sharper in 1995 when it was committed.  rctl is less hackish, but causes
similar problems.

Anyway, since the effective limit does change, getrlimit() should keep up
with the change like getdtablesize() does.  Otherwise processes can't tell
what the rlimit actually is, or needs to know that getdtablesize() must
be used instead of getrlimit() to determine the actual limit.

maxfileperproc and the rctl limit are only needed to prevent growth
of the table.  Once a large table has been allocated, it is not useful
to prevent dup2() from using a large fd for which space has already
been allocated.  But the clamps to maxfileperproc prevent this.  Maybe
the clamps to the rctl limit don't prevent this, since they are applied
later and only cause failure if the allocation fails.  But then since
they are applied at the same time in getdtablesize(), they prevent
applications from knowing the actual limit in another way.  I think this
is now further complicated by sparse tables.

>> Actually, {OPEN_MAX} is guaranteed by POSIX to be at least
>> {_POSIX_OPEN_MAX}, and {_POSIX_OPEN_MAX} is precisely 20.  But these
>> guarantees and similar ones for stdio's FOPEN_MAX have always been
>> broken in FreeBSD, since anyone can reduce the rlimit below 20.
>> ...
> Recent versions of POSIX allow {OPEN_MAX} to be based on the rlimit. In
> that case, it may change when the rlimit is changed.

I think this is more or less required (and implicit before).  But how
can anything work if it there is no API like closefrom() and there is
no way to determine the maximum open fd.  getdtablesize() would be
more useful if it actually returned the table size.  Then its old use
of closing all files up to the table size (less 1) would keep working
OK iff the table size is small.  Of course, the table must remain large
enough to keep holding the maximum open fd.  If getdtablesize() did that,
then it would match its name again.  But it hasn't matched since 386BSD
or earlier -- in FreeBSD-1, it returned essentially the same as now:
(just the rlimit, so it doesn't cover the maximum open fd if anyone
reduces the rlimit).

>>> The
>>> .Fn getdtablesize
>>> -system call returns the size of this table.
>>> +system call returns the maximum number of file descriptors
>>> +that the current process may open.
>
>> Actually, the process may open more than this number, after raising its
>> (soft) rlimit, if this is possible.
>
> True, but this requires different function calls than just ones
> allocating file descriptors.

Not too bad if the rlimit can only be changed by the process itself, but
I'd still like this to be documented explicitly.

POSIX doesn't have the complications of maxfileperproc or rctl.  These
are harder to document.  About all you can say is that the results returned
by getdtablesize(), getrlimit() and sysconf() are volatile so that they
become unusable before they can be used :-(.

>>> +.Xr closefrom 2 ,
>>> .Xr dup 2 ,
>>> -.Xr open 2 ,
>>> -.Xr select 2
>>> +.Xr getrlimit 2 ,
>>> +.Xr sysconf 2
>>> .Sh HISTORY
>>> The
>>> .Fn getdtablesize
>
> I suppose rctl(8) can be added here.

dup(2) needs similar changes (I didn't notice if you changed it
recently).  It says that `oldd' must be < getdtablesize(), but actually,
`oldd' must just be open.  It doesn't say that `newd' must be <
getdtablesize() (modulo complications for the rctl case), except
indirectly in the ERRORS section, it says that EBADF occurs when `newd'
exceeds the maximum allowable descriptor number.  Here "allowable" has
a wrong tense ("allowed" would make more sense), and what this maximum
is is not described.  POSIX uses better wording for this of course
("... when `fildes' is greater than or equal to {OPEN_MAX}").  This
would be better still with "greater than or equal" spelled ">=".

Bruce


More information about the svn-src-all mailing list