svn commit: r249859 - head/lib/libc/sys

Jilles Tjoelker jilles at stack.nl
Mon Apr 29 16:59:31 UTC 2013


On Sat, Apr 27, 2013 at 04:36:56AM +1000, Bruce Evans wrote:
> On Fri, 26 Apr 2013, Jilles Tjoelker wrote:

> > On Thu, Apr 25, 2013 at 09:56:01PM +1000, Bruce Evans wrote:
> >> On Wed, 24 Apr 2013, Jilles Tjoelker wrote:

> >>> Log:
> >>>  getdtablesize(2): Describe what this function actually does.
> >>> ...
> >>> .Sh NAME
> >>> .Nm getdtablesize
> >>> -.Nd get descriptor table size
> >>> +.Nd get file descriptor limit

> >> Now its name doesn't match its description, and the reason for this is
> >> not documented.

> > This seems to be the case on most systems that have this function.

> No reason not to document it.

I can do it if you provide text.

> >> This function is almost obsolete.  In POSIX, it is spelled {OPEN_MAX}
> >> or sysconf(__SC_OPEN_MAX).  It is sometimes misspelled OPEN_MAX.

> > There is a difference between sysconf(_SC_OPEN_MAX) and getdtablesize():
> > the latter also takes maxfilesperproc and rctl(8) rules into account.

> Yes, only sysconf(_SC_OPEN_MAX) (and getrlimit(..., RLIMIT_NOFILE)) are
> broken.

> I'm not familiar with rctl.  It seems to break this some more: in
> kern_descrip.c, all the places that use the rlimit clamp it to
> maxfileperproc, so the effective rlimit is always the clamped value
> and this is what sysconf() and getrlimit() should return too.  Now
> for rctl, only 1 of these places (namely getdtablesize()) clamps it
> further to the RACCT_NOFILE limit.  Other places include do_dup().
> do_dup() checks the RACCT_NOFILE much later.  I think this gives the
> same limit, but at least the errno when the limit RACCT_NOFILE limit
> is exceeded but the others aren't is wrong in all cases.  (The early
> errno for (new >= maxfd) is (flags & DUP_FCNTL ? EINVAL : EBADF).
> The later errno for racct_set(... RACCT_NOFILE, new + 1) is always
> EMFILE.  EMFILE is not even a possible errno for dup2().

Hmm, if we return [EBADF] there, we should also change
sysconf(_SC_OPEN_MAX) to return getdtablesize() instead of the soft
rlimit.

> >> I prepared to remove the broken definition of OPEN_MAX, but never committed
> >> the final step.  /usr/src has very few misuses of OPEN_MAX now, so removing
> >> the definition wouldn't be too hard.  Most uses are in compatibility
> >> cruft.  E.g., the following from
> >> crypto/openssh/openbsd-compat/bsd-closefrom.c
> >> which is confused about related things:

> >> [snip]
> > If that code is compiled at all, it is a bug. We have closefrom() and
> > OpenSSH should use it.

> It almost certainly uses it.  However, a quick grep for getdtablesize()
> shows many applications using it, and most of them use it to give the
> top of a close() loop.  These should be converted to use something
> like closefrom().  getdtablesize() is used much more than sysconf()
> or getrlimit() for this.  This is done mostly in old BSD applications.
> However, the worst uses that I noticed are in the relatively new ppp
> application.  There are about 40 calls to getdtablesize() in /usr/src.
> 7 of these are in ppp.  1 of the 7 is for the close() loop.  4 of the
> 7 are for fcntl(... F_SETFD ...) loops.  The other 2 are for a home
> made FD_SETSIZE and a home made FD_ZERO().  These essentially initialize
> FD_SETSIZE to the variable getdtablesize().  A table of bits of that
> size is allocated.  A table of bits of that size is cleared.  Since
> getdtablesize() is variable, buffer overruns and underruns occur if
> someone changes getdtablesize() while the process is running (it can
> be changed by the maxfileperproc sysctl and now by rctl).  But the
> huge table is not always passed to select() (its size doesn't seem to
> be recorded anywhere, so it can also be overrun or underrun by
> FD_SET() on it, where the fd for FD_SET is acquired before or after
> getdtablesize() changes).

Much of this is in uncommonly used code that is hard to test :(
A getdtablesize()/close() loop is not so bad because it is exactly
equivalent to closefrom() (modulo bugs in the former) but the
fcntl(F_SETFD, 1) loops are harder.

There are also various copies of popen in the tree (often slightly
modified) that create an array of getdtablesize() elements to track
popened file descriptors. This is safe as long as getdtablesize() does
not increase while the process is running, but also wasteful.

> >> ... in 4.4BSD and FreeBSD, both sysconf(_SC_OPEN_MAX) are just wrappers for
> >> the resource limit (sysconf() is a libc wrapper and getdtablesize() is
> >> a syscall wrapper).  Actually, in FreeBSD, getdtablesize() is not even the
> >> rlmint -- it is the min() of the rlimit and the global sysctl integer
> >> maxfilesperproc.  Here the bug is in the rlimit.  For the rlimit,
> >> maxfilesperproc is only used when the rlimit is set and when it is used
> >> in the kernel.  But when the rlimit is returned to userland, via
> >> getrlimit(), maxfilesperproc is not used, so the rlimit may be wrong if
> >> maxfileperproc was lowered after setting the rlimit.

> > I don't like the idea of rlimits changing of "their own will". Changing
> > maxfileperproc at run time is going to be a bit nasty in any case but
> > seems not as bad as changing rlimits of running processes.

> I don't like this either.  maxfileperproc wouldn't exist if my axe was
> sharper in 1995 when it was committed.  rctl is less hackish, but causes
> similar problems.

> Anyway, since the effective limit does change, getrlimit() should keep up
> with the change like getdtablesize() does.  Otherwise processes can't tell
> what the rlimit actually is, or needs to know that getdtablesize() must
> be used instead of getrlimit() to determine the actual limit.

I don't think getrlimit() is intended to allow programs to determine an
actual limit. There may be other limits than the setrlimit().

> maxfileperproc and the rctl limit are only needed to prevent growth
> of the table.  Once a large table has been allocated, it is not useful
> to prevent dup2() from using a large fd for which space has already
> been allocated.  But the clamps to maxfileperproc prevent this.  Maybe
> the clamps to the rctl limit don't prevent this, since they are applied
> later and only cause failure if the allocation fails.  But then since
> they are applied at the same time in getdtablesize(), they prevent
> applications from knowing the actual limit in another way.  I think this
> is now further complicated by sparse tables.

Only preventing growth of the table would still allow a process to
consume more of kern.maxfiles than is supposed to be possible.

> >> Actually, {OPEN_MAX} is guaranteed by POSIX to be at least
> >> {_POSIX_OPEN_MAX}, and {_POSIX_OPEN_MAX} is precisely 20.  But these
> >> guarantees and similar ones for stdio's FOPEN_MAX have always been
> >> broken in FreeBSD, since anyone can reduce the rlimit below 20.
> >> ...
> > Recent versions of POSIX allow {OPEN_MAX} to be based on the rlimit. In
> > that case, it may change when the rlimit is changed.

> I think this is more or less required (and implicit before).  But how
> can anything work if it there is no API like closefrom() and there is
> no way to determine the maximum open fd.  getdtablesize() would be
> more useful if it actually returned the table size.  Then its old use
> of closing all files up to the table size (less 1) would keep working
> OK iff the table size is small.  Of course, the table must remain large
> enough to keep holding the maximum open fd.  If getdtablesize() did that,
> then it would match its name again.  But it hasn't matched since 386BSD
> or earlier -- in FreeBSD-1, it returned essentially the same as now:
> (just the rlimit, so it doesn't cover the maximum open fd if anyone
> reduces the rlimit).

Various uses of getdtablesize() are indeed for the limit, and trying to
imitate closefrom() is an incorrect and slow use of getdtablesize().

A closefrom() API was explicitly rejected for POSIX in
http://www.austingroupbugs.net/view.php?id=149 because closing arbitrary
file descriptors could render the application non-conforming (for
example, tracing might be implemented using a file descriptor, or if
fexecve() is used on an interpreted (#!) file).

Instead, it was suggested to allow making all new file descriptors
close-on-exec and to require standard functions to make all internal
file descriptors close-on-exec. Some of this is already implemented, see
https://wiki.freebsd.org/AtomicCloseOnExec .

> >>> The
> >>> .Fn getdtablesize
> >>> -system call returns the size of this table.
> >>> +system call returns the maximum number of file descriptors
> >>> +that the current process may open.

> >> Actually, the process may open more than this number, after raising its
> >> (soft) rlimit, if this is possible.

> > True, but this requires different function calls than just ones
> > allocating file descriptors.

> Not too bad if the rlimit can only be changed by the process itself, but
> I'd still like this to be documented explicitly.

> POSIX doesn't have the complications of maxfileperproc or rctl.  These
> are harder to document.  About all you can say is that the results returned
> by getdtablesize(), getrlimit() and sysconf() are volatile so that they
> become unusable before they can be used :-(.

> >>> +.Xr closefrom 2 ,
> >>> .Xr dup 2 ,
> >>> -.Xr open 2 ,
> >>> -.Xr select 2
> >>> +.Xr getrlimit 2 ,
> >>> +.Xr sysconf 2
> >>> .Sh HISTORY
> >>> The
> >>> .Fn getdtablesize

> > I suppose rctl(8) can be added here.

> dup(2) needs similar changes (I didn't notice if you changed it
> recently).  It says that `oldd' must be < getdtablesize(), but actually,
> `oldd' must just be open.  It doesn't say that `newd' must be <
> getdtablesize() (modulo complications for the rctl case), except
> indirectly in the ERRORS section, it says that EBADF occurs when `newd'
> exceeds the maximum allowable descriptor number.  Here "allowable" has
> a wrong tense ("allowed" would make more sense), and what this maximum
> is is not described.  POSIX uses better wording for this of course
> ("... when `fildes' is greater than or equal to {OPEN_MAX}").  This
> would be better still with "greater than or equal" spelled ">=".

Probably. To be looked at later.

-- 
Jilles Tjoelker


More information about the svn-src-head mailing list