maxfiles, file table, descriptors, etc...

Tue Apr 22 09:57:09 PDT 2003

On Mon, Apr 21, 2003 at 11:04:07AM -0700, Terry Lambert wrote:

> Things which are allocated by the zone allocator at interrupt
> time have a fixed amount of KVA that is set at boot time, before
> the VM system is fully up.  Even if it were not fully up, the
> way it works is by preallocating an address space range to be
> later filled in by physical pages (you cannot call malloc() at
> interrupt time, but you can take a fault and fill in a backing
> page).  So the zone size for sockets (inpcb's, tcpcb's) is fixed
> at boot time, even though it is derived from the "maxfiles".

This--plus the references to zalloci(), zalloc(), and malloc() you
gave--are starting to give me an understanding of this.  At least,
I recognize the differences you're explaining as well as the logic
behind those differences.  This is really starting to get
fascinating.

> A problem with the 5.x approach is that this means it's possible
> to get NULL returns from allocation routines, when the system is
> under memory pressure (because a mapping cannot be established),
> when certain of those routines are expected to *never* fail to
> obtain KVA space.

This is a bit unnerving--or so it would seem, though I'm a bit lost
on a couple points here.  First, you said:

> In 5.x, the zone limits are still fixed to a static boot-time
> settable only value -- the same value -- but the actual zone
> allocations take place later."  

Okay, so the basically the kernel is told it has a certain amount
of memory guaranteed to be available to it within a certain zone
when in fact that memory is not (because it's allocated later,
after a time when it may have already been allocated for another
purpose).  I see how this links to your parenthetical statement:

>                                                             This
> is a serious problem, and has yet to be correctly addressed in
> the new allocator code (the problem occurs because the failure to
> obtain a mapping occurs before the zone in question hits its
> administrative limit).

What I fail to see is why this scheme is decidedly "better" than
that of the old memory allocator.  I understand from the vm source
that uma wants to avoid allocating pools of unused memory for the
kernel--allocating memory on an as needed basis is a logical thing
to do.  But losing the guarantee that the allocation routines will
not fail and not adjusting the calling functions of those routines
seems a bit dumb (since, as you state, the kernel panics).  I think
this might be a trouble spot for me because of another question....

What is the correct way to address this in the new allocator code?
I can come up with an option or two on my own... such as that to
which I've already alluded: memory allocation routines that once
guaranteed success can no longer be used in such a manner, thus the
calling functions must be altered to take this into account.  But
this is certainly not trivial!

And finally:

>                         Basically, everywhere that calls zalloci()
> is at risk of panic'ing under heavy load.

Am I not getting a point here?  I can't find any reference to
zalloci() in the kernel source for 5.x (as of a 07 Apr 2003 cvs
update on HEAD), and such circumstances don't apply to 4.x (which,
of course, is where I DID find them after you mentioned them).

> Correct.  The file descriptors are dynamically allocated; or rather,
> they are allocated incrementally, as needed, and since this is not
> at interrupt time, the standard system malloc() can be used.

A quick tangent....  when file descriptors are assigned and given to
a running program, are they guaranteed to start from zero (or three
if you don't close stdin, stdout, and stderr)?  Or is this a byproduct
of implementation across the realm of Unixes?

> An interesting aside here is that the per process open file table,
> which holds references to file for the process, is actually
> allocated at power-of-2, meaning each time it needs to grow, the
> size is doubled, using realloc(), instead of malloc(), to keep the
> table allocation contiguous.  This means if you use a lot of files,
> it takes exponentially increasing time to open new files, since
> realloc has to double the size, and then copy everything.  For a
> few files, this is OK; for 100,000+ files (or network connections)
> in a single process, this starts to become a real source of overhead.

Now this _IS_ interesting.  I would think circumstances requiring
100,000+ files or net connections, though not uncommon, are certainly
NOT in the vast majority, but would still have a bone to pick with this
implementation.  For example, a web server--from which most users
expect (demand?) fast response time--that takes time to expand its
file table during a connection or request would seem to have
unreasonable response times.  One would think there is a better way.
How much of an issue is this really?  (Afterall, I probably wouldn't
have inquired about file limits, etc., in the first place if I wasn't
intending on implementing something that will require a lot of
connections.)

Excellent info, Terry.  Thanks for sharing it!

Kevin

pos += screamnext[pos]  /* does this goof up anywhere? */
-- Larry Wall in util.c from the perl source code

---
This message was signed by GnuPG.  E-Mail kpieckiel-pgp at smartrafficenter.org
to receive my public key.  You may also get my key from pgpkeys.mit.edu;
my ID is 0xF1604E92 and will expire on 01 January 2004.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-hackers/attachments/20030422/3b37966e/attachment.bin