'kern.maxpipekva exceeded' messages...

Tue Jan 27 09:57:23 PST 2004

On Tue, 27 Jan 2004, Dag-Erling [iso-8859-1] Smørgrav wrote:

> My problem is not idle pipes; my problem is that the following system
>
> # sysctl kern.ipc.maxpipekva
> kern.ipc.maxpipekva: 8704000
> # sysctl kern.ipc.pipekva
> kern.ipc.pipekva: 393216
>
> runs out of pipe kva every monday morning when it tries to pipe a
> level 0 dump through ssh.
>
> Is there some way to impose a limit on the memory consumed by a single
> pipe?  I don't care if dump blocks waiting for ssh to push out the
> data, but I do care about the system crashing shortly after running
> out of pipe kva.

There is a limit, no single pipe can grow beyond BIG_PIPE_SIZE, which is
presently defined as 64K.  Well, unless there is a leak, of course. :)

Is it really crashing?  That's not supposed to happen. :(

It occurs to me that the maxpipekva exceeded printf may be misleading, and
should be moved to pipe_create so that they are not triggered when
pipespace is called to resize a pipe buffer from pipe_write... it's
possible that with 4K, 16K, and 64K pipes all sharing the same address
space, we're getting fragmentation which is causing some large allocations
to fail prematurely.

> Another problem I have is with a system that runs out of pipe kva when
> I create a large number of jails.  I really need a way to find out
> where all that memory goes...

"fstat | grep pipe" should tell you all that you need to know; each pipe
is presently created with buffers of 16K in each direction (until you
reach half usage, when the size is dropped to 4K.)  So, in general, "fstat
| grep pipe | wc -l" * 16384 should add up to kern.ipc.pipekva.  Pipes
which have grown to 64K in size will break this assumption slightly.

Also note that the property above is an accident; fstat shows both sides
of a pipe, so we're really doubly counting each pipe.  However, each pipe
is bidirectional (and few programs take advantage of that), so fstat's
doubling accounts for the fact that we're not taking into account the
unused direction's buffer. :)

> > If you're interested in working on this right now, I can send you what I
> > had planned to do for #1, it would be a very small amount of code,
> > although it would require a bit of testing to ensure that it does not
> > degrade the performance of pipes by a noticeable amount.
>
> That would be nice.  I have several systems I can test it on.
>
> DES

Ok, what it comes down to is that we account for the space allocated,
rather than the space actually used; trying to account for the space
actually used would turn this into a much more complex beast, and I tried
to avoid that.

So, in order to save memory, you'll need to change how much memory we
allocate, and dynamically size the buffer upward as needed.  The first
part of this would be to _not_ allocate a buffer for the reverse direction
of the pipe; to do this, you'll need to add an extra argument to
pipe_create which tells it whether to call pipespace or not, and then
you'll have to add code to pipe_write which will allocate space at that
time if one ever uses the reverse direction of the pipe.  This will net
you a 2x memory savings right away with 0 cost.

Secondly, you could change pipe_create so that pipespace is always told to
allocate SMALL_PIPE_SIZE pipes.  Then, go into pipe_write and find the
section of code under:

        /*
         * If it is advantageous to resize the pipe buffer, do
         * so.
         */

And rewrite the loop to something more like

int tempsize = wpipe->pipe_buffer.size;

	while ((uio->uio_resid > wpipe->pipe_buffer.size) &&
		(tempsize < BIG_PIPE_SIZE) &&
		(amountpipekva < maxpipekva / 2)) {
		tempsize *= 2;
	}

	if ((tempsize > wpipe->pipe_buffer.size) &&
                (wpipe->pipe_state & PIPE_DIRECTW) == 0 &&
                (wpipe->pipe_buffer.size <= PIPE_SIZE) &&
                (wpipe->pipe_buffer.cnt == 0)) {

                if ((error = pipelock(wpipe, 1)) == 0) {
                        PIPE_UNLOCK(wpipe);
                        pipespace(wpipe, tempsize);
                        PIPE_LOCK(wpipe);
                        pipeunlock(wpipe);
                }
        }

Note that I took out the amountbigpipe count; if you rewrite everything to
grow dynamically, the bigpipecount can probably be thrown out.  In fact,
you could probably increase BIG_PIPE_SIZE to 128K if that would improve
performance for some application.  On the other hand, maybe 32K is a
better limit... you'd have to do some testing to see how dynamic resizing
would affect the operation, which is why I didn't look into this much.

As far as the implementation of this change goes, it should be extremely
safe; pipespace has been resizing pipes upwards in size for years, this
should be no different.

Memory savings here:  Well, PIPE_SIZE is 16K, SMALL_PIPE_SIZE is PAGE_SIZE
(4K on i386), and BIG_PIPE_SIZE is 64K.  So if you have all idle pipes,
this would save you 4x memory (up to the point where you reach half usage,
where *everything* is allocated as 4K), and it could also save you memory
if only 32K buffers are needed and we've been allocating 64K for some app.

Now, there are a few implementation issues that may affect the performance
you see as a result of the preceeding changes:

1.  pipespace can't resize if there is currently any data in the pipe.  I
believe that copying over the old data during a resize should be doable,
but I haven't attempted it.  Not allowing resizes may penalize some
application which writes an initial small piece of data, followed by
larger blocks which would warrant a resize.

1a. If you could resize with data currently in the buffer, then you could
also resize _down_, allowing pipes to shrink in size when memory is short.
This could be useful as well.

2.  Direct writes cannot be followed by non-direct writes until the buffer
has been emptied.  It seems possible that applications which do a large
write and then a small write may be unnecessarily blocked.  However,
changing this behavior would require a large rewrite, and I would not
recommend it unless you can generate statistics which prove that this is
an issue.

3.  Alc has mentioned that direct writes could be optimized a bit more by
not making the direct mapping until the read is performed.  However, as
most pipes never get into direct mode, this is mostly inconsequential.

Overall, I think that implementing the first two changes from earlier in
the message and #1 above should not take much time at all, and would
provide a substantial memory savings.

Mike "Silby" Silbersack