kern/158641: Writing > 8192 bytes to a pipe blocks signal handling

Thu Jul 7 02:21:29 UTC 2011

On Mon, 4 Jul 2011, Tom Hukins wrote:

>> Description:
>
> When a pipe has more than 8192 bytes written to it, the current process
> hangs and does not handle signals correctly.

It just blocks and does handle signals correctly.

If a pipe is open in not-O_NONBLOCK mode (as is the case here), write()s
of as little as 1 byte may block, depending on the pipe's buffering
mechanisms and how much is already buffered.  The first blockage occurs
when more than 8191 (not 8192) bytes are written starting from empty,
at least when there is only 1 write.  The details are undocumented, but
have something to do with the undocumented implementation detail
PIPE_MINDIRECT being 8192 (I'm surprised that blocking doesn't start
at the the undocumented implementation limit PIPE_SIZE = 16384).

The write() can be terminated by a signal, except of course by the thread
doing the write(), since that thread blocked.

Use fcntl() on the pipe write fd if you don't want the writing thread to
block.  Then the write() can fail much more easily, or return a short
count, so checking its return value is much more important.

>> How-To-Repeat:

>    if ( pipe(pdes) != 0) {
> 	return 1;
>    }
>    signal(SIGALRM, catch_alrm);
>
>    int mypid = getpid();

Use fcntl here or earlier.

>    write( pdes[1], argv[1], strlen(argv[1]) );

This blocks when the write() size is more than 8191...

>    kill(mypid, SIGALRM);

...so this is never reached.

Sending SIGALRM from another thread works correctly (except for bugs in
the signal handling: (1): it is unsafe to use printf() in a signal handler;
(2) the external alarm invokes the signal handler and also terminates the
write(), and after write returns it sends an alarm signal which invokes
the signal handler again.

With fcntl() to O_NONBLOCK, the write of course doesn't block, but
the behaviour is still surprising.  I expected a write of 8192 bytes
to return 8191 (since it would have blocked at 8192), but it actually
returned 8192.  This behaviour persists up to a write() size of PIPE_SIZE
(65536) -- the full amount is written.  It takes a write size of
(PIPE_SIZE + 1) for things to work unsurprisingly -- this writes
PIPE_SIZE and returns that since writing 1 more would block.  So
O_NONBLOCK not only prevents blocking but also changes the buffering
so that the buffer has the full size.

Reviewing of the source code shows that this behaviour is intentional.
There are 2 completely different buffering methods.  Write()s of <=
8192 bytes uses a simple buffering method.  Normally, writes of between
8192 and 65536 bytes (inclusive) use a sophisticated "direct" "zero-copy"
method involving vm and no kernel buffers.  This tends to be faster,
but mainly in silly benchmarks.  But O_NONBLOCK turns off the direct
writing, so that writes of up to PIPE_SIZE (65536) are buffered simply
in kernel buffers of that size, and write() can write that much before
returning immediately.  I think the direct method is not used for these
writes simply because it cannot work for them -- for it to work there
must be a reader to own the buffers that are copied directly to, but
there may be no such reader, and write() cannot (or should not) simply
fail since it is required to (or should) supply at least PIPE_BUF (512)
bytes of buffering.  The critical test is whether a non-blocking write
of 8192 bytes is permitted to fail and return -1 with errno = EAGAIN
just because the implementation wants to use direct buffering but no
direct buffering is available (because there is no reader yet).  Note
that there is no problem for blocking write()s -- the kernel simply
blocks waiting for a reader.

This explains at a lower level why your program blocks, and why the
blockage is at 8192 and not at 65536: the kernel wants to use direct
writes, but can't do that since there is no reader owning the read
buffers; so the kernel waits for a reader; but no reader ever arrives
since the program is single-threaded and never supplies one.

Apart from being able to write PIPE_BUF bytes atomically, nothing is
guaranteed for the buffering of pipes.  The man page doesn't even
document PIPE_BUF.  Reviewing of the POSIX spec shows that there seems
to be no requirement that writing of PIPE_BUF bytes ever succeeds.
Atomicity just means that the write is of either nothing or >= PIPE_BUF
bytes.  Of course such may fail if the buffer is too full to hold
PIPE_BUF more bytes.  In the direct case, we can weaselly say that the
buffer is always too full if there is no reader, so as not to have to
switch to the slower indirect (kernel buffering) method.  This is
technically justified so the buffer is too full if it doesn't exist,
but is probably too surprising.

Bruce