kern/158641: Writing > 8192 bytes to a pipe blocks signal handling

Thu Jul 7 02:30:11 UTC 2011

The following reply was made to PR kern/158641; it has been noted by GNATS.

From: Bruce Evans <brde at optusnet.com.au>
To: Tom Hukins <tom at FreeBSD.org>
Cc: FreeBSD-gnats-submit at FreeBSD.org, freebsd-bugs at FreeBSD.org
Subject: Re: kern/158641: Writing > 8192 bytes to a pipe blocks signal handling
Date: Thu, 7 Jul 2011 12:21:25 +1000 (EST)

 On Mon, 4 Jul 2011, Tom Hukins wrote:

 >> Description:
 >
 > When a pipe has more than 8192 bytes written to it, the current process
 > hangs and does not handle signals correctly.

 It just blocks and does handle signals correctly.

 If a pipe is open in not-O_NONBLOCK mode (as is the case here), write()s
 of as little as 1 byte may block, depending on the pipe's buffering
 mechanisms and how much is already buffered.  The first blockage occurs
 when more than 8191 (not 8192) bytes are written starting from empty,
 at least when there is only 1 write.  The details are undocumented, but
 have something to do with the undocumented implementation detail
 PIPE_MINDIRECT being 8192 (I'm surprised that blocking doesn't start
 at the the undocumented implementation limit PIPE_SIZE = 16384).

 The write() can be terminated by a signal, except of course by the thread
 doing the write(), since that thread blocked.

 Use fcntl() on the pipe write fd if you don't want the writing thread to
 block.  Then the write() can fail much more easily, or return a short
 count, so checking its return value is much more important.

 >> How-To-Repeat:

 >    if ( pipe(pdes) != 0) {
 > 	return 1;
 >    }
 >    signal(SIGALRM, catch_alrm);
 >
 >    int mypid = getpid();

 Use fcntl here or earlier.

 >    write( pdes[1], argv[1], strlen(argv[1]) );

 This blocks when the write() size is more than 8191...

 >    kill(mypid, SIGALRM);

 ...so this is never reached.

 Sending SIGALRM from another thread works correctly (except for bugs in
 the signal handling: (1): it is unsafe to use printf() in a signal handler;
 (2) the external alarm invokes the signal handler and also terminates the
 write(), and after write returns it sends an alarm signal which invokes
 the signal handler again.

 With fcntl() to O_NONBLOCK, the write of course doesn't block, but
 the behaviour is still surprising.  I expected a write of 8192 bytes
 to return 8191 (since it would have blocked at 8192), but it actually
 returned 8192.  This behaviour persists up to a write() size of PIPE_SIZE
 (65536) -- the full amount is written.  It takes a write size of
 (PIPE_SIZE + 1) for things to work unsurprisingly -- this writes
 PIPE_SIZE and returns that since writing 1 more would block.  So
 O_NONBLOCK not only prevents blocking but also changes the buffering
 so that the buffer has the full size.

 Reviewing of the source code shows that this behaviour is intentional.
 There are 2 completely different buffering methods.  Write()s of <=
 8192 bytes uses a simple buffering method.  Normally, writes of between
 8192 and 65536 bytes (inclusive) use a sophisticated "direct" "zero-copy"
 method involving vm and no kernel buffers.  This tends to be faster,
 but mainly in silly benchmarks.  But O_NONBLOCK turns off the direct
 writing, so that writes of up to PIPE_SIZE (65536) are buffered simply
 in kernel buffers of that size, and write() can write that much before
 returning immediately.  I think the direct method is not used for these
 writes simply because it cannot work for them -- for it to work there
 must be a reader to own the buffers that are copied directly to, but
 there may be no such reader, and write() cannot (or should not) simply
 fail since it is required to (or should) supply at least PIPE_BUF (512)
 bytes of buffering.  The critical test is whether a non-blocking write
 of 8192 bytes is permitted to fail and return -1 with errno = EAGAIN
 just because the implementation wants to use direct buffering but no
 direct buffering is available (because there is no reader yet).  Note
 that there is no problem for blocking write()s -- the kernel simply
 blocks waiting for a reader.

 This explains at a lower level why your program blocks, and why the
 blockage is at 8192 and not at 65536: the kernel wants to use direct
 writes, but can't do that since there is no reader owning the read
 buffers; so the kernel waits for a reader; but no reader ever arrives
 since the program is single-threaded and never supplies one.

 Apart from being able to write PIPE_BUF bytes atomically, nothing is
 guaranteed for the buffering of pipes.  The man page doesn't even
 document PIPE_BUF.  Reviewing of the POSIX spec shows that there seems
 to be no requirement that writing of PIPE_BUF bytes ever succeeds.
 Atomicity just means that the write is of either nothing or >= PIPE_BUF
 bytes.  Of course such may fail if the buffer is too full to hold
 PIPE_BUF more bytes.  In the direct case, we can weaselly say that the
 buffer is always too full if there is no reader, so as not to have to
 switch to the slower indirect (kernel buffering) method.  This is
 technically justified so the buffer is too full if it doesn't exist,
 but is probably too surprising.

 Bruce