kern/94772: FIFOs (named pipes) + select() == broken

Wed Mar 22 01:09:24 UTC 2006

On Tue, 21 Mar 2006, Oliver Fromme wrote:

>> Description:
>
> I recently wondered why several of my scripts that use
> a named pipe (FIFO) don't work on FreeBSD.
>
> After some debugging it turned out that select() seems
> to be broken when used with FIFOs on FreeBSD 6.
> Particularly, this is the bug I'm suffering from:
>
> When a FIFO had been opened for reading and the writing
> process closes it, the reading process blocks in select(),
> even though the descriptor is ready for read().  If the
> select() call is omitted, read() returns 0 immediately
> indicating EOF.  But as soon as you use select(), it blocks
> and there is no chance to detect the EOF condition.

See also:

PR 76125 (about the same bug)
PR 76144 (about a related bug in poll())
PR 53447 (about a another or the same related bug in poll())
PR 34020 (about the inverse of the bug (return on non-EOF) for select()
           and poll())

The fix for PR 34020 inverted (a symptom of) a bug for poll() to
give a worse bug, and gave the (inverted) bug for select() where there
was no bug before.  fifo_poll() now instructs lower layers to ignore
EOF by setting POLLINIGNEOF.  This moves the bug for poll() and isn't
directly harmful for poll(), but it is directly harmful for select()
and it makes the real bug for poll() more harmful.  The real bug for
poll() is that POLLHUP needs to be set to indicate EOF, but it isn't
actually set for many file types including named pipes.  So we now
always get no indication of EOF where we should get POLLHUP or
POLLIN | POLLHUP.  Previously, we always got EOF indicated by POLLIN,
and we also got EOF indicated by POLLIN in the tricky case where other
systems don't indicate EOF.

The tricky case is for a named pipe that has not had any writers during
the lifetime of current readers or thereabouts (races seem to be a
problem).  Such a pipe can never have had a connection on it, so it
cannot be in the hangup state and it is clear that poll() should not
return POLLHUP for it.  It is less clear that select() and poll()
should block waiting for a writer, but that is what other systems do
for poll() at least.  I fixed FreeBSD long ago to not always block in
read() on a named piped when there are no current writers, since
blocking in read() is just wrong in the O_NONBLOCK case.  This had the
side effect of making select() never block when there are no current
writers.  poll() inherited this behaviour from select().  This behaviour
is wrong since select()/poll are the only reasonable ways to block
waiting for a writer, but it doesn't cause many problems.

>> How-To-Repeat:
>
> Please see the small test program below.  Compile it like this:
> cc -O -o fifotest fifotest.c
> ...
> The same test program (with "err()" replaced by a small
> self-made function) runs without error on all other UNIX
> systems that I've tried:  Linux 2.4.32, Solaris 10, and
> DEC UNIX 4.0 (predecessor of Tru64).  By the way, it's
> even sufficient to do "cat /dev/null > fifo", i.e. not
> writing anything at all, but issuing EOF immediately.
> Under FreeBSD, nothing happens at all in that case.
> All other UNIX systems recognize EOF (select() returns).

Here is a program that tests more cases.  I made it give no output
(for no errors) under Linux-2.6.10.  It also gives no output for
the nameless pipe case under FreeBSD-4.10 and FreeBSD-oldcurrent
and for the named piped case under FreeBSD-4.10, but it fails with
(only) the error in this PR under FreeBSD-oldcurrent.  Please test
it on Solaris etc.  Compile it with -DNAMEDPIPE for the named pipe
case.  In this case, it creates and leaves a fifo "p" in the current
directory (or doesn't handle the error if "p" exists but is not
an accessible fifo) but shouldn't have any other side effects.

%%%
#include <sys/select.h>
#include <sys/stat.h>

#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <signal.h>
#include <unistd.h>

static pid_t cpid;
static pid_t ppid;
static volatile sig_atomic_t state;

static void
catch(int sig)
{
 	state++;
}

static void
child(int fd)
{
 	fd_set rfds;
 	struct timeval tv;
 	char buf[256];

#ifdef NAMEDPIPE
 	fd = open("p", O_RDONLY | O_NONBLOCK);
 	if (fd < 0)
 		err(1, "open for read");
#endif
 	kill(ppid, SIGUSR1);

 	/* XXX should check that fd fits in rfds. */

 	usleep(1);
 	while (state != 1)
 		;
#ifndef NAMEDPIPE
 	/*
 	 * The connection cannot be restablished.  Use the code that delays
 	 * the read until after the writer disconnects since that case is
 	 * more interesting.
 	 */
 	state = 4;
 	goto state4;
#endif
 	FD_ZERO(&rfds);
 	FD_SET(fd, &rfds);
 	tv.tv_sec = 0;
 	tv.tv_usec = 0;
 	if (select(fd + 1, &rfds, NULL, NULL, &tv) < 0)
 		err(1, "select");
 	if (FD_ISSET(fd, &rfds))
 		warnx("state 1: expected clear; got set");
 	kill(ppid, SIGUSR1);

 	usleep(1);
 	while (state != 2)
 		;
 	FD_ZERO(&rfds);
 	FD_SET(fd, &rfds);
 	tv.tv_sec = 0;
 	tv.tv_usec = 0;
 	if (select(fd + 1, &rfds, NULL, NULL, &tv) < 0)
 		err(1, "select");
 	if (!FD_ISSET(fd, &rfds))
 		warnx("state 2: expected set; got clear");
 	if (read(fd, buf, sizeof buf) != 1)
 		err(1, "read");
 	FD_ZERO(&rfds);
 	FD_SET(fd, &rfds);
 	tv.tv_sec = 0;
 	tv.tv_usec = 0;
 	if (select(fd + 1, &rfds, NULL, NULL, &tv) < 0)
 		err(1, "select");
 	if (FD_ISSET(fd, &rfds))
 		warnx("state 2a: expected clear; got set");
 	kill(ppid, SIGUSR1);

 	usleep(1);
 	while (state != 3)
 		;
 	FD_ZERO(&rfds);
 	FD_SET(fd, &rfds);
 	tv.tv_sec = 0;
 	tv.tv_usec = 0;
 	if (select(fd + 1, &rfds, NULL, NULL, &tv) < 0)
 		err(1, "select");
 	if (!FD_ISSET(fd, &rfds))
 		warnx("state 3: expected set; got clear");
 	kill(ppid, SIGUSR1);

 	/*
 	 * Now we expect a new writer, and a new connection too since
 	 * we read all the data.  The only new point is that we didn't
 	 * start quite from scratch since the read fd is not new.  Check
 	 * startup state as above, but don't do the read as above.
 	 */
 	usleep(1);
 	while (state != 4)
 		;
state4:
 	FD_ZERO(&rfds);
 	FD_SET(fd, &rfds);
 	tv.tv_sec = 0;
 	tv.tv_usec = 0;
 	if (select(fd + 1, &rfds, NULL, NULL, &tv) < 0)
 		err(1, "select");
 	if (FD_ISSET(fd, &rfds))
 		warnx("state 4: expected clear; got set");
 	kill(ppid, SIGUSR1);

 	usleep(1);
 	while (state != 5)
 		;
 	FD_ZERO(&rfds);
 	FD_SET(fd, &rfds);
 	tv.tv_sec = 0;
 	tv.tv_usec = 0;
 	if (select(fd + 1, &rfds, NULL, NULL, &tv) < 0)
 		err(1, "select");
 	if (!FD_ISSET(fd, &rfds))
 		warnx("state 5: expected set; got clear");
 	kill(ppid, SIGUSR1);

 	usleep(1);
 	while (state != 6)
 		;
 	/*
 	 * Now we have no writer, but should still have data from the old
 	 * writer. Check that we have both a data condition and a hangup
 	 * condition, and that the data can read the data in the usual way.
 	 * Since Linux does this, programs must not quite reading when they
 	 * see POLLHUP; they must see POLLHUP without POLLIN (or another
 	 * input condition) before they decide that there is EOF.  gdb-6.1.1
 	 * is an example of a broken program that quits on POLLHUP only --
 	 * see its event-loop.c.
 	 */
 	FD_ZERO(&rfds);
 	FD_SET(fd, &rfds);
 	tv.tv_sec = 0;
 	tv.tv_usec = 0;
 	if (select(fd + 1, &rfds, NULL, NULL, &tv) < 0)
 		err(1, "select");
 	if (!FD_ISSET(fd, &rfds))
 		warnx("state 6: expected set; got clear");
 	if (read(fd, buf, sizeof buf) != 1)
 		err(1, "read");
 	FD_ZERO(&rfds);
 	FD_SET(fd, &rfds);
 	tv.tv_sec = 0;
 	tv.tv_usec = 0;
 	if (select(fd + 1, &rfds, NULL, NULL, &tv) < 0)
 		err(1, "select");
 	if (!FD_ISSET(fd, &rfds))
 		warnx("state 6a: expected set; got clear");
 	close(fd);
 	kill(ppid, SIGUSR1);
 	exit(0);
}

static void
parent(int fd)
{
 	usleep(1);
 	while (state != 1)
 		;
#ifdef NAMEDPIPE
 	fd = open("p", O_WRONLY | O_NONBLOCK);
 	if (fd < 0)
 		err(1, "open for write");
#endif
 	kill(cpid, SIGUSR1);

 	usleep(1);
 	while (state != 2)
 		;
 	if (write(fd, "", 1) != 1)
 		err(1, "write");
 	kill(cpid, SIGUSR1);

 	usleep(1);
 	while (state != 3)
 		;
 	if (close(fd) != 0)
 		err(1, "close for write");
 	kill(cpid, SIGUSR1);

 	usleep(1);
 	while (state != 4)
 	    ;
#ifndef NAMEDPIPE
 	return;
#endif
 	fd = open("p", O_WRONLY | O_NONBLOCK);
 	if (fd < 0)
 		err(1, "open for write");
 	kill(cpid, SIGUSR1);

 	usleep(1);
 	while (state != 5)
 		;
 	if (write(fd, "", 1) != 1)
 		err(1, "write");
 	kill(cpid, SIGUSR1);

 	usleep(1);
 	while (state != 6)
 		;
 	if (close(fd) != 0)
 		err(1, "close for write");
 	kill(cpid, SIGUSR1);

 	usleep(1);
 	while (state != 7)
 		;
}

int
main(void)
{
 	int fd[2];
 	int i;

#ifdef NAMEDPIPE
 	if (mkfifo("p", 0666) != 0 && errno != EEXIST)
 		err(1, "mkfifo");
#endif
 	signal(SIGUSR1, catch);
 	ppid = getpid();
 	for (i = 0; i < 2; i++) {
#ifndef NAMEDPIPE
 		if (pipe(fd) != 0)
 			err(1, "pipe");
#else
 		fd[0] = -1;
 		fd[1] = -1;
#endif
 		state = 0;
 		switch (cpid = fork()) {
 		case -1:
 			err(1, "fork");
 		case 0:
 			(void)close(fd[1]);
 			child(fd[0]);
 			break;
 		default:
 			(void)close(fd[0]);
 			parent(fd[1]);
 			break;
 		}
 	}
 	return (0);
}
%%%

The error output from this is:

%%%
selectp: state 3: expected set; got clear
selectp: state 6a: expected set; got clear
selectp: state 3: expected set; got clear
selectp: state 6a: expected set; got clear
%%%

These messages are all caused by the same bug.  "state 3" is EOF without
any data ever having been readable.  "state 6" is EOF with data readable
(the test for this passed so there is no output for it).  "state 6a" is
EOF after having read the data available in state 6.  It is good that
state 6a fails in the same way as state 3 -- at least the bug doesn't
seem to involve races.  The duplicate messages are caused by iterating
the test to see if the bug depends on previous activity on the pipe
(but the program is probably too careful cleaning up for iteration to
show problems).

A similar test program (not enclosed) shows many more bugs for poll():

- no output under Linux-2.6.10

- FreeBSD-4.10 on nameless pipes:
   % poll: state 6a: expected POLLHUP; got 0x11
   % poll: state 6a: expected POLLHUP; got 0x11
     0x11 is POLLIN | POLLHUP.  Nameless pipes are one of the few file
     types for which POLLHUP is actually implemented (POLLHUP is also
     implemented (not quite right) for ttys but isn't implemented for
     any other important file type).  Linux returns only POLLHUP here.
     This is best, since it allows distinguishing the case of pure EOF
     from the case of EOF with data reasable.  However, buggy
     applications like gdb don't actually understand the difference
     between EOF-with-data and pure EOF (see the comment in the test
     program).  Also, select() depends on pipe_poll() returning POLLIN
     to work.

- FreeBSD-oldcurrent on nameless pipes:
   % poll: state 6a: expected POLLHUP; got 0x11
   % poll: state 6a: expected POLLHUP; got 0x11
     No change.  The kernel code for select() and poll() hasn't been either
     fixed or broken for nameless pipes.

- FreeBSD-4.10 on named pipes:
   % pollp: state 3: expected POLLHUP; got 0x1
   % pollp: state 6: expected POLLIN | POLLHUP; got 0x1
   % pollp: state 6a: expected POLLHUP; got 0x1
   % pollp: state 3: expected POLLHUP; got 0x1
   % pollp: state 6: expected POLLIN | POLLHUP; got 0x1
   % pollp: state 6a: expected POLLHUP; got 0x1
     0x1 is POLLIN.  FreeBSD-4.10 never returns POLLHUP for named pipes.
     This and/or returning POLLIN in too many cases causes the problem in
     PRs 34020, 53447 and 76144.

- FreeBSD-oldcurrent on nameless pipes:
   % pollp: state 6: expected POLLIN | POLLHUP; got 0x1
   % pollp: state 6: expected POLLIN | POLLHUP; got 0x1
     FreeBSD-current still never returns POLLHUP for nameless pipes.
     However, my test program determines whether POLLHUP should have been
     returned in some cases and reduces the bugs to the above.  It uses
     FreeBSD(>4)'s (my) POLLINIGNEOF to do this.  POLLINIGNEOF was supposed
     to be usable for this to limit the damage caused by the fix for PR34020,
     since I knew that this fix would break EOF handling in some cases.
     However, POLLINIGNEOF doesn't really work.  To use it, the test
     program has to use nonblocking syscalls for everything including
     poll() (it gets nonblocking polls using a timeout of 0).  Even with
     this, EOF can't be detected in state 6 (EOF-with-data).  But the
     bug in state 6 is small (it even compensates for the bug in gdb).

     Here is the code to determine POLLHUP:

%%%
#ifdef POLLINIGNEOF
/*
  * FreeBSD's POLLINIGNEOF (which causes half of the bugs when the kernel
  * uses it) can be used to fix up the broken cases 3 and 6a if the kernel
  * uses it, i.e., for named pipes but not for pipes.  Note that the sense
  * of POLLINIGNEOF is reversed when passed to the kernel -- it means
  * don't-ignore-EOF in .events and if it is set there then it means
  * not-POLLHUP in .revents.
  */
int
mypoll(struct pollfd *fds, nfds_t nfds, int timeout)
{
 	struct pollfd mypfd;
 	int r;

 	r = poll(fds, nfds, timeout);
 	if (nfds != 1 || timeout != 0 || fds[0].revents & POLLIN)
 		return (r);
 	mypfd = fds[0];
 	mypfd.events |= POLLINIGNEOF;
 	r = poll(&mypfd, 1, 0);
 	if (r >= 0) {
 		if (mypfd.revents &= POLLIN) {
 			mypfd.revents &= ~POLLIN;
 			mypfd.revents |= POLLHUP;
 		}
 		fds[0].revents = mypfd.revents;
 	}
 	return (r);
}
#define	poll(fds, nfds, timeout)	mypoll((fds), (nfds), (timeout))
#endif
%%%

     With this userland fixup for the missing POLLHUP, the above shows that
     states 3 and 6a have been fixed for poll() on named pipes.  These are
     precisely the states that have been broken for select() on named
     pipes.  Toggling the seting of POLLIN for these states toggles the
     location of one of the bugs.

     Without this userland fixup, the output for FreeBSD-oldcurrent on
     named pipes is:
     % state 3: expected POLLHUP; got 0
     % state 6: expected POLLIN | POLLHUP; got 0x1
     % state 6a: expected POLLHUP; got 0
     % state 3: expected POLLHUP; got 0
     % state 6: expected POLLIN | POLLHUP; got 0x1
     % state 6a: expected POLLHUP; got 0
       The POLLHUP flag is now never set, so states 3 and 6a aren't actually
       fixed; in fact they are more broken than before, just like for select()
       -- now no poll flag is set for these cases, so poll() and select()
       don't even see normal hangups unless they are used with a timeout
       and/or with the negative-logic POLLINIGNEOF as in my test program.
       IIRC, PRs 53447 and 76144 are about this problem.

Quick fix (?): #defining POLLINIGNEOF as 0 in <sys/poll.h> should give the
FreeBSD-4.10 behaviour.

Fix (?):
- actually implement returning POLLHUP in sopoll() and other places.  Return
   POLLHUP but not POLLIN for the pure-EOF case.  Return POLLIN* | POLLHUP
   for EOF-with-data.
- remove POLLINIGNEOF and associated complications in sopoll(), fifo_poll()
   and <sys/poll.h>
- change selscan() to check for POLLHUP so that POLLIN, POLLIN | POLLHUP
   and POLLHUP all act the same for select()
- remove POLLHUP from the comment in selscan().  Fix the rest of this
   comment or remove it (most backends are too broken to return poll flags
   if appropriate, and the comment only mentions one of the other poll flags
   that selscan() ignores)
- remove the corresponding comment in pollscan() since it is wrong and says
   nothing relevant (pollscan() just accepts whatever flags the backends set).

Bruce