kern/94772: FIFOs (named pipes) + select() == broken

Wed Mar 22 01:10:42 UTC 2006

The following reply was made to PR kern/94772; it has been noted by GNATS.

From: Bruce Evans <bde at zeta.org.au>
To: Oliver Fromme <olli at secnetix.de>
Cc: FreeBSD-gnats-submit at freebsd.org, freebsd-bugs at freebsd.org
Subject: Re: kern/94772: FIFOs (named pipes) + select() == broken
Date: Wed, 22 Mar 2006 12:09:16 +1100 (EST)

 On Tue, 21 Mar 2006, Oliver Fromme wrote:

 >> Description:
 >
 > I recently wondered why several of my scripts that use
 > a named pipe (FIFO) don't work on FreeBSD.
 >
 > After some debugging it turned out that select() seems
 > to be broken when used with FIFOs on FreeBSD 6.
 > Particularly, this is the bug I'm suffering from:
 >
 > When a FIFO had been opened for reading and the writing
 > process closes it, the reading process blocks in select(),
 > even though the descriptor is ready for read().  If the
 > select() call is omitted, read() returns 0 immediately
 > indicating EOF.  But as soon as you use select(), it blocks
 > and there is no chance to detect the EOF condition.

 See also:

 PR 76125 (about the same bug)
 PR 76144 (about a related bug in poll())
 PR 53447 (about a another or the same related bug in poll())
 PR 34020 (about the inverse of the bug (return on non-EOF) for select()
            and poll())

 The fix for PR 34020 inverted (a symptom of) a bug for poll() to
 give a worse bug, and gave the (inverted) bug for select() where there
 was no bug before.  fifo_poll() now instructs lower layers to ignore
 EOF by setting POLLINIGNEOF.  This moves the bug for poll() and isn't
 directly harmful for poll(), but it is directly harmful for select()
 and it makes the real bug for poll() more harmful.  The real bug for
 poll() is that POLLHUP needs to be set to indicate EOF, but it isn't
 actually set for many file types including named pipes.  So we now
 always get no indication of EOF where we should get POLLHUP or
 POLLIN | POLLHUP.  Previously, we always got EOF indicated by POLLIN,
 and we also got EOF indicated by POLLIN in the tricky case where other
 systems don't indicate EOF.

 The tricky case is for a named pipe that has not had any writers during
 the lifetime of current readers or thereabouts (races seem to be a
 problem).  Such a pipe can never have had a connection on it, so it
 cannot be in the hangup state and it is clear that poll() should not
 return POLLHUP for it.  It is less clear that select() and poll()
 should block waiting for a writer, but that is what other systems do
 for poll() at least.  I fixed FreeBSD long ago to not always block in
 read() on a named piped when there are no current writers, since
 blocking in read() is just wrong in the O_NONBLOCK case.  This had the
 side effect of making select() never block when there are no current
 writers.  poll() inherited this behaviour from select().  This behaviour
 is wrong since select()/poll are the only reasonable ways to block
 waiting for a writer, but it doesn't cause many problems.

 >> How-To-Repeat:
 >
 > Please see the small test program below.  Compile it like this:
 > cc -O -o fifotest fifotest.c
 > ...
 > The same test program (with "err()" replaced by a small
 > self-made function) runs without error on all other UNIX
 > systems that I've tried:  Linux 2.4.32, Solaris 10, and
 > DEC UNIX 4.0 (predecessor of Tru64).  By the way, it's
 > even sufficient to do "cat /dev/null > fifo", i.e. not
 > writing anything at all, but issuing EOF immediately.
 > Under FreeBSD, nothing happens at all in that case.
 > All other UNIX systems recognize EOF (select() returns).

 Here is a program that tests more cases.  I made it give no output
 (for no errors) under Linux-2.6.10.  It also gives no output for
 the nameless pipe case under FreeBSD-4.10 and FreeBSD-oldcurrent
 and for the named piped case under FreeBSD-4.10, but it fails with
 (only) the error in this PR under FreeBSD-oldcurrent.  Please test
 it on Solaris etc.  Compile it with -DNAMEDPIPE for the named pipe
 case.  In this case, it creates and leaves a fifo "p" in the current
 directory (or doesn't handle the error if "p" exists but is not
 an accessible fifo) but shouldn't have any other side effects.

 %%%
 #include <sys/select.h>
 #include <sys/stat.h>

 #include <err.h>
 #include <errno.h>
 #include <fcntl.h>
 #include <signal.h>
 #include <unistd.h>

 static pid_t cpid;
 static pid_t ppid;
 static volatile sig_atomic_t state;

 static void
 catch(int sig)
 {
  	state++;
 }

 static void
 child(int fd)
 {
  	fd_set rfds;
  	struct timeval tv;
  	char buf[256];

 #ifdef NAMEDPIPE
  	fd = open("p", O_RDONLY | O_NONBLOCK);
  	if (fd < 0)
  		err(1, "open for read");
 #endif
  	kill(ppid, SIGUSR1);

  	/* XXX should check that fd fits in rfds. */

  	usleep(1);
  	while (state != 1)
  		;
 #ifndef NAMEDPIPE
  	/*
  	 * The connection cannot be restablished.  Use the code that delays
  	 * the read until after the writer disconnects since that case is
  	 * more interesting.
  	 */
  	state = 4;
  	goto state4;
 #endif
  	FD_ZERO(&rfds);
  	FD_SET(fd, &rfds);
  	tv.tv_sec = 0;
  	tv.tv_usec = 0;
  	if (select(fd + 1, &rfds, NULL, NULL, &tv) < 0)
  		err(1, "select");
  	if (FD_ISSET(fd, &rfds))
  		warnx("state 1: expected clear; got set");
  	kill(ppid, SIGUSR1);

  	usleep(1);
  	while (state != 2)
  		;
  	FD_ZERO(&rfds);
  	FD_SET(fd, &rfds);
  	tv.tv_sec = 0;
  	tv.tv_usec = 0;
  	if (select(fd + 1, &rfds, NULL, NULL, &tv) < 0)
  		err(1, "select");
  	if (!FD_ISSET(fd, &rfds))
  		warnx("state 2: expected set; got clear");
  	if (read(fd, buf, sizeof buf) != 1)
  		err(1, "read");
  	FD_ZERO(&rfds);
  	FD_SET(fd, &rfds);
  	tv.tv_sec = 0;
  	tv.tv_usec = 0;
  	if (select(fd + 1, &rfds, NULL, NULL, &tv) < 0)
  		err(1, "select");
  	if (FD_ISSET(fd, &rfds))
  		warnx("state 2a: expected clear; got set");
  	kill(ppid, SIGUSR1);

  	usleep(1);
  	while (state != 3)
  		;
  	FD_ZERO(&rfds);
  	FD_SET(fd, &rfds);
  	tv.tv_sec = 0;
  	tv.tv_usec = 0;
  	if (select(fd + 1, &rfds, NULL, NULL, &tv) < 0)
  		err(1, "select");
  	if (!FD_ISSET(fd, &rfds))
  		warnx("state 3: expected set; got clear");
  	kill(ppid, SIGUSR1);

  	/*
  	 * Now we expect a new writer, and a new connection too since
  	 * we read all the data.  The only new point is that we didn't
  	 * start quite from scratch since the read fd is not new.  Check
  	 * startup state as above, but don't do the read as above.
  	 */
  	usleep(1);
  	while (state != 4)
  		;
 state4:
  	FD_ZERO(&rfds);
  	FD_SET(fd, &rfds);
  	tv.tv_sec = 0;
  	tv.tv_usec = 0;
  	if (select(fd + 1, &rfds, NULL, NULL, &tv) < 0)
  		err(1, "select");
  	if (FD_ISSET(fd, &rfds))
  		warnx("state 4: expected clear; got set");
  	kill(ppid, SIGUSR1);

  	usleep(1);
  	while (state != 5)
  		;
  	FD_ZERO(&rfds);
  	FD_SET(fd, &rfds);
  	tv.tv_sec = 0;
  	tv.tv_usec = 0;
  	if (select(fd + 1, &rfds, NULL, NULL, &tv) < 0)
  		err(1, "select");
  	if (!FD_ISSET(fd, &rfds))
  		warnx("state 5: expected set; got clear");
  	kill(ppid, SIGUSR1);

  	usleep(1);
  	while (state != 6)
  		;
  	/*
  	 * Now we have no writer, but should still have data from the old
  	 * writer. Check that we have both a data condition and a hangup
  	 * condition, and that the data can read the data in the usual way.
  	 * Since Linux does this, programs must not quite reading when they
  	 * see POLLHUP; they must see POLLHUP without POLLIN (or another
  	 * input condition) before they decide that there is EOF.  gdb-6.1.1
  	 * is an example of a broken program that quits on POLLHUP only --
  	 * see its event-loop.c.
  	 */
  	FD_ZERO(&rfds);
  	FD_SET(fd, &rfds);
  	tv.tv_sec = 0;
  	tv.tv_usec = 0;
  	if (select(fd + 1, &rfds, NULL, NULL, &tv) < 0)
  		err(1, "select");
  	if (!FD_ISSET(fd, &rfds))
  		warnx("state 6: expected set; got clear");
  	if (read(fd, buf, sizeof buf) != 1)
  		err(1, "read");
  	FD_ZERO(&rfds);
  	FD_SET(fd, &rfds);
  	tv.tv_sec = 0;
  	tv.tv_usec = 0;
  	if (select(fd + 1, &rfds, NULL, NULL, &tv) < 0)
  		err(1, "select");
  	if (!FD_ISSET(fd, &rfds))
  		warnx("state 6a: expected set; got clear");
  	close(fd);
  	kill(ppid, SIGUSR1);
  	exit(0);
 }

 static void
 parent(int fd)
 {
  	usleep(1);
  	while (state != 1)
  		;
 #ifdef NAMEDPIPE
  	fd = open("p", O_WRONLY | O_NONBLOCK);
  	if (fd < 0)
  		err(1, "open for write");
 #endif
  	kill(cpid, SIGUSR1);

  	usleep(1);
  	while (state != 2)
  		;
  	if (write(fd, "", 1) != 1)
  		err(1, "write");
  	kill(cpid, SIGUSR1);

  	usleep(1);
  	while (state != 3)
  		;
  	if (close(fd) != 0)
  		err(1, "close for write");
  	kill(cpid, SIGUSR1);

  	usleep(1);
  	while (state != 4)
  	    ;
 #ifndef NAMEDPIPE
  	return;
 #endif
  	fd = open("p", O_WRONLY | O_NONBLOCK);
  	if (fd < 0)
  		err(1, "open for write");
  	kill(cpid, SIGUSR1);

  	usleep(1);
  	while (state != 5)
  		;
  	if (write(fd, "", 1) != 1)
  		err(1, "write");
  	kill(cpid, SIGUSR1);

  	usleep(1);
  	while (state != 6)
  		;
  	if (close(fd) != 0)
  		err(1, "close for write");
  	kill(cpid, SIGUSR1);

  	usleep(1);
  	while (state != 7)
  		;
 }

 int
 main(void)
 {
  	int fd[2];
  	int i;

 #ifdef NAMEDPIPE
  	if (mkfifo("p", 0666) != 0 && errno != EEXIST)
  		err(1, "mkfifo");
 #endif
  	signal(SIGUSR1, catch);
  	ppid = getpid();
  	for (i = 0; i < 2; i++) {
 #ifndef NAMEDPIPE
  		if (pipe(fd) != 0)
  			err(1, "pipe");
 #else
  		fd[0] = -1;
  		fd[1] = -1;
 #endif
  		state = 0;
  		switch (cpid = fork()) {
  		case -1:
  			err(1, "fork");
  		case 0:
  			(void)close(fd[1]);
  			child(fd[0]);
  			break;
  		default:
  			(void)close(fd[0]);
  			parent(fd[1]);
  			break;
  		}
  	}
  	return (0);
 }
 %%%

 The error output from this is:

 %%%
 selectp: state 3: expected set; got clear
 selectp: state 6a: expected set; got clear
 selectp: state 3: expected set; got clear
 selectp: state 6a: expected set; got clear
 %%%

 These messages are all caused by the same bug.  "state 3" is EOF without
 any data ever having been readable.  "state 6" is EOF with data readable
 (the test for this passed so there is no output for it).  "state 6a" is
 EOF after having read the data available in state 6.  It is good that
 state 6a fails in the same way as state 3 -- at least the bug doesn't
 seem to involve races.  The duplicate messages are caused by iterating
 the test to see if the bug depends on previous activity on the pipe
 (but the program is probably too careful cleaning up for iteration to
 show problems).

 A similar test program (not enclosed) shows many more bugs for poll():

 - no output under Linux-2.6.10

 - FreeBSD-4.10 on nameless pipes:
    % poll: state 6a: expected POLLHUP; got 0x11
    % poll: state 6a: expected POLLHUP; got 0x11
      0x11 is POLLIN | POLLHUP.  Nameless pipes are one of the few file
      types for which POLLHUP is actually implemented (POLLHUP is also
      implemented (not quite right) for ttys but isn't implemented for
      any other important file type).  Linux returns only POLLHUP here.
      This is best, since it allows distinguishing the case of pure EOF
      from the case of EOF with data reasable.  However, buggy
      applications like gdb don't actually understand the difference
      between EOF-with-data and pure EOF (see the comment in the test
      program).  Also, select() depends on pipe_poll() returning POLLIN
      to work.

 - FreeBSD-oldcurrent on nameless pipes:
    % poll: state 6a: expected POLLHUP; got 0x11
    % poll: state 6a: expected POLLHUP; got 0x11
      No change.  The kernel code for select() and poll() hasn't been either
      fixed or broken for nameless pipes.

 - FreeBSD-4.10 on named pipes:
    % pollp: state 3: expected POLLHUP; got 0x1
    % pollp: state 6: expected POLLIN | POLLHUP; got 0x1
    % pollp: state 6a: expected POLLHUP; got 0x1
    % pollp: state 3: expected POLLHUP; got 0x1
    % pollp: state 6: expected POLLIN | POLLHUP; got 0x1
    % pollp: state 6a: expected POLLHUP; got 0x1
      0x1 is POLLIN.  FreeBSD-4.10 never returns POLLHUP for named pipes.
      This and/or returning POLLIN in too many cases causes the problem in
      PRs 34020, 53447 and 76144.

 - FreeBSD-oldcurrent on nameless pipes:
    % pollp: state 6: expected POLLIN | POLLHUP; got 0x1
    % pollp: state 6: expected POLLIN | POLLHUP; got 0x1
      FreeBSD-current still never returns POLLHUP for nameless pipes.
      However, my test program determines whether POLLHUP should have been
      returned in some cases and reduces the bugs to the above.  It uses
      FreeBSD(>4)'s (my) POLLINIGNEOF to do this.  POLLINIGNEOF was supposed
      to be usable for this to limit the damage caused by the fix for PR34020,
      since I knew that this fix would break EOF handling in some cases.
      However, POLLINIGNEOF doesn't really work.  To use it, the test
      program has to use nonblocking syscalls for everything including
      poll() (it gets nonblocking polls using a timeout of 0).  Even with
      this, EOF can't be detected in state 6 (EOF-with-data).  But the
      bug in state 6 is small (it even compensates for the bug in gdb).

      Here is the code to determine POLLHUP:

 %%%
 #ifdef POLLINIGNEOF
 /*
   * FreeBSD's POLLINIGNEOF (which causes half of the bugs when the kernel
   * uses it) can be used to fix up the broken cases 3 and 6a if the kernel
   * uses it, i.e., for named pipes but not for pipes.  Note that the sense
   * of POLLINIGNEOF is reversed when passed to the kernel -- it means
   * don't-ignore-EOF in .events and if it is set there then it means
   * not-POLLHUP in .revents.
   */
 int
 mypoll(struct pollfd *fds, nfds_t nfds, int timeout)
 {
  	struct pollfd mypfd;
  	int r;

  	r = poll(fds, nfds, timeout);
  	if (nfds != 1 || timeout != 0 || fds[0].revents & POLLIN)
  		return (r);
  	mypfd = fds[0];
  	mypfd.events |= POLLINIGNEOF;
  	r = poll(&mypfd, 1, 0);
  	if (r >= 0) {
  		if (mypfd.revents &= POLLIN) {
  			mypfd.revents &= ~POLLIN;
  			mypfd.revents |= POLLHUP;
  		}
  		fds[0].revents = mypfd.revents;
  	}
  	return (r);
 }
 #define	poll(fds, nfds, timeout)	mypoll((fds), (nfds), (timeout))
 #endif
 %%%

      With this userland fixup for the missing POLLHUP, the above shows that
      states 3 and 6a have been fixed for poll() on named pipes.  These are
      precisely the states that have been broken for select() on named
      pipes.  Toggling the seting of POLLIN for these states toggles the
      location of one of the bugs.

      Without this userland fixup, the output for FreeBSD-oldcurrent on
      named pipes is:
      % state 3: expected POLLHUP; got 0
      % state 6: expected POLLIN | POLLHUP; got 0x1
      % state 6a: expected POLLHUP; got 0
      % state 3: expected POLLHUP; got 0
      % state 6: expected POLLIN | POLLHUP; got 0x1
      % state 6a: expected POLLHUP; got 0
        The POLLHUP flag is now never set, so states 3 and 6a aren't actually
        fixed; in fact they are more broken than before, just like for select()
        -- now no poll flag is set for these cases, so poll() and select()
        don't even see normal hangups unless they are used with a timeout
        and/or with the negative-logic POLLINIGNEOF as in my test program.
        IIRC, PRs 53447 and 76144 are about this problem.

 Quick fix (?): #defining POLLINIGNEOF as 0 in <sys/poll.h> should give the
 FreeBSD-4.10 behaviour.

 Fix (?):
 - actually implement returning POLLHUP in sopoll() and other places.  Return
    POLLHUP but not POLLIN for the pure-EOF case.  Return POLLIN* | POLLHUP
    for EOF-with-data.
 - remove POLLINIGNEOF and associated complications in sopoll(), fifo_poll()
    and <sys/poll.h>
 - change selscan() to check for POLLHUP so that POLLIN, POLLIN | POLLHUP
    and POLLHUP all act the same for select()
 - remove POLLHUP from the comment in selscan().  Fix the rest of this
    comment or remove it (most backends are too broken to return poll flags
    if appropriate, and the comment only mentions one of the other poll flags
    that selscan() ignores)
 - remove the corresponding comment in pollscan() since it is wrong and says
    nothing relevant (pollscan() just accepts whatever flags the backends set).

 Bruce