pipe(2) calling convention: why?

Sun Nov 9 15:04:01 PST 2008

On Sun, Nov 9, 2008 at 12:38 PM, Kostik Belousov <kostikbel at gmail.com> wrote:
> On Sun, Nov 09, 2008 at 08:27:46PM +0100, Ed Schouten wrote:
>> Hello all,
>>
>> After having a discussion on IRC with some friends of mine about system
>> call conventions, we couldn't exactly determine why pipe(2)'s calling
>> convention has to be different from the rest. Unlike most system calls,
>> pipe(2) has two return values. Instead of just copying out an array of
>> two elements, it uses two registers to store the file descriptor
>> numbers.
>>
>> It seems a lot of BSD-style system calls used to work that way, but
>> pipe(2) seems to be the only system call on FreeBSD that uses this
>> today. Some system calls only seem to set td_retval[1] to zero, which
>> makes little sense to me. Maybe those assignments can be removed.
>>
>> In my opinion there are a couple of disadvantages of having multiple
>> return values:
>>
>> - As documented in syscall(2), there is no way to obtain the second
>>   return value if you use this functions.
>>
>> - Each of those system calls needs to have its own implementation
>>   written in assembly for each architecture we support. Why can hundreds
>>   of system calls be handled in a generic fashion, while interfaces like
>>   pipe(2) can't?
>>
>> As a small experiment I've written a patch to allocate a new system call
>> (506) which uses a generic calling convention to implement pipe(2). It
>> seems Linux also uses this method, so I've removed linux_pipe() from the
>> Linuxolator as well, which seems to work.
>>
>> I could commit this if people think it makes sense. Any comments?
>>
>
> The convention of returning pipe descriptors in the registers comes
> back at least to the Six Edition. Check the Lion' book for the reference.
> Amusingly, Solaris uses the same calling convention for pipe(2).
>
> I do not see what we gain by the change. Now, we have one syscall and
> some arch-dependend wrappers in the libc. After the patch, we get rid
> of the wrappers, but grow two syscalls.
>
> The only reason of doing this I can imagine is to allow syscall(2) to
> work for SYS_pipe from C code. Since we did not heard complaints about
> this for ~15 years, we can live with it.
>

The other side effect of the change is to remove one asm instruction
code in the syscall handler and replace it by potentially hundreds of
instructions to do the copyout.  Plus we gain another syscall, lose
backwards compatability with kernel.old again, and so on.  I really
don't see an overall benefit.

What I do see some use for is to do the kern_pipe() split (like in the
patch) which simplifies the linux abi wrappers (and other ABI
wrappers, not just linux!).  Just have our syscall return in retval[0]
and [1] like before.  But we get the benefit of simplifying a bunch of
wrappers.

The patch is incomplete anyway,  It leaks fds if the copyout fails.
There is a comment about this in the patch anyway.

Other historical notes..

Ancient unix systems used to implement syscalls by having userland do
a call (jsr) to a shared page.  The trap handler would verify the
entry point, and if it was approved, it would then give privilige and
continue.  The problem was that this severely limited the number of
syscalls because we were talking tiny address spaces.  Given that
syscall numbers were at a premium, it made sense to pack as much
functionality into syscalls as possible.  eg: getpid syscall could
return both pid and ppid, saving a kernel syscall entry point, and so
on.

This is also one of the reasons for SIGSYS.  Calling an illegal kernel
entry point in a process that had run wild could be easily converted
into a signal.  WIld processes could easily hit the kernel entry
points.  Again, this doesn't really apply these days.  It is somewhat
archaic by today's standards - linux doesn't even bother with SIGSYS -
it has bad syscalls just return ENOSYS.

fork() currently uses both retval[0] and [1], in spite of it appearing
not to.  See cpu_fork() for the other half.

We use both return values for 64 bit returns.  eg: lseek().  Some
places that set it to 0 are silly.

I really don't see td_retval[0] and td_retval[1] ever going away
entirely, at least not while we share the syscall vector between 32
and 64 bit systems.

I don't think it is worth breaking kernel.old compatability, replacing
the current syscall for pipe() with a slower one, and having to have
both anyway is much of a win.   Splitting pipe() and kern_pipe() would
help ABI wrappers.  I don't see value in adding a new way for pipe(2)
to fail (right now, pipe(2) causes a segfault if you pass a bad
address.  The new wrapper causes it to return EFAULT instead, and NOT
crash the app.  The failure mode has changed.)

As an aside.. I'm very very very painfully aware of the dual return
from syscalls.  I've been fighting with this in valgrind for quite
some time now.  We have some very interesting semantics on i386.

* syscalls preserve all registers except for %eax and %eflags.  Even
scratch registers.
* .. except for %edx sometimes, for 64 bit returns, or dual-returns.
Otherwise %edx is preserved.
* libc depends on this in a couple of hand-written asm stubs, eg:
brk()/sbrk().  Nothing else cares about this.
* some libc syscall wrappers trash the scratch registers though.
* in spite of syscalls not using C calling conventions, the kernel
assumes you've done a C-style call to libc.  It assumes the C return
address was pushed onto the stack before the args.

In retrospect I wish it never had started out this way.  But it did,
it still is, and I feel the costs of changing it are not worth it for
such little gain.

-- 
Peter Wemm - peter at wemm.org; peter at FreeBSD.org; peter at yahoo-inc.com; KI6FJV
"All of this is for nothing if we don't go to the stars" - JMS/B5
"If Java had true garbage collection, most programs would delete
themselves upon execution." -- Robert Sewell