Re: amd64 syscall ABI (vs. Darwin)

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Mon, 17 Jan 2022 22:51:52 UTC
On Mon, Jan 17, 2022 at 10:31:09PM +0000, Damian's Proton Mail wrote:
> 
> > On 17 Jan 2022, at 14:38, Konstantin Belousov <kostikbel@gmail.com> wrote:
> >
> > On Mon, Jan 17, 2022 at 12:41:59PM +0000, Damian Malarczyk wrote:
> >> Hello,
> >>
> >> I'm hacking on a toy project to run Darwin (MachO) binaries on FreeBSD.
> >> Currently I'm at a stage of syscalls support, and I've noticed a difference in the amd64 ABI that I didn't expect.
> >>
> >> FreeBSD is changing values of some registers that aren't used as the syscall output. e.g., r8-r11 are changed, while r12-r15 don't seem to be affected.
> >> That's not the case on Darwin, from what I've seen onlyrax, rdx used as syscall results are changed.
> >> It looks like FreeBSD's syscalls calling convention is more like standard function calling, and r8-r11 should be always caller saved.
> > It is not 'more like'. FreeBSD follows C ABI for amd64 for syscall
> > registers handling. An additional twist is that the registers which are
> > declared as calleee-clobered are zeroed to avoid kernel data leakage to
> > userspace.
> Oh I see, this explains it then.
> 
> >>
> >> At a first glance Darwin approach seems more optimal, as less registers get clobbered. Is there any specific reason why this isn't also the case on FreeBSD?
> >> I'm also wondering where exactly the register values are changed. When I look at thetrapframe contents in the sv_set_syscall_retvalsystem vector callback the r8 register value is same as on the input, so it must be changed somewhere later. Does anyone know where exactly this happens?
> >
> > Look at the sys/amd64/amd64/exceptions.S.  The fast_syscall entry point
> > is where we receive control after the syscall instruction.
> A lot of new things in there for me, but the flow is clear. I was able to find corresponding logic in XNU’s sources too. Earlier I said:
> 
> > At a first glance Darwin approach seems more optimal
> But it’s instead the opposite/no difference at all, as in Darwin, they explicitly restore/set all registers, including callee saved r12-r15.
> 
> Explicitly preserving registers would prevent kernel data leakage too. Doing so in FreeBSD would also be an ABI compatible change I think, since users shouldn’t rely on values in those registers.
> I’m curious if you see any obvious pros/cons with either approach, or is it just a more arbitrary implementation choice?
We preserve everything on syscall entry, it is the SYSCALL instruction
behavior that makes it look somewhat convoluted.  I suggest you to read
the SDM description of the SYSCALL instruction to understand the registers
manipulations on entry.

On the other hand, on the fast syscall return, we indeed not restore
everything. If you want to restore full frame, use PCB_FULL_IRET pcb
flag to request iretq return path.

> 
> Not that I’d propose changing the ABI though, I also want my toy project to work as a plug-in kernel module.
> I guess the only other option to emulate Darwin's behaviour would be to intercept syscalls in userspace somehow first and manually preserve the register values?

To emulate Darwin, you would need specific ABI personality (sysent) in the
kernel, which would also provide sv_syscall_ret method.  The method can
do whatever is needed to the return frame, and set PCB_FULL_IRET to indicate
that kernel should load it into CPU GPR file as is.

BTW, does Darwin use SYSCALL instruction for syscall entry on amd64?