Re: Bugzilla PR#293127 and 292884

In reply to: Rick Macklem : "Re: Bugzilla PR#293127 and 292884"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Sat, 14 Mar 2026 22:27:16 UTC
On Mon, Mar 9, 2026 at 5:08 PM Rick Macklem <rick.macklem@gmail.com> wrote:
>
> On Sun, Feb 22, 2026 at 4:00 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> >
> > On Sun, Feb 22, 2026 at 3:44 PM Mark Johnston <markj@freebsd.org> wrote:
> > >
> > > On Sun, Feb 22, 2026 at 03:27:42PM -0800, Rick Macklem wrote:
> > > > On Sun, Feb 22, 2026 at 3:21 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> > > > >
> > > > > On Sun, Feb 22, 2026 at 2:19 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Could someone who is familiar with what has changed
> > > > > > related to socket handling between FreeBSD 14 and 15
> > > > > > please look at these bugzilla PRs.
> > > > > >
> > > > > > They show similar crashes, which indicate that the
> > > > > > socket receive buffer has been discarded when
> > > > > > soreceive() returns EWOULDBLOCK.
> > > > > I've been looking at svc_vc.c and I think there needs
> > > > > to be some updating done to it.
> > > > >
> > > > > The code currently looks like...
> > > > > error = soreceive(so, NULL, &uio, &m, &ctrl, &rcvflag);
> > > > >
> > > > > if (error == EWOULDBLOCK) {
> > > > >     /*
> > > > >      * We must re-test for readability after
> > > > >      * taking the lock to protect us in the case
> > > > >      * where a new packet arrives on the socket
> > > > >      * after our call to soreceive fails with
> > > > >      * EWOULDBLOCK.
> > > > >      */
> > > > >      SOCK_RECVBUF_LOCK(so);
> > > > >      if (!soreadable(so))
> > > > >           xprt_inactive_self(xprt);
> > > > >      SOCK_RECVBUF_UNLOCK(so);
> > > > >      sx_xunlock(&xprt->xp_lock);
> > > > >      return (FALSE);
> > > > > }
> > > > >
> > > > > When I look at the code in soreceive_stream(), it calls
> > > > >    error = SOCK_IO_RECV_LOCK(so, SBLOCKWAIT(MGS_DONTWAIT));
> > > > > which will return EWOULDBLOCK and this happens before SOCKBUF_LOCK()
> > > > > in soreceive_stream_locked().
> > >
> > > That'll happen only if a different thread holds the socket I/O lock,
> > > i.e., a different thread is trying to read data from the socket at the
> > > same time.  Can that happen here?
> > No, it shouldn't. An xprt is assigned to an nfsd thread and that should
> > only be one at a time. It does get re-assigned after
> > xprt_inactive()/xprt_active()
> > calls, but I don't think that should happen while a thread assigned to an xprt
> > is in soreceive(), as protected by the xp_lock.
> >
> > >
> > > > > I don't think the EWOULDBLOCK here was meant to check that some other
> > > > > thread has the I/O lock.  I think that EWOULDBLOCK was meant to check
> > > > > for "no rcv data on socket"?
> > >
> > > Yes, if it's possible that something else is reading from the socket at
> > > the same time, then we might be doing the wrong thing here, but I'm not
> > > sure if that's possible?  Are xprts one-to-one with kernel worker
> > > threads?
> > >
> > > I'm pretty this hasn't changed recently either, though certainly that
> > > part of the socket code's been refactored recently.
> > Yea, the code appears to be in 14 and there has not been any
> > reported crashes with 14.
> >
> > >
> > > > > So, does someone (like markj@) know what this code should now be?
> > > > > (The crashes might be because this code is just plain broken now.)
> > > >
> > > > Oh, and it appears that soreceive_stream_locked() returns EAGAIN when
> > > > there is no data to be read and not EWOULDBLOCK.
> > >
> > > EAGAIN and EWOULDBLOCK are two names for the same thing.  errno.h
> > > defines EWOULDBLOCK to be EAGAIN.
> > I knew that, but I forgot;-)
> > (It would be a boring/bothersome commit, but maybe someone should make
> > the kernel always use one of them, to avoid confusion..)
> >
> > Well, that's all I can think of. Hopefully the reporter can run with
> > GENERIC-DEBUG
> > and we can get more info. (Maybe I'll ask the PR#292884 reporters the
> > same thing.)
> Could someone TCP conversant take a look at attachment core.txt.6 in
> PR#293127.
>
> It shows a crash in tcp_discardcb(). He was using the patches that
> avoided crashes for the other reporters, which simply acquires/releases
> an extra reference count on the socket. (As noted, these crashes have
> been showing up for FreeBSD-15.0, although the socket handling code
> in the krpc hasn't changed in something like 10 years.)
>
> I'm hoping a crash in tcp_discardcb() might give someone a hint
> w.r.t. what is going on?
I think this crash was caused by a bug in my "add a socket reference patch",
which did not CURVNET_SET() before calling sorele().

I am waiting to hear from the reporters w.r.t. the fixed version of the patch
(which acquires/releases an extra reference on the socket) stops the crashes.

However, that still doesn't explain why an extra socket reference
might be needed?
(The change appears to have happened somewhere around 15.0.)

I have spotted this change...
Before commit 5bba272 the code in tcp_usr_shutdown():
   tcp_usrclosed(tp);
   if (!(inp->inp_flags & INP_DROPPED))
         error = tcp_output_nodrop(tp);

after commit 55bba272 the code in tcp_usr_shutdown():
   tcp_usrclosed(tp);
   error = tcp_output_nodrop(tp);
*** Note that the test for INP_DROPPED was removed,
     however tcp_usrclosed(tp) can call in_pcbdrop(), which
     can set INP_DROPPED.

Before commit 5bba272, error == 0 if tcp_usrclosed() calls in_pcbdrop().
After commit 5bba272, error is whatever tcp_output_nodrop() returns.

Is this change a bug or a feature?

I am thinking that calling tcp_unlock_or_drop() when the error is
non-zero (but would have been 0 before commit 5bba272) could
be causing the premature destruction of the socket, which causes
the crashes?

rick

>
> Thanks, rick
>
> >
> > Thanks for your help with this, rick