[Bug 261671] rc script fails to start gssd on 12.3

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 02 Feb 2022 03:46:21 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=261671

            Bug ID: 261671
           Summary: rc script fails to start gssd on 12.3
           Product: Base System
           Version: 12.3-STABLE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: conf
          Assignee: bugs@FreeBSD.org
          Reporter: bugs.freebsd@scourger.nl

Created attachment 231515
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=231515&action=edit
Patch with with a workaround.

On FreeBSD 12.3, gssd fails to start on boot.

## Environment

I installed a clean FreeBSD 12.3 system with minimal configuration changes.
It mounts a few NVSv4 filesystems using Kerberos for authentication. Users and
groups are stored in LDAP. A very minimal set of packages is installed to
provide the functionality (see attached pkg.txt).
NFS mounts are specified in /etc/fstab with (among others) the "late" flag set.
Contents of /etc/rc.conf are included as an attachment.
The system uses boot environments with subordinate filesystems like shown below
(currently only one BE):

# zfs list -r -o name,mountpoint,canmount,mounted fenrir/ROOT
NAME                           MOUNTPOINT  CANMOUNT  MOUNTED
fenrir/ROOT                    none              on       no
fenrir/ROOT/default            none          noauto      yes
fenrir/ROOT/default/usr        /usr          noauto      yes
fenrir/ROOT/default/usr/local  /usr/local    noauto      yes
fenrir/ROOT/default/var        /var          noauto      yes

After configuration of the system, I tested my setup by starting the daemons
and invoking "mount -a -l", and the NFS filesystems got mounted succesfully.
Then came the moment of the first reboot, where I was confronted with an
interrupted boot process at the "mountlate" stage (asking to go into single
user mode or proceed to multi-user).

I have used virtually the same setup on earlier hosts without problems since
the 10.X era (including the FreeBSD 12.2 system I'm writing this on). For good
measure, I also tried to upgrade an existing 12.2 install to 12.3 in a boot
environment without subordinate datasets. This resulted in the same error
condition.

## Problem description

During boot, gssd(8) fails to start properly on FreeBSD 12.3. Any "late" NFSv4
filesystem in /etc/fstab fail to mount during boot.

The console shows an error message when it tries to start gssd, as shown in the
following snippet:
  Starting file system checks:
  Mounting local filesystems:.
  /etc/rc: WARNING: run_rc_command: cannot run /usr/sbin/gssd
  ELF ldconfig path: /lib /usr/lib /usr/lib/compat /usr/local/lib
/usr/local/lib/compat/pkg /usr/local/lib/compat/pkg
  32-bit compatibility ldconfig path: /usr/lib32

The same configuration works fine on FreeBSD 12.2. It appears that the culprit
is a change in the ordering of rc files.
On FreeBSD 12.3, the 'gssd' script gets wedged between 'zfsbe' and 'zfs' (see
the attached rcorder-12.3.orig).
On 12.2, gssd is started much later in the boot process (well after NETWORKING;
see attached rcorder-12.2.orig).

As a test, I made a minor change to the gssd script to see if the rc ordering
was indeed the problem. Adding NETWORKING to the REQUIRE line seems to be
sufficient to fix the booting problem. I also added "BEFORE:  mountcritremote"
to make sure gssd doesn't start too late on diskless clients (though I haven't
tested diskless). See the attached gssd.patch for the exact changes that I
made. The patch changes the startup order to the one listed in
rcorder-12.3.fixed.

To test the hypothesis that rc ordering is indeed the issue, I tried 4
testcases:
Case 1: default /etc/rc.d/gssd, no NFS filesystems in /etc/fstab
  The system boots without obvious issues, but gssd is not running.
  Trying to mount a NFSv4 filesystem immediately returns "Permission denied".
  If you start gssd manually, mounting NFSv4 works.

Case 2: default /etc/rc.d/gssd, NFS filesystems in /etc/fstab
  gssd doen't start during boot, as in case 1.
  The boot process is interrupted during the "mountlate" stage, when it tries
to mount the NFS filesystems.
  If you choose to proceed into multi-user mode, you'll have to manually cancel
further mount attempts during boot.
  Once in multi-user mode, depending on how quickly/often CTRC-c was pressed to
abort "mountlate", 0 or more instances of gssd are running (I've observed 1 and
2).
  Even if only 1 instance of gssd is running, it is not possible to mount NFSv4
filesystems. A manual mount hangs in the "[rpccon]" state before timing out
with a "Permission denied" error:
    root@fenrir:~ # mount /net/cerberus/incoming/
    load: 0.01  cmd: mount_nfs 48471 [rpccon] 0.86r 0.00u 0.00s 0% 8080k
    load: 0.01  cmd: mount_nfs 48471 [rpccon] 1.88r 0.00u 0.00s 0% 8080k
    load: 0.01  cmd: mount_nfs 48471 [rpccon] 2.99r 0.00u 0.00s 0% 8080k
    mount_nfs: nmount: /net/cerberus/incoming: Permission denied
  After killing all gssd instances and running "service gssd restart", mounting
the filesystems is possible.

Case 3: modified /etc/rc.d/gssd, no NFS filesystems in /etc/fstab
  The system boots without issue, gssd is running and NFSv4 filesystems can be
mounted manually.

Case 4: modified /etc/rc.d/gssd, NFS filesystems in /etc/fstab
  The system boots as expected, gssd is running and filesystems are
automatically mounted as expected.

These results seem to confirm that the problem stems from an attempt to start
gssd too early.

Note that I haven't tested this with NFSv3 or non-Kerberized NFSv4, so it is
possible that those work fine.

## How to reproduce

Do a fresh installation of FreeBSD 12.3, and perform the minimal required
configuration for gssd. Running "service gssd start" should succesfully launch
the daemon.
Reboot, and observe that gssd hasn't started.

## Solution

A simple fix would be to change the REQUIRE line in the gssd rc file. But that
might just be patchwork that hides the actual problem.

It is unclear to me why the rc ordering is so different between 12.2 and 12.3;
as far as I can see there haven't been any big changes to any of the files in
/etc/rc.d. However, one of the few rc scripts that changed is in fact gssd (see
review D27203 ). Ironically, that commit doesn't seem to cause the problem.
Using the 12.2 version of the gssd rc script on FreeBSD 12.3 still causes a
startup failure.

In any case, there are huge differences when comparing the output of "rcorder
/etc/rc.d/*" between 12.2 and 12.3, while the contents of files in /etc/rc.d
are almost exactly the same. At this point, my guess is that something has
changed in the behaviour of rcorder(8) itself. I can't say if that is intended,
or a bug.

-- 
You are receiving this mail because:
You are the assignee for the bug.