Re: git: b19740f4ce7a - main - swap_pager: lock vnode in swapdev_strategy()

From: Yoshihiro Ota <ota_at_j.email.ne.jp>
Date: Mon, 29 Nov 2021 01:43:39 UTC
Hi,

I wonder creating /etc/rc.d/nfsswap that depends on nfsclient can give a smoother shutdown like swaplate depends on mountlate.

Hiro

On Sun, 28 Nov 2021 04:00:56 +0200
Konstantin Belousov <kostikbel@gmail.com> wrote:

> On Sun, Nov 28, 2021 at 12:22:46PM +1100, Peter Jeremy wrote:
> > On 2021-Nov-27 01:26:17 +0200, Konstantin Belousov <kostikbel@gmail.com> wrote:
> > >commit 9c62295373f728459c19138f5aa03d9cb8422554
> > >Author: Konstantin Belousov <kib@FreeBSD.org>
> > >Date:   Sat Nov 27 01:22:27 2021 +0200
> > >
> > >    swapoff_one(): only check free pages count manually turning swap off
> > 
> > That didn't work but I don't think the underlying bug is related to
> > your recent work on swap_pager - digging back through my logs, I've
> > found another similar panic in August last year.
> It is definitely not caused by, it is just yet another issue.
> 
> > 
> > Nov 28 09:40:17 rock64 syslogd: exiting on signal 15
> > Waiting (max 60 seconds) for system process `vnlru' to stop... done
> > Waiting (max 60 seconds) for system process `syncer' to stop... 
> > Syncing disks, vnodes remaining... 0 0 done
> > Waiting (max 60 seconds) for system thread `bufdaemon' to stop... done
> > Waiting (max 60 seconds) for system thread `bufspacedaemon-0' to stop... done
> > All buffers synced.
> > No strategy for buffer at 0xffff0000bf8dc000
> > vnode 0xffffa00009024a80: type VBAD
> >     usecount 2, writecount 0, refcount 33263 seqc users 1
> >     hold count flags ()
> >     flags (VIRF_DOOMED|VV_VMSIZEVNLOCK)
> >     lock type nfs: SHARED (count 1)
> > swap_pager: I/O error - pagein failed; blkno 241400,size 4096, error 45
> > panic: VOP_STRATEGY failed bp=0xffff0000bf8dc000 vp=0
> > cpuid = 0
> > time = 1638052821
> > KDB: stack backtrace:
> > db_trace_self() at db_trace_self
> > db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> > vpanic() at vpanic+0x178
> > panic() at panic+0x44
> > bufstrategy() at bufstrategy+0x80
> > swapdev_strategy() at swapdev_strategy+0xcc
> > swap_pager_getpages_locked() at swap_pager_getpages_locked+0x460
> > swapoff_one() at swapoff_one+0x3e4
> > swapoff_all() at swapoff_all+0x9c
> > bufshutdown() at bufshutdown+0x2ac
> > kern_reboot() at kern_reboot+0x240
> > sys_reboot() at sys_reboot+0x358
> > do_el0_sync() at do_el0_sync+0x4a4
> > handle_el0_sync() at handle_el0_sync+0x9c
> > --- exception, esr 0x56000000
> > KDB: enter: panic
> > [ thread pid 1 tid 100002 ]
> > Stopped at      kdb_enter+0x48: undefined       f900c11f
> > db> 
> > 
> > This is the same traceback as my previous mail.  Looking at the code
> > path, the test whether there's enough RAM to swap in all the data
> > passes in both cases: If swapoff_one() returned ENOMEM then
> > swapoff_all() would report a "Cannot remove swap device" error and
> > keep going (not bother to actually remove the swap device) - and
> > that's not happening.
> > 
> > I think the important message is "No strategy for buffer at 0x..."
> > which comes from vop_nostrategy() and causes bufstrategy() to panic:
> > 
> > swapdev_strategy()
> >  => bstrategy()
> >  => BO_STRATEGY()
> >  => bufstrategy()
> >  => VOP_STRATEGY()
> >  => VOP_STRATEGY_APV()
> >  => vop_nostrategy()
> >     => bufdone() => swp_pager_async_iodone()
> > 
> > Presumably, stopping the network means there's no longer any way for
> > swap operations to complete so the swap device has become associated
> > with default_vnodeops, (though I haven't dug into the actual code
> > path that does that).
> > 
> > Moving up a level, does it really matter if swapoff_one() is skipped?
> > If it actually returned an error (eg if the free memory test failed),
> > then that's what would happen.  By this point in the shutdown, there's
> > no userland left (which makes me wonder why there's anything left in
> > swap in any case) and only the final cleanups remain before the kernel
> > shuts down.  What's really needed is a way to detect that the relevant
> > swap I/O provider has gone away and return to swapoff_all() without
> > panicing.
> 
> I think the intent there is to ensure that the system is in as much steady
> state as possible.  I can easily imagine some HBA doing weird things if
> some io is in flight when the reset comes in.  So we want to ensure that
> no io occurs.
> 
> The cause for your panic is not the network interface down state (in fact,
> I think that interface state up), but as you correctly analyzed, the
> call to vop_nostrategy().  It is there in stack because underlying filesystem
> was unmounted, and the swap vnode was reclaimed, replacing NFS vop vector
> with deadfs vop vector, which inherits from the default vop.
> 
> The bit that I do not understand from your report, is why swapoff_all() did
> not occured earlier.  Your earlier report, where spurious ENOMEM came out
> from swapoff(2) syscall, and which I fixed by the previous patch, means 
> that there was an attempt to swapoff.
> 
> Still, the solution is, IMO, to swapoff before unmount. We do not need
> swapin for flushing buffers.
> 
> Please try this combined patch.
> 
> diff --git a/sys/kern/vfs_bio.c b/sys/kern/vfs_bio.c
> index 4b746a269171..201afeec9311 100644
> --- a/sys/kern/vfs_bio.c
> +++ b/sys/kern/vfs_bio.c
> @@ -1452,16 +1452,21 @@ bufshutdown(int show_busybufs)
>  		 */
>  		printf("Giving up on %d buffers\n", nbusy);
>  		DELAY(5000000);	/* 5 seconds */
> +		swapoff_all();
>  	} else {
>  		if (!first_buf_printf)
>  			printf("Final sync complete\n");
> +
>  		/*
> -		 * Unmount filesystems
> +		 * Unmount filesystems.  Swapoff before unmount,
> +		 * because swap on file is unoperational after unmount
> +		 * of underlying filesystem.
>  		 */
> -		if (!KERNEL_PANICKED())
> +		if (!KERNEL_PANICKED()) {
> +			swapoff_all();
>  			vfs_unmountall();
> +		}
>  	}
> -	swapoff_all();
>  	DELAY(100000);		/* wait for console output to finish */
>  }
>  
> diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c
> index 4cfdb3fd2cc8..981a71b2c4b1 100644
> --- a/sys/vm/swap_pager.c
> +++ b/sys/vm/swap_pager.c
> @@ -469,7 +469,8 @@ static bool	swp_pager_swblk_empty(struct swblk *sb, int start, int
> limit); static void	swp_pager_free_empty_swblk(vm_object_t, struct swblk *sb);
>  static int	swapongeom(struct vnode *);
>  static int	swaponvp(struct thread *, struct vnode *, u_long);
> -static int	swapoff_one(struct swdevt *sp, struct ucred *cred);
> +static int	swapoff_one(struct swdevt *sp, struct ucred *cred,
> +		    bool swapoff_syscall);
>  
>  /*
>   * Swap bitmap functions
> @@ -2523,14 +2524,14 @@ sys_swapoff(struct thread *td, struct swapoff_args *uap)
>  		error = EINVAL;
>  		goto done;
>  	}
> -	error = swapoff_one(sp, td->td_ucred);
> +	error = swapoff_one(sp, td->td_ucred, true);
>  done:
>  	sx_xunlock(&swdev_syscall_lock);
>  	return (error);
>  }
>  
>  static int
> -swapoff_one(struct swdevt *sp, struct ucred *cred)
> +swapoff_one(struct swdevt *sp, struct ucred *cred, bool swapoff_syscall)
>  {
>  	u_long nblks;
>  #ifdef MAC
> @@ -2552,8 +2553,16 @@ swapoff_one(struct swdevt *sp, struct ucred *cred)
>  	 * available virtual memory in the system will fit the amount
>  	 * of data we will have to page back in, plus an epsilon so
>  	 * the system doesn't become critically low on swap space.
> +	 * The vm_free_count() part does not account e.g. for clean
> +	 * pages that can be immediately reclaimed without paging, so
> +	 * this is very rough estimation.
> +	 *
> +	 * On the other hand, not turning swap off on swapoff_all()
> +	 * means that we loose swap data when filesystems go away,
> +	 * which is arguably worse.
>  	 */
> -	if (vm_free_count() + swap_pager_avail < nblks + nswap_lowat)
> +	if (swapoff_syscall &&
> +	    vm_free_count() + swap_pager_avail < nblks + nswap_lowat)
>  		return (ENOMEM);
>  
>  	/*
> @@ -2603,7 +2612,7 @@ swapoff_all(void)
>  			devname = devtoname(sp->sw_vp->v_rdev);
>  		else
>  			devname = "[file]";
> -		error = swapoff_one(sp, thread0.td_ucred);
> +		error = swapoff_one(sp, thread0.td_ucred, false);
>  		if (error != 0) {
>  			printf("Cannot remove swap device %s (error=%d), "
>  			    "skipping.\n", devname, error);
>