Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

From: Mark Johnston <markj_at_freebsd.org>
Date: Mon, 07 Mar 2022 21:42:54 UTC
On Mon, Mar 07, 2022 at 09:54:26PM +0100, Ronald Klop wrote:
>  
> Van: Mark Johnston <markj@freebsd.org>
> Datum: maandag, 7 maart 2022 16:13
> Aan: Ronald Klop <ronald-lists@klop.ws>
> CC: bob prohaska <fbsd@www.zefox.net>, Mark Millard <marklmi@yahoo.com>, freebsd-arm@freebsd.org, freebsd-current <freebsd-current@freebsd.org>
> > I haven't been able to reproduce any crashes running poudriere in an
> > arm64 AWS instance, though.  Could you please try the patch below and
> > confirm whether it fixes your panics?  I verified that the apparent
> > problem described above is gone with the patch.
> > 
> > diff --git a/sys/kern/kern_rmlock.c b/sys/kern/kern_rmlock.c
> > index 0cdcfb8fec62..e51c25136ae0 100644
> > --- a/sys/kern/kern_rmlock.c
> > +++ b/sys/kern/kern_rmlock.c
> > @@ -437,6 +437,7 @@ _rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker, int trylock)
> >  {
> >     struct thread *td = curthread;
> >     struct pcpu *pc;
> > +   int cpuid;
> >  
> >     if (SCHEDULER_STOPPED())
> >         return (1);
> > @@ -452,6 +453,7 @@ _rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker, int trylock)
> >     atomic_interrupt_fence();
> >  
> >     pc = get_pcpu();
> > +   cpuid = pc->pc_cpuid;
> >     rm_tracker_add(pc, tracker);
> >     sched_pin();
> >  
> > @@ -463,7 +465,7 @@ _rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker, int trylock)
> >      * conditional jump.
> >      */
> >     if (__predict_true(0 == (td->td_owepreempt |
> > -       CPU_ISSET(pc->pc_cpuid, &rm->rm_writecpus))))
> > +       CPU_ISSET(cpuid, &rm->rm_writecpus))))
> >         return (1);
> >  
> >     /* We do not have a read token and need to acquire one. */
> > 
> > 
> > 
> 
> Hi,
> 
> This patch paniced again:
> x0: ffffa00005a31500                                                                                             
>   x1: ffffa00005a0e000                                                                                                            
>   x2:                2                                                                                                            
>   x3: ffffa00076c4e9a0                                                                                                            
>   x4:                0                                                                                                            
>   x5:    e672743c8f9e5                                                                                                            
>   x6:    dc89f70500ab1
>   x7:               14
>   x8: ffffa00005a31518
>   x9:                1
>  x10: ffffa00005a0e000
>  x11:                0
>  x12:                0
>  x13:                a
>  x14: 1013e6b85a8ecbe4
>  x15:     1dce740d11a5
>  x16: ffff3ea86e2434bf
>  x17: fffffffffffffff2
>  x18: ffff0000fe661800 (g_ctx + fcf9fa54)
>  x19: ffffa00076c4e9a0
>  x20: ffff0000fec39000 (g_ctx + fd577254)
>  x21:                2
>  x22: ffff0000419b6090 (g_ctx + 402f42e4)
>  x23: ffff000000c0b137 (lockstat_enabled + 0)
>  x24:              100
>  x25: ffff000000c0b000 (version + a0)
>  x26: ffff000000c0b000 (version + a0)
>  x27: ffff000000c0b000 (version + a0)
>  x28:                0
>  x29: ffff0000fe661800 (g_ctx + fcf9fa54)
>   sp: ffff0000fe661800
>   lr: ffff00000154ea50 (zio_dva_throttle + 154)
>  elr: ffff00000154ea80 (zio_dva_throttle + 184)
> spsr:         60000045
>  far:     2b753286b0b8
> panic: Unknown kernel exception 0 esr_el1 2000000
> cpuid = 1
> time = 1646685857
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x174
> panic() at panic+0x44
> do_el1h_sync() at do_el1h_sync+0x184
> handle_el1h_sync() at handle_el1h_sync+0x10
> --- exception, esr 0x2000000
> zio_dva_throttle() at zio_dva_throttle+0x184
> zio_execute() at zio_execute+0x58
> KDB: enter: panic
> [ thread pid 0 tid 100129 ]
> Stopped at      kdb_enter+0x44: undefined       f901c11f
> db>  

ZFS doesn't make use of rm locks as far as I can see, so this is a
little weird.  I reverted the original rmlock commit in main, so it may
be worth verifying that the problem really is gone before digging
deeper.  In other words, I'm a bit suspicious that this is a different
bug.