smp_tlb_shootdown loop (Re: spin lock smp rendezvous held by 0xffffff01250a7980 for > 5 seconds)

Sun Nov 27 03:20:32 GMT 2005

On Sat, Nov 26, 2005 at 06:22:45PM -0500, Kris Kennaway wrote:
> On Thu, Nov 24, 2005 at 06:26:16PM -0500, Kris Kennaway wrote:
> > I got this on a quad amd64 machine running 6.0-STABLE.  At the time it
> > was running 21 simultaneous tar extractions onto a sync-mounted md.
> > 
> > panic() at panic+0x1e6
> > _mtx_lock_spin() at _mtx_lock_spin+0xad
> > pmap_invalidate_range() at pmap_invalidate_range+0xb3
> > pmap_qremove() at pmap_qremove+0x53
> > vfs_vmio_release() at vfs_vmio_release+0x1e0
> > getnewbuf() at getnewbuf+0x368
> > getblk() at getblk+0x3d9
> > ffs_balloc_ufs1() at ffs_balloc_ufs1+0x662
> > ffs_write() at ffs_write+0x31b
> > VOP_WRITE_APV() at VOP_WRITE_APV+0xed
> > vn_write() at vn_write+0x228
> > dofilewrite() at dofilewrite+0x90
> > kern_writev() at kern_writev+0x54
> > write() at write+0x4b

Another CPU is here:

smp_tlb_shootdown() at smp_tlb_shootdown+0x40
smp_invlpg_range() at smp_invlpg_range+0x1e
pmap_invalidate_range() at pmap_invalidate_range+0xf9
pmap_qenter() at pmap_qenter+0x64
allocbuf() at allocbuf+0x9a0
getblk() at getblk+0x52d
ffs_balloc_ufs1() at ffs_balloc_ufs1+0x662
ffs_write() at ffs_write+0x31b
VOP_WRITE_APV() at VOP_WRITE_APV+0xed
vn_write() at vn_write+0x228
dofilewrite() at dofilewrite+0x90
kern_writev() at kern_writev+0x54
write() at write+0x4b
syscall() at syscall+0x404
Xfast_syscall() at Xfast_syscall+0xa8
--- syscall (4, FreeBSD ELF64, write), rip = 0x80070ea6c, rsp = 0x7fffffffe6a8, rbp = 0x52a800 ---
-

It is looping:

smp_tlb_shootdown+0x40: repe nop
smp_tlb_shootdown+0x42: movl    0x21c4f8,%eax
smp_tlb_shootdown+0x48: cmpl    %ebx,%eax
smp_tlb_shootdown+0x4a: jb      smp_tlb_shootdown+0x40

smp_tlb_shootdown(u_int vector, vm_offset_t addr1, vm_offset_t addr2)
{
        u_int ncpu;

        ncpu = mp_ncpus - 1;    /* does not shootdown self */
        if (ncpu < 1)
                return;         /* no other cpus */
        mtx_assert(&smp_ipi_mtx, MA_OWNED);
        smp_tlb_addr1 = addr1;
        smp_tlb_addr2 = addr2;
        atomic_store_rel_int(&smp_tlb_wait, 0);
        ipi_all_but_self(vector);
        while (smp_tlb_wait < ncpu)
                ia32_pause();
}

which seems to be the while loop at the end.

db> x/x smp_tlb_wait
smp_tlb_wait:   1
db> x mp_ncpus
mp_ncpus:       4

So it looks like it's stuck waiting for the tlb shootdown on the other
processors.  However, the 3 other CPUs are all in the same place:

> _mtx_lock_spin() at _mtx_lock_spin+0x6b
> getit() at getit+0x6f
> DELAY() at DELAY+0x44
> _mtx_lock_spin() at _mtx_lock_spin+0x6b
> pmap_invalidate_range() at pmap_invalidate_range+0xb3
> pmap_qremove() at pmap_qremove+0x53
> vfs_vmio_release() at vfs_vmio_release+0x1e0
> getnewbuf() at getnewbuf+0x368
> getblk() at getblk+0x3d9
> ffs_balloc_ufs1() at ffs_balloc_ufs1+0x662
> ffs_write() at ffs_write+0x31b
> VOP_WRITE_APV() at VOP_WRITE_APV+0xed
> vn_write() at vn_write+0x228
> dofilewrite() at dofilewrite+0x90
> kern_writev() at kern_writev+0x54
> write() at write+0x4b
> syscall() at syscall+0x404
> Xfast_syscall() at Xfast_syscall+0xa8
> --- syscall (4, FreeBSD ELF64, write), rip = 0x80070ea6c, rsp = 0x7fffffffe6a8, rbp = 0x52ae00 ---
> 
> i.e. the first _mtx_lock_spin() tried to acquire the ipi lock and
> spun, which called DELAY and getit, which tried to acquire the clock
> lock:
> 
>         mtx_lock_spin(&clock_lock);
> 
> which *also* spun, and called DELAY...and at that point things went to
> hell and it recursed until it blew out the stack.

So why aren't they processing the IPI?  Was the IPI lost somehow?

Kris
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-amd64/attachments/20051126/f4886543/attachment.bin