'make -j16 universe' gives SIReset

Marius Strobl marius at alchemy.franken.de
Sat Jul 2 00:23:31 UTC 2011


On Fri, Jul 01, 2011 at 08:17:52AM +1000, Peter Jeremy wrote:
> [Moving back on-list]
> 
> On 2011-Jun-30 06:30:08 +0800, Marius Strobl <marius at alchemy.franken.de> wrote:
> >On Thu, Jun 30, 2011 at 08:00:10AM +1000, Peter Jeremy wrote:
> >> On 2011-Jun-29 19:54:44 +0200, Marius Strobl <marius at alchemy.franken.de> wrote:
> >> >On Wed, Jun 29, 2011 at 12:54:33PM +1000, Peter Jeremy wrote:
> >> >> My V890 has been running "make -j32 buildworld" in a loop for a
> >> >> week now without problems so I think that was the problem.
> >> 
> >> OTOH, a V440 that has been running similar load for a similar period
> >> died overnight with:
> >> 
> >> panic: uma_small_alloc: free page still has mappings!
> >> VNASSERT failed
> >> cpuid = 3
> >> 0xfffff800079643c0: KDB: enter: panic
> ...
> >> I'm fairly sure that is the same kernel but will double-check and
> >> investigate that panic further.
> 
> FWIW, that kernel didn't have the latest patchset (adding Zeus support).

That shouldn't make a difference; the later version only adds the
SPARC64 bits as you already noticed and adjusts the boot loader to
compile again. I made no changes to the existing parts apart from
fixing a comment. Besides I see no connection between fixing the
gross user TLB flushing and the below problem so far.

> 
> >Ok, this appears to be an unrelated problem though. Alan, do you
> >have an idea what could be causing this?
> 
> I managed to get the same panic (though different traceback) on the
> V890 after about an hour of pho@'s stress test with INCARNATIONS=150:
> 
> panic: uma_small_alloc: free page still has mappings!
> cpuid = 1
> KDB: enter: panic
> [ thread pid 142 tid 100196 ]
> Stopped at      kdb_enter+0x80: ta              %xcc, 1
> db> where
> Tracing pid 142 tid 100196 td 0xfffff8a016ace880
> panic() at panic+0x20c
> uma_small_alloc() at uma_small_alloc+0xe8
> keg_alloc_slab() at keg_alloc_slab+0xc8
> keg_fetch_slab() at keg_fetch_slab+0x218
> zone_fetch_slab() at zone_fetch_slab+0x44
> uma_zalloc_arg() at uma_zalloc_arg+0x60c
> m_getm2() at m_getm2+0x134
> m_uiotombuf() at m_uiotombuf+0x4c
> sosend_generic() at sosend_generic+0x420
> sosend() at sosend+0x2c
> soo_write() at soo_write+0x3c
> dofilewrite() at dofilewrite+0x7c
> kern_writev() at kern_writev+0x38
> write() at write+0x4c
> syscallenter() at syscallenter+0x270
> syscall() at syscall+0x74
> -- syscall (4, FreeBSD ELF64, write) %o7=0x101db4 --
> userland() at 0x405936c8
> user trace: trap %o7=0x101db4
> pc 0x405936c8, sp 0x7fdffffd8a1
> pc 0x101f44, sp 0x7fdffffd9a1
> pc 0x104604, sp 0x7fdffffda81
> pc 0x1046f0, sp 0x7fdffffdb51
> pc 0x104994, sp 0x7fdffffdc21
> pc 0x104d90, sp 0x7fdffffdd01
> pc 0x101610, sp 0x7fdffffde41
> pc 0x4020cff4, sp 0x7fdffffdf01
> done
> db>
> 
> I've got a crashdump on the V440 but discovered that gdb reports
> "GDB can't read core files on this machine." so it isn't much use.
> Any suggestions on how to debug this?

The VM and its interaction with the MD code are beyond me, I hope
Alan can chime in here. Reading through the code I see a possible
path which could lead to this though; tsb_tte_enter(), which is
the only place where TD_PV ever is set and also only in case of
managed pages, always calls pmap_cache_enter(), which together
with pmap_cache_remove() does the page color handling. In
pmap_remove_all() however, pmap_cache_remove() is only called for
managed pages, so for unmanaged pages we might miss the removal
of the mapping from the the color used. I've no idea though if
this actually is relevant, i.e. whether the VM ever calls
pmap_remove_all() for unmanaged pages. Tentatively I'd say it
doesn't, in which case the only solution I see is to exclude
unmanaged pages from the page color handling and caching, which
I don't know whether it's safe (besides impacting performance).
Unfortunately, with my gear I can't reproduce this. Could you
please try the below patch? I've no idea whether it's correct
but might give another datapoint.

Marius

Index: pmap.c
===================================================================
--- pmap.c	(revision 223705)
+++ pmap.c	(working copy)
@@ -1382,21 +1385,21 @@ pmap_remove_all(vm_page_t m)
 	vm_page_lock_queues();
 	for (tp = TAILQ_FIRST(&m->md.tte_list); tp != NULL; tp = tpn) {
 		tpn = TAILQ_NEXT(tp, tte_link);
-		if ((tp->tte_data & TD_PV) == 0)
-			continue;
 		pm = TTE_GET_PMAP(tp);
 		va = TTE_GET_VA(tp);
 		PMAP_LOCK(pm);
 		if ((tp->tte_data & TD_WIRED) != 0)
 			pm->pm_stats.wired_count--;
-		if ((tp->tte_data & TD_REF) != 0)
-			vm_page_flag_set(m, PG_REFERENCED);
-		if ((tp->tte_data & TD_W) != 0)
-			vm_page_dirty(m);
+		if ((tp->tte_data & TD_PV) != 0) {
+			if ((tp->tte_data & TD_REF) != 0)
+				vm_page_flag_set(m, PG_REFERENCED);
+			if ((tp->tte_data & TD_W) != 0)
+				vm_page_dirty(m);
+			pm->pm_stats.resident_count--;
+		}
 		tp->tte_data &= ~TD_V;
 		tlb_page_demap(pm, va);
 		TAILQ_REMOVE(&m->md.tte_list, tp, tte_link);
-		pm->pm_stats.resident_count--;
 		pmap_cache_remove(m, va);
 		TTE_ZERO(tp);
 		PMAP_UNLOCK(pm);


More information about the freebsd-sparc64 mailing list