dogfooding over in clusteradm land

Mon Jan 2 20:47:17 UTC 2012

On  2 Jan, Florian Smeets wrote:
> On 29.12.11 01:04, Kirk McKusick wrote:
>> Rather than changing BKVASIZE, I would try running the cvs2svn
>> conversion on a 16K/2K filesystem and see if that sorts out the
>> problem. If it does, it tells us that doubling the main block
>> size and reducing the number of buffers by half is the problem.
>> If that is the problem, then we will have to increase the KVM
>> allocated to the buffer cache.
>> 
> 
> This does not make a difference. I tried on 32K/4K with/without journal
> and on 16K/2K all exhibit the same problem. At some point during the
> cvs2svn conversion the sycer starts to use 100% CPU. The whole process
> hangs at that point sometimes for hours, from time to time it does
> continue doing some work, but really really slow. It's usually between
> revision 210000 and 220000, when the resulting svn file gets bigger than
> about 11-12Gb. At that point an ls in the target dir hangs in state ufs.
> 
> I broke into ddb and ran all commands which i thought could be useful.
> The output is at http://tb.smeets.im/~flo/giant-ape_syncer.txt

Tracing command syncer pid 9 tid 100183 td 0xfffffe00120e9000
cpustop_handler() at cpustop_handler+0x2b
ipi_nmi_handler() at ipi_nmi_handler+0x50
trap() at trap+0x1a8
nmi_calltrap() at nmi_calltrap+0x8
--- trap 0x13, rip = 0xffffffff8082ba43, rsp = 0xffffff8000270fe0, rbp = 0xffffff88c97829a0 ---
_mtx_assert() at _mtx_assert+0x13
pmap_remove_write() at pmap_remove_write+0x38
vm_object_page_remove_write() at vm_object_page_remove_write+0x1f
vm_object_page_clean() at vm_object_page_clean+0x14d
vfs_msync() at vfs_msync+0xf1
sync_fsync() at sync_fsync+0x12a
sync_vnode() at sync_vnode+0x157
sched_sync() at sched_sync+0x1d1
fork_exit() at fork_exit+0x135
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffff88c9782d00, rbp = 0 ---

I thinks this explains why the r228838 patch seems to help the problem.
Instead of an application call to msync(), you're getting bitten by the
syncer doing the equivalent.  I don't know why the syncer is CPU bound,
though.  From my understanding of the patch it only optimizes the I/O.
Without the patch, I would expect that the syncer would just spend a lot
of time waiting on I/O.  My guess is that this is actually a vm problem.
There are nested loops in vm_object_page_clean() and
vm_object_page_remove_write(), so you could be doing something that's
causing lots of looping in that code.

I think that ls is hanging because it's stumbling across the vnode that
the syncer has locked.