dogfooding over in clusteradm land

Tue Jan 3 02:35:46 UTC 2012

On  2 Jan, Don Lewis wrote:
> On  2 Jan, Florian Smeets wrote:

>> This does not make a difference. I tried on 32K/4K with/without journal
>> and on 16K/2K all exhibit the same problem. At some point during the
>> cvs2svn conversion the sycer starts to use 100% CPU. The whole process
>> hangs at that point sometimes for hours, from time to time it does
>> continue doing some work, but really really slow. It's usually between
>> revision 210000 and 220000, when the resulting svn file gets bigger than
>> about 11-12Gb. At that point an ls in the target dir hangs in state ufs.
>> 
>> I broke into ddb and ran all commands which i thought could be useful.
>> The output is at http://tb.smeets.im/~flo/giant-ape_syncer.txt
> 
> Tracing command syncer pid 9 tid 100183 td 0xfffffe00120e9000
> cpustop_handler() at cpustop_handler+0x2b
> ipi_nmi_handler() at ipi_nmi_handler+0x50
> trap() at trap+0x1a8
> nmi_calltrap() at nmi_calltrap+0x8
> --- trap 0x13, rip = 0xffffffff8082ba43, rsp = 0xffffff8000270fe0, rbp = 0xffffff88c97829a0 ---
> _mtx_assert() at _mtx_assert+0x13
> pmap_remove_write() at pmap_remove_write+0x38
> vm_object_page_remove_write() at vm_object_page_remove_write+0x1f
> vm_object_page_clean() at vm_object_page_clean+0x14d
> vfs_msync() at vfs_msync+0xf1
> sync_fsync() at sync_fsync+0x12a
> sync_vnode() at sync_vnode+0x157
> sched_sync() at sched_sync+0x1d1
> fork_exit() at fork_exit+0x135
> fork_trampoline() at fork_trampoline+0xe
> --- trap 0, rip = 0, rsp = 0xffffff88c9782d00, rbp = 0 ---
> 
> I thinks this explains why the r228838 patch seems to help the problem.
> Instead of an application call to msync(), you're getting bitten by the
> syncer doing the equivalent.  I don't know why the syncer is CPU bound,
> though.  From my understanding of the patch it only optimizes the I/O.
> Without the patch, I would expect that the syncer would just spend a lot
> of time waiting on I/O.  My guess is that this is actually a vm problem.
> There are nested loops in vm_object_page_clean() and
> vm_object_page_remove_write(), so you could be doing something that's
> causing lots of looping in that code.

Does the machine recover if you suspend cvs2svn?  I think what is
happening is that cvs2svn is continuing to dirty pages while the syncer
is trying to sync the file.  From my limited understanding of this code,
it looks to me like every time cvs2svn dirties a page, it will trigger a
call to vm_object_set_writeable_dirty(), which will increment
object->generation.  Whenever vm_object_page_clean() detects a change in
the generation count, it restarts its scan of the pages associated with
the object.  This is probably not optimal ...