dogfooding over in clusteradm land

Tue Jan 3 08:02:35 UTC 2012

On  2 Jan, Don Lewis wrote:
> On  2 Jan, Don Lewis wrote:
>> On  2 Jan, Florian Smeets wrote:
> 
>>> This does not make a difference. I tried on 32K/4K with/without journal
>>> and on 16K/2K all exhibit the same problem. At some point during the
>>> cvs2svn conversion the sycer starts to use 100% CPU. The whole process
>>> hangs at that point sometimes for hours, from time to time it does
>>> continue doing some work, but really really slow. It's usually between
>>> revision 210000 and 220000, when the resulting svn file gets bigger than
>>> about 11-12Gb. At that point an ls in the target dir hangs in state ufs.
>>> 
>>> I broke into ddb and ran all commands which i thought could be useful.
>>> The output is at http://tb.smeets.im/~flo/giant-ape_syncer.txt
>> 
>> Tracing command syncer pid 9 tid 100183 td 0xfffffe00120e9000
>> cpustop_handler() at cpustop_handler+0x2b
>> ipi_nmi_handler() at ipi_nmi_handler+0x50
>> trap() at trap+0x1a8
>> nmi_calltrap() at nmi_calltrap+0x8
>> --- trap 0x13, rip = 0xffffffff8082ba43, rsp = 0xffffff8000270fe0, rbp = 0xffffff88c97829a0 ---
>> _mtx_assert() at _mtx_assert+0x13
>> pmap_remove_write() at pmap_remove_write+0x38
>> vm_object_page_remove_write() at vm_object_page_remove_write+0x1f
>> vm_object_page_clean() at vm_object_page_clean+0x14d
>> vfs_msync() at vfs_msync+0xf1
>> sync_fsync() at sync_fsync+0x12a
>> sync_vnode() at sync_vnode+0x157
>> sched_sync() at sched_sync+0x1d1
>> fork_exit() at fork_exit+0x135
>> fork_trampoline() at fork_trampoline+0xe
>> --- trap 0, rip = 0, rsp = 0xffffff88c9782d00, rbp = 0 ---
>> 
>> I thinks this explains why the r228838 patch seems to help the problem.
>> Instead of an application call to msync(), you're getting bitten by the
>> syncer doing the equivalent.  I don't know why the syncer is CPU bound,
>> though.  From my understanding of the patch it only optimizes the I/O.
>> Without the patch, I would expect that the syncer would just spend a lot
>> of time waiting on I/O.  My guess is that this is actually a vm problem.
>> There are nested loops in vm_object_page_clean() and
>> vm_object_page_remove_write(), so you could be doing something that's
>> causing lots of looping in that code.
> 
> Does the machine recover if you suspend cvs2svn?  I think what is
> happening is that cvs2svn is continuing to dirty pages while the syncer
> is trying to sync the file.  From my limited understanding of this code,
> it looks to me like every time cvs2svn dirties a page, it will trigger a
> call to vm_object_set_writeable_dirty(), which will increment
> object->generation.  Whenever vm_object_page_clean() detects a change in
> the generation count, it restarts its scan of the pages associated with
> the object.  This is probably not optimal ...

Since the syncer is only trying to flush out pages that have been dirty
for the last 30 seconds, I think that vm_object_set_writeable_dirty()
should just make one pass through the object, ignoring generation, and
then return when it is called from the syncer.  That should keep
vm_object_set_writeable_dirty() from looping over the object again and
again if another process is actively dirtying the object.