syncer causing latency spikes

Wed Jul 17 21:42:38 UTC 2013

On Wed, Jul 17, 2013 at 10:18:52PM +0300, Konstantin Belousov wrote:
> On Wed, Jul 17, 2013 at 02:07:55PM -0400, Mark Johnston wrote:
> > Hello,
> > 
> > I'm trying to investigate and solve some postgres latency spikes that
> > I'm seeing as a result of some behaviour in the syncer. This is with
> > FreeBSD 8.2 (with some local modifications and backports, r231160 in
> > particular). The system has an LSI 9261-8i RAID controller (backed by
> > mfi(4)) and the database and WALs are on separate volumes, a RAID 6 and
> > a RAID 1 respectively. It has about 96GB of RAM installed.
> > 
> > What's happening is that the syncer tries to fsync a large database file
> > and goes to sleep in getpbuf() with the corresponding vnode lock held
> > and the following stack:
> > 
> > #3  0xffffffff805fceb5 in _sleep (ident=0xffffffff80ca8e20, lock=0xffffffff80d6bc20, priority=-2134554464, wmesg=0xffffffff80a4fe43 "wswbuf0", timo=0) at /d2/usr/src/sys/kern/kern_synch.c:234
> > #4  0xffffffff808780d5 in getpbuf (pfreecnt=0xffffffff80ca8e20) at /d2/usr/src/sys/vm/vm_pager.c:339
> > #5  0xffffffff80677a00 in cluster_wbuild (vp=0xffffff02ea3d7ce8, size=16384, start_lbn=20869, len=2) at /d2/usr/src/sys/kern/vfs_cluster.c:801
> > #6  0xffffffff808477ed in ffs_syncvnode (vp=0xffffff02ea3d7ce8, waitfor=Variable "waitfor" is not available.) at /d2/usr/src/sys/ufs/ffs/ffs_vnops.c:306
> > #7  0xffffffff808488cf in ffs_fsync (ap=0xffffff9b0cd27b00) at /d2/usr/src/sys/ufs/ffs/ffs_vnops.c:190
> > #8  0xffffffff8096798a in VOP_FSYNC_APV (vop=0xffffffff80ca5300, a=0xffffff9b0cd27b00) at vnode_if.c:1267
> > #9  0xffffffff8068bade in sync_vnode (slp=0xffffff002ab8e758, bo=0xffffff9b0cd27bc0, td=0xffffff002ac89460) at vnode_if.h:549
> > #10 0xffffffff8068bdcd in sched_sync () at /d2/usr/src/sys/kern/vfs_subr.c:1841
> > 
> > (kgdb) frame 6
> > #6  0xffffffff808477ed in ffs_syncvnode (vp=0xffffff02ea3d7ce8, waitfor=Variable "waitfor" is not available.) at /d2/usr/src/sys/ufs/ffs/ffs_vnops.c:306
> > 306                             vfs_bio_awrite(bp);
> > (kgdb) vpath vp
> > 0xffffff02ea3d7ce8: 18381
> > 0xffffff02eab1dce8: 16384
> > 0xffffff02eaaf0588: base
> > 0xffffff01616d8b10: data
> > 0xffffff01616d8ce8: pgsql
> > 0xffffff002af9f588: <mount point>
> > 0xffffff002af853b0: d3
> > 0xffffff002abd6b10: /
> > (kgdb)
> > 
> > During such an fsync, DTrace shows me that syncer sleeps of 50-200ms are
> > happening up to 8 or 10 times a second. When this happens, a bunch of
> > postgres threads become blocked in vn_write() waiting for the vnode lock
> > to become free. It looks like the write-clustering code is limited to
> > using (nswbuf / 2) pbufs, and FreeBSD prevents one from setting nswbuf
> > to anything greater than 256.
> Syncer is probably just a victim of profiling.  Would postgres called
> fsync(2), you then blame the fsync code for the pauses.

True, and postgres does frequently fsync(2) its own database files as
part its checkpointing operations. However, every time I take a look at
what's happening during an I/O spike, it's the syncer that's holding the
lock. I'm digging into the postgres internals right now, but I think that
it mangages to avoid having a backend writing to a database file at the
same time that the bg writer is fsync(2)ing it. postgres takes care to
checkpoint its files when it wants to ensure that data has been written
out, so it seems to me that the syncer's behaviour is at least somewhat
redundant here.

> 
> Just add a tunable to allow the user to manually-tune the nswbuf,
> regardless of the buffer cache sizing.  And yes, nswbuf default max
> probably should be bumped to something like 1024, at least on 64bit
> architectures which do not starve for kernel memory.

I ended trying that right after I wrote the first email. I goosed up
nswbuf to 8192 and found that it just exacerbated the problem - the
syncer still ends up going to sleep waiting for pbufs, but it generates
a lot more I/O than before and mfi(4)'s bioq becomes huge; I saw it go
up to 20000 regularly, which takes a while to drain.

> 
> > 
> > Since the sleeps are happening in the write-clustering code, I tried
> > disabling write clustering on the mountpoint (with the noclusterw mount
> > option) and found that this replaces my problem with another one: the
> > syncer periodically generates large bursts of writes that create a
> > backlog in the mfi(4) bioq. Then postgres' reads take a few seconds to
> > return, causing more or less the same end result.
> Well, this is exactly what clustering code is for, merge a lot of
> short i/o requests into the bigger i/o ?

Right - it's what I expected. I just wanted to preemptively answer the
question of "what happens if you disable clustering?"

> 
> > 
> > Does anyone have any suggestions on what I can do to help reduce the
> > impact of the syncer on my systems? I can't just move to a newer version
> > of FreeBSD, but I'm willing to backport changes if anyone can point me
> > to something that might help.
> 
> As a side note, syncer code for UFS was redone in HEAD and 9 to only
> iterate over the active vnodes.  This is mostly irrelevant to your
> problems which happen when syncer does perform the i/o, but still it
> saves a lot of CPU.

I noticed that change yesterday when I was going through some commit
logs. I'd like to bring it in at some point, but as you point out, it
probably doesn't really help with this particular problem since the CPUs
are mostly idle in my test setup.