FreeBSD 10.1 Memory Exhaustion

Karl Denninger karl at denninger.net
Tue Jul 14 16:59:23 UTC 2015


On 7/14/2015 10:10, Sean Chittenden wrote:
> I think the reason this is not seen more often is because people frequently
> throw limits on the arc in /boot/loader.conf:
>
> vfs.zfs.arc_min="18G"
> vfs.zfs.arc_max="149G"
>
> ZFS ARC *should* not require those settings, but does currently for mixed
> workloads (i.e. databases) in order to be "stable".  By setting fixed sizes
> on the ARC, UMA and ARC are much more cooperative in that they have their
> own memory regions to manage so this behavior is not seen as often.
However, this is a false God unless you have very tight control over the
RSS requirements on your machine.  For a NFS fileserver or similar you
might be able to get away with that, because the number of nfsds (for
example) is a nominally-known quantity and you can probably quantify RSS
requirements.

For a server that accepts connections from outside sources and is
subject to burst loads this strategy moves the wall but probably doesn't
prevent the problem entirely (what happens when a bunch of web clients
hit your machine at once, for example, spiking memory demand?)

The fundamental issue is that the base code will under certain load
patterns (and surprisingly often) prefer to keep pages allocated to ARC
(disk cache) in memory over RSS, causing RSS to be paged.  This is
exacerbated by UMA's "lazy" return allocated kernel memory (which is a
good thing most of the time for performance reasons.)  That decision is
almost always wrong because paging RSS requires one guaranteed I/O (to
place the paged RSS on the swap) and may require two (to later recover
it if it is referenced); discarding cached disk data carries no I/O
guarantee, with one future I/O only being required if the cached page is
again referenced.

The patch in question does not change the base code behavior until and
unless memory is constrained.  It then pares back ARC instead of
allowing the system to page RSS, and in addition when under memory
pressure UMA is patrolled to keep the lazily-held kernel memory in check
along with cutting back dmu_tx write buffer size so as to prevent
heavily burst-loading memory during write-intensive operations.

The latter, IMHO, is a poorly chosen value in the first instance.  The
ideal situation would be one where the dmu_tx write buffer size is
selected based on the performance of each vdev so as to always have at
least one full buffer available when the previous DMA'd transfer
completes, but not materially more than one (e.g. perhaps two
maximum.)   As it stands there is only one set for the entire system
(rather than one per vdev) and it's sized based not on I/O channel
performance but system memory with a cap of 4Gb, which takes a hell of a
long time to drain to small-parallel (or non-parallel) spinning media
vdevs.  Such a flush can implicate a sequencing lock (e.g. you wish to
modify something that is pending write in that buffer) which has the
potential to lead to further misbehavior in the form of long delays
before the system responds. If you have all spinning media and few
parallel channels on the vdevs (or none) for writes lowering the max_max
will be of material benefit in leveling out I/O performance with no
penalty on peak write rates.  However, if your machine is mixed SSD and
spinning rust there's no "one size fits all", nor is there if you have
pools with varying degrees of parallelism.
>
> To be clear, however, it should not be necessary to set parameters like
> these in /boot/loader.conf in order to obtain consistent operational
> behavior.  I'd be curious to know if someone running 10.2 BETA without
> patches is able to trigger this behavior or not.  There was work done that
> reported helped with this between 10.1 and now.  To what extent it helped,
> however, I don't have any advice yet.
>
> -sc
I looked at the changes in 10.2-PRE and didn't see anything that led me
to believe would materially change behavior.  I have not, however, had
the time to run an exhaustive test suite against unpatched 10.2-PRE.

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2944 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20150714/9811243f/attachment.bin>


More information about the freebsd-fs mailing list