FreeBSD/ZFS on [HEAD] chews up memory

Thu Apr 9 14:21:23 UTC 2015

On 4/9/2015 08:53, Mark Martinec wrote:
> 2015-04-09 15:19, Bob Friesenhahn wrote:
>> On Thu, 9 Apr 2015, grarpamp wrote:
>>>> RAM amount might matter too. 12GB vs 32GB is a bit of a difference.
>>> Allow me to bitch hypothetically...
>>> We, and I, get that some FS need memory, just like kernel and
>>> userspace need memory to function. But to be honest, things
>>> should fail or slow gracefully. Why in the world, regardless of
>>> directory size, should I ever need to feed ZFS 10GB of RAM?
>>
>> From my reading of this list in the past month or so, I have seen
>> other complaints about memory usage, but also regarding UFS and NFS
>> and not just ZFS.  One is lead to think that the way the system uses
>> memory for filesystems has changed.
>>
>> As others have said, ZFS ARC should automatically diminish, but
>> perhaps ZFS ARC is not responsible for the observed memory issues.
>>
>> Bob
>
> I'd really like to see the:
>
>   [Bug 187594] [zfs] [patch] ZFS ARC behavior problem and fix
>     https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594
>
> find its way into 10-STABLE. Things behaved much more
> sanely some time in 9.*, before the great UMA change
> took place. Not everyone has dozens of gigabytes of memory.
> With 16 GB mem even when memory is tight (poudriere build),
> the wired size seems excessive, most of which is ARC.
>

There are a number of intertwined issues related to how the VM system 
interacts with ZFS' use of memory for ARC; the patch listed above IMHO 
resolves most -- but not all -- of them.

The one big one remaining, that I do not have a patch to fix at present, 
is the dmu_tx write cache (exposed in sysctl as 
vfs.zfs.dirty_data_max*)  It is sized based on available RAM at boot 
with both a minimum and maximum size and is across all pools.  This 
initializes to allow up to 10% of RAM to be used for this on boot with a 
cap of 4Gb.  That can be a problem because in a moderately-large RAM 
configuration machine with spinning rust it is entirely possible for 
that write cache to represent /*tens of seconds or even more than a 
minute */of actual I/O time to flush.  (The maximum full-track 
sequential I/O speed of a 7200RPM 4TB drive is in the ~200Mb/sec range; 
10% of 32Gb is 3Gb, so this is ~15 seconds of time in a typical 4-unit 
RaidZ2 zVol -- and it gets worse, much worse, with smaller-capacity 
disks that have less areal density under the head and thus are slower 
due to the basic physics of the matter.)   The write cache is a very 
good thing for performance in most circumstances because it allows ZFS 
to optimize writes to minimize the number of seeks and latency required 
but there are some pathological cases where having it too large is very 
bad for performance.

Specifically, it becomes a problem when the operation you wish to 
perform on the filesystem requires coherency with something _*in*_ that 
cache, and thus the cache must flush and complete before that operation 
can succeed.  This manifests as you doing something as benign as typing 
"vi some-file" and your terminal session locks up for tens of seconds 
to, in some cases, more than a minute!

If _*all*_ the disks on your machine are of a given type and reasonably 
coherent in I/O throughput (e.g. all SSDs, all rotating rust of the same 
approximate size and throughput, etc) then you can tune this as the code 
stands to get good performance and avoid the problem.  But if you have 
some volumes comprised of high-performance SSD storage (say, for 
often-modified or accessed database tables) and other volumes comprised 
of high-capacity spinning rust (because SSD for storage of that data 
makes no economic sense) then you've got a problem because 
dirty_data_max is system-wide and not per-pool.

The irony is that with the patch I developed in under heavy load the 
pathology tends to not happen because the dmu_tx cache gets cut back 
automatically under heavy load as part of the UMA reuse mitigation 
strategy that I implemented in that patch.  But under light load it 
still can and sometimes does bite you.  The best (and I argue proper) 
means for eliminating that is for the dmu_tx cache to be sized per-pool 
and to be computed based on the pool's actual write I/O performance; in 
other words, it should be sized to represent a maximum 
latency-to-coherence time that is acceptable (and that should be able to 
be tuned.)  Doing so appears to be quite non-trivial though or I would 
have already taken it on and addressed it.

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2938 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20150409/2232d14a/attachment.bin>