FreeBSD/ZFS on [HEAD] chews up memory
Karl Denninger
karl at denninger.net
Thu Apr 9 14:21:23 UTC 2015
On 4/9/2015 08:53, Mark Martinec wrote:
> 2015-04-09 15:19, Bob Friesenhahn wrote:
>> On Thu, 9 Apr 2015, grarpamp wrote:
>>>> RAM amount might matter too. 12GB vs 32GB is a bit of a difference.
>>> Allow me to bitch hypothetically...
>>> We, and I, get that some FS need memory, just like kernel and
>>> userspace need memory to function. But to be honest, things
>>> should fail or slow gracefully. Why in the world, regardless of
>>> directory size, should I ever need to feed ZFS 10GB of RAM?
>>
>> From my reading of this list in the past month or so, I have seen
>> other complaints about memory usage, but also regarding UFS and NFS
>> and not just ZFS. One is lead to think that the way the system uses
>> memory for filesystems has changed.
>>
>> As others have said, ZFS ARC should automatically diminish, but
>> perhaps ZFS ARC is not responsible for the observed memory issues.
>>
>> Bob
>
> I'd really like to see the:
>
> [Bug 187594] [zfs] [patch] ZFS ARC behavior problem and fix
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594
>
> find its way into 10-STABLE. Things behaved much more
> sanely some time in 9.*, before the great UMA change
> took place. Not everyone has dozens of gigabytes of memory.
> With 16 GB mem even when memory is tight (poudriere build),
> the wired size seems excessive, most of which is ARC.
>
There are a number of intertwined issues related to how the VM system
interacts with ZFS' use of memory for ARC; the patch listed above IMHO
resolves most -- but not all -- of them.
The one big one remaining, that I do not have a patch to fix at present,
is the dmu_tx write cache (exposed in sysctl as
vfs.zfs.dirty_data_max*) It is sized based on available RAM at boot
with both a minimum and maximum size and is across all pools. This
initializes to allow up to 10% of RAM to be used for this on boot with a
cap of 4Gb. That can be a problem because in a moderately-large RAM
configuration machine with spinning rust it is entirely possible for
that write cache to represent /*tens of seconds or even more than a
minute */of actual I/O time to flush. (The maximum full-track
sequential I/O speed of a 7200RPM 4TB drive is in the ~200Mb/sec range;
10% of 32Gb is 3Gb, so this is ~15 seconds of time in a typical 4-unit
RaidZ2 zVol -- and it gets worse, much worse, with smaller-capacity
disks that have less areal density under the head and thus are slower
due to the basic physics of the matter.) The write cache is a very
good thing for performance in most circumstances because it allows ZFS
to optimize writes to minimize the number of seeks and latency required
but there are some pathological cases where having it too large is very
bad for performance.
Specifically, it becomes a problem when the operation you wish to
perform on the filesystem requires coherency with something _*in*_ that
cache, and thus the cache must flush and complete before that operation
can succeed. This manifests as you doing something as benign as typing
"vi some-file" and your terminal session locks up for tens of seconds
to, in some cases, more than a minute!
If _*all*_ the disks on your machine are of a given type and reasonably
coherent in I/O throughput (e.g. all SSDs, all rotating rust of the same
approximate size and throughput, etc) then you can tune this as the code
stands to get good performance and avoid the problem. But if you have
some volumes comprised of high-performance SSD storage (say, for
often-modified or accessed database tables) and other volumes comprised
of high-capacity spinning rust (because SSD for storage of that data
makes no economic sense) then you've got a problem because
dirty_data_max is system-wide and not per-pool.
The irony is that with the patch I developed in under heavy load the
pathology tends to not happen because the dmu_tx cache gets cut back
automatically under heavy load as part of the UMA reuse mitigation
strategy that I implemented in that patch. But under light load it
still can and sometimes does bite you. The best (and I argue proper)
means for eliminating that is for the dmu_tx cache to be sized per-pool
and to be computed based on the pool's actual write I/O performance; in
other words, it should be sized to represent a maximum
latency-to-coherence time that is acceptable (and that should be able to
be tuned.) Doing so appears to be quite non-trivial though or I would
have already taken it on and addressed it.
--
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2938 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20150409/2232d14a/attachment.bin>
More information about the freebsd-fs
mailing list