Still getting kmem exhausted panic

Tue Sep 28 21:31:17 UTC 2010

on 28/09/2010 21:40 Ben Kelly said the following:
> 
> On Sep 28, 2010, at 1:17 PM, Andriy Gapon wrote:
> 
>> on 28/09/2010 19:46 Ben Kelly said the following:
>>> Hmm.  My server is currently idle with no I/O happening:
>>> 
>>> kstat.zfs.misc.arcstats.c: 25165824 kstat.zfs.misc.arcstats.c_max:
>>> 46137344 kstat.zfs.misc.arcstats.size: 91863156
>>> 
>>> If what you say is true, this shouldn't happen, should it?  This system
>>> is an i386 machine with kmem max at 800M and arc set to 40M.  This is
>>> running head from April 6, 2010, so it is a bit old, though.
>> 
>> Well, your system is a bit old indeed. And the branch is unknown, so I
>> can't really see what sources you have. And I am not sure if I'll be able
>> to say anything about those sources.
> 
> Quite old.  I've been intending to update, but haven't found the time lately.
> I'll try to do the upgrade this weekend and see if it changes anything.
> 
>> As to the numbers - yes, with current code I'd expect arcstats.size to go
>> down to arcstats.c when there is no I/O.  arc_reclaim_thread should do
>> that.
> 
> Thats what I thought as well, but when I debugged it a year or two ago I
> found that the buffers were still referenced and thus could not be reclaimed.
> As far as I can remember they needed a vfs/vnops like zfs_vnops_inactive or
> zfs_vnops_reclaim to be executed in order to free the reference.  What is
> responsible for making those calls?

It's time that we should start showing each other places in code :)
Because I don't think that that's how the code work.
E.g. I look at how zfs_read() calls dmu_read_uio() which calls
dmu_buf_hold_array() and dmu_buf_rele_array() around uimove() call.
>From what I see, dmu_buf_hold_array() calls dmu_buf_hold_array_by_dnode() calls
dbuf_hold() calls arc_buf_add_ref() or arc_buf_alloc().
And conversely, dmu_buf_rele_array() calls dbuf_rele() calls arc_buf_remove_ref().

So, I am quite sure that ARC buffers are held/referenced only during ongoing I/O
to or from them.

Perhaps, on the other hand, you had in mind life-cycle of other things (not ARC
buffers) that are accounted against ARC size (with type ARC_SPACE_OTHER)?
Such as e.g. dmu_buf_impl_t-s allocated in dbuf_create().
I have to admit that I haven't investigated behavior of that part of
ARC-assigned memory.  It's only a small proportion (~10%) of the whole ARC size
on my systems.

>>> At one point I had patches running on my system that triggered the
>>> pagedaemon based on arc load and it did allow me to keep my arc below the
>>> max.  Or at least I thought it did.
>>> 
>>> In any case, I've never really been able to wrap my head around the VFS
>>> layer and how it interacts with zfs.  So I'm more than willing to believe
>>> I'm confused.  Any insights are greatly appreciated.
>> 
>> ARC is a ZFS private cache. ZFS doesn't use unified buffer/page cache. So
>> ARC is not directly affected by pagedaemon. But this is not exactly VFS
>> layer thing.
> 
> Can you explain the difference in how the vfs/vnode operations are called or
> used for those two situations?

They are called exactly the same.
VFS layer and code above it are not aware of FS implementation details.

> I thought that the buffer cache was used by filesystems to implement these
> operations.  So that the buffer cache was below the vfs/vnops layer.  So

Buffer cache works as part of unified VM and its buffers use the same pages as
page cache does.

> while zfs implemented its operations in terms of the arc, things like UFS
> implemented vfs/vnops in terms of the buffer cache.  I thought the layers

Yes.  Filesystems like UFS are "sandwiched" between buffer cache and page cache,
which work in concert.  Also, they don't (have to) implement their own
buffer/page caching policies, because it's all managed by unified VM system.

On the contrary, ZFS has its own private cache.
So, first of all, its data may be cached in two places at once - page cache and
ARC.  And, because of that, some assumptions of the higher level code get
violated, so ZFS has to jump through the hoops to meet those assumptions (e.g.
see UIO_NOCOPY).

> further up the chain like the page daemon did not distinguish that much
> between these two implementation due to the VFS interface layer.  (Although

Right, but see above.

> there seems to be a layering violation in that the buffer cache signals
> directly to the upper page daemon layer to trigger page reclamation.)

Umm, not sure if that is a fact.

> The old (ancient) patch I tried previously to help reduce the arc working set
> and allow it to shrink is here:
> 
> http://www.wanderview.com/svn/public/misc/zfs/zfs_kmem_limit.diff
> 
> Unfortunately, there are a couple ideas on fighting fragmentation mixed into
> that patch.  See the part about arc_reclaim_pages().  This patch did seem to
> allow my arc to stay under the target maximum even when under load that
> previously caused the system to exceed the maximum.  When I update this
> weekend I'll try a stripped down version of the patch to see if it helps or
> not with the latest zfs.
> 
> Thanks for your help in understanding this stuff!

The patch seems good, especially the part about taking into account the kmem
fragmentation.  But it also seems to be heavily tuned towards "tiny ARC" systems
like yours, so I am not sure yet how suitable it is for "mainstream" systems.

-- 
Andriy Gapon