ZFS: arc_reclaim_thread running 100%, 8.1-RELEASE, LBOLT related

Mon May 30 14:21:25 UTC 2011

On Thu, 26 May 2011, Artem Belevich wrote:

> On Thu, May 26, 2011 at 6:46 PM, David P Discher <dpd at bitgravity.com> wrote:
>> Hello FS list:
>>
>> We've been using ZFS v3, storage pool v14 with FreeBSD 8.1-RELEASE with fairly good results for over a year.  We have been moving more and more of our storage to ZFS.  Last week, I believe we hit another issue with LBOLT.
>>
>> The original which was first reported by Artem Belevich for l2arc_feed_thread :
>>
>>  - http://lists.freebsd.org/pipermail/freebsd-fs/2011-January/010558.html
>>
>> But this also affects the arc_reclaim_thread as well. The guys over at iX Systems helped out and pointed me to this patch :
>>
>>  - http://people.freebsd.org/~delphij/misc/218180.diff
>>
>> which typedef's clock_t to int64_t.

I think that patch is unusable and didn't get used.  clock_t is a
(bogusly) machine-dependent system type that cannot be declared in a
cddl header (except possibly to hide bugs by making it different from
the system type in some places only).

>> However, the arc_reclaim_thread does not have a ~24 day rollover - it does not use clock_t.  I think this rollover in the integer results in LBOLT going negative, after about 106-107 days.  We haven't noticed this until actually 112-115 days of uptime.  I think it is also related to L1 ARC sizing, and load.  Our systems with arc set to min-max of  512M/2G ARC haven't developed the issue - at least the CPU hogging thread - but the systems with 12G+ of ARC, and lots of rsync and du activity along side of random reads from the zpool develop the issue.
>>
>> The problem is slight different, and possibly more harmful than the l2arc feeder issue seen with LBOLT.
>>
>> in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, the arc_evict() function, under "evict_start:" has this loop to walk the arc cache page list:

[unreadable non-text with hard \xa0's deleted]

>> Now, when LBOLT is negative, with some degree of jitter/randomness, this loop short-circuts, resulting in high CPU usage.  Also the ARC buffers may not get evicted on-time, or possibly at all.  One system I had, all processes to the zpool where waiting on D-state, and the arc_reclaim_thread was stuck at 100%.  du and rysnc seem to help aggravate this issue.  On an affected system :

>> ...
>> After chatting with someone else well respected in the community, he proposed an alternative fix.  I'm vetting here to make sure there isn't something deeper in the code that could get bitten by, as well as some clarification :
>>
>> in:
>> ./sys/cddl/compat/opensolaris/sys/time.h
>>
>> the relevant parts are:
>>
>>         41 #define LBOLT   ((gethrtime() * hz) / NANOSEC)

Might work if gethrtime() is a 64-bit type, but as you pointed out, it is
almost perfectly passimized.

>>         ...
>>
>>         54 static __inline hrtime_t

Not clear what its type is.

>>         55 gethrtime(void) {
>>         56
>>         57         struct timespec ts;
>>         58         hrtime_t nsec;
>>         59
>>         60 #if 1
>>         61         getnanouptime(&ts);
>>         62 #else
>>         63         nanouptime(&ts);
>>         64 #endif

The ifdef prevents more perfect pessimization -- we will scale to hz ticks,
so we don't want to extra accuracy and overheads from using nanouptime().

>>         65         nsec = (hrtime_t)ts.tv_sec * NANOSEC + ts.tv_nsec;
>
> Yup. This would indeed overflow in ~106.75 days.

Apparently hr_time_t is 64 bits signed, and hz = 1000.  64 bits unsigned
would overflow after twice as long (after a further multiplication by hz
in LBOLT).  Use hz = much larger or much smaller than 1000 to change the
overflow point.

>> QUESTION - what units is LBOLT suppose to be ?  If gethrtime() is returning nanoseconds, why is nanoseconds getting multiplied by hz ?  If LBOLT is suppose to be clock-ticks (which is what arc.c looks like it wants it in) then it really should be :
>>
>>         #define LBOLT   ( (gethrtime() / NANOSEC) * hz )
>>
>> But if that is case, then why make the call to getnanouptime() at all ?  If LBOLT is number of clock ticks, then can't this just be a query to uptime in seconds ?  So how about something like this:
>>
>>        #define LBOLT   (time_uptime * hz)

This is similar to FreBSD's `ticks', and overflows at a similar point:
- `ticks' is int, and int is int32_t on all supported arches, so `ticks'
   overflows at 2**31 / hz on all supported arches.  This was 248 days
   if hz is correctly configured as 100.   The default misconfiguration
   of hz = 1000 gives overflow after 24.8 days.
- time_uptime is time_t, so the above overflows at TIME_T_MAX / hz.
   - on i386 and powerpc, time_t is int32_t, so the above overflows at
     the same point as does `ticks' on these arches.
   - on all other arches, time_t is int64_t, so the above overflows after
     2**32 times as long.

> I believe lbolt used to hold number of ticks on solaris, though they
> switched to tickless kernel some time back and got rid of lbolt.

Yes, LBOLT in solaris seems to correspond to `ticks' in FreeBSD, except
it might not overflow like `ticks' does.  (`lbolt' in FreeBSD was a dummy
sleep address with value always 0.)

>> I've applied this changed locally, and did a basic stress test with our load generator in the lab, thrashing the arc cache. (96GB RAM, 48G min/max for ARC) It seems to have no ill effects - though, will have to wait 4-months before declaring the actual issue here fixed.  I'm hoping to put this in production next week.

Won't work on i386.

>> It would seem, the same optimization could be done here too:
>>
>>                #define ddi_get_lbolt()         (time_uptime * hz)
>>                #define ddi_get_lbolt64()       (int64_t)(time_uptime * hz)

To fix the overflow it has to be:

#define ddi_get_lbolt64()	((int64_t)time_uptime * hz)

>> With saving the call to getnanouptime() a multiple and divide, there should be a couple hundred cycle performance improvement here.  I don't claim this would be noticeable, but seems like a simple, straight forward optimization.

Should only be a couple of ten cycle performance improvement.
getnanouptime() is very fast.

> The side effect is that it limits bolt resolution to hz units. With
> HZ=100, that will be 10ms. Whether it's good enough or too coarse I
> have no idea. Perhaps we can compromise and update lbolt in
> microseconds. That should give us few hundred years until the
> overflow.

No, it actually limits the resolution to seconds units, and that seems
too coarse.  Using getnanupotime() already limited the resolution to
hz/tc_tick inverse-units (tc_tick = 1 for hz <= ~1000, but above that
getnanouptime() provides less than hz inverse-units).

`ticks' would be the right thing to use if it didn't overflow.

Some networking code (e.g., in tcp_output.c) still uses `ticks', and at
least used to have bugs from this use when `ticks' overflowed.  Blindly
increasing hz of course made the bugs more frequent.

Bruce