ZFS: arc_reclaim_thread running 100%, 8.1-RELEASE, LBOLT related

Fri May 27 02:17:54 UTC 2011

Hello FS list:

We've been using ZFS v3, storage pool v14 with FreeBSD 8.1-RELEASE with fairly good results for over a year.  We have been moving more and more of our storage to ZFS.  Last week, I believe we hit another issue with LBOLT.

The original which was first reported by Artem Belevich for l2arc_feed_thread :

 - http://lists.freebsd.org/pipermail/freebsd-fs/2011-January/010558.html

But this also affects the arc_reclaim_thread as well. The guys over at iX Systems helped out and pointed me to this patch :

 - http://people.freebsd.org/~delphij/misc/218180.diff

which typedef's clock_t to int64_t.

However, the arc_reclaim_thread does not have a ~24 day rollover - it does not use clock_t.  I think this rollover in the integer results in LBOLT going negative, after about 106-107 days.  We haven't noticed this until actually 112-115 days of uptime.  I think it is also related to L1 ARC sizing, and load.  Our systems with arc set to min-max of  512M/2G ARC haven't developed the issue - at least the CPU hogging thread - but the systems with 12G+ of ARC, and lots of rsync and du activity along side of random reads from the zpool develop the issue.

The problem is slight different, and possibly more harmful than the l2arc feeder issue seen with LBOLT. 

in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, the arc_evict() function, under "evict_start:" has this loop to walk the arc cache page list:

	1708         for (ab = list_tail(list); ab; ab = ab_prev) {
	1709                 ab_prev = list_prev(list, ab);
	1710                 bytes_remaining -= (ab->b_size * ab->b_datacnt);
	1711                 /* prefetch buffers have a minimum lifespan */
	1712                 if (HDR_IO_IN_PROGRESS(ab) ||
	1713                     (spa && ab->b_spa != spa) ||
	1714                     (ab->b_flags & (ARC_PREFETCH|ARC_INDIRECT) &&
	1715                     LBOLT - ab->b_arc_access < arc_min_prefetch_lifespan)) {
	1716                         skipped++;
	1717                         continue;
	1718                 }

Now, when LBOLT is negative, with some degree of jitter/randomness, this loop short-circuts, resulting in high CPU usage.  Also the ARC buffers may not get evicted on-time, or possibly at all.  One system I had, all processes to the zpool where waiting on D-state, and the arc_reclaim_thread was stuck at 100%.  du and rysnc seem to help aggravate this issue.  On an affected system :

> top -SHb 500 | grep arc_reclaim_thr
   95 root           -8    -     0K    60K arc_re  3 102.9H 96.39% {arc_reclaim_thre}

Conveniently, "skipped++" is surfaced via a sysctl, here's two queries to it on this system with the arc reclaim thread running hot (a du going too at the same time ), 60 seconds in between :

  kstat.zfs.misc.arcstats.evict_skip: 4117714520450
  kstat.zfs.misc.arcstats.evict_skip: 4118188257434

> uptime
 3:51PM  up 116 days, 23:48, 3 users, load averages: 1.30, 0.96, 0.64

After chatting with someone else well respected in the community, he proposed an alternative fix.  I'm vetting here to make sure there isn't something deeper in the code that could get bitten by, as well as some clarification : 

in:
./sys/cddl/compat/opensolaris/sys/time.h 

the relevant parts are:

	 41 #define LBOLT   ((gethrtime() * hz) / NANOSEC)
	 ...

	 54 static __inline hrtime_t
	 55 gethrtime(void) {
	 56 
	 57         struct timespec ts;
	 58         hrtime_t nsec;
	 59 
	 60 #if 1
	 61         getnanouptime(&ts);
	 62 #else
	 63         nanouptime(&ts);
	 64 #endif
	 65         nsec = (hrtime_t)ts.tv_sec * NANOSEC + ts.tv_nsec;
	 66         return (nsec);
	 67 }

QUESTION - what units is LBOLT suppose to be ?  If gethrtime() is returning nanoseconds, why is nanoseconds getting multiplied by hz ?  If LBOLT is suppose to be clock-ticks (which is what arc.c looks like it wants it in) then it really should be :

	 #define LBOLT   ( (gethrtime() / NANOSEC) * hz )

But if that is case, then why make the call to getnanouptime() at all ?  If LBOLT is number of clock ticks, then can't this just be a query to uptime in seconds ?  So how about something like this:

	#define LBOLT	(time_uptime * hz)

I've applied this changed locally, and did a basic stress test with our load generator in the lab, thrashing the arc cache. (96GB RAM, 48G min/max for ARC) It seems to have no ill effects - though, will have to wait 4-months before declaring the actual issue here fixed.  I'm hoping to put this in production next week.

All this above is on 8.1-RELEASE. ZFS v28 changed some of this, but still didn't improve lbolt:

 - http://svnweb.freebsd.org/base/head/sys/cddl/compat/opensolaris/sys/time.h?revision=221991&view=markup

	65	#define ddi_get_lbolt()         ((gethrtime() * hz) / NANOSEC)
	66	#define ddi_get_lbolt64()       (int64_t)((gethrtime() * hz) / NANOSEC)

It would seem, the same optimization could be done here too:

		#define ddi_get_lbolt()         (time_uptime * hz)
		#define ddi_get_lbolt64()       (int64_t)(time_uptime * hz)

With saving the call to getnanouptime() a multiple and divide, there should be a couple hundred cycle performance improvement here.  I don't claim this would be noticeable, but seems like a simple, straight forward optimization. 

clock_t will still need the typedef'ed to 64bit to still address the l2arc usage of LBOLT.

Thanks !

---
David P. Discher
dpd at bitgravity.com * AIM: bgDavidDPD
BITGRAVITY * http://www.bitgravity.com