ZFS: clock_t overflow in l2arc_feed_thread

Mon Jan 31 12:31:52 UTC 2011

On Mon, 31 Jan 2011, Martin Matuska wrote:

> This is a problem for several places in the ZFS code, most of them in
> arc.c. The L2 issuse is quite clear:
> in arc.c: l2arc_feed_thread() we have the following statement:
>
> (void) cv_timedwait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock,
> next - LBOLT);
>
> next gets calculated from l2arc_write_interval() from the previous
> while() loop.
> Now a value of 0 will occur if l2arc_writ_interval() returns the same
> LBOLT, effectively making this a cv_wait(),
> negative values for the timeout are theoretically possible, too.

Ugh.  Bugs like this are avoided for timevals by doing all conversions
using tvtohz().  This involves lots more than subtracting values.

There is some magic for zero and negative timeouts.  I wanted tvtohz()
to panic for them, but it only has a printf() and that is normally
inactive since it is under DIAGNOSTIC.  callout_reset_on() still silently
changes timeouts of <= 0 to 1.  cv_timedwait() seems to call that
eventually, so I don't se why a negative timeout in the above would
cause larger problems -- it should just wake up early, which should
be handled since it can happen for other reasons.  nanosleep() handles
long timeouts by waking up early (after only 24 days) and going back
to sleep.

> In addition there are other areas in the code where comparative
> decisions based on a stored LBOLT and current LBOLT are made.
> A wrong decision made on clock_t overflow time might be harmful and
> cause unexpected behaviour.

I didn't see all of the early discussion.  Once timeouts become negative
(but changed to 1 by callout_reset_on), they might remain that way for
a long time until LBOLT and/or `next' overflow again, and this would
waste time at best.  If their types are significantly different (say
long for next and int for LBOLT, where long happens to be 64 bits and
int happens to be 32 bits), then only 1 of them will overflow and it
will never catch up with the other.  Similar problems with network
sequence numbers are rendered mostly harmless by making all values
wrap at the same point, but similar bugs result if the "next" number
is actually before the "current" number or only 1 of the types is
bogusly casted or if a difference is assigned to a mismatched type...

> Affected are (what I have seen so far, in
> sys/cddl/contrib/opensolaris/uts/common/fs/zfs/):
> arc.c: arc_access(), arc_reclaim_thread(), l2arc_write_interval(),
> l2arc_feed_thread()
> dmu_zfetch.c: dmu_zfetch_dofetch(), dmu_zfetch_stream_reclaim()
> zil.c: zil_replay()
>
> In userland we have some clock_t in:
> lib/libzpool/common/kernel.c
>
> DÅˆa 31.01.2011 06:25, Bruce Evans wrote / napÃsal(a):
>> On Mon, 31 Jan 2011, Martin Matuska wrote:
>>
>>> I have re-checked OpenSolaris, and discovered that long is a int32_t.
>>>
>>> I agree, we should go for int64_t in our case.

If there are no pointers in sight, then you can probably just use
int64_t for LBOLT.  The type of `next', and overflows on implicit
conversion of `next - LBOLT' to int by the prototype should still be
checked.  I think bugs in these are relatively harmless -- provided
neither `next' or LBOLT overflows, their difference will usually be
relatively small and positive, and glitches when it is negative will
be handled by callout_reset_on() changing it to 1.  Only cases where
the overflow gives a large positive value (24 days is possible) wouldn't
be harmless.  Another harmless case would be if sign extension bugs
and/or type bugs give a value near -1U, as happens in many overflow
cases.  -1U overflows again in the conversion in the prototype and
becomes -1 on 2's complement machines, and then 1 in callout_reset_on().

Bruce