[patch] zfs livelock and thread priorities

Sat May 16 16:40:50 UTC 2009

On May 15, 2009, at 11:13 PM, Adam McDougall wrote:
> On Tue, Apr 28, 2009 at 04:52:23PM -0400, Ben Kelly wrote:
>  On Apr 28, 2009, at 2:11 PM, Artem Belevich wrote:
>> My system had eventually deadlocked overnight, though it took much
>> longer than before to reach that point.
>>
>> In the end I've got many many processes sleeping in zio_wait with no
>> disk activity whatsoever.
>> I'm not sure if that's the same issue or not.
>>
>> Here are stack traces for all processes -- http://pastebin.com/f364e1452
>> I've got the core saved, so if you want me to dig out some more info,
>> let me know if/how I could help.
>
>  It looks like there is a possible deadlock between zfs_zget() and
>  zfs_zinactive().  They both acquire a lock via ZFS_OBJ_HOLD_ENTER().
>  The zfs_zinactive() path can get called indirectly from within
>  zio_done().  The zfs_zget() can in turn block waiting for  
> zio_done()'s
>  completion while holding the object lock.
>
>  The following patch might help:
>
>     http://www.wanderview.com/svn/public/misc/zfs/zfs_zinactive_deadlock.diff
>
>  This simply bails out of the inactive processing if the object lock  
> is
>  already held.  I'm not sure if this is 100% correct or not as it
>  cannot verify there are references to the vnode.  I also tried
>  executing the zfs_zinactive() logic in a taskqueue to avoid the
>  deadlock, but that caused other deadlocks to occur.
>
>  Hope that helps.
>
>  - Ben
>
> Its my understanding that the deadlock was fixed in -current,
> how does that affect the usefulness of the thread priorities
> patch?  Should I continue testing it or is it effectively a
> NOOP now?

As far as I know the vnode release deadlock is unrelated to the thread  
prioritization patch.

The particular problem I ran into that caused me to look at the  
priorities was a livelock.  When the arc cache got low on memory  
sometimes user and txg threads would begin messaging each other in a  
seemingly infinite pattern waiting for space to be freed.   
Unfortunately, these threads were simultaneously starving out the  
spa_zio threads from actually flushing data to the disks.  This  
effectively blocked all disk related activity and would wedge the box  
when the syncer process got into the mix as well.  This condition  
doesn't happen on opensolaris because their use of explicit priorities  
ensures that the spa_zio threads take precedence over user and txg  
threads.

Beyond this particular scenario it seems possible that there are other  
priority related problems lurking.  ZFS in opensolaris is either  
explicitly or implicitly designed around the different threads having  
certain relative priorities.  While it seems to mostly work without  
these priorities we are definitely opening ourselves up to untested  
corner cases by ignoring the prioritization.

The one downside I have noticed to setting zfs thread priorities  
explicitly is a reduction in interactivity during heavy disk load.   
This is somewhat to be expected since the spa_zio threads are running  
at a higher priority than user threads.  This has been an issue on  
opensolaris as well:

   http://bugs.opensolaris.org/view_bug.do?bug_id=6586537

The bug states that a fix is available, but I haven't had a chance to  
go back and see what they ended up doing to make things more responsive.

Currently the thread priority patch for freebsd is a proof of  
concept.  If people think its a valid approach I can try to clean it  
up so that it could be committed.  The two main issues with it right  
now are:

   1) It changes the kproc(9) API by adding a kproc_create_priority()  
function that allows you to set the priority of the newly created  
thread.  I'm not sure how people feel about this.

   2) It makes the opensolaris thread_create() function take freebsd  
priority values and sets the constants maxclsyspri and minclsyspri to  
somewhat arbitrary values.  This means that if someone ports other  
opensolaris code over and passes priority values to thread_create  
without using these constants they will get unexpected behavior.  This  
could be addressed by creating a mapping function from opensolaris  
priorities to freebsd priorities.

> Also, I've been doing some fairly intense testing of zfs in
> recent -current and I am tracking down a situation where
> performance gets worse but I think I found a workaround.
> I am gathering more data regarding the cause, workaround,
> symptoms, and originating commit and will post about it soon.

I'd be interested to here more about this.

Thanks!

- Ben