Process enters unkillable state and somewhat wedges zfs

Daniel Andersen dea at caida.org
Mon Aug 18 19:46:31 UTC 2014


We are currently experiencing a strange problem that sort of locks up one of our zfs pools.  This is on a FreeBSD 10
machine.  Let me give a rough layout of our system to better describe what is happening:

We have two pools, tank and work.  Both are mounted as /tank and /work respectively.  Within these pools, we have a
variety of partitions.  Each of these partitions is then nullfs mounted into other partitions in an effort to present
a common directory structure for users.  Below is an example:

/tank/a -> /data/a
/tank/a -> /export/a
/tank/b -> /data/b
/work/home -> /home

etc.

Now, occasionally, something goes horribly wrong with a process ( or sometimes one thread within a process ).  It enters
a state where it is running, pegging a CPU at 100%, and is unkillable.  This process, as I understand it, is
attempting to access data on both pools, but only through the nullfs mounts.

Now, when this process enters the above mentioned state, the /tank pool becomes inaccessible.. any process attempting
to touch it enters the traditional 'D' state and itself becomes unkillable.  However... you can *still* access all the
data through the nullfs mounts.  So, while 'ls /tank/a' wedges, 'ls /data/a' works fine.

Typically, this happens when the machine is under high load.  Arc memory usage is often at 140+GB ( out of 192GB total )
 It has happened under low load once... but I suspect there was still substantial I/O load at the time, as we had been
doing many benchmarks trying to trigger this problem, and likely the cache was still flushing to disk.

Initially, we thought this was triggered by a process attempting to dump core, as all processes that originally wedged
were such.. however, after disabling core dumps, we just had a case where 'sudo -u user su - user' wedged.

This has me baffled.  If anyone has any hints as to where to even start debugging this, I'd appreciate it greatly.
For now, I've tuned down the arc max memory usage.. just as a guess.  I saw in another thread here some bugs about
FreeBSD not giving back memory from ARC correctly, when needed.  I don't know if this could cause a process to enter
this state, but figured it was worth a try.

Many thanks,
Daniel Andersen


More information about the freebsd-fs mailing list