zpool import hangs when out of space - Was: zfs pool import hangs on [tx->tx_sync_done_cv]

Wed Oct 15 04:52:15 UTC 2014

----- Original Message ----- 
From: "Steven Hartland" <killing at multiplay.co.uk>
To: "Mark Martinec" <Mark.Martinec+freebsd at ijs.si>; <freebsd-fs at freebsd.org>; <freebsd-stable at freebsd.org>
Sent: Tuesday, October 14, 2014 12:40 PM
Subject: Re: zpool import hangs when out of space - Was: zfs pool import hangs on [tx->tx_sync_done_cv]

> ----- Original Message ----- 
> From: "Mark Martinec" <Mark.Martinec+freebsd at ijs.si>
> 
> 
>> On 10/14/2014 13:19, Steven Hartland wrote:
>>> Well interesting issue I left this pool alone this morning literally doing
>>> nothing, and its now out of space.
>>> zpool list
>>> NAME       SIZE  ALLOC   FREE   FRAG  EXPANDSZ    CAP  DEDUP  HEALTH
>>> ALTROOT
>>> sys1boot  3.97G  3.97G   190K     0%         -    99%  1.00x  ONLINE  -
>>> sys1copy  3.97G  3.97G     8K     0%         -    99%  1.00x  ONLINE  -
>>>
>>> There's something very wrong here as nothing has been accessing the pool.
>>>
>>>   pool: zfs
>>> state: ONLINE
>>> status: One or more devices are faulted in response to IO failures.
>>> action: Make sure the affected devices are connected, then run 'zpool
>>> clear'.
>>>    see: http://illumos.org/msg/ZFS-8000-HC
>>>   scan: none requested
>>> config:
>>>
>>>         NAME        STATE     READ WRITE CKSUM
>>>         zfs         ONLINE       0     2     0
>>>           md1       ONLINE       0     0     0
>>>
>>> I tried destroying the pool and ever that failed, presumably because
>>> the pool has suspended IO.
>> 
>> That's exactly how trouble started here. Got the
>>   "One or more devices are faulted in response to IO failures"
>> on all three small cloned boot pools one day, out of the blue.
>> There was no activity there, except for periodic snapshoting
>> every 10 minutes.
> 
> Yer this isn't fragmentation, this is something else. I've started a
> thread on the openzfs list to discuss this as theres something quite
> odd going on.

After bisecting the kernel versions in stable/10 the problem commit
appears to be:
https://svnweb.freebsd.org/base?view=revision&revision=268650

Removing it or using a pool without async_destory enabled prevents
the leak.

More debugging tomorrow.

    Regards
    steve