zpool import hangs when out of space - Was: zfs pool import hangs on [tx->tx_sync_done_cv]

Tue Oct 14 11:20:06 UTC 2014

----- Original Message ----- 
From: "Steven Hartland" <killing at multiplay.co.uk>
To: "K. Macy" <kmacy at freebsd.org>
Cc: "freebsd-fs at FreeBSD.org" <freebsd-fs at freebsd.org>; "mark" <Mark.Martinec at ijs.si>; "FreeBSD Stable" <freebsd-stable at freebsd.org>
Sent: Tuesday, October 14, 2014 9:14 AM
Subject: Re: zpool import hangs when out of space - Was: zfs pool import hangs on [tx->tx_sync_done_cv]

> ----- Original Message ----- 
> From: "K. Macy" <kmacy at freebsd.org>
> 
> 
>>>> Thank you both for analysis and effort!
>>>>
>>>> I can't rule out the possibility that my main system pool
>>>> on a SSD was low on space at some point in time, but the
>>>> three 4 GiB cloned pools (sys1boot and its brothers) were all
>>>> created as a zfs send / receive copies of the main / (root)
>>>> file system and I haven't noticed anything unusual during
>>>> syncing. This syncing was done manually (using zxfer) and
>>>> independently from the upgrade on the system - on a steady/quiet
>>>> system, when the source file system definitely had sufficient
>>>> free space.
>>>>
>>>> The source file system now shows 1.2 GiB of usage shown
>>>> by df:
>>>>   shiny/ROOT  61758388  1271620  60486768  2%  /
>>>> Seems unlikely that the 1.2 GiB has grown to 4 GiB space
>>>> on a cloned filesystem.
>>>>
>>>> Will try to import the main two pools after re-creating
>>>> a sane boot pool...
>>>
>>>
>>> Yer zfs list only shows around 2-3GB used too but zpool list
>>> shows the pool is out of space. Cant rule out an accounting
>>> issue though.
>>>
>> 
>> What is using the extra space in the pool? Is there an unmounted
>> dataset or snapshot? Do you know how to easily tell? Unlike txg and
>> zio processing I don't have the luxury of having just read that part
>> of the codebase.
> 
> Its not clear but I believe it could just be fragmention even though
> its ashift=9.
> 
> I sent the last snapshot to another pool of the same size and it
> resulted in:
> NAME       SIZE  ALLOC   FREE   FRAG  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
> sys1boot  3.97G  3.97G   190K     0%         -    99%  1.00x  ONLINE  -
> sys1copy  3.97G  3.47G   512M    72%         -    87%  1.00x  ONLINE  -
> 
> I believe FRAG is 0% as the feature wasn't enabled for the lifetime of
> the pool hence its simply not showing a valid value.
> 
> zfs list -t all -r sys1boot
> NAME                                  USED  AVAIL  REFER  MOUNTPOINT
> sys1boot                             1.76G  2.08G    11K  /sys1boot
> sys1boot/ROOT                        1.72G  2.08G  1.20G  /sys1boot/ROOT
> sys1boot/ROOT at auto-2014-08-16_04.00     1K      -  1.19G  -
> sys1boot/ROOT at auto-2014-08-17_04.00     1K      -  1.19G  -
..

Well interesting issue I left this pool alone this morning literally doing
nothing, and its now out of space.
zpool list
NAME       SIZE  ALLOC   FREE   FRAG  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
sys1boot  3.97G  3.97G   190K     0%         -    99%  1.00x  ONLINE  -
sys1copy  3.97G  3.97G     8K     0%         -    99%  1.00x  ONLINE  -

There's something very wrong here as nothing has been accessing the pool.

  pool: zfs
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-HC
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zfs         ONLINE       0     2     0
          md1       ONLINE       0     0     0

I tried destroying the pool and ever that failed, presumably because
the pool has suspended IO.

    Regards
    Steve