ZFS buggy in CURRENT? Stuck in [zio->io_cv] forever!

Sun Oct 27 23:30:39 UTC 2013

On Sun, 27 Oct 2013 16:32:13 -0000
"Steven Hartland" <killing at multiplay.co.uk> wrote:

> 
> ----- Original Message ----- 
> From: "O. Hartmann" <ohartman at zedat.fu-berlin.de>
> > 
> > I have setup a RAIDZ pool comprised from 4 3TB HDDs. To maintain 4k
> > block alignment, I followed the instructions given on several sites
> > and I'll sketch them here for the protocol.
> > 
> > The operating system is 11.0-CURRENT AND 10.0-BETA2.
> > 
> > create a GPT partition on each drive and add one whole-covering
> > partition with the option
> > 
> > gpart add -t freebsd-zfs -b 1M -l disk0[0-3] ada[3-6]
> > 
> > gnop create -S4096 gtp/disk[3-6]
> > 
> > Because I added a disk to an existing RAIDZ, I exported the former
> > ZFS pool, then I deleted on each disk the partition and then
> > destroyed the GPT scheme. The former pool had a ZIL and CACHE
> > residing on the same SSD, partioned. I didn't kill or destroy the
> > partitions on that SSD. To align 4k blocks, I also created on the
> > existing gpt/log00 and gpt/cache00 via 
> > 
> > gnop create -S4096 gpt/log00|gpt/cache00
> > 
> > the NOP overlays.
> > 
> > After I created a new pool via zpool create POOL gpt/disk0[0-3].nop
> > log gpt/log00.nop cache gpt/cache00.nop
> 
> You don't need any of the nop hax in 10 or 11 any more as it has
> proper sector size detection. The caviate for this is when you have a
> disk which adervtises 512b sectors but is 4k and we dont have a 4k
> quirk in the kernel for it yet.

Well, this is news to me.

> 
> If you anyone comes across a case of this feel free to drop me the
> details from camcontrol <identify|inquiry> <device>

camcontrol identify says this (serial numbers skipped): 

ada3:
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 4096, offset 0
LBA supported         268435455 sectors
LBA48 supported       5860533168 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6 
media RPM             5400

Feature                      Support  Enabled   Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes
overlap                        no
Tagged Command Queuing (TCQ)   no       no
Native Command Queuing (NCQ)   yes              32 tags
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      no       no
automatic acoustic management  no       no
media status notification      no       no
power-up in Standby            yes      no
write-read-verify              no       no
unload                         yes      yes
free-fall                      no       no
Data Set Management (DSM/TRIM) no
Host Protected Area (HPA)      yes      no      5860533168/5860533168
HPA - Security                 no

ada4/ada5/ada6:
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 4096, offset 0
LBA supported         268435455 sectors
LBA48 supported       5860533168 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6 

Feature                      Support  Enabled   Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes
overlap                        no
Tagged Command Queuing (TCQ)   no       no
Native Command Queuing (NCQ)   yes              32 tags
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      no       no
automatic acoustic management  no       no
media status notification      no       no
power-up in Standby            yes      no
write-read-verify              no       no
unload                         no       no
free-fall                      no       no
Data Set Management (DSM/TRIM) no
Host Protected Area (HPA)      yes      no      5860533168/5860533168
HPA - Security                 no

> 
> If due to this you still need to use the gnop hack then you only need
> to apply it to 1 device as the zpool create uses the largest ashift
> from the disks.

This is also new and not obviously reported/documented well.

> 
> I would then as the very first step export and import as at this point
> there is much less data on the devices to scan through, not that
> this should be needed but...

I performed the task again. The pool is not destroyed, as I can not
import it anymore. I do so delete all the partitions and then
destroying the GPT scheme and recreating it as well as the partions.

After the "receive" finished after 10 hours, I exported the backup pool
and the newly created pool. Now I'm trying to import the new pool again
- and I'm stuck again.

This crap is stuck, really stuck. I can not

- kill the process
- shutdown the server(!)

Yes, shutdown the server is stuck forever. It doesn't go down and I do
now wait days for the box going down due to ZFS.

> 
> 
> > I "received" a snapshot taken and sent to another storage array,
> > after I the newly created pool didn't show up any signs of illness
> > or corruption.
> > 
> > After ~10 hours of receiving the backup, I exported that pool
> > amongst the backup pool, destroyed the appropriate .nop device
> > entries via 
> > 
> > gnop destroy gpt/disk0[0-3]
> > 
> > and the same for cache and log and tried to check via 
> > 
> > zpool import
> > 
> > whether my pool (as well as the backup pool) shows up. And here the
> > nasty mess starts!
> > 
> > The "zpool import" command issued on console is now stuck for hours
> > and can not be interrupted via Ctrl-C! No pool shows up! Hitting
> > Ctrl-T shows a state like
> > 
> > ... cmd: zpool 4317 [zio->io_cv]: 7345.34r 0.00 [...]
> > 
> > Looking with 
> > 
> > systat -vm 1
> > 
> > at the trhoughput of the CAM devices I realise that two of the four
> > RAIDZ-comprising drives show activities, having 7000 - 8000 tps and
> > ~ 30 MB/s bandwidth - the other two zero!
> > 
> > And the pool is still inactive, the console is stuck.
> > 
> > Well, this made my day! At this point, I try to understand what's
> > going wrong and try to recall what I did the last time different
> > when the same procedure on three disks on the same hardware worked
> > for me.
> > 
> > Now after 10 hours copy orgy and the need for the working array I
> > start believing that using ZFS is still peppered with too many
> > development-like flaws rendering it risky on FreeBSD. Colleagues
> > working on SOLARIS on ZFS I consulted never saw those
> > stuck-behaviour like I realise this moment.
> 
> While we only run 8.3-RELEASE currently, as we've decided to skip 9.X
> and move straight to 10 once we've tested, we've found ZFS is not only
> very stable but it now become critical to the way we run things.

I have had no problems during 10.0-CURRENT with the pool unless it
reported some issues with the block sizes and warnings about degraded
performance.

After I followed the steps of block aligning for 4k, the RAIDZ worked
also. It starts non-working when the fourth HDD has been introduced.
The system doesn't report any issues with the harddrive and the disk
itself is healthy.

> 
> > I don not want to repeat the procedure again. There must be a
> > possibility to import the pool - even the backup pool, which is
> > working, untouched by the work, should be able to import - but it
> > doesn't. If I address that pool, while this crap "zpool import"
> > command is still blocking the console, not willing to die even with
> > "killall -9 zpool", I can not import the backup pool via "zpool
> > import BACKUP00". The console gets stuck immediately and for the
> > eternity without any notice. Htting Ctrl-T says something like 
> > 
> > load: 3.59  cmd: zpool 46199 [spa_namespace_lock] 839.18r 0.00u
> > 0.00s 0% 3036k
> > 
> > which means I can not even import the backup facility and this means
> > really no fun.
> 
> I'm not sure there's enough information here to determine where any
> issue may lie, but as a guess it could be that ZFS is having issues
> locating the one change devices and is scanning the entire disk to
> try and determine that. This would explain the IO on the one device
> but not the others.

If there are issues, what kind of issues? I would expect this if there
is a possibilty to add a drive "on the fly". 

I was told I have to destroy and create the pool (RAIDZ) when I have
added another drive. I did so. The pool is newly created abd received a
former snapshot from a backup device via 

zfs receive -vdF POOL00 < /path/to/backup.zfs

> 
> Did you per-chance have one of the disks in use for something else
> and hence it may have old label information in it that wasn't cleaned
> down?

No. The disks are entirely ZFS only. One partition per disk.

If old labels are there, then this must be considered a bug or design
flaw. I deleted everything with the command line tools I have (I do not
like this hoodo-voodo crap with dd ...). I can not destroy a pool that
isn't imported.

What I see now, again the third time in a row, is that I neither can
reboot the box (shutdown gets stuck for ever), I can not kill the job
"zpool import" (for looking what pools are available after I have
exported the newly created and "received back" pool).

What I see is that two drives are busy as reported earlier - and those
both drives are two of the former pool. The new drive isn't involved:

[...]
Disks  ada0  ada1  ada2  ada3  ada4  ada5  ada6           intrn   718
cpu1:timer
KB/t   0.00  0.00 32.00  0.00  0.00  4.00  4.00   2906400 wire    719
cpu2:timer
tps       0     0     6     0     0  7286  7287     20492 act    1135
cpu3:timer
MB/s   0.00  0.00  0.19  0.00  0.00 28.46 28.47    141320 inact
%busy     0     0     0     0     0    42    42           cache
                                                 12944592 free
                                                  1372736 buf

ada3-ada6 are the RAIDZ comprising HDDs, ada0 is the cache/ZIL, ada1
and ada2 are backup/system.

<SAMSUNG SSD 830 Series CXM03B1Q>  at scbus0 target 0 lun 0 (ada0,pass0)
<WDC WD40EZRX-00SPEB0 80.00A80>    at scbus1 target 0 lun 0 (ada1,pass1)
<SAMSUNG SSD 830 Series CXM03B1Q>  at scbus2 target 0 lun 0 (ada2,pass2)
<WDC WD30EFRX-68EUZN0 80.00A80>    at scbus3 target 0 lun 0 (ada3,pass3)
<WDC WD30EZRX-00DC0B0 80.00A80>    at scbus4 target 0 lun 0 (ada4,pass4)
<WDC WD30EZRX-00DC0B0 80.00A80>    at scbus5 target 0 lun 0 (ada5,pass5)
<WDC WD30EZRX-00DC0B0 80.00A80>    at scbus6 target 0 lun 0 (ada6,pass6)

root at gate [src] zpool status
load: 0.20  cmd: zpool 4944 [spa_namespace_lock] 4.29r 0.00u 0.00s 0%
2572k

I can not even interrupt this command. Everything touches ZFS right now
locks up the system/console completely! How do I kill successfully this
command? I need control back. This is a weird and unacceptable
behaviour.

The most frustrating thing is this total blockade of everything
ragarding ZFS. This one(!) pool prevents in a single-thread-giant-lock
manner to import, check on or view other pools. This is a absolute
no-go for me on a server system. As I reported, not even a shutdown is
possible.

> 
>     Regards
>     Steve

Thanks anyway,

oh

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-current/attachments/20131028/43318d4d/attachment.sig>