ZFS and 2 TB disk drive technology :-(

Wed Sep 24 11:08:16 UTC 2014

     I've now tried some testing with ZFS on four of the five drives
that I currently have ready to put into use for a raidz2 cluster.  In
the process, I've found that some of the recommendations made for
setting various kernel variables in /boot/loader.conf don't seem to
work as represented, at least not in i386.  To the best of my memory,
setting vfs.zfs.arc_max or vm.kmem_size results in a panic in very short
order.  Secondly, setting vm.kmem_size_max works, but only if the value
to which it is set does not exceed 512 MB.  512 MB, however, does seem
to be sufficient to eliminate the ZFS kernel module's initialization
warning that says to expect unstable behavior, so that problem appears
to have been resolved.
     I created a four-way mirror vdev, where the four drives were as
follows.

	da1	WD 2TB drive (new, in old "MyBook" case with USB 2.0,
		Firewire 400, and eSATA interfaces, connected via
		Firewire 400)

	da2	Seagate 2TB drive (refurbished and seems to work
		tolerably well, in old Backups Plus case with USB 3.0
		interface)

	da5	Seagate 2TB drive (refurbished, already shown to get
		between 1900 and 2000 bytes in error on a 1.08 TB file
		copy, in old Backups Plus case with USB 3.0 interface)

	da7	Samsung 2TB drive (Samsung D3 Station, new in June,
		already shown to get between 1900 and 2000 bytes in
		error on a 1.08 TB file copy, with USB 3.0 interface)

     Then I copied the 1.08 TB file again from another Seagate 2 TB drive
to the mirror vdev.  No errors were detected during the copy.  Then I
began creating a tar file from large parts of a nearly full 1.2 TB file
system (UFS2) on yet another Seagate 2TB on the Firewire 400 bus with the
tar output going to a file in the mirror in order to try to have written
something to most of the sectors on the four-drive mirror.  I terminated
tar after the empty space in the mirror got down to about 3% because the
process had slowed to a crawl.  (Apparently, space allocation in ZFS
slows down far more than UFS2 when available space gets down to the last
few percent.:-( )
     Next, I ran a scrub on the mirror and, after the scrub finished, got
the following output from a "zpool status -v".

  pool: testmirror
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 1.38M in 17h59m with 1 errors on Mon Sep 15 19:53:45 2014
config:

	NAME        STATE     READ WRITE CKSUM
	testmirror  ONLINE       0     0     1
	  mirror-0  ONLINE       0     0     2
	    da1p5   ONLINE       0     0     2
	    da2p5   ONLINE       0     0     2
	    da5p5   ONLINE       0     0     8
	    da7p5   ONLINE       0     0     7

errors: Permanent errors have been detected in the following files:

        /backups/testmirror/backups.s2A

     Note that the choices of recommended action above do *not* include
replacing a bad drive and having ZFS rebuild its content on the
replacement.  Why is that so?
     Thinking, apparently naively, that the scrub had repaired some or
most of the errors and wanting to know which drives had ended up with
permanent errors, I did a "zpool clear testmirror" and ran another scrub.
During this scrub, I got some kernel messages on the console:

(da7:umass-sim5:5:0:0): WRITE(10). CDB: 2a 00 3b 20 4d 36 00 00 05 00
(da7:umass-sim5:5:0:0): CAM status: CCB request completed with an error
(da7:umass-sim5:5:0:0): Retrying command
(da7:umass-sim5:5:0:0): WRITE(10). CDB: 2a 00 3b 20 4d 36 00 00 05 00
(da7:umass-sim5:5:0:0): CAM status: CCB request completed with an error
(da7:umass-sim5:5:0:0): Retrying command

I don't know how to decipher these error messages (i.e., what do the hex
digits after "CDB: " mean?)  When it had finished, another "zpool status
 -v" showed these results.

  pool: testmirror
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 1.25M in 18h4m with 1 errors on Tue Sep 16 15:02:56 2014
config:

	NAME        STATE     READ WRITE CKSUM
	testmirror  ONLINE       0     0     1
	  mirror-0  ONLINE       0     0     2
	    da1p5   ONLINE       0     0     2
	    da2p5   ONLINE       0     0     2
	    da5p5   ONLINE       0     0     6
	    da7p5   ONLINE       0     0     8

errors: Permanent errors have been detected in the following files:

        /backups/testmirror/backups.s2A

     So it is not clear to me that either scrub fixed *any* errors at
all.  I next ran a comparison ("cmp -z -l") of the original against the
copy now on the mirror, which found these differences before cmp(1) was
terminated because the vm_pager got an error while trying to read in a
block from the mirror vdev.  (The cpuset stuff was to prevent cmp(1)
from interfering too much with another ongoing, but unrelated, process.)

Script started on Wed Sep 17 01:37:38 2014
[hellas] 101 % time nice +12 cpuset -l 3,0 cmp -z -l /backups/s2C/save/backups.s2A /backups/testmirror/backups.s2A
8169610513 164 124
71816953105 344 304
121604893969 273 233
160321633553 170 130
388494183697  42   2
488384007441 266 226
574339165457 141 101
662115138833 145 105
683519290641 157 117
683546029329  60  20
cmp: Input/output error (caught SIGSEGV)
4144.600u 3948.457s 8:08:08.33 27.6%	15+-393k 5257820+0io 10430953pf+0w
[hellas] 104 % time nice +12 cpuset -l 3,0 cmp -z -l /backups/s2C/save/backups.s2A /backups/testmirror/backups.s2A
6022126866 164 124
69669469458 344 304
119457410322 273 233
158174149906 170 130
386346700050  42   2
486236523794 266 226
572191681810 141 101
659967655186 145 105
681371806994 157 117
681398545682  60  20
cmp: Input/output error (caught SIGSEGV)
4132.551u 4003.112s 8:13:20.95 27.4%	15+-345k 5241297+0io 10560652pf+0w
[hellas] 105 % time nice +12 cpuset -l 3,0 cmp -z -l /backups/s2C/save/backups.s2A /backups/testmirror/backups.s2A
8169610513 164 124
71816953105 344 304
121604893969 273 233
160321633553 170 130
388494183697  42   2
488384007441 266 226
574339165457 141 101
662115138833 145 105
683519290641 157 117
683546029329  60  20
cmp: Input/output error (caught SIGSEGV)
4136.621u 3977.459s 8:07:43.85 27.7%	15+-378k 5257810+0io 10430951pf+0w
[hellas] 106 % 

As you can see, the hard error seems to be pretty consistent.  Also, the
bytes found to differ up until termination all differ by a single bit that
was on in the original and is off in the copy, always the same bit in the
byte.
     Another issue revealed above is that ZFS, in spite of having *four*
copies of the data and checksums of them, failed to detect any problem
while reading the data back for cmp(1), much less feed cmp(1) the correct
version of the data rather than a corrupted version.  Similarly, the hard
error (not otherwise logged by the kernel) apparently encountered by
vm_pager resulted in termination of cmp(1) rather than resulting in ZFS
reading the page from one of the other three drives.  I don't see how ZFS
is of much help here, so I guess I must have misunderstood the claims for
ZFS that I've read on this list and in the available materials on-line.
     I don't know where to turn next.  I will try to call Seagate/Samsung
later today again about the bad Samsung drive and the bad, refurbished
Seagate drive, but they already told me once that having a couple of kB
of errors in a ~1.08 TB file copy does not mean that the drive is bad.
I don't know whether they will consider a hard write error to mean the
drive is bad.  The kernel messages shown above are the first ones I've
gotten about any of the drives involved in the copy operation or the
tests described above.
     If anyone reading this has any suggestions for a course of action
here, I'd be most interested in reading them.  Thanks in advance for any
ideas and also for any corrections if I've misunderstood what a ZFS
mirror was supposed to have done to preserve the data and maintain
correct operation at the application level.


                                  Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet:   bennett at sdf.org   *xor*   bennett at freeshell.org  *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good  *
* objection to the introduction of that bane of all free governments *
* -- a standing army."                                               *
*    -- Gov. John Hancock, New York Journal, 28 January 1790         *
**********************************************************************