ZFS and 2 TB disk drive technology :-(

Wed Sep 24 15:24:44 UTC 2014

On 9/24/14 7:08, Scott Bennett wrote:

<snip>

What version of FreeBSD are you running ?

What hardware are you running it on ?

>       Then I copied the 1.08 TB file again from another Seagate 2 TB drive
> to the mirror vdev.  No errors were detected during the copy.  Then I
> began creating a tar file from large parts of a nearly full 1.2 TB file
> system (UFS2) on yet another Seagate 2TB on the Firewire 400 bus with the
> tar output going to a file in the mirror in order to try to have written
> something to most of the sectors on the four-drive mirror.  I terminated
> tar after the empty space in the mirror got down to about 3% because the
> process had slowed to a crawl.  (Apparently, space allocation in ZFS
> slows down far more than UFS2 when available space gets down to the last
> few percent.:-( )

ZFS's space allocation algorithm will have trouble (performance issues) 
allocating new blocks long before you get a few percent free. This is 
known behavior and the threshold for performance degradation varies with 
work load and historical write patterns. My rule of thumb is that you 
really do not want to go past 75-80% full, but I have seen reports over 
on the ZFS list of issues with very specific write patterns and work 
load with as little as 50% used. For your work load, writing very large 
files once, I would expect that you can get close to 90% used before 
seeing real performance issues.

>       Next, I ran a scrub on the mirror and, after the scrub finished, got
> the following output from a "zpool status -v".
>
>    pool: testmirror
>   state: ONLINE
> status: One or more devices has experienced an error resulting in data
> 	corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
> 	entire pool from backup.
>     see: http://illumos.org/msg/ZFS-8000-8A
>    scan: scrub repaired 1.38M in 17h59m with 1 errors on Mon Sep 15 19:53:45 2014

The above means that ZFS was able to repair 1.38MB of bad data but still 
ran into 1 situation (unknown size) that it could not fix.

> config:
>
> 	NAME        STATE     READ WRITE CKSUM
> 	testmirror  ONLINE       0     0     1
> 	  mirror-0  ONLINE       0     0     2
> 	    da1p5   ONLINE       0     0     2
> 	    da2p5   ONLINE       0     0     2
> 	    da5p5   ONLINE       0     0     8
> 	    da7p5   ONLINE       0     0     7
>
> errors: Permanent errors have been detected in the following files:
>
>          /backups/testmirror/backups.s2A

And here is the file that contains the bad data.

>
>       Note that the choices of recommended action above do *not* include
> replacing a bad drive and having ZFS rebuild its content on the
> replacement.  Why is that so?

Correct, because for some reason ZFS was not able to read enough of the 
data without checksum errors to gibe you back your data in tact.

>       Thinking, apparently naively, that the scrub had repaired some or
> most of the errors

It did, 1.38MB worth. But it also had errors it could not repair.

> and wanting to know which drives had ended up with
> permanent errors, I did a "zpool clear testmirror" and ran another scrub.
> During this scrub, I got some kernel messages on the console:
>
> (da7:umass-sim5:5:0:0): WRITE(10). CDB: 2a 00 3b 20 4d 36 00 00 05 00
> (da7:umass-sim5:5:0:0): CAM status: CCB request completed with an error
> (da7:umass-sim5:5:0:0): Retrying command
> (da7:umass-sim5:5:0:0): WRITE(10). CDB: 2a 00 3b 20 4d 36 00 00 05 00
> (da7:umass-sim5:5:0:0): CAM status: CCB request completed with an error
> (da7:umass-sim5:5:0:0): Retrying command

How many device errors have you had since booting the system / creating 
the zpool ?

> I don't know how to decipher these error messages (i.e., what do the hex
> digits after "CDB: " mean?)

I do not know the specifics in this case, but whenever I have seen 
device errors it has always been due to either bad communication with a 
drive or a drive reporting an error. If there are ANY device errors you 
must address them before you go any further.

As an anecdotal note, I have not had terribly good luck with USB 
attached drives under FreeBSD, especially under 9.x. I suspect that the 
USB stack just can't keep up and ends up dropping things (or hanging). I 
have had better luck with the 10.x release but still do not trust it for 
high traffic loads. I have had no issues with SAS or SATA interfaces 
(using supported chipsets, I have had very good luck with any of the 
Marvell JBOD SATA controllers), _except_ when I was using a SATA port 
multiplier. Over on the ZFS list the consensus is that port multipliers 
are problematic at best and they should be avoided.

> When it had finished, another "zpool status
>   -v" showed these results.
>
>    pool: testmirror
>   state: ONLINE
> status: One or more devices has experienced an error resulting in data
> 	corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
> 	entire pool from backup.
>     see: http://illumos.org/msg/ZFS-8000-8A
>    scan: scrub repaired 1.25M in 18h4m with 1 errors on Tue Sep 16 15:02:56 2014

This time it fixed 1.25MB of data and still had an error (of unknown 
size) that it could not fix.

> config:
>
> 	NAME        STATE     READ WRITE CKSUM
> 	testmirror  ONLINE       0     0     1
> 	  mirror-0  ONLINE       0     0     2
> 	    da1p5   ONLINE       0     0     2
> 	    da2p5   ONLINE       0     0     2
> 	    da5p5   ONLINE       0     0     6
> 	    da7p5   ONLINE       0     0     8

Once again you have errors on ALL your devices. This points to a 
systemic problem of some sort on your system. On the ZFS list people 
have reported bad memory as sometimes being the cause of these errors. I 
would look for a system component that is common to all the drives and 
controllers. How healthy is your power supply ? How close to it's limits 
are you ?

>
> errors: Permanent errors have been detected in the following files:
>
>          /backups/testmirror/backups.s2A
>
>       So it is not clear to me that either scrub fixed *any* errors at
> all.

Why is it not clear? The message from zpool status is very clear:

scan: scrub repaired 1.25M in 18h4m with 1 errors on Tue Sep 16 15:02:56 
2014

There were errors that were repaired and an error that was not.

>  I next ran a comparison ("cmp -z -l") of the original against the
> copy

If you are comparing the file that ZFS reported was corrupt, then you 
should not expect them to match.

> now on the mirror, which found these differences before cmp(1) was
> terminated because the vm_pager got an error while trying to read in a
> block from the mirror vdev.  (The cpuset stuff was to prevent cmp(1)
> from interfering too much with another ongoing, but unrelated, process.)

It sounds like you are really pushing this system to do more than it 
reasonably can. In a situation like this you should really not be doing 
anything else at the same time given that you are already pushing what 
the system can do.

>
> Script started on Wed Sep 17 01:37:38 2014
> [hellas] 101 % time nice +12 cpuset -l 3,0 cmp -z -l /backups/s2C/save/backups.s2A /backups/testmirror/backups.s2A

This is the file the ZFS told you was corrupt, all bets are off.

<snip>

>       Another issue revealed above is that ZFS, in spite of having *four*
> copies of the data and checksums of them, failed to detect any problem
> while reading the data back for cmp(1), much less feed cmp(1) the correct
> version of the data rather than a corrupted version.

ZFS told you that file was corrupt. You are choosing to try to read it. 
ZFS used to not even let you try to access a corrupt file but that 
behavior was changed to permit people to try to salvage what they could 
instead of write it all off.

> Similarly, the hard
> error (not otherwise logged by the kernel) apparently encountered by
> vm_pager resulted in termination of cmp(1) rather than resulting in ZFS
> reading the page from one of the other three drives.  I don't see how ZFS
> is of much help here, so I guess I must have misunderstood the claims for
> ZFS that I've read on this list and in the available materials on-line.

I suggest that you are ignoring what ZFS is telling you, specifically 
that your system is incapable of reliably write to and reading from 
_any_ of the four drives you are trying to use and that there is a 
corrupt file due to this and here it the name of that corrupt file.

Until you fix the underlying issues with your system, ZFS (or any FS for 
that matter) will not be of much use to you.

>       I don't know where to turn next.  I will try to call Seagate/Samsung
> later today again about the bad Samsung drive and the bad, refurbished
> Seagate drive, but they already told me once that having a couple of kB
> of errors in a ~1.08 TB file copy does not mean that the drive is bad.
> I don't know whether they will consider a hard write error to mean the
> drive is bad.  The kernel messages shown above are the first ones I've
> gotten about any of the drives involved in the copy operation or the
> tests described above.

The fact that you have TWO different drives from TWO different vendors 
exhibiting the same problem (and to the same degree) makes me think that 
the problem is NOT with the drives but elsewhere with your system. I 
have started tracking usage an failure statistics for my personal drives 
(currently 26 of them, but I have 4 more coming back from Seagate as 
warranty replacements). I know that I do not have a statistically 
significant sample, but it is what I have to work with. Taking into 
account the drive I have as well as the hundreds of drives I managed at 
a past client, I have never seen the kind of bad data failures you are 
seeing UNLESS I had another underlying problem. Especially when the 
problem appears on multiple drives. I suspect that the real odds of 
having the same type of bad data failure on TWO drives in this case is 
so small that another cause needs to be identified.

>       If anyone reading this has any suggestions for a course of action
> here, I'd be most interested in reading them.  Thanks in advance for any
> ideas and also for any corrections if I've misunderstood what a ZFS
> mirror was supposed to have done to preserve the data and maintain
> correct operation at the application level.

The system you are trying to use ZFS on may just not be able to handle 
the throughput (both memory and disk I/O) generated by ZFS without 
breaking. This may NOT just be a question of amount of RAM, but of the 
reliability of the motherboard/CPU/RAM/device interfaces when stressed. 
In the early days of ZFS it was noticed that ZFS stressed the CPU and 
memory systems of a server harder than virtually any other task.

-- 
--
Paul Kraus    paul at kraus-haus.org
Co-Chair Albacon 2014.5 http://www.albacon.org/2014/