ZFS and 2 TB disk drive technology :-(

Tue Sep 30 02:30:01 UTC 2014

On 9/28/14 6:30, Scott Bennett wrote:
>       On Wed, 24 Sep 2014 11:24:35 -0400 Paul Kraus <paul at kraus-haus.org>
> wrote:
>       Thanks for chiming in, Paul.
>
>> On 9/24/14 7:08, Scott Bennett wrote:
>>
>> <snip>
>>
>> What version of FreeBSD are you running ?
>
> FreeBSD hellas 9.2-STABLE FreeBSD 9.2-STABLE #1 r264339: Fri Apr 11 05:16:25 CDT 2014     bennett at hellas:/usr/obj/usr/src/sys/hellas  i386

I asked this specifically because I have seen lots of issues with hard 
drives connected via USB on 9.x. In some cases the system hangs 
(silently, with no evidence after a hard reboot), in some cases just 
flaky I/O to the drives that caused performance issues. And _none_ of 
the attached drives were over 1TB. There were three different 1TB drives 
(one IOmega and 2 Seagate) and 1 500GB (Seagate drive in a Gigaware 
enclosure). These were on three different systems (two SuperMicro dual 
Quad-Xeon CPU and one HP MicroProliant N36L). These were all USB2, I 
would expect more problems and weirder problems with USB3 as it (tries 
to) goes much faster.

Skipping lots ...

>       Okay, laying aside the question of why no drive out of four in a mirror
> vdev can provide the correct data, so that's why a rebuild wouldn't work.
> Couldn't it at least give a clue about drive(s) to be replaced/repaired?
> I.e., the drive(s) and sector number(s)?  Otherwise, one would spend a lot
> of time reloading data without knowing whether a failure at the same place(s)
> would just happen again.

You can probably dig that out of the zpool using zdb, but I am no zdb 
expert and refer you to the experts on the ZFS list (find out how to 
subscribe at the bottom of this list 
http://wiki.illumos.org/display/illumos/illumos+Mailing+Lists ).

<snip>

>> As an anecdotal note, I have not had terribly good luck with USB
>> attached drives under FreeBSD, especially under 9.x. I suspect that the
>> USB stack just can't keep up and ends up dropping things (or hanging). I
>> have had better luck with the 10.x release but still do not trust it for
>> high traffic loads. I have had no issues with SAS or SATA interfaces
>
>       Okay.  I'll keep that in mind for the future, but for now I'm stuck
> with 9.2 until I can get some stable disk space to work with to do the
> upgrades to amd64 and then to later releases.  The way things have been
> going, I may have to relegate at least four 2 TB drives to paperweight
> supply and then wait until I can replace them with smaller capacity drives
> that will actually work.  Also, I have four 2 TB drives in external cases
> that have only USB 3.0 interfaces on them, so I have no other way to
> connect them (except USB 2.0, of course), so I'm stuck with (some) USB,
> too.

While I have had nothing but trouble every single time I tried using a 
USB attached drive for more than a few MB at a time under 9.x, I have 
had no problems with Marvell based JBOD SATA cards under 9.x or 10.0.

>
>> (using supported chipsets, I have had very good luck with any of the
>> Marvell JBOD SATA controllers), _except_ when I was using a SATA port
>> multiplier. Over on the ZFS list the consensus is that port multipliers
>> are problematic at best and they should be avoided.
>
>       What kinds of problems did they mention?  Also, how are those Marvell
> controllers connected to your system(s)?  I'm just wondering whether
> I would be able to use any of those models of controllers.  I've not dealt
> with SATA port multipliers.  Would an eSATA card with two ports on it be
> classed as a port multiplier?

I do not recall specifics, but I do recall a variety of issues, mostly 
around SATA buss resets. The problem _I_ had was that if one of the four 
drives in the enclosure (behind the port multiplier) failed it knocked 
all four off-line.

The cards were PCIE 1X and PCIE 2X. All of the Marvell cards I have seen 
have been one logical port per physical port. The chipsets in the add-on 
cards seem to be in sets of 4 ports (although the on-board chipsets seem 
to be sets of 6). I currently have one 4 port card (2 internal, 2 
external) and one 8 port card (4 internal, 4 external) with no problems. 
They are in an HP MicroProliant N54L with 16 GB RAM.

Here is the series of cards that I have been using: 
http://www.sybausa.com/productList.php?cid=142&currentPage=0
Specifically the SI-PEX40072 and SI-PEX40065, stay away from the RAID 
versions and just go for the JBOD. The Marvell JBOD chips were 
recommended over on the ZFS list.

>       At the moment, all of my ZFS devices are connected by either USB 3.0
> or Firewire 400.

Are the USB drives directly attached or via hubs ? The hubs may be 
introducing more errors (I have not had good luck finding USB hubs that 
are reliable in transferring data ... on my Mac, I have never tried 
using external hubs on my servers).

>  I now have an eSATA card with two ports on it that I
> plan to install at some point, which will let me move the Firewire 400
> drive to eSATA.  Should I expect any new problem for that drive after the
> change?

I would expect a decrease in error counts for the eSATA attached drive.

I have been running LOTS of scrubs against my ~1.6TB of data recently, 
both on a 2-way 2-column mirror of 2TB drives or the current config of 
2-way 3-column mirror of 1TB drives. I have seen no errors of any kind.

[ppk at FreeBSD2 ~]$ zpool status
   pool: KrausHaus
  state: ONLINE
   scan: scrub repaired 0 in 1h49m with 0 errors on Fri Sep 26 13:51:21 2014
config:

	NAME                             STATE     READ WRITE CKSUM
	KrausHaus                        ONLINE       0     0     0
	  mirror-0                       ONLINE       0     0     0
	    diskid/DISK-Seaagte ES.3     ONLINE       0     0     0
	    diskid/DISK-Seaagte ES.2     ONLINE       0     0     0
	  mirror-1                       ONLINE       0     0     0
	    diskid/DISK-WD-SE            ONLINE       0     0     0
	    diskid/DISK-HGST UltraStar   ONLINE       0     0     0
	  mirror-2                       ONLINE       0     0     0
	    diskid/DISK-WD-SE            ONLINE       0     0     0
	    diskid/DISK-Seaagte ES.2     ONLINE       0     0     0
	spares
	  diskid/DISK-Seaagte ES.3       AVAIL

errors: No known data errors

Note: Disk serial numbers replaced with type of drive. All are 1TB.

<snip>

>>
>> It sounds like you are really pushing this system to do more than it
>> reasonably can. In a situation like this you should really not be doing
>> anything else at the same time given that you are already pushing what
>> the system can do.
>>
>       It seems to me that the only places that could fail to keep up would
> be the motherboard's chip(set) or one of the controller cards.  The
> motherboard controller knows the speed of the memory, so it will only
> cycle the memory at that speed.  The CPU, of course, should be at a lower
> priority for bus cycles, so it would just use whatever were left over.  There
> is no overclocking involved, so that is not an issue here.  The machine goes
> as fast as it goes and no faster.  If it takes longer for it to complete a
> task, then that's how long it takes.  I don't see that "pushing this system
> to do more than it reasonably can" is even possible for me to do.  It does
> what it does, and it does it when it gets to it.  Would I like it to do
> things faster?  Of course, I would, but what I want does not change physics.
> I'm not getting any machine check or overrun messages, either.

So you deny that race states can exist in a system as complex as a 
modern computer running a modern OS ? The OS is an integral part of all 
this, including all the myriad device drivers. And with multiple CPUs 
the problem may be even worse.

>       Further, because one of the drives is limited to 50 MB/s (Firewire 400)
> transfer rates, ZFS really can't go any faster than that drive.  Most of the
> time, a systat vmstat display during the scrubs showed the MB/s actually
> transferred for all four drives as being about the same (~23 - ~35 MB/s).

What does `iostat -x -w 1` show ? How many drives are at 100 %b ? How 
many drives have a qlen of 10 ? For how many samples in a row ? That is 
the limit of what ZFS will dispatch, once there are 10 outstanding I/O 
requests for a given device, ZFS does not dispatch more I/O requests 
until the qlen drops below 10. This is tunable (look through sysctl -a | 
grep vfs.zfs). On my system with the port multiplier I had to tune this 
down to 4 (found empirically) or I would see underlying SATA device 
errors and retries.

I find it useful to look at 1 second as well as 10 seconds samples (to 
see both peak load on the drives as well as more average).

Here is my system with a scrub running on the above zpool and 10 second 
sample time (iostat -x -w 10):

                         extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0     783.8   3.0 97834.2    15.8   10  10.4  84
ada1     792.8   3.0 98649.5    15.8    4   3.7  49
ada2     789.9   3.0 98457.0    15.8    4   3.6  47
ada3       0.1  13.1     0.0    59.0    0   6.1   0
ada4       0.8  13.1     0.4    59.0    0   5.8   0
ada5     794.0   3.0 98703.7    15.8    0   4.1  62
ada6     785.9   3.0 98158.3    15.8   10  11.2  98
ada7       0.0   0.0     0.0     0.0    0   0.0   0
ada8     791.4   3.0 98458.2    15.8    0   3.0  53

In the above, ada0 and ada6 have hit their outstanding I/O limit (in 
zfs), both are slower than the others, with both longer service time 
(svc_t) and % busy (%b). These are the oldest drives in the zpool, 
Seagate ES.2 series and are 5 years old (and just out of warranty). So 
it is not surprising that they are the slowest. They are the limiting 
factor on how fast the scrub can progress.

>       The scrubs took from 5% to 25% of one core's time,

Because they are limited by the I/O stack between the kernal and the device.

> and associated
> kernel functions took from 0% to ~9% (combined) from other cores.  cmp(1)
> took 25% - 35% of one core with associated kernel functions taking 5% - 15%
> (combined) from other cores.  I used cpuset(1) to keep cmp(1) from bothering
> the mprime thread I cared about the most.  (Note that mprime runs niced
> to 18, so its threads should not slow any of the testing I was doing.)  It
> really doesn't look to me like an overload situation, but I can try moving
> the three USB 3.0 drives to USB 2.0 to slow things down even further.

Do you have a way to look at errors directly on the USB buss ?

>  That
> leaves still unexplained ZFS's failure to make use of multiple copies for
> error correction during the reading of a file or to fix in one scrub
> everything that was fixable.

>>>
>>> Script started on Wed Sep 17 01:37:38 2014
>>> [hellas] 101 % time nice +12 cpuset -l 3,0 cmp -z -l /backups/s2C/save/backups.s2A /backups/testmirror/backups.s2A
>>
>> This is the file the ZFS told you was corrupt, all bets are off.
>>
>       There should be only one bad block because the scrubs fixed everything
> else, right?

Not necessarily, I have not looked at the ZFS code (feel free to, it is 
all open source), so I do not know for certain whether it gives up on a 
file once it finds corruption.

> And that bad block is bad on all four drives, right?

Or the I/O to all four drives was interrupted at the same TIME ... I 
have seen that before when it was a device driver stack that was having 
trouble (which is what I suspect here).

<snip>

>> The fact that you have TWO different drives from TWO different vendors
>> exhibiting the same problem (and to the same degree) makes me think that
>> the problem is NOT with the drives but elsewhere with your system. I
>> have started tracking usage an failure statistics for my personal drives
>> (currently 26 of them, but I have 4 more coming back from Seagate as
>
>       Whooweee!  That's a heap of drives!  IIRC, for a chi^2 distribution,
> 30 isn't bad for a sample size.  How many of those drives are of larger
> capacity than 1 TB?

Not really, I used to manage hundreds of drives. When I have 2 out of 4 
Seagate ES.2 1TB drives and 1 out of 2 HGST UltraStar 1TB drives fail 
under warranty I am still not willing to say that overall both Seagate 
and HGST have a 50% failure rate ... specifically because I do not 
consider 4 (or worse 2) drives a statistically significant sample :-)

In terms of drive sizes, a little over 50% are 1TB or over (not counting 
the 4 Seagate 1TB warranty replacement drives that arrived today). Of 
the 11 1TB drives in the sample (not counting the ones that arrived 
today), 3 have failed under warranty (so far). The 4 2TB drives in the 
sample set none have failed yet, but they are all less than 1 year old.

<snip>

>> The system you are trying to use ZFS on may just not be able to handle
>> the throughput (both memory and disk I/O) generated by ZFS without
>> breaking. This may NOT just be a question of amount of RAM, but of the
>> reliability of the motherboard/CPU/RAM/device interfaces when stressed.
>
>       I did do a fair amount of testing with mprime last year and found no
> problems.

 From the brief research I did, it looks like mprime is a computational 
program and will test only limited portions of a system (CPU and RAM 
mostly).

>  I monitor CPU temperatures frequently, especially when I'm
> running a test like the ones I've been doing, and the temperatures have
> remained reasonable throughout.  (My air-conditioning bill has not been
> similarly reasonable, I'm sorry to say.)
>       That having been said, though, between your remarks and Andrew Berg's,
> there does seem cause to run another scrub, perhaps two, with those three
> drives connected via USB 2.0 instead of USB 3.0 to see what happens when
> everything is slowed down drastically.  I'll give that a try when I find
> time.  That won't address the ZFS-related questions or the differences
> in error rates on different drives, but might reveal an underlying system
> hardware issue.
>       Maybe a PCIE2 board is too slow for USB 3.0, although the motherboard
> controller, BIOS, USB 3.0 controller, and kernel all declined to complain.
> If it is, then the eSATA card I bought (SATA II) would likely be useless
> as well. :-<
>
>> In the early days of ZFS it was noticed that ZFS stressed the CPU and
>> memory systems of a server harder than virtually any other task.
>>
>       When would that have been, please?  (I don't know much ZFS history.)
> I believe this machine dates to 2006 or more likely 2007, although the
> USB 3.0 card was new last year.  The VIA Firewire card was installed at
> the same time as the USB 3.0 card, but it was not new at that time.

That would have been the 2005-2007 timeframe. A Sun SF-V240 could be 
brought to it's knees by a large ZFS copy operation. Both CPUs would peg 
and memory bandwidth would all be consumed by the I/O operations. The 
Sun T-2000 was much better as it had (effectively) 32 logical CPUs (8 
cores each with 4 execution threads) and ZFS really likes multiprocessor 
environments.

-- 
--
Paul Kraus    paul at kraus-haus.org
Co-Chair Albacon 2014.5 http://www.albacon.org/2014/