ZFS and 2 TB disk drive technology :-(
Paul Kraus
paul at kraus-haus.org
Tue Sep 30 02:30:01 UTC 2014
On 9/28/14 6:30, Scott Bennett wrote:
> On Wed, 24 Sep 2014 11:24:35 -0400 Paul Kraus <paul at kraus-haus.org>
> wrote:
> Thanks for chiming in, Paul.
>
>> On 9/24/14 7:08, Scott Bennett wrote:
>>
>> <snip>
>>
>> What version of FreeBSD are you running ?
>
> FreeBSD hellas 9.2-STABLE FreeBSD 9.2-STABLE #1 r264339: Fri Apr 11 05:16:25 CDT 2014 bennett at hellas:/usr/obj/usr/src/sys/hellas i386
I asked this specifically because I have seen lots of issues with hard
drives connected via USB on 9.x. In some cases the system hangs
(silently, with no evidence after a hard reboot), in some cases just
flaky I/O to the drives that caused performance issues. And _none_ of
the attached drives were over 1TB. There were three different 1TB drives
(one IOmega and 2 Seagate) and 1 500GB (Seagate drive in a Gigaware
enclosure). These were on three different systems (two SuperMicro dual
Quad-Xeon CPU and one HP MicroProliant N36L). These were all USB2, I
would expect more problems and weirder problems with USB3 as it (tries
to) goes much faster.
Skipping lots ...
> Okay, laying aside the question of why no drive out of four in a mirror
> vdev can provide the correct data, so that's why a rebuild wouldn't work.
> Couldn't it at least give a clue about drive(s) to be replaced/repaired?
> I.e., the drive(s) and sector number(s)? Otherwise, one would spend a lot
> of time reloading data without knowing whether a failure at the same place(s)
> would just happen again.
You can probably dig that out of the zpool using zdb, but I am no zdb
expert and refer you to the experts on the ZFS list (find out how to
subscribe at the bottom of this list
http://wiki.illumos.org/display/illumos/illumos+Mailing+Lists ).
<snip>
>> As an anecdotal note, I have not had terribly good luck with USB
>> attached drives under FreeBSD, especially under 9.x. I suspect that the
>> USB stack just can't keep up and ends up dropping things (or hanging). I
>> have had better luck with the 10.x release but still do not trust it for
>> high traffic loads. I have had no issues with SAS or SATA interfaces
>
> Okay. I'll keep that in mind for the future, but for now I'm stuck
> with 9.2 until I can get some stable disk space to work with to do the
> upgrades to amd64 and then to later releases. The way things have been
> going, I may have to relegate at least four 2 TB drives to paperweight
> supply and then wait until I can replace them with smaller capacity drives
> that will actually work. Also, I have four 2 TB drives in external cases
> that have only USB 3.0 interfaces on them, so I have no other way to
> connect them (except USB 2.0, of course), so I'm stuck with (some) USB,
> too.
While I have had nothing but trouble every single time I tried using a
USB attached drive for more than a few MB at a time under 9.x, I have
had no problems with Marvell based JBOD SATA cards under 9.x or 10.0.
>
>> (using supported chipsets, I have had very good luck with any of the
>> Marvell JBOD SATA controllers), _except_ when I was using a SATA port
>> multiplier. Over on the ZFS list the consensus is that port multipliers
>> are problematic at best and they should be avoided.
>
> What kinds of problems did they mention? Also, how are those Marvell
> controllers connected to your system(s)? I'm just wondering whether
> I would be able to use any of those models of controllers. I've not dealt
> with SATA port multipliers. Would an eSATA card with two ports on it be
> classed as a port multiplier?
I do not recall specifics, but I do recall a variety of issues, mostly
around SATA buss resets. The problem _I_ had was that if one of the four
drives in the enclosure (behind the port multiplier) failed it knocked
all four off-line.
The cards were PCIE 1X and PCIE 2X. All of the Marvell cards I have seen
have been one logical port per physical port. The chipsets in the add-on
cards seem to be in sets of 4 ports (although the on-board chipsets seem
to be sets of 6). I currently have one 4 port card (2 internal, 2
external) and one 8 port card (4 internal, 4 external) with no problems.
They are in an HP MicroProliant N54L with 16 GB RAM.
Here is the series of cards that I have been using:
http://www.sybausa.com/productList.php?cid=142¤tPage=0
Specifically the SI-PEX40072 and SI-PEX40065, stay away from the RAID
versions and just go for the JBOD. The Marvell JBOD chips were
recommended over on the ZFS list.
> At the moment, all of my ZFS devices are connected by either USB 3.0
> or Firewire 400.
Are the USB drives directly attached or via hubs ? The hubs may be
introducing more errors (I have not had good luck finding USB hubs that
are reliable in transferring data ... on my Mac, I have never tried
using external hubs on my servers).
> I now have an eSATA card with two ports on it that I
> plan to install at some point, which will let me move the Firewire 400
> drive to eSATA. Should I expect any new problem for that drive after the
> change?
I would expect a decrease in error counts for the eSATA attached drive.
I have been running LOTS of scrubs against my ~1.6TB of data recently,
both on a 2-way 2-column mirror of 2TB drives or the current config of
2-way 3-column mirror of 1TB drives. I have seen no errors of any kind.
[ppk at FreeBSD2 ~]$ zpool status
pool: KrausHaus
state: ONLINE
scan: scrub repaired 0 in 1h49m with 0 errors on Fri Sep 26 13:51:21 2014
config:
NAME STATE READ WRITE CKSUM
KrausHaus ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
diskid/DISK-Seaagte ES.3 ONLINE 0 0 0
diskid/DISK-Seaagte ES.2 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
diskid/DISK-WD-SE ONLINE 0 0 0
diskid/DISK-HGST UltraStar ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
diskid/DISK-WD-SE ONLINE 0 0 0
diskid/DISK-Seaagte ES.2 ONLINE 0 0 0
spares
diskid/DISK-Seaagte ES.3 AVAIL
errors: No known data errors
Note: Disk serial numbers replaced with type of drive. All are 1TB.
<snip>
>>
>> It sounds like you are really pushing this system to do more than it
>> reasonably can. In a situation like this you should really not be doing
>> anything else at the same time given that you are already pushing what
>> the system can do.
>>
> It seems to me that the only places that could fail to keep up would
> be the motherboard's chip(set) or one of the controller cards. The
> motherboard controller knows the speed of the memory, so it will only
> cycle the memory at that speed. The CPU, of course, should be at a lower
> priority for bus cycles, so it would just use whatever were left over. There
> is no overclocking involved, so that is not an issue here. The machine goes
> as fast as it goes and no faster. If it takes longer for it to complete a
> task, then that's how long it takes. I don't see that "pushing this system
> to do more than it reasonably can" is even possible for me to do. It does
> what it does, and it does it when it gets to it. Would I like it to do
> things faster? Of course, I would, but what I want does not change physics.
> I'm not getting any machine check or overrun messages, either.
So you deny that race states can exist in a system as complex as a
modern computer running a modern OS ? The OS is an integral part of all
this, including all the myriad device drivers. And with multiple CPUs
the problem may be even worse.
> Further, because one of the drives is limited to 50 MB/s (Firewire 400)
> transfer rates, ZFS really can't go any faster than that drive. Most of the
> time, a systat vmstat display during the scrubs showed the MB/s actually
> transferred for all four drives as being about the same (~23 - ~35 MB/s).
What does `iostat -x -w 1` show ? How many drives are at 100 %b ? How
many drives have a qlen of 10 ? For how many samples in a row ? That is
the limit of what ZFS will dispatch, once there are 10 outstanding I/O
requests for a given device, ZFS does not dispatch more I/O requests
until the qlen drops below 10. This is tunable (look through sysctl -a |
grep vfs.zfs). On my system with the port multiplier I had to tune this
down to 4 (found empirically) or I would see underlying SATA device
errors and retries.
I find it useful to look at 1 second as well as 10 seconds samples (to
see both peak load on the drives as well as more average).
Here is my system with a scrub running on the above zpool and 10 second
sample time (iostat -x -w 10):
extended device statistics
device r/s w/s kr/s kw/s qlen svc_t %b
ada0 783.8 3.0 97834.2 15.8 10 10.4 84
ada1 792.8 3.0 98649.5 15.8 4 3.7 49
ada2 789.9 3.0 98457.0 15.8 4 3.6 47
ada3 0.1 13.1 0.0 59.0 0 6.1 0
ada4 0.8 13.1 0.4 59.0 0 5.8 0
ada5 794.0 3.0 98703.7 15.8 0 4.1 62
ada6 785.9 3.0 98158.3 15.8 10 11.2 98
ada7 0.0 0.0 0.0 0.0 0 0.0 0
ada8 791.4 3.0 98458.2 15.8 0 3.0 53
In the above, ada0 and ada6 have hit their outstanding I/O limit (in
zfs), both are slower than the others, with both longer service time
(svc_t) and % busy (%b). These are the oldest drives in the zpool,
Seagate ES.2 series and are 5 years old (and just out of warranty). So
it is not surprising that they are the slowest. They are the limiting
factor on how fast the scrub can progress.
> The scrubs took from 5% to 25% of one core's time,
Because they are limited by the I/O stack between the kernal and the device.
> and associated
> kernel functions took from 0% to ~9% (combined) from other cores. cmp(1)
> took 25% - 35% of one core with associated kernel functions taking 5% - 15%
> (combined) from other cores. I used cpuset(1) to keep cmp(1) from bothering
> the mprime thread I cared about the most. (Note that mprime runs niced
> to 18, so its threads should not slow any of the testing I was doing.) It
> really doesn't look to me like an overload situation, but I can try moving
> the three USB 3.0 drives to USB 2.0 to slow things down even further.
Do you have a way to look at errors directly on the USB buss ?
> That
> leaves still unexplained ZFS's failure to make use of multiple copies for
> error correction during the reading of a file or to fix in one scrub
> everything that was fixable.
>>>
>>> Script started on Wed Sep 17 01:37:38 2014
>>> [hellas] 101 % time nice +12 cpuset -l 3,0 cmp -z -l /backups/s2C/save/backups.s2A /backups/testmirror/backups.s2A
>>
>> This is the file the ZFS told you was corrupt, all bets are off.
>>
> There should be only one bad block because the scrubs fixed everything
> else, right?
Not necessarily, I have not looked at the ZFS code (feel free to, it is
all open source), so I do not know for certain whether it gives up on a
file once it finds corruption.
> And that bad block is bad on all four drives, right?
Or the I/O to all four drives was interrupted at the same TIME ... I
have seen that before when it was a device driver stack that was having
trouble (which is what I suspect here).
<snip>
>> The fact that you have TWO different drives from TWO different vendors
>> exhibiting the same problem (and to the same degree) makes me think that
>> the problem is NOT with the drives but elsewhere with your system. I
>> have started tracking usage an failure statistics for my personal drives
>> (currently 26 of them, but I have 4 more coming back from Seagate as
>
> Whooweee! That's a heap of drives! IIRC, for a chi^2 distribution,
> 30 isn't bad for a sample size. How many of those drives are of larger
> capacity than 1 TB?
Not really, I used to manage hundreds of drives. When I have 2 out of 4
Seagate ES.2 1TB drives and 1 out of 2 HGST UltraStar 1TB drives fail
under warranty I am still not willing to say that overall both Seagate
and HGST have a 50% failure rate ... specifically because I do not
consider 4 (or worse 2) drives a statistically significant sample :-)
In terms of drive sizes, a little over 50% are 1TB or over (not counting
the 4 Seagate 1TB warranty replacement drives that arrived today). Of
the 11 1TB drives in the sample (not counting the ones that arrived
today), 3 have failed under warranty (so far). The 4 2TB drives in the
sample set none have failed yet, but they are all less than 1 year old.
<snip>
>> The system you are trying to use ZFS on may just not be able to handle
>> the throughput (both memory and disk I/O) generated by ZFS without
>> breaking. This may NOT just be a question of amount of RAM, but of the
>> reliability of the motherboard/CPU/RAM/device interfaces when stressed.
>
> I did do a fair amount of testing with mprime last year and found no
> problems.
From the brief research I did, it looks like mprime is a computational
program and will test only limited portions of a system (CPU and RAM
mostly).
> I monitor CPU temperatures frequently, especially when I'm
> running a test like the ones I've been doing, and the temperatures have
> remained reasonable throughout. (My air-conditioning bill has not been
> similarly reasonable, I'm sorry to say.)
> That having been said, though, between your remarks and Andrew Berg's,
> there does seem cause to run another scrub, perhaps two, with those three
> drives connected via USB 2.0 instead of USB 3.0 to see what happens when
> everything is slowed down drastically. I'll give that a try when I find
> time. That won't address the ZFS-related questions or the differences
> in error rates on different drives, but might reveal an underlying system
> hardware issue.
> Maybe a PCIE2 board is too slow for USB 3.0, although the motherboard
> controller, BIOS, USB 3.0 controller, and kernel all declined to complain.
> If it is, then the eSATA card I bought (SATA II) would likely be useless
> as well. :-<
>
>> In the early days of ZFS it was noticed that ZFS stressed the CPU and
>> memory systems of a server harder than virtually any other task.
>>
> When would that have been, please? (I don't know much ZFS history.)
> I believe this machine dates to 2006 or more likely 2007, although the
> USB 3.0 card was new last year. The VIA Firewire card was installed at
> the same time as the USB 3.0 card, but it was not new at that time.
That would have been the 2005-2007 timeframe. A Sun SF-V240 could be
brought to it's knees by a large ZFS copy operation. Both CPUs would peg
and memory bandwidth would all be consumed by the I/O operations. The
Sun T-2000 was much better as it had (effectively) 32 logical CPUs (8
cores each with 4 execution threads) and ZFS really likes multiprocessor
environments.
--
--
Paul Kraus paul at kraus-haus.org
Co-Chair Albacon 2014.5 http://www.albacon.org/2014/
More information about the freebsd-questions
mailing list