dead slow update servers

Mon Jul 15 16:43:25 UTC 2019

"Kevin P. Neal" <kpn at neutralgood.org> writes:

> On Mon, Jul 15, 2019 at 05:42:25AM +0200, hw wrote:
>> "Kevin P. Neal" <kpn at neutralgood.org> writes:
>> > Oh, and my Dell machines are old enough that I'm stuck with the hardware
>> > RAID controller. I use ZFS and have raid0 arrays configured with single
>> > drives in each. I _hate_ it. When a drive fails the machine reboots and
>> > the controller hangs the boot until I drive out there and dump the card's
>> > cache. It's just awful.
>> 
>> That doesn't sound like a good setup.  Usually, nothing reboots when a
>> drive fails.
>> 
>> Would it be a disadvantage to put all drives into a single RAID10 (or
>> each half of them into one) and put ZFS on it (or them) if you want to
>> keep ZFS?
>
> Well, it still leaves me with the overhead of dealing with creating arrays
> in the hardware.

Didn't you need to create the RAID0s having a single disk, too?

> And it costs me loss of the scrubbing/verification of the end-to-end
> checksumming. So I'm less safe there with no less work.

If you're worried about the controller giving results that lead to the
correct check sums and data ending up on the disk not matching these
check sums when the controller reads it later, what difference does it
make which kind of RAID you use?  You can always run a scrub to verify
the check sums, and if errors are being found, you may need to replace
the controller.

> It would probably eliminate the reboots, though. But that's only if my
> theory about the reboots is correct.
>
> The failures I've seen involve the circuit board on the drive failing and
> the drive not responding to any commands ever again. My guess is that the
> ZFS watchdog timer is rebooting because commands don't complete within
> the timeout period. I could change that by changing the setting that keeps
> ZFS from writing to a drive when a drive vanishes, but then I lose the
> safety of pausing the system when a drive pops out of the slot. Yes, that
> has happened before.

Do the drives pop back into the slots all by themselves before the
timeout expires?

When a drive becomes unresponsive, ZFS should just fail it and continue
to work with the remaining ones.  I've seen it doing that.

> Maybe I should just go ahead and change it. I've got a drive about to
> fail on me. It's a three way mirror so I'm not worried about it. It would
> be, uh, _nice_ if it didn't bring down the machine, though.

If you were using two or more disks each in a RAID1 or RAID10 to create
one disk exposed to ZFS, you wouldn't have a problem when one disk
becomes unresponsive.  If there's someone around who is used to quickly
popping the disks back into their slots, that someone could as well
replace a failed disk by simply taking it out and plugging a new one in.

Hardware RAID does have advantages, so why not use them when you're
stuck with it anyway?