Re: zfs (?) issues?

In reply to: void : "Re: zfs (?) issues?"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Sulev-Madis Silber <freebsd-current-freebsd-org111_at_ketas.si.pri.ee>
Date: Thu, 08 May 2025 14:32:59 UTC

On April 26, 2025 4:06:16 PM GMT+03:00, void <void@f-m.fm> wrote:
>On Sat, Apr 26, 2025 at 03:01:01PM +0300, Sulev-Madis Silber wrote:
>> 
>> that might be it!? there is hdd on machine that was tested but now never really likes to complete the long smart tests, and short take ages. there are no "usual" disk errors, tho. that hdd is part of 2 disk mirror that the git runs on
>
>These are exactly the symptoms that led me to junk 2x HDs. One was CMR the other SMR. Smart tests failing to execute normally is a sure sign it's hardware.

that's how i lost bunch of recent data. no recent backups. old single disk pool. it took so much time for me to add mirror. that hdd failed. i have to give it to recovery company. they agreed maybe it's head. it corrupted data according to zfs, and smart data. finally after copying half a disk off, it went bust. clicks. this is mirrored and backupped system now. extensively stress tested but those are second hand 160g ones that i apparently just have used up their remaining life i guess

>
>> i'm wondering why noone else spots it much, tho? 
>
>maybe they do but in the end attribute it to hardware
>
>> and this is not fixed on current either? and fix is in zfs? and ufs, as tested by others, would not be affected... why? 
>
>I've also seen the same thing happen in a microSD and a USB2/3 context,
>both were UFS2 not ZFS.
>
>> tl;dr - suspected issue of zfs on slow device filling up *entire* ram with write buffers, leaving userland killed and system in unusable state
>
>Going by what you've described here, I'd say the problem is down to hardware.

just hw eh? remember, this is not first user who reports it. i bet they all didn't have bust hw

i bet slow or intermittent io is just cause. i also have apperant too high frags here. i let pool fill to 100% by accident. now it's 51% cap and 65% frag. i know it's just free space but cow fs makes this slow i guess

but why does it fill up the ram? is there a tunable at least? or why doesn't it autotune. arc iirc tunes to 0.6*ram or 1g max. this goes past those limits. either write cache has own too large limit. or i don't know

i mean, there might be legit reason or small hiccup. in real or virtual systems. local or remote storage. in expected situations io just slows down or stalls. here it seems like zfs accepts everything in hope it be able to write it down in a moment. but what if it can't?

i still look for a way to defuse this perfect storm situation. hw just makes the problem appear. it should reduce throughput or just stall the io up. kernel is full of all kinds of limits which work. i want this to work too. even if changing defaults would be bad, maybe there is proper value of some tunable that works for me? altho i still find it highly unlikely that zfs should take past 90% of any ram size or that combined kernel memory should go 100%. this should never happen

iirc smr disks did something what i observe where write speed falls after x time (small cmr + big smr). and those shdd's (small flash + big magnetic)

i think it's annoying and bad surprise that any kind of io issue can bring everything just down

if there's no tunable for this, maybe it should be added? where i can say to kernel, just don't do this. do not go there. do not let memory usage go so far. whether it's pre-fs, fs or post-fs buffer of some kind. it can't like keep filling up

i also wonder if ton of ram even helps. i wish i could test it somehow too because i'm curious

also, if disk acts up like this, could zfs be improved to to detect it? because currently it doesn't do anything

in any ways, this keeps confusing people forever. i could fix this with hdd replacement. someone else finds it again