Re: zfs (?) issues?

From: Sulev-Madis Silber <freebsd-current-freebsd-org111_at_ketas.si.pri.ee>
Date: Sat, 26 Apr 2025 12:01:01 UTC

On April 23, 2025 6:40:44 PM GMT+03:00, void <void@f-m.fm> wrote:
>On Mon, Apr 21, 2025 at 04:25:16AM +0300, Sulev-Madis Silber wrote:
>> i have long running issue in my 13.4 box (amd64)
>> 
>> others don't get it at all and only suggest adding more than 4g ram
>> 
>> it manifests as some mmap or other problems i don't really get
>> 
>> basically unrestricted git consumes all the memory. 
>
>I see symptoms like this on very slow media. That media might
>be inherently slow (like microSD) or it might mean the hd/ssd is worn out [1]. Programs like git and subversion read and write lots of small files, and the os/filesystem might not be able to write
>to slow media as fast as git would like. 
>[1] observed failure mode of some hardware, where writes just     get slower and slower.
>
>[2] the workaround where the machine *has* to use micro sd, in my
>    example, to update ports, was to download latest ports.tar.xz and
>    unzip it, rather than use git.
>
>[3] test hd performance with fio (/usr/ports/benchmarks/fio)

that might be it!? there is hdd on machine that was tested but now never really likes to complete the long smart tests, and short take ages. there are no "usual" disk errors, tho. that hdd is part of 2 disk mirror that the git runs on

but there could be fix for this that never affects people. i don't know how internals really are but slow io could fill the buffers up. those can be checked and fs could be limited. eg, simply not telling that write was ok yet. that would make things slower if queue is full, so git would wait. i bet that there are checks for it, maybe they just don't work well? it can't be just blindly taking writes hoping they could be committed up to storage in some future time

or i could be wrong and it's some other issue

i'm wondering why noone else spots it much, tho? because io could be slow due media being abnormally slow by design. or it could be failing. but it could also that influx is just past what storage can do. and this could happen in fast machines too. or it could be happening due accident or even attack. if i get this correctly. oom protect won't save any userland process here either? so it truly was all about kernel wanting to allocate all of the ram. which it did. i didn't see single userland process running iirc but i couldn't check either. kernel itself kept running perfectly fine after that. fix of that particular failure is to enable watchdog of course. i think i've seen it on another machine as well but never realized. or maybe it was hw there and kernel was also frozen. when i turned to check, i found caps bulging

if all up is correct, it could be easily tested, with gnop maybe. i don't see speed constriction option but i see delays. maybe even i can test it now, as it doesn't need huge ram just to prove the point that it fills up completely

and this is not fixed on current either? and fix is in zfs? and ufs, as tested by others, would not be affected... why? i know zfd does cow but. anyway, i can't figure it out. that's why i don't dev fs'es. maybe tell kirk even? : p

what's funny is how kernel knew to stop there? was it just because it finally was reaching actual ram limits. or just because writes stopped due killed git. i'm not sure what kernel memory full could mean. panic always? since you can't kill things in kernel. or it could give errors or delay? like completion of syscalls is delayed? i'm unsure about all this. i'll leave it for others. but it's often told that, like, full memory in kernel leads to panics. couldn't it just error out?

tl;dr - suspected issue of zfs on slow device filling up *entire* ram with write buffers, leaving userland killed and system in unusable state