Re: BLAKE3 unstability?

From: John Baldwin <jhb_at_FreeBSD.org>
Date: Tue, 12 Jul 2022 18:33:47 UTC
On 7/12/22 1:41 AM, Evgeniy Khramtsov wrote:
>>>> I can reproduce via:
>>>>
>>>> $ truncate -s 10G /tmp/test
>>>> $ mdconfig -f /tmp/test -S 4096
>>>> $ zpool create test /dev/md1
>>>> $ zfs create -o checksum=blake3 test/b
>>>> $ dd if=/dev/random of=/test/b/noise bs=1M count=4096
>>>> $ sync
>>>> $ zpool scrub test
>>>> $ zpool status
>>>
>>> I cannot reproduce this on openzfs/zfs@cb01da68057 (the commit that was
>>> most recently merged) built out of tree on either stable/13 70fd40edb86
>>> or main 9aa02d5120a.
>>>
>>> I'll update a system and see if I can reproduce it with the in-tree ZFS.
>>>
>>> - Ryan
>>>
>> It did not reproduce for me with in-tree ZFS on main@3c9ad9398fcd either.
>>
>> Could you share sysctl kstat.zfs.misc.chksum_bench, maybe we are using
>> different implementations?
>> I do see that blake3 went in with only a Linux module parameter for the
>> implementation selection, so I'll have to fix that. For now we can at least
>> see which was fastest, which should be the one selected. You just won't be
>> able to manually change it to see if that helps.
>>
>> - Ryan
> 
> I found the culprit (kernel and base from download.FreeBSD.org
> kernel.txz and base.txz respectively) (I forgot about local sysctl.conf...):
> 
> kern.sched.steal_thresh=1
> kern.sched.preempt_thresh=121
> 
> Then
> 
> #!/bin/sh
> 
> truncate -s 10G /tmp/test
> mdconfig -f /tmp/test -S 4096
> zpool create test /dev/md0
> zfs create -o checksum=blake3 test/b
> dd if=/dev/random of=/test/b/noise bs=1M count=4096
> sync
> zpool scrub test
> sleep 3
> zpool status
> 
> zpool destroy test
> mdconfig -d -u 0
> rm /tmp/test
> 
> As for ULE "tuning", these values give me fine desktop interactivity
> when building lang/rust when nice and idprio did not help, so I left
> them in sysctl.conf. Not sure if scheduling parameters are worthy of
> a ZFS PR, maybe something essential is preempted.

It could be missing fpu_kern_enter/leave that lack of preemption would
cover over.  I thought that missing that would give a panic in the
kernel though due to FPU instructions being disabled (including vector
instructions).  Maybe ZFS isn't using fpu_kern_enter(FPU_NOCTX) and is
instead trying to juggle contexts and it has a bug in how it manages
saved FPU contexts and reuses a context?  If so, I would just suggest
that ZFS switch to using FPU_KERN_NOCTX instead which runs all SSE
type code in a critical section to disable preemption but avoids
having to allocate and manage FPU contexts.

-- 
John Baldwin