Re: witness_lock_list_get: witness exhausted

From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Mon, 04 Oct 2021 11:27:51 UTC
Just take it and change as you see fit, I don't have time to work on it.

On 10/4/21, Alan Somers <asomers@freebsd.org> wrote:
> On Mon, Jan 8, 2018 at 5:31 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> On Tue, Jan 9, 2018 at 12:41 AM, Michael Jung <mikej@mikej.com> wrote:
>>
>> > On 2018-01-08 13:39, John Baldwin wrote:
>> >
>> >> On Tuesday, November 28, 2017 02:46:03 PM Michael Jung wrote:
>> >>
>> >>> Hi!
>> >>>
>> >>> I've recently up'd my processor count on our poudriere box and have
>> >>> started noticing the error
>> >>> "witness_lock_list_get: witness exhausted" on the console.  The
>> >>> kernel
>> >>> *DOES NOT* crash but I
>> >>> thought the report may be useful to someone.
>> >>>
>> >>> $ uname -a
>> >>> FreeBSD poudriere 12.0-CURRENT FreeBSD 12.0-CURRENT #1 r325999: Sun
>> >>> Nov
>> >>> 19 18:41:20 EST 2017
>> >>> mikej@poudriere:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64
>> >>>
>> >>> The machine is pretty busy running four poudriere build instances.
>> >>>
>> >>> last pid: 76584;  load averages: 115.07, 115.96, 98.30
>> >>>
>> >>>                                       up 6+07:32:59  14:44:03
>> >>> 763 processes: 117 running, 581 sleeping, 2 zombie, 63 lock
>> >>> CPU: 59.0% user,  0.0% nice, 40.7% system,  0.1% interrupt,  0.1%
>> >>> idle
>> >>> Mem: 12G Active, 2003M Inact, 44G Wired, 29G Free
>> >>> ARC: 28G Total, 11G MFU, 16G MRU, 122M Anon, 359M Header, 1184M Other
>> >>>       25G Compressed, 32G Uncompressed, 1.24:1 Ratio
>> >>>
>> >>> Let me know what additional information I might supply.
>> >>>
>> >>
>> >> This just means that WITNESS stopped working because it ran out of
>> >> pre-allocated objects.  In particular the objects used to track how
>> >> many locks are held by how many threads:
>> >>
>> >> /*
>> >>  * XXX: This is somewhat bogus, as we assume here that at most 2048
>> >> threads
>> >>  * will hold LOCK_NCHILDREN locks.  We handle failure ok, and we
>> >> should
>> >>  * probably be safe for the most part, but it's still a SWAG.
>> >>  */
>> >> #define LOCK_NCHILDREN  5
>> >> #define LOCK_CHILDCOUNT 2048
>> >>
>> >> Probably the '2048' (max number of concurrent threads) needs to scale
>> >> with
>> >> MAXCPU.  2048 threads is probably a bit low on big x86 boxes.
>> >>
>> >
>> >
>> > Thank you for you explanation.  We are expanding our ESXi cluster and
>> > even
>> > though with standard edition I can only assign 64 vCPU's to a guest and
>> > as
>> > much
>> > RAM as I want, I do like to help with edge cases if I can make them
>> > occur
>> > pushing
>> > boundaries as I can towards additianional improvements in FreeBSD.
>> >
>>
>> Can you apply this and re-run the test?
>>
>> https://people.freebsd.org/~mjg/witness.diff
>>
>> It bumps the counters to be "high enough" but also starts tracking usage.
>> If you get
>> the message again, bump the values even higher.
>>
>> Once you get a complete poudriere run which did not result in the
>> problem,
>> do:
>> $ sysctl debug.witness.list_used debug.witness.list_max_used
>>
>> to dump the actual usage.
>
> This is a nice little patch.  Can we commit to head?  Even better
> would be if LOCK_CHILDCOUNT could be a tunable.  On my largish system,
> here's what I get shortly after boot:
>
> debug.witness.list_max_used: 8432
> debug.witness.list_used: 8420
>
> -Alan
>


-- 
Mateusz Guzik <mjguzik gmail.com>