Re: madvise(MADV_FREE) doesn't work in some cases?

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Sat, 03 Jul 2021 11:35:59 UTC
On Sat, Jul 03, 2021 at 02:32:01AM +0300, Vitaliy Gusev wrote:
> Hi, 
> 
> I came across not expected behaviour with madvise() in FreeBSD.
> 
> Attached test program mmapfork does: mmap, fork, touch memory and then madvise(MADV_FREE).
> 
> Expected behaviour - one process can allocate memory (lazy allocation) while system is freeing previously allocated memory for a second process.
> 
> Current behaviour -  system kills one process with message in dmesg:
> 
> pid 31314 (mmapfork), jid 0, uid 1001, was killed: out of swap space
> 
> Running this test in Linux or illumos shows expected behaviour with a little difference in illumos - it frees memory almost immediately, w/o needs lack of memory in a system.
> 
> If use MADV_NOTNEED - no changes.
> 
> If modify program and do not do fork(), but run two instances  - that shows expected behaviour.
> 
> To reproduce just disable swap, and run program with argument as 1/2 RAM on a system. For instance, command below will try run and use ~ 2GB area twice.
> 
> [vetal@bsdev ~]$ ./mmapfork 2000
> 
> Testing program is attached.
> 
> Note, during testing I disabled swap on all systems: Linux, illumos and FreeBSD.
> 
> Does it mean madvise() doesn't work well in FreeBSD or test does something wrong?

Your program does not exactly what you described above.  There is a generic
race to consume memory, and some specific details about madvise(2) on FreeBSD.

From the code, you do:
- mmap anonymous private region
- fork
- both child and parent start touching the mmaped region.

Two processes race to consume 1/2 of RAM on your system.  If one of
them happen to execute faster then another, you do get to the case where
one of them does madvise().  But it could be that processes execute in
lockstep, and try to eat all the memory before going to madvise().
Did you excluded this case?

Now, about the specific of madvise(MADV_FREE) on FreeBSD.  Due to the way
CoW is implemented with the shadow chain of objects, we cannot drop the
top of the shadow chain, otherwise instead of returning zeroed pages next
time, we would return content back in the time.  It was relatively recent
discovery, see bf5661f4a1af6931ec4b6, PR 240061.

To explain it in simplified form, when there is potential old content
under the CoW copy for the mapping, we cannot drop CoW-ed pages. This
is the motivation why madvise(MADV_FREE) does nothing for your program.
When you run two instances without fork, there is no previous content
and no Cow, so madvise() can safely remove the pages from the object,
and on the next access they are zero-filled.

You can read more details in the referenced commit, as well as some musings
about way to make it somewhat better.

I must say, that trying to allocated 1/2 + 1/2 of RAM this way, on a system
without swap, is the way to ask for troubles anyway.