Re: POSIX shared memory and dying jails

From: Michael Gmelin <freebsd_at_grem.de>
Date: Fri, 25 Jun 2021 16:58:59 UTC

On Fri, 25 Jun 2021 09:19:05 -0700
James Gritton <jamie@freebsd.org> wrote:

> On 2021-06-25 07:41, Michael Gmelin wrote:
> > It seems like non-anonymous POSIX shared memory is not freed
> > automatically when a jail is removed and keeps it in a dying state,
> > until the shared memory segment is deleted manually.
> > 
> > See below for the most basic example:
> > 
> >     [root@jailhost ~]# jail -c path=/ command=/bin/sh
> >     # posixshmcontrol create /removeme
> >     # exit
> >     [root@jailhost ~]# jls -dv -j shmtest dying
> >     true
> > 
> > So at this point, the jail is stuck in a dying state.
> > 
> > Checking POSIX shared memory segments shows the shared memory
> > segment which is stopping the jail from crossing the Styx:
> > 
> >     [root@jailhost ~]# posixshmcontrol list
> >     MODE            OWNER   GROUP   SIZE    PATH
> >     rw-------       root    wheel   0       /removeme
> > 
> > After removing the shared memory segment manually...
> > 
> >     [root@jailhost ~]# posixshmcontrol rm /removeme
> > 
> > the jail passes away peacefully:
> > 
> >     [root@jailhost ~]#  jls -dv -j shmtest dying
> >     jls: jail "shmtest" not found
> > 
> > I wonder if it wouldn't make sense to always remove POSIX shared
> > memory created by a jail automatically when it's removed.  
> 
> That does seem reasonable, though it would take some bookkeeping to do
> right.  There is currently no concrete idea of a jail's ownership of a
> POSIX shm object, as it uses only uid and gid for access permissions,
> same as files.  The tie to the jail is in the underlying vm_object,
> which holds a cred that references the jail - that seems to be what's
> keeping the jail from going away.

Interesting - I was wondering how that worked, thanks. Would there by a
way to cut that tie somehow (for use cases that deliberately want to
leave the shared memory segment behind)?

> 
> Like files, POSIX shared memory is one way a jail may communicate with
> the rest of the system.  So it's theoretically conceivable that shared
> memory created by a defunct jail my still be in use by a parent jail,
> in the same way that shared memory created by a defunct process is
> still visible to other processes, but that may be a rare enough case
> to disregard.

This could theoretically be controlled by a parameter set on the
jail (something like "noposixshmcleanup"), the default being to remove
the segments on jail removal.

Another problem caused by the lack of jail ownership is that access
semantics are a bit strange. E.g., a jail based on / can easily list
(and remove) all memory allocations in the system, while for other jails
it depends. They can stat their own allocations like in:

    # posixshmcontrol stat /xyz
    output as expected...

But not list them:

    # posixshmcontrol ls
    posixshmcontrol: cannot get kern.ipc.posix_shm_list length:
    Operation not permitted

Probably related to matching the path of the allocation, I didn't look
into the code.

For practical purposes, we implemented a primitive workaround in the
scriptwork stopping jails that simply lists all allocations matching a
jail's path and removes them:

    # Garbage collect POSIX shared memory
    if command -v posixshmcontrol >/dev/null; then
      _shm_paths=$( posixshmcontrol ls | cut -f 5 | grep "^$_pdir/" )
      for _shm_path in $_shm_paths ; do
        posixshmcontrol rm "$_shm_path"
      done
    fi

but having something automatic in the OS would be nice. Or being
able to run `posixshmcontrol -j shmtest ls`. Seems like this would be
quite some effort though to get it right - also in terms of who can
access what - right now, it's simply based on the path, which also gives
a lot of flexibility.

By the way, this was all triggered by running postgresql in a jail -
depending on how it was started (non-persistent/exec.start vs
persistent/jexec) it would not clean up after itself when the jail was
removed, leading to jails and POSIX shared memory leaking on each jail
restart[0]. Probably something about signal handling, but that's
material for a different thread :).

Best,
Michael

[0]https://github.com/pizzamig/pot/issues/150

-- 
Michael Gmelin