LOR on nfs: vfs_vnops.c:301 kern_descrip.c:1580

Fri Aug 20 19:35:51 UTC 2010

On Friday, August 20, 2010 3:19:53 pm Kostik Belousov wrote:
> On Fri, Aug 20, 2010 at 08:55:08PM +0400, pluknet wrote:
> > On 19 August 2010 17:34, John Baldwin <jhb at freebsd.org> wrote:
> > > On Thursday, August 19, 2010 5:29:25 am pluknet wrote:
> > >> On 19 August 2010 00:07, John Baldwin <jhb at freebsd.org> wrote:
> > >> > On Wednesday, August 18, 2010 3:17:56 pm pluknet wrote:
> > >> >> On 18 August 2010 23:11, pluknet <pluknet at gmail.com> wrote:
> > >> >> > On 18 August 2010 17:46, Kostik Belousov <kostikbel at gmail.com> wrote:
> > >> >> >> On Wed, Aug 18, 2010 at 02:43:19PM +0400, pluknet wrote:
> > >> >> >>> On 18 August 2010 12:07, pluknet <pluknet at gmail.com> wrote:
> > >> >> >>> > On 17 August 2010 20:04, Kostik Belousov <kostikbel at gmail.com> wrote:
> > >> >> >>> >
> > >> >> >>> >>
> > >> >> >>> >> Also please take a note of the John' suggestion to use the taskqueue.
> > >> >> >>> >
> > >> >> >>> > I decided to go this road. Thank you both.
> > >> >> >>> > Now I do nfs buildkernel survive and prepare some benchmark results.
> > >> >> >>> >
> > >> >> >>>
> > >> >> >>> So, I modified the patch to defer proc_create() with taskqueue(9).
> > >> >> >>> Below is `time make -j5 buildkernel WITHOUT_MODULES=yes` perf.
> > >> > evaluation.
> > >> >> >>> Done on 4-way CPU on clean /usr/obj with /usr/src & /usr/obj both
> > >> >> >>> nfs-mounted over 1Gbit LAN.
> > >> >> >>>
> > >> >> >>> clean old
> > >> >> >>> 1137.985u 239.411s 7:42.15 298.0%       6538+2133k 87+43388io 226pf+0w
> > >> >> >>>
> > >> >> >>> clean new
> > >> >> >>> 1134.755u 240.032s 7:41.25 298.0%       6553+2133k 87+43367io 224pf+0w
> > >> >> >>>
> > >> >> >>> Patch needs polishing, though it generally works.
> > >> >> >>> Not sure if shep_chan (or whatever name it will get) needs locking.
> > >> >> >> As I said yesterday, if several requests to create nfsiod coming one
> > >> >> >> after another, you would loose all but the last.
> > >> >> >>
> > >> >> >> You should put the requests into the list, probably protected by
> > >> >> >> nfs_iod_mtx.
> > >> >> >>
> > >> >> >
> > >> >> > How about this patch? Still several things to ask.
> > >> >> > 1) I used malloc instance w/ M_NOWAIT, since it's called with nfs_iod_mtx
> > >> > held.
> > >> >> > 2) Probably busy/done gymnastics is a wrong mess. Your help is
> > >> > appreciated.
> > >> >> > 3) if (1) is fine, is it right to use fail: logic (i.e. set
> > >> >> > NFSIOD_NOT_AVAILABLE)
> > >> >> > on memory shortage? Not tested.
> > >> >> >
> > >> >> > There are debug printf() left intentionally to see how 3 contexts run
> > >> > under load
> > >> >> > to each other. I attached these messages as well if that makes sense.
> > >> >> >
> > >> >>
> > >> >> Ah, yes. Sorry, forgot about that.
> > >> >
> > >> > Your task handler needs to run in a loop until the list of requests is empty.
> > >> > If multiple threads call taskqueue_enqueue() before your task is scheduled,
> > >> > they will be consolidated down to a single call of your task handler.
> > >>
> > >> Thanks for your suggestions.
> > >>
> > >> So I converted nfs_nfsiodnew_tq() into loop, and as such
> > >> I changed a cleanup SLIST_FOREACH() as well to free a bulk of
> > >> (potentially consolidated) completed requests in one pass.
> > >> kproc_create() is still out of the SLIST_FOREACH() to avoid calling it
> > >> with nfs_iod_mtx held.
> > >>
> > >> >
> > >> > -       int error, i;
> > >> > -       int newiod;
> > >> > +       int i, newiod, error;
> > >> >
> > >> > You should sort these alphabetically if you are going to change this.  I would
> > >> > probably just leave it as-is.
> > >>
> > >> Err.. that's resulted after a set of changes. Thanks for noting that.
> > >>
> > >> >
> > >> > Also, I'm not sure how this works as you don't actually wait for the task to
> > >> > complete.  If the taskqueue_enqueue() doesn't preempt, then you may read
> > >> > ni_error as zero before the kproc has actually been created and return success
> > >> > even though an nfsiod hasn't been created.
> > >> >
> > >>
> > >> I put taskqueue_drain() right after taskqueue_enqueue() to be in sync with
> > >> task handler. That was part to think about, as I didn't find such a use pattern.
> > >> It seems though that tasks are launched now strictly one-by-one, without
> > >> visible parallelism (as far as debug printf()s don't overlap anymore).
> > >
> > > Ah, if it is safe to sleep then I have a couple of suggestions:
> > >
> > 
> > Cool, credits go to John :).
> > I just adopted all of your changes (attached).
> > 
> > > - Use M_WAITOK to malloc() so you don't have to worry about the failure case
> > >  (I see Rick already suggested this).
> > 
> > After all that reduces to the following replacement in nfs_nfsiodnew().
> > So, no regression should be there in theory.
> > 
> >  mtx_unlock(&nfs_iod_mtx);
> > - kproc_create(...)
> > + malloc(...)
> > mtx_lock(&nfs_iod_mtx);
> > 
> > It survived after this simple test running for an hour, which
> > forces creation of many iods with r/w from 300 threads:
> > 
> > nfsserv:/home/svn/freebsd/obj_exp on /usr/obj (nfs)
> > 
> > ./randomio /usr/obj/file 100 0.5 0 4096 60 100 &
> > ./randomio /usr/obj/file 100 0.5 0 4096 60 100 &
> > ./randomio /usr/obj/file 100 0.5 0 4096 60 100 &
> > while [ true ]; do sysctl vfs.nfs.iodmax=2; sysctl vfs.nfs.iodmax=60;
> > sleep 20; done
> > 
> > randomio is from ports/149838.
> > 
> > 
> > P.S.
> > it's funny to see in top
> >     0 root       10  44    0     0K   144K deadlk  2  23:16 28.86% kernel
> > Someone might think the kernel is in deadlock.
> 
> It seems nobody replied to the mdf@ objection against wait of the
> new proc startup being equivalent to the LOR. I think that the wait
> is safe, because the task is executed in the context of
> the different process then the enqueue request.
> This might be worth noting in the comment or commit message.

I do wonder if we could get away with not waiting at all and always return -1?
You could have the task handler actually finish the toggle of the tristate in
the array.  Potentially you could even dispense with the linked list of
malloc'd structures and just walk the array creating processes for any entries
in the "in-progress" state in the task handler.  You might also want to avoid
submitting entries for new threads if there is already a pending one?  If that
is the case it could be further simplified by having the task always create a
single kthread when scheduled and just scheduling the task anytime a request
needs one.

> You should drain the taskq and prevent creation of new nfsiods before
> waiting for existing nfsiods to exit on module unload.

Agreed that unload needs to be handled.

> I find it somewhat weird that nip is removed from the list in the
> task handler, but freed in the enqueue thread. You could provide
> pointers to the local variables to exchange done/exit codes.
> I do not insist on this part.

Well, this is because the ownership of the thread transitions back from the
task to the process that requested the thread be created.  However, I wonder
if given my suggestions above you could have a model where a request that
couldn't immediately find a free thread is always queued elsewhere after
requesting a new thread to be started, and the new thread just grabs a request
from the head of that other queue when it starts up.  If that sort of approach
would work I think it would end up simpler overall.

-- 
John Baldwin