nfs_getpages: error 4

Sat Mar 5 18:36:05 UTC 2016

On Sat, Mar 05, 2016 at 07:42:51PM +0300, Dmitry Sivachenko wrote:
> 
> > On 05 Mar 2016, at 19:27, Konstantin Belousov <kostikbel at gmail.com> wrote:
> > 
> > On Sat, Mar 05, 2016 at 05:24:26PM +0300, Dmitry Sivachenko wrote:
> >>> 
> >>> Again, error 4 is EINTR so you could disable both "soft" and "intr" options for test.
> >> 
> >> 
> >> "soft" is meaningless in such setup, because "file system calls will fail after retrycnt round trip timeout intervals" but "The default is a retry count of zero, which means to keep retrying forever".
> >> 
> >> If I understand "intr" correctly, it matters only when server becomes unresponsive, that is "server is not responding" message should be in my logs.  But I have no such a message.
> >> 
> >> 
> > 
> > The intr NFS mount option allows signals to interrupt NFS waits for the
> > RPC responses.  This is almost certainly the reason for the EINTR error
> > you get from the pager.
> > 
> > You should at last get the
> > vm_fault: pager read error, pid ...
> > messages as well.  Is this true ?
> 
> 
> That is true, see my initial post.
Ok.

> 
> 
> >  The end result would be SIGSEGV
> > delivered to the process.
> > 
> > OTOH, I do not quite understand why did your threads requesting page-in
> > fall into the wait for a free page.  I assume that there is enough free
> > pages in the system ?
> > 
> 
> 
> I have no swap configured, but it is possible that running processes eat all RAM (I expect them to be killed with OOM rather than stuck?)

I cannot answer this question about 'eat all ram'.  You can.

But I suspect that you do have enough free or reclamaible pages for OOM
to not trigger, e.g. because you demonstrated commands output from the
live system after the situation occured.  It more likely was a temporal
free page shortage, after which the system recovered.

I more believe in a bug in the handling of killed process in vm_fault().
Could you get the p_flag value for the hung process ?  Like
	ps -o flags <pid>