NFS (& amd?) dysfunction descending a hierarchy
David Wolfskill
david at catwhisker.org
Tue Dec 9 19:14:31 UTC 2008
On Tue, Dec 02, 2008 at 04:15:38PM -0800, David Wolfskill wrote:
> I seem to have a fairly- (though not deterministly so) reproducible
> mode of failure with an NFS-mounted directory hierarchy: An attempt to
> traverse a "sufficiently large" hierarchy (e.g., via "tar zcpf" or "rm
> -fr") will fail to "visit" some subdirectories, typically apparently
> acting as if the subdirectories in question do not actually exist
> (despite the names having been returned in the output of a previous
> readdir()).
> ...
I was able to reproduce the external symptoms of the failure running
CURRENT as of yesterday, using "rm -fr" of a copy of a recent
/usr/ports hierachy on an NFS-mounted file system as a test case.
However, I believe the mechanism may be a bit different -- while
still being other than what I would expect.
One aspect in which the externally-observable symptoms were different
(under CURRENT, vs. RELENG_7) is that under CURRENT, once the error
condition occurred, the NFS client machine was in a state where it
merely kept repeating
nfs server pid848 at fbsd-build:/volume: not responding
until I logged in as root & rebooted it.
Here's a cut/paste of the kdump from the ktrace of the amd(8) process
under CURRENT, showing where the master amd(8) process (pid 848)
forks a child (4126) to try the unmount:
848 amd 1228846258.722953 CALL gettimeofday(0x8078e48,0)
848 amd 1228846258.722964 RET gettimeofday 0
848 amd 1228846258.722982 CALL sigprocmask(SIG_BLOCK,0xbfbfeaec,0xbfbfeadc)
848 amd 1228846258.722993 RET sigprocmask 0
848 amd 1228846258.723003 CALL fork
848 amd 1228846258.730250 RET fork 4126/0x101e
848 amd 1228846258.730405 CALL sigprocmask(SIG_SETMASK,0xbfbfeadc,0)
4126 amd 1228846258.730252 RET fork 0
4126 amd 1228846258.730456 CALL getpid
4126 amd 1228846258.730467 RET getpid 4126/0x101e
4126 amd 1228846258.730493 CALL unmount(0x2825f340,<invalid>0)
848 amd 1228846258.730422 RET sigprocmask 0
848 amd 1228846258.730595 CALL gettimeofday(0x8078e48,0)
848 amd 1228846258.730608 RET gettimeofday 0
...
848 amd 1228846258.914814 CALL sigprocmask(SIG_SETMASK,0xbfbfeba0,0)
848 amd 1228846258.914826 RET sigprocmask 0
848 amd 1228846258.914838 CALL select(0x400,0xbfbfec40,0,0,0xbfbfecd8)
4126 amd 1228846259.090428 RET unmount 0
4126 amd 1228846259.090492 CALL sigprocmask(SIG_BLOCK,0x2809b080,0xbfbfea0c)
4126 amd 1228846259.090505 RET sigprocmask 0
4126 amd 1228846259.090518 CALL sigprocmask(SIG_SETMASK,0x2809b090,0)
4126 amd 1228846259.090530 RET sigprocmask 0
4126 amd 1228846259.090545 CALL sigprocmask(SIG_BLOCK,0x2809b080,0xbfbfe9dc)
4126 amd 1228846259.090556 RET sigprocmask 0
4126 amd 1228846259.090576 CALL sigprocmask(SIG_SETMASK,0x2809b090,0)
4126 amd 1228846259.090587 RET sigprocmask 0
4126 amd 1228846259.090605 CALL exit(0)
848 amd 1228846259.091248 RET select -1 errno 4 Interrupted system call
848 amd 1228846259.091277 PSIG SIGCHLD caught handler=0x805e090 mask=0x0 code=0x0
848 amd 1228846259.091298 CALL wait4(0xffffffff,0xbfbfe83c,WNOHANG,0)
848 amd 1228846259.091329 RET wait4 4126/0x101e
848 amd 1228846259.091342 CALL wait4(0xffffffff,0xbfbfe83c,WNOHANG,0)
848 amd 1228846259.091352 RET wait4 -1 errno 10 No child processes
848 amd 1228846259.091365 CALL sigprocmask(SIG_SETMASK,0x80795bc,0)
848 amd 1228846259.091377 RET sigprocmask 0
848 amd 1228846259.091390 CALL sigprocmask(SIG_BLOCK,0x80792c4,0)
848 amd 1228846259.091401 RET sigprocmask 0
848 amd 1228846259.091411 CALL gettimeofday(0x8078e48,0)
848 amd 1228846259.091422 RET gettimeofday 0
Note that while the child didn't get EBUSY (as it does under RELENG_7)
-- indeed, the unmount call appears to have returned 0 -- the master
amd(8) process looks to be seeing "errno 4 Interrupted system call."
And here's a relevent part of the kdump from the "rm -fr" -- I had
kdump spit out Epoch timestamps with each in order to make correlation
easier:
4121 rm 1228846258.736266 CALL unlink(0x2821c148)
4121 rm 1228846258.736281 NAMI "distinfo"
4121 rm 1228846258.738329 RET unlink 0
4121 rm 1228846258.738379 CALL unlink(0x2821c1b8)
4121 rm 1228846258.738401 NAMI "pkg-descr"
4121 rm 1228846258.739963 RET unlink 0
4121 rm 1228846258.739982 CALL open(0x28178b6b,O_RDONLY,<unused>0)
4121 rm 1228846258.740002 NAMI ".."
4121 rm 1228846258.740541 RET open 4
4121 rm 1228846258.740558 CALL fstat(0x4,0xbfbfe96c)
4121 rm 1228846258.740579 STRU struct stat {dev=67174155, ino=22674937, mode=drwxr-xr-x , nlink=114, uid=9874, gid=929, rdev=0, atime=1228846258.184514000, stime
=1228846258.779501000, ctime=1228846258.779501000, birthtime=-1, size=12288, blksize=4096, blocks=24, flags=0x0 }
4121 rm 1228846258.740593 RET fstat 0
4121 rm 1228846258.740608 CALL fchdir(0x4)
4121 rm 1228846258.740626 RET fchdir 0
4121 rm 1228846258.740641 CALL close(0x4)
4121 rm 1228846258.740976 RET close 0
4121 rm 1228846258.740991 CALL rmdir(0x2821c538)
4121 rm 1228846258.741007 NAMI "dnscheck"
4121 rm 1228846258.741764 RET rmdir 0
4121 rm 1228846258.741783 CALL stat(0x2821d028,0xbfbfe900)
4121 rm 1228846258.741799 NAMI "dnsdoctor"
4121 rm 1228846258.742050 STRU struct stat {dev=67174155, ino=2519891, mode=drwxr-xr-x , nlink=3, uid=9874, gid=929, rdev=0, atime=1228844788, stime=1227555712,
ctime=1228845836.981842000, birthtime=-1, size=4096, blksize=4096, blocks=8, flags=0x0 }
4121 rm 1228846258.742066 RET stat 0
4121 rm 1228846258.742080 CALL open(0x2821d028,O_NONBLOCK,<unused>0x28200000)
4121 rm 1228846258.742097 NAMI "dnsdoctor"
4121 rm 1228846258.742419 RET open 4
4121 rm 1228846258.742435 CALL fstat(0x4,0xbfbfe6a0)
4121 rm 1228846258.742452 STRU struct stat {dev=67174155, ino=2519891, mode=drwxr-xr-x , nlink=3, uid=9874, gid=929, rdev=0, atime=1228844788, stime=1227555712,
ctime=1228845836.981842000, birthtime=-1, size=4096, blksize=4096, blocks=8, flags=0x0 }
4121 rm 1228846258.742465 RET fstat 0
4121 rm 1228846258.742480 CALL fcntl(0x4,F_SETFD,FD_CLOEXEC)
4121 rm 1228846258.742495 RET fcntl 0
4121 rm 1228846258.742516 CALL fstatfs(0x4,0xbfbfe700)
4121 rm 1228846258.742792 RET fstatfs -1 errno 2 No such file or directory
4121 rm 1228846258.742809 CALL close(0x4)
4121 rm 1228846258.743187 RET close 0
4121 rm 1228846258.743203 CALL stat(0x2821d098,0xbfbfe900)
4121 rm 1228846258.743219 NAMI "dnsflood"
4121 rm 1228846258.743544 STRU struct stat {dev=67174155, ino=2519892, mode=drwxr-xr-x , nlink=3, uid=9874, gid=929, rdev=0, atime=1228844788, stime=1227555712, ctime=1228845836.978868000, birthtime=-1, size=4096, blksize=4096, blocks=8, flags=0x0 }
4121 rm 1228846258.743559 RET stat 0
4121 rm 1228846258.743574 CALL open(0x2821d098,O_NONBLOCK,<unused>0x28200000)
4121 rm 1228846258.743590 NAMI "dnsflood"
4121 rm 1228846258.743792 RET open 4
4121 rm 1228846258.743809 CALL fstat(0x4,0xbfbfe6a0)
4121 rm 1228846258.743826 STRU struct stat {dev=67174155, ino=2519892, mode=drwxr-xr-x , nlink=3, uid=9874, gid=929, rdev=0, atime=1228844788, stime=1227555712, ctime=1228845836.978868000, birthtime=-1, size=4096, blksize=4096, blocks=8, flags=0x0 }
4121 rm 1228846258.743840 RET fstat 0
4121 rm 1228846258.743854 CALL fcntl(0x4,F_SETFD,FD_CLOEXEC)
4121 rm 1228846258.743867 RET fcntl 0
4121 rm 1228846258.743882 CALL fstatfs(0x4,0xbfbfe700)
4121 rm 1228846258.744008 RET fstatfs -1 errno 2 No such file or directory
4121 rm 1228846258.744022 CALL close(0x4)
4121 rm 1228846258.744411 RET close 0
[I included a moderate amount of successful processing near the
beginning of that excerpt, so folks could see the pattern.]
In contrast, here are the similar kdump excerpts from RELENG_7:
702 amd 1228774660.297858 CALL gettimeofday(0x807ad48,0)
702 amd 1228774660.297864 RET gettimeofday 0
702 amd 1228774660.297872 CALL sigprocmask(SIG_BLOCK,0xbfbfeaec,0xbfbfeadc)
702 amd 1228774660.297878 RET sigprocmask 0
702 amd 1228774660.297883 CALL fork
702 amd 1228774660.302658 RET fork 16840/0x41c8
16840 amd 1228774660.302660 RET fork 0
702 amd 1228774660.302689 CALL sigprocmask(SIG_SETMASK,0xbfbfeadc,0)
16840 amd 1228774660.302707 CALL getpid
16840 amd 1228774660.302725 RET getpid 16840/0x41c8
702 amd 1228774660.302715 RET sigprocmask 0
702 amd 1228774660.302746 CALL gettimeofday(0x807ad48,0)
16840 amd 1228774660.302753 CALL unmount(0x2837d310,<invalid>0)
702 amd 1228774660.302753 RET gettimeofday 0
...
702 amd 1228774660.417933 CALL select(0x400,0xbfbfec40,0,0,0xbfbfecd8)
16840 amd 1228774660.434632 RET unmount -1 errno 16 Device busy
16840 amd 1228774660.434684 CALL sigprocmask(SIG_BLOCK,0x28097c00,0xbfbfea0c)
16840 amd 1228774660.434691 RET sigprocmask 0
16840 amd 1228774660.434699 CALL sigprocmask(SIG_SETMASK,0x28097c10,0)
16840 amd 1228774660.434705 RET sigprocmask 0
16840 amd 1228774660.434713 CALL sigprocmask(SIG_BLOCK,0x28097c00,0xbfbfe9dc)
16840 amd 1228774660.434718 RET sigprocmask 0
16840 amd 1228774660.434729 CALL sigprocmask(SIG_SETMASK,0x28097c10,0)
16840 amd 1228774660.434734 RET sigprocmask 0
16840 amd 1228774660.434745 CALL exit(0x10)
702 amd 1228774660.435214 RET select -1 errno 4 Interrupted system call
702 amd 1228774660.435227 PSIG SIGCHLD caught handler=0x805de20 mask=0x0 code=0x0
702 amd 1228774660.435237 CALL wait4(0xffffffff,0xbfbfe83c,WNOHANG,0)
702 amd 1228774660.435255 RET wait4 16840/0x41c8
702 amd 1228774660.435296 CALL wait4(0xffffffff,0xbfbfe83c,WNOHANG,0)
702 amd 1228774660.435302 RET wait4 -1 errno 10 No child processes
702 amd 1228774660.435307 CALL sigprocmask(SIG_SETMASK,0x807ba7c,0)
702 amd 1228774660.435312 RET sigprocmask 0
702 amd 1228774660.435317 CALL sigprocmask(SIG_BLOCK,0x807b784,0)
702 amd 1228774660.435323 RET sigprocmask 0
and:
16835 rm 1228774660.305162 CALL open(0x2816280b,O_RDONLY,<unused>0)
16835 rm 1228774660.305173 NAMI ".."
16835 rm 1228774660.305626 RET open 4
16835 rm 1228774660.305634 CALL fstat(0x4,0xbfbfe93c)
16835 rm 1228774660.305644 STRU struct stat {dev=50396945, ino=29713037, mode=drwxr-xr-x , nlink=91, uid=9874, gid=929, rdev=0, atime=1228774657.877477000, stime=
1228774660.314260000, ctime=1228774660.314260000, birthtime=0, size=20480, blksize=4096, blocks=40, flags=0x0 }
16835 rm 1228774660.305651 RET fstat 0
16835 rm 1228774660.305658 CALL fchdir(0x4)
16835 rm 1228774660.305667 RET fchdir 0
16835 rm 1228774660.305674 CALL close(0x4)
16835 rm 1228774660.305824 RET close 0
16835 rm 1228774660.305831 CALL rmdir(0x2821afe8)
16835 rm 1228774660.305838 NAMI "p-interp"
16835 rm 1228774660.306498 RET rmdir 0
16835 rm 1228774660.306513 CALL stat(0x28218b48,0xbfbfe8cc)
16835 rm 1228774660.306519 NAMI "pcemu"
16835 rm 1228774660.306756 STRU struct stat {dev=50396945, ino=8465981, mode=drwxr-xr-x , nlink=5, uid=9874, gid=929, rdev=0, atime=1228772282, stime=1227555736,
ctime=1228773351.399184000, birthtime=0, size=4096, blksize=4096, blocks=8, flags=0x0 }
16835 rm 1228774660.306764 RET stat 0
16835 rm 1228774660.306770 CALL open(0x28218b48,O_NONBLOCK,<unused>0x1)
16835 rm 1228774660.306776 NAMI "pcemu"
16835 rm 1228774660.307003 RET open 4
16835 rm 1228774660.307010 CALL fstat(0x4,0xbfbfe8cc)
16835 rm 1228774660.307018 STRU struct stat {dev=50396945, ino=8465981, mode=drwxr-xr-x , nlink=5, uid=9874, gid=929, rdev=0, atime=1228772282, stime=1227555736,
ctime=1228773351.399184000, birthtime=0, size=4096, blksize=4096, blocks=8, flags=0x0 }
16835 rm 1228774660.307025 RET fstat 0
16835 rm 1228774660.307031 CALL fcntl(0x4,F_SETFD,FD_CLOEXEC)
16835 rm 1228774660.307039 RET fcntl 0
16835 rm 1228774660.307049 CALL fstatfs(0x4,0xbfbfe6f4)
16835 rm 1228774660.307294 RET fstatfs -1 errno 2 No such file or directory
16835 rm 1228774660.307302 CALL close(0x4)
16835 rm 1228774660.307549 RET close 0
16835 rm 1228774660.307557 CALL stat(0x28218b98,0xbfbfe8cc)
16835 rm 1228774660.307563 NAMI "pearpc"
16835 rm 1228774660.307759 STRU struct stat {dev=50396945, ino=27094159, mode=drwxr-xr-x , nlink=4, uid=9874, gid=929, rdev=0, atime=1228772282, stime=1227555736, ctime=1228773351.390236000, birthtime=0, size=4096, blksize=4096, blocks=8, flags=0x0 }
16835 rm 1228774660.307767 RET stat 0
16835 rm 1228774660.307772 CALL open(0x28218b98,O_NONBLOCK,<unused>0x1)
16835 rm 1228774660.307779 NAMI "pearpc"
16835 rm 1228774660.308000 RET open 4
16835 rm 1228774660.308007 CALL fstat(0x4,0xbfbfe8cc)
16835 rm 1228774660.308015 STRU struct stat {dev=50396945, ino=27094159, mode=drwxr-xr-x , nlink=4, uid=9874, gid=929, rdev=0, atime=1228772282, stime=1227555736, ctime=1228773351.390236000, birthtime=0, size=4096, blksize=4096, blocks=8, flags=0x0 }
16835 rm 1228774660.308021 RET fstat 0
16835 rm 1228774660.308026 CALL fcntl(0x4,F_SETFD,FD_CLOEXEC)
16835 rm 1228774660.308032 RET fcntl 0
16835 rm 1228774660.308038 CALL fstatfs(0x4,0xbfbfe6f4)
16835 rm 1228774660.308101 RET fstatfs -1 errno 2 No such file or directory
16835 rm 1228774660.308108 CALL close(0x4)
16835 rm 1228774660.308350 RET close 0
So either way, the user-level program attempting the directory hierarchy
traversal can be coded to be very careful, yet still receive a rude
surprise -- the system saying that a file that the program had already
opened (and still has opened) does not exist.
How rude! :-{
I'll be very happy to test suggested patches, whether intended to fix
the problem or merely facilitate diagnosis. That said, it shouldn't be
difficult to reproduce this -- I did it with a copy of /usr/ports; a
colleague has reported doing so with a copy of /usr/src (though I
haven't witnessed that).
Peace,
david
--
David H. Wolfskill david at catwhisker.org
Depriving a girl or boy of an opportunity for education is evil.
See http://www.catwhisker.org/~david/publickey.gpg for my public key.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-current/attachments/20081209/340bd551/attachment.pgp
More information about the freebsd-current
mailing list