Data corruption over NFS in -current
rmacklem at uoguelph.ca
Fri Jan 13 16:46:09 UTC 2012
Martin Cracauer wrote:
> More findings.
> Reminder, with the original report I found:
> - files for no reason changing ownership and group to
> - data corruption as in inserting binary junk obviously from ports
> - data corruption as in malformed ascii text that might be a bug I
> have in my code that is only exposed in FreeBSD
> I ran the script on a Linux machine in the same situation again the
> NFS server, it worked fine. I haven't look at blocksizes, NFS
> versions etc in play yet.
> I ran with oldnfs (reboot), which showed only the third problem.
> I re-ran with newfs (reboot) which worked (all three problems absent).
Since this test worked, it suggests that problem #3 is not a bug in your
software, unless your runs aren't processing the same data. However, a
test using a local disk to confirm this, would be nice.
> I then started building ports/land/gcc47 at the same time as I
> re-started my crazy script and it too only a few seconds for an
> unexpected ownership to root to occur.
Well, from my experience, isolating a problem like this is much easier
if you can reproduce it reliably. I'd try this a few times and if
doing ports/land/gcc47 concurrently reproduces the problem reliably,
then I'd use that for all the testing. (I'd suggest you re-do the above
tests doing ports/land/gcc47 concurrently with the script.)
Also, I'd run "systat -vmstat" or similar (others may have better suggestions
than "systat -vmstat"?) while running the tests, to see if there might be
a memory exhaustion issue. (Daniel mentioned he had seen this, if I understood
his post correctly. Maybe he can elaborate on how he spotted the memory exhaustion?)
> My next steps are:
> - trying block sizes and other parameters, maybe use a different NFS
> version with the Linux client. My NFS server is newly upgraded to
> Linux kernel 3.1.5
or go back to the old version of the NFS server, if that is feasible.
Two changes (new Linux NFS server and new FreeBSD version) at about the
same time, makes it harder to point your finger at the problem.
> - running my script on a FreeBSD host with local disk to see whether
> problem #3 is a general problem that appears or is exposed only on
It might also be useful to run this FreeBSD host with local disk using
the NFS mount and having a swap partition on the disk. (Again, related
to what Daniel mentioned.)
> - capture tcpdump as mentioned earlier
If the combination of running the script and ports/land/gcc47 reproduces
the problem reliably, then doing a tcpdump should be straightforward.
Good luck with it. I'll admit I doubt this will be resolved quickly or
easily, but pursuing it as far as you can find the time to do so will
be appreciated by others who might run into the same problem.
> I will probably have to turn debug off since this script run is
> dominated by system time now and gets 10x slower as it is now.
> Martin Cracauer <cracauer at cons.org> http://www.cons.org/cracauer/
> freebsd-current at freebsd.org mailing list
> To unsubscribe, send any mail to
> "freebsd-current-unsubscribe at freebsd.org"
More information about the freebsd-current