more weird bugs with mmap-ing via NFS

Wed Mar 22 00:26:02 UTC 2006

:I don't specify either, but the default is UDP, is not it?

    Yes, the default is UDP.

:> Now imagine a client that experiences this problem only
:> sometimes. Modern hardware, but for some reason (network
:> congestion?) some frames are still lost if sent back-to-back.
:> (Realtek chipset on the receiving side?)
:
:No, both sides have em-cards and are only separated by a rather decent large 
:switch.
:
:I'll try the TCP mount, workaround. If it helps, we can assume, our UDP NFS is 
:broken for sustained high bandwidth writes :-(
:
:Thanks!
:
:	-mi

    I can't speak for FreeBSD's current implementation, but it should be
    possible to determine whether there is an issue with packet drops or
    not by observing the network statistics via netstat -s.  Generally
    speaking, however, I know of no problems with a UDP NFS mount per-say,
    at least as long reasonable values are chosen for the block size.

    The mmap() call in your mzip.c program looks ok to me with the exception
    of the use of PROT_WRITE.  Try using PROT_READ|PROT_WRITE.  The
    ftruncate() looks ok as well.   If the program works over a local
    filesystem but fails to produce data in the output file on an NFS
    mount (but completes otherwise), then there is a bug in NFS somewhere.
    If the problem is simply due to the program stalling, and not completing
    due to the stalling, then it could be a problem with dropped packets
    in the network stack.  If the problem is that the program simply runs
    very inefficiently over NFS, with excessive network bandwidth for the
    data being written (as you also reported), this is probably an artifact
    of attempting to use mmap() to write out the data, for reasons previously
    discussed.

    I would again caution against using mmap() to populate a file in this
    manner.  Even with MADV_SEQUENTIAL there is no guarentee that the system
    will actually flush the pages to the actual file on the server
    sequentially, and you could end up with a very badly fragmented file.
    When a file is truncated to a larger size the underlying filesystem
    does not allocate the actual backing store on disk for the data hole
    created.  Allocation winds up being based on the order in which the
    operating system flushes the VM pages.  The VM system does its best, but
    it is really designed more as a random-access system rather then a 
    sequential system.  Pages are flushed based on memory availability and
    a thousand other factors and may not necessarily be flushed to the file
    in the order you think they should be.  write() is really a much better
    way to write out a sequential file (on any operating system, not
    just BSD).

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>