nfsclient: incorrect st_blksize (bug?)

Mon Jul 29 14:59:06 UTC 2013

On Mon, 29 Jul 2013, Ali Niknam wrote:

> I've come across a problem that has proven to be unsolvable for me so far. It 
> might be a bug in the NFS Client code, it also be my general lack of 
> knowledge :). Can someone please give me a hint in the right direction?
>
> This is the case:
>
> mount_nfs -o rsize=32768 -o wsize=32768 -o nfsv4 -o tcp host:/path /mnt/nfs
>
> stat /mnt/nfs gives st_blksize of 4096 bytes.
> statfs /mnt/nfs gives an iosize of 4096 bytes.
>
> Mounting with nfsv3 gives the same results, regardless of udp or tcp 
> protocol. NFSv2 however seems to give a st_blksize of 128k, with an iosize of 
> 8192 bytes.
>
> In short: it seems that with BSD 9.1 the rsize/wsize's arent passed along 
> correctly. I tried to debug it by looking in the kernel code but I got lost 
> unfortunately in the abstraction layers (everything seems to set 
> NFS_FABLKSIZE).
>
> Mounting the same host on a linux machine gives the correct st_blksize (32k).
>
> The disadvantage is ofcourse that apache/etc adhere to the 4k st_blksize by 
> only reading 4k chunks so that nfs io slows down substantially.

nfs still seems to seems to ask for a blocksize of NFS_FABLKSIZE = 512.  Old
versions of FreeBSD honored the leaf file system's idea of the best block
size and gave this 512.  After many intermediate broken versions, vn_stat()
now has a hack that involves it using PAGE_SIZE iff the leaf file system
prefers a smaller size, so 512 becomes 4096 on x86.  4096 is not as bad as
512, but still too small for most purposes.  OTOH, 512 works quite well for
nfs over local networks with low latency.  512 fits in a 1500-byte packet
but 4096 doesn't, so latency can be better with small block sizes and
lower latency also gives higher throughput provided everything can keep
up with the small blocks.

A workaround might by to use statfs() instead of stat().  st_blksize
can vary within a file system in theory, but usually doesn't, and can't
be trusted anyway.  struct statfs has fields f_bsize ("fragment" size)
and f_iosize (optimal transfer size).  These seem to be set better by
leaf file systems, and are certainly never frobbed by upper layers
(except to translate to old statfs()).  nfs still seems to set f_bsize
to NFS_FABSLKSIZE, but it sets f_iosize to its i/o size.  ffs sets
f_bsize to its fragment size (not so good.  statfs() can't even
respresent ffs's 2 types of block size.  Neither can stat(), but
st_blksize is initialized with the other one, so unportable code can
determine both).  ffs sets f_iosize to a disk-specific size.  There
are many bugs in the setting of the latter too, and it now almost
always reduces to a hard-coded setting of MAXPHYS that has nothing
to do with disks' preferred sizes.  Hard-coding of MAXPHYS everywhere
would be OK for throughput but not so good for latency.  To optimize
for latency, there seems to be nothing better than using statfs()'s
f_bsize, but we know that that reduces to a hard-coded 512 for nfs
and to the not-necessarily best fragment size for ffs.

Bruce