i386/68719: [usb] USB 2.0 mobil rack+ fat32 performance problem

Mon May 30 00:19:42 PDT 2005

On Sun, 29 May 2005, Dominic Marks wrote:

> I have been experimenting in msdosfs_read and I have managed to come up with
> something that works, but I'm sure it is flawed. On large file reads it will
> improve read performance (see below) - but only after a long period of the
> file copy achieving only 3MB/s (see A1). During this time gstat reports the
> disc itself is reading at its maximum of around 28MB/s. After a long period
> of low throughput, the disc drops to 25MB/s but the actual transfer rate
> increases to 25MB/s (see A2).

A1 is strange.  It might be reading too much ahead, but I wouldn't expect 
the read-ahead to be discarded soon so this should make little difference
for reading whole files.

> I've tried to narrow it down to something but I'm mostly in the dark, so I'll
> just hand over what I found to work to review. I looked at Bruce's changes to
> msdosfs_write and tried to do the same (implement cluster_read) using the
> ext2 and ffs _read methods as a how-to. I think I'm reading ahead too far, or
> too early. I have been unable to interpret the gstat output during the first
> part of the transfer any further.

The ext2 and ffs methods are a good place to start.  Also look at cd9660 --
it is a little simpler.

> The patch which combines Bruce's original patch for msdosfs_write, revised for
> current text positions, and my attempts to do the same for msdosfs_read.
>
> %%
> Index: msdosfs_vnops.c
> ===================================================================
> RCS file: /usr/cvs/src/sys/fs/msdosfs/msdosfs_vnops.c,v
> retrieving revision 1.149.2.1
> diff -u -r1.149.2.1 msdosfs_vnops.c
> --- msdosfs_vnops.c	31 Jan 2005 23:25:56 -0000	1.149.2.1
> +++ msdosfs_vnops.c	29 May 2005 14:10:18 -0000
> @@ -565,14 +567,21 @@
> 			error = bread(pmp->pm_devvp, lbn, blsize, NOCRED, &bp);
> 		} else {
> 			blsize = pmp->pm_bpcluster;
> -			rablock = lbn + 1;
> -			if (seqcount > 1 &&
> -			    de_cn2off(pmp, rablock) < dep->de_FileSize) {
> -				rasize = pmp->pm_bpcluster;
> -				error = breadn(vp, lbn, blsize,
> -				    &rablock, &rasize, 1, NOCRED, &bp);
> +			/* XXX what is the best value for crsize? */
> + 			crsize = blsize * nblks > MAXBSIZE ? MAXBSIZE : blsize * nblks;
> +			if ((vp->v_mount->mnt_flag & MNT_NOCLUSTERR) == 0) {
> +				error = cluster_read(vp, dep->de_FileSize, lbn,
> +					crsize, NOCRED, uio->uio_resid, seqcount, &bp);

crsize should be just the block size (cluster size in msdosfs and
blsize variable here) according to this code in all other file systems.
seqcount gives the amount of readahead and there are algorithms elsewhere
to guess its best value.  I think cluster_read() reads only physically
contiguous blocks, so the amount of read-ahead for it is not critical
for the clustered case anyway.  There will either be a large range of
contigous blocks, in which case reading ahead a lot isn't bad, or
read-ahead will be limited by discontiguities.  Giving a too-large
value for crsize may be harmful by confusing cluster_read() about
discontiguities, or just by asking it to read the large size when the
blocks actually in the file aren't contiguous.

I think the above handles most cases, so look for problems there first.

> 			} else {

The above seems to be missing a bread() for the EOF case (before the else).
I don't know what cluster_read() does at EOF.  See cd9660_read() for clear
code.  (Here there is unfortunately an extra level of indentation from a
special case for directories.)

> -				error = bread(vp, lbn, blsize, NOCRED, &bp);
> +				rablock = lbn + 1;
> +				if (seqcount > 1 &&
> +					de_cn2off(pmp, rablock) < dep->de_FileSize) {
> +						rasize = pmp->pm_bpcluster;
> +						error = breadn(vp, lbn, blsize,
> +						&rablock, &rasize, 1, NOCRED, &bp);
> +				} else {
> +					error = bread(vp, lbn, blsize, NOCRED, &bp);
> +				}

This part seems to be OK.  (It is just the old code indented.)

> 			}
> 		}
> 		if (error) {
> ...
> %%
>
> With this patch I can get the following transfer rates:
>
> msdosfs reading
>
> # ls -lh /mnt/random2.file
> -rwxr-xr-x  1 root  wheel   1.0G May 29 11:24 /mnt/random2.file
>
> # /usr/bin/time -al cp /mnt/random2.file /vol
>       59.61 real         0.05 user         6.79 sys
>       632  maximum resident set size
>        11  average shared memory size
>        80  average unshared data size
>       123  average unshared stack size
>        88  page reclaims
>         0  page faults
>         0  swaps
>     23757  block input operations **
>      8192  block output operations
>         0  messages sent
>         0  messages received
>         0  signals received
>     16660  voluntary context switches
>     10387  involuntary context switches
>
> Average Rate: 15.31MB/s. (Would be higher if not for the slow start)
>
> ** This figure is 3x that of the UFS2 operations. This must be a indicator of
> what I'm doing wrong, but I don't know what.

This might also be a sign of fragmentation due to bad allocation policies
at write time or write() not being able to do good allocation due to
previous fragmentation.

The average rate isn't too bad, despite the extra blocks.

> msdosfs writing
>
> # /usr/bin/time -al cp /vol/random2.file /mnt
>       47.33 real         0.03 user         7.13 sys
>       632  maximum resident set size
>        12  average shared memory size
>        85  average unshared data size
>       130  average unshared stack size
>        88  page reclaims
>         0  page faults
>         0  swaps
>      8735  block input operations
>     16385  block output operations
>         0  messages sent
>         0  messages received
>         0  signals received
>      8856  voluntary context switches
>     29631  involuntary context switches
>
> Average Rate: 18.79MB/s.

There are 2x as many blocks as for ffs2 for writing instead of 3x for
reading.  What are the input blocks for here?  Better put the non-msdosfs
part of the source or target in memory so that it doesn't get counted.
Or try mount -v (it gives sync and async read/write counts for individual
file systems).

2x is actually believable while ffs2's counts aren't.  It corresponds to
a block size of 64K, which is what I would expect for the unfragmented
case.

> To compare with UFS2 + softupdates on the same system / disc.
>
> ufs2 reading
>
> # /usr/bin/time -al cp /mnt/random2.file /vol
>       42.39 real         0.02 user         6.61 sys
>       632  maximum resident set size
>        12  average shared memory size
>        87  average unshared data size
>       133  average unshared stack size
>        88  page reclaims
>         0  page faults
>         0  swaps
>      8249  block input operations
>      8193  block output operations
>         0  messages sent
>         0  messages received
>         0  signals received
>      8246  voluntary context switches
>     24617  involuntary context switches
>
> Average Rate: 20.89MB/s.

Isn't it 24.16MB/s?

8192 i/o operations seems to be too small.  It corresponds to a block
size of 128K.  Most drivers don't actually support doing i/o of that
size (most have a limit of 64K), so if they get asked to then it is a
bug.  This bug is common or ubiquitous.  The block size to use for
clusters is in mnt_iosize_max, and this is set in various wrong ways,
often or always to MAXPHYS = 128K.  This usually makes little difference
except to give misleading statistics.  Clustering tends to produce
blocks of size 128K and the block i/o counts report blocks of that
sizes, but smaller blocks are sent to the hardware.  I'm not sure if
libdevstat() sees the smaller blocks.  I think it doesn't.

> [... ufs2 writing similar to reading]

Bruce