fs/udf: vm pages "overlap" while reading large dir

Tue Feb 12 12:33:42 UTC 2008

On Tue, 12 Feb 2008, Poul-Henning Kamp wrote:

> In message <47B0CE54.6090204 at icyb.net.ua>, Andriy Gapon writes:
>> on 06/02/2008 18:29 Andriy Gapon said the following:
>>> Small summary of the above long description.
>>> For directory reading fs/udf performs bread() on a (underlying) device
>>> vnode. It passes block number as if block size was 512 bytes (i.e.
>>> byte_offset_within_dev/512).
>
> We have three sizes of relevance here, the sectorsize of the provider,
> the blocksize of the filesystem and the page size of the system.

4. The block size required for bread() and friends (almost always
    DEV_BSIZE).

> In general it is adventurous to have any of them be anything but
> powers of two, and it is at best ill-adviced and more likely asking
> for trouble to do requests that are not multiple of and aligned to
> the sectorsize of the provider.
> 
> So up front, I would say that it is an UDF bug to do 512 byte reads off
> a 2k sector provider.
> 
> Making the buffer cache handle this is possible, but it is not the
> direction we have planned for the buffercache (ie: that it should
> become a wrapper for using the VM system, rather than being a
> separately allocated cache).
>
> So if the objective is to make UDF work in the short term, your
> change might work, if the objective is to move FreeBSD's kernel
> architecture forward, the right thing to do is to fix UDF to not
> access subsectors.

This bug has nothing to do with subsectors, and very little to do with
udf.  udf is just depending on bread()'s API being unbroken.  That API
requires starting with blocks consisting of whole sectors (else the
subsequent I/O would fail) and converting to blocks of size DEV_BSIZE,
phyexcept for unusual (non-disk?) file systems like nfs (and no others?).
All (?) disk file systems do this conversion.  The bug is just more
noticeable for udf since it is used more often on disks with a block
size != DEV_BSIZE, and it apparently does something that makes the bug
mess up VM more than other file systems.

Of course, it might be better for the API to take blocks in units of
the sector size, but that is not what it takes, and this can't be
changed easily or safely.

The standard macro btodb() is often used to convert from bytes to
blocks of this size, and doesn't support bo_bsize or the bufobj pointer
that would be needed to reach bo_bsize.

ffs mostly uses its fsbtodb() macro, which converts blocks in ffs block
(frag?)-sized units to blocks in DEV_BSIZE units so as to pass them
to unbroken bread() and friends.  ffs initializes bo_bsize to its block
(not frag) size, and then never uses it directly except in ffs_rawread(),
where it is used to check that the I/O is in units of whole sectors
(otherwise, ffs_rawread() has DEV_BSIZE more hard-coded than the rest
of ffs).

The details of fsbtodb() are interesting.  They show signs of previous
attempts to use units of sectors.  From ffs/fs.h:

% /*
%  * Turn filesystem block numbers into disk block addresses.
%  * This maps filesystem blocks to device size blocks.
%  */
% #define	fsbtodb(fs, b)	((daddr_t)(b) << (fs)->fs_fsbtodb)
% #define	dbtofsb(fs, b)	((b) >> (fs)->fs_fsbtodb)

Creation of fs_fsbtodb is left to newfs.  The units of DEV_BSIZE are thus
built in to the on-disk data struct (in a fairly easy to change and/or
override way, but there are a lot of existing disks).  From newfs.c:

% 	realsectorsize = sectorsize;
% 	if (sectorsize != DEV_BSIZE) {		/* XXX */
% 		int secperblk = sectorsize / DEV_BSIZE;
% 
% 		sectorsize = DEV_BSIZE;
% 		fssize *= secperblk;
% 		if (pp != NULL)
% 			pp->p_size *= secperblk;
% 	}
% 	mkfs(pp, special);

Though mkfs() appears to support disk blocks of size <sector size> =
sectorsize, its sectorsize parameter is hacked on here, so it always
generates fs_fsbtodb and other derived parameters for disk blocks of
size DEV_BSIZE, as is required for the unbroken bread() API.

msdosfs is similar to ffs here, except it constructs the shift count
at mount time, to arrange for always converting to DEV_BSIZE blocks for
passing the bread() and friends.

udf is effectively similar, but pessimal and disorganized.  For the
conversion for bread(), it uses btodb().  In udf_bmap(), it constructs
a shift count on every call by subtracting DEV_BSHIFT from its block shift
count.  In udf_strategy(), on every call it constructs a multiplier
instead of a shift count, by dividing its block size by DEV_BSIZE.  It's
weird that the strategy routing, which will soon end up sending sectors
to the disk, is converting to DEV_BSIZE units, a size that cannot be the
size for udf's normal media.

cd9660 uses btodb() for one conversion for bread() and
(<its block shift count> - DEV_BSHIFT) in 7 other conversions to
DEV_BSIZE units.

So it's surprising that any file systems work on CDs and DVDs.  Maybe
the bug affects udf more because udf crosses page boundaries more.

It's also surprising that the bug has such small effects.  This seems
to be because the blkno arg to bread() and friends is little more than
a cookie.  Each fs's strategy routine converts it to a byte offset
later, and could do this using any unique cookie, but actually uses a
simple shift forth and back.  All fs's are consistent in their conversion
to and from DEV_BSIZE or other units.  Not much more than getblk()
needs more than a cookie or uses an inconsistent conversion to a byte
offset.

Bruce