svn commit: r218604 - head/sbin/fsck_ffs

Bruce Evans brde at optusnet.com.au
Sun Feb 13 16:07:14 UTC 2011


On Sun, 13 Feb 2011, Kostik Belousov wrote:

> On Sun, Feb 13, 2011 at 11:30:56PM +1100, Bruce Evans wrote:
>> On Sat, 12 Feb 2011, Konstantin Belousov wrote:
>>
>>> Log:
>>> In checker, read journal by sectors.
>>>
>>> Due to UFS insistence to pretend that device sector size is 512 bytes,
>>
>> Do you mean that ffs_fsck does this?  UFS doesn't exist, and ffs in the
>> kernel didn't use device sectors at all until recently (it used DEV_BSIZE
>> units, but those are not sectors but are the units for bread() and friends).
> Yes, the journal writer started using DEV_BSIZE as _sector_ size to
> write the journal blocks, and fixing this was the point of the series
> of commits. Journal was unoperable on the providers with non-512 byte
> sectors.

I now think userland shouldn't use any kernel sizes (DEV_BSIZE, ffs
frag size, or ffs block size, otherfs block sizes) for i/o.  It can
use the disk sector size (if any), but that will be bad for performance
so it normally should't be used either, and there is no need to determine
it (just use a large size that works).  Some reblocking from fs block
sizes to good i/o sizes may be needed, especially if the sizes that
works is large (say 256K -- you would use it as a cache and reblock to
it).  This is best done using block devices so that even 1-byte i/o's
are reblocked with no extra code, but block devices are axed in FreeBSD.

>>> Modified: head/sbin/fsck_ffs/fsck.h
>>> ==============================================================================
>>> --- head/sbin/fsck_ffs/fsck.h	Sat Feb 12 13:12:45 2011	(r218603)
>>> +++ head/sbin/fsck_ffs/fsck.h	Sat Feb 12 13:17:14 2011	(r218604)
>>> @@ -268,6 +268,7 @@ char	snapname[BUFSIZ];	/* when doing sna
>>> char	*cdevname;		/* name of device being checked */
>>> long	dev_bsize;		/* computed value of DEV_BSIZE */
>>> long	secsize;		/* actual disk sector size */
>>> +long	real_dev_bsize;
>>
>> If `secsize' is the actual disk sector size, then we don't need another
>> variable giving the real sector size.
>>
>> The new variable has no comment.  What is the difference between a sector
>> size and a dev_bsize?  There is enough confusion between DEV_BSIZE and
>> sector sizes already.  DEV_BSIZE has come to means nothing to do with
>> sectors and little to do with devices or block sizes.  It is just a minimal
>> i/o sizes for use in old interfaces (ones that didn't want to use more than
>> 32 bits for disk addresses, but now use 64 bits anyway, so that they can
>> address 72 bits after multiplication by DEV_BSIZE).
> There is no "actual" difference (I tired of the word actual when debugging
> the patches), they are all DEV_BSIZE.

Any chance of fixing the comment or removing one of the variables?

I now see that secsize really is the actual sector size, modulo bugs (*),
unlike in newfs, and it is used mainly to do i/o of a working size, with
some rebocking, in fsutil.c.

(*) Old bug: there was no call to DIOCGSECTORSIZE to determine the
sector size, so if the label didn't give it then it could easily be
wrong.  New bug: DIOCGSECTORSIZE is now called, but it is only used
to set real_dev_bsize.  So the old code in fsutil.c doesn't benifit
from the better determination of the sector size.  I think it just
fails if the label didn't give the size and the default of DEV_BSIZE
is too small to work.  suj.c also seems to be in error in calling
bread() instead of blread().  This prevents any chance of the old
code's reblocking to size secsize (I'm not sure if there is enough
reblocking to work, but the old code only uses size secsize).  All old
parts of fsck_ffs call bread() via blread().  The only places that
call bread() directly are suj.c (many times) and gjournal.c (once).

real_dev_bsize still seems bogus.  DEV_BSIZE is the real DEV_BSIZE.
I think you mainly want an i/o size that works.  This can be secsize,
as in old code, provided that is initialized correctly.

In my fixes in fsck_msdosfs, the i/o size that works (except while
probing for one that works) is named iosize.

>>> Modified: head/sbin/fsck_ffs/setup.c
>>> ==============================================================================
>>> --- head/sbin/fsck_ffs/setup.c	Sat Feb 12 13:12:45 2011 (r218603)
>>> +++ head/sbin/fsck_ffs/setup.c	Sat Feb 12 13:17:14 2011 (r218604)
>>> @@ -446,7 +446,7 @@ sblock_init(void)
>>> 	if (sblk.b_un.b_buf == NULL || asblk.b_un.b_buf == NULL)
>>> 		errx(EEXIT, "cannot allocate space for superblock");
>>> 	if ((lp = getdisklabel(NULL, fsreadfd)))
>>> -		dev_bsize = secsize = lp->d_secsize;
>>> +		real_dev_bsize = dev_bsize = secsize = lp->d_secsize;
>>> 	else
>>> 		dev_bsize = secsize = DEV_BSIZE;
>>> }
>>
>> Both the variables are real enough here, provided getdisklabel() works.
>> Otherwise, you are in trouble and should fail instead of defaulting the
>> size unless the size is set by another means.  newfs seems to fail, and
>> I'm sure newfs_msdos does fail if the sector size is unknown.
>>
>>> Modified: head/sbin/fsck_ffs/suj.c
>>> ==============================================================================
>>> --- head/sbin/fsck_ffs/suj.c	Sat Feb 12 13:12:45 2011	(r218603)
>>> +++ head/sbin/fsck_ffs/suj.c	Sat Feb 12 13:17:14 2011	(r218604)
>>> @@ -28,6 +28,7 @@
>>> __FBSDID("$FreeBSD$");
>>>
>>> #include <sys/param.h>
>>> +#include <sys/disk.h>
>>> #include <sys/disklabel.h>
>>> #include <sys/mount.h>
>>> #include <sys/stat.h>
>>> @@ -201,6 +202,11 @@ opendisk(const char *devnam)
>>> 		    disk->d_error);
>>> 	}
>>> 	fs = &disk->d_fs;
>>> +	if (real_dev_bsize == 0 && ioctl(disk->d_fd, DIOCGSECTORSIZE,
>>> +	    &real_dev_bsize) == -1)
>>> +		real_dev_bsize = secsize;
>>> +	if (debug)
>>> +		printf("dev_bsize %ld\n", real_dev_bsize);
>>> }
>>>
>>> /*
>>
>> If this gives a different "real_dev_bsize", and this is different from the
>> sector size in the label or defaulted-to, then the latter is presumably
>> wrong, and we're left with a `secsize' that is not the "actual sector size".
> Yes, secsize is not the "actual sector size", it is equal to DEV_BSIZE,
> see setup.c:sblock_init().

DEV_BSIZE is only the default there.  When there is a label, all of
real_dev_bsize, dev_bsize and secsize are set to lp->d_secsize.
lp->d_secsize is set by g_sectorsize(), so it shouldn't be completely
wrong.  I don't believe in g, but it is an anachronism to start using
DIOCGSECTORSIZE in fsck_ffs long after bsdlabel was changed to use
g_sectorsize() instead.

>> newfs doesn't have the  precedence that I want:
>> - sectorsize set on the command line (-S option) has precedence.  [Bug;
>>   -S 0 is needed to cancel any previous setting of sectorsize on the
>>   command line by -S or maybe by -T, but it is treated as an error.]
>>   fsck_ffs unfortunately doesn't seem to have any -S or -T option.  These
>>   used to be unneeded since sector size weren't used, and labels worked
>>   and contain enough info to locate the superblock where there is more
>>   info.
>> - then try setting sectorsize using the ioctl if sectorsize is not already
>>   set.  [Bug: no error checking for this ioctl.  If it fails, then it
>>   may clobber sectorsize, which breaks the next step.]
>> - then set sectorsize from the label if there is a label and sectorsize
>>   is not already set
>> - fail if sectorsize is not set
> newfs.c does not do what you describe, in fact. After going to all
> troubles finding the sector size, it performs the following:

Er that is what I described -- first it determines the sector size as
described above; then it clobbers the sector size as below::

>
> 	if (sectorsize != DEV_BSIZE) {		/* XXX */
> 		int secperblk = sectorsize / DEV_BSIZE;
>
> 		sectorsize = DEV_BSIZE;
> 		fssize *= secperblk;
> 		if (pp != NULL)
> 			pp->p_size *= secperblk;
> 	}
>
> And now layout calculations happen as if sector size is 512 bytes.

except it preserves the original size in `realsectorsize' before clobbering
`sectorsize'.  My description of these points quoted far below.

> I was
> puzzled for long time why libufs givess me 512 in d_bsize for a volume over
> 4096-bytes per sector md:
> 	disk->d_bsize = fs->fs_fsize / fsbtodb(fs, 1);

Yes, it is very confusing, and becomes even more confusing with even more
block size variables that don't match their name and/or comments.

This reminds me that libufs may be part of the problem.  ffs utilities
used to have very simple read and write routines.  Now they use libufs.
libufs is so uneasy to use that newfs has even lost its error checking
for its most critical write -- sbwrite().  The error checking used to
be in the low-level write routine (it was to print a message and exit
IIRC).  Now libufs just returns an error code, and none of the sbwrite()
calls in newfs check for errors.

Back to the current problem.  libufs needs a block size in a couple
of places, and it uses disk->d_bsize.  The above is its main initialization.
There are considerable possibilities for d_bsize being inconsistent
with ffs, or with the faked block sizes in utilites, or just not working
since it is not a multiple of the sector size.  Here are all reference
to d_bsize in libufs:

% block.c:	cnt = pread(disk->d_fd, p2, size, (off_t)(blockno * disk->d_bsize));
% block.c:	cnt = pwrite(disk->d_fd, data, size, (off_t)(blockno * disk->d_bsize));
% block.c:	ioarg[0] = blockno * disk->d_bsize;

For the offset calculations, disk->d_bsize should be spelled DEV_BSIZE.
Require all callers to pass block numbers in DEV_BSIZE units.  Don't try
to be more general than the kernel.  fsdbtodb() will convert fs block
numbers into DEV_BSIZE units for us in most cases.

% bread.3:.Va d_bsize
% libufs.h:	long d_bsize;		/* device bsize */

The comment adds nothing.  As elsewhere, this has little to do with devices,
but is block size for talking to the historical kernel interfaces bread(9)
and friends.  It is always DEV_BSIZE in the kernel.  That won't change,
and neither will DEV_BSIZE change from 512, since these are only historical
interfaces.  Any interface changes would be to use units of 1 byte.

% sblock.c:	disk->d_bsize = fs->fs_fsize / fsbtodb(fs, 1);

Same calculation as in fsck_ffs.  Always gives DEV_BSIZE.

% sblock.c:	disk->d_sblock = superblock / disk->d_bsize;
% sblock.c:		disk->d_sblock = disk->d_fs.fs_sblockloc / disk->d_bsize;
% type.c:	disk->d_bsize = 1;

Hopefully nver used.

>> Then the hacks involving realsectorsize:
>> - set it to sectorsize
>> - if sectorsize != DEV_BSIZE, change it to DEV_BSIZE and fix up related
>>   quantities (I hope the change to the partition table is never written).
>>   Now `sectorsize' is no longer "actual", though it was actual before the
>>   hacks.  We continue with the actual sector size recorded in
>>   realsectorsize.
> realsectorsize is not used. You are citing the piece of code I pasted above.

I only said that realsectorsize is inititialized :-).  But it is used, just
once, a few lines later to un-rescale pp->p_size.  (This is also confusing.
pp->p_size is in units of sectors and shouldn't need rescaling.  It just
needs initialization if there is no label to begin with.  When there is
a label to begin with, and the sector size is not DEV_BSIZE, we seem to
end up with no change by bogusly rescaling then undoing this.  The case
where their is no label to begin with has less-localized code and I didn't
check it.)

>>> @@ -2262,7 +2268,7 @@ suj_build(void)
>>> 		rec = (union jrec *)seg->ss_blk;
>>> 		for (i = 0; i < seg->ss_rec.jsr_cnt; off += JREC_SIZE,
>>> 		rec++) {
>>> 			/* skip the segrec. */
>>> -			if ((off % DEV_BSIZE) == 0)
>>> +			if ((off % real_dev_bsize) == 0)
>>> 				continue;
>>> 			switch (rec->rec_jrefrec.jr_op) {
>>> 			case JOP_ADDREF:
>>> ...
>>
>> This is all sort of backwards.  DEV_BSIZE is the real DEV_BSIZE.  Old code
>> that uses units of DEV_BSIZE spells this as dev_bsize although it is
>> constant.  New code that uses units of sectors spelled this as DEV_BSIZE
>> although the correct size is variable; now the new code spelled this as
>> real_dev_bsize, which is variable but unrelated to dev_bsize = DEV_BSIZE
>> :-).
> We need to write the journal block atomically, and device/geom level only
> provides that:
> - we must write at least sector;
> - any write bigger then sector is not atomic.

Does it really operate on active file systems/devices?

We use libufs, and it is not clear how atomic that is.  I think it
should provide caching.  I thought that the old fsck_ffs code would
do i/o of the correct size (it uses secsize), but there are minor
problems:
- it does some caching.  Seems to be not a problem here, since
   blwrite() doesn't use the cache (as in the kernel).
- blwrite() first tries to write with the specified size.  It does
   no reblocking.  It doesn't actually use secsize.  If the first try
   fails, then it retries 1 "sector" at a time using the non-sector
   size dev_bsize.
- blread() is similar, except it retries using the sector size secsize
   and has different printfs for errors  when dev_bsize differs from both
   secsize and 1.
We can easily gaurantee than blwrite() doesn't do unwanted caching or
reblocking.  It writes directly.  It reblocks in the error case, but
that is null provided the original size is the sector size blwrite()
uses the correct size for the retry.  Original sizes smaller than the
sector size just won't work, unless the sector size is virtual and
larger than the physical sector size.

Bruce


More information about the svn-src-all mailing list