Unexpected SU+J inconsistency AGAIN -- please, don't shift topic to ZFS!

Wed Mar 6 07:15:23 UTC 2013

On  1 Mar, Lev Serebryakov wrote:
> Hello, Ivan.
> You wrote 28 февраля 2013 г., 21:01:46:
> 
>>>   One time, Kirk say, that delayed writes are Ok for SU until bottom
>>>  layer doesn't lie about operation completeness. geom_raid5 could
>>>  delay writes (in hope that next writes will combine nicely and allow
>>>  not to do read-calculate-write cycle for read alone), but it never
>>>  mark BIO complete until it is really completed (layers down to
>>>  geom_raid5 returns completion). So, every BIO in wait queue is "in
>>>  flight" from GEOM/VFS point of view. Maybe, it is fatal for journal :(
> IV> It shouldn't be - it could be a bug.
>    I understand, that it proves nothing, but I've tried to repeat
>  "previous crash corrupt FS in journal-undetectable way" theory by
>  killing virtual system when there is massive writing to
>  geom_radi5-based FS (on virtual drives, unfortunately). I've done 15
>  tries (as it is manual testing, it takes about 1-1.5 hours total),
>  but every time FS was Ok after double-fsck (first with journal and
>  last without one). Of course, there was MASSIVE loss of data, as
>  timeout and size of cache in geom_raid5 was set very high (sometimes
>  FS becomes empty after unpacking 50% of SVN mirror seed, crash and
>  check) but FS was consistent every time!

Did you have any power failures that took down the system sometime
before this panic occured?  By default FreeBSD enables write caching on
ATA drives.

	kern.cam.ada.write_cache: 1
	kern.cam.ada.0.write_cache: -1  (-1 => use system default value)

That means that the drive will immediately acknowledge writes and is
free to reorder them as it pleases.

When UFS+SU allocates a new inode, it first clears the available bit in
the bitmap and writes the bitmap block to disk before it writes the new
inode contents to disk.  When a file is deleted, the inode is zeroed on
disk before the available bit is set in the bitmap and the bitmap block
is written.  That means that if an inode is marked as available in the
bitmap, then it should be zero.  The system panic that you experienced
happened when the system was attempting to allocate an inode for a new
file and when it peeked at an inode that was marked as available, it
found that the inode was non-zero.

What might have happened is that sometime in the past, the system was in
the process of creating a new file when a power failure ocurred.  It
found an available inode, marked it as unavailable in the bitmap, and
write the bitmap block to the drive. Because write caching was enabled,
the bitmap block was cached in the drive's write cache, and the drive
said that the write was complete. After getting this response, UFS+SU
wrote the new inode contents to the drive, which was also cached.  The
drive then wrote the inode contents to the drive. At this point the
power failed, losing all of the contents of the drive write cache before
the bitmap block was updated.  When the system was powered up again,
fsck just replayed the journal because you were using SU+J, and didn't
detect the inconsistency between the bitmap and the actual inode
contents (which would require a full fsck).  This damage could remain
latent for quite some time, and wouldn't be found until the filesystem
tried to allocate the inode in question.