Panic in ffs_valloc (Was: Unexpected SU+J inconsistency AGAIN -- please, don't shift topic to ZFS!)

Fri Mar 1 20:23:02 UTC 2013

Hello, Kirk.
You wrote 1 марта 2013 г., 22:00:51:

>>   As far, as I understand, if this theory is right (file system
>>  corruption which left unnoticed by "standard" fsck), it is bug in FFS
>>  SU+J too, as it should not be corrupted by reordered writes (if
>>  writes is properly reported as completed even if they were
>>  reordered).
KM> If the bitmaps are left corrupted (in particular if blocks are marked
KM> free that are actually in use), then that panic can occur. Such a state
KM> should never be possible when running with SU even if you have crashed
KM> multiple times and restarted without running fsck.
  I run fsck every time (ok, every half-year) when server crashes due
 to my awkward experiments on live system, but I run it as it runs:
 with journal after upgrade to 9-STABLE, not full old-fashioned run.

KM> To reduce the number of possible points of failure, I suggest that
KM> you try running with just SU (i.e., turn off the SU+J jornalling).
KM> you can do this with `tunefs -j disable /dev/fsdisk'. This will
KM> turn off journalling, but not soft updates. You can verify this
KM> by then running `tunefs -p /dev/fsdisk' to ensure that soft updates
KM> are still enabled.
  And wait another half a year :)

  I'm trying to reproduce this situation on VM (VirtualBox with
 virtual HDDs), but no luck (yet?).

KM> I will MFC 246876 and 246877 once they have been in head long enough
KM> to have confidence that they will not cause trouble. That means at
KM> least a month (well more than the two weeks they have presently been
KM> there).

KM> Note these changes only pass the barrier request down to the GEOM
KM> layer. I don't know whether it actually makes it to the drive layer
KM> and if it does whether the drive layer actually implements it. My
KM> goal was to get the ball rolling.
  I'm have controversial feelings about this barriers. IMHO, all
 writes to UFS (FFS) could and should be divided into two classes:
 data writes and metadata (including journal, as FFS doesn't have data
 journaling) writes. IMHO (it is last time I type these 4 letters,
 but, please, add it when you read this before and after each my
 sentence, as I'm not FS expert at any grade), data writes could be
 done as best effort till fsync() is called (or file is opened with
 appropriate flag, which is equivalent to automatic fsync() after each
 write). They could be delayed, reordered, etc. But metadata should
 have some strong guarantees (and fsync()'ed data too, of course).
 Such division could allow best possible performance & consistent FS
 metadata (maybe not consistent user data -- but every application
 which needs strong guarantees, like RDBMS, use fsync() anyway).

   Now you add "BARRIER" write. It looks too strong to use it often.
 It will force writing of ALL data from caches, even if your intention
 is to write only 2 or 3 blocks of metadata. It could solve problems
 with FS metadata, but it will degrade performance, especially in
 multithreaded load. Update of inode map for creating 0 bytes file
 flag by one process (protected with barrier) will flush whole data
 cache (maybe, hundred of meagbytes) of other one.

  It is better than noting, but, it is not best solution. Every write
 should be marked as "critical" or "loose" and critical-marked buffers
 (BIOs) must be written ASAP and before all other _crtitcal_ BIOs (not
 all BIOs after it with or without flag). So, barrier should affect
 only other barriers (ordered writes). Default, "loose" semantic (for
 data) will exactly what we have now.

   It is very hard to implement contract "It only ensure that buffers
written before that buffer will get to the media before any buffers
written after that buffer" in any other way but full flush, which, as I
stated above, will hurt performace in such cases as effective
RAID5-like implementations which gain a lot from combining wrties
together by spatial (not time) property.

   And for full flush (which is needed sometimes, of course) we
already have BIO_FLUSH command.

   Anyway, I'll support new semantic in geom_raid5 ASAP. But,
 unfortunately, now it could be supported as it is simple write
 followed by BIO_FLUSH -- not very effective :(

-- 
// Black Lion AKA Lev Serebryakov <lev at FreeBSD.org>