GSoC proposition: multiplatform UFS2 driver

Fri Mar 14 20:29:02 UTC 2014

On Mar 14, 2014, at 3:18 PM, Edward Tomasz Napierała <trasz at FreeBSD.org> wrote:

> Wiadomość napisana przez Richard Yao w dniu 14 mar 2014, o godz. 19:53:
>> On 03/14/2014 02:36 PM, Edward Tomasz Napierała wrote:
>>> Wiadomość napisana przez Ian Lepore w dniu 14 mar 2014, o godz. 16:39:
>>>> On Fri, 2014-03-14 at 15:27 +0000, RW wrote:
>>>>> On Thu, 13 Mar 2014 18:22:10 -0800
>>>>> Dieter BSD wrote:
>>>>> 
>>>>>> Julio writes,
>>>>>>> That being said, I do not like the idea of using NetBSD's UFS2
>>>>>>> code. It lacks Soft-Updates, which I consider to make FreeBSD UFS2
>>>>>>> second only to ZFS in desirability.
>>>>>> 
>>>>>> FFS has been in production use for decades.  ZFS is still wet behind
>>>>>> the ears. Older versions of NetBSD have soft updates, and they work
>>>>>> fine for me. I believe that NetBSD 6.0 is the first release without
>>>>>> soft updates.  They claimed that soft updates was "too difficult" to
>>>>>> maintain.  I find that soft updates are *essential* for data
>>>>>> integrity (I don't know *why*, I'm not a FFS guru).
>>>>> 
>>>>> NetBSD didn't simply drop soft-updates, they replaced it with
>>>>> journalling, which is the approach used by practically all modern
>>>>> filesystems. 
>>>>> 
>>>>> A number of people on the questions list have said that they find
>>>>> UFS+SU to be considerably less robust than the journalled filesystems
>>>>> of other OS's.  
>>> 
>>> Let me remind you that some other OS-es had problems such as truncation
>>> of files which were _not_ written (XFS), silently corrupting metadata when
>>> there were too many files in a single directory (ext3), and panicing instead
>>> of returning ENOSPC (btrfs).  ;->
>> 
>> Lets be clear that such problems live between the VFS and block layer
>> and therefore are isolated to specific filesystems. Such problems
>> disappear when using ZFS.
> 
> Such problems disappear after fixing bugs that caused them.  Just like
> with ZFS - some people _have_ lost zpools in the past.

People with problems who get in touch with me usually can save their pools. I cannot recall an incident where a user came to me for help and suffered complete loss of a pool. However, there have been incidents of partial data loss involving user error (running zfs destroy on data you want to keep is bad), faulty memory (this user ignored my warnings about non-ECC memory and then put it into production without running memtest; then blamed ZFS) and two incidents where bugs in ZoL's autotools checks that disabled flushing to disk. The latter two cases have had regression tests put into place to catch the errors that permitted them.

> 
>>>> What I've seen claimed is that UFS+SUJ is less robust.  That's a very
>>>> different thing than UFS+SU.  Journaling was nailed onto the side of UFS
>>>> +SU as an afterthought, and it shows.
>>> 
>>> Not really - it was developed rather recently, and with filesystems it usually
>>> shows, but it's not "nailed onto the side": it complements SU operation
>>> by journalling the few things which SU doesn't really handle and which
>>> used to require background fsck.
>>> 
>>> One problem with SU is that it depends on hardware not lying about
>>> write completion.  Journalling filesystems usually just issue flushes
>>> instead.
>> 
>> This point about write completion being done on unflushed data and no
>> flushes being done could explain the disconnect between RW's statements
>> and what Soft Updates should accomplish. However, it does not change my
>> assertion that placing UFS SU on a ZFS zvol will avoid such failure
>> modes.
> 
> Assuming everything between UFS and ZFS below behaves correctly.

For ZFS, this means hardware honors flushes and does not deduplicate data (e.g. sandforce) so that ditto blocks have an effect. The latter failure mode does not appear to have been observed in the wild. The former has never been observed to my knowledge when ZFS is given the physical disks and the SAS/SATA controller does not try doing a write cache. It has been observed on certain iSCSI targets though.

>> In ZFS, we have a two stage transaction commit that issues a
>> flush at each stage to ensure that data goes to disk, no matter what the
>> drive reported. Unless the hardware disobeys flushes, the second stage
>> cannot happen if the first stage does not complete and if the second
>> stage does not complete, all changes are ignored.
>> 
>> What keeps soft updates from issuing a flush following write completion?
>> If there are no pending writes, it is a noop. If the hardware lies, then
>> this will force the write. The internal dependency tracking mechanisms
>> in Soft Updates should make figuring out when a flush needs to be issued
>> should hardware have lied about completion rather simple. At a high
>> level, what needs to be done is to batch the things that can be done
>> simultaneously and separate those that cannot by flushes. If such
>> behavior is implemented, it should have a mount option for toggling it.
>> It simply is not needed on well behaved devices, such as ZFS zvols.
> 
> As you say, it's not needed on well-behaved devices.  While it could
> help with crappy hardware, I think it would be either very complicated
> (batching, as described), or would perform very poorly.

For ZFS, a well behaved device is a device that honors flushes. As long as flush semantics are obeyed, ZFS should be fine. The only exceptions known to me involves drives that deduplicate zfs ditto blocks (so far unobserved in the wild), non-ECC RAM (which breaks everything equally) and driver bugs (ZFS does not replace backups). UFS Soft Updates seems to have stricter requirements than ZFS in that IO completion must be honest, but the end result is not as good as there are no ditto blocks or checksums for a merkle tree. Also, in all fairness, ZFS relies on this information too, but it is for performance purposes, not consistency.

> To be honest, I wonder how many problems could be avoided by
> disabling write cache by default.  With NCQ it shouldn't cause
> performance problems, right?

I think you need to specify which cache causes the problem. There is the buffer cache (removed in recent FreeBSD and bypassed in Linux by ZFSOnLinux), the RAID controller cache (using this gives good performance numbers, but is terrible for reliability) and the actual drive cache (ZFS is okay with this; UFS2 with SU possibly not).