HAST + ZFS + NFS + CARP

Julien Cigar julien at perdition.city
Thu Aug 11 11:49:25 UTC 2016


On Thu, Aug 11, 2016 at 01:22:05PM +0200, Borja Marcos wrote:
> 
> > On 11 Aug 2016, at 13:02, Julien Cigar <julien at perdition.city> wrote:
> > 
> > On Thu, Aug 11, 2016 at 12:15:39PM +0200, Julien Cigar wrote:
> >> On Thu, Aug 11, 2016 at 11:24:40AM +0200, Borja Marcos wrote:
> >>> 
> >>>> On 11 Aug 2016, at 11:10, Julien Cigar <julien at perdition.city> wrote:
> >>>> 
> >>>> As I said in a previous post I tested the zfs send/receive approach (with
> >>>> zrep) and it works (more or less) perfectly.. so I concur in all what you
> >>>> said, especially about off-site replicate and synchronous replication.
> >>>> 
> >>>> Out of curiosity I'm also testing a ZFS + iSCSI + CARP at the moment, 
> >>>> I'm in the early tests, haven't done any heavy writes yet, but ATM it 
> >>>> works as expected, I havent' managed to corrupt the zpool.
> >>> 
> >>> I must be too old school, but I don’t quite like the idea of using an essentially unreliable transport
> >>> (Ethernet) for low-level filesystem operations.
> >>> 
> >>> In case something went wrong, that approach could risk corrupting a pool. Although, frankly,
> > 
> > Now I'm thinking of the following scenario:
> > - filer1 is the MASTER, filer2 the BACKUP
> > - on filer1 a zpool data mirror over loc1, loc2, rem1, rem2 (where rem1 
> > and rem2 are iSCSI disks)
> > - the pool is mounted on MASTER
> > 
> > Now imagine that the replication interface corrupts packets silently,
> > but data are still written on rem1 and rem2. Does ZFS will detect 
> > immediately that written blocks on rem1 and rem2 are corrupted?
> 
> As far as I know ZFS does not read after write. It can detect silent corruption when reading a file
> or a metadata block, but that will happen only when requested (file), when needed (metadata)
> or in a scrub. It doesn’t do preemptive read-after-write, I think. Or I don’t recall having read it.

Nop, ZFS doesn't read after write. So in theory you pool can become
corrupted in the following case:

T1: a zpool scrub is made, everything is OK
T2: the replication interface starts to silently corrupt packets
T3: corrupted data blocks are written on the two iSCSI disks while 
valid data blocks are written on the two local disks.
T4: corrupted data blocks are not replayed, so ZFS will not notice it.
T5: master dies before another zpool scrub is run
T6: failover happens, BACKUP becomes the new MASTER, try to import the
pool -> corruption -> fail >:O

Although very very unlikely, this scenario is in theory possible.

BTW any idea if some sort of checksum for payload is made in the iSCSI
protocol?

> 
> Silent corruption can be overcome by ZFS as long as it isn’t too much. In my case with the
> evil HBA it was like a block operation error in an hour of intensive I/O. In normal operation it could
> be a block error in a week or so. With that error rate, the chance of a random I/O error corrupting the
> same block in three different devices (it’s a raidz2 vdev) are really remote. 
> 
> But, again, and I won’t push more at the risk of annoying you to death. Just, think that your I/O 
> throughput will be bound by your network and iSCSI performance, anyway ;)
> 
> 
> 
> 
> Borja.
> 
> 
> P.D: I forgot to reply to this before:
> 
> >> Yeah.. although you could have silent data corruption with any broken
> >> hardware too. Some years ago I suffered a silent data corruption due to 
> >> a broken RAID card, and had to restore from backups..
> 
> Ethernet hardware is designed with the assumption that the loss of a packet is not such a big deal. 
> Shit happens on SAS and other specialized storage networks of course, but you should expect it to be 
> at least a bit less. ;)
> 
> 

-- 
Julien Cigar
Belgian Biodiversity Platform (http://www.biodiversity.be)
PGP fingerprint: EEF9 F697 4B68 D275 7B11  6A25 B2BB 3710 A204 23C0
No trees were killed in the creation of this message.
However, many electrons were terribly inconvenienced.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20160811/bce0ed96/attachment.sig>


More information about the freebsd-fs mailing list