pNFS mirror file distribution (big picture question?)

Fri May 25 21:14:07 UTC 2018

Hi,

#1 The code currently in projects/pnfs-planb-server allows creation of sets of
mirrored data servers (DSs). For example, the "-p" nfsd option argument:
nfsv4-data0#nfsv4-data1,nfsv4-data2#nfsv4-data3
defines two mirrored sets of data servers with two servers in each one.
("#" separates mirrors within a mirror set)

I did this a couple of years ago, in part because I thought having a well defined
"mirror" for a DS would facilitate mirror recovery.
Now that I have completed the mirror recovery code, having a defined mirror
set is not needed.

#2 An alternate mirroring approach would be what I might call the random/distributed
approach, where each file is distributed on any two (or more) of the DSs.
For this approach, the "-p" nfsd option argument:
nfsv4-data0,nfsv4-data1,nfsv4-data2,nfsv4-data3
defines four DSs and a separate flag would say "two way mirroring", so
each file would be placed on 2 of the 4 DSs.

The question is "should I switch the code to approach #2?".

I won't call it elegant, but #1 is neat and tidy, since the sysadmin knows that
a data file ends up on either nfsv4-data0, nfsv4-data1 or nfsv4-data2, nfsv4-data3.
Assuming the mirrored DSs  in a set have the same amount of storage, they will
have the same amount of free space.
--> This implies that they will run out of space at the same time and the pNFS
      service won't be able to write to files on the mirror set.

With #2, one of the DSs will probably run out of space first. I think this will make
a client trying to write a file on it to report to the Metadata server (MDS) a write
error and that will cause the DS to be taken offline.
Then, the write will succeed on the other mirror and things will continue to run.
Eventually all the DSs will fill up, but hopefully a sysadmin can step in and fix the
"out of space" problem before that point.
Another advantage I can see for #2 is that it gives the MDS more flexibility when it
chooses which DSs to create the data files on than #1 does.
(I will be less "neat and tidy", but the sysadmin can find out which DSs store the
data for a file on the MDS on a "per file" basis.)

James Rose was asking about "manual migration". It is almost the same as what
is already done for mirror recovery and is a pretty trivial addition for #2. For #1, it
can be done, but is more work. "manual migration" refers to a sysadmin doing a
command that moves a data file from one DS to another.
(Others that are more clever than I could use the "manual migration" syscall
 to implement automagic versions to try and balance storage use and I/O load.)

Given easier migration and what I think is better handling of "out of space" failures,
I am leaning towards switching the code to #2.

What do others think? rick