Cluster Filesystem for FreeBSD - any interest?

Mon Jul 4 08:09:05 GMT 2005

在 2005-07-01五的 17:38 -0700，Bakul Shah写道：
> > > A couple FS specific suggestions:
> > > - perhaps clustering can be built on top of existing
> > >   filesystems.  Each machine's local filesystem is considered
> > >   a cache and you use some sort of cache coherency protocol.
> > >   That way you don't have to deal with filesystem allocation
> > >   and layout issues.
> > 
> > I see - that's an interesting idea.  Almost like each machine could 
> > mount the shared version read-only, then slap a layer on top that is 
> > connected to a cache coherency manager (maybe there is a daemon on each 
> > node, and the nodes sync their caches via the network) to keep the 
> > filesystems 'in sync'.  Then maybe only one elected node actually writes 
> > the data to the disk.  If that node dies, then another node is elected.
> 
> \begin{handwaving}
> What I was thinking of:

Assume we have clustered metadata servers (MDS), clustered file data
servers (FDS), clients, linked with any high speed network as Ethenet,
Myrinet, FC, etc, either in User level or kernel level.

> - The cluster system assures that there are atleast N copies
>   of every file at N+ separate locations.

assure after the writing time and the client/MDS/FDS crashing time.

> - More than N copies may be cached dependign on usage pattern.

If the server really need cache ? If the memory is big enough to make
the cache can be accessed again before it is covered by new data ?

> - any node can write.  The system takes care of replication

Yes, any node can write, and the concurrent read/write lock is hold
either by app, or the FS itself to protect the data from corruption. Or
as Google's always append write mode.

>   and placement.

Let us make MDS decide the file data replication and placement, and FDS
do the placement, which also means the FDS need to manage the disk
space, or at least report its disk space usage to MDS to let it know how
to make the decision.

> - meta data, directories are implemented *above* this level.

And the meta data also need to be replicated.

> - more likely you'd want to map file *fragments* to local
>   files so that a file can grow beyond one disk and smaller

Yes, to support really large files. We need data striping to support
large files and enhance the write/read speed of them; and we need data
splicing to support very small files who can easily eat up all inodes
though there are plenty free disk space left.

>   fragements mean you don't have to cache an entire file.

Why it need to cache the entire file ?

> - you still need to mediate access at file level but this
>   is no different from two+ processes accessing a local file.
> Of course, the devil is in the details!
> 
> > > - a network wide stable storage `disk' may be easier to do
> > >   given GEOM.  There are atleast N copies of each data block.
> > >   Data may be cached locally at any site but writing data is
> > >   done as a distributed transaction.  So again cache
> > >   coherency is needed.  A network RAID if you will!
> > 
> > I'm not sure how this would work.  A network RAID with geom+ggate is 
> > simple (I've done this a couple times - cool!), but how does that get me 
> > shared read-write access to the same data?
> 
> What I had in mind something like this: Each logical block is
> backed by N physical blocks at N sites.  Individual
> filesystems live in partitions of this space.  So in effect
> you have a single NFS server per filesystem that deals with
> all metadata+dir lookup but due to caching read access should
> be faster.  When a server goes down, another server can be
> elected.

It is good for read-only accessing.

> 
> > :) I understand.  Any nudging in the right direction here would be
> > appreciated.
> 
> I'd probably start with modelling a single filesystem and how
> it maps to a sequence of disk blocks (*without* using any
> code or worrying about details of formats but capturing the

I.e, map a file to one file on one host. Do you feel use the raw block
instead of files will easy the job ?

> essential elements).  I'd describe various operations in
> terms of preconditions and postconditions.  Then, I'd extend
> the model to deal with redundancy and so on.  Then I'd model

> various failure modes. etc.  If you are interested _enough_

As either a MDS/FDS crashes, how does it bring itself back up.  Or if we
need to add another MDS/FDS into a cluster, how does it configure
itself.

> we can take this offline and try to work something out.  You

Please get me in to answer the Performance, Scalability, Availability,
and Reliability problems :) 

> may even be able to use perl to create an `executable'
> specification:-)

There's a MogileFS ;)

> \end{handwaving}
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
> 
-- 
yf-263 <yfyoufeng at 263.net>
Unix-driver.org