Recommendations for replacing a GFS cluster

clusterdistributed-filesystemnfs

I have a couple of CentOS GFS-clusters (GFS as in Global File System) using a shared disk in a Fibre Channel SAN. They are mature now, and the time has come to start planning for their replacement.

They are an odd number of nodes (3 or 5) with fencing of faulty nodes set up with APC (PDU) power switches. The nodes are all active and read and write simultaneously on the same shared filesystem. The filesystem is small, currently less than a TB, and will never grow larger than would fit on a commodity hard drive.

I have two exclusive IP-address resources which relocate when a node is down. (1 on the 3-node cluster). Everything works very well, but the performance is not very good when there is a lot of activity.

So, what could I do differently in my next generation cluster?

What I need is service uptime and data availability. Possibly scalability as well, but probably not. I don't expect the load to grow very much.
I also need to be able to read and write the files like regular files on a regular filesystem. There is no need for quotas or ACLs. Just regular unix permissions, ownership, mtime, size in bytes, and the ability to use ln to make a lock file in a way that fails on all but 1 node, should they try it at the same time.

I don't want to increase the number of physical servers (which means that I want to use the storage on the actual servers themselves).

It's not mandatory, but I think it would be good if I weren't dependent on the Shared disk. I've been through two incidents with Enterprise class SAN storage being unavailable in the last 5 years, so however improbable that is, I'd like to be one step ahead.

Since uptime is very important, 1 physical server with 1 running kernel is too little. Virtual machines are dependent on the SAN in our environment.

My thoughts so far:

  • All nodes could be plain NFSv3 clients (Would ln work the way I expect? What would be the NFS server then?)
  • Ceph with CephFS (When will the FS be production ready?)
  • XtreemFS (Why are there so little written about it compared to Ceph?)

As you see, I'm interested in distributed storage, but need advice from experienced gurus. Especially recommendations or advice about Ceph or XtreemFS would be welcome. This is not a HPC with insane bandwidth demands. Just need the availability and reliability, and hopefully flexibility of my old solution, ideally in a "better" configuration than the current.

EDIT (see Nils comment)
The main reason I think about replacing this solution is that I want to see if it is possible to eliminate the single point of failure that the SAN storage cabinet is. Or should I instead use LVM mirroring to keep the data on two different storage systems in the same SAN fabric? Two FC-HBAs and double switches should be enough I think.

Best Answer

Ceph and GlusterFS is where the clustered FS technology is currently going. Since I'm not familiar with GlusterFS I'll speak on the features of Ceph.

Ceph scales horizontally; the more low-end nodes you add to it the better the performance. Unlike GlusterFS this is a primary benefit to Ceph as it shards the data.

However, Ceph is in active development (it's production ready except for Ceph FS) and requires a modern kernel (as I write this not even CentOS 6.5 default kernel can take advantage of RBD/CephFS features). To get around this I installed ELRepo kernel-lt.

To break it down for you:

  • Cephs RBD is a clustered SAN replacement; you can create "virtual" devices that are in the cluster and can be mounted on servers. Note: Only one server should have an RBD image mounted at one time (you don't want multiple OS's mounting a SATA drive do you?). You would then format the RBD disk, mount it like normal, and then have NFS/CIFS make it available. If the server providing NFS/CIFS goes down no data is lost.
  • Ceph FS is a clustered NAS replacement (although not production ready); it provides file locking capabilities that are needed for a clustered FS that is shared between servers (such as a web server for example).

RBD runs in kernel space; so there is no fuse performance hits. Ceph FS also runs in kernel space but can run with FUSE.

Ceph is also pretty easy to deploy:

  1. pip install ceph-deploy on an admin node (your desktop/workstation).
  2. Add Inktank RPM repos and ceph-deploy install node1 node2 ... nodeN to install ceph on all of the nodes.
Related Question