Gluster file system in geo replication

  System
 

Gluster is a distributed and scalable network file system developed in user space by FUSE (Filesystem in user space) to hook itself with VFS layer.

It’s permit to scale to several petabytes handling thousands of clients. The gluster volumes are storage unit exportable formed by disk spaces called bricks distributed in different servers.

The volumes data can be distributed or replicated. In the first case the files are distributed among the bricks of the volumes without redundancy; in the second case the files are distributed in all brick of the clusters. For all the details about how the volumes can be created, visit the gluster site.

Gluster file system can be used as scalable storage for any type of data, In this tutorial I will explain how to create a gluster file system replicated formed by two nodes in geo replication with a remote site reachable by WAN.

The data write between the nodes of master cluster is synchronous. Versus the remote node the write data is asynchronous.

This solution is optimal for a remote site to use as disaster recovery if the primary site goes down.

The reference architecture is the following:

Gluster geo replication

In my laboratory I used Virtual Box as virtualization system, Centos 7.3 as System Operating and 3.10 version as Gluster file system. These the ip addresses used:

Node Role SO Ip Address
gfs-master-01 gfs node of primary cluster Centos 7.3 192.168.1.1
gfs-master-02 gfs node of primary cluster Centos 7.3 192.168.1.2
gfs-remote-01 gfs node of remote node Centos 7.3 192.168.2.1
gfs-client-01 client that mounts the gfs volume Centos 7.4 192.168.3.1
gfs-client-02 client that mounts the gfs volume Centos 7.4 192.168.3.2

Before starting, I would like to make a introduction how the gluster architecture is implemented.

Gluster Architecture

Gluster file system is a user space file system developed thanks to FUSE, a kernel module that support interaction between kernel VFS and non-privileged user applications and it has an API that can be accessed from userspace.

Every system call caught by kernel virtual file system is redirect by /dev/fuse file descriptor via FUSE API to user space application that interacts directly with the underlyng file system (xfs, ext3, ext3 and ext4). It means that for any write/read to underlyng file system, there are two context switches (a switch from a application user to other) with significative impact respect a classic file system where there is only switch from application to kernel mode.

A typical system system call in a FUSE file system:

Fuse file system

This is what happens:

  1. A process read the /fuse/file belong to FUSE file system already mounted.
  2. The VFS invokes the appropriate handler in fusefs. If the requested data is found in the page cache, it is returned immediately. Otherwise, the system call is forwarded over a character device, /dev/fuse, tothe libfuse library, which in turn invokes the callback de-
    fined in userfs for the read() operation.
  3. The request is managed in a application user space and is server for example reading from a ext3 file system by a normal system call.
  4. The ouput is returned to first application user.

This architecture gives good speed for big data file, but slow perfomance for small files. For this reason, I would not use glusterfs for a critical web server (I infact experimented bad performance for serving small web pages) but it’s more adapted for big files as used in desktop computers, multimedia servers, and scientific computing.

You can find more info about FUSE file system in this link: http://www.csl.sri.com/users/gehani/papers/SAC-2010.FUSE.pdf.

In gluster the the basic unit of storage is called brick that is a an export directory on a server. A Volume is the collection of bricks spanned on different servers mountable by remote clients.

The files can be distributed across various bricks in the volume, distributed configuration, or replicated across bricks in the volume, called replication. There are others way to configure a glusterfs cluster consultable at https://gluster.readthedocs.io/en/latest/Quick-Start-Guide/Architecture/#architecture.

The files are replicated in all bricks of the cluster in syncronous way and in asyncronous way versus slave nodes of a remote cluster. This last solution is called geo replication and it’s the right solution for backup or disaster site.

Let’s start now to speack about all the actions for the installation and configuration procedure in all nodes of two primary and secondary sites.Next I will detail the geo replication configuration.

Gluster Master Installation and configuration

The goal of this my article is to show how to install and configure two gluster file systems sites in geo replication. The primary site is composed by two systems gfs-master-01 and gfs-master-02. The secondary remote site is composed by one only node gfs-remote-01.

Before configuring the geo replication between primary and secondary sites, the gluster file system must be installed and configured on both sites:

On all glusterfs servers:

[root@gfs-master-01 ~]#yum install centos-release-gluster310-1.0-1.el7.centos.noarch
[root@gfs-master-01 ~]#yum install glusterfs-server

The file system that will be replicated by glusterfs servers must be created and mounted on all nodes. In every node I will create a logical volume called gfs belonging to logical group data that will mounted on all servers.

On all glusterfs servers:

[root@gfs-master-01 ~]#fdisk /dev/sdc
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-32635, default 1):
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-32635, default 32635):
Using default value 32635
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.
[root@gfs-master-01 ~]#pvcreate /dev/sdc1
[root@gfs-master-01 ~]#vgcreate vg_data /dev/sdc1
[root@gfs-master-01 ~]#lvcreate -L 4G -n lv_gfs vg_data
[root@gfs-master-01 ~]#mkdir /gfs/brick1
[root@gfs-master-01 ~]#mkfs.ext4 /dev/vg_data/lv_gfs
[root@gfs-master-01 ~]#vi /etc/fstab
/dev/vg_data/lv_gfs /gfs/brick1 ext4 defaults 1 2
[root@gfs-master-01 ~]#mount /gfs/brick1
[root@gfs-master-01 ~]#df -k |grep brick1
/dev/mapper/vg_data-lv_gfs 3997376 3622524 148756 97% /gfs/brick1

Before creating the volume in the master servers, it’s necessary to start the gfs servers, probe them and next create the directory that will be the volume repository:

[root@gfs-master-01 ~]#systemctl enable glusterd
[root@gfs-master-02 ~]#systemctl enable glusterd
[root@gfs-master-01 ~]#systemctl start glusterd
[root@gfs-master-02 ~]#systemctl start glusterd
[root@gfs-master-01 ~]#mkdir /gfs/brick1/gv0
[root@gfs-master-02 ~]#mkdir /gfs/brick1/gv0
[root@gfs-master-01 ~]#gluster peer probe gfs-master-02
[root@gfs-master-02 ~]#gluster peer probe gfs-master-01

Now the volume replicated in syncronous way can be created executing this command in one only master server:

[root@gfs-master-01 ~]#gluster volume create gv0 replica 2 gfs-master-01:/gfs/brick1/gv0 gfs-master-02:/gfs/brick1/gv0
volume create: gv0: success: please start the volume to access data

The volume can be started:

[root@gfs-master-01 ~]#gluster volume start gv0
volume start: gv0: success
[root@gfs-master-01 ~]#gluster volume info
Volume Name: gv0
Type: Replicate
Volume ID: 3f3gc333-e467-494s-c23d-49bddc79a001
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gfs-master-01:/data/brick1/gv0
Brick2: gfs-master-02:/data/brick1/gv0
Options Reconfigured:
transport.address-family: inet
nfs.disable: on

It’s possible to show all the nodes of cluster in this way:

[root@ gfs-master-01 ~]# gluster peer status
Number of Peers: 1
Hostname: gfs-master-02
Uuid: 6dd2260b-afb9-346d-8fc7-bc6f4a6f8c19
State: Peer in Cluster (Connected)

The master gluster file system is ready to be mounted on both client:

[root@gfs-client-01 ~]#vi /etc/fstab
gfs-master-01:/gv0 /gfs glusterfs defaults,_netdev 0 0
[root@gfs-client-02 ~]# vi /etc/fstab 
gfs-master-02:/gv0 /gfs glusterfs defaults,_netdev 0 0

Let’s describe now the steps for configuring the gluster slave on remote node.

Gluster Slave Installation and configuration

In the slave gluster server we have to create a gfs volume that will be geo replicated with the master volume. Before creating the remote volume, the logical volume must be created of the same size of master and the gluster file system must be installed:

[root@gfs-remote-01 ~]#pvcreate /dev/sdc1
[root@gfs-remote-01 ~]#vgcreate vg_data /dev/sdc1
[root@gfs-remote-01 ~]#lvcreate -L 4G -n lv_gfs vg_data
[root@gfs-remote-01 ~]#mkdir /gfs/brick1
[root@gfs-remote-01 ~]#mkfs.ext4 /dev/vg_data/lv_gfs
[root@gfs-remote-01 ~]#vi /etc/fstab
/dev/vg_data/lv_gfs /gfs/brick1 ext4 defaults 1 2
[root@gfs-remote-01 ~]#mount /gfs/brick1
[root@gfs-remote-01 ~]#df -k |grep brick1

Before creating the volume in the master servers, it’s necessary to start the gfs servers.

[root@gfs-remote-01 ~]#systemctl enable glusterd
[root@gfs-remote-01 ~]#systemctl start glusterd
[root@gfs-remote-01 ~]#mkdir /gfs/brick1/gv0

Now the volume is created without replica mode because in the remote site there is one only servcer.

[root@gfs-remote-01 ~]#gluster volume create gv0 1 gfs-remote-01:/gfs/brick1/gv0 
volume create: gv0: success: please start the volume to access data

The volume can be started:

[root@gfs-remote-01 ~]#gluster volume start gv0
volume start: gv0: success
[root@gfs-remote-01 ~]#gluster volume info
Volume Name: gv0
Type: Replicate
Volume ID: 3f3gc333-e467-494s-c23d-49bddc79a001
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 1= 1
Transport-type: tcp
Bricks:
Brick1: gfs-remote-01:/data/brick1/gv0
Options Reconfigured:
transport.address-family: inet
nfs.disable: on

The remote volume is up&running. The final steps is to geo replicate the master with the slave site.

Gluster Geo Replication configuration

The geo data replication between master and slave volume will be work is in asyncronous way and this is very useful because doesn’t slow down the data write in master site. A solution very adaptable for disaster or backup site.

The connection from master to slave uses the ssh protocol, instead the synchronization happens from slave to master using the rsync.

In every node of master and slave the following package must be installed:

[root@gfs-master-01 ~]#yum install glusterfs-geo-replication.x86_64
[root@gfs-master-02 ~]#yum install glusterfs-geo-replication.x86_64
[root@gfs-remote-01 ~]#yum install glusterfs-geo-replication.x86_64

On a master node, a ssh keys must be created and moved to slave:

[root@gfs-master-01 ~] #ssh-keygen
[root@gfs-master-01 ~]# ssh-copy-id root@gfs-remote-01

The ssh keys must be moved in the right directory of masters nodes.

[root@gfs-master-01 ~]#cp /root/.ssh/id_rsa.pub /var/lib/glusterd/geo-replication/secret.pem.pub
[root@gfs-master-01 ~]#cp /root/.ssh/id_rsa /var/lib/glusterd/geo-replication/secret.pem
[root@gfs-master-01 ~]#scp /var/lib/glusterd/geo-replication/secret.pem* root@gfs-master-02:/var/lib/glusterd/geo-replication/

In the remote node the following link is created, otherwise the replication gives an errore because the gsyncd utility used for replication is not found.

[root@gfs-remote-01 /]# ln -s /usr/libexec/glusterfs/gsyncd /nonexistent/gsyncd

The geo replication is ready to be configured and started:

[root@gfs-master-01 ~]# gluster volume geo-replication gv0 gfs-remote-01::gv0 create push-pem
[root@gfs-master-01 ~]# gluster volume geo-replication gv0 gfs-remote-01::gv0 start

As you know, the replication is between the master volume that is replicated in synchronous way and the remote volume synchronized in asynchronous way. The remote volume could be another gluster cluster like the master, in my case is one single node.

For checking the geo replication status:

[root@gfs-master-01 /]# gluster volume geo-replication status
MASTER NODE         MASTER VOL    MASTER BRICK        SLAVE USER    SLAVE                                 SLAVE NODE                 STATUS    CRAWL STATUS       LAST_SYNCED
gfs-master-01   gv0           /data/brick1/gv0    root          ssh://gfs-remote-01::gv0    gfs-remote-01   Active    Changelog Crawl    2017-07-22 08:26:23
gfs-master-02    gv0          /data/brick1/gv0    root          ssh://gfs-remote-01::gv0      gfs-remote-01   Active    Changelog Crawl    2017-07-21 08:26:18

The thing to understand is that the geo replication is from master to slave, not from slave to master. In case of disaster, after that the master site is recovered, a manual approach must be performed for aligning the master site to slave updated during the disaster.

Conclusions

The gluster file system is very useful and good alternative to nfs server and solutions like drbd that are more complicated to configure and manage. It’s a solution to use every time a file system distributed is requested or for disaster recovery sites.

The advantages are related to slow performance for read/write of small files. This could give degrade performance for application that perform high value of input/output for seconds..

Don’t hesitate to contact me for any feed-back or suggestions.

Work In Progress

One Reply to “Gluster file system in geo replication”

LEAVE A COMMENT