Pacemaker cluster with NFS and DRBD

  System
 

This article describes how to configure pacemaker software (an open source high availability cluster) for designing a NFS service in high availability using drbd for mirroring the volume data.

The cluster is configured in Active/Standby way on two Centos 7.3 nodes. The reference architecture is the following:

Pacemaker cluster wih rdbd in Master/Standby

The resources software are configured in active standby. The only resource running in both nodes is the drbd executed as module directly inside the kernel.The resource dependencies permit the running of all processes in one only node (the master) and its right starting: before starting the nfs server, the file system must be mounted; before mounting the file system, the drbd node  must be in primary state.

After this introduction, let’s start to configure drbd on both nodes.

DRBD Installation and Configuration

The Distributed Replicated Block Device (DRBD) is a replicated storage solution for mirroring the content of block devices between hosts.

The solution implemented is a primary/standby architecture: a logical volume mirrored between hosts where there is one only primary node that mounts the file system and a standby node where the device mirrored is not mounted. In case of crash of the master, the standby will be reconfigured as primary directly from cluster manager. For any information about the DRBD, please read the offical link https://www.drbd.org/.

The installation procedure is very simple. The following commands must be executed on both nodes:

[root@pcs-01 yum.repos.d]# rpm –import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
[root@pcs-01 yum.repos.d]# rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm
[root@pcs-01 yum.repos.d]# yum install -y kmod-drbd84 drbd84-utils

Install semanage Linux on all nodes and enable the  DRBD processes from SELinux control:

[root@pcs-01 yum.repos.d]# yum install policycoreutils-python-2.5-11.el7_3.x86_64
[root@pcs-01 yum.repos.d]# semanage permissive -a drbd_t

Create and configure the data volume to mirror. Managing the volume to mirror by LVM gives the possibility to easy resize it. In this case the volume mirrored is sdc:

[root@pcs-01 yum.repos.d]# fdisk /dev/sdc
[root@pcs-01 yum.repos.d]# pvcreate /dev/sdc1
[root@pcs-01 yum.repos.d]# vgcreate drbd_vg /dev/sdc1
[root@pcs-01 yum.repos.d]# lvcreate -L 0.99G -n drbd_lv drbd_vg

Now it’s possible to configure the DRBD nodes:

[root@pcs-01 drbd.d]# vi /etc/drbd.d/r0.res
resource r0 {
net {
protocol C;
sndbuf-size 0;}
disk {
fencing resource-only;}
handlers {
fence-peer “/usr/lib/drbd/crm-fence-peer.sh”;
after-resync-target “/usr/lib/drbd/crm-unfence-peer.sh”;}
on pcs-01 {
device /dev/drbd1;
disk /dev/drbd_vg/drbd_lv;
address 192.168.1.51:7789;
meta-disk internal;}
on pcs-01{
device /dev/drbd1;
disk /dev/drbd_vg/drbd_lv;
address 192.168.1.52:7789;
meta-disk internal;}}

If the DRBD replication link becomes disconnected, the crm-fence-peer.sh script contacts the cluster manager, determines the Pacemaker Master/Slave resource associated with this DRBD resource, and ensures that the Master/Slave resource no longer gets promoted on any node other than the currently active one. Conversely, when the connection is re-established and DRBD completes its synchronization process, then that constraint is removed and the cluster manager is free to promote the resource on any node again (this sentence is copied from https://www.drbd.org/en/doc/users-guide-84/s-pacemaker-fencing).

In other words: the crm-fence-peer.sh script offers a minimal solution to split brain issue. This is not a very solution because if the brain of the cluster is splitted, both nodes are active and this can take a data inconsistency.

The meta data used by drbd for sync functionality can be created:

[root@pcs-01 drbd.d]# drbdadm create-md r0
[root@pcs-02 drbd.d]# drbdadm create-md r0
[root@pcs-01 drbd.d]# drbdadm up r0
[root@pcs-02 drbd.d# drbdadm up r0

Promote the pcs-01 node to primary:

[root@pcs-01 drbd.d]# drbdadm primary –force ro

Check the sync between hosts:

[root@pcs-01 drbd.d]# more /proc/drbd
version: 8.4.9-1 (api:1/proto:86-101)
GIT-hash: 9976da086367a2476503ef7f6b13d4567327a280 build by akemi@Build64R7, 2016-12-04 01:08:48
1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r—–
ns:1040316 nr:0 dw:0 dr:1041228 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0/p>

On primary node, the file system is created

[root@pcs-01 drbd.d]# mkfs.ext4 /dev/drbd1
mke2fs 1.42.9 (28-Dec-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
65024 inodes, 260062 blocks
13003 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=266338304
8 block groups
32768 blocks per group, 32768 fragments per group
8128 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376
Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

For testing if everthing is ok, mount the filesystem on primary node, create a file a umount it because it will be managed directly by the cluster manager:

[root@pcsv-01 drbd.d]# mount /dev/drbd1 /var/share
[root@pcsv-01 drbd.d]# touch /var/share/test-01.txt
[root@pcsv-01 drbd.d]# umount /var/share

Remember not to enable the drbd at boot time because this is a cluster’s job.

Let’s configure now the cluster manager.

Pacemaker installation and configuration

Pacemaker is a free cluster manager that provides high availability to applications and databases restarting automatically the resources software managed from one node to other. It is a good alternative to commercial products like Veritas, Red Hat and Oracle.

In my laboratory the cluster manages a nfs server with all its resources needed: the file system exported and mounted from an block device managed by drbd and an virtual ip address where listening. In case of node fault, all the resources are started to another node respecting the defined constraints.

The installation of pacemaker cluster is very simple. These the commands to execute on both systems:

[root@pcs-01 drbd.d]# yum install -y pacemaker pcs psmisc policycoreutils-python 
[root@pcs-01 drbd.d]# systemctl start pcsd.service
[root@pcs-01 drbd.d]# systemctl enable pcsd.service

Following how to configure it (on both nodes):

[root@pcs-01 drbd.d]# passwd hacluster
[root@pcs-01 drbd.d]# pcs cluster auth pcs-01 pcs-02

Next, use pcs cluster setup on the only node to generate and synchronize the corosync (see the http://clusterlabs.org/ for the pacemaker architecture).

[root@pcsv-01 drbd.d]# pcs cluster setup –name nfs-cluster pcs-01 pcs-02

Now it’s time to start the cluster on both nodes:

[root@pcs-01 drbd.d]# pcs cluster start –all
[root@pcs-01 drbd.d]# systemctl enable pcs

Before configuring the cluster, you should understand that In order to guarantee the data safety, in Pacemaker is enabled the STONITH, a technique for fencing in computer clusters. Without STONITH configuration, the cluster reports a issue.This feature is then disabled for bypassing the issue.

[root@pcs-01 drbd.d]#pcs property set stonith-enabled=false

With the new cluster option set, the configuration is now valid.

In production a some form of protection against the split brain must be implemented. I the end of article I only suggest some idea about to do it.

Le’t start with the configurations of base resources of the cluster: the virtual ip address and the drbd device in primary/standby way. For the virtual ip address:

[root@pcs-01 var]# pcs resource create ClusterIP ocf:heartbeat:IPaddr2 \
ip=192.168.1.53 cidr_netmask=32 op monitor interval=30s

The status of cluster resources is showed in this way

[root@pcs-01 var]# pcs status
Cluster name: nfs-cluster
WARNING: no stonith devices and stonith-enabled is not false
Stack: corosync
Current DC: pcs-01 (version 1.1.15-11.el7_3.2-e174ec8) – partition with quorum
Last updated: Fri Feb 26 12:23:33 2017 Last change: Fri Feb 26 12:21:50 2017 by root via cibadmin on pcs-01
2 nodes and 1 resource configured
Online: [ pcs-01 pcs-02 ]
Full list of resources:
ClusterIP (ocf::heartbeat:IPaddr2): Started pcs-01
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/disabled
Configure the DRBD resource on all nodes.

The next action is to configure the drbd resource in primary/standby way. In order to do it, the following commands are performed:

[root@pcs-01 var]# pcs cluster cib drbd_cfg
[root@pcs-01 var]# pcs -f drbd_cfg resource create NFS-DRBD ocf:linbit:drbd \
drbd_resource=r0 op monitor interval=60s
[root@pcs-01 var]# pcs -f drbd_cfg resource master NFS-DRBD-Clone NFS-DRBD \
master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 \
notify=true
[root@pcs-01 var]# pcs -f drbd_cfg constraint colocation add NFS-DRBD-Clone with ClusterIP INFINITY
[root@pcs-01 var]# pcs -f drbd_cfg constraint order ClusterIP then NFS-DRBD-Clone \
Adding ClusterIP NFSData (kind: Mandatory) (Options: first-action=start then-action=start)
[root@docker-test-01 ~]# pcs cluster cib-push drbd_cfg

The commands above are committed together when the cib-push is done. A constraint is present for forcing the ClusterIP resource to be up in the same node of DRBD resource (constraint colocation ) and for starting the ClusterIP before the DRBD resource (constraint order).

The next resource is the filesytem that must be mounted where the DRBD primary resource is running. For that two constraints are required: the first for collocate the resource always in the same node, the second for promoting the drbd node to primary before mounting the file system:

[root@pcs-01 var]# pcs cluster cib drbd_cfg
[root@pcs-01 var]# pcs -f fs_cfg resource create DATA Filesystem device=”/dev/drbd1″ directory=”/var/share/” fstype=”ext4″
[root@pcs-01 var]# pcs -f fs_cfg constraint colocation add DATA with NFS-DRBD-Clone INFINITY with-rsc-role=Master
[root@pcs-01 var]# pcs -f fs_cfg constraint order promote NFS-DRBD-Clone then start DATA
[root@docker-test-01 ~]# pcs cluster cib-push drbd_cfg

The last resource to configure is the nfs server: before doing that, the following packages must be installed on both nodes:

[root@pcs-01 var]#yum install nfs-utils

Don’t enable the nfs-server: it’s a cluster duty. The only thing to do is to configure the nfs server: the export file must contain the directory to export. In my case:

[root@pcs-01 var]# vi /etc/exports
/var/nshare 192.168.1.40,192.168.1.39(rw,sync,no_root_squash,no_all_squash)

The file above contains the directory to export and the list of nfs client enabled to mount it.

The commands to configure the nfs resource in the cluster are:

[root@pcs-01 var]# pcs cluster cib fs_cfg
[root@pcs-01 var]# pcs -f fs_cfg resource create NFS-Server systemd:nfs-server op monitor interval=”30s”
[root@pcs-01 var]# pcs -f fs_cfg constraint colocation add NFS-Server with DATA INFINITY with-rsc-role=Master
[root@pcs-01 var]# pcs -f fs_cfg constraint order promote DATA then start NFS-Server
[root@pcs-01 var]# pcs cluster cib-push fs_cfg

The constraint is for forcing the nfs server to be running in the same node where the volume is mounted and to be started before mounting the file system exported.

Following the cluster status with the all constraints:

[root@pcs-01 var]# pcs status
Cluster name: nfs-cluster
Stack: corosync
Current DC: pcs-02 (version 1.1.15-11.el7_3.2-e174ec8) – partition with quorum
Last updated: Sun Feb 26 23:45:43 2017 Last change: Wed Feb 22 14:34:09 2017 by root via crm_attribute on pcs-01
2 nodes and 5 resources configured
Node pcs-02: standby
Online: [ pcs-01 ]
Full list of resources:
ClusterIP (ocf::heartbeat:IPaddr2): Started pcs-01
Master/Slave Set: NFS-DRBD-Clone [NFS-DRBD]
Masters: [ pcs-01 ]
Stopped: [ pcs-02 ]
DATA (ocf::heartbeat:Filesystem): Started pcs-01
NFS-SERVER (systemd:nfs-server): Started pcs-01
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/disabled
[root@pcs-01 var]## pcs constraint
Location Constraints:
Ordering Constraints:
promote NFS-DRBD-Clone then start DATA (kind:Mandatory)
start NFS-DRBD-Clone then start ClusterIP (kind:Mandatory)
promote DATA then start NFS-SERVER (kind:Mandatory)
Colocation Constraints:
DATA with NFS-DRBD-Clone (score:INFINITY) (with-rsc-role:Master)
ClusterIP with NFS-DRBD-Clone (score:INFINITY)
NFS-SERVER with DATA (score:INFINITY) (with-rsc-role:Master)
Ticket Constraints:

It’s possible to test the the failover cluster functionality putting the node pcs-01 in standby:

[root@pcs-01 var]# ps -afe | grep nfs
root 8634 2 0 Feb27 ? 00:00:00 [nfsd4_callbacks]
root 8645 2 0 Feb27 ? 00:00:00 [nfsd]
root 8646 2 0 Feb27 ? 00:00:00 [nfsd]
root 8647 2 0 Feb27 ? 00:00:00 [nfsd]
root 8648 2 0 Feb27 ? 00:00:00 [nfsd]
root 8649 2 0 Feb27 ? 00:00:00 [nfsd]
root 8650 2 0 Feb27 ? 00:00:00 [nfsd]
root 8653 2 0 Feb27 ? 00:00:00 [nfsd]
root 8654 2 0 Feb27 ? 00:00:00 [nfsd]
root 29482 29466 0 23:53 pts/0 00:00:00 grep –color=auto nfs
[root@pcs-01 var]# more /proc/drbd
version: 8.4.9-1 (api:1/proto:86-101)
GIT-hash: 9976da086367a2476503ef7f6b13d4567327a280 build by akemi@Build64R7, 2016-12-04 01:08:481: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r—–
ns:4 nr:8 dw:20 dr:1866 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:4
[root@pcs-01 var]# df -k |grep nfs
/dev/drbd1 2039788 527388 1402396 28% /var/share
[root@pcs-01 var]# pcs resource
ClusterIP (ocf::heartbeat:IPaddr2): Started pcs-01
Master/Slave Set: NFS-DRBD-Clone [NFS-DRBD ]
Masters: [ pcs-01 ]
Standby: [ pcs-02 ]
DATA (ocf::heartbeat:Filesystem): Started pcs-01
NFS-SERVER (systemd:nfs-server): Started pcs-01
[root@pcs-01 var]# pcs cluster standby pcs-01
[root@pcs-01 var]# ps -afe |grep nfs
root 29960 29466 0 23:54 pts/0 00:00:00 grep –color=auto nfs
[root@pcs-01 var]# more /proc/drbd
version: 8.4.9-1 (api:1/proto:86-101)
[root@pcs-01 var]# pcs resource
ClusterIP (ocf::heartbeat:IPaddr2): Started pcs-02
Master/Slave Set: NFS-DRBD-Clone [NFS-DRBD ]
Masters: [ pcs-02 ]
Stopped: [ pcs-01 ]
DATA (ocf::heartbeat:Filesystem): Started pcs-02
NFS-SERVER (systemd:nfs-server): Started pcs-02

The cluster works fine. It’s ready to be used, but in Production the split brain issue must be resolved to ensure data integrity: If a node loses the network connectivity to other side of the cluster, both nodes will believe to be the only in healthy state and will bring its resource up. The volume data will be mounted on both, and this could compromise the data integrity

I prefer a shared storage based approach to this issue and in this context two different solutions are possible:

  1. Split brain detection (SBD) with STONITH.
  2. Storage-based fencing SCSI-3 PR (Persistent Reservation) with STONITH.

In SBD STONITH (shoot the other node in the head), the nodes in the Linux cluster keep each other updated by using the Heartbeat mechanism. If something goes wrong with a node in the cluster, a poison pill is written for that node to the shared storage device. The node has to “eat” (accept) the poison pill and terminate itself, after which a file system resource can be safely failed over to another node in the Linux cluster (Sentence copied from searchdatacenter.techtarget.com) You can find more information how to configure SBD in Stonith in http://www.linux-ha.org/wiki/SBD_Fencing. Other useful link is http://searchdatacenter.techtarget.com/tip/Implementing-SBD-STONITH-in-Linux-HA-clusters

Storage-based fencing with SCSI-3 PR (Persistent Reservation) ensures that only one node can ever write or access storage at time. A shared storage that supports SCSI-3 PR must be used. The daemon or service responsible for fencing in Pacemaker is stonith. For configuring stonith in pacemaker using this approach:

[root@pcs-01 var]# pcs stonith create scsi fence_scsi pcmk_host_list=”pcs-01 pcs-02″ pcmk_monitor_action=”metadata” pcmk_reboot_action=”off” devices=”/dev/mapper/drbd_vg-drbd_lv” meta provides=”unfencing”
[root@pcs-01 var]# pcs stonith show
scsi (stonith:fence_scsi): Started

This link for more information about it https://keithtenzer.com/2015/06/22/pacemaker-the-open-source-high-availability-cluster/.

The two approaches above will be implemented and described better in a next article.

I hope that this article was useful for you.

Don’t hesitate to contact for any doubt or suggestion.

 

LEAVE A COMMENT