Kubernetes ceph rook first part

  Network, System

In this article I investigate the possibility to implement a ceph storage solution, robust, scalable and high available inside a kubernetes cluster using rook (https://rook.io/) that automates the delivery and the management of the ceph solution.

The benefits of this solution are to have a storage area easy to install and manage, opposite to big installation and management effort required by a standalone ceph installation, and of course the chance to create, inside kubernetes, persistent volumes, to be mounted as block device or distributed file system in high availability without taking care of the implementation details.

In this first part I will show you how to create a shared filesystem to mount with read/write permission from multiple pods, useful for example for logging, or for sharing common repository for file transfer; in the second part I will speak about the chance to create block device to mount in write mode from one only pod, useful for managing data service, like databases, inside the cluster; in the last part will be showed how to create nfs volumes, with ganesha nfs server, to export in high availabilty to external world using a haproxy, for tcp load balancing to ceph ganesha endpoints.

The laboratory is implemented in kubernetes micro cluster, in the same described in my previous article https://www.securityandit.com/network/kubernetes-the-easy-way-part-01/, with 3 workers that are all used for running businesses pods and ceph daemons: Monitor, OSD, Manager and Metadata Server. In a production environment is better to separate, by label and taints, the nodes where to running the business pods respect to that where the ceph pods are running.

In this context, the reference architecture implemented is the following:

Kubernetes ceph rook cluster installation

Ceph rook permits to deliver object ( like aws s3), block and file storage inside a kubernetes cluster using as data storage the device files, or disks, attached to worker nodes of the clusters. It’s still possible to implement a ceph rook architecture in the public cloud – like AWS, Google or Azure – using zonal disks (aws ebs, google persisten disk, lrs azure disk) that have more perfomance respect to regional disks. This makes sense because the availability is provided by ceph cluster replicating the block device in the disks present in the different availabilty zones.

Returining to our scenario, the Ceph architecture, as explained deeply in the ceph architecure oveview https://docs.ceph.com/en/latest/architecture/, is composed by multiples types of daemons:

  1. Ceph Monitor: ceph-monitor maintains maps of the cluster state and by the Crush map algorithm is able to detect the osd that store the data asked by client.
  2. Ceph OSD Daemon: ceph-osd is the object storage daemon for the Ceph distributed file system. It is responsible for storing objects on a local file system and providing access to them over the network. In ceph rook there is a osd pod for any disk attached to cluster. One osd is elected as primary and, al least, other two are secondary.
  3. Ceph Manager: ceph-mgr provides monitoring functionaliry.
  4. Ceph Metadata Server ceph-mds is the metadata server for distibuted file systems and these pods, as described, will be started when a ceph file system is created.

The installation of ceph rook inside kubernetes, described in this following uri https:https://github.com/rook/rook/blob/master/Documentation/quickstart.md//github.com/rook/rook, is based on the kubernetes crd concepts, that are a extension of the kubernetes API for storing custom object definitions, and kubernetes operator, designed for automation actions and software delivery inside kubernetes in function of crd objects; in few words, if a crd object is created inside the cluster, the operator will perform actions considering its meaning and its configuration (see https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/).

The installation of crd resources and operator kubernetes is performed in this simple way (for openshift you shoud apply operator-openshift.yaml):

$git clone --single-branch --branch v1.7.0 https://github.com/rook/rook.git
$cd rook/cluster/examples/kubernetes/ceph
$kubectl create -f crds.yaml -f common.yaml -f operator.yaml
$kubectl get pods -A |grep opera
rook-ceph           rook-ceph-operator-6b88ff7b4c-xk67b                               1/1     Running            0          38d

The operator, that is kubernetes pod, is ready to create, continually watching the crd objects created in the api servers, the ceph resources that are the basic components of ceply cluster – osd, manager, metadata server and monitor.

The crd objects contain all the configurations of the ceph infrastructure like:

  1. Number of monitor pods. For production, at least 3 nodes are recommended.
  2. Number of manager pods. When higher availability of the mgr is needed, increase the count to 2.
  3. Enabling of dashboard. I enabled because it’s very useful.
  4. Taints and Selectors for separating the nodes where the ceph pods should be running from the business part.
  5. The filter for detecting the device special file to include in the storage area. In this scenario, there are, as described in the architecture above, different disks. Every time that a device is attached to a node of the cluster,it’s automatically added in the storage size and a new osd deployment/pod is created for managing it.
  6. The path on the host, in the variable dataDirHostPath, where configuration files will be persisted. I suggest a file system different than root file system.

This the command to apply for creating the ceph infrastructure:

$ kubectl create -f cluster.yaml

The osd pods, running in the nodes where the relative managed device are attached, are created, one for any device attached to workers node.

$ kubectl get pods -n rook-ceph |grep osd
rook-ceph-osd-2-7dbf6bb4c7-pzp72 1/1 Running 1 49d
rook-ceph-osd-5-674d9595ff-zf8nh 1/1 Running 1 49d
rook-ceph-osd-4-584f4f4857-nzwq4 1/1 Running 6 119d
rook-ceph-osd-1-59cc49cf99-xljkt 1/1 Running 7 119d
rook-ceph-osd-0-575f688ffc-764n8 1/1 Running 0 38d
rook-ceph-osd-3-5f4984ddcf-wjtvw 1/1 Running 0 38d
rook-ceph-osd-6-6d849c48d7-wpw72 1/1 Running 0 38d
kubectl get deploy -n rook-ceph |grep osd
rook-ceph-osd-2 1/1 1 1 157d
rook-ceph-osd-5 1/1 1 1 128d
rook-ceph-osd-4 1/1 1 1 128d
rook-ceph-osd-1 1/1 1 1 157d
rook-ceph-osd-0 1/1 1 1 157d
rook-ceph-osd-3 1/1 1 1 128d
rook-ceph-osd-6 1/1 1 1 120d

Every pod is managed by a deploy that forces the pod running to a node where the disk is attached by the NodeSelector kubernetes attribute.

$kubectl get deploy -n rook-ceph |grep osd
rook-ceph-osd-2                             1/1     1            1           30d
rook-ceph-osd-5                             1/1     1            1           31d
rook-ceph-osd-4                             1/1     1            1           25d
rook-ceph-osd-1                             1/1     1            1           15d
rook-ceph-osd-0                             1/1     1            1           37d
rook-ceph-osd-3                             1/1     1            1           43d
rook-ceph-osd-6                             1/1     1            1           23d

The ceph manager and three ceph monitoring pods, in high availabilty, are created too:

$ kubectl get pods -n rook-ceph |grep mon
rook-ceph-mon-b-7d494dd789-fmjw5 1/1 Running 1 49d
rook-ceph-mon-d-69d66dc67d-zj85g 1/1 Running 2 49d
rook-ceph-mon-c-85655988f8-vtz6x 1/1 Running 0 36d
$ kubectl get pods -n rook-ceph |grep mgr
rook-ceph-mgr-a-74c8d8d65f-bllm8 1/1 Running 3 49d

Other pods running are the plugins that have the function to create block storage to be consumed by a pod (RWO) and filesystem to be shared across multiple pods (RWX), accessibile inside the cluster by persistent volume claim. These plugins are associated to storage class definition.

$ kubectl get pods -n rook-ceph |grep plugin
csi-cephfsplugin-qdfhr 3/3 Running 21 120d
csi-rbdplugin-chhk2 3/3 Running 22 120d
csi-cephfsplugin-7dtt2 3/3 Running 24 120d
csi-rbdplugin-xjqk8 3/3 Running 24 120d
csi-rbdplugin-provisioner-5868bd8b55-hb9d6 6/6 Running 0 38d
csi-cephfsplugin-provisioner-775dcbbc86-k2d4g 6/6 Running 0 38d
csi-rbdplugin-q979d 3/3 Running 18 120d
csi-cephfsplugin-x4mdc 3/3 Running 18 120d
csi-rbdplugin-provisioner-5868bd8b55-bcd9m 6/6 Running 24 49d
csi-cephfsplugin-provisioner-775dcbbc86-85cq5 6/6 Running 8 49d

The next step, before creating the ceph filesystem, is to install a took box useful for troubleshooting.

$ kubectl create -f deploy/examples/toolbox.yaml
$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
[root@rook-ceph-tools-744c55f865-vgr96 /]# ceph status
    id:     4g4d5452-75td-56g2-p45d-34c9487a6769
    health: HEALTH_OK
    mon: 3 daemons, quorum b,c,d (age 5w)
    mgr: a(active, since 5w)
    mds: 1/1 daemons up, 2 hot standby
    osd: 7 osds: 7 up (since 5w), 7 in (since 5w)
    volumes: 1/1 healthy
    pools:   6 pools, 161 pgs
    objects: 25.55k objects, 11 GiB
    usage:   36 GiB used, 104 GiB / 140 GiB avail
    pgs:     161 active+clean
    client:   3.2 KiB/s rd, 853 B/s wr, 4 op/s rd, 0 op/s wr

It’s possible to connect to ceph dashboard and monitoring the cluster by the GUI. Following how to get the admin token and the kubernetes service to expose, by NodePort or by Ingress (find more info in this my article https://www.securityandit.com/network/kubernetes-service-nodeport-ingress-and-loadbalancer/), to reach the dashboard.

$kubectl -n rook-ceph get secret rook-ceph-dashboard-password -o jsonpath="{['data']['password']}" | base64 --decode && echo
$kubectl get svc -n rook-ceph|grep dash
rook-ceph-mgr-dashboard          ClusterIP           8443/TCP            157d

In order to have the cluster ready to use, only for microk8s, I had to fix it:

$kubectl edit configmap rook-ceph-operator-config  -n rook-ceph
ROOK_CSI_KUBELET_DIR_PATH: "/var/snap/microk8s/common/var/lib/kubelet"
changed from /var/lib/kubelet to /var/snap/microk8s/common/var/lib/kubelet

As I said, The operator will automatically watch for new devices being added to your cluster and If they match the filters configured in the cluster custom resource, a new osd that manages the local device is created. Infact, I added in the first node of the cluster a new disk:

$fdisk -l |grep sdf
Disk /dev/sdf: 10 GiB, 10737418240 bytes, 20971520 sectors

After operator reboots, even if teorically it should not be necessary, the disk is detected by the operator and a new osd is started:

kubectl logs rook-ceph-operator-6b88ff7b4c-hmh7w -f -n rook-ceph
2022-01-16 12:29:38.243892 I | op-osd: creating OSD 4 on node "microk8s01"
2022-01-16 12:29:38.461093 I | clusterdisruption-controller: osd "rook-ceph-osd-4" is down but no node drain is detected
root@microk8s-01:~# kubectl get pods -n rook-ceph |grep osd

rook-ceph-osd-7-699dfccc4f-gcfrh                             1/1     Running     0          2m22s

Kubernetes ceph rook file system

As described in the following link, https://rook.io/docs/rook/v1.7/ceph-filesystem.html, it’s the time to create a ceph file system without the need to create all the ceph objects inside the scene: every ceph file system needs of a pool, that is the logic partition where to store the data, and the placements groups, scalable automatically, that belong to a pool, store the data and are managed by the osd daemon, one for any device attached, and a metadata pool for storing meta data info.

This is the crd to create in the cluster:

$vi ceph-filesystem.yaml
apiVersion: ceph.rook.io/v1
kind: CephFilesystem
name: myfilesystem
namespace: rook-ceph
size: 3
size: 3
preserveFilesystemOnDelete: true
activeCount: 1
activeStandby: true
$kubectl apply -f eph-file-system.yaml
#kubectl get CephFilesystem -A
rook-ceph   myfilesystem 1           157d 

This ceph filesystem creates, behind the scene, a ceph pool that is splitted in placement groups, managed by a primary OSD. and each individual object is assigned to one of them. If an OSD fails or the cluster re-balances, Ceph can move or replicate an entire placement group—​i.e., all of the objects in the placement groups—​without having to address each object individually (for further details see https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/storage_strategies_guide/placement_groups_pgs).

In our case, every placement group of data and meta data pool, as specified above, is replicated three times as specified in the size field.

Inside the tool box, it’s possible to manage the ceph resources:

$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
$ ceph osd pool ls

The next step is to create a storage class inside the kubernetes that, for any persisten volume claim of the same class, by the ceph plugin installed during the creation of the cluster, will create a persistent volume mountable in write from many pods at the same time.

It’s important to set in the storage class configuration the ceph file system name, myfilesystem, the name of the cluster, rook-ceph.

$vi ceph-sc.yaml
apiVersion: storage.k8s.iov1
kind: StorageClass
  name: rook-cephfs
# Change "rook-ceph" provisioner prefix to match the operator namespace if needed
provisioner: rook-ceph.cephfs.csi.ceph.com
reclaimPolicy: Retain
allowVolumeExpansion: true
  # clusterID is the namespace where operator is deployed.
  clusterID: rook-ceph
  # CephFS filesystem name into which the volume shall be created
  fsName: myfilesystem
  # Ceph pool into which the volume shall be created
  # Required for provisionVolume: "true"
  pool: myfilesystem-data0
  # The secrets contain Ceph admin credentials. These are generated automatically by the operator
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
$kubectl apply -f ceph-sc.yaml

I suggest to use allowVolumeExpansion equal to true, for resing on demand the quota persistent volume, and reclaimPolicy equal to Retain, for manual reclamation of the resource.

Remember that the provisioner plugin is composed by two pods, one of them active, deployed during the creation of the cluster. Generally, in Kubernetes, the creation of resources, like volumes to mount, virtual host to configure or, as happens in the cloud, the deploy of external objects, like Load Balancer, is perfomed by plugin, pod running inside the cluster, that with the right permissions are able to do its job.

For creating the persistent volume claim:

vi cephfs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
  name: cephfs-pvc
  - ReadWriteMany
      storage: 10Gi
  storageClassName: rook-cephfs
$ kubectl apply -f cephfs-pvc.yaml
$ kubectl get pvc
cephfs-pvc                 Bound    pvc-6551a7g6-c9f8-4f78-9678-6h4bhuu6e4e3   10Gi        RWX            rook-cephfs       10s

The volume, created automatically by the ceph plugin, has been bound to persistent volume claim and it’s ready to be mounted from any pod that asks it. The kubernetes volume is a simple directory inside the ceph filesystem and it can be showed mounting the ceph filesystem directly inside a virtual machine.

For doing it, it needs to authenticate, using the ceph toolbox, to ceph monitor daemon and to to get the monitor ip endpoints:

$kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
mon_endpoints=$(grep mon_host /etc/ceph/ceph.conf | awk '{print $3}')
my_secret=$(grep key /etc/ceph/keyring | awk '{print $3}')
mount -t ceph -o mds_namespace=myfilesystem,name=admin,secret=$my_secret $mon_endpoints:/ /mnt/
cd /mnt/volumes/csi/csi-vol-34113909-ge18-677d-hg7d-df7hd27dhbc3/34113909-ge18-677d-hg7d-df7hd27dhbc3
touch index.html
echo "Hello World" > index.html

Let’s try to access to this volume directly from a pod:

$vi nginx-deploy-yaml 
apiVersion: apps/v1
kind: Deployment
  name: nginx-deployment
      app: nginx
  replicas: 1 
        app: nginx
      - name: nginx
        image: nginx:1.14.2
        - containerPort: 80
        - mountPath: "/usr/share/nginx/html"
          name: testvolume
      restartPolicy: Always
      - name: testvolume
          claimName: cephfs-pvc
$kubectl apply -f nginx-deployment.yaml
$kubectl get pods -o wide |grep nginx-depl
nginx-deployment-7fc9c699f5-zbgg4                                 1/1     Running            0          4m59s    microk8s03              
$kubectl exec -it ^Cinx-deployment-7fc9c699f5-zbgg4 /bin/bash
nginx-deployment-7fc9c699f5-zbgg4:/# df -k |grep csi,,   1048576     8192   1040384   1% 
root@nginx-deployment-7fc9c699f5-zbgg4:/# cat /usr/share/nginx/html/index.html 
Hello World.


In this article I showed how to install rook ceph inside a kubernetes cluster with the option to use distributed file system, in high availabilty, and quite perfomance, that, typically, I suggest to use for logging or for sharing repository from different sources.

In the next article I will show how to create block device, accessible in write mode only from one only pod, useful for deploying, inside kubernetes, stateful deployment for storing sql or nosql database, message brokers or caching services.