Ceph

Last modified by Jonas Jelten on 2024/09/13 15:05

We offer Ceph as a scalable and fast way to obtain storage for your needs.

How does it work?

The gist how Ceph works for you:
We have many servers with SSDs or HDDs, each bought by one organization unit such as a chair. Data is spread accross all servers, and each organization unit gets as much storage space as they bought in servers.

You can access the storage mainly via RBD (RADOS block device), which is a device behaving like a local disk (USB stick, SSD, ...), but actually stores and retrieves data from the cluster in our data centre.

RBD acquisition

An RBD is a storage device you can use in your servers to store data in our Ceph cluster. It either uses HDD or SSD storage (cheaper vs faster).

For evaluation purposes, you can get small amounts of storage directly.
Otherwise, you can get as much space as you are entitled to.

Each RBD is stored in a "namespace", which restricts access to it. You can have multiple RBDs in the same namespace.

The name of an RBD is ORG-name/namespacename/rbdname.

To request the creation (or extension) of an RBD, write to support@ito.cit.tum.de specifying name, size, namespace and HDD/SSD.

You will get back a secret keyring to access the namespace.

RBD mapping

In order to "use" an RBD in your server, you need to "map" it.

You should have ready the name and keyring of the RBD.

  • Please install ceph-common, at least in version 15.
    • It contains a tool named rbdmap, which can (oh wonder) map your RBD.
  • Edit /etc/ceph/rbdmap to add your RBD in a line
    • it has the format: rbdname name=keyringname,options=...
    • ORG-name/namespacename/rbdname name=client.ORG.rbd.namespacename,options='queue_depth=1024'
  • Place the keyring file in /etc/ceph/
    • Filename: ceph.client.ORG.rbd.namespacename.keyring
    • Permissions: 700
    • Owner: root
    • Content: the client identifier and 28 byte key in base64 encoding.
[client.ORG.rbd.namespacename]  
key = ASD+OdlsdoTQJxFFljfCDEf/ASDFlYIbEbZatg==
  • systemctl enable --now rbdmap.service so the RBD device is created and on system starts.
  • You should now have a /dev/rbd0 device
  • You can list current mapping status with rbd device list
  • You can manually map/unmap with rbd device map $rbdname and rbd device unmap $rbdname

Now you have a raw storage device, but you can't yet store files on it, since you are missing a filesystem.

RBD formatting

Now that you have mapped your RBD, we can create file system structures on it.

This is as simple as running:

mkfs.ext4 -E nodiscard,stride=1024,stripe_width=1024 /dev/rbdxxx

get the newly created filesystem UUID:

sudo blkid /dev/rbdxxx

Now we create an entry in /etc/fstab with noauto so the below script triggers the mount, and the mount is not done too early in the boot.

/etc/fstab:

UUID=your-new-fs-uuid /your/mount/point ext4 defaults,_netdev,acl,noauto,nodev,nosuid,noatime,stripe=1024 0 0

In order to mount this filesystem in your server, we need a mount helper script (otherwise the RBD is not yet mapped on system start when /etc/fstab tries to mount it directly during boot).

/etc/ceph/rbd.d/ORG-rbd/namespacename/rbdname:

#!/bin/bash

# lvm may disable vgs when not all blocks were available during scan
pvscan
vgchange -ay

# mount all the filesystems
mountpoint -q /your/mount/point || mount /your/mount/point

Mark this script executable so rbdmap can execute it as post-mapping hook!

To test, either restart rbdmap.service or manually call umount and mount for /your/mount/point.

LVM on RBD

You can create LVM pvs and lvs on your RBD. You can use this for read/write caching, for example (see below). This works like usual, just do pvcreate etc.

RBD tuning

To get more performance, there's some useful tweaks

CPU Bugs

When your server is sufficiently shielded behind firewalls and it isn't susceptible to attacks, disable the cpu bug mitigations for a performance boost as a kernel command line parameter:

/etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="mitigations=off"

Read-Ahead

We read ahead 1MiB, since Ceph stores the objects in 4MiB blocks anyway. We also allow more parallel requests and use no IO scheduler (since Ceph is distributed with equal latency anyway).

/etc/udev/rules.d/90-ceph-rbd.rules:

KERNEL=="rbd[0-9]*", ENV{DEVTYPE}=="disk", ACTION=="add|change", ATTR{bdi/read_ahead_kb}="1024" ATTR{queue/scheduler}="none" ATTR{queue/wbt_lat_usec}="0" ATTR{queue/nr_requests}="2048"

LVM-Cache

see man 7 lvmcache. We can cache the RBD on a local NVMe for more performance.

  • /dev/fastdevice is the name of the local NVMe.
  • /dev/datavg/datalv is your name of your existing logical volume containing all the stored data on Ceph.
  • we recommend read and write caching, and a local fastdevice size of at least 50GiB. the more the better emoticon_smile
## setup
# cache device
pvcreate /dev/fastdevice

# add cache device to vg to cache
vgextend datavg /dev/fastdevice

# create cache pool (meta+data combined):
lvcreate -n cache --type cache-pool -l '100%FREE' datavg /dev/fastdevice

# enable caching
#
# --type cache (recommended): use dm-cache for read and writecache
#   --cachemode: do we cache writes?
#     buffer writes: writeback
#     no write buffering: writethrough
#
# --type writecache: only ever cache writes, not reads
#
# --chunksize data block management size
lvconvert --type cache --cachepool cache --cachemode writeback --chunksize 1024KiB /dev/datavg/datalv

## status
# check status
lvs -ao+devices

## resizing
lvconvert --splitcache /dev/datavg/datalv
lvextend -l +100%FREE /dev/datavg/datalv
lvconvert ... # to enable caching again

## disabling
# deactivate and keep cache lv
lvconvert --splitcache /dev/datavg/datalv

# disable and delete cache lv -> cache-pv still part of vg!
# watch out when resizing the lv -> the cache-pv will get parts of the lv then, use pvmove to remove again.
lvconvert --uncache /dev/datavg/datalv

# remove pv from vg
lvreduce datavg /dev/fastdevice

NFS tuning

in /etc/default/nfs-kernel-server:

echo "1048576" > /proc/fs/nfsd/max_block_size   # allow 1MiB iosize (geht auch noch mehr)
RPCNFSDCOUNT=64    # workeranzahl