Wiki source code of Ceph
Version 3.1 by Jonas Jelten on 2024/08/23 14:09
Show last authors
| author | version | line-number | content |
|---|---|---|---|
| 1 | We offer [Ceph](https://ceph.io) as a scalable and fast way to obtain storage for your needs. | ||
| 2 | |||
| 3 | {{toc/}} | ||
| 4 | |||
| 5 | ## How does it work? | ||
| 6 | |||
| 7 | The gist how Ceph works for you: | ||
| 8 | We have many servers with SSDs or HDDs, each bought by one organization unit such as a chair. Data is spread accross all servers, and each organization unit gets as much storage space as they bought in servers. | ||
| 9 | |||
| 10 | You can access the storage mainly via RBD (RADOS block device), which is a device behaving like a local disk (USB stick, SSD, ...), but actually stores and retrieves data from the cluster in our data centre. | ||
| 11 | |||
| 12 | ## RBD acquisition | ||
| 13 | |||
| 14 | An RBD is a **storage device** you can use in your servers to store data in our Ceph cluster. It either uses **HDD** or **SSD** storage (cheaper vs faster). | ||
| 15 | |||
| 16 | For evaluation purposes, you can get small amounts of storage directly. | ||
| 17 | Otherwise, you can get as much space as you are entitled to. | ||
| 18 | |||
| 19 | Each RBD is stored in a "namespace", which **restricts access** to it. You can have multiple RBDs in the same namespace. | ||
| 20 | |||
| 21 | The name of an RBD is `ORG-name/namespacename/rbdname.` | ||
| 22 | |||
| 23 | To request the creation (or extension) of an RBD, write to [support@ito.cit.tum.de](support@ito.cit.tum.de) specifying **name**, **size**, **namespace** and **HDD/SSD**. | ||
| 24 | |||
| 25 | You will get back a secret **keyring** to access the namespace. | ||
| 26 | |||
| 27 | ## RBD mapping | ||
| 28 | |||
| 29 | In order to "use" an RBD in your server, you need to "map" it. | ||
| 30 | |||
| 31 | You should have ready the name and keyring of the RBD. | ||
| 32 | |||
| 33 | * Please install `ceph-common`, at least in version 15. | ||
| 34 | * It contains a tool named `rbdmap`, which can (oh wonder) map your RBD. | ||
| 35 | * Edit /etc/ceph/rbdmap to add your RBD in a line | ||
| 36 | * it has the format: `rbdname name=keyringname,options=...` | ||
| 37 | * `ORG-name/namespacename/rbdname name=client.ORG.rbd.namespacename,options='queue_depth=1024'` | ||
| 38 | * Place the keyring file in /etc/ceph/ | ||
| 39 | * Filename: `ceph.client.ORG.rbd.namespacename.keyring` | ||
| 40 | * Permissions: 700 | ||
| 41 | * Owner: root | ||
| 42 | * Content: the client identifier and 28 byte key in base64 encoding. | ||
| 43 | |||
| 44 | ``` | ||
| 45 | [client.ORG.rbd.namespacename] | ||
| 46 | key = ASD+OdlsdoTQJxFFljfCDEf/ASDFlYIbEbZatg== | ||
| 47 | ``` | ||
| 48 | |||
| 49 | * `systemctl enable --now rbdmap.service` so the RBD device is created and on system starts. | ||
| 50 | * You should now have a `/dev/rbd0` device | ||
| 51 | * You can list current mapping status with `rbd device list` | ||
| 52 | * You can manually map/unmap with `rbd device map $rbdname` and `rbd device unmap $rbdname` | ||
| 53 | |||
| 54 | Now you have a raw storage device, but you can't yet store files on it, since you are missing a filesystem. | ||
| 55 | |||
| 56 | |||
| 57 | ## RBD formatting | ||
| 58 | |||
| 59 | Now that you have mapped your RBD, we can create file system structures on it. | ||
| 60 | |||
| 61 | This is as simple as running: | ||
| 62 | |||
| 63 | ``` | ||
| 64 | mkfs.ext4 -E nodiscard,stride=1024,stripe_width=1024 /dev/rbdxxx | ||
| 65 | ``` | ||
| 66 | |||
| 67 | get the newly created filesystem UUID: | ||
| 68 | ``` | ||
| 69 | sudo blkid /dev/rbdxxx | ||
| 70 | ``` | ||
| 71 | |||
| 72 | Now we create an entry in `/etc/fstab` with `noauto` so the below script triggers the mount, and the mount is not done too early in the boot. | ||
| 73 | |||
| 74 | `/etc/fstab`: | ||
| 75 | ``` | ||
| 76 | UUID=your-new-fs-uuid /your/mount/point ext4 defaults,_netdev,acl,noauto,nodev,nosuid,noatime,stripe=1024 0 0 | ||
| 77 | ``` | ||
| 78 | |||
| 79 | In order to mount this filesystem in your server, we need a mount helper script (otherwise the RBD is not yet mapped on system start when `/etc/fstab` tries to mount it directly during boot). | ||
| 80 | |||
| 81 | `/etc/ceph/rbd.d/ORG-rbd/namespacename/rbdname`: | ||
| 82 | ```bash | ||
| 83 | #!/bin/bash | ||
| 84 | |||
| 85 | # lvm may disable vgs when not all blocks were available during scan | ||
| 86 | pvscan | ||
| 87 | vgchange -ay | ||
| 88 | |||
| 89 | # mount all the filesystems | ||
| 90 | mountpoint -q /your/mount/point || mount /your/mount/point | ||
| 91 | ``` | ||
| 92 | Mark this script *executable* so `rbdmap` can execute it as post-mapping hook! | ||
| 93 | |||
| 94 | To test, either restart `rbdmap.service` or manually call `umount` and `mount` for `/your/mount/point`. | ||
| 95 | |||
| 96 | |||
| 97 | ## LVM on RBD | ||
| 98 | |||
| 99 | You can create LVM `pvs` and `lvs` on your RBD. You can use this for read/write caching, for example (see below). | ||
| 100 | This works like usual, just do `pvcreate` etc. | ||
| 101 | |||
| 102 | |||
| 103 | ## RBD tuning | ||
| 104 | |||
| 105 | To get more performance, there's some useful tweaks | ||
| 106 | |||
| 107 | ### CPU Bugs | ||
| 108 | |||
| 109 | When your server is sufficiently shielded behind firewalls and it isn't susceptible to attacks, disable the cpu bug mitigations for a performance boost as a kernel command line parameter: | ||
| 110 | |||
| 111 | `/etc/default/grub`: | ||
| 112 | ``` | ||
| 113 | GRUB_CMDLINE_LINUX_DEFAULT="mitigations=off" | ||
| 114 | ``` | ||
| 115 | |||
| 116 | ### Read-Ahead | ||
| 117 | |||
| 118 | We read ahead 1MiB, since Ceph stores the objects in 4MiB blocks anyway. | ||
| 119 | |||
| 120 | `/etc/udev/rules.d/90-ceph-rbd.rules`: | ||
| 121 | ``` | ||
| 122 | KERNEL=="rbd[0-9]*", ENV{DEVTYPE}=="disk", ACTION=="add|change", ATTR{bdi/read_ahead_kb}="1024" ATTR{queue/scheduler}="none" ATTR{queue/wbt_lat_usec}="0" ATTR{queue/nr_requests}="2048" | ||
| 123 | ``` | ||
| 124 | |||
| 125 | ### LVM-Cache | ||
| 126 | |||
| 127 | see `man 7 lvmcache`. | ||
| 128 | We can cache the RBD on a local NVMe for more performance. | ||
| 129 | |||
| 130 | * `/dev/fastdevice` is the name of the local NVMe. | ||
| 131 | * `/dev/datavg/datalv` is your name of your existing logical volume containing all the stored data on Ceph. | ||
| 132 | * we recommend writeback caching | ||
| 133 | |||
| 134 | ```bash | ||
| 135 | ## setup | ||
| 136 | # cache device | ||
| 137 | pvcreate /dev/fastdevice | ||
| 138 | |||
| 139 | # add cache device to vg to cache | ||
| 140 | vgextend datavg /dev/fastdevice | ||
| 141 | |||
| 142 | # create cache pool (meta+data combined): | ||
| 143 | lvcreate -n cache --type cache-pool -l '100%FREE' datavg /dev/fastdevice | ||
| 144 | |||
| 145 | # enable caching | ||
| 146 | # | ||
| 147 | # --type cache (recommended): use dm-cache for read and writecache | ||
| 148 | # --cachemode: do we cache writes? | ||
| 149 | # buffer writes: writeback | ||
| 150 | # no write buffering: writethrough | ||
| 151 | # | ||
| 152 | # --type writecache: only ever cache writes, not reads | ||
| 153 | # | ||
| 154 | # --chunksize data block management size | ||
| 155 | lvconvert --type cache --cachepool cache --cachemode writeback --chunksize 1024KiB /dev/datavg/datalv | ||
| 156 | |||
| 157 | ## status | ||
| 158 | # check status | ||
| 159 | lvs -ao+devices | ||
| 160 | |||
| 161 | ## resizing | ||
| 162 | lvconvert --splitcache /dev/datavg/datalv | ||
| 163 | lvextend -l +100%FREE /dev/datavg/datalv | ||
| 164 | lvconvert ... # to enable caching again | ||
| 165 | |||
| 166 | ## disabling | ||
| 167 | # deactivate and keep cache lv | ||
| 168 | lvconvert --splitcache /dev/datavg/datalv | ||
| 169 | |||
| 170 | # disable and delete cache lv -> cache-pv still part of vg! | ||
| 171 | # watch out when resizing the lv -> the cache-pv will get parts of the lv then, use pvmove to remove again. | ||
| 172 | lvconvert --uncache /dev/datavg/datalv | ||
| 173 | |||
| 174 | # remove pv from vg | ||
| 175 | lvreduce datavg /dev/fastdevice | ||
| 176 | ``` |