Changes for page Ceph

Last modified by Jonas Jelten on 2024/09/13 15:05

From version 1.1
edited by Jonas Jelten
on 2024/08/23 13:26
Change comment: There is no comment for this version
To version 3.1
edited by Jonas Jelten
on 2024/08/23 14:09
Change comment: There is no comment for this version

Summary

Details

Page properties
Content
... ... @@ -1,1 +1,176 @@
1 1  We offer [Ceph](https://ceph.io) as a scalable and fast way to obtain storage for your needs.
2 +
3 +{{toc/}}
4 +
5 +## How does it work?
6 +
7 +The gist how Ceph works for you:
8 +We have many servers with SSDs or HDDs, each bought by one organization unit such as a chair. Data is spread accross all servers, and each organization unit gets as much storage space as they bought in servers.
9 +
10 +You can access the storage mainly via RBD (RADOS block device), which is a device behaving like a local disk (USB stick, SSD, ...), but actually stores and retrieves data from the cluster in our data centre.
11 +
12 +## RBD acquisition
13 +
14 +An RBD is a **storage device** you can use in your servers to store data in our Ceph cluster. It either uses **HDD** or **SSD** storage (cheaper vs faster).
15 +
16 +For evaluation purposes, you can get small amounts of storage directly.
17 +Otherwise, you can get as much space as you are entitled to.
18 +
19 +Each RBD is stored in a "namespace", which **restricts access** to it. You can have multiple RBDs in the same namespace.
20 +
21 +The name of an RBD is `ORG-name/namespacename/rbdname.`
22 +
23 +To request the creation (or extension) of an RBD, write to [support@ito.cit.tum.de](support@ito.cit.tum.de) specifying **name**, **size**, **namespace** and **HDD/SSD**.
24 +
25 +You will get back a secret **keyring** to access the namespace.
26 +
27 +## RBD mapping
28 +
29 +In order to "use" an RBD in your server, you need to "map" it.
30 +
31 +You should have ready the name and keyring of the RBD.
32 +
33 +* Please install `ceph-common`, at least in version 15.
34 + * It contains a tool named `rbdmap`, which can (oh wonder) map your RBD.
35 +* Edit /etc/ceph/rbdmap to add your RBD in a line
36 + * it has the format: `rbdname name=keyringname,options=...`
37 + * `ORG-name/namespacename/rbdname name=client.ORG.rbd.namespacename,options='queue_depth=1024'`
38 +* Place the keyring file in /etc/ceph/
39 + * Filename: `ceph.client.ORG.rbd.namespacename.keyring`
40 + * Permissions: 700
41 + * Owner: root
42 + * Content: the client identifier and 28 byte key in base64 encoding.
43 +
44 +```
45 +[client.ORG.rbd.namespacename]
46 +key = ASD+OdlsdoTQJxFFljfCDEf/ASDFlYIbEbZatg==
47 +```
48 +
49 +* `systemctl enable --now rbdmap.service` so the RBD device is created and on system starts.
50 +* You should now have a `/dev/rbd0` device
51 +* You can list current mapping status with `rbd device list`
52 +* You can manually map/unmap with `rbd device map $rbdname` and `rbd device unmap $rbdname`
53 +
54 +Now you have a raw storage device, but you can't yet store files on it, since you are missing a filesystem.
55 +
56 +
57 +## RBD formatting
58 +
59 +Now that you have mapped your RBD, we can create file system structures on it.
60 +
61 +This is as simple as running:
62 +
63 +```
64 +mkfs.ext4 -E nodiscard,stride=1024,stripe_width=1024 /dev/rbdxxx
65 +```
66 +
67 +get the newly created filesystem UUID:
68 +```
69 +sudo blkid /dev/rbdxxx
70 +```
71 +
72 +Now we create an entry in `/etc/fstab` with `noauto` so the below script triggers the mount, and the mount is not done too early in the boot.
73 +
74 +`/etc/fstab`:
75 +```
76 +UUID=your-new-fs-uuid /your/mount/point ext4 defaults,_netdev,acl,noauto,nodev,nosuid,noatime,stripe=1024 0 0
77 +```
78 +
79 +In order to mount this filesystem in your server, we need a mount helper script (otherwise the RBD is not yet mapped on system start when `/etc/fstab` tries to mount it directly during boot).
80 +
81 +`/etc/ceph/rbd.d/ORG-rbd/namespacename/rbdname`:
82 +```bash
83 +#!/bin/bash
84 +
85 +# lvm may disable vgs when not all blocks were available during scan
86 +pvscan
87 +vgchange -ay
88 +
89 +# mount all the filesystems
90 +mountpoint -q /your/mount/point || mount /your/mount/point
91 +```
92 +Mark this script *executable* so `rbdmap` can execute it as post-mapping hook!
93 +
94 +To test, either restart `rbdmap.service` or manually call `umount` and `mount` for `/your/mount/point`.
95 +
96 +
97 +## LVM on RBD
98 +
99 +You can create LVM `pvs` and `lvs` on your RBD. You can use this for read/write caching, for example (see below).
100 +This works like usual, just do `pvcreate` etc.
101 +
102 +
103 +## RBD tuning
104 +
105 +To get more performance, there's some useful tweaks
106 +
107 +### CPU Bugs
108 +
109 +When your server is sufficiently shielded behind firewalls and it isn't susceptible to attacks, disable the cpu bug mitigations for a performance boost as a kernel command line parameter:
110 +
111 +`/etc/default/grub`:
112 +```
113 +GRUB_CMDLINE_LINUX_DEFAULT="mitigations=off"
114 +```
115 +
116 +### Read-Ahead
117 +
118 +We read ahead 1MiB, since Ceph stores the objects in 4MiB blocks anyway.
119 +
120 +`/etc/udev/rules.d/90-ceph-rbd.rules`:
121 +```
122 +KERNEL=="rbd[0-9]*", ENV{DEVTYPE}=="disk", ACTION=="add|change", ATTR{bdi/read_ahead_kb}="1024" ATTR{queue/scheduler}="none" ATTR{queue/wbt_lat_usec}="0" ATTR{queue/nr_requests}="2048"
123 +```
124 +
125 +### LVM-Cache
126 +
127 +see `man 7 lvmcache`.
128 +We can cache the RBD on a local NVMe for more performance.
129 +
130 +* `/dev/fastdevice` is the name of the local NVMe.
131 +* `/dev/datavg/datalv` is your name of your existing logical volume containing all the stored data on Ceph.
132 +* we recommend writeback caching
133 +
134 +```bash
135 +## setup
136 +# cache device
137 +pvcreate /dev/fastdevice
138 +
139 +# add cache device to vg to cache
140 +vgextend datavg /dev/fastdevice
141 +
142 +# create cache pool (meta+data combined):
143 +lvcreate -n cache --type cache-pool -l '100%FREE' datavg /dev/fastdevice
144 +
145 +# enable caching
146 +#
147 +# --type cache (recommended): use dm-cache for read and writecache
148 +# --cachemode: do we cache writes?
149 +# buffer writes: writeback
150 +# no write buffering: writethrough
151 +#
152 +# --type writecache: only ever cache writes, not reads
153 +#
154 +# --chunksize data block management size
155 +lvconvert --type cache --cachepool cache --cachemode writeback --chunksize 1024KiB /dev/datavg/datalv
156 +
157 +## status
158 +# check status
159 +lvs -ao+devices
160 +
161 +## resizing
162 +lvconvert --splitcache /dev/datavg/datalv
163 +lvextend -l +100%FREE /dev/datavg/datalv
164 +lvconvert ... # to enable caching again
165 +
166 +## disabling
167 +# deactivate and keep cache lv
168 +lvconvert --splitcache /dev/datavg/datalv
169 +
170 +# disable and delete cache lv -> cache-pv still part of vg!
171 +# watch out when resizing the lv -> the cache-pv will get parts of the lv then, use pvmove to remove again.
172 +lvconvert --uncache /dev/datavg/datalv
173 +
174 +# remove pv from vg
175 +lvreduce datavg /dev/fastdevice
176 +```