Wiki-Quellcode von Ceph

Version 3.1 von Jonas Jelten am 2024/08/23 14:09

Zeige letzte Bearbeiter
1 We offer [Ceph](https://ceph.io) as a scalable and fast way to obtain storage for your needs.
2
3 {{toc/}}
4
5 ## How does it work?
6
7 The gist how Ceph works for you:
8 We have many servers with SSDs or HDDs, each bought by one organization unit such as a chair. Data is spread accross all servers, and each organization unit gets as much storage space as they bought in servers.
9
10 You can access the storage mainly via RBD (RADOS block device), which is a device behaving like a local disk (USB stick, SSD, ...), but actually stores and retrieves data from the cluster in our data centre.
11
12 ## RBD acquisition
13
14 An RBD is a **storage device** you can use in your servers to store data in our Ceph cluster. It either uses **HDD** or **SSD** storage (cheaper vs faster).
15
16 For evaluation purposes, you can get small amounts of storage directly.
17 Otherwise, you can get as much space as you are entitled to.
18
19 Each RBD is stored in a "namespace", which **restricts access** to it. You can have multiple RBDs in the same namespace.
20
21 The name of an RBD is `ORG-name/namespacename/rbdname.`
22
23 To request the creation (or extension) of an RBD, write to [support@ito.cit.tum.de](support@ito.cit.tum.de) specifying **name**, **size**, **namespace** and **HDD/SSD**.
24
25 You will get back a secret **keyring** to access the namespace.
26
27 ## RBD mapping
28
29 In order to "use" an RBD in your server, you need to "map" it.
30
31 You should have ready the name and keyring of the RBD.
32
33 * Please install `ceph-common`, at least in version 15.
34 * It contains a tool named `rbdmap`, which can (oh wonder) map your RBD.
35 * Edit /etc/ceph/rbdmap to add your RBD in a line
36 * it has the format: `rbdname name=keyringname,options=...`
37 * `ORG-name/namespacename/rbdname name=client.ORG.rbd.namespacename,options='queue_depth=1024'`
38 * Place the keyring file in /etc/ceph/
39 * Filename: `ceph.client.ORG.rbd.namespacename.keyring`
40 * Permissions: 700
41 * Owner: root
42 * Content: the client identifier and 28 byte key in base64 encoding.
43
44 ```
45 [client.ORG.rbd.namespacename]
46 key = ASD+OdlsdoTQJxFFljfCDEf/ASDFlYIbEbZatg==
47 ```
48
49 * `systemctl enable --now rbdmap.service` so the RBD device is created and on system starts.
50 * You should now have a `/dev/rbd0` device
51 * You can list current mapping status with `rbd device list`
52 * You can manually map/unmap with `rbd device map $rbdname` and `rbd device unmap $rbdname`
53
54 Now you have a raw storage device, but you can't yet store files on it, since you are missing a filesystem.
55
56
57 ## RBD formatting
58
59 Now that you have mapped your RBD, we can create file system structures on it.
60
61 This is as simple as running:
62
63 ```
64 mkfs.ext4 -E nodiscard,stride=1024,stripe_width=1024 /dev/rbdxxx
65 ```
66
67 get the newly created filesystem UUID:
68 ```
69 sudo blkid /dev/rbdxxx
70 ```
71
72 Now we create an entry in `/etc/fstab` with `noauto` so the below script triggers the mount, and the mount is not done too early in the boot.
73
74 `/etc/fstab`:
75 ```
76 UUID=your-new-fs-uuid /your/mount/point ext4 defaults,_netdev,acl,noauto,nodev,nosuid,noatime,stripe=1024 0 0
77 ```
78
79 In order to mount this filesystem in your server, we need a mount helper script (otherwise the RBD is not yet mapped on system start when `/etc/fstab` tries to mount it directly during boot).
80
81 `/etc/ceph/rbd.d/ORG-rbd/namespacename/rbdname`:
82 ```bash
83 #!/bin/bash
84
85 # lvm may disable vgs when not all blocks were available during scan
86 pvscan
87 vgchange -ay
88
89 # mount all the filesystems
90 mountpoint -q /your/mount/point || mount /your/mount/point
91 ```
92 Mark this script *executable* so `rbdmap` can execute it as post-mapping hook!
93
94 To test, either restart `rbdmap.service` or manually call `umount` and `mount` for `/your/mount/point`.
95
96
97 ## LVM on RBD
98
99 You can create LVM `pvs` and `lvs` on your RBD. You can use this for read/write caching, for example (see below).
100 This works like usual, just do `pvcreate` etc.
101
102
103 ## RBD tuning
104
105 To get more performance, there's some useful tweaks
106
107 ### CPU Bugs
108
109 When your server is sufficiently shielded behind firewalls and it isn't susceptible to attacks, disable the cpu bug mitigations for a performance boost as a kernel command line parameter:
110
111 `/etc/default/grub`:
112 ```
113 GRUB_CMDLINE_LINUX_DEFAULT="mitigations=off"
114 ```
115
116 ### Read-Ahead
117
118 We read ahead 1MiB, since Ceph stores the objects in 4MiB blocks anyway.
119
120 `/etc/udev/rules.d/90-ceph-rbd.rules`:
121 ```
122 KERNEL=="rbd[0-9]*", ENV{DEVTYPE}=="disk", ACTION=="add|change", ATTR{bdi/read_ahead_kb}="1024" ATTR{queue/scheduler}="none" ATTR{queue/wbt_lat_usec}="0" ATTR{queue/nr_requests}="2048"
123 ```
124
125 ### LVM-Cache
126
127 see `man 7 lvmcache`.
128 We can cache the RBD on a local NVMe for more performance.
129
130 * `/dev/fastdevice` is the name of the local NVMe.
131 * `/dev/datavg/datalv` is your name of your existing logical volume containing all the stored data on Ceph.
132 * we recommend writeback caching
133
134 ```bash
135 ## setup
136 # cache device
137 pvcreate /dev/fastdevice
138
139 # add cache device to vg to cache
140 vgextend datavg /dev/fastdevice
141
142 # create cache pool (meta+data combined):
143 lvcreate -n cache --type cache-pool -l '100%FREE' datavg /dev/fastdevice
144
145 # enable caching
146 #
147 # --type cache (recommended): use dm-cache for read and writecache
148 # --cachemode: do we cache writes?
149 # buffer writes: writeback
150 # no write buffering: writethrough
151 #
152 # --type writecache: only ever cache writes, not reads
153 #
154 # --chunksize data block management size
155 lvconvert --type cache --cachepool cache --cachemode writeback --chunksize 1024KiB /dev/datavg/datalv
156
157 ## status
158 # check status
159 lvs -ao+devices
160
161 ## resizing
162 lvconvert --splitcache /dev/datavg/datalv
163 lvextend -l +100%FREE /dev/datavg/datalv
164 lvconvert ... # to enable caching again
165
166 ## disabling
167 # deactivate and keep cache lv
168 lvconvert --splitcache /dev/datavg/datalv
169
170 # disable and delete cache lv -> cache-pv still part of vg!
171 # watch out when resizing the lv -> the cache-pv will get parts of the lv then, use pvmove to remove again.
172 lvconvert --uncache /dev/datavg/datalv
173
174 # remove pv from vg
175 lvreduce datavg /dev/fastdevice
176 ```