Ceph - ITO Wiki

1

We offer [Ceph](https://ceph.io) as a scalable and fast way to obtain storage for your needs.

2

## How does it work?

The gist how Ceph works for you:

8

We have many servers with SSDs or HDDs, each bought by one organization unit such as a chair. Data is spread accross all servers, and each organization unit gets as much storage space as they bought in servers.

9

10

You can access the storage mainly via RBD (RADOS block device), which is a device behaving like a local disk (USB stick, SSD, ...), but actually stores and retrieves data from the cluster in our data centre.

## RBD acquisition

An RBD is a **storage device** you can use in your servers to store data in our Ceph cluster. It either uses **HDD** or **SSD** storage (cheaper vs faster).

15

16

For evaluation purposes, you can get small amounts of storage directly.

17

Otherwise, you can get as much space as you are entitled to.

18

19

Each RBD is stored in a "namespace", which **restricts access** to it. You can have multiple RBDs in the same namespace.

20

21

The name of an RBD is `ORG-name/namespacename/rbdname.`

22

23

To request the creation (or extension) of an RBD, write to [support@ito.cit.tum.de](support@ito.cit.tum.de) specifying **name**, **size**, **namespace** and **HDD/SSD**.

24

25

You will get back a secret **keyring** to access the namespace.

## RBD mapping

In order to "use" an RBD in your server, you need to "map" it.

30

31

You should have ready the name and keyring of the RBD.

32

33

* Please install `ceph-common`, at least in version 15.

34

* It contains a tool named `rbdmap`, which can (oh wonder) map your RBD.

35

* Edit /etc/ceph/rbdmap to add your RBD in a line

36

* it has the format: `rbdname name=keyringname,options=...`

37

* `ORG-name/namespacename/rbdname name=client.ORG.rbd.namespacename,options='queue_depth=1024'`

38

* Place the keyring file in /etc/ceph/

39

* Filename: `ceph.client.ORG.rbd.namespacename.keyring`

40

* Permissions: 700

41

* Owner: root

42

* Content: the client identifier and 28 byte key in base64 encoding.

43

44

```

45

[client.ORG.rbd.namespacename]

46

key = ASD+OdlsdoTQJxFFljfCDEf/ASDFlYIbEbZatg==

47

```

48

49

* `systemctl enable --now rbdmap.service` so the RBD device is created and on system starts.

50

* You should now have a `/dev/rbd0` device

51

* You can list current mapping status with `rbd device list`

52

* You can manually map/unmap with `rbd device map $rbdname` and `rbd device unmap $rbdname`

53

54

Now you have a raw storage device, but you can't yet store files on it, since you are missing a filesystem.

## RBD formatting

Now that you have mapped your RBD, we can create file system structures on it.

60

61

This is as simple as running:

62

63

```

64

mkfs.ext4 -E nodiscard,stride=1024,stripe_width=1024 /dev/rbdxxx

65

```

66

67

get the newly created filesystem UUID:

68

```

69

sudo blkid /dev/rbdxxx

70

```

71

72

Now we create an entry in `/etc/fstab` with `noauto` so the below script triggers the mount, and the mount is not done too early in the boot.

`/etc/fstab`:

```

UUID=your-new-fs-uuid /your/mount/point ext4 defaults,_netdev,acl,noauto,nodev,nosuid,noatime,stripe=1024 0 0

77

```

78

79

In order to mount this filesystem in your server, we need a mount helper script (otherwise the RBD is not yet mapped on system start when `/etc/fstab` tries to mount it directly during boot).

80

81

`/etc/ceph/rbd.d/ORG-rbd/namespacename/rbdname`:

```bash

#!/bin/bash

# lvm may disable vgs when not all blocks were available during scan

pvscan

vgchange -ay

# mount all the filesystems

90

mountpoint -q /your/mount/point || mount /your/mount/point

91

```

92

Mark this script *executable* so `rbdmap` can execute it as post-mapping hook!

93

94

To test, either restart `rbdmap.service` or manually call `umount` and `mount` for `/your/mount/point`.

## LVM on RBD

You can create LVM `pvs` and `lvs` on your RBD. You can use this for read/write caching, for example (see below).

100

This works like usual, just do `pvcreate` etc.

## RBD tuning

To get more performance, there's some useful tweaks

### CPU Bugs

When your server is sufficiently shielded behind firewalls and it isn't susceptible to attacks, disable the cpu bug mitigations for a performance boost as a kernel command line parameter:

`/etc/default/grub`:

```

GRUB_CMDLINE_LINUX_DEFAULT="mitigations=off"

```

### Read-Ahead

We read ahead 1MiB, since Ceph stores the objects in 4MiB blocks anyway.

119

120

`/etc/udev/rules.d/90-ceph-rbd.rules`:

121

```

122

KERNEL=="rbd[0-9]*", ENV{DEVTYPE}=="disk", ACTION=="add|change", ATTR{bdi/read_ahead_kb}="1024" ATTR{queue/scheduler}="none" ATTR{queue/wbt_lat_usec}="0" ATTR{queue/nr_requests}="2048"

```

### LVM-Cache

see `man 7 lvmcache`.

128

We can cache the RBD on a local NVMe for more performance.

129

130

* `/dev/fastdevice` is the name of the local NVMe.

131

* `/dev/datavg/datalv` is your name of your existing logical volume containing all the stored data on Ceph.

132

* we recommend writeback caching

```bash

## setup

# cache device

pvcreate /dev/fastdevice

138

139

# add cache device to vg to cache

140

vgextend datavg /dev/fastdevice

141

142

# create cache pool (meta+data combined):

143

lvcreate -n cache --type cache-pool -l '100%FREE' datavg /dev/fastdevice

# enable caching

#

# --type cache (recommended): use dm-cache for read and writecache

148

# --cachemode: do we cache writes?

149

# buffer writes: writeback

150

# no write buffering: writethrough

151

#

152

# --type writecache: only ever cache writes, not reads

153

#

154

# --chunksize data block management size

155

lvconvert --type cache --cachepool cache --cachemode writeback --chunksize 1024KiB /dev/datavg/datalv

## status

# check status

lvs -ao+devices

## resizing

lvconvert --splitcache /dev/datavg/datalv

163

lvextend -l +100%FREE /dev/datavg/datalv

164

lvconvert ... # to enable caching again

165

166

## disabling

167

# deactivate and keep cache lv

168

lvconvert --splitcache /dev/datavg/datalv

169

170

# disable and delete cache lv -> cache-pv still part of vg!

171

# watch out when resizing the lv -> the cache-pv will get parts of the lv then, use pvmove to remove again.

172

lvconvert --uncache /dev/datavg/datalv

173

174

# remove pv from vg

175

lvreduce datavg /dev/fastdevice

176

```

Wiki source code of Ceph

Navigation

author	version	line-number	content
		1	We offer [Ceph](https://ceph.io) as a scalable and fast way to obtain storage for your needs.
		2
		3	{{toc/}}
		4
		5	## How does it work?
		6
		7	The gist how Ceph works for you:
		8	We have many servers with SSDs or HDDs, each bought by one organization unit such as a chair. Data is spread accross all servers, and each organization unit gets as much storage space as they bought in servers.
		9
		10	You can access the storage mainly via RBD (RADOS block device), which is a device behaving like a local disk (USB stick, SSD, ...), but actually stores and retrieves data from the cluster in our data centre.
		11
		12	## RBD acquisition
		13
		14	An RBD is a storage device you can use in your servers to store data in our Ceph cluster. It either uses HDD or SSD storage (cheaper vs faster).
		15
		16	For evaluation purposes, you can get small amounts of storage directly.
		17	Otherwise, you can get as much space as you are entitled to.
		18
		19	Each RBD is stored in a "namespace", which restricts access to it. You can have multiple RBDs in the same namespace.
		20
		21	The name of an RBD is `ORG-name/namespacename/rbdname.`
		22
		23	To request the creation (or extension) of an RBD, write to [support@ito.cit.tum.de](support@ito.cit.tum.de) specifying name, size, namespace and HDD/SSD.
		24
		25	You will get back a secret keyring to access the namespace.
		26
		27	## RBD mapping
		28
		29	In order to "use" an RBD in your server, you need to "map" it.
		30
		31	You should have ready the name and keyring of the RBD.
		32
		33	* Please install `ceph-common`, at least in version 15.
		34	* It contains a tool named `rbdmap`, which can (oh wonder) map your RBD.
		35	* Edit /etc/ceph/rbdmap to add your RBD in a line
		36	* it has the format: `rbdname name=keyringname,options=...`
		37	* `ORG-name/namespacename/rbdname name=client.ORG.rbd.namespacename,options='queue_depth=1024'`
		38	* Place the keyring file in /etc/ceph/
		39	* Filename: `ceph.client.ORG.rbd.namespacename.keyring`
		40	* Permissions: 700
		41	* Owner: root
		42	* Content: the client identifier and 28 byte key in base64 encoding.
		43
		44	```
		45	[client.ORG.rbd.namespacename]
		46	key = ASD+OdlsdoTQJxFFljfCDEf/ASDFlYIbEbZatg==
		47	```
		48
		49	* `systemctl enable --now rbdmap.service` so the RBD device is created and on system starts.
		50	* You should now have a `/dev/rbd0` device
		51	* You can list current mapping status with `rbd device list`
		52	* You can manually map/unmap with `rbd device map $rbdname` and `rbd device unmap $rbdname`
		53
		54	Now you have a raw storage device, but you can't yet store files on it, since you are missing a filesystem.
		55
		56
		57	## RBD formatting
		58
		59	Now that you have mapped your RBD, we can create file system structures on it.
		60
		61	This is as simple as running:
		62
		63	```
		64	mkfs.ext4 -E nodiscard,stride=1024,stripe_width=1024 /dev/rbdxxx
		65	```
		66
		67	get the newly created filesystem UUID:
		68	```
		69	sudo blkid /dev/rbdxxx
		70	```
		71
		72	Now we create an entry in `/etc/fstab` with `noauto` so the below script triggers the mount, and the mount is not done too early in the boot.
		73
		74	`/etc/fstab`:
		75	```
		76	UUID=your-new-fs-uuid /your/mount/point ext4 defaults,_netdev,acl,noauto,nodev,nosuid,noatime,stripe=1024 0 0
		77	```
		78
		79	In order to mount this filesystem in your server, we need a mount helper script (otherwise the RBD is not yet mapped on system start when `/etc/fstab` tries to mount it directly during boot).
		80
		81	`/etc/ceph/rbd.d/ORG-rbd/namespacename/rbdname`:
		82	```bash
		83	#!/bin/bash
		84
		85	# lvm may disable vgs when not all blocks were available during scan
		86	pvscan
		87	vgchange -ay
		88
		89	# mount all the filesystems
		90	mountpoint -q /your/mount/point \|\| mount /your/mount/point
		91	```
		92	Mark this script executable so `rbdmap` can execute it as post-mapping hook!
		93
		94	To test, either restart `rbdmap.service` or manually call `umount` and `mount` for `/your/mount/point`.
		95
		96
		97	## LVM on RBD
		98
		99	You can create LVM `pvs` and `lvs` on your RBD. You can use this for read/write caching, for example (see below).
		100	This works like usual, just do `pvcreate` etc.
		101
		102
		103	## RBD tuning
		104
		105	To get more performance, there's some useful tweaks
		106
		107	### CPU Bugs
		108
		109	When your server is sufficiently shielded behind firewalls and it isn't susceptible to attacks, disable the cpu bug mitigations for a performance boost as a kernel command line parameter:
		110
		111	`/etc/default/grub`:
		112	```
		113	GRUB_CMDLINE_LINUX_DEFAULT="mitigations=off"
		114	```
		115
		116	### Read-Ahead
		117
		118	We read ahead 1MiB, since Ceph stores the objects in 4MiB blocks anyway.
		119
		120	`/etc/udev/rules.d/90-ceph-rbd.rules`:
		121	```
		122	KERNEL=="rbd[0-9]*", ENV{DEVTYPE}=="disk", ACTION=="add\|change", ATTR{bdi/read_ahead_kb}="1024" ATTR{queue/scheduler}="none" ATTR{queue/wbt_lat_usec}="0" ATTR{queue/nr_requests}="2048"
		123	```
		124
		125	### LVM-Cache
		126
		127	see `man 7 lvmcache`.
		128	We can cache the RBD on a local NVMe for more performance.
		129
		130	* `/dev/fastdevice` is the name of the local NVMe.
		131	* `/dev/datavg/datalv` is your name of your existing logical volume containing all the stored data on Ceph.
		132	* we recommend writeback caching
		133
		134	```bash
		135	## setup
		136	# cache device
		137	pvcreate /dev/fastdevice
		138
		139	# add cache device to vg to cache
		140	vgextend datavg /dev/fastdevice
		141
		142	# create cache pool (meta+data combined):
		143	lvcreate -n cache --type cache-pool -l '100%FREE' datavg /dev/fastdevice
		144
		145	# enable caching
		146	#
		147	# --type cache (recommended): use dm-cache for read and writecache
		148	# --cachemode: do we cache writes?
		149	# buffer writes: writeback
		150	# no write buffering: writethrough
		151	#
		152	# --type writecache: only ever cache writes, not reads
		153	#
		154	# --chunksize data block management size
		155	lvconvert --type cache --cachepool cache --cachemode writeback --chunksize 1024KiB /dev/datavg/datalv
		156
		157	## status
		158	# check status
		159	lvs -ao+devices
		160
		161	## resizing
		162	lvconvert --splitcache /dev/datavg/datalv
		163	lvextend -l +100%FREE /dev/datavg/datalv
		164	lvconvert ... # to enable caching again
		165
		166	## disabling
		167	# deactivate and keep cache lv
		168	lvconvert --splitcache /dev/datavg/datalv
		169
		170	# disable and delete cache lv -> cache-pv still part of vg!
		171	# watch out when resizing the lv -> the cache-pv will get parts of the lv then, use pvmove to remove again.
		172	lvconvert --uncache /dev/datavg/datalv
		173
		174	# remove pv from vg
		175	lvreduce datavg /dev/fastdevice
		176	```