Ceph Storage

The Acme deployment uses Ceph as a storage backend.

The Ceph deployment is not managed by StackHPC Ltd.

Working with Ceph deployment tool

cephadm configuration location

In kayobe-config repository, under etc/kayobe/cephadm.yml (or in a specific Kayobe environment when using multiple environment, e.g. etc/kayobe/environments/production/cephadm.yml)

StackHPC’s cephadm Ansible collection relies on multiple inventory groups:

  • mons

  • mgrs

  • osds

  • rgws (optional)

Those groups are usually defined in etc/kayobe/inventory/groups.

Running cephadm playbooks

In kayobe-config repository, under etc/kayobe/ansible there is a set of cephadm based playbooks utilising stackhpc.cephadm Ansible Galaxy collection.

  • cephadm.yml - runs the end to end process starting with deployment and defining EC profiles/crush rules/pools and users

  • cephadm-crush-rules.yml - defines Ceph crush rules according

  • cephadm-deploy.yml - runs the bootstrap/deploy playbook without the additional playbooks

  • cephadm-ec-profiles.yml - defines Ceph EC profiles

  • cephadm-gather-keys.yml - gather Ceph configuration and keys and populate kayobe-config

  • cephadm-keys.yml - defines Ceph users/keys

  • cephadm-pools.yml - defines Ceph pools

Running Ceph commands

Ceph commands are usually run inside a cephadm shell utility container:

ceph# cephadm shell

Operating a cluster requires a keyring with an admin access to be available for Ceph commands. Cephadm will copy such keyring to the nodes carrying _admin label - present on MON servers by default when using StackHPC Cephadm collection.

Adding a new storage node

Add a node to a respective group (e.g. osds) and run cephadm-deploy.yml playbook.

Note

To add other node types than osds (mons, mgrs, etc) you need to specify -e cephadm_bootstrap=True on playbook run.

Removing a storage node

First drain the node

ceph# cephadm shell
ceph# ceph orch host drain <host>

Once all daemons are removed - you can remove the host:

ceph# cephadm shell
ceph# ceph orch host rm <host>

And then remove the host from inventory (usually in etc/kayobe/inventory/overcloud)

Additional options/commands may be found in Host management

Replacing a Failed Ceph Drive

Once an OSD has been identified as having a hardware failure, the affected drive will need to be replaced.

If rebooting a Ceph node, first set noout to prevent excess data movement:

ceph# cephadm shell
ceph# ceph osd set noout

Reboot the node and replace the drive

Unset noout after the node is back online

ceph# cephadm shell
ceph# ceph osd unset noout

Remove the OSD using Ceph orchestrator command:

ceph# cephadm shell
ceph# ceph orch osd rm <ID> --replace

After removing OSDs, if the drives the OSDs were deployed on once again become available, cephadm may automatically try to deploy more OSDs on these drives if they match an existing drivegroup spec. If this is not your desired action plan - it’s best to modify the drivegroup spec before (cephadm_osd_spec variable in etc/kayobe/cephadm.yml). Either set unmanaged: true to stop cephadm from picking up new disks or modify it in some way that it no longer matches the drives you want to remove.

Operations

Replacing drive

See upstream documentation: https://docs.ceph.com/en/quincy/cephadm/services/osd/#replacing-an-osd

In case where disk holding DB and/or WAL fails, it is necessary to recreate (using replacement procedure above) all OSDs that are associated with this disk - usually NVMe drive. The following single command is sufficient to identify which OSDs are tied to which physical disks:

ceph# ceph device ls

Host maintenance

https://docs.ceph.com/en/quincy/cephadm/host-management/#maintenance-mode

Upgrading

https://docs.ceph.com/en/quincy/cephadm/upgrade/

Troubleshooting

Investigating a Failed Ceph Drive

A failing drive in a Ceph cluster will cause OSD daemon to crash. In this case Ceph will go into HEALTH_WARN state. Ceph can report details about failed OSDs by running:

ceph# ceph health detail

Note

Remember to run ceph/rbd commands from within cephadm shell (preferred method) or after installing Ceph client. Details in the official documentation. It is also required that the host where commands are executed has admin Ceph keyring present - easiest to achieve by applying _admin label (Ceph MON servers have it by default when using StackHPC Cephadm collection).

A failed OSD will also be reported as down by running:

ceph# ceph osd tree

Note the ID of the failed OSD.

The failed disk is usually logged by the Linux kernel too:

storage-0# dmesg -T

Cross-reference the hardware device and OSD ID to ensure they match. (Using pvs and lvs may help make this connection).

Inspecting a Ceph Block Device for a VM

To find out what block devices are attached to a VM, go to the hypervisor that it is running on (an admin-level user can see this from openstack server show).

On this hypervisor, enter the libvirt container:

comp0# docker exec -it nova_libvirt /bin/bash

Find the VM name using libvirt:

(nova-libvirt)[root@comp0 /]# virsh list
 Id    Name                State
------------------------------------
 1     instance-00000001   running

Now inspect the properties of the VM using virsh dumpxml:

(nova-libvirt)[root@comp0 /]# virsh dumpxml instance-00000001 | grep rbd
      <source protocol='rbd' name='acme-vms/51206278-e797-4153-b720-8255381228da_disk'>

On a Ceph node, the RBD pool can be inspected and the volume extracted as a RAW block image:

ceph# rbd ls acme-vms
ceph# rbd export acme-vms/51206278-e797-4153-b720-8255381228da_disk blob.raw

The raw block device (blob.raw above) can be mounted using the loopback device.

Inspecting a QCOW Image using LibGuestFS

The virtual machine’s root image can be inspected by installing libguestfs-tools and using the guestfish command:

ceph# export LIBGUESTFS_BACKEND=direct
ceph# guestfish -a blob.qcow
><fs> run
 100% [XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX] 00:00
><fs> list-filesystems
/dev/sda1: ext4
><fs> mount /dev/sda1 /
><fs> ls /
bin
boot
dev
etc
home
lib
lib64
lost+found
media
mnt
opt
proc
root
run
sbin
srv
sys
tmp
usr
var
><fs> quit