Jump to content
SUSE Enterprise Storage 6

Administration Guide

Authors: Tomáš Bažant, Alexandra Settle, Liam Proven, and Sven Seeberg
Publication Date: 11/25/2020
About This Guide
Available Documentation
Feedback
Documentation Conventions
About the Making of This Manual
Ceph Contributors
I Cluster Management
1 User Privileges and Command Prompts
1.1 Salt/DeepSea Related Commands
1.2 Ceph Related Commands
1.3 General Linux Commands
1.4 Additional Information
2 Salt Cluster Administration
2.1 Adding New Cluster Nodes
2.2 Adding New Roles to Nodes
2.3 Removing and Reinstalling Cluster Nodes
2.4 Redeploying Monitor Nodes
2.5 Verify an Encrypted OSD
2.6 Adding an OSD Disk to a Node
2.7 Removing an OSD
2.8 Replacing an OSD Disk
2.9 Recovering a Reinstalled OSD Node
2.10 Moving the Admin Node to a New Server
2.11 Automated Installation via Salt
2.12 Updating the Cluster Nodes
2.13 Halting or Rebooting Cluster
2.14 Adjusting ceph.conf with Custom Settings
2.15 Enabling AppArmor Profiles
2.16 Deactivating Tuned Profiles
2.17 Removing an Entire Ceph Cluster
3 Backing Up Cluster Configuration and Data
3.1 Back Up Ceph Configuration
3.2 Back Up Salt Configuration
3.3 Back Up DeepSea Configuration
3.4 Back Up Custom Configurations
II Ceph Dashboard
4 About Ceph Dashboard
5 Dashboard's Web User Interface
5.1 Log In
5.2 Utility Menu
5.3 Main Menu
5.4 The Content Pane
5.5 Common Web UI Features
5.6 Dashboard Widgets
6 Managing Dashboard Users and Roles
6.1 Listing Users
6.2 Adding New Users
6.3 Editing Users
6.4 Deleting Users
6.5 Listing User Roles
6.6 Adding Custom Roles
6.7 Editing Custom Roles
6.8 Deleting Custom Roles
7 Viewing Cluster Internals
7.1 Cluster Nodes
7.2 Ceph Monitors
7.3 Ceph OSDs
7.4 Cluster Configuration
7.5 CRUSH Map
7.6 Manager Modules
7.7 Logs
8 Managing Pools
8.1 Adding a New Pool
8.2 Deleting Pools
8.3 Editing a Pool's Options
9 Managing RADOS Block Devices
9.1 Viewing Details about RBDs
9.2 Viewing RBD's Configuration
9.3 Creating RBDs
9.4 Deleting RBDs
9.5 RADOS Block Device Snapshots
9.6 Managing iSCSI Gateways
9.7 RBD Quality of Service (QoS)
9.8 RBD Mirroring
10 Managing NFS Ganesha
10.1 Adding NFS Exports
10.2 Deleting NFS Exports
10.3 Editing NFS Exports
11 Managing Ceph File Systems
11.1 Viewing CephFS Overview
12 Managing Object Gateways
12.1 Viewing Object Gateways
12.2 Managing Object Gateway Users
12.3 Managing the Object Gateway Buckets
13 Manual Configuration
13.1 TLS/SSL Support
13.2 Host Name and Port Number
13.3 User Name and Password
13.4 Enabling the Object Gateway Management Front-end
13.5 Enable Single Sign-On
14 Managing Users and Roles on the Command Line
14.1 User Accounts
14.2 User Roles and Permissions
14.3 Reverse Proxies
14.4 Auditing
III Operating a Cluster
15 Introduction
16 Operating Ceph Services
16.1 Operating Ceph Cluster Related Services Using systemd
16.2 Restarting Ceph Services Using DeepSea
16.3 Shutdown and Start of the Whole Ceph Cluster
17 Determining Cluster State
17.1 Checking a Cluster's Status
17.2 Checking Cluster Health
17.3 Watching a Cluster
17.4 Checking a Cluster's Usage Stats
17.5 Checking OSD Status
17.6 Checking for Full OSDs
17.7 Checking Monitor Status
17.8 Checking Placement Group States
17.9 Using the Admin Socket
17.10 Storage Capacity
17.11 Monitoring OSDs and Placement Groups
17.12 OSD Is Not Running
18 Monitoring and Alerting
18.1 Pillar Variables
18.2 Grafana
18.3 Prometheus
18.4 Alertmanager
18.5 Troubleshooting Alerts
19 Authentication with cephx
19.1 Authentication Architecture
19.2 Key Management
20 Stored Data Management
20.1 Devices
20.2 Buckets
20.3 Rule Sets
20.4 Placement Groups
20.5 CRUSH Map Manipulation
20.6 Scrubbing
21 Ceph Manager Modules
21.1 Balancer
21.2 Telemetry Module
22 Managing Storage Pools
22.1 Associate Pools with an Application
22.2 Operating Pools
22.3 Pool Migration
22.4 Pool Snapshots
22.5 Data Compression
23 RADOS Block Device
23.1 Block Device Commands
23.2 Mounting and Unmounting
23.3 Snapshots
23.4 Mirroring
23.5 Cache Settings
23.6 QoS Settings
23.7 Read-ahead Settings
23.8 Advanced Features
23.9 Mapping RBD Using Old Kernel Clients
24 Erasure Coded Pools
24.1 Prerequisite for Erasure Coded Pools
24.2 Creating a Sample Erasure Coded Pool
24.3 Erasure Code Profiles
24.4 Erasure Coded Pools with RADOS Block Device
25 Ceph Cluster Configuration
25.1 Runtime Configuration
25.2 Ceph OSD and BlueStore
25.3 Ceph Object Gateway
IV Accessing Cluster Data
26 Ceph Object Gateway
26.1 Object Gateway Restrictions and Naming Limitations
26.2 Deploying the Object Gateway
26.3 Operating the Object Gateway Service
26.4 Configuration Options
26.5 Managing Object Gateway Access
26.6 HTTP Front-ends
26.7 Enabling HTTPS/SSL for Object Gateways
26.8 Synchronization Modules
26.9 LDAP Authentication
26.10 Bucket Index Sharding
26.11 Integrating OpenStack Keystone
26.12 Pool Placement and Storage Classes
26.13 Multisite Object Gateways
26.14 Load Balancing the Object Gateway Servers with HAProxy
27 Ceph iSCSI Gateway
27.1 Connecting to ceph-iscsi Managed Targets
27.2 Conclusion
28 Clustered File System
28.1 Mounting CephFS
28.2 Unmounting CephFS
28.3 CephFS in /etc/fstab
28.4 Multiple Active MDS Daemons (Active-Active MDS)
28.5 Managing Failover
28.6 Setting CephFS Quotas
28.7 Managing CephFS Snapshots
29 Exporting Ceph Data via Samba
29.1 Export CephFS via Samba Share
29.2 Samba Gateway Joining Active Directory
30 NFS Ganesha: Export Ceph Data via NFS
30.1 Installation
30.2 Configuration
30.3 Custom NFS Ganesha Roles
30.4 Starting or Restarting NFS Ganesha
30.5 Setting the Log Level
30.6 Verifying the Exported NFS Share
30.7 Mounting the Exported NFS Share
V Integration with Virtualization Tools
31 Using libvirt with Ceph
31.1 Configuring Ceph
31.2 Preparing the VM Manager
31.3 Creating a VM
31.4 Configuring the VM
31.5 Summary
32 Ceph as a Back-end for QEMU KVM Instance
32.1 Installation
32.2 Usage
32.3 Creating Images with QEMU
32.4 Resizing Images with QEMU
32.5 Retrieving Image Info with QEMU
32.6 Running QEMU with RBD
32.7 Enabling Discard/TRIM
32.8 QEMU Cache Options
VI FAQs, Tips and Troubleshooting
33 Hints and Tips
33.1 Identifying Orphaned Partitions
33.2 Adjusting Scrubbing
33.3 Stopping OSDs without Rebalancing
33.4 Time Synchronization of Nodes
33.5 Checking for Unbalanced Data Writing
33.6 Btrfs Subvolume for /var/lib/ceph on Ceph Monitor Nodes
33.7 Increasing File Descriptors
33.8 Integration with Virtualization Software
33.9 Firewall Settings for Ceph
33.10 Testing Network Performance
33.11 How to Locate Physical Disks Using LED Lights
34 Frequently Asked Questions
34.1 How Does the Number of Placement Groups Affect the Cluster Performance?
34.2 Can I Use SSDs and Hard Disks on the Same Cluster?
34.3 What are the Trade-offs of Using a Journal on SSD?
34.4 What Happens When a Disk Fails?
34.5 What Happens When a Journal Disk Fails?
35 Troubleshooting
35.1 Reporting Software Problems
35.2 Sending Large Objects with rados Fails with Full OSD
35.3 Corrupted XFS File system
35.4 'Too Many PGs per OSD' Status Message
35.5 'nn pg stuck inactive' Status Message
35.6 OSD Weight is 0
35.7 OSD is Down
35.8 Finding Slow OSDs
35.9 Fixing Clock Skew Warnings
35.10 Poor Cluster Performance Caused by Network Problems
35.11 /var Running Out of Space
A DeepSea Stage 1 Custom Example
B Ceph Maintenance Updates Based on Upstream 'Nautilus' Point Releases
Glossary
C Documentation Updates
C.1 Maintenance update of SUSE Enterprise Storage 6 documentation
C.2 June 2019 (Release of SUSE Enterprise Storage 6)
List of Figures
5.1 Ceph Dashboard Login Screen
5.2 Ceph Dashboard Home Page
5.3 Status Widgets
5.4 performance Widgets
5.5 Capacity Widgets
6.1 User Management
6.2 Adding a User
6.3 User Roles
6.4 Adding a Role
7.1 Hosts
7.2 Ceph Monitors
7.3 Ceph OSDs
7.4 OSD Flags
7.5 OSD Recovery Priority
7.6 OSD Details
7.7 Cluster Configuration
7.8 CRUSH Map
7.9 Manager Modules
7.10 Logs
8.1 List of Pools
8.2 Adding a New Pool
9.1 List of RBD Images
9.2 RBD Details
9.3 RBD Configuration
9.4 Adding a New RBD
9.5 RBD Snapshots
9.6 List of iSCSI Targets
9.7 iSCSI Target Details
9.8 Adding a New Target
9.9 Running rbd-mirror Daemons
9.10 Creating a Pool with RBD Application
9.11 Configuring the Replication Mode
9.12 Adding Peer Credentials
9.13 List of Replicated Pools
9.14 New RBD Image
9.15 New RBD Image Synchronized
9.16 RBD Images' Replication Status
10.1 List of NFS Exports
10.2 NFS Export Details
10.3 Adding a New NFS Export
10.4 Editing an NFS Export
11.1 CephFS Details
12.1 Gateway's Details
12.2 Gateway Users
12.3 Adding a New Gateway User
12.4 Gateway Bucket Details
17.1 Ceph Cluster
17.2 Peering Schema
17.3 Placement Groups Status
19.1 Basic cephx Authentication
19.2 cephx Authentication
19.3 cephx Authentication - MDS and OSD
20.1 OSDs with Mixed Device Classes
20.2 Example Tree
20.3 Node Replacement Methods
20.4 Placement Groups in a Pool
20.5 Placement Groups and OSDs
22.1 Pools before Migration
22.2 Cache Tier Setup
22.3 Data Flushing
22.4 Setting Overlay
22.5 Migration Complete
23.1 RADOS Protocol
27.1 iSCSI Initiator Properties
27.2 Discover Target Portal
27.3 Target Portals
27.4 Targets
27.5 iSCSI Target Properties
27.6 Device Details
27.7 New Volume Wizard
27.8 Offline Disk Prompt
27.9 Confirm Volume Selections
27.10 iSCSI Initiator Properties
27.11 Add Target Server
27.12 Manage Multipath Devices
27.13 Paths Listing for Multipath
27.14 Add Storage Dialog
27.15 Custom Space Setting
27.16 iSCSI Datastore Overview

Copyright © 2020 SUSE LLC

Copyright © 2016, RedHat, Inc, and contributors.

The text of and illustrations in this document are licensed under a Creative Commons Attribution-Share Alike 4.0 International ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/4.0/legalcode. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version.

Red Hat, Red Hat Enterprise Linux, the Shadowman logo, JBoss, MetaMatrix, Fedora, the Infinity Logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux® is the registered trademark of Linus Torvalds in the United States and other countries. Java® is a registered trademark of Oracle and/or its affiliates. XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries. All other trademarks are the property of their respective owners.

For SUSE trademarks, see http://www.suse.com/company/legal/. All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.

All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors nor the translators shall be held liable for possible errors or the consequences thereof.

About This Guide Edit source

SUSE Enterprise Storage 6 is an extension to SUSE Linux Enterprise Server 15 SP1. It combines the capabilities of the Ceph (http://ceph.com/) storage project with the enterprise engineering and support of SUSE. SUSE Enterprise Storage 6 provides IT organizations with the ability to deploy a distributed storage architecture that can support a number of use cases using commodity hardware platforms.

This guide helps you understand the concept of the SUSE Enterprise Storage 6 with the main focus on managing and administrating the Ceph infrastructure. It also demonstrates how to use Ceph with other related solutions, such as OpenStack or KVM.

Many chapters in this manual contain links to additional documentation resources. These include additional documentation that is available on the system as well as documentation available on the Internet.

For an overview of the documentation available for your product and the latest documentation updates, refer to https://documentation.suse.com.

1 Available Documentation Edit source

The following manuals are available for this product:

Administration Guide

The guide describes various administration tasks that are typically performed after the installation. The guide also introduces steps to integrate Ceph with virtualization solutions such as libvirt, Xen, or KVM, and ways to access objects stored in the cluster via iSCSI and RADOS gateways.

Book “Deployment Guide”

Guides you through the installation steps of the Ceph cluster and all services related to Ceph. The guide also illustrates a basic Ceph cluster structure and provides you with related terminology.

HTML versions of the product manuals can be found in the installed system under /usr/share/doc/manual. Find the latest documentation updates at https://documentation.suse.com where you can download the manuals for your product in multiple formats.

2 Feedback Edit source

Several feedback channels are available:

Bugs and Enhancement Requests

For services and support options available for your product, refer to http://www.suse.com/support/.

To report bugs for a product component, log in to the Novell Customer Center from http://www.suse.com/support/ and select My Support › Service Request.

User Comments

We want to hear your comments and suggestions for this manual and the other documentation included with this product. If you have questions, suggestions, or corrections, contact doc-team@suse.com, or you can also click the Report Documentation Bug link beside each chapter or section heading.

Mail

For feedback on the documentation of this product, you can also send a mail to doc-team@suse.de. Make sure to include the document title, the product version, and the publication date of the documentation. To report errors or suggest enhancements, provide a concise description of the problem and refer to the respective section number and page (or URL).

3 Documentation Conventions Edit source

The following typographical conventions are used in this manual:

  • /etc/passwd: directory names and file names

  • placeholder: replace placeholder with the actual value

  • PATH: the environment variable PATH

  • ls, --help: commands, options, and parameters

  • user: users or groups

  • Alt, AltF1: a key to press or a key combination; keys are shown in uppercase as on a keyboard

  • File, File › Save As: menu items, buttons

  • Dancing Penguins (Chapter Penguins, ↑Another Manual): This is a reference to a chapter in another manual.

4 About the Making of This Manual Edit source

This book is written in GeekoDoc, a subset of DocBook (see http://www.docbook.org). The XML source files were validated by xmllint, processed by xsltproc, and converted into XSL-FO using a customized version of Norman Walsh's stylesheets. The final PDF can be formatted through FOP from Apache or through XEP from RenderX. The authoring and publishing tools used to produce this manual are available in the package daps. The DocBook Authoring and Publishing Suite (DAPS) is developed as open source software. For more information, see http://daps.sf.net/.

5 Ceph Contributors Edit source

The Ceph project and its documentation is a result of the work of hundreds of contributors and organizations. See https://ceph.com/contributors/ for more details.

Part I Cluster Management Edit source

1 User Privileges and Command Prompts

As a Ceph cluster administrator, you will be configuring and adjusting the cluster behavior by running specific commands. There are several types of commands you will need:

2 Salt Cluster Administration

After you deploy a Ceph cluster, you will probably need to perform several modifications to it occasionally. These include adding or removing new nodes, disks, or services. This chapter describes how you can achieve these administration tasks.

3 Backing Up Cluster Configuration and Data

This chapter explains which parts of the Ceph cluster you should back up in order to be able to restore its functionality.

1 User Privileges and Command Prompts Edit source

As a Ceph cluster administrator, you will be configuring and adjusting the cluster behavior by running specific commands. There are several types of commands you will need:

1.1 Salt/DeepSea Related Commands Edit source

These commands help you to deploy or upgrade the Ceph cluster, run commands on several (or all) cluster nodes at the same time, or assist you when adding or removing cluster nodes. The most frequently used are salt, salt-run, and deepsea. You need to run Salt commands on the Salt master node (refer to Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.2 “Introduction to DeepSea” for details) as root. These commands are introduced with the following prompt:

root@master # 

For example:

root@master # salt '*.example.net' test.ping

1.2 Ceph Related Commands Edit source

These are lower level commands to configure and fine tune all aspects of the cluster and its gateways on the command line, for example ceph, rbd, radosgw-admin, or crushtool.

To run Ceph related commands, you need to have read access to a Ceph key. The key's capabilities then define your privileges within the Ceph environment. One option is to run Ceph commands as root (or via sudo) and use the unrestricted default keyring 'ceph.client.admin.key'.

Safer and recommended option is to create a more restrictive individual key for each administrator user and put it in a directory where the users can read it, for example:

~/.ceph/ceph.client.USERNAME.keyring
Tip
Tip: Path to Ceph Keys

To use a custom admin user and keyring, you need to specify the user name and path to the key each time you run the ceph command using the -n client.USER_NAME and --keyring PATH/TO/KEYRING options.

To avoid this, include these options in the CEPH_ARGS variable in the individual users' ~/.bashrc files.

Although you can run Ceph related commands on any cluster node, we recommend running them on the Admin Node. This documentation uses the cephadm user to run the commands, therefore they are introduced with the following prompt:

cephadm@adm > 

For example:

cephadm@adm > ceph auth list
Tip
Tip: Commands for Specific Nodes

If the documentation instructs you to run a command on a cluster node with a specific role, it will be addressed by the prompt. For example:

cephadm@mon > 

1.3 General Linux Commands Edit source

Linux commands not related to Ceph or DeepSea, such as mount, cat, or openssl, are introduced either with the cephadm@adm > or root # prompts, depending on which privileges the related command requires.

1.4 Additional Information Edit source

For more information on Ceph key management, refer to Section 19.2, “Key Management”.

2 Salt Cluster Administration Edit source

After you deploy a Ceph cluster, you will probably need to perform several modifications to it occasionally. These include adding or removing new nodes, disks, or services. This chapter describes how you can achieve these administration tasks.

2.1 Adding New Cluster Nodes Edit source

The procedure of adding new nodes to the cluster is almost identical to the initial cluster node deployment described in Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”:

Tip
Tip: Prevent Rebalancing

When adding an OSD to the existing cluster, bear in mind that the cluster will be rebalancing for some time afterward. To minimize the rebalancing periods, add all OSDs you intend to add at the same time.

An additional way is to set the osd crush initial weight = 0 option in the ceph.conf file before adding the OSDs:

  1. Add osd crush initial weight = 0 to /srv/salt/ceph/configuration/files/ceph.conf.d/global.conf.

  2. Create the new configuration on the Salt master node:

    root@master # salt 'SALT_MASTER_NODE' state.apply ceph.configuration.create

    Or:

    root@master # salt-call state.apply ceph.configuration.create
  3. Apply the new configuration to the targeted OSD minions:

    root@master # salt 'OSD_MINIONS' state.apply ceph.configuration
    Note
    Note

    If this is not a new node, but you want to proceed as if it were, ensure you remove the /etc/ceph/destroyedOSDs.yml file from the node. Otherwise, any devices from the first attempt will be restored with their previous OSD ID and reweight.

    Run the following commands:

    root@master # salt-run state.orch ceph.stage.1
    root@master # salt-run state.orch ceph.stage.2
    root@master # salt 'node*' state.apply ceph.osd
  4. After the new OSDs are added, adjust their weights as required with the ceph osd crush reweight command in small increments. This allows the cluster to rebalance and become healthy between increasing increments so it does not overwhelm the cluster and clients accessing the cluster.

  1. Install SUSE Linux Enterprise Server 15 SP1 on the new node and configure its network setting so that it resolves the Salt master host name correctly. Verify that it has a proper connection to both public and cluster networks, and that time synchronization is correctly configured. Then install the salt-minion package:

    root@minion > zypper in salt-minion

    If the Salt master's host name is different from salt, edit /etc/salt/minion and add the following:

    master: DNS_name_of_your_salt_master

    If you performed any changes to the configuration files mentioned above, restart the salt.minion service:

    root@minion > systemctl restart salt-minion.service
  2. On the Salt master, accept the salt key of the new node:

    root@master # salt-key --accept NEW_NODE_KEY
  3. Verify that /srv/pillar/ceph/deepsea_minions.sls targets the new Salt minion and/or set the proper DeepSea grain. Refer to Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.2.2.1 “Matching the Minion Name” or Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.3 “Cluster Deployment”, Running Deployment Stages for more details.

  4. Run the preparation stage. It synchronizes modules and grains so that the new minion can provide all the information DeepSea expects.

    root@master # salt-run state.orch ceph.stage.0
    Important
    Important: Possible Restart of DeepSea stage 0

    If the Salt master rebooted after its kernel update, you need to restart DeepSea stage 0.

  5. Run the discovery stage. It will write new file entries in the /srv/pillar/ceph/proposals directory, where you can edit relevant .yml files:

    root@master # salt-run state.orch ceph.stage.1
  6. Optionally, change /srv/pillar/ceph/proposals/policy.cfg if the newly added host does not match the existing naming scheme. For details, refer to Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.5.1 “The policy.cfg File”.

  7. Run the configuration stage. It reads everything under /srv/pillar/ceph and updates the pillar accordingly:

    root@master # salt-run state.orch ceph.stage.2

    Pillar stores data which you can access with the following command:

    root@master # salt target pillar.items
    Tip
    Tip: Modifying OSD's Layout

    If you want to modify the default OSD's layout and change the drive groups configuration, follow the procedure described in Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.5.2 “DriveGroups”.

  8. The configuration and deployment stages include newly added nodes:

    root@master # salt-run state.orch ceph.stage.3
    root@master # salt-run state.orch ceph.stage.4

2.2 Adding New Roles to Nodes Edit source

You can deploy all types of supported roles with DeepSea. See Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.5.1.2 “Role Assignment” for more information on supported role types and examples of matching them.

To add a new service to an existing node, follow these steps:

  1. Adapt /srv/pillar/ceph/proposals/policy.cfg to match the existing host with a new role. For more details, refer to Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.5.1 “The policy.cfg File”. For example, if you need to run an Object Gateway on a MON node, the line is similar to:

    role-rgw/xx/x/example.mon-1.sls
  2. Run stage 2 to update the pillar:

    root@master # salt-run state.orch ceph.stage.2
  3. Run stage 3 to deploy core services, or stage 4 to deploy optional services. Running both stages does not hurt.

2.3 Removing and Reinstalling Cluster Nodes Edit source

Tip
Tip: Removing a Cluster Node Temporarily

The Salt master expects all minions to be present in the cluster and responsive. If a minion breaks and is not responsive anymore, it causes problems to the Salt infrastructure, mainly to DeepSea and Ceph Dashboard.

Before you fix the minion, delete its key from the Salt master temporarily:

root@master # salt-key -d MINION_HOST_NAME

After the minion is fixed, add its key to the Salt master again:

root@master # salt-key -a MINION_HOST_NAME

To remove a role from a cluster, edit /srv/pillar/ceph/proposals/policy.cfg and remove the corresponding line(s). Then run stages 2 and 5 as described in Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.3 “Cluster Deployment”.

Note
Note: Removing OSDs from Cluster

In case you need to remove a particular OSD node from your cluster, ensure that your cluster has more free disk space than the disk you intend to remove. Bear in mind that removing an OSD results in rebalancing of the whole cluster.

Before running stage 5 to do the actual removal, always check which OSDs are going to be removed by DeepSea:

root@master # salt-run rescinded.ids

When a role is removed from a minion, the objective is to undo all changes related to that role. For most of the roles, the task is simple, but there may be problems with package dependencies. If a package is uninstalled, its dependencies are not.

Removed OSDs appear as blank drives. The related tasks overwrite the beginning of the file systems and remove backup partitions in addition to wiping the partition tables.

Note
Note: Preserving Partitions Created by Other Methods

Disk drives previously configured by other methods, such as ceph-deploy, may still contain partitions. DeepSea will not automatically destroy these. The administrator must reclaim these drives manually.

Example 2.1: Removing a Salt minion from the Cluster

If your storage minions are named, for example, 'data1.ceph', 'data2.ceph' ... 'data6.ceph', and the related lines in your policy.cfg are similar to the following:

[...]
# Hardware Profile
role-storage/cluster/data*.sls
[...]

Then to remove the Salt minion 'data2.ceph', change the lines to the following:

[...]
# Hardware Profile
role-storage/cluster/data[1,3-6]*.sls
[...]

Also keep in mind to adapt your drive_groups.yml file to match the new targets.

    [...]
    drive_group_name:
      target: 'data[1,3-6]*'
    [...]

Then run stage 2, check which OSDs are going to be removed, and finish by running stage 5:

root@master # salt-run state.orch ceph.stage.2
root@master # salt-run rescinded.ids
root@master # salt-run state.orch ceph.stage.5
Example 2.2: Migrating Nodes

Assume the following situation: during the fresh cluster installation, you (the administrator) allocated one of the storage nodes as a stand-alone Object Gateway while waiting for the gateway's hardware to arrive. Now the permanent hardware has arrived for the gateway and you can finally assign the intended role to the backup storage node and have the gateway role removed.

After running stages 0 and 1 (see Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.3 “Cluster Deployment”, Running Deployment Stages) for the new hardware, you named the new gateway rgw1. If the node data8 needs the Object Gateway role removed and the storage role added, and the current policy.cfg looks like this:

# Hardware Profile
role-storage/cluster/data[1-7]*.sls

# Roles
role-rgw/cluster/data8*.sls

Then change it to:

# Hardware Profile
role-storage/cluster/data[1-8]*.sls

# Roles
role-rgw/cluster/rgw1*.sls

Run stages 2 to 4, check which OSDs are going to be possibly removed, and finish by running stage 5. Stage 3 will add data8 as a storage node. For a moment, data8 will have both roles. Stage 4 will add the Object Gateway role to rgw1 and stage 5 will remove the Object Gateway role from data8:

root@master # salt-run state.orch ceph.stage.2
root@master # salt-run state.orch ceph.stage.3
root@master # salt-run state.orch ceph.stage.4
root@master # salt-run rescinded.ids
root@master # salt-run state.orch ceph.stage.5
Example 2.3: Removal of a Failed Node

If the Salt minion is not responding and the administrator is unable to resolve the issue, we recommend removing the Salt key:

root@master # salt-key -d MINION_ID
Example 2.4: Removal of a Failed Storage Node

When a server fails (due to network, power, or other issues), it means that all the OSDs are dead. Issue the following commands for each OSD on the failed storage node:

cephadm@adm > ceph osd purge-actual $id --yes-i-really-mean-it
cephadm@adm > ceph auth del osd.$id

Running the ceph osd purge-actual command is equivalent to the following:

cephadm@adm > ceph destroy $id
cephadm@adm > ceph osd rm $id
cephadm@adm > ceph osd crush remove osd.$id

2.4 Redeploying Monitor Nodes Edit source

When one or more of your monitor nodes fail and are not responding, you need to remove the failed monitors from the cluster and possibly then re-add them back in the cluster.

Important
Important: The Minimum Is Three Monitor Nodes

The number of monitor nodes must not be less than three. If a monitor node fails, and as a result your cluster has only two monitor nodes, you need to temporarily assign the monitor role to other cluster nodes before you redeploy the failed monitor nodes. After you redeploy the failed monitor nodes, you can uninstall the temporary monitor roles.

For more information on adding new nodes/roles to the Ceph cluster, see Section 2.1, “Adding New Cluster Nodes” and Section 2.2, “Adding New Roles to Nodes”.

For more information on removing cluster nodes, refer to Section 2.3, “Removing and Reinstalling Cluster Nodes”.

There are two basic degrees of a Ceph node failure:

  • The Salt minion host is broken either physically or on the OS level, and does not respond to the salt 'minion_name' test.ping call. In such case you need to redeploy the server completely by following the relevant instructions in Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.3 “Cluster Deployment”.

  • The monitor related services failed and refuse to recover, but the host responds to the salt 'minion_name' test.ping call. In such case, follow these steps:

  1. Edit /srv/pillar/ceph/proposals/policy.cfg on the Salt master, and remove or update the lines that correspond to the failed monitor nodes so that they now point to the working monitor nodes. For example:

    [...]
    # MON
    #role-mon/cluster/ses-example-failed1.sls
    #role-mon/cluster/ses-example-failed2.sls
    role-mon/cluster/ses-example-new1.sls
    role-mon/cluster/ses-example-new2.sls
    [...]
  2. Run DeepSea stages 2 to 5 to apply the changes:

    root@master # deepsea stage run ceph.stage.2
    root@master # deepsea stage run ceph.stage.3
    root@master # deepsea stage run ceph.stage.4
    root@master # deepsea stage run ceph.stage.5

2.5 Verify an Encrypted OSD Edit source

After using DeepSea to deploy an OSD, you may want to verify that the OSD is encrypted.

  1. Check the output of ceph-volume lvm list (it should be run as root on the node where the OSDs in question are located):

    root@master # ceph-volume lvm list
    
      ====== osd.3 =======
    
        [block]       /dev/ceph-d9f09cf7-a2a4-4ddc-b5ab-b1fa4096f713/osd-data-71f62502-4c85-4944-9860-312241d41bb7
    
            block device              /dev/ceph-d9f09cf7-a2a4-4ddc-b5ab-b1fa4096f713/osd-data-71f62502-4c85-4944-9860-312241d41bb7
            block uuid                m5F10p-tUeo-6ZGP-UjxJ-X3cd-Ec5B-dNGXvG
            cephx lockbox secret
            cluster fsid              413d9116-e4f6-4211-a53b-89aa219f1cf2
            cluster name              ceph
            crush device class        None
            encrypted                 0
            osd fsid                  f8596bf7-000f-4186-9378-170b782359dc
            osd id                    3
            type                      block
            vdo                       0
            devices                   /dev/vdb
    
      ====== osd.7 =======
    
        [block]       /dev/ceph-38914e8d-f512-44a7-bbee-3c20a684753d/osd-data-0f385f9e-ce5c-45b9-917d-7f8c08537987
    
            block device              /dev/ceph-38914e8d-f512-44a7-bbee-3c20a684753d/osd-data-0f385f9e-ce5c-45b9-917d-7f8c08537987
            block uuid                1y3qcS-ZG01-Y7Z1-B3Kv-PLr6-jbm6-8B79g6
            cephx lockbox secret
            cluster fsid              413d9116-e4f6-4211-a53b-89aa219f1cf2
            cluster name              ceph
            crush device class        None
            encrypted                 0
            osd fsid                  0f9a8002-4c81-4f5f-93a6-255252cac2c4
            osd id                    7
            type                      block
            vdo                       0
            devices                   /dev/vdc

    Note the line that says encrypted 0. This means the OSD is not encrypted. The possible values are as follows:

      encrypted                 0  = not encrypted
      encrypted                 1  = encrypted

    If you get the following error, it means the node where you are running the command does not have any OSDs on it:

    root@master # ceph-volume lvm list
    No valid Ceph lvm devices found

    If you have deployed a cluster with an OSD for which ceph-volume lvm list shows encrypted 1, the OSD is encrypted. If you are unsure, proceed to step two.

  2. Ceph OSD encryption-at-rest relies on the Linux kernel's dm-crypt subsystem and the Linux Unified Key Setup ("LUKS"). When creating an encrypted OSD, ceph-volume creates an encrypted logical volume and saves the corresponding dm-crypt secret key in the Ceph Monitor data store. When the OSD is to be started, ceph-volume ensures the device is mounted, retrieves the dm-crypt secret key from the Ceph Monitor's, and decrypts the underlying device. This creates a new device, containing the unencrypted data, and this is the device the Ceph OSD daemon is started on.

    The OSD does not know whether the underlying logical volume is encrypted or not, there is no ceph osd command that returns this information. However, it is possible to query LUKS for it, as follows.

    First, get the device of the OSD logical volume you are interested in. This can be obtained from the ceph-volume lvm list output:

    block device              /dev/ceph-d9f09cf7-a2a4-4ddc-b5ab-b1fa4096f713/osd-data-71f62502-4c85-4944-9860-312241d41bb7

    Then, dump the LUKS header from that device:

    root@master # cryptsetup luksDump OSD_BLOCK_DEVICE

    if the OSD is not encrypted, the output is as follows:

    Device /dev/ceph-38914e8d-f512-44a7-bbee-3c20a684753d/osd-data-0f385f9e-ce5c-45b9-917d-7f8c08537987 is not a valid LUKS device.

    If the OSD is encrypted, the output is as follows:

    root@master # cryptsetup luksDump /dev/ceph-1ce61157-81be-427d-83ad-7337f05d8514/osd-data-89230c92-3ace-4685-97ff-6fa059cef63a
      LUKS header information for /dev/ceph-1ce61157-81be-427d-83ad-7337f05d8514/osd-data-89230c92-3ace-4685-97ff-6fa059cef63a
    
      Version:        1
      Cipher name:    aes
      Cipher mode:    xts-plain64
      Hash spec:      sha256
      Payload offset: 4096
      MK bits:        256
      MK digest:      e9 41 85 f1 1b a3 54 e2 48 6a dc c2 50 26 a5 3b 79 b0 f2 2e
      MK salt:        4c 8c 9d 1f 72 1a 88 6c 06 88 04 72 81 7b e4 bb
                      b1 70 e1 c2 7c c5 3b 30 6d f7 c8 9c 7c ca 22 7d
      MK iterations:  118940
      UUID:           7675f03b-58e3-47f2-85fc-3bafcf1e589f
    
      Key Slot 0: ENABLED
              Iterations:             1906500
              Salt:                   8f 1f 7f f4 eb 30 5a 22 a5 b4 14 07 cc da dc 48
                                      b5 e9 87 ef 3b 9b 24 72 59 ea 1a 0a ec 61 e6 42
              Key material offset:    8
              AF stripes:             4000
      Key Slot 1: DISABLED
      Key Slot 2: DISABLED
      Key Slot 3: DISABLED
      Key Slot 4: DISABLED
      Key Slot 5: DISABLED
      Key Slot 6: DISABLED
      Key Slot 7: DISABLED

    Since decrypting the data on an encrypted OSD disk requires knowledge of the corresponding dm-crypt secret key, OSD encryption provides protection for cases when a disk drive that was used as an OSD is decommissioned, lost, or stolen.

2.6 Adding an OSD Disk to a Node Edit source

To add a disk to an existing OSD node, verify that any partition on the disk was removed and wiped. Refer to Step 12 in Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.3 “Cluster Deployment” for more details. Adapt /srv/salt/ceph/configuration/files/drive_groups.yml accordingly (refer to Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.5.2 “DriveGroups” for details). After saving the file, run DeepSea's stage 3:

root@master # deepsea stage run ceph.stage.3

2.7 Removing an OSD Edit source

You can remove a Ceph OSD from the cluster by running the following command:

root@master # salt-run osd.remove OSD_ID

OSD_ID needs to be a number of the OSD without the osd. prefix. For example, from osd.3 only use the digit 3.

2.7.1 Removing Multiple OSDs Edit source

Use the same procedure as mentioned in Section 2.7, “Removing an OSD” but simply supply multiple OSD IDs:

root@master # salt-run osd.remove 2 6 11 15
Removing osd 2 on host data1
Draining the OSD
Waiting for ceph to catch up.
osd.2 is safe to destroy
Purging from the crushmap
Zapping the device


Removing osd 6 on host data1
Draining the OSD
Waiting for ceph to catch up.
osd.6 is safe to destroy
Purging from the crushmap
Zapping the device


Removing osd 11 on host data1
Draining the OSD
Waiting for ceph to catch up.
osd.11 is safe to destroy
Purging from the crushmap
Zapping the device


Removing osd 15 on host data1
Draining the OSD
Waiting for ceph to catch up.
osd.15 is safe to destroy
Purging from the crushmap
Zapping the device


2:
True
6:
True
11:
True
15:
True

2.7.2 Removing All OSDs on a Host Edit source

To remove all OSDs on a specific host, run the following command:

root@master # salt-run osd.remove OSD_HOST_NAME

2.7.3 Removing Broken OSDs Forcefully Edit source

There are cases when removing an OSD gracefully (see Section 2.7, “Removing an OSD”) fails. This may happen, for example, if the OSD or its journal, WAL or DB are broken, when it suffers from hanging I/O operations, or when the OSD disk fails to unmount.

root@master # salt-run osd.remove OSD_ID force=True
Tip
Tip: Hanging Mounts

If a partition is still mounted on the disk being removed, the command will exit with the 'Unmount failed - check for processes on DEVICE' message. You can then list all processes that access the file system with the fuser -m DEVICE. If fuser returns nothing, try manual unmount DEVICE and watch the output of dmesg or journalctl commands.

2.7.4 Validating OSD LVM Metadata Edit source

After removing an OSD using the salt-run osd.remove ID or through other ceph commands, LVM metadata may not be completely removed. This means that if you want to re-deploy a new OSD, old LVM metadata would be used.

  1. First, check if the OSD has been removed:

    cephadm@osd > ceph-volume lvm list

    Even if one of the OSD's has been removed successfully, it can still be listed. For example, if you removed osd.2, the following would be the output:

      ====== osd.2 =======
    
      [block] /dev/ceph-a2189611-4380-46f7-b9a2-8b0080a1f9fd/osd-data-ddc508bc-6cee-4890-9a42-250e30a72380
    
      block device /dev/ceph-a2189611-4380-46f7-b9a2-8b0080a1f9fd/osd-data-ddc508bc-6cee-4890-9a42-250e30a72380
      block uuid kH9aNy-vnCT-ExmQ-cAsI-H7Gw-LupE-cvSJO9
      cephx lockbox secret
      cluster fsid 6b6bbac4-eb11-45cc-b325-637e3ff9fa0c
      cluster name ceph
      crush device class None
      encrypted 0
      osd fsid aac51485-131c-442b-a243-47c9186067db
      osd id 2
      type block
      vdo 0
      devices /dev/sda

    In this example, you can see that osd.2 is still located in /dev/sda.

  2. Validate the LVM metadata on the OSD node:

    cephadm@osd > ceph-volume inventory

    The output from running ceph-volume inventory marks the /dev/sda availablity as False. For example:

      Device Path Size rotates available Model name
      /dev/sda 40.00 GB True False QEMU HARDDISK
      /dev/sdb 40.00 GB True False QEMU HARDDISK
      /dev/sdc 40.00 GB True False QEMU HARDDISK
      /dev/sdd 40.00 GB True False QEMU HARDDISK
      /dev/sde 40.00 GB True False QEMU HARDDISK
      /dev/sdf 40.00 GB True False QEMU HARDDISK
      /dev/vda 25.00 GB True False
  3. Run the following command on the OSD node to remove the LVM metadata completely:

    cephadm@osd > ceph-volume lvm zap --osd-id ID --destroy
  4. Run the inventory command again to verify that the /dev/sda availability returns True. For example:

    cephadm@osd > ceph-volume inventory
    Device Path Size rotates available Model name
    /dev/sda 40.00 GB True True QEMU HARDDISK
    /dev/sdb 40.00 GB True False QEMU HARDDISK
    /dev/sdc 40.00 GB True False QEMU HARDDISK
    /dev/sdd 40.00 GB True False QEMU HARDDISK
    /dev/sde 40.00 GB True False QEMU HARDDISK
    /dev/sdf 40.00 GB True False QEMU HARDDISK
    /dev/vda 25.00 GB True False

    LVM metadata has been removed. You can safely run the dd command on the device.

  5. The OSD can now be re-deployed without needing to reboot the OSD node:

    root@master # salt-run state.orch ceph.stage.2
    root@master # salt-run state.orch ceph.stage.3

2.8 Replacing an OSD Disk Edit source

There are several reasons why you may need to replace an OSD disk, for example:

  • The OSD disk failed or is soon going to fail based on SMART information, and can no longer be used to store data safely.

  • You need to upgrade the OSD disk, for example to increase its size.

The replacement procedure is the same for both cases. It is also valid for both default and customized CRUSH Maps.

  1. Suppose that, for example, '5' is the ID of the OSD whose disk needs to be replaced. The following command marks it as destroyed in the CRUSH Map but leaves its original ID:

    root@master # salt-run osd.replace 5
    Tip
    Tip: osd.replace and osd.remove

    The Salt's osd.replace and osd.remove (see Section 2.7, “Removing an OSD”) commands are identical except that osd.replace leaves the OSD as 'destroyed' in the CRUSH Map while osd.remove removes all traces from the CRUSH Map.

  2. Manually replace the failed/upgraded OSD drive.

  3. If you want to modify the default OSD's layout and change the DriveGroups configuration, follow the procedure described in Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.5.2 “DriveGroups”.

  4. Run the deployment stage 3 to deploy the replaced OSD disk:

    root@master # salt-run state.orch ceph.stage.3

2.9 Recovering a Reinstalled OSD Node Edit source

If the operating system breaks and is not recoverable on one of your OSD nodes, follow these steps to recover it and redeploy its OSD role with cluster data untouched:

  1. Reinstall the base SUSE Linux Enterprise operating system on the node where the OS broke. Install the salt-minion packages on the OSD node, delete the old Salt minion key on the Salt master, and register the new Salt minion's key with the Salt master. For more information on the initial deployment, see Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.3 “Cluster Deployment”.

  2. Instead of running the whole of stage 0, run the following parts:

    root@master # salt 'osd_node' state.apply ceph.sync
    root@master # salt 'osd_node' state.apply ceph.packages.common
    root@master # salt 'osd_node' state.apply ceph.mines
    root@master # salt 'osd_node' state.apply ceph.updates
  3. Copy the ceph.conf to the OSD node, and then activate the OSD:

    root@master # salt 'osd_node' state.apply ceph.configuration
    root@master # salt 'osd_node' cmd.run "ceph-volume lvm activate --all"
  4. Verify activation with one of the following commands:

    root@master # ceph -s
    # OR
    root@master # ceph osd tree
  5. To ensure consistency across the cluster, run the DeepSea stages in the following order:

    root@master # salt-run state.orch ceph.stage.1
    root@master # salt-run state.orch ceph.stage.2
    root@master # salt-run state.orch ceph.stage.3
    root@master # salt-run state.orch ceph.stage.4
    root@master # salt-run state.orch ceph.stage.5
    root@master # salt-run state.orch ceph.stage.0
  6. Run DeepSea stage 0:

    root@master # salt-run state.orch ceph.stage.0
  7. Reboot the relevant OSD node. All OSD disks will be rediscovered and reused.

  8. Get Prometheus' node exporter installed/running:

    root@master # salt 'RECOVERED_MINION' \
     state.apply ceph.monitoring.prometheus.exporters.node_exporter
  9. Remove unnecessary Salt grains (best after all OSDs have been migrated to LVM):

    root@master # salt -I roles:storage grains.delkey ceph

2.10 Moving the Admin Node to a New Server Edit source

If you need to replace the Admin Node host with a new one, you need to move the Salt master and DeepSea files. Use your favorite synchronization tool for transferring the files. In this procedure, we use rsync because it is a standard tool available in SUSE Linux Enterprise Server 15 SP1 software repositories.

  1. Stop salt-master and salt-minion services on the old Admin Node:

    root@master # systemctl stop salt-master.service
    root@master # systemctl stop salt-minion.service
  2. Configure Salt on the new Admin Node so that the Salt master and Salt minions communicate. Find more details in Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.3 “Cluster Deployment”.

    Tip
    Tip: Transition of Salt Minions

    To ease the transition of Salt minions to the new Admin Node, remove the original Salt master's public key from each of them:

    root@minion > rm /etc/salt/pki/minion/minion_master.pub
    root@minion > systemctl restart salt-minion.service
  3. Verify that the deepsea package is installed and install it if required.

    root@master # zypper install deepsea
  4. Customize the policy.cfg file by changing the role-master line. Find more details in Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.5.1 “The policy.cfg File”.

  5. Synchronize /srv/pillar and /srv/salt directories from the old Admin Node to the new one.

    Tip
    Tip: rsync Dry Run and Symbolic Links

    If possible, try synchronizing the files in a dry run first to see which files will be transferred (rsync's option -n). Also, include symbolic links (rsync's option -a). For rsync, the synchronize command will look as follows:

    root@master # rsync -avn /srv/pillar/ NEW-ADMIN-HOSTNAME:/srv/pillar
  6. If you made custom changes to files outside of /srv/pillar and /srv/salt, for example in /etc/salt/master or /etc/salt/master.d, synchronize them as well.

  7. Now you can run DeepSea stages from the new Admin Node. Refer to Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.2 “Introduction to DeepSea” for their detailed description.

2.11 Automated Installation via Salt Edit source

The installation can be automated by using the Salt reactor. For virtual environments or consistent hardware environments, this configuration will allow the creation of a Ceph cluster with the specified behavior.

Warning
Warning

Salt cannot perform dependency checks based on reactor events. There is a real risk of putting your Salt master into a death spiral.

The automated installation requires the following:

  • A properly created /srv/pillar/ceph/proposals/policy.cfg.

  • Prepared custom global.yml placed to the /srv/pillar/ceph/stack directory.

The default reactor configuration will only run stages 0 and 1. This allows testing of the reactor without waiting for subsequent stages to complete.

When the first salt-minion starts, stage 0 will begin. A lock prevents multiple instances. When all minions complete stage 0, stage 1 will begin.

If the operation is performed properly, edit the file

/etc/salt/master.d/reactor.conf

and replace the following line

- /srv/salt/ceph/reactor/discovery.sls

with

- /srv/salt/ceph/reactor/all_stages.sls

Verify that the line is not commented out.

2.12 Updating the Cluster Nodes Edit source

Keep the Ceph cluster nodes up-to-date by applying rolling updates regularly.

2.12.1 Software Repositories Edit source

Before patching the cluster with the latest software packages, verify that all the cluster's nodes have access to the relevant repositories. Refer to Book “Deployment Guide”, Chapter 6 “Upgrading from Previous Releases”, Section 6.8.1 “Manual Node Upgrade Using the Installer DVD” for a complete list of the required repositories.

2.12.2 Repository Staging Edit source

If you use a staging tool—for example, SUSE Manager, Subscription Management Tool, or Repository Mirroring Tool—that serves software repositories to the cluster nodes, verify that stages for both 'Updates' repositories for SUSE Linux Enterprise Server and SUSE Enterprise Storage are created at the same point in time.

We strongly recommend to use a staging tool to apply patches which have frozen or staged patch levels. This ensures that new nodes joining the cluster have the same patch level as the nodes already running in the cluster. This way you avoid the need to apply the latest patches to all the cluster's nodes before new nodes can join the cluster.

2.12.3 zypper patch or zypper dup Edit source

By default, cluster nodes are upgraded using the zypper dup command. If you prefer to update the system using zypper patch instead, edit /srv/pillar/ceph/stack/global.yml and add the following line:

update_method_init: zypper-patch

2.12.4 Cluster Node Reboots Edit source

During the update, cluster nodes may be optionally rebooted if their kernel was upgraded by the update. If you want to eliminate the possibility of a forced reboot of potentially all nodes, either verify that the latest kernel is installed and running on Ceph nodes, or disable automatic node reboots as described in Book “Deployment Guide”, Chapter 7 “Customizing the Default Configuration”, Section 7.1.5 “Updates and Reboots during Stage 0”.

2.12.5 Downtime of Ceph Services Edit source

Depending on the configuration, cluster nodes may be rebooted during the update as described in Section 2.12.4, “Cluster Node Reboots”. If there is a single point of failure for services such as Object Gateway, Samba Gateway, NFS Ganesha, or iSCSI, the client machines may be temporarily disconnected from services whose nodes are being rebooted.

2.12.6 Running the Update Edit source

To update the software packages on all cluster nodes to the latest version, follow these steps:

  1. Update the deepsea, salt-master, and salt-minion packages and restart relevant services on the Salt master:

    root@master # salt -I 'roles:master' state.apply ceph.updates.master
  2. Update and restart the salt-minion package on all cluster nodes:

    root@master # salt -I 'cluster:ceph' state.apply ceph.updates.salt
  3. Update all other software packages on the cluster:

    root@master # salt-run state.orch ceph.stage.0
  4. Restart Ceph related services:

    root@master # salt-run state.orch ceph.restart

2.13 Halting or Rebooting Cluster Edit source

In some cases it may be necessary to halt or reboot the whole cluster. We recommended carefully checking for dependencies of running services. The following steps provide an outline for stopping and starting the cluster:

  1. Tell the Ceph cluster not to mark OSDs as out:

    cephadm@adm > ceph osd set noout
  2. Stop daemons and nodes in the following order:

    1. Storage clients

    2. Gateways, for example NFS Ganesha or Object Gateway

    3. Metadata Server

    4. Ceph OSD

    5. Ceph Manager

    6. Ceph Monitor

  3. If required, perform maintenance tasks.

  4. Start the nodes and servers in the reverse order of the shutdown process:

    1. Ceph Monitor

    2. Ceph Manager

    3. Ceph OSD

    4. Metadata Server

    5. Gateways, for example NFS Ganesha or Object Gateway

    6. Storage clients

  5. Remove the noout flag:

    cephadm@adm > ceph osd unset noout

2.14 Adjusting ceph.conf with Custom Settings Edit source

If you need to put custom settings into the ceph.conf file, you can do so by modifying the configuration files in the /srv/salt/ceph/configuration/files/ceph.conf.d directory:

  • global.conf

  • mon.conf

  • mgr.conf

  • mds.conf

  • osd.conf

  • client.conf

  • rgw.conf

Note
Note: Unique rgw.conf

The Object Gateway offers a lot of flexibility and is unique compared to the other ceph.conf sections. All other Ceph components have static headers such as [mon] or [osd]. The Object Gateway has unique headers such as [client.rgw.rgw1]. This means that the rgw.conf file needs a header entry. For examples, see

/srv/salt/ceph/configuration/files/rgw.conf

or

/srv/salt/ceph/configuration/files/rgw-ssl.conf

See Section 26.7, “Enabling HTTPS/SSL for Object Gateways” for more examples.

Important
Important: Run stage 3

After you make custom changes to the above mentioned configuration files, run stages 3 and 4 to apply these changes to the cluster nodes:

root@master # salt-run state.orch ceph.stage.3
root@master # salt-run state.orch ceph.stage.4

These files are included from the /srv/salt/ceph/configuration/files/ceph.conf.j2 template file, and correspond to the different sections that the Ceph configuration file accepts. Putting a configuration snippet in the correct file enables DeepSea to place it into the correct section. You do not need to add any of the section headers.

Tip
Tip

To apply any configuration options only to specific instances of a daemon, add a header such as [osd.1]. The following configuration options will only be applied to the OSD daemon with the ID 1.

2.14.1 Overriding the Defaults Edit source

Later statements in a section overwrite earlier ones. Therefore it is possible to override the default configuration as specified in the /srv/salt/ceph/configuration/files/ceph.conf.j2 template. For example, to turn off cephx authentication, add the following three lines to the /srv/salt/ceph/configuration/files/ceph.conf.d/global.conf file:

auth cluster required = none
auth service required = none
auth client required = none

When redefining the default values, Ceph related tools such as rados may issue warnings that specific values from the ceph.conf.j2 were redefined in global.conf. These warnings are caused by one parameter assigned twice in the resulting ceph.conf.

As a workaround for this specific case, follow these steps:

  1. Change the current directory to /srv/salt/ceph/configuration/create:

    root@master # cd /srv/salt/ceph/configuration/create
  2. Copy default.sls to custom.sls:

    root@master # cp default.sls custom.sls
  3. Edit custom.sls and change ceph.conf.j2 to custom-ceph.conf.j2.

  4. Change current directory to /srv/salt/ceph/configuration/files:

    root@master # cd /srv/salt/ceph/configuration/files
  5. Copy ceph.conf.j2 to custom-ceph.conf.j2:

    root@master # cp ceph.conf.j2 custom-ceph.conf.j2
  6. Edit custom-ceph.conf.j2 and delete the following line:

    {% include "ceph/configuration/files/rbd.conf" %}

    Edit global.yml and add the following line:

    configuration_create: custom
  7. Refresh the pillar:

    root@master # salt target saltutil.pillar_refresh
  8. Run stage 3:

    root@master # salt-run state.orch ceph.stage.3

Now you should have only one entry for each value definition. To re-create the configuration, run:

root@master # salt-run state.orch ceph.configuration.create

and then verify the contents of /srv/salt/ceph/configuration/cache/ceph.conf.

2.14.2 Including Configuration Files Edit source

If you need to apply a lot of custom configurations, use the following include statements within the custom configuration files to make file management easier. Following is an example of the osd.conf file:

[osd.1]
{% include "ceph/configuration/files/ceph.conf.d/osd1.conf" ignore missing %}
[osd.2]
{% include "ceph/configuration/files/ceph.conf.d/osd2.conf" ignore missing %}
[osd.3]
{% include "ceph/configuration/files/ceph.conf.d/osd3.conf" ignore missing %}
[osd.4]
{% include "ceph/configuration/files/ceph.conf.d/osd4.conf" ignore missing %}

In the previous example, the osd1.conf, osd2.conf, osd3.conf, and osd4.conf files contain the configuration options specific to the related OSD.

Tip
Tip: Runtime Configuration

Changes made to Ceph configuration files take effect after the related Ceph daemons restart. See Section 25.1, “Runtime Configuration” for more information on changing the Ceph runtime configuration.

2.15 Enabling AppArmor Profiles Edit source

AppArmor is a security solution that confines programs by a specific profile. For more details, refer to https://documentation.suse.com/sles/15-SP1/single-html/SLES-security/#part-apparmor.

DeepSea provides three states for AppArmor profiles: 'enforce', 'complain', and 'disable'. To activate a particular AppArmor state, run:

salt -I "deepsea_minions:*" state.apply ceph.apparmor.default-STATE

To put the AppArmor profiles in an 'enforce' state:

root@master # salt -I "deepsea_minions:*" state.apply ceph.apparmor.default-enforce

To put the AppArmor profiles in a 'complain' state:

root@master # salt -I "deepsea_minions:*" state.apply ceph.apparmor.default-complain

To disable the AppArmor profiles:

root@master # salt -I "deepsea_minions:*" state.apply ceph.apparmor.default-disable
Tip
Tip: Enabling the AppArmor Service

Each of these three calls verifies if AppArmor is installed and installs it if not, and starts and enables the related systemd service. DeepSea will warn you if AppArmor was installed and started/enabled in another way and therefore runs without DeepSea profiles.

2.16 Deactivating Tuned Profiles Edit source

By default, DeepSea deploys Ceph clusters with tuned profiles active on Ceph Monitor, Ceph Manager, and Ceph OSD nodes. In some cases, you may need to permanently deactivate tuned profiles. To do so, put the following lines in /srv/pillar/ceph/stack/global.yml and re-run stage 3:

alternative_defaults:
 tuned_mgr_init: default-off
 tuned_mon_init: default-off
 tuned_osd_init: default-off
root@master # salt-run state.orch ceph.stage.3

2.17 Removing an Entire Ceph Cluster Edit source

The ceph.purge runner removes the entire Ceph cluster. This way you can clean the cluster environment when testing different setups. After the ceph.purge completes, the Salt cluster is reverted back to the state at the end of DeepSea stage 1. You can then either change the policy.cfg (see Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.5.1 “The policy.cfg File”), or proceed to DeepSea stage 2 with the same setup.

To prevent accidental deletion, the orchestration checks if the safety is disengaged. You can disengage the safety measures and remove the Ceph cluster by running:

root@master # salt-run disengage.safety
root@master # salt-run state.orch ceph.purge
Tip
Tip: Disabling Ceph Cluster Removal

If you want to prevent anyone from running the ceph.purge runner, create a file named disabled.sls in the /srv/salt/ceph/purge directory and insert the following line in the /srv/pillar/ceph/stack/global.yml file:

purge_init: disabled
Important
Important: Rescind Custom Roles

If you previously created custom roles for Ceph Dashboard (refer to Section 6.6, “Adding Custom Roles” and Section 14.2, “User Roles and Permissions” for detailed information), you need to take manual steps to purge them before running the ceph.purge runner. For example, if the custom role for Object Gateway is named 'us-east-1', then follow these steps:

root@master # cd /srv/salt/ceph/rescind
root@master # rsync -a rgw/ us-east-1
root@master # sed -i 's!rgw!us-east-1!' us-east-1/*.sls

3 Backing Up Cluster Configuration and Data Edit source

This chapter explains which parts of the Ceph cluster you should back up in order to be able to restore its functionality.

3.1 Back Up Ceph Configuration Edit source

Back up the /etc/ceph directory. It contains crucial cluster configuration. You will need the backup of /etc/ceph for example when you need to replace the Admin Node.

3.2 Back Up Salt Configuration Edit source

You need to back up the /etc/salt/ directory. It contains the Salt configuration files, for example the Salt master key and accepted client keys.

The Salt files are not strictly required for backing up the Admin Node, but make redeploying the Salt cluster easier. If there is no backup of these files, the Salt minions need to be registered again at the new Admin Node.

Note
Note: Security of the Salt Master Private Key

Make sure that the backup of the Salt master private key is stored in a safe location. The Salt master key can be used to manipulate all cluster nodes.

After restoring the /etc/salt directory from a backup, restart the Salt services:

root@master # systemctl restart salt-master
root@master # systemctl restart salt-minion

3.3 Back Up DeepSea Configuration Edit source

All files required by DeepSea are stored in /srv/pillar/, /srv/salt/ and /etc/salt/master.d.

If you need to redeploy the Admin Node, install the DeepSea package on the new node and move the backed up data back into the directories. DeepSea can then be used again without any further changes being required. Before using DeepSea again, make sure that all Salt minions are correctly registered on the Admin Node.

3.4 Back Up Custom Configurations Edit source

  • Prometheus data and customization.

  • Grafana customization.

  • Verify that you have a record of existing openATTIC users so that you can create new accounts for these users in the Ceph Dashboard.

  • Manual changes to ceph.conf outside of DeepSea.

  • Manual changes to the iSCSI configuration outside of DeepSea.

  • Ceph keys.

  • CRUSH Map and CRUSH rules. Save the decompiled CRUSH Map including CRUSH rules into crushmap-backup.txt by running the following command:

    cephadm@adm > ceph osd getcrushmap | crushtool -d - -o crushmap-backup.txt
  • Samba Gateway configuration. If you are using a single gateway, backup /etc/samba/smb.conf. If you are using HA setup, backup also CTDB and Pacemaker configuration files. Refer to Chapter 29, Exporting Ceph Data via Samba for details on what configuration is used by Samba Gateways.

  • NFS Ganesha configuration. Only needed when using HA setup. Refer to Chapter 30, NFS Ganesha: Export Ceph Data via NFS for details on what configuration is used by NFS Ganesha.

Part II Ceph Dashboard Edit source

4 About Ceph Dashboard

The Ceph Dashboard is a module that adds a built-in Web based monitoring and administration application to the Ceph Manager (refer to Book “Deployment Guide”, Chapter 1 “SUSE Enterprise Storage 6 and Ceph”, Section 1.2.3 “Ceph Nodes and Daemons” for more details on Ceph Manager). You no longer need …

5 Dashboard's Web User Interface

To log in to the dashboard Web application, point your browser to its URL including the port number. You can find its address by running

6 Managing Dashboard Users and Roles

Dashboard user management performed by Ceph commands on the command line was already introduced in Chapter 14, Managing Users and Roles on the Command Line.

7 Viewing Cluster Internals

The Cluster menu item lets you view detailed information about Ceph cluster hosts, OSDs, MONs, CRUSH Map, and the content of log files.

8 Managing Pools

For more general information about Ceph pools, refer to Chapter 22, Managing Storage Pools. For information specific to erasure code pools, refer to Chapter 24, Erasure Coded Pools.

9 Managing RADOS Block Devices

To list all available RADOS Block Devices (RBDs), click Block › Images from the main menu.

10 Managing NFS Ganesha

For more general information about NFS Ganesha, refer to Chapter 30, NFS Ganesha: Export Ceph Data via NFS.

11 Managing Ceph File Systems

To find detailed information about CephFS, refer to Chapter 28, Clustered File System.

12 Managing Object Gateways

For more general information about Object Gateway, refer to Chapter 26, Ceph Object Gateway.

13 Manual Configuration

This section introduces advanced information for users that prefer configuring dashboard's settings manually on the command line.

14 Managing Users and Roles on the Command Line

This section describes how to manage user accounts used by the Ceph Dashboard. It helps you create or modify user accounts, as well as set proper user roles and permissions.

4 About Ceph Dashboard Edit source

The Ceph Dashboard is a module that adds a built-in Web based monitoring and administration application to the Ceph Manager (refer to Book “Deployment Guide”, Chapter 1 “SUSE Enterprise Storage 6 and Ceph”, Section 1.2.3 “Ceph Nodes and Daemons” for more details on Ceph Manager). You no longer need to know complex Ceph related commands to manage and monitor your Ceph cluster. You can either use the Ceph Dashboard's intuitive Web interface, or its built-in REST API.

The Ceph Dashboard is automatically enabled and configured with DeepSea's stage 3 during the deployment procedure (see Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.3 “Cluster Deployment”). In a Ceph cluster with multiple Ceph Manager instances, only the dashboard running on the currently active Ceph Manager daemon will serve incoming requests. Accessing the dashboard's TCP port on any of the other Ceph Manager instances that are currently on standby will perform an HTTP redirect (303) to the currently active Ceph Manager's dashboard URL. This way, you can point your browser to any of the Ceph Manager instances in order to access the dashboard. Consider this behavior when securing access with firewall or planning for HA setup.

5 Dashboard's Web User Interface Edit source

5.1 Log In Edit source

To log in to the dashboard Web application, point your browser to its URL including the port number. You can find its address by running

cephadm@adm > ceph mgr services | grep dashboard
"dashboard": "https://ses-dash-node.example.com:8443/",
Ceph Dashboard Login Screen
Figure 5.1: Ceph Dashboard Login Screen

You need a user account in order to log in to the dashboard Web application. DeepSea creates a default user 'admin' with administrator privileges for you. If you decide to log in with the default 'admin' user, retrieve the corresponding password by running

root@master # salt-call grains.get dashboard_creds
Tip
Tip: Custom User Account

If you do not want to use the default 'admin' account to access the Ceph Dashboard, create a custom user account with administrator privileges. Refer to Chapter 14, Managing Users and Roles on the Command Line for more details.

The dashboard user interface is graphically divided into several blocks: the utility menu, the main menu, and the main content pane.

Ceph Dashboard Home Page
Figure 5.2: Ceph Dashboard Home Page

5.2 Utility Menu Edit source

The top right part of the screen contains a utility menu. It includes general tasks related more to the dashboard than to the Ceph cluster. By clicking its items, you can access the following topics:

  • Change the language of the dashboard's user interface. You can choose from Czech, English, German, Spanish, French, Portuguese, Chinese, or Indonesian.

  • Display a list of Ceph related tasks that are running in the background.

  • View and erase recent dashboard notifications.

  • Display a list of links that refer to the information about the dashboard, its complete documentation, and an overview of its REST API.

  • Manage the dashboard's users and user roles. Refer to Chapter 14, Managing Users and Roles on the Command Line for more detailed command line descriptions.

  • Log out of the dashboard.

5.3 Main Menu Edit source

The dashboard's main menu occupies the top left part of the screen. It covers the following topics:

Dashboard

Return to Ceph Dashboard's home page.

Cluster

View detailed information about overall cluster configuration and CRUSH Map, its hosts, Ceph OSDs, Ceph Monitors, and the content of a log file.

Pools

View and manage cluster pools.

Block

View and manage block devices and their iSCSI exports.

NFS

View and manage NFS Ganesha deployments.

Filesystems

View and manage CephFSs.

Object Gateway

View and manage Object Gateway's daemons, users, and buckets.

5.4 The Content Pane Edit source

The content pane occupies the main part of the dashboard's screen. The dashboard home page shows plenty of helpful widgets to inform you briefly about the current status of the cluster, capacity, and performance information.

5.5 Common Web UI Features Edit source

In Ceph Dashboard, you often work with lists—for example, lists of pools, OSD nodes, or RBD devices. All lists will automatically refresh themselves by default every 5 seconds. The following common widgets help you manage or adjust these list:

Click to trigger a manual refresh of the list.

Click to display or hide individual table columns.

Click and select how many rows to display on a single page.

Click inside and filter the rows by typing the string to search for.

Use to change the currently displayed page if the list spans across multiple pages.

5.6 Dashboard Widgets Edit source

Each dashboard widget shows specific status information related to a specific aspect of a running Ceph cluster. Some widgets are active links and after clicking them, they will redirect you to a related detailed page of the topic they represent.

Tip
Tip: More Details on Mouse Over

Some graphical widgets show you more detail when you move the mouse over them.

5.6.1 Status Widgets Edit source

Status widgets give you a brief overview about the cluster's current status.

Status Widgets
Figure 5.3: Status Widgets
Cluster Status

Presents basic information about the cluster's health.

Monitors

Shows the number of running monitors and their quorum.

OSDs

Shows the total number of OSDs, as well as the number of 'up' and 'in' OSDs.

Manager Daemons

Shows the number of active and standby Ceph Manager daemons.

Hosts

Shows the total number of cluster nodes.

Object Gateways

Shows the number of running Object Gateways.

Metadata Servers

Shows the number of Metadata Servers.

iSCSI Gateway

Shows the number of configured iSCSI gateways.

5.6.2 Performance Widgets Edit source

Performance widgets refer to basic performance data of Ceph clients.

performance Widgets
Figure 5.4: performance Widgets
Client IOPS

The amount of clients' read and write operations per second.

Client Throughput

The sum of clients' read and write operations per second.

Client Read/Write

Visualizes clients' read/write ratio.

Recovery Throughput

The throughput of data recovered per second.

Scrub

Shows the scrub (see Section 20.4.9, “Scrubbing a Placement Group”) status. It is either disabled, enabled, or active.

5.6.3 Capacity Widgets Edit source

Capacity widgets show brief information about the storage capacity.

Capacity Widgets
Figure 5.5: Capacity Widgets
Pools

Shows the number of pools in the cluster.

Raw Capacity

Shows the ratio of used and available raw storage capacity.

Objects

Shows the number of data objects stored in the cluster.

PGs per OSD

Shows the number of placement groups per OSD.

PG Status

Displays a chart of the placement groups according to their status.

6 Managing Dashboard Users and Roles Edit source

Dashboard user management performed by Ceph commands on the command line was already introduced in Chapter 14, Managing Users and Roles on the Command Line.

This section describes how to manage user accounts by using the Dashboard Web user interface.

6.1 Listing Users Edit source

Click in the utility menu and select User Management.

The list contains each user's user name, full name, e-mail, and list of assigned roles.

User Management
Figure 6.1: User Management

6.2 Adding New Users Edit source

Click Add in the top left of the table heading to add a new user. Enter their user name, password, and optionally a full name and an e-mail.

Adding a User
Figure 6.2: Adding a User

Click the little pen icon to assign predefined roles to the user. Confirm with Create User.

6.3 Editing Users Edit source

Click a user's table row and then select Edit to edit details about the user. Confirm with Update User.

6.4 Deleting Users Edit source

Click a user's table row and then select Delete to delete the user account. Activate the Yes, I am sure check box and confirm with Delete User.

6.5 Listing User Roles Edit source

Click in the utility menu and select User Management. Then click the Roles tab.

The list contains each role's name, description, and whether it is a system role.

User Roles
Figure 6.3: User Roles

6.6 Adding Custom Roles Edit source

Click Add in the top left of the table heading to add a new custom role. Enter its name, description, and set required permissions for individual topics.

Tip
Tip: Purging Custom Roles

If you create custom user roles and intend to remove the Ceph cluster with the ceph.purge runner later on, you need to purge the custom roles first. Find more details in Section 2.17, “Removing an Entire Ceph Cluster”.

Adding a Role
Figure 6.4: Adding a Role
Tip
Tip: Multiple Activation

By activating the check box that precedes the topic name, you activate all permissions for that topic. By activating the All check box, you activate all permissions for all the topics.

Confirm with Create Role.

6.7 Editing Custom Roles Edit source

Click a custom role's table row and then select Edit in the top left of the table heading to edit a description and permissions of the custom role. Confirm with Update Role.

6.8 Deleting Custom Roles Edit source

Click a role's table row and then select Delete to delete the role. Activate the Yes, I am sure check box and confirm with Delete Role.

7 Viewing Cluster Internals Edit source

The Cluster menu item lets you view detailed information about Ceph cluster hosts, OSDs, MONs, CRUSH Map, and the content of log files.

7.1 Cluster Nodes Edit source

Click Cluster › Hosts to view a list of cluster nodes.

Hosts
Figure 7.1: Hosts

Click a node name in the Hostname column to view performance details of the node.

The Services column lists all daemons that are running on each related node. Click a daemon name to view its detailed configuration.

7.2 Ceph Monitors Edit source

Click Cluster › Monitors to view a list of cluster nodes with running Ceph monitors. The list includes each monitor's rank number, public IP address, and number of open sessions.

The list is divided into two tables: one for Ceph Monitors that are in quorum, and the second one for Ceph Monitors that are not in quorum.

Click a node name in the Name column to view the related Ceph Monitor configuration.

The Status table shows general statistics about the running Ceph Monitors.

Ceph Monitors
Figure 7.2: Ceph Monitors

7.3 Ceph OSDs Edit source

Click Cluster › OSDs to view a list of nodes with running OSD daemons/disks. The list includes each node's name, status, number of placement groups, size, usage, reads/writes chart in time, and the rate of read/write operations per second.

Ceph OSDs
Figure 7.3: Ceph OSDs

Click Set Cluster-wide Flags in the table heading to pop up a window with a list of flags that apply to the whole cluster. You can activate or deactivate individual flags, and confirm with Submit.

OSD Flags
Figure 7.4: OSD Flags

Click Set Cluster-wide Recovery Priority in the table heading to open a pop-up window with a list of OSD recovery priorities that apply to the whole cluster. You can activate the preferred priority profile, and fine tune the individual values below. Confirm with Submit.

OSD Recovery Priority
Figure 7.5: OSD Recovery Priority

Click a node name in the Host column to view an extended table with details about the OSD setting and performance. Browsing through several tabs, you can see lists of Attributes, Metadata, Performance counter, and a graphical Histogram of reads and writes.

OSD Details
Figure 7.6: OSD Details
Tip
Tip: Perform Specific Tasks on OSDs

After you click an OSD node name, its table row changes color slightly, meaning that you can now perform a task on the node. Click the down arrow in the top left of the table heading and select a task to perform, such as Deep Scrub. Optionally enter a required value for the task, and confirm to run it on the node.

7.4 Cluster Configuration Edit source

Click Cluster › Configuration to view a complete list of Ceph cluster configuration options. The list contains the name of the option, its short description, and its current and default values.

Cluster Configuration
Figure 7.7: Cluster Configuration

Click a specific option row to highlight and see detailed information about the option, such as its type of value, minimum and maximum permitted values, whether it can be updated at runtime, and many more.

After highlighting a specific option, you can edit its value(s) by clicking the Edit button in the top left of the table heading. Confirm changes with Save.

7.5 CRUSH Map Edit source

Click Cluster › CRUSH map to view a CRUSH Map of the cluster. For more general information on CRUSH Maps, refer to Section 20.5, “CRUSH Map Manipulation”.

Click the root, nodes, or individual OSDs to view more detailed information, such as crush weight, depth in the map tree, device class of the OSD, and many more.

CRUSH Map
Figure 7.8: CRUSH Map

7.6 Manager Modules Edit source

Click Cluster › Manager modules to view a list of available Ceph Manager modules. Each line consists of a module name and information on whether it is currently enabled or not.

Manager Modules
Figure 7.9: Manager Modules

After highlighting a specific module, you can see its detailed settings in the Details table below. Edit them by clicking the Edit button in the top left of the table heading. Confirm changes with Update.

7.7 Logs Edit source

Click Cluster › Logs to view a list of cluster's recent log entries. Each line consists of a time stamp, the type of the log entry, and the logged message itself.

Click the Audit Logs tab to view log entries of the auditing subsystem. Refer to Section 14.4, “Auditing” for commands to enable or disable auditing.

Logs
Figure 7.10: Logs

8 Managing Pools Edit source

Tip
Tip: More Information on Pools

For more general information about Ceph pools, refer to Chapter 22, Managing Storage Pools. For information specific to erasure code pools, refer to Chapter 24, Erasure Coded Pools.

To list all available pools, click Pools from the main menu.

The list shows each pool's name, type, related application, placement group status, replica size, erasure coded profile, usage, and read/write statistics.

List of Pools
Figure 8.1: List of Pools

To view more detailed information about a pool, activate its table row.

8.1 Adding a New Pool Edit source

To add a new pool, click Add in the top left of the pools table. In the pool form you can enter the pool's name, type, number of placement groups, and additional information, such as pool's applications, and compression mode and algorithm. The pool form itself pre-calculates the number of placement groups that best suited to this specific pool. The calculation is based on the amount of OSDs in the cluster and the selected pool type with its specific settings. As soon as a placement groups number is set manually, it will be replaced by a calculated number. Confirm with Create pool.

Adding a New Pool
Figure 8.2: Adding a New Pool

8.2 Deleting Pools Edit source

To delete a pool, click its table row and click Delete in the top left of the pools table.

8.3 Editing a Pool's Options Edit source

To edit a pool's options, click its table row and select Edit in the top left of the pools table.

You can change the name of the pool, increase the number of placement groups, change the list of the pool's applications and compression settings. Confirm with Edit pool.

9 Managing RADOS Block Devices Edit source

To list all available RADOS Block Devices (RBDs), click Block › Images from the main menu.

The list shows brief information about the device, such as the device's name, the related pool name, size of the device, number and size of objects on the device.

List of RBD Images
Figure 9.1: List of RBD Images

9.1 Viewing Details about RBDs Edit source

To view more detailed information about a device, click its row in the table:

RBD Details
Figure 9.2: RBD Details

9.2 Viewing RBD's Configuration Edit source

To view detailed configuration of a device, click its row in the table and then the Configuration tab in the lower table:

RBD Configuration
Figure 9.3: RBD Configuration

9.3 Creating RBDs Edit source

To add a new device, click Add in the top left of the table heading and do the following on the Create RBD screen:

Adding a New RBD
Figure 9.4: Adding a New RBD
  1. Enter the name of the new device. Refer to Book “Deployment Guide”, Chapter 2 “Hardware Requirements and Recommendations”, Section 2.11 “Naming Limitations” for naming limitations.

  2. Select the pool with the 'rbd' application assigned from which the new RBD device will be created.

  3. Specify the size of the new device.

  4. Specify additional options for the device. To fine-tune the device parameters, click Advanced and enter values for object size, stripe unit, or stripe count. To enter Quality of Service (QoS) limits, click Quality of Service and enter them.

  5. Confirm with Create RBD.

9.4 Deleting RBDs Edit source

To delete a device, click its row in the table and select Delete in the top left of the table heading. Confirm the deletion with Delete RBD.

Tip
Tip: Moving RBDs to Trash

Deleting an RBD is an irreversible action. If you Move to Trash instead, you can restore the device later on by selecting it on the Trash tab of the main table and clicking Restore in the top left of the table heading.

9.5 RADOS Block Device Snapshots Edit source

To create a RADOS Block Device snapshot, click the device's table row and in the Snapshots tab below the main table, click Create in the top left of the table heading. Enter the snapshot's name and confirm with Create Snapshot.

After selecting a snapshot, you can perform additional actions on the device, such as rename, protect, clone, copy, or delete. Rollback restores the device's state from the current snapshot.

RBD Snapshots
Figure 9.5: RBD Snapshots

9.6 Managing iSCSI Gateways Edit source

Tip
Tip: More Information on iSCSI Gateways

For more general information about iSCSI Gateways, refer to Book “Deployment Guide”, Chapter 10 “Installation of iSCSI Gateway” and Chapter 27, Ceph iSCSI Gateway.

To list all available gateways and mapped images, click Block › iSCSI from the main menu. An Overview tab opens, listing currently configured iSCSI Gateways and mapped RBD images.

The Gateways table lists each gateway's state, number of iSCSI targets, and number of sessions. The Images table lists each mapped image's name, related pool name backstore type, and other statistical details.

The Targets tab lists currently configured iSCSI targets.

List of iSCSI Targets
Figure 9.6: List of iSCSI Targets

To view more detailed information about a target, click its table row. A tree-structured schema opens, listing disks, portals, initiators, and groups. Click an item to expand it and view its detailed contents, optionally with a related configuration in the table on the right.

iSCSI Target Details
Figure 9.7: iSCSI Target Details

9.6.1 Adding iSCSI Targets Edit source

To add a new iSCSI target, click Add in the top left of the Targets table and enter the required information.

Adding a New Target
Figure 9.8: Adding a New Target
  1. Enter the target address of the new gateway.

  2. Click Add portal and select one or multiple iSCSI portals from the list.

  3. Click Add image and select one or multiple RBD images for the gateway.

  4. If you need to use authentication to access the gateway, activate the Authentication check box and enter the credentials. You can find more advanced authentication options after activating Mutual authentication and Discovery authentication.

  5. Confirm with Create Target.

9.6.2 Editing iSCSI Targets Edit source

To edit an existing iSCSI target, click its row in the Targets table and click Edit in the top left of the table.

You can then modify the iSCSI target, add or delete portals, and add or delete related RBD images. You can also adjust authentication information for the gateway.

9.6.3 Deleting iSCSI Targets Edit source

To delete an iSCSI target, click its table row and select Delete in the top left of the gateways table. Activate Yes, I am sure and confirm with Delete iSCSI.

9.7 RBD Quality of Service (QoS) Edit source

Tip
Tip: For More Information

For more general information and a description of RBD QoS configuration options, refer to Section 23.6, “QoS Settings”.

The QoS options can be configured at different levels.

  • Globally

  • On a per-pool basis

  • On a per-image basis

The global configuration is at the top of the list and will be used for all newly created RBD images and for those images that do not override these values on the pool or RBD image layer. An option value specified globally can be overridden on a per-pool or per-image basis. Options specified on a pool will be applied to all RBD images of that pool unless overridden by a configuration option set on an image. Options specified on an image will override options specified on a pool and will override options specified globally.

This way it is possible to define defaults globally, adapt them for all RBD images of a specific pool, and override the pool configuration for individual RBD images.

9.7.1 Configuring Options Globally Edit source

To configure RBD options globally, select Cluster › Configuration from the main menu.

To list all available global configuration options, click Level › Advanced. Then filter the results of the table by filtering for 'rbd_qos' in the search field. This lists all available configuration options for QoS. To change a value, click its row in the table, then select Edit at the top left of the table. The Edit dialog contains six different fields for specifying values. The RBD configuration option values are required in the mgr text box. Note that unlike the other dialogs, this one does not allow you to specify the value in convenient units. You need to set these values in either bytes or IOPS, depending on the option you are editing.

9.7.2 Configuring Options on a New Pool Edit source

To create a new pool and configure RBD configuration options on it, click Pools › Create. Select replicated as pool type. You will then need to add the rbd application tag to the pool to be able to configure the RBD QoS options.

Note
Note

It is not possible to configure RBD QoS configuration options on an erasure coded pool. To configure the RBD QoS options for erasure coded pools, you need to edit the replicated metadata pool of an RBD image. The configuration will then be applied to the erasure coded data pool of that image.

9.7.3 Configuring Options on an Existing Pool Edit source

To configure RBD QoS options on an existing pool, click Pools, then click the pool's table row and select Edit at the top left of the table.

You should see the RBD Configuration section in the dialog, followed by a Quality of Service section.

Note
Note

If you see neither the RBD Configuration nor the Quality of Service section, you are likely either editing an erasure coded pool, which cannot be used to set RBD configuration options, or the pool is not configured to be used by RBD images. In the latter case, assign the rbd application tag to the pool and the corresponding configuration sections will show up.

9.7.4 Configuration Options Edit source

Click Quality of Service + to expand the configuration options. A list of all available options will show up. The units of the configuration options are already shown in the text boxes. In case of any bytes per second (BPS) option, you are free to use shortcuts such as '1M' or '5G'. They will be automatically converted to '1 MB/s' and '5 GB/s' respectively.

By clicking the reset button to the right of each text box, any value set on the pool will be removed. This does not remove configuration values of options configured globally or on an RBD image.

9.7.5 Creating RBD QoS Options with a New RBD Image Edit source

To create an RBD image with RBD QoS options set on that image, select Block › Images and then click Create. Click Advanced to expand the advanced configuration section. Click Quality of Service to open all available configuration options.

9.7.6 Editing RBD QoS Options on Existing Images Edit source

To edit RBD QoS options on an existing image, select Block › Images, then click the pool's table row, and lastly click Edit. The edit dialog will show up. Click Advanced to expand the advanced configuration section. Click Quality of Service to open all available configuration options.

9.7.7 Changing Configuration Options When Copying or Cloning Images Edit source

If an RBD image is cloned or copied, the values set on that particular image will be copied too, by default. If you want to change them while copying or cloning, you can do so by specifying the updated configuration values in the copy/clone dialog, the same way as when creating or editing an RBD image. Doing so will only set (or reset) the values for the RBD image that is copied or cloned. This operation changes neither the source RBD image configuration, nor the global configuration.

If you choose to reset the option value on copying/cloning, no value for that option will be set on that image. This means that any value of that option specified for the parent pool will be used if the parent pool has the value configured. Otherwise, the global default will be used.

9.8 RBD Mirroring Edit source

Tip
Tip: General Information

For general information and the command line approach to RADOS Block Device mirroring, refer to Section 23.4, “Mirroring”.

You can use the Ceph Dashboard to configure replication of RBD images between two or more clusters.

9.8.1 Primary Cluster and Secondary Cluster(s) Edit source

Primary cluster is where the original pool with images is created. Secondary cluster(s) is where the pool/images are replicated from the primary cluster.

Note
Note: Relative Naming

The primary and secondary terms can be relative in the context of replication because they relate more to individual pools than to clusters. For example, in two-way replication, one pool can be mirrored from the primary cluster to the secondary one, while another pool can be mirrored from the secondary cluster to the primary one.

9.8.2 Replication Modes Edit source

There are two modes of data replication:

  • Using the pool mode, you replicate all the RBD images in a pool.

  • Using the image mode, you can activate mirroring only for specific image(s) in a pool.

9.8.3 Configure the rbd-mirror Daemon Edit source

The rbd-mirror daemon performs the actual cluster data replication. To install, configure, and run it, follow these steps:

  1. The rbd-mirror daemon needs to run on one of the nodes on the secondary cluster other than the Admin Node. Because it is not installed by default, install it:

    root@minion > zypper install rbd-mirror
  2. On the primary cluster, create a unique Ceph user ID for the rbd-mirror daemon process. In this example, we will use 'uid1' as the user ID:

    cephadm@adm > ceph auth get-or-create client.rbd-mirror.uid1 \
     mon 'profile rbd-mirror' osd 'profile rbd'
    [client.rbd-mirror.uid1]
    	key = AQBbDJddZKLBIxAAdsmSCCjXoKwzGkGmCpUQ9g==
  3. On the node where you previously installed the rbd-mirror package on the secondary cluster, create the same Ceph user and save the output to a keyring:

    root@minion > ceph auth get-or-create client.rbd-mirror.uid1 \
     mon 'profile rbd-mirror' osd 'profile rbd' \
     > /etc/ceph/ceph.client.rbd-mirror.uid1.keyring
  4. On the same node, enable and run the rbd-mirror service:

    root@minion > systemctl enable ceph-rbd-mirror@rbd-mirror.uid1
    root@minion > systemctl start ceph-rbd-mirror@rbd-mirror.uid1
    root@minion > systemctl status ceph-rbd-mirror@rbd-mirror.uid1
    ● ceph-rbd-mirror@rbd-mirror.uid1.service - Ceph rbd mirror daemon
       Loaded: loaded (/usr/lib/systemd/system/ceph-rbd-mirror@.service; enabled; vendor preset: disabled)
       Active: active (running) since Fri 2019-10-04 07:48:53 EDT; 2 days ago
     Main PID: 212434 (rbd-mirror)
        Tasks: 47
       CGroup: /system.slice/system-ceph\x2drbd\x2dmirror.slice/ceph-rbd-mirror@rbd-mirror.uid1.service
               └─212434 /usr/bin/rbd-mirror -f --cluster ceph --id rbd-mirror.test --setuser ceph --setgroup ceph
    
    Oct 04 07:48:53 doc-ses6min4 systemd[1]: Started Ceph rbd mirror daemon.
  5. On the secondary cluster's Ceph Dashboard, navigate to Block › Mirroring. The Daemons table to the left shows actively running rbd-mirror daemons and their health.

    Running rbd-mirror Daemons
    Figure 9.9: Running rbd-mirror Daemons

9.8.4 Configure Pool Replication in Ceph Dashboard Edit source

The rbd-mirror daemon needs to have access to the primary cluster to be able to mirror RBD images. Therefore you need to create a peer Ceph user account on the primary cluster and let the secondary cluster know about its keyring:

  1. On the primary cluster, create a new 'client.rbd-mirror-peer' user that will be used for data replication:

    cephadm@adm > ceph auth get-or-create client.rbd-mirror-peer \
     mon 'profile rbd' osd 'profile rbd'
    [client.rbd-mirror-peer]
    	key = AQBbDJddZKLBIxAAdsmSCCjXoKwzGkGmCpUQ9g==
  2. On both the primary and secondary cluster, create a pool with an identical name and assign the 'rbd' application to it. Refer to Section 8.1, “Adding a New Pool” for more details on creating a new pool.

    Creating a Pool with RBD Application
    Figure 9.10: Creating a Pool with RBD Application
  3. On both the primary and secondary cluster's dashboards, navigate to Block › Mirroring. In the Pools table on the right, click the name of the pool to replicate, and after clicking Edit Mode, select the replication mode. In this example, we will work with a pool replication mode, which means that all images within a given pool will be replicated. Confirm with Update.

    Configuring the Replication Mode
    Figure 9.11: Configuring the Replication Mode
    Important
    Important: Error or Warning on the Primary Cluster

    After updating the replication mode, an error or warning flag will appear in the corresponding right column. That is because the pool has no peer user for replication assigned yet. Ignore this flag for the primary cluster as we assign a peer user to the secondary cluster only.

  4. On the secondary cluster's Dashboard, navigate to Block › Mirroring. Register the 'client.rbd-mirror-peer' user keyring to the mirrored pool by clicking the pool's name and selecting Add Peer. Provide the primary cluster's details:

    Adding Peer Credentials
    Figure 9.12: Adding Peer Credentials
    Cluster Name

    An arbitrary unique string that identifies the primary cluster, such as 'primary'. The cluster name needs to be different from the real secondary cluster's name.

    CephX ID

    The Ceph user ID that you created as a mirroring peer. In this example it is 'rbd-mirror-peer'.

    Monitor Addresses

    Comma separated list of IP addresses/host names of the primary cluster's Ceph Monitor nodes.

    CephX Key

    The key related to the peer user ID. You can retrieve it by running the following example command on the primary cluster:

    cephadm@adm > ceph auth print_key client.rbd-mirror-peer

    Confirm with Submit.

    List of Replicated Pools
    Figure 9.13: List of Replicated Pools

9.8.5 Verify That RBD Image Replication Works Edit source

When the rbd-mirror daemon is running and RBD image replication is configured on the Ceph Dashboard, it is time to verify whether the replication actually works:

  1. On the primary cluster's Ceph Dashboard, create an RBD image so that its parent pool is the pool that you already created for replication purposes. Enable the Exclusive lock and Journaling features for the image. Refer to Section 9.3, “Creating RBDs” for details on how to create RBD images.

    New RBD Image
    Figure 9.14: New RBD Image
  2. After you create the image that you want to replicate, open the secondary cluster's Ceph Dashboard and navigate to Block › Mirroring. The Pools table on the right will reflect the change in the number of # Remote images and synchronize the number of # Local images.

    New RBD Image Synchronized
    Figure 9.15: New RBD Image Synchronized
    Tip
    Tip: Replication Progress

    The Images table at the bottom of the page shows the status of replication of RBD images. The Issues tab includes possible problems, the Syncing tab displays the progress of image replication, and the Ready tab lists all images with successful replication.

    RBD Images' Replication Status
    Figure 9.16: RBD Images' Replication Status
  3. On the primary cluster, write data to the RBD image. On the secondary cluster's Ceph Dashboard, navigate to Block › Images and monitor whether the corresponding image's size is growing as the data on the primary cluster is written.

10 Managing NFS Ganesha Edit source

Tip
Tip: More Information on NFS Ganesha

For more general information about NFS Ganesha, refer to Chapter 30, NFS Ganesha: Export Ceph Data via NFS.

To list all available NFS exports, click NFS from the main menu.

The list shows each export's directory, daemon host name, type of storage back-end, and access type.

List of NFS Exports
Figure 10.1: List of NFS Exports

To view more detailed information about an NFS export, click its table row in the table.

NFS Export Details
Figure 10.2: NFS Export Details

10.1 Adding NFS Exports Edit source

To add a new NFS export, click Add in the top left of the exports table and enter the required information.

Note
Note

The following example uses admin for the Object Gateway User. For more information on the permissions for other users, see Section 19.2.1.2, “Authorization and Capabilities”.

Adding a New NFS Export
Figure 10.3: Adding a New NFS Export
  1. Select one or more NFS Ganesha daemons that will run the export.

  2. Select a storage back-end—either CephFS or Object Gateway.

  3. Select a user ID and other back-end related options.

  4. Enter the directory path for the NFS export. If the directory does not exist on the server, it will be created.

  5. Specify other NFS related options, such as supported NFS protocol version, pseudo, access type, squashing, or transport protocol.

  6. If you need to limit access to specific clients only, click Add clients and add their IP addresses together with access type and squashing options.

  7. Confirm with Submit.

10.2 Deleting NFS Exports Edit source

To delete an export, click its table row and select Delete in the top left of the exports table.

10.3 Editing NFS Exports Edit source

To edit an existing export, click its table row and click Edit in the top left of the exports table.

You can then adjust all the details of the NFS export.

Editing an NFS Export
Figure 10.4: Editing an NFS Export

11 Managing Ceph File Systems Edit source

Tip
Tip: For More Information

To find detailed information about CephFS, refer to Chapter 28, Clustered File System.

11.1 Viewing CephFS Overview Edit source

Click Filesystems from the main menu to view the overview of configured file systems. The main table shows each file system's name, date of creation, and whether it is enabled or not.

By clicking on a file system's table row, you reveal details about its rank and pools added to the file system.

CephFS Details
Figure 11.1: CephFS Details

At the bottom of the screen, you can see statistics counting the number of related MDS inodes and client requests, collected in real time.

12 Managing Object Gateways Edit source

Tip
Tip: More Information on Object Gateway

For more general information about Object Gateway, refer to Chapter 26, Ceph Object Gateway.

12.1 Viewing Object Gateways Edit source

To view a list of configured Object Gateways, click Object Gateway › Daemons. The list includes the ID of the gateway, host name of the cluster node where the gateway daemon is running, and the gateway's version number.

Click a gateway's table row to view detailed information about the gateway. The Performance Counters tab shows details about read/write operations and cache statistics.

Gateway's Details
Figure 12.1: Gateway's Details

12.2 Managing Object Gateway Users Edit source

Click Object Gateway › Users to view a list of existing Object Gateway users.

Click a user's table row to view details about the user account, such as status information or the user and bucket quota details.

Gateway Users
Figure 12.2: Gateway Users

12.2.1 Adding a New Gateway User Edit source

To add a new gateway user, click Add in the top left of the table heading. Fill in their credentials, details about the S3 key and user/bucket quota, then confirm with Add.

Adding a New Gateway User
Figure 12.3: Adding a New Gateway User

12.2.2 Deleting Gateway Users Edit source

To delete a gateway user, click the user's table row and select Delete in the top left of the table heading. Activate the Yes, I am sure check box and confirm with Delete User.

12.2.3 Editing Gateway User Details Edit source

To change gateway user details, click the user's table row and select Edit in the top left of the table heading.

Modify basic or additional user information, such as their capabilities, keys, sub-users, and quota information. Confirm with Update.

The Keys tab includes a read-only list of the gateway's users and their access and secret keys. To view the keys, click a user name in the list and then select Show in the top left of the table heading. In the S3 Key dialog, click the 'eye' icon to unveil the keys, or click the clipboard icon to copy the related key to the clipboard.

12.3 Managing the Object Gateway Buckets Edit source

Object Gateway (OGW) buckets implement the functionality of OpenStack Swift containers. Object Gateway buckets serve as containers for storing data objects.

Click Object Gateway › Buckets to view a list of OGW buckets.

12.3.1 Adding a New Bucket Edit source

To add a new OGW bucket, click Add in the top left of the table heading. Enter the bucket's name and select its owner. Confirm with Add.

12.3.2 Viewing Bucket Details Edit source

To view detailed information about an OGW bucket, click its table row.

Gateway Bucket Details
Figure 12.4: Gateway Bucket Details
Tip
Tip: Bucket Quota

Below the Details table, you can find details about the bucket quota settings.

12.3.3 Updating the Owner of a Bucket Edit source

Click a bucket table row, then select Edit in the top left of the table heading.

Update the owner of the bucket and confirm with Update.

12.3.4 Deleting a Bucket Edit source

To delete an OGW bucket, click its table row and select Delete in the top left of the table heading. Activate the Yes, I am sure check box, and confirm with Delete bucket.

13 Manual Configuration Edit source

This section introduces advanced information for users that prefer configuring dashboard's settings manually on the command line.

13.1 TLS/SSL Support Edit source

All HTTP connections to the dashboard are secured with SSL/TLS by default. A secure connection requires an SSL certificate. You can either use a self-signed certificate, or generate a certificate and have a well known certificate authority (CA) sign it.

Tip
Tip: Disabling SSL

You may want to disable the SSL support for a specific reason. For example, if the dashboard is running behind a proxy that does not support SSL.

Use caution when disabling SSL as user names and passwords will be sent to the dashboard unencrypted.

To disable SSL, run:

cephadm@adm > ceph config set mgr mgr/dashboard/ssl false
Tip
Tip: Restart Ceph Manager Processes

You need to restart the Ceph Manager processes manually after changing the SSL certificate and key. You can do so by either running

cephadm@adm > ceph mgr failACTIVE-MANAGER-NAME

or by disabling and re-enabling the dashboard module, which also triggers the manager to respawn itself:

cephadm@adm > ceph mgr module disable dashboard
cephadm@adm > ceph mgr module enable dashboard

13.1.1 Self-signed Certificates Edit source

Creating a self-signed certificate for secure communication is simple. This way you can get the dashboard running quickly.

Note
Note: Web Browsers Complain

Most Web browsers will complain about a self-signed certificate and require explicit confirmation before establishing a secure connection to the dashboard.

To generate and install a self-signed certificate, use the following built-in command:

cephadm@adm > ceph dashboard create-self-signed-cert

13.1.2 Self-signed or Trusted Third-party Certificate with OpenSSL Edit source

OpenSSL is an open-source command-line tool that is commonly used to generate private keys, create Certificate Signing Requests (CSR), install your SSL/TLS certificate, and identify certificate information. The following instructions illustrate how to generate a self-signed or trusted third-party certificate using OpenSSL:

  1. Generate a Private Key:

    cephadm@adm > openssl genrsa -des3 -out server.key 2048

    Type the passphrase to protect the key.

  2. Generate a CSR:

    cephadm@adm > openssl req -new -key server.key -out server.csr

    Enter the passphrase, and fill in the Country Name, State or Province Name, Locality Name, Organization Name, Organizational Unit Name, Common Name, Email Address.

    Note
    Note

    The Common Name should be the FQDN of the server. For example, server.mydomain.com.

    When asked for a challenge password and optional company name, leave it blank.

  3. To sign the certificate, select from the following options:

    • Trusted Third-party Certificate Authority.  Send the CSR to the third party for their signing. The following files should be received: Server certificate (public key) and the Intermediate CA and the bundles that chain to the Trusted Root CA.

    • Self-signed.  Sign the certificate with OpenSSL:

      openssl x509 -req -days 730 -in server.csr -signkey server.key -out server.crt

      Increase or decrease the value 730 as needed. This is the number of days for which the certificate is valid.

  4. (Optional) If needed, create a concatenated PEM file:

    cephadm@adm > openssl req -newkey rsa:2048 -new -nodes -x509 -days 3650 -keyout key.pem -out cert.pem

13.1.3 Certificates Signed by CA Edit source

To properly secure the connection to the dashboard and to eliminate Web browser complaints about a self-signed certificate, we recommend using a certificate that is signed by a CA.

You can generate a certificate key pair with a command similar to the following:

root # openssl req -new -nodes -x509 \
  -subj "/O=IT/CN=ceph-mgr-dashboard" -days 3650 \
  -keyout dashboard.key -out dashboard.crt -extensions v3_ca

The above command outputs dashboard.key and dashboard.crt files. After you get the dashboard.crt file signed by a CA, enable it for all Ceph Manager instances by running the following commands:

cephadm@adm > ceph config-key set mgr/dashboard/crt -i dashboard.crt
cephadm@adm > ceph config-key set mgr/dashboard/key -i dashboard.key
Tip
Tip: Different Certificates for Each Manager Instance

If you require different certificates for each Ceph Manager instance, modify the commands and include the name of the instance as follows. Replace NAME with the name of the Ceph Manager instance (usually the related host name):

cephadm@adm > ceph config-key set mgr/dashboard/NAME/crt -i dashboard.crt
cephadm@adm > ceph config-key set mgr/dashboard/NAME/key -i dashboard.key

13.1.4 Certificates Signed with a Custom CA Edit source

The following procedure needs to be followed once to create the root CA.

Note
Note

This is the key used to sign the certificate requests. Anyone holding this can sign certificates on your behalf.

  1. Create the Root Key:

    cephadm@adm > openssl genrsa -des3 -out rootCA.key 4096
    Note
    Note

    If you want a non-password protected key, remove the -des3 option.

  2. Create and self-sign the root certificate:

    cephadm@adm > openssl req -x509 -new -nodes -key rootCA.key -sha256 -days 1024 -out rootCA.crt

The following procedure needs to be followed for each server that needs a trusted certificate from our CA.

  1. Create the certificate key:

    cephadm@adm > openssl genrsa -out mydomain.com.key 2048

    The certificate signing request is where you specify the details for the certificate you want to generate. This request is processed by the owner of the Root Key to generate the certificate.

  2. These are two ways to create the CSR:

    Important
    Important

    When creating the certificate signing request, it is important to specify the Common Name providing the IP address or domain name for the service, otherwise the certificate cannot be verified.

    • Interactive method. For example:

      cephadm@adm > openssl req -new -key mydomain.com.key -out mydomain.com.csr

      You will then be prmopted for information. For example, the Country Name, Organization Name, and Email Address.

    • One-liner method. This is where instead of being interactively prompted, you include the information up front. For example:

      cephadm@adm > openssl req -new -sha256 -key mydomain.com.key -subj "/C=US/ST=CA/O=MyOrg, Inc./CN=mydomain.com" -out mydomain.com.csr

      If you need to pass additional configuration in the one-liner method, you can use the -config parameter. For example:

      cephadm@adm > openssl req -new -sha256 \
            -key mydomain.com.key \
            -subj "/C=US/ST=CA/O=MyOrg, Inc./CN=mydomain.com" \
            -reqexts SAN \
            -config <(cat /etc/ssl/openssl.cnf \
                <(printf "\n[SAN]\nsubjectAltName=DNS:mydomain.com,DNS:www.mydomain.com")) \
            -out mydomain.com.csr
  3. Verify the CSR content:

    cephadm@adm > openssl req -in mydomain.com.csr -noout -text
  4. Generate the certificate using the mydomain CSR and key along with the CA Root Key:

    cephadm@adm > openssl x509 -req -in mydomain.com.csr -CA rootCA.crt -CAkey rootCA.key -CAcreateserial -out mydomain.com.crt -days 500 -sha256
  5. Verify the certificate's content:

    cephadm@adm > openssl x509 -in mydomain.com.crt -text -noout

13.2 Host Name and Port Number Edit source

The Ceph Dashboard Web application binds to a specific TCP/IP address and TCP port. By default, the currently active Ceph Manager that hosts the dashboard binds to TCP port 8443 (or 8080 when SSL is disabled).

The dashboard Web application binds to "::" by default, which corresponds to all available IPv4 and IPv6 addresses. You can change the IP address and port number of the Web application so that they apply to all Ceph Manager instances by using the following commands:

cephadm@adm > ceph config set mgr mgr/dashboard/server_addr IP_ADDRESS
cephadm@adm > ceph config set mgr mgr/dashboard/server_port PORT_NUMBER
Tip
Tip: Configure Ceph Manager Instances Separately

Since each ceph-mgr daemon hosts its own instance of the dashboard, you may need to configure them separately. Change the IP address and port number for a specific manager instance by using the following commands (replace NAME with the ID of the ceph-mgr instance):

cephadm@adm > ceph config set mgr mgr/dashboard/NAME/server_addr IP_ADDRESS
cephadm@adm > ceph config set mgr mgr/dashboard/NAME/server_port PORT_NUMBER
Tip
Tip: List Configured Endpoints

The ceph mgr services command displays all endpoints that are currently configured. Look for the 'dashboard' key to obtain the URL for accessing the dashboard.

13.3 User Name and Password Edit source

If you do not want to use the default administrator account, create a different user account and associate it with at least one role. We provide a set of predefined system roles that you can use. For more details refer to Chapter 14, Managing Users and Roles on the Command Line.

To create a user with administrator privileges, use the following command:

cephadm@adm > ceph dashboard ac-user-create USER_NAME PASSWORD administrator

13.4 Enabling the Object Gateway Management Front-end Edit source

To use the Object Gateway management functionality of the dashboard, you need to provide the login credentials of a user with the 'system' flag enabled:

  1. If you do not have a user with the 'system' flag, create one:

    cephadm@adm > radosgw-admin user create --uid=USER_ID --display-name=DISPLAY_NAME --system

    Take note of the 'access_key' and 'secret_key' keys in the output of the command.

  2. You can also obtain the credentials of an existing user by using the radosgw-admin command:

    cephadm@adm > radosgw-admin user info --uid=USER_ID
  3. Provide the received credentials to the dashboard:

    cephadm@adm > ceph dashboard set-rgw-api-access-key ACCESS_KEY
    cephadm@adm > ceph dashboard set-rgw-api-secret-key SECRET_KEY

There are several points to consider:

  • The host name and port number of the Object Gateway are determined automatically.

  • If multiple zones are used, it will automatically determine the host within the master zonegroup and master zone. This is sufficient for most setups, but in some circumstances you may want to set the host name and port manually:

    cephadm@adm > ceph dashboard set-rgw-api-host HOST
    cephadm@adm > ceph dashboard set-rgw-api-port PORT
  • These are additional settings that you may need:

    cephadm@adm > ceph dashboard set-rgw-api-scheme SCHEME  # http or https
    cephadm@adm > ceph dashboard set-rgw-api-admin-resource ADMIN_RESOURCE
    cephadm@adm > ceph dashboard set-rgw-api-user-id USER_ID
  • If you are using a self-signed certificate (Section 13.1, “TLS/SSL Support”) in your Object Gateway setup, disable certificate verification in the dashboard to avoid refused connections caused by certificates signed by an unknown CA or not matching the host name:

    cephadm@adm > ceph dashboard set-rgw-api-ssl-verify False
  • If the Object Gateway takes too long to process requests and the dashboard runs into timeouts, the timeout value can be adjusted (default is 45 seconds):

    cephadm@adm > ceph dashboard set-rest-requests-timeout SECONDS

13.5 Enable Single Sign-On Edit source

Single Sign-On (SSO) is an access control method that enables users to log in with a single ID and password to multiple applications simultaneously.

The Ceph Dashboard supports external authentication of users via the SAML 2.0 protocol. Because authorization is still performed by the dashboard, you first need to create user accounts and associate them with the desired roles. However, the authentication process can be performed by an existing Identity Provider (IdP).

To configure Single Sign-On, use the following command:

cephadm@adm > ceph dashboard sso setup saml2 CEPH_DASHBOARD_BASE_URL \
 IDP_METADATA IDP_USERNAME_ATTRIBUTE \
 IDP_ENTITY_ID SP_X_509_CERT \
 SP_PRIVATE_KEY

Parameters:

CEPH_DASHBOARD_BASE_URL

Base URL where Ceph Dashboard is accessible (for example, 'https://cephdashboard.local').

IDP_METADATA

URL, file path, or content of the IdP metadata XML (for example, 'https://myidp/metadata').

IDP_USERNAME_ATTRIBUTE

Optional. Attribute that will be used to get the user name from the authentication response. Defaults to 'uid'.

IDP_ENTITY_ID

Optional. Use when more than one entity ID exists on the IdP metadata.

SP_X_509_CERT / SP_PRIVATE_KEY

Optional. File path or content of the certificate that will be used by Ceph Dashboard (Service Provider) for signing and encryption.

Note
Note: SAML Requests

The issuer value of SAML requests will follow this pattern:

CEPH_DASHBOARD_BASE_URL/auth/saml2/metadata

To display the current SAML 2.0 configuration, run:

cephadm@adm > ceph dashboard sso show saml2

To disable Single Sign-On, run:

cephadm@adm > ceph dashboard sso disable

To check if SSO is enabled, run:

cephadm@adm > ceph dashboard sso status

To enable SSO, run:

cephadm@adm > ceph dashboard sso enable saml2

14 Managing Users and Roles on the Command Line Edit source

This section describes how to manage user accounts used by the Ceph Dashboard. It helps you create or modify user accounts, as well as set proper user roles and permissions.

14.1 User Accounts Edit source

The Ceph Dashboard supports managing multiple user accounts. Each user account consists of a user name, a password (stored in encrypted form using bcrypt), an optional name, and an optional e-mail address.

User accounts are stored in Ceph Monitor’s configuration database and are globally shared across all Ceph Manager instances.

Use the following commands to manage user accounts:

Show existing users:
cephadm@adm > ceph dashboard ac-user-show [USERNAME]
Create a new user:
cephadm@adm > ceph dashboard ac-user-create USERNAME [PASSWORD] [ROLENAME] [NAME] [EMAIL]
Delete a user:
cephadm@adm > ceph dashboard ac-user-delete USERNAME
Change a user's password:
cephadm@adm > ceph dashboard ac-user-set-password USERNAME PASSWORD
Modify a user's name and email:
cephadm@adm > ceph dashboard ac-user-set-info USERNAME NAME EMAIL

14.2 User Roles and Permissions Edit source

This section describes what security scopes you can assign to a user role, how to manage user roles and assign them to user accounts.

14.2.1 Security Scopes Edit source

User accounts are associated with a set of roles that define which parts of the dashboard can be accessed by the user. The dashboard parts are grouped within a security scope. Security scopes are predefined and static. The following security scopes are currently available:

hosts

Includes all features related to the Hosts menu entry.

config-opt

Includes all features related to the management of Ceph configuration options.

pool

Includes all features related to pool management.

osd

Includes all features related to the Ceph OSD management.

monitor

Includes all features related to the Ceph Monitor management.

rbd-image

Includes all features related to the RADOS Block Device image management.

rbd-mirroring

Includes all features related to the RADOS Block Device mirroring management.

iscsi

Includes all features related to iSCSI management.

rgw

Includes all features related to the Object Gateway management.

cephfs

Includes all features related to CephFS management.

manager

Includes all features related to the Ceph Manager management.

log

Includes all features related to Ceph logs management.

grafana

Includes all features related to the Grafana proxy.

dashboard-settings

Allows changing dashboard settings.

14.2.2 User Roles Edit source

A role specifies a set of mappings between a security scope and a set of permissions. There are four types of permissions: 'read', 'create', 'update', and 'delete'.

The following example specifies a role where a user has 'read' and 'create' permissions for features related to pool management, and has full permissions for features related to RBD image management:

{
  'role': 'my_new_role',
  'description': 'My new role',
  'scopes_permissions': {
    'pool': ['read', 'create'],
    'rbd-image': ['read', 'create', 'update', 'delete']
  }
}

The dashboard already provides a set of predefined roles that we call system roles. You can instantly use them after a fresh Ceph Dashboard installation:

administrator

Provides full permissions for all security scopes.

read-only

Provides read permission for all security scopes except the dashboard settings.

block-manager

Provides full permissions for 'rbd-image', 'rbd-mirroring', and 'iscsi' scopes.

rgw-manager

Provides full permissions for the 'rgw' scope.

cluster-manager

Provides full permissions for the 'hosts', 'osd', 'monitor', 'manager', and 'config-opt' scopes.

pool-manager

Provides full permissions for the 'pool' scope.

cephfs-manager

Provides full permissions for the 'cephfs' scope.

14.2.2.1 Managing Custom Roles Edit source

You can create new user roles by using the following commands:

Create a new role:
cephadm@adm > ceph dashboard ac-role-create ROLENAME [DESCRIPTION]
Delete a role:
cephadm@adm > ceph dashboard ac-role-delete ROLENAME
Add scope permissions to a role:
cephadm@adm > ceph dashboard ac-role-add-scope-perms ROLENAME SCOPENAME PERMISSION [PERMISSION...]
Delete scope permissions from a role:
cephadm@adm > ceph dashboard ac-role-del-perms ROLENAME SCOPENAME

14.2.2.2 Assigning Roles to User Accounts Edit source

Use the following commands to assign roles to users:

Set user roles:
cephadm@adm > ceph dashboard ac-user-set-roles USERNAME ROLENAME [ROLENAME ...]
Add additional roles to a user:
cephadm@adm > ceph dashboard ac-user-add-roles USERNAME ROLENAME [ROLENAME ...]
Delete roles from a user:
cephadm@adm > ceph dashboard ac-user-del-roles USERNAME ROLENAME [ROLENAME ...]
Tip
Tip: Purging Custom Roles

If you create custom user roles and intend to remove the Ceph cluster with the ceph.purge runner later on, you need to purge the custom roles first. Find more details in Section 2.17, “Removing an Entire Ceph Cluster”.

14.2.2.3 Example: Creating a User and a Custom Role Edit source

This section illustrates a procedure for creating a user account capable of managing RBD images, viewing and creating Ceph pools, and having read-only access to any other scopes.

  1. Create a new user named 'tux':

     cephadm@adm > ceph dashboard ac-user-create tux PASSWORD
  2. Create a role and specify scope permissions:

    cephadm@adm > ceph dashboard ac-role-create rbd/pool-manager
    cephadm@adm > ceph dashboard ac-role-add-scope-perms rbd/pool-manager \
     rbd-image read create update delete
    cephadm@adm > ceph dashboard ac-role-add-scope-perms rbd/pool-manager pool read create
  3. Associate the roles with the 'tux' user:

    cephadm@adm > ceph dashboard ac-user-set-roles tux rbd/pool-manager read-only

14.3 Reverse Proxies Edit source

If you are accessing the dashboard via a reverse proxy configuration, you may need to service it under a URL prefix. To get the dashboard to use hyperlinks that include your prefix, you can set the url_prefix setting:

cephadm@adm > ceph config set mgr mgr/dashboard/url_prefix URL_PREFIX

Then you can access the dashboard at http://HOST_NAME:PORT_NUMBER/URL_PREFIX/.

14.4 Auditing Edit source

The Ceph Dashboard's REST API can log PUT, POST, and DELETE requests to the Ceph audit log. Logging is disabled by default, but you can enable it with the following command:

cephadm@adm > ceph dashboard set-audit-api-enabled true

If enabled, the following parameters are logged per each request:

from

The origin of the request, for example 'https://[::1]:44410'.

path

The REST API path, for example '/api/auth'.

method

'PUT', 'POST', or 'DELETE'.

user

The name of the user (or ‘None’).

An example log entry looks like this:

2019-02-06 10:33:01.302514 mgr.x [INF] [DASHBOARD] \
 from='https://[::ffff:127.0.0.1]:37022' path='/api/rgw/user/exu' method='PUT' \
 user='admin' params='{"max_buckets": "1000", "display_name": "Example User", "uid": "exu", "suspended": "0", "email": "user@example.com"}'
Tip
Tip: Disable Logging of Request Payload

The logging of the request payload (the list of arguments and their values) is enabled by default. You can disable it as follows:

cephadm@adm > ceph dashboard set-audit-api-log-payload false

Part III Operating a Cluster Edit source

15 Introduction

In this part of the manual you will learn how to start or stop Ceph services, monitor a cluster's state, use and modify CRUSH Maps, or manage storage pools.

16 Operating Ceph Services

You can operate Ceph services either using systemd or using DeepSea.

17 Determining Cluster State

When you have a running cluster, you may use the ceph tool to monitor it. Determining the cluster state typically involves checking the status of Ceph OSDs, Ceph Monitors, placement groups, and Metadata Servers.

18 Monitoring and Alerting

In SUSE Enterprise Storage 6, DeepSea no longer deploys a monitoring and alerting stack on the Salt master. Users have to define the Prometheus role for Prometheus and Alertmanager, and the Grafana role for Grafana. When multiple nodes are assigned with the Prometheus or Grafana role, a highly avail…

19 Authentication with cephx

To identify clients and protect against man-in-the-middle attacks, Ceph provides its cephx authentication system. Clients in this context are either human users—such as the admin user—or Ceph-related services/daemons, for example OSDs, monitors, or Object Gateways.

20 Stored Data Management

The CRUSH algorithm determines how to store and retrieve data by computing data storage locations. CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server or broker. With an algorithmically determined method of storing and retrieving data, Ceph avoids a…

21 Ceph Manager Modules

The architecture of the Ceph Manager (refer to Book “Deployment Guide”, Chapter 1 “SUSE Enterprise Storage 6 and Ceph”, Section 1.2.3 “Ceph Nodes and Daemons” for a brief introduction) allows extending its functionality via modules, such as 'dashboard' (see Part II, “Ceph Dashboard”), 'prometheus' (…

22 Managing Storage Pools

Ceph stores data within pools. Pools are logical groups for storing objects. When you first deploy a cluster without creating a pool, Ceph uses the default pools for storing data. The following important highlights relate to Ceph pools:

23 RADOS Block Device

A block is a sequence of bytes, for example a 4 MB block of data. Block-based storage interfaces are the most common way to store data with rotating media, such as hard disks, CDs, floppy disks. The ubiquity of block device interfaces makes a virtual block device an ideal candidate to interact with …

24 Erasure Coded Pools

Ceph provides an alternative to the normal replication of data in pools, called erasure or erasure coded pool. Erasure pools do not provide all functionality of replicated pools (for example, they cannot store metadata for RBD pools), but require less raw storage. A default erasure pool capable of s…

25 Ceph Cluster Configuration

This chapter provides a list of important Ceph cluster settings and their description. The settings are sorted by topic.

15 Introduction Edit source

In this part of the manual you will learn how to start or stop Ceph services, monitor a cluster's state, use and modify CRUSH Maps, or manage storage pools.

It also includes advanced topics, for example how to manage users and authentication in general, how to manage pool and RADOS device snapshots, how to set up erasure coded pools, or how to increase the cluster performance with cache tiering.

16 Operating Ceph Services Edit source

You can operate Ceph services either using systemd or using DeepSea.

16.1 Operating Ceph Cluster Related Services Using systemd Edit source

Use the systemctl command to operate all Ceph related services. The operation takes place on the node you are currently logged in to. You need to have root privileges to be able to operate on Ceph services.

16.1.1 Starting, Stopping, and Restarting Services Using Targets Edit source

To simplify starting, stopping, and restarting all the services of a particular type (for example all Ceph services, or all MONs, or all OSDs) on a node, Ceph provides the following systemd unit files:

cephadm@adm > ls /usr/lib/systemd/system/ceph*.target
ceph.target
ceph-osd.target
ceph-mon.target
ceph-mgr.target
ceph-mds.target
ceph-radosgw.target
ceph-rbd-mirror.target

To start/stop/restart all Ceph services on the node, run:

root # systemctl start ceph.target
root # systemctl stop ceph.target
root # systemctl restart ceph.target

To start/stop/restart all OSDs on the node, run:

root # systemctl start ceph-osd.target
root # systemctl stop ceph-osd.target
root # systemctl restart ceph-osd.target

Commands for the other targets are analogous.

16.1.2 Starting, Stopping, and Restarting Individual Services Edit source

You can operate individual services using the following parameterized systemd unit files:

ceph-osd@.service
ceph-mon@.service
ceph-mds@.service
ceph-mgr@.service
ceph-radosgw@.service
ceph-rbd-mirror@.service

To use these commands, you first need to identify the name of the service you want to operate. See Section 16.1.3, “Identifying Individual Services” to learn more about services identification.

To start, stop or restart the osd.1 service, run:

root # systemctl start ceph-osd@1.service
root # systemctl stop ceph-osd@1.service
root # systemctl restart ceph-osd@1.service

Commands for the other service types are analogous.

16.1.3 Identifying Individual Services Edit source

You can find out the names/numbers of a particular type of service in several ways. The following commands provide results for ceph* services. You can run them on any node of the Ceph cluster.

To list all (even inactive) services of type ceph*, run:

root # systemctl list-units --all --type=service ceph*

To list only the inactive services, run:

root # systemctl list-units --all --state=inactive --type=service ceph*

You can also use salt to query services across multiple nodes:

root@master # salt TARGET cmd.shell \
 "systemctl list-units --all --type=service ceph* | sed -e '/^$/,$ d'"

Query storage nodes only:

root@master # salt -I 'roles:storage' cmd.shell \
 'systemctl list-units --all --type=service ceph*'

16.1.4 Service Status Edit source

You can query systemd for the status of services. For example:

root # systemctl status ceph-osd@1.service
root # systemctl status ceph-mon@HOSTNAME.service

Replace HOSTNAME with the host name the daemon is running on.

If you do not know the exact name/number of the service, see Section 16.1.3, “Identifying Individual Services”.

16.2 Restarting Ceph Services Using DeepSea Edit source

After applying updates to the cluster nodes, the affected Ceph related services need to be restarted. Normally, restarts are performed automatically by DeepSea. This section describes how to restart the services manually.

Tip
Tip: Watching the Restart

The process of restarting the cluster may take some time. You can watch the events by using the Salt event bus by running:

root@master # salt-run state.event pretty=True

Another command to monitor active jobs is

root@master # salt-run jobs.active

16.2.1 Restarting All Services Edit source

Warning
Warning: Interruption of Services

If Ceph related services—specifically iSCSI or NFS Ganesha—are configured as single points of access with no High Availability setup, restarting them will result in their temporary outage as viewed from the client side.

Tip
Tip: Samba Not Managed by DeepSea

Because DeepSea and the Ceph Dashboard do not currently support Samba deployments, you need to manage Samba related services manually. For more details, see Chapter 29, Exporting Ceph Data via Samba.

To restart all services on the cluster, run the following command:

root@master # salt-run state.orch ceph.restart
  • For DeepSea prior to version 0.8.4, the Metadata Server, iSCSI Gateway, Object Gateway, and NFS Ganesha services restart in parallel.

  • For DeepSea 0.8.4 and newer, all roles you have configured restart in the following order: Ceph Monitor, Ceph Manager, Ceph OSD, Metadata Server, Object Gateway, iSCSI Gateway, NFS Ganesha. To keep the downtime low and to find potential issues as early as possible, nodes are restarted sequentially. For example, only one monitoring node is restarted at a time.

The command waits for the cluster to recover if the cluster is in a degraded, unhealthy state.

16.2.2 Restarting Specific Services Edit source

To restart a specific service on the cluster, run:

root@master # salt-run state.orch ceph.restart.service_name

For example, to restart all Object Gateways, run:

root@master # salt-run state.orch ceph.restart.rgw

You can use the following targets:

root@master # salt-run state.orch ceph.restart.mon
root@master # salt-run state.orch ceph.restart.mgr
root@master # salt-run state.orch ceph.restart.osd
root@master # salt-run state.orch ceph.restart.mds
root@master # salt-run state.orch ceph.restart.rgw
root@master # salt-run state.orch ceph.restart.igw
root@master # salt-run state.orch ceph.restart.ganesha

The restart orchestration checks if the installated binary is newer than the current one, or if configuration changes exist for this daemon and only restarts in those cases. If you run the above command and nothing happens, this is due to these conditions. See Section 16.1.2, “Starting, Stopping, and Restarting Individual Services” for more information.

16.3 Shutdown and Start of the Whole Ceph Cluster Edit source

There are occasions when you need to stop all Ceph related services in the cluster in the recommended order, and then be able to simply start them again. For example, in case of a planned power outage.

Procedure 16.1: Shutting Down the Whole Ceph Cluster
  1. Shut down or disconnect any clients accessing the cluster.

  2. To prevent CRUSH from automatically rebalancing the cluster, set the cluster to noout:

    root@master # ceph osd set noout
  3. Disable safety measures and run the ceph.shutdown runner:

    root@master # salt-run disengage.safety
    root@master # salt-run state.orch ceph.shutdown
  4. Power off all cluster nodes:

    root@master # salt -C 'G@deepsea:*' cmd.run "shutdown -h"
Procedure 16.2: Starting the Whole Ceph Cluster
  1. Power on the Admin Node.

  2. Power on the Ceph Monitor nodes.

  3. Power on the Ceph OSD nodes.

  4. Unset the previously set noout flag:

    root@master # ceph osd unset noout
  5. Power on all configured gateways.

  6. Power on or connect cluster clients.

17 Determining Cluster State Edit source

When you have a running cluster, you may use the ceph tool to monitor it. Determining the cluster state typically involves checking the status of Ceph OSDs, Ceph Monitors, placement groups, and Metadata Servers.

Tip
Tip: Interactive Mode

To run the ceph tool in an interactive mode, type ceph at the command line with no arguments. The interactive mode is more convenient if you are going to enter more ceph commands in a row. For example:

cephadm@adm > ceph
ceph> health
ceph> status
ceph> quorum_status
ceph> mon_status

17.1 Checking a Cluster's Status Edit source

To check a cluster's status, execute the following:

cephadm@adm > ceph status

or

cephadm@adm > ceph -s

In interactive mode, type status and press Enter.

ceph> status

Ceph will print the cluster status. For example, a tiny Ceph cluster consisting of one monitor and two OSDs may print the following:

cluster b370a29d-9287-4ca3-ab57-3d824f65e339
 health HEALTH_OK
 monmap e1: 1 mons at {ceph1=10.0.0.8:6789/0}, election epoch 2, quorum 0 ceph1
 osdmap e63: 2 osds: 2 up, 2 in
  pgmap v41332: 952 pgs, 20 pools, 17130 MB data, 2199 objects
        115 GB used, 167 GB / 297 GB avail
               1 active+clean+scrubbing+deep
             951 active+clean

17.2 Checking Cluster Health Edit source

After you start your cluster and before you start reading and/or writing data, check your cluster's health:

cephadm@adm > ceph health
HEALTH_WARN 10 pgs degraded; 100 pgs stuck unclean; 1 mons down, quorum 0,2 \
node-1,node-2,node-3
Tip
Tip

If you specified non-default locations for your configuration or keyring, you may specify their locations:

cephadm@adm > ceph -c /path/to/conf -k /path/to/keyring health

The Ceph cluster returns one of the following health codes:

OSD_DOWN

One or more OSDs are marked down. The OSD daemon may have been stopped, or peer OSDs may be unable to reach the OSD over the network. Common causes include a stopped or crashed daemon, a down host, or a network outage.

Verify the host is healthy, the daemon is started, and network is functioning. If the daemon has crashed, the daemon log file (/var/log/ceph/ceph-osd.*) may contain debugging information.

OSD_crush type_DOWN, for example OSD_HOST_DOWN

All the OSDs within a particular CRUSH subtree are marked down, for example all OSDs on a host.

OSD_ORPHAN

An OSD is referenced in the CRUSH map hierarchy but does not exist. The OSD can be removed from the CRUSH hierarchy with:

cephadm@adm > ceph osd crush rm osd.ID
OSD_OUT_OF_ORDER_FULL

The usage thresholds for backfillfull (defaults to 0.90), nearfull (defaults to 0.85), full (defaults to 0.95), and/or failsafe_full are not ascending. In particular, we expect backfillfull < nearfull, nearfull < full, and full < failsafe_full.

To read the current values, run:

cephadm@adm > ceph health detail
HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s)
osd.3 is full at 97%
osd.4 is backfill full at 91%
osd.2 is near full at 87%

The thresholds can be adjusted with the following commands:

cephadm@adm > ceph osd set-backfillfull-ratio ratio
cephadm@adm > ceph osd set-nearfull-ratio ratio
cephadm@adm > ceph osd set-full-ratio ratio
OSD_FULL

One or more OSDs has exceeded the full threshold and is preventing the cluster from servicing writes. Usage by pool can be checked with:

cephadm@adm > ceph df

The currently defined full ratio can be seen with:

cephadm@adm > ceph osd dump | grep full_ratio

A short-term workaround to restore write availability is to raise the full threshold by a small amount:

cephadm@adm > ceph osd set-full-ratio ratio

Add new storage to the cluster by deploying more OSDs, or delete existing data in order to free up space.

OSD_BACKFILLFULL

One or more OSDs has exceeded the backfillfull threshold, which prevents data from being allowed to rebalance to this device. This is an early warning that rebalancing may not be able to complete and that the cluster is approaching full. Usage by pool can be checked with:

cephadm@adm > ceph df
OSD_NEARFULL

One or more OSDs has exceeded the nearfull threshold. This is an early warning that the cluster is approaching full. Usage by pool can be checked with:

cephadm@adm > ceph df
OSDMAP_FLAGS

One or more cluster flags of interest has been set. With the exception of full, these flags can be set or cleared with:

cephadm@adm > ceph osd set flag
cephadm@adm > ceph osd unset flag

These flags include:

full

The cluster is flagged as full and cannot service writes.

pauserd, pausewr

Paused reads or writes.

noup

OSDs are not allowed to start.

nodown

OSD failure reports are being ignored, such that the monitors will not mark OSDs down.

noin

OSDs that were previously marked out will not be marked back in when they start.

noout

Down OSDs will not automatically be marked out after the configured interval.

nobackfill, norecover, norebalance

Recovery or data rebalancing is suspended.

noscrub, nodeep_scrub

Scrubbing (see Section 20.6, “Scrubbing”) is disabled.

notieragent

Cache tiering activity is suspended.

OSD_FLAGS

One or more OSDs has a per-OSD flag of interest set. These flags include:

noup

OSD is not allowed to start.

nodown

Failure reports for this OSD will be ignored.

noin

If this OSD was previously marked out automatically after a failure, it will not be marked in when it starts.

noout

If this OSD is down, it will not be automatically marked out after the configured interval.

Per-OSD flags can be set and cleared with:

cephadm@adm > ceph osd add-flag osd-ID
cephadm@adm > ceph osd rm-flag osd-ID
OLD_CRUSH_TUNABLES

The CRUSH Map is using very old settings and should be updated. The oldest tunables that can be used (that is the oldest client version that can connect to the cluster) without triggering this health warning is determined by the mon_crush_min_required_version configuration option.

OLD_CRUSH_STRAW_CALC_VERSION

The CRUSH Map is using an older, non-optimal method for calculating intermediate weight values for straw buckets. The CRUSH Map should be updated to use the newer method (straw_calc_version=1).

CACHE_POOL_NO_HIT_SET

One or more cache pools is not configured with a hit set to track usage, which prevents the tiering agent from identifying cold objects to flush and evict from the cache. Hit sets can be configured on the cache pool with:

cephadm@adm > ceph osd pool set poolname hit_set_type type
cephadm@adm > ceph osd pool set poolname hit_set_period period-in-seconds
cephadm@adm > ceph osd pool set poolname hit_set_count number-of-hitsets
cephadm@adm > ceph osd pool set poolname hit_set_fpp target-false-positive-rate

For more information on cache tiering, see Book “Tuning Guide”, Chapter 7 “Cache Tiering”.

OSD_NO_SORTBITWISE

No pre-Luminous v12 OSDs are running but the sortbitwise flag has not been set. You need to set the sortbitwise flag before Luminous v12 or newer OSDs can start:

cephadm@adm > ceph osd set sortbitwise
POOL_FULL

One or more pools has reached its quota and is no longer allowing writes. You can set pool quotas and usage with:

cephadm@adm > ceph df detail

You can either raise the pool quota with

cephadm@adm > ceph osd pool set-quota poolname max_objects num-objects
cephadm@adm > ceph osd pool set-quota poolname max_bytes num-bytes

or delete some existing data to reduce usage.

PG_AVAILABILITY

Data availability is reduced, meaning that the cluster is unable to service potential read or write requests for some data in the cluster. Specifically, one or more PGs is in a state that does not allow I/O requests to be serviced. Problematic PG states include peering, stale, incomplete, and the lack of active (if those conditions do not clear quickly). Detailed information about which PGs are affected is available from:

cephadm@adm > ceph health detail

In most cases the root cause is that one or more OSDs is currently down. The state of specific problematic PGs can be queried with:

cephadm@adm > ceph tell pgid query
PG_DEGRADED

Data redundancy is reduced for some data, meaning the cluster does not have the desired number of replicas for all data (for replicated pools) or erasure code fragments (for erasure coded pools). Specifically, one or more PGs have either the degraded or undersized flag set (there are not enough instances of that placement group in the cluster), or have not had the clean flag set for some time. Detailed information about which PGs are affected is available from:

cephadm@adm > ceph health detail

In most cases the root cause is that one or more OSDs is currently down. The state of specific problematic PGs can be queried with:

cephadm@adm > ceph tell pgid query
PG_DEGRADED_FULL

Data redundancy may be reduced or at risk for some data because of a lack of free space in the cluster. Specifically, one or more PGs has the backfill_toofull or recovery_toofull flag set, meaning that the cluster is unable to migrate or recover data because one or more OSDs is above the backfillfull threshold.

PG_DAMAGED

Data scrubbing (see Section 20.6, “Scrubbing”) has discovered some problems with data consistency in the cluster. Specifically, one or more PGs has the inconsistent or snaptrim_error flag is set, indicating an earlier scrub operation found a problem, or that the repair flag is set, meaning a repair for such an inconsistency is currently in progress.

OSD_SCRUB_ERRORS

Recent OSD scrubs have uncovered inconsistencies.

CACHE_POOL_NEAR_FULL

A cache tier pool is nearly full. Full in this context is determined by the target_max_bytes and target_max_objects properties on the cache pool. When the pool reaches the target threshold, write requests to the pool may block while data is flushed and evicted from the cache, a state that normally leads to very high latencies and poor performance. The cache pool target size can be adjusted with:

cephadm@adm > ceph osd pool set cache-pool-name target_max_bytes bytes
cephadm@adm > ceph osd pool set cache-pool-name target_max_objects objects

Normal cache flush and evict activity may also be throttled because of reduced availability or performance of the base tier, or overall cluster load.

Find more information about cache tiering in Book “Tuning Guide”, Chapter 7 “Cache Tiering”.

TOO_FEW_PGS

The number of PGs in use is below the configurable threshold of mon_pg_warn_min_per_osd PGs per OSD. This can lead to suboptimal distribution and balance of data across the OSDs in the cluster reduce overall performance.

TOO_MANY_PGS

The number of PGs in use is above the configurable threshold of mon_pg_warn_max_per_osd PGs per OSD. This can lead to higher memory usage for OSD daemons, slower peering after cluster state changes (for example OSD restarts, additions, or removals), and higher load on the Ceph Managers and Ceph Monitors.

While the pg_num value for existing pools cannot be reduced. The pgp_num value can. This effectively collocates some PGs on the same sets of OSDs, mitigating some of the negative impacts described above. The pgp_num value can be adjusted with:

cephadm@adm > ceph osd pool set pool pgp_num value
SMALLER_PGP_NUM

One or more pools has a pgp_num value less than pg_num. This is normally an indication that the PG count was increased without also increasing the placement behavior. This is normally resolved by setting pgp_num to match pg_num, triggering the data migration, with:

cephadm@adm > ceph osd pool set pool pgp_num pg_num_value
MANY_OBJECTS_PER_PG

One or more pools have an average number of objects per PG that is significantly higher than the overall cluster average. The specific threshold is controlled by the mon_pg_warn_max_object_skew configuration value. This is usually an indication that the pool(s) containing most of the data in the cluster have too few PGs, and/or that other pools that do not contain as much data have too many PGs. The threshold can be raised to silence the health warning by adjusting the mon_pg_warn_max_object_skew configuration option on the monitors.

POOL_APP_NOT_ENABLED¶

A pool exists that contains one or more objects but has not been tagged for use by a particular application. Resolve this warning by labeling the pool for use by an application. For example, if the pool is used by RBD:

cephadm@adm > rbd pool init pool_name

If the pool is being used by a custom application 'foo', you can also label it using the low-level command:

cephadm@adm > ceph osd pool application enable foo
POOL_FULL

One or more pools have reached (or is very close to reaching) its quota. The threshold to trigger this error condition is controlled by the mon_pool_quota_crit_threshold configuration option. Pool quotas can be adjusted up or down (or removed) with:

cephadm@adm > ceph osd pool set-quota pool max_bytes bytes
cephadm@adm > ceph osd pool set-quota pool max_objects objects

Setting the quota value to 0 will disable the quota.

POOL_NEAR_FULL

One or more pools are approaching their quota. The threshold to trigger this warning condition is controlled by the mon_pool_quota_warn_threshold configuration option. Pool quotas can be adjusted up or down (or removed) with:

cephadm@adm > ceph osd osd pool set-quota pool max_bytes bytes
cephadm@adm > ceph osd osd pool set-quota pool max_objects objects

Setting the quota value to 0 will disable the quota.

OBJECT_MISPLACED

One or more objects in the cluster are not stored on the node where the cluster wants them to be. This is an indication that data migration caused by a recent cluster change has not yet completed. Misplaced data is not a dangerous condition in itself. Data consistency is never at risk, and old copies of objects are never removed until the desired number of new copies (in the desired locations) are present.

OBJECT_UNFOUND

One or more objects in the cluster cannot be found. Specifically, the OSDs know that a new or updated copy of an object should exist, but a copy of that version of the object has not been found on the OSDs that are currently up. Read or write requests to the 'unfound' objects will be blocked. Ideally, the down OSD that has the most recent copy of the unfound object can be brought back up. Candidate OSDs can be identified from the peering state for the PG(s) responsible for the unfound object:

cephadm@adm > ceph tell pgid query
REQUEST_SLOW

One or more OSD requests is taking a long time to process. This can be an indication of extreme load, a slow storage device, or a software bug. You can query the request queue on the OSD(s) in question with the following command executed from the OSD host:

cephadm@adm > ceph daemon osd.id ops

You can see a summary of the slowest recent requests:

cephadm@adm > ceph daemon osd.id dump_historic_ops

You can find the location of an OSD with:

cephadm@adm > ceph osd find osd.id
REQUEST_STUCK

One or more OSD requests have been blocked for a relatively long time, for example 4096 seconds. This is an indication that either the cluster has been unhealthy for an extended period of time (for example, not enough running OSDs or inactive PGs) or there is some internal problem with the OSD.

PG_NOT_SCRUBBED

One or more PGs have not been scrubbed (see Section 20.6, “Scrubbing”) recently. PGs are normally scrubbed every mon_scrub_interval seconds, and this warning triggers when mon_warn_not_scrubbed such intervals have elapsed without a scrub. PGs will not scrub if they are not flagged as clean, which may happen if they are misplaced or degraded (see PG_AVAILABILITY and PG_DEGRADED above). You can manually initiate a scrub of a clean PG with:

cephadm@adm > ceph pg scrub pgid
PG_NOT_DEEP_SCRUBBED

One or more PGs has not been deep scrubbed (see Section 20.6, “Scrubbing”) recently. PGs are normally scrubbed every osd_deep_mon_scrub_interval seconds, and this warning triggers when mon_warn_not_deep_scrubbed seconds have elapsed without a scrub. PGs will not (deep) scrub if they are not flagged as clean, which may happen if they are misplaced or degraded (see PG_AVAILABILITY and PG_DEGRADED above). You can manually initiate a scrub of a clean PG with:

cephadm@adm > ceph pg deep-scrub pgid
Tip
Tip

If you specified non-default locations for your configuration or keyring, you may specify their locations:

root # ceph -c /path/to/conf -k /path/to/keyring health

17.3 Watching a Cluster Edit source

You can find the immediate state of the cluster using ceph -s. For example, a tiny Ceph cluster consisting of one monitor, and two OSDs may print the following when a workload is running:

cephadm@adm > ceph -s
cluster:
  id:     ea4cf6ce-80c6-3583-bb5e-95fa303c893f
  health: HEALTH_WARN
          too many PGs per OSD (408 > max 300)

services:
  mon: 3 daemons, quorum ses5min1,ses5min3,ses5min2
  mgr: ses5min1(active), standbys: ses5min3, ses5min2
  mds: cephfs-1/1/1 up  {0=ses5min3=up:active}
  osd: 4 osds: 4 up, 4 in
  rgw: 1 daemon active

data:
  pools:   8 pools, 544 pgs
  objects: 253 objects, 3821 bytes
  usage:   6252 MB used, 13823 MB / 20075 MB avail
  pgs:     544 active+clean

The output provides the following information:

  • Cluster ID

  • Cluster health status

  • The monitor map epoch and the status of the monitor quorum

  • The OSD map epoch and the status of OSDs

  • The status of Ceph Managers

  • The status of Object Gateways

  • The placement group map version

  • The number of placement groups and pools

  • The notional amount of data stored and the number of objects stored

  • The total amount of data stored

Tip
Tip: How Ceph Calculates Data Usage

The used value reflects the actual amount of raw storage used. The xxx GB / xxx GB value means the amount available (the lesser number) of the overall storage capacity of the cluster. The notional number reflects the size of the stored data before it is replicated, cloned or snapshot. Therefore, the amount of data actually stored typically exceeds the notional amount stored, because Ceph creates replicas of the data and may also use storage capacity for cloning and snapshotting.

Other commands that display immediate status information are:

  • ceph pg stat

  • ceph osd pool stats

  • ceph df

  • ceph df detail

To get the information updated in real time, put any of these commands (including ceph -s) as an argument of the watch command:

root # watch -n 10 'ceph -s'

Press CtrlC when you are tired of watching.

17.4 Checking a Cluster's Usage Stats Edit source

To check a cluster’s data usage and distribution among pools, use the ceph df command. To get more details, use ceph df detail.

cephadm@adm > ceph df
RAW STORAGE:
    CLASS     SIZE       AVAIL      USED        RAW USED     %RAW USED
    hdd       40 GiB     32 GiB     137 MiB      8.1 GiB         20.33
    TOTAL     40 GiB     32 GiB     137 MiB      8.1 GiB         20.33
POOLS:
    POOL             ID     STORED     OBJECTS    USED       %USED    MAX AVAIL
    iscsi-images      1     3.9 KiB          8    769 KiB        0       10 GiB
    cephfs_data       2     1.6 KiB          5    960 KiB        0       10 GiB
    cephfs_metadata   3      54 KiB         22    1.5 MiB        0       10 GiB
[...]

The RAW STORAGE section of the output provides an overview of the amount of storage your cluster uses for your data.

  • CLASS: The storage class of the device. Refer to Section 20.1.1, “Device Classes” for more details on device classes.

  • SIZE: The overall storage capacity of the cluster.

  • AVAIL: The amount of free space available in the cluster.

  • USED: The space (accumulated over all OSDs) allocated purely for data objects kept at block device.

  • RAW USED: The sum of 'USED' space and space allocated/reserved at block device for Ceph purposes, for example BlueFS part for BlueStore.

  • % RAW USED: The percentage of raw storage used. Use this number in conjunction with the full ratio and near full ratio to ensure that you are not reaching your cluster’s capacity. See Section 17.10, “Storage Capacity” for additional details.

    Note
    Note: Cluster Fill Level

    When a raw storage fill level is getting close to 100%, you need to add new storage to the cluster. A higher usage may lead to single full OSDs and cluster health problems.

    Use the command ceph osd df tree to list the fill level of all OSDs.

The POOLS section of the output provides a list of pools and the notional usage of each pool. The output from this section does not reflect replicas, clones or snapshots. For example, if you store an object with 1MB of data, the notional usage will be 1MB, but the actual usage may be 2MB or more depending on the number of replicas, clones and snapshots.

  • POOL: The name of the pool.

  • ID: The pool ID.

  • STORED: The amount of data stored by the user.

  • OBJECTS: The notional number of objects stored per pool.

  • USED: The amount of space allocated purely for data by all OSD nodes in kB.

  • %USED: The notional percentage of storage used per pool.

  • MAX AVAIL: The maximum available space in the given pool.

Note
Note

The numbers in the POOLS section are notional. They are not inclusive of the number of replicas, snapshots or clones. As a result, the sum of the USEDand %USED amounts will not add up to the RAW USED and %RAW USED amounts in the RAW STORAGE section of the output.

17.5 Checking OSD Status Edit source

You can check OSDs to ensure they are up and on by executing:

cephadm@adm > ceph osd stat

or

cephadm@adm > ceph osd dump

You can also view OSDs according to their position in the CRUSH map.

cephadm@adm > ceph osd tree

Ceph will print a CRUSH tree with a host, its OSDs, whether they are up and their weight.

# id    weight  type name       up/down reweight
-1      3       pool default
-3      3               rack mainrack
-2      3                       host osd-host
0       1                               osd.0   up      1
1       1                               osd.1   up      1
2       1                               osd.2   up      1

17.6 Checking for Full OSDs Edit source

Ceph prevents you from writing to a full OSD so that you do not lose data. In an operational cluster, you should receive a warning when your cluster is getting near its full ratio. The mon osd full ratio defaults to 0.95, or 95% of capacity before it stops clients from writing data. The mon osd nearfull ratio defaults to 0.85, or 85% of capacity, when it generates a health warning.

Full OSD nodes will be reported by ceph health:

cephadm@adm > ceph health
  HEALTH_WARN 1 nearfull osds
  osd.2 is near full at 85%

or

cephadm@adm > ceph health
  HEALTH_ERR 1 nearfull osds, 1 full osds
  osd.2 is near full at 85%
  osd.3 is full at 97%

The best way to deal with a full cluster is to add new OSD hosts/disks allowing the cluster to redistribute data to the newly available storage.

Tip
Tip: Preventing Full OSDs

After an OSD becomes full—it uses 100% of its disk space—it will normally crash quickly without warning. Following are a few tips to remember when administering OSD nodes.

  • Each OSD's disk space (usually mounted under /var/lib/ceph/osd/osd-{1,2..}) needs to be placed on a dedicated underlying disk or partition.

  • Check the Ceph configuration files and make sure that Ceph does not store its log file to the disks/partitions dedicated for use by OSDs.

  • Make sure that no other process writes to the disks/partitions dedicated for use by OSDs.

17.7 Checking Monitor Status Edit source

After you start the cluster and before first reading and/or writing data, check the Ceph Monitors' quorum status. When the cluster is already serving requests, check the Ceph Monitors' status periodically to ensure that they are running.

To display the monitor map, execute the following:

cephadm@adm > ceph mon stat

or

cephadm@adm > ceph mon dump

To check the quorum status for the monitor cluster, execute the following:

cephadm@adm > ceph quorum_status

Ceph will return the quorum status. For example, a Ceph cluster consisting of three monitors may return the following:

{ "election_epoch": 10,
  "quorum": [
        0,
        1,
        2],
  "monmap": { "epoch": 1,
      "fsid": "444b489c-4f16-4b75-83f0-cb8097468898",
      "modified": "2011-12-12 13:28:27.505520",
      "created": "2011-12-12 13:28:27.505520",
      "mons": [
            { "rank": 0,
              "name": "a",
              "addr": "192.168.1.10:6789\/0"},
            { "rank": 1,
              "name": "b",
              "addr": "192.168.1.11:6789\/0"},
            { "rank": 2,
              "name": "c",
              "addr": "192.168.1.12:6789\/0"}
           ]
    }
}

17.8 Checking Placement Group States Edit source

Placement groups map objects to OSDs. When you monitor your placement groups, you will want them to be active and clean. For a detailed discussion, refer to Section 17.11, “Monitoring OSDs and Placement Groups”.

17.9 Using the Admin Socket Edit source

The Ceph admin socket allows you to query a daemon via a socket interface. By default, Ceph sockets reside under /var/run/ceph. To access a daemon via the admin socket, log in to the host running the daemon and use the following command:

cephadm@adm > ceph --admin-daemon /var/run/ceph/socket-name

To view the available admin socket commands, execute the following command:

cephadm@adm > ceph --admin-daemon /var/run/ceph/socket-name help

The admin socket command enables you to show and set your configuration at runtime. Refer to Section 25.1, “Runtime Configuration” for details.

Additionally, you can set configuration values at runtime directly (the admin socket bypasses the monitor, unlike ceph tell daemon-type.id injectargs, which relies on the monitor but does not require you to log in directly to the host in question).

17.10 Storage Capacity Edit source

When a Ceph storage cluster gets close to its maximum capacity, Ceph prevents you from writing to or reading from Ceph OSDs as a safety measure to prevent data loss. Therefore, letting a production cluster approach its full ratio is not a good practice, because it sacrifices high availability. The default full ratio is set to .95, meaning 95% of capacity. This a very aggressive setting for a test cluster with a small number of OSDs.

Tip
Tip: Increase Storage Capacity

When monitoring your cluster, be alert to warnings related to the nearfull ratio. It means that a failure of some OSDs could result in a temporary service disruption if one or more OSDs fails. Consider adding more OSDs to increase storage capacity.

A common scenario for test clusters involves a system administrator removing a Ceph OSD from the Ceph storage cluster to watch the cluster rebalance. Then removing another Ceph OSD, and so on until the cluster eventually reaches the full ratio and locks up. We recommend a bit of capacity planning even with a test cluster. Planning enables you to estimate how much spare capacity you will need in order to maintain high availability. Ideally, you want to plan for a series of Ceph OSD failures where the cluster can recover to an active + clean state without replacing those Ceph OSDs immediately. You can run a cluster in an active + degraded state, but this is not ideal for normal operating conditions.

The following diagram depicts a simplistic Ceph storage cluster containing 33 Ceph nodes with one Ceph OSD per host, each of them reading from and writing to a 3 TB drive. This exemplary cluster has a maximum actual capacity of 99 TB. The mon osd full ratio option is set to 0.95. If the cluster falls to 5 TB of the remaining capacity, it will not allow the clients to read and write data. Therefore the storage cluster’s operating capacity is 95 TB, not 99 TB.

Ceph Cluster
Figure 17.1: Ceph Cluster

It is normal in such a cluster for one or two OSDs to fail. A less frequent but reasonable scenario involves a rack’s router or power supply failing, which brings down multiple OSDs simultaneously (for example, OSDs 7-12). In such a scenario, you should still strive for a cluster that can remain operational and achieve an active + clean state—even if that means adding a few hosts with additional OSDs in short order. If your capacity usage is too high, you may not lose data. But you could still sacrifice data availability while resolving an outage within a failure domain if capacity usage of the cluster exceeds the full ratio. For this reason, we recommend at least some rough capacity planning.

Identify two numbers for your cluster:

  1. The number of OSDs.

  2. The total capacity of the cluster.

If you divide the total capacity of your cluster by the number of OSDs in your cluster, you will find the mean average capacity of an OSD within your cluster. Consider multiplying that number by the number of OSDs you expect will fail simultaneously during normal operations (a relatively small number). Finally, multiply the capacity of the cluster by the full ratio to arrive at a maximum operating capacity. Then, subtract the number of the amount of data from the OSDs you expect to fail to arrive at a reasonable full ratio. Repeat the foregoing process with a higher number of OSD failures (a rack of OSDs) to arrive at a reasonable number for a near full ratio.

The following settings only apply on cluster creation and are then stored in the OSD map:

[global]
 mon osd full ratio = .80
 mon osd backfillfull ratio = .75
 mon osd nearfull ratio = .70
Tip
Tip

These settings only apply during cluster creation. Afterward they need to be changed in the OSD Map using the ceph osd set-nearfull-ratio and ceph osd set-full-ratio commands.

mon osd full ratio

The percentage of disk space used before an OSD is considered full. Default is .95

mon osd backfillfull ratio

The percentage of disk space used before an OSD is considered too full to backfill. Default is .90

mon osd nearfull ratio

The percentage of disk space used before an OSD is considered nearfull. Default is .85

Tip
Tip: Check OSD Weight

If some OSDs are nearfull, but others have plenty of capacity, you may have a problem with the CRUSH weight for the nearfull OSDs.

17.11 Monitoring OSDs and Placement Groups Edit source

High availability and high reliability require a fault-tolerant approach to managing hardware and software issues. Ceph has no single point-of-failure, and can service requests for data in a 'degraded' mode. Ceph’s data placement introduces a layer of indirection to ensure that data does not bind directly to particular OSD addresses. This means that tracking down system faults requires finding the placement group and the underlying OSDs at root of the problem.

Tip
Tip: Access in Case of Failure

A fault in one part of the cluster may prevent you from accessing a particular object. That does not mean that you cannot access other objects. When you run into a fault, follow the steps for monitoring your OSDs and placement groups. Then begin troubleshooting.

Ceph is generally self-repairing. However, when problems persist, monitoring OSDs and placement groups will help you identify the problem.

17.11.1 Monitoring OSDs Edit source

An OSD’s status is either in the cluster ('in') or out of the cluster ('out'). At the same time, it is either up and running ('up') or it is down and not running ('down'). If an OSD is 'up', it may be either in the cluster (you can read and write data) or out of the cluster. If it was in the cluster and recently moved out of the cluster, Ceph will migrate placement groups to other OSDs. If an OSD is out of the cluster, CRUSH will not assign placement groups to it. If an OSD is 'down', it should also be 'out'.

Note
Note: Unhealthy State

If an OSD is 'down' and 'in', there is a problem and the cluster will not be in a healthy state.

If you execute a command such as ceph health, ceph -s or ceph -w, you may notice that the cluster does not always echo back HEALTH OK. With regard to OSDs, you should expect that the cluster will not echo HEALTH OK under the following circumstances:

  • You have not started the cluster yet (it will not respond).

  • You have just started or restarted the cluster and it is not ready yet, because the placement groups are being created and the OSDs are in the process of peering.

  • You have just added or removed an OSD.

  • You have just modified your cluster map.

An important aspect of monitoring OSDs is to ensure that when the cluster is up and running, all the OSDs in the cluster are up and running, too. To see if all the OSDs are running, execute:

root # ceph osd stat
x osds: y up, z in; epoch: eNNNN

The result should tell you the total number of OSDs (x), how many are 'up' (y), how many are 'in' (z), and the map epoch (eNNNN). If the number of OSDs that are 'in' the cluster is more than the number of OSDs that are 'up', execute the following command to identify the ceph-osd daemons that are not running:

root # ceph osd tree
#ID CLASS WEIGHT  TYPE NAME             STATUS REWEIGHT PRI-AFF
-1       2.00000 pool openstack
-3       2.00000 rack dell-2950-rack-A
-2       2.00000 host dell-2950-A1
0   ssd 1.00000      osd.0                up  1.00000 1.00000
1   ssd 1.00000      osd.1              down  1.00000 1.00000

If an OSD with, for example, ID 1 is down, start it:

cephadm@osd > sudo systemctl start ceph-osd@1.service

See Section 17.12, “OSD Is Not Running” for problems associated with OSDs that have stopped or that will not restart.

17.11.2 Placement Group Sets Edit source

When CRUSH assigns placement groups to OSDs, it looks at the number of replicas for the pool and assigns the placement group to OSDs such that each replica of the placement group gets assigned to a different OSD. For example, if the pool requires three replicas of a placement group, CRUSH may assign them to osd.1, osd.2 and osd.3 respectively. CRUSH actually seeks a pseudo-random placement that will take into account failure domains you set in your CRUSH Map, so you will rarely see placement groups assigned to nearest neighbor OSDs in a large cluster. We refer to the set of OSDs that should contain the replicas of a particular placement group as the acting set. In some cases, an OSD in the acting set is down or otherwise not able to service requests for objects in the placement group. When these situations arise, it may match one of the following scenarios:

  • You added or removed an OSD. Then, CRUSH reassigned the placement group to other OSDs and therefore changed the composition of the acting set, causing the migration of data with a 'backfill' process.

  • An OSD was 'down', was restarted, and is now recovering.

  • An OSD in the acting set is 'down' or unable to service requests, and another OSD has temporarily assumed its duties.

    Ceph processes a client request using the up set, which is the set of OSDs that will actually handle the requests. In most cases, the up set and the acting set are virtually identical. When they are not, it may indicate that Ceph is migrating data, an OSD is recovering, or that there is a problem (for example, Ceph usually echoes a HEALTH WARN state with a 'stuck stale' message in such scenarios).

To retrieve a list of placement groups, run:

cephadm@adm > ;ceph pg dump

To view which OSDs are within the acting set or the up set for a given placement group, run:

cephadm@adm > ceph pg mapPG_NUM
osdmap eNNN pg RAW_PG_NUM (PG_NUM) -> up [0,1,2] acting [0,1,2]

The result should tell you the osdmap epoch (eNNN), the placement group number (PG_NUM), the OSDs in the up set ('up'), and the OSDs in the acting set ('acting'):

Tip
Tip: Cluster Problem Indicator

If the up set and acting set do not match, this may be an indicator either of the cluster rebalancing itself, or of a potential problem with the cluster.

17.11.3 Peering Edit source

Before you can write data to a placement group, it must be in an 'active' state, and it should be in a 'clean' state. For Ceph to determine the current state of a placement group, the primary OSD of the placement group (the first OSD in the acting set), peers with the secondary and tertiary OSDs to establish agreement on the current state of the placement group (assuming a pool with three replicas of the PG).

Peering Schema
Figure 17.2: Peering Schema

17.11.4 Monitoring Placement Group States Edit source

If you execute a command such as ceph health, ceph -s or ceph -w, you may notice that the cluster does not always echo back the HEALTH OK message. After you check to see if the OSDs are running, you should also check placement group states.

Expect that the cluster will not echo HEALTH OK in a number of placement group peering-related circumstances:

  • You have just created a pool and placement groups have not peered yet.

  • The placement groups are recovering.

  • You have just added an OSD to or removed an OSD from the cluster.

  • You have just modified your CRUSH Map and your placement groups are migrating.

  • There is inconsistent data in different replicas of a placement group.

  • Ceph is scrubbing a placement group’s replicas.

  • Ceph does not have enough storage capacity to complete backfilling operations.

If one of the above mentioned circumstances causes Ceph to echo HEALTH WARN, do not panic. In many cases, the cluster will recover on its own. In some cases, you may need to take action. An important aspect of monitoring placement groups is to ensure that when the cluster is up and running, all placement groups are 'active' and preferably in the 'clean state'. To see the status of all placement groups, run:

cephadm@adm > ceph pg stat
x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail

The result should tell you the total number of placement groups (x), how many placement groups are in a particular state such as 'active+clean' (y) and the amount of data stored (z).

In addition to the placement group states, Ceph will also echo back the amount of storage capacity used (aa), the amount of storage capacity remaining (bb), and the total storage capacity for the placement group. These numbers can be important in a few cases:

  • You are reaching your near full ratio or full ratio.

  • Your data is not getting distributed across the cluster because of an error in your CRUSH configuration.

Tip
Tip: Placement Group IDs

Placement group IDs consist of the pool number (not pool name) followed by a period (.) and the placement group ID—a hexadecimal number. You can view pool numbers and their names from the output of ceph osd lspools. For example, the default pool rbd corresponds to pool number 0. A fully qualified placement group ID has the following form:

POOL_NUM.PG_ID

And it typically looks like this:

0.1f

To retrieve a list of placement groups, run the following:

cephadm@adm > ceph pg dump

You can also format the output in JSON format and save it to a file:

cephadm@adm > ceph pg dump -o FILE_NAME --format=json

To query a particular placement group, run the following:

cephadm@adm > ceph pg POOL_NUM.PG_ID query

The following list describes the common placement group states in detail.

CREATING

When you create a pool, it will create the number of placement groups you specified. Ceph will echo 'creating' when it is creating one or more placement groups. When they are created, the OSDs that are part of the placement group’s acting set will peer. When peering is complete, the placement group status should be 'active+clean', which means that a Ceph client can begin writing to the placement group.

Placement Groups Status
Figure 17.3: Placement Groups Status
PEERING

When Ceph is peering a placement group, it is bringing the OSDs that store the replicas of the placement group into agreement about the state of the objects and metadata in the placement group. When Ceph completes peering, this means that the OSDs that store the placement group agree about the current state of the placement group. However, completion of the peering process does not mean that each replica has the latest contents.

Note
Note: Authoritative History

Ceph will not acknowledge a write operation to a client until all OSDs of the acting set persist the write operation. This practice ensures that at least one member of the acting set will have a record of every acknowledged write operation since the last successful peering operation.

With an accurate record of each acknowledged write operation, Ceph can construct and enlarge a new authoritative history of the placement group—a complete and fully ordered set of operations that, if performed, would bring an OSD’s copy of a placement group up to date.

ACTIVE

When Ceph completes the peering process, a placement group may become 'active'. The 'active' state means that the data in the placement group is generally available in the primary placement group and the replicas for read and write operations.

CLEAN

When a placement group is in the 'clean' state, the primary OSD and the replica OSDs have successfully peered and there are no stray replicas for the placement group. Ceph replicated all objects in the placement group the correct number of times.

DEGRADED

When a client writes an object to the primary OSD, the primary OSD is responsible for writing the replicas to the replica OSDs. After the primary OSD writes the object to storage, the placement group will remain in a 'degraded' state until the primary OSD has received an acknowledgement from the replica OSDs that Ceph created the replica objects successfully.

The reason a placement group can be 'active+degraded' is that an OSD may be 'active' even though it does not hold all of the objects yet. If an OSD goes down, Ceph marks each placement group assigned to the OSD as 'degraded'. The OSDs must peer again when the OSD comes back up. However, a client can still write a new object to a degraded placement group if it is 'active'.

If an OSD is 'down' and the 'degraded' condition persists, Ceph may mark the down OSD as 'out' of the cluster and remap the data from the 'down' OSD to another OSD. The time between being marked 'down' and being marked 'out' is controlled by the mon osd down out interval option, which is set to 600 seconds by default.

A placement group can also be 'degraded' because Ceph cannot find one or more objects that should be in the placement group. While you cannot read or write to unfound objects, you can still access all of the other objects in the 'degraded' placement group.

RECOVERING

Ceph was designed for fault-tolerance at a scale where hardware and software problems are ongoing. When an OSD goes 'down', its contents may fall behind the current state of other replicas in the placement groups. When the OSD is back 'up', the contents of the placement groups must be updated to reflect the current state. During that time period, the OSD may reflect a 'recovering' state.

Recovery is not always trivial, because a hardware failure may cause a cascading failure of multiple OSDs. For example, a network switch for a rack or cabinet may fail, which can cause the OSDs of a number of host machines to fall behind the current state of the cluster. Each of the OSDs must recover when the fault is resolved.

Ceph provides a number of settings to balance the resource contention between new service requests and the need to recover data objects and restore the placement groups to the current state. The osd recovery delay start setting allows an OSD to restart, re-peer and even process some replay requests before starting the recovery process. The osd recovery thread timeout sets a thread timeout, because multiple OSDs may fail, restart and re-peer at staggered rates. The osd recovery max active setting limits the number of recovery requests an OSD will process simultaneously to prevent the OSD from failing to serve. The osd recovery max chunk setting limits the size of the recovered data chunks to prevent network congestion.

BACK FILLING

When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs in the cluster to the newly added OSD. Forcing the new OSD to accept the reassigned placement groups immediately can put excessive load on the new OSD. Backfilling the OSD with the placement groups allows this process to begin in the background. When backfilling is complete, the new OSD will begin serving requests when it is ready.

During the backfill operations, you may see one of several states: 'backfill_wait' indicates that a backfill operation is pending, but is not yet in progress; 'backfill' indicates that a backfill operation is in progress; 'backfill_too_full' indicates that a backfill operation was requested, but could not be completed because of insufficient storage capacity. When a placement group cannot be backfilled, it may be considered 'incomplete'.

Ceph provides a number of settings to manage the load associated with reassigning placement groups to an OSD (especially a new OSD). By default, osd max backfills sets the maximum number of concurrent backfills to or from an OSD to 10. The backfill full ratio enables an OSD to refuse a backfill request if the OSD is approaching its full ratio (90%, by default) and change with ceph osd set-backfillfull-ratio command. If an OSD refuses a backfill request, the osd backfill retry interval enables an OSD to retry the request (after 10 seconds, by default). OSDs can also set osd backfill scan min and osd backfill scan max to manage scan intervals (64 and 512, by default).

REMAPPED

When the acting set that services a placement group changes, the data migrates from the old acting set to the new acting set. It may take some time for a new primary OSD to service requests. So it may ask the old primary to continue to service requests until the placement group migration is complete. When data migration completes, the mapping uses the primary OSD of the new acting set.

STALE

While Ceph uses heartbeats to ensure that hosts and daemons are running, the ceph-osd daemons may also get into a 'stuck' state where they are not reporting statistics in a timely manner (for example, a temporary network fault). By default, OSD daemons report their placement group, boot and failure statistics every half second (0.5), which is more frequent than the heartbeat thresholds. If the primary OSD of a placement group’s acting set fails to report to the monitor or if other OSDs have reported the primary OSD as 'down', the monitors will mark the placement group as 'stale'.

When you start your cluster, it is common to see the 'stale' state until the peering process completes. After your cluster has been running for a while, seeing placement groups in the 'stale' state indicates that the primary OSD for those placement groups is down or not reporting placement group statistics to the monitor.

17.11.5 Identifying Troubled Placement Groups Edit source

As previously noted, a placement group is not necessarily problematic because its state is not 'active+clean'. Generally, Ceph’s ability to self repair may not be working when placement groups get stuck. The stuck states include:

  • Unclean: Placement groups contain objects that are not replicated the required number of times. They should be recovering.

  • Inactive: Placement groups cannot process reads or writes because they are waiting for an OSD with the most up-to-date data to come back up.

  • Stale: Placement groups are in an unknown state, because the OSDs that host them have not reported to the monitor cluster in a while (configured by the mon osd report timeout option).

To identify stuck placement groups, run the following:

cephadm@adm > ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded]

17.11.6 Finding an Object Location Edit source

To store object data in the Ceph Object Store, a Ceph client needs to set an object name and specify a related pool. The Ceph client retrieves the latest cluster map and the CRUSH algorithm calculates how to map the object to a placement group, and then calculates how to assign the placement group to an OSD dynamically. To find the object location, all you need is the object name and the pool name. For example:

cephadm@adm > ceph osd map POOL_NAME OBJECT_NAME [NAMESPACE]
Example 17.1: Locating an Object

As an example, let us create an object. Specify an object name 'test-object-1', a path to an example file 'testfile.txt' containing some object data, and a pool name 'data' using the rados put command on the command line:

cephadm@adm > rados put test-object-1 testfile.txt --pool=data

To verify that the Ceph Object Store stored the object, run the following:

cephadm@adm > rados -p data ls

Now, identify the object location. Ceph will output the object’s location:

cephadm@adm > ceph osd map data test-object-1
osdmap e537 pool 'data' (0) object 'test-object-1' -> pg 0.d1743484 \
(0.4) -> up ([1,0], p0) acting ([1,0], p0)

To remove the example object, simply delete it using the rados rm command:

cephadm@adm > rados rm test-object-1 --pool=data

17.12 OSD Is Not Running Edit source

Under normal circumstances, simply restarting the ceph-osd daemon will allow it to rejoin the cluster and recover.

17.12.1 An OSD Will Not Start Edit source

If you start your cluster and an OSD will not start, check the following:

  • Configuration File: If you were not able to get OSDs running from a new installation, check your configuration file to ensure it conforms (for example, host and not hostname).

  • Check Paths: Check the paths in your configuration, and the actual paths themselves for data and journals. If you separate the OSD data from the journal data and there are errors in your configuration file or in the actual mounts, you may have trouble starting OSDs. If you want to store the journal on a block device, you need to partition your journal disk and assign one partition per OSD.

  • Check Max Threadcount: If you have a node with a lot of OSDs, you may be hitting the default maximum number of threads (usually 32,000), especially during recovery. You can increase the number of threads using the sysctl command to see if increasing the maximum number of threads to the maximum possible number of threads allowed (for example, 4194303) will help:

    root # sysctl -w kernel.pid_max=4194303

    If increasing the maximum thread count resolves the issue, you can make it permanent by including a kernel.pid_max setting in the /etc/sysctl.conf file:

    kernel.pid_max = 4194303

17.12.2 An OSD Failed Edit source

When the ceph-osd process dies, the monitor will learn about the failure from surviving ceph-osd daemons and report it via the ceph health command:

cephadm@adm > ceph health
HEALTH_WARN 1/3 in osds are down

Specifically, you will get a warning whenever there are ceph-osd processes that are marked 'in' and 'down'. You can identify which ceph-osds are down with:

cephadm@adm > ceph health detail
HEALTH_WARN 1/3 in osds are down
osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080

If there is a disk failure or other fault preventing ceph-osd from functioning or restarting, an error message should be present in its log file in /var/log/ceph.

If the daemon stopped because of a heartbeat failure, the underlying kernel file system may be unresponsive. Check the dmesg command output for disk or other kernel errors.

17.12.3 No Free Disk Space Edit source

Ceph prevents you from writing to a full OSD to prevent losing data. In an operational cluster, you should receive a warning when your cluster is getting near its full ratio. The mon osd full ratio option defaults to 0.95, or 95% of capacity before it stops clients from writing data. The mon osd backfillfull ratio defaults to 0.90, or 90% of capacity when it blocks backfills from starting. The OSD nearfull ratio defaults to 0.85, or 85% of capacity when it generates a health warning. You can change the value of 'nearfull' with the following command:

cephadm@adm > ceph osd set-nearfull-ratio 0.0 to 1.0

Full cluster issues usually arise when testing how Ceph handles an OSD failure on a small cluster. When one node has a high percentage of the cluster’s data, the cluster can easily eclipse its 'nearfull' and 'full' ratio immediately. If you are testing how Ceph reacts to OSD failures on a small cluster, you should leave sufficient free disk space and consider temporarily lowering the OSD full ratio, OSD backfillfull ratio and OSD nearfull ratio using these commands:

cephadm@adm > ceph osd set-nearfull-ratio 0.0 to 1.0
cephadm@adm > ceph osd set-full-ratio 0.0 to 1.0
cephadm@adm > ceph osd set-backfillfull-ratio 0.0 to 1.0

Full Ceph OSD will be reported by ceph health:

cephadm@adm > ceph health
HEALTH_WARN 1 nearfull osd(s)

or

cephadm@adm > ceph health detail
HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s)
osd.3 is full at 97%
osd.4 is backfill full at 91%
osd.2 is near full at 87%

The best way to deal with a full cluster is to add new Ceph OSDs, allowing the cluster to redistribute data to the newly available storage.

If you cannot start an OSD because it is full, you may delete some data by deleting some placement group directories in the full OSD.

Important
Important: Deleting a Placement Group Directory

If you choose to delete a placement group directory on a full OSD, do not delete the same placement group directory on another full OSD, or you may lose data. You must maintain at least one copy of your data on at least one OSD.

18 Monitoring and Alerting Edit source

In SUSE Enterprise Storage 6, DeepSea no longer deploys a monitoring and alerting stack on the Salt master. Users have to define the Prometheus role for Prometheus and Alertmanager, and the Grafana role for Grafana. When multiple nodes are assigned with the Prometheus or Grafana role, a highly available setup is deployed.

  • Prometheus is the monitoring and alerting toolkit.

  • Alertmanager handles alerts sent by the Prometheus server.

  • Grafana is the visualization and alerting software.

  • The prometheus-node_exporter is the service running on all Salt minions.

The Prometheus configuration and scrape targets (exporting daemons) are setup automatically by DeepSea. DeepSea also deploys a list of default alerts, for example health error, 10% OSDs down, or pgs inactive.

18.1 Pillar Variables Edit source

The Salt pillar is a key-value store that provides information and configuration values to minions. It is available to all minions, each with differing content. The Salt pillar is pre-populated with default values and can be customized in two different ways:

  • /srv/pillar/ceph/stack/global.yml: to change pillar variables for all nodes.

  • /srv/pillar/ceph/stack/CLUSTER_NAME/minions/HOST: to change specific minion configurations.

The pillar variables below are available to all nodes by default:

  monitoring:
  alertmanager:
    config: salt://path/to/config
    additional_flags: ''
  grafana:
    ssl_cert: False # self-signed certs are created by default
    ssl_key: False # self-signed certs are created by default
  prometheus:
    # pass additional configration to prometheus
    additional_flags: ''
    alert_relabel_config: []
    rule_files: []
    # per exporter config variables
    scrape_interval:
      ceph: 10
      node_exporter: 10
      prometheus: 10
      grafana: 10
    relabel_config:
      alertmanager: []
      ceph: []
      node_exporter: []
      prometheus: []
      grafana: []
    metric_relabel_config:
      ceph: []
      node_exporter: []
      prometheus: []
      grafana: []
    target_partition:
      ceph: '1/1'
      node_exporter: '1/1'
      prometheus: '1/1'
      grafana: '1/1'

18.2 Grafana Edit source

All traffic is encrypted through Grafana. You can either supply your own SSL certs or create self-signed one.

Grafana uses the following variables:

  • ssl_cert

  • ssl_key

The Ceph Dashboard embeds the Grafana dashboards via HTML iframe elements. If Grafana is configured without SSL/TLS support or if SSL uses self-signed certificates, most browsers will block the embedding of insecure content into a secured web page if the SSL support in the dashboard has been enabled (which is the default configuration). If you can not see the embedded Grafana dashboards in Ceph Dashboard, check your browser's documentation on how to unblock mixed content or how to accept self-signed certificates. Alternatively, consider enabling SSL/TLS support in Grafana, using a certificate that is issued by a certificate authority (CA) known to the browser.

For more information on supplying your own SSL certificates, see Section 13.1.3, “Certificates Signed by CA”, for generating a self-signed or trusted third-party certificate using OpenSSL, see Section 13.1.2, “Self-signed or Trusted Third-party Certificate with OpenSSL”, for creating your own CA signed certificate, see Section 13.1.1, “Self-signed Certificates”, and for creating your own custom CA signed certificate, see Section 13.1.4, “Certificates Signed with a Custom CA”.

18.3 Prometheus Edit source

The exporter based configuration that can be passed through the pillar. These groups map to exporters that provide data. The node exporter is present on all nodes, Ceph is exported by the Ceph Manager nodes, Prometheus and Grafana is exported by the respective Prometheus and Grafana nodes.

Prometheus uses the following variables:

  • scrape_interval: change the scrape interval, how often an exporter is to be scraped.

  • target_partition: partition scrape targets when multiple Prometheus instnaces are deployed and have some instances scrape only part of all exporter instances.

  • relabel_config: dynamically rewrites the label set of a target before it gets scraped. Multiple relabeling steps can be configured per scrape configuration.

  • metrics_relabel_config: applied to samples as the last step before ingestion.

18.3.1 Security Model Edit source

Prometheus' security model presumes that untrusted users have access to the Prometheus HTTP endpoint and logs. Untrusted users have access to all the (meta-)data Prometheus collects that is contained in the database, plus a variety of operational and debugging information.

However, Prometheus' HTTP API is limited to read-only operations. Configurations cannot be changed using the API, and secrets are not exposed. Moreover, Prometheus has some built-in measures to mitigate the impact of denial of service attacks.

18.4 Alertmanager Edit source

The Alertmanager handles alerts sent by the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver. It also takes care of silencing of alerts. Alertmanager is configured via the command line flags and a configuration file that defines inhibition rules, notification routing and notification receivers.

18.4.1 Configuration File Edit source

Alertmanager's configuration is different for each deployment. Therefore, DeepSea does not ship any related defaults. You need to provide your own alertmanager.yml configuration file. The alertmanager package by default installs a configuration file /etc/prometheus/alertmanager.yml which can serve as an example configuration. If you prefer to have your Alertmanager configuration managed by DeepSea, add the following key to your pillar, for example to the /srv/pillar/ceph/stack/ceph/minions/YOUR_SALT_MASTER_MINION_ID.sls file:

For a complete example of Alertmanager's configuration file, see Section 18.5, “Troubleshooting Alerts”.

monitoring:
 alertmanager_config:
   /path/to/your/alertmanager/config.yml

Alertmanager's configuration file is written in the YAML format. It follows the scheme described below. Parameters in brackets are optional. For non-list parameters the default value is used. The following generic placeholders are used in the scheme:

DURATION

A duration matching the regular expression [0-9]+(ms|[smhdwy])

LABELNAME

A string matching the regular expression [a-zA-Z_][a-zA-Z0-9_]*

LABELVALUE

A string of Unicode characters.

FILEPATH

A valid path in the current working directory.

BOOLEAN

A Boolean that can take the values 'true' or 'false'.

STRING

A regular string.

SECRET

A regular string that is a secret, for example a password.

TMPL_STRING

A string which is template-expanded before usage.

TMPL_SECRET

A secret string which is template-expanded before usage.

Example 18.1: Global Configuration

Parameters in the global: configuration are valid in all other configuration contexts. They also serve as defaults for other configuration sections.

global:
# the time after which an alert is declared resolved if it has not been updated
[ resolve_timeout: DURATION | default = 5m ]

# The default SMTP From header field.
[ smtp_from: TMPL_STRING ]
# The default SMTP smarthost used for sending emails, including port number.
# Port number usually is 25, or 587 for SMTP over TLS
# (sometimes referred to as STARTTLS).
# Example: smtp.example.org:587
[ smtp_smarthost: STRING ]
# The default host name to identify to the SMTP server.
[ smtp_hello: STRING | default = "localhost" ]
[ smtp_auth_username: STRING ]
# SMTP Auth using LOGIN and PLAIN.
[ smtp_auth_password: SECRET ]
# SMTP Auth using PLAIN.
[ smtp_auth_identity: STRING ]
# SMTP Auth using CRAM-MD5.
[ smtp_auth_secret: SECRET ]
# The default SMTP TLS requirement.
[ smtp_require_tls: BOOL | default = true ]

# The API URL to use for Slack notifications.
[ slack_api_url: STRING ]
[ victorops_api_key: STRING ]
[ victorops_api_url: STRING | default = "https://victorops.example.com/integrations/alert/" ]
[ pagerduty_url: STRING | default = "https://pagerduty.example.com/v2/enqueue" ]
[ opsgenie_api_key: STRING ]
[ opsgenie_api_url: STRING | default = "https://opsgenie.example.com/" ]
[ hipchat_api_url: STRING | default = "https://hipchat.example.com/" ]
[ hipchat_auth_token: SECRET ]
[ wechat_api_url: STRING | default = "https://wechat.example.com/cgi-bin/" ]
[ wechat_api_secret: SECRET ]
[ wechat_api_corp_id: STRING ]

# The default HTTP client configuration
[ http_config: HTTP_CONFIG ]

# Files from which custom notification template definitions are read.
# The last component may use a wildcard matcher, e.g. 'templates/*.tmpl'.
templates:
[ - FILEPATH ... ]

# The root node of the routing tree.
route: ROUTE

# A list of notification receivers.
receivers:
- RECEIVER ...

# A list of inhibition rules.
inhibit_rules:
[ - INHIBIT_RULE ... ]
Example 18.2: ROUTE

A ROUTE block defines a node in a routing tree. Unspecified parameters are inherited from its parent node. Every alert enters the routing tree at the configured top-level route, which needs to match all alerts. It then traverses the child nodes. If the continue option is set to 'false', the traversing stops after the first matched child. Setting the option to 'true' on a matched node, the alert will continue matching against subsequent siblings. If an alert does not match any children of a node, the alert is handled based on the configuration parameters of the current node.

[ receiver: STRING ]
[ group_by: '[' LABELNAME, ... ']' ]

# If an alert should continue matching subsequent sibling nodes.
[ continue: BOOLEAN | default = false ]

# A set of equality matchers an alert has to fulfill to match a node.
match:
 [ LABELNAME: LABELVALUE, ... ]

# A set of regex-matchers an alert has to fulfill to match a node.
match_re:
 [ LABELNAME: REGEX, ... ]

# Time to wait before sending a notification for a group of alerts.
[ group_wait: DURATION | default = 30s ]

# Time to wait before sending a notification about new alerts
# added to a group of alerts for which an initial notification has
# already been sent.
[ group_interval: DURATION | default = 5m ]

# Time to wait before re-sending a notification
[ repeat_interval: DURATION | default = 4h ]

# Possible child routes.
routes:
 [ - ROUTE ... ]
Example 18.3: INHIBIT_RULE

An inhibition rule mutes a target alert that matches a set of matchers when a source alert exists that matches another set of matchers. Both alerts need to share the same label values for the label names in the equal list.

Alerts can match and therefore inhibit themselves. Do not write inhibition rules where an alert matches both source and target.

# Matchers that need to be fulfilled for the alerts to be muted.
target_match:
 [ LABELNAME: LABELVALUE, ... ]
target_match_re:
 [ LABELNAME: REGEX, ... ]

# Matchers for which at least one alert needs to exist so that the
# inhibition occurs.
source_match:
 [ LABELNAME: LABELVALUE, ... ]
source_match_re:
 [ LABELNAME: REGEX, ... ]

# Labels with an equal value in the source and target
# alert for the inhibition to take effect.
[ equal: '[' LABELNAME, ... ']' ]
Example 18.4: HTTP_CONFIG

HTTP_CONFIG configures the HTTP client used by the receiver to communicate with API services.

Note that basic_auth, bearer_token and bearer_token_file options are mutually exclusive.

# Sets the 'Authorization' header with the user name and password.
basic_auth:
[ username: STRING ]
[ password: SECRET ]

# Sets the 'Authorization' header with the bearer token.
[ bearer_token: SECRET ]

# Sets the 'Authorization' header with the bearer token read from a file.
[ bearer_token_file: FILEPATH ]

# TLS settings.
tls_config:
# CA certificate to validate the server certificate with.
[ ca_file: FILEPATH ]
# Certificate and key files for client cert authentication to the server.
[ cert_file: FILEPATH ]
[ key_file: FILEPATH ]
# ServerName extension to indicate the name of the server.
# http://tools.ietf.org/html/rfc4366#section-3.1
[ server_name: STRING ]
# Disable validation of the server certificate.
[ insecure_skip_verify: BOOLEAN | default = false]

# Optional proxy URL.
[ proxy_url: STRING ]
Example 18.5: RECEIVER

Receiver is a named configuration for one or more notification integrations.

Instead of adding new receivers, we recommend implementing custom notification integrations using the webhook receiver (see Example 18.15, “WEBHOOK_CONFIG).

# The unique name of the receiver.
name: STRING

# Configurations for several notification integrations.
email_configs:
[ - EMAIL_CONFIG, ... ]
hipchat_configs:
[ - HIPCHAT_CONFIG, ... ]
pagerduty_configs:
[ - PAGERDUTY_CONFIG, ... ]
pushover_configs:
[ - PUSHOVER_CONFIG, ... ]
slack_configs:
[ - SLACK_CONFIG, ... ]
opsgenie_configs:
[ - OPSGENIE_CONFIG, ... ]
webhook_configs:
[ - WEBHOOK_CONFIG, ... ]
victorops_configs:
[ - VICTOROPS_CONFIG, ... ]
wechat_configs:
[ - WECHAT_CONFIG, ... ]
Example 18.6: EMAIL_CONFIG
# Whether to notify about resolved alerts.
[ send_resolved: BOOLEAN | default = false ]

# The email address to send notifications to.
to: TMPL_STRING

# The sender address.
[ from: TMPL_STRING | default = global.smtp_from ]

# The SMTP host through which emails are sent.
[ smarthost: STRING | default = global.smtp_smarthost ]

# The host name to identify to the SMTP server.
[ hello: STRING | default = global.smtp_hello ]

# SMTP authentication details.
[ auth_username: STRING | default = global.smtp_auth_username ]
[ auth_password: SECRET | default = global.smtp_auth_password ]
[ auth_secret: SECRET | default = global.smtp_auth_secret ]
[ auth_identity: STRING | default = global.smtp_auth_identity ]

# The SMTP TLS requirement.
[ require_tls: BOOL | default = global.smtp_require_tls ]

# The HTML body of the email notification.
[ html: TMPL_STRING | default = '{{ template "email.default.html" . }}' ]
# The text body of the email notification.
[ text: TMPL_STRING ]

# Further headers email header key/value pairs. Overrides any headers
# previously set by the notification implementation.
[ headers: { STRING: TMPL_STRING, ... } ]
Example 18.7: HIPCHAT_CONFIG
# Whether or not to notify about resolved alerts.
[ send_resolved: BOOLEAN | default = false ]

# The HipChat Room ID.
room_id: TMPL_STRING
# The authentication token.
[ auth_token: SECRET | default = global.hipchat_auth_token ]
# The URL to send API requests to.
[ api_url: STRING | default = global.hipchat_api_url ]

# A label to be shown in addition to the sender's name.
[ from:  TMPL_STRING | default = '{{ template "hipchat.default.from" . }}' ]
# The message body.
[ message:  TMPL_STRING | default = '{{ template "hipchat.default.message" . }}' ]
# Whether this message will trigger a user notification.
[ notify:  BOOLEAN | default = false ]
# Determines how the message is treated by the alertmanager and rendered inside HipChat. Valid values are 'text' and 'html'.
[ message_format:  STRING | default = 'text' ]
# Background color for message.
[ color:  TMPL_STRING | default = '{{ if eq .Status "firing" }}red{{ else }}green{{ end }}' ]

# Configuration of the HTTP client.
[ http_config: HTTP_CONFIG | default = global.http_config ]
Example 18.8: PAGERDUTY_CONFIG

The routing_key and service_key options are mutually exclusive.

# Whether or not to notify about resolved alerts.
[ send_resolved: BOOLEAN | default = true ]

# The PagerDuty integration key (when using 'Events API v2').
routing_key: TMPL_SECRET
# The PagerDuty integration key (when using 'Prometheus').
service_key: TMPL_SECRET

# The URL to send API requests to.
[ url: STRING | default = global.pagerduty_url ]

# The client identification of the Alertmanager.
[ client:  TMPL_STRING | default = '{{ template "pagerduty.default.client" . }}' ]
# A backlink to the notification sender.
[ client_url:  TMPL_STRING | default = '{{ template "pagerduty.default.clientURL" . }}' ]

# The incident description.
[ description: TMPL_STRING | default = '{{ template "pagerduty.default.description" .}}' ]

# Severity of the incident.
[ severity: TMPL_STRING | default = 'error' ]

# A set of arbitrary key/value pairs that provide further details.
[ details: { STRING: TMPL_STRING, ... } | default = {
 firing:       '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
 resolved:     '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'
 num_firing:   '{{ .Alerts.Firing | len }}'
 num_resolved: '{{ .Alerts.Resolved | len }}'
} ]

# The HTTP client's configuration.
[ http_config: HTTP_CONFIG | default = global.http_config ]
Example 18.9: PUSHOVER_CONFIG
# Whether or not to notify about resolved alerts.
[ send_resolved: BOOLEAN | default = true ]

# The recipient user key.
user_key: SECRET

# Registered application’s API token.
token: SECRET

# Notification title.
[ title: TMPL_STRING | default = '{{ template "pushover.default.title" . }}' ]

# Notification message.
[ message: TMPL_STRING | default = '{{ template "pushover.default.message" . }}' ]

# A supplementary URL displayed together with the message.
[ url: TMPL_STRING | default = '{{ template "pushover.default.url" . }}' ]

# Priority.
[ priority: TMPL_STRING | default = '{{ if eq .Status "firing" }}2{{ else }}0{{ end }}' ]

# How often the Pushover servers will send the same notification (at least 30 seconds).
[ retry: DURATION | default = 1m ]

# How long your notification will continue to be retried (unless the user
# acknowledges the notification).
[ expire: DURATION | default = 1h ]

# Configuration of the HTTP client.
[ http_config: HTTP_CONFIG | default = global.http_config ]
Example 18.10: SLACK_CONFIG
# Whether or not to notify about resolved alerts.
[ send_resolved: BOOLEAN | default = false ]

# The Slack webhook URL.
[ api_url: SECRET | default = global.slack_api_url ]

# The channel or user to send notifications to.
channel: TMPL_STRING

# API request data as defined by the Slack webhook API.
[ icon_emoji: TMPL_STRING ]
[ icon_url: TMPL_STRING ]
[ link_names: BOOLEAN | default = false ]
[ username: TMPL_STRING | default = '{{ template "slack.default.username" . }}' ]
# The following parameters define the attachment.
actions:
[ ACTION_CONFIG ... ]
[ color: TMPL_STRING | default = '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' ]
[ fallback: TMPL_STRING | default = '{{ template "slack.default.fallback" . }}' ]
fields:
[ FIELD_CONFIG ... ]
[ footer: TMPL_STRING | default = '{{ template "slack.default.footer" . }}' ]
[ pretext: TMPL_STRING | default = '{{ template "slack.default.pretext" . }}' ]
[ short_fields: BOOLEAN | default = false ]
[ text: TMPL_STRING | default = '{{ template "slack.default.text" . }}' ]
[ title: TMPL_STRING | default = '{{ template "slack.default.title" . }}' ]
[ title_link: TMPL_STRING | default = '{{ template "slack.default.titlelink" . }}' ]
[ image_url: TMPL_STRING ]
[ thumb_url: TMPL_STRING ]

# Configuration of the HTTP client.
[ http_config: HTTP_CONFIG | default = global.http_config ]
Example 18.11: ACTION_CONFIG for SLACK_CONFIG
# Provide a button to tell Slack you want to render a button.
type: TMPL_STRING
# Label for the button.
text: TMPL_STRING
# http or https URL to deliver users to. If you specify invalid URLs, the message will be posted with no button.
url: TMPL_STRING
#  If set to 'primary', the button will be green, indicating the best forward action to take
#  'danger' turns the button red, indicating a destructive action.
[ style: TMPL_STRING [ default = '' ]
Example 18.12: FIELD_CONFIG for SLACK_CONFIG
# A bold heading without markup above the value text.
title: TMPL_STRING
# The text of the field. It can span across several lines.
value: TMPL_STRING
# A flag indicating if value is short enough to be displayed together with other values.
[ short: BOOLEAN | default = slack_config.short_fields ]
Example 18.13: OPSGENIE_CONFIG
# Whether or not to notify about resolved alerts.
[ send_resolved: BOOLEAN | default = true ]

# The API key to use with the OpsGenie API.
[ api_key: SECRET | default = global.opsgenie_api_key ]

# The host to send OpsGenie API requests to.
[ api_url: STRING | default = global.opsgenie_api_url ]

# Alert text (maximum is 130 characters).
[ message: TMPL_STRING ]

# A description of the incident.
[ description: TMPL_STRING | default = '{{ template "opsgenie.default.description" . }}' ]

# A backlink to the sender.
[ source: TMPL_STRING | default = '{{ template "opsgenie.default.source" . }}' ]

# A set of arbitrary key/value pairs that provide further detail.
[ details: { STRING: TMPL_STRING, ... } ]

# Comma separated list of team responsible for notifications.
[ teams: TMPL_STRING ]

# Comma separated list of tags attached to the notifications.
[ tags: TMPL_STRING ]

# Additional alert note.
[ note: TMPL_STRING ]

# Priority level of alert, one of P1, P2, P3, P4, and P5.
[ priority: TMPL_STRING ]

# Configuration of the HTTP.
[ http_config: HTTP_CONFIG | default = global.http_config ]
Example 18.14: VICTOROPS_CONFIG
# Whether or not to notify about resolved alerts.
[ send_resolved: BOOLEAN | default = true ]

# The API key for talking to the VictorOps API.
[ api_key: SECRET | default = global.victorops_api_key ]

# The VictorOps API URL.
[ api_url: STRING | default = global.victorops_api_url ]

# A key used to map the alert to a team.
routing_key: TMPL_STRING

# Describes the behavior of the alert (one of 'CRITICAL', 'WARNING', 'INFO').
[ message_type: TMPL_STRING | default = 'CRITICAL' ]

# Summary of the alerted problem.
[ entity_display_name: TMPL_STRING | default = '{{ template "victorops.default.entity_display_name" . }}' ]

# Long explanation of the alerted problem.
[ state_message: TMPL_STRING | default = '{{ template "victorops.default.state_message" . }}' ]

# The monitoring tool the state message is from.
[ monitoring_tool: TMPL_STRING | default = '{{ template "victorops.default.monitoring_tool" . }}' ]

# Configuration of the HTTP client.
[ http_config: HTTP_CONFIG | default = global.http_config ]
Example 18.15: WEBHOOK_CONFIG

You can use the webhook receiver to configure a generic receiver.

# Whether or not to notify about resolved alerts.
[ send_resolved: BOOLEAN | default = true ]

# The endpoint for sending HTTP POST requests.
url: STRING

# Configuration of the HTTP client.
[ http_config: HTTP_CONFIG | default = global.http_config ]

Alertmanager sends HTTP POST requests in the following JSON format:

{
 "version": "4",
 "groupKey": STRING, // identifycation of the group of alerts (to deduplicate)
 "status": "<resolved|firing>",
 "receiver": STRING,
 "groupLabels": OBJECT,
 "commonLabels": OBJECT,
 "commonAnnotations": OBJECT,
 "externalURL": STRING, // backlink to Alertmanager.
 "alerts": [
   {
     "status": "<resolved|firing>",
     "labels": OBJECT,
     "annotations": OBJECT,
     "startsAt": "<rfc3339>",
     "endsAt": "<rfc3339>",
     "generatorURL": STRING // identifies the entity that caused the alert
   },
   ...
 ]
}

The webhook receiver allows for integration with the following notification mechanisms:

  • DingTalk (https://github.com/timonwong/prometheus-webhook-dingtalk)

  • IRC Bot (https://github.com/multimfi/bot)

  • JIRAlert (https://github.com/free/jiralert)

  • Phabricator / Maniphest (https://github.com/knyar/phalerts)

  • prom2teams: forwards notifications to Microsoft Teams (https://github.com/idealista/prom2teams)

  • SMS: supports multiple providers (https://github.com/messagebird/sachet)

  • Telegram bot (https://github.com/inCaller/prometheus_bot)

  • SNMP trap (https://github.com/SUSE/prometheus-webhook-snmp)

Example 18.16: WECHAT_CONFIG
# Whether or not to notify about resolved alerts.
[ send_resolved: BOOLEAN | default = false ]

# The API key to use for the WeChat API.
[ api_secret: SECRET | default = global.wechat_api_secret ]

# The WeChat API URL.
[ api_url: STRING | default = global.wechat_api_url ]

# The corp id used to authenticate.
[ corp_id: STRING | default = global.wechat_api_corp_id ]

# API request data as defined by the WeChat API.
[ message: TMPL_STRING | default = '{{ template "wechat.default.message" . }}' ]
[ agent_id: STRING | default = '{{ template "wechat.default.agent_id" . }}' ]
[ to_user: STRING | default = '{{ template "wechat.default.to_user" . }}' ]
[ to_party: STRING | default = '{{ template "wechat.default.to_party" . }}' ]
[ to_tag: STRING | default = '{{ template "wechat.default.to_tag" . }}' ]

18.4.2 Custom Alerts Edit source

You can define your custom alert conditions to send notifications to an external service. Prometheus uses its own expression language for defining custom alerts. Following is an example of a rule with an alert:

groups:
- name: example
 rules:
  # alert on high deviation from average PG count
  - alert: high pg count deviation
   expr: abs(((ceph_osd_pgs > 0) - on (job) group_left avg(ceph_osd_pgs > 0) by (job)) / on (job) group_left avg(ceph_osd_pgs > 0) by (job)) > 0.35
   for: 5m
   labels:
    severity: warning
    type: ses_default
   annotations:
   description: >
    OSD {{ $labels.osd }} deviates by more then 30% from average PG count

The optional for clause specifies the time Prometheus will wait between first encountering a new expression output vector element and counting an alert as firing. In this case, Prometheus will check that the alert continues to be active for 5 minutes before firing the alert. Elements in a pending state are active, but not firing yet.

The labels clause specifies a set of additional labels attached to the alert. Conflicting labels will be overwritten. Labels can be templated (see Section 18.4.2.1, “Templates” for more details on templating).

The annotations clause specifies informational labels. You can use them to store additional information, for example alert descriptions or runbook links. Annotations can be templated (see Section 18.4.2.1, “Templates” for more details on templating).

To add your custom alerts to SUSE Enterprise Storage 6, either

  • place your YAML files with custom alerts in the /etc/prometheus/alerts directory

or

  • provide a list of paths to your custom alert files in the Pillar under the monitoring:custom_alerts key. DeepSea Stage 2 or the salt SALT_MASTER state.apply ceph.monitoring.prometheus command will add your alert files in the right place.

    Example 18.17: Adding Custom Alerts to SUSE Enterprise Storage

    A file with custom alerts is in /root/my_alerts/my_alerts.yml on the Salt master. If you add

    monitoring:
     custom_alerts:
       - /root/my_alerts/my_alerts.yml

    to the /srv/pillar/ceph/cluster/YOUR_SALT_MASTER_MINION_ID.sls file, DeepSea will create the /etc/prometheus/alerts/my_alerts.yml file and restart Prometheus.

18.4.2.1 Templates Edit source

You can use templates for label and annotation values. The $labels variable includes the label key/value pairs of an alert instance, while $value holds the evaluated value of an alert instance.

The following example inserts a firing element label and value:

{{ $labels.LABELNAME }}
{{ $value }}

18.4.2.2 Inspecting Alerts at Runtime Edit source

If you need to verify which alerts are active, you have several options:

  • Navigate to the Alerts tab of Prometheus. It will show you the exact label sets for which defined alerts are active. Prometheus also stores synthetic time series for pending and firing alerts. They have the following form:

    ALERTS{alertname="ALERT_NAME", alertstate="pending|firing", ADDITIONAL_ALERT_LABELS}

    The sample value is 1 if the alert is active (pending or firing). The series is marked 'stale' when the alert is inactive.

  • In the Prometheus Web interface at the URL address http://PROMETHEUS_HOST_IP:9090/alerts, inspect alerts and their state (INACTIVE, PENDING or FIRING).

  • In the Alertmanager Web interface at the URL address http://:PROMETHEUS_HOST_IP:9093/#/alerts, inspect alerts and silence them if desired.

18.4.3 SNMP Trap Receiver Edit source

If you want to get notified about Prometheus alerts via SNMP traps, then you can install the Prometheus Alertmanager SNMP trap receiver via DeepSea. To do so you need to enable it in the pillar under the monitoring:alertmanager_receiver_snmp:enabled key in your global.yml file. The configuration of the receiver must be set under the monitoring:alertmanager_receiver_snmp:config key.

Example 18.18: SNMP Trap Configuration
monitoring:
 alertmanager:
   receiver:
      snmp:
        enabled: True
        config:
          host: localhost
          port: 9099
          snmp_host: snmp.foo-bar.com
          snmp_community: private
          metrics: True

Refer to the receiver manual at https://github.com/SUSE/prometheus-webhook-snmp#global-configuration-file. for more details about the configuration options.

DeepSea Stage 2 or the salt SALT_MASTER state.apply ceph.monitoring.alertmanager command will install and configure the receiver in the appropriate location. Verify your settings with:

root@master # salt-call pillar.get 'monitoring:alertmanager_receiver_snmp:enabled'
root@master # salt-call pillar.get 'monitoring:alertmanager_receiver_snmp:config'

18.5 Troubleshooting Alerts Edit source

The following section details the alert that has been triggered and actions to take when the alert is displayed.

MONITOR
MON_DOWN

One or more monitor daemons are down. The cluster requires a majority of the monitors in order to function. When one or more monitors are down, clients will initially have difficulty connecting to the cluster.

Restart the monitor daemon that is down as soon as possible to reduce the risk of a subsequent monitor failure.

MON_CLOCK_SKEW

The clocks on the hosts running the ceph-mon monitor daemons are not well synchronized. This health alert is raised if the cluster detects a clock skew greater than mon_clock_drift_allowed. Resolve this by synchronizing the clocks using either ntpd or chrony. If it is impractical to keep the clocks closely synchronized, the mon_clock_drift_allowed threshold can be increased, but this value must stay well below the mon_lease interval in order for monitor cluster to function properly.

MON_MSGR2_NOT_ENABLED

The ms_bind_msgr2 option is enabled but one or more monitors is not configured to bind to a v2 port in the cluster’s monmap. This means that features specific to the msgr2 protocol (for example, encryption) are not available on some or all connections. In most cases this can be corrected by issuing the following command:

cephadm@adm > ceph mon enable-msgr2

This command changes any monitor configured for the old default port 6789 to continue to listen for v1 connections on 6789 and also listen for v2 connections on the new default 3300 port. If a monitor is configured to listen for v1 connections on a non-standard port (not 6789), then the monmap needs to be modified manually.

MANAGER
MGR_MODULE_DEPENDENCY

An enabled manager module is failing its dependency check. This health check should come with a message from the module about the problem. For example, a module might report that a required package is not installed. In which case, the message will read: "Install the required package and restart your manager daemons." This health check only applies to enabled modules. If a module is not enabled, you can see whether it is reporting dependency issues in the output of ceph module ls.

MGR_MODULE_ERROR

A manager module has experienced an unexpected error. Typically, this means an unhandled exception was raised from the module’s serve function. The human readable description of the error may be obscurely worded if the exception did not provide a useful description of itself. This health check may indicate a bug. Open a bug report if you think you have encountered a bug. If you believe the error is transient, you may restart your manager daemon(s), or use ceph mgr fail on the active daemon to prompt a failover to another daemon.

OSDS
OSD_DOWN

One or more OSDs are marked down. The ceph-osd daemon may have been stopped, or peer OSDs may be unable to reach the OSD over the network. Common causes include a stopped or crashed daemon, a down host, or a network outage. Verify the host is healthy, the daemon is started, and network is functioning. If the daemon has crashed, the daemon log file (/var/log/ceph/ceph-osd.*) may contain debugging information.

OSD_CRUSH TYPE_DOWN

For example, OSD_HOST_DOWN or OSD_ROOT_DOWN. All the OSDs within a particular CRUSH subtree are marked down, for example all OSDs on a host.

OSD_ORPHAN

An OSD is referenced in the CRUSH Map hierarchy but does not exist. The OSD can be removed from the CRUSH hierarchy with:

cephadm@adm > ceph osd crush rm osd.ID
OSD_OUT_OF_ORDER_FULL

The utilization thresholds for backfillfull, nearfull, full, and failsafe_full are not ascending. The thresholds can be adjusted with:

cephadm@adm > ceph osd set-backfillfull-ratio RATIO
cephadm@adm > ceph osd set-nearfull-ratio RATIO
cephadm@adm > ceph osd set-full-ratio RATIO
OSD_FULL

One or more OSDs have exceeded the full threshold and is preventing the cluster from servicing writes. Utilization by pool can be checked with:

cephadm@adm > ceph df

The currently defined full ratio can be seen with:

cephadm@adm > ceph osd dump | grep full_ratio

A short-term workaround to restore write availability is to raise the full threshold by a small amount:

cephadm@adm > ceph osd set-full-ratio RATIO

New storage should be added to the cluster by deploying more OSDs or existing data should be deleted in order to free up space.

OSD_BACKFILLFULL

One or more OSDs have exceeded the backfillfull threshold, preventing data from being allowed to rebalance to this device. This is an early warning that rebalancing may not be able to complete and that the cluster is approaching full. Utilization by pool can be checked with:

cephadm@adm > ceph df
OSD_NEARFULL

One or more OSDs have exceeded the nearfull threshold. This is an early warning that the cluster is approaching full. Utilization by pool can be checked with:

cephadm@adm > ceph df
OSDMAP_FLAGS

One or more cluster flags of interest has been set. These flags include:

full

The cluster is flagged as full and cannot serve writes

pauserd, pausewr

Paused reads or writes

noup

OSDs are not allowed to start

nodown

OSD failure reports are being ignored and the monitors are not marking OSDs down

noin

OSDs that were previously marked out are not being marked back in when they start

noout

Down OSDs are not automatically marked out after the configured interval

nobackfill, norecover, norebalance

Recovery or data rebalancing is suspended

noscrub, nodeep_scrub

Scrubbing is disabled

notieragent

Cache tiering activity is suspended

With the exception of full, these flags can be set or cleared with:

cephadm@adm > ceph osd set FLAG
cephadm@adm > ceph osd unset FLAG
OSD_FLAGS

One or more OSDs or CRUSH {nodes,device classes} has a flag of interest set. These flags include:

noup

These OSDs are not allowed to start

nodown

Failure reports for these OSDs are ignored

noin

If these OSDs were previously marked out automatically after a failure, they are not to be marked in when they start

noout

If these OSDs are down they are not automatically marked out after the configured interval

These flags can be set and cleared in batch with:

cephadm@adm > ceph osd set-group FLAG WHO
cephadm@adm > ceph osd unset-group FLAG WHO

For example:

cephadm@adm > ceph osd set-group noup,noout osd.0 osd.1
cephadm@adm > ceph osd unset-group noup,noout osd.0 osd.1
cephadm@adm > ceph osd set-group noup,noout host-foo
cephadm@adm > ceph osd unset-group noup,noout host-foo
cephadm@adm > ceph osd set-group noup,noout class-hdd
cephadm@adm > ceph osd unset-group noup,noout class-hdd
OLD_CRUSH_TUNABLES

The CRUSH Map is using old settings and should be updated. The oldest tunables that can be used (for example, the oldest client version that can connect to the cluster) without triggering this health warning are determined by the mon_crush_min_required_version config option.

OLD_CRUSH_STRAW_CALC_VERSION

The CRUSH Map is using an older, sub-optimal method for calculating intermediate weight values for straw buckets. The CRUSH Map requires an update to use the newer method (straw_calc_version=1).

CACHE_POOL_NO_HIT_SET

One or more cache pools are not configured with a hit set to track utilization. This prevents the tiering agent from identifying cold objects to flush and evict from the cache. Hit sets can be configured on the cache pool with the following:

cephadm@adm > ceph osd pool set POOLNAME hit_set_type TYPE
cephadm@adm > ceph osd pool set POOLNAME hit_set_period PERIOD-IN-SECONDS
cephadm@adm > ceph osd pool set POOLNAME hit_set_count NUMBER-OF-HITSETS
cephadm@adm > ceph osd pool set POOLNAME hit_set_fpp TARGET-FALSE-POSITIVE-RATE
OSD_NO_SORTBITWISE

No SUSE Enterprise Storage 5.5 v12.y.z OSDs are running but the sortbitwise flag has not been set. Set the sortbitwise flag before v12.y.z or newer OSDs can start. You can safely set the flag with:

cephadm@adm > ceph osd set sortbitwise
POOL_FULL

One or more pools have reached the quota and are no longer allowing writes. Pool quotas and utilization can be seen with the following command:

cephadm@adm > ceph df detail

You can either raise the pool quota with the following commands:

cephadm@adm > ceph osd pool set-quota POOLNAME max_objects NUM-OBJECTS
cephadm@adm > ceph osd pool set-quota POOLNAME max_bytes NUM-BYTES

Or, you can delete existing data to reduce utilization.

BLUEFS_SPILLOVER

One or more OSDs that use the BlueStore backend have been allocated db partitions (storage space for metadata, normally on a faster device) but that space has filled, such that metadata has overflowed onto the normal slow device. This is not necessarily an error condition or even unexpected, but if the administrator’s expectation was that all metadata would fit on the faster device, it indicates that not enough space was provided. This warning can be disabled on all OSDs with the following command:

cephadm@adm > ceph config set osd bluestore_warn_on_bluefs_spillover false

Alternatively, it can be disabled on a specific OSD with the following command:

cephadm@adm > ceph config set osd.123 bluestore_warn_on_bluefs_spillover false

To provide more metadata space, the OSD in question can be destroyed and reprovisioned. This involves data migration and recovery. It is possible to expand the LVM logical volume backing the db storage. If the underlying LV has been expanded, the OSD daemon needs to be stopped and BlueFS informed of the device size change with the following command:

cephadm@adm > ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-$ID
BLUEFS_AVAILABLE_SPACE

To check how much space is free for BlueFS, execute:

cephadm@adm > ceph daemon osd.123 bluestore bluefs available

This provides output for up to 3 values; BDEV_DB free, BDEV_SLOW free and available_from_bluestore. BDEV_DB and BDEV_SLOW report the amount of space that has been acquired by BlueFS and is considered free. Value available_from_bluestore denotes ability of BlueStore to leave more space to BlueFS. It is normal that this value is different from amount of BlueStore free space, as BlueFS allocation unit is typically larger than BlueStore allocation unit. This means that only part of BlueStore free space is acceptable for BlueFS.

BLUEFS_LOW_SPACE

If BlueFS is running low on available free space and there is little available_from_bluestore, consider reducing BlueFS' allocation unit size. To simulate available space when the allocation unit is different, execute:

cephadm@adm > ceph daemon osd.123 bluestore bluefs available ALLOC-UNIT-SIZE
BLUESTORE_FRAGMENTATION

As BlueStore works, free space on underlying storage becomes fragmented. This is normal and unavoidable, but excessive fragmentation can cause slowdown. To inspect BlueStore fragmentation, execute:

cephadm@adm > ceph daemon osd.123 bluestore allocator score block

Score is given in [0-1] range. [0.0 .. 0.4] tiny fragmentation [0.4 .. 0.7] small, acceptable fragmentation [0.7 .. 0.9] considerable, but safe fragmentation [0.9 .. 1.0] severe fragmentation, can impact BlueFS' ability to get space from BlueStore. If detailed report of free fragments is required, execute:

cephadm@adm > ceph daemon osd.123 bluestore allocator dump block

If the OSD process does not perform fragmentation, inspect with ceph-bluestore-tool. Get the fragmentation score:

cephadm@adm > ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-score

Dump detailed free chunks:

cephadm@adm > ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-123 --allocator block free-dump
BLUESTORE_LEGACY_STATFS

As of SUSE Enterprise Storage 6, BlueStore tracks its internal usage statistics on a per-pool granular basis and one or more OSDs have BlueStore volumes that were created prior to SUSE Enterprise Storage 6. If all OSDs are older than SUSE Enterprise Storage 6, the per-pool metrics are not available. However, if there is a mix of pre-SUSE Enterprise Storage 6 and post-SUSE Enterprise Storage 6 OSDs, the cluster usage statistics reported by ceph df will not be accurate. The old OSDs can be updated to use the new usage tracking scheme by stopping each OSD, running a repair operation, and the restarting it. For example, if osd.123 requires an update:

root # systemctl stop ceph-osd@123
cephadm@adm > ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-123
root # systemctl start ceph-osd@123

This warning can be disabled with:

cephadm@adm > ceph config set global bluestore_warn_on_legacy_statfs false
BLUESTORE_DISK_SIZE_MISMATCH

One or more OSDs using BlueStore has an internal inconsistency between the size of the physical device and the metadata tracking its size. This can lead to the OSD crashing in the future. The OSDs in question should be destroyed and re-deployed. To avoid putting any data at risk, re-deploy only one OSD at a time. For example, if OSD_ID has the error:

cephadm@adm > ceph osd out osd.$N
while ! ceph osd safe-to-destroy osd.$N ; do sleep 1m ; done
ceph osd destroy osd.$N
ceph-volume lvm zap /path/to/device
ceph-volume lvm create --osd-id $N --data /path/to/device
DEVICE HEALTH
DEVICE_HEALTH

One or more devices are expected to fail. The warning threshold is controlled by the mgr/devicehealth/warn_threshold configuration option. This warning only applies to OSDs that are currently marked in. The expected response to this failure is to mark the device out. The data is then migrated off of the device and the hardware is removed from the system. Marking out is normally done automatically if mgr/devicehealth/self_heal is enabled based on the mgr/devicehealth/mark_out_threshold. Device health can be checked with:

cephadm@adm > ceph device info DEVICE-ID

Device life expectancy is set by a prediction model run by the Ceph Manager or by an external tool via the command:

cephadm@adm > ceph device set-life-expectancy DEVICE-ID FROM TO

You can change the stored life expectancy manually, but that usually does not persist—the tool that originally set it reset and changing the stored value does not affect the actual health of the hardware device.

DEVICE_HEALTH_IN_USE

One or more devices are expected to fail and has been marked out of the cluster based on mgr/devicehealth/mark_out_threshold, but the devices are still participating in one more PGs. This may be because it was only recently marked as out and the data is still migrating, or because the data cannot be migrated off for some reason (for example, the cluster is nearly full, or the CRUSH hierarchy is such that there is not another suitable OSD to migrate the data to). This message can be silenced by disabling the self heal behavior (setting mgr/devicehealth/self_heal to false), by adjusting the mgr/devicehealth/mark_out_threshold, or by addressing what is preventing data from being migrated off of the ailing device.

DEVICE_HEALTH_TOOMANY

Too many devices are expected to fail and the mgr/devicehealth/self_heal behavior is enabled, such that marking out all of the ailing devices would exceed the clusters mon_osd_min_in_ratio ratio that prevents too many OSDs from being automatically marked out. This can indicates that too many devices in the cluster are expected to fail and action is required to add newer (healthier) devices before too many devices fail and data is lost. The health message can also be silenced by adjusting parameters like mon_osd_min_in_ratio or mgr/devicehealth/mark_out_threshold, but be warned that this increases the likelihood of unrecoverable data loss in the cluster.

DATA HEALTH (POOLS AND PLACEMENT GROUPS)
PG_AVAILABILITY

Data availability is reduced and the cluster is unable to service potential read or write requests for some data in the cluster. Specifically, if one or more PGs are in a state that does not allow IO requests to be serviced. Problematic PG states include peering, stale, incomplete, and in-active (if those conditions do not clear quickly). Detailed information about which PGs are affected is available from:

cephadm@adm > ceph health detail

In most cases the root cause is that one or more OSDs are currently down; see the discussion for OSD_DOWN above. The state of specific problematic PGs can be queried with:

cephadm@adm > ceph tell PG_ID query
PG_DEGRADED

Data redundancy is reduced for some data, meaning the cluster does not have the desired number of replicas for all data (for replicated pools) or erasure code fragments (for erasure coded pools). Specifically, if one or more PGs:

  • have a degraded or undersized flag set, meaning there are not enough instances of that placement group in the cluster;

  • have not had the clean flag set for some time.

PG_RECOVERY_FULL

Data redundancy can be reduced or at risk for some data due to a lack of free space in the cluster. Specifically, one or more PGs have the recovery_toofull flag set, meaning that the cluster is unable to migrate or recover data because one or more OSDs are above the full threshold. See the discussion for OSD_FULL above for steps to resolve this condition.

PG_BACKFILL_FULL

Data redundancy can be reduced or at risk for some data due to a lack of free space in the cluster. Specifically, one or more PGs have the backfill_toofull flag set, meaning that the cluster is unable to migrate or recover data because one or more OSDs are above the backfillfull threshold. See the discussion for OSD_BACKFILLFULL above for steps to resolve this condition.

PG_DAMAGED

Data scrubbing has discovered some problems with data consistency in the cluster. Specifically, one or more PGs have the inconsistent or snaptrim_error flag is set, indicating an earlier scrub operation found a problem, or that the repair flag is set and a repair for such an inconsistency is currently in progress.

OSD_SCRUB_ERRORS

Recent OSD scrubs have uncovered inconsistencies. This error is generally paired with PG_DAMAGED.

LARGE_OMAP_OBJECTS

One or more pools contain large omap objects as determined by osd_deep_scrub_large_omap_object_key_threshold (threshold for number of keys to determine a large omap object) or osd_deep_scrub_large_omap_object_value_sum_threshold (the threshold for summed size (bytes) of all key values to determine a large omap object) or both. More information on the object name, key count, and size in bytes can be found by searching the cluster log for ‘Large omap object found’. Large omap objects can be caused by RGW bucket index objects that do not have automatic resharding enabled. The thresholds can be adjusted with:

cephadm@adm > ceph config set osd osd_deep_scrub_large_omap_object_key_threshold KEYS
cephadm@adm > ceph config set osd osd_deep_scrub_large_omap_object_value_sum_threshold BYTES
CACHE_POOL_NEAR_FULL

A cache tier pool is nearly full. Full is determined by the target_max_bytes and target_max_objects properties on the cache pool. Once the pool reaches the target threshold, write requests to the pool may block while data is flushed and evicted from the cache, a state that normally leads to very high latencies and poor performance. The cache pool target size can be adjusted with:

cephadm@adm > ceph osd pool set CACHE-POOL-NAME target_max_bytes BYTES
cephadm@adm > ceph osd pool set CACHE-POOL-NAME target_max_objects OBJECTS

Normal cache flush and eviction activity can also be throttled due to reduced availability, performance of the base tier, or overall cluster load.

POOL_TOO_FEW_PGS

One or more pools should probably have more PGs, based on the amount of data that is currently stored in the pool. This can lead to sub-optimal distribution and balance of data across the OSDs in the cluster, and similarly reduce overall performance. This warning is generated if the pg_autoscale_mode property on the pool is set to warn. To disable the warning, you can disable auto-scaling of PGs for the pool entirely with:

cephadm@adm > ceph osd pool set POOL-NAME pg_autoscale_mode off

To allow the cluster to automatically adjust the number of PGs:

cephadm@adm > ceph osd pool set POOL-NAME pg_autoscale_mode on

You can also manually set the number of PGs for the pool to the recommended amount with:

cephadm@adm > ceph osd pool set POOL-NAME pg_num NEW-PG-NUM
TOO_MANY_PGS

The number of PGs in use in the cluster is above the configurable threshold of mon_max_pg_per_osd PGs per OSD. If this threshold is exceeded, the cluster does not allow new pools to be created, pool pg_num to be increased, or pool replication to be increased (any of which would lead to more PGs in the cluster). A large number of PGs can lead to higher memory utilization for OSD daemons, slower peering after cluster state changes (like OSD restarts, additions, or removals), and higher load on the Ceph Manager and Ceph Monitor daemons. The simplest way to mitigate the problem is to increase the number of OSDs in the cluster by adding more hardware. The OSD count used for the purposes of this health check is the number of in OSDs, marking out OSDs in (if there are any) can also help:

cephadm@adm > ceph osd in OSD_IDs
POOL_TOO_MANY_PGS

One or more pools require more PGs based on the amount of data that is currently stored in the pool. This can lead to higher memory utilization for OSD daemons, slower peering after cluster state changes (like OSD restarts, additions, or removals), and higher load on the manager and monitor daemons. This warning is generated if the pg_autoscale_mode property on the pool is set to warn. To disable the warning, you can disable auto-scaling of PGs for the pool entirely with:

cephadm@adm > ceph osd pool set POOL-NAME pg_autoscale_mode off

To allow the cluster to automatically adjust the number of PGs:

cephadm@adm > ceph osd pool set POOL-NAME pg_autoscale_mode on

You can also manually set the number of PGs for the pool to the recommended amount with:

cephadm@adm > ceph osd pool set POOL-NAME pg_num NEW-PG-NUM
POOL_TARGET_SIZE_RATIO_OVERCOMMITTED

One or more pools have a target_size_ratio property set to estimate the expected size of the pool as a fraction of total storage, but the value(s) exceed the total available storage (either by themselves or in combination with other pools’ actual usage). This can indicate that the target_size_ratio value for the pool is too large and should be reduced or set to zero with:

cephadm@adm > ceph osd pool set POOL-NAME target_size_ratio 0
POOL_TARGET_SIZE_BYTES_OVERCOMMITTED

One or more pools have a target_size_bytes property set to estimate the expected size of the pool, but the value(s) exceed the total available storage (either by themselves or in combination with other pools’ actual usage). This indicates that the target_size_bytes value for the pool is too large and should be reduced or set to zero with:

cephadm@adm > ceph osd pool set POOL-NAME target_size_bytes 0
TOO_FEW_OSDS

The number of OSDs in the cluster is below the configurable threshold of osd_pool_default_size.

SMALLER_PGP_NUM

One or more pools have a pgp_num value less than pg_num, indicating that the PG count was increased without also increasing the placement behavior. To adjust the placement group number, adjust pgp_num and pg_num. Ensure that changing pgp_num is performed first and does not trigger the rebalance. To resolve, set pgp_num to match pg_num and trigger the data migration with:

cephadm@adm > ceph osd pool set POOL pgp_num PG-NUM-VALUE
MANY_OBJECTS_PER_PG

One or more pools has an average number of objects per PG that is significantly higher than the overall cluster average. The specific threshold is controlled by the mon_pg_warn_max_object_skew configuration value. This indicates that the pool(s) containing most of the data in the cluster have too few PGs, or that other pools that do not contain as much data have too many PGs. The threshold can be raised to silence the health warning by adjusting the mon_pg_warn_max_object_skew configuration option on the monitors.

POOL_APP_NOT_ENABLED

A pool exists that contains one or more objects but has not been tagged for use by a particular application. Resolve this warning by labeling the pool for use by an application. For example, if the pool is used by RBD:

cephadm@adm > rbd pool init POOLNAME

If the pool is being used by a custom application FOO, you can also label via the low-level command:

cephadm@adm > ceph osd pool application enable FOO
POOL_FULL

One or more pools has reached (or is very close to reaching) its quota. The threshold to trigger this error condition is controlled by the mon_pool_quota_crit_threshold configuration option. Pool quotas can be adjusted up or down (or removed) with:

cephadm@adm > ceph osd pool set-quota POOL max_bytes BYTES
cephadm@adm > ceph osd pool set-quota POOL max_objects OBJECTS

Setting the quota value to 0 disables the quota.

POOL_NEAR_FULL

One or more pools are approaching its quota. The threshold to trigger this warning condition is controlled by the mon_pool_quota_warn_threshold configuration option. Pool quotas can be adjusted up or down (or removed) with:

cephadm@adm > ceph osd pool set-quota POOL max_bytes BYTES
cephadm@adm > ceph osd pool set-quota POOL max_objects OBJECTS
OBJECT_MISPLACED

One or more objects in the cluster is not stored on the node the cluster would like it to be stored on. This is an indication that data migration due to some recent cluster change has not yet completed. Misplaced data is not a dangerous condition in and of itself. Data consistency is not at risk and old copies of objects are not removed until the desired number of new copies (in the desired locations) are present.

OBJECT_UNFOUND

One or more objects in the cluster cannot be found. Specifically, the OSDs know that a new or updated copy of an object should exist, but a copy of that version of the object has not been found on OSDs that are currently online. Read or write requests to unfound objects will block. Ideally, a down OSD can be brought back online that has the more recent copy of the unfound object. Candidate OSDs can be identified from the peering state for the PG(s) responsible for the unfound object:

cephadm@adm > ceph tell PG_ID query

If the latest copy of the object is not available, the cluster can be told to roll back to a previous version of the object.

SLOW_OPS

One or more OSD requests is taking a long time to process. This can be an indication of extreme load, a slow storage device, or a software bug. The request queue on the OSD(s) in question can be queried with the following command, executed from the OSD host:

cephadm@adm > ceph daemon osd.ID ops

A summary of the slowest recent requests can be seen with:

cephadm@adm > ceph daemon osd.ID dump_historic_ops

The location of an OSD can be found with:

cephadm@adm > ceph osd find osd.ID
PG_NOT_SCRUBBED

One or more PGs have not been scrubbed recently. PGs are normally scrubbed every mon_scrub_interval seconds and this warning triggers when mon_warn_pg_not_deep_scrubbed_ratio percentage of interval has elapsed without a scrub since it was due. PGs do not scrub if they are not flagged as clean. This can happen if they are misplaced or degraded (see PG_AVAILABILITY and PG_DEGRADED above). You can manually initiate a scrub of a clean PG with:

cephadm@adm > ceph pg scrub PG_ID
PG_NOT_DEEP_SCRUBBED

One or more PGs have not been deep scrubbed recently. PGs are normally scrubbed every osd_deep_scrub_interval seconds and this warning triggers when mon_warn_pg_not_deep_scrubbed_ratio percentage of interval has elapsed without a scrub since it was due. PGs do not (deep) scrub if they are not flagged as clean. This can happen if they are misplaced or degraded (see PG_AVAILABILITY and PG_DEGRADED above). You can manually initiate a scrub of a clean PG with:

cephadm@adm > ceph pg deep-scrub PG_ID
MISCELLANEOUS
RECENT_CRASH

One or more Ceph daemons have crashed recently, and the crash has not yet been archived or acknowledged by the administrator. This may indicate a software bug, a hardware problem (for example, a failing disk), or some other problem.

Note
Note

Encountering a crash is not normal, but can be observed on occasion. When a crash occurs, ceph crash will alert the administrator. If ceph crash is reporting an abnormal number of crashes, contact SUSE support for further assistance. supportconfig reports from the affected nodes will help SUSE address the issue. Also consider patching the cluster at a regular interval.

New crashes can be listed with:

cephadm@adm > ceph crash ls-new

Information about a specific crash can be examined with:

cephadm@adm > ceph crash info CRASH-ID

This warning can be silenced by archiving the crash (perhaps after being examined by an administrator) so that it does not generate this warning:

cephadm@adm > ceph crash archive CRASH-ID

Similarly, all new crashes can be archived with:

cephadm@adm > ceph crash archive-all

Archived crashes are still visible via ceph crash ls but not ceph crash ls-new. The time period for what recent means is controlled by the option mgr/crash/warn_recent_interval (default: two weeks). These warnings can be disabled entirely with:

cephadm@adm > ceph config set mgr/crash/warn_recent_interval 0
TELEMETRY_CHANGED

Telemetry has been enabled but the contents of the telemetry report have changed since that time, so telemetry reports are not sent. The Ceph developers periodically revise the telemetry feature to include new and useful information, or to remove information found to be useless or sensitive. If any new information is included in the report, Ceph requires the administrator to re-enable telemetry to ensure they have an opportunity to (re)review what information is shared. To review the contents of the telemetry report:

cephadm@adm > ceph telemetry show

The telemetry report consists of several optional channels that are independently enabled or disabled. To re-enable telemetry (and make this warning go away):

cephadm@adm > ceph telemetry on

To disable telemetry (and make this warning go away):

cephadm@adm > ceph telemetry off
 groups:
  - name: cluster health
   rules:
    - alert: health error
     expr: ceph_health_status == 2
     for: 5m
     labels:
      severity: critical
      type: ses_default
     annotations:
      description: Ceph in error for > 5m
    - alert: unhealthy
     expr: ceph_health_status != 0
     for: 15m
     labels:
      severity: warning
      type: ses_default
     annotations:
      description: Ceph not healthy for > 5m
  - name: mon
   rules:
    - alert: low monitor quorum count
     expr: ceph_monitor_quorum_count < 3
     labels:
      severity: critical
      type: ses_default
     annotations:
      description: Monitor count in quorum is low
  - name: osd
   rules:
    - alert: 10% OSDs down
     expr: sum(ceph_osd_down) / count(ceph_osd_in) >= 0.1
     labels:
      severity: critical
      type: ses_default
     annotations:
      description: More then 10% of OSDS are down
    - alert: OSD down
     expr: sum(ceph_osd_down) > 1
     for: 15m
     labels:
      severity: warning
      type: ses_default
     annotations:
      description: One or more OSDS down for more then 15 minutes
    - alert: OSDs near full
     expr: (ceph_osd_utilization unless on(osd) ceph_osd_down) > 80
     labels:
      severity: critical
      type: ses_default
     annotations:
      description: OSD {{ $labels.osd }} is dangerously full, over 80%
    # alert on single OSDs flapping
    - alert: flap osd
     expr: rate(ceph_osd_up[5m])*60 > 1
     labels:
      severity: warning
      type: ses_default
     annotations:
      description: >
        OSD {{ $label.osd }} was marked down at back up at least once a
        minute for 5 minutes.
    # alert on high deviation from average PG count
    - alert: high pg count deviation
     expr: abs(((ceph_osd_pgs > 0) - on (job) group_left avg(ceph_osd_pgs > 0) by (job)) / on (job) group_left avg(ceph_osd_pgs > 0) by (job)) > 0.35
     for: 5m
     labels:
      severity: warning
      type: ses_default
     annotations:
      description: >
        OSD {{ $labels.osd }} deviates by more then 30% from
        average PG count
    # alert on high commit latency...but how high is too high
  - name: mds
   rules:
   # no mds metrics are exported yet
  - name: mgr
   rules:
   # no mgr metrics are exported yet
  - name: pgs
   rules:
    - alert: pgs inactive
     expr: ceph_total_pgs - ceph_active_pgs > 0
     for: 5m
     labels:
      severity: critical
      type: ses_default
     annotations:
      description: One or more PGs are inactive for more then 5 minutes.
    - alert: pgs unclean
     expr: ceph_total_pgs - ceph_clean_pgs > 0
     for: 15m
     labels:
      severity: warning
      type: ses_default
     annotations:
      description: One or more PGs are not clean for more then 15 minutes.
  - name: nodes
   rules:
    - alert: root volume full
     expr: node_filesystem_avail{mountpoint="/"} / node_filesystem_size{mountpoint="/"} < 0.1
     labels:
      severity: critical
      type: ses_default
     annotations:
      description: Root volume (OSD and MON store) is dangerously full (< 10% free)
    # alert on nic packet errors and drops rates > 1 packet/s
    - alert: network packets dropped
     expr: irate(node_network_receive_drop{device!="lo"}[5m]) + irate(node_network_transmit_drop{device!="lo"}[5m]) > 1
     labels:
      severity: warning
      type: ses_default
     annotations:
      description: >
       Node {{ $labels.instance }} experiences packet drop > 1
       packet/s on interface {{ $lables.device }}
    - alert: network packet errors
     expr: irate(node_network_receive_errs{device!="lo"}[5m]) + irate(node_network_transmit_errs{device!="lo"}[5m]) > 1
     labels:
      severity: warning
      type: ses_default
     annotations:
      description: >
       Node {{ $labels.instance }} experiences packet errors > 1
       packet/s on interface {{ $lables.device }}
    # predict fs fillup times
    - alert: storage filling
     expr: ((node_filesystem_free - node_filesystem_size) / deriv(node_filesystem_free[2d]) <= 5) > 0
     labels:
      severity: warning
      type: ses_default
     annotations:
      description: >
       Mountpoint {{ $lables.mountpoint }} will be full in less then 5 days
       assuming the average fillup rate of the past 48 hours.
  - name: pools
   rules:
    - alert: pool full
     expr: ceph_pool_used_bytes / ceph_pool_available_bytes > 0.9
     labels:
      severity: critical
      type: ses_default
     annotations:
      description: Pool {{ $labels.pool }} at 90% capacity or over
    - alert: pool filling up
     expr: (-ceph_pool_used_bytes / deriv(ceph_pool_available_bytes[2d]) <= 5 ) > 0
     labels:
      severity: warning
      type: ses_default
     annotations:
      description: >
       Pool {{ $labels.pool }} will be full in less then 5 days
       assuming the average fillup rate of the past 48 hours.

19 Authentication with cephx Edit source

To identify clients and protect against man-in-the-middle attacks, Ceph provides its cephx authentication system. Clients in this context are either human users—such as the admin user—or Ceph-related services/daemons, for example OSDs, monitors, or Object Gateways.

Note
Note

The cephx protocol does not address data encryption in transport, such as TLS/SSL.

19.1 Authentication Architecture Edit source

cephx uses shared secret keys for authentication, meaning both the client and Ceph Monitors have a copy of the client’s secret key. The authentication protocol enables both parties to prove to each other that they have a copy of the key without actually revealing it. This provides mutual authentication, which means the cluster is sure the user possesses the secret key, and the user is sure that the cluster has a copy of the secret key as well.

A key scalability feature of Ceph is to avoid a centralized interface to the Ceph object store. This means that Ceph clients can interact with OSDs directly. To protect data, Ceph provides its cephx authentication system, which authenticates Ceph clients.

Each monitor can authenticate clients and distribute keys, so there is no single point of failure or bottleneck when using cephx. The monitor returns an authentication data structure that contains a session key for use in obtaining Ceph services. This session key is itself encrypted with the client’s permanent secret key, so that only the client can request services from the Ceph monitors. The client then uses the session key to request its desired services from the monitor, and the monitor provides the client with a ticket that will authenticate the client to the OSDs that actually handle data. Ceph monitors and OSDs share a secret, so the client can use the ticket provided by the monitor with any OSD or metadata server in the cluster. cephx tickets expire, so an attacker cannot use an expired ticket or session key obtained wrongfully.

To use cephx, an administrator must setup clients/users first. In the following diagram, the client.admin user invokes ceph auth get-or-create-key from the command line to generate a user name and secret key. Ceph’s auth subsystem generates the user name and key, stores a copy with the monitor(s) and transmits the user’s secret back to the client.admin user. This means that the client and the monitor share a secret key.

Basic cephx Authentication
Figure 19.1: Basic cephx Authentication

To authenticate with the monitor, the client passes the user name to the monitor. The monitor generates a session key and encrypts it with the secret key associated with the user name and transmits the encrypted ticket back to the client. The client then decrypts the data with the shared secret key to retrieve the session key. The session key identifies the user for the current session. The client then requests a ticket related to the user, which is signed by the session key. The monitor generates a ticket, encrypts it with the user’s secret key and transmits it back to the client. The client decrypts the ticket and uses it to sign requests to OSDs and metadata servers throughout the cluster.

cephx Authentication
Figure 19.2: cephx Authentication

The cephx protocol authenticates ongoing communications between the client machine and the Ceph servers. Each message sent between a client and a server after the initial authentication is signed using a ticket that the monitors, OSDs, and metadata servers can verify with their shared secret.

cephx Authentication - MDS and OSD
Figure 19.3: cephx Authentication - MDS and OSD
Important
Important

The protection offered by this authentication is between the Ceph client and the Ceph cluster hosts. The authentication is not extended beyond the Ceph client. If the user accesses the Ceph client from a remote host, Ceph authentication is not applied to the connection between the user’s host and the client host.

19.2 Key Management Edit source

This section describes Ceph client users and their authentication and authorization with the Ceph storage cluster. Users are either individuals or system actors such as applications, which use Ceph clients to interact with the Ceph storage cluster daemons.

When Ceph runs with authentication and authorization enabled (enabled by default), you must specify a user name and a keyring containing the secret key of the specified user (usually via the command line). If you do not specify a user name, Ceph will use client.admin as the default user name. If you do not specify a keyring, Ceph will look for a keyring via the keyring setting in the Ceph configuration file. For example, if you execute the ceph health command without specifying a user name or keyring, Ceph interprets the command like this:

cephadm@adm > ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring health

Alternatively, you may use the CEPH_ARGS environment variable to avoid re-entering the user name and secret.

19.2.1 Background Information Edit source

Regardless of the type of Ceph client (for example, block device, object storage, file system, native API), Ceph stores all data as objects within pools. Ceph users need to have access to pools in order to read and write data. Additionally, Ceph users must have execute permissions to use Ceph's administrative commands. The following concepts will help you understand Ceph user management.

19.2.1.1 User Edit source

A user is either an individual or a system actor such as an application. Creating users allows you to control who (or what) can access your Ceph storage cluster, its pools, and the data within pools.

Ceph uses types of users. For the purposes of user management, the type will always be client. Ceph identifies users in period (.) delimited form, consisting of the user type and the user ID. For example, TYPE.ID, client.admin, or client.user1. The reason for user typing is that Ceph monitors, OSDs, and metadata servers also use the cephx protocol, but they are not clients. Distinguishing the user type helps to distinguish between client users and other users, streamlining access control, user monitoring, and traceability.

Sometimes Ceph’s user type may seem confusing, because the Ceph command line allows you to specify a user with or without the type, depending upon your command line usage. If you specify --user or --id, you can omit the type. So client.user1 can be entered simply as user1. If you specify --name or -n, you must specify the type and name, such as client.user1. We recommend using the type and name as a best practice wherever possible.

Note
Note

A Ceph storage cluster user is not the same as a Ceph object storage user or a Ceph file system user. The Ceph Object Gateway uses a Ceph storage cluster user to communicate between the gateway daemon and the storage cluster, but the gateway has its own user management functionality for end users. The Ceph file system uses POSIX semantics. The user space associated with it is not the same as a Ceph storage cluster user.

19.2.1.2 Authorization and Capabilities Edit source

Ceph uses the term capabilities to describe authorizing an authenticated user to exercise the functionality of the monitors, OSDs and metadata servers. Capabilities can also restrict access to data within a pool or pool namespace. A Ceph administrative user sets a user's capabilities when creating or updating a user.

Capability syntax follows the form:

daemon-type 'cap-spec[, cap-spec ...]'

Following is a list of capabilities for each service type:

Monitor capabilities

include r, w, x and allow profile cap.

mon 'allow rwx'
mon 'allow profile osd'
OSD capabilities

include r, w, x, class-read, class-write and profile osd. Additionally, OSD capabilities also allow for pool and namespace settings.

osd 'allow capability' [pool=poolname] [namespace=namespace-name]
MDS capability

simply requires allow, or blank.

mds 'allow'

The following entries describe each capability:

allow

Precedes access settings for a daemon. Implies rw for MDS only.

r

Gives the user read access. Required with monitors to retrieve the CRUSH map.

w

Gives the user write access to objects.

x

Gives the user the capability to call class methods (both read and write) and to conduct auth operations on monitors.

class-read

Gives the user the capability to call class read methods. Subset of x.

class-write

Gives the user the capability to call class write methods. Subset of x.

*

Gives the user read, write, and execute permissions for a particular daemon/pool, and the ability to execute admin commands.

profile osd

Gives a user permissions to connect as an OSD to other OSDs or monitors. Conferred on OSDs to enable OSDs to handle replication heartbeat traffic and status reporting.

profile mds

Gives a user permissions to connect as an MDS to other MDSs or monitors.

profile bootstrap-osd

Gives a user permissions to bootstrap an OSD. Delegated to deployment tools so that they have permissions to add keys when bootstrapping an OSD.

profile bootstrap-mds

Gives a user permissions to bootstrap a metadata server. Delegated to deployment tools so they have permissions to add keys when bootstrapping a metadata server.

For example, monitor capabilities include r, w, x access settings or profile {name}.

The syntax looks as follows:

mon 'allow {access-spec} [network {network/prefix}]'
mon 'profile {name}'

The {access-spec} syntax is as follows:

* | all | [r][w][x]

The optional {network/prefix} is a standard network name and prefix length in CIDR notation (for exmaple, 10.3.0.0/16). If present, the use of this capability is restricted to clients connecting from this network.

Another example is the OSD capabilities that include r, w, x, class-read, class-write access settings or profile {name}. Additionally, OSD capabilities also allow for pool and namespace settings.

The syntax looks as follows:

osd 'allow {access-spec} [{match-spec}] [network {network/prefix}]'
osd 'profile {name} [pool={pool-name} [namespace={namespace-name}]] [network {network/prefix}]'

Capabilities can also restrict access to data within a pool, a namespace within a pool, or a set of pools based on their application tags. A Ceph administrative user sets a user’s capabilities when creating or updating a user.

19.2.1.3 Pools Edit source

A pool is a logical partition where users store data. In Ceph deployments, it is common to create a pool as a logical partition for similar types of data. For example, when deploying Ceph as a back-end for OpenStack, a typical deployment would have pools for volumes, images, backups and virtual machines, and users such as client.glance or client.cinder.

19.2.2 Managing Users Edit source

User management functionality provides Ceph cluster administrators with the ability to create, update, and delete users directly in the Ceph cluster.

When you create or delete users in the Ceph cluster, you may need to distribute keys to clients so that they can be added to keyrings. See Section 19.2.3, “Keyring Management” for details.

19.2.2.1 Listing Users Edit source

To list the users in your cluster, execute the following:

cephadm@adm > ceph auth list

Ceph will list all users in your cluster. For example, in a cluster with two nodes, ceph auth list output looks similar to this:

installed auth entries:

osd.0
        key: AQCvCbtToC6MDhAATtuT70Sl+DymPCfDSsyV4w==
        caps: [mon] allow profile osd
        caps: [osd] allow *
osd.1
        key: AQC4CbtTCFJBChAAVq5spj0ff4eHZICxIOVZeA==
        caps: [mon] allow profile osd
        caps: [osd] allow *
client.admin
        key: AQBHCbtT6APDHhAA5W00cBchwkQjh3dkKsyPjw==
        caps: [mds] allow
        caps: [mon] allow *
        caps: [osd] allow *
client.bootstrap-mds
        key: AQBICbtTOK9uGBAAdbe5zcIGHZL3T/u2g6EBww==
        caps: [mon] allow profile bootstrap-mds
client.bootstrap-osd
        key: AQBHCbtT4GxqORAADE5u7RkpCN/oo4e5W0uBtw==
        caps: [mon] allow profile bootstrap-osd
Note
Note: TYPE.ID Notation

Note that the TYPE.ID notation for users applies such that osd.0 specifies a user of type osd and its ID is 0. client.admin is a user of type client and its ID is admin. Note also that each entry has a key: value entry, and one or more caps: entries.

You may use the -o filename option with ceph auth list to save the output to a file.

19.2.2.2 Getting Information about Users Edit source

To retrieve a specific user, key, and capabilities, execute the following:

cephadm@adm > ceph auth get TYPE.ID

For example:

cephadm@adm > ceph auth get client.admin
exported keyring for client.admin
[client.admin]
	key = AQA19uZUqIwkHxAAFuUwvq0eJD4S173oFRxe0g==
	caps mds = "allow"
	caps mon = "allow *"
 caps osd = "allow *"

Developers may also execute the following:

cephadm@adm > ceph auth export TYPE.ID

The auth export command is identical to auth get, but also prints the internal authentication ID.

19.2.2.3 Adding Users Edit source

Adding a user creates a user name (TYPE.ID), a secret key, and any capabilities included in the command you use to create the user.

A user's key enables the user to authenticate with the Ceph storage cluster. The user's capabilities authorize the user to read, write, or execute on Ceph monitors (mon), Ceph OSDs (osd), or Ceph metadata servers (mds).

There are a few commands available to add a user:

ceph auth add

This command is the canonical way to add a user. It will create the user, generate a key, and add any specified capabilities.

ceph auth get-or-create

This command is often the most convenient way to create a user, because it returns a keyfile format with the user name (in brackets) and the key. If the user already exists, this command simply returns the user name and key in the keyfile format. You may use the -o filename option to save the output to a file.

ceph auth get-or-create-key

This command is a convenient way to create a user and return the user's key (only). This is useful for clients that need the key only (for example libvirt). If the user already exists, this command simply returns the key. You may use the -o filename option to save the output to a file.

When creating client users, you may create a user with no capabilities. A user with no capabilities can authenticate but nothing more. Such client cannot retrieve the cluster map from the monitor. However, you can create a user with no capabilities if you want to defer adding capabilities later using the ceph auth caps command.

A typical user has at least read capabilities on the Ceph monitor and read and write capabilities on Ceph OSDs. Additionally, a user's OSD permissions are often restricted to accessing a particular pool.

cephadm@adm > ceph auth add client.john mon 'allow r' osd \
 'allow rw pool=liverpool'
cephadm@adm > ceph auth get-or-create client.paul mon 'allow r' osd \
 'allow rw pool=liverpool'
cephadm@adm > ceph auth get-or-create client.george mon 'allow r' osd \
 'allow rw pool=liverpool' -o george.keyring
cephadm@adm > ceph auth get-or-create-key client.ringo mon 'allow r' osd \
 'allow rw pool=liverpool' -o ringo.key
Important
Important

If you provide a user with capabilities to OSDs, but you do not restrict access to particular pools, the user will have access to all pools in the cluster.

19.2.2.4 Modifying User Capabilities Edit source

The ceph auth caps command allows you to specify a user and change the user's capabilities. Setting new capabilities will overwrite current ones. To view current capabilities run ceph auth get USERTYPE.USERID. To add capabilities, you also need to specify the existing capabilities when using the following form:

cephadm@adm > ceph auth caps USERTYPE.USERID daemon 'allow [r|w|x|*|...] \
     [pool=pool-name] [namespace=namespace-name]' [daemon 'allow [r|w|x|*|...] \
     [pool=pool-name] [namespace=namespace-name]']

For example:

cephadm@adm > ceph auth get client.john
cephadm@adm > ceph auth caps client.john mon 'allow r' osd 'allow rw pool=prague'
cephadm@adm > ceph auth caps client.paul mon 'allow rw' osd 'allow r pool=prague'
cephadm@adm > ceph auth caps client.brian-manager mon 'allow *' osd 'allow *'

To remove a capability, you may reset the capability. If you want the user to have no access to a particular daemon that was previously set, specify an empty string:

cephadm@adm > ceph auth caps client.ringo mon ' ' osd ' '

19.2.2.5 Deleting Users Edit source

To delete a user, use ceph auth del:

cephadm@adm > ceph auth del TYPE.ID

where TYPE is one of client, osd, mon, or mds, and ID is the user name or ID of the daemon.

If you created users with permissions strictly for a pool that no longer exists, you should consider deleting those users too.

19.2.2.6 Printing a User's Key Edit source

To print a user’s authentication key to standard output, execute the following:

cephadm@adm > ceph auth print-key TYPE.ID

where TYPE is one of client, osd, mon, or mds, and ID is the user name or ID of the daemon.

Printing a user's key is useful when you need to populate client software with a user's key (such as libvirt), as in the following example:

root # mount -t ceph host:/ mount_point \
-o name=client.user,secret=`ceph auth print-key client.user`

19.2.2.7 Importing Users Edit source

To import one or more users, use ceph auth import and specify a keyring:

cephadm@adm > ceph auth import -i /etc/ceph/ceph.keyring
Note
Note

The Ceph storage cluster will add new users, their keys and their capabilities and will update existing users, their keys and their capabilities.

19.2.3 Keyring Management Edit source

When you access Ceph via a Ceph client, the client will look for a local keyring. Ceph presets the keyring setting with the following four keyring names by default so you do not need to set them in your Ceph configuration file unless you want to override the defaults:

/etc/ceph/cluster.name.keyring
/etc/ceph/cluster.keyring
/etc/ceph/keyring
/etc/ceph/keyring.bin

The cluster metavariable is your Ceph cluster name as defined by the name of the Ceph configuration file. ceph.conf means that the cluster name is ceph, thus ceph.keyring. The name metavariable is the user type and user ID, for example client.admin, thus ceph.client.admin.keyring.

After you create a user (for example client.ringo), you must get the key and add it to a keyring on a Ceph client so that the user can access the Ceph storage cluster.

Section 19.2, “Key Management” details how to list, get, add, modify and delete users directly in the Ceph storage cluster. However, Ceph also provides the ceph-authtool utility to allow you to manage keyrings from a Ceph client.

19.2.3.1 Creating a Keyring Edit source

When you use the procedures in Section 19.2, “Key Management” to create users, you need to provide user keys to the Ceph client(s) so that the client can retrieve the key for the specified user and authenticate with the Ceph storage cluster. Ceph clients access keyrings to look up a user name and retrieve the user's key:

cephadm@adm > ceph-authtool --create-keyring /path/to/keyring

When creating a keyring with multiple users, we recommend using the cluster name (for example cluster.keyring) for the keyring file name and saving it in the /etc/ceph directory so that the keyring configuration default setting will pick up the file name without requiring you to specify it in the local copy of your Ceph configuration file. For example, create ceph.keyring by executing the following:

cephadm@adm > ceph-authtool -C /etc/ceph/ceph.keyring

When creating a keyring with a single user, we recommend using the cluster name, the user type and the user name and saving it in the /etc/ceph directory. For example, ceph.client.admin.keyring for the client.admin user.

19.2.3.2 Adding a User to a Keyring Edit source

When you add a user to the Ceph storage cluster (see Section 19.2.2.3, “Adding Users”), you can retrieve the user, key and capabilities, and save the user to a keyring.

If you only want to use one user per keyring, the ceph auth get command with the -o option will save the output in the keyring file format. For example, to create a keyring for the client.admin user, execute the following:

cephadm@adm > ceph auth get client.admin -o /etc/ceph/ceph.client.admin.keyring

When you want to import users to a keyring, you can use ceph-authtool to specify the destination keyring and the source keyring:

cephadm@adm > ceph-authtool /etc/ceph/ceph.keyring \
  --import-keyring /etc/ceph/ceph.client.admin.keyring
Important
Important

If your keyring is compromised, delete your key from the /etc/ceph directory and recreate a new key using the same instructions from Section 19.2.3.1, “Creating a Keyring”.

19.2.3.3 Creating a User Edit source

Ceph provides the ceph auth add command to create a user directly in the Ceph storage cluster. However, you can also create a user, keys and capabilities directly on a Ceph client keyring. Then, you can import the user to the Ceph storage cluster:

cephadm@adm > ceph-authtool -n client.ringo --cap osd 'allow rwx' \
  --cap mon 'allow rwx' /etc/ceph/ceph.keyring

You can also create a keyring and add a new user to the keyring simultaneously:

cephadm@adm > ceph-authtool -C /etc/ceph/ceph.keyring -n client.ringo \
  --cap osd 'allow rwx' --cap mon 'allow rwx' --gen-key

In the previous scenarios, the new user client.ringo is only in the keyring. To add the new user to the Ceph storage cluster, you must still add the new user to the cluster:

cephadm@adm > ceph auth add client.ringo -i /etc/ceph/ceph.keyring

19.2.3.4 Modifying Users Edit source

To modify the capabilities of a user record in a keyring, specify the keyring and the user followed by the capabilities:

cephadm@adm > ceph-authtool /etc/ceph/ceph.keyring -n client.ringo \
  --cap osd 'allow rwx' --cap mon 'allow rwx'

To update the modified user within the Ceph cluster environment, you must import the changes from the keyring to the user entry in the Ceph cluster:

cephadm@adm > ceph auth import -i /etc/ceph/ceph.keyring

See Section 19.2.2.7, “Importing Users” for details on updating a Ceph storage cluster user from a keyring.

19.2.4 Command Line Usage Edit source

The ceph command supports the following options related to the user name and secret manipulation:

--id or --user

Ceph identifies users with a type and an ID (TYPE.ID, such as client.admin or client.user1). The id, name and -n options enable you to specify the ID portion of the user name (for example admin or user1). You can specify the user with the --id and omit the type. For example, to specify user client.foo enter the following:

cephadm@adm > ceph --id foo --keyring /path/to/keyring health
cephadm@adm > ceph --user foo --keyring /path/to/keyring health
--name or -n

Ceph identifies users with a type and an ID (TYPE.ID, such as client.admin or client.user1). The --name and -n options enable you to specify the fully qualified user name. You must specify the user type (typically client) with the user ID:

cephadm@adm > ceph --name client.foo --keyring /path/to/keyring health
cephadm@adm > ceph -n client.foo --keyring /path/to/keyring health
--keyring

The path to the keyring containing one or more user name and secret. The --secret option provides the same functionality, but it does not work with Object Gateway, which uses --secret for another purpose. You may retrieve a keyring with ceph auth get-or-create and store it locally. This is a preferred approach, because you can switch user names without switching the keyring path:

cephadm@adm > rbd map --id foo --keyring /path/to/keyring mypool/myimage

20 Stored Data Management Edit source

The CRUSH algorithm determines how to store and retrieve data by computing data storage locations. CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server or broker. With an algorithmically determined method of storing and retrieving data, Ceph avoids a single point of failure, a performance bottleneck, and a physical limit to its scalability.

CRUSH requires a map of your cluster, and uses the CRUSH Map to pseudo-randomly store and retrieve data in OSDs with a uniform distribution of data across the cluster.

CRUSH maps contain a list of OSDs, a list of 'buckets' for aggregating the devices into physical locations, and a list of rules that tell CRUSH how it should replicate data in a Ceph cluster’s pools. By reflecting the underlying physical organization of the installation, CRUSH can model—and thereby address—potential sources of correlated device failures. Typical sources include physical proximity, a shared power source, and a shared network. By encoding this information into the cluster map, CRUSH placement policies can separate object replicas across different failure domains while still maintaining the desired distribution. For example, to address the possibility of concurrent failures, it may be desirable to ensure that data replicas are on devices using different shelves, racks, power supplies, controllers, and/or physical locations.

After you deploy a Ceph cluster, a default CRUSH Map is generated. It is fine for your Ceph sandbox environment. However, when you deploy a large-scale data cluster, you should give significant consideration to developing a custom CRUSH Map, because it will help you manage your Ceph cluster, improve performance and ensure data safety.

For example, if an OSD goes down, a CRUSH Map can help you locate the physical data center, room, row and rack of the host with the failed OSD in the event you need to use on-site support or replace hardware.

Similarly, CRUSH may help you identify faults more quickly. For example, if all OSDs in a particular rack go down simultaneously, the fault may lie with a network switch or power to the rack or the network switch rather than the OSDs themselves.

A custom CRUSH Map can also help you identify the physical locations where Ceph stores redundant copies of data when the placement group(s) (refer to Section 20.4, “Placement Groups”) associated with a failed host are in a degraded state.

There are three main sections to a CRUSH Map.

  • Devices consist of any object storage device corresponding to a ceph-osd daemon.

  • Buckets consist of a hierarchical aggregation of storage locations (for example rows, racks, hosts, etc.) and their assigned weights.

  • Rule Sets consist of the manner of selecting buckets.

20.1 Devices Edit source

To map placement groups to OSDs, a CRUSH Map requires a list of OSD devices (the name of the OSD daemon). The list of devices appears first in the CRUSH Map.

#devices
device NUM osd.OSD_NAME class CLASS_NAME

For example:

#devices
device 0 osd.0 class hdd
device 1 osd.1 class ssd
device 2 osd.2 class nvme
device 3 osd.3class ssd

As a general rule, an OSD daemon maps to a single disk.

20.1.1 Device Classes Edit source

The flexibility of the CRUSH Map in controlling data placement is one of the Ceph's strengths. It is also one of the most difficult parts of the cluster to manage. Device classes automate the most common changes to CRUSH Maps that the administrator had to do manually previously.

20.1.1.1 The CRUSH Management Problem Edit source

Ceph clusters are frequently built with multiple types of storage devices: HDD, SSD, NVMe, or even mixed classes of the above. We call these different types of storage devices device classes to avoid confusion between the type property of CRUSH buckets (for example, host, rack, row, see Section 20.2, “Buckets” for more details). Ceph OSDs backed by SSDs are much faster than those backed by spinning disks, making them better suited for certain workloads. Ceph makes it easy to create RADOS pools for different data sets or workloads and to assign different CRUSH rules to control data placement for those pools.

Figure 20.1: OSDs with Mixed Device Classes

However, setting up the CRUSH rules to place data only on a certain class of device is tedious. Rules work in terms of the CRUSH hierarchy, but if the devices are mixed into the same hosts or racks (as in the sample hierarchy above), they will (by default) be mixed together and appear in the same sub-trees of the hierarchy. Manually separating them out into separate trees involved creating multiple versions of each intermediate node for each device class in previous versions of SUSE Enterprise Storage.

20.1.1.2 Device Classes Edit source

An elegant solution that Ceph offers is to add a property called device class to each OSD. By default, OSDs will automatically set their device classes to either 'hdd', 'ssd', or 'nvme' based on the hardware properties exposed by the Linux kernel. These device classes are reported in a new column of the ceph osd tree command output:

cephadm@adm > ceph osd tree
 ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
 -1       83.17899 root default
 -4       23.86200     host cpach
 2   hdd  1.81898         osd.2      up  1.00000 1.00000
 3   hdd  1.81898         osd.3      up  1.00000 1.00000
 4   hdd  1.81898         osd.4      up  1.00000 1.00000
 5   hdd  1.81898         osd.5      up  1.00000 1.00000
 6   hdd  1.81898         osd.6      up  1.00000 1.00000
 7   hdd  1.81898         osd.7      up  1.00000 1.00000
 8   hdd  1.81898         osd.8      up  1.00000 1.00000
 15  hdd  1.81898         osd.15     up  1.00000 1.00000
 10  nvme 0.93100         osd.10     up  1.00000 1.00000
 0   ssd  0.93100         osd.0      up  1.00000 1.00000
 9   ssd  0.93100         osd.9      up  1.00000 1.00000

If the automatic device class detection fails, for example because the device driver is not properly exposing information about the device via /sys/block, you can adjust device classes from the command line:

cephadm@adm > ceph osd crush rm-device-class osd.2 osd.3
done removing class of osd(s): 2,3
cephadm@adm > ceph osd crush set-device-class ssd osd.2 osd.3
set osd(s) 2,3 to class 'ssd'

20.1.1.3 CRUSH Placement Rules Edit source

CRUSH rules can restrict placement to a specific device class. For example, you can create a 'fast' replicated pool that distributes data only over SSD disks by running the following command:

cephadm@adm > ceph osd crush rule create-replicated RULE_NAME ROOT FAILURE_DOMAIN_TYPE DEVICE_CLASS

For example:

cephadm@adm > ceph osd crush rule create-replicated fast default host ssd

Create a pool named 'fast_pool' and assign it to the 'fast' rule:

cephadm@adm > ceph osd pool create fast_pool 128 128 replicated fast

The process for creating erasure code rules is slightly different. First, you create an erasure code profile that includes a property for your desired device class. Then, use that profile when creating the erasure coded pool:

cephadm@adm > ceph osd erasure-code-profile set myprofile \
 k=4 m=2 crush-device-class=ssd crush-failure-domain=host
cephadm@adm > ceph osd pool create mypool 64 erasure myprofile

In case you need to manually edit the CRUSH Map to customize your rule, the syntax has been extended to allow the device class to be specified. For example, the CRUSH rule generated by the above commands looks as follows:

rule ecpool {
  id 2
  type erasure
  min_size 3
  max_size 6
  step set_chooseleaf_tries 5
  step set_choose_tries 100
  step take default class ssd
  step chooseleaf indep 0 type host
  step emit
}

The important difference here is that the 'take' command includes the additional 'class CLASS_NAME' suffix.

20.1.1.4 Additional Commands Edit source

To list device classes used in a CRUSH Map, run:

cephadm@adm > ceph osd crush class ls
[
  "hdd",
  "ssd"
]

To list existing CRUSH rules, run:

cephadm@adm > ceph osd crush rule ls
replicated_rule
fast

To view details of the CRUSH rule named 'fast', run:

cephadm@adm > ceph osd crush rule dump fast
{
		"rule_id": 1,
		"rule_name": "fast",
		"ruleset": 1,
		"type": 1,
		"min_size": 1,
		"max_size": 10,
		"steps": [
						{
										"op": "take",
										"item": -21,
										"item_name": "default~ssd"
						},
						{
										"op": "chooseleaf_firstn",
										"num": 0,
										"type": "host"
						},
						{
										"op": "emit"
						}
		]
}

To list OSDs that belong to an 'ssd' class, run:

cephadm@adm > ceph osd crush class ls-osd ssd
0
1

20.1.1.5 Migrating from a Legacy SSD Rule to Device Classes Edit source

In SUSE Enterprise Storage prior to version 5, you needed to manually edit the CRUSH Map and maintain a parallel hierarchy for each specialized device type (such as SSD) in order to write rules that apply to these devices. Since SUSE Enterprise Storage 5, the device class feature has enabled this transparently.

You can transform a legacy rule and hierarchy to the new class-based rules by using the crushtool command. There are several types of transformation possible:

crushtool --reclassify-root ROOT_NAME DEVICE_CLASS

This command takes everything in the hierarchy beneath ROOT_NAME and adjusts any rules that reference that root via

take ROOT_NAME

to instead

take ROOT_NAME class DEVICE_CLASS

It renumbers the buckets so that the old IDs are used for the specified class’s 'shadow tree'. As a consequence, no data movement occurs.

Example 20.1: crushtool --reclassify-root

Consider the following existing rule:

rule replicated_ruleset {
   id 0
   type replicated
   min_size 1
   max_size 10
   step take default
   step chooseleaf firstn 0 type rack
   step emit
}

If you reclassify the root 'default' as class 'hdd', the rule will become

rule replicated_ruleset {
   id 0
   type replicated
   min_size 1
   max_size 10
   step take default class hdd
   step chooseleaf firstn 0 type rack
   step emit
}
crushtool --set-subtree-class BUCKET_NAME DEVICE_CLASS

This method marks every device in the subtree rooted at BUCKET_NAME with the specified device class.

--set-subtree-class is normally used in conjunction with the --reclassify-root option to ensure that all devices in that root are labeled with the correct class. However, some of those devices may intentionally have a different class, and therefore you do not want to relabel them. In such cases, exclude the --set-subtree-class option. Keep in mind that such remapping will not be perfect, because the previous rule is distributed across devices of multiple classes but the adjusted rules will only map to devices of the specified device class.

crushtool --reclassify-bucket MATCH_PATTERN DEVICE_CLASS DEFAULT_PATTERN

This method allows merging a parallel type-specific hierarchy with the normal hierarchy. For example, many users have CRUSH Maps similar to the following one:

Example 20.2: crushtool --reclassify-bucket
host node1 {
   id -2           # do not change unnecessarily
   # weight 109.152
   alg straw
   hash 0  # rjenkins1
   item osd.0 weight 9.096
   item osd.1 weight 9.096
   item osd.2 weight 9.096
   item osd.3 weight 9.096
   item osd.4 weight 9.096
   item osd.5 weight 9.096
   [...]
}

host node1-ssd {
   id -10          # do not change unnecessarily
   # weight 2.000
   alg straw
   hash 0  # rjenkins1
   item osd.80 weight 2.000
   [...]
}

root default {
   id -1           # do not change unnecessarily
   alg straw
   hash 0  # rjenkins1
   item node1 weight 110.967
   [...]
}

root ssd {
   id -18          # do not change unnecessarily
   # weight 16.000
   alg straw
   hash 0  # rjenkins1
   item node1-ssd weight 2.000
   [...]
}

This function reclassifies each bucket that matches a given pattern. The pattern can look like %suffix or prefix%. In the above example, you would use the pattern %-ssd. For each matched bucket, the remaining portion of the name that matches the '%' wild card specifies the base bucket. All devices in the matched bucket are labeled with the specified device class and then moved to the base bucket. If the base bucket does not exist (for example, if 'node12-ssd' exists but 'node12' does not), then it is created and linked underneath the specified default parent bucket. The old bucket IDs are preserved for the new shadow buckets to prevent data movement. Rules with the take steps that reference the old buckets are adjusted.

crushtool --reclassify-bucket BUCKET_NAME DEVICE_CLASS BASE_BUCKET

You can use the --reclassify-bucket option without a wild card to map a single bucket. For example, in the previous example, we want the 'ssd' bucket to be mapped to the default bucket.

The final command to convert the map comprised of the above fragments would be as follows:

cephadm@adm > ceph osd getcrushmap -o original
cephadm@adm > crushtool -i original --reclassify \
  --set-subtree-class default hdd \
  --reclassify-root default hdd \
  --reclassify-bucket %-ssd ssd default \
  --reclassify-bucket ssd ssd default \
  -o adjusted

In order to verify that the conversion is correct, there is a --compare option that tests a large sample of inputs to the CRUSH Map and compares if the same result comes back out. These inputs are controlled by the same options that apply to the --test. For the above example, the command would be as follows:

cephadm@adm > crushtool -i original --compare adjusted
rule 0 had 0/10240 mismatched mappings (0)
rule 1 had 0/10240 mismatched mappings (0)
maps appear equivalent
Tip
Tip

If there were differences, you would see what ratio of inputs are remapped in the parentheses.

If you are satisfied with the adjusted CRUSH Map, you can apply it to the cluster:

cephadm@adm > ceph osd setcrushmap -i adjusted

20.1.1.6 For More Information Edit source

Find more details on CRUSH Maps in Section 20.5, “CRUSH Map Manipulation”.

Find more details on Ceph pools in general in Chapter 22, Managing Storage Pools.

Find more details about erasure coded pools in Chapter 24, Erasure Coded Pools.

20.2 Buckets Edit source

CRUSH maps contain a list of OSDs, which can be organized into 'buckets' for aggregating the devices into physical locations.

0

osd

An OSD daemon (osd.1, osd.2, etc.).

1

host

A host name containing one or more OSDs.

2

chassis

Chassis of which the rack is composed.

3

rack

A computer rack. The default is unknownrack.

4

row

A row in a series of racks.

5

pdu

Power distribution unit.

6

pod

Abbreviation for "Point of Delivery": in this context, a group of PDUs, or a group of rows of racks.

7

room

A room containing racks and rows of hosts.

8

datacenter

A physical data center containing rooms.

9

region

10

root

Tip
Tip

You can modify the existing types and create your own bucket types.

Ceph’s deployment tools generate a CRUSH Map that contains a bucket for each host, and a root named 'default', which is useful for the default rbd pool. The remaining bucket types provide a means for storing information about the physical location of nodes/buckets, which makes cluster administration much easier when OSDs, hosts, or network hardware malfunction and the administrator needs access to physical hardware.

A bucket has a type, a unique name (string), a unique ID expressed as a negative integer, a weight relative to the total capacity/capability of its item(s), the bucket algorithm ( straw2 by default), and the hash (0 by default, reflecting CRUSH Hash rjenkins1). A bucket may have one or more items. The items may consist of other buckets or OSDs. Items may have a weight that reflects the relative weight of the item.

[bucket-type] [bucket-name] {
  id [a unique negative numeric ID]
  weight [the relative capacity/capability of the item(s)]
  alg [the bucket type: uniform | list | tree | straw2 | straw ]
  hash [the hash type: 0 by default]
  item [item-name] weight [weight]
}

The following example illustrates how you can use buckets to aggregate a pool and physical locations like a data center, a room, a rack and a row.

host ceph-osd-server-1 {
        id -17
        alg straw2
        hash 0
        item osd.0 weight 0.546
        item osd.1 weight 0.546
}

row rack-1-row-1 {
        id -16
        alg straw2
        hash 0
        item ceph-osd-server-1 weight 2.00
}

rack rack-3 {
        id -15
        alg straw2
        hash 0
        item rack-3-row-1 weight 2.00
        item rack-3-row-2 weight 2.00
        item rack-3-row-3 weight 2.00
        item rack-3-row-4 weight 2.00
        item rack-3-row-5 weight 2.00
}

rack rack-2 {
        id -14
        alg straw2
        hash 0
        item rack-2-row-1 weight 2.00
        item rack-2-row-2 weight 2.00
        item rack-2-row-3 weight 2.00
        item rack-2-row-4 weight 2.00
        item rack-2-row-5 weight 2.00
}

rack rack-1 {
        id -13
        alg straw2
        hash 0
        item rack-1-row-1 weight 2.00
        item rack-1-row-2 weight 2.00
        item rack-1-row-3 weight 2.00
        item rack-1-row-4 weight 2.00
        item rack-1-row-5 weight 2.00
}

room server-room-1 {
        id -12
        alg straw2
        hash 0
        item rack-1 weight 10.00
        item rack-2 weight 10.00
        item rack-3 weight 10.00
}

datacenter dc-1 {
        id -11
        alg straw2
        hash 0
        item server-room-1 weight 30.00
        item server-room-2 weight 30.00
}

root data {
        id -10
        alg straw2
        hash 0
        item dc-1 weight 60.00
        item dc-2 weight 60.00
}

20.3 Rule Sets Edit source

CRUSH maps support the notion of 'CRUSH rules', which are the rules that determine data placement for a pool. For large clusters, you will likely create many pools where each pool may have its own CRUSH ruleset and rules. The default CRUSH Map has a rule for the default root. If you want more roots and more rules, you need to create them later or they will be created automatically when new pools are created.

Note
Note

In most cases, you will not need to modify the default rules. When you create a new pool, its default ruleset is 0.

A rule takes the following form:

rule rulename {

        ruleset ruleset
        type type
        min_size min-size
        max_size max-size
        step step

}
ruleset

An integer. Classifies a rule as belonging to a set of rules. Activated by setting the ruleset in a pool. This option is required. Default is 0.

type

A string. Describes a rule for either a 'replicated' or 'erasure' coded pool. This option is required. Default is replicated.

min_size

An integer. If a pool group makes fewer replicas than this number, CRUSH will NOT select this rule. This option is required. Default is 2.

max_size

An integer. If a pool group makes more replicas than this number, CRUSH will NOT select this rule. This option is required. Default is 10.

step take bucket

Takes a bucket specified by a name, and begins iterating down the tree. This option is required. For an explanation about iterating through the tree, see Section 20.3.1, “Iterating Through the Node Tree”.

step targetmodenum type bucket-type

target can either be choose or chooseleaf. When set to choose, a number of buckets is selected. chooseleaf directly selects the OSDs (leaf nodes) from the sub-tree of each bucket in the set of buckets.

mode can either be firstn or indep. See Section 20.3.2, “firstn and indep”.

Selects the number of buckets of the given type. Where N is the number of options available, if num > 0 && < N, choose that many buckets; if num < 0, it means N - num; and, if num == 0, choose N buckets (all available). Follows step take or step choose.

step emit

Outputs the current value and empties the stack. Typically used at the end of a rule, but may also be used to form different trees in the same rule. Follows step choose.

20.3.1 Iterating Through the Node Tree Edit source

The structure defined with the buckets can be viewed as a node tree. Buckets are nodes and OSDs are leafs in this tree.

Rules in the CRUSH Map define how OSDs are selected from this tree. A rule starts with a node and then iterates down the tree to return a set of OSDs. It is not possible to define which branch needs to be selected. Instead the CRUSH algorithm assures that the set of OSDs fulfills the replication requirements and evenly distributes the data.

With step take bucket the iteration through the node tree begins at the given bucket (not bucket type). If OSDs from all branches in the tree are to be returned, the bucket must be the root bucket. Otherwise the following steps are only iterating through a sub-tree.

After step take one or more step choose entries follow in the rule definition. Each step choose chooses a defined number of nodes (or branches) from the previously selected upper node.

In the end the selected OSDs are returned with step emit.

step chooseleaf is a convenience function that directly selects OSDs from branches of the given bucket.

Figure 20.2, “Example Tree” provides an example of how step is used to iterate through a tree. The orange arrows and numbers correspond to example1a and example1b, while blue corresponds to example2 in the following rule definitions.

Example Tree
Figure 20.2: Example Tree
# orange arrows
rule example1a {
        ruleset 0
        type replicated
        min_size 2
        max_size 10
        # orange (1)
        step take rack1
        # orange (2)
        step choose firstn 0 host
        # orange (3)
        step choose firstn 1 osd
        step emit
}

rule example1b {
        ruleset 0
        type replicated
        min_size 2
        max_size 10
        # orange (1)
        step take rack1
        # orange (2) + (3)
        step chooseleaf firstn 0 host
        step emit
}

# blue arrows
rule example2 {
        ruleset 0
        type replicated
        min_size 2
        max_size 10
        # blue (1)
        step take room1
        # blue (2)
        step chooseleaf firstn 0 rack
        step emit
}

20.3.2 firstn and indep Edit source

A CRUSH rule defines replacements for failed nodes or OSDs (see Section 20.3, “Rule Sets”). The keyword step requires either firstn or indep as parameter. Figure 20.3, “Node Replacement Methods” provides an example.

firstn adds replacement nodes to the end of the list of active nodes. In case of a failed node, the following healthy nodes are shifted to the left to fill the gap of the failed node. This is the default and desired method for replicated pools, because a secondary node already has all data and therefore can take over the duties of the primary node immediately.

indep selects fixed replacement nodes for each active node. The replacement of a failed node does not change the order of the remaining nodes. This is desired for erasure coded pools. In erasure coded pools the data stored on a node depends on its position in the node selection. When the order of nodes changes, all data on affected nodes needs to be relocated.

Node Replacement Methods
Figure 20.3: Node Replacement Methods

20.4 Placement Groups Edit source

Ceph maps objects to placement groups (PGs). Placement groups are shards or fragments of a logical object pool that place objects as a group into OSDs. Placement groups reduce the amount of per-object metadata when Ceph stores the data in OSDs. A larger number of placement groups—for example, 100 per OSD—leads to better balancing.

20.4.1 How Are Placement Groups Used? Edit source

A placement group (PG) aggregates objects within a pool. The main reason is that tracking object placement and metadata on a per-object basis is computationally expensive. For example, a system with millions of objects cannot track placement of each of its objects directly.

Placement Groups in a Pool
Figure 20.4: Placement Groups in a Pool

The Ceph client will calculate to which placement group an object will belong to. It does this by hashing the object ID and applying an operation based on the number of PGs in the defined pool and the ID of the pool.

The object’s contents within a placement group are stored in a set of OSDs. For example, in a replicated pool of size two, each placement group will store objects on two OSDs:

Placement Groups and OSDs
Figure 20.5: Placement Groups and OSDs

If OSD #2 fails, another OSD will be assigned to placement group #1 and will be filled with copies of all objects in OSD #1. If the pool size is changed from two to three, an additional OSD will be assigned to the placement group and will receive copies of all objects in the placement group.

Placement groups do not own the OSD, they share it with other placement groups from the same pool or even other pools. If OSD #2 fails, the placement group #2 will also need to restore copies of objects, using OSD #3.

When the number of placement groups increases, the new placement groups will be assigned OSDs. The result of the CRUSH function will also change and some objects from the former placement groups will be copied over to the new placement groups and removed from the old ones.

20.4.2 Determining the Value of PG_NUM Edit source

When creating a new pool, it is mandatory to choose the value of PG_NUM:

root # ceph osd pool create POOL_NAME PG_NUM

PG_NUM cannot be calculated automatically. Following are a few commonly used values, depending on the number of OSDs in the cluster:

Less than 5 OSDs:

Set PG_NUM to 128.

Between 5 and 10 OSDs:

Set PG_NUM to 512.

Between 10 and 50 OSDs:

Set PG_NUM to 1024.

As the number of OSDs increases, choosing the right value for PG_NUM becomes more important. PG_NUM strongly affects the behavior of the cluster as well as the durability of the data in case of OSD failure.

20.4.2.1 Number of Placement Groups for More Than 50 OSDs Edit source

If you have less than 50 OSDs, use the preselection described in Section 20.4.2, “Determining the Value of PG_NUM. If you have more than 50 OSDs, we recommend approximately 50-100 placement groups per OSD to balance out resource usage, data durability, and distribution. For a single pool of objects, you can use the following formula to get a baseline:

          total PGs = (OSDs * 100) / POOL_SIZE

Where POOL_SIZE is either the number of replicas for replicated pools, or the 'k'+'m' sum for erasure coded pools as returned by the ceph osd erasure-code-profile get command. You should round the result up to the nearest power of 2. Rounding up is recommended for the CRUSH algorithm to evenly balance the number of objects among placement groups.

As an example, for a cluster with 200 OSDs and a pool size of 3 replicas, you would estimate the number of PGs as follows:

          (200 * 100) / 3 = 6667

The nearest power of 2 is 8192.

When using multiple data pools for storing objects, you need to ensure that you balance the number of placement groups per pool with the number of placement groups per OSD. You need to reach a reasonable total number of placement groups that provides reasonably low variance per OSD without taxing system resources or making the peering process too slow.

For example, a cluster of 10 pools, each with 512 placement groups on 10 OSDs, is a total of 5,120 placement groups spread over 10 OSDs, that is 512 placement groups per OSD. Such a setup does not use too many resources. However, if 1000 pools were created with 512 placement groups each, the OSDs would handle approximately 50,000 placement groups each and it would require significantly more resources and time for peering.

20.4.3 Setting the Number of Placement Groups Edit source

To set the number of placement groups in a pool, you need to specify the number of placement groups at the time you create the pool (see Section 22.2.2, “Create a Pool”). Once you have set placement groups for a pool, you may increase the number of placement groups, but you cannot decrease them. To increase the number of placement groups, run the following:

root # ceph osd pool set POOL_NAME pg_num PG_NUM

After you increase the number of placement groups, you also need to increase the number of placement groups for placement (PGP_NUM) before your cluster will rebalance. PGP_NUM will be the number of placement groups that will be considered for placement by the CRUSH algorithm. Increasing PG_NUM splits the placement groups but data will not be migrated to the newer placement groups until PGP_NUM is increased. PGP_NUM should be equal to PG_NUM. To increase the number of placement groups for placement, run the following:

root # ceph osd pool set POOL_NAME pgp_num PGP_NUM

20.4.4 Getting the Number of Placement Groups Edit source

To get the number of placement groups in a pool, run the following:

root # ceph osd pool get POOL_NAME pg_num

20.4.5 Getting a Cluster's PG Statistics Edit source

To get the statistics for the placement groups in your cluster, run the following:

root # ceph pg dump [--format FORMAT]

Valid formats are 'plain' (default) and 'json'.

20.4.6 Getting Statistics for Stuck PGs Edit source

To get the statistics for all placement groups stuck in a specified state, run the following:

root # ceph pg dump_stuck STATE \
 [--format FORMAT] [--threshold THRESHOLD]

STATE is one of 'inactive' (PGs cannot process reads or writes because they are waiting for an OSD with the most up-to-date data to come up), 'unclean' (PGs contain objects that are not replicated the desired number of times), 'stale' (PGs are in an unknown state—the OSDs that host them have not reported to the monitor cluster in a time interval specified by the mon_osd_report_timeout option), 'undersized', or 'degraded'.

Valid formats are 'plain' (default) and 'json'.

The threshold defines the minimum number of seconds the placement group is stuck before including it in the returned statistics (300 seconds by default).

20.4.7 Getting a Placement Group Map Edit source

To get the placement group map for a particular placement group, run the following:

root # ceph pg map PG_ID

Ceph will return the placement group map, the placement group, and the OSD status:

root # ceph pg map 1.6c
osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]

20.4.8 Getting a Placement Groups Statistics Edit source

To retrieve statistics for a particular placement group, run the following:

root # ceph pg PG_ID query

20.4.9 Scrubbing a Placement Group Edit source

To scrub (Section 20.6, “Scrubbing”) a placement group, run the following:

root # ceph pg scrub PG_ID

Ceph checks the primary and replica nodes, generates a catalog of all objects in the placement group, and compares them to ensure that no objects are missing or mismatched and their contents are consistent. Assuming the replicas all match, a final semantic sweep ensures that all of the snapshot-related object metadata is consistent. Errors are reported via logs.

20.4.10 Prioritizing Backfill and Recovery of Placement Groups Edit source

You may run into a situation where several placement groups require recovery and/or back-fill, while some groups hold data more important than others. For example, those PGs may hold data for images used by running machines and other PGs may be used by inactive machines or less relevant data. In that case, you may want to prioritize recovery of those groups so that performance and availability of data stored on those groups is restored earlier. To mark particular placement groups as prioritized during backfill or recovery, run the following:

root # ceph pg force-recovery PG_ID1 [PG_ID2 ... ]
root # ceph pg force-backfill PG_ID1 [PG_ID2 ... ]

This will cause Ceph to perform recovery or backfill on specified placement groups first, before other placement groups. This does not interrupt currently ongoing backfills or recovery, but causes specified PGs to be processed as soon as possible. If you change your mind or prioritize wrong groups, cancel the prioritization:

root # ceph pg cancel-force-recovery PG_ID1 [PG_ID2 ... ]
root # ceph pg cancel-force-backfill PG_ID1 [PG_ID2 ... ]

The cancel-* commands remove the 'force' flag from the PGs so that they are processed in default order. Again, this does not affect placement groups currently being processed, only those that are still queued. The 'force' flag is cleared automatically after recovery or backfill of the group is done.

20.4.11 Reverting Lost Objects Edit source

If the cluster has lost one or more objects and you have decided to abandon the search for the lost data, you need to mark the unfound objects as 'lost'.

If the objects are still lost after having queried all possible locations, you may need to give up on the lost objects. This is possible given unusual combinations of failures that allow the cluster to learn about writes that were performed before the writes themselves are recovered.

Currently the only supported option is 'revert', which will either roll back to a previous version of the object, or forget about it entirely in case of a new object. To mark the 'unfound' objects as 'lost', run the following:

  cephadm@adm > ceph pg PG_ID mark_unfound_lost revert|delete

20.4.12 PG Auto-scaler Edit source

As of the Nautilus release, Ceph includes a new manager module called pg_autoscaler that allows the cluster to consider the amount of data actually stored (or expected to be stored) in each pool and choose appropriate pg_num values automatically.

To enable the auto-scaler:

cephadm@adm > ceph mgr module enable pg_autoscaler

The autoscaler is configured on a per-pool basis, and can run in three modes:

warn

In warn mode, a health warning is issued if the suggested pg_num value is too different from the current value. This is the default for new and existing pools.

on

In autoscaler on mode, the pool pg_num is adjusted automatically with no need for any administrator interaction.

off

The autoscaler can also be turned off for any given pool, leaving the administrator to manage pg_num manually as before.

Note
Note

We recommend using the on mode, but administrators who want to exercise more manual control should use the warn mode.

To enable the autoscale for a particular pool:

cephadm@adm > ceph osd pool set foo pg_autoscale_mode on

Once enabled, the current state of all pools and the proposed adjustments can be queried on the command line:

cephadm@adm > ceph osd pool autoscale-status

20.5 CRUSH Map Manipulation Edit source

This section introduces ways to basic CRUSH Map manipulation, such as editing a CRUSH Map, changing CRUSH Map parameters, and adding/moving/removing an OSD.

20.5.1 Editing a CRUSH Map Edit source

To edit an existing CRUSH map, do the following:

  1. Get a CRUSH Map. To get the CRUSH Map for your cluster, execute the following:

    cephadm@adm > ceph osd getcrushmap -o compiled-crushmap-filename

    Ceph will output (-o) a compiled CRUSH Map to the file name you specified. Since the CRUSH Map is in a compiled form, you must decompile it first before you can edit it.

  2. Decompile a CRUSH Map. To decompile a CRUSH Map, execute the following:

    cephadm@adm > crushtool -d compiled-crushmap-filename \
     -o decompiled-crushmap-filename

    Ceph will decompile (-d) the compiled CRUSH Mapand output (-o) it to the file name you specified.

  3. Edit at least one of Devices, Buckets and Rules parameters.

  4. Compile a CRUSH Map. To compile a CRUSH Map, execute the following:

    cephadm@adm > crushtool -c decompiled-crush-map-filename \
     -o compiled-crush-map-filename

    Ceph will store a compiled CRUSH Mapto the file name you specified.

  5. Set a CRUSH Map. To set the CRUSH Map for your cluster, execute the following:

    cephadm@adm > ceph osd setcrushmap -i compiled-crushmap-filename

    Ceph will input the compiled CRUSH Map of the file name you specified as the CRUSH Map for the cluster.

Tip
Tip: Use Versioning System

Use a versioning system—such as git or svn—for the exported and modified CRUSH Map files. It makes a possible rollback simple.

Tip
Tip: Test the New CRUSH Map

Test the new adjusted CRUSH Map using the crushtool --test command, and compare to the state before applying the new CRUSH Map. You may find the following command switches useful: --show-statistics, --show-mappings, --show-bad-mappings, --show-utilization, --show-utilization-all, --show-choose-tries

20.5.2 Add/Move an OSD Edit source

To add or move an OSD in the CRUSH Map of a running cluster, execute the following:

cephadm@adm > ceph osd crush set id_or_name weight root=pool-name
bucket-type=bucket-name ...
id

An integer. The numeric ID of the OSD. This option is required.

name

A string. The full name of the OSD. This option is required.

weight

A double. The CRUSH weight for the OSD. This option is required.

root

A key/value pair. By default, the CRUSH hierarchy contains the pool default as its root. This option is required.

bucket-type

Key/value pairs. You may specify the OSD’s location in the CRUSH hierarchy.

The following example adds osd.0 to the hierarchy, or moves the OSD from a previous location.

cephadm@adm > ceph osd crush set osd.0 1.0 root=data datacenter=dc1 room=room1 \
row=foo rack=bar host=foo-bar-1

20.5.3 Difference between ceph osd reweight and ceph osd crush reweight Edit source

There are two similar commands that change the 'weight' of a Ceph OSD. The context of their usage is different and may cause confusion.

20.5.3.1 ceph osd reweight Edit source

Usage:

cephadm@adm > ceph osd reweight OSD_NAME NEW_WEIGHT

ceph osd reweight sets an override weight on the Ceph OSD. This value is in the range of 0 to 1, and forces CRUSH to reposition the data that would otherwise live on this drive. It does not change the weights assigned to the buckets above the OSD, and is a corrective measure in case the normal CRUSH distribution is not working out quite right. For example, if one of your OSDs is at 90% and the others are at 40%, you could reduce this weight to try and compensate for it.

Note
Note: OSD Weight Is Temporary

Note that ceph osd reweight is not a persistent setting. When an OSD gets marked out, its weight will be set to 0 and when it gets marked in again, the weight will be changed to 1.

20.5.3.2 ceph osd crush reweight Edit source

Usage:

cephadm@adm > ceph osd crush reweight OSD_NAME NEW_WEIGHT

ceph osd crush reweight sets the CRUSH weight of the OSD. This weight is an arbitrary value—generally the size of the disk in TB—and controls how much data the system tries to allocate to the OSD.

20.5.4 Remove an OSD Edit source

To remove an OSD from the CRUSH Map of a running cluster, execute the following:

cephadm@adm > ceph osd crush remove OSD_NAME

20.5.5 Add a Bucket Edit source

To add a bucket to the CRUSH Map of a running cluster, execute the ceph osd crush add-bucket command:

cephadm@adm > ceph osd crush add-bucket BUCKET_NAME BUCKET_TYPE

20.5.6 Move a Bucket Edit source

To move a bucket to a different location or position in the CRUSH Map hierarchy, execute the following:

cephadm@adm > ceph osd crush move BUCKET_NAME BUCKET_TYPE=BUCKET_NAME [...]

For example:

cephadm@adm > ceph osd crush move bucket1 datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1

20.5.7 Remove a Bucket Edit source

To remove a bucket from the CRUSH Map hierarchy, execute the following:

cephadm@adm > ceph osd crush remove BUCKET_NAME
Note
Note: Empty Bucket Only

A bucket must be empty before removing it from the CRUSH hierarchy.

20.6 Scrubbing Edit source

In addition to making multiple copies of objects, Ceph ensures data integrity by scrubbing placement groups (find more information about placement groups in Book “Deployment Guide”, Chapter 1 “SUSE Enterprise Storage 6 and Ceph”, Section 1.3.2 “Placement Group”). Ceph scrubbing is analogous to running fsck on the object storage layer. For each placement group, Ceph generates a catalog of all objects and compares each primary object and its replicas to ensure that no objects are missing or mismatched. Daily light scrubbing checks the object size and attributes, while weekly deep scrubbing reads the data and uses checksums to ensure data integrity.

Scrubbing is important for maintaining data integrity, but it can reduce performance. You can adjust the following settings to increase or decrease scrubbing operations:

osd max scrubs

The maximum number of simultaneous scrub operations for a Ceph OSD. Default is 1.

osd scrub begin hour, osd scrub end hour

The hours of day (0 to 24) that define a time window during which the scrubbing can happen. By default, begins at 0 and ends at 24.

Important
Important

If the placement group’s scrub interval exceeds the osd scrub max interval setting, the scrub will happen no matter what time window you define for scrubbing.

osd scrub during recovery

Allows scrubs during recovery. Setting this to 'false' will disable scheduling new scrubs while there is an active recovery. Already running scrubs will continue. This option is useful for reducing load on busy clusters. Default is 'true'.

osd scrub thread timeout

The maximum time in seconds before a scrub thread times out. Default is 60.

osd scrub finalize thread timeout

The maximum time in seconds before a scrub finalize thread times out. Default is 60*10.

osd scrub load threshold

The normalized maximum load. Ceph will not scrub when the system load (as defined by the ratio of getloadavg() / number of online cpus) is higher than this number. Default is 0.5.

osd scrub min interval

The minimal interval in seconds for scrubbing Ceph OSD when the Ceph cluster load is low. Default is 60*60*24 (once a day).

osd scrub max interval

The maximum interval in seconds for scrubbing Ceph OSD, irrespective of cluster load. Default is 7*60*60*24 (once a week).

osd scrub chunk min

The minimum number of object store chunks to scrub during a single operation. Ceph blocks writes to a single chunk during a scrub. Default is 5.

osd scrub chunk max

The maximum number of object store chunks to scrub during a single operation. Default is 25.

osd scrub sleep

Time to sleep before scrubbing the next group of chunks. Increasing this value slows down the whole scrub operation, while client operations are less impacted. Default is 0.

osd deep scrub interval

The interval for 'deep' scrubbing (fully reading all data). The osd scrub load threshold option does not affect this setting. Default is 60*60*24*7 (once a week).

osd scrub interval randomize ratio

Add a random delay to the osd scrub min interval value when scheduling the next scrub job for a placement group. The delay is a random value smaller than the result of osd scrub min interval * osd scrub interval randomized ratio. Therefore, the default setting practically randomly spreads the scrubs out in the allowed time window of [1, 1.5] * osd scrub min interval. Default is 0.5.

osd deep scrub stride

Read size when doing a deep scrub. Default is 524288 (512 kB).

21 Ceph Manager Modules Edit source

The architecture of the Ceph Manager (refer to Book “Deployment Guide”, Chapter 1 “SUSE Enterprise Storage 6 and Ceph”, Section 1.2.3 “Ceph Nodes and Daemons” for a brief introduction) allows extending its functionality via modules, such as 'dashboard' (see Part II, “Ceph Dashboard”), 'prometheus' (see Chapter 18, Monitoring and Alerting), or 'balancer'.

To list all available modules, run:

cephadm@adm > ceph mgr module ls
{
        "enabled_modules": [
                "restful",
                "status"
        ],
        "disabled_modules": [
                "dashboard"
        ]
}

To enable or disable a specific module, run:

cephadm@adm > ceph mgr module enable MODULE-NAME

For example:

cephadm@adm > ceph mgr module disable dashboard

To list the services that the enabled modules provide, run:

cephadm@adm > ceph mgr services
{
        "dashboard": "http://myserver.com:7789/",
        "restful": "https://myserver.com:8789/"
}

21.1 Balancer Edit source

The balancer module optimizes the placement group (PG) distribution across OSDs for a more balanced deployment. Although the module is activated by default, it is inactive. It supports the following two modes: 'crush-compat' and 'upmap'.

Tip
Tip: Current Balancer Configuration

To view the current balancer configuration, run:

cephadm@adm > ceph config-key dump
Important
Important: Supported Mode

We currently only support the 'crush-compat' mode because the 'upmap' mode requires an OSD feature that prevents any pre-Luminous OSDs from connecting to the cluster.

21.1.1 The 'crush-compat' Mode Edit source

In 'crush-compat' mode, the balancer adjusts the OSDs' reweight-sets to achieve improved distribution of the data. It moves PGs between OSDs, temporarily causing a HEALTH_WARN cluster state resulting from misplaced PGs.

Tip
Tip: Mode Activation

Although 'crush-compat' is the default mode, we recommend activating it explicitly:

cephadm@adm > ceph balancer mode crush-compat

21.1.2 Planning and Executing of Data Balancing Edit source

Using the balancer module, you can create a plan for data balancing. You can then execute the plan manually, or let the balancer balance PGs continuously.

The decision whether to run the balancer in manual or automatic mode depends on several factors, such as the current data imbalance, cluster size, PG count, or I/O activity. We recommend creating an initial plan and executing it at a time of low I/O load in the cluster. The reason for this is that the initial imbalance will probably be considerable and it is a good practice to keep the impact on clients low. After an initial manual run, consider activating the automatic mode and monitor the rebalance traffic under normal I/O load. The improvements in PG distribution need to be weighed against the rebalance traffic caused by the balancer.

Tip
Tip: Movable Fraction of Placement Groups (PGs)

During the process of balancing, the balancer module throttles PG movements so that only a configurable fraction of PGs is moved. The default is 5% and you can adjust the fraction, to 9% for example, by running the following command:

cephadm@adm > ceph config set mgr target_max_misplaced_ratio .09

To create and execute a balancing plan, follow these steps:

  1. Check the current cluster score:

    cephadm@adm > ceph balancer eval
  2. Create a plan. For example, 'great_plan':

    cephadm@adm > ceph balancer optimize great_plan
  3. See what changes the 'great_plan' will entail:

    cephadm@adm > ceph balancer show great_plan
  4. Check the potential cluster score if you decide to apply the 'great_plan':

    cephadm@adm > ceph balancer eval great_plan
  5. Execute the 'great_plan' for one time only:

    cephadm@adm > ceph balancer execute great_plan
  6. Observe the cluster balancing with the ceph -s command. If you are satisfied with the result, activate automatic balancing:

    cephadm@adm > ceph balancer on

    If you later decide to deactivate automatic balancing, run:

    cephadm@adm > ceph balancer off
Tip
Tip: Automatic Balancing without Initial Plan

You can activate automatic balancing without executing an initial plan. In such case, expect a potentially long running rebalancing of placement groups.

21.2 Telemetry Module Edit source

The telemetry plugin sends the Ceph project anonymous data about the cluster in which the plugin is running.

This (opt-in) component contains counters and statistics on how the cluster has been deployed, the version of Ceph, the distribution of the hosts and other parameters which help the project to gain a better understanding of the way Ceph is used. It does not contain any sensitive data like pool names, object names, object contents, or host names.

The purpose of the telemetry module is to provide an automated feedback loop for the developers to help quantify adoption rates, tracking, or point out things that need to be better explained or validated during configuration to prevent undesirable outcomes.

Note
Note

The telemetry module requires the Ceph Manager nodes to have the ability to push data over HTTPS to the upstream servers. Ensure your corporate firewalls permit this action.

  1. To enable the telemetry module:

    cephadm@adm > ceph mgr module enable telemetry
    Note
    Note

    This command only enables you to view your data locally. This command does not share your data with the Ceph community.

  2. To allow the telemetry module to start sharing data:

    cephadm@adm > ceph telemetry on
  3. To disable telemetry data sharing:

    cephadm@adm > ceph telemetry off
  4. To generate a JSON report that can be printed:

    cephadm@adm > ceph telemetry show
  5. To add a contact and description to the report:

    cephadm@adm > ceph config set mgr mgr/telemetry/contact ‘John Doe john.doe@example.com’
    cephadm@adm > ceph config set mgr mgr/telemetry/description ‘My first Ceph cluster’
  6. The module compiles and sends a new report every 24 hours by default. To adjust this interval:

    cephadm@adm > ceph config set mgr mgr/telemetry/interval HOURS

22 Managing Storage Pools Edit source

Ceph stores data within pools. Pools are logical groups for storing objects. When you first deploy a cluster without creating a pool, Ceph uses the default pools for storing data. The following important highlights relate to Ceph pools:

  • Resilience: You can set how many OSDs, buckets, or leaves are allowed to fail without losing data. For replicated pools, it is the desired number of copies/replicas of an object. New pools are created with a default count of replicas set to 3. For erasure coded pools, it is the number of coding chunks (that is m=2 in the erasure code profile).

  • Placement Groups: are internal data structures for storing data in a pool across OSDs. The way Ceph stores data into PGs is defined in a CRUSH Map. You can set the number of placement groups for a pool at its creation. A typical configuration uses approximately 100 placement groups per OSD to provide optimal balancing without using up too many computing resources. When setting up multiple pools, be careful to ensure you set a reasonable number of placement groups for both the pool and the cluster as a whole. A Ceph PGs per Pool Calculator can help you.

  • CRUSH Rules: When you store data in a pool, objects and its replicas (or chunks in case of erasure coded pools) are placed according to the CRUSH ruleset mapped to the pool. You can create a custom CRUSH rule for your pool.

  • Snapshots: When you create snapshots with ceph osd pool mksnap, you effectively take a snapshot of a particular pool.

To organize data into pools, you can list, create, and remove pools. You can also view the usage statistics for each pool.

22.1 Associate Pools with an Application Edit source

Before using pools, you need to associate them with an application. Pools that will be used with CephFS, or pools that are automatically created by Object Gateway are automatically associated.

For other cases, you can manually associate a free-form application name with a pool:

cephadm@adm > ceph osd pool application enable pool_name application_name
Tip
Tip: Default Application Names

CephFS uses the application name cephfs, RADOS Block Device uses rbd, and Object Gateway uses rgw.

A pool can be associated with multiple applications, and each application can have its own metadata. You can display the application metadata for a given pool using the following command:

cephadm@adm > ceph osd pool application get pool_name

22.2 Operating Pools Edit source

This section introduces practical information to perform basic tasks with pools. You can find out how to list, create, and delete pools, as well as show pool statistics or manage snapshots of a pool.

22.2.1 List Pools Edit source

To list your cluster’s pools, execute:

cephadm@adm > ceph osd pool ls

22.2.2 Create a Pool Edit source

A pool can be created as either 'replicated' to recover from lost OSDs by keeping multiple copies of the objects or 'erasure' to get a kind of generalized RAID5/6 capability. Replicated pools require more raw storage, while erasure coded pools require less raw storage. Default is 'replicated'.

To create a replicated pool, execute:

cephadm@adm > ceph osd pool create pool_name pg_num pgp_num replicated crush_ruleset_name \
expected_num_objects

To create an erasure coded pool, execute:

cephadm@adm > ceph osd pool create pool_name pg_num pgp_num erasure erasure_code_profile \
 crush_ruleset_name expected_num_objects

The ceph osd pool create can fail if you exceed the limit of placement groups per OSD. The limit is set with the option mon_max_pg_per_osd.

pool_name

The name of the pool. It must be unique. This option is required.

pg_num

The total number of placement groups for the pool. This option is required. Default value is 8.

pgp_num

The total number of placement groups for placement purposes. This should be equal to the total number of placement groups, except for placement group splitting scenarios. This option is required. Default value is 8.

crush_ruleset_name

The name of the crush ruleset for this pool. If the specified ruleset does not exist, the creation of replicated pools will fail with -ENOENT. For replicated pools it is the ruleset specified by the osd pool default crush replicated ruleset configuration variable. This ruleset must exist. For erasure pools it is 'erasure-code' if the default erasure code profile is used or POOL_NAME otherwise. This ruleset will be created implicitly if it does not exist already.

erasure_code_profile=profile

For erasure coded pools only. Use the erasure code profile. It must be an existing profile as defined by osd erasure-code-profile set.

When you create a pool, set the number of placement groups to a reasonable value. Consider the total number of placement groups per OSD too. Placement groups are computationally expensive, so performance will degrade when you have many pools with many placement groups (for example 50 pools with 100 placement groups each).

See Section 20.4, “Placement Groups” for details on calculating an appropriate number of placement groups for your pool.

expected_num_objects

The expected number of objects for this pool. By setting this value (together with a negative filestore merge threshold), the PG folder splitting happens at the pool creation time. This avoids the latency impact with a runtime folder splitting.

22.2.3 Set Pool Quotas Edit source

You can set pool quotas for the maximum number of bytes and/or the maximum number of objects per pool.

cephadm@adm > ceph osd pool set-quota pool-name max_objects obj-count max_bytes bytes

For example:

cephadm@adm > ceph osd pool set-quota data max_objects 10000

To remove a quota, set its value to 0.

22.2.4 Delete a Pool Edit source

Warning
Warning: Pool Deletion is Not Reversible

Pools may contain important data. Deleting a pool causes all data in the pool to disappear, and there is no way to recover it.

Because inadvertent pool deletion is a real danger, Ceph implements two mechanisms that prevent pools from being deleted. Both mechanisms must be disabled before a pool can be deleted.

The first mechanism is the NODELETE flag. Each pool has this flag, and its default value is 'false'. To find out the value of this flag on a pool, run the following command:

cephadm@adm > ceph osd pool get pool_name nodelete

If it outputs nodelete: true, it is not possible to delete the pool until you change the flag using the following command:

cephadm@adm > ceph osd pool set pool_name nodelete false

The second mechanism is the cluster-wide configuration parameter mon allow pool delete, which defaults to 'false'. This means that, by default, it is not possible to delete a pool. The error message displayed is:

Error EPERM: pool deletion is disabled; you must first set the
mon_allow_pool_delete config option to true before you can destroy a pool

To delete the pool in spite of this safety setting, you can temporarily set mon allow pool delete to 'true', delete the pool, and then return the parameter to 'false':

cephadm@adm > ceph tell mon.* injectargs --mon-allow-pool-delete=true
cephadm@adm > ceph osd pool delete pool_name pool_name --yes-i-really-really-mean-it
cephadm@adm > ceph tell mon.* injectargs --mon-allow-pool-delete=false

The injectargs command displays the following message:

injectargs:mon_allow_pool_delete = 'true' (not observed, change may require restart)

This is merely confirming that the command was executed successfully. It is not an error.

If you created your own rulesets and rules for a pool you created, you should consider removing them when you no longer need your pool.

22.2.5 Rename a Pool Edit source

To rename a pool, execute:

cephadm@adm > ceph osd pool rename current-pool-name new-pool-name

If you rename a pool and you have per-pool capabilities for an authenticated user, you must update the user’s capabilities with the new pool name.

22.2.6 Show Pool Statistics Edit source

To show a pool’s usage statistics, execute:

cephadm@adm > rados df
POOL_NAME                    USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED  RD_OPS      RD  WR_OPS      WR USED COMPR UNDER COMPR
.rgw.root                 768 KiB       4      0     12                  0       0        0      44  44 KiB       4   4 KiB        0 B         0 B
cephfs_data               960 KiB       5      0     15                  0       0        0    5502 2.1 MiB      14  11 KiB        0 B         0 B
cephfs_metadata           1.5 MiB      22      0     66                  0       0        0      26  78 KiB     176 147 KiB        0 B         0 B
default.rgw.buckets.index     0 B       1      0      3                  0       0        0       4   4 KiB       1     0 B        0 B         0 B
default.rgw.control           0 B       8      0     24                  0       0        0       0     0 B       0     0 B        0 B         0 B
default.rgw.log               0 B     207      0    621                  0       0        0 5372132 5.1 GiB 3579618     0 B        0 B         0 B
default.rgw.meta          961 KiB       6      0     18                  0       0        0     155 140 KiB      14   7 KiB        0 B         0 B
example_rbd_pool          2.1 MiB      18      0     54                  0       0        0 3350841 2.7 GiB     118  98 KiB        0 B         0 B
iscsi-images              769 KiB       8      0     24                  0       0        0 1559261 1.3 GiB      61  42 KiB        0 B         0 B
mirrored-pool             1.1 MiB      10      0     30                  0       0        0  475724 395 MiB      54  48 KiB        0 B         0 B
pool2                         0 B       0      0      0                  0       0        0       0     0 B       0     0 B        0 B         0 B
pool3                     333 MiB      37      0    111                  0       0        0 3169308 2.5 GiB   14847 118 MiB        0 B         0 B
pool4                     1.1 MiB      13      0     39                  0       0        0 1379568 1.1 GiB   16840  16 MiB        0 B         0 B

A description of individual columns follow:

USED

Number of bytes used by the pool.

OBJECTS

Number of objects stored in the pool.

CLONES

Number of clones stored in the pool. When a snapshot is created and one writes to an object, instead of modifying the original object its clone is created so the original snapshotted object content is not modified.

COPIES

Number of object replicas. For example, if a replicated pool with the replication factor 3 has 'x' objects, it will normally have 3 * x copies.

MISSING_ON_PRIMARY

Number of objects in the degraded state (not all copies exist) while the copy is missing on the primary OSD.

UNFOUND

Number of unfound objects.

DEGRADED

Number of degraded objects.

RD_OPS

Total number of read operations requested for this pool.

RD

Total number of bytes read from this pool.

WR_OPS

Total number of write operations requested for this pool.

WR

Total number of bytes written to the pool. Note that it is not the same as the pool's usage because you can write to the same object many times. The result is that the pool's usage will remain the same but the number of bytes written to the pool will grow.

USED COMPR

Number of bytes allocated for compressed data.

UNDER COMPR

Number of bytes that the compressed data occupy when it is not compressed.

22.2.7 Get Pool Values Edit source

To get a value from a pool, execute:

cephadm@adm > ceph osd pool get pool-name key

You can get values for keys listed in Section 22.2.8, “Set Pool Values” plus the following keys:

pg_num

The number of placement groups for the pool.

pgp_num

The effective number of placement groups to use when calculating data placement. Valid range is equal to or less than pg_num.

Tip
Tip: All of a Pool's Values

To list all values related to a specific pool, run:

cephadm@adm > ceph osd pool get POOL_NAME all

22.2.8 Set Pool Values Edit source

To set a value to a pool, execute:

cephadm@adm > ceph osd pool set pool-name KEY VALUE

The following is a list of pool values sorted by a pool type:

Common Pool Values
crash_replay_interval

The number of seconds to allow clients to replay acknowledged, but uncommitted requests.

pg_num

The number of placement groups for the pool. If you add new OSDs to the cluster, verify the value for placement groups on all pools targeted for the new OSDs.

pgp_num

The effective number of placement groups to use when calculating data placement.

crush_ruleset

The ruleset to use for mapping object placement in the cluster.

hashpspool

Set (1) or unset (0) the HASHPSPOOL flag on a given pool. Enabling this flag changes the algorithm to better distribute PGs to OSDs. After enabling this flag on a pool whose HASHPSPOOL flag was set to the default 0, the cluster starts backfilling to have a correct placement of all PGs again. Be aware that this can create quite substantial I/O load on a cluster, therefore do not enable the flag from 0 to 1 on highly loaded production clusters.

nodelete

Prevents the pool from being removed.

nopgchange

Prevents the pool's pg_num and pgp_num from being changed.

noscrub,nodeep-scrub

Disables (deep) scrubbing of the data for the specific pool to resolve temporary high I/O load.

write_fadvise_dontneed

Set or unset the WRITE_FADVISE_DONTNEED flag on a given pool's read/write requests to bypass putting data into cache. Default is false. Applies to both replicated and EC pools.

scrub_min_interval

The minimum interval in seconds for pool scrubbing when the cluster load is low. The default 0 means that the osd_scrub_min_interval value from the Ceph configuration file is used.

scrub_max_interval

The maximum interval in seconds for pool scrubbing, regardless of the cluster load. The default 0 means that the osd_scrub_max_interval value from the Ceph configuration file is used.

deep_scrub_interval

The interval in seconds for the pool deep scrubbing. The default 0 means that the osd_deep_scrub value from the Ceph configuration file is used.

Replicated Pool Values
size

Sets the number of replicas for objects in the pool. See Section 22.2.9, “Set the Number of Object Replicas” for further details. Replicated pools only.

min_size

Sets the minimum number of replicas required for I/O. See Section 22.2.9, “Set the Number of Object Replicas” for further details. Replicated pools only.

nosizechange

Prevents the pool's size from being changed. When a pool is created, the default value is taken from the value of the osd_pool_default_flag_nosizechange parameter which is false by default. Applies to replicated pools only because you cannot change size for EC pools.

hit_set_type

Enables hit set tracking for cache pools. See Bloom Filter for additional information. This option can have the following values: bloom, explicit_hash, explicit_object. Default is bloom, other values are for testing only.

hit_set_count

The number of hit sets to store for cache pools. The higher the number, the more RAM consumed by the ceph-osd daemon. Default is 0.

hit_set_period

The duration of a hit set period in seconds for cache pools. The higher the number, the more RAM consumed by the ceph-osd daemon. When a pool is created, the default value is taken from the value of the osd_tier_default_cache_hit_set_period parameter, which is 1200 by default. Applies to replicated pools only because EC pools cannot be used as a cache tier.

hit_set_fpp

The false positive probability for the bloom hit set type. See Bloom Filter for additional information. Valid range is 0.0 - 1.0 Default is 0.05

use_gmt_hitset

Force OSDs to use GMT (Greenwich Mean Time) time stamps when creating a hit set for cache tiering. This ensures that nodes in different time zones return the same result. Default is 1. This value should not be changed.

cache_target_dirty_ratio

The percentage of the cache pool containing modified (dirty) objects before the cache tiering agent will flush them to the backing storage pool. Default is 0.4.

cache_target_dirty_high_ratio

The percentage of the cache pool containing modified (dirty) objects before the cache tiering agent will flush them to the backing storage pool with a higher speed. Default is 0.6.

cache_target_full_ratio

The percentage of the cache pool containing unmodified (clean) objects before the cache tiering agent will evict them from the cache pool. Default is 0.8.

target_max_bytes

Ceph will begin flushing or evicting objects when the max_bytes threshold is triggered.

target_max_objects

Ceph will begin flushing or evicting objects when the max_objects threshold is triggered.

hit_set_grade_decay_rate

Temperature decay rate between two successive hit_sets. Default is 20.

hit_set_search_last_n

Count at most N appearances in hit_sets for temperature calculation. Default is 1.

cache_min_flush_age

The time (in seconds) before the cache tiering agent will flush an object from the cache pool to the storage pool.

cache_min_evict_age

The time (in seconds) before the cache tiering agent will evict an object from the cache pool.

Erasure Coded Pool Values
fast_read

If this flag is enabled on erasure coding pools, then the read request issues sub-reads to all shards, and waits until it receives enough shards to decode to serve the client. In the case of jerasure and isa erasure plug-ins, when the first K replies return, then the client’s request is served immediately using the data decoded from these replies. This approach causes more CPU load and less disk/network load. Currently, this flag is only supported for erasure coding pools. Default is 0.

22.2.9 Set the Number of Object Replicas Edit source

To set the number of object replicas on a replicated pool, execute the following:

cephadm@adm > ceph osd pool set poolname size num-replicas

The num-replicas includes the object itself. For example if you want the object and two copies of the object for a total of three instances of the object, specify 3.

Warning
Warning: Do Not Set Less Than 3 Replicas

If you set the num-replicas to 2, there will be only one copy of your data. If you lose one object instance, you need to trust that the other copy has not been corrupted, for example since the last scrubbing during recovery (refer to Section 20.6, “Scrubbing” for details).

Setting a pool to one replica means that there is exactly one instance of the data object in the pool. If the OSD fails, you lose the data. A possible usage for a pool with one replica is storing temporary data for a short time.

Tip
Tip: Setting More Than 3 Replicas

Setting 4 replicas for a pool increases the reliability by 25%.

In case of two data centers, you need to set at least 4 replicas for a pool to have two copies in each data center so that if one data center is lost, two copies still exist and you can still lose one disk without losing data.

Note
Note

An object might accept I/Os in degraded mode with fewer than pool size replicas. To set a minimum number of required replicas for I/O, you should use the min_size setting. For example:

cephadm@adm > ceph osd pool set data min_size 2

This ensures that no object in the data pool will receive I/O with fewer than min_size replicas.

Tip
Tip: Get the Number of Object Replicas

To get the number of object replicas, execute the following:

cephadm@adm > ceph osd dump | grep 'replicated size'

Ceph will list the pools, with the replicated size attribute highlighted. By default, Ceph creates two replicas of an object (a total of three copies, or a size of 3).

22.3 Pool Migration Edit source

When creating a pool (see Section 22.2.2, “Create a Pool”) you need to specify its initial parameters, such as the pool type or the number of placement groups. If you later decide to change any of these parameters—for example when converting a replicated pool into an erasure coded one, or decreasing the number of placement groups—you need to migrate the pool data to another one whose parameters suit your deployment.

This section describes two migration methods—a cache tier method for general pool data migration, and a method using rbd migrate sub-commands to migrate RBD images to a new pool. Each method has its specifics and limitations.

22.3.1 Limitations Edit source

  • You can use the cache tier method to migrate from a replicated pool to either an EC pool or another replicated pool. Migrating from an EC pool is not supported.

  • You cannot migrate RBD images and CephFS exports from a replicated pool to an EC pool. The reason is that EC pools do not support omap, while RBD and CephFS use omap to store its metadata. For example, the header object of the RBD will fail to be flushed. But you can migrate data to EC pool, leaving metadata in replicated pool.

  • The rbd migration method allows migrating images with minimal client downtime. You only need to stop the client before the prepare step and start it afterward. Note that only a librbd client that supports this feature (Ceph Nautilus or newer) will be able to open the image just after the prepare step, while older librbd clients or the krbd clients will not be able to open the image until the commit step is executed.

22.3.2 Migrate Using Cache Tier Edit source

The principle is simple—include the pool that you need to migrate into a cache tier in reverse order. Find more details on cache tiers in Book “Tuning Guide”, Chapter 7 “Cache Tiering”. The following example migrates a replicated pool named 'testpool' to an erasure coded pool:

Procedure 22.1: Migrating Replicated to Erasure Coded Pool
  1. Create a new erasure coded pool named 'newpool'. Refer to Section 22.2.2, “Create a Pool” for a detailed explanation of pool creation parameters.

     cephadm@adm > ceph osd pool create newpool PG_NUM PGP_NUM erasure default

    Verify that the used client keyring provides at least the same capabilities for 'newpool' as it does for 'testpool'.

    Now you have two pools: the original replicated 'testpool' filled with data, and the new empty erasure coded 'newpool':

    Pools before Migration
    Figure 22.1: Pools before Migration
  2. Set up the cache tier and configure the replicated pool 'testpool' as a cache pool. The -force-nonempty option allows adding a cache tier even if the pool already has data:

    cephadm@adm > ceph tell mon.* injectargs \
     '--mon_debug_unsafe_allow_tier_with_nonempty_snaps=1'
    cephadm@adm > ceph osd tier add newpool testpool --force-nonempty
    cephadm@adm > ceph osd tier cache-mode testpool proxy
    Cache Tier Setup
    Figure 22.2: Cache Tier Setup
  3. Force the cache pool to move all objects to the new pool:

    cephadm@adm > rados -p testpool cache-flush-evict-all
    Data Flushing
    Figure 22.3: Data Flushing
  4. Until all the data has been flushed to the new erasure coded pool, you need to specify an overlay so that objects are searched on the old pool:

    cephadm@adm > ceph osd tier set-overlay newpool testpool

    With the overlay, all operations are forwarded to the old replicated 'testpool':

    Setting Overlay
    Figure 22.4: Setting Overlay

    Now you can switch all the clients to access objects on the new pool.

  5. After all data is migrated to the erasure coded 'newpool', remove the overlay and the old cache pool 'testpool':

    cephadm@adm > ceph osd tier remove-overlay newpool
    cephadm@adm > ceph osd tier remove newpool testpool
    Migration Complete
    Figure 22.5: Migration Complete
  6. Run

    cephadm@adm > ceph tell mon.* injectargs \
     '--mon_debug_unsafe_allow_tier_with_nonempty_snaps=0'

22.3.3 Migrating RBD Images Edit source

The following is the recommended way to migrate RBD images from one replicated pool to another replicated pool.

  1. Stop clients (such as a virtual machine) from accessing the RBD image.

  2. Create a new image in the target pool, with the parent set to the source image:

    cephadm@adm > rbd migration prepare SRC_POOL/IMAGE TARGET_POOL/IMAGE
    Tip
    Tip: Migrate Only Data to an EC Pool

    If you need to migrate only the image data to a new EC pool and leave the metadata in the original replicated pool, run the following command instead:

    cephadm@adm > rbd migration prepare SRC_POOL/IMAGE \
     --data-pool TARGET_POOL/IMAGE
  3. Let clients access the image in the target pool.

  4. Migrate data to the target pool:

    cephadm@adm > rbd migration execute SRC_POOL/IMAGE
  5. Remove the old image:

    cephadm@adm > rbd migration commit SRC_POOL/IMAGE

22.4 Pool Snapshots Edit source

Pool snapshots are snapshots of the state of the whole Ceph pool. With pool snapshots, you can retain the history of the pool's state. Creating pool snapshots consumes storage space proportional to the pool size. Always check the related storage for enough disk space before creating a snapshot of a pool.

22.4.1 Make a Snapshot of a Pool Edit source

To make a snapshot of a pool, run:

cephadm@adm > ceph osd pool mksnap POOL-NAME SNAP-NAME

For example:

cephadm@adm > ceph osd pool mksnap pool1 snap1
created pool pool1 snap snap1

22.4.2 List Snapshots of a Pool Edit source

To list existing snapshots of a pool, run:

cephadm@adm > rados lssnap -p POOL_NAME

For example:

cephadm@adm > rados lssnap -p pool1
1	snap1	2018.12.13 09:36:20
2	snap2	2018.12.13 09:46:03
2 snaps

22.4.3 Remove a Snapshot of a Pool Edit source

To remove a snapshot of a pool, run:

cephadm@adm > ceph osd pool rmsnap POOL-NAME SNAP-NAME

22.5 Data Compression Edit source

BlueStore (find more details in Book “Deployment Guide”, Chapter 1 “SUSE Enterprise Storage 6 and Ceph”, Section 1.4 “BlueStore”) provides on-the-fly data compression to save disk space. The compression ratio depends on the data stored in the system. Note that compression/decompression requires additional CPU power.

You can configure data compression globally (see Section 22.5.3, “Global Compression Options”) and then override specific compression settings for each individual pool.

You can enable or disable pool data compression, or change the compression algorithm and mode at any time, regardless of whether the pool contains data or not.

No compression will be applied to existing data after enabling the pool compression.

After disabling the compression of a pool, all its data will be decompressed.

22.5.1 Enable Compression Edit source

To enable data compression for a pool named POOL_NAME, run the following command:

cephadm@adm > ceph osd pool set POOL_NAME compression_algorithm COMPRESSION_ALGORITHM
cephadm@adm > ceph osd pool set POOL_NAME compression_mode COMPRESSION_MODE
Tip
Tip: Disabling Pool Compression

To disable data compression for a pool, use 'none' as the compression algorithm:

cephadm@adm > ceph osd pool set POOL_NAME compression_algorithm none

22.5.2 Pool Compression Options Edit source

A full list of compression settings:

compression_algorithm

Possible values are none, zstd, snappy. Default is snappy.

Which compression algorithm to use depends on the specific use case. Several recommendations follow:

  • Use the default snappy as long as you do not have a good reason to change it.

  • zstd offers a good compression ratio, but causes high CPU overhead when compressing small amounts of data.

  • Run a benchmark of these algorithms on a sample of your actual data while keeping an eye on the CPU and memory usage of your cluster.

compression_mode

Possible values are none, aggressive, passive, force. Default is none.

  • none: compress never

  • passive: compress if hinted COMPRESSIBLE

  • aggressive: compress unless hinted INCOMPRESSIBLE

  • force: compress always

compression_required_ratio

Value: Double, Ratio = SIZE_COMPRESSED / SIZE_ORIGINAL. Default is 0.875, which means that if the compression does not reduce the occupied space by at least 12.5%, the object will not be compressed.

Objects above this ratio will not be stored compressed because of the low net gain.

compression_max_blob_size

Value: Unsigned Integer, size in bytes. Default: 0

Maximum size of objects that are compressed.

compression_min_blob_size

Value: Unsigned Integer, size in bytes. Default: 0

Minimum size of objects that are compressed.

22.5.3 Global Compression Options Edit source

The following configuration options can be set in the Ceph configuration and apply to all OSDs and not only a single pool. The pool specific configuration listed in Section 22.5.2, “Pool Compression Options” takes precedence.

bluestore_compression_algorithm

See compression_algorithm

bluestore_compression_mode

See compression_mode

bluestore_compression_required_ratio

See compression_required_ratio

bluestore_compression_min_blob_size

Value: Unsigned Integer, size in bytes. Default: 0

Minimum size of objects that are compressed. The setting is ignored by default in favor of bluestore_compression_min_blob_size_hdd and bluestore_compression_min_blob_size_ssd. It takes precedence when set to a non-zero value.

bluestore_compression_max_blob_size

Value: Unsigned Integer, size in bytes. Default: 0

Maximum size of objects that are compressed before they will be split into smaller chunks. The setting is ignored by default in favor of bluestore_compression_max_blob_size_hdd and bluestore_compression_max_blob_size_ssd. It takes precedence when set to a non-zero value.

bluestore_compression_min_blob_size_ssd

Value: Unsigned Integer, size in bytes. Default: 8K

Minimum size of objects that are compressed and stored on solid-state drive.

bluestore_compression_max_blob_size_ssd

Value: Unsigned Integer, size in bytes. Default: 64K

Maximum size of objects that are compressed and stored on solid-state drive before they will be split into smaller chunks.

bluestore_compression_min_blob_size_hdd

Value: Unsigned Integer, size in bytes. Default: 128K

Minimum size of objects that are compressed and stored on hard disks.

bluestore_compression_max_blob_size_hdd

Value: Unsigned Integer, size in bytes. Default: 512K

Maximum size of objects that are compressed and stored on hard disks before they will be split into smaller chunks.

23 RADOS Block Device Edit source

A block is a sequence of bytes, for example a 4 MB block of data. Block-based storage interfaces are the most common way to store data with rotating media, such as hard disks, CDs, floppy disks. The ubiquity of block device interfaces makes a virtual block device an ideal candidate to interact with a mass data storage system like Ceph.

Ceph block devices allow sharing of physical resources, and are resizable. They store data striped over multiple OSDs in a Ceph cluster. Ceph block devices leverage RADOS capabilities such as snapshotting, replication, and consistency. Ceph's RADOS Block Devices (RBD) interact with OSDs using kernel modules or the librbd library.

RADOS Protocol
Figure 23.1: RADOS Protocol

Ceph's block devices deliver high performance with infinite scalability to kernel modules. They support virtualization solutions such as QEMU, or cloud-based computing systems such as OpenStack that rely on libvirt. You can use the same cluster to operate the Object Gateway, CephFS, and RADOS Block Devices simultaneously.

23.1 Block Device Commands Edit source

The rbd command enables you to create, list, introspect, and remove block device images. You can also use it, for example, to clone images, create snapshots, rollback an image to a snapshot, or view a snapshot.

23.1.1 Creating a Block Device Image in a Replicated Pool Edit source

Before you can add a block device to a client, you need to create a related image in an existing pool (see Chapter 22, Managing Storage Pools):

cephadm@adm > rbd create --size MEGABYTES POOL-NAME/IMAGE-NAME

For example, to create a 1 GB image named 'myimage' that stores information in a pool named 'mypool', execute the following:

cephadm@adm > rbd create --size 1024 mypool/myimage
Tip
Tip: Image Size Units

If you omit a size unit shortcut ('G' or 'T'), the image's size is in megabytes. Use 'G' or 'T' after the size number to specify gigabytes or terabytes.

23.1.2 Creating a Block Device Image in an Erasure Coded Pool Edit source

As of SUSE Enterprise Storage 5, it is possible to store data of a block device image directly in erasure coded (EC) pools. A RADOS Block Device image consists of data and metadata parts. You can store only the 'data' part of a RADOS Block Device image in an EC pool. The pool needs to have the 'overwrite' flag set to true, and that is only possible if all OSDs where the pool is stored use BlueStore.

You cannot store the image's 'metadata' part in an EC pool. You need to specify the replicated pool for storing the image's metadata with the --pool= option of the rbd create command.

Use the following steps to create an RBD image in a newly created EC pool:

cephadm@adm > ceph osd pool create POOL_NAME 12 12 erasure
cephadm@adm > ceph osd pool set POOL_NAME allow_ec_overwrites true

#Metadata will reside in pool "OTHER_POOL", and data in pool "POOL_NAME"
cephadm@adm > rbd create IMAGE_NAME --size=1G --data-pool POOL_NAME --pool=OTHER_POOL

23.1.3 Listing Block Device Images Edit source

To list block devices in a pool named 'mypool', execute the following:

cephadm@adm > rbd ls mypool

23.1.4 Retrieving Image Information Edit source

To retrieve information from an image 'myimage' within a pool named 'mypool', run the following:

cephadm@adm > rbd info mypool/myimage

23.1.5 Resizing a Block Device Image Edit source

RADOS Block Device images are thin provisioned—they do not actually use any physical storage until you begin saving data to them. However, they do have a maximum capacity that you set with the --size option. If you want to increase (or decrease) the maximum size of the image, run the following:

cephadm@adm > rbd resize --size 2048 POOL_NAME/IMAGE_NAME # to increase
cephadm@adm > rbd resize --size 2048 POOL_NAME/IMAGE_NAME --allow-shrink # to decrease

23.1.6 Removing a Block Device Image Edit source

To remove a block device that corresponds to an image 'myimage' in a pool named 'mypool', run the following:

cephadm@adm > rbd rm mypool/myimage

23.2 Mounting and Unmounting Edit source

After you create a RADOS Block Device, you can use it like any other disk device: format it, mount it to be able to exchange files, and unmount it when done.

  1. Make sure your Ceph cluster includes a pool with the disk image you want to map. Assume the pool is called mypool and the image is myimage.

    cephadm@adm > rbd list mypool
  2. Map the image to a new block device.

    cephadm@adm > rbd map --pool mypool myimage
    Tip
    Tip: User Name and Authentication

    To specify a user name, use --id user-name. If you use cephx authentication, you also need to specify a secret. It may come from a keyring or a file containing the secret:

    cephadm@adm > rbd map --pool rbd myimage --id admin --keyring /path/to/keyring

    or

    cephadm@adm > rbd map --pool rbd myimage --id admin --keyfile /path/to/file
  3. List all mapped devices:

    cephadm@adm > rbd showmapped
     id pool   image   snap device
     0  mypool myimage -    /dev/rbd0

    The device we want to work on is /dev/rbd0.

    Tip
    Tip: RBD Device Path

    Instead of /dev/rbdDEVICE_NUMBER, you can use /dev/rbd/POOL_NAME/IMAGE_NAME as a persistent device path. For example:

    /dev/rbd/mypool/myimage
  4. Make an XFS file system on the /dev/rbd0 device.

    root # mkfs.xfs /dev/rbd0
     log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
     log stripe unit adjusted to 32KiB
     meta-data=/dev/rbd0              isize=256    agcount=9, agsize=261120 blks
              =                       sectsz=512   attr=2, projid32bit=1
              =                       crc=0        finobt=0
     data     =                       bsize=4096   blocks=2097152, imaxpct=25
              =                       sunit=1024   swidth=1024 blks
     naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
     log      =internal log           bsize=4096   blocks=2560, version=2
              =                       sectsz=512   sunit=8 blks, lazy-count=1
     realtime =none                   extsz=4096   blocks=0, rtextents=0
  5. Mount the device and check it is correctly mounted. Replace /mnt with your mount point.

    root # mount /dev/rbd0 /mnt
    root # mount | grep rbd0
    /dev/rbd0 on /mnt type xfs (rw,relatime,attr2,inode64,sunit=8192,...

    Now you can move data to and from the device as if it was a local directory.

    Tip
    Tip: Increasing the Size of RBD Device

    If you find that the size of the RBD device is no longer enough, you can easily increase it.

    1. Increase the size of the RBD image, for example up to 10 GB.

      cephadm@adm > rbd resize --size 10000 mypool/myimage
       Resizing image: 100% complete...done.
    2. Grow the file system to fill up the new size of the device.

      root # xfs_growfs /mnt
       [...]
       data blocks changed from 2097152 to 2560000
  6. After you finish accessing the device, you can unmap and unmount it.

    cephadm@adm > rbd unmap /dev/rbd0
    root # unmount /mnt
Tip
Tip: Manual (Un)mounting

Since manually mapping and mounting RBD images after boot and unmounting and unmapping them before shutdown can be tedious, an rbdmap script and systemd unit is provided. Refer to Section 23.2.1, “rbdmap: Map RBD Devices at Boot Time”.

23.2.1 rbdmap: Map RBD Devices at Boot Time Edit source

rbdmap is a shell script that automates rbd map and rbd unmap operations on one or more RBD images. Although you can run the script manually at any time, the main advantage is automatic mapping and mounting of RBD images at boot time (and unmounting and unmapping at shutdown), as triggered by the Init system. A systemd unit file, rbdmap.service is included with the ceph-common package for this purpose.

The script takes a single argument, which can be either map or unmap. In either case, the script parses a configuration file. It defaults to /etc/ceph/rbdmap, but can be overridden via an environment variable RBDMAPFILE. Each line of the configuration file corresponds to an RBD image which is to be mapped, or unmapped.

The configuration file has the following format:

image_specification rbd_options
image_specification

Path to an image within a pool. Specify as pool_name/image_name.

rbd_options

An optional list of parameters to be passed to the underlying rbd map command. These parameters and their values should be specified as a comma-separated string, for example:

PARAM1=VAL1,PARAM2=VAL2,...

The example makes the rbdmap script run the following command:

cephadm@adm > rbd map POOL_NAME/IMAGE_NAME --PARAM1 VAL1 --PARAM2 VAL2

In the following example you can see how to specify a user name and a keyring with a corresponding secret:

cephadm@adm > rbdmap map mypool/myimage id=rbd_user,keyring=/etc/ceph/ceph.client.rbd.keyring

When run as rbdmap map, the script parses the configuration file, and for each specified RBD image, it attempts to first map the image (using the rbd map command) and then mount the image.

When run as rbdmap unmap, images listed in the configuration file will be unmounted and unmapped.

rbdmap unmap-all attempts to unmount and subsequently unmap all currently mapped RBD images, regardless of whether they are listed in the configuration file.

If successful, the rbd map operation maps the image to a /dev/rbdX device, at which point a udev rule is triggered to create a friendly device name symbolic link /dev/rbd/pool_name/image_name pointing to the real mapped device.

In order for mounting and unmounting to succeed, the 'friendly' device name needs to have a corresponding entry in /etc/fstab. When writing /etc/fstab entries for RBD images, specify the 'noauto' (or 'nofail') mount option. This prevents the Init system from trying to mount the device too early—before the device in question even exists, as rbdmap.service is typically triggered quite late in the boot sequence.

For a complete list of rbd options, see the rbd manual page (man 8 rbd).

For examples of the rbdmap usage, see the rbdmap manual page (man 8 rbdmap).

23.2.2 Increasing the Size of RBD Device Edit source

If you find that the size of the RBD device is no longer enough, you can easily increase it.

  1. Increase the size of the RBD image, for example up to 10GB.

    cephadm@adm > rbd resize --size 10000 mypool/myimage
     Resizing image: 100% complete...done.
  2. Grow the file system to fill up the new size of the device.

    root # xfs_growfs /mnt
     [...]
     data blocks changed from 2097152 to 2560000

23.3 Snapshots Edit source

An RBD snapshot is a snapshot of a RADOS Block Device image. With snapshots, you retain a history of the image's state. Ceph also supports snapshot layering, which allows you to clone VM images quickly and easily. Ceph supports block device snapshots using the rbd command and many higher-level interfaces, including QEMU, libvirt, OpenStack, and CloudStack.

Note
Note

Stop input and output operations and flush all pending writes before snapshotting an image. If the image contains a file system, the file system must be in a consistent state at the time of snapshotting.

23.3.1 Cephx Notes Edit source

When cephx is enabled, you must specify a user name or ID and a path to the keyring containing the corresponding key for the user. See Chapter 19, Authentication with cephx for more details. You may also add the CEPH_ARGS environment variable to avoid re-entry of the following parameters.

cephadm@adm > rbd --id user-ID --keyring=/path/to/secret commands
cephadm@adm > rbd --name username --keyring=/path/to/secret commands

For example:

cephadm@adm > rbd --id admin --keyring=/etc/ceph/ceph.keyring commands
cephadm@adm > rbd --name client.admin --keyring=/etc/ceph/ceph.keyring commands
Tip
Tip

Add the user and secret to the CEPH_ARGS environment variable so that you do not need to enter them each time.

23.3.2 Snapshot Basics Edit source

The following procedures demonstrate how to create, list, and remove snapshots using the rbd command on the command line.

23.3.2.1 Create Snapshot Edit source

To create a snapshot with rbd, specify the snap create option, the pool name, and the image name.

cephadm@adm > rbd --pool pool-name snap create --snap snap-name image-name
cephadm@adm > rbd snap create pool-name/image-name@snap-name

For example:

cephadm@adm > rbd --pool rbd snap create --snap snapshot1 image1
cephadm@adm > rbd snap create rbd/image1@snapshot1

23.3.2.2 List Snapshots Edit source

To list snapshots of an image, specify the pool name and the image name.

cephadm@adm > rbd --pool pool-name snap ls image-name
cephadm@adm > rbd snap ls pool-name/image-name

For example:

cephadm@adm > rbd --pool rbd snap ls image1
cephadm@adm > rbd snap ls rbd/image1

23.3.2.3 Rollback Snapshot Edit source

To rollback to a snapshot with rbd, specify the snap rollback option, the pool name, the image name, and the snapshot name.

cephadm@adm > rbd --pool pool-name snap rollback --snap snap-name image-name
cephadm@adm > rbd snap rollback pool-name/image-name@snap-name

For example:

cephadm@adm > rbd --pool pool1 snap rollback --snap snapshot1 image1
cephadm@adm > rbd snap rollback pool1/image1@snapshot1
Note
Note

Rolling back an image to a snapshot means overwriting the current version of the image with data from a snapshot. The time it takes to execute a rollback increases with the size of the image. It is faster to clone from a snapshot than to rollback an image to a snapshot, and it is the preferred method of returning to a pre-existing state.

23.3.2.4 Delete a Snapshot Edit source

To delete a snapshot with rbd, specify the snap rm option, the pool name, the image name, and the user name.

cephadm@adm > rbd --pool pool-name snap rm --snap snap-name image-name
cephadm@adm > rbd snap rm pool-name/image-name@snap-name

For example:

cephadm@adm > rbd --pool pool1 snap rm --snap snapshot1 image1
cephadm@adm > rbd snap rm pool1/image1@snapshot1
Note
Note

Ceph OSDs delete data asynchronously, so deleting a snapshot does not free up the disk space immediately.

23.3.2.5 Purge Snapshots Edit source

To delete all snapshots for an image with rbd, specify the snap purge option and the image name.

cephadm@adm > rbd --pool pool-name snap purge image-name
cephadm@adm > rbd snap purge pool-name/image-name

For example:

cephadm@adm > rbd --pool pool1 snap purge image1
cephadm@adm > rbd snap purge pool1/image1

23.3.3 Layering Edit source

Ceph supports the ability to create multiple copy-on-write (COW) clones of a block device snapshot. Snapshot layering enables Ceph block device clients to create images very quickly. For example, you might create a block device image with a Linux VM written to it, then, snapshot the image, protect the snapshot, and create as many copy-on-write clones as you like. A snapshot is read-only, so cloning a snapshot simplifies semantics—making it possible to create clones rapidly.

Note
Note

The terms 'parent' and 'child' mentioned in the command line examples below mean a Ceph block device snapshot (parent) and the corresponding image cloned from the snapshot (child).

Each cloned image (child) stores a reference to its parent image, which enables the cloned image to open the parent snapshot and read it.

A COW clone of a snapshot behaves exactly like any other Ceph block device image. You can read to, write from, clone, and resize cloned images. There are no special restrictions with cloned images. However, the copy-on-write clone of a snapshot refers to the snapshot, so you must protect the snapshot before you clone it.

Note
Note: --image-format 1 Not Supported

You cannot create snapshots of images created with the deprecated rbd create --image-format 1 option. Ceph only supports cloning of the default format 2 images.

23.3.3.1 Getting Started with Layering Edit source

Ceph block device layering is a simple process. You must have an image. You must create a snapshot of the image. You must protect the snapshot. After you have performed these steps, you can begin cloning the snapshot.

The cloned image has a reference to the parent snapshot, and includes the pool ID, image ID, and snapshot ID. The inclusion of the pool ID means that you may clone snapshots from one pool to images in another pool.

  • Image Template: A common use case for block device layering is to create a master image and a snapshot that serves as a template for clones. For example, a user may create an image for a Linux distribution (for example, SUSE Linux Enterprise Server), and create a snapshot for it. Periodically, the user may update the image and create a new snapshot (for example, zypper ref && zypper patch followed by rbd snap create). As the image matures, the user can clone any one of the snapshots.

  • Extended Template: A more advanced use case includes extending a template image that provides more information than a base image. For example, a user may clone an image (a VM template) and install other software (for example, a database, a content management system, or an analytics system), and then snapshot the extended image, which itself may be updated in the same way as the base image.

  • Template Pool: One way to use block device layering is to create a pool that contains master images that act as templates, and snapshots of those templates. You may then extend read-only privileges to users so that they may clone the snapshots without the ability to write or execute within the pool.

  • Image Migration/Recovery: One way to use block device layering is to migrate or recover data from one pool into another pool.

23.3.3.2 Protecting a Snapshot Edit source

Clones access the parent snapshots. All clones would break if a user inadvertently deleted the parent snapshot. To prevent data loss, you need to protect the snapshot before you can clone it.

cephadm@adm > rbd --pool pool-name snap protect \
 --image image-name --snap snapshot-name
cephadm@adm > rbd snap protect pool-name/image-name@snapshot-name

For example:

cephadm@adm > rbd --pool pool1 snap protect --image image1 --snap snapshot1
cephadm@adm > rbd snap protect pool1/image1@snapshot1
Note
Note

You cannot delete a protected snapshot.

23.3.3.3 Cloning a Snapshot Edit source

To clone a snapshot, you need to specify the parent pool, image, snapshot, the child pool, and the image name. You need to protect the snapshot before you can clone it.

cephadm@adm > rbd clone --pool pool-name --image parent-image \
 --snap snap-name --dest-pool pool-name \
 --dest child-image
cephadm@adm > rbd clone pool-name/parent-image@snap-name \
pool-name/child-image-name

For example:

cephadm@adm > rbd clone pool1/image1@snapshot1 pool1/image2
Note
Note

You may clone a snapshot from one pool to an image in another pool. For example, you may maintain read-only images and snapshots as templates in one pool, and writable clones in another pool.

23.3.3.4 Unprotecting a Snapshot Edit source

Before you can delete a snapshot, you must unprotect it first. Additionally, you may not delete snapshots that have references from clones. You need to flatten each clone of a snapshot before you can delete the snapshot.

cephadm@adm > rbd --pool pool-name snap unprotect --image image-name \
 --snap snapshot-name
cephadm@adm > rbd snap unprotect pool-name/image-name@snapshot-name

For example:

cephadm@adm > rbd --pool pool1 snap unprotect --image image1 --snap snapshot1
cephadm@adm > rbd snap unprotect pool1/image1@snapshot1

23.3.3.5 Listing Children of a Snapshot Edit source

To list the children of a snapshot, execute the following:

cephadm@adm > rbd --pool pool-name children --image image-name --snap snap-name
cephadm@adm > rbd children pool-name/image-name@snapshot-name

For example:

cephadm@adm > rbd --pool pool1 children --image image1 --snap snapshot1
cephadm@adm > rbd children pool1/image1@snapshot1

23.3.3.6 Flattening a Cloned Image Edit source

Cloned images retain a reference to the parent snapshot. When you remove the reference from the child clone to the parent snapshot, you effectively 'flatten' the image by copying the information from the snapshot to the clone. The time it takes to flatten a clone increases with the size of the snapshot. To delete a snapshot, you must flatten the child images first.

cephadm@adm > rbd --pool pool-name flatten --image image-name
cephadm@adm > rbd flatten pool-name/image-name

For example:

cephadm@adm > rbd --pool pool1 flatten --image image1
cephadm@adm > rbd flatten pool1/image1
Note
Note

Since a flattened image contains all the information from the snapshot, a flattened image will take up more storage space than a layered clone.

23.4 Mirroring Edit source

RBD images can be asynchronously mirrored between two Ceph clusters. This capability uses the RBD journaling image feature to ensure crash-consistent replication between clusters. Mirroring is configured on a per-pool basis within peer clusters and can be configured to automatically mirror all images within a pool or only a specific subset of images. Mirroring is configured using the rbd command. The rbd-mirror daemon is responsible for pulling image updates from the remote peer cluster and applying them to the image within the local cluster.

Note
Note: rbd-mirror Daemon

To use RBD mirroring, you need to have two Ceph clusters, each running the rbd-mirror daemon.

Important
Important: RADOS Block Devices Exported via iSCSI

You cannot mirror RBD devices that are exported via iSCSI using kernel-based iSCSI Gateway.

Refer to Chapter 27, Ceph iSCSI Gateway for more details on iSCSI.

23.4.1 rbd-mirror Daemon Edit source

The two rbd-mirror daemons are responsible for watching image journals on the remote, peer cluster and replaying the journal events against the local cluster. The RBD image journaling feature records all modifications to the image in the order they occur. This ensures that a crash-consistent mirror of the remote image is available locally.

The rbd-mirror daemon is available in the rbd-mirror package. You can install the package on OSD nodes, gateway nodes, or even on dedicated nodes. We do not recommend installing the rbd-mirror on the Admin Node. Install, enable, and start rbd-mirror:

root@minion > zypper install rbd-mirror
root@minion > systemctl enable ceph-rbd-mirror@server_name.service
root@minion > systemctl start ceph-rbd-mirror@server_name.service
Important
Important

Each rbd-mirror daemon requires the ability to connect to both clusters simultaneously.

23.4.2 Pool Configuration Edit source

The following procedures demonstrate how to perform the basic administrative tasks to configure mirroring using the rbd command. Mirroring is configured on a per-pool basis within the Ceph clusters.

You need to perform the pool configuration steps on both peer clusters. These procedures assume two clusters, named 'local' and 'remote', are accessible from a single host for clarity.

See the rbd manual page (man 8 rbd) for additional details on how to connect to different Ceph clusters.

Tip
Tip: Multiple Clusters

The cluster name in the following examples corresponds to a Ceph configuration file of the same name /etc/ceph/remote.conf.

23.4.2.1 Enable Mirroring on a Pool Edit source

To enable mirroring on a pool, specify the mirror pool enable subcommand, the pool name, and the mirroring mode. The mirroring mode can either be pool or image:

pool

All images in the pool with the journaling feature enabled are mirrored.

image

Mirroring needs to be explicitly enabled on each image. See Section 23.4.3.2, “Enable Image Mirroring” for more information.

For example:

cephadm@adm > rbd --cluster local mirror pool enable POOL_NAME pool
cephadm@adm > rbd --cluster remote mirror pool enable POOL_NAME pool

23.4.2.2 Disable Mirroring Edit source

To disable mirroring on a pool, specify the mirror pool disable subcommand and the pool name. When mirroring is disabled on a pool in this way, mirroring will also be disabled on any images (within the pool) for which mirroring was enabled explicitly.

cephadm@adm > rbd --cluster local mirror pool disable POOL_NAME
cephadm@adm > rbd --cluster remote mirror pool disable POOL_NAME

23.4.2.3 Add Cluster Peer Edit source

In order for the rbd-mirror daemon to discover its peer cluster, the peer needs to be registered to the pool. To add a mirroring peer cluster, specify the mirror pool peer add subcommand, the pool name, and a cluster specification:

cephadm@adm > rbd --cluster local mirror pool peer add POOL_NAME client.remote@remote
cephadm@adm > rbd --cluster remote mirror pool peer add POOL_NAME client.local@local

23.4.2.4 Remove Cluster Peer Edit source

To remove a mirroring peer cluster, specify the mirror pool peer remove subcommand, the pool name, and the peer UUID (available from the rbd mirror pool info command):

cephadm@adm > rbd --cluster local mirror pool peer remove POOL_NAME \
 55672766-c02b-4729-8567-f13a66893445
cephadm@adm > rbd --cluster remote mirror pool peer remove POOL_NAME \
 60c0e299-b38f-4234-91f6-eed0a367be08

23.4.3 Image Configuration Edit source

Unlike pool configuration, image configuration only needs to be performed against a single mirroring peer Ceph cluster.

Mirrored RBD images are designated as either primary or non-primary. This is a property of the image and not the pool. Images that are designated as non-primary cannot be modified.

Images are automatically promoted to primary when mirroring is first enabled on an image (either implicitly if the pool mirror mode was 'pool' and the image has the journaling image feature enabled, or explicitly (see Section 23.4.3.2, “Enable Image Mirroring”) by the rbd command).

23.4.3.1 Image Journaling Support Edit source

RBD mirroring uses the RBD journaling feature to ensure that the replicated image always remains crash-consistent. Before an image can be mirrored to a peer cluster, the journaling feature must be enabled. The feature can be enabled at the time of image creation by providing the --image-feature exclusive-lock,journaling option to the rbd command.

Alternatively, the journaling feature can be dynamically enabled on pre-existing RBD images. To enable journaling, specify the feature enable subcommand, the pool and image name, and the feature name:

cephadm@adm > rbd --cluster local feature enable POOL_NAME/IMAGE_NAME journaling
Note
Note: Option Dependency

The journaling feature is dependent on the exclusive-lock feature. If the exclusive-lock feature is not already enabled, you need to enable it prior to enabling the journaling feature.

Warning
Warning: Journaling on All New Images

You can enable journaling on all new images by default by appending the journaling value to the rbd default features option in the Ceph configuration file. For example:

rbd default features = layering,exclusive-lock,object-map,deep-flatten,journaling

Before applying such a change, carefully consider if enabling journaling on all new images is good for your deployment, because it can have a negative performance impact.

23.4.3.2 Enable Image Mirroring Edit source

If mirroring is configured in the 'image' mode, then it is necessary to explicitly enable mirroring for each image within the pool. To enable mirroring for a specific image, specify the mirror image enable subcommand along with the pool and image name:

cephadm@adm > rbd --cluster local mirror image enable POOL_NAME/IMAGE_NAME

23.4.3.3 Disable Image Mirroring Edit source

To disable mirroring for a specific image, specify the mirror image disable subcommand along with the pool and image name:

cephadm@adm > rbd --cluster local mirror image disable POOL_NAME/IMAGE_NAME

23.4.3.4 Image Promotion and Demotion Edit source

In a failover scenario where the primary designation needs to be moved to the image in the peer cluster, you need to stop access to the primary image, demote the current primary image, promote the new primary image, and resume access to the image on the alternate cluster.

Note
Note: Forced Promotion

Promotion can be forced using the --force option. Forced promotion is needed when the demotion cannot be propagated to the peer cluster (for example, in case of cluster failure or communication outage). This will result in a split-brain scenario between the two peers, and the image will no longer be synchronized until a resync subcommand is issued.

To demote a specific image to non-primary, specify the mirror image demote subcommand along with the pool and image name:

cephadm@adm > rbd --cluster local mirror image demote POOL_NAME/IMAGE_NAME

To demote all primary images within a pool to non-primary, specify the mirror pool demote subcommand along with the pool name:

cephadm@adm > rbd --cluster local mirror pool demote POOL_NAME

To promote a specific image to primary, specify the mirror image promote subcommand along with the pool and image name:

cephadm@adm > rbd --cluster remote mirror image promote POOL_NAME/IMAGE_NAME

To promote all non-primary images within a pool to primary, specify the mirror pool promote subcommand along with the pool name:

cephadm@adm > rbd --cluster local mirror pool promote POOL_NAME
Tip
Tip: Split I/O Load

Since the primary or non-primary status is per-image, it is possible to have two clusters split the I/O load and stage failover or failback.

23.4.3.5 Force Image Resync Edit source

If a split-brain event is detected by the rbd-mirror daemon, it will not attempt to mirror the affected image until corrected. To resume mirroring for an image, first demote the image determined to be out of date and then request a resync to the primary image. To request an image resync, specify the mirror image resync subcommand along with the pool and image name:

cephadm@adm > rbd mirror image resync POOL_NAME/IMAGE_NAME

23.4.4 Mirror Status Edit source

The peer cluster replication status is stored for every primary mirrored image. This status can be retrieved using the mirror image status and mirror pool status subcommands:

To request the mirror image status, specify the mirror image status subcommand along with the pool and image name:

cephadm@adm > rbd mirror image status POOL_NAME/IMAGE_NAME

To request the mirror pool summary status, specify the mirror pool status subcommand along with the pool name:

cephadm@adm > rbd mirror pool status POOL_NAME
Tip
Tip:

Adding the --verbose option to the mirror pool status subcommand will additionally output status details for every mirroring image in the pool.

23.5 Cache Settings Edit source

The user space implementation of the Ceph block device (librbd) cannot take advantage of the Linux page cache. Therefore, it includes its own in-memory caching. RBD caching behaves similar to hard disk caching. When the OS sends a barrier or a flush request, all 'dirty' data is written to the OSDs. This means that using write-back caching is just as safe as using a well-behaved physical hard disk with a VM that properly sends flushes. The cache uses a Least Recently Used (LRU) algorithm, and in write-back mode it can merge adjacent requests for better throughput.

Ceph supports write-back caching for RBD. To enable it, add

[client]
...
rbd cache = true

to the [client] section of your ceph.conf file. By default, librbd does not perform any caching. Writes and reads go directly to the storage cluster, and writes return only when the data is on disk on all replicas. With caching enabled, writes return immediately, unless there are more unflushed bytes than set in the rbd cache max dirty option. In such a case, the write triggers writeback and blocks until enough bytes are flushed.

Ceph supports write-through caching for RBD. You can set the size of the cache, and you can set targets and limits to switch from write-back caching to write-through caching. To enable write-through mode, set

rbd cache max dirty = 0

This means writes return only when the data is on disk on all replicas, but reads may come from the cache. The cache is in memory on the client, and each RBD image has its own cache. Since the cache is local to the client, there is no coherency if there are others accessing the image. Running GFS or OCFS on top of RBD will not work with caching enabled.

The ceph.conf file settings for RBD should be set in the [client] section of your configuration file. The settings include:

rbd cache

Enable caching for RADOS Block Device (RBD). Default is 'true'.

rbd cache size

The RBD cache size in bytes. Default is 32 MB.

rbd cache max dirty

The 'dirty' limit in bytes at which the cache triggers write-back. rbd cache max dirty needs to be less than rbd cache size. If set to 0, uses write-through caching. Default is 24 MB.

rbd cache target dirty

The 'dirty target' before the cache begins writing data to the data storage. Does not block writes to the cache. Default is 16 MB.

rbd cache max dirty age

The number of seconds dirty data is in the cache before writeback starts. Default is 1.

rbd cache writethrough until flush

Start out in write-through mode, and switch to write-back after the first flush request is received. Enabling this is a conservative but safe setting in case virtual machines running on rbd are too old to send flushes (for example, the virtio driver in Linux before kernel 2.6.32). Default is 'true'.

23.6 QoS Settings Edit source

Generallyi, Quality of Service (QoS) refers to methods of traffic prioritization and resource reservation. It is particularly important for the transportation of traffic with special requirements.

Important
Important: Not Supported by iSCSI

The following QoS settings are used only by the userspace RBD implementation librbd and not used by the kRBD implementation. Because iSCSI uses kRBD, it does not use the QoS settings. However, for iSCSI you can configure QoS on the kernel block device layer using standard kernel facilities.

rbd qos iops limit

The desired limit of I/O operations per second. Default is 0 (no limit).

rbd qos bps limit

The desired limit of I/O bytes per second. Default is 0 (no limit).

rbd qos read iops limit

The desired limit of read operations per second. Default is 0 (no limit).

rbd qos write iops limit

The desired limit of write operations per second. Default is 0 (no limit).

rbd qos read bps limit

The desired limit of read bytes per second. Default is 0 (no limit).

rbd qos write bps limit

The desired limit of write bytes per second. Default is 0 (no limit).

rbd qos iops burst

The desired burst limit of I/O operations. Default is 0 (no limit).

rbd qos bps burst

The desired burst limit of I/O bytes. Default is 0 (no limit).

rbd qos read iops burst

The desired burst limit of read operations. Default is 0 (no limit).

rbd qos write iops burst

The desired burst limit of write operations. Default is 0 (no limit).

rbd qos read bps burst

The desired burst limit of read bytes. Default is 0 (no limit).

rbd qos write bps burst

The desired burst limit of write bytes. Default is 0 (no limit).

rbd qos schedule tick min

The minimum schedule tick (in milliseconds) for QoS. Default is 50.

23.7 Read-ahead Settings Edit source

RADOS Block Device supports read-ahead/prefetching to optimize small, sequential reads. This should normally be handled by the guest OS in the case of a virtual machine, but boot loaders may not issue efficient reads. Read-ahead is automatically disabled if caching is disabled.

rbd readahead trigger requests

Number of sequential read requests necessary to trigger read-ahead. Default is 10.

rbd readahead max bytes

Maximum size of a read-ahead request. If set to 0, read-ahead is disabled. Default is 512 kB.

rbd readahead disable after bytes

After this many bytes have been read from an RBD image, read-ahead is disabled for that image until it is closed. This allows the guest OS to take over read-ahead when it is booted. If set to 0, read-ahead stays enabled. Default is 50 MB.

23.8 Advanced Features Edit source

RADOS Block Device supports advanced features that enhance the functionality of RBD images. You can specify the features either on the command line when creating an RBD image, or in the Ceph configuration file by using the rbd_default_features option.

You can specify the values of the rbd_default_features option in two ways:

  • As a sum of features' internal values. Each feature has its own internal value—for example 'layering' has 1 and 'fast-diff' has 16. Therefore to activate these two feature by default, include the following:

    rbd_default_features = 17
  • As a comma-separated list of features. The previous example will look as follows:

    rbd_default_features = layering,fast-diff
Note
Note: Features Not Supported by iSCSI

RBD images with the following features will not be supported by iSCSI: deep-flatten, object-map, journaling, fast-diff, striping

A list of advanced RBD features follows:

layering

Layering enables you to use cloning.

Internal value is 1, default is 'yes'.

striping

Striping spreads data across multiple objects and helps with parallelism for sequential read/write workloads. It prevents single node bottlenecks for large or busy RADOS Block Devices.

Internal value is 2, default is 'yes'.

exclusive-lock

When enabled, it requires a client to get a lock on an object before making a write. Enable the exclusive lock only when a single client is accessing an image at the same time. Internal value is 4. Default is 'yes'.

object-map

Object map support depends on exclusive lock support. Block devices are thin provisioned, meaning that they only store data that actually exists. Object map support helps track which objects actually exist (have data stored on a drive). Enabling object map support speeds up I/O operations for cloning, importing and exporting a sparsely populated image, and deleting.

Internal value is 8, default is 'yes'.

fast-diff

Fast-diff support depends on object map support and exclusive lock support. It adds another property to the object map, which makes it much faster to generate diffs between snapshots of an image and the actual data usage of a snapshot.

Internal value is 16, default is 'yes'.

deep-flatten

Deep-flatten makes the rbd flatten (see Section 23.3.3.6, “Flattening a Cloned Image”) work on all the snapshots of an image, in addition to the image itself. Without it, snapshots of an image will still rely on the parent, therefore you will not be able to delete the parent image until the snapshots are deleted. Deep-flatten makes a parent independent of its clones, even if they have snapshots.

Internal value is 32, default is 'yes'.

journaling

Journaling support depends on exclusive lock support. Journaling records all modifications to an image in the order they occur. RBD mirroring (see Section 23.4, “Mirroring”) uses the journal to replicate a crash consistent image to a remote cluster.

Internal value is 64, default is 'no'.

23.9 Mapping RBD Using Old Kernel Clients Edit source

Old clients (for example, SLE11 SP4) may not be able to map RBD images because a cluster deployed with SUSE Enterprise Storage 6 forces some features (both RBD image level features and RADOS level features) that these old clients do not support. When this happens, the OSD logs will show messages similar to the following:

2019-05-17 16:11:33.739133 7fcb83a2e700  0 -- 192.168.122.221:0/1006830 >> \
192.168.122.152:6789/0 pipe(0x65d4e0 sd=3 :57323 s=1 pgs=0 cs=0 l=1 c=0x65d770).connect \
protocol feature mismatch, my 2fffffffffff < peer 4010ff8ffacffff missing 401000000000000
Warning
Warning: Changing CRUSH Map Bucket Types Causes Massive Rebalancing

If you intend to switch the CRUSH Map bucket types between 'straw' and 'straw2', do it in a planned manner. Expect a significant impact on the cluster load because changing bucket type will cause massive cluster rebalancing.

  1. Disable any RBD image features that are not supported. For example:

    cephadm@adm > rbd feature disable pool1/image1 object-map
    cephadm@adm > rbd feature disable pool1/image1 exclusive-lock
  2. Change the CRUSH Map bucket types from 'straw2' to 'straw':

    1. Save the CRUSH Map:

      cephadm@adm > ceph osd getcrushmap -o crushmap.original
    2. Decompile the CRUSH Map:

      cephadm@adm > crushtool -d crushmap.original -o crushmap.txt
    3. Edit the CRUSH Map and replace 'straw2' with 'straw'.

    4. Recompile the CRUSH Map:

      cephadm@adm > crushtool -c crushmap.txt -o crushmap.new
    5. Set the new CRUSH Map:

      cephadm@adm > ceph osd setcrushmap -i crushmap.new

24 Erasure Coded Pools Edit source

Ceph provides an alternative to the normal replication of data in pools, called erasure or erasure coded pool. Erasure pools do not provide all functionality of replicated pools (for example, they cannot store metadata for RBD pools), but require less raw storage. A default erasure pool capable of storing 1 TB of data requires 1.5 TB of raw storage, allowing a single disk failure. This compares favorably to a replicated pool, which needs 2 TB of raw storage for the same purpose.

For background information on Erasure Code, see https://en.wikipedia.org/wiki/Erasure_code.

For a list of pool values related to EC pools, refer to Erasure Coded Pool Values.

Note
Note

When using FileStore, you cannot access erasure coded pools with the RBD interface unless you have a cache tier configured. Refer to Book “Tuning Guide”, Chapter 7 “Cache Tiering”, Section 7.5 “Erasure Coded Pool and Cache Tiering” for more details, or use the default BlueStore (see Book “Deployment Guide”, Chapter 1 “SUSE Enterprise Storage 6 and Ceph”, Section 1.4 “BlueStore”).

24.1 Prerequisite for Erasure Coded Pools Edit source

To make use of erasure coding, you need to:

  • Define an erasure rule in the CRUSH Map.

  • Define an erasure code profile that specifies the coding algorithm to be used.

  • Create a pool using the previously mentioned rule and profile.

Keep in mind that changing the profile and the details in the profile will not be possible after the pool is created and has data.

Ensure that the CRUSH rules for erasure pools use indep for step. For details see Section 20.3.2, “firstn and indep”.

24.2 Creating a Sample Erasure Coded Pool Edit source

The simplest erasure coded pool is equivalent to RAID5 and requires at least three hosts. This procedure describes how to create a pool for testing purposes.

  1. The command ceph osd pool create is used to create a pool with type erasure. The 12 stands for the number of placement groups. With default parameters, the pool is able to handle the failure of one OSD.

    cephadm@adm > ceph osd pool create ecpool 12 12 erasure
    pool 'ecpool' created
  2. The string ABCDEFGHI is written into an object called NYAN.

    cephadm@adm > echo ABCDEFGHI | rados --pool ecpool put NYAN -
  3. For testing purposes OSDs can now be disabled, for example by disconnecting them from the network.

  4. To test whether the pool can handle the failure of devices, the content of the file can be accessed with the rados command.

    cephadm@adm > rados --pool ecpool get NYAN -
    ABCDEFGHI

24.3 Erasure Code Profiles Edit source

When the ceph osd pool create command is invoked to create an erasure pool, the default profile is used, unless another profile is specified. Profiles define the redundancy of data. This is done by setting two parameters, arbitrarily named k and m. k and m define in how many chunks a piece of data is split and how many coding chunks are created. Redundant chunks are then stored on different OSDs.

Definitions required for erasure pool profiles:

chunk

when the encoding function is called, it returns chunks of the same size: data chunks which can be concatenated to reconstruct the original object and coding chunks which can be used to rebuild a lost chunk.

k

the number of data chunks, that is the number of chunks into which the original object is divided. For example, if k = 2 a 10 kB object will be divided into k objects of 5 kB each. The default min_size on erasure coded pools is k + 1. However, we recommend min_size to be k + 2 or more to prevent loss of writes and data.

m

the number of coding chunks, that is the number of additional chunks computed by the encoding functions. If there are 2 coding chunks, it means 2 OSDs can be out without losing data.

crush-failure-domain

defines to which devices the chunks are distributed. A bucket type needs to be set as value. For all bucket types, see Section 20.2, “Buckets”. If the failure domain is rack, the chunks will be stored on different racks to increase the resilience in case of rack failures. Keep in mind that this requires k+m racks.

With the default erasure code profile used in Section 24.2, “Creating a Sample Erasure Coded Pool”, you will not lose cluster data if a single OSD or host fails. Therefore, to store 1 TB of data it needs another 0.5 TB of raw storage. That means 1.5 TB of raw storage is required for 1 TB of data (because of k=2, m=1). This is equivalent to a common RAID 5 configuration. For comparison, a replicated pool needs 2 TB of raw storage to store 1 TB of data.

The settings of the default profile can be displayed with:

cephadm@adm > ceph osd erasure-code-profile get default
directory=.libs
k=2
m=1
plugin=jerasure
crush-failure-domain=host
technique=reed_sol_van

Choosing the right profile is important because it cannot be modified after the pool is created. A new pool with a different profile needs to be created and all objects from the previous pool moved to the new one (see Section 22.3, “Pool Migration”).

The most important parameters of the profile are k, m and crush-failure-domain because they define the storage overhead and the data durability. For example, if the desired architecture must sustain the loss of two racks with a storage overhead of 66%, the following profile can be defined. Note that this is only valid with a CRUSH Map that has buckets of type 'rack':

cephadm@adm > ceph osd erasure-code-profile set myprofile \
   k=3 \
   m=2 \
   crush-failure-domain=rack

The example Section 24.2, “Creating a Sample Erasure Coded Pool” can be repeated with this new profile:

cephadm@adm > ceph osd pool create ecpool 12 12 erasure myprofile
cephadm@adm > echo ABCDEFGHI | rados --pool ecpool put NYAN -
cephadm@adm > rados --pool ecpool get NYAN -
ABCDEFGHI

The NYAN object will be divided in three (k=3) and two additional chunks will be created (m=2). The value of m defines how many OSDs can be lost simultaneously without losing any data. The crush-failure-domain=rack will create a CRUSH ruleset that ensures no two chunks are stored in the same rack.

24.3.1 Creating a New Erasure Code Profile Edit source

The following command creates a new erasure code profile:

root # ceph osd erasure-code-profile set NAME \
 directory=DIRECTORY \
 plugin=PLUGIN \
 stripe_unit=STRIPE_UNIT \
 KEY=VALUE ... \
 --force
DIRECTORY

Optional. Set the directory name from which the erasure code plugin is loaded. Default is /usr/lib/ceph/erasure-code.

PLUGIN

Optional. Use the erasure code plugin to compute coding chunks and recover missing chunks. Available plugins are 'jerasure', 'isa', 'lrc', and 'shes'. Default is 'jerasure'.

STRIPE_UNIT

Optional. The amount of data in a data chunk, per stripe. For example, a profile with 2 data chunks and stripe_unit=4K would put the range 0-4K in chunk 0, 4K-8K in chunk 1, then 8K-12K in chunk 0 again. This should be a multiple of 4K for best performance. The default value is taken from the monitor configuration option osd_pool_erasure_code_stripe_unit when a pool is created. The 'stripe_width' of a pool using this profile will be the number of data chunks multiplied by this 'stripe_unit'.

KEY=VALUE

Key/value pairs of options specific to the selected erasure code plugin.

--force

Optional. Override an existing profile by the same name, and allow setting a non-4K-aligned stripe_unit.

Warning
Warning

We strongly recommend that profiles are never modified. Instead, a new profile should be created and used when creating a new pool or creating a new rule for an existing pool. Seek expert advice before performing this action in specific circumstances.

24.3.2 Removing an Erasure Code Profile Edit source

The following command removes an erasure code profile as identified by its NAME:

root # ceph osd erasure-code-profile rm NAME
Important
Important

If the profile is referenced by a pool, the deletion will fail.

24.3.3 Displaying an Erasure Code Profile's Details Edit source

The following command displays details of an erasure code profile as identified by its NAME:

root # ceph osd erasure-code-profile get NAME

24.3.4 Listing Erasure Code Profiles Edit source

The following command lists the names of all erasure code profiles:

root # ceph osd erasure-code-profile ls

24.4 Erasure Coded Pools with RADOS Block Device Edit source

To mark an EC pool as an RBD pool, tag it accordingly:

cephadm@adm > ceph osd pool application enable rbd ec_pool_name

RBD can store image data in EC pools. However, the image header and metadata still need to be stored in a replicated pool. Assuming you have the pool named 'rbd' for this purpose:

cephadm@adm > rbd create rbd/image_name --size 1T --data-pool ec_pool_name

You can use the image normally like any other image, except that all of the data will be stored in the ec_pool_name pool instead of 'rbd' pool.

25 Ceph Cluster Configuration Edit source

This chapter provides a list of important Ceph cluster settings and their description. The settings are sorted by topic.

25.1 Runtime Configuration Edit source

Section 2.14, “Adjusting ceph.conf with Custom Settings” describes how to make changes to the Ceph configuration file ceph.conf. However, the actual cluster behavior is determined not by the current state of the ceph.conf file but by the configuration of the running Ceph daemons, which is stored in memory.

You can query an individual Ceph daemon for a particular configuration setting using the admin socket on the node where the daemon is running. For example, the following command gets the value of the osd_max_write_size configuration parameter from the daemon named osd.0:

cephadm@adm > ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok \
config get osd_max_write_size
{
  "osd_max_write_size": "90"
}

You can also change the daemons' settings at runtime. Remember that this change is temporary and will be lost after the next daemon restart. For example, the following command changes the osd_max_write_size parameter to '50' for all OSDs in the cluster:

cephadm@adm > ceph tell osd.* injectargs --osd_max_write_size 50
Warning
Warning: injectargs Is Not Reliable

Unfortunately, changing the cluster settings with the injectargs command is not 100% reliable. If you need to be sure that the changed parameter is active, change it in the configuration files on all cluster nodes and restart all daemons in the cluster.

25.2 Ceph OSD and BlueStore Edit source

25.2.1 Automatic Cache Sizing Edit source

BlueStore can be configured to automatically resize its caches when tc_malloc is configured as the memory allocator and the bluestore_cache_autotune setting is enabled. This option is currently enabled by default. BlueStore will attempt to keep OSD heap memory usage under a designated target size via the osd_memory_target configuration option. This is a best effort algorithm and caches will not shrink smaller than the amount specified by osd_memory_cache_min. Cache ratios will be chosen based on a hierarchy of priorities. If priority information is not available, the bluestore_cache_meta_ratio and bluestore_cache_kv_ratio options are used as fallbacks.

bluestore_cache_autotune

Automatically tunes the ratios assigned to different BlueStore caches while respecting minimum values. Default is True.

osd_memory_target

When tc_malloc and bluestore_cache_autotune are enabled, try to keep this many bytes mapped in memory.

Note
Note

This may not exactly match the RSS memory usage of the process. While the total amount of heap memory mapped by the process should generally stay close to this target, there is no guarantee that the kernel will actually reclaim memory that has been unmapped.

osd_memory_cache_min

When tc_malloc and bluestore_cache_autotune are enabled, set the minimum amount of memory used for caches.

Note
Note

Setting this value too low can result in significant cache thrashing.

25.3 Ceph Object Gateway Edit source

You can influence the Object Gateway behavior by a number of options in the /etc/ceph/ceph.conf file under sections named

[client.radosgw.INSTANCE_NAME]

If an option is not specified, its default value is used. A complete list of the Object Gateway options follows:

25.3.1 General Settings Edit source

rgw_frontends

Configures the HTTP front-end(s). Specify multiple front-ends in a comma-delimited list. Each front-end configuration may include a list of options separated by spaces, where each option is in the form “key=value” or “key”. Default is

rgw_frontends = beast port=7480
rgw_data

Sets the location of the data files for the Object Gateway. Default is /var/lib/ceph/radosgw/CLUSTER_ID.

rgw_enable_apis

Enables the specified APIs. Default is 's3, swift, swift_auth, admin All APIs'.

rgw_cache_enabled

Enables or disables the Object Gateway cache. Default is 'true'.

rgw_cache_lru_size

The number of entries in the Object Gateway cache. Default is 10000.

rgw_socket_path

The socket path for the domain socket. FastCgiExternalServer uses this socket. If you do not specify a socket path, the Object Gateway will not run as an external server. The path you specify here needs to be the same as the path specified in the rgw.conf file.

rgw_fcgi_socket_backlog

The socket backlog for fcgi. Default is 1024.

rgw_host

The host for the Object Gateway instance. It can be an IP address or a host name. Default is 0.0.0.0

rgw_port

The port number where the instance listens for requests. If not specified, the Object Gateway runs external FastCGI.

rgw_dns_name

The DNS name of the served domain.

rgw_script_uri

The alternative value for the SCRIPT_URI if not set in the request.

rgw_request_uri

The alternative value for the REQUEST_URI if not set in the request.

rgw_print_continue

Enable 100-continue if it is operational. Default is 'true'.

rgw_remote_addr_param

The remote address parameter. For example, the HTTP field containing the remote address, or the X-Forwarded-For address if a reverse proxy is operational. Default is REMOTE_ADDR.

rgw_op_thread_timeout

The timeout in seconds for open threads. Default is 600.

rgw_op_thread_suicide_timeout

The time timeout in seconds before the Object Gateway process dies. Disabled if set to 0 (default).

rgw_thread_pool_size

Number of threads for the Beast server. Increase to a higher value if you need to serve more requests. Defaults to 100 threads.

rgw_num_rados_handles

The number of RADOS cluster handles for Object Gateway. Each Object Gateway worker thread now gets to pick a RADOS handle for its lifetime. This option may be deprecated and removed in future releases. Default is 1.

rgw_num_control_oids

The number of notification objects used for cache synchronization between different rgw instances. Default is 8.

rgw_init_timeout

The number of seconds before the Object Gateway gives up on initialization. Default is 30.

rgw_mime_types_file

The path and location of the MIME types. Used for Swift auto-detection of object types. Default is /etc/mime.types.

rgw_gc_max_objs

The maximum number of objects that may be handled by garbage collection in one garbage collection processing cycle. Default is 32.

rgw_gc_obj_min_wait

The minimum wait time before the object may be removed and handled by garbage collection processing. Default is 2 * 3600.

rgw_gc_processor_max_time

The maximum time between the beginning of two consecutive garbage collection processing cycles. Default is 3600.

rgw_gc_processor_period

The cycle time for garbage collection processing. Default is 3600.

rgw_s3_success_create_obj_status

The alternate success status response for create-obj. Default is 0.

rgw_resolve_cname

Whether the Object Gateway should use DNS CNAME record of the request host name field (if host name is not equal to the Object Gateway DNS name). Default is 'false'.

rgw_obj_stripe_size

The size of an object stripe for Object Gateway objects. Default is 4 << 20.

rgw_extended_http_attrs

Add a new set of attributes that can be set on an entity (for example, a user, a bucket, or an object). These extra attributes can be set through HTTP header fields when putting the entity or modifying it using the POST method. If set, these attributes will return as HTTP fields when requesting GET/HEAD on the entity. Default is 'content_foo, content_bar, x-foo-bar'.

rgw_exit_timeout_secs

Number of seconds to wait for a process before exiting unconditionally. Default is 120.

rgw_get_obj_window_size

The window size in bytes for a single object request. Default is '16 << 20'.

rgw_get_obj_max_req_size

The maximum request size of a single GET operation sent to the Ceph Storage Cluster. Default is 4 << 20.

rgw_relaxed_s3_bucket_names

Enables relaxed S3 bucket name rules for US region buckets. Default is 'false'.

rgw_list_buckets_max_chunk

The maximum number of buckets to retrieve in a single operation when listing user buckets. Default is 1000.

rgw_override_bucket_index_max_shards

Represents the number of shards for the bucket index object. Setting 0 (default) indicates there is no sharding. It is not recommended to set a value too large (for example 1000) as it increases the cost for bucket listing. This variable should be set in the client or global sections so that it is automatically applied to radosgw-admin commands.

rgw_curl_wait_timeout_ms

The timeout in milliseconds for certain curl calls. Default is 1000.

rgw_copy_obj_progress

Enables output of object progress during long copy operations. Default is 'true'.

rgw_copy_obj_progress_every_bytes

The minimum bytes between copy progress output. Default is 1024 * 1024.

rgw_admin_entry

The entry point for an admin request URL. Default is 'admin'.

rgw_content_length_compat

Enable compatibility handling of FCGI requests with both CONTENT_LENGTH AND HTTP_CONTENT_LENGTH set. Default is 'false'.

rgw_bucket_quota_ttl

The amount of time in seconds for which cached quota information is trusted. After this timeout, the quota information will be re-fetched from the cluster. Default is 600.

rgw_user_quota_bucket_sync_interval

The amount of time in seconds for which the bucket quota information is accumulated before synchronizing to the cluster. During this time, other Object Gateway instances will not see the changes in the bucket quota stats related to operations on this instance. Default is 180.

rgw_user_quota_sync_interval

The amount of time in seconds for which user quota information is accumulated before synchronizing to the cluster. During this time, other Object Gateway instances will not see the changes in the user quota stats related to operations on this instance. Default is 180.

rgw_bucket_default_quota_max_objects

Default maximum number of objects per bucket. It is set on new users if no other quota is specified, and has no effect on existing users. This variable should be set in the client or global sections so that it is automatically applied to radosgw-admin commands. Default is -1.

rgw_bucket_default_quota_max_size

Default maximum capacity per bucket in bytes. It is set on new users if no other quota is specified, and has no effect on existing users. Default is -1.

rgw_user_default_quota_max_objects

Default maximum number of objects for a user. This includes all objects in all buckets owned by the user. It is set on new users if no other quota is specified, and has no effect on existing users. Default is -1.

rgw_user_default_quota_max_size

The value for user maximum size quota in bytes set on new users if no other quota is specified. It has no effect on existing users. Default is -1.

rgw_verify_ssl

Verify SSL certificates while making requests. Default is 'true'.

rgw_max_chunk_size

Maximum size of a chunk of data that will be read in a single operation. Increasing the value to 4 MB (4194304) will provide better performance when processing large objects. Default is 128 kB (131072).

Multisite Settings
rgw_zone

The name of the zone for the gateway instance. If no zone is set, a cluster-wide default can be configured with the radosgw-admin zone default command.

rgw_zonegroup

The name of the zonegroup for the gateway instance. If no zonegroup is set, a cluster-wide default can be configured with the radosgw-admin zonegroup default command.

rgw_realm

The name of the realm for the gateway instance. If no realm is set, a cluster-wide default can be configured with theradosgw-admin realm default command.

rgw_run_sync_thread

If there are other zones in the realm to synchronize from, spawn threads to handle the synchronization of data and metadata. Default is 'true'.

rgw_data_log_window

The data log entries window in seconds. Default is 30.

rgw_data_log_changes_size

The number of in-memory entries to hold for the data changes log. Default is 1000.

rgw_data_log_obj_prefix

The object name prefix for the data log. Default is 'data_log'.

rgw_data_log_num_shards

The number of shards (objects) on which to keep the data changes log. Default is 128.

rgw_md_log_max_shards

The maximum number of shards for the metadata log. Default is 64.

Swift Settings
rgw_enforce_swift_acls

Enforces the Swift Access Control List (ACL) settings. Default is 'true'.

rgw_swift_token_expiration

The time in seconds for expiring a Swift token. Default is 24 * 3600.

rgw_swift_url

The URL for the Ceph Object Gateway Swift API.

rgw_swift_url_prefix

The URL prefix for the Swift StorageURL that goes in front of the “/v1” part. This allows to run several Gateway instances on the same host. For compatibility, setting this configuration variable to empty causes the default “/swift” to be used. Use explicit prefix “/” to start StorageURL at the root.

Warning
Warning

Setting this option to “/” will not work if S3 API is enabled. Keep in mind that disabling S3 will make it impossible to deploy the Object Gateway in the multisite configuration!

rgw_swift_auth_url

Default URL for verifying v1 authentication tokens when the internal Swift authentication is not used.

rgw_swift_auth_entry

The entry point for a Swift authentication URL. Default is 'auth'.

rgw_swift_versioning_enabled

Enables the Object Versioning of OpenStack Object Storage API. This allows clients to put the X-Versions-Location attribute on containers that should be versioned. The attribute specifies the name of container storing archived versions. It must be owned by the same user as the versioned container for reasons of access control verification—ACLs are not taken into consideration. Those containers cannot be versioned by the S3 object versioning mechanism. Default is 'false'.

Logging Settings
rgw_log_nonexistent_bucket

Enables the Object Gateway to log a request for a non-existent bucket. Default is 'false'.

rgw_log_object_name

The logging format for an object name. See the manual page man 1 date for details about format specifiers. Default is '%Y-%m-%d-%H-%i-%n'.

rgw_log_object_name_utc

Whether a logged object name includes a UTC time. If set to 'false' (default), it uses the local time.

rgw_usage_max_shards

The maximum number of shards for usage logging. Default is 32.

rgw_usage_max_user_shards

The maximum number of shards used for a single user’s usage logging. Default is 1.

rgw_enable_ops_log

Enable logging for each successful Object Gateway operation. Default is 'false'.

rgw_enable_usage_log

Enable the usage log. Default is 'false'.

rgw_ops_log_rados

Whether the operations log should be written to the Ceph Storage Cluster back-end. Default is 'true'.

rgw_ops_log_socket_path

The Unix domain socket for writing operations logs.

rgw_ops_log_data_backlog

The maximum data backlog data size for operations logs written to a Unix domain socket. Default is 5 << 20.

rgw_usage_log_flush_threshold

The number of dirty merged entries in the usage log before flushing synchronously. Default is 1024.

rgw_usage_log_tick_interval

Flush pending usage log data every 'n' seconds. Default is 30.

rgw_log_http_headers

Comma-delimited list of HTTP headers to include in log entries. Header names are case-insensitive, and use the full header name with words separated by underscores. For example, 'http_x_forwarded_for', 'http_x_special_k'.

rgw_intent_log_object_name

The logging format for the intent log object name. See the manual page man 1 date for details about format specifiers. Default is '%Y-%m-%d-%i-%n'.

rgw_intent_log_object_name_utc

Whether the intent log object name includes a UTC time. If set to 'false' (default), it uses the local time.

Keystone Settings
rgw_keystone_url

The URL for the Keystone server.

rgw_keystone_api_version

The version (2 or 3) of OpenStack Identity API that should be used for communication with the Keystone server. Default is 2.

rgw_keystone_admin_domain

The name of the OpenStack domain with the administrator privilege when using OpenStack Identity API v3.

rgw_keystone_admin_project

The name of the OpenStack project with the administrator privilege when using OpenStack Identity API v3. If not set, the value of the rgw keystone admin tenant will be used instead.

rgw_keystone_admin_token

The Keystone administrator token (shared secret). In the Object Gateway, authentication with the administrator token has priority over authentication with the administrator credentials (options rgw keystone admin user, rgw keystone admin password, rgw keystone admin tenant, rgw keystone admin project, and rgw keystone admin domain). The administrator token feature is considered as deprecated.

rgw_keystone_admin_tenant

The name of the OpenStack tenant with the administrator privilege (Service Tenant) when using OpenStack Identity API v2.

rgw_keystone_admin_user

The name of the OpenStack user with the administrator privilege for Keystone authentication (Service User) when using OpenStack Identity API v2.

rgw_keystone_admin_password

The password for the OpenStack administrator user when using OpenStack Identity API v2.

rgw_keystone_accepted_roles

The roles required to serve requests. Default is 'Member, admin'.

rgw_keystone_token_cache_size

The maximum number of entries in each Keystone token cache. Default is 10000.

rgw_keystone_revocation_interval

The number of seconds between token revocation checks. Default is 15 * 60.

rgw_keystone_verify_ssl

Verify SSL certificates while making token requests to Keystone. Default is 'true'.

25.3.1.1 Additional Notes Edit source

rgw_dns_name

If the parameter rgw dns name is added to the ceph.conf, make sure that the S3 client is configured to direct requests at the endpoint specified by rgw dns name.

25.3.2 HTTP Front-ends Edit source

25.3.2.1 Beast Edit source

port, ssl_port

IPv4 & IPv6 listening port numbers. You can specify multiple port numbers:

port=80 port=8000 ssl_port=8080

Default is 80.

endpoint, ssl_endpoint

The listening addresses in the form 'address[:port]', where the address is an IPv4 address string in dotted decimal form, or an IPv6 address in hexadecimal notation surrounded by square brackets. Specifying an IPv6 endpoint would listen to IPv6 only. The optional port number defaults to 80 for endpoint and 443 for ssl_endpoint. You can specify multiple addresses:

endpoint=[::1] endpoint=192.168.0.100:8000 ssl_endpoint=192.168.0.100:8080
ssl_private_key

Optional path to the private key file used for SSL-enabled endpoints. If not specified, the ssl_certificate file is used as a private key.

tcp_nodelay

If specified, the socket option will disable Nagle's algorithm on the connection. It means that packets will be sent as soon as possible instead of waiting for a full buffer or timeout to occur.

'1' disables Nagle's algorithm for all sockets.

'0' keeps Nagle's algorithm enabled (default).

Example 25.1: Example Beast Configuration in /etc/ceph/ceph.conf
rgw_frontends = beast port=8000 ssl_port=443 ssl_certificate=/etc/ssl/ssl.crt error_log_file=/var/log/radosgw/civetweb.error.log

25.3.2.2 CivetWeb Edit source

port

The listening port number. For SSL-enabled ports, add an 's' suffix (for example, '443s'). To bind a specific IPv4 or IPv6 address, use the form 'address:port'. You can specify multiple endpoints either by joining them with '+' or by providing multiple options:

port=127.0.0.1:8000+443s
port=8000 port=443s

Default is 7480.

num_threads

The number of threads spawned by Civetweb to handle incoming HTTP connections. This effectively limits the number of concurrent connections that the front-end can service.

Default is the value specified by the rgw_thread_pool_size option.

request_timeout_ms

The amount of time in milliseconds that Civetweb will wait for more incoming data before giving up.

Default is 30000 milliseconds.

access_log_file

Path to the access log file. You can specify either a full path, or a path relative to the current working directory. If not specified (default), then accesses are not logged.

error_log_file

Path to the error log file. You can specify either a full path, or a path relative to the current working directory. If not specified (default), then errors are not logged.

Example 25.2: Example Civetweb Configuration in /etc/ceph/ceph.conf
rgw_frontends = civetweb port=8000+443s request_timeout_ms=30000 error_log_file=/var/log/radosgw/civetweb.error.log

25.3.2.3 Common Options Edit source

ssl_certificate

Path to the SSL certificate file used for SSL-enabled endpoints.

prefix

A prefix string that is inserted into the URI of all requests. For example, a Swift-only front-end could supply a URI prefix of '/swift'.

Part IV Accessing Cluster Data Edit source

26 Ceph Object Gateway

This chapter introduces details about administration tasks related to Object Gateway, such as checking status of the service, managing accounts, multisite gateways, or LDAP authentication.

27 Ceph iSCSI Gateway

The chapter focuses on administration tasks related to the iSCSI Gateway. For a procedure of deployment refer to Book “Deployment Guide”, Chapter 10 “Installation of iSCSI Gateway”.

28 Clustered File System

This chapter describes administration tasks that are normally performed after the cluster is set up and CephFS exported. If you need more information on setting up CephFS, refer to Book “Deployment Guide”, Chapter 11 “Installation of CephFS”.

29 Exporting Ceph Data via Samba

This chapter describes how to export data stored in a Ceph cluster via a Samba/CIFS share so that you can easily access them from Windows* client machines. It also includes information that will help you configure a Ceph Samba gateway to join Active Directory in the Windows* domain to authenticate a…

30 NFS Ganesha: Export Ceph Data via NFS

NFS Ganesha is an NFS server (refer to Sharing File Systems with NFS ) that runs in a user address space instead of as part of the operating system kernel. With NFS Ganesha, you can plug in your own storage mechanism—such as Ceph—and access it from any NFS client.

26 Ceph Object Gateway Edit source

This chapter introduces details about administration tasks related to Object Gateway, such as checking status of the service, managing accounts, multisite gateways, or LDAP authentication.

26.1 Object Gateway Restrictions and Naming Limitations Edit source

Following is a list of important Object Gateway limits:

26.1.1 Bucket Limitations Edit source

When approaching Object Gateway via the S3 API, bucket names are limited to DNS-compliant names with a dash character '-' allowed. When approaching Object Gateway via the Swift API, you may use any combination of UTF-8 supported characters except for a slash character '/'. The maximum length of a bucket name is 255 characters. Bucket names must be unique.

Tip
Tip: Use DNS-compliant Bucket Names

Although you may use any UTF-8 based bucket name via the Swift API, it is recommended to name buckets with regard to the S3 naming limitations to avoid problems accessing the same bucket via the S3 API.

26.1.2 Stored Object Limitations Edit source

Maximum number of objects per user

No restriction by default (limited by ~ 2^63).

Maximum number of objects per bucket

No restriction by default (limited by ~ 2^63).

Maximum size of an object to upload/store

Single uploads are restricted to 5 GB. Use multipart for larger object sizes. The maximum number of multipart chunks is 10000.

26.1.3 HTTP Header Limitations Edit source

HTTP header and request limitation depend on the Web front-end used. The default Beast restricts the size of the HTTP header to 16 kB.

26.2 Deploying the Object Gateway Edit source

The recommended way of deploying the Ceph Object Gateway is via the DeepSea infrastructure by adding the relevant role-rgw [...] line(s) into the policy.cfg file on the Salt master, and running the required DeepSea stages.

  • To include the Object Gateway during the Ceph cluster deployment process, refer to Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.3 “Cluster Deployment” and Book “Deployment Guide”, Chapter 5 “Deploying with DeepSea/Salt”, Section 5.5.1 “The policy.cfg File”.

  • To add the Object Gateway role to an already deployed cluster, refer to Section 2.2, “Adding New Roles to Nodes”.

26.3 Operating the Object Gateway Service Edit source

The Object Gateway service is operated with the systemctl command. You need to have root privileges to operate the Object Gateway service. Note that GATEWAY_HOST is the host name of the server whose Object Gateway instance you need to operate.

The following subcommands are supported for the Object Gateway service:

systemctl status ceph-radosgw@rgw.GATEWAY_HOST

Prints the status information of the service.

systemctl start ceph-radosgw@rgw.GATEWAY_HOST

Starts the service if it is not already running.

systemctl restart ceph-radosgw@rgw.GATEWAY_HOST

Restarts the service.

systemctl stop ceph-radosgw@rgw.GATEWAY_HOST

Stops the running service.

systemctl enable ceph-radosgw@rgw.GATEWAY_HOST

Enables the service so that it is automatically started on system start-up.

systemctl disable ceph-radosgw@rgw.GATEWAY_HOST

Disables the service so that it is not automatically started on system start-up.

26.4 Configuration Options Edit source

Refer to Section 25.3, “Ceph Object Gateway” for a list of Object Gateway configuration options.

26.5 Managing Object Gateway Access Edit source

You can communicate with Object Gateway using either S3- or Swift-compatible interface. S3 interface is compatible with a large subset of the Amazon S3 RESTful API. Swift interface is compatible with a large subset of the OpenStack Swift API.

Both interfaces require you to create a specific user, and install the relevant client software to communicate with the gateway using the user's secret key.

26.5.1 Accessing Object Gateway Edit source

26.5.1.1 S3 Interface Access Edit source

To access the S3 interface, you need a REST client. S3cmd is a command line S3 client. You can find it in the OpenSUSE Build Service. The repository contains versions for both SUSE Linux Enterprise and openSUSE based distributions.

If you want to test your access to the S3 interface, you can also write a small a Python script. The script will connect to Object Gateway, create a new bucket, and list all buckets. The values for aws_access_key_id and aws_secret_access_key are taken from the values of access_key and secret_key returned by the radosgw_admin command from Section 26.5.2.1, “Adding S3 and Swift Users”.

  1. Install the python-boto package:

    root # zypper in python-boto
  2. Create a new Python script called s3test.py with the following content:

    import boto
    import boto.s3.connection
    access_key = '11BS02LGFB6AL6H1ADMW'
    secret_key = 'vzCEkuryfn060dfee4fgQPqFrncKEIkh3ZcdOANY'
    conn = boto.connect_s3(
    aws_access_key_id = access_key,
    aws_secret_access_key = secret_key,
    host = 'HOSTNAME',
    is_secure=False,
    calling_format = boto.s3.connection.OrdinaryCallingFormat(),
    )
    bucket = conn.create_bucket('my-new-bucket')
    for bucket in conn.get_all_buckets():
      print "NAME\tCREATED".format(
      name = bucket.name,
      created = bucket.creation_date,
      )

    Replace HOSTNAME with the host name of the host where you configured the Object Gateway service, for example gateway_host.

  3. Run the script:

    python s3test.py

    The script outputs something like the following:

    my-new-bucket 2015-07-22T15:37:42.000Z

26.5.1.2 Swift Interface Access Edit source

To access Object Gateway via Swift interface, you need the swift command line client. Its manual page man 1 swift tells you more about its command line options.

The package is included in the 'Public Cloud' module for SUSE Linux Enterprise 12 from SP3 and SUSE Linux Enterprise 15. Before installing the package, you need to activate the module and refresh the software repository:

root # SUSEConnect -p sle-module-public-cloud/12/SYSTEM-ARCH
sudo zypper refresh

Or

root # SUSEConnect -p sle-module-public-cloud/15/SYSTEM-ARCH
root # zypper refresh

To install the swift command, run the following:

root # zypper in python-swiftclient

The swift access uses the following syntax:

tux > swift -A http://IP_ADDRESS/auth/1.0 \
-U example_user:swift -K 'SWIFT_SECRET_KEY' list

Replace IP_ADDRESS with the IP address of the gateway server, and SWIFT_SECRET_KEY with its value from the output of the radosgw-admin key create command executed for the swift user in Section 26.5.2.1, “Adding S3 and Swift Users”.

For example:

tux > swift -A http://gateway.example.com/auth/1.0 -U example_user:swift \
-K 'r5wWIxjOCeEO7DixD1FjTLmNYIViaC6JVhi3013h' list

The output is:

my-new-bucket

26.5.2 Managing S3 and Swift Accounts Edit source

26.5.2.1 Adding S3 and Swift Users Edit source

You need to create a user, access key and secret to enable end users to interact with the gateway. There are two types of users: a user and subuser. While users are used when interacting with the S3 interface, subusers are users of the Swift interface. Each subuser is associated to a user.

Users can also be added via the DeepSea file rgw.sls. For an example, see Section 30.3.1, “Different Object Gateway Users for NFS Ganesha”.

To create a Swift user, follow the steps:

  1. To create a Swift user—which is a subuser in our terminology—you need to create the associated user first.

    cephadm@adm > radosgw-admin user create --uid=USERNAME \
     --display-name="DISPLAY-NAME" --email=EMAIL

    For example:

    cephadm@adm > radosgw-admin user create \
       --uid=example_user \
       --display-name="Example User" \
       --email=penguin@example.com
  2. To create a subuser (Swift interface) for the user, you must specify the user ID (--uid=USERNAME), a subuser ID, and the access level for the subuser.

    cephadm@adm > radosgw-admin subuser create --uid=UID \
     --subuser=UID \
     --access=[ read | write | readwrite | full ]

    For example:

    cephadm@adm > radosgw-admin subuser create --uid=example_user \
     --subuser=example_user:swift --access=full
  3. Generate a secret key for the user.

    cephadm@adm > radosgw-admin key create \
       --gen-secret \
       --subuser=example_user:swift \
       --key-type=swift
  4. Both commands will output JSON-formatted data showing the user state. Notice the following lines, and remember the secret_key value:

    "swift_keys": [
       { "user": "example_user:swift",
         "secret_key": "r5wWIxjOCeEO7DixD1FjTLmNYIViaC6JVhi3013h"}],

When accessing Object Gateway through the S3 interface you need to create an S3 user by running:

cephadm@adm > radosgw-admin user create --uid=USERNAME \
 --display-name="DISPLAY-NAME" --email=EMAIL

For example:

cephadm@adm > radosgw-admin user create \
   --uid=example_user \
   --display-name="Example User" \
   --email=penguin@example.com

The command also creates the user's access and secret key. Check its output for access_key and secret_key keywords and their values:

[...]
 "keys": [
       { "user": "example_user",
         "access_key": "11BS02LGFB6AL6H1ADMW",
         "secret_key": "vzCEkuryfn060dfee4fgQPqFrncKEIkh3ZcdOANY"}],
 [...]

26.5.2.2 Removing S3 and Swift Users Edit source

The procedure for deleting users is similar for S3 and Swift users. But in case of Swift users you may need to delete the user including its subusers.

To remove a S3 or Swift user (including all its subusers), specify user rm and the user ID in the following command:

cephadm@adm > radosgw-admin user rm --uid=example_user

To remove a subuser, specify subuser rm and the subuser ID.

cephadm@adm > radosgw-admin subuser rm --uid=example_user:swift

You can make use of the following options:

--purge-data

Purges all data associated to the user ID.

--purge-keys

Purges all keys associated to the user ID.

Tip
Tip: Removing a Subuser

When you remove a subuser, you are removing access to the Swift interface. The user will remain in the system.

26.5.2.3 Changing S3 and Swift User Access and Secret Keys Edit source

The access_key and secret_key parameters identify the Object Gateway user when accessing the gateway. Changing the existing user keys is the same as creating new ones, as the old keys get overwritten.

For S3 users, run the following:

cephadm@adm > radosgw-admin key create --uid=EXAMPLE_USER --key-type=s3 --gen-access-key --gen-secret

For Swift users, run the following:

cephadm@adm > radosgw-admin key create --subuser=EXAMPLE_USER:swift --key-type=swift --gen-secret
--key-type=TYPE

Specifies the type of key. Either swift or s3.

--gen-access-key

Generates a random access key (for S3 user by default).

--gen-secret

Generates a random secret key.

--secret=KEY

Specifies a secret key, for example manually generated.

26.5.2.4 User Quota Management Edit source

The Ceph Object Gateway enables you to set quotas on users and buckets owned by users. Quotas include the maximum number of objects in a bucket and the maximum storage size in megabytes.

Before you enable a user quota, you first need to set its parameters:

cephadm@adm > radosgw-admin quota set --quota-scope=user --uid=EXAMPLE_USER \
 --max-objects=1024 --max-size=1024
--max-objects

Specifies the maximum number of objects. A negative value disables the check.

--max-size

Specifies the maximum number of bytes. A negative value disables the check.

--quota-scope

Sets the scope for the quota. The options are bucket and user. Bucket quotas apply to buckets a user owns. User quotas apply to a user.

Once you set a user quota, you may enable it:

cephadm@adm > radosgw-admin quota enable --quota-scope=user --uid=EXAMPLE_USER

To disable a quota:

cephadm@adm > radosgw-admin quota disable --quota-scope=user --uid=EXAMPLE_USER

To list quota settings:

cephadm@adm > radosgw-admin user info --uid=EXAMPLE_USER

To update quota statistics:

cephadm@adm > radosgw-admin user stats --uid=EXAMPLE_USER --sync-stats

26.6 HTTP Front-ends Edit source

The Ceph Object Gateway supports two embedded HTTP front-ends: Beast and Civetweb.

The Beast front-end uses the Boost.Beast library for HTTP parsing and the Boost.Asio library for asynchronous network I/O.

The Civetweb front-end uses the Civetweb HTTP library, which is a fork of Mongoose.

You can configure them with the rgw_frontends option in the /etc/ceph/ceph.conf file. Refer to Section 25.3, “Ceph Object Gateway” for a list of configuration options.

26.7 Enabling HTTPS/SSL for Object Gateways Edit source

To enable the default Object Gateway role to communicate securely using SSL, you need to either have a CA issued certificate or create a self-signed one. There are two ways to configure Object Gateway with HTTPS enabled—a simple way that makes use of the default settings, and an advanced way that lets you fine tune HTTPS related settings.

26.7.1 Create a Self-Signed Certificate Edit source

Tip
Tip

Skip this section if you already have a valid certificate signed by CA.

By default, DeepSea expects the certificate file in /srv/salt/ceph/rgw/cert/rgw.pem on the Salt master. It will then distribute the certificate to /etc/ceph/rgw.pem on the Salt minion with the Object Gateway role, where Ceph reads it.

The following procedure describes how to generate a self-signed SSL certificate on the Salt master.

  1. If you need your Object Gateway to be known by additional subject identities, add them to the subjectAltName option in the [v3_req] section of the /etc/ssl/openssl.cnf file:

    [...]
    [ v3_req ]
    subjectAltName = DNS:server1.example.com DNS:server2.example.com
    [...]
    Tip
    Tip: IP Addresses in subjectAltName

    To use IP addresses instead of domain names in the subjectAltName option, replace the example line with the following:

    subjectAltName = IP:10.0.0.10 IP:10.0.0.11
  2. Create the key and the certificate using openssl. Enter all data you need to include in your certificate. We recommend entering the FQDN as the common name. Before signing the certificate, verify that 'X509v3 Subject Alternative Name:' is included in requested extensions, and that the resulting certificate has "X509v3 Subject Alternative Name:" set.

    root@master # openssl req -x509 -nodes -days 1095 \
     -newkey rsa:4096 -keyout rgw.key -out /srv/salt/ceph/rgw/cert/rgw.pem
  3. Append the key to the certificate file:

    root@master # cat rgw.key >> /srv/salt/ceph/rgw/cert/rgw.pem

26.7.2 Simple HTTPS Configuration Edit source

By default, Ceph on the Object Gateway node reads the /etc/ceph/rgw.pem certificate, and uses port 443 for secure SSL communication. If you do not need to change these values, follow these steps:

  1. Edit /srv/pillar/ceph/stack/global.yml and add the following line:

    rgw_init: default-ssl
  2. Copy the default Object Gateway SSL configuration to the ceph.conf.d subdirectory:

    root@master # cp /srv/salt/ceph/configuration/files/rgw-ssl.conf \
     /srv/salt/ceph/configuration/files/ceph.conf.d/rgw.conf
  3. Run DeepSea stages 2, 3, and 4 to apply the changes:

    root@master # salt-run state.orch ceph.stage.2
    root@master # salt-run state.orch ceph.stage.3
    root@master # salt-run state.orch ceph.stage.4

26.7.3 Advanced HTTPS Configuration Edit source

If you need to change the default values for SSL settings of the Object Gateway, follow these steps:

  1. Edit /srv/pillar/ceph/stack/global.yml and add the following line:

    rgw_init: default-ssl
  2. Copy the default Object Gateway SSL configuration to the ceph.conf.d subdirectory:

    root@master # cp /srv/salt/ceph/configuration/files/rgw-ssl.conf \
     /srv/salt/ceph/configuration/files/ceph.conf.d/rgw.conf
  3. Edit /srv/salt/ceph/configuration/files/ceph.conf.d/rgw.conf and change the default options, such as port number or path to the SSL certificate, to reflect your setup.

  4. Run DeepSea stage 3 and 4 to apply the changes:

    root@master # salt-run state.orch ceph.stage.2
    root@master # salt-run state.orch ceph.stage.3
    root@master # salt-run state.orch ceph.stage.4
Tip
Tip: Binding to Multiple Ports

The Beast server can bind to multiple ports. This is useful if you need to access a single Object Gateway instance with both SSL and non-SSL connections. A two-port configuration line example follows:

[client.{{ client }}]
rgw_frontends = beast port=80 ssl_port=443 ssl_certificate=/etc/ceph/rgw.pem

26.8 Synchronization Modules Edit source

The multisite functionality of Object Gateway allows you to create multiple zones and mirror data and metadata between them. Synchronization modules are built atop of the multisite framework that allows for forwarding data and metadata to a different external tier. A synchronization module allows for a set of actions to be performed whenever a change in data occurs (for example, metadata operations such as bucket or user creation). As the Object Gateway multisite changes are eventually consistent at remote sites, changes are propagated asynchronously. This covers use cases such as backing up the object storage to an external cloud cluster, a custom backup solution using tape drives, or indexing metadata in ElasticSearch.

26.8.1 General Configuration Edit source

All synchronization modules are configured in a similar way. You need to create a new zone (refer to Section 26.13, “Multisite Object Gateways” for more details) and set its --tier_type option, for example --tier-type=cloud for the cloud synchronization module:

cephadm@adm > radosgw-admin zone create --rgw-zonegroup=ZONE-GROUP-NAME \
 --rgw-zone=ZONE-NAME \
 --endpoints=http://endpoint1.example.com,http://endpoint2.example.com, [...] \
 --tier-type=cloud

You can configure the specific tier by using the following command:

cephadm@adm > radosgw-admin zone modify --rgw-zonegroup=ZONE-GROUP-NAME \
 --rgw-zone=ZONE-NAME \
 --tier-config=KEY1=VALUE1,KEY2=VALUE2

The KEY in the configuration specifies the configuration variable that you want to update, and the VALUE specifies its new value. Nested values can be accessed using period. For example:

cephadm@adm > radosgw-admin zone modify --rgw-zonegroup=ZONE-GROUP-NAME \
 --rgw-zone=ZONE-NAME \
 --tier-config=connection.access_key=KEY,connection.secret=SECRET

You can access array entries by appending square brackets '[]' with the referenced entry. You can add a new array entry by using square brackets '[]'. Index value of -1 references the last entry in the array. It is not possible to create a new entry and reference it again in the same command. For example, a command to create a new profile for buckets starting with PREFIX follows:

cephadm@adm > radosgw-admin zone modify --rgw-zonegroup=ZONE-GROUP-NAME \
 --rgw-zone=ZONE-NAME \
 --tier-config=profiles[].source_bucket=PREFIX'*'
cephadm@adm > radosgw-admin zone modify --rgw-zonegroup=ZONE-GROUP-NAME \
 --rgw-zone=ZONE-NAME \
 --tier-config=profiles[-1].connection_id=CONNECTION_ID,profiles[-1].acls_id=ACLS_ID
Tip
Tip: Adding and Removing Configuration Entries

You can add a new tier configuration entry by using the --tier-config-add=KEY=VALUE parameter.

You can remove an existing entry by using --tier-config-rm=KEY.

26.8.2 Synchronizing Zones Edit source

A synchronization module configuration is local to a zone. The synchronization module determines whether the zone exports data or can only consume data that was modified in another zone. As of Luminous the supported synchronization plug-ins are ElasticSearch, rgw, which is the default synchronization plug-in that synchronizes data between the zones and log which is a trivial synchronization plug-in that logs the metadata operation that happens in the remote zones. The following sections are written with the example of a zone using ElasticSearch synchronization module. The process would be similar for configuring any other synchronization plug-in.

Note
Note: Default Synchronization Plug-in

rgw is the default synchronization plug-in and there is no need to explicitly configure this.

26.8.2.1 Requirements and Assumptions Edit source

Let us assume a simple multisite configuration as described in Section 26.13, “Multisite Object Gateways” consists of 2 zones: us-east and us-west. Now we add a third zone us-east-es which is a zone that only processes metadata from the other sites. This zone can be in the same or a different Ceph cluster than us-east. This zone would only consume metadata from other zones and Object Gateways in this zone will not serve any end user requests directly.

26.8.2.2 Configuring Synchronization Modules Edit source

  1. Create the third zone similar to the ones described in Section 26.13, “Multisite Object Gateways”, for example

    cephadm@adm > radosgw-admin zone create --rgw-zonegroup=us --rgw-zone=us-east-es \
    --access-key=SYSTEM-KEY --secret=SECRET --endpoints=http://rgw-es:80
  2. A synchronization module can be configured for this zone via the following

    cephadm@adm > radosgw-admin zone modify --rgw-zone=ZONE-NAME --tier-type=TIER-TYPE \
    --tier-config={set of key=value pairs}
  3. For example in the ElasticSearch synchronization module

    cephadm@adm > radosgw-admin zone modify --rgw-zone=ZONE-NAME --tier-type=elasticsearch \
    --tier-config=endpoint=http://localhost:9200,num_shards=10,num_replicas=1

    For the various supported tier-config options refer to Section 26.8.3, “ElasticSearch Synchronization Module”.

  4. Finally update the period

    cephadm@adm > radosgw-admin period update --commit
  5. Now start the Object Gateway in the zone

    root # systemctl start ceph-radosgw@rgw.`hostname -s`
    root # systemctl enable ceph-radosgw@rgw.`hostname -s`

26.8.3 ElasticSearch Synchronization Module Edit source

This synchronization module writes the metadata from other zones to ElasticSearch. As of Luminous this is JSON of data fields we currently store in ElasticSearch.

{
  "_index" : "rgw-gold-ee5863d6",
  "_type" : "object",
  "_id" : "34137443-8592-48d9-8ca7-160255d52ade.34137.1:object1:null",
  "_score" : 1.0,
  "_source" : {
    "bucket" : "testbucket123",
    "name" : "object1",
    "instance" : "null",
    "versioned_epoch" : 0,
    "owner" : {
      "id" : "user1",
      "display_name" : "user1"
    },
    "permissions" : [
      "user1"
    ],
    "meta" : {
      "size" : 712354,
      "mtime" : "2017-05-04T12:54:16.462Z",
      "etag" : "7ac66c0f148de9519b8bd264312c4d64"
    }
  }
}

26.8.3.1 ElasticSearch Tier Type Configuration Parameters Edit source

endpoint

Specifies the ElasticSearch server endpoint to access.

num_shards

(integer) The number of shards that ElasticSearch will be configured with on data synchronization initialization. Note that this cannot be changed after initialization. Any change here requires rebuild of the ElasticSearch index and reinitialization of the data synchronization process.

num_replicas

(integer) The number of replicas that ElasticSearch will be configured with on data synchronization initialization.

explicit_custom_meta

(true | false) Specifies whether all user custom metadata will be indexed, or whether user will need to configure (at the bucket level) what customer metadata entries should be indexed. This is false by default

index_buckets_list

(comma separated list of strings) If empty, all buckets will be indexed. Otherwise, only buckets specified here will be indexed. It is possible to provide bucket prefixes (for example 'foo*'), or bucket suffixes (for example '*bar').

approved_owners_list

(comma separated list of strings) If empty, buckets of all owners will be indexed (subject to other restrictions), otherwise, only buckets owned by specified owners will be indexed. Suffixes and prefixes can also be provided.

override_index_path

(string) if not empty, this string will be used as the ElasticSearch index path. Otherwise the index path will be determined and generated on synchronization initialization.

username

Specifies a user name for ElasticSearch if authentication is required.

password

Specifies a password for ElasticSearch if authentication is required.

26.8.3.2 Metadata Queries Edit source

Since the ElasticSearch cluster now stores object metadata, it is important that the ElasticSearch endpoint is not exposed to the public and only accessible to the cluster administrators. For exposing metadata queries to the end user itself this poses a problem since we'd want the user to only query their metadata and not of any other users, this would require the ElasticSearch cluster to authenticate users in a way similar to RGW does which poses a problem.

As of Luminous RGW in the metadata master zone can now service end user requests. This allows for not exposing the ElasticSearch endpoint in public and also solves the authentication and authorization problem since RGW itself can authenticate the end user requests. For this purpose RGW introduces a new query in the bucket APIs that can service ElasticSearch requests. All these requests must be sent to the metadata master zone.

Get an ElasticSearch Query
GET /BUCKET?query=QUERY-EXPR

request params:

  • max-keys: max number of entries to return

  • marker: pagination marker

expression := [(]<arg> <op> <value> [)][<and|or> ...]

op is one of the following: <, <=, ==, >=, >

For example:

GET /?query=name==foo

Will return all the indexed keys that user has read permission to, and are named 'foo'. The output will be a list of keys in XML that is similar to the S3 list buckets response.

Configure custom metadata fields

Define which custom metadata entries should be indexed (under the specified bucket), and what are the types of these keys. If explicit custom metadata indexing is configured, this is needed so that rgw will index the specified custom metadata values. Otherwise it is needed in cases where the indexed metadata keys are of a type other than string.

POST /BUCKET?mdsearch
x-amz-meta-search: <key [; type]> [, ...]

Multiple metadata fields must be comma separated, a type can be forced for a field with a `;`. The currently allowed types are string(default), integer and date, for example, if you want to index a custom object metadata x-amz-meta-year as int, x-amz-meta-date as type date and x-amz-meta-title as string, you would do

POST /mybooks?mdsearch
x-amz-meta-search: x-amz-meta-year;int, x-amz-meta-release-date;date, x-amz-meta-title;string
Delete custom metadata configuration

Delete custom metadata bucket configuration.

DELETE /BUCKET?mdsearch
Get custom metadata configuration

Retrieve custom metadata bucket configuration.

GET /BUCKET?mdsearch

26.8.4 Cloud Synchronization Module Edit source

This section introduces a module that synchronizes the zone data to a remote cloud service. The synchronization is only unidirectional—the date is not synchronized back from the remote zone. The main goal of this module is to enable synchronizing data to multiple cloud service providers. Currently it supports cloud providers that are compatible with AWS (S3).

To synchronize data to a remote cloud service, you need to configure user credentials. Because many cloud services introduce limits on the number of buckets that each user can create, you can configure the mapping of source objects and buckets, different targets to different buckets and bucket prefixes. Note that source access lists (ACLs) will not be preserved. It is possible to map permissions of specific source users to specific destination users.

Because of API limitations, there is no way to preserve original object modification time and HTTP entity tag (ETag). The cloud synchronization module stores these as metadata attributes on the destination objects.

26.8.4.1 General Configuration Edit source

Following are examples of a trivial and non-trivial configuration for the cloud synchronization module. Note that the trivial configuration can collide with the non-trivial one.

Example 26.1: Trivial Configuration
{
  "connection": {
    "access_key": ACCESS,
    "secret": SECRET,
    "endpoint": ENDPOINT,
    "host_style": path | virtual,
  },
  "acls": [ { "type": id | email | uri,
    "source_id": SOURCE_ID,
    "dest_id": DEST_ID } ... ],
  "target_path": TARGET_PATH,
}
Example 26.2: Non-trivial Configuration
{
  "default": {
    "connection": {
      "access_key": ACCESS,
      "secret": SECRET,
      "endpoint": ENDPOINT,
      "host_style" path | virtual,
    },
    "acls": [
    {
      "type": id | email | uri,   #  optional, default is id
      "source_id": ID,
      "dest_id": ID
    } ... ]
    "target_path": PATH # optional
  },
  "connections": [
  {
    "connection_id": ID,
    "access_key": ACCESS,
    "secret": SECRET,
    "endpoint": ENDPOINT,
    "host_style": path | virtual,  # optional
  } ... ],
  "acl_profiles": [
  {
    "acls_id": ID, # acl mappings
    "acls": [ {
      "type": id | email | uri,
      "source_id": ID,
      "dest_id": ID
    } ... ]
  }
  ],
  "profiles": [
  {
   "source_bucket": SOURCE,
   "connection_id": CONNECTION_ID,
   "acls_id": MAPPINGS_ID,
   "target_path": DEST,          # optional
  } ... ],
}

Explanation of used configuration terms follows:

connection

Represents a connection to the remote cloud service. Contains 'connection_id', 'access_key', 'secret', 'endpoint', and 'host_style'.

access_key

The remote cloud access key that will be used for the specific connection.

secret

The secret key for the remote cloud service.

endpoint

URL of remote cloud service endpoint.

host_style

Type of host style ('path' or 'virtual') to be used when accessing remote cloud endpoint. Default is 'path'.

acls

Array of access list mappings.

acl_mapping

Each 'acl_mapping' structure contains 'type', 'source_id', and 'dest_id'. These will define the ACL mutation for each object. An ACL mutation allows converting source user ID to a destination ID.

type

ACL type: 'id' defines user ID, 'email' defines user by email, and 'uri' defines user by uri (group).

source_id

ID of user in the source zone.

dest_id

ID of user in the destination.

target_path

A string that defines how the target path is created. The target path specifies a prefix to which the source object name is appended. The target path configurable can include any of the following variables:

SID

A unique string that represents the synchronization instance ID.

ZONEGROUP

Zonegroup name.

ZONEGROUP_ID

Zonegroup ID.

ZONE

Zone name.

ZONE_ID

Zone ID.

BUCKET

Source bucket name.

OWNER

Source bucket owner ID.

For example: target_path = rgwx-ZONE-SID/OWNER/BUCKET

acl_profiles

An array of access list profiles.

acl_profile

Each profile contains 'acls_id' that represents the profile, and an 'acls' array that holds a list of 'acl_mappings'.

profiles

A list of profiles. Each profile contains the following:

source_bucket

Either a bucket name, or a bucket prefix (if ends with *) that defines the source bucket(s) for this profile.

target_path

See above for the explanation.

connection_id

ID of the connection that will be used for this profile.

acls_id

ID of ACL's profile that will be used for this profile.

26.8.4.2 S3 Specific Configurables Edit source

The cloud synchronization module will only work with back-ends that are compatible with AWS S3. There are a few configurables that can be used to tweak its behavior when accessing S3 cloud services:

{
  "multipart_sync_threshold": OBJECT_SIZE,
  "multipart_min_part_size": PART_SIZE
}
multipart_sync_threshold

Objects whose size is equal to or larger than this value will be synchronized with the cloud service using multipart upload.

multipart_min_part_size

Minimum parts size to use when synchronizing objects using multipart upload.

26.8.5 Archive Synchronization Module Edit source

The archive sync module utilizes the versioning feature of S3 objects in Object Gateway. You can configure an archive zone that captures the different versions of S3 objects as they occur over time in other zones. The history of versions that the archive zone keeps can only be eliminated via gateways associated with the archive zone.

With such an architecture, several non-versioned zones can mirror their data and metadata via their zone gateways providing high availability to the end users, while the archive zone captures all the data updates to consolidate them as versions of S3 objects.

By including the archive zone in a multi-zone configuration, you gain the flexibility of an S3 object history in one zone while saving the space that the replicas of the versioned S3 objects would consume in the remaining zones.

26.8.5.1 Configuration Edit source

Tip
Tip: More Information

Refer to Section 26.13, “Multisite Object Gateways” for details on configuring multisite gateways.

Refer to Section 26.8, “Synchronization Modules” for details on configuring synchronization modules.

To use the archive module, you need to create a new zone whose tier type is set to archive:

cephadm@adm > radosgw-admin zone create --rgw-zonegroup=ZONE_GROUP_NAME \
 --rgw-zone=OGW_ZONE_NAME \
 --endpoints=http://OGW_ENDPOINT1_URL[,http://OGW_ENDPOINT2_URL,...]
 --tier-type=archive

26.9 LDAP Authentication Edit source

Apart from the default local user authentication, Object Gateway can use LDAP server services to authenticate users as well.

26.9.1 Authentication Mechanism Edit source

The Object Gateway extracts the user's LDAP credentials from a token. A search filter is constructed from the user name. The Object Gateway use