Administration Guide
- About This Guide
- I Cluster Management
- 1 Salt Cluster Administration
- 1.1 Adding New Cluster Nodes
- 1.2 Adding New Roles to Nodes
- 1.3 Removing and Reinstalling Cluster Nodes
- 1.4 Redeploying Monitor Nodes
- 1.5 Adding an OSD Disk to a Node
- 1.6 Removing an OSD
- 1.7 Replacing an OSD Disk
- 1.8 Recovering a Reinstalled OSD Node
- 1.9 Automated Installation via Salt
- 1.10 Updating the Cluster Nodes
- 1.11 Halting or Rebooting Cluster
- 1.12 Adjusting
ceph.conf
with Custom Settings - 1.13 Enabling AppArmor Profiles
- 1 Salt Cluster Administration
- II Operating a Cluster
- III Accessing Cluster Data
- 13 Ceph Object Gateway
- 13.1 Object Gateway Restrictions and Naming Limitations
- 13.2 Deploying the Object Gateway
- 13.3 Operating the Object Gateway Service
- 13.4 Configuration Parameters
- 13.5 Managing Object Gateway Access
- 13.6 Enabling HTTPS/SSL for Object Gateways
- 13.7 Sync Modules
- 13.8 LDAP Authentication
- 13.9 Bucket Index Sharding
- 13.10 Integrating OpenStack Keystone
- 13.11 Multisite Object Gateways
- 13.12 Load Balancing the Object Gateway Servers with HAProxy
- 14 Ceph iSCSI Gateway
- 15 Clustered File System
- 16 NFS Ganesha: Export Ceph Data via NFS
- 13 Ceph Object Gateway
- IV Managing Cluster with GUI Tools
- V Integration with Virtualization Tools
- VI FAQs, Tips and Troubleshooting
- 20 Hints and Tips
- 20.1 Identifying Orphaned Partitions
- 20.2 Adjusting Scrubbing
- 20.3 Stopping OSDs without Rebalancing
- 20.4 Time Synchronization of Nodes
- 20.5 Checking for Unbalanced Data Writing
- 20.6 Btrfs Sub-volume for /var/lib/ceph
- 20.7 Increasing File Descriptors
- 20.8 How to Use Existing Partitions for OSDs Including OSD Journals
- 20.9 Integration with Virtualization Software
- 20.10 Firewall Settings for Ceph
- 20.11 Testing Network Performance
- 20.12 Replacing Storage Disk
- 21 Frequently Asked Questions
- 22 Troubleshooting
- 22.1 Reporting Software Problems
- 22.2 Sending Large Objects with
rados
Fails with Full OSD - 22.3 Corrupted XFS File system
- 22.4 'Too Many PGs per OSD' Status Message
- 22.5 'nn pg stuck inactive' Status Message
- 22.6 OSD Weight is 0
- 22.7 OSD is Down
- 22.8 Finding Slow OSDs
- 22.9 Fixing Clock Skew Warnings
- 22.10 Poor Cluster Performance Caused by Network Problems
- 22.11
/var
Running Out of Space - 22.12 Too Many PGs Per OSD
- 20 Hints and Tips
- Glossary
- A DeepSea Stage 1 Custom Example
- B Default Alerts for SUSE Enterprise Storage
- C Example Procedure of Manual Ceph Installation
- D Documentation Updates
- 6.1 Basic
cephx
Authentication - 6.2
cephx
Authentication - 6.3
cephx
Authentication - MDS and OSD - 7.1 OSDs with Mixed Device Classes
- 7.2 Example Tree
- 7.3 Node Replacement Methods
- 8.1 Pools before Migration
- 8.2 Cache Tier Setup
- 8.3 Data Flushing
- 8.4 Setting Overlay
- 8.5 Migration Complete
- 9.1 RADOS Protocol
- 11.1 Bloom Filter with 3 Stored Objects
- 14.1 iSCSI Initiator Properties
- 14.2 Discover Target Portal
- 14.3 Target Portals
- 14.4 Targets
- 14.5 iSCSI Target Properties
- 14.6 Device Details
- 14.7 New Volume Wizard
- 14.8 Offline Disk Prompt
- 14.9 Confirm Volume Selections
- 14.10 iSCSI Initiator Properties
- 14.11 Add Target Server
- 14.12 Manage Multipath Devices
- 14.13 Paths Listing for Multipath
- 14.14 Add Storage Dialog
- 14.15 Custom Space Setting
- 14.16 iSCSI Datastore Overview
- 17.1 openATTIC Login Screen
- 17.2 openATTIC Dashboard
- 17.3 Basic Widgets
- 17.4 Capacity Widgets
- 17.5 Latency Widgets
- 17.6 Throughput
- 17.7 List of OSD nodes
- 17.8 List of RBDs
- 17.9 RBD Details
- 17.10 RBD Snapshots
- 17.11 Deleting RBD
- 17.12 Adding a New RBD
- 17.13 List of Pools
- 17.14 Pool Details
- 17.15 Deleting Pools
- 17.16 Adding a New Pool
- 17.17 List of Nodes
- 17.18 List of NFS Exports
- 17.19 NFS Export Details
- 17.20 Adding a New NFS Export
- 17.21 Editing an NFS Export
- 17.22 List of iSCSI Gateways
- 17.23 Gateway Details
- 17.24 Adding a New iSCSI Gateway
- 17.25 CRUSH Map
- 17.26 Replication rules
- 17.27 List of Object Gateway Users
- 17.28 Adding a New Object Gateway User
- 17.29 User quota
- 17.30 Bucket Quota
- 17.31 Adding a Subuser
- 17.32 View S3 keys
- 17.33 Capabilities
- 17.34 Object Gateway Buckets
- 17.35 Adding a New Bucket
- 17.36 Bucket Details
- 17.37 Editing an Object Gateway Bucket
- 17.38 Deleting Buckets
- 1.1 Removing a Salt minion from the Cluster
- 1.2 Migrating Nodes
- 5.1 Global Configuration
- 5.2 ROUTE
- 5.3 INHIBIT_RULE
- 5.4 HTTP_CONFIG
- 5.5 RECEIVER
- 5.6 EMAIL_CONFIG
- 5.7 HIPCHAT_CONFIG
- 5.8 PAGERDUTY_CONFIG
- 5.9 PUSHOVER_CONFIG
- 5.10 SLACK_CONFIG
- 5.11 ACTION_CONFIG for SLACK_CONFIG
- 5.12 FIELD_CONFIG for SLACK_CONFIG
- 5.13 OPSGENIE_CONFIG
- 5.14 VICTOROPS_CONFIG
- 5.15 WEBHOOK_CONFIG
- 5.16 WECHAT_CONFIG
- 5.17 Adding Custom Alerts to SUSE Enterprise Storage
- 7.1
crushtool --reclassify-root
- 7.2
crushtool --reclassify-bucket
Copyright © 2022 SUSE LLC
Copyright © 2016, RedHat, Inc, and contributors.
The text of and illustrations in this document are licensed under a Creative Commons Attribution-Share Alike 4.0 International ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/4.0/legalcode. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version.
Red Hat, Red Hat Enterprise Linux, the Shadowman logo, JBoss, MetaMatrix, Fedora, the Infinity Logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux® is the registered trademark of Linus Torvalds in the United States and other countries. Java® is a registered trademark of Oracle and/or its affiliates. XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries. All other trademarks are the property of their respective owners.
For SUSE trademarks, see http://www.suse.com/company/legal/. All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.
All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors nor the translators shall be held liable for possible errors or the consequences thereof.
About This Guide #
SUSE Enterprise Storage 5.5 is an extension to SUSE Linux Enterprise Server 12 SP3. It combines the capabilities of the Ceph (http://ceph.com/) storage project with the enterprise engineering and support of SUSE. SUSE Enterprise Storage 5.5 provides IT organizations with the ability to deploy a distributed storage architecture that can support a number of use cases using commodity hardware platforms.
This guide helps you understand the concept of the SUSE Enterprise Storage 5.5 with the main focus on managing and administrating the Ceph infrastructure. It also demonstrates how to use Ceph with other related solutions, such as OpenStack or KVM.
Many chapters in this manual contain links to additional documentation resources. These include additional documentation that is available on the system as well as documentation available on the Internet.
For an overview of the documentation available for your product and the latest documentation updates, refer to https://documentation.suse.com.
1 Available Documentation #
The following manuals are available for this product:
- Book “Administration Guide”
The guide describes various administration tasks that are typically performed after the installation. The guide also introduces steps to integrate Ceph with virtualization solutions such as
libvirt
, Xen, or KVM, and ways to access objects stored in the cluster via iSCSI and RADOS gateways.- Book “Deployment Guide”
Guides you through the installation steps of the Ceph cluster and all services related to Ceph. The guide also illustrates a basic Ceph cluster structure and provides you with related terminology.
HTML versions of the product manuals can be found in the installed system
under /usr/share/doc/manual
. Find the latest
documentation updates at https://documentation.suse.com where you can download the
manuals for your product in multiple formats.
2 Feedback #
Several feedback channels are available:
- Bugs and Enhancement Requests
For services and support options available for your product, refer to http://www.suse.com/support/.
To report bugs for a product component, log in to the Novell Customer Center from http://www.suse.com/support/ and select › .
- User Comments
We want to hear your comments and suggestions for this manual and the other documentation included with this product. If you have questions, suggestions, or corrections, contact doc-team@suse.com, or you can also click the
Report Documentation Bug
link beside each chapter or section heading.For feedback on the documentation of this product, you can also send a mail to
doc-team@suse.de
. Make sure to include the document title, the product version, and the publication date of the documentation. To report errors or suggest enhancements, provide a concise description of the problem and refer to the respective section number and page (or URL).
3 Documentation Conventions #
The following typographical conventions are used in this manual:
/etc/passwd
: directory names and file namesplaceholder: replace placeholder with the actual value
PATH
: the environment variable PATHls
,--help
: commands, options, and parametersuser
: users or groupsAlt, Alt–F1: a key to press or a key combination; keys are shown in uppercase as on a keyboard
Dancing Penguins (Chapter Penguins, ↑Another Manual): This is a reference to a chapter in another manual.
4 About the Making of This Manual #
This book is written in GeekoDoc, a subset of DocBook (see
http://www.docbook.org). The XML source files were
validated by xmllint
, processed by
xsltproc
, and converted into XSL-FO using a customized
version of Norman Walsh's stylesheets. The final PDF can be formatted through
FOP from Apache or through XEP from RenderX. The authoring and publishing
tools used to produce this manual are available in the package
daps
. The DocBook Authoring and
Publishing Suite (DAPS) is developed as open source software. For more
information, see http://daps.sf.net/.
5 Ceph Contributors #
The Ceph project and its documentation is a result of hundreds of contributors and organizations. See https://ceph.com/contributors/ for more details.
Part I Cluster Management #
- 1 Salt Cluster Administration
After you deploy a Ceph cluster, you will probably need to perform several modifications to it occasionally. These include adding or removing new nodes, disks, or services. This chapter describes how you can achieve these administration tasks.
1 Salt Cluster Administration #
After you deploy a Ceph cluster, you will probably need to perform several modifications to it occasionally. These include adding or removing new nodes, disks, or services. This chapter describes how you can achieve these administration tasks.
1.1 Adding New Cluster Nodes #
The procedure of adding new nodes to the cluster is almost identical to the initial cluster node deployment described in Book “Deployment Guide”, Chapter 4 “Deploying with DeepSea/Salt”:
Tip: Prevent Rebalancing
When adding an OSD to the existing cluster, bear in mind that the cluster will be rebalancing for some time afterward. To minimize the rebalancing periods, add all OSDs you intend to add at the same time.
Additional way is to set the osd crush initial weight = 0
option in the ceph.conf
file before adding the OSDs:
Add
osd crush initial weight = 0
to/srv/salt/ceph/configuration/files/ceph.conf.d/global.conf
.Create the new configuration:
root@master #
salt MASTER state.apply ceph.configuration.createOr:
root@master #
salt-call state.apply ceph.configuration.createApply the new configuration:
root@master #
salt TARGET state.apply ceph.configurationNote
If this is not a new node, but you want to proceed as if it were, ensure you remove the
/etc/ceph/destroyedOSDs.yml
file from the node. Otherwise, any devices from the first attempt will be restored with their previous OSD ID and reweight.Run the following commands:
root@master #
salt-run state.orch ceph.stage.1root@master #
salt-run state.orch ceph.stage.2root@master #
salt 'node*' state.apply ceph.osdAfter the new OSDs are added, adjust their weights as required with the
ceph osd reweight
command in small increments. This allows the cluster to rebalance and become healthy between increasing increments so it does not overwhelm the cluster and clients accessing the cluster.
Install SUSE Linux Enterprise Server 12 SP3 on the new node and configure its network setting so that it resolves the Salt master host name correctly. Verify that it has a proper connection to both public and cluster networks, and that time synchronization is correctly configured. Then install the
salt-minion
package:root@minion >
zypper in salt-minionIf the Salt master's host name is different from
salt
, edit/etc/salt/minion
and add the following:master: DNS_name_of_your_salt_master
If you performed any changes to the configuration files mentioned above, restart the
salt.minion
service:root@minion >
systemctl restart salt-minion.serviceOn the Salt master, accept the salt key of the new node:
root@master #
salt-key --accept NEW_NODE_KEYVerify that
/srv/pillar/ceph/deepsea_minions.sls
targets the new Salt minion and/or set the proper DeepSea grain. Refer to Book “Deployment Guide”, Chapter 4 “Deploying with DeepSea/Salt”, Section 4.2.2.1 “Matching the Minion Name” of Book “Deployment Guide”, Chapter 4 “Deploying with DeepSea/Salt”, Section 4.3 “Cluster Deployment”, Running Deployment Stages for more details.Run the preparation stage. It synchronizes modules and grains so that the new minion can provide all the information DeepSea expects.
root@master #
salt-run state.orch ceph.stage.0Important: Possible Restart of DeepSea Stage 0
If the Salt master rebooted after its kernel update, you need to restart DeepSea Stage 0.
Run the discovery stage. It will write new file entries in the
/srv/pillar/ceph/proposals
directory, where you can edit relevant .yml files:root@master #
salt-run state.orch ceph.stage.1Optionally, change
/srv/pillar/ceph/proposals/policy.cfg
if the newly added host does not match the existing naming scheme. For details, refer to Book “Deployment Guide”, Chapter 4 “Deploying with DeepSea/Salt”, Section 4.5.1 “Thepolicy.cfg
File”.Run the configuration stage. It reads everything under
/srv/pillar/ceph
and updates the pillar accordingly:root@master #
salt-run state.orch ceph.stage.2Pillar stores data which you can access with the following command:
root@master #
salt target pillar.itemsThe configuration and deployment stages include newly added nodes:
root@master #
salt-run state.orch ceph.stage.3root@master #
salt-run state.orch ceph.stage.4
1.2 Adding New Roles to Nodes #
You can deploy all types of supported roles with DeepSea. See Book “Deployment Guide”, Chapter 4 “Deploying with DeepSea/Salt”, Section 4.5.1.2 “Role Assignment” for more information on supported role types and examples of matching them.
To add a new service to an existing node, follow these steps:
Adapt
/srv/pillar/ceph/proposals/policy.cfg
to match the existing host with a new role. For more details, refer to Book “Deployment Guide”, Chapter 4 “Deploying with DeepSea/Salt”, Section 4.5.1 “Thepolicy.cfg
File”. For example, if you need to run an Object Gateway on a MON node, the line is similar to:role-rgw/xx/x/example.mon-1.sls
Run Stage 2 to update the pillar:
root@master #
salt-run state.orch ceph.stage.2Run Stage 3 to deploy core services, or Stage 4 to deploy optional services. Running both stages does not hurt.
1.3 Removing and Reinstalling Cluster Nodes #
Tip: Removing a Cluster Node Temporarily
The Salt master expects all minions to be present in the cluster and responsive. If a minion breaks and is not responsive any more, it causes problems to the Salt infrastructure, mainly to DeepSea and openATTIC.
Before you fix the minion, delete its key from the Salt master temporarily:
root@master #
salt-key -d MINION_HOST_NAME
After the minions is fixed, add its key to the Salt master again:
root@master #
salt-key -a MINION_HOST_NAME
To remove a role from a cluster, edit
/srv/pillar/ceph/proposals/policy.cfg
and remove the
corresponding line(s). Then run Stages 2 and 5 as described in
Book “Deployment Guide”, Chapter 4 “Deploying with DeepSea/Salt”, Section 4.3 “Cluster Deployment”.
Note: Removing OSDs from Cluster
In case you need to remove a particular OSD node from your cluster, ensure that your cluster has more free disk space than the disk you intend to remove. Bear in mind that removing an OSD results in rebalancing of the whole cluster.
Before running stage.5 to do the actual removal, always check which OSD's are going to be removed by DeepSea:
root@master #
salt-run rescinded.ids
When a role is removed from a minion, the objective is to undo all changes related to that role. For most of the roles, the task is simple, but there may be problems with package dependencies. If a package is uninstalled, its dependencies are not.
Removed OSDs appear as blank drives. The related tasks overwrite the beginning of the file systems and remove backup partitions in addition to wiping the partition tables.
Note: Preserving Partitions Created by Other Methods
Disk drives previously configured by other methods, such as
ceph-deploy
, may still contain partitions. DeepSea
will not automatically destroy these. The administrator must reclaim these
drives manually.
Example 1.1: Removing a Salt minion from the Cluster #
If your storage minions are named, for example, 'data1.ceph', 'data2.ceph'
... 'data6.ceph', and the related lines in your
policy.cfg
are similar to the following:
[...] # Hardware Profile profile-default/cluster/data*.sls profile-default/stack/default/ceph/minions/data*.yml [...]
Then to remove the Salt minion 'data2.ceph', change the lines to the following:
[...] # Hardware Profile profile-default/cluster/data[1,3-6]*.sls profile-default/stack/default/ceph/minions/data[1,3-6]*.yml [...]
Then run stage.2, check which OSD's are going to be removed, and finish by running stage.5:
root@master #
salt-run state.orch ceph.stage.2root@master #
salt-run rescinded.idsroot@master #
salt-run state.orch ceph.stage.5
Example 1.2: Migrating Nodes #
Assume the following situation: during the fresh cluster installation, you (the administrator) allocated one of the storage nodes as a stand-alone Object Gateway while waiting for the gateway's hardware to arrive. Now the permanent hardware has arrived for the gateway and you can finally assign the intended role to the backup storage node and have the gateway role removed.
After running Stages 0 and 1 (see Book “Deployment Guide”, Chapter 4 “Deploying with DeepSea/Salt”, Section 4.3 “Cluster Deployment”, Running Deployment Stages) for the
new hardware, you named the new gateway rgw1
. If the
node data8
needs the Object Gateway role removed and the storage
role added, and the current policy.cfg
looks like
this:
# Hardware Profile profile-default/cluster/data[1-7]*.sls profile-default/stack/default/ceph/minions/data[1-7]*.sls # Roles role-rgw/cluster/data8*.sls
Then change it to:
# Hardware Profile profile-default/cluster/data[1-8]*.sls profile-default/stack/default/ceph/minions/data[1-8]*.sls # Roles role-rgw/cluster/rgw1*.sls
Run stages 2 to 4, check which OSD's are going to be possibly removed, and
finish by running stage.5. Stage 3 will add data8
as a
storage node. For a moment, data8
will have both roles.
Stage 4 will add the Object Gateway role to rgw1
and stage 5 will
remove the Object Gateway role from data8
:
root@master #
salt-run state.orch ceph.stage.2root@master #
salt-run state.orch ceph.stage.3root@master #
salt-run state.orch ceph.stage.4root@master #
salt-run rescinded.idsroot@master #
salt-run state.orch ceph.stage.5
1.4 Redeploying Monitor Nodes #
When one or more of your monitor nodes fail and are not responding, you need to remove the failed monitors from the cluster and possibly then re-add them back in the cluster.
Important: The Minimum is Three Monitor Nodes
The number of monitor nodes must not be less than three. If a monitor node fails, and as a result your cluster has only two monitor nodes only, you need to temporarily assign the monitor role to other cluster nodes before you redeploy the failed monitor nodes. After you redeploy the failed monitor nodes, you can uninstall the temporary monitor roles.
For more information on adding new nodes/roles to the Ceph cluster, see Section 1.1, “Adding New Cluster Nodes” and Section 1.2, “Adding New Roles to Nodes”.
For more information on removing cluster nodes, refer to Section 1.3, “Removing and Reinstalling Cluster Nodes”.
There are two basic degrees of a Ceph node failure:
The Salt minion host is broken either physically or on the OS level, and does not respond to the
salt 'minion_name' test.ping
call. In such case you need to redeploy the server completely by following the relevant instructions in Book “Deployment Guide”, Chapter 4 “Deploying with DeepSea/Salt”, Section 4.3 “Cluster Deployment”.The monitor related services failed and refuse to recover, but the host responds to the
salt 'minion_name' test.ping
call. In such case, follow these steps:
Edit
/srv/pillar/ceph/proposals/policy.cfg
on the Salt master, and remove or update the lines that correspond to the failed monitor nodes so that they now point to the working monitor nodes. For example:[...] # MON #role-mon/cluster/ses-example-failed1.sls #role-mon/cluster/ses-example-failed2.sls role-mon/cluster/ses-example-new1.sls role-mon/cluster/ses-example-new2.sls [...]
Run DeepSea Stages 2 to 5 to apply the changes:
root@master #
deepsea
stage run ceph.stage.2root@master #
deepsea
stage run ceph.stage.3root@master #
deepsea
stage run ceph.stage.4root@master #
deepsea
stage run ceph.stage.5
1.5 Adding an OSD Disk to a Node #
To add a disk to an existing OSD node, verify that any partition on the disk
was removed and wiped. Refer to Step 12 in
Book “Deployment Guide”, Chapter 4 “Deploying with DeepSea/Salt”, Section 4.3 “Cluster Deployment” for more details. After the disk is
empty, add the disk to the YAML file of the node. The path to the file is
/srv/pillar/ceph/proposals/profile-default/stack/default/ceph/minions/node_name.yml
.
After saving the file, run DeepSea stages 2 and 3:
root@master #
salt-run state.orch ceph.stage.2root@master #
salt-run state.orch ceph.stage.3
Tip: Updated Profiles Automatically
Instead of manually editing the YAML file, DeepSea can create new profiles. To let DeepSea create new profiles, the existing profiles need to be moved:
root@master #
old
/srv/pillar/ceph/proposals/profile-default/root@master #
salt-run state.orch ceph.stage.1root@master #
salt-run state.orch ceph.stage.2root@master #
salt-run state.orch ceph.stage.3
We recommend verifying the suggested proposals before deploying the changes. Refer to Book “Deployment Guide”, Chapter 4 “Deploying with DeepSea/Salt”, Section 4.5.1.4 “Profile Assignment” for more details on viewing proposals.
1.6 Removing an OSD #
You can remove an Ceph OSD from the cluster by running the following command:
root@master #
salt-run
disengage.safetyroot@master #
salt-run
remove.osd OSD_ID
OSD_ID needs to be a number of the OSD without
the osd.
prefix. For example, from
osd.3
only use the digit 3
.
1.6.1 Removing Multiple OSDs #
Use the same procedure as mentioned in Section 1.6, “Removing an OSD” but simply supply multiple OSD IDs:
root@master #
salt-run disengage.safety safety is now disabled for cluster cephroot@master #
salt-run remove.osd 1 13 20 Removing osds 1, 13, 20 from minions Press Ctrl-C to abort Removing osd 1 from minion data4.ceph Removing osd 13 from minion data4.ceph Removing osd 20 from minion data4.ceph Removing osd 1 from Ceph Removing osd 13 from Ceph Removing osd 20 from Ceph
Important: Removed OSD ID Still Present in grains
After the remove.osd
command finishes, the ID of the
removed OSD is still part of Salt grains and you can see it after
running salt target
osd.list
. The reason is that if the
remove.osd
command partially fails on removing the data
disk, the only reference to related partitions on the shared devices is in
the grains. If we updated the grains immediately, then those partitions
would be orphaned.
To update the grains manually, run salt
target osd.retain
. It is part of
DeepSea Stage 3, therefore if you are going to run Stage 3 after the OSD
removal, the grains get updated automatically.
Tip: Automatic Retries
You can append the timeout
parameter (in seconds) after
which Salt retries the OSD removal:
root@master #
salt-run remove.osd 20 timeout=6
Removing osd 20 from minion data4.ceph
Timeout expired - OSD 20 has 22 PGs remaining
Retrying...
Removing osd 20 from Ceph
1.6.2 Removing Broken OSDs Forcefully #
There are cases when removing an OSD gracefully (see Section 1.6, “Removing an OSD”) fails. This may happen for example if the OSD or its journal, Wall or DB are broken, when it suffers from hanging I/O operations, or when the OSD disk fails to unmount. In such case, you need to force the OSD removal. The following command removes both the data partition, and the journal or WAL/DB partitions:
root@master #
salt target osd.remove OSD_ID force=True
Tip: Hanging Mounts
If a partition is still mounted on the disk being removed, the command
will exit with the 'Unmount failed - check for processes on
DEVICE' message. You can then list all
processes that access the file system with the fuser -m
DEVICE
. If fuser
returns nothing, try manual unmount
DEVICE
and watch the output of
dmesg
or journalctl
commands.
1.7 Replacing an OSD Disk #
There are several reasons why you may need to replace an OSD disk, for example:
The OSD disk failed or is soon going to fail based on SMART information, and can no longer be used to store data safely.
You need to upgrade the OSD disk, for example to increase its size.
The replacement procedure is the same for both cases. It is also valid for both default and customized CRUSH Maps.
Warning: The Number of Free Disks
When doing an automated OSDs replacement, the number of free disks needs to be the same as the number of disks you need to replace. If there are more free disks available in the system, it is impossible to guess which free disks to replace. Therefore the automated replacement will not be performed.
Turn off safety limitations temporarily:
root@master #
salt-run disengage.safetySuppose that for example '5' is the ID of the OSD whose disk needs to be replaced. The following command marks it as destroyed in the CRUSH Map but leaves its original ID:
root@master #
salt-run replace.osd 5Tip:
replace.osd
andremove.osd
The Salt's
replace.osd
andremove.osd
(see Section 1.6, “Removing an OSD”) commands are identical except thatreplace.osd
leaves the OSD as 'destroyed' in the CRUSH Map whileremove.osd
removes all traces from the CRUSH Map.Manually replace the failed/upgraded OSD drive.
After replacing the physical drive, you need to modify the configuration of the related Salt minion. You can do so either manually or in an automated way.
To manually change a Salt minion's configuration, see Section 1.7.1, “Manual Configuration”.
To change a Salt minion's configuration in an automated way, see Section 1.7.2, “Automated Configuration”.
After you finish either manual or automated configuration of the Salt minion, run DeepSea Stage 2 to update the Salt configuration. It prints out a summary about the differences between the storage configuration and the current setup:
root@master #
salt-run state.orch ceph.stage.2 deepsea_minions : valid yaml_syntax : valid profiles_populated : valid public network : 172.16.21.0/24 cluster network : 172.16.22.0/24 These devices will be deployed data1.ceph: /dev/sdb, /dev/sdc, /dev/sdd, /dev/sde, /dev/sdf, /dev/sdgTip: Run
salt-run advise.osds
To summarize the steps that will be taken when the actual replacement is deployed, you can run the following command:
root@master #
salt-run advise.osds These devices will be deployed data1.ceph: /dev/disk/by-id/cciss-3600508b1001c7c24c537bdec8f3a698f: Run 'salt-run state.orch ceph.stage.3'Run the deployment Stage 3 to deploy the replaced OSD disk:
root@master #
salt-run state.orch ceph.stage.3
1.7.1 Manual Configuration #
Find the renamed YAML file for the Salt minion. For example, the file for the minion named 'data1.ceph' is
/srv/pillar/ceph/proposals/profile-PROFILE_NAME/stack/default/ceph/minions/data1.ceph.yml-replace
Rename the file to its original name (without the
-replace
suffix), edit it, and replace the old device with the new device name.Tip: salt osd.report
Consider using
salt 'MINION_NAME' osd.report
to identify the device that has been removed.For example, if the
data1.ceph.yml
file containsceph: storage: osds: [...] /dev/disk/by-id/cciss-3600508b1001c93595b70bd0fb700ad38: format: bluestore [...]
replace the corresponding device path with
ceph: storage: osds: [...] /dev/disk/by-id/cciss-3600508b1001c7c24c537bdec8f3a698f: format: bluestore replace: True [...]
1.7.2 Automated Configuration #
While the default profile for Stage 1 may work for the simplest setups, this stage can be optionally customized:
Set the
stage_discovery: CUSTOM_STAGE_NAME
option in/srv/pillar/ceph/stack/global.yml
.Create the corresponding file
/srv/salt/ceph/stage/1/CUSTOM_STAGE_NAME.sls
and customize it to reflect your specific requirements for Stage 1. See Appendix A, DeepSea Stage 1 Custom Example for an example.Tip: Inspect
init.sls
Inspect the
/srv/salt/ceph/stage/1/init.sls
file to see what variables you can use in your custom Stage 1 .sls file.Refresh the Pillar:
root@master #
salt '*' saltutil.pillar_refreshRun Stage 1 to generate the new configuration file:
root@master #
salt-run state.orch ceph.stage.1
Tip: Custom Options
To list all available options, inspect the output of the salt-run
proposal.help
command.
If you customized the cluster deployment with a specific command
salt-run proposal.populate OPTION=VALUE
use the same configuration when doing the automated configuration.
1.8 Recovering a Reinstalled OSD Node #
If the operating system breaks and is not recoverable on one of your OSD nodes, follow these steps to recover it and redeploy its OSD role with cluster data untouched:
Reinstall the base SUSE Linux Enterprise operating system on the node where the OS broke. Install the salt-minion packages on the OSD node, delete the old Salt minion key on the Salt master, and register the new Salt minion's key it with the Salt master. For more information on the initial deployment, see Book “Deployment Guide”, Chapter 4 “Deploying with DeepSea/Salt”, Section 4.3 “Cluster Deployment”.
Instead of running the whole of Stage 0, run the following parts:
root@master #
salt 'osd_node' state.apply ceph.syncroot@master #
salt 'osd_node' state.apply ceph.packages.commonroot@master #
salt 'osd_node' state.apply ceph.minesroot@master #
salt 'osd_node' state.apply ceph.updatesRun DeepSea Stages 1 to 5:
root@master #
salt-run state.orch ceph.stage.1root@master #
salt-run state.orch ceph.stage.2root@master #
salt-run state.orch ceph.stage.3root@master #
salt-run state.orch ceph.stage.4root@master #
salt-run state.orch ceph.stage.5Run DeepSea Stage 0:
root@master #
salt-run state.orch ceph.stage.0Reboot the relevant OSD node. All OSD disks will be rediscovered and reused.
1.9 Automated Installation via Salt #
The installation can be automated by using the Salt reactor. For virtual environments or consistent hardware environments, this configuration will allow the creation of a Ceph cluster with the specified behavior.
Warning
Salt cannot perform dependency checks based on reactor events. There is a real risk of putting your Salt master into a death spiral.
The automated installation requires the following:
A properly created
/srv/pillar/ceph/proposals/policy.cfg
.Prepared custom configuration placed to the
/srv/pillar/ceph/stack
directory.
The default reactor configuration will only run Stages 0 and 1. This allows testing of the reactor without waiting for subsequent stages to complete.
When the first salt-minion starts, Stage 0 will begin. A lock prevents multiple instances. When all minions complete Stage 0, Stage 1 will begin.
If the operation is performed properly, edit the file
/etc/salt/master.d/reactor.conf
and replace the following line
- /srv/salt/ceph/reactor/discovery.sls
with
- /srv/salt/ceph/reactor/all_stages.sls
Verify that the line is not commented out.
1.10 Updating the Cluster Nodes #
Keep the Ceph cluster nodes up-to-date by applying rolling updates regularly.
Important: Access to Software Repositories
Before patching the cluster with latest software packages, verify that all its nodes have access to SUSE Linux Enterprise Server repositories that match your version of SUSE Enterprise Storage. For SUSE Enterprise Storage 5.5, the following repositories are required:
root #
zypper lr -E
# | Alias | Name | Enabled | GPG Check | Refresh
---+---------+-----------------------------------+---------+-----------+--------
4 | [...] | SUSE-Enterprise-Storage-5-Pool | Yes | (r ) Yes | No
6 | [...] | SUSE-Enterprise-Storage-5-Updates | Yes | (r ) Yes | Yes
9 | [...] | SLES12-SP3-Pool | Yes | (r ) Yes | No
11 | [...] | SLES12-SP3-Updates | Yes | (r ) Yes | Yes
Tip: Repository Staging
If you use a staging tool—for example, SUSE Manager, Subscription Management Tool, or Repository Mirroring Tool—that serves software repositories to the cluster nodes, verify that stages for both 'Updates' repositories for SUSE Linux Enterprise Server and SUSE Enterprise Storage are created at the same point in time.
We strongly recommend to use a staging tool to apply patches which have
frozen
or staged
patch levels. This
ensures that new nodes joining the cluster have the same patch level as the
nodes already running in the cluster. This way you avoid the need to apply
the latest patches to all the cluster's nodes before new nodes can join the
cluster.
To update the software packages on all cluster nodes to the latest version, follow these steps:
Update the deepsea, salt-master, and salt-minion packages and restart relevant services on the Salt master:
root@master #
salt -I 'roles:master' state.apply ceph.updates.masterUpdate and restart the salt-minion package on all cluster nodes:
root@master #
salt -I 'cluster:ceph' state.apply ceph.updates.saltUpdate all other software packages on the cluster:
root@master #
salt-run state.orch ceph.stage.0Restart Ceph related services:
root@master #
salt-run state.orch ceph.restart
Note: Possible Downtime of Ceph Services
When applying updates to Ceph cluster nodes, Ceph services may be restarted. If there is a single point of failure for services such as Object Gateway, NFS Ganesha, or iSCSI, the client machines may be temporarily disconnected from related services.
If DeepSea detects a running Ceph cluster, it applies available updates, restarts running Ceph services, and optionally restarts nodes sequentially if a kernel update was installed. DeepSea follows Ceph's official recommendation of first updating the monitors, then the OSDs, and lastly additional services, such as Metadata Server, Object Gateway, iSCSI Gateway, or NFS Ganesha. DeepSea stops the update process if it detects an issue in the cluster. A trigger for that can be:
Ceph reports 'HEALTH_ERR' for longer then 300 seconds.
Salt minions are queried for their assigned services to be still up and running after an update. The update fails if the services are down for more than 900 seconds.
Making these arrangements ensures that even with corrupted or failing updates, the Ceph cluster is still operational.
DeepSea Stage 0 updates the system via zypper update
and optionally reboots the system if the kernel is updated. If you want to
eliminate the possibility of a forced reboot of potentially all nodes,
either make sure that the latest kernel is installed and running before
initiating DeepSea Stage 0, or disable automatic node reboots as described
in Book “Deployment Guide”, Chapter 7 “Customizing the Default Configuration”, Section 7.1.5 “Updates and Reboots during Stage 0”.
Tip: zypper patch
If you prefer to update the system using the zypper
patch
command, edit
/srv/pillar/ceph/stack/global.yml
and add the
following line:
update_method_init: zypper-patch
You can change the default update/reboot behavior of DeepSea Stage 0 by
adding/changing the stage_prep_master
and
stage_prep_minion
options. For more information, see
Book “Deployment Guide”, Chapter 7 “Customizing the Default Configuration”, Section 7.1.5 “Updates and Reboots during Stage 0”.
1.11 Halting or Rebooting Cluster #
In some cases it may be necessary to halt or reboot the whole cluster. We recommended carefully checking for dependencies of running services. The following steps provide an outline for stopping and starting the cluster:
Tell the Ceph cluster not to mark OSDs as out:
cephadm >
ceph
osd set nooutStop daemons and nodes in the following order:
Storage clients
Gateways, for example NFS Ganesha or Object Gateway
Metadata Server
Ceph OSD
Ceph Manager
Ceph Monitor
If required, perform maintenance tasks.
Start the nodes and servers in the reverse order of the shutdown process:
Ceph Monitor
Ceph Manager
Ceph OSD
Metadata Server
Gateways, for example NFS Ganesha or Object Gateway
Storage clients
Remove the noout flag:
cephadm >
ceph
osd unset noout
1.12 Adjusting ceph.conf
with Custom Settings #
If you need to put custom settings into the ceph.conf
file, you can do so by modifying the configuration files in the
/srv/salt/ceph/configuration/files/ceph.conf.d
directory:
global.conf
mon.conf
mgr.conf
mds.conf
osd.conf
client.conf
rgw.conf
Note: Unique rgw.conf
The Object Gateway offers a lot flexibility and is unique compared to the other
ceph.conf
sections. All other Ceph components have
static headers such as [mon]
or
[osd]
. The Object Gateway has unique headers such as
[client.rgw.rgw1]
. This means that the
rgw.conf
file needs a header entry. For examples, see
/srv/salt/ceph/configuration/files/rgw.conf
or
/srv/salt/ceph/configuration/files/rgw-ssl.conf
Important: Run Stage 3
After you make custom changes to the above mentioned configuration files, run Stages 3 and 4 to apply these changes to the cluster nodes:
root@master #
salt-run state.orch ceph.stage.3root@master #
salt-run state.orch ceph.stage.4
These files are included from the
/srv/salt/ceph/configuration/files/ceph.conf.j2
template file, and correspond to the different sections that the Ceph
configuration file accepts. Putting a configuration snippet in the correct
file enables DeepSea to place it into the correct section. You do not need
to add any of the section headers.
Tip
To apply any configuration options only to specific instances of a daemon,
add a header such as [osd.1]
. The following
configuration options will only be applied to the OSD daemon with the ID 1.
1.12.1 Overriding the Defaults #
Later statements in a section overwrite earlier ones. Therefore it is
possible to override the default configuration as specified in the
/srv/salt/ceph/configuration/files/ceph.conf.j2
template. For example, to turn off cephx authentication, add the following
three lines to the
/srv/salt/ceph/configuration/files/ceph.conf.d/global.conf
file:
auth cluster required = none auth service required = none auth client required = none
When redefining the default values, Ceph related tools such as
rados
may issue warnings that specific values from the
ceph.conf.j2
were redefined in
global.conf
. These warnings are caused by one
parameter assigned twice in the resulting ceph.conf
.
As a workaround for this specific case, follow these steps:
Change the current directory to
/srv/salt/ceph/configuration/create
:root@master #
cd /srv/salt/ceph/configuration/createCopy
default.sls
tocustom.sls
:root@master #
cp default.sls custom.slsEdit
custom.sls
and changeceph.conf.j2
tocustom-ceph.conf.j2
.Change current directory to
/srv/salt/ceph/configuration/files
:root@master #
cd /srv/salt/ceph/configuration/filesCopy
ceph.conf.j2
tocustom-ceph.conf.j2
:root@master #
cp ceph.conf.j2 custom-ceph.conf.j2Edit
custom-ceph.conf.j2
and delete the following line:{% include "ceph/configuration/files/rbd.conf" %}
Edit
global.yml
and add the following line:configuration_create: custom
Refresh the pillar:
root@master #
salt target saltutil.pillar_refreshRun Stage 3:
root@master #
salt-run state.orch ceph.stage.3
Now you should have only one entry for each value definition. To re-create the configuration, run:
root@master #
salt-run state.orch ceph.configuration.create
and then verify the contents of
/srv/salt/ceph/configuration/cache/ceph.conf
.
1.12.2 Including Configuration Files #
If you need to apply a lot of custom configurations, use the following
include statements within the custom configuration files to make file
management easier. Following is an example of the
osd.conf
file:
[osd.1] {% include "ceph/configuration/files/ceph.conf.d/osd1.conf" ignore missing %} [osd.2] {% include "ceph/configuration/files/ceph.conf.d/osd2.conf" ignore missing %} [osd.3] {% include "ceph/configuration/files/ceph.conf.d/osd3.conf" ignore missing %} [osd.4] {% include "ceph/configuration/files/ceph.conf.d/osd4.conf" ignore missing %}
In the previous example, the osd1.conf
,
osd2.conf
, osd3.conf
, and
osd4.conf
files contain the configuration options
specific to the related OSD.
Tip: Runtime Configuration
Changes made to Ceph configuration files take effect after the related Ceph daemons restart. See Section 12.1, “Runtime Configuration” for more information on changing the Ceph runtime configuration.
1.13 Enabling AppArmor Profiles #
AppArmor is a security solution that confines programs by a specific profile. For more details, refer to https://documentation.suse.com/sles/12-SP5/single-html/SLES-security/#part-apparmor.
DeepSea provides three states for AppArmor profiles: 'enforce', 'complain', and 'disable'. To activate a particular AppArmor state, run:
salt -I "deepsea_minions:*" state.apply ceph.apparmor.default-STATE
To put the AppArmor profiles in an 'enforce' state:
root@master #
salt -I "deepsea_minions:*" state.apply ceph.apparmor.default-enforce
To put the AppArmor profiles in a 'complain' status:
root@master #
salt -I "deepsea_minions:*" state.apply ceph.apparmor.default-complain
To disable the AppArmor profiles:
root@master #
salt -I "deepsea_minions:*" state.apply ceph.apparmor.default-disable
Tip: Enabling the AppArmor Service
Each of these three calls verifies if AppArmor is installed and installs it if
not, and starts and enables the related systemd
service. DeepSea will
warn you if AppArmor was installed and started/enabled in another way and
therefore runs without DeepSea profiles.
Part II Operating a Cluster #
- 2 Introduction
In this part of the manual you will learn how to start or stop Ceph services, monitor a cluster's state, use and modify CRUSH Maps, or manage storage pools.
- 3 Operating Ceph Services
You can operate Ceph services either using
systemd
, or using DeepSea.- 4 Determining Cluster State
When you have a running cluster, you may use the
ceph
tool to monitor it. Determining the cluster state typically involves checking the status of Ceph OSDs, Ceph Monitors, placement groups and Metadata Servers.- 5 Monitoring and Alerting
By default, DeepSea deploys a monitoring and alerting stack on the Salt master. It consists of the following components:
- 6 Authentication with
cephx
To identify clients and protect against man-in-the-middle attacks, Ceph provides its
cephx
authentication system. Clients in this context are either human users—such as the admin user—or Ceph-related services/daemons, for example OSDs, monitors, or Object Gateways.- 7 Stored Data Management
The CRUSH algorithm determines how to store and retrieve data by computing data storage locations. CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server or broker. With an algorithmically determined method of storing and retrieving data, Ceph avoids a…
- 8 Managing Storage Pools
Ceph stores data within pools. Pools are logical groups for storing objects. When you first deploy a cluster without creating a pool, Ceph uses the default pools for storing data. The following important highlights relate to Ceph pools:
- 9 RADOS Block Device
A block is a sequence of bytes, for example a 4MB block of data. Block-based storage interfaces are the most common way to store data with rotating media, such as hard disks, CDs, floppy disks. The ubiquity of block device interfaces makes a virtual block device an ideal candidate to interact with a…
- 10 Erasure Coded Pools
Ceph provides an alternative to the normal replication of data in pools, called erasure or erasure coded pool. Erasure pools do not provide all functionality of replicated pools (for example it cannot store metadata for RBD pools), but require less raw storage. A default erasure pool capable of stor…
- 11 Cache Tiering
A cache tier is an additional storage layer implemented between the client and the standard storage. It is designed to speed up access to pools stored on slow hard disks and erasure coded pools.
- 12 Ceph Cluster Configuration
This chapter provides a list of important Ceph cluster settings and their description. The settings are sorted by topic.
2 Introduction #
In this part of the manual you will learn how to start or stop Ceph services, monitor a cluster's state, use and modify CRUSH Maps, or manage storage pools.
It also includes advanced topics, for example how to manage users and authentication in general, how to manage pool and RADOS device snapshots, how to set up erasure coded pools, or how to increase the cluster performance with cache tiering.
3 Operating Ceph Services #
You can operate Ceph services either using systemd
, or using DeepSea.
3.1 Operating Ceph Cluster Related Services using systemd
#
Use the systemctl
command to operate all Ceph related
services. The operation takes place on the node you are currently logged in
to. You need to have root
privileges to be able to operate on Ceph
services.
3.1.1 Starting, Stopping, and Restarting Services using Targets #
To simplify starting, stopping, and restarting all the services of a
particular type (for example all Ceph services, or all MONs, or all OSDs)
on a node, Ceph provides the following systemd
unit files:
cephadm >
ls /usr/lib/systemd/system/ceph*.target
ceph.target
ceph-osd.target
ceph-mon.target
ceph-mgr.target
ceph-mds.target
ceph-radosgw.target
ceph-rbd-mirror.target
To start/stop/restart all Ceph services on the node, run:
root #
systemctl start ceph.targetroot #
systemctl stop ceph.targetroot #
systemctl restart ceph.target
To start/stop/restart all OSDs on the node, run:
root #
systemctl start ceph-osd.targetroot #
systemctl stop ceph-osd.targetroot #
systemctl restart ceph-osd.target
Commands for the other targets are analogous.
3.1.2 Starting, Stopping, and Restarting Individual Services #
You can operate individual services using the following parameterized
systemd
unit files:
ceph-osd@.service ceph-mon@.service ceph-mds@.service ceph-mgr@.service ceph-radosgw@.service ceph-rbd-mirror@.service
To use these commands, you first need to identify the name of the service you want to operate. See Section 3.1.3, “Identifying Individual Services” to learn more about services identification.
To start/stop/restart the osd.1
service, run:
root #
systemctl start ceph-osd@1.serviceroot #
systemctl stop ceph-osd@1.serviceroot #
systemctl restart ceph-osd@1.service
Commands for the other service types are analogous.
3.1.3 Identifying Individual Services #
You can find out the names/numbers of a particular type of service in
several ways. The following commands provide results for services
ceph*
and lrbd*
. You can run them on
any node of the Ceph cluster.
To list all (even inactive) services of type ceph*
and
lrbd*
, run:
root #
systemctl list-units --all --type=service ceph* lrbd*
To list only the inactive services, run:
root #
systemctl list-units --all --state=inactive --type=service ceph* lrbd*
You can also use salt
to query services across multiple
nodes:
root@master #
salt TARGET cmd.shell \
"systemctl list-units --all --type=service ceph* lrbd* | sed -e '/^$/,$ d'"
Query storage nodes only:
root@master #
salt -I 'roles:storage' cmd.shell \
'systemctl list-units --all --type=service ceph* lrbd*'
3.1.4 Service Status #
You can query systemd
for the status of services. For example:
root #
systemctl status ceph-osd@1.serviceroot #
systemctl status ceph-mon@HOSTNAME.service
Replace HOSTNAME with the host name the daemon is running on.
If you do not know the exact name/number of the service, see Section 3.1.3, “Identifying Individual Services”.
3.2 Restarting Ceph Services using DeepSea #
After applying updates to the cluster nodes, the affected Ceph related services need to be restarted. Normally, restarts are performed automatically by DeepSea. This section describes how to restart the services manually.
Tip: Watching the Restart
The process of restarting the cluster may take some time. You can watch the events by using the Salt event bus by running:
root@master #
salt-run state.event pretty=True
Another command to monitor active jobs is
root@master #
salt-run jobs.active
3.2.1 Restarting All Services #
Warning: Interruption of Services
If Ceph related services—specifically iSCSI or NFS Ganesha—are configured as single points of access with no High Availability setup, restarting then will result in their temporary outage as viewed from the client side.
Tip: Samba not Managed by DeepSea
Because DeepSea and openATTIC do not currently support Samba deployments, you need to manage Samba related services manually. For more details, see Book “Deployment Guide”, Chapter 13 “Exporting Ceph Data via Samba”.
To restart all services on the cluster, run the following command:
root@master #
salt-run state.orch ceph.restart
All roles you have configured restart in the following order: Ceph Monitor, Ceph Manager, Ceph OSD, Metadata Server, Object Gateway, iSCSI Gateway, NFS Ganesha. To keep the downtime low and to find potential issues as early as possible, nodes are restarted sequentially. For example, only one monitoring node is restarted at a time.
The command waits for the cluster to recover if the cluster is in a degraded, unhealthy state.
3.2.2 Restarting Specific Services #
To restart a specific service on the cluster, run:
root@master #
salt-run state.orch ceph.restart.service_name
For example, to restart all Object Gateways, run:
root@master #
salt-run state.orch ceph.restart.rgw
You can use the following targets:
root@master #
salt-run state.orch ceph.restart.mon
root@master #
salt-run state.orch ceph.restart.mgr
root@master #
salt-run state.orch ceph.restart.osd
root@master #
salt-run state.orch ceph.restart.mds
root@master #
salt-run state.orch ceph.restart.rgw
root@master #
salt-run state.orch ceph.restart.igw
root@master #
salt-run state.orch ceph.restart.ganesha
3.3 Shutdown and Restart of the Whole Ceph Cluster #
Shutting down and restarting the cluster may be necessary in the case of a planned power outage. To stop all Ceph related services and restart without issue, follow the steps below.
Procedure 3.1: Shutting Down the Whole Ceph Cluster #
Shut down or disconnect any clients accessing the cluster.
To prevent CRUSH from automatically rebalancing the cluster, set the cluster to
noout
:root@master #
ceph osd set nooutDisable safety measures:
root@master #
salt-run disengage.safetyStop all Ceph services in the following order:
Stop NFS Ganesha:
root@master #
salt -C 'I@roles:ganesha and I@cluster:ceph' ceph.terminate.ganeshaStop Object Gateways:
root@master #
salt -C 'I@roles:rgw and I@cluster:ceph' ceph.terminate.rgwStop Metadata Servers:
root@master #
salt -C 'I@roles:mds and I@cluster:ceph' ceph.terminate.mdsStop iSCSI Gateways:
root@master #
salt -C 'I@roles:igw and I@cluster:ceph' ceph.terminate.igwStop Ceph OSDs:
root@master #
salt -C 'I@roles:storage and I@cluster:ceph' ceph.terminate.storageStop Ceph Managers:
root@master #
salt -C 'I@roles:mgr and I@cluster:ceph' ceph.terminate.mgrStop Ceph Monitors:
root@master #
salt -C 'I@roles:mon and I@cluster:ceph' ceph.terminate.mon
Power off all cluster nodes:
root@master #
salt -C 'G@deepsea:*' cmd.run "shutdown -h"
Procedure 3.2: Starting the Whole Ceph Cluster #
Power on the Admin Node.
Power on the Ceph Monitor nodes.
Power on the Ceph OSD nodes.
Unset the previously set
noout
flag:root@master #
ceph osd unset nooutPower on all configured gateways.
Power on or connect cluster clients.
4 Determining Cluster State #
When you have a running cluster, you may use the ceph
tool
to monitor it. Determining the cluster state typically involves checking the
status of Ceph OSDs, Ceph Monitors, placement groups and Metadata Servers.
Tip: Interactive Mode
To run the ceph
tool in an interactive mode, type
ceph
at the command line with no arguments. The
interactive mode is more convenient if you are going to enter more
ceph
commands in a row. For example:
cephadm >
ceph
ceph> health
ceph> status
ceph> quorum_status
ceph> mon_status
4.1 Checking a Cluster's Status #
To check a cluster's status, execute the following:
cephadm >
ceph status
or
cephadm >
ceph -s
In interactive mode, type status
and press
Enter.
ceph> status
Ceph will print the cluster status. For example, a tiny Ceph cluster consisting of one monitor and two OSDs may print the following:
cluster b370a29d-9287-4ca3-ab57-3d824f65e339 health HEALTH_OK monmap e1: 1 mons at {ceph1=10.0.0.8:6789/0}, election epoch 2, quorum 0 ceph1 osdmap e63: 2 osds: 2 up, 2 in pgmap v41332: 952 pgs, 20 pools, 17130 MB data, 2199 objects 115 GB used, 167 GB / 297 GB avail 1 active+clean+scrubbing+deep 951 active+clean
4.2 Checking Cluster Health #
After you start your cluster and before you start reading and/or writing data, check your cluster's health:
cephadm >
ceph health
HEALTH_WARN 10 pgs degraded; 100 pgs stuck unclean; 1 mons down, quorum 0,2 \
node-1,node-2,node-3
Tip
If you specified non-default locations for your configuration or keyring, you may specify their locations:
cephadm >
ceph -c /path/to/conf -k /path/to/keyring health
The Ceph cluster returns one of the following health codes:
- OSD_DOWN
One or more OSDs are marked down. The OSD daemon may have been stopped, or peer OSDs may be unable to reach the OSD over the network. Common causes include a stopped or crashed daemon, a down host, or a network outage.
Verify the host is healthy, the daemon is started, and network is functioning. If the daemon has crashed, the daemon log file (
/var/log/ceph/ceph-osd.*
) may contain debugging information.- OSD_crush type_DOWN, for example OSD_HOST_DOWN
All the OSDs within a particular CRUSH subtree are marked down, for example all OSDs on a host.
- OSD_ORPHAN
An OSD is referenced in the CRUSH map hierarchy but does not exist. The OSD can be removed from the CRUSH hierarchy with:
cephadm >
ceph osd crush rm osd.ID- OSD_OUT_OF_ORDER_FULL
The usage thresholds for backfillfull (defaults to 0.90), nearfull (defaults to 0.85), full (defaults to 0.95), and/or failsafe_full are not ascending. In particular, we expect backfillfull < nearfull, nearfull < full, and full < failsafe_full.
To read the current values, run:
cephadm >
ceph health detail HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s) osd.3 is full at 97% osd.4 is backfill full at 91% osd.2 is near full at 87%The thresholds can be adjusted with the following commands:
cephadm >
ceph osd set-backfillfull-ratio ratiocephadm >
ceph osd set-nearfull-ratio ratiocephadm >
ceph osd set-full-ratio ratio- OSD_FULL
One or more OSDs has exceeded the full threshold and is preventing the cluster from servicing writes. Usage by pool can be checked with:
cephadm >
ceph dfThe currently defined full ratio can be seen with:
cephadm >
ceph osd dump | grep full_ratioA short-term workaround to restore write availability is to raise the full threshold by a small amount:
cephadm >
ceph osd set-full-ratio ratioAdd new storage to the cluster by deploying more OSDs, or delete existing data in order to free up space.
- OSD_BACKFILLFULL
One or more OSDs has exceeded the backfillfull threshold, which prevents data from being allowed to rebalance to this device. This is an early warning that rebalancing may not be able to complete and that the cluster is approaching full. Usage by pool can be checked with:
cephadm >
ceph df- OSD_NEARFULL
One or more OSDs has exceeded the nearfull threshold. This is an early warning that the cluster is approaching full. Usage by pool can be checked with:
cephadm >
ceph df- OSDMAP_FLAGS
One or more cluster flags of interest has been set. With the exception of full, these flags can be set or cleared with:
cephadm >
ceph osd set flagcephadm >
ceph osd unset flagThese flags include:
- full
The cluster is flagged as full and cannot service writes.
- pauserd, pausewr
Paused reads or writes.
- noup
OSDs are not allowed to start.
- nodown
OSD failure reports are being ignored, such that the monitors will not mark OSDs down.
- noin
OSDs that were previously marked out will not be marked back in when they start.
- noout
Down OSDs will not automatically be marked out after the configured interval.
- nobackfill, norecover, norebalance
Recovery or data rebalancing is suspended.
- noscrub, nodeep_scrub
Scrubbing (see Section 7.5, “Scrubbing”) is disabled.
- notieragent
Cache tiering activity is suspended.
- OSD_FLAGS
One or more OSDs has a per-OSD flag of interest set. These flags include:
- noup
OSD is not allowed to start.
- nodown
Failure reports for this OSD will be ignored.
- noin
If this OSD was previously marked out automatically after a failure, it will not be marked in when it starts.
- noout
If this OSD is down, it will not be automatically marked out after the configured interval.
Per-OSD flags can be set and cleared with:
cephadm >
ceph osd add-flag osd-IDcephadm >
ceph osd rm-flag osd-ID- OLD_CRUSH_TUNABLES
The CRUSH Map is using very old settings and should be updated. The oldest tunables that can be used (that is the oldest client version that can connect to the cluster) without triggering this health warning is determined by the
mon_crush_min_required_version
configuration option.- OLD_CRUSH_STRAW_CALC_VERSION
The CRUSH Map is using an older, non-optimal method for calculating intermediate weight values for straw buckets. The CRUSH Map should be updated to use the newer method (
straw_calc_version
=1).- CACHE_POOL_NO_HIT_SET
One or more cache pools is not configured with a hit set to track usage, which prevents the tiering agent from identifying cold objects to flush and evict from the cache. Hit sets can be configured on the cache pool with:
cephadm >
ceph osd pool set poolname hit_set_type typecephadm >
ceph osd pool set poolname hit_set_period period-in-secondscephadm >
ceph osd pool set poolname hit_set_count number-of-hitsetscephadm >
ceph osd pool set poolname hit_set_fpp target-false-positive-rateFor more information on cache tiering, see Chapter 11, Cache Tiering.
- OSD_NO_SORTBITWISE
No pre-luminous v12 OSDs are running but the
sortbitwise
flag has not been set. You need to set thesortbitwise
flag before luminous v12 or newer OSDs can start:cephadm >
ceph osd set sortbitwise- POOL_FULL
One or more pools has reached its quota and is no longer allowing writes. You can set pool quotas and usage with:
cephadm >
ceph df detailYou can either raise the pool quota with
cephadm >
ceph osd pool set-quota poolname max_objects num-objectscephadm >
ceph osd pool set-quota poolname max_bytes num-bytesor delete some existing data to reduce usage.
- PG_AVAILABILITY
Data availability is reduced, meaning that the cluster is unable to service potential read or write requests for some data in the cluster. Specifically, one or more PGs is in a state that does not allow IO requests to be serviced. Problematic PG states include peering, stale, incomplete, and the lack of active (if those conditions do not clear quickly). Detailed information about which PGs are affected is available from:
cephadm >
ceph health detailIn most cases the root cause is that one or more OSDs is currently down. The state of specific problematic PGs can be queried with:
cephadm >
ceph tell pgid query- PG_DEGRADED
Data redundancy is reduced for some data, meaning the cluster does not have the desired number of replicas for all data (for replicated pools) or erasure code fragments (for erasure coded pools). Specifically, one or more PGs have either the degraded or undersized flag set (there are not enough instances of that placement group in the cluster), or have not had the clean flag set for some time. Detailed information about which PGs are affected is available from:
cephadm >
ceph health detailIn most cases the root cause is that one or more OSDs is currently down. The state of specific problematic PGs can be queried with:
cephadm >
ceph tell pgid query- PG_DEGRADED_FULL
Data redundancy may be reduced or at risk for some data because of a lack of free space in the cluster. Specifically, one or more PGs has the backfill_toofull or recovery_toofull flag set, meaning that the cluster is unable to migrate or recover data because one or more OSDs is above the backfillfull threshold.
- PG_DAMAGED
Data scrubbing (see Section 7.5, “Scrubbing”) has discovered some problems with data consistency in the cluster. Specifically, one or more PGs has the inconsistent or snaptrim_error flag is set, indicating an earlier scrub operation found a problem, or that the repair flag is set, meaning a repair for such an inconsistency is currently in progress.
- OSD_SCRUB_ERRORS
Recent OSD scrubs have uncovered inconsistencies.
- CACHE_POOL_NEAR_FULL
A cache tier pool is nearly full. Full in this context is determined by the target_max_bytes and target_max_objects properties on the cache pool. When the pool reaches the target threshold, write requests to the pool may block while data is flushed and evicted from the cache, a state that normally leads to very high latencies and poor performance. The cache pool target size can be adjusted with:
cephadm >
ceph osd pool set cache-pool-name target_max_bytes bytescephadm >
ceph osd pool set cache-pool-name target_max_objects objectsNormal cache flush and evict activity may also be throttled because of reduced availability or performance of the base tier, or overall cluster load.
Find more information about cache tiering in Chapter 11, Cache Tiering.
- TOO_FEW_PGS
The number of PGs in use is below the configurable threshold of
mon_pg_warn_min_per_osd
PGs per OSD. This can lead to suboptimal distribution and balance of data across the OSDs in the cluster reduce overall performance.See Placement Groups for details on calculating an appropriate number of placement groups for your pool.
- TOO_MANY_PGS
The number of PGs in use is above the configurable threshold of
mon_pg_warn_max_per_osd
PGs per OSD. This can lead to higher memory usage for OSD daemons, slower peering after cluster state changes (for example OSD restarts, additions, or removals), and higher load on the Ceph Managers and Ceph Monitors.While the
pg_num
value for existing pools cannot be reduced. Thepgp_num
value can. This effectively collocates some PGs on the same sets of OSDs, mitigating some of the negative impacts described above. Thepgp_num
value can be adjusted with:cephadm >
ceph osd pool set pool pgp_num value- SMALLER_PGP_NUM
One or more pools has a
pgp_num
value less thanpg_num
. This is normally an indication that the PG count was increased without also increasing the placement behavior. This is normally resolved by settingpgp_num
to matchpg_num
, triggering the data migration, with:cephadm >
ceph osd pool set pool pgp_num pg_num_value- MANY_OBJECTS_PER_PG
One or more pools have an average number of objects per PG that is significantly higher than the overall cluster average. The specific threshold is controlled by the
mon_pg_warn_max_object_skew
configuration value. This is usually an indication that the pool(s) containing most of the data in the cluster have too few PGs, and/or that other pools that do not contain as much data have too many PGs. The threshold can be raised to silence the health warning by adjusting themon_pg_warn_max_object_skew
configuration option on the monitors.- POOL_APP_NOT_ENABLED¶
A pool exists that contains one or more objects but has not been tagged for use by a particular application. Resolve this warning by labeling the pool for use by an application. For example, if the pool is used by RBD:
cephadm >
rbd pool init pool_nameIf the pool is being used by a custom application 'foo', you can also label it using the low-level command:
cephadm >
ceph osd pool application enable foo- POOL_FULL
One or more pools have reached (or is very close to reaching) its quota. The threshold to trigger this error condition is controlled by the
mon_pool_quota_crit_threshold
configuration option. Pool quotas can be adjusted up or down (or removed) with:cephadm >
ceph osd pool set-quota pool max_bytes bytescephadm >
ceph osd pool set-quota pool max_objects objectsSetting the quota value to 0 will disable the quota.
- POOL_NEAR_FULL
One or more pools are approaching their quota. The threshold to trigger this warning condition is controlled by the
mon_pool_quota_warn_threshold
configuration option. Pool quotas can be adjusted up or down (or removed) with:cephadm >
ceph osd osd pool set-quota pool max_bytes bytescephadm >
ceph osd osd pool set-quota pool max_objects objectsSetting the quota value to 0 will disable the quota.
- OBJECT_MISPLACED
One or more objects in the cluster are not stored on the node where the cluster wants it. This is an indication that data migration caused by a recent cluster change has not yet completed. Misplaced data is not a dangerous condition in itself. Data consistency is never at risk, and old copies of objects are never removed until the desired number of new copies (in the desired locations) are present.
- OBJECT_UNFOUND
One or more objects in the cluster cannot be found. Specifically, the OSDs know that a new or updated copy of an object should exist, but a copy of that version of the object has not been found on OSDs that are currently online. Read or write requests to the 'unfound' objects will be blocked. Ideally, a down OSD can be brought back online that has the more recent copy of the unfound object. Candidate OSDs can be identified from the peering state for the PG(s) responsible for the unfound object:
cephadm >
ceph tell pgid query- REQUEST_SLOW
One or more OSD requests is taking a long time to process. This can be an indication of extreme load, a slow storage device, or a software bug. You can query the request queue on the OSD(s) in question with the following command executed from the OSD host:
cephadm >
ceph daemon osd.id opsYou can see a summary of the slowest recent requests:
cephadm >
ceph daemon osd.id dump_historic_opsYou can find the location of an OSD with:
cephadm >
ceph osd find osd.id- REQUEST_STUCK
One or more OSD requests have been blocked for a longer time, for example 4096 seconds. This is an indication that either the cluster has been unhealthy for an extended period of time (for example not enough running OSDs or inactive PGs) or there is some internal problem with the OSD.
- PG_NOT_SCRUBBED
One or more PGs have not been scrubbed (see Section 7.5, “Scrubbing”) recently. PGs are normally scrubbed every
mon_scrub_interval
seconds, and this warning triggers whenmon_warn_not_scrubbed
such intervals have elapsed without a scrub. PGs will not scrub if they are not flagged as clean, which may happen if they are misplaced or degraded (see PG_AVAILABILITY and PG_DEGRADED above). You can manually initiate a scrub of a clean PG with:cephadm >
ceph pg scrub pgid- PG_NOT_DEEP_SCRUBBED
One or more PGs has not been deep scrubbed (see Section 7.5, “Scrubbing”) recently. PGs are normally scrubbed every
osd_deep_mon_scrub_interval
seconds, and this warning triggers whenmon_warn_not_deep_scrubbed
seconds have elapsed without a scrub. PGs will not (deep)scrub if they are not flagged as clean, which may happen if they are misplaced or degraded (see PG_AVAILABILITY and PG_DEGRADED above). You can manually initiate a scrub of a clean PG with:cephadm >
ceph pg deep-scrub pgid
4.3 Watching a Cluster #
You can find the immediate state of the cluster using ceph
-s
. For example, a tiny Ceph cluster consisting of one monitor,
and two OSDs may print the following when a workload is running:
cephadm >
ceph -s
cluster:
id: ea4cf6ce-80c6-3583-bb5e-95fa303c893f
health: HEALTH_WARN
too many PGs per OSD (408 > max 300)
services:
mon: 3 daemons, quorum ses5min1,ses5min3,ses5min2
mgr: ses5min1(active), standbys: ses5min3, ses5min2
mds: cephfs-1/1/1 up {0=ses5min3=up:active}
osd: 4 osds: 4 up, 4 in
rgw: 1 daemon active
data:
pools: 8 pools, 544 pgs
objects: 253 objects, 3821 bytes
usage: 6252 MB used, 13823 MB / 20075 MB avail
pgs: 544 active+clean
The output provides the following information:
Cluster ID
Cluster health status
The monitor map epoch and the status of the monitor quorum
The OSD map epoch and the status of OSDs
The status of Ceph Managers.
The status of Object Gateways.
The placement group map version
The number of placement groups and pools
The notional amount of data stored and the number of objects stored; and,
The total amount of data stored.
Tip: How Ceph Calculates Data Usage
The used
value reflects the actual amount of raw storage
used. The xxx GB / xxx GB
value means the amount
available (the lesser number) of the overall storage capacity of the
cluster. The notional number reflects the size of the stored data before it
is replicated, cloned or snapshot. Therefore, the amount of data actually
stored typically exceeds the notional amount stored, because Ceph creates
replicas of the data and may also use storage capacity for cloning and
snapshotting.
Other commands that display immediate status information are:
ceph pg stat
ceph osd pool stats
ceph df
ceph df detail
To get the information updated in real time, put any of these commands
(including ceph -s
) as an argument of the
watch
command:
root #
watch -n 10 'ceph -s'
Press Ctrl–C when you are tired of watching.
4.4 Checking a Cluster's Usage Stats #
To check a cluster’s data usage and distribution among pools, use the
ceph df
command. To get more details, use ceph
df detail
.
cephadm >
ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
65886G 45826G 7731M 16
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
data 1 1726M 10 17676G 1629
rbd 4 5897M 27 22365G 3547
ecpool 6 69M 0.2 35352G 31
[...]
The GLOBAL
section of the output provides an overview of
the amount of storage your cluster uses for your data.
SIZE
: The overall storage capacity of the cluster.AVAIL
: The amount of free space available in the cluster.RAW USED
: The amount of raw storage used.% RAW USED
: The percentage of raw storage used. Use this number in conjunction with thefull ratio
andnear full ratio
to ensure that you are not reaching your cluster’s capacity. See Storage Capacity for additional details.Note: Cluster Fill Level
When a raw storage fill level is getting close to 100%, you need to add new storage to the cluster. A higher usage may lead to single full OSDs and cluster health problems.
Use the command
ceph osd df tree
to list the fill level of all OSDs.
The POOLS
section of the output provides a list of pools
and the notional usage of each pool. The output from this section
does not reflect replicas, clones or snapshots. For
example, if you store an object with 1MB of data, the notional usage will be
1MB, but the actual usage may be 2MB or more depending on the number of
replicas, clones and snapshots.
NAME
: The name of the pool.ID
: The pool ID.USED
: The notional amount of data stored in kilobytes, unless the number appends M for megabytes or G for gigabytes.%USED
: The notional percentage of storage used per pool.MAX AVAIL
: The maximum available space in the given pool.OBJECTS
: The notional number of objects stored per pool.
Note
The numbers in the POOLS section are notional. They are not inclusive of the number of replicas, snapshots or clones. As a result, the sum of the USED and %USED amounts will not add up to the RAW USED and %RAW USED amounts in the %GLOBAL section of the output.
4.5 Checking OSD Status #
You can check OSDs to ensure they are up and on by executing:
cephadm >
ceph osd stat
or
cephadm >
ceph osd dump
You can also view OSDs according to their position in the CRUSH map.
cephadm >
ceph osd tree
Ceph will print a CRUSH tree with a host, its OSDs, whether they are up and their weight.
# id weight type name up/down reweight -1 3 pool default -3 3 rack mainrack -2 3 host osd-host 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1
4.6 Checking for Full OSDs #
Ceph prevents you from writing to a full OSD so that you do not lose data.
In an operational cluster, you should receive a warning when your cluster is
getting near its full ratio. The mon osd full ratio
defaults to 0.95, or 95% of capacity before it stops clients from writing
data. The mon osd nearfull ratio
defaults to 0.85, or 85%
of capacity, when it generates a health warning.
Full OSD nodes will be reported by ceph health
:
cephadm >
ceph health
HEALTH_WARN 1 nearfull osds
osd.2 is near full at 85%
or
cephadm >
ceph health
HEALTH_ERR 1 nearfull osds, 1 full osds
osd.2 is near full at 85%
osd.3 is full at 97%
The best way to deal with a full cluster is to add new OSD hosts/disks allowing the cluster to redistribute data to the newly available storage.
Tip: Preventing Full OSDs
After an OSD becomes full—it uses 100% of its disk space—it will normally crash quickly without warning. Following are a few tips to remember when administering OSD nodes.
Each OSD's disk space (usually mounted under
/var/lib/ceph/osd/osd-{1,2..}
) needs to be placed on a dedicated underlying disk or partition.Check the Ceph configuration files and make sure that Ceph does not store its log file to the disks/partitions dedicated for use by OSDs.
Make sure that no other process writes to the disks/partitions dedicated for use by OSDs.
4.7 Checking Monitor Status #
After you start the cluster and before first reading and/or writing data, check the Ceph Monitors quorum status. When the cluster is already serving requests, check the Ceph Monitors status periodically to ensure that they are running.
To display the monitor map, execute the following:
cephadm >
ceph mon stat
or
cephadm >
ceph mon dump
To check the quorum status for the monitor cluster, execute the following:
cephadm >
ceph quorum_status
Ceph will return the quorum status. For example, a Ceph cluster consisting of three monitors may return the following:
{ "election_epoch": 10, "quorum": [ 0, 1, 2], "monmap": { "epoch": 1, "fsid": "444b489c-4f16-4b75-83f0-cb8097468898", "modified": "2011-12-12 13:28:27.505520", "created": "2011-12-12 13:28:27.505520", "mons": [ { "rank": 0, "name": "a", "addr": "192.168.1.10:6789\/0"}, { "rank": 1, "name": "b", "addr": "192.168.1.11:6789\/0"}, { "rank": 2, "name": "c", "addr": "192.168.1.12:6789\/0"} ] } }
4.8 Checking Placement Group States #
Placement groups map objects to OSDs. When you monitor your placement
groups, you will want them to be active
and
clean
. For a detailed discussion, refer to
Monitoring
OSDs and Placement Groups.
4.9 Using the Admin Socket #
The Ceph admin socket allows you to query a daemon via a socket interface.
By default, Ceph sockets reside under /var/run/ceph
.
To access a daemon via the admin socket, log in to the host running the
daemon and use the following command:
cephadm >
ceph --admin-daemon /var/run/ceph/socket-name
To view the available admin socket commands, execute the following command:
cephadm >
ceph --admin-daemon /var/run/ceph/socket-name help
The admin socket command enables you to show and set your configuration at runtime. Refer to Viewing a Configuration at Runtimefor details.
Additionally, you can set configuration values at runtime directly (the
admin socket bypasses the monitor, unlike ceph tell
daemon-type.id
injectargs, which relies on the monitor but does not require you to log in
directly to the host in question).
5 Monitoring and Alerting #
By default, DeepSea deploys a monitoring and alerting stack on the Salt master. It consists of the following components:
Prometheus monitoring and alerting toolkit.
Grafana visualization and alerting software.
The
prometheus-ceph_exporter
service running on the Salt master.The
prometheus-node_exporter
service running on all Salt minions.
The Prometheus configuration and scrape targets
(exporting daemons) are setup automatically by DeepSea. DeepSea also
deploys a list of default alerts, for example health
error
, 10% OSDs down
, or pgs
inactive
.
The Alertmanager handles alerts sent by the Prometheus server. It takes care of de-duplicating, grouping, and routing them to the correct receiver. It also takes care of silencing of alerts. Alertmanager is configured via the command line flags and a configuration file that defines inhibition rules, notification routing and notification receivers.
5.1 Configuration File #
Alertmanager's configuration is different for each deployment. Therefore,
DeepSea does not ship any related defaults. You need to provide your own
alertmanager.yml
configuration file. The
alertmanager package by default installs a
configuration file /etc/prometheus/alertmanager.yml
which can serve as an example configuration. If you prefer to have your
Alertmanager configuration managed by DeepSea, add the following key to
your pillar, for example to the
/srv/pillar/ceph/stack/ceph/minions/YOUR_SALT_MASTER_MINION_ID.sls
file:
For a complete example of Alertmanager's configuration file, see Appendix B, Default Alerts for SUSE Enterprise Storage.
monitoring: alertmanager_config: /path/to/your/alertmanager/config.yml
Alertmanager's configuration file is written in the YAML format. It follows the scheme described below. Parameters in brackets are optional. For non-list parameters the default value is used. The following generic placeholders are used in the scheme:
- DURATION
A duration matching the regular expression
[0-9]+(ms|[smhdwy])
- LABELNAME
A string matching the regular expression
[a-zA-Z_][a-zA-Z0-9_]*
- LABELVALUE
A string of unicode characters.
- FILEPATH
A valid path in the current working directory.
- BOOLEAN
A boolean that can take the values
true
orfalse
.- STRING
A regular string.
- SECRET
A regular string that is a secret. For example, a password.
- TMPL_STRING
A string which is template-expanded before usage.
- TMPL_SECRET
A secret string which is template-expanded before usage.
Example 5.1: Global Configuration #
Parameters in the global:
configuration are valid in
all other configuration contexts. They also serve as defaults for other
configuration sections.
global: # the time after which an alert is declared resolved if it has not been updated [ resolve_timeout: DURATION | default = 5m ] # The default SMTP From header field. [ smtp_from: TMPL_STRING ] # The default SMTP smarthost used for sending emails, including port number. # Port number usually is 25, or 587 for SMTP over TLS # (sometimes referred to as STARTTLS). # Example: smtp.example.org:587 [ smtp_smarthost: STRING ] # The default host name to identify to the SMTP server. [ smtp_hello: STRING | default = "localhost" ] [ smtp_auth_username: STRING ] # SMTP Auth using LOGIN and PLAIN. [ smtp_auth_password: SECRET ] # SMTP Auth using PLAIN. [ smtp_auth_identity: STRING ] # SMTP Auth using CRAM-MD5. [ smtp_auth_secret: SECRET ] # The default SMTP TLS requirement. [ smtp_require_tls: BOOL | default = true ] # The API URL to use for Slack notifications. [ slack_api_url: STRING ] [ victorops_api_key: STRING ] [ victorops_api_url: STRING | default = "https://victorops.example.com/integrations/alert/" ] [ pagerduty_url: STRING | default = "https://pagerduty.example.com/v2/enqueue" ] [ opsgenie_api_key: STRING ] [ opsgenie_api_url: STRING | default = "https://opsgenie.example.com/" ] [ hipchat_api_url: STRING | default = "https://hipchat.example.com/" ] [ hipchat_auth_token: SECRET ] [ wechat_api_url: STRING | default = "https://wechat.example.com/cgi-bin/" ] [ wechat_api_secret: SECRET ] [ wechat_api_corp_id: STRING ] # The default HTTP client configuration [ http_config: HTTP_CONFIG ] # Files from which custom notification template definitions are read. # The last component may use a wildcard matcher, e.g. 'templates/*.tmpl'. templates: [ - FILEPATH ... ] # The root node of the routing tree. route: ROUTE # A list of notification receivers. receivers: - RECEIVER ... # A list of inhibition rules. inhibit_rules: [ - INHIBIT_RULE ... ]
Example 5.2: ROUTE #
A ROUTE block defines a node in a routing
tree. Unspecified parameters are inherited from its parent node. Every
alert enters the routing tree at the configured top-level route, which
needs to match all alerts, then traversing the child nodes. If the
continue
option is set to false
,
the traversing stops after the first matched child. Setting the option to
true
on a matched node, the alert continues to match
against subsequent siblings. If an alert does not match any children of a
node, the alert is handled based on the configuration parameters of the
current node.
[ receiver: STRING ] [ group_by: '[' LABELNAME, ... ']' ] # If an alert should continue matching subsequent sibling nodes. [ continue: BOOLEAN | default = false ] # A set of equality matchers an alert has to fulfill to match a node. match: [ LABELNAME: LABELVALUE, ... ] # A set of regex-matchers an alert has to fulfill to match a node. match_re: [ LABELNAME: REGEX, ... ] # Time to wait before sending a notification for a group of alerts. [ group_wait: DURATION | default = 30s ] # Time to wait before sending a notification about new alerts # added to a group of alerts for which an initial notification has # already been sent. [ group_interval: DURATION | default = 5m ] # Time to wait before re-sending a notification [ repeat_interval: DURATION | default = 4h ] # Possible child routes. routes: [ - ROUTE ... ]
Example 5.3: INHIBIT_RULE #
An inhibition rule mutes a target alert that matches a set of matchers
when a source alert exists that matches another set of matchers. Both
alerts need to share the same label values for the label names in the
equal
list.
Alerts can match and therefore inhibit themselves. Do not write inhibition rules where an alert matches both source and target.
# Matchers that need to be fulfilled for the alerts to be muted. target_match: [ LABELNAME: LABELVALUE, ... ] target_match_re: [ LABELNAME: REGEX, ... ] # Matchers for which at least one alert needs to exist so that the # inhibition occurs. source_match: [ LABELNAME: LABELVALUE, ... ] source_match_re: [ LABELNAME: REGEX, ... ] # Labels with an equal value in the source and target # alert for the inhibition to take effect. [ equal: '[' LABELNAME, ... ']' ]
Example 5.4: HTTP_CONFIG #
HTTP_CONFIG configures the HTTP client used by the receiver to communicate with API services.
Note that basic_auth
, bearer_token
and
bearer_token_file
options are mutually exclusive.
# Sets the 'Authorization' header with the user name and password. basic_auth: [ username: STRING ] [ password: SECRET ] # Sets the 'Authorization' header with the bearer token. [ bearer_token: SECRET ] # Sets the 'Authorization' header with the bearer token read from a file. [ bearer_token_file: FILEPATH ] # TLS settings. tls_config: # CA certificate to validate the server certificate with. [ ca_file: FILEPATH ] # Certificate and key files for client cert authentication to the server. [ cert_file: FILEPATH ] [ key_file: FILEPATH ] # ServerName extension to indicate the name of the server. # http://tools.ietf.org/html/rfc4366#section-3.1 [ server_name: STRING ] # Disable validation of the server certificate. [ insecure_skip_verify: BOOLEAN | default = false] # Optional proxy URL. [ proxy_url: STRING ]
Example 5.5: RECEIVER #
Receiver is a named configuration for one or more notification integrations.
Instead of adding new receivers, we recommend implementing custom notification integrations using the webhook receiver (see Example 5.15, “WEBHOOK_CONFIG”).
# The unique name of the receiver. name: STRING # Configurations for several notification integrations. email_configs: [ - EMAIL_CONFIG, ... ] hipchat_configs: [ - HIPCHAT_CONFIG, ... ] pagerduty_configs: [ - PAGERDUTY_CONFIG, ... ] pushover_configs: [ - PUSHOVER_CONFIG, ... ] slack_configs: [ - SLACK_CONFIG, ... ] opsgenie_configs: [ - OPSGENIE_CONFIG, ... ] webhook_configs: [ - WEBHOOK_CONFIG, ... ] victorops_configs: [ - VICTOROPS_CONFIG, ... ] wechat_configs: [ - WECHAT_CONFIG, ... ]
Example 5.6: EMAIL_CONFIG #
# Whether to notify about resolved alerts. [ send_resolved: BOOLEAN | default = false ] # The email address to send notifications to. to: TMPL_STRING # The sender address. [ from: TMPL_STRING | default = global.smtp_from ] # The SMTP host through which emails are sent. [ smarthost: STRING | default = global.smtp_smarthost ] # The host name to identify to the SMTP server. [ hello: STRING | default = global.smtp_hello ] # SMTP authentication details. [ auth_username: STRING | default = global.smtp_auth_username ] [ auth_password: SECRET | default = global.smtp_auth_password ] [ auth_secret: SECRET | default = global.smtp_auth_secret ] [ auth_identity: STRING | default = global.smtp_auth_identity ] # The SMTP TLS requirement. [ require_tls: BOOL | default = global.smtp_require_tls ] # The HTML body of the email notification. [ html: TMPL_STRING | default = '{{ template "email.default.html" . }}' ] # The text body of the email notification. [ text: TMPL_STRING ] # Further headers email header key/value pairs. Overrides any headers # previously set by the notification implementation. [ headers: { STRING: TMPL_STRING, ... } ]
Example 5.7: HIPCHAT_CONFIG #
# Whether or not to notify about resolved alerts. [ send_resolved: BOOLEAN | default = false ] # The HipChat Room ID. room_id: TMPL_STRING # The authentication token. [ auth_token: SECRET | default = global.hipchat_auth_token ] # The URL to send API requests to. [ api_url: STRING | default = global.hipchat_api_url ] # A label to be shown in addition to the sender's name. [ from: TMPL_STRING | default = '{{ template "hipchat.default.from" . }}' ] # The message body. [ message: TMPL_STRING | default = '{{ template "hipchat.default.message" . }}' ] # Whether this message will trigger a user notification. [ notify: BOOLEAN | default = false ] # Determines how the message is treated by the alertmanager and rendered inside HipChat. Valid values are 'text' and 'html'. [ message_format: STRING | default = 'text' ] # Background color for message. [ color: TMPL_STRING | default = '{{ if eq .Status "firing" }}red{{ else }}green{{ end }}' ] # Configuration of the HTTP client. [ http_config: HTTP_CONFIG | default = global.http_config ]
Example 5.8: PAGERDUTY_CONFIG #
The routing_key
and service_key
options
are mutually exclusive.
# Whether or not to notify about resolved alerts. [ send_resolved: BOOLEAN | default = true ] # The PagerDuty integration key (when using 'Events API v2'). routing_key: TMPL_SECRET # The PagerDuty integration key (when using 'Prometheus'). service_key: TMPL_SECRET # The URL to send API requests to. [ url: STRING | default = global.pagerduty_url ] # The client identification of the Alertmanager. [ client: TMPL_STRING | default = '{{ template "pagerduty.default.client" . }}' ] # A backlink to the notification sender. [ client_url: TMPL_STRING | default = '{{ template "pagerduty.default.clientURL" . }}' ] # The incident description. [ description: TMPL_STRING | default = '{{ template "pagerduty.default.description" .}}' ] # Severity of the incident. [ severity: TMPL_STRING | default = 'error' ] # A set of arbitrary key/value pairs that provide further details. [ details: { STRING: TMPL_STRING, ... } | default = { firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}' resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}' num_firing: '{{ .Alerts.Firing | len }}' num_resolved: '{{ .Alerts.Resolved | len }}' } ] # The HTTP client's configuration. [ http_config: HTTP_CONFIG | default = global.http_config ]
Example 5.9: PUSHOVER_CONFIG #
# Whether or not to notify about resolved alerts. [ send_resolved: BOOLEAN | default = true ] # The recipient user key. user_key: SECRET # Registered application’s API token. token: SECRET # Notification title. [ title: TMPL_STRING | default = '{{ template "pushover.default.title" . }}' ] # Notification message. [ message: TMPL_STRING | default = '{{ template "pushover.default.message" . }}' ] # A supplementary URL displayed together with the message. [ url: TMPL_STRING | default = '{{ template "pushover.default.url" . }}' ] # Priority. [ priority: TMPL_STRING | default = '{{ if eq .Status "firing" }}2{{ else }}0{{ end }}' ] # How often the Pushover servers will send the same notification (at least 30 seconds). [ retry: DURATION | default = 1m ] # How long your notification will continue to be retried (unless the user # acknowledges the notification). [ expire: DURATION | default = 1h ] # Configuration of the HTTP client. [ http_config: HTTP_CONFIG | default = global.http_config ]
Example 5.10: SLACK_CONFIG #
# Whether or not to notify about resolved alerts. [ send_resolved: BOOLEAN | default = false ] # The Slack webhook URL. [ api_url: SECRET | default = global.slack_api_url ] # The channel or user to send notifications to. channel: TMPL_STRING # API request data as defined by the Slack webhook API. [ icon_emoji: TMPL_STRING ] [ icon_url: TMPL_STRING ] [ link_names: BOOLEAN | default = false ] [ username: TMPL_STRING | default = '{{ template "slack.default.username" . }}' ] # The following parameters define the attachment. actions: [ ACTION_CONFIG ... ] [ color: TMPL_STRING | default = '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' ] [ fallback: TMPL_STRING | default = '{{ template "slack.default.fallback" . }}' ] fields: [ FIELD_CONFIG ... ] [ footer: TMPL_STRING | default = '{{ template "slack.default.footer" . }}' ] [ pretext: TMPL_STRING | default = '{{ template "slack.default.pretext" . }}' ] [ short_fields: BOOLEAN | default = false ] [ text: TMPL_STRING | default = '{{ template "slack.default.text" . }}' ] [ title: TMPL_STRING | default = '{{ template "slack.default.title" . }}' ] [ title_link: TMPL_STRING | default = '{{ template "slack.default.titlelink" . }}' ] [ image_url: TMPL_STRING ] [ thumb_url: TMPL_STRING ] # Configuration of the HTTP client. [ http_config: HTTP_CONFIG | default = global.http_config ]
Example 5.11: ACTION_CONFIG for SLACK_CONFIG #
# Provide a button to tell Slack you want to render a button. type: TMPL_STRING # Label for the button. text: TMPL_STRING # http or https URL to deliver users to. If you specify invalid URLs, the message will be posted with no button. url: TMPL_STRING # If set to 'primary', the button will be green, indicating the best forward action to take # 'danger' turns the button red, indicating a destructive action. [ style: TMPL_STRING [ default = '' ]
Example 5.12: FIELD_CONFIG for SLACK_CONFIG #
# A bold heading without markup above thevalue
text. title: TMPL_STRING # The text of the field. It can span across several lines. value: TMPL_STRING # A flag indicating ifvalue
is short enough to be displayed together with other values. [ short: BOOLEAN | default = slack_config.short_fields ]
Example 5.13: OPSGENIE_CONFIG #
# Whether or not to notify about resolved alerts. [ send_resolved: BOOLEAN | default = true ] # The API key to use with the OpsGenie API. [ api_key: SECRET | default = global.opsgenie_api_key ] # The host to send OpsGenie API requests to. [ api_url: STRING | default = global.opsgenie_api_url ] # Alert text (maximum is 130 characters). [ message: TMPL_STRING ] # A description of the incident. [ description: TMPL_STRING | default = '{{ template "opsgenie.default.description" . }}' ] # A backlink to the sender. [ source: TMPL_STRING | default = '{{ template "opsgenie.default.source" . }}' ] # A set of arbitrary key/value pairs that provide further detail. [ details: { STRING: TMPL_STRING, ... } ] # Comma separated list of team responsible for notifications. [ teams: TMPL_STRING ] # Comma separated list of tags attached to the notifications. [ tags: TMPL_STRING ] # Additional alert note. [ note: TMPL_STRING ] # Priority level of alert, one of P1, P2, P3, P4, and P5. [ priority: TMPL_STRING ] # Configuration of the HTTP. [ http_config: HTTP_CONFIG | default = global.http_config ]
Example 5.14: VICTOROPS_CONFIG #
# Whether or not to notify about resolved alerts. [ send_resolved: BOOLEAN | default = true ] # The API key for talking to the VictorOps API. [ api_key: SECRET | default = global.victorops_api_key ] # The VictorOps API URL. [ api_url: STRING | default = global.victorops_api_url ] # A key used to map the alert to a team. routing_key: TMPL_STRING # Describes the behavior of the alert (one of 'CRITICAL', 'WARNING', 'INFO'). [ message_type: TMPL_STRING | default = 'CRITICAL' ] # Summary of the alerted problem. [ entity_display_name: TMPL_STRING | default = '{{ template "victorops.default.entity_display_name" . }}' ] # Long explanation of the alerted problem. [ state_message: TMPL_STRING | default = '{{ template "victorops.default.state_message" . }}' ] # The monitoring tool the state message is from. [ monitoring_tool: TMPL_STRING | default = '{{ template "victorops.default.monitoring_tool" . }}' ] # Configuration of the HTTP client. [ http_config: HTTP_CONFIG | default = global.http_config ]
Example 5.15: WEBHOOK_CONFIG #
You can utilize the webhook receiver to configure a generic receiver.
# Whether or not to notify about resolved alerts. [ send_resolved: BOOLEAN | default = true ] # The endpoint for sending HTTP POST requests. url: STRING # Configuration of the HTTP client. [ http_config: HTTP_CONFIG | default = global.http_config ]
Alertmanager sends HTTP POST requests in the following JSON format:
{ "version": "4", "groupKey": STRING, // identifycation of the group of alerts (to deduplicate) "status": "<resolved|firing>", "receiver": STRING, "groupLabels": OBJECT, "commonLabels": OBJECT, "commonAnnotations": OBJECT, "externalURL": STRING, // backlink to Alertmanager. "alerts": [ { "status": "<resolved|firing>", "labels": OBJECT, "annotations": OBJECT, "startsAt": "<rfc3339>", "endsAt": "<rfc3339>", "generatorURL": STRING // identifies the entity that caused the alert }, ... ] }
The webhook receiver allows for integration with the following notification mechanisms:
DingTalk (https://github.com/timonwong/prometheus-webhook-dingtalk)
IRC Bot (https://github.com/multimfi/bot)
JIRAlert (https://github.com/free/jiralert)
Phabricator / Maniphest (https://github.com/knyar/phalerts)
prom2teams: forwards notifications to Microsoft Teams (https://github.com/idealista/prom2teams)
SMS: supports multiple providers (https://github.com/messagebird/sachet)
Telegram bot (https://github.com/inCaller/prometheus_bot)
Example 5.16: WECHAT_CONFIG #
# Whether or not to notify about resolved alerts. [ send_resolved: BOOLEAN | default = false ] # The API key to use for the WeChat API. [ api_secret: SECRET | default = global.wechat_api_secret ] # The WeChat API URL. [ api_url: STRING | default = global.wechat_api_url ] # The corp id used to authenticate. [ corp_id: STRING | default = global.wechat_api_corp_id ] # API request data as defined by the WeChat API. [ message: TMPL_STRING | default = '{{ template "wechat.default.message" . }}' ] [ agent_id: STRING | default = '{{ template "wechat.default.agent_id" . }}' ] [ to_user: STRING | default = '{{ template "wechat.default.to_user" . }}' ] [ to_party: STRING | default = '{{ template "wechat.default.to_party" . }}' ] [ to_tag: STRING | default = '{{ template "wechat.default.to_tag" . }}' ]
5.2 Custom Alerts #
You can define your custom alert conditions to send notifications to an external service. Prometheus uses its own expression language for defining custom alerts. Following is an example of a rule with an alert:
groups: - name: example rules: # alert on high deviation from average PG count - alert: high pg count deviation expr: abs(((ceph_osd_pgs > 0) - on (job) group_left avg(ceph_osd_pgs > 0) by (job)) / on (job) group_left avg(ceph_osd_pgs > 0) by (job)) > 0.35 for: 5m labels: severity: warning type: ses_default annotations: description: > OSD {{ $labels.osd }} deviates by more then 30% from average PG count
The optional for
clause specifies the time Prometheus
waits between first encountering a new expression output vector
element and counting an alert as firing. In this case, Prometheus
checks that the alert continues to be active for 5 minutes before firing
the alert. Elements in a pending state are active, but not firing yet.
The labels
clause specifies a set of additional labels
attached to the alert. Conflicting labels will be overwritten. Labels can
be templated (see Section 5.2.1, “Templates” for more
details on templating).
The annotations
clause specifies informational labels.
You can use them to store additional information, for example alert
descriptions or runbook links. Annotations can be templated (see
Section 5.2.1, “Templates” for more details on templating).
To add your custom alerts to SUSE Enterprise Storage, either...
place your YAML files with custom alerts in the
/etc/prometheus/alerts
directory
or
provide a list of paths to your custom alert files in the pillar under the
monitoring:custom_alerts
key. DeepSea Stage 2 or thesalt SALT_MASTER state.apply ceph.monitoring.prometheus
command adds your alert files in the right place.Example 5.17: Adding Custom Alerts to SUSE Enterprise Storage #
A file with custom alerts is in
/root/my_alerts/my_alerts.yml
on the Salt master. If you addmonitoring: custom_alerts: - /root/my_alerts/my_alerts.yml
to the
/srv/pillar/ceph/cluster/YOUR_SALT_MASTER_MINION_ID.sls
file, DeepSea creates the/etc/prometheus/alerts/my_alerts.yml
file and restarts Prometheus.
5.2.1 Templates #
You can use templates for label and annotation values. The
$labels
variable includes the label key and value pairs of
an alert instance, while $value
holds the evaluated
value of an alert instance.
The following example inserts a firing element label and value:
{{ $labels.LABELNAME }} {{ $value }}
5.2.2 Inspecting Alerts at Runtime #
If you need to verify which alerts are active, you have several options:
Navigate to the
tab of Prometheus. It shows you the exact label sets for which defined alerts are active. Prometheus also stores synthetic time series for pending and firing alerts. They have the following form:ALERTS{alertname="ALERT_NAME", alertstate="pending|firing", ADDITIONAL_ALERT_LABELS}
The sample value is 1 if the alert is active (pending or firing). The series is marked
stale
when the alert is inactive.In the Prometheus web interface at the URL address http://PROMETHEUS_HOST_IP:9090/alerts, inspect alerts and their state (
INACTIVE
,PENDING
orFIRING
).In the Alertmanager web interface at the URL address http://:PROMETHEUS_HOST_IP9093/#/alerts, inspect alerts and silence them if desired.
6 Authentication with cephx
#
To identify clients and protect against man-in-the-middle attacks, Ceph
provides its cephx
authentication system. Clients in
this context are either human users—such as the admin user—or
Ceph-related services/daemons, for example OSDs, monitors, or Object Gateways.
Note
The cephx
protocol does not address data encryption in transport, such as
TLS/SSL.
6.1 Authentication Architecture #
cephx
uses shared secret keys for authentication, meaning both the client
and Ceph Monitors have a copy of the client’s secret key. The authentication
protocol enables both parties to prove to each other that they have a copy
of the key without actually revealing it. This provides mutual
authentication, which means the cluster is sure the user possesses the
secret key, and the user is sure that the cluster has a copy of the secret
key as well.
A key scalability feature of Ceph is to avoid a centralized interface to
the Ceph object store. This means that Ceph clients can interact with
OSDs directly. To protect data, Ceph provides its cephx
authentication
system, which authenticates Ceph clients.
Each monitor can authenticate clients and distribute keys, so there is no
single point of failure or bottleneck when using cephx
. The monitor
returns an authentication data structure that contains a session key for use
in obtaining Ceph services. This session key is itself encrypted with the
client’s permanent secret key, so that only the client can request
services from the Ceph monitors. The client then uses the session key to
request its desired services from the monitor, and the monitor provides the
client with a ticket that will authenticate the client to the OSDs that
actually handle data. Ceph monitors and OSDs share a secret, so the client
can use the ticket provided by the monitor with any OSD or metadata server
in the cluster. cephx
tickets expire, so an attacker cannot use an expired
ticket or session key obtained wrongfully.
To use cephx
, an administrator must setup clients/users first. In the
following diagram, the
client.admin
user invokes
ceph auth get-or-create-key
from the command line to
generate a user name and secret key. Ceph’s auth
subsystem generates the user name and key, stores a copy with the monitor(s)
and transmits the user’s secret back to the
client.admin
user. This means that
the client and the monitor share a secret key.
Figure 6.1: Basic cephx
Authentication #
To authenticate with the monitor, the client passes the user name to the monitor. The monitor generates a session key and encrypts it with the secret key associated with the user name and transmits the encrypted ticket back to the client. The client then decrypts the data with the shared secret key to retrieve the session key. The session key identifies the user for the current session. The client then requests a ticket related to the user, which is signed by the session key. The monitor generates a ticket, encrypts it with the user’s secret key and transmits it back to the client. The client decrypts the ticket and uses it to sign requests to OSDs and metadata servers throughout the cluster.
Figure 6.2: cephx
Authentication #
The cephx
protocol authenticates ongoing communications between the client
machine and the Ceph servers. Each message sent between a client and a
server after the initial authentication is signed using a ticket that the
monitors, OSDs, and metadata servers can verify with their shared secret.
Figure 6.3: cephx
Authentication - MDS and OSD #
Important
The protection offered by this authentication is between the Ceph client and the Ceph cluster hosts. The authentication is not extended beyond the Ceph client. If the user accesses the Ceph client from a remote host, Ceph authentication is not applied to the connection between the user’s host and the client host.
6.2 Key Management #
This section describes Ceph client users and their authentication and authorization with the Ceph storage cluster. Users are either individuals or system actors such as applications, which use Ceph clients to interact with the Ceph storage cluster daemons.
When Ceph runs with authentication and authorization enabled (enabled by
default), you must specify a user name and a keyring containing the secret
key of the specified user (usually via the command line). If you do not
specify a user name, Ceph will use
client.admin
as the default user
name. If you do not specify a keyring, Ceph will look for a keyring via
the keyring setting in the Ceph configuration file. For example, if you
execute the ceph health
command without specifying a user
name or keyring, Ceph interprets the command like this:
cephadm >
ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring health
Alternatively, you may use the CEPH_ARGS
environment
variable to avoid re-entering the user name and secret.
6.2.1 Background Information #
Regardless of the type of Ceph client (for example, block device, object storage, file system, native API), Ceph stores all data as objects within pools. Ceph users need to have access to pools in order to read and write data. Additionally, Ceph users must have execute permissions to use Ceph's administrative commands. The following concepts will help you understand Ceph user management.
6.2.1.1 User #
A user is either an individual or a system actor such as an application. Creating users allows you to control who (or what) can access your Ceph storage cluster, its pools, and the data within pools.
Ceph uses types of users. For the purposes of user
management, the type will always be client
. Ceph
identifies users in period (.) delimited form, consisting of the user type
and the user ID. For example, TYPE.ID
,
client.admin
, or client.user1
. The
reason for user typing is that Ceph monitors, OSDs, and metadata servers
also use the cephx protocol, but they are not clients. Distinguishing the
user type helps to distinguish between client users and other users,
streamlining access control, user monitoring, and traceability.
Note
A Ceph storage cluster user is not the same as a Ceph object storage user or a Ceph file system user. The Ceph Object Gateway uses a Ceph storage cluster user to communicate between the gateway daemon and the storage cluster, but the gateway has its own user management functionality for end users. The Ceph file system uses POSIX semantics. The user space associated with it is not the same as a Ceph storage cluster user.
6.2.1.2 Authorization and Capabilities #
Ceph uses the term 'capabilities' (caps) to describe authorizing an authenticated user to exercise the functionality of the monitors, OSDs, and metadata servers. Capabilities can also restrict access to data within a pool or pool namespace. A Ceph administrative user sets a user's capabilities when creating or updating a user.
Capability syntax follows the form:
daemon-type 'allow capability' [...]
Following is a list of capabilities for each service type:
- Monitor capabilities
include
r
,w
,x
andallow profile cap
.mon 'allow rwx' mon 'allow profile osd'
- OSD capabilities
include
r
,w
,x
,class-read
,class-write
andprofile osd
. Additionally, OSD capabilities also allow for pool and namespace settings.osd 'allow capability' [pool=poolname] [namespace=namespace-name]
- MDS capability
simply requires
allow
, or blank.mds 'allow'
The following entries describe each capability:
- allow
Precedes access settings for a daemon. Implies
rw
for MDS only.- r
Gives the user read access. Required with monitors to retrieve the CRUSH map.
- w
Gives the user write access to objects.
- x
Gives the user the capability to call class methods (both read and write) and to conduct
auth
operations on monitors.- class-read
Gives the user the capability to call class read methods. Subset of
x
.- class-write
Gives the user the capability to call class write methods. Subset of
x
.- *
Gives the user read, write, and execute permissions for a particular daemon/pool, and the ability to execute admin commands.
- profile osd
Gives a user permissions to connect as an OSD to other OSDs or monitors. Conferred on OSDs to enable OSDs to handle replication heartbeat traffic and status reporting.
- profile mds
Gives a user permissions to connect as an MDS to other MDSs or monitors.
- profile bootstrap-osd
Gives a user permissions to bootstrap an OSD. Delegated to deployment tools so that they have permissions to add keys when bootstrapping an OSD.
- profile bootstrap-mds
Gives a user permissions to bootstrap a metadata server. Delegated to deployment tools so they have permissions to add keys when bootstrapping a metadata server.
6.2.2 Managing Users #
User management functionality provides Ceph cluster administrators with the ability to create, update, and delete users directly in the Ceph cluster.
When you create or delete users in the Ceph cluster, you may need to distribute keys to clients so that they can be added to keyrings. See Section 6.2.3, “Keyring Management” for details.
6.2.2.1 Listing Users #
To list the users in your cluster, execute the following:
cephadm >
ceph auth list
Ceph will list all users in your cluster. For example, in a cluster with
two nodes, ceph auth list
output looks similar to this:
installed auth entries: osd.0 key: AQCvCbtToC6MDhAATtuT70Sl+DymPCfDSsyV4w== caps: [mon] allow profile osd caps: [osd] allow * osd.1 key: AQC4CbtTCFJBChAAVq5spj0ff4eHZICxIOVZeA== caps: [mon] allow profile osd caps: [osd] allow * client.admin key: AQBHCbtT6APDHhAA5W00cBchwkQjh3dkKsyPjw== caps: [mds] allow caps: [mon] allow * caps: [osd] allow * client.bootstrap-mds key: AQBICbtTOK9uGBAAdbe5zcIGHZL3T/u2g6EBww== caps: [mon] allow profile bootstrap-mds client.bootstrap-osd key: AQBHCbtT4GxqORAADE5u7RkpCN/oo4e5W0uBtw== caps: [mon] allow profile bootstrap-osd
Note: TYPE.ID Notation
Note that the TYPE.ID
notation for users applies such
that osd.0
specifies a user of type
osd
and its ID is 0
.
client.admin
is a user of type
client
and its ID is admin
. Note
also that each entry has a key:
value
entry, and one or more
caps:
entries.
You may use the -o filename
option with ceph auth list
to save the output to a
file.
6.2.2.2 Getting Information about Users #
To retrieve a specific user, key, and capabilities, execute the following:
cephadm >
ceph auth get TYPE.ID
For example:
cephadm >
ceph auth get client.admin
exported keyring for client.admin
[client.admin]
key = AQA19uZUqIwkHxAAFuUwvq0eJD4S173oFRxe0g==
caps mds = "allow"
caps mon = "allow *"
caps osd = "allow *"
Developers may also execute the following:
cephadm >
ceph auth export TYPE.ID
The auth export
command is identical to auth
get
, but also prints the internal authentication ID.
6.2.2.3 Adding Users #
Adding a user creates a user name (TYPE.ID
), a secret
key, and any capabilities included in the command you use to create the
user.
A user's key enables the user to authenticate with the Ceph storage cluster. The user's capabilities authorize the user to read, write, or execute on Ceph monitors (mon), Ceph OSDs (osd), or Ceph metadata servers (mds).
There are a few commands available to add a user:
ceph auth add
This command is the canonical way to add a user. It will create the user, generate a key, and add any specified capabilities.
ceph auth get-or-create
This command is often the most convenient way to create a user, because it returns a keyfile format with the user name (in brackets) and the key. If the user already exists, this command simply returns the user name and key in the keyfile format. You may use the
-o filename
option to save the output to a file.ceph auth get-or-create-key
This command is a convenient way to create a user and return the user's key (only). This is useful for clients that need the key only (for example
libvirt
). If the user already exists, this command simply returns the key. You may use the-o filename
option to save the output to a file.
When creating client users, you may create a user with no capabilities. A
user with no capabilities can authenticate but nothing more. Such client
cannot retrieve the cluster map from the monitor. However, you can create
a user with no capabilities if you want to defer adding capabilities later
using the ceph auth caps
command.
A typical user has at least read capabilities on the Ceph monitor and read and write capabilities on Ceph OSDs. Additionally, a user's OSD permissions are often restricted to accessing a particular pool.
cephadm >
ceph auth add client.john mon 'allow r' osd \ 'allow rw pool=liverpool'cephadm >
ceph auth get-or-create client.paul mon 'allow r' osd \ 'allow rw pool=liverpool'cephadm >
ceph auth get-or-create client.george mon 'allow r' osd \ 'allow rw pool=liverpool' -o george.keyringcephadm >
ceph auth get-or-create-key client.ringo mon 'allow r' osd \ 'allow rw pool=liverpool' -o ringo.key
Important
If you provide a user with capabilities to OSDs, but you do not restrict access to particular pools, the user will have access to all pools in the cluster.
6.2.2.4 Modifying User Capabilities #
The ceph auth caps
command allows you to specify a user
and change the user's capabilities. Setting new capabilities will
overwrite current ones. To view current capabilities run ceph
auth get
USERTYPE.USERID
.
To add capabilities, you also need to specify the existing capabilities
when using the following form:
cephadm >
ceph auth caps USERTYPE.USERID daemon 'allow [r|w|x|*|...] \
[pool=pool-name] [namespace=namespace-name]' [daemon 'allow [r|w|x|*|...] \
[pool=pool-name] [namespace=namespace-name]']
For example:
cephadm >
ceph auth get client.johncephadm >
ceph auth caps client.john mon 'allow r' osd 'allow rw pool=prague'cephadm >
ceph auth caps client.paul mon 'allow rw' osd 'allow r pool=prague'cephadm >
ceph auth caps client.brian-manager mon 'allow *' osd 'allow *'
To remove a capability, you may reset the capability. If you want the user to have no access to a particular daemon that was previously set, specify an empty string:
cephadm >
ceph auth caps client.ringo mon ' ' osd ' '
6.2.2.5 Deleting Users #
To delete a user, use ceph auth del
:
cephadm >
ceph auth del TYPE.ID
where TYPE is one of client
,
osd
, mon
, or mds
,
and ID is the user name or ID of the daemon.
If you created users with permissions strictly for a pool that no longer exists, you should consider deleting those users too.
6.2.2.6 Printing a User's Key #
To print a user’s authentication key to standard output, execute the following:
cephadm >
ceph auth print-key TYPE.ID
where TYPE is one of client
,
osd
, mon
, or mds
,
and ID is the user name or ID of the daemon.
Printing a user's key is useful when you need to populate client software
with a user's key (such as libvirt
), as in the following example:
root #
mount -t ceph host:/ mount_point \
-o name=client.user,secret=`ceph auth print-key client.user`
6.2.2.7 Importing Users #
To import one or more users, use ceph auth import
and
specify a keyring:
cephadm >
ceph auth import -i /etc/ceph/ceph.keyring
Note
The Ceph storage cluster will add new users, their keys and their capabilities and will update existing users, their keys and their capabilities.
6.2.3 Keyring Management #
When you access Ceph via a Ceph client, the client will look for a local keyring. Ceph presets the keyring setting with the following four keyring names by default so you do not need to set them in your Ceph configuration file unless you want to override the defaults:
/etc/ceph/cluster.name.keyring /etc/ceph/cluster.keyring /etc/ceph/keyring /etc/ceph/keyring.bin
The cluster metavariable is your Ceph cluster
name as defined by the name of the Ceph configuration file.
ceph.conf
means that the cluster name is
ceph
, thus ceph.keyring
. The
name metavariable is the user type and user ID,
for example client.admin
, thus
ceph.client.admin.keyring
.
After you create a user (for example
client.ringo
), you must get the
key and add it to a keyring on a Ceph client so that the user can access
the Ceph storage cluster.
Section 6.2, “Key Management” details how to list, get, add,
modify and delete users directly in the Ceph storage cluster. However,
Ceph also provides the ceph-authtool
utility to allow
you to manage keyrings from a Ceph client.
6.2.3.1 Creating a Keyring #
When you use the procedures in Section 6.2, “Key Management” to create users, you need to provide user keys to the Ceph client(s) so that the client can retrieve the key for the specified user and authenticate with the Ceph storage cluster. Ceph clients access keyrings to look up a user name and retrieve the user's key:
cephadm >
ceph-authtool --create-keyring /path/to/keyring
When creating a keyring with multiple users, we recommend using the
cluster name (for example cluster.keyring) for
the keyring file name and saving it in the /etc/ceph
directory so that the keyring configuration default setting will pick up
the file name without requiring you to specify it in the local copy of
your Ceph configuration file. For example, create
ceph.keyring
by executing the following:
cephadm >
ceph-authtool -C /etc/ceph/ceph.keyring
When creating a keyring with a single user, we recommend using the cluster
name, the user type and the user name and saving it in the
/etc/ceph
directory. For example,
ceph.client.admin.keyring
for the
client.admin
user.
6.2.3.2 Adding a User to a Keyring #
When you add a user to the Ceph storage cluster (see Section 6.2.2.3, “Adding Users”), you can retrieve the user, key and capabilities, and save the user to a keyring.
If you only want to use one user per keyring, the ceph auth
get
command with the -o
option will save the
output in the keyring file format. For example, to create a keyring for
the client.admin
user, execute
the following:
cephadm >
ceph auth get client.admin -o /etc/ceph/ceph.client.admin.keyring
When you want to import users to a keyring, you can use
ceph-authtool
to specify the destination keyring and
the source keyring:
cephadm >
ceph-authtool /etc/ceph/ceph.keyring \
--import-keyring /etc/ceph/ceph.client.admin.keyring
Important
If your keyring is compromised, delete your key from the
/etc/ceph
directory and recreate a new key using the same
instructions from Section 6.2.3.1, “Creating a Keyring”.
6.2.3.3 Creating a User #
Ceph provides the ceph auth add
command to create a
user directly in the Ceph storage cluster. However, you can also create
a user, keys and capabilities directly on a Ceph client keyring. Then,
you can import the user to the Ceph storage cluster:
cephadm >
ceph-authtool -n client.ringo --cap osd 'allow rwx' \
--cap mon 'allow rwx' /etc/ceph/ceph.keyring
You can also create a keyring and add a new user to the keyring simultaneously:
cephadm >
ceph-authtool -C /etc/ceph/ceph.keyring -n client.ringo \
--cap osd 'allow rwx' --cap mon 'allow rwx' --gen-key
In the previous scenarios, the new user
client.ringo
is only in the
keyring. To add the new user to the Ceph storage cluster, you must still
add the new user to the cluster:
cephadm >
ceph auth add client.ringo -i /etc/ceph/ceph.keyring
6.2.3.4 Modifying Users #
To modify the capabilities of a user record in a keyring, specify the keyring and the user followed by the capabilities:
cephadm >
ceph-authtool /etc/ceph/ceph.keyring -n client.ringo \
--cap osd 'allow rwx' --cap mon 'allow rwx'
To update the modified user within the Ceph cluster environment, you must import the changes from the keyring to the user entry in the Ceph cluster:
cephadm >
ceph auth import -i /etc/ceph/ceph.keyring
See Section 6.2.2.7, “Importing Users” for details on updating a Ceph storage cluster user from a keyring.
6.2.4 Command Line Usage #
The ceph
command supports the following options related
to the user name and secret manipulation:
--id
or--user
Ceph identifies users with a type and an ID (TYPE.ID, such as
client.admin
orclient.user1
). Theid
,name
and-n
options enable you to specify the ID portion of the user name (for exampleadmin
oruser1
). You can specify the user with the --id and omit the type. For example, to specify user client.foo enter the following:cephadm >
ceph --id foo --keyring /path/to/keyring healthcephadm >
ceph --user foo --keyring /path/to/keyring health--name
or-n
Ceph identifies users with a type and an ID (TYPE.ID, such as
client.admin
orclient.user1
). The--name
and-n
options enable you to specify the fully qualified user name. You must specify the user type (typicallyclient
) with the user ID:cephadm >
ceph --name client.foo --keyring /path/to/keyring healthcephadm >
ceph -n client.foo --keyring /path/to/keyring health--keyring
The path to the keyring containing one or more user name and secret. The
--secret
option provides the same functionality, but it does not work with Object Gateway, which uses--secret
for another purpose. You may retrieve a keyring withceph auth get-or-create
and store it locally. This is a preferred approach, because you can switch user names without switching the keyring path:cephadm >
rbd map --id foo --keyring /path/to/keyring mypool/myimage
7 Stored Data Management #
The CRUSH algorithm determines how to store and retrieve data by computing data storage locations. CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server or broker. With an algorithmically determined method of storing and retrieving data, Ceph avoids a single point of failure, a performance bottleneck, and a physical limit to its scalability.
CRUSH requires a map of your cluster, and uses the CRUSH Map to pseudo-randomly store and retrieve data in OSDs with a uniform distribution of data across the cluster.
CRUSH maps contain a list of OSDs, a list of 'buckets' for aggregating the devices into physical locations, and a list of rules that tell CRUSH how it should replicate data in a Ceph cluster’s pools. By reflecting the underlying physical organization of the installation, CRUSH can model—and thereby address—potential sources of correlated device failures. Typical sources include physical proximity, a shared power source, and a shared network. By encoding this information into the cluster map, CRUSH placement policies can separate object replicas across different failure domains while still maintaining the desired distribution. For example, to address the possibility of concurrent failures, it may be desirable to ensure that data replicas are on devices using different shelves, racks, power supplies, controllers, and/or physical locations.
After you deploy a Ceph cluster, a default CRUSH Map is generated. It is fine for your Ceph sandbox environment. However, when you deploy a large-scale data cluster, you should give significant consideration to developing a custom CRUSH Map, because it will help you manage your Ceph cluster, improve performance and ensure data safety.
For example, if an OSD goes down, a CRUSH Map can help you locate the physical data center, room, row and rack of the host with the failed OSD in the event you need to use on-site support or replace hardware.
Similarly, CRUSH may help you identify faults more quickly. For example, if all OSDs in a particular rack go down simultaneously, the fault may lie with a network switch or power to the rack or the network switch rather than the OSDs themselves.
A custom CRUSH Map can also help you identify the physical locations where Ceph stores redundant copies of data when the placement group(s) associated with a failed host are in a degraded state.
There are three main sections to a CRUSH Map.
7.1 Devices #
To map placement groups to OSDs, a CRUSH Map requires a list of OSD devices (the name of the OSD daemon). The list of devices appears first in the CRUSH Map.
#devices device NUM osd.OSD_NAME class CLASS_NAME
For example:
#devices device 0 osd.0 class hdd device 1 osd.1 class ssd device 2 osd.2 class nvme device 3 osd.3class ssd
As a general rule, an OSD daemon maps to a single disk.
7.1.1 Device Classes #
The flexibility of the CRUSH Map in controlling data placement is one of the Ceph's strengths. It is also one of the most difficult parts of the cluster to manage. Device classes automate one of the most common reasons why CRUSH Maps are directly manually edited.
7.1.1.1 The CRUSH Management Problem #
Ceph clusters are frequently built with multiple types of storage devices: HDDs, SSDs, NVMe’s, or even mixed classes of the above. We call these different types of storage devices device classes to avoid confusion between the type property of CRUSH buckets (e.g., host, rack, row, see Section 7.2, “Buckets” for more details). Ceph OSDs backed by SSDs are much faster than those backed by spinning disks, making them better suited for certain workloads. Ceph makes it easy to create RADOS pools for different data sets or workloads and to assign different CRUSH rules to control data placement for those pools.
Figure 7.1: OSDs with Mixed Device Classes #
However, setting up the CRUSH rules to place data only on a certain class of device is tedious. Rules work in terms of the CRUSH hierarchy, but if the devices are mixed into the same hosts or racks (as in the sample hierarchy above), they will (by default) be mixed together and appear in the same subtrees of the hierarchy. Manually separating them out into separate trees involved creating multiple versions of each intermediate node for each device class in previous versions of SUSE Enterprise Storage.
7.1.1.2 Device Classes #
An elegant solution that Ceph offers is to add a property called
device class to each OSD. By default, OSDs will
automatically set their device classes to either 'hdd', 'ssd', or 'nvme'
based on the hardware properties exposed by the Linux kernel. These device
classes are reported in a new column of the ceph osd
tree
command output:
cephadm >
ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 83.17899 root default
-4 23.86200 host cpach
2 hdd 1.81898 osd.2 up 1.00000 1.00000
3 hdd 1.81898 osd.3 up 1.00000 1.00000
4 hdd 1.81898 osd.4 up 1.00000 1.00000
5 hdd 1.81898 osd.5 up 1.00000 1.00000
6 hdd 1.81898 osd.6 up 1.00000 1.00000
7 hdd 1.81898 osd.7 up 1.00000 1.00000
8 hdd 1.81898 osd.8 up 1.00000 1.00000
15 hdd 1.81898 osd.15 up 1.00000 1.00000
10 nvme 0.93100 osd.10 up 1.00000 1.00000
0 ssd 0.93100 osd.0 up 1.00000 1.00000
9 ssd 0.93100 osd.9 up 1.00000 1.00000
If the automatic device class detection fails for example because the
device driver is not properly exposing information about the device via
/sys/block
, you can adjust device classes from the
command line:
cephadm >
ceph osd crush rm-device-class osd.2 osd.3 done removing class of osd(s): 2,3cephadm >
ceph osd crush set-device-class ssd osd.2 osd.3 set osd(s) 2,3 to class 'ssd'
7.1.1.3 CRUSH Placement Rules #
CRUSH rules can restrict placement to a specific device class. For example, you can create a 'fast' replicated pool that distributes data only over SSD disks by running the following command:
cephadm >
ceph osd crush rule create-replicated RULE_NAME ROOT FAILURE_DOMAIN_TYPE DEVICE_CLASS
For example:
cephadm >
ceph osd crush rule create-replicated fast default host ssd
Create a pool named 'fast_pool' and assign it to the 'fast' rule:
cephadm >
ceph osd pool create fast_pool 128 128 replicated fast
The process for creating erasure code rules is slightly different. First, you create an erasure code profile that includes a property for your desired device class. Then use that profile when creating the erasure coded pool:
cephadm >
ceph osd erasure-code-profile set myprofile \ k=4 m=2 crush-device-class=ssd crush-failure-domain=hostcephadm >
ceph osd pool create mypool 64 erasure myprofile
If you need to manually edit the CRUSH Map to customize your rule, the syntax has been extended to allow the device class to be specified. For example, the CRUSH rule generated by the above commands looks as follows:
rule ecpool {
id 2
type erasure
min_size 3
max_size 6
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class ssd
step chooseleaf indep 0 type host
step emit
}
The important difference there is that the 'take' command includes the additional 'class CLASS_NAME' suffix.
7.1.1.4 Additional Commands #
To list device classes used in a CRUSH Map, run:
cephadm >
ceph osd crush class ls
[
"hdd",
"ssd"
]
To list existing CRUSH rules, run:
cephadm >
ceph osd crush rule ls
replicated_rule
fast
To view details of the CRUSH rule named 'fast', run:
cephadm >
ceph osd crush rule dump fast
{
"rule_id": 1,
"rule_name": "fast",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -21,
"item_name": "default~ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
To list OSDs that belong to a 'ssd' class, run:
cephadm >
ceph osd crush class ls-osd ssd
0
1
7.1.1.5 Migrating from a Legacy SSD Rule to Device Classes #
In SUSE Enterprise Storage prior to version 5, you needed to manually edit the CRUSH Map and maintain a parallel hierarchy for each specialized device type (such as SSD) in order to write rules that apply to these devices. Since SUSE Enterprise Storage 5, the device class feature has enabled this transparently.
You can transform a legacy rule and hierarchy to the new class-based rules
by using the crushtool
command. There are several types
of transformation possible:
crushtool --reclassify-root ROOT_NAME DEVICE_CLASS
This command takes everything in the hierarchy beneath ROOT_NAME and adjusts any rules that reference that root via
take ROOT_NAME
to instead
take ROOT_NAME class DEVICE_CLASS
It renumbers the buckets so that the old IDs are used for the specified class’s 'shadow tree'. As a consequence, no data movement occurs.
Example 7.1:
crushtool --reclassify-root
#Consider the following existing rule:
rule replicated_ruleset { id 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type rack step emit }
If you reclassify the root 'default' as class 'hdd', the rule will become
rule replicated_ruleset { id 0 type replicated min_size 1 max_size 10 step take default class hdd step chooseleaf firstn 0 type rack step emit }
crushtool --set-subtree-class BUCKET_NAME DEVICE_CLASS
This method marks every device in the subtree rooted at BUCKET_NAME with the specified device class.
--set-subtree-class
is normally used in conjunction with the--reclassify-root
option to ensure that all devices in that root are labeled with the correct class. However some of those devices may intentionally have a different class, and therefore you do not want to relabel them. In such cases, exclude the--set-subtree-class
option. Keep in mind that such remapping will not be perfect, because the previous rule is distributed across devices of multiple classes but the adjusted rules will only map to devices of the specified device class.crushtool --reclassify-bucket MATCH_PATTERN DEVICE_CLASS DEFAULT_PATTERN
This method allows merging a parallel type-specific hierarchy with the normal hierarchy. For example, many users have CRUSH Maps similar to the following one:
Example 7.2:
crushtool --reclassify-bucket
#host node1 { id -2 # do not change unnecessarily # weight 109.152 alg straw hash 0 # rjenkins1 item osd.0 weight 9.096 item osd.1 weight 9.096 item osd.2 weight 9.096 item osd.3 weight 9.096 item osd.4 weight 9.096 item osd.5 weight 9.096 [...] } host node1-ssd { id -10 # do not change unnecessarily # weight 2.000 alg straw hash 0 # rjenkins1 item osd.80 weight 2.000 [...] } root default { id -1 # do not change unnecessarily alg straw hash 0 # rjenkins1 item node1 weight 110.967 [...] } root ssd { id -18 # do not change unnecessarily # weight 16.000 alg straw hash 0 # rjenkins1 item node1-ssd weight 2.000 [...] }
This function reclassifies each bucket that matches a given pattern. The pattern can look like
%suffix
orprefix%
. In the above example, you would use the pattern%-ssd
. For each matched bucket, the remaining portion of the name that matches the '%' wild card specifies the base bucket. All devices in the matched bucket are labeled with the specified device class and then moved to the base bucket. If the base bucket does not exist (for example if 'node12-ssd' exists but 'node12' does not), then it is created and linked underneath the specified default parent bucket. The old bucket IDs are preserved for the new shadow buckets to prevent data movement. Rules with thetake
steps that reference the old buckets are adjusted.crushtool --reclassify-bucket BUCKET_NAME DEVICE_CLASS BASE_BUCKET
You can use the
--reclassify-bucket
option without a wild card to map a single bucket. For example, in the previous example, we want the 'ssd' bucket to be mapped to the default bucket.The final command to convert the map comprised of the above fragments would be as follows:
cephadm >
ceph osd getcrushmap -o originalcephadm >
crushtool -i original --reclassify \ --set-subtree-class default hdd \ --reclassify-root default hdd \ --reclassify-bucket %-ssd ssd default \ --reclassify-bucket ssd ssd default \ -o adjustedIn order to verify that the conversion is correct, there is a
--compare
option that tests a large sample of inputs to the CRUSH Map and compares if the same result comes back out. These inputs are controlled by the same options that apply to the--test
. For the above example the command would be as follows:cephadm >
crushtool -i original --compare adjusted rule 0 had 0/10240 mismatched mappings (0) rule 1 had 0/10240 mismatched mappings (0) maps appear equivalentTip
If there were differences, you would see what ratio of inputs are remapped in the parentheses.
If you are satisfied with the adjusted CRUSH Map, you can apply it to the cluster:
cephadm >
ceph osd setcrushmap -i adjusted
7.1.1.6 For More Information #
Find more details on CRUSH Maps in Section 7.4, “CRUSH Map Manipulation”.
Find more details on Ceph pools in general in Chapter 8, Managing Storage Pools.
Find more details about erasure coded pools in Chapter 10, Erasure Coded Pools.
7.2 Buckets #
CRUSH maps contain a list of OSDs, which can be organized into 'buckets' for aggregating the devices into physical locations.
0 |
osd |
An OSD daemon (osd.1, osd.2, etc.). |
1 |
host |
A host name containing one or more OSDs. |
2 |
chassis |
Chassis of which the rack is composed. |
3 |
rack |
A computer rack. The default is |
4 |
row |
A row in a series of racks. |
5 |
pdu |
Power distribution unit. |
6 |
pod | |
7 |
room |
A room containing racks and rows of hosts. |
8 |
datacenter |
A physical data center containing rooms. |
9 |
region | |
10 |
root |
Tip
You can modify the existing types and create your own bucket types.
Ceph’s deployment tools generate a CRUSH Map that contains a bucket for
each host, and a root named 'default', which is useful for the default
rbd
pool. The remaining bucket types provide a means for
storing information about the physical location of nodes/buckets, which
makes cluster administration much easier when OSDs, hosts, or network
hardware malfunction and the administrator needs access to physical
hardware.
A bucket has a type, a unique name (string), a unique ID expressed as a
negative integer, a weight relative to the total capacity/capability of its
item(s), the bucket algorithm ( straw2
by default), and
the hash (0
by default, reflecting CRUSH Hash
rjenkins1
). A bucket may have one or more items. The
items may consist of other buckets or OSDs. Items may have a weight that
reflects the relative weight of the item.
[bucket-type] [bucket-name] { id [a unique negative numeric ID] weight [the relative capacity/capability of the item(s)] alg [the bucket type: uniform | list | tree | straw2 | straw ] hash [the hash type: 0 by default] item [item-name] weight [weight] }
The following example illustrates how you can use buckets to aggregate a pool and physical locations like a data center, a room, a rack and a row.
host ceph-osd-server-1 { id -17 alg straw2 hash 0 item osd.0 weight 0.546 item osd.1 weight 0.546 } row rack-1-row-1 { id -16 alg straw2 hash 0 item ceph-osd-server-1 weight 2.00 } rack rack-3 { id -15 alg straw2 hash 0 item rack-3-row-1 weight 2.00 item rack-3-row-2 weight 2.00 item rack-3-row-3 weight 2.00 item rack-3-row-4 weight 2.00 item rack-3-row-5 weight 2.00 } rack rack-2 { id -14 alg straw2 hash 0 item rack-2-row-1 weight 2.00 item rack-2-row-2 weight 2.00 item rack-2-row-3 weight 2.00 item rack-2-row-4 weight 2.00 item rack-2-row-5 weight 2.00 } rack rack-1 { id -13 alg straw2 hash 0 item rack-1-row-1 weight 2.00 item rack-1-row-2 weight 2.00 item rack-1-row-3 weight 2.00 item rack-1-row-4 weight 2.00 item rack-1-row-5 weight 2.00 } room server-room-1 { id -12 alg straw2 hash 0 item rack-1 weight 10.00 item rack-2 weight 10.00 item rack-3 weight 10.00 } datacenter dc-1 { id -11 alg straw2 hash 0 item server-room-1 weight 30.00 item server-room-2 weight 30.00 } root data { id -10 alg straw2 hash 0 item dc-1 weight 60.00 item dc-2 weight 60.00 }
7.3 Rule Sets #
CRUSH maps support the notion of 'CRUSH rules', which are the rules that determine data placement for a pool. For large clusters, you will likely create many pools where each pool may have its own CRUSH ruleset and rules. The default CRUSH Map has a rule for the default root. If you want more roots and more rules, you need to create them later or they will be created automatically when new pools are created.
Note
In most cases, you will not need to modify the default rules. When you create a new pool, its default ruleset is 0.
A rule takes the following form:
rule rulename { ruleset ruleset type type min_size min-size max_size max-size step step }
- ruleset
An integer. Classifies a rule as belonging to a set of rules. Activated by setting the ruleset in a pool. This option is required. Default is
0
.- type
A string. Describes a rule for either for 'replicated' or 'erasure' coded pool. This option is required. Default is
replicated
.- min_size
An integer. If a pool group makes fewer replicas than this number, CRUSH will NOT select this rule. This option is required. Default is
2
.- max_size
An integer. If a pool group makes more replicas than this number, CRUSH will NOT select this rule. This option is required. Default is
10
.- step take bucket
Takes a bucket specified by a name, and begins iterating down the tree. This option is required. For an explanation about iterating through the tree, see Section 7.3.1, “Iterating Through the Node Tree”.
- step targetmodenum type bucket-type
target can either be
choose
orchooseleaf
. When set tochoose
, a number of buckets is selected.chooseleaf
directly selects the OSDs (leaf nodes) from the sub-tree of each bucket in the set of buckets.mode can either be
firstn
orindep
. See Section 7.3.2, “firstn and indep”.Selects the number of buckets of the given type. Where N is the number of options available, if num > 0 && < N, choose that many buckets; if num < 0, it means N - num; and, if num == 0, choose N buckets (all available). Follows
step take
orstep choose
.- step emit
Outputs the current value and empties the stack. Typically used at the end of a rule, but may also be used to form different trees in the same rule. Follows
step choose
.
7.3.1 Iterating Through the Node Tree #
The structure defined with the buckets can be viewed as a node tree. Buckets are nodes and OSDs are leafs in this tree.
Rules in the CRUSH Map define how OSDs are selected from this tree. A rule starts with a node and then iterates down the tree to return a set of OSDs. It is not possible to define which branch needs to be selected. Instead the CRUSH algorithm assures that the set of OSDs fulfills the replication requirements and evenly distributes the data.
With step take
bucket the
iteration through the node tree begins at the given bucket (not bucket
type). If OSDs from all branches in the tree are to be returned, the bucket
must be the root bucket. Otherwise the following steps are only iterating
through a sub-tree.
After step take
one or more step
choose
entries follow in the rule definition. Each step
choose
chooses a defined number of nodes (or branches) from the
previously selected upper node.
In the end the selected OSDs are returned with step
emit
.
step chooseleaf
is a convenience function that directly
selects OSDs from branches of the given bucket.
Figure 7.2, “Example Tree” provides an example of
how step
is used to iterate through a tree. The orange
arrows and numbers correspond to example1a
and
example1b
, while blue corresponds to
example2
in the following rule definitions.
Figure 7.2: Example Tree #
# orange arrows rule example1a { ruleset 0 type replicated min_size 2 max_size 10 # orange (1) step take rack1 # orange (2) step choose firstn 0 host # orange (3) step choose firstn 1 osd step emit } rule example1b { ruleset 0 type replicated min_size 2 max_size 10 # orange (1) step take rack1 # orange (2) + (3) step chooseleaf firstn 0 host step emit } # blue arrows rule example2 { ruleset 0 type replicated min_size 2 max_size 10 # blue (1) step take room1 # blue (2) step chooseleaf firstn 0 rack step emit }
7.3.2 firstn and indep #
A CRUSH rule defines replacements for failed nodes or OSDs (see
Section 7.3, “Rule Sets”). The keyword step
requires either firstn
or indep
as
parameter. Figure Figure 7.3, “Node Replacement Methods”
provides an example.
firstn
adds replacement nodes to the end of the list of
active nodes. In case of a failed node, the following healthy nodes are
shifted to the left to fill the gap of the failed node. This is the default
and desired method for replicated pools, because a
secondary node already has all data and therefore can take over the duties
of the primary node immediately.
indep
selects fixed replacement nodes for each active
node. The replacement of a failed node does not change the order of the
remaining nodes. This is desired for erasure coded
pools. In erasure coded pools the data stored on a node depends
on its position in the node selection. When the order of nodes changes, all
data on affected nodes needs to be relocated.
Figure 7.3: Node Replacement Methods #
7.4 CRUSH Map Manipulation #
This section introduces ways to basic CRUSH Map manipulation, such as editing a CRUSH Map, changing CRUSH Map parameters, and adding/moving/removing an OSD.
7.4.1 Editing a CRUSH Map #
To edit an existing CRUSH map, do the following:
Get a CRUSH Map. To get the CRUSH Map for your cluster, execute the following:
cephadm >
ceph osd getcrushmap -o compiled-crushmap-filenameCeph will output (
-o
) a compiled CRUSH Map to the file name you specified. Since the CRUSH Map is in a compiled form, you must decompile it first before you can edit it.Decompile a CRUSH Map. To decompile a CRUSH Map, execute the following:
cephadm >
crushtool -d compiled-crushmap-filename \ -o decompiled-crushmap-filenameCeph will decompile (
-d
) the compiled CRUSH Mapand output (-o
) it to the file name you specified.Edit at least one of Devices, Buckets and Rules parameters.
Compile a CRUSH Map. To compile a CRUSH Map, execute the following:
cephadm >
crushtool -c decompiled-crush-map-filename \ -o compiled-crush-map-filenameCeph will store a compiled CRUSH Mapto the file name you specified.
Set a CRUSH Map. To set the CRUSH Map for your cluster, execute the following:
cephadm >
ceph osd setcrushmap -i compiled-crushmap-filenameCeph will input the compiled CRUSH Map of the file name you specified as the CRUSH Map for the cluster.
Tip: Use Versioning System
Use a versioning system—such as git or svn—for the exported and modified CRUSH Map files. It makes a possible rollback simple.
Tip: Test the New CRUSH Map
Test the new adjusted CRUSH Map using the crushtool
--test
command, and compare to the state before applying the new
CRUSH Map. You may find the following command switches useful:
--show-statistics
, --show-mappings
,
--show-bad-mappings
, --show-utilization
,
--show-utilization-all
,
--show-choose-tries
7.4.2 Add/Move an OSD #
To add or move an OSD in the CRUSH Map of a running cluster, execute the following:
cephadm >
ceph osd crush set id_or_name weight root=pool-name
bucket-type=bucket-name ...
- id
An integer. The numeric ID of the OSD. This option is required.
- name
A string. The full name of the OSD. This option is required.
- weight
A double. The CRUSH weight for the OSD. This option is required.
- root
A key/value pair. By default, the CRUSH hierarchy contains the pool default as its root. This option is required.
- bucket-type
Key/value pairs. You may specify the OSD’s location in the CRUSH hierarchy.
The following example adds osd.0
to the hierarchy, or
moves the OSD from a previous location.
cephadm >
ceph osd crush set osd.0 1.0 root=data datacenter=dc1 room=room1 \
row=foo rack=bar host=foo-bar-1
7.4.3 Difference between ceph osd reweight
and ceph osd crush reweight
#
There are two similar commands that change the 'weight' of an Ceph OSD. Context of their usage is different and may cause confusion.
7.4.3.1 ceph osd reweight
#
Usage:
cephadm >
ceph osd reweight OSD_NAME NEW_WEIGHT
ceph osd reweight
sets an override weight on the Ceph OSD.
This value is in the range 0 to 1, and forces CRUSH to re-place of the
data that would otherwise live on this drive. It does
not change the weights assigned to the
buckets above the OSD, and is a corrective measure in case the normal
CRUSH distribution is not working out quite right. For example, if one of
your OSDs is at 90% and the others are at 40%, you could reduce this
weight to try and compensate for it.
Note: OSD Weight is Temporary
Note that ceph osd reweight
is not a persistent
setting. When an OSD gets marked out, its weight will be set to 0 and
when it gets marked in again, the weight will be changed to 1.
7.4.3.2 ceph osd crush reweight
#
Usage:
cephadm >
ceph osd crush reweight OSD_NAME NEW_WEIGHT
ceph osd crush reweight
sets the
CRUSH weight of the OSD. This weight is
an arbitrary value—generally the size of the disk in TB—and
controls how much data the system tries to allocate to the OSD.
7.4.4 Remove an OSD #
To remove an OSD from the CRUSH Map of a running cluster, execute the following:
cephadm >
ceph osd crush remove OSD_NAME
7.4.5 Add a Bucket #
To add a bucket in the CRUSH Map of a running cluster, execute the
ceph osd crush add-bucket
command:
cephadm >
ceph osd crush add-bucket BUCKET_NAME BUCKET_TYPE
7.4.6 Move a Bucket #
To move a bucket to a different location or position in the CRUSH Map hierarchy, execute the following:
cephadm >
ceph osd crush move BUCKET_NAME BUCKET_TYPE=BUCKET_NAME [...]
For example:
cephadm >
ceph osd crush move bucket1 datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
7.4.7 Remove a Bucket #
To remove a bucket from the CRUSH Map hierarchy, execute the following:
cephadm >
ceph osd crush remove BUCKET_NAME
Note: Empty Bucket Only
A bucket must be empty before removing it from the CRUSH hierarchy.
7.5 Scrubbing #
In addition to making multiple copies of objects, Ceph insures data
integrity by scrubbing placement groups (find more
information about placement groups in
Book “Deployment Guide”, Chapter 1 “SUSE Enterprise Storage 5.5 and Ceph”, Section 1.4.2 “Placement Group”). Ceph scrubbing is analogous
to running fsck
on the object storage layer. For each
placement group, Ceph generates a catalog of all objects and compares each
primary object and its replicas to ensure that no objects are missing or
mismatched. Daily light scrubbing checks the object size and attributes,
while weekly deep scrubbing reads the data and uses checksums to ensure data
integrity.
Scrubbing is important for maintaining data integrity, but it can reduce performance. You can adjust the following settings to increase or decrease scrubbing operations:
osd max scrubs
The maximum number of simultaneous scrub operations for a Ceph OSD. Default is 1.
osd scrub begin hour
,osd scrub end hour
The hours of day (0 to 24) that define a time window when the scrubbing can happen. By default begins at 0 and ends at 24.
Important
If the placement group’s scrub interval exceeds the
osd scrub max interval
setting, the scrub will happen no matter what time window you define for scrubbing.osd scrub during recovery
Allows scrubs during recovery. Setting this to 'false' will disable scheduling new scrubs while there is an active recovery. Already running scrubs will continue. This option is useful for reducing load on busy clusters. Default is 'true'.
osd scrub thread timeout
The maximum time in seconds before a scrub thread times out. Default is 60.
osd scrub finalize thread timeout
The maximum time in seconds before a scrub finalize thread time out. Default is 60*10.
osd scrub load threshold
The normalized maximum load. Ceph will not scrub when the system load (as defined by the ratio of
getloadavg()
/ number ofonline cpus
) is higher than this number. Default is 0.5.osd scrub min interval
The minimal interval in seconds for scrubbing Ceph OSD when the Ceph cluster load is low. Default is 60*60*24 (once a day).
osd scrub max interval
The maximum interval in seconds for scrubbing Ceph OSD irrespective of cluster load. 7*60*60*24 (once a week).
osd scrub chunk min
The minimal number of object store chunks to scrub during single operation. Ceph blocks writes to a single chunk during scrub. Default is 5.
osd scrub chunk max
The maximum number of object store chunks to scrub during single operation. Default is 25.
osd scrub sleep
Time to sleep before scrubbing next group of chunks. Increasing this value slows down the whole scrub operation while client operations are less impacted. Default is 0.
osd deep scrub interval
The interval for 'deep' scrubbing (fully reading all data). The
osd scrub load threshold
option does not affect this setting. Default is 60*60*24*7 (once a week).osd scrub interval randomize ratio
Add a random delay to the
osd scrub min interval
value when scheduling the next scrub job for a placement group. The delay is a random value smaller than the result ofosd scrub min interval
*osd scrub interval randomized ratio
. Therefore, the default setting practically randomly spreads the scrubs out in the allowed time window of [1, 1.5] *osd scrub min interval
. Default is 0.5osd deep scrub stride
Read size when doing a deep scrub. Default is 524288 (512 kB).
8 Managing Storage Pools #
Ceph stores data within pools. Pools are logical groups for storing objects. When you first deploy a cluster without creating a pool, Ceph uses the default pools for storing data. The following important highlights relate to Ceph pools:
Resilience: You can set how many OSDs, buckets, or leaves are allowed to fail without losing data. For replicated pools, it is the desired number of copies/replicas of an object. New pools are created with a default count of replicas set to 3. For erasure coded pools, it is the number of coding chunks (that is m=2 in the erasure code profile).
Placement Groups: are internal data structures for storing data in a pool across OSDs. The way Ceph stores data into PGs is defined in a CRUSH Map. You can set the number of placement groups for a pool at its creation. A typical configuration uses approximately 100 placement groups per OSD to provide optimal balancing without using up too many computing resources. When setting up multiple pools, be careful to ensure you set a reasonable number of placement groups for both the pool and the cluster as a whole.
CRUSH Rules: When you store data in a pool, objects and its replicas (or chunks in case of erasure coded pools) are placed according to the CRUSH ruleset mapped to the pool. You can create a custom CRUSH rule for your pool.
Snapshots: When you create snapshots with
ceph osd pool mksnap
, you effectively take a snapshot of a particular pool.
To organize data into pools, you can list, create, and remove pools. You can also view the usage statistics for each pool.
8.1 Associate Pools with an Application #
Before using pools, you need to associate them with an application. Pools that will be used with CephFS, or pools that are automatically created by Object Gateway are automatically associated.
For other cases, you can manually associate a free-form application name with a pool:
cephadm >
ceph osd pool application enable pool_name application_name
Tip: Default Application Names
CephFS uses the application name cephfs
, RADOS Block Device uses
rbd
, and Object Gateway uses rgw
.
A pool can be associated with multiple applications, and each application can have its own metadata. You can display the application metadata for a given pool using the following command:
cephadm >
ceph osd pool application get pool_name
8.2 Operating Pools #
This section introduces practical information to perform basic tasks with pools. You can find out how to list, create, and delete pools, as well as show pool statistics or manage snapshots of a pool.
8.2.1 List Pools #
To list your cluster’s pools, execute:
cephadm >
ceph osd pool ls
8.2.2 Create a Pool #
A pool can either be 'replicated' to recover from lost OSDs by keeping multiple copies of the objects or 'erasure' to get a kind of generalized RAID5/6 capability. The replicated pools require more raw storage, while erasure coded pools require less raw storage. Default is 'replicated'.
To create a replicated pool, execute:
cephadm >
ceph osd pool create pool_name pg_num pgp_num replicated crush_ruleset_name \
expected_num_objects
To create an erasure coded pool, execute:
cephadm >
ceph osd pool create pool_name pg_num pgp_num erasure erasure_code_profile \
crush_ruleset_name expected_num_objects
The ceph osd pool create
can fail if you exceed the
limit of placement groups per OSD. The limit is set with the option
mon_max_pg_per_osd
.
- pool_name
The name of the pool. It must be unique. This option is required.
- pg_num
The total number of placement groups for the pool. This option is required. Default value is 8.
- pgp_num
The total number of placement groups for placement purposes. This should be equal to the total number of placement groups, except for placement group splitting scenarios. This option is required. Default value is 8.
- crush_ruleset_name
The name of the crush ruleset for this pool. If the specified ruleset does not exist, the creation of replicated pool will fail with -ENOENT. For replicated pools it is the ruleset specified by the
osd pool default crush replicated ruleset
configuration variable. This ruleset must exist. For erasure pools it is 'erasure-code' if the default erasure code profile is used or POOL_NAME otherwise. This ruleset will be created implicitly if it does not exist already.- erasure_code_profile=profile
For erasure coded pools only. Use the erasure code profile. It must be an existing profile as defined by
osd erasure-code-profile set
.When you create a pool, set the number of placement groups to a reasonable value. Consider the total number of placement groups per OSD too. Placement groups are computationally expensive, so performance will degrade when you have many pools with many placement groups (for example 50 pools with 100 placement groups each).
See Placement Groups for details on calculating an appropriate number of placement groups for your pool.
- expected_num_objects
The expected number of objects for this pool. By setting this value (together with a negative
filestore merge threshold
), the PG folder splitting happens at the pool creation time. This avoids the latency impact with a runtime folder splitting.
8.2.3 Set Pool Quotas #
You can set pool quotas for the maximum number of bytes and/or the maximum number of objects per pool.
cephadm >
ceph osd pool set-quota pool-name max_objects obj-count max_bytes bytes
For example:
cephadm >
ceph osd pool set-quota data max_objects 10000
To remove a quota, set its value to 0.
8.2.4 Delete a Pool #
Warning: Pool Deletion is Not Reversible
Pools may contain important data. Deleting a pool causes all data in the pool to disappear, and there is no way to recover it.
Because inadvertent pool deletion is a real danger, Ceph implements two mechanisms that prevent pools from being deleted. Both mechanisms must be disabled before a pool can be deleted.
The first mechanism is the NODELETE
flag. Each pool has
this flag, and its default value is 'false'. To find out the value of this
flag on a pool, run the following command:
cephadm >
ceph osd pool get pool_name nodelete
If it outputs nodelete: true
, it is not possible to
delete the pool until you change the flag using the following command:
cephadm >
ceph osd pool set pool_name nodelete false
The second mechanism is the cluster-wide configuration parameter
mon allow pool delete
, which defaults to 'false'. This
means that, by default, it is not possible to delete a pool. The error
message displayed is:
Error EPERM: pool deletion is disabled; you must first set the mon_allow_pool_delete config option to true before you can destroy a pool
To delete the pool in spite of this safety setting, you can temporarily set
mon allow pool delete
to 'true', delete the pool, and then
return the parameter to 'false':
cephadm >
ceph tell mon.* injectargs --mon-allow-pool-delete=truecephadm >
ceph osd pool delete pool_name pool_name --yes-i-really-really-mean-itcephadm >
ceph tell mon.* injectargs --mon-allow-pool-delete=false
The injectargs
command displays the following message:
injectargs:mon_allow_pool_delete = 'true' (not observed, change may require restart)
This is merely confirming that the command was executed successfully. It is not an error.
If you created your own rulesets and rules for a pool you created, you should consider removing them when you no longer need your pool.
8.2.5 Rename a Pool #
To rename a pool, execute:
cephadm >
ceph osd pool rename current-pool-name new-pool-name
If you rename a pool and you have per-pool capabilities for an authenticated user, you must update the user’s capabilities with the new pool name.
8.2.6 Show Pool Statistics #
To show a pool’s usage statistics, execute:
cephadm >
rados df
pool name category KB objects lones degraded unfound rd rd KB wr wr KB
cold-storage - 228 1 0 0 0 0 0 1 228
data - 1 4 0 0 0 0 0 4 4
hot-storage - 1 2 0 0 0 15 10 5 231
metadata - 0 0 0 0 0 0 0 0 0
pool1 - 0 0 0 0 0 0 0 0 0
rbd - 0 0 0 0 0 0 0 0 0
total used 266268 7
total avail 27966296
total space 28232564
8.2.7 Get Pool Values #
To get a value from a pool, execute:
cephadm >
ceph osd pool get pool-name key
You can get values for keys listed in Section 8.2.8, “Set Pool Values” plus the following keys:
- pg_num
The number of placement groups for the pool.
- pgp_num
The effective number of placement groups to use when calculating data placement. Valid range is equal to or less than
pg_num
.
Tip: All Pool's Values
To list all values related to a specific pool, run:
cephadm >
ceph osd pool get POOL_NAME all
8.2.8 Set Pool Values #
To set a value to a pool, execute:
cephadm >
ceph osd pool set pool-name key value
You may set values for the following keys:
- size
Sets the number of replicas for objects in the pool. See Section 8.2.9, “Set the Number of Object Replicas” for further details. Replicated pools only.
- min_size
Sets the minimum number of replicas required for I/O. See Section 8.2.9, “Set the Number of Object Replicas” for further details. Replicated pools only.
- crash_replay_interval
The number of seconds to allow clients to replay acknowledged, but uncommitted requests.
- pg_num
The number of placement groups for the pool. If you add new OSDs to the cluster, verify the value for placement groups on all pools targeted for the new OSDs, for details refer to Section 8.2.10, “Increasing the Number of Placement Groups”.
- pgp_num
The effective number of placement groups to use when calculating data placement.
- crush_ruleset
The ruleset to use for mapping object placement in the cluster.
- hashpspool
Set (1) or unset (0) the HASHPSPOOL flag on a given pool. Enabling this flag changes the algorithm to better distribute PGs to OSDs. After enabling this flag on a pool whose HASHPSPOOL flag was set to the default 0, the cluster starts backfilling to have a correct placement of all PGs again. Be aware that this can create quite substantial I/O load on a cluster, therefore do not enable the flag from 0 to 1 on a highly loaded production clusters.
- nodelete
Prevents the pool from being removed.
- nopgchange
Prevents the pool's
pg_num
andpgp_num
from being changed.- nosizechange
Prevents the pool's size from being changed.
- write_fadvise_dontneed
Set/Unset the
WRITE_FADVISE_DONTNEED
flag on a given pool.- noscrub,nodeep-scrub
Disables (deep)-scrubbing of the data for the specific pool to resolve temporary high I/O load.
- hit_set_type
Enables hit set tracking for cache pools. See Bloom Filter for additional information. This option can have the following values:
bloom
,explicit_hash
,explicit_object
. Default isbloom
, other values are for testing only.- hit_set_count
The number of hit sets to store for cache pools. The higher the number, the more RAM consumed by the
ceph-osd
daemon. Default is0
.- hit_set_period
The duration of a hit set period in seconds for cache pools. The higher the number, the more RAM consumed by the
ceph-osd
daemon.- hit_set_fpp
The false positive probability for the bloom hit set type. See Bloom Filter for additional information. Valid range is 0.0 - 1.0 Default is
0.05
- use_gmt_hitset
Force OSDs to use GMT (Greenwich Mean Time) time stamps when creating a hit set for cache tiering. This ensures that nodes in different time zones return the same result. Default is
1
. This value should not be changed.- cache_target_dirty_ratio
The percentage of the cache pool containing modified (dirty) objects before the cache tiering agent will flush them to the backing storage pool. Default is
0.4
.- cache_target_dirty_high_ratio
The percentage of the cache pool containing modified (dirty) objects before the cache tiering agent will flush them to the backing storage pool with a higher speed. Default is
0.6
.- cache_target_full_ratio
The percentage of the cache pool containing unmodified (clean) objects before the cache tiering agent will evict them from the cache pool. Default is
0.8
.- target_max_bytes
Ceph will begin flushing or evicting objects when the
max_bytes
threshold is triggered.- target_max_objects
Ceph will begin flushing or evicting objects when the
max_objects
threshold is triggered.- hit_set_grade_decay_rate
Temperature decay rate between two successive
hit_set
s. Default is20
.- hit_set_search_last_n
Count at most
N
appearances inhit_set
s for temperature calculation. Default is1
.- cache_min_flush_age
The time (in seconds) before the cache tiering agent will flush an object from the cache pool to the storage pool.
- cache_min_evict_age
The time (in seconds) before the cache tiering agent will evict an object from the cache pool.
- fast_read
If this flag is enabled on erasure coding pools, then the read request issues sub-reads to all shards, and waits until it receives enough shards to decode to serve the client. In the case of jerasure and isa erasure plug-ins, when the first
K
replies return, then the client’s request is served immediately using the data decoded from these replies. This approach cause more CPU load and less disk/network load. Currently, this flag is only supported for erasure coding pools. Default is0
.- scrub_min_interval
The minimum interval in seconds for pool scrubbing when the cluster load is low. The default
0
means that theosd_scrub_min_interval
value from the Ceph configuration file is used.- scrub_max_interval
The maximum interval in seconds for pool scrubbing, regardless of the cluster load. The default
0
means that theosd_scrub_max_interval
value from the Ceph configuration file is used.- deep_scrub_interval
The interval in seconds for the pool deep scrubbing. The default
0
means that theosd_deep_scrub
value from the Ceph configuration file is used.
8.2.9 Set the Number of Object Replicas #
To set the number of object replicas on a replicated pool, execute the following:
cephadm >
ceph osd pool set poolname size num-replicas
The num-replicas includes the object itself. For example if you want the object and two copies of the object for a total of three instances of the object, specify 3.
Warning: Do not Set Less than 3 Replicas
If you set the num-replicas to 2, there will be only one copy of your data. If you lose one object instance, you need to trust that the other copy has not been corrupted for example since the last scrubbing during recovery (refer to Section 7.5, “Scrubbing” for details).
Setting a pool to one replica means that there is exactly one instance of the data object in the pool. If the OSD fails, you lose the data. A possible usage for a pool with one replica is storing temporary data for a short time.
Tip: Setting More than 3 Replicas
Setting 4 replicas for a pool increases the reliability by 25%.
In case of two data centers, you need to set at least 4 replicas for a pool to have two copies in each data center so that in case one data center is lost, still two copies exist and you can still lose one disk without loosing data.
Note
An object might accept I/Os in degraded mode with fewer than pool
size
replicas. To set a minimum number of required replicas for
I/O, you should use the min_size
setting. For example:
cephadm >
ceph osd pool set data min_size 2
This ensures that no object in the data pool will receive I/O with fewer
than min_size
replicas.
Tip: Get the Number of Object Replicas
To get the number of object replicas, execute the following:
cephadm >
ceph osd dump | grep 'replicated size'
Ceph will list the pools, with the replicated size
attribute highlighted. By default, Ceph creates two replicas of an
object (a total of three copies, or a size of 3).
8.2.10 Increasing the Number of Placement Groups #
When creating a new pool, you specify the number of placement groups for the pool (see Section 8.2.2, “Create a Pool” ). After adding more OSDs to the cluster, you usually need to increase the number of placement groups as well for performance and data durability reasons. For each placement group, OSD and monitor nodes need memory, network and CPU at all times and even more during recovery. From which follows that minimizing the number of placement groups saves significant amounts of resources. On the other hand - too small number of placement groups causes unequal data distribution among OSDs.
Warning: Too High Value of pg_num
When changing the pg_num
value for a pool, it may happen
that the new number of placement groups exceeds the allowed limit. For
example
cephadm >
ceph osd pool set rbd pg_num 4096
Error E2BIG: specified pg_num 3500 is too large (creating 4096 new PGs \
on ~64 OSDs exceeds per-OSD max of 32)
The limit prevents extreme placement group splitting, and is derived from
the mon_osd_max_split_count
value.
Important: Reducing the Number of Placement Groups not Possible
While increasing the number of placement groups on a pool is possible at any time, reducing is not possible.
To determine the right new number of placement groups for a resized cluster
is a complex task. One approach is to continuously grow the number of
placement groups up to the state when the cluster performance is
satisfactory. To determine the new incremented number of placement groups,
you need to get the value of the mon_osd_max_split_count
parameter (default is 32), and add it to the current number of placement
groups. To give you a basic idea, take a look at the following script:
cephadm >
max_inc=`ceph daemon mon.a config get mon_osd_max_split_count 2>&1 \ | tr -d '\n ' | sed 's/.*"\([[:digit:]]\+\)".*/\1/'`cephadm >
pg_num=`ceph osd pool get POOL_NAME pg_num | cut -f2 -d: | tr -d ' '`cephadm >
echo "current pg_num value: $pg_num, max increment: $max_inc"cephadm >
next_pg_num="$(($pg_num+$max_inc))"cephadm >
echo "allowed increment of pg_num: $next_pg_num"
After finding out the next number of placement groups, increase it with
cephadm >
ceph osd pool set POOL_NAME pg_num NEXT_PG_NUM
8.3 Pool Migration #
When creating a pool (see Section 8.2.2, “Create a Pool”) you need to specify its initial parameters, such as the pool type or the number of placement groups. If you later decide to change any of these parameters—for example when converting a replicated pool into an erasure coded one, or decreasing the number of placement groups—, you need to migrate the pool data to another one whose parameters suit your deployment.
This section describes two migration methods—a cache
tier method for general pool data migration, and a method using
rbd migrate
sub-commands to migrate RBD images to a new
pool. Each method has its specifics and limitations.
8.3.1 Limitations #
You can use the cache tier method to migrate from a replicated pool to either an EC pool or another replicated pool. Migrating from an EC pool is not supported.
You cannot migrate RBD images and CephFS exports from a replicated pool to an EC pool. The reason is that EC pools do not support
omap
, while RBD and CephFS useomap
to store its metadata. For example, the header object of the RBD will fail to be flushed. But you can migrate data to EC pool, leaving metadata in replicated pool.The
rbd migration
method allows migrating images with minimal client downtime. You only need to stop the client before theprepare
step and start it afterward. Note that only alibrbd
client that supports this feature (Ceph Nautilus or newer) will be able to open the image just after theprepare
step, while olderlibrbd
clients or thekrbd
clients will not be able to open the image until thecommit
step is executed.
8.3.2 Migrate Using Cache Tier #
The principle is simple—include the pool that you need to migrate into a cache tier in reverse order. Find more details on cache tiers in Chapter 11, Cache Tiering. The following example migrates a replicated pool named 'testpool' to an erasure coded pool:
Procedure 8.1: Migrating Replicated to Erasure Coded Pool #
Create a new erasure coded pool named 'newpool'. Refer to Section 8.2.2, “Create a Pool” for detailed explanation of pool creation parameters.
cephadm >
ceph osd pool create newpool PG_NUM PGP_NUM erasure defaultVerify that the used client keyring provides at least the same capabilities for 'newpool' as it does for 'testpool'.
Now you have two pools: the original replicated 'testpool' filled with data, and the new empty erasure coded 'newpool':
Figure 8.1: Pools before Migration #
Setup the cache tier and configure the replicated pool 'testpool' as a cache pool. The
-force-nonempty
option allows adding a cache tier even if the pool already has data:cephadm >
ceph tell mon.* injectargs \ '--mon_debug_unsafe_allow_tier_with_nonempty_snaps=1'cephadm >
ceph osd tier add newpool testpool --force-nonemptycephadm >
ceph osd tier cache-mode testpool proxyFigure 8.2: Cache Tier Setup #
Force the cache pool to move all objects to the new pool:
cephadm >
rados -p testpool cache-flush-evict-allFigure 8.3: Data Flushing #
Until all the data has been flushed to the new erasure coded pool, you need to specify an overlay so that objects are searched on the old pool:
cephadm >
ceph osd tier set-overlay newpool testpoolWith the overlay, all operations are forwarded to the old replicated 'testpool':
Figure 8.4: Setting Overlay #
Now you can switch all the clients to access objects on the new pool.
After all data is migrated to the erasure coded 'newpool', remove the overlay and the old cache pool 'testpool':
cephadm >
ceph osd tier remove-overlay newpoolcephadm >
ceph osd tier remove newpool testpoolFigure 8.5: Migration Complete #
Run
cephadm >
ceph tell mon.* injectargs \ '--mon_debug_unsafe_allow_tier_with_nonempty_snaps=0'
8.3.3 Migrating RBD Images #
The following is the recommended way to migrate RBD images from one replicated pool to another replicated pool.
Stop clients (such as a virtual machine) from accessing the RBD image.
Create a new image in the target pool, with the parent set to the source image:
cephadm >
rbd migration prepare SRC_POOL/IMAGE TARGET_POOL/IMAGETip: Migrate Only Data to an EC Pool
If you need to migrate only the image data to a new EC pool and leave the metadata in the original replicated pool, run the following command instead:
cephadm >
rbd migration prepare SRC_POOL/IMAGE \ --data-pool TARGET_POOL/IMAGELet clients access the image in the target pool.
Migrate data to the target pool:
cephadm >
rbd migration execute SRC_POOL/IMAGERemove the old image:
cephadm >
rbd migration commit SRC_POOL/IMAGE
8.4 Pool Snapshots #
Pool snapshots are snapshots of the state of the whole Ceph pool. With pool snapshots, you can retain the history of the pool's state. Creating pool snapshots consumes storage space proportional to the pool size. Always check the related storage for enough disk space before creating a snapshot of a pool.
8.4.1 Make a Snapshot of a Pool #
To make a snapshot of a pool, run:
cephadm >
ceph osd pool mksnap POOL-NAME SNAP-NAME
For example:
cephadm >
ceph osd pool mksnap pool1 snap1
created pool pool1 snap snap1
8.4.2 List Snapshots of a Pool #
To list existing snapshots of a pool, run:
cephadm >
rados lssnap -p POOL_NAME
For example:
cephadm >
rados lssnap -p pool1
1 snap1 2018.12.13 09:36:20
2 snap2 2018.12.13 09:46:03
2 snaps
8.4.3 Remove a Snapshot of a Pool #
To remove a snapshot of a pool, run:
cephadm >
ceph osd pool rmsnap POOL-NAME SNAP-NAME
8.5 Data Compression #
BlueStore (find more details in Book “Deployment Guide”, Chapter 1 “SUSE Enterprise Storage 5.5 and Ceph”, Section 1.5 “BlueStore”) provides on-the-fly data compression to save disk space. The compression ratio depends on the data stored in the system. Note that compression / de-compression requires additional CPU power.
You can configure data compression globally (see Section 8.5.3, “Global Compression Options”), and then override specific compression settings for each individual pool.
You can enable or disable pool data compression, or change the compression algorithm and mode at any time, regardless of whether the pool contains data or not.
No compression will be applied to existing data after enabling the pool compression.
After disabling the compression of a pool, all its data will be decompressed.
8.5.1 Enable Compression #
To enable data compression for a pool named POOL_NAME, run the following command:
cephadm >
ceph
osd pool set POOL_NAME compression_algorithm COMPRESSION_ALGORITHMcephadm >
ceph
osd pool set POOL_NAME compression_mode COMPRESSION_MODE
Tip: Disabling Pool Compression
To disable data compression for a pool, use 'none' as the compression algorithm:
cephadm >
ceph
osd pool set POOL_NAME compression_algorithm none
8.5.2 Pool Compression Options #
A full list of compression settings:
- compression_algorithm
Possible values are
none
,zstd
,snappy
. Default issnappy
.Which compression algorithm to use depends on the specific use case. Several recommendations follow:
Use the default
snappy
as long as you do not have a good reason to change it.zstd
offers a good compression ratio, but causes high CPU overhead when compressing small amounts of data.Run a benchmark of these algorithms on a sample of your actual data while keeping an eye on the CPU and memory usage of your cluster.
- compression_mode
Possible values are
none
,aggressive
,passive
,force
. Default isnone
.none
: compress neverpassive
: compress if hintedCOMPRESSIBLE
aggressive
: compress unless hintedINCOMPRESSIBLE
force
: compress always
- compression_required_ratio
Value: Double, Ratio = SIZE_COMPRESSED / SIZE_ORIGINAL. Default is
0.875
, which means that if the compression does not reduce the occupied space by at least 12,5%, the object will not be compressed.Objects above this ratio will not be stored compressed because of the low net gain.
- compression_max_blob_size
Value: Unsigned Integer, size in bytes. Default:
0
Maximum size of objects that are compressed.
- compression_min_blob_size
Value: Unsigned Integer, size in bytes. Default:
0
Minimum size of objects that are compressed.
8.5.3 Global Compression Options #
The following configuration options can be set in the Ceph configuration and apply to all OSDs and not only a single pool. The pool specific configuration listed in Section 8.5.2, “Pool Compression Options” take precedence.
- bluestore_compression_algorithm
- bluestore_compression_mode
See compression_mode
- bluestore_compression_required_ratio
- bluestore_compression_min_blob_size
Value: Unsigned Integer, size in bytes. Default:
0
Minimum size of objects that are compressed. The setting is ignored by default in favor of
bluestore_compression_min_blob_size_hdd
andbluestore_compression_min_blob_size_ssd
. It takes precedence when set to a non-zero value.- bluestore_compression_max_blob_size
Value: Unsigned Integer, size in bytes. Default:
0
Maximum size of objects that are compressed before they will be split into smaller chunks. The setting is ignored by default in favor of
bluestore_compression_max_blob_size_hdd
andbluestore_compression_max_blob_size_ssd
. It takes precedence when set to a non-zero value.- bluestore_compression_min_blob_size_ssd
Value: Unsigned Integer, size in bytes. Default:
8K
Minimum size of objects that are compressed and stored on solid-state drive.
- bluestore_compression_max_blob_size_ssd
Value: Unsigned Integer, size in bytes. Default:
64K
Maximum size of objects that are compressed and stored on solid-state drive before they will be split into smaller chunks.
- bluestore_compression_min_blob_size_hdd
Value: Unsigned Integer, size in bytes. Default:
128K
Minimum size of objects that are compressed and stored on hard disks.
- bluestore_compression_max_blob_size_hdd
Value: Unsigned Integer, size in bytes. Default:
512K
Maximum size of objects that are compressed and stored on hard disks before they will be split into smaller chunks.
9 RADOS Block Device #
A block is a sequence of bytes, for example a 4MB block of data. Block-based storage interfaces are the most common way to store data with rotating media, such as hard disks, CDs, floppy disks. The ubiquity of block device interfaces makes a virtual block device an ideal candidate to interact with a mass data storage system like Ceph.
Ceph block devices allow sharing of physical resources, and are resizable.
They store data striped over multiple OSDs in a Ceph cluster. Ceph block
devices leverage RADOS capabilities such as snapshotting, replication, and
consistency. Ceph's RADOS Block Devices (RBD) interact with OSDs using kernel modules or
the librbd
library.
Figure 9.1: RADOS Protocol #
Ceph's block devices deliver high performance with infinite scalability to
kernel modules. They support virtualization solutions such as QEMU, or
cloud-based computing systems such as OpenStack that rely on libvirt
. You
can use the same cluster to operate the Object Gateway, CephFS, and RADOS Block Devices
simultaneously.
9.1 Block Device Commands #
The rbd
command enables you to create, list, introspect,
and remove block device images. You can also use it, for example, to clone
images, create snapshots, rollback an image to a snapshot, or view a
snapshot.
9.1.1 Creating a Block Device Image in a Replicated Pool #
Before you can add a block device to a client, you need to create a related image in an existing pool (see Chapter 8, Managing Storage Pools):
cephadm >
rbd create --size MEGABYTES POOL-NAME/IMAGE-NAME
For example, to create a 1GB image named 'myimage' that stores information in a pool named 'mypool', execute the following:
cephadm >
rbd create --size 1024 mypool/myimage
Tip: Image Size Units
If you omit a size unit shortcut ('G' or 'T'), the image's size is in megabytes. Use 'G' or 'T' after the size number to specify gigabytes or terabytes.
9.1.2 Creating a Block Device Image in an Erasure Coded Pool #
As of SUSE Enterprise Storage 5.5, it is possible to store data of a block device image directly in erasure coded (EC) pools. RADOS Block Device image consists of data and metadata parts. You can store only the 'data' part of an RADOS Block Device image in an EC pool. The pool needs to have the 'overwrite' flag set to true, and that is only possible if all OSDs where the pool is stored use BlueStore.
You cannot store the image's 'metadata' part in an EC pool. You need to
specify the replicated pool for storing image's metadata with the
--pool=
option of the rbd create
command.
Use the following steps to create an RBD image in a newly created EC pool:
cephadm >
ceph
osd pool create POOL_NAME 12 12 erasurecephadm >
ceph
osd pool set POOL_NAME allow_ec_overwrites true #Metadata will reside in pool "OTHER_POOL", and data in pool "POOL_NAME"cephadm >
rbd
create IMAGE_NAME --size=1G --data-pool POOL_NAME --pool=OTHER_POOL
9.1.3 Listing Block Device Images #
To list block devices in a pool named 'mypool', execute the following:
cephadm >
rbd ls mypool
9.1.4 Retrieving Image Information #
To retrieve information from an image 'myimage' within a pool named 'mypool', run the following:
cephadm >
rbd info mypool/myimage
9.1.5 Resizing a Block Device Image #
RADOS Block Device images are thin provisioned—they do not actually use any
physical storage until you begin saving data to them. However, they do have
a maximum capacity that you set with the --size
option. If
you want to increase (or decrease) the maximum size of the image, run the
following:
cephadm >
rbd resize --size 2048 POOL_NAME/IMAGE_NAME # to increasecephadm >
rbd resize --size 2048 POOL_NAME/IMAGE_NAME --allow-shrink # to decrease
9.1.6 Removing a Block Device Image #
To remove a block device that corresponds to an image 'myimage' in a pool named 'mypool', run the following:
cephadm >
rbd rm mypool/myimage
9.2 Mounting and Unmounting #
After you create a RADOS Block Device, you can use it as any other disk device: format it, mount it to be able to exchange files, and unmount it when done.
Make sure your Ceph cluster includes a pool with the disk image you want to map. Assume the pool is called
mypool
and the image ismyimage
.cephadm >
rbd list mypoolMap the image to a new block device.
cephadm >
rbd map --pool mypool myimageTip: User Name and Authentication
To specify a user name, use
--id user-name
. If you usecephx
authentication, you also need to specify a secret. It may come from a keyring or a file containing the secret:cephadm >
rbd map --pool rbd myimage --id admin --keyring /path/to/keyringor
cephadm >
rbd map --pool rbd myimage --id admin --keyfile /path/to/fileList all mapped devices:
cephadm >
rbd showmapped id pool image snap device 0 mypool myimage - /dev/rbd0The device we want to work on is
/dev/rbd0
.Tip: RBD Device Path
Instead of
/dev/rbdDEVICE_NUMBER
, you can use/dev/rbd/POOL_NAME/IMAGE_NAME
as a persistent device path. For example:/dev/rbd/mypool/myimage
Make an XFS file system on the
/dev/rbd0
device.root #
mkfs.xfs /dev/rbd0 log stripe unit (4194304 bytes) is too large (maximum is 256KiB) log stripe unit adjusted to 32KiB meta-data=/dev/rbd0 isize=256 agcount=9, agsize=261120 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0 data = bsize=4096 blocks=2097152, imaxpct=25 = sunit=1024 swidth=1024 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=0 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0Mount the device and check it is correctly mounted. Replace
/mnt
with your mount point.root #
mount /dev/rbd0 /mntroot #
mount | grep rbd0 /dev/rbd0 on /mnt type xfs (rw,relatime,attr2,inode64,sunit=8192,...Now you can move data to and from the device as if it was a local directory.
Tip: Increasing the Size of RBD Device
If you find that the size of the RBD device is no longer enough, you can easily increase it.
Increase the size of the RBD image, for example up to 10GB.
root #
rbd resize --size 10000 mypool/myimage Resizing image: 100% complete...done.Grow the file system to fill up the new size of the device.
root #
xfs_growfs /mnt [...] data blocks changed from 2097152 to 2560000
After you finish accessing the device, you can unmap and unmount it.
cephadm >
rbd unmap /dev/rbd0root #
unmount /mnt
Tip: Manual (Un)mounting
Since manually mapping and mounting RBD images after boot and unmounting
and unmapping them before shutdown can be tedious, an
rbdmap
script and systemd
unit is provided. Refer to
Section 9.2.1, “rbdmap: Map RBD Devices at Boot Time”.
9.2.1 rbdmap: Map RBD Devices at Boot Time #
rbdmap
is a shell script that automates rbd
map
and rbd unmap
operations on one or more
RBD images. Although you can run the script manually at any time, the main
advantage is automatic mapping and mounting of RBD images at boot time (and
unmounting and unmapping at shutdown), as triggered by the Init system. A
systemd
unit file, rbdmap.service
is included with
the ceph-common
package for this purpose.
The script takes a single argument, which can be either
map
or unmap
. In either case, the script
parses a configuration file. It defaults to
/etc/ceph/rbdmap
, but can be overridden via an
environment variable RBDMAPFILE
. Each line of the
configuration file corresponds to an RBD image which is to be mapped, or
unmapped.
The configuration file has the following format:
image_specification rbd_options
image_specification
Path to an image within a pool. Specify as pool_name/image_name.
rbd_options
An optional list of parameters to be passed to the underlying
rbd map
command. These parameters and their values should be specified as a comma-separated string, for example:PARAM1=VAL1,PARAM2=VAL2,...
The example makes the
rbdmap
script run the following command:cephadm >
rbd map POOL_NAME/IMAGE_NAME --PARAM1 VAL1 --PARAM2 VAL2In the following example you can see how to specify a user name and a keyring with a corresponding secret:
cephadm >
rbdmap map mypool/myimage id=rbd_user,keyring=/etc/ceph/ceph.client.rbd.keyring
When run as rbdmap map
, the script parses the
configuration file, and for each specified RBD image, it attempts to first
map the image (using the rbd map
command) and then mount
the image.
When run as rbdmap unmap
, images listed in the
configuration file will be unmounted and unmapped.
rbdmap unmap-all
attempts to unmount and subsequently
unmap all currently mapped RBD images, regardless of whether they are
listed in the configuration file.
If successful, the rbd map operation maps the image to a /dev/rbdX device,
at which point a udev rule is triggered to create a friendly device name
symbolic link
/dev/rbd/pool_name/image_name
pointing to the real mapped device.
In order for mounting and unmounting to succeed, the 'friendly' device name
needs to have a corresponding entry in /etc/fstab
.
When writing /etc/fstab
entries for RBD images,
specify the 'noauto' (or 'nofail') mount option. This prevents the Init
system from trying to mount the device too early—before the device in
question even exists, as rbdmap.service
is typically
triggered quite late in the boot sequence.
For a complete list of rbd
options, see the
rbd
manual page (man 8 rbd
).
For examples of the rbdmap
usage, see the
rbdmap
manual page (man 8 rbdmap
).
9.2.2 Increasing the Size of RBD Device #
If you find that the size of the RBD device is no longer enough, you can easily increase it.
Increase the size of the RBD image, for example up to 10GB.
cephadm >
rbd resize --size 10000 mypool/myimage Resizing image: 100% complete...done.Grow the file system to fill up the new size of the device.
root #
xfs_growfs /mnt [...] data blocks changed from 2097152 to 2560000
9.3 Snapshots #
An RBD snapshot is a snapshot of a RADOS Block Device image. With snapshots, you retain a
history of the image's state. Ceph also supports snapshot layering, which
allows you to clone VM images quickly and easily. Ceph supports block
device snapshots using the rbd
command and many
higher-level interfaces, including QEMU, libvirt
,
OpenStack, and CloudStack.
Note
Stop input and output operations and flush all pending writes before snapshotting an image. If the image contains a file system, the file system must be in a consistent state at the time of snapshotting.
9.3.1 Cephx Notes #
When cephx
is enabled (see
http://ceph.com/docs/master/rados/configuration/auth-config-ref/
for more information), you must specify a user name or ID and a path to the
keyring containing the corresponding key for the user. See
User
Management for more details. You may also add the
CEPH_ARGS
environment variable to avoid re-entry
of the following parameters.
cephadm >
rbd --id user-ID --keyring=/path/to/secret commandscephadm >
rbd --name username --keyring=/path/to/secret commands
For example:
cephadm >
rbd --id admin --keyring=/etc/ceph/ceph.keyring commandscephadm >
rbd --name client.admin --keyring=/etc/ceph/ceph.keyring commands
Tip
Add the user and secret to the CEPH_ARGS
environment variable so that you do not need to enter them each time.
9.3.2 Snapshot Basics #
The following procedures demonstrate how to create, list, and remove
snapshots using the rbd
command on the command line.
9.3.2.1 Create Snapshot #
To create a snapshot with rbd
, specify the snap
create
option, the pool name, and the image name.
cephadm >
rbd --pool pool-name snap create --snap snap-name image-namecephadm >
rbd snap create pool-name/image-name@snap-name
For example:
cephadm >
rbd --pool rbd snap create --snap snapshot1 image1cephadm >
rbd snap create rbd/image1@snapshot1
9.3.2.2 List Snapshots #
To list snapshots of an image, specify the pool name and the image name.
cephadm >
rbd --pool pool-name snap ls image-namecephadm >
rbd snap ls pool-name/image-name
For example:
cephadm >
rbd --pool rbd snap ls image1cephadm >
rbd snap ls rbd/image1
9.3.2.3 Rollback Snapshot #
To rollback to a snapshot with rbd
, specify the
snap rollback
option, the pool name, the image name, and
the snapshot name.
cephadm >
rbd --pool pool-name snap rollback --snap snap-name image-namecephadm >
rbd snap rollback pool-name/image-name@snap-name
For example:
cephadm >
rbd --pool pool1 snap rollback --snap snapshot1 image1cephadm >
rbd snap rollback pool1/image1@snapshot1
Note
Rolling back an image to a snapshot means overwriting the current version of the image with data from a snapshot. The time it takes to execute a rollback increases with the size of the image. It is faster to clone from a snapshot than to rollback an image to a snapshot, and it is the preferred method of returning to a pre-existing state.
9.3.2.4 Delete a Snapshot #
To delete a snapshot with rbd
, specify the snap
rm
option, the pool name, the image name, and the user name.
cephadm >
rbd --pool pool-name snap rm --snap snap-name image-namecephadm >
rbd snap rm pool-name/image-name@snap-name
For example:
cephadm >
rbd --pool pool1 snap rm --snap snapshot1 image1cephadm >
rbd snap rm pool1/image1@snapshot1
Note
Ceph OSDs delete data asynchronously, so deleting a snapshot does not free up the disk space immediately.
9.3.2.5 Purge Snapshots #
To delete all snapshots for an image with rbd
, specify
the snap purge
option and the image name.
cephadm >
rbd --pool pool-name snap purge image-namecephadm >
rbd snap purge pool-name/image-name
For example:
cephadm >
rbd --pool pool1 snap purge image1cephadm >
rbd snap purge pool1/image1
9.3.3 Layering #
Ceph supports the ability to create multiple copy-on-write (COW) clones of a block device snapshot. Snapshot layering enables Ceph block device clients to create images very quickly. For example, you might create a block device image with a Linux VM written to it, then, snapshot the image, protect the snapshot, and create as many copy-on-write clones as you like. A snapshot is read-only, so cloning a snapshot simplifies semantics—making it possible to create clones rapidly.
Note
The terms 'parent' and 'child' mentioned in the command line examples below mean a Ceph block device snapshot (parent) and the corresponding image cloned from the snapshot (child).
Each cloned image (child) stores a reference to its parent image, which enables the cloned image to open the parent snapshot and read it.
A COW clone of a snapshot behaves exactly like any other Ceph block device image. You can read to, write from, clone, and resize cloned images. There are no special restrictions with cloned images. However, the copy-on-write clone of a snapshot refers to the snapshot, so you must protect the snapshot before you clone it.
Note: --image-format 1
not Supported
You cannot create snapshots of images created with the deprecated
rbd create --image-format 1
option. Ceph only
supports cloning of the default format 2 images.
9.3.3.1 Getting Started with Layering #
Ceph block device layering is a simple process. You must have an image. You must create a snapshot of the image. You must protect the snapshot. After you have performed these steps, you can begin cloning the snapshot.
The cloned image has a reference to the parent snapshot, and includes the pool ID, image ID, and snapshot ID. The inclusion of the pool ID means that you may clone snapshots from one pool to images in another pool.
Image Template: A common use case for block device layering is to create a master image and a snapshot that serves as a template for clones. For example, a user may create an image for a Linux distribution (for example, SUSE Linux Enterprise Server), and create a snapshot for it. Periodically, the user may update the image and create a new snapshot (for example,
zypper ref && zypper patch
followed byrbd snap create
). As the image matures, the user can clone any one of the snapshots.Extended Template: A more advanced use case includes extending a template image that provides more information than a base image. For example, a user may clone an image (a VM template) and install other software (for example, a database, a content management system, or an analytics system), and then snapshot the extended image, which itself may be updated in the same way as the base image.
Template Pool: One way to use block device layering is to create a pool that contains master images that act as templates, and snapshots of those templates. You may then extend read-only privileges to users so that they may clone the snapshots without the ability to write or execute within the pool.
Image Migration/Recovery: One way to use block device layering is to migrate or recover data from one pool into another pool.
9.3.3.2 Protecting a Snapshot #
Clones access the parent snapshots. All clones would break if a user inadvertently deleted the parent snapshot. To prevent data loss, you need to protect the snapshot before you can clone it.
cephadm >
rbd --pool pool-name snap protect \ --image image-name --snap snapshot-namecephadm >
rbd snap protect pool-name/image-name@snapshot-name
For example:
cephadm >
rbd --pool pool1 snap protect --image image1 --snap snapshot1cephadm >
rbd snap protect pool1/image1@snapshot1
Note
You cannot delete a protected snapshot.
9.3.3.3 Cloning a Snapshot #
To clone a snapshot, you need to specify the parent pool, image, snapshot, the child pool, and the image name. You need to protect the snapshot before you can clone it.
cephadm >
rbd clone --pool pool-name --image parent-image \ --snap snap-name --dest-pool pool-name \ --dest child-imagecephadm >
rbd clone pool-name/parent-image@snap-name \ pool-name/child-image-name
For example:
cephadm >
rbd clone pool1/image1@snapshot1 pool1/image2
Note
You may clone a snapshot from one pool to an image in another pool. For example, you may maintain read-only images and snapshots as templates in one pool, and writable clones in another pool.
9.3.3.4 Unprotecting a Snapshot #
Before you can delete a snapshot, you must unprotect it first. Additionally, you may not delete snapshots that have references from clones. You need to flatten each clone of a snapshot before you can delete the snapshot.
cephadm >
rbd --pool pool-name snap unprotect --image image-name \ --snap snapshot-namecephadm >
rbd snap unprotect pool-name/image-name@snapshot-name
For example:
cephadm >
rbd --pool pool1 snap unprotect --image image1 --snap snapshot1cephadm >
rbd snap unprotect pool1/image1@snapshot1
9.3.3.5 Listing Children of a Snapshot #
To list the children of a snapshot, execute the following:
cephadm >
rbd --pool pool-name children --image image-name --snap snap-namecephadm >
rbd children pool-name/image-name@snapshot-name
For example:
cephadm >
rbd --pool pool1 children --image image1 --snap snapshot1cephadm >
rbd children pool1/image1@snapshot1
9.3.3.6 Flattening a Cloned Image #
Cloned images retain a reference to the parent snapshot. When you remove the reference from the child clone to the parent snapshot, you effectively 'flatten' the image by copying the information from the snapshot to the clone. The time it takes to flatten a clone increases with the size of the snapshot. To delete a snapshot, you must flatten the child images first.
cephadm >
rbd --pool pool-name flatten --image image-namecephadm >
rbd flatten pool-name/image-name
For example:
cephadm >
rbd --pool pool1 flatten --image image1cephadm >
rbd flatten pool1/image1
Note
Since a flattened image contains all the information from the snapshot, a flattened image will take up more storage space than a layered clone.
9.4 Mirroring #
RBD images can be asynchronously mirrored between two Ceph clusters. This
capability uses the RBD journaling image feature to ensure crash-consistent
replication between clusters. Mirroring is configured on a per-pool basis
within peer clusters and can be configured to automatically mirror all
images within a pool or only a specific subset of images. Mirroring is
configured using the rbd
command. The
rbd-mirror
daemon is responsible for pulling image
updates from the remote peer cluster and applying them to the image within
the local cluster.
Note: rbd-mirror Daemon
To use RBD mirroring, you need to have two Ceph clusters, each running
the rbd-mirror
daemon.
Important: RADOS Block Devices Exported via iSCSI
You cannot mirror RBD devices that are exported via iSCSI using
lrbd
.
Refer to Chapter 14, Ceph iSCSI Gateway for more details on iSCSI.
9.4.1 rbd-mirror Daemon #
The two rbd-mirror
daemons are responsible for
watching image journals on the remote, peer cluster and replaying the
journal events against the local cluster. The RBD image journaling feature
records all modifications to the image in the order they occur. This
ensures that a crash-consistent mirror of the remote image is available
locally.
The rbd-mirror
daemon is available in the
rbd-mirror package. You can install the package on OSD
nodes, gateway nodes, or even on dedicated nodes. We do not recommend
installing the rbd-mirror on the Salt master/admin node.
Install, enable, and start rbd-mirror:
root #
zypper install rbd-mirrorroot #
systemctl enable ceph-rbd-mirror@server_name.serviceroot #
systemctl start ceph-rbd-mirror@server_name.service
Important
Each rbd-mirror
daemon requires the ability to
connect to both clusters simultaneously.
9.4.2 Pool Configuration #
The following procedures demonstrate how to perform the basic
administrative tasks to configure mirroring using the
rbd
command. Mirroring is configured on a per-pool basis
within the Ceph clusters.
You need to perform the pool configuration steps on both peer clusters. These procedures assume two clusters, named 'local' and 'remote', are accessible from a single host for clarity.
See the rbd
manual page (man 8 rbd
)
for additional details on how to connect to different Ceph clusters.
Tip: Multiple Clusters
The cluster name in the following examples corresponds to a Ceph
configuration file of the same name
/etc/ceph/remote.conf
. See the
ceph-conf
documentation for how to configure multiple clusters.
9.4.2.1 Enable Mirroring on a Pool #
To enable mirroring on a pool, specify the mirror pool
enable
subcommand, the pool name, and the mirroring mode. The
mirroring mode can either be pool or image:
- pool
All images in the pool with the journaling feature enabled are mirrored.
- image
Mirroring needs to be explicitly enabled on each image. See Section 9.4.3.2, “Enable Image Mirroring” for more information.
For example:
cephadm >
rbd --cluster local mirror pool enable POOL_NAME poolcephadm >
rbd --cluster remote mirror pool enable POOL_NAME pool
9.4.2.2 Disable Mirroring #
To disable mirroring on a pool, specify the mirror pool
disable
subcommand and the pool name. When mirroring is disabled
on a pool in this way, mirroring will also be disabled on any images
(within the pool) for which mirroring was enabled explicitly.
cephadm >
rbd --cluster local mirror pool disable POOL_NAMEcephadm >
rbd --cluster remote mirror pool disable POOL_NAME
9.4.2.3 Add Cluster Peer #
In order for the rbd-mirror
daemon to discover
its peer cluster, the peer needs to be registered to the pool. To add a
mirroring peer cluster, specify the mirror pool peer
add
subcommand, the pool name, and a cluster specification:
cephadm >
rbd --cluster local mirror pool peer add POOL_NAME client.remote@remotecephadm >
rbd --cluster remote mirror pool peer add POOL_NAME client.local@local
9.4.2.4 Remove Cluster Peer #
To remove a mirroring peer cluster, specify the mirror pool peer
remove
subcommand, the pool name, and the peer UUID (available
from the rbd mirror pool info
command):
cephadm >
rbd --cluster local mirror pool peer remove POOL_NAME \ 55672766-c02b-4729-8567-f13a66893445cephadm >
rbd --cluster remote mirror pool peer remove POOL_NAME \ 60c0e299-b38f-4234-91f6-eed0a367be08
9.4.3 Image Configuration #
Unlike pool configuration, image configuration only needs to be performed against a single mirroring peer Ceph cluster.
Mirrored RBD images are designated as either primary or non-primary. This is a property of the image and not the pool. Images that are designated as non-primary cannot be modified.
Images are automatically promoted to primary when mirroring is first
enabled on an image (either implicitly if the pool mirror mode was 'pool'
and the image has the journaling image feature enabled, or explicitly (see
Section 9.4.3.2, “Enable Image Mirroring”) by the
rbd
command).
9.4.3.1 Image Journaling Support #
RBD mirroring uses the RBD journaling feature to ensure that the
replicated image always remains crash-consistent. Before an image can be
mirrored to a peer cluster, the journaling feature must be enabled. The
feature can be enabled at the time of image creation by providing the
--image-feature exclusive-lock,journaling
option to the
rbd
command.
Alternatively, the journaling feature can be dynamically enabled on
pre-existing RBD images. To enable journaling, specify the
feature enable
subcommand, the pool and image name, and
the feature name:
cephadm >
rbd --cluster local feature enable POOL_NAME/IMAGE_NAME journaling
Note: Option Dependency
The journaling
feature is dependent on the
exclusive-lock
feature. If the
exclusive-lock
feature is not already enabled, you need
to enable it prior to enabling the journaling
feature.
Warning: Journaling on All New Images
You can enable journaling on all new images by default by appending the
journaling
value to the rbd default
features
option in the Ceph configuration file. For example:
rbd default features = layering,exclusive-lock,object-map,deep-flatten,journaling
Before applying such change, carefully consider if enabling journaling on all new images is good for your deployment because it can have negative performance impact.
9.4.3.2 Enable Image Mirroring #
If mirroring is configured in the 'image' mode, then it is necessary to
explicitly enable mirroring for each image within the pool. To enable
mirroring for a specific image, specify the mirror image
enable
subcommand along with the pool and image name:
cephadm >
rbd --cluster local mirror image enable POOL_NAME/IMAGE_NAME
9.4.3.3 Disable Image Mirroring #
To disable mirroring for a specific image, specify the mirror
image disable
subcommand along with the pool and image name:
cephadm >
rbd --cluster local mirror image disable POOL_NAME/IMAGE_NAME
9.4.3.4 Image Promotion and Demotion #
In a failover scenario where the primary designation needs to be moved to the image in the peer cluster, you need to stop access to the primary image, demote the current primary image, promote the new primary image, and resume access to the image on the alternate cluster.
Note: Forced Promotion
Promotion can be forced using the --force
option. Forced
promotion is needed when the demotion cannot be propagated to the peer
cluster (for example, in case of cluster failure or communication
outage). This will result in a split-brain scenario between the two
peers, and the image will no longer be synchronized until a
resync
subcommand is issued.
To demote a specific image to non-primary, specify the mirror
image demote
subcommand along with the pool and image name:
cephadm >
rbd --cluster local mirror image demote POOL_NAME/IMAGE_NAME
To demote all primary images within a pool to non-primary, specify the
mirror pool demote
subcommand along with the pool name:
cephadm >
rbd --cluster local mirror pool demote POOL_NAME
To promote a specific image to primary, specify the mirror image
promote
subcommand along with the pool and image name:
cephadm >
rbd --cluster remote mirror image promote POOL_NAME/IMAGE_NAME
To promote all non-primary images within a pool to primary, specify the
mirror pool promote
subcommand along with the pool
name:
cephadm >
rbd --cluster local mirror pool promote POOL_NAME
Tip: Split I/O Load
Since the primary or non-primary status is per-image, it is possible to have two clusters split the IO load and stage failover or failback.
9.4.3.5 Force Image Resync #
If a split-brain event is detected by the
rbd-mirror
daemon, it will not attempt to mirror
the affected image until corrected. To resume mirroring for an image,
first demote the image determined to be out of date and then request a
resync to the primary image. To request an image resync, specify the
mirror image resync
subcommand along with the pool and
image name:
cephadm >
rbd mirror image resync POOL_NAME/IMAGE_NAME
9.4.4 Mirror Status #
The peer cluster replication status is stored for every primary mirrored
image. This status can be retrieved using the mirror image
status
and mirror pool status
subcommands:
To request the mirror image status, specify the mirror image
status
subcommand along with the pool and image name:
cephadm >
rbd mirror image status POOL_NAME/IMAGE_NAME
To request the mirror pool summary status, specify the mirror pool
status
subcommand along with the pool name:
cephadm >
rbd mirror pool status POOL_NAME
Tip:
Adding the --verbose
option to the mirror pool
status
subcommand will additionally output status details for
every mirroring image in the pool.
9.5 Advanced Features #
RADOS Block Device supports advanced features that enhance the functionality of RBD
images. You can specify the features either on the command line when
creating an RBD image, or in the Ceph configuration file by using the
rbd_default_features
option.
You can specify the values of the rbd_default_features
option in two ways:
As a sum of features' internal values. Each feature has its own internal value—for example 'layering' has 1 and 'fast-diff' has 16. Therefore to activate these two feature by default, include the following:
rbd_default_features = 17
As a comma-separated list of features. The previous example will look as follows:
rbd_default_features = layering,fast-diff
Note: Features not Supported by iSCSI
RBD images with the following features will not be supported by iSCSI:
deep-flatten
, striping
,
exclusive-lock
, object-map
,
journaling
, fast-diff
List of advanced RBD features follows:
layering
Layering enables you to use cloning.
Internal value is 1, default is 'yes'.
striping
Striping spreads data across multiple objects and helps with parallelism for sequential read/write workloads. It prevents single node bottleneck for large or busy RADOS Block Devices.
Internal value is 2, default is 'yes'.
exclusive-lock
When enabled, it requires a client to get a lock on an object before making a write. Enable the exclusive lock only when a single client is accessing an image at the same time. Internal value is 4. Default is 'yes'.
object-map
Object map support depends on exclusive lock support. Block devices are thin provisioned meaning that they only store data that actually exists. Object map support helps track which objects actually exist (have data stored on a drive). Enabling object map support speeds up I/O operations for cloning, importing and exporting a sparsely populated image, and deleting.
Internal value is 8, default is 'yes'.
fast-diff
Fast-diff support depends on object map support and exclusive lock support. It adds another property to the object map, which makes it much faster to generate diffs between snapshots of an image, and the actual data usage of a snapshot.
Internal value is 16, default is 'yes'.
deep-flatten
Deep-flatten makes the
rbd flatten
(see Section 9.3.3.6, “Flattening a Cloned Image”) work on all the snapshots of an image, in addition to the image itself. Without it, snapshots of an image will still rely on the parent, therefore you will not be able to delete the parent image until the snapshots are deleted. Deep-flatten makes a parent independent of its clones, even if they have snapshots.Internal value is 32, default is 'yes'.
journaling
Journaling support depends on exclusive lock support. Journaling records all modifications to an image in the order they occur. RBD mirroring (see Section 9.4, “Mirroring”) utilizes the journal to replicate a crash consistent image to a remote cluster.
Internal value is 64, default is 'no'.
10 Erasure Coded Pools #
Ceph provides an alternative to the normal replication of data in pools, called erasure or erasure coded pool. Erasure pools do not provide all functionality of replicated pools (for example it cannot store metadata for RBD pools), but require less raw storage. A default erasure pool capable of storing 1 TB of data requires 1,5 TB of raw storage, allowing a single disk failure. This compares favorably to a replicated pool which needs 2 TB of raw storage for the same purpose.
For background information on Erasure Code, see https://en.wikipedia.org/wiki/Erasure_code.
Note
When using FileStore, you cannot access erasure coded pools with the RBD interface unless you have a cache tier configured. Refer to Section 11.5, “Erasure Coded Pool and Cache Tiering” for more details, or use BlueStore.
10.1 Prerequisite for Erasure Coded Pools #
To make use of erasure coding, you need to:
Define an erasure rule in the CRUSH Map.
Define an erasure code profile that specifies the coding algorithm to be used.
Create a pool using the previously mentioned rule and profile.
Keep in mind that changing the profile and the details in the profile will not be possible after the pool was created and has data.
Ensure that the CRUSH rules for erasure pools use
indep
for step
. For details see
Section 7.3.2, “firstn and indep”.
10.2 Creating a Sample Erasure Coded Pool #
The simplest erasure coded pool is equivalent to RAID5 and requires at least three hosts. This procedure describes how to create a pool for testing purposes.
The command
ceph osd pool create
is used to create a pool with type erasure. The12
stands for the number of placement groups. With default parameters, the pool is able to handle the failure of one OSD.cephadm >
ceph osd pool create ecpool 12 12 erasure pool 'ecpool' createdThe string
ABCDEFGHI
is written into an object calledNYAN
.cephadm >
echo ABCDEFGHI | rados --pool ecpool put NYAN -For testing purposes OSDs can now be disabled, for example by disconnecting them from the network.
To test whether the pool can handle the failure of devices, the content of the file can be accessed with the
rados
command.cephadm >
rados --pool ecpool get NYAN - ABCDEFGHI
10.3 Erasure Code Profiles #
When the ceph osd pool create
command is invoked to
create an erasure pool, the default profile is used,
unless another profile is specified. Profiles define the redundancy of data.
This is done by setting two parameters, arbitrarily named
k
and m
. k and m define in how many
chunks
a piece of data is split and how many coding
chunks are created. Redundant chunks are then stored on different OSDs.
Definitions required for erasure pool profiles:
- chunk
when the encoding function is called, it returns chunks of the same size: data chunks which can be concatenated to reconstruct the original object and coding chunks which can be used to rebuild a lost chunk.
- k
the number of data chunks, that is the number of chunks into which the original object is divided. For example if
k = 2
a 10KB object will be divided intok
objects of 5KB each.- m
the number of coding chunks, that is the number of additional chunks computed by the encoding functions. If there are 2 coding chunks, it means 2 OSDs can be out without losing data.
- crush-failure-domain
defines to which devices the chunks are distributed. A bucket type needs to be set as value. For all bucket types, see Section 7.2, “Buckets”. If the failure domain is
rack
, the chunks will be stored on different racks to increase the resilience in case of rack failures. Keep in mind that this requires k+m racks.
With the default erasure code profile used in Section 10.2, “Creating a Sample Erasure Coded Pool”, you will not lose cluster data if a single OSD or host fails. Therefore, to store 1 TB of data it needs another 0.5 TB of raw storage. That means 1.5 TB of raw storage are required for 1 TB of data (due to k=2,m=1). This is equivalent to a common RAID 5 configuration. For comparison: a replicated pool needs 2 TB of raw storage to store 1 TB of data.
The settings of the default profile can be displayed with:
cephadm >
ceph osd erasure-code-profile get default
directory=.libs
k=2
m=1
plugin=jerasure
crush-failure-domain=host
technique=reed_sol_van
Choosing the right profile is important because it cannot be modified after the pool is created. A new pool with a different profile needs to be created and all objects from the previous pool moved to the new one (see Section 8.3, “Pool Migration”).
The most important parameters of the profile are k
,
m
and crush-failure-domain
because
they define the storage overhead and the data durability. For example, if
the desired architecture must sustain the loss of two racks with a storage
overhead of 66% overhead, the following profile can be defined. Note that
this is only valid with a CRUSH Map that has buckets of type 'rack':
cephadm >
ceph osd erasure-code-profile set myprofile \
k=3 \
m=2 \
crush-failure-domain=rack
The example Section 10.2, “Creating a Sample Erasure Coded Pool” can be repeated with this new profile:
cephadm >
ceph osd pool create ecpool 12 12 erasure myprofilecephadm >
echo ABCDEFGHI | rados --pool ecpool put NYAN -cephadm >
rados --pool ecpool get NYAN - ABCDEFGHI
The NYAN object will be divided in three (k=3
) and two
additional chunks will be created (m=2
). The value of
m
defines how many OSDs can be lost simultaneously
without losing any data. The crush-failure-domain=rack
will create a CRUSH ruleset that ensures no two chunks are stored in the
same rack.
For more information about the erasure code profiles, see http://docs.ceph.com/docs/master/rados/operations/erasure-code-profile.
10.4 Erasure Coded Pools with RADOS Block Device #
To mark an EC pool as a RBD pool, tag it accordingly:
cephadm >
ceph osd pool application enable rbd ec_pool_name
RBD can store image data in EC pools. However, the image header and metadata still needs to be stored in a replicated pool. Assuming you have the pool named 'rbd' for this purpose:
cephadm >
rbd create rbd/image_name --size 1T --data-pool ec_pool_name
You can use the image normally like any other image, except that all of the data will be stored in the ec_pool_name pool instead of 'rbd' pool.
11 Cache Tiering #
A cache tier is an additional storage layer implemented between the client and the standard storage. It is designed to speed up access to pools stored on slow hard disks and erasure coded pools.
Typically cache tiering involves creating a pool of relatively fast storage devices (for example SSD drives) configured to act as a cache tier, and a backing pool of slower and cheaper devices configured to act as a storage tier. The size of the cache pool is usually 10-20% of the storage pool.
11.1 Tiered Storage Terminology #
Cache tiering recognizes two types of pools: a cache pool and a storage pool.
- storage pool
Either a standard replicated pool that stores several copies of an object in the Ceph storage cluster, or an erasure coded pool (see Chapter 10, Erasure Coded Pools).
The storage pool is sometimes referred to as 'backing' or 'cold' storage.
- cache pool
A standard replicated pool stored on a relatively small but fast storage device with their own ruleset in a CRUSH Map.
The cache pool is also referred to as 'hot' storage.
11.2 Points to Consider #
Cache tiering may degrade the cluster performance for specific workloads. The following points show some of its aspects that you need to consider:
Workload-dependent: Whether a cache will improve performance is dependent on the workload. Because there is a cost associated with moving objects into or out of the cache, it can be more effective when most of the requests touch a small number of objects. The cache pool should be large enough to capture the working set for your workload to avoid thrashing.
Difficult to benchmark: Most performance benchmarks may show low performance with cache tiering. The reason is that they request a big set of objects, and it takes a long time for the cache to 'warm up'.
Possibly low performance: For workloads that are not suitable for cache tiering, performance is often slower than a normal replicated pool without cache tiering enabled.
librados
object enumeration: If your application is usinglibrados
directly and relies on object enumeration, cache tiering may not work as expected. (This is not a problem for Object Gateway, RBD, or CephFS.)
11.3 When to Use Cache Tiering #
Consider using cache tiering in the following cases:
Your erasure coded pools are stored on FileStore and you need to access them via RADOS Block Device. For more information on RBD, see Chapter 9, RADOS Block Device.
Your erasure coded pools are stored on FileStore and you need to access them via iSCSI. For more information on iSCSI, refer to Chapter 14, Ceph iSCSI Gateway.
You have a limited number of high-performance storage and a large collection of low-performance storage, and need to access the stored data faster.
11.4 Cache Modes #
The cache tiering agent handles the migration of data between the cache tier and the backing storage tier. Administrators have the ability to configure how this migration takes place. There are two main scenarios:
- write-back mode
In write-back mode, Ceph clients write data to the cache tier and receive an ACK from the cache tier. In time, the data written to the cache tier migrates to the storage tier and gets flushed from the cache tier. Conceptually, the cache tier is overlaid 'in front' of the backing storage tier. When a Ceph client needs data that resides in the storage tier, the cache tiering agent migrates the data to the cache tier on read, then it is sent to the Ceph client. Thereafter, the Ceph client can perform I/O using the cache tier, until the data becomes inactive. This is ideal for mutable, data such as photo or video editing, or transactional data.
- read-only mode
In read-only mode, Ceph clients write data directly to the backing tier. On read, Ceph copies the requested objects from the backing tier to the cache tier. Stale objects get removed from the cache tier based on the defined policy. This approach is ideal for immutable data such as presenting pictures or videos on a social network, DNA data, or X-ray imaging, because reading data from a cache pool that might contain out-of-date data provides weak consistency. Do not use read-only mode for mutable data.
11.5 Erasure Coded Pool and Cache Tiering #
Erasure coded pools require more resources than replicated pools. To overcome these limitations, we recommended to set a cache tier before the erasure coded pool. This it a requirement when using FileStore.
For example, if the “hot-storage” pool is made of fast storage, the “ecpool” created in Section 10.3, “Erasure Code Profiles” can be speeded up with:
cephadm >
ceph osd tier add ecpool hot-storagecephadm >
ceph osd tier cache-mode hot-storage writebackcephadm >
ceph osd tier set-overlay ecpool hot-storage
This will place the “hot-storage” pool as a tier of ecpool in write-back mode so that every write and read to the ecpool is actually using the hot-storage and benefits from its flexibility and speed.
cephadm >
rbd --pool ecpool create --size 10 myvolume
For more information about cache tiering, see Chapter 11, Cache Tiering.
11.6 Setting Up an Example Tiered Storage #
This section illustrates how to set up a fast SSD cache tier (hot storage) in front of a standard hard disk (cold storage).
Tip
The following example is for illustration purposes only and includes a setup with one root and one rule for the SSD part residing on a single Ceph node.
In the production environment, cluster setups typically include more root and rule entries for the hot storage, and also mixed nodes, with both SSDs and SATA disks.
Create two additional CRUSH rules, 'replicated_ssd' for the fast SSD caching device class, and 'replicated_hdd' for the slower HDD device class:
cephadm >
ceph osd crush rule create-replicated replicated_ssd default host ssdcephadm >
ceph osd crush rule create-replicated replicated_hdd default host hddSwitch all existing pools to the 'replicated_hdd' rule. This prevents Ceph from storing data to the newly added SSD devices:
cephadm >
ceph osd pool set POOL_NAME crush_rule replicated_hddTurn the machine into a Ceph node using DeepSea. Install the software and configure the host machine as described in Section 1.1, “Adding New Cluster Nodes”. Let us assume that its name is node-4. This node needs to have 4 OSD disks.
Turn the machines into a Ceph nodes using DeepSea. Install the software and configure as described in Section 1.1, “Adding New Cluster Nodes”. In this example, the nodes have 4 OSD disks.
[...] host node-4 { id -5 # do not change unnecessarily # weight 0.012 alg straw hash 0 # rjenkins1 item osd.6 weight 0.003 item osd.7 weight 0.003 item osd.8 weight 0.003 item osd.9 weight 0.003 } [...]
Edit the CRUSH map for the hot storage pool mapped to the OSDs backed by the fast SSD drives. Define a second hierarchy with a root node for the SSDs (as 'root ssd'). Additionally, change the weight and a CRUSH rule for the SSDs. For more information on CRUSH map, see http://docs.ceph.com/docs/master/rados/operations/crush-map/.
Edit the CRUSH map directly with command line tools such as
getcrushmap
andcrushtool
:cephadm >
ceph osd crush rm-device-class osd.6 osd.7 osd.8 osd.9cephadm >
ceph osd crush set-device-class ssd osd.6 osd.7 osd.8 osd.9Create the hot storage pool to be used for cache tiering. Use the new 'ssd' rule for it:
cephadm >
ceph osd pool create hot-storage 100 100 replicated ssdCreate the cold storage pool using the default 'replicated_ruleset' rule:
cephadm >
ceph osd pool create cold-storage 100 100 replicated replicated_rulesetThen, setting up a cache tier involves associating a backing storage pool with a cache pool, in this case, cold storage (= storage pool) with hot storage (= cache pool):
cephadm >
ceph osd tier add cold-storage hot-storageTo set the cache mode to 'writeback', execute the following:
cephadm >
ceph osd tier cache-mode hot-storage writebackFor more information about cache modes, see Section 11.4, “Cache Modes”.
Writeback cache tiers overlay the backing storage tier, so they require one additional step: you must direct all client traffic from the storage pool to the cache pool. To direct client traffic directly to the cache pool, execute the following, for example:
cephadm >
ceph osd tier set-overlay cold-storage hot-storage
11.7 Configuring a Cache Tier #
There are several options you can use to configure cache tiers. Use the following syntax:
cephadm >
ceph osd pool set cachepool key value
11.7.1 Hit Set #
Hit set parameters allow for tuning of cache pools. Hit sets in Ceph are usually bloom filters and provide a memory-efficient way of tracking objects that are already in the cache pool.
The hit set is a bit array that is used to store the result of a set of
hashing functions applied on object names. Initially, all bits are set to
0
. When an object is added to the hit set, its name is
hashed and the result is mapped on different positions in the hit set,
where the value of the bit is then set to 1
.
To find out whether an object exists in the cache, the object name is
hashed again. If any bit is 0
, the object is definitely
not in the cache and needs to be retrieved from cold storage.
It is possible that the results of different objects are stored in the same
location of the hit set. By chance, all bits can be 1
without the object being in the cache. Therefore, hit sets working with a
bloom filter can only tell whether an object is definitely not in the cache
and needs to be retrieved from cold storage.
A cache pool can have more than one hit set tracking file access over time.
The setting hit_set_count
defines how many hit sets are
being used, and hit_set_period
defines for how long each
hit set has been used. After the period has expired, the next hit set is
used. If the number of hit sets is exhausted, the memory from the oldest
hit set is freed and a new hit set is created. The values of
hit_set_count
and hit_set_period
multiplied by each other define the overall time frame in which access to
objects has been tracked.
Figure 11.1: Bloom Filter with 3 Stored Objects #
Compared to the number of hashed objects, a hit set based on a bloom filter
is very memory-efficient. Less than 10 bits are required to reduce the
false positive probability below 1%. The false positive probability can be
defined with hit_set_fpp
. Based on the number of objects
in a placement group and the false positive probability Ceph
automatically calculates the size of the hit set.
The required storage on the cache pool can be limited with
min_write_recency_for_promote
and
min_read_recency_for_promote
. If the value is set to
0
, all objects are promoted to the cache pool as soon as
they are read or written and this persists until they are evicted. Any
value greater than 0
defines the number of hit sets
ordered by age that are searched for the object. If the object is found in
a hit set, it will be promoted to the cache pool. Keep in mind that backup
of objects may also cause them to be promoted to the cache. A full backup
with the value of '0' can cause all data to be promoted to the cache tier
while active data gets removed from the cache tier. Therefore, changing
this setting based on the backup strategy may be useful.
Note
The longer the period and the higher the
min_read_recency_for_promote
and
min_write_recency_for_promote
values, the more RAM the
ceph-osd
daemon consumes. In
particular, when the agent is active to flush or evict cache objects, all
hit_set_count
hit sets are loaded into RAM.
11.7.1.1 Use GMT for Hit Set #
Cache tier setups have a bloom filter called hit set. The filter tests whether an object belongs to a set of either hot or cold objects. The objects are added to the hit set using time stamps appended to their names.
If cluster machines are placed in different time zones and the time stamps are derived from the local time, objects in a hit set can have misleading names consisting of future or past time stamps. In the worst case, objects may not exist in the hit set at all.
To prevent this, the use_gmt_hitset
defaults to '1' on a
newly created cache tier setups. This way, you force OSDs to use GMT
(Greenwich Mean Time) time stamps when creating the object names for the
hit set.
Warning: Leave the Default Value
Do not touch the default value '1' of use_gmt_hitset
. If
errors related to this option are not caused by your cluster setup, never
change it manually. Otherwise, the cluster behavior may become
unpredictable.
11.7.2 Cache Sizing #
The cache tiering agent performs two main functions:
- Flushing
The agent identifies modified (dirty) objects and forwards them to the storage pool for long-term storage.
- Evicting
The agent identifies objects that have not been modified (clean) and evicts the least recently used among them from the cache.
11.7.2.1 Absolute Sizing #
The cache tiering agent can flush or evict objects based on the total number of bytes or the total number of objects. To specify a maximum number of bytes, execute the following:
cephadm >
ceph osd pool set cachepool target_max_bytes num_of_bytes
To specify the maximum number of objects, execute the following:
cephadm >
ceph osd pool set cachepool target_max_objects num_of_objects
Note
Ceph is not able to determine the size of a cache pool automatically, so the configuration on the absolute size is required here. Otherwise, flush and evict will not work. If you specify both limits, the cache tiering agent will begin flushing or evicting when either threshold is triggered.
Note
All client requests will be blocked only when
target_max_bytes
or target_max_objects
reached.
11.7.2.2 Relative Sizing #
The cache tiering agent can flush or evict objects relative to the size of
the cache pool (specified by target_max_bytes
or
target_max_objects
in
Section 11.7.2.1, “Absolute Sizing”). When the cache pool
consists of a certain percentage of modified (dirty) objects, the cache
tiering agent will flush them to the storage pool. To set the
cache_target_dirty_ratio
, execute the following:
cephadm >
ceph osd pool set cachepool cache_target_dirty_ratio 0.0...1.0
For example, setting the value to 0.4 will begin flushing modified (dirty) objects when they reach 40% of the cache pool's capacity:
cephadm >
ceph osd pool set hot-storage cache_target_dirty_ratio 0.4
When the dirty objects reach a certain percentage of the capacity, flush
them at a higher speed. Use
cache_target_dirty_high_ratio
:
cephadm >
ceph osd pool set cachepool cache_target_dirty_high_ratio 0.0..1.0
When the cache pool reaches a certain percentage of its capacity, the
cache tiering agent will evict objects to maintain free capacity. To set
the cache_target_full_ratio
, execute the following:
cephadm >
ceph osd pool set cachepool cache_target_full_ratio 0.0..1.0
11.7.3 Cache Age #
You can specify the minimum age of a recently modified (dirty) object before the cache tiering agent flushes it to the backing storage pool. Note that this will only apply if the cache actually needs to flush/evict objects:
cephadm >
ceph osd pool set cachepool cache_min_flush_age num_of_seconds
You can specify the minimum age of an object before it will be evicted from the cache tier:
cephadm >
ceph osd pool set cachepool cache_min_evict_age num_of_seconds
11.7.4 Examples #
11.7.4.1 Large Cache Pool and Small Memory #
If lots of storage and only a small amount of RAM is available, all objects can be promoted to the cache pool as soon as they are accessed. The hit set is kept small. The following is a set of example configuration values:
hit_set_count = 1 hit_set_period = 3600 hit_set_fpp = 0.05 min_write_recency_for_promote = 0 min_read_recency_for_promote = 0
11.7.4.2 Small Cache Pool and Large Memory #
If a small amount of storage but a comparably large amount of memory is available, the cache tier can be configured to promote a limited number of objects into the cache pool. Twelve hit sets, of which each is used over a period of 14,400 seconds, provide tracking for a total of 48 hours. If an object has been accessed in the last 8 hours, it is promoted to the cache pool. The set of example configuration values then is:
hit_set_count = 12 hit_set_period = 14400 hit_set_fpp = 0.01 min_write_recency_for_promote = 2 min_read_recency_for_promote = 2
12 Ceph Cluster Configuration #
This chapter provides a list of important Ceph cluster settings and their description. The settings are sorted by topic.
12.1 Runtime Configuration #
Section 1.12, “Adjusting ceph.conf
with Custom Settings” describes how to make changes to the
Ceph configuration file ceph.conf
. However, the
actual cluster behavior is determined not by the current state of the
ceph.conf
file but by the configuration of the running
Ceph daemons, which is stored in memory.
You can query an individual Ceph daemon for a particular configuration
setting using the admin socket on the node where the
daemon is running. For example, the following command gets the value of the
osd_max_write_size
configuration parameter from daemon
named osd.0
:
cephadm >
ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok \
config get osd_max_write_size
{
"osd_max_write_size": "90"
}
You can also change the daemons' settings at runtime.
Remember that this change is temporary and will be lost after the next
daemon restart. For example, the following command changes the
osd_max_write_size
parameter to '50' for all OSDs in the
cluster:
cephadm >
ceph tell osd.* injectargs --osd_max_write_size 50
Warning: injectargs
is Not Reliable
Unfortunately, changing the cluster settings with the
injectargs
command is not 100% reliable. If you need to
be sure that the changed parameter is active, change it in the
configuration files on all cluster nodes and restart all daemons in the
cluster.
12.2 Ceph OSD and BlueStore #
12.2.1 Automatic Cache Sizing #
BlueStore can be configured to automatically resize its caches when
tc_malloc
is configured as the memory allocator and the
bluestore_cache_autotune
setting is enabled. This option
is currently enabled by default. BlueStore will attempt to keep OSD heap
memory usage under a designated target size via the
osd_memory_target
configuration option. This is a best
effort algorithm and caches will not shrink smaller than the amount
specified by osd_memory_cache_min
. Cache ratios will be
chosen based on a hierarchy of priorities. If priority information is not
available, the bluestore_cache_meta_ratio
and
bluestore_cache_kv_ratio
options are used as fallbacks.
bluestore_cache_autotune
Automatically tunes the ratios assigned to different BlueStore caches while respecting minimum values. Default is
True
.osd_memory_target
When
tc_malloc
andbluestore_cache_autotune
are enabled, try to keep this many bytes mapped in memory.Note
This may not exactly match the RSS memory usage of the process. While the total amount of heap memory mapped by the process should generally stay close to this target, there is no guarantee that the kernel will actually reclaim memory that has been unmapped.
osd_memory_cache_min
When
tc_malloc
andbluestore_cache_autotune
are enabled, set the minimum amount of memory used for caches.Note
Setting this value too low can result in significant cache thrashing.
Part III Accessing Cluster Data #
- 13 Ceph Object Gateway
This chapter introduces details about administration tasks related to Object Gateway, such as checking status of the service, managing accounts, multisite gateways, or LDAP authentication.
- 14 Ceph iSCSI Gateway
The chapter focuses on administration tasks related to the iSCSI Gateway. For a procedure of deployment refer to Book “Deployment Guide”, Chapter 10 “Installation of iSCSI Gateway”.
- 15 Clustered File System
This chapter describes administration tasks that are normally performed after the cluster is set up and CephFS exported. If you need more information on setting up CephFS, refer to Book “Deployment Guide”, Chapter 11 “Installation of CephFS”.
- 16 NFS Ganesha: Export Ceph Data via NFS
NFS Ganesha is an NFS server (refer to Sharing File Systems with NFS ) that runs in a user address space instead of as part of the operating system kernel. With NFS Ganesha, you can plug in your own storage mechanism—such as Ceph—and access it from any NFS client.
13 Ceph Object Gateway #
This chapter introduces details about administration tasks related to Object Gateway, such as checking status of the service, managing accounts, multisite gateways, or LDAP authentication.
13.1 Object Gateway Restrictions and Naming Limitations #
Following is a list of important Object Gateway limits:
13.1.1 Bucket Limitations #
When approaching Object Gateway via the S3 API, bucket names are limited to DNS-compliant names with a dash character '-' allowed. When approaching Object Gateway via the Swift API, you may use any combination of UTF-8 supported characters except for a slash character '/'. The maximum length of a bucket name is 255 characters. Bucket names must be unique.
Tip: Use DNS-compliant Bucket Names
Although you may use any UTF-8 based bucket name via the Swift API, it is recommended to name buckets with regard to the S3 naming limitations to avoid problems accessing the same bucket via the S3 API.
13.1.2 Stored Object Limitations #
- Maximum number of object per user
No restriction by default (limited by ~ 2^63).
- Maximum number of object per bucket
No restriction by default (limited by ~ 2^63).
- Maximum size of an object to upload / store
Single uploads are restricted to 5GB. Use multipart for larger object sizes. The maximum number of multipart chunks is 10000.
13.1.3 HTTP Header Limitations #
HTTP header and request limitation depend on the Web front-end used. The default CivetWeb restricts the number of HTTP headers to 64 headers, and the size of the HTTP header to 16kB.
13.2 Deploying the Object Gateway #
The recommended way of deploying the Ceph Object Gateway is via the DeepSea
infrastructure by adding the relevant role-rgw [...]
line(s) into the policy.cfg
file on the Salt master, and
running required DeepSea stages.
To include the Object Gateway during the Ceph cluster deployment process, refer to Book “Deployment Guide”, Chapter 4 “Deploying with DeepSea/Salt”, Section 4.3 “Cluster Deployment” and Book “Deployment Guide”, Chapter 4 “Deploying with DeepSea/Salt”, Section 4.5.1 “The
policy.cfg
File”.To add the Object Gateway role to an already deployed cluster, refer to Section 1.2, “Adding New Roles to Nodes”.
13.3 Operating the Object Gateway Service #
Object Gateway service is operated with the systemctl
command. You
need to have root
privileges to operate the Object Gateway service. Note that
gateway_host is the host name of the server whose
Object Gateway instance you need to operate.
The following subcommands are supported for the Object Gateway service:
- systemctl status ceph-radosgw@rgw.gateway_host
Prints the status information of the service.
- systemctl start ceph-radosgw@rgw.gateway_host
Starts the service if it is not already running.
- systemctl restart ceph-radosgw@rgw.gateway_host
Restarts the service.
- systemctl stop ceph-radosgw@rgw.gateway_host
Stops the running service.
- systemctl enable ceph-radosgw@rgw.gateway_host
Enables the service so that it is automatically started on system start-up.
- systemctl disable ceph-radosgw@rgw.gateway_host
Disables the service so that it is not automatically started on system start-up.
13.4 Configuration Parameters #
You can influence the Object Gateway behavior by a number of options in the
ceph.conf
file under the section named
[client.radosgw.INSTANCE_NAME]
If an option is not specified, its default value is used. A complete list of the Object Gateway options follows:
General Settings #
- rgw frontends
Configures the HTTP front end(s). Specify multiple front ends in a comma-delimited list. Each front end configuration may include a list of options separated by spaces, where each option is in the form “key=value” or “key”. Default is
rgw frontends = civetweb port=7480
Note:
tcp_nodelay
This option may affect the transfer rate of sending TCP packets, depending on the data chunk sizes. If set to '1', the socket option will disable Nagle's algorithm on the connection. Therefore packets will be sent as soon as possible instead of waiting for a full buffer or timeout to occur.
- rgw data
Sets the location of the data files for the Object Gateway. Default is
/var/lib/ceph/radosgw/CLUSTER_ID
.- rgw enable apis
Enables the specified APIs. Default is 's3, swift, swift_auth, admin All APIs'.
- rgw cache enabled
Enables or disables the Object Gateway cache. Default is 'true'.
- rgw cache lru size
The number of entries in the Object Gateway cache. Default is 10000.
- rgw socket path
The socket path for the domain socket.
FastCgiExternalServer
uses this socket. If you do not specify a socket path, the Object Gateway will not run as an external server. The path you specify here needs to be the same as the path specified in thergw.conf
file.- rgw fcgi socket backlog
The socket backlog for fcgi. Default is 1024.
- rgw host
The host for the Object Gateway instance. It can be an IP address or a hostname. Default is 0.0.0.0
- rgw port
The port number where the instance listens for requests. If not specified, the Object Gateway runs external FastCGI.
- rgw dns name
The DNS name of the served domain.
- rgw script uri
The alternative value for the SCRIPT_URI if not set in the request.
- rgw request uri
The alternative value for the REQUEST_URI if not set in the request.
- rgw print continue
Enable 100-continue if it is operational. Default is 'true'.
- rgw remote addr param
The remote address parameter. For example, the HTTP field containing the remote address, or the X-Forwarded-For address if a reverse proxy is operational. Default is REMOTE_ADDR.
- rgw op thread timeout
The timeout in seconds for open threads. Default is 600.
- rgw op thread suicide timeout
The time timeout in seconds before the Object Gateway process dies. Disabled if set to 0 (default).
- rgw thread pool size
Number of threads for the CivetWeb server. Increase to a higher value if you need to serve more requests. Defaults to 100 threads.
- rgw num rados handles
The number of RADOS cluster handles for Object Gateway. Having a configurable number of RADOS handles results in significant performance boost for all types of workloads. Each Object Gateway worker thread now gets to pick a RADOS handle for its lifetime. Default is 1.
- rgw num control oids
The number of notification objects used for cache synchronization between different rgw instances. Default is 8.
- rgw init timeout
The number of seconds before the Object Gateway gives up on initialization. Default is 30.
- rgw mime types file
The path and location of the MIME types. Used for Swift auto-detection of object types. Default is
/etc/mime.types
.- rgw gc max objs
The maximum number of objects that may be handled by garbage collection in one garbage collection processing cycle. Default is 32.
- rgw gc obj min wait
The minimum wait time before the object may be removed and handled by garbage collection processing. Default is 2 * 3600.
- rgw gc processor max time
The maximum time between the beginning of two consecutive garbage collection processing cycles. Default is 3600.
- rgw gc processor period
The cycle time for garbage collection processing. Default is 3600.
- rgw s3 success create obj status
The alternate success status response for
create-obj
. Default is 0.- rgw resolve cname
Whether the Object Gateway should use DNS CNAME record of the request host name field (if host name is not equal to the Object Gateway DNS name). Default is 'false'.
- rgw obj stripe size
The size of an object stripe for Object Gateway objects. Default is 4 << 20.
- rgw extended http attrs
Add a new set of attributes that can be set on an entity (for example a user, a bucket or an object). These extra attributes can be set through HTTP header fields when putting the entity or modifying it using POST method. If set, these attributes will return as HTTP fields when requesting GET/HEAD on the entity. Default is 'content_foo, content_bar, x-foo-bar'.
- rgw exit timeout secs
Number of seconds to wait for a process before exiting unconditionally. Default is 120.
- rgw get obj window size
The window size in bytes for a single object request. Default is '16 << 20'.
- rgw get obj max req size
The maximum request size of a single GET operation sent to the Ceph Storage Cluster. Default is 4 << 20.
- rgw relaxed s3 bucket names
Enables relaxed S3 bucket names rules for US region buckets. Default is 'false'.
- rgw list buckets max chunk
The maximum number of buckets to retrieve in a single operation when listing user buckets. Default is 1000.
- rgw override bucket index max shards
Represents the number of shards for the bucket index object. Setting 0 (default) indicates there is no sharding. It is not recommended to set a value too large (for example 1000) as it increases the cost for bucket listing. This variable should be set in the client or global sections so that it is automatically applied to
radosgw-admin
commands.- rgw curl wait timeout ms
The timeout in milliseconds for certain
curl
calls. Default is 1000.- rgw copy obj progress
Enables output of object progress during long copy operations. Default is 'true'.
- rgw copy obj progress every bytes
The minimum bytes between copy progress output. Default is 1024 * 1024.
- rgw admin entry
The entry point for an admin request URL. Default is 'admin'.
- rgw content length compat
Enable compatibility handling of FCGI requests with both CONTENT_LENGTH AND HTTP_CONTENT_LENGTH set. Default is 'false'.
- rgw bucket quota ttl
The amount of time in seconds that cached quota information is trusted. After this timeout, the quota information will be re-fetched from the cluster. Default is 600.
- rgw user quota bucket sync interval
The amount of time in seconds for which the bucket quota information is accumulated before syncing to the cluster. During this time, other Object Gateway instances will not see the changes in the bucket quota stats related to operations on this instance. Default is 180.
- rgw user quota sync interval
The amount of time in seconds for which user quota information is accumulated before syncing to the cluster. During this time, other Object Gateway instances will not see the changes in the user quota stats related to operations on this instance. Default is 180.
- rgw bucket default quota max objects
Default maximum number of objects per bucket. It is set on new users if no other quota is specified, and has no effect on existing users. This variable should be set in the client or global sections so that it is automatically applied to
radosgw-admin
commands. Default is -1.- rgw bucket default quota max size
Default maximum capacity per bucket in bytes. It is set on new users if no other quota is specified, and has no effect on existing users. Default is -1.
- rgw user default quota max objects
Default maximum number of objects for a user. This includes all objects in all buckets owned by the user. It is set on new users if no other quota is specified, and has no effect on existing users. Default is -1.
- rgw user default quota max size
The value for user maximum size quota in bytes set on new users if no other quota is specified. It has no effect on existing users. Default is -1.
- rgw verify ssl
Verify SSL certificates while making requests. Default is 'true'.
- rgw max chunk size
Maximum size of a chunk of data that will be read in a single operation. Increasing the value to 4MB (4194304) will provide better performance when processing large objects. Default is 128kB (131072).
Multisite Settings #
- rgw zone
The name of the zone for the gateway instance. If no zone is set, a cluster-wide default can be configured with the
radosgw-admin zone default
command.- rgw zonegroup
The name of the zonegroup for the gateway instance. If no zonegroup is set, a cluster-wide default can be configured with the
radosgw-admin zonegroup default
command.- rgw realm
The name of the realm for the gateway instance. If no realm is set, a cluster-wide default can be configured with the
radosgw-admin realm default
command.- rgw run sync thread
If there are other zones in the realm to synchronize from, spawn threads to handle the synchronization of data and metadata. Default is 'true'.
- rgw data log window
The data log entries window in seconds. Default is 30/
- rgw data log changes size
The number of in-memory entries to hold for the data changes log. Default is 1000.
- rgw data log obj prefix
The object name prefix for the data log. Default is 'data_log'.
- rgw data log num shards
The number of shards (objects) on which to keep the data changes log. Default is 128.
- rgw md log max shards
The maximum number of shards for the metadata log. Default is 64.
Swift Settings #
- rgw enforce swift acls
Enforces the Swift Access Control List (ACL) settings. Default is 'true'.
- rgw swift token expiration
The time in seconds for expiring a Swift token. Default is 24 * 3600.
- rgw swift url
The URL for the Ceph Object Gateway Swift API.
- rgw swift url prefix
The URL prefix for the Swift StorageURL that goes in front of the “/v1” part. This allows to run several Gateway instances on the same host. For compatibility, setting this configuration variable to empty causes the default “/swift” to be used. Use explicit prefix “/” to start StorageURL at the root.
Warning
Setting this option to “/” will not work if S3 API is enabled. Keep in mind that disabling S3 will make impossible to deploy the Object Gateway in the multisite configuration!
- rgw swift auth url
Default URL for verifying v1 authentication tokens when the internal Swift authentication is not used.
- rgw swift auth entry
The entry point for a Swift authentication URL. Default is 'auth'.
- rgw swift versioning enabled
Enables the Object Versioning of OpenStack Object Storage API. This allows clients to put the
X-Versions-Location
attribute on containers that should be versioned. The attribute specifies the name of container storing archived versions. It must be owned by the same user that the versioned container due to access control verification - ACLs are not taken into consideration. Those containers cannot be versioned by the S3 object versioning mechanism. Default is 'false'.
Logging Settings #
- rgw log nonexistent bucket
Enables the Object Gateway to log a request for a non-existent bucket. Default is 'false'.
- rgw log object name
The logging format for an object name. See the manual page
man 1 date
for details about format specifiers. Default is '%Y-%m-%d-%H-%i-%n'.- rgw log object name utc
Whether a logged object name includes a UTC time. If set to 'false' (default), it uses the local time.
- rgw usage max shards
The maximum number of shards for usage logging. Default is 32.
- rgw usage max user shards
The maximum number of shards used for a single user’s usage logging. Default is 1.
- rgw enable ops log
Enable logging for each successful Object Gateway operation. Default is 'false'.
- rgw enable usage log
Enable the usage log. Default is 'false'.
- rgw ops log rados
Whether the operations log should be written to the Ceph Storage Cluster back end. Default is 'true'.
- rgw ops log socket path
The Unix domain socket for writing operations logs.
- rgw ops log data backlog
The maximum data backlog data size for operations logs written to a Unix domain socket. Default is 5 << 20.
- rgw usage log flush threshold
The number of dirty merged entries in the usage log before flushing synchronously. Default is 1024.
- rgw usage log tick interval
Flush pending usage log data every 'n' seconds. Default is 30.
- rgw log http headers
Comma-delimited list of HTTP headers to include in log entries. Header names are case insensitive, and use the full header name with words separated by underscores. For example 'http_x_forwarded_for, http_x_special_k'.
- rgw intent log object name
The logging format for the intent log object name. See the manual page
man 1 date
for details about format specifiers. Default is '%Y-%m-%d-%i-%n'.- rgw intent log object name utc
Whether the intent log object name includes a UTC time. If set to 'false' (default), it uses the local time.
Keystone Settings #
- rgw keystone url
The URL for the Keystone server.
- rgw keystone api version
The version (2 or 3) of OpenStack Identity API that should be used for communication with the Keystone server. Default is 2.
- rgw keystone admin domain
The name of the OpenStack domain with the administrator privilege when using OpenStack Identity API v3.
- rgw keystone admin project
The name of the OpenStack project with the administrator privilege when using OpenStack Identity API v3. If not set, the value of the
rgw keystone admin tenant
will be used instead.- rgw keystone admin token
The Keystone administrator token (shared secret). In the Object Gateway, authentication with the administrator token has priority over authentication with the administrator credentials (options
rgw keystone admin user
,rgw keystone admin password
,rgw keystone admin tenant
,rgw keystone admin project
, andrgw keystone admin domain
). Administrator token feature is considered as deprecated.- rgw keystone admin tenant
The name of the OpenStack tenant with the administrator privilege (Service Tenant) when using OpenStack Identity API v2.
- rgw keystone admin user
The name of the OpenStack user with the administrator privilege for Keystone authentication (Service User) when using OpenStack Identity API v2.
- rgw keystone admin password
The password for the OpenStack administrator user when using OpenStack Identity API v2.
- rgw keystone accepted roles
The roles required to serve requests. Default is 'Member, admin'.
- rgw keystone token cache size
The maximum number of entries in each Keystone token cache. Default is 10000.
- rgw keystone revocation interval
The number of seconds between token revocation checks. Default is 15 * 60.
- rgw keystone verify ssl
Verify SSL certificates while making token requests to keystone. Default is 'true'.
13.4.1 Additional Notes #
- rgw dns name
If the parameter
rgw dns name
is added to theceph.conf
, make sure that the S3 client is configured to direct requests at the endpoint specified byrgw dns name
.
13.5 Managing Object Gateway Access #
You can communicate with Object Gateway using either S3- or Swift-compatible interface. S3 interface is compatible with a large subset of the Amazon S3 RESTful API. Swift interface is compatible with a large subset of the OpenStack Swift API.
Both interfaces require you to create a specific user, and install the relevant client software to communicate with the gateway using the user's secret key.
13.5.1 Accessing Object Gateway #
13.5.1.1 S3 Interface Access #
To access the S3 interface, you need a REST client.
S3cmd
is a command line S3 client. You can find it in
the
OpenSUSE
Build Service. The repository contains versions for both SUSE Linux Enterprise and
openSUSE based distributions.
If you want to test your access to the S3 interface, you can also write a
small a Python script. The script will connect to Object Gateway, create a new
bucket, and list all buckets. The values for
aws_access_key_id
and
aws_secret_access_key
are taken from the values of
access_key
and secret_key
returned by
the radosgw_admin
command from
Section 13.5.2.1, “Adding S3 and Swift Users”.
Install the
python-boto
package:root #
zypper in python-botoCreate a new Python script called
s3test.py
with the following content:import boto import boto.s3.connection access_key = '11BS02LGFB6AL6H1ADMW' secret_key = 'vzCEkuryfn060dfee4fgQPqFrncKEIkh3ZcdOANY' conn = boto.connect_s3( aws_access_key_id = access_key, aws_secret_access_key = secret_key, host = '{hostname}', is_secure=False, calling_format = boto.s3.connection.OrdinaryCallingFormat(), ) bucket = conn.create_bucket('my-new-bucket') for bucket in conn.get_all_buckets(): print "{name}\t{created}".format( name = bucket.name, created = bucket.creation_date, )
Replace
{hostname}
with the host name of the host where you configured Object Gateway service, for examplegateway_host
.Run the script:
python s3test.py
The script outputs something like the following:
my-new-bucket 2015-07-22T15:37:42.000Z
13.5.1.2 Swift Interface Access #
To access Object Gateway via Swift interface, you need the swift
command line client. Its manual page man 1 swift
tells
you more about its command line options.
The package is included in the 'Public Cloud' module for SUSE Linux Enterprise 12 SP3. Before installing the package, you need to activate the module and refresh the software repository:
root #
SUSEConnect -p sle-module-public-cloud/12/x86_64
sudo zypper refresh
To install the swift
command, run the following:
root #
zypper in python-swiftclient
The swift access uses the following syntax:
cephadm >
swift -A http://IP_ADDRESS/auth/1.0 \
-U example_user:swift -K 'swift_secret_key' list
Replace IP_ADDRESS with the IP address of the
gateway server, and swift_secret_key with its
value from the output of the radosgw-admin key create
command executed for the swift
user in
Section 13.5.2.1, “Adding S3 and Swift Users”.
For example:
cephadm >
swift -A http://gateway.example.com/auth/1.0 -U example_user:swift \
-K 'r5wWIxjOCeEO7DixD1FjTLmNYIViaC6JVhi3013h' list
The output is:
my-new-bucket
13.5.2 Managing S3 and Swift Accounts #
13.5.2.1 Adding S3 and Swift Users #
You need to create a user, access key and secret to enable end users to interact with the gateway. There are two types of users: a user and subuser. While users are used when interacting with the S3 interface, subusers are users of the Swift interface. Each subuser is associated to a user.
Users can also be added via the DeepSea file
rgw.sls
. For an example, see
Section 16.3.1, “Different Object Gateway Users for NFS Ganesha”.
To create a Swift user, follow the steps:
To create a Swift user—which is a subuser in our terminology—you need to create the associated user first.
cephadm >
radosgw-admin user create --uid=username \ --display-name="display-name" --email=emailFor example:
cephadm >
radosgw-admin user create \ --uid=example_user \ --display-name="Example User" \ --email=penguin@example.comTo create a subuser (Swift interface) for the user, you must specify the user ID (--uid=username), a subuser ID, and the access level for the subuser.
cephadm >
radosgw-admin subuser create --uid=uid \ --subuser=uid \ --access=[ read | write | readwrite | full ]For example:
cephadm >
radosgw-admin subuser create --uid=example_user \ --subuser=example_user:swift --access=fullGenerate a secret key for the user.
cephadm >
radosgw-admin key create \ --gen-secret \ --subuser=example_user:swift \ --key-type=swiftBoth commands will output JSON-formatted data showing the user state. Notice the following lines, and remember the
secret_key
value:"swift_keys": [ { "user": "example_user:swift", "secret_key": "r5wWIxjOCeEO7DixD1FjTLmNYIViaC6JVhi3013h"}],
When accessing Object Gateway through the S3 interface you need to create a S3 user by running:
cephadm >
radosgw-admin user create --uid=username \
--display-name="display-name" --email=email
For example:
cephadm >
radosgw-admin user create \
--uid=example_user \
--display-name="Example User" \
--email=penguin@example.com
The command also creates the user's access and secret key. Check its
output for access_key
and secret_key
keywords and their values:
[...] "keys": [ { "user": "example_user", "access_key": "11BS02LGFB6AL6H1ADMW", "secret_key": "vzCEkuryfn060dfee4fgQPqFrncKEIkh3ZcdOANY"}], [...]
13.5.2.2 Removing S3 and Swift Users #
The procedure for deleting users is similar for S3 and Swift users. But in case of Swift users you may need to delete the user including its subusers.
To remove a S3 or Swift user (including all its subusers), specify
user rm
and the user ID in the following command:
cephadm >
radosgw-admin user rm --uid=example_user
To remove a subuser, specify subuser rm
and the subuser
ID.
cephadm >
radosgw-admin subuser rm --uid=example_user:swift
You can make use of the following options:
- --purge-data
Purges all data associated to the user ID.
- --purge-keys
Purges all keys associated to the user ID.
Tip: Removing a Subuser
When you remove a subuser, you are removing access to the Swift interface. The user will remain in the system.
13.5.2.3 Changing S3 and Swift User Access and Secret Keys #
The access_key
and secret_key
parameters identify the Object Gateway user when accessing the gateway. Changing
the existing user keys is the same as creating new ones, as the old keys
get overwritten.
For S3 users, run the following:
cephadm >
radosgw-admin key create --uid=example_user --key-type=s3 --gen-access-key --gen-secret
For Swift users, run the following:
cephadm >
radosgw-admin key create --subuser=example_user:swift --key-type=swift --gen-secret
--key-type=type
Specifies the type of key. Either
swift
ors3
.--gen-access-key
Generates a random access key (for S3 user by default).
--gen-secret
Generates a random secret key.
--secret=key
Specifies a secret key, for example manually generated.
13.5.2.4 User Quota Management #
The Ceph Object Gateway enables you to set quotas on users and buckets owned by users. Quotas include the maximum number of objects in a bucket and the maximum storage size in megabytes.
Before you enable a user quota, you first need to set its parameters:
cephadm >
radosgw-admin quota set --quota-scope=user --uid=example_user \
--max-objects=1024 --max-size=1024
--max-objects
Specifies the maximum number of objects. A negative value disables the check.
--max-size
Specifies the maximum number of bytes. A negative value disables the check.
--quota-scope
Sets the scope for the quota. The options are
bucket
anduser
. Bucket quotas apply to buckets a user owns. User quotas apply to a user.
Once you set a user quota, you may enable it:
cephadm >
radosgw-admin quota enable --quota-scope=user --uid=example_user
To disable a quota:
cephadm >
radosgw-admin quota disable --quota-scope=user --uid=example_user
To list quota settings:
cephadm >
radosgw-admin user info --uid=example_user
To update quota statistics:
cephadm >
radosgw-admin user stats --uid=example_user --sync-stats
13.6 Enabling HTTPS/SSL for Object Gateways #
To enable the default Object Gateway role to communicate securely using SSL, you need to either have a CA-issued certificate, or create a self-signed one— not both. There are two ways to configure Object Gateway with HTTPS enabled: a simple way that makes use of the default settings, and an advanced way that lets you fine-tune HTTPS related settings.
13.6.1 Create a Self-Signed Certificate #
By default, DeepSea expects the certificate file in
/srv/salt/ceph/rgw/cert/rgw.pem
on the Salt master. It
will then distribute the certificate to
/etc/ceph/rgw.pem
on the Salt minion with the Object Gateway
role, where Ceph reads it.
The following procedure describes how to generate a self-signed SSL certificate on the Salt master node.
Note
If you have a valid certificated signed by a CA, proceed to Step 3.
If you need your Object Gateway to be known by additional subject identities, add them to the
subjectAltName
option in the[v3_req]
section of the/etc/ssl/openssl.cnf
file:[...] [ v3_req ] subjectAltName = DNS:server1.example.com DNS:server2.example.com [...]
Tip: IP Addresses in
subjectAltName
To use IP addresses instead of domain names in the
subjectAltName
option, replace the example line with the following:subjectAltName = IP:10.0.0.10 IP:10.0.0.11
Create the key and the certificate using
openssl
. Enter all data you need to include in your certificate. We recommend entering the FQDN as the common name. Before signing the certificate, verify that 'X509v3 Subject Alternative Name:' is included in requested extensions, and that the resulting certificate has "X509v3 Subject Alternative Name:" set.root@master #
openssl req -x509 -nodes -days 1095 \ -newkey rsa:4096 -keyout rgw.key -out /srv/salt/ceph/rgw/cert/rgw.pemAppend the files to the
rgw.pem
. For example:root@master #
cat rgw.key >> /srv/salt/ceph/rgw/cert/rgw.pem
13.6.2 Simple HTTPS Configuration #
By default, Ceph on the Object Gateway node reads the
/etc/ceph/rgw.pem
certificate, and uses port 443 for
secure SSL communication. If you do not need to change these values, follow
these steps:
Edit
/srv/pillar/ceph/stack/global.yml
and add the following line:rgw_init: default-ssl
Copy the default Object Gateway SSL configuration to the
ceph.conf.d
subdirectory:root@master #
cp /srv/salt/ceph/configuration/files/rgw-ssl.conf \ /srv/salt/ceph/configuration/files/ceph.conf.d/rgw.confRun DeepSea Stages 2, 3, and 4 to apply the changes:
root@master #
salt-run state.orch ceph.stage.2root@master #
salt-run state.orch ceph.stage.3root@master #
salt-run state.orch ceph.stage.4
13.6.3 Advanced HTTPS Configuration #
If you need to change the default values for SSL settings of the Object Gateway, follow these steps:
Edit
/srv/pillar/ceph/stack/global.yml
and add the following line:rgw_init: default-ssl
Copy the default Object Gateway SSL configuration to the
ceph.conf.d
subdirectory:root@master #
cp /srv/salt/ceph/configuration/files/rgw-ssl.conf \ /srv/salt/ceph/configuration/files/ceph.conf.d/rgw.confEdit
/srv/salt/ceph/configuration/files/ceph.conf.d/rgw.conf
and change the default options, such as port number or path to the SSL certificate, to reflect your setup.Run DeepSea Stage 3 and 4 to apply the changes:
root@master #
salt-run state.orch ceph.stage.3root@master #
salt-run state.orch ceph.stage.4
Tip: Binding to Multiple Ports
The CivetWeb server can bind to multiple ports. This is useful if you need to access a single Object Gateway instance with both SSL and non-SSL connections. When specifying the ports, separate their numbers by a plus sign '+'. A two-port configuration line example follows:
[client.{{ client }}] rgw_frontends = civetweb port=80+443s ssl_certificate=/etc/ceph/rgw.pem
13.7 Sync Modules #
The multisite functionality of Object Gateway introduced in Jewel allows to create multiple zones and mirror data and metadata between them. Sync Modules are built atop of the multisite framework that allows for forwarding data and metadata to a different external tier. A sync module allows for a set of actions to be performed whenever a change in data occurs (metadata ops like bucket or user creation etc. are also regarded as changes in data). As the rgw multisite changes are eventually consistent at remote sites, changes are propagated asynchronously. This would allow for unlocking use cases such as backing up the object storage to an external cloud cluster or a custom backup solution using tape drives, indexing metadata in Elasticsearch etc.
13.7.1 Synchronizing Zones #
A sync module configuration is local to a zone. The sync module determines
whether the zone exports data or can only consume data that was modified in
another zone. As of luminous the supported sync plug-ins are
elasticsearch
, rgw
, which is the
default sync plug-in that synchronizes data between the zones and
log
which is a trivial sync plug-in that logs the
metadata operation that happens in the remote zones. The following sections
are written with the example of a zone using
elasticsearch
sync module. The process would be similar
for configuring any other sync plug-in.
Note: Default Sync Plugin
rgw
is the default sync plug-in and there is no need to
explicitly configure this.
13.7.1.1 Requirements and Assumptions #
Let us assume a simple multisite configuration as described in
Section 13.11, “Multisite Object Gateways” consisting of the 2 zones
us-east
and us-west
. Now we add a
third zone us-east-es
which is a zone that only
processes metadata from the other sites. This zone can be in the same or a
different Ceph cluster than us-east
. This zone would
only consume metadata from other zones and Object Gateways in this zone will not
serve any end user requests directly.
13.7.1.2 Configuring Sync Modules #
Create the third zone similar to the ones described in Section 13.11, “Multisite Object Gateways”, for example
cephadm >
radosgw-admin
zone create --rgw-zonegroup=us --rgw-zone=us-east-es \ --access-key={system-key} --secret={secret} --endpoints=http://rgw-es:80A sync module can be configured for this zone via the following
cephadm >
radosgw-admin
zone modify --rgw-zone={zone-name} --tier-type={tier-type} \ --tier-config={set of key=value pairs}For example in the
elasticsearch
sync modulecephadm >
radosgw-admin
zone modify --rgw-zone={zone-name} --tier-type=elasticsearch \ --tier-config=endpoint=http://localhost:9200,num_shards=10,num_replicas=1For the various supported tier-config options refer to Section 13.7.2, “Storing Metadata in Elasticsearch”.
Finally update the period
cephadm >
radosgw-admin
period update --commitNow start the radosgw in the zone
root #
systemctl
start ceph-radosgw@rgw.`hostname -s`root #
systemctl
enable ceph-radosgw@rgw.`hostname -s`
13.7.2 Storing Metadata in Elasticsearch #
This sync module writes the metadata from other zones to Elasticsearch. As of luminous this is JSON of data fields we currently store in Elasticsearch.
{ "_index" : "rgw-gold-ee5863d6", "_type" : "object", "_id" : "34137443-8592-48d9-8ca7-160255d52ade.34137.1:object1:null", "_score" : 1.0, "_source" : { "bucket" : "testbucket123", "name" : "object1", "instance" : "null", "versioned_epoch" : 0, "owner" : { "id" : "user1", "display_name" : "user1" }, "permissions" : [ "user1" ], "meta" : { "size" : 712354, "mtime" : "2017-05-04T12:54:16.462Z", "etag" : "7ac66c0f148de9519b8bd264312c4d64" } } }
13.7.2.1 Elasticsearch Tier Type Configuration Parameters #
- endpoint
Specifies the Elasticsearch server endpoint to access.
- num_shards
(integer) The number of shards that Elasticsearch will be configured with on data sync initialization. Note that this cannot be changed after initialization. Any change here requires rebuild of the Elasticsearch index and reinitialization of the data sync process.
- num_replicas
(integer) The number of the replicas that Elasticsearch will be configured with on data sync initialization.
- explicit_custom_meta
(true | false) Specifies whether all user custom metadata will be indexed, or whether user will need to configure (at the bucket level) what customer metadata entries should be indexed. This is false by default
- index_buckets_list
(comma separated list of strings) If empty, all buckets will be indexed. Otherwise, only buckets specified here will be indexed. It is possible to provide bucket prefixes (for example 'foo*'), or bucket suffixes (for example '*bar').
- approved_owners_list
(comma separated list of strings) If empty, buckets of all owners will be indexed (subject to other restrictions), otherwise, only buckets owned by specified owners will be indexed. Suffixes and prefixes can also be provided.
- override_index_path
(string) if not empty, this string will be used as the Elasticsearch index path. Otherwise the index path will be determined and generated on sync initialization.
13.7.2.2 Metadata Queries #
Since the Elasticsearch cluster now stores object metadata, it is important that the Elasticsearch endpoint is not exposed to the public and only accessible to the cluster administrators. For exposing metadata queries to the end user itself this poses a problem since we'd want the user to only query their metadata and not of any other users, this would require the Elasticsearch cluster to authenticate users in a way similar to RGW does which poses a problem.
As of Luminous RGW in the metadata master zone can now service end user requests. This allows for not exposing the Elasticsearch endpoint in public and also solves the authentication and authorization problem since RGW itself can authenticate the end user requests. For this purpose RGW introduces a new query in the bucket APIs that can service Elasticsearch requests. All these requests must be sent to the metadata master zone.
- Get an Elasticsearch Query
GET /BUCKET?query={query-expr}
request params:
max-keys: max number of entries to return
marker: pagination marker
expression := [(]<arg> <op> <value> [)][<and|or> ...]
op is one of the following: <, <=, ==, >=, >
For example:
GET /?query=name==foo
Will return all the indexed keys that user has read permission to, and are named 'foo'. The output will be a list of keys in XML that is similar to the S3 list buckets response.
- Configure custom metadata fields
Define which custom metadata entries should be indexed (under the specified bucket), and what are the types of these keys. If explicit custom metadata indexing is configured, this is needed so that rgw will index the specified custom metadata values. Otherwise it is needed in cases where the indexed metadata keys are of a type other than string.
POST /BUCKET?mdsearch x-amz-meta-search: <key [; type]> [, ...]
Multiple metadata fields must be comma separated, a type can be forced for a field with a `;`. The currently allowed types are string(default), integer and date, for example, if you want to index a custom object metadata x-amz-meta-year as int, x-amz-meta-date as type date and x-amz-meta-title as string, you would do
POST /mybooks?mdsearch x-amz-meta-search: x-amz-meta-year;int, x-amz-meta-release-date;date, x-amz-meta-title;string
- Delete custom metadata configuration
Delete custom metadata bucket configuration.
DELETE /BUCKET?mdsearch
- Get custom metadata configuration
Retrieve custom metadata bucket configuration.
GET /BUCKET?mdsearch
13.8 LDAP Authentication #
Apart from the default local user authentication, Object Gateway can use LDAP server services to authenticate users as well.
13.8.1 Authentication Mechanism #
The Object Gateway extracts the user's LDAP credentials from a token. A search filter is constructed from the user name. The Object Gateway uses the configured service account to search the directory for a matching entry. If an entry is found, the Object Gateway attempts to bind to the found distinguished name with the password from the token. If the credentials are valid, the bind will succeed, and the Object Gateway grants access.
You can limit the allowed users by setting the base for the search to a specific organizational unit or by specifying a custom search filter, for example requiring specific group membership, custom object classes, or attributes.
13.8.2 Requirements #
LDAP or Active Directory: A running LDAP instance accessible by the Object Gateway.
Service account: LDAP credentials to be used by the Object Gateway with search permissions.
User account: At least one user account in the LDAP directory.
Important: Do Not Overlap LDAP and Local Users
You should not use the same user names for local users and for users being authenticated by using LDAP. The Object Gateway cannot distinguish them and it treats them as the same user.
Tip: Sanity Checks
Use the ldapsearch
utility to verify the service
account or the LDAP connection. For example:
cephadm >
ldapsearch -x -D "uid=ceph,ou=system,dc=example,dc=com" -W \
-H ldaps://example.com -b "ou=users,dc=example,dc=com" 'uid=*' dn
Make sure to use the same LDAP parameters as in the Ceph configuration file to eliminate possible problems.
13.8.3 Configure Object Gateway to Use LDAP Authentication #
The following parameters in the /etc/ceph/ceph.conf
configuration file are related to the LDAP authentication:
rgw_ldap_uri
Specifies the LDAP server to use. Make sure to use the
ldaps://fqdn:port
parameter to avoid transmitting the plain text credentials openly.rgw_ldap_binddn
The Distinguished Name (DN) of the service account used by the Object Gateway.
rgw_ldap_secret
The password for the service account.
- rgw_ldap_searchdn
Specifies the base in the directory information tree for searching users. This might be your users organizational unit or some more specific Organizational Unit (OU).
rgw_ldap_dnattr
The attribute being used in the constructed search filter to match a user name. Depending on your Directory Information Tree (DIT) this would probably be
uid
orcn
.rgw_search_filter
If not specified, the Object Gateway automatically constructs the search filter with the
rgw_ldap_dnattr
setting. Use this parameter to narrow the list of allowed users in very flexible ways. Consult Section 13.8.4, “Using a Custom Search Filter to Limit User Access” for details.
13.8.4 Using a Custom Search Filter to Limit User Access #
There are two ways you can use the rgw_search_filter
parameter.
13.8.4.1 Partial Filter to Further Limit the Constructed Search Filter #
An example of a partial filter:
"objectclass=inetorgperson"
The Object Gateway will generate the search filter as usual with the user name from
the token and the value of rgw_ldap_dnattr
. The
constructed filter is then combined with the partial filter from the
rgw_search_filter
attribute. Depending on the user name
and the settings the final search filter may become:
"(&(uid=hari)(objectclass=inetorgperson))"
In that case, user 'hari' will only be granted access if he is found in the LDAP directory, has an object class of 'inetorgperson', and did specify a valid password.
13.8.4.2 Complete Filter #
A complete filter must contain a USERNAME
token which
will be substituted with the user name during the authentication attempt.
The rgw_ldap_dnattr
parameter is not used anymore in this
case. For example, to limit valid users to a specific group, use the
following filter:
"(&(uid=USERNAME)(memberOf=cn=ceph-users,ou=groups,dc=mycompany,dc=com))"
Note: memberOf
Attribute
Using the memberOf
attribute in LDAP searches requires
server side support from you specific LDAP server implementation.
13.8.5 Generating an Access Token for LDAP authentication #
The radosgw-token
utility generates the access token
based on the LDAP user name and password. It outputs a base-64 encoded
string which is the actual access token. Use your favorite S3 client (refer
to Section 13.5.1, “Accessing Object Gateway”) and specify the token as the
access key and use an empty secret key.
cephadm >
export RGW_ACCESS_KEY_ID="username"cephadm >
export RGW_SECRET_ACCESS_KEY="password"cephadm >
radosgw-token --encode --ttype=ldap
Important: Clear Text Credentials
The access token is a base-64 encoded JSON structure and contains the LDAP credentials as a clear text.
Note: Active Directory
For Active Directory, use the --ttype=ad
parameter.
13.9 Bucket Index Sharding #
The Object Gateway stores bucket index data in an index pool, which defaults to
.rgw.buckets.index
. If you put too many (hundreds of
thousands) objects into a single bucket and the quota for maximum number of
objects per bucket (rgw bucket default quota max objects
)
is not set, the performance of the index pool may degrade. Bucket
index sharding prevents such performance decreases and allows a
high number of objects per bucket.
13.9.1 Bucket Index Resharding #
If a bucket has grown large and its initial configuration is not sufficient anymore, the bucket's index pool needs to be resharded. You can either use automatic online bucket index resharding (refer to Section 13.9.1.1, “Dynamic Resharding”, or reshard the bucket index offline manually (refer to Section 13.9.1.2, “Manual Resharding”.
13.9.1.1 Dynamic Resharding #
Since SUSE Enterprise Storage 5.5, we support online bucket resharding. It detects if the number of objects per bucket reaches a certain threshold, and automatically increases the number of shards used by the bucket index. This process reduces the number of entries in each bucket index shard.
The detection process runs:
When new objects are added to the bucket.
In a background process that periodically scans all the buckets. This is needed in order to deal with existing buckets that are not being updated.
A bucket that requires resharding is added to the
reshard_log
queue and will be scheduled to be resharded
later. The reshard threads run in the background and execute the scheduled
resharding, one at a time.
Configuring Dynamic Resharding #
rgw_dynamic_resharding
Enables or disables dynamic bucket index resharding. Possible values are 'true' or 'false'. Defaults to 'true'.
rgw_reshard_num_logs
Number of shards for the resharding log. Defaults to 16.
rgw_reshard_bucket_lock_duration
Duration of lock on the bucket object during resharding. Defaults to 120 seconds.
rgw_max_objs_per_shard
Maximum number of objects per bucket index shard. Defaults to 100000 objects.
rgw_reshard_thread_interval
Maximum time between rounds of reshard thread processing. Defaults to 600 seconds.
Important: Multisite Configurations
Dynamic resharding is not supported in multisite environment. It is disabled by default since Ceph 12.2.2, but we recommend you to double check the setting.
Commands to Administer the Resharding Process #
- Add a bucket to the resharding queue:
cephadm >
radosgw-admin reshard add \ --bucket BUCKET_NAME \ --num-shards NEW_NUMBER_OF_SHARDS- List resharding queue:
cephadm >
radosgw-admin reshard list- Process / schedule a bucket resharding:
cephadm >
radosgw-admin reshard process- Display the bucket resharding status:
cephadm >
radosgw-admin reshard status --bucket BUCKET_NAME- Cancel pending bucket resharding:
cephadm >
radosgw-admin reshard cancel --bucket BUCKET_NAME
13.9.1.2 Manual Resharding #
Dynamic resharding mentioned in Section 13.9.1.1, “Dynamic Resharding” is supported only for simple Object Gateway configurations. For multisite configurations, use manual resharding described in this section.
To reshard the bucket index manually offline, use the following command:
cephadm >
radosgw-admin bucket reshard
The bucket reshard
command performs the following:
Creates a new set of bucket index objects for the specified object.
Spreads all objects entries of these index objects.
Creates a new bucket instance.
Links the new bucket instance with the bucket so that all new index operations go through the new bucket indexes.
Prints the old and the new bucket ID to the standard output.
Procedure 13.1: Resharding the Bucket Index Pool #
Make sure that all operations to the bucket are stopped.
Back up the original bucket index:
cephadm >
radosgw-admin bi list \ --bucket=BUCKET_NAME \ > BUCKET_NAME.list.backupReshard the bucket index:
cephadm >
radosgw-admin reshard \ --bucket=BUCKET_NAME \ --num-shards=NEW_SHARDS_NUMBERTip: Old Bucket ID
As part of its output, this command also prints the new and the old bucket ID. Note the old bucket ID down; you will need it to purge the old bucket index objects.
Verify that the objects are listed correctly by comparing the old bucket index listing with the new one. Then purge the old bucket index objects:
cephadm >
radosgw-admin bi purge --bucket=BUCKET_NAME --bucket-id=OLD_BUCKET_ID
13.9.2 Bucket Index Sharding for New Buckets #
There are two options that affect bucket index sharding:
Use the
rgw_override_bucket_index_max_shards
option for simple configurations.Use the
bucket_index_max_shards
option for multisite configurations.
Setting the options to 0
disables bucket index sharding.
A value greater than 0
enables bucket index sharding and
sets the maximum number of shards.
The following formula helps you calculate the recommended number of shards:
number_of_objects_expected_in_a_bucket / 100000
Be aware that the maximum number of shards is 7877.
13.9.2.1 Simple Configurations #
Open the Ceph configuration file and add or modify the following option:
rgw_override_bucket_index_max_shards = 12
Tip: All or One Object Gateway Instances
To configure bucket index sharding for all instances of the Object Gateway, include
rgw_override_bucket_index_max_shards
in the[global]
section.To configure bucket index sharding only for a particular instance of the Object Gateway, include
rgw_override_bucket_index_max_shards
in the related instance section.Restart the Object Gateway. See Section 13.3, “Operating the Object Gateway Service” for more details.
13.9.2.2 Multisite Configurations #
Multisite configurations can have a different index pool to manage
failover. To configure a consistent shard count for zones in one zone
group, set the bucket_index_max_shards
option in the zone
group's configuration:
Export the zone group configuration to the
zonegroup.json
file:cephadm >
radosgw-admin zonegroup get > zonegroup.jsonEdit the
zonegroup.json
file and set thebucket_index_max_shards
option for each named zone.Reset the zone group:
cephadm >
radosgw-admin zonegroup set < zonegroup.jsonUpdate the period:
cephadm >
radosgw-admin period update --commit
13.10 Integrating OpenStack Keystone #
OpenStack Keystone is an identity service for the OpenStack product. You can integrate the Object Gateway with Keystone to set up a gateway that accepts a Keystone authentication token. A user authorized by Keystone to access the gateway will be verified on the Ceph Object Gateway side and automatically created if needed. The Object Gateway queries Keystone periodically for a list of revoked tokens.
13.10.1 Configuring OpenStack #
Before configuring the Ceph Object Gateway, you need to configure the OpenStack Keystone to enable the Swift service and point it to the Ceph Object Gateway:
Set the Swift service. To use OpenStack to validate Swift users, first create the Swift service:
cephadm >
openstack service create \ --name=swift \ --description="Swift Service" \ object-storeSet the endpoints. After you create the Swift service, point to the Ceph Object Gateway. Replace REGION_NAME with the name of the gateway’s zone group name or region name.
cephadm >
openstack endpoint create --region REGION_NAME \ --publicurl "http://radosgw.example.com:8080/swift/v1" \ --adminurl "http://radosgw.example.com:8080/swift/v1" \ --internalurl "http://radosgw.example.com:8080/swift/v1" \ swiftVerify the settings. After you create the Swift service and set the endpoints, show the endpoints to verify that all the settings are correct.
cephadm >
openstack endpoint show object-store
13.10.2 Configuring the Ceph Object Gateway #
13.10.2.1 Configure SSL Certificates #
The Ceph Object Gateway queries Keystone periodically for a list of revoked tokens. These requests are encoded and signed. Keystone may be also configured to provide self-signed tokens, which are also encoded and signed. You need to configure the gateway so that it can decode and verify these signed messages. Therefore, the OpenSSL certificates that Keystone uses to create the requests need to be converted to the 'nss db' format:
root #
mkdir /var/ceph/nssroot #
openssl x509 -in /etc/keystone/ssl/certs/ca.pem \ -pubkey | certutil -d /var/ceph/nss -A -n ca -t "TCu,Cu,Tuw"root
openssl x509 -in /etc/keystone/ssl/certs/signing_cert.pem \ -pubkey | certutil -A -d /var/ceph/nss -n signing_cert -t "P,P,P"
To allow Ceph Object Gateway to interact with OpenStack Keystone, OpenStack Keystone can use a
self-signed SSL certificate. Either install Keystone’s SSL certificate
on the node running the Ceph Object Gateway, or alternatively set the value of the
option rgw keystone verify ssl
to 'false'. Setting
rgw keystone verify ssl
to 'false' means that the gateway
will not attempt to verify the certificate.
13.10.2.2 Configure the Object Gateway's Options #
You can configure Keystone integration using the following options:
rgw keystone api version
Version of the Keystone API. Valid options are 2 or 3. Defaults to 2.
rgw keystone url
The URL and port number of the administrative RESTful API on the Keystone server. Follows the pattern SERVER_URL:PORT_NUMBER.
rgw keystone admin token
The token or shared secret that is configured internally in Keystone for administrative requests.
rgw keystone accepted roles
The roles required to serve requests. Defaults to 'Member, admin'.
rgw keystone accepted admin roles
The list of roles allowing a user to gain administrative privileges.
rgw keystone token cache size
The maximum number of entries in the Keystone token cache.
rgw keystone revocation interval
The number of seconds before checking revoked tokens. Defaults to 15 * 60.
rgw keystone implicit tenants
Create new users in their own tenants of the same name. Defaults to 'false'.
rgw s3 auth use keystone
If set to 'true', the Ceph Object Gateway will authenticate users using Keystone. Defaults to 'false'.
nss db path
The path to the NSS database.
It is also possible to configure the Keystone service tenant, user &
password for keystone (for v2.0 version of the OpenStack Identity API),
similar to the way OpenStack services tend to be configured. This way you
can avoid setting the shared secret rgw keystone admin
token
in the configuration file, which should be disabled in
production environments. The service tenant credentials should have admin
privileges, for more details refer to the
official
OpenStack Keystone documentation. The related configuration options
follow:
rgw keystone admin user
The Keystone administrator user name.
rgw keystone admin password
The keystone administrator user password.
rgw keystone admin tenant
The Keystone version 2.0 administrator user tenant.
A Ceph Object Gateway user is mapped to a Keystone tenant. A Keystone user has
different roles assigned to it, possibly on more than a single tenant.
When the Ceph Object Gateway gets the ticket, it looks at the tenant and the user roles
that are assigned to that ticket, and accepts or rejects the request
according to the setting of the rgw keystone accepted
roles
option.
Tip: Mapping to OpenStack Tenants
Although Swift tenants are mapped to the Object Gateway user by default, they
can be also mapped to OpenStack tenants via the rgw keystone
implicit tenants
option. This will make containers use the
tenant namespace instead of the S3 like global namespace that the Object Gateway
defaults to. We recommend deciding on the mapping method at the planning
stage to avoid confusion. The reason is that toggling the option later
affects only newer requests which get mapped under a tenant, while older
buckets created before still continue to be in a global namespace.
For version 3 of the OpenStack Identity API, you should replace the
rgw keystone admin tenant
option with:
rgw keystone admin domain
The Keystone administrator user domain.
rgw keystone admin project
The Keystone administrator user project.
13.11 Multisite Object Gateways #
- Zone
A logical grouping of one or more Object Gateway instances. There must be one zone designated as the master zone in a zonegroup, which handles all bucket and user creation.
- Zonegroup
A zonegroup consists of multiple zones. There should be a master zonegroup that will handle changes to the system configuration.
- Zonegroup map
A configuration structure that holds the map of the entire system, for example which zonegroup is the master, relationships between different zone groups, and certain configuration options such as storage policies.
- Realm
A container for zone groups. This allows for separation of zone groups between clusters. It is possible to create multiple realms, making it easier to run completely different configurations in the same cluster.
- Period
A period holds the configuration structure for the current state of the realm. Every period contains a unique ID and an epoch. Every realm has an associated current period, holding the current state of configuration of the zone groups and storage policies. Any configuration change for a non-master zone will increment the period's epoch. Changing the master zone to a different zone will trigger the following changes:
A new period is generated with a new period ID and epoch of 1.
Realm's current period is updated to point to the newly generated period ID.
Realm's epoch is incremented.
You can configure each Object Gateway to participate in a federated architecture, working in an active zone configuration while allowing for writes to non-master zones.
13.11.1 Terminology #
A description of terms specific to a federated architecture follows:
13.11.2 Example Cluster Setup #
In this example, we will focus on creating a single zone group with three separate zones, which actively synchronize their data. Two zones belong to the same cluster, while the third belongs to a different one. There is no synchronization agent involved in mirroring data changes between the Object Gateways. This allows for a much simpler configuration scheme and active-active configurations. Note that metadata operations—such as creating a new user—still need to go through the master zone. However, data operations—such as creation of buckets and objects—can be handled by any of the zones.
13.11.3 System Keys #
While configuring zones, Object Gateway expects creation of an S3-compatible system user together with their access and secret keys. This allows another Object Gateway instance to pull the configuration remotely with the access and secret keys. For more information on creating S3 users, see Section 13.5.2.1, “Adding S3 and Swift Users”.
Tip
It is useful to generate the access and secret keys before the zone creation itself because it makes scripting and use of configuration management tools easier later on.
For the purpose of this example, let us assume that the access and secret keys are set in the environment variables:
# SYSTEM_ACCESS_KEY=1555b35654ad1656d805 # SYSTEM_SECRET_KEY=h7GhxuBLTrlhVUyxSPUKUV8r/2EI4ngqJxD7iBdBYLhwluN30JaT3Q==
Generally, access keys consist of 20 alphanumeric characters, while secret keys consist of 40 alphanumeric characters (they can contain +/= characters as well). You can generate these keys in the command line:
# SYSTEM_ACCESS_KEY=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 20 | head -n 1) # SYSTEM_SECRET_KEY=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 40 | head -n 1)
13.11.4 Naming Conventions #
This example describes the process of setting up a master zone. We will
assume a zonegroup called us
spanning the United States,
which will be our master zonegroup. This will contain two zones written in
a zonegroup-zone
format. This is our convention only and you can choose a format that you
prefer. In summary:
Master zonegroup: United States
us
Master zone: United States, East Region 1:
us-east-1
Secondary zone: United States, East Region 2:
us-east-2
Secondary zone: United States, West Region:
us-west
This will be a part of a larger realm named gold
. The
us-east-1
and us-east-2
zones are
part of the same Ceph cluster, us-east-1
being the
primary one. us-west
is in a different Ceph cluster.
13.11.5 Default Pools #
When configured with the appropriate permissions, Object Gateway creates default
pools on its own. The pg_num
and
pgp_num
values are taken from the
ceph.conf
configuration file. Pools related to a zone
by default follow the convention of
zone-name.pool-name.
For example for the us-east-1
zone, it will be the
following pools:
.rgw.root us-east-1.rgw.control us-east-1.rgw.data.root us-east-1.rgw.gc us-east-1.rgw.log us-east-1.rgw.intent-log us-east-1.rgw.usage us-east-1.rgw.users.keys us-east-1.rgw.users.email us-east-1.rgw.users.swift us-east-1.rgw.users.uid us-east-1.rgw.buckets.index us-east-1.rgw.buckets.data us-east-1.rgw.meta
These pools can be created in other zones as well, by replacing
us-east-1
with the appropriate zone name.
13.11.6 Creating a Realm #
Configure a realm called gold
and make it the default
realm:
cephadm >
radosgw-admin realm create --rgw-realm=gold --default
{
"id": "4a367026-bd8f-40ee-b486-8212482ddcd7",
"name": "gold",
"current_period": "09559832-67a4-4101-8b3f-10dfcd6b2707",
"epoch": 1
}
Note that every realm has an ID, which allows for flexibility such as
renaming the realm later if needed. The current_period
changes whenever we change anything in the master zone. The
epoch
is incremented when there is a change in the
master zone's configuration which results in a change of the current
period.
13.11.7 Deleting the Default Zonegroup #
The default installation of Object Gateway creates the default zonegroup called
default
. Because we no longer need the default
zonegroup, remove it.
cephadm >
radosgw-admin zonegroup delete --rgw-zonegroup=default
13.11.8 Creating a Master Zonegroup #
Create a master zonegroup called us
. The zonegroup will
manage the zonegroup map and propagate changes to the rest of the system.
By marking the zonegroup as default, you allow explicitly mentioning the
rgw-zonegroup switch for later commands.
cephadm >
radosgw-admin zonegroup create --rgw-zonegroup=us \
--endpoints=http://rgw1:80 --master --default
{
"id": "d4018b8d-8c0d-4072-8919-608726fa369e",
"name": "us",
"api_name": "us",
"is_master": "true",
"endpoints": [
"http:\/\/rgw1:80"
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "",
"zones": [],
"placement_targets": [],
"default_placement": "",
"realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7"
}
Alternatively, you can mark a zonegroup as default with the following command:
cephadm >
radosgw-admin zonegroup default --rgw-zonegroup=us
13.11.9 Creating a Master Zone #
Now create a default zone and add it to the default zonegroup. Note that you will use this zone for metadata operations such as user creation:
cephadm >
radosgw-admin zone create --rgw-zonegroup=us --rgw-zone=us-east-1 \
--endpoints=http://rgw1:80 --access-key=$SYSTEM_ACCESS_KEY --secret=$SYSTEM_SECRET_KEY
{
"id": "83859a9a-9901-4f00-aa6d-285c777e10f0",
"name": "us-east-1",
"domain_root": "us-east-1/gc.rgw.data.root",
"control_pool": "us-east-1/gc.rgw.control",
"gc_pool": "us-east-1/gc.rgw.gc",
"log_pool": "us-east-1/gc.rgw.log",
"intent_log_pool": "us-east-1/gc.rgw.intent-log",
"usage_log_pool": "us-east-1/gc.rgw.usage",
"user_keys_pool": "us-east-1/gc.rgw.users.keys",
"user_email_pool": "us-east-1/gc.rgw.users.email",
"user_swift_pool": "us-east-1/gc.rgw.users.swift",
"user_uid_pool": "us-east-1/gc.rgw.users.uid",
"system_key": {
"access_key": "1555b35654ad1656d804",
"secret_key": "h7GhxuBLTrlhVUyxSPUKUV8r\/2EI4ngqJxD7iBdBYLhwluN30JaT3Q=="
},
"placement_pools": [
{
"key": "default-placement",
"val": {
"index_pool": "us-east-1/gc.rgw.buckets.index",
"data_pool": "us-east-1/gc.rgw.buckets.data",
"data_extra_pool": "us-east-1/gc.rgw.buckets.non-ec",
"index_type": 0
}
}
],
"metadata_heap": "us-east-1/gc.rgw.meta",
"realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7"
}
Note that the --rgw-zonegroup
and
--default
switches add the zone to a zonegroup and make it
the default zone. Alternatively, the same can also be done with the
following commands:
cephadm >
radosgw-admin zone default --rgw-zone=us-east-1cephadm >
radosgw-admin zonegroup add --rgw-zonegroup=us --rgw-zone=us-east-1
13.11.9.1 Creating System Users #
To access zone pools, you need to create a system user. Note that you will need these keys when configuring the secondary zone as well.
cephadm >
radosgw-admin user create --uid=zone.user \
--display-name="Zone User" --access-key=$SYSTEM_ACCESS_KEY \
--secret=$SYSTEM_SECRET_KEY --system
13.11.9.2 Update the Period #
Because you changed the master zone configuration, you need to commit the changes for them to take effect in the realm configuration structure. Initially, the period looks like this:
cephadm >
radosgw-admin period get
{
"id": "09559832-67a4-4101-8b3f-10dfcd6b2707", "epoch": 1, "predecessor_uuid": "", "sync_status": [], "period_map":
{
"id": "09559832-67a4-4101-8b3f-10dfcd6b2707", "zonegroups": [], "short_zone_ids": []
}, "master_zonegroup": "", "master_zone": "", "period_config":
{
"bucket_quota": {
"enabled": false, "max_size_kb": -1, "max_objects": -1
}, "user_quota": {
"enabled": false, "max_size_kb": -1, "max_objects": -1
}
}, "realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7", "realm_name": "gold", "realm_epoch": 1
}
Update the period and commit the changes:
cephadm >
radosgw-admin period update --commit
{
"id": "b5e4d3ec-2a62-4746-b479-4b2bc14b27d1",
"epoch": 1,
"predecessor_uuid": "09559832-67a4-4101-8b3f-10dfcd6b2707",
"sync_status": [ "[...]"
],
"period_map": {
"id": "b5e4d3ec-2a62-4746-b479-4b2bc14b27d1",
"zonegroups": [
{
"id": "d4018b8d-8c0d-4072-8919-608726fa369e",
"name": "us",
"api_name": "us",
"is_master": "true",
"endpoints": [
"http:\/\/rgw1:80"
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "83859a9a-9901-4f00-aa6d-285c777e10f0",
"zones": [
{
"id": "83859a9a-9901-4f00-aa6d-285c777e10f0",
"name": "us-east-1",
"endpoints": [
"http:\/\/rgw1:80"
],
"log_meta": "true",
"log_data": "false",
"bucket_index_max_shards": 0,
"read_only": "false"
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": []
}
],
"default_placement": "default-placement",
"realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7"
}
],
"short_zone_ids": [
{
"key": "83859a9a-9901-4f00-aa6d-285c777e10f0",
"val": 630926044
}
]
},
"master_zonegroup": "d4018b8d-8c0d-4072-8919-608726fa369e",
"master_zone": "83859a9a-9901-4f00-aa6d-285c777e10f0",
"period_config": {
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
},
"realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7",
"realm_name": "gold",
"realm_epoch": 2
}
13.11.9.3 Start the Object Gateway #
You need to mention the Object Gateway zone and port options in the configuration file before starting the Object Gateway. For more information on Object Gateway and its configuration, see Chapter 13, Ceph Object Gateway. The configuration section of Object Gateway should look similar to this:
[client.rgw.us-east-1] rgw_frontends="civetweb port=80" rgw_zone=us-east-1
Start the Object Gateway:
root #
systemctl start ceph-radosgw@rgw.us-east-1
13.11.10 Creating a Secondary Zone #
In the same cluster, create and configure the secondary zone named
us-east-2
. You can execute all the following commands in
the node hosting the master zone itself.
To create the secondary zone, use the same command as when you created the primary zone, except dropping the master flag:
cephadm >
radosgw-admin zone create --rgw-zonegroup=us --endpoints=http://rgw2:80 \
--rgw-zone=us-east-2 --access-key=$SYSTEM_ACCESS_KEY --secret=$SYSTEM_SECRET_KEY
{
"id": "950c1a43-6836-41a2-a161-64777e07e8b8",
"name": "us-east-2",
"domain_root": "us-east-2.rgw.data.root",
"control_pool": "us-east-2.rgw.control",
"gc_pool": "us-east-2.rgw.gc",
"log_pool": "us-east-2.rgw.log",
"intent_log_pool": "us-east-2.rgw.intent-log",
"usage_log_pool": "us-east-2.rgw.usage",
"user_keys_pool": "us-east-2.rgw.users.keys",
"user_email_pool": "us-east-2.rgw.users.email",
"user_swift_pool": "us-east-2.rgw.users.swift",
"user_uid_pool": "us-east-2.rgw.users.uid",
"system_key": {
"access_key": "1555b35654ad1656d804",
"secret_key": "h7GhxuBLTrlhVUyxSPUKUV8r\/2EI4ngqJxD7iBdBYLhwluN30JaT3Q=="
},
"placement_pools": [
{
"key": "default-placement",
"val": {
"index_pool": "us-east-2.rgw.buckets.index",
"data_pool": "us-east-2.rgw.buckets.data",
"data_extra_pool": "us-east-2.rgw.buckets.non-ec",
"index_type": 0
}
}
],
"metadata_heap": "us-east-2.rgw.meta",
"realm_id": "815d74c2-80d6-4e63-8cfc-232037f7ff5c"
}
13.11.10.1 Update the Period #
Inform all the gateways of the new change in the system map by doing a period update and committing the changes:
cephadm >
radosgw-admin period update --commit
{
"id": "b5e4d3ec-2a62-4746-b479-4b2bc14b27d1",
"epoch": 2,
"predecessor_uuid": "09559832-67a4-4101-8b3f-10dfcd6b2707",
"sync_status": [ "[...]"
],
"period_map": {
"id": "b5e4d3ec-2a62-4746-b479-4b2bc14b27d1",
"zonegroups": [
{
"id": "d4018b8d-8c0d-4072-8919-608726fa369e",
"name": "us",
"api_name": "us",
"is_master": "true",
"endpoints": [
"http:\/\/rgw1:80"
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "83859a9a-9901-4f00-aa6d-285c777e10f0",
"zones": [
{
"id": "83859a9a-9901-4f00-aa6d-285c777e10f0",
"name": "us-east-1",
"endpoints": [
"http:\/\/rgw1:80"
],
"log_meta": "true",
"log_data": "false",
"bucket_index_max_shards": 0,
"read_only": "false"
},
{
"id": "950c1a43-6836-41a2-a161-64777e07e8b8",
"name": "us-east-2",
"endpoints": [
"http:\/\/rgw2:80"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false"
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": []
}
],
"default_placement": "default-placement",
"realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7"
}
],
"short_zone_ids": [
{
"key": "83859a9a-9901-4f00-aa6d-285c777e10f0",
"val": 630926044
},
{
"key": "950c1a43-6836-41a2-a161-64777e07e8b8",
"val": 4276257543
}
]
},
"master_zonegroup": "d4018b8d-8c0d-4072-8919-608726fa369e",
"master_zone": "83859a9a-9901-4f00-aa6d-285c777e10f0",
"period_config": {
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
},
"realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7",
"realm_name": "gold",
"realm_epoch": 2
}
13.11.10.2 Start the Object Gateway #
Adjust the configuration of the Object Gateway for the secondary zone, and start it:
[client.rgw.us-east-2] rgw_frontends="civetweb port=80" rgw_zone=us-east-2
cephadm >
sudo systemctl start ceph-radosgw@rgw.us-east-2
13.11.11 Adding Object Gateway to the Second Cluster #
The second Ceph cluster belongs to the same zonegroup as the initial one, but may be geographically located elsewhere.
13.11.11.1 Default Realm and Zonegroup #
Since you already created the realm for the first gateway, pull the realm here and make it the default here:
cephadm >
radosgw-admin realm pull --url=http://rgw1:80 \ --access-key=$SYSTEM_ACCESS_KEY --secret=$SYSTEM_SECRET_KEY { "id": "4a367026-bd8f-40ee-b486-8212482ddcd7", "name": "gold", "current_period": "b5e4d3ec-2a62-4746-b479-4b2bc14b27d1", "epoch": 2 }cephadm >
radosgw-admin realm default --rgw-realm=gold
Get the configuration from the master zone by pulling the period:
cephadm >
radosgw-admin period pull --url=http://rgw1:80 \
--access-key=$SYSTEM_ACCESS_KEY --secret=$SYSTEM_SECRET_KEY
Set the default zonegroup to the already created us
zonegroup:
cephadm >
radosgw-admin zonegroup default --rgw-zonegroup=us
13.11.11.2 Secondary Zone Configuration #
Create a new zone named us-west
with the same system
keys:
cephadm >
radosgw-admin zone create --rgw-zonegroup=us --rgw-zone=us-west \
--access-key=$SYSTEM_ACCESS_KEY --secret=$SYSTEM_SECRET_KEY \
--endpoints=http://rgw3:80 --default
{
"id": "950c1a43-6836-41a2-a161-64777e07e8b8",
"name": "us-west",
"domain_root": "us-west.rgw.data.root",
"control_pool": "us-west.rgw.control",
"gc_pool": "us-west.rgw.gc",
"log_pool": "us-west.rgw.log",
"intent_log_pool": "us-west.rgw.intent-log",
"usage_log_pool": "us-west.rgw.usage",
"user_keys_pool": "us-west.rgw.users.keys",
"user_email_pool": "us-west.rgw.users.email",
"user_swift_pool": "us-west.rgw.users.swift",
"user_uid_pool": "us-west.rgw.users.uid",
"system_key": {
"access_key": "1555b35654ad1656d804",
"secret_key": "h7GhxuBLTrlhVUyxSPUKUV8r\/2EI4ngqJxD7iBdBYLhwluN30JaT3Q=="
},
"placement_pools": [
{
"key": "default-placement",
"val": {
"index_pool": "us-west.rgw.buckets.index",
"data_pool": "us-west.rgw.buckets.data",
"data_extra_pool": "us-west.rgw.buckets.non-ec",
"index_type": 0
}
}
],
"metadata_heap": "us-west.rgw.meta",
"realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7"
}
13.11.11.3 Update the Period #
To propagate the zonegroup map changes, we update and commit the period:
cephadm >
radosgw-admin period update --commit --rgw-zone=us-west
{
"id": "b5e4d3ec-2a62-4746-b479-4b2bc14b27d1",
"epoch": 3,
"predecessor_uuid": "09559832-67a4-4101-8b3f-10dfcd6b2707",
"sync_status": [
"", # truncated
],
"period_map": {
"id": "b5e4d3ec-2a62-4746-b479-4b2bc14b27d1",
"zonegroups": [
{
"id": "d4018b8d-8c0d-4072-8919-608726fa369e",
"name": "us",
"api_name": "us",
"is_master": "true",
"endpoints": [
"http:\/\/rgw1:80"
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "83859a9a-9901-4f00-aa6d-285c777e10f0",
"zones": [
{
"id": "83859a9a-9901-4f00-aa6d-285c777e10f0",
"name": "us-east-1",
"endpoints": [
"http:\/\/rgw1:80"
],
"log_meta": "true",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false"
},
{
"id": "950c1a43-6836-41a2-a161-64777e07e8b8",
"name": "us-east-2",
"endpoints": [
"http:\/\/rgw2:80"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false"
},
{
"id": "d9522067-cb7b-4129-8751-591e45815b16",
"name": "us-west",
"endpoints": [
"http:\/\/rgw3:80"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false"
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": []
}
],
"default_placement": "default-placement",
"realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7"
}
],
"short_zone_ids": [
{
"key": "83859a9a-9901-4f00-aa6d-285c777e10f0",
"val": 630926044
},
{
"key": "950c1a43-6836-41a2-a161-64777e07e8b8",
"val": 4276257543
},
{
"key": "d9522067-cb7b-4129-8751-591e45815b16",
"val": 329470157
}
]
},
"master_zonegroup": "d4018b8d-8c0d-4072-8919-608726fa369e",
"master_zone": "83859a9a-9901-4f00-aa6d-285c777e10f0",
"period_config": {
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
},
"realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7",
"realm_name": "gold",
"realm_epoch": 2
}
Note that the period epoch number has incremented, indicating a change in the configuration.
13.11.11.4 Start the Object Gateway #
This is similar to starting the Object Gateway in the first zone. The only
difference is that the Object Gateway zone configuration should reflect the
us-west
zone name:
[client.rgw.us-west] rgw_frontends="civetweb port=80" rgw_zone=us-west
Start the second Object Gateway:
root #
systemctl start ceph-radosgw@rgw.us-west
13.11.12 Failover and Disaster Recovery #
If the master zone should fail, failover to the secondary zone for disaster recovery.
Make the secondary zone the master and default zone. For example:
cephadm >
radosgw-admin
zone modify --rgw-zone={zone-name} --master --defaultBy default, Ceph Object Gateway will run in an active-active configuration. If the cluster was configured to run in an active-passive configuration, the secondary zone is a read-only zone. Remove the --read-only status to allow the zone to receive write operations. For example:
cephadm >
radosgw-admin
zone modify --rgw-zone={zone-name} --master --default \ --read-only=FalseUpdate the period to make the changes take effect.
cephadm >
radosgw-admin
period update --commitFinally, restart the Ceph Object Gateway.
root #
systemctl
restart ceph-radosgw@rgw.`hostname -s`
If the former master zone recovers, revert the operation.
From the recovered zone, pull the period from the current master zone.
cephadm >
radosgw-admin
period pull --url={url-to-master-zone-gateway} \ --access-key={access-key} --secret={secret}Make the recovered zone the master and default zone.
cephadm >
radosgw-admin
zone modify --rgw-zone={zone-name} --master --defaultUpdate the period to make the changes take effect.
cephadm >
radosgw-admin
period update --commitThen, restart the Ceph Object Gateway in the recovered zone.
root #
systemctl
restart ceph-radosgw@rgw.`hostname -s`If the secondary zone needs to be a read-only configuration, update the secondary zone.
cephadm >
radosgw-admin
zone modify --rgw-zone={zone-name} --read-onlyUpdate the period to make the changes take effect.
cephadm >
radosgw-admin
period update --commitFinally, restart the Ceph Object Gateway in the secondary zone.
root #
systemctl
restart ceph-radosgw@rgw.`hostname -s`
13.12 Load Balancing the Object Gateway Servers with HAProxy #
You can use the HAProxy load balancer to distribute all requests across multiple Object Gateway back-end servers. Refer to https://documentation.suse.com/sle-ha/12-SP5/single-html/SLE-HA-guide/#sec-ha-lb-haproxy for more details on configuring HAProxy.
Following is a simple configuration of HAProxy for balancing Object Gateway nodes using round robin as the balancing algorithm:
cephadm >
cat /etc/haproxy/haproxy.cfg
[...]
frontend https_frontend
bind *:443 crt path-to-cert.pem [ciphers: ... ]
default_backend rgw
backend rgw
mode http
balance roundrobin
server rgw_server1 rgw-endpoint1 weight 1 maxconn 100 check
server rgw_server2 rgw-endpoint2 weight 1 maxconn 100 check
[...]
14 Ceph iSCSI Gateway #
The chapter focuses on administration tasks related to the iSCSI Gateway. For a procedure of deployment refer to Book “Deployment Guide”, Chapter 10 “Installation of iSCSI Gateway”.
14.1 Connecting to lrbd-managed Targets #
This chapter describes how to connect to lrdb-managed targets from clients running Linux, Microsoft Windows, or VMware.
14.1.1 Linux (open-iscsi
) #
Connecting to lrbd-backed iSCSI targets with
open-iscsi
is a two-step process. First the
initiator must discover the iSCSI targets available on the gateway host,
then it must log in and map the available Logical Units (LUs).
Both steps require that the open-iscsi
daemon is
running. The way you start the open-iscsi
daemon
is dependent on your Linux distribution:
On SUSE Linux Enterprise Server (SLES); and Red Hat Enterprise Linux (RHEL) hosts, run
systemctl start iscsid
(orservice iscsid start
ifsystemctl
is not available).On Debian and Ubuntu hosts, run
systemctl start open-iscsi
(orservice open-iscsi start
).
If your initiator host runs SUSE Linux Enterprise Server, refer to https://documentation.suse.com/sles/12-SP5/single-html/SLES-storage/#sec-iscsi-initiator for details on how to connect to an iSCSI target.
For any other Linux distribution supporting
open-iscsi
, proceed to discover targets on your
lrbd
gateway (this example uses iscsi1.example.com
as the portal address; for multipath access repeat these steps with
iscsi2.example.com):
root #
iscsiadm -m discovery -t sendtargets -p iscsi1.example.com
192.168.124.104:3260,1 iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol
Then, log in to the portal. If the login completes successfully, any RBD-backed logical units on the portal will immediately become available on the system SCSI bus:
root #
iscsiadm -m node -p iscsi1.example.com --login
Logging in to [iface: default, target: iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol, portal: 192.168.124.104,3260] (multiple)
Login to [iface: default, target: iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol, portal: 192.168.124.104,3260] successful.
Repeat this process for other portal IP addresses or hosts.
If your system has the lsscsi
utility installed,
you use it to enumerate available SCSI devices on your system:
lsscsi [8:0:0:0] disk SUSE RBD 4.0 /dev/sde [9:0:0:0] disk SUSE RBD 4.0 /dev/sdf
In a multipath configuration (where two connected iSCSI devices represent
one and the same LU), you can also examine the multipath device state with
the multipath
utility:
root #
multipath -ll
360014050cf9dcfcb2603933ac3298dca dm-9 SUSE,RBD
size=49G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 8:0:0:0 sde 8:64 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
`- 9:0:0:0 sdf 8:80 active ready running
You can now use this multipath device as you would any block device. For example, you can use the device as a Physical Volume for Linux Logical Volume Management (LVM), or you can simply create a file system on it. The example below demonstrates how to create an XFS file system on the newly connected multipath iSCSI volume:
root #
mkfs -t xfs /dev/mapper/360014050cf9dcfcb2603933ac3298dca
log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/mapper/360014050cf9dcfcb2603933ac3298dca isize=256 agcount=17, agsize=799744 blks
= sectsz=512 attr=2, projid32bit=1
= crc=0 finobt=0
data = bsize=4096 blocks=12800000, imaxpct=25
= sunit=1024 swidth=1024 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal log bsize=4096 blocks=6256, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Note that XFS being a non-clustered file system, you may only ever mount it on a single iSCSI initiator node at any given time.
If at any time you want to discontinue using the iSCSI LUs associated with a particular target, run the following command:
root #
iscsiadm -m node -p iscsi1.example.com --logout
Logging out of session [sid: 18, iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol, portal: 192.168.124.104,3260]
Logout of [sid: 18, target: iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol, portal: 192.168.124.104,3260] successful.
As with discovery and login, you must repeat the logout steps for all portal IP addresses or host names.
14.1.1.1 Multipath Configuration #
The multipath configuration is maintained on the clients or initiators and
is independent of any lrbd
configuration. Select
a strategy prior to using block storage. After editing the
/etc/multipath.conf
, restart
multipathd
with
root #
systemctl restart multipathd
For an active-passive configuration with friendly names, add
defaults { user_friendly_names yes }
to your /etc/multipath.conf
. After connecting to your
targets successfully, run
root #
multipath -ll
mpathd (36001405dbb561b2b5e439f0aed2f8e1e) dm-0 SUSE,RBD
size=2.0G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 2:0:0:3 sdl 8:176 active ready running
|-+- policy='service-time 0' prio=1 status=enabled
| `- 3:0:0:3 sdj 8:144 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
`- 4:0:0:3 sdk 8:160 active ready running
Note the status of each link. For an active-active configuration, add
defaults { user_friendly_names yes } devices { device { vendor "(LIO-ORG|SUSE)" product "RBD" path_grouping_policy "multibus" path_checker "tur" features "0" hardware_handler "1 alua" prio "alua" failback "immediate" rr_weight "uniform" no_path_retry 12 rr_min_io 100 } }
to your /etc/multipath.conf
. Restart
multipathd
and run
root #
multipath -ll
mpathd (36001405dbb561b2b5e439f0aed2f8e1e) dm-3 SUSE,RBD
size=2.0G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
|- 4:0:0:3 sdj 8:144 active ready running
|- 3:0:0:3 sdk 8:160 active ready running
`- 2:0:0:3 sdl 8:176 active ready running
14.1.2 Microsoft Windows (Microsoft iSCSI initiator) #
To connect to a SUSE Enterprise Storage 5.5 iSCSI target from a Windows 2012 server, follow these steps:
Open Windows Server Manager. From the Dashboard, select
› . The dialog appears. Select the tab:Figure 14.1: iSCSI Initiator Properties #
In the
dialog, enter the target's host name or IP address in the field and click :Figure 14.2: Discover Target Portal #
Repeat this process for all other gateway host names or IP addresses. When completed, review the
list:Figure 14.3: Target Portals #
Next, switch to the
tab and review your discovered target(s).Figure 14.4: Targets #
Click
in the tab. The dialog appears. Select the check box to enable multipath I/O (MPIO), then click :When the
dialog closes, select to review the target's properties:Figure 14.5: iSCSI Target Properties #
Select
, and click to review the multipath I/O configuration:Figure 14.6: Device Details #
The default
is . If you prefer a pure fail-over configuration, change it to .
This concludes the iSCSI initiator configuration. The iSCSI volumes are now available like any other SCSI devices, and may be initialized for use as volumes and drives. Click
to close the dialog, and proceed with the role from the dashboard.Observe the newly connected volume. It identifies as SUSE RBD SCSI Multi-Path Drive on the iSCSI bus, and is initially marked with an Offline status and a partition table type of Unknown. If the new volume does not appear immediately, select from the drop-down box to rescan the iSCSI bus.
Right-click on the iSCSI volume and select
from the context menu. The appears. Click , highlight the newly connected iSCSI volume and click to begin.Figure 14.7: New Volume Wizard #
Initially, the device is empty and does not contain a partition table. When prompted, confirm the dialog indicating that the volume will be initialized with a GPT partition table:
Figure 14.8: Offline Disk Prompt #
Select the volume size. Typically, you would use the device's full capacity. Then assign a drive letter or directory name where the newly created volume will become available. Then select a file system to create on the new volume, and finally confirm your selections with
to finish creating the volume:Figure 14.9: Confirm Volume Selections #
When the process finishes, review the results, then
to conclude the drive initialization. Once initialization completes, the volume (and its NTFS file system) becomes available like a newly initialized local drive.
14.1.3 VMware #
To connect to
lrbd
managed iSCSI volumes you need a configured iSCSI software adapter. If no such adapter is available in your vSphere configuration, create one by selecting › › › .When available, select the adapter's properties by right-clicking the adapter and selecting
from the context menu:Figure 14.10: iSCSI Initiator Properties #
In the
dialog, click the button. Then go to the tab and select .Enter the IP address or host name of your
lrbd
iSCSI gateway. If you run multiple iSCSI gateways in a failover configuration, repeat this step for as many gateways as you operate.Figure 14.11: Add Target Server #
When you have entered all iSCSI gateways, click
in the dialog to initiate a rescan of the iSCSI adapter.When the rescan completes, the new iSCSI device appears below the
list in the pane. For multipath devices, you can now right-click on the adapter and select from the context menu:Figure 14.12: Manage Multipath Devices #
You should now see all paths with a green light under
. One of your paths should be marked and all others simply :Figure 14.13: Paths Listing for Multipath #
You can now switch from
to the item labeled . Select in the top-right corner of the pane to bring up the dialog. Then, select and click . The newly added iSCSI device appears in the list. Select it, then click to proceed:Figure 14.14: Add Storage Dialog #
Click
to accept the default disk layout.In the
pane, assign a name to the new datastore, and click . Accept the default setting to use the volume's entire space for the datastore, or select for a smaller datastore:Figure 14.15: Custom Space Setting #
Click
to complete the datastore creation.The new datastore now appears in the datastore list and you can select it to retrieve details. You are now able to use the
lrbd
-backed iSCSI volume like any other vSphere datastore.Figure 14.16: iSCSI Datastore Overview #
14.2 Conclusion #
lrbd
is a key component of SUSE Enterprise Storage 5.5 that enables
access to distributed, highly available block storage from any server or
client capable of speaking the iSCSI protocol. By using
lrbd
on one or more iSCSI gateway hosts, Ceph RBD
images become available as Logical Units (LUs) associated with iSCSI
targets, which can be accessed in an optionally load-balanced, highly
available fashion.
Since all of lrbd
's configuration is stored in the
Ceph RADOS object store, lrbd
gateway hosts are
inherently without persistent state and thus can be replaced, augmented, or
reduced at will. As a result, SUSE Enterprise Storage 5.5 enables SUSE customers to run a
truly distributed, highly-available, resilient, and self-healing enterprise
storage technology on commodity hardware and an entirely open source
platform.
15 Clustered File System #
This chapter describes administration tasks that are normally performed after the cluster is set up and CephFS exported. If you need more information on setting up CephFS, refer to Book “Deployment Guide”, Chapter 11 “Installation of CephFS”.
15.1 Mounting CephFS #
When the file system is created and the MDS is active, you are ready to mount the file system from a client host.
15.1.1 Client Preparation #
If the client host is running SUSE Linux Enterprise 12 SP2 or SP3, you can skip this section as the system is ready to mount CephFS 'out of the box'.
If the client host is running SUSE Linux Enterprise 12 SP1, you need to apply all the latest patches before mounting CephFS.
In any case, everything needed to mount CephFS is included in SUSE Linux Enterprise. The SUSE Enterprise Storage 5.5 product is not needed.
To support the full mount
syntax, the
ceph-common package (which is shipped with SUSE Linux Enterprise) should
be installed before trying to mount CephFS.
15.1.2 Create a Secret File #
The Ceph cluster runs with authentication turned on by default. You should create a file that stores your secret key (not the keyring itself). To obtain the secret key for a particular user and then create the file, do the following:
Procedure 15.1: Creating a Secret Key #
View the key for the particular user in a keyring file:
cephadm >
cat /etc/ceph/ceph.client.admin.keyringCopy the key of the user who will be using the mounted Ceph FS file system. Usually, the key looks similar to the following:
AQCj2YpRiAe6CxAA7/ETt7Hcl9IyxyYciVs47w==
Create a file with the user name as a file name part, for example
/etc/ceph/admin.secret
for the user admin.Paste the key value to the file created in the previous step.
Set proper access rights to the file. The user should be the only one who can read the file—others may not have any access rights.
15.1.3 Mount CephFS #
You can mount CephFS with the mount
command. You need
to specify the monitor host name or IP address. Because the
cephx
authentication is enabled by default in
SUSE Enterprise Storage 5.5, you need to specify a user name and their related secret as
well:
root #
mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
-o name=admin,secret=AQATSKdNGBnwLhAAnNDKnH65FmVKpXZJVasUeQ==
As the previous command remains in the shell history, a more secure approach is to read the secret from a file:
root #
mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
-o name=admin,secretfile=/etc/ceph/admin.secret
Note that the secret file should only contain the actual keyring secret. In our example, the file will then contain only the following line:
AQATSKdNGBnwLhAAnNDKnH65FmVKpXZJVasUeQ==
Tip: Specify Multiple Monitors
It is a good idea to specify multiple monitors separated by commas on the
mount
command line in case one monitor happens to be
down at the time of mount. Each monitor address takes the form
host[:port]
. If the port is not specified, it defaults
to 6789.
Create the mount point on the local host:
root #
mkdir /mnt/cephfs
Mount the CephFS:
root #
mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
-o name=admin,secretfile=/etc/ceph/admin.secret
A subdirectory subdir
may be specified if a subset of
the file system is to be mounted:
root #
mount -t ceph ceph_mon1:6789:/subdir /mnt/cephfs \
-o name=admin,secretfile=/etc/ceph/admin.secret
You can specify more than one monitor host in the mount
command:
root #
mount -t ceph ceph_mon1,ceph_mon2,ceph_mon3:6789:/ /mnt/cephfs \
-o name=admin,secretfile=/etc/ceph/admin.secret
Important: Read Access to the Root Directory
If clients with path restriction are used, the MDS capabilities need to include read access to the root directory. For example, a keyring may look as follows:
client.bar key: supersecretkey caps: [mds] allow rw path=/barjail, allow r path=/ caps: [mon] allow r caps: [osd] allow rwx
The allow r path=/
part means that path-restricted
clients are able to see the root volume, but cannot write to it. This may
be an issue for use cases where complete isolation is a requirement.
15.2 Unmounting CephFS #
To unmount the CephFS, use the umount
command:
root #
umount /mnt/cephfs
15.3 CephFS in /etc/fstab
#
To mount CephFS automatically upon client start-up, insert the
corresponding line in its file systems table
/etc/fstab
:
mon1:6790,mon2:/subdir /mnt/cephfs ceph name=admin,secretfile=/etc/ceph/secret.key,noatime,_netdev 0 2
15.4 Multiple Active MDS Daemons (Active-Active MDS) #
CephFS is configured for a single active MDS daemon by default. To scale metadata performance for large-scale systems, you can enable multiple active MDS daemons, which will share the metadata workload with one another.
15.4.1 When to Use Active-Active MDS #
Consider using multiple active MDS daemons when your metadata performance is bottlenecked on the default single MDS.
Adding more daemons does not increase performance on all workload types. For example, a single application running on a single client will not benefit from an increased number of MDS daemons unless the application is doing a lot of metadata operations in parallel.
Workloads that typically benefit from a larger number of active MDS daemons are those with many clients, perhaps working on many separate directories.
15.4.2 Increasing the MDS Active Cluster Size #
Each CephFS file system has a max_mds
setting, which
controls how many ranks will be created. The actual number of ranks in the
file system will only be increased if a spare daemon is available to take
on the new rank. For example, if there is only one MDS daemon running and
max_mds
is set to two, no second rank will be created.
In the following example, we set the max_mds
option to 2
to create a new rank apart from the default one. To see the changes, run
ceph status
before and after you set
max_mds
, and watch the line containing
fsmap
:
cephadm >
ceph
status [...] services: [...] mds: cephfs-1/1/1 up {0=node2=up:active}, 1 up:standby [...]cephadm >
ceph
mds set max_mds 2cephadm >
ceph
status [...] services: [...] mds: cephfs-2/2/2 up {0=node2=up:active,1=node1=up:active} [...]
The newly created rank (1) passes through the 'creating' state and then enter its 'active' state.
Important: Standby Daemons
Even with multiple active MDS daemons, a highly available system still requires standby daemons to take over if any of the servers running an active daemon fail.
Consequently, the practical maximum of max_mds
for highly
available systems is one less than the total number of MDS servers in your
system. To remain available in the event of multiple server failures,
increase the number of standby daemons in the system to match the number
of server failures you need to survive.
15.4.3 Decreasing the Number of Ranks #
All ranks—including the ranks to be removed—must first be
active. This means that you need to have at least max_mds
MDS daemons available.
First, set max_mds
to a lower number. For example, go back
to having a single active MDS:
cephadm >
ceph
status [...] services: [...] mds: cephfs-2/2/2 up {0=node2=up:active,1=node1=up:active} [...]cephadm >
ceph
mds set max_mds 1cephadm >
ceph
status [...] services: [...] mds: cephfs-1/1/1 up {0=node2=up:active}, 1 up:standby [...]
Note that we still have two active MDSs. The ranks still exist even though
we have decreased max_mds
, because
max_mds
only restricts the creation of new ranks.
Next, use the ceph mds deactivate
rank
command to remove the unneeded
rank:
cephadm >
ceph
status [...] services: [...] mds: cephfs-2/2/1 up {0=node2=up:active,1=node1=up:active}cephadm >
ceph
mds deactivate 1 telling mds.1:1 192.168.58.101:6805/2799214375 to deactivatecephadm >
ceph
status [...] services: [...] mds: cephfs-2/2/1 up {0=node2=up:active,1=node1=up:stopping}cephadm >
ceph
status [...] services: [...] mds: cephfs-1/1/1 up {0=node2=up:active}, 1 up:standby
The deactivated rank will first enter the stopping state, for a period of time while it hands off its share of the metadata to the remaining active daemons. This phase can take from seconds to minutes. If the MDS appears to be stuck in the stopping state then that should be investigated as a possible bug.
If an MDS daemon crashes or is terminated while in the 'stopping' state, a standby will take over and the rank will go back to 'active'. You can try to deactivate it again when it has come back up.
When a daemon finishes stopping, it will start again and go back to being a standby.
15.4.4 Manually Pinning Directory Trees to a Rank #
In multiple active metadata server configurations, a balancer runs, which works to spread metadata load evenly across the cluster. This usually works well enough for most users, but sometimes it is desirable to override the dynamic balancer with explicit mappings of metadata to particular ranks. This can allow the administrator or users to evenly spread application load or limit impact of users' metadata requests on the entire cluster.
The mechanism provided for this purpose is called an 'export pin'. It is an
extended attribute of directories. The name of this extended attribute is
ceph.dir.pin
. Users can set this attribute using
standard commands:
root #
setfattr -n ceph.dir.pin -v 2 /path/to/dir
The value (-v
) of the extended attribute is the rank to
assign the directory sub-tree to. A default value of -1 indicates that the
directory is not pinned.
A directory export pin is inherited from its closest parent with a set export pin. Therefore, setting the export pin on a directory affects all of its children. However, the parent's pin can be overridden by setting the child directory export pin. For example:
root #
mkdir -p a/b # "a" and "a/b" start with no export pin set.
setfattr -n ceph.dir.pin -v 1 a/ # "a" and "b" are now pinned to rank 1.
setfattr -n ceph.dir.pin -v 0 a/b # "a/b" is now pinned to rank 0
# and "a/" and the rest of its children
# are still pinned to rank 1.
15.5 Managing Failover #
If an MDS daemon stops communicating with the monitor, the monitor will wait
mds_beacon_grace
seconds (default 15 seconds) before
marking the daemon as laggy. You can configure one or
more 'standby' daemons that will take over during the MDS daemon failover.
15.5.1 Configuring Standby Daemons #
There are several configuration settings that control how a daemon will
behave while in standby. You can specify them in the
ceph.conf
on the host where the MDS daemon runs. The
daemon loads these settings when it starts, and sends them to the monitor.
By default, if none of these settings are used, all MDS daemons which do not hold a rank will be used as 'standbys' for any rank.
The settings which associate a standby daemon with a particular name or rank do not guarantee that the daemon will only be used for that rank. They mean that when several standbys are available, the associated standby daemon will be used. If a rank is failed, and a standby is available, it will be used even if it is associated with a different rank or named daemon.
- mds_standby_replay
If set to true, then the standby daemon will continuously read the metadata journal of an up rank. This will give it a warm metadata cache, and speed up the process of failing over if the daemon serving the rank fails.
An up rank may only have one standby replay, daemon assigned to it. If two daemons are both set to be standby replay then one of them will arbitrarily win, and the other will become a normal non-replay standby.
When a daemon has entered the standby replay state, it will only be used as a standby for the rank that it is following. If another rank fails, this standby replay daemon will not be used as a replacement, even if no other standbys are available.
- mds_standby_for_name
Set this to make the standby daemon only take over a failed rank if the last daemon to hold it matches this name.
- mds_standby_for_rank
Set this to make the standby daemon only take over the specified rank. If another rank fails, this daemon will not be used to replace it.
Use in conjunction with
mds_standby_for_fscid
to be specific about which file system's rank you are targeting in case of multiple file systems.- mds_standby_for_fscid
If
mds_standby_for_rank
is set, this is simply a qualifier to say which file system's rank is being referred to.If
mds_standby_for_rank
is not set, then setting FSCID will cause this daemon to target any rank in the specified FSCID. Use this if you have a daemon that you want to use for any rank, but only within a particular file system.- mon_force_standby_active
This setting is used on monitor hosts. It defaults to true.
If it is false, then daemons configured with
standby_replay=true
will only become active if the rank/name that they have been configured to follow fails. On the other hand, if this setting is true, then a daemon configured withstandby_replay=true
may be assigned some other rank.
15.5.2 Examples #
Several example ceph.conf
configurations follow. You
can either copy a ceph.conf
with the configuration of
all daemons to all your servers, or you can have a different file on each
server that contains that server's daemon configuration.
15.5.2.1 Simple Pair #
Two MDS daemons 'a' and 'b' acting as a pair. Whichever one is not currently assigned a rank will be the standby replay follower of the other.
[mds.a] mds standby replay = true mds standby for rank = 0 [mds.b] mds standby replay = true mds standby for rank = 0
16 NFS Ganesha: Export Ceph Data via NFS #
NFS Ganesha is an NFS server (refer to Sharing File Systems with NFS ) that runs in a user address space instead of as part of the operating system kernel. With NFS Ganesha, you can plug in your own storage mechanism—such as Ceph—and access it from any NFS client.
S3 buckets are exported to NFS on a per-user basis, for example via the path
GANESHA_NODE:/USERNAME/BUCKETNAME
.
A CephFS is exported by default via the path
GANESHA_NODE:/cephfs
.
Note: NFS Ganesha Performance
Due to increased protocol overhead and additional latency caused by extra network hops between the client and the storage, accessing Ceph via an NFS Gateway may significantly reduce application performance when compared to native CephFS or Object Gateway clients.
16.1 Installation #
For installation instructions, see Book “Deployment Guide”, Chapter 12 “Installation of NFS Ganesha”.
16.2 Configuration #
For a list of all parameters available within the configuration file, see:
man ganesha-config
man ganesha-ceph-config
for CephFS File System Abstraction Layer (FSAL) options.man ganesha-rgw-config
for Object Gateway FSAL options.
This section includes information to help you configure the NFS Ganesha server to export the cluster data accessible via Object Gateway and CephFS.
NFS Ganesha configuration is controlled by
/etc/ganesha/ganesha.conf
. Note that changes to this
file are overwritten when DeepSea Stage 4 is executed. To persistently
change the settings, edit the file
/srv/salt/ceph/ganesha/files/ganesha.conf.j2
located on
the Salt master.
16.2.1 Export Section #
This section describes how to configure the EXPORT
sections in the ganesha.conf
.
EXPORT { Export_Id = 1; Path = "/"; Pseudo = "/"; Access_Type = RW; Squash = No_Root_Squash; [...] FSAL { Name = CEPH; } }
16.2.1.1 Export Main Section #
- Export_Id
Each export needs to have a unique 'Export_Id' (mandatory).
- Path
Export path in the related CephFS pool (mandatory). This allows subdirectories to be exported from the CephFS.
- Pseudo
Target NFS export path (mandatory for NFSv4). It defines under which NFS export path the exported data is available.
Example: with the value
/cephfs/
and after executingroot #
mount GANESHA_IP:/cephfs/ /mnt/The CephFS data is available in the directory
/mnt/cephfs/
on the client.- Access_Type
'RO' for read-only access, 'RW' for read-write access, and 'None' for no access.
Tip: Limit Access to Clients
If you leave
Access_Type = RW
in the mainEXPORT
section and limit access to a specific client in theCLIENT
section, other clients will be able to connect anyway. To disable access to all clients and enable access for specific clients only, setAccess_Type = None
in theEXPORT
section and then specify less restrictive access mode for one or more clients in theCLIENT
section:EXPORT { FSAL { access_type = "none"; [...] } CLIENT { clients = 192.168.124.9; access_type = "RW"; [...] } [...] }
- Squash
NFS squash option.
- FSAL
Exporting 'File System Abstraction Layer'. See Section 16.2.1.2, “FSAL Subsection”.
16.2.1.2 FSAL Subsection #
EXPORT { [...] FSAL { Name = CEPH; } }
- Name
Defines which back-end NFS Ganesha uses. Allowed values are
CEPH
for CephFS orRGW
for Object Gateway. Depending on the choice, arole-mds
orrole-rgw
must be defined in thepolicy.cfg
.
16.2.2 RGW Section #
RGW { ceph_conf = "/etc/ceph/ceph.conf"; name = "name"; cluster = "ceph"; }
- ceph_conf
Points to the
ceph.conf
file. When deploying with DeepSea, it is not necessary to change this value.- name
The name of the Ceph client user used by NFS Ganesha.
- cluster
Name of the Ceph cluster. SUSE Enterprise Storage 5.5 currently only supports one cluster name, which is
ceph
by default.
16.2.3 Changing Default NFS Ganesha Ports #
NFS Ganesha uses the port 2049 for NFS and 875 for the rquota support by
default. To change the default port numbers, use the
NFS_Port
and RQUOTA_Port
options inside
the NFS_CORE_PARAM
section, for example:
NFS_CORE_PARAM { NFS_Port = 2060; RQUOTA_Port = 876; }
16.3 Custom NFS Ganesha Roles #
Custom NFS Ganesha roles for cluster nodes can be defined. These roles are
then assigned to nodes in the policy.cfg
. The roles
allow for:
Separated NFS Ganesha nodes for accessing Object Gateway and CephFS.
Assigning different Object Gateway users to NFS Ganesha nodes.
Having different Object Gateway users enables NFS Ganesha nodes to access different S3 buckets. S3 buckets can be used for access control. Note: S3 buckets are not to be confused with Ceph buckets used in the CRUSH Map.
16.3.1 Different Object Gateway Users for NFS Ganesha #
The following example procedure for the Salt master shows how to create two
NFS Ganesha roles with different Object Gateway users. In this example, the roles
gold
and silver
are used, for which
DeepSea already provides example configuration files.
Open the file
/srv/pillar/ceph/stack/global.yml
with the editor of your choice. Create the file if it does not exist.The file needs to contain the following lines:
rgw_configurations: - rgw - silver - gold ganesha_configurations: - silver - gold
These roles can later be assigned in the
policy.cfg
.Create a file
/srv/salt/ceph/rgw/users/users.d/gold.yml
and add the following content:- { uid: "gold1", name: "gold1", email: "gold1@demo.nil" }
Create a file
/srv/salt/ceph/rgw/users/users.d/silver.yml
and add the following content:- { uid: "silver1", name: "silver1", email: "silver1@demo.nil" }
Now, templates for the
ganesha.conf
need to be created for each role. The original template of DeepSea is a good start. Create two copies:root #
cd
/srv/salt/ceph/ganesha/files/root #
cp
ganesha.conf.j2 silver.conf.j2root #
cp
ganesha.conf.j2 gold.conf.j2The new roles require keyrings to access the cluster. To provide access, copy the
ganesha.j2
:root #
cp
ganesha.j2 silver.j2root #
cp
ganesha.j2 gold.j2Copy the keyring for the Object Gateway:
root #
cd
/srv/salt/ceph/rgw/files/root #
cp
rgw.j2 silver.j2root #
cp
rgw.j2 gold.j2Object Gateway also needs the configuration for the different roles:
root #
cd
/srv/salt/ceph/configuration/files/root #
cp
ceph.conf.rgw silver.confroot #
cp
ceph.conf.rgw gold.confAssign the newly created roles to cluster nodes in the
/srv/pillar/ceph/proposals/policy.cfg
:role-silver/cluster/NODE1.sls role-gold/cluster/NODE2.sls
Replace NODE1 and NODE2 with the names of the nodes to which you want to assign the roles.
Execute DeepSea Stages 0 to 4.
16.3.2 Separating CephFS and Object Gateway FSAL #
The following example procedure for the Salt master shows how to create 2 new different roles that use CephFS and Object Gateway:
Open the file
/srv/pillar/ceph/rgw.sls
with the editor of your choice. Create the file if it does not exist.The file needs to contain the following lines:
rgw_configurations: ganesha_cfs: users: - { uid: "demo", name: "Demo", email: "demo@demo.nil" } ganesha_rgw: users: - { uid: "demo", name: "Demo", email: "demo@demo.nil" } ganesha_configurations: - ganesha_cfs - ganesha_rgw
These roles can later be assigned in the
policy.cfg
.Now, templates for the
ganesha.conf
need to be created for each role. The original template of DeepSea is a good start. Create two copies:root #
cd
/srv/salt/ceph/ganesha/files/root #
cp
ganesha.conf.j2 ganesha_rgw.conf.j2root #
cp
ganesha.conf.j2 ganesha_cfs.conf.j2Edit the
ganesha_rgw.conf.j2
and remove the section:{% if salt.saltutil.runner('select.minions', cluster='ceph', roles='mds') != [] %} [...] {% endif %}
Edit the
ganesha_cfs.conf.j2
and remove the section:{% if salt.saltutil.runner('select.minions', cluster='ceph', roles=role) != [] %} [...] {% endif %}
The new roles require keyrings to access the cluster. To provide access, copy the
ganesha.j2
:root #
cp
ganesha.j2 ganesha_rgw.j2root #
cp
ganesha.j2 ganesha_cfs.j2The line
caps mds = "allow *"
can be removed from theganesha_rgw.j2
.Copy the keyring for the Object Gateway:
root #
cp
/srv/salt/ceph/rgw/files/rgw.j2 \ /srv/salt/ceph/rgw/files/ganesha_rgw.j2Object Gateway needs the configuration for the new role:
root #
cp
/srv/salt/ceph/configuration/files/ceph.conf.rgw \ /srv/salt/ceph/configuration/files/ceph.conf.ganesha_rgwAssign the newly created roles to cluster nodes in the
/srv/pillar/ceph/proposals/policy.cfg
:role-ganesha_rgw/cluster/NODE1.sls role-ganesha_cfs/cluster/NODE1.sls
Replace NODE1 and NODE2 with the names of the nodes to which you want to assign the roles.
Execute DeepSea Stages 0 to 4.
16.3.3 Supported Operations #
The RGW NFS interface supports most operations on files and directories, with the following restrictions:
Links including symbolic links are not supported.
NFS access control lists (ACLs) are not supported. Unix user and group ownership and permissions are supported.
Directories may not be moved or renamed. You may move files between directories.
Only full, sequential write I/O is supported. Therefore, write operations are forced to be uploads. Many typical I/O operations, such as editing files in place, will necessarily fail as they perform non-sequential stores. There are file utilities that apparently write sequentially (for example some versions of GNU
tar
), but may fail due to infrequent non-sequential stores. When mounting via NFS, application's sequential I/O can generally be forced to sequential writes to the NFS server via synchronous mounting (the-o sync
option). NFS clients that cannot mount synchronously (for example Microsoft Windows*) will not be able to upload files.NFS RGW supports read-write operations only for block size smaller than 4MB.
16.4 Starting or Restarting NFS Ganesha #
To enable and start the NFS Ganesha service, run:
root #
systemctl
enable nfs-ganesharoot #
systemctl
start nfs-ganesha
Restart NFS Ganesha with:
root #
systemctl
restart nfs-ganesha
When NFS Ganesha is started or restarted, it has a grace timeout of 90 seconds for NFS v4. During the grace period, new requests from clients are actively rejected. Hence, clients may face a slowdown of requests when NFS is in grace state.
16.5 Setting the Log Level #
You change the default debug level NIV_EVENT
by editing
the file /etc/sysconfig/nfs-ganesha
. Replace
NIV_EVENT
with NIV_DEBUG
or
NIV_FULL_DEBUG
. Increasing the log verbosity can produce
large amounts of data in the log files.
OPTIONS="-L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT"
A restart of the service is required when changing the log level.
Note
NFS Ganesha uses Ceph client libraries to connect to the Ceph
cluster. By default, client libraries do not log errors or any other
output. To see more details about NFS Ganesha interacting with the
Ceph cluster (for example, connection issues details) logging needs
to be explicitly defined in the ceph.conf
configuration file
under the [client]
section. For example:
[client] log_file = "/var/log/ceph/ceph-client.log"
16.6 Verifying the Exported NFS Share #
When using NFS v3, you can verify whether the NFS shares are exported on the NFS Ganesha server node:
root #
showmount
-e / (everything)
16.7 Mounting the Exported NFS Share #
To mount the exported NFS share (as configured in Section 16.2, “Configuration”) on a client host, run:
root #
mount
-t nfs -o rw,noatime,sync \ nfs_ganesha_server_hostname:/ /path/to/local/mountpoint
16.8 Additional Resources #
The original NFS Ganesha documentation can be found at https://github.com/nfs-ganesha/nfs-ganesha/wiki/Docs.
Part IV Managing Cluster with GUI Tools #
- 17 openATTIC
Calamari used to be the preferred Web UI application for managing and monitoring the Ceph cluster. Since SUSE Enterprise Storage 5.5, Calamari has been removed in favor of the more advanced openATTIC.
17 openATTIC #
Tip: Calamari Removed
Calamari used to be the preferred Web UI application for managing and monitoring the Ceph cluster. Since SUSE Enterprise Storage 5.5, Calamari has been removed in favor of the more advanced openATTIC.
openATTIC is a central storage management system which supports Ceph storage cluster. With openATTIC, you can control everything from a central management interface. It is no longer necessary to be familiar with the inner workings of the Ceph storage tools. Cluster management tasks can be carried out either by using openATTIC's intuitive Web interface, or via its REST API.
17.1 openATTIC Deployment and Configuration #
This section introduces steps to deploy and configure openATTIC and its supported features so that you can administer your Ceph cluster using a user-friendly Web interface.
17.1.1 Enabling Secure Access to openATTIC using SSL #
Access to the openATTIC Web application uses non-secure HTTP protocol by default. To enable secure access to openATTIC, you need to configure the Apache Web server manually:
If you do not have an SSL certificate signed by a well known certificate authority (CA), create a self-signed SSL certificate and copy its files to the directory where the Web server expects it, for example:
root #
openssl req -newkey rsa:2048 -new -nodes -x509 -days 3650 \ -keyout key.pem -out cert.pemroot #
cp cert.pem /etc/ssl/certs/servercert.pemroot #
cp key.pem /etc/ssl/certs/serverkey.pemRefer to https://documentation.suse.com/sles/12-SP5/single-html/SLES-admin/#sec-apache2-ssl for more details on creating SSL certificates.
Add 'SSL' to the
APACHE_SERVER_FLAGS
option in the/etc/sysconfig/apache2
configuration file. You can do it manually, or run the following commands:root #
a2enmod sslroot #
a2enflag SSLCreate
/etc/apache2/vhosts.d/vhost-ssl.conf
for a new Apache virtual host with the following content:<IfDefine SSL> <IfDefine !NOSSL> <VirtualHost *:80> ServerName OA_HOST_NAME Redirect "/" "https://OA_HOST_NAME/" </VirtualHost> <VirtualHost _default_:443> ServerName OA_HOST_NAME DocumentRoot "/srv/www/htdocs" ErrorLog /var/log/apache2/error_log TransferLog /var/log/apache2/access_log SSLEngine on SSLCertificateFile /etc/ssl/certs/servercert.pem SSLCertificateKeyFile /etc/ssl/certs/serverkey.pem CustomLog /var/log/apache2/ssl_request_log ssl_combined </VirtualHost> </IfDefine> </IfDefine>
Restart the Web server to reload the new virtual host definition together with the certificate files:
root #
systemctl restart apache2.service
17.1.2 Deploying openATTIC #
Since SUSE Enterprise Storage 5.5, openATTIC has been deployed as a DeepSea role. Refer to Chapter 1, Salt Cluster Administration for a general procedure.
17.1.3 openATTIC Initial Setup #
By default, oaconfig
creates an administrative user
account, openattic
, with the same password as the
user name. As a security precaution, we strongly recommend changing this
password immediately:
cephadm >
oaconfig changepassword openattic
Changing password for user 'openattic'
Password: <enter password>
Password (again): <re-enter password>
Password changed successfully for user 'openattic'
17.1.4 DeepSea Integration in openATTIC #
Some openATTIC features, such as iSCSI Gateway and Object Gateway management, make use of the
DeepSea REST API. It is enabled and configured by default. If you need to
override its default settings for debugging purposes, edit
/etc/sysconfig/openattic
and add or change the
following lines:
SALT_API_HOST="salt_api_host" SALT_API_PORT=8001 SALT_API_USERNAME="example_user" SALT_API_PASSWORD="password"
Important: oaconfig restart
Remember to run oaconfig restart
after you make changes
to the /etc/sysconfig/openattic
file.
Important: File Syntax
/etc/sysconfig/openattic
is used in Python as well as
Bash. Therefore, the files need to be in a format which Bash can
understand, and it is not possible to have spaces before or after the
'equals' signs.
17.1.5 Object Gateway Management #
Object Gateway management features in openATTIC are enabled by default. If you need to
override the default values for Object Gateway API as discovered from DeepSea,
include the following options with relevant values in
/etc/sysconfig/openattic
. For example:
RGW_API_HOST="rgw_api_host" RGW_API_PORT=80 RGW_API_SCHEME="http" RGW_API_ACCESS_KEY="VFEG733GBY0DJCIV6NK0" RGW_API_SECRET_KEY="lJzPbZYZTv8FzmJS5eiiZPHxlT2LMGOMW8ZAeOAq"
Note: Default Resource for Object Gateway
If your Object Gateway admin resource is not configured to use the default value
'admin' as used in 'http://rgw_host:80/admin', you need to also set the
RGW_API_ADMIN_RESOURCE
option appropriately.
To obtain Object Gateway credentials, use the radosgw-admin
command:
cephadm >
radosgw-admin user info --uid=admin
17.1.6 iSCSI Gateway Management #
iSCSI Gateway management features in openATTIC are enabled by default. If you need
override the default Salt API host name, change the
SALT_API_HOST
value as described in
Section 17.1.4, “DeepSea Integration in openATTIC”.
17.2 openATTIC Web User Interface #
openATTIC can be managed using a Web user interface. Open a Web browser and navigate to http://SERVER_HOST/openattic. To log in, use the default user name openattic and the corresponding password.
Figure 17.1: openATTIC Login Screen #
The openATTIC user interface is graphically divided into a top menu pane and a content pane.
The right part of the top pane includes a link to the current user settings, and a
link, and links to the list of existing and system . The rest of the top pane includes the main openATTIC menu.The content pane changes depending on which item menu is activated. By default, a
is displayed showing a number widgets to inform you about the status of the cluster.Figure 17.2: openATTIC Dashboard #
17.3 Dashboard #
Each
widget shows specific status information related to the running Ceph cluster. After clicking the title of a widget, the widget spreads across the whole content pane, possibly showing more details. A list of several widgets follows:The
widget tells whether the cluster is operating correctly. In case a problem is detected, you can view the detailed error message by clicking the subtitle inside the widget.The
, , , , , , and widgets simply show the related numbers.Figure 17.3: Basic Widgets #
The following widgets deal with total and available storage capacity:
, , , and .Figure 17.4: Capacity Widgets #
The following widgets deal with OSD and monitor node latency:
, , and :Figure 17.5: Latency Widgets #
The
widget shows the read and write per second statistics in time.Figure 17.6: Throughput #
Tip: More Details on Mouse Over
If you move the mouse pointer over any of the displayed charts, you will be shown more details related to the date and time pointed at in a pop-up window.
If you click in the chart area and then drag the mouse pointer to the left or right along the time axis, the time interval on the axis will be zoomed in to the interval you marked by moving the mouse. To zoom out back to the original scale, double-click the chart.
Within openATTIC there are options to display graphs for longer than
15 days. However, by default Prometheus only stores history for 15 days.
You can adjust this behavior in /etc/systemd/system/multi-user.target.wants/prometheus.service
.
Open
/etc/systemd/system/multi-user.target.wants/prometheus.service
.This file should reference the following:
EnvironmentFile=-/etc/sysconfig/prometheus ExecStart=/usr/bin/prometheus $ARGS
If not does not, add the above two lines and include the following:
ARGS="--storage.tsdb.retention=90d" \ --log.level=warn"
Tip
Ensure
ARGS
is a multiline bash string. This enables Prometheus to store up to 90 days of data.If you want other time options, the format is as follows: number X time multiplier (where time multiplier can be h[ours], d[ays], w[eeks], y[ears]).
Restart the Prometheus service.
17.4 Ceph Related Tasks #
openATTIC's main menu lists Ceph related tasks. Currently, the following tasks are relevant:
, , , , , , , and .17.4.1 Common Web UI Features #
In openATTIC, you often work with lists—for example, lists of pools, OSD nodes, or RBD devices. The following common widgets help you manage or adjust these list:
Click to refresh the list of items.
Click to display or hide individual table columns.
Click and select how many rows to display on a single page.
Click inside and filter the rows by typing the string to search
for.
Use to change the currently displayed page if the list
spans across multiple pages.
17.4.2 Listing OSD Nodes #
To list all available OSD nodes, click
from the main menu.The list shows each OSD's name, host name, status, weight, and storage back-end.
Figure 17.7: List of OSD nodes #
17.4.3 Managing RADOS Block Devices (RBDs) #
To list all available RADOS Block Devices, click
from the main menu.The list shows each device's name, the related pool name, size of the device, and, if 'fast-diff' was enabled during the RADOS Block Device creation, the percentage that is already occupied.
Figure 17.8: List of RBDs #
17.4.3.1 Status Information #
To view more detailed information about a device, activate its check box in the very left column:
Figure 17.9: RBD Details #
17.4.3.2 Statistics #
Click the
tab of an RADOS Block Device to view the statistics of transferred data. You can zoom in and out the time range either by highlighting the time range with a mouse, or by selecting it after clicking the date in the top left corner of the tab.17.4.3.3 RADOS Block Device Snapshots #
To create an RADOS Block Device snapshot, click its
tab and select from the left top drop-down box.After selecting a snapshot, you can rename, protect, clone, or delete it. Deletion also works if you select multiple snapshots.
restores the device's state from the current snapshot.Figure 17.10: RBD Snapshots #
17.4.3.4 Deleting RBDs #
To delete a device or a group of devices, activate their check boxes in the very left column and click
in the top-left of the RBDs table:Figure 17.11: Deleting RBD #
17.4.3.5 Adding RBDs #
To add a new device, click
in the top left of the RBDs table and do the following on the screen:Figure 17.12: Adding a New RBD #
Enter the name of the new device. Refer to Book “Deployment Guide”, Chapter 2 “Hardware Requirements and Recommendations”, Section 2.8 “Naming Limitations” for naming limitations.
Select the cluster that will store the new pool.
Select the pool from which the new RBD device will be created.
Specify the size of the new device. If you click the
link above, the maximum pool size is populated.To fine-tune the device parameters, click
and activate or deactivate the displayed options.Confirm with
.
17.4.4 Managing Pools #
Tip: More Information on Pools
For more general information about Ceph pools, refer to Chapter 8, Managing Storage Pools. For information specific to erasure coded pools, refer to Chapter 10, Erasure Coded Pools.
To list all available pools, click
from the main menu.The list shows each pool's name, ID, the percentage of used space, the number of placement groups, replica size, type ('replicated' or 'erasure'), erasure code profile, and the CRUSH ruleset.
Figure 17.13: List of Pools #
To view more detailed information about a pool, activate its check box in the very left column:
Figure 17.14: Pool Details #
17.4.4.1 Deleting Pools #
To delete a pool or a group of pools, activate their check boxes in the very left column and click
in the top left of the pools table:Figure 17.15: Deleting Pools #
17.4.4.2 Adding Pools #
To add a new pool, click
in the top left of the pools table and do the following on the screen:Figure 17.16: Adding a New Pool #
Enter the name of the new pool. Refer to Book “Deployment Guide”, Chapter 2 “Hardware Requirements and Recommendations”, Section 2.8 “Naming Limitations” for naming limitations.
Select the cluster that will store the new pool.
Select the pool type. Pools can be either replicated or erasure coded.
For a replicated pool, specify the replica size and the number of placement groups.
For an erasure code pool, specify the number of placement groups and erasure code profile. You can add your custom profile by clicking the plus '+' sign and specifying the profile name, data and coding chunks, and a ruleset failure domain.
Confirm with
.
17.4.5 Listing Nodes #
Click
from the main menu to view the list of nodes available on the cluster.Figure 17.17: List of Nodes #
Each node is represented by its host name, public IP address, cluster ID it belongs to, node role (for example, 'admin', 'storage', or 'master'), and key acceptance status.
17.4.6 Managing NFS Ganesha #
Tip: More Information on NFS Ganesha
For more general information about NFS Ganesha, refer to Chapter 16, NFS Ganesha: Export Ceph Data via NFS.
To list all available NFS exports, click
from the main menu.The list shows each export's directory, host name, status, type of storage back-end, and access type.
Figure 17.18: List of NFS Exports #
To view more detailed information about an NFS export, activate its check box in the very left column:
Figure 17.19: NFS Export Details #
Tip: NFS Mount Command
At the bottom of the export detailed view, there is a mount command for you to be able to easily mount the related NFS export from a client machine.
17.4.6.1 Adding NFS Exports #
To add a new NFS export, click
in the top left of the exports table and enter the required information.Figure 17.20: Adding a New NFS Export #
Select a server host for the NFS export.
Select a storage back-end—either
or .Enter the directory path for the NFS export. If the directory does not exist on the server, it will be created.
Specify other NFS related options, such as supported NFS protocol version, access type, squashing, or transport protocol.
If you need to limit access to specific clients only, click
and add their IP addresses together with access type and squashing options.Confirm with
.
17.4.6.2 Cloning and Deleting NFS Exports #
To delete an export or a group of exports, activate their check boxes in the very left column and select
in the top left of the exports table.Similarly, you can select
to clone the activated gateway.17.4.6.3 Editing NFS Exports #
To edit an existing export, either click its name in the exports table, or activate its check box and click
in the top left of the exports table.You can then adjust all the details of the NFS export.
Figure 17.21: Editing an NFS Export #
17.4.7 Managing iSCSI Gateways #
Tip: More Information on iSCSI Gateways
For more general information about iSCSI Gateways, refer to Book “Deployment Guide”, Chapter 10 “Installation of iSCSI Gateway” and Chapter 14, Ceph iSCSI Gateway.
To list all available gateways, click
from the main menu.The list shows each gateway's target, state, and related portals and RBD images.
Figure 17.22: List of iSCSI Gateways #
To view more detailed information about a gateway, activate its check box in the very left column:
Figure 17.23: Gateway Details #
17.4.7.1 Adding iSCSI Gateways #
To add a new iSCSI Gateway, click
in the top left of the gateways table and enter the required information.Figure 17.24: Adding a New iSCSI Gateway #
Enter the target address of the new gateway.
Click
and select one or multiple iSCSI portals from the list.Click
and select one or multiple RBD images for the gateway.If you need to use authentication to access the gateway, activate the
check box and enter the credentials. You can find more advanced authentication options after activating and .Confirm with
.
17.4.7.2 Editing iSCSI Gateways #
To edit an existing iSCSI Gateway, either click its name in the gateways table, or activate its check box and click
in the top left of the gateways table.You can then modify the iSCSI target, add or delete portals, and add or delete related RBD images. You can also adjust authentication information for the gateway.
17.4.7.3 Cloning and Deleting iSCSI Gateways #
To delete a gateway or a group of gateways, activate their check boxes in the very left column and select
in the top left of the gateways table.Similarly, you can select
to clone the activated gateway.17.4.7.4 Starting and Stopping iSCSI Gateways #
To start all gateways, select
in the top left of the gateways table. To stop all gateways, select .17.4.8 Viewing the Cluster CRUSH Map #
Click
from the main menu to view cluster CRUSH Map.Figure 17.25: CRUSH Map #
In the
pane, you can see the structure of the cluster as described by the CRUSH Map.In the
pane, you can view individual rulesets after selecting one of them from the drop-down box.Figure 17.26: Replication rules #
17.4.9 Managing Object Gateway Users and Buckets #
Tip: More Information on Object Gateways
For more general information about Object Gateways, refer to Chapter 13, Ceph Object Gateway.
To list Object Gateway users, select
› from the main menu.The list shows each user's ID, display name, e-mail address, if the user is suspended, and the maximum number of buckets for the user.
Figure 17.27: List of Object Gateway Users #
17.4.9.1 Adding a New Object Gateway User #
To add a new Object Gateway user, click
in the top left of the users' table and enter the relevant information.Tip: More Information
Find more information about Object Gateway user accounts in Section 13.5.2, “Managing S3 and Swift Accounts”.
Figure 17.28: Adding a New Object Gateway User #
Enter the user name, full name, and optionally an e-mail address and the maximum number of buckets for the user.
If the user should be initially suspended, activate the
check box.Specify the access and secret keys for the S3 authentication. If you want openATTIC to generate the keys for you, activate
.In the
section, set quota limits for the current user.Check
to activate the user quota limits. You can either specify the of the disk space the user can use within the cluster, or check for no size limit.Similarly, specify
that the user can store on the cluster storage, or if the user may store any number of objects.Figure 17.29: User quota #
In the
section, set the bucket quota limits for the current user.Figure 17.30: Bucket Quota #
Confirm with
.
17.4.9.2 Deleting Object Gateway Users #
To delete one or more Object Gateway users, activate their check boxes in the very left column and select
in the top left of the users table.17.4.9.3 Editing Object Gateway Users #
To edit the user information of an Object Gateway user, either activate their check box in the very left column and select Section 17.4.9.1, “Adding a New Object Gateway User”, plus the following additional information:
in the top left of the users table, or click their ID. You can change the information you entered when adding the user in- Subusers
Add, remove, or edit subusers of the currently edited user.
Figure 17.31: Adding a Subuser #
- Keys
Add, remove, or view access and secret keys of the currently edited user.
You can add S3 keys for the currently edited user, or view Swift keys for their subusers.
Figure 17.32: View S3 keys #
- Capabilities
Add or remove user's capabilities. The capabilities apply to
, , , , and . Each capability value can be one of 'read', 'write', or '*' for read and write privilege.Figure 17.33: Capabilities #
17.4.9.4 Listing Buckets for Object Gateway Users #
Tip
A bucket is a mechanism for storing data objects. A user account may have many buckets, but bucket names must be unique. Although the term 'bucket' is normally used within the Amazon S3 API, the term 'container' is used in the OpenStack Swift API context.
Click
› to list all available Object Gateway buckets.Figure 17.34: Object Gateway Buckets #
17.4.9.5 Adding Buckets for Object Gateway Users #
To add a new bucket, click
in the top left of the buckets table and enter the new bucket name and the related Object Gateway user. Confirm with .Figure 17.35: Adding a New Bucket #
17.4.9.6 Viewing Bucket Details #
To view detailed information about an Object Gateway bucket, activate its check box in the very left column of the buckets table.
Figure 17.36: Bucket Details #
17.4.9.7 Editing Buckets #
To edit a bucket, either activate its check box in the very left column and select
in the top left of the buckets table, or click its name.Figure 17.37: Editing an Object Gateway Bucket #
On the edit screen, you can change the user to which the bucket belongs.
17.4.9.8 Deleting Buckets #
To delete one or more Object Gateway buckets, activate their check boxes in the very left column of the buckets table, and select
in the top left of the table.Figure 17.38: Deleting Buckets #
To confirm the deletion, type 'yes' in the
pop-up window, and click .Warning: Careful Deletion
When deleting an Object Gateway bucket, it is currently not verified if the bucket is actually in use, for example by NFS Ganesha via the S3 storage back-end.
Part V Integration with Virtualization Tools #
- 18 Using
libvirt
with Ceph The libvirt library creates a virtual machine abstraction layer between hypervisor interfaces and the software applications that use them. With libvirt, developers and system administrators can focus on a common management framework, common API, and common shell interface (virsh) to many different h…
- 19 Ceph as a Back-end for QEMU KVM Instance
The most frequent Ceph use case involves providing block device images to virtual machines. For example, a user may create a 'golden' image with an OS and any relevant software in an ideal configuration. Then, the user takes a snapshot of the image. Finally, the user clones the snapshot (usually man…
18 Using libvirt
with Ceph #
The libvirt
library creates a virtual machine abstraction layer between
hypervisor interfaces and the software applications that use them. With
libvirt
, developers and system administrators can focus on a common
management framework, common API, and common shell interface
(virsh
) to many different hypervisors, including
QEMU/KVM, Xen, LXC, or VirtualBox.
Ceph block devices support QEMU/KVM. You can use Ceph block devices
with software that interfaces with libvirt
. The cloud solution uses
libvirt
to interact with QEMU/KVM, and QEMU/KVM interacts with Ceph
block devices via librbd
.
To create VMs that use Ceph block devices, use the procedures in the
following sections. In the examples, we have used
libvirt-pool
for the pool name,
client.libvirt
for the user name, and
new-libvirt-image
for the image name. You may use any
value you like, but ensure you replace those values when executing commands
in the subsequent procedures.
18.1 Configuring Ceph #
To configure Ceph for use with libvirt
, perform the following steps:
Create a pool. The following example uses the pool name
libvirt-pool
with 128 placement groups.cephadm >
ceph osd pool create libvirt-pool 128 128Verify that the pool exists.
cephadm >
ceph osd lspoolsCreate a Ceph User. The following example uses the Ceph user name
client.libvirt
and referenceslibvirt-pool
.cephadm >
ceph auth get-or-create client.libvirt mon 'profile rbd' osd \ 'profile rbd pool=libvirt-pool'Verify the name exists.
cephadm >
ceph auth listNote
libvirt
will access Ceph using the IDlibvirt
, not the Ceph nameclient.libvirt
. See http://docs.ceph.com/docs/master/rados/operations/user-management/#user for a detailed explanation of the difference between ID and name.Use QEMU to create an image in your RBD pool. The following example uses the image name
new-libvirt-image
and referenceslibvirt-pool
.Tip: Keyring File Location
The
libvirt
user key is stored in a keyring file placed in the/etc/ceph
directory. The keyring file needs to have a appropriate name that includes the name of the Ceph cluster it belongs to. For the default cluster name 'ceph', the keyring file name is/etc/ceph/ceph.client.libvirt.keyring
.If the keyring does not exist, create it with:
cephadm >
ceph auth get client.libvirt > /etc/ceph/ceph.client.libvirt.keyringroot #
qemu-img create -f raw rbd:libvirt-pool/new-libvirt-image:id=libvirt 2GVerify the image exists.
cephadm >
rbd -p libvirt-pool ls
18.2 Preparing the VM Manager #
You may use libvirt
without a VM manager, but you may find it simpler to
create your first domain with virt-manager
.
Install a virtual machine manager.
root #
zypper in virt-managerPrepare/download an OS image of the system you want to run virtualized.
Launch the virtual machine manager.
virt-manager
18.3 Creating a VM #
To create a VM with virt-manager
, perform the following
steps:
Choose the connection from the list, right-click it, and select
.libvirt-virtual-machine
.Finish the configuration and start the VM.
Verify that the newly created domain exists with
sudo virsh list
. If needed, specify the connection string, such asvirsh -c qemu+ssh://root@vm_host_hostname/system list
Id Name State ----------------------------------------------- [...] 9 libvirt-virtual-machine runningLog in to the VM and stop it before configuring it for use with Ceph.
18.4 Configuring the VM #
In this chapter, we focus on configuring VMs for integration with Ceph
using virsh
. virsh
commands often
require root privileges (sudo
) and will not return
appropriate results or notify you that root privileges are required. For a
reference of virsh
commands, refer to
Virsh Command
Reference.
Open the configuration file with
virsh edit
vm-domain-name.root #
virsh edit libvirt-virtual-machineUnder <devices> there should be a <disk> entry.
<devices> <emulator>/usr/bin/qemu-system-x86_64</emulator> <disk type='file' device='disk'> <driver name='qemu' type='raw'/> <source file='/path/to/image/recent-linux.img'/> <target dev='vda' bus='virtio'/> <address type='drive' controller='0' bus='0' unit='0'/> </disk>
Replace
/path/to/image/recent-linux.img
with the path to the OS image.Important
Use
sudo virsh edit
instead of a text editor. If you edit the configuration file under/etc/libvirt/qemu
with a text editor,libvirt
may not recognize the change. If there is a discrepancy between the contents of the XML file under/etc/libvirt/qemu
and the result ofsudo virsh dumpxml
vm-domain-name, then your VM may not work properly.Add the Ceph RBD image you previously created as a <disk> entry.
<disk type='network' device='disk'> <source protocol='rbd' name='libvirt-pool/new-libvirt-image'> <host name='monitor-host' port='6789'/> </source> <target dev='vda' bus='virtio'/> </disk>
Replace monitor-host with the name of your host, and replace the pool and/or image name as necessary. You may add multiple <host> entries for your Ceph monitors. The
dev
attribute is the logical device name that will appear under the/dev
directory of your VM. The optional bus attribute indicates the type of disk device to emulate. The valid settings are driver specific (for example ide, scsi, virtio, xen, usb or sata). See Disks for details of the <disk> element, and its child elements and attributes.Save the file.
If your Ceph cluster has authentication enabled (it does by default), you must generate a secret. Open an editor of your choice and create a file called
secret.xml
with the following content:<secret ephemeral='no' private='no'> <usage type='ceph'> <name>client.libvirt secret</name> </usage> </secret>
Define the secret.
root #
virsh secret-define --file secret.xml <uuid of secret is output here>Get the
client.libvirt
key and save the key string to a file.cephadm >
ceph auth get-key client.libvirt | sudo tee client.libvirt.keySet the UUID of the secret.
root #
virsh secret-set-value --secret uuid of secret \ --base64 $(cat client.libvirt.key) && rm client.libvirt.key secret.xmlYou must also set the secret manually by adding the following
<auth>
entry to the<disk>
element you entered earlier (replacing the uuid value with the result from the command line example above).root #
virsh edit libvirt-virtual-machineThen, add
<auth></auth>
element to the domain configuration file:... </source> <auth username='libvirt'> <secret type='ceph' uuid='9ec59067-fdbc-a6c0-03ff-df165c0587b8'/> </auth> <target ...
Note
The exemplary ID is
libvirt
, not the Ceph nameclient.libvirt
as generated at step 2 of Section 18.1, “Configuring Ceph”. Ensure you use the ID component of the Ceph name you generated. If for some reason you need to regenerate the secret, you will need to executesudo virsh secret-undefine
uuid before executingsudo virsh secret-set-value
again.
18.5 Summary #
Once you have configured the VM for use with Ceph, you can start the VM. To verify that the VM and Ceph are communicating, you may perform the following procedures.
Check to see if Ceph is running:
cephadm >
ceph healthCheck to see if the VM is running:
root #
virsh listCheck to see if the VM is communicating with Ceph. Replace vm-domain-name with the name of your VM domain:
root #
virsh qemu-monitor-command --hmp vm-domain-name 'info block'Check to see if the device from
&target dev='hdb' bus='ide'/>
appears under/dev
or under/proc/partitions
:cephadm >
ls /devcephadm >
cat /proc/partitions
19 Ceph as a Back-end for QEMU KVM Instance #
The most frequent Ceph use case involves providing block device images to virtual machines. For example, a user may create a 'golden' image with an OS and any relevant software in an ideal configuration. Then, the user takes a snapshot of the image. Finally, the user clones the snapshot (usually many times, see Section 9.3, “Snapshots” for details). The ability to make copy-on-write clones of a snapshot means that Ceph can provision block device images to virtual machines quickly, because the client does not need to download an entire image each time it spins up a new virtual machine.
Ceph block devices can integrate with the QEMU virtual machines. For more information on QEMU KVM, see https://documentation.suse.com/sles/12-SP5/single-html/SLES-virtualization/#part-virt-qemu.
19.1 Installation #
In order to use Ceph block devices, QEMU needs to have the appropriate
driver installed. Check whether the qemu-block-rbd
package is installed, and install it if needed:
root #
zypper install qemu-block-rbd
19.2 Usage #
The QEMU command line expects you to specify the pool name and image name. You may also specify a snapshot name.
qemu-img command options \ rbd:pool-name/image-name@snapshot-name:option1=value1:option2=value2...
For example, specifying the id and conf options might look like the following:
qemu-img command options \
rbd:pool_name/image_name:id=glance:conf=/etc/ceph/ceph.conf
19.3 Creating Images with QEMU #
You can create a block device image from QEMU. You must specify
rbd
, the pool name, and the name of the image you want to
create. You must also specify the size of the image.
qemu-img create -f raw rbd:pool-name/image-name size
For example:
qemu-img create -f raw rbd:pool1/image1 10G Formatting 'rbd:pool1/image1', fmt=raw size=10737418240 nocow=off cluster_size=0
Important
The raw
data format is really the only sensible format
option to use with RBD. Technically, you could use other QEMU-supported
formats such as qcow2
, but doing so would add additional
overhead, and would also render the volume unsafe for virtual machine live
migration when caching is enabled.
19.4 Resizing Images with QEMU #
You can resize a block device image from QEMU. You must specify
rbd
, the pool name, and the name of the image you want to
resize. You must also specify the size of the image.
qemu-img resize rbd:pool-name/image-name size
For example:
qemu-img resize rbd:pool1/image1 9G Image resized.
19.5 Retrieving Image Info with QEMU #
You can retrieve block device image information from QEMU. You must
specify rbd
, the pool name, and the name of the image.
qemu-img info rbd:pool-name/image-name
For example:
qemu-img info rbd:pool1/image1 image: rbd:pool1/image1 file format: raw virtual size: 9.0G (9663676416 bytes) disk size: unavailable cluster_size: 4194304
19.6 Running QEMU with RBD #
QEMU can access an image as a virtual block device directly via
librbd
. This avoids an additional context switch,
and can take advantage of RBD caching.
You can use qemu-img
to convert existing virtual machine
images to Ceph block device images. For example, if you have a qcow2
image, you could run:
qemu-img convert -f qcow2 -O raw sles12.qcow2 rbd:pool1/sles12
To run a virtual machine booting from that image, you could run:
root #
qemu -m 1024 -drive format=raw,file=rbd:pool1/sles12
RBD
caching can significantly improve performance. QEMU’s cache
options control librbd
caching:
root #
qemu -m 1024 -drive format=rbd,file=rbd:pool1/sles12,cache=writeback
19.7 Enabling Discard/TRIM #
Ceph block devices support the discard operation. This means that a guest
can send TRIM requests to let a Ceph block device reclaim unused space.
This can be enabled in the guest by mounting XFS
with the discard option.
For this to be available to the guest, it must be explicitly enabled for the
block device. To do this, you must specify a
discard_granularity
associated with the drive:
root #
qemu -m 1024 -drive format=raw,file=rbd:pool1/sles12,id=drive1,if=none \
-device driver=ide-hd,drive=drive1,discard_granularity=512
Note
The above example uses the IDE driver. The virtio driver does not support discard.
If using libvirt
, edit your libvirt domain’s
configuration file using virsh edit
to include the
xmlns:qemu
value. Then, add a qemu:commandline
block
as a child of that domain. The following example shows how
to set two devices with qemu id=
to different
discard_granularity
values.
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> <qemu:commandline> <qemu:arg value='-set'/> <qemu:arg value='block.scsi0-0-0.discard_granularity=4096'/> <qemu:arg value='-set'/> <qemu:arg value='block.scsi0-0-1.discard_granularity=65536'/> </qemu:commandline> </domain>
19.8 QEMU Cache Options #
QEMU’s cache options correspond to the following Ceph RBD Cache settings.
Writeback:
rbd_cache = true
Writethrough:
rbd_cache = true rbd_cache_max_dirty = 0
None:
rbd_cache = false
QEMU’s cache settings override Ceph’s default settings (settings that are not explicitly set in the Ceph configuration file). If you explicitly set RBD Cache settings in your Ceph configuration file, your Ceph settings override the QEMU cache settings. If you set cache settings on the QEMU command line, the QEMU command line settings override the Ceph configuration file settings.
Part VI FAQs, Tips and Troubleshooting #
- 20 Hints and Tips
The chapter provides information to help you enhance performance of your Ceph cluster and provides tips how to set the cluster up.
- 21 Frequently Asked Questions
- 22 Troubleshooting
This chapter describes several issues that you may face when you operate a Ceph cluster.
20 Hints and Tips #
The chapter provides information to help you enhance performance of your Ceph cluster and provides tips how to set the cluster up.
20.1 Identifying Orphaned Partitions #
To identify possibly orphaned journal/WAL/DB devices, follow these steps:
Pick the device that may have orphaned partitions and save the list of its partitions to a file:
root@minion >
ls /dev/sdd?* > /tmp/partitionsRun
readlink
against all block.wal, block.db, and journal devices, and compare the output to the previously saved list of partitions:root@minion >
readlink -f /var/lib/ceph/osd/ceph-*/{block.wal,block.db,journal} \ | sort | comm -23 /tmp/partitions -The output is the list of partitions that are not used by Ceph.
Remove the orphaned partitions that do not belong to Ceph with your preferred command (for example
fdisk
,parted
, orsgdisk
).
20.2 Adjusting Scrubbing #
By default, Ceph performs light scrubbing (find more details in Section 7.5, “Scrubbing”) daily and deep scrubbing weekly. Light scrubbing checks object sizes and checksums to ensure that placement groups are storing the same object data. Deep scrubbing checks an object’s content with that of its replicas to ensure that the actual contents are the same. The price for checking data integrity is increased I/O load on the cluster during the scrubbing procedure.
The default settings allow Ceph OSDs to initiate scrubbing at inappropriate times, such as during periods of heavy loads. Customers may experience latency and poor performance when scrubbing operations conflict with their operations. Ceph provides several scrubbing settings that can limit scrubbing to periods with lower loads or during off-peak hours.
If the cluster experiences high loads during the day and low loads late at night, consider restricting scrubbing to night time hours, such as 11pm till 6am:
[osd] osd_scrub_begin_hour = 23 osd_scrub_end_hour = 6
If time restriction is not an effective method of determining a scrubbing
schedule, consider using the osd_scrub_load_threshold
option. The default value is 0.5, but it could be modified for low load
conditions:
[osd] osd_scrub_load_threshold = 0.25
20.3 Stopping OSDs without Rebalancing #
You may need to stop OSDs for maintenance periodically. If you do not want
CRUSH to automatically rebalance the cluster in order to avoid huge data
transfers, set the cluster to noout
first:
root@minion >
ceph osd set noout
When the cluster is set to noout
, you can begin stopping
the OSDs within the failure domain that requires maintenance work:
root@minion >
systemctl stop ceph-osd@OSD_NUMBER.service
Find more information in Section 3.1.2, “Starting, Stopping, and Restarting Individual Services”.
After you complete the maintenance, start OSDs again:
root@minion >
systemctl start ceph-osd@OSD_NUMBER.service
After OSD services are started, unset the cluster from
noout
:
cephadm >
ceph osd unset noout
20.4 Time Synchronization of Nodes #
Ceph requires precise time synchronization between all nodes.
We recommend synchronizing all Ceph cluster nodes with at least three reliable time sources that are located on the internal network. The internal time sources can point to a public time server or have their own time source.
Important: Public Time Servers
Do not synchronize all Ceph cluster nodes directly with remote public time servers. With such a configuration, each node in the cluster has its own NTP daemon that communicates continually over the Internet with a set of three or four time servers that may provide slightly different time. This solution introduces a large degree of latency variability that makes it difficult or impossible to keep the clock drift under 0.05 seconds which is what the Ceph monitors require.
For details how to set up the NTP server refer to SUSE Linux Enterprise Server Administration Guide.
Then to change the time on your cluster, do the following:
Important: Setting Time
You may face a situation when you need to set the time back, for example if the time changes from the summer to the standard time. We do not recommend to move the time backward for a longer period than the cluster is down. Moving the time forward does not cause any trouble.
Procedure 20.1: Time Synchronization on the Cluster #
Stop all clients accessing the Ceph cluster, especially those using iSCSI.
Shut down your Ceph cluster. On each node run:
root #
systemctl stop ceph.targetNote
If you use Ceph and SUSE OpenStack Cloud, stop also the SUSE OpenStack Cloud.
Verify that your NTP server is set up correctly—all ntpd daemons get their time from a source or sources in the local network.
Set the correct time on your NTP server.
Verify that NTP is running and working properly, on all nodes run:
root #
systemctl status ntpd.serviceor
root #
ntpq -pStart all monitoring nodes and verify that there is no clock skew:
root #
systemctl start targetStart all OSD nodes.
Start other Ceph services.
Start the SUSE OpenStack Cloud if you have it.
20.5 Checking for Unbalanced Data Writing #
When data is written to OSDs evenly, the cluster is considered balanced. Each OSD within a cluster is assigned its weight. The weight is a relative number and tells Ceph how much of the data should be written to the related OSD. The higher the weight, the more data will be written. If an OSD has zero weight, no data will be written to it. If the weight of an OSD is relatively high compared to other OSDs, a large portion of the data will be written there, which makes the cluster unbalanced.
Unbalanced clusters have poor performance, and in the case that an OSD with a high weight suddenly crashes, a lot of data needs to be moved to other OSDs, which slows down the cluster as well.
To avoid this, you should regularly check OSDs for the amount of data writing. If the amount is between 30% and 50% of the capacity of a group of OSDs specified by a given rule set, you need to reweight the OSDs. Check for individual disks and find out which of them fill up faster than the others (or are generally slower), and lower their weight. The same is valid for OSDs where not enough data is written—you can increase their weight to have Ceph write more data to them. In the following example, you will find out the weight of an OSD with ID 13, and reweight it from 3 to 3.05:
$ ceph osd tree | grep osd.13 13 3 osd.13 up 1 $ ceph osd crush reweight osd.13 3.05 reweighted item id 13 name 'osd.13' to 3.05 in crush map $ ceph osd tree | grep osd.13 13 3.05 osd.13 up 1
Tip: OSD Reweight by Utilization
The ceph osd reweight-by-utilization
threshold command automates the process of
reducing the weight of OSDs which are heavily overused. By default it will
adjust the weights downward on OSDs which reached 120% of the average
usage, but if you include threshold it will use that percentage instead.
20.6 Btrfs Sub-volume for /var/lib/ceph #
SUSE Linux Enterprise by default is installed on a Btrfs partition. The directory
/var/lib/ceph
should be excluded from Btrfs snapshots
and rollbacks, especially when a MON is running on the node. DeepSea
provides the fs
runner that can set up a sub-volume for
this path.
20.6.1 Requirements for new Installation #
If you are setting up the cluster the first time, the following requirements must be met before you can use the DeepSea runner:
Salt and DeepSea are properly installed and working according to this documentation.
salt-run state.orch ceph.stage.0
has been invoked to synchronize all the Salt and DeepSea modules to the minions.Ceph is not yet installed, thus ceph.stage.3 has not yet been run and
/var/lib/ceph
does not yet exist.
20.6.2 Requirements for Existing Installation #
If your cluster is already installed, the following requirements must be met before you can use the DeepSea runner:
Nodes are upgraded to SUSE Enterprise Storage 5.5 and cluster is under DeepSea control.
Ceph cluster is up and healthy.
Upgrade process has synchronized Salt and DeepSea modules to all minion nodes.
20.6.3 Automatic Setup #
On the Salt master run:
root@master #
salt-run
state.orch ceph.migrate.subvolumeOn nodes without an existing
/var/lib/ceph
directory, this will, one node at a time:create
/var/lib/ceph
as a@/var/lib/ceph
Btrfs sub-volume.mount the new sub-volume and update
/etc/fstab
appropriately.disable copy-on-write for
/var/lib/ceph
.
On nodes with an existing Ceph installation, this will, one node at a time:
terminate running Ceph processes.
unmount OSDs on the node.
create
@/var/lib/ceph
Btrfs sub-volume and migrate existing/var/lib/ceph
data.mount the new sub-volume and update
/etc/fstab
appropriately.disable copy-on-write for
/var/lib/ceph/*
, omitting/var/lib/ceph/osd/*
.re-mount OSDs.
re-start Ceph daemons.
20.6.4 Manual Setup #
This uses the new fs
runner.
Inspects the state of
/var/lib/ceph
on all nodes and print suggestions about how to proceed:root@master #
salt-run
fs.inspect_varThis will return one of the following commands:
salt-run fs.create_var salt-run fs.migrate_var salt-run fs.correct_var_attrs
Run the command that was returned in the previous step.
If an error occurs on one of the nodes, the execution for other nodes will stop as well and the runner will try to revert performed changes. Consult the log files on the minions with the problem to determine the problem. The runner can be re-run after the problem has been solved.
The command salt-run fs.help
provides a list of all
runner and module commands for the fs
module.
20.7 Increasing File Descriptors #
For OSD daemons, the read/write operations are critical to keep the Ceph cluster balanced. They often need to have many files open for reading and writing at the same time. On the OS level, the maximum number of simultaneously open files is called 'maximum number of file descriptors'.
To prevent OSDs from running out of file descriptors, you can override the
OS default value and specify the number in
/etc/ceph/ceph.conf
, for example:
max_open_files = 131072
After you change max_open_files
, you need to restart the
OSD service on the relevant Ceph node.
20.8 How to Use Existing Partitions for OSDs Including OSD Journals #
Important
This section describes an advanced topic that only storage experts and developers should examine. It is mostly needed when using non-standard OSD journal sizes. If the OSD partition's size is less than 10GB, its initial weight is rounded to 0 and because no data are therefore placed on it, you should increase its weight. We take no responsibility for overfilled journals.
If you need to use existing disk partitions as an OSD node, the OSD journal and data partitions need to be in a GPT partition table.
You need to set the correct partition types to the OSD partitions so that
udev
recognizes them correctly and sets their
ownership to ceph:ceph
.
For example, to set the partition type for the journal partition
/dev/vdb1
and data partition
/dev/vdb2
, run the following:
root #
sgdisk --typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/vdbroot #
sgdisk --typecode=2:4fbd7e29-9d25-41b8-afd0-062c0ceff05d /dev/vdb
Tip
The Ceph partition table types are listed in
/usr/lib/udev/rules.d/95-ceph-osd.rules
:
cat /usr/lib/udev/rules.d/95-ceph-osd.rules # OSD_UUID ACTION=="add", SUBSYSTEM=="block", \ ENV{DEVTYPE}=="partition", \ ENV{ID_PART_ENTRY_TYPE}=="4fbd7e29-9d25-41b8-afd0-062c0ceff05d", \ OWNER:="ceph", GROUP:="ceph", MODE:="660", \ RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name" ACTION=="change", SUBSYSTEM=="block", \ ENV{ID_PART_ENTRY_TYPE}=="4fbd7e29-9d25-41b8-afd0-062c0ceff05d", \ OWNER="ceph", GROUP="ceph", MODE="660" # JOURNAL_UUID ACTION=="add", SUBSYSTEM=="block", \ ENV{DEVTYPE}=="partition", \ ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \ OWNER:="ceph", GROUP:="ceph", MODE:="660", \ RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name" ACTION=="change", SUBSYSTEM=="block", \ ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \ OWNER="ceph", GROUP="ceph", MODE="660" [...]
20.9 Integration with Virtualization Software #
20.9.1 Storing KVM Disks in Ceph Cluster #
You can create a disk image for KVM-driven virtual machine, store it in a
Ceph pool, optionally convert the content of an existing image to it, and
then run the virtual machine with qemu-kvm
making use of
the disk image stored in the cluster. For more detailed information, see
Chapter 19, Ceph as a Back-end for QEMU KVM Instance.
20.9.2 Storing libvirt
Disks in Ceph Cluster #
Similar to KVM (see Section 20.9.1, “Storing KVM Disks in Ceph Cluster”), you
can use Ceph to store virtual machines driven by libvirt
. The advantage
is that you can run any libvirt
-supported virtualization solution, such
as KVM, Xen, or LXC. For more information, see
Chapter 18, Using libvirt
with Ceph.
20.9.3 Storing Xen Disks in Ceph Cluster #
One way to use Ceph for storing Xen disks is to make use of libvirt
as described in Chapter 18, Using libvirt
with Ceph.
Another option is to make Xen talk to the rbd
block device driver directly:
If you have no disk image prepared for Xen, create a new one:
cephadm >
rbd create myimage --size 8000 --pool mypoolList images in the pool
mypool
and check if your new image is there:cephadm >
rbd list mypoolCreate a new block device by mapping the
myimage
image to therbd
kernel module:cephadm >
rbd map --pool mypool myimageTip: User Name and Authentication
To specify a user name, use
--id user-name
. Moreover, if you usecephx
authentication, you must also specify a secret. It may come from a keyring or a file containing the secret:cephadm >
rbd map --pool rbd myimage --id admin --keyring /path/to/keyringor
cephadm
rbd map --pool rbd myimage --id admin --keyfile /path/to/fileList all mapped devices:
rbd showmapped
id pool image snap device 0 mypool myimage - /dev/rbd0Now you can configure Xen to use this device as a disk for running a virtual machine. You can for example add the following line to the
xl
-style domain configuration file:disk = [ '/dev/rbd0,,sda', '/dev/cdrom,,sdc,cdrom' ]
20.10 Firewall Settings for Ceph #
Warning: DeepSea Stages Fail with Firewall
DeepSea deployment stages fail when firewall is active (and even configured). To pass the stages correctly, you need to either turn the firewall off by running
root #
systemctl stop SuSEfirewall2.service
or set the FAIL_ON_WARNING
option to 'False' in
/srv/pillar/ceph/stack/global.yml
:
FAIL_ON_WARNING: False
We recommend protecting the network cluster communication with SUSE Firewall. You can edit its configuration by selecting
› › › .Following is a list of Ceph related services and numbers of the ports that they normally use:
- Ceph Monitor
Enable the
service or port 6789 (TCP).- Ceph OSD or Metadata Server
Enable the
service, or ports 6800-7300 (TCP).- iSCSI Gateway
Open port 3260 (TCP).
- Object Gateway
Open the port where Object Gateway communication occurs. It is set in
/etc/ceph.conf
on the line starting withrgw frontends =
. Default is 80 for HTTP and 443 for HTTPS (TCP).- NFS Ganesha
By default, NFS Ganesha uses ports 2049 (NFS service, TCP) and 875 (rquota support, TCP). Refer to Section 16.2.3, “Changing Default NFS Ganesha Ports” for more information on changing the default NFS Ganesha ports.
- Apache based services, such as openATTIC, SMT, or SUSE Manager
Open ports 80 for HTTP and 443 for HTTPS (TCP).
- SSH
Open port 22 (TCP).
- NTP
Open port 123 (UDP).
- Salt
Open ports 4505 and 4506 (TCP).
- Grafana
Open port 3000 (TCP).
- Prometheus
Open port 9100 (TCP).
20.11 Testing Network Performance #
To test the network performance the DeepSea net
runner
provides the following commands.
A simple ping to all nodes:
root@master #
salt-run
net.ping Succeeded: 9 addresses from 9 minions average rtt 1.35 msA jumbo ping to all nodes:
root@master #
salt-run
net.jumbo_ping Succeeded: 9 addresses from 9 minions average rtt 2.13 msA bandwidth test:
root@master #
salt-run
net.iperf Fastest 2 hosts: |_ - 192.168.58.106 - 2981 Mbits/sec |_ - 192.168.58.107 - 2967 Mbits/sec Slowest 2 hosts: |_ - 192.168.58.102 - 2857 Mbits/sec |_ - 192.168.58.103 - 2842 Mbits/secTip: Stop 'iperf3' Processes Manually
When running a test using the
net.iperf
runner, the 'iperf3' server processes that are started do not stop automatically when a test is completed. To stop the processes, use the following runner:root@master #
salt '*' multi.kill_iperf_cmd
20.12 Replacing Storage Disk #
If you need to replace a storage disk in a Ceph cluster, you can do so during the cluster's full operation. The replacement will cause temporary increase in data transfer.
If the disk fails entirely, Ceph needs to rewrite at least the same amount of data as the capacity of the failed disk is. If the disk is properly evacuated and then re-added to avoid loss of redundancy during the process, the amount of rewritten data will be twice as big. If the new disk has a different size as the replaced one, it will cause some additional data to be redistributed to even out the usage of all OSDs.
21 Frequently Asked Questions #
21.1 How Does the Number of Placement Groups Affect the Cluster Performance? #
When your cluster is becoming 70% to 80% full, it is time to add more OSDs to it. When you increase the number of OSDs, you may consider increasing the number of placement groups as well.
Warning
Changing the number of PGs causes a lot of data transfer within the cluster.
To calculate the optimal value for your newly-resized cluster is a complex task.
A high number of PGs creates small chunks of data. This speeds up recovery after an OSD failure, but puts a lot of load on the monitor nodes as they are responsible for calculating the data location.
On the other hand, a low number of PGs takes more time and data transfer to recover from an OSD failure, but does not impose that much load on monitor nodes as they need to calculate locations for less (but larger) data chunks.
21.2 Can I Use SSDs and Hard Disks on the Same Cluster? #
Solid-state drives (SSD) are generally faster than hard disks. If you mix the two types of disks for the same write operation, the data writing to the SSD disk will be slowed down by the hard disk performance. Thus, you should never mix SSDs and hard disks for data writing following the same rule (see Section 7.3, “Rule Sets” for more information on rules for storing data).
There are generally 2 cases where using SSD and hard disk on the same cluster makes sense:
Use each disk type for writing data following different rules. Then you need to have a separate rule for the SSD disk, and another rule for the hard disk.
Use each disk type for a specific purpose. For example the SSD disk for journal, and the hard disk for storing data.
21.3 What are the Trade-offs of Using a Journal on SSD? #
Using SSDs for OSD journal(s) is better for performance as the journal is usually the bottleneck of hard disk-only OSDs. SSDs are often used to share journals of several OSDs.
Following is a list of potential disadvantages of using SSDs for OSD journal:
SSD disks are more expensive than hard disks. But as one OSD journal requires up to 6GB of disk space only, the price may not be so crucial.
SSD disk consumes storage slots which can be otherwise used by a large hard disk to extend the cluster capacity.
SSD disks have reduced write cycles compared to hard disks, but modern technologies are beginning to eliminate the problem.
If you share more journals on the same SSD disk, you risk losing all the related OSDs after the SSD disk fails. This will require a lot of data to be moved to rebalance the cluster.
Hotplugging disks becomes more complex as the data mapping is not 1:1 the failed OSD and the journal disk.
21.4 What Happens When a Disk Fails? #
When a disk with a stored cluster data has a hardware problem and fails to operate, here is what happens:
The related OSD crashed and is automatically removed from the cluster.
The failed disk's data is replicated to another OSD in the cluster from other copies of the same data stored in other OSDs.
Then you should remove the disk from the cluster CRUSH Map, and physically from the host hardware.
21.5 What Happens When a Journal Disk Fails? #
Ceph can be configured to store journals or write ahead logs on devices separate from the OSDs. When a disk dedicated to a journal fails, the related OSD(s) fail as well (see Section 21.4, “What Happens When a Disk Fails?”).
Warning: Hosting Multiple Journals on One Disk
For performance boost, you can use a fast disk (such as SSD) to store journal partitions for several OSDs. We do not recommend to host journals for more than 4 OSDs on one disk, because in case of the journals' disk failure, you risk losing stored data for all the related OSDs' disks.
22 Troubleshooting #
This chapter describes several issues that you may face when you operate a Ceph cluster.
22.1 Reporting Software Problems #
If you come across a problem when running SUSE Enterprise Storage 5.5 related to some of its
components, such as Ceph or Object Gateway, report the problem to SUSE Technical
Support. The recommended way is with the supportconfig
utility.
Tip
Because supportconfig
is modular software, make sure
that the supportutils-plugin-ses
package is
installed.
cephadm >
rpm -q supportutils-plugin-ses
If it is missing on the Ceph server, install it with
root #
zypper ref && zypper in supportutils-plugin-ses
Although you can use supportconfig
on the command line,
we recommend using the related YaST module. Find more information about
supportconfig
in
https://documentation.suse.com/sles/12-SP5/single-html/SLES-admin/#sec-admsupport-supportconfig.
22.2 Sending Large Objects with rados
Fails with Full OSD #
rados
is a command line utility to manage RADOS object
storage. For more information, see man 8 rados
.
If you send a large object to a Ceph cluster with the
rados
utility, such as
cephadm >
rados -p mypool put myobject /file/to/send
it can fill up all the related OSD space and cause serious trouble to the cluster performance.
22.3 Corrupted XFS File system #
In rare circumstances like kernel bug or broken/misconfigured hardware, the underlying file system (XFS) in which an OSD stores its data might be damaged and unmountable.
If you are sure there is no problem with your hardware and the system is configured properly, raise a bug against the XFS subsystem of the SUSE Linux Enterprise Server kernel and mark the particular OSD as down:
cephadm >
ceph osd down OSD identification
Warning: Do Not Format or Otherwise Modify the Damaged Device
Even though using xfs_repair
to fix the problem in the
file system may seem reasonable, do not use it as the command modifies the
file system. The OSD may start but its functioning may be influenced.
Now zap the underlying disk and re-create the OSD by running:
cephadm >
ceph-disk prepare --zap $OSD_DISK_DEVICE $OSD_JOURNAL_DEVICE"
for example:
cephadm >
ceph-disk prepare --zap /dev/sdb /dev/sdd2
22.4 'Too Many PGs per OSD' Status Message #
If you receive a Too Many PGs per OSD
message after
running ceph status
, it means that the
mon_pg_warn_max_per_osd
value (300 by default) was
exceeded. This value is compared to the number of PGs per OSD ratio. This
means that the cluster setup is not optimal.
The number of PGs cannot be reduced after the pool is created. Pools that do not yet contain any data can safely be deleted and then re-created with a lower number of PGs. Where pools already contain data, the only solution is to add OSDs to the cluster so that the ratio of PGs per OSD becomes lower.
22.5 'nn pg stuck inactive' Status Message #
If you receive a stuck inactive
status message after
running ceph status
, it means that Ceph does not know
where to replicate the stored data to fulfill the replication rules. It can
happen shortly after the initial Ceph setup and fix itself automatically.
In other cases, this may require a manual interaction, such as bringing up a
broken OSD, or adding a new OSD to the cluster. In very rare cases, reducing
the replication level may help.
If the placement groups are stuck perpetually, you need to check the output
of ceph osd tree
. The output should look tree-structured,
similar to the example in Section 22.7, “OSD is Down”.
If the output of ceph osd tree
is rather flat as in the
following example
cephadm >
ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0 root default
0 0 osd.0 up 1.00000 1.00000
1 0 osd.1 up 1.00000 1.00000
2 0 osd.2 up 1.00000 1.00000
You should check that the related CRUSH map has a tree structure. If it is also flat, or with no hosts as in the above example, it may mean that host name resolution is not working correctly across the cluster.
If the hierarchy is incorrect—for example the root contains hosts, but
the OSDs are at the top level and are not themselves assigned to
hosts—you will need to move the OSDs to the correct place in the
hierarchy. This can be done using the ceph osd crush move
and/or ceph osd crush set
commands. For further details
see Section 7.4, “CRUSH Map Manipulation”.
22.6 OSD Weight is 0 #
When OSD starts, it is assigned a weight. The higher the weight, the bigger the chance that the cluster writes data to the OSD. The weight is either specified in a cluster CRUSH Map, or calculated by the OSDs' start-up script.
In some cases, the calculated value for OSDs' weight may be rounded down to zero. It means that the OSD is not scheduled to store data, and no data is written to it. The reason is usually that the disk is too small (smaller than 15GB) and should be replaced with a bigger one.
22.7 OSD is Down #
OSD daemon is either running, or stopped/down. There are 3 general reasons why an OSD is down:
Hard disk failure.
The OSD crashed.
The server crashed.
You can see the detailed status of OSDs by running
cephadm >
ceph osd tree
# id weight type name up/down reweight
-1 0.02998 root default
-2 0.009995 host doc-ceph1
0 0.009995 osd.0 up 1
-3 0.009995 host doc-ceph2
1 0.009995 osd.1 up 1
-4 0.009995 host doc-ceph3
2 0.009995 osd.2 down 1
The example listing shows that the osd.2
is down. Then
you may check if the disk where the OSD is located is mounted:
root #
lsblk -f
[...]
vdb
├─vdb1 /var/lib/ceph/osd/ceph-2
└─vdb2
You can track the reason why the OSD is down by inspecting its log file
/var/log/ceph/ceph-osd.2.log
. After you find and fix
the reason why the OSD is not running, start it with
root #
systemctl start ceph-osd@2.service
Do not forget to replace 2
with the actual number of your
stopped OSD.
22.8 Finding Slow OSDs #
When tuning the cluster performance, it is very important to identify slow storage/OSDs within the cluster. The reason is that if the data is written to the slow(est) disk, the complete write operation slows down as it always waits until it is finished on all the related disks.
It is not trivial to locate the storage bottleneck. You need to examine each and every OSD to find out the ones slowing down the write process. To do a benchmark on a single OSD, run:
ceph tell
osd.OSD_ID_NUMBER bench
For example:
cephadm >
ceph tell osd.0 bench
{ "bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": "19377779.000000"}
Then you need to run this command on each OSD and compare the
bytes_per_sec
value to get the slow(est) OSDs.
22.9 Fixing Clock Skew Warnings #
The time information in all cluster nodes must be synchronized. If a node's time is not fully synchronized, you may get clock skew warnings when checking the state of the cluster.
Time synchronization is managed with NTP (see http://en.wikipedia.org/wiki/Network_Time_Protocol). Set each node to synchronize its time with one or more NTP servers, preferably to the same group of NTP servers. If the time skew still occurs on a node, follow these steps to fix it:
root #
systemctl stop ntpd.serviceroot #
systemctl stop ceph-mon.targetroot #
systemctl start ntpd.serviceroot #
systemctl start ceph-mon.target
You can then query the NTP peers and check the time offset with
sudo ntpq -p
.
The Ceph monitors need to have their clocks synchronized to within 0.05 seconds of each other. Refer to Section 20.4, “Time Synchronization of Nodes” for more information.
22.10 Poor Cluster Performance Caused by Network Problems #
There are more reasons why the cluster performance may become weak. One of them can be network problems. In such case, you may notice the cluster reaching quorum, OSD and monitor nodes going offline, data transfers taking a long time, or a lot of reconnect attempts.
To check whether cluster performance is degraded by network problems,
inspect the Ceph log files under the /var/log/ceph
directory.
To fix network issues on the cluster, focus on the following points:
Basic network diagnostics. Try DeepSea diagnostics tools runner
net.ping
to ping between cluster nodes to see if individual interface can reach to specific interface and the average response time. Any specific response time much slower then average will also be reported. For example:root@master #
salt-run net.ping Succeeded: 8 addresses from 7 minions average rtt 0.15 msTry validating all interface with JumboFrame enable:
root@master #
salt-run net.jumbo_ping Succeeded: 8 addresses from 7 minions average rtt 0.26 msNetwork performance benchmark. Try DeepSea's network performance runner
net.iperf
to test intern-node network bandwidth. On a given cluster node, a number ofiperf
processes (according to the number of CPU cores) are started as servers. The remaining cluster nodes will be used as clients to generate network traffic. The accumulated bandwidth of all per-nodeiperf
processes is reported. This should reflect the maximum achievable network throughput on all cluster nodes. For example:root@master #
salt-run net.iperf cluster=ceph output=full 192.168.128.1: 8644.0 Mbits/sec 192.168.128.2: 10360.0 Mbits/sec 192.168.128.3: 9336.0 Mbits/sec 192.168.128.4: 9588.56 Mbits/sec 192.168.128.5: 10187.0 Mbits/sec 192.168.128.6: 10465.0 Mbits/secCheck firewall settings on cluster nodes. Make sure they do not block ports/protocols required by Ceph operation. See Section 20.10, “Firewall Settings for Ceph” for more information on firewall settings.
Check the networking hardware, such as network cards, cables, or switches, for proper operation.
Tip: Separate Network
To ensure fast and safe network communication between cluster nodes, set up a separate network used exclusively by the cluster OSD and monitor nodes.
22.11 /var
Running Out of Space #
By default, the Salt master saves every minion's return for every job in its
job cache. The cache can then be used later to lookup
results for previous jobs. The cache directory defaults to
/var/cache/salt/master/jobs/
.
Each job return from every minion is saved in a single file. Over time this
directory can grow very large, depending on the number of published jobs and
the value of the keep_jobs
option in the
/etc/salt/master
file. keep_jobs
sets
the number of hours (24 by default) to keep information about past minion
jobs.
keep_jobs: 24
Important: Do Not Set keep_jobs: 0
Setting keep_jobs
to '0' will cause the job cache cleaner
to never run, possibly resulting in a full partition.
If you want to disable the job cache, set job_cache
to
'False':
job_cache: False
Tip: Restoring Partition Full because of Job Cache
When the partition with job cache files gets full because of wrong
keep_jobs
setting, follow these steps to free disk space
and improve the job cache settings:
Stop the Salt master service:
root@master #
systemctl stop salt-masterChange the Salt master configuration related to job cache by editing
/etc/salt/master
:job_cache: False keep_jobs: 1
Clear the Salt master job cache:
root #
rm -rfv /var/cache/salt/master/jobs/*Start the Salt master service:
root@master #
systemctl start salt-master
22.12 Too Many PGs Per OSD #
The TOO_MANY_PGS
flag is raised when the number of PGs in
use is above the configurable threshold of mon_pg_warn_max_per_osd
PGs per OSD. If this threshold is exceeded, the cluster does not allow new pools
to be created, pool pg_num
to be increased, or pool
replication to be increased.
SUSE Enterprise Storage 4 and 5.5 have two ways to solve this issue.
Procedure 22.1: Solution 1 #
Set the following in your
ceph.conf
:[global] mon_max_pg_per_osd = 800 # depends on your amount of PGs osd max pg per osd hard ratio = 10 # the default is 2. We recommend to set at least 5. mon allow pool delete = true # without it you can't remove a pool
Restart all MONs and OSDs one by one.
Check the value of your MON and OSD ID:
cephadm >
ceph --admin-daemon /var/run/ceph/ceph-mon.ID.asok config get mon_max_pg_per_osdcephadm >
ceph --admin-daemon /var/run/ceph/ceph-osd.ID.asok config get osd_max_pg_per_osd_hard_ratioExecute the following to determine your default
pg_num
:cephadm >
rados lspoolscephadm >
ceph osd pool get USER-EMAIL pg_numWith caution, execute the following commands to remove pools:
cephadm >
ceph osd pool create USER-EMAILnew 8cephadm >
rados cppool USER-EMAIL default.rgw.lc.newcephadm >
ceph osd pool delete USER-EMAIL USER-EMAIL --yes-i-really-really-mean-itcephadm >
ceph osd pool rename USER-EMAIL.new USER-EMAILcephadm >
ceph osd pool application enable USER-EMAIL rgwIf this does not remove enough PGs per OSD and you are still receiving blocking requests, you may need to find another pool to remove.
Procedure 22.2: Solution 2 #
Create a new pool with the correct PG count:
cephadm >
ceph osd pool create NEW-POOL PG-COUNTCopy the contents of the old pool the new pool:
cephadm >
rados cppool OLD-POOL NEW-POOLRemove the old pool:
cephadm >
ceph osd pool delete OLD-POOL OLD-POOL --yes-i-really-really-mean-itRename the new pool:
cephadm >
ceph osd pool rename NEW-POOL OLD-POOLRestart the Object Gateway.
Glossary #
General
- Admin node #
The node from which you run the
ceph-deploy
utility to deploy Ceph on OSD nodes.- Bucket #
A point which aggregates other nodes into a hierarchy of physical locations.
- CRUSH, CRUSH Map #
An algorithm that determines how to store and retrieve data by computing data storage locations. CRUSH requires a map of the cluster to pseudo-randomly store and retrieve data in OSDs with a uniform distribution of data across the cluster.
- Monitor node, MON #
A cluster node that maintains maps of cluster state, including the monitor map, or the OSD map.
- Node #
Any single machine or server in a Ceph cluster.
- OSD node #
A cluster node that stores data, handles data replication, recovery, backfilling, rebalancing, and provides some monitoring information to Ceph monitors by checking other Ceph OSD daemons.
- Pool #
Logical partitions for storing objects such as disk images.
- Rule Set #
Rules to determine data placement for a pool.
A DeepSea Stage 1 Custom Example #
{% set master = salt['master.minion']() %} include: - ..validate ready: salt.runner: - name: minions.ready - timeout: {{ salt['pillar.get']('ready_timeout', 300) }} refresh_pillar0: salt.state: - tgt: {{ master }} - sls: ceph.refresh discover roles: salt.runner: - name: populate.proposals - require: - salt: refresh_pillar0 discover storage profiles: salt.runner: - name: proposal.populate - kwargs: 'name': 'prod' 'db-size': '59G' 'wal-size': '1G' 'nvme-spinner': True 'ratio': 12 - require: - salt: refresh_pillar0
B Default Alerts for SUSE Enterprise Storage #
groups: - name: cluster health rules: - alert: health error expr: ceph_health_status == 2 for: 5m labels: severity: critical type: ses_default annotations: description: Ceph in error for > 5m - alert: unhealthy expr: ceph_health_status != 0 for: 15m labels: severity: warning type: ses_default annotations: description: Ceph not healthy for > 5m - name: mon rules: - alert: low monitor quorum count expr: ceph_monitor_quorum_count < 3 labels: severity: critical type: ses_default annotations: description: Monitor count in quorum is low - name: osd rules: - alert: 10% OSDs down expr: sum(ceph_osd_down) / count(ceph_osd_in) >= 0.1 labels: severity: critical type: ses_default annotations: description: More then 10% of OSDS are down - alert: OSD down expr: sum(ceph_osd_down) > 1 for: 15m labels: severity: warning type: ses_default annotations: description: One or more OSDS down for more then 15 minutes - alert: OSDs near full expr: (ceph_osd_utilization unless on(osd) ceph_osd_down) > 80 labels: severity: critical type: ses_default annotations: description: OSD {{ $labels.osd }} is dangerously full, over 80% # alert on single OSDs flapping - alert: flap osd expr: rate(ceph_osd_up[5m])*60 > 1 labels: severity: warning type: ses_default annotations: description: > OSD {{ $label.osd }} was marked down at back up at least once a minute for 5 minutes. # alert on high deviation from average PG count - alert: high pg count deviation expr: abs(((ceph_osd_pgs > 0) - on (job) group_left avg(ceph_osd_pgs > 0) by (job)) / on (job) group_left avg(ceph_osd_pgs > 0) by (job)) > 0.35 for: 5m labels: severity: warning type: ses_default annotations: description: > OSD {{ $labels.osd }} deviates by more then 30% from average PG count # alert on high commit latency...but how high is too high - name: mds rules: # no mds metrics are exported yet - name: mgr rules: # no mgr metrics are exported yet - name: pgs rules: - alert: pgs inactive expr: ceph_total_pgs - ceph_active_pgs > 0 for: 5m labels: severity: critical type: ses_default annotations: description: One or more PGs are inactive for more then 5 minutes. - alert: pgs unclean expr: ceph_total_pgs - ceph_clean_pgs > 0 for: 15m labels: severity: warning type: ses_default annotations: description: One or more PGs are not clean for more then 15 minutes. - name: nodes rules: - alert: root volume full expr: node_filesystem_avail{mountpoint="/"} / node_filesystem_size{mountpoint="/"} < 0.1 labels: severity: critical type: ses_default annotations: description: Root volume (OSD and MON store) is dangerously full (< 10% free) # alert on nic packet errors and drops rates > 1 packet/s - alert: network packets dropped expr: irate(node_network_receive_drop{device!="lo"}[5m]) + irate(node_network_transmit_drop{device!="lo"}[5m]) > 1 labels: severity: warning type: ses_default annotations: description: > Node {{ $labels.instance }} experiences packet drop > 1 packet/s on interface {{ $lables.device }} - alert: network packet errors expr: irate(node_network_receive_errs{device!="lo"}[5m]) + irate(node_network_transmit_errs{device!="lo"}[5m]) > 1 labels: severity: warning type: ses_default annotations: description: > Node {{ $labels.instance }} experiences packet errors > 1 packet/s on interface {{ $lables.device }} # predict fs fillup times - alert: storage filling expr: ((node_filesystem_free - node_filesystem_size) / deriv(node_filesystem_free[2d]) <= 5) > 0 labels: severity: warning type: ses_default annotations: description: > Mountpoint {{ $lables.mountpoint }} will be full in less then 5 days assuming the average fillup rate of the past 48 hours. - name: pools rules: - alert: pool full expr: ceph_pool_used_bytes / ceph_pool_available_bytes > 0.9 labels: severity: critical type: ses_default annotations: description: Pool {{ $labels.pool }} at 90% capacity or over - alert: pool filling up expr: (-ceph_pool_used_bytes / deriv(ceph_pool_available_bytes[2d]) <= 5 ) > 0 labels: severity: warning type: ses_default annotations: description: > Pool {{ $labels.pool }} will be full in less then 5 days assuming the average fillup rate of the past 48 hours.
C Example Procedure of Manual Ceph Installation #
The following procedure shows the commands that you need to install Ceph storage cluster manually.
Generate the key secrets for the Ceph services you plan to run. You can use the following command to generate it:
python -c "import os ; import struct ; import time; import base64 ; \ key = os.urandom(16) ; header = struct.pack('<hiih',1,int(time.time()),0,len(key)) ; \ print base64.b64encode(header + key)"
Add the keys to the related keyrings. First for
client.admin
, then for monitors, and then other related services, such as OSD, Object Gateway, or MDS:cephadm >
ceph-authtool -n client.admin \ --create-keyring /etc/ceph/ceph.client.admin.keyring \ --cap mds 'allow *' --cap mon 'allow *' --cap osd 'allow *' ceph-authtool -n mon. \ --create-keyring /var/lib/ceph/bootstrap-mon/ceph-osceph-03.keyring \ --set-uid=0 --cap mon 'allow *' ceph-authtool -n client.bootstrap-osd \ --create-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring \ --cap mon 'allow profile bootstrap-osd' ceph-authtool -n client.bootstrap-rgw \ --create-keyring /var/lib/ceph/bootstrap-rgw/ceph.keyring \ --cap mon 'allow profile bootstrap-rgw' ceph-authtool -n client.bootstrap-mds \ --create-keyring /var/lib/ceph/bootstrap-mds/ceph.keyring \ --cap mon 'allow profile bootstrap-mds'Create a monmap—a database of all monitors in a cluster:
monmaptool --create --fsid eaac9695-4265-4ca8-ac2a-f3a479c559b1 \ /tmp/tmpuuhxm3/monmap monmaptool --add osceph-02 192.168.43.60 /tmp/tmpuuhxm3/monmap monmaptool --add osceph-03 192.168.43.96 /tmp/tmpuuhxm3/monmap monmaptool --add osceph-04 192.168.43.80 /tmp/tmpuuhxm3/monmap
Create a new keyring and import keys from the admin and monitors' keyrings there. Then use them to start the monitors:
cephadm >
ceph-authtool --create-keyring /tmp/tmpuuhxm3/keyring \ --import-keyring /var/lib/ceph/bootstrap-mon/ceph-osceph-03.keyring ceph-authtool /tmp/tmpuuhxm3/keyring \ --import-keyring /etc/ceph/ceph.client.admin.keyring sudo -u ceph ceph-mon --mkfs -i osceph-03 \ --monmap /tmp/tmpuuhxm3/monmap --keyring /tmp/tmpuuhxm3/keyring systemctl restart ceph-mon@osceph-03Check the monitors state in
systemd
:root #
systemctl show --property ActiveState ceph-mon@osceph-03Check if Ceph is running and reports the monitor status:
cephadm >
ceph --cluster=ceph \ --admin-daemon /var/run/ceph/ceph-mon.osceph-03.asok mon_statusCheck the specific services' status using the existing keys:
cephadm >
ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \ --name client.admin -f json-pretty status [...] ceph --connect-timeout 5 \ --keyring /var/lib/ceph/bootstrap-mon/ceph-osceph-03.keyring \ --name mon. -f json-pretty statusImport keyring from existing Ceph services and check the status:
cephadm >
ceph auth import -i /var/lib/ceph/bootstrap-osd/ceph.keyring ceph auth import -i /var/lib/ceph/bootstrap-rgw/ceph.keyring ceph auth import -i /var/lib/ceph/bootstrap-mds/ceph.keyring ceph --cluster=ceph \ --admin-daemon /var/run/ceph/ceph-mon.osceph-03.asok mon_status ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \ --name client.admin -f json-pretty statusPrepare disks/partitions for OSDs, using the XFS file system:
cephadm >
ceph-disk -v prepare --fs-type xfs --data-dev --cluster ceph \ --cluster-uuid eaac9695-4265-4ca8-ac2a-f3a479c559b1 /dev/vdb ceph-disk -v prepare --fs-type xfs --data-dev --cluster ceph \ --cluster-uuid eaac9695-4265-4ca8-ac2a-f3a479c559b1 /dev/vdc [...]Activate the partitions:
cephadm >
ceph-disk -v activate --mark-init systemd --mount /dev/vdb1 ceph-disk -v activate --mark-init systemd --mount /dev/vdc1For SUSE Enterprise Storage version 2.1 and earlier, create the default pools:
cephadm >
ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \ --name client.admin osd pool create .users.swift 16 16 ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \ --name client.admin osd pool create .intent-log 16 16 ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \ --name client.admin osd pool create .rgw.gc 16 16 ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \ --name client.admin osd pool create .users.uid 16 16 ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \ --name client.admin osd pool create .rgw.control 16 16 ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \ --name client.admin osd pool create .users 16 16 ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \ --name client.admin osd pool create .usage 16 16 ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \ --name client.admin osd pool create .log 16 16 ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \ --name client.admin osd pool create .rgw 16 16Create the Object Gateway instance key from the bootstrap key:
cephadm >
ceph --connect-timeout 5 --cluster ceph --name client.bootstrap-rgw \ --keyring /var/lib/ceph/bootstrap-rgw/ceph.keyring auth get-or-create \ client.rgw.0dc1e13033d2467eace46270f0048b39 osd 'allow rwx' mon 'allow rw' \ -o /var/lib/ceph/radosgw/ceph-rgw.rgw_name/keyringEnable and start Object Gateway:
root #
systemctl enable ceph-radosgw@rgw.rgw_name systemctl start ceph-radosgw@rgw.rgw_nameOptionally, create the MDS instance key from the bootstrap key, then enable and start it:
cephadm >
ceph --connect-timeout 5 --cluster ceph --name client.bootstrap-mds \ --keyring /var/lib/ceph/bootstrap-mds/ceph.keyring auth get-or-create \ mds.mds.rgw_name osd 'allow rwx' mds allow mon \ 'allow profile mds' \ -o /var/lib/ceph/mds/ceph-mds.rgw_name/keyring systemctl enable ceph-mds@mds.rgw_name systemctl start ceph-mds@mds.rgw_name
D Documentation Updates #
This chapter lists content changes for this document since the initial release of SUSE Enterprise Storage 4. You can find changes related to the cluster deployment that apply to previous versions in https://documentation.suse.com/ses/5.5/single-html/ses-deployment/#ap-deploy-docupdate.
The document was updated on the following dates:
D.1 The Latest Documentation Update #
Bugfixes #
Added hint on verifying profile proposal in Section 1.5, “Adding an OSD Disk to a Node” (https://bugzilla.suse.com/show_bug.cgi?id=1134736).
Remove minion's key from the Salt master temporarily when problematic in Section 1.3, “Removing and Reinstalling Cluster Nodes” (https://bugzilla.suse.com/show_bug.cgi?id=1120662).
Added procedure on making
ceph.conf
parameters unique in Section 1.12, “Adjustingceph.conf
with Custom Settings” (https://bugzilla.suse.com/show_bug.cgi?id=1116349).Extended a tip on preventing rebalancing after adding a node in Section 1.1, “Adding New Cluster Nodes” (https://bugzilla.suse.com/show_bug.cgi?id=1131044).
Added performance notes to Book “Deployment Guide”, Chapter 13 “Exporting Ceph Data via Samba” and Chapter 16, NFS Ganesha: Export Ceph Data via NFS (https://bugzilla.suse.com/show_bug.cgi?id=1124674).
Added Section 7.1.1.5, “Migrating from a Legacy SSD Rule to Device Classes” (https://bugzilla.suse.com/show_bug.cgi?id=1112883).
Added a list of corresponding software repositories in Section 1.10, “Updating the Cluster Nodes” (https://bugzilla.suse.com/show_bug.cgi?id=1117474).
Added Section 9.5, “Advanced Features” (https://bugzilla.suse.com/show_bug.cgi?id=1120706).
Extended examples of Ceph services identification in Section 3.1.3, “Identifying Individual Services” (https://bugzilla.suse.com/show_bug.cgi?id=1120682).
Updated Section 11.6, “Setting Up an Example Tiered Storage” to use device classes (https://bugzilla.suse.com/show_bug.cgi?id=1114827).
Added example to Book “Deployment Guide”, Chapter 7 “Customizing the Default Configuration”, Section 7.1.5 “Updates and Reboots during Stage 0” (https://bugzilla.suse.com/show_bug.cgi?id=1119451).
Use multiple internal time sources in Section 20.4, “Time Synchronization of Nodes” (https://bugzilla.suse.com/show_bug.cgi?id=1119571).
Added listing of pool snapshots in Section 8.4, “Pool Snapshots” (https://bugzilla.suse.com/show_bug.cgi?id=1113911).
Added Section 7.1.1, “Device Classes” (https://bugzilla.suse.com/show_bug.cgi?id=1113292).
Added
tcp_nodelay
explanation in Section 13.4, “Configuration Parameters” (https://bugzilla.suse.com/show_bug.cgi?id=1106274).Added more details on pool compression in Section 8.5, “Data Compression” (https://bugzilla.suse.com/show_bug.cgi?id=1113934).
Improved Section 8.3.2, “Migrate Using Cache Tier” and added Section 8.3.3, “Migrating RBD Images” (https://bugzilla.suse.com/show_bug.cgi?id=1113900).
Added Important: Removed OSD ID Still Present in grains (https://bugzilla.suse.com/show_bug.cgi?id=1107464).
D.2 October, 2018 (Documentation Maintenance Update) #
General Updates #
Moved Section 20.1, “Identifying Orphaned Partitions” into Chapter 20, Hints and Tips.
Extended Section 17.4.3, “Managing RADOS Block Devices (RBDs)”, mainly added a section about snapshots (Fate #325642).
Added Section 8.3, “Pool Migration” (Fate#322006).
Inserted Section 3.2, “Restarting Ceph Services using DeepSea” into Chapter 3, Operating Ceph Services.
Bugfixes #
Extended Section 1.7.2, “Automated Configuration” (https://bugzilla.suse.com/show_bug.cgi?id=1111442).
Added
/etc/ceph
to the list of backup content in Book “Deployment Guide”, Chapter 6 “Backing Up the Cluster Configuration”, Section 6.1 “Back Up Ceph COnfiguration” (https://bugzilla.suse.com/show_bug.cgi?id=1153342).Removed 'default-update-no-reboot' as it is 'default' now, in Book “Deployment Guide”, Chapter 7 “Customizing the Default Configuration”, Section 7.2 “Modifying Discovered Configuration” (https://bugzilla.suse.com/show_bug.cgi?id=1111318).
Fixed path to
*-replace
file in Section 1.7, “Replacing an OSD Disk” (https://bugzilla.suse.com/show_bug.cgi?id=1111470). Removed AppArmor snippets from Book “Deployment Guide”, Chapter 5 “Upgrading from Previous Releases” and added Section 1.13, “Enabling AppArmor Profiles” (https://bugzilla.suse.com/show_bug.cgi?id=1110861).Added Chapter 5, Monitoring and Alerting (https://bugzilla.suse.com/show_bug.cgi?id=1107833).
Disable kernel NFS on
role-ganesha
in Book “Deployment Guide”, Chapter 12 “Installation of NFS Ganesha”, Section 12.1.2 “Summary of Requirements” (https://bugzilla.suse.com/show_bug.cgi?id=1107625).Added Section 16.3.3, “Supported Operations” (https://bugzilla.suse.com/show_bug.cgi?id=1107624).
Improved auto-replacement OSDs procedure in Section 1.7, “Replacing an OSD Disk” (https://bugzilla.suse.com/show_bug.cgi?id=1107090).
Fixed libvirt keyring creation in Section 18.1, “Configuring Ceph” (https://bugzilla.suse.com/show_bug.cgi?id=1106495).
Simplified a user creation command in Section 18.1, “Configuring Ceph” https://bugzilla.suse.com/show_bug.cgi?id=1102467).
Added 'tier' to the command in Section 8.3.2, “Migrate Using Cache Tier” https://bugzilla.suse.com/show_bug.cgi?id=1102212).
Added a missing
salt
command in Section 1.6.2, “Removing Broken OSDs Forcefully” https://bugzilla.suse.com/show_bug.cgi?id=1100701).Added the
state.apply
part to thesalt
in Section 1.8, “Recovering a Reinstalled OSD Node” https://bugzilla.suse.com/show_bug.cgi?id=1095937).Added Section 17.1.1, “Enabling Secure Access to openATTIC using SSL” https://bugzilla.suse.com/show_bug.cgi?id=1083216).
Added Section 13.12, “Load Balancing the Object Gateway Servers with HAProxy” https://bugzilla.suse.com/show_bug.cgi?id=1093513).
Added Section 7.5, “Scrubbing” https://bugzilla.suse.com/show_bug.cgi?id=1079256).
Enhanced the port list for Firewall setting in Section 20.10, “Firewall Settings for Ceph” https://bugzilla.suse.com/show_bug.cgi?id=1070087).
Differed roles restarting as per DeepSea version in Section 3.2.2, “Restarting Specific Services” https://bugzilla.suse.com/show_bug.cgi?id=1091075).
Added Section 1.4, “Redeploying Monitor Nodes” https://bugzilla.suse.com/show_bug.cgi?id=1038731).
Added Section 13.9.1.1, “Dynamic Resharding” https://bugzilla.suse.com/show_bug.cgi?id=1076001).
Added Section 13.9, “Bucket Index Sharding” https://bugzilla.suse.com/show_bug.cgi?id=1076000).
Updated the Object Gateway SSL the DeepSea way in Section 13.6, “Enabling HTTPS/SSL for Object Gateways” (https://bugzilla.suse.com/show_bug.cgi?id=1083756 and https://bugzilla.suse.com/show_bug.cgi?id=1077809).
Changes in DeepSea require a modified deployment of NFS Ganesha with Object Gateway. See Section 16.3.1, “Different Object Gateway Users for NFS Ganesha” (https://bugzilla.suse.com/show_bug.cgi?id=1058821).
ceph osd pool create
fails if placement group limit per OSD is exceeded. See Section 8.2.2, “Create a Pool”. (https://bugzilla.suse.com/show_bug.cgi?id=1076509)Added Section 1.8, “Recovering a Reinstalled OSD Node” (https://bugzilla.suse.com/show_bug.cgi?id=1057764).
Added a reliability warning in Section 12.1, “Runtime Configuration” (https://bugzilla.suse.com/show_bug.cgi?id=989349).
Added Section 13.2, “Deploying the Object Gateway” (https://bugzilla.suse.com/show_bug.cgi?id=1088895).
Removed
lz4
from the list of compression algorithms in Section 8.5.3, “Global Compression Options” (https://bugzilla.suse.com/show_bug.cgi?id=1088450).Added a tip on removing multiple OSDs in Section 1.6, “Removing an OSD” (https://bugzilla.suse.com/show_bug.cgi?id=1070791).
Added Section 20.3, “Stopping OSDs without Rebalancing” (https://bugzilla.suse.com/show_bug.cgi?id=1051039).
Added MDS cache size configurables in Book “Deployment Guide”, Chapter 11 “Installation of CephFS”, Section 11.2.2 “Configuring a Metadata Server” (https://bugzilla.suse.com/show_bug.cgi?id=1062692).
Added Section 13.10, “Integrating OpenStack Keystone” (https://bugzilla.suse.com/show_bug.cgi?id=1077941).
Added a tip about synchronizing iSCSI Gateway configuration in Book “Deployment Guide”, Chapter 10 “Installation of iSCSI Gateway”, Section 10.4.3 “Export RBD Images via iSCSI” (https://bugzilla.suse.com/show_bug.cgi?id=1073327).
Changes in DeepSea require a modified deployment of NFS Ganesha with Object Gateway. See Section 16.3.1, “Different Object Gateway Users for NFS Ganesha” (https://bugzilla.suse.com/show_bug.cgi?id=1058821).
D.3 November 2017 (Documentation Maintenance Update) #
General Updates #
Added Section 20.12, “Replacing Storage Disk” (Fate #321032).
Bugfixes #
Added Section 10.4, “Erasure Coded Pools with RADOS Block Device” (https://bugzilla.suse.com/show_bug.cgi?id=1075158).
Added a section about adding a disk to OSD nodes. See Section 1.5, “Adding an OSD Disk to a Node”. (https://bugzilla.suse.com/show_bug.cgi?id=1066005)
salt-run remove.osd
requires OSD_ID digit without leadingosd.
. See Section 1.6, “Removing an OSD”.ceph tell
requires OSD_ID digit and leadingosd.
. See Section 22.8, “Finding Slow OSDs”.Added Section 22.11, “
/var
Running Out of Space” (https://bugzilla.suse.com/show_bug.cgi?id=1069255).Added Section 13.1, “Object Gateway Restrictions and Naming Limitations” (https://bugzilla.suse.com/show_bug.cgi?id=1067613).
Fixed rbd-mirror starting and stopping commands in Section 9.4.1, “rbd-mirror Daemon” (https://bugzilla.suse.com/show_bug.cgi?id=1068061).
D.4 October, 2017 (Release of SUSE Enterprise Storage 5.5) #
General Updates #
Removed Calamari in favor of openATTIC.
Added Section 17.4.6, “Managing NFS Ganesha” (Fate #321620).
Added Chapter 9, RADOS Block Device (Fate #321061).
Added Section 17.4.9, “Managing Object Gateway Users and Buckets” (Fate #320318).
Added Section 13.8, “LDAP Authentication” (Fate #321631).
Added Section 15.4, “Multiple Active MDS Daemons (Active-Active MDS)” (Fate #322976).
Added Section 17.4.7, “Managing iSCSI Gateways” (Fate #321370).
Added iSCSI Gateway and Object Gateway configuration for openATTIC, see Section 17.1.5, “Object Gateway Management” and Section 17.1.6, “iSCSI Gateway Management” (Fate #320318 and #321370).
Updated Chapter 16, NFS Ganesha: Export Ceph Data via NFS (https://bugzilla.suse.com/show_bug.cgi?id=1036495, https://bugzilla.suse.com/show_bug.cgi?id=1031444).
RBD images can now be stored in EC pools, see Section 9.1.2, “Creating a Block Device Image in an Erasure Coded Pool” (https://bugzilla.suse.com/show_bug.cgi?id=1040752).
Added section about backing up DeepSea configuration, see Book “Deployment Guide”, Chapter 6 “Backing Up the Cluster Configuration” (https://bugzilla.suse.com/show_bug.cgi?id=1046497).
Object Gateway failover and disaster recovery, see Section 13.11.12, “Failover and Disaster Recovery” (https://bugzilla.suse.com/show_bug.cgi?id=1036084).
BlueStore enables data compression for pools, see Section 8.5, “Data Compression” (FATE#318582).
CIFS export of CephFS is possible, see Book “Deployment Guide”, Chapter 13 “Exporting Ceph Data via Samba” (FATE#321622).
Added procedure for cluster reboot, see Section 1.11, “Halting or Rebooting Cluster” (https://bugzilla.suse.com/show_bug.cgi?id=1047638).
DeepSea stage 0 can update without rebooting, see Section 1.10, “Updating the Cluster Nodes”.
ceph fs
replaced, see Section 15.4.3, “Decreasing the Number of Ranks” and Section 15.4.2, “Increasing the MDS Active Cluster Size” (https://bugzilla.suse.com/show_bug.cgi?id=1047638).Added section Section 20.11, “Testing Network Performance” (FATE#321031).
Bugfixes #
The swift client package is now part of the 'Public Cloud' module in Section 13.5.1, “Accessing Object Gateway” (https://bugzilla.suse.com/show_bug.cgi?id=1057591).
Added Section 12.1, “Runtime Configuration” (https://bugzilla.suse.com/show_bug.cgi?id=1061435).
CivetWeb binds to multiple ports in Tip: Binding to Multiple Ports (https://bugzilla.suse.com/show_bug.cgi?id=1055181).
Included 3 Object Gateway options affecting performance in Section 13.4, “Configuration Parameters” (https://bugzilla.suse.com/show_bug.cgi?id=1052983).
Imported Section 1.12, “Adjusting
ceph.conf
with Custom Settings” and added the need for Stage 3 (https://bugzilla.suse.com/show_bug.cgi?id=1057273).Added libvirt keyring creation in Section 18.1, “Configuring Ceph” (https://bugzilla.suse.com/show_bug.cgi?id=1055610).
Added Example 1.1, “Removing a Salt minion from the Cluster” (https://bugzilla.suse.com/show_bug.cgi?id=1054516).
Updated Section 4.3, “Watching a Cluster” (https://bugzilla.suse.com/show_bug.cgi?id=1053638).
Made Salt REST API variables optional in Section 17.1, “openATTIC Deployment and Configuration” (https://bugzilla.suse.com/show_bug.cgi?id=1054748 and https://bugzilla.suse.com/show_bug.cgi?id=1054749).
Removed
oaconfig install
in Section 17.1.3, “openATTIC Initial Setup” (https://bugzilla.suse.com/show_bug.cgi?id=1054747).Added a section about displaying a pool metadata in Section 8.1, “Associate Pools with an Application” (https://bugzilla.suse.com/show_bug.cgi?id=1053327).
A list of health codes imported in Section 4.2, “Checking Cluster Health” (https://bugzilla.suse.com/show_bug.cgi?id=1052939).
Updated screenshots and related text in Section 17.4.9.1, “Adding a New Object Gateway User” and Section 17.4.9.3, “Editing Object Gateway Users” (https://bugzilla.suse.com/show_bug.cgi?id=1051814 and (https://bugzilla.suse.com/show_bug.cgi?id=1051816).
Added Object Gateway buckets in Section 17.4.9, “Managing Object Gateway Users and Buckets” (https://bugzilla.suse.com/show_bug.cgi?id=1051800).
Included cephx in mount examples in Section 15.1.3, “Mount CephFS” (https://bugzilla.suse.com/show_bug.cgi?id=1053022).
Updated and improved the description of pool deletion in Section 8.2.4, “Delete a Pool” (https://bugzilla.suse.com/show_bug.cgi?id=1052981).
Added compression algorithms description in Section 8.5.2, “Pool Compression Options” (https://bugzilla.suse.com/show_bug.cgi?id=1051457).
Replaced network diagnostics and benchmark in Section 22.10, “Poor Cluster Performance Caused by Network Problems” (https://bugzilla.suse.com/show_bug.cgi?id=1050190).
Extended Section 22.5, “'nn pg stuck inactive' Status Message” (https://bugzilla.suse.com/show_bug.cgi?id=1050183).
Mentioned pool re-creation in Section 22.4, “'Too Many PGs per OSD' Status Message” (https://bugzilla.suse.com/show_bug.cgi?id=1050178).
Fixed RGW section name in
ceph.conf
in Book “Deployment Guide”, Chapter 9 “Ceph Object Gateway”, Section 9.1 “Object Gateway Manual Installation” (https://bugzilla.suse.com/show_bug.cgi?id=1050170).Updated commands output in Section 4.4, “Checking a Cluster's Usage Stats” and Section 4.3, “Watching a Cluster” (https://bugzilla.suse.com/show_bug.cgi?id=1050175).
Removed a preventive HEALTCH_WARN section in Section 4.2, “Checking Cluster Health” (https://bugzilla.suse.com/show_bug.cgi?id=1050174).
Fixed sudo in Section 13.5.2.1, “Adding S3 and Swift Users” (https://bugzilla.suse.com/show_bug.cgi?id=1050177).
Removed a reference to a RADOS striper in Section 22.2, “Sending Large Objects with
rados
Fails with Full OSD” (https://bugzilla.suse.com/show_bug.cgi?id=1050171).Improved section about OSD failure because of journal failure in Section 21.5, “What Happens When a Journal Disk Fails?” (https://bugzilla.suse.com/show_bug.cgi?id=1050169).
Added a tip on
zypper patch
during Stage 0 in Section 1.10, “Updating the Cluster Nodes” (https://bugzilla.suse.com/show_bug.cgi?id=1050165).Added Section 8.1, “Associate Pools with an Application” (https://bugzilla.suse.com/show_bug.cgi?id=1049940).
Improved time synchronization information in Section 22.9, “Fixing Clock Skew Warnings” (https://bugzilla.suse.com/show_bug.cgi?id=1050186).
Replaced 'erasure pool' with the correct 'erasure coded pool' (https://bugzilla.suse.com/show_bug.cgi?id=1050093).
Replaced
rcceph
withsystemctl
(https://bugzilla.suse.com/show_bug.cgi?id=1050111).Updated CephFS mounting preparation in Section 15.1.1, “Client Preparation” (https://bugzilla.suse.com/show_bug.cgi?id=1049451).
Fixed
qemu-img
command in Section 18.1, “Configuring Ceph” (https://bugzilla.suse.com/show_bug.cgi?id=1047190).Specified which DeepSea stages to run when removing roles in Section 1.3, “Removing and Reinstalling Cluster Nodes” (https://bugzilla.suse.com/show_bug.cgi?id=1047430).
Added a new DeepSea role Ceph Manager (https://bugzilla.suse.com/show_bug.cgi?id=1047472).
Adjusted intro for 12 SP3 in Section 15.1.1, “Client Preparation” (https://bugzilla.suse.com/show_bug.cgi?id=1043739).
Fixed typo in XML entity in Section 18.4, “Configuring the VM” (https://bugzilla.suse.com/show_bug.cgi?id=1042917).
Added information to re-run DeepSea Stages 2-5 for a role removal in Section 1.3, “Removing and Reinstalling Cluster Nodes” (https://bugzilla.suse.com/show_bug.cgi?id=1041899).
Added Object Gateway, iSCSI Gateway, and NFS Ganesha ports numbers that need to be open in SUSE Firewall in Section 20.10, “Firewall Settings for Ceph” (https://bugzilla.suse.com/show_bug.cgi?id=1034081).
Added description of CRUSH map tree iteration, see Section 7.3.1, “Iterating Through the Node Tree”.
Added indep parameter to CRUSH rule, see Section 7.3.2, “firstn and indep”. (https://bugzilla.suse.com/show_bug.cgi?id=1025189)
Mounting CephFS via
/etc/fstab
requires_netdev
parameter. See Section 15.3, “CephFS in/etc/fstab
” (https://bugzilla.suse.com/show_bug.cgi?id=989349)Added tip on an existing
rbdmap
systemd
service file in Section 9.2, “Mounting and Unmounting” (https://bugzilla.suse.com/show_bug.cgi?id=1015748).Added an explanation to
use_gmt_hitset
option in Section 11.7.1.1, “Use GMT for Hit Set” (https://bugzilla.suse.com/show_bug.cgi?id=1024522).Moved mounting CephFS back into the admin guide and added a client preparation section in Section 15.1.1, “Client Preparation” (https://bugzilla.suse.com/show_bug.cgi?id=1025447).